In this work, we propose a peer-to-peer (P2P) voice delivery architecture for the support of an immersive communication environment (ICE). An ICE service allows multiple users in a distributed virtual environment (DVE) to exchange live voices which are rendered with the directional and distance cues corresponding to the users' positions in the DVE. Our architecture addresses the scalability issue in peer access bandwidth consumption encountered by the "brute force" full- mesh P2P architecture. On the other hand, our P2P architecture seeks a good balance between access bandwidth scalability and the voice quality delivered. The three key aspects of voice quality considered are voice transmission delay, directional voice rendering accuracy and voice attenuation accuracy. Our P2P architecture achieves access bandwidth reduction by organising all the pair-wise close peers into non-overlapping clusters and by mixing the voice streams within the same cluster so as to reduce the number of direct peer to peer voice exchanges. Different sets of experiments were conducted to compare the performances of our P2P architecture against the benchmark of the full-mesh P2P architecture.