This paper develops a scalable system design for the creation, and delivery over the Internet, of a realistic voice communication service for crowded virtual spaces. Examples of crowded spaces include virtual market places or battlefields in online games. A realistic crowded audio scene including spatial rendering of the voices of surrounding avatars is impractical to deliver over the Internet in a peer-to-peer manner due to access bandwidth limitations and cost. A brute force server model, on the other hand, will face significant computational costs and scalability issues. This paper presents a novel server-based architecture for this service that performs simple operations in the servers (including weighted mixing of audio streams) to cope with access bandwidth restrictions of clients, and uses spatial audio rendering capabilities of the clients to reduce the computational load on the servers. This paper then examines the performance of two components of this architecture: angular clustering and grid summarization. The impact of two factors, namely a high density of avatars and realistic access bandwidth limitations, on the quality and accuracy of the audio scene is then evaluated using simulation results.