MPI Distributed Word Count
Parallel Text Processing Across Docker Containers
Overview
A parallelized text analysis application that counts character and word frequencies across large files using Message Passing Interface (MPI). The program partitions input files across multiple processes, each counting independently, then aggregates results to produce the top 10 most frequent characters and words — all containerized with Docker for portable cluster deployment.
Key Features
- Parallel file partitioning with MPI-IO (MPI_File_read_at) for concurrent segment reading
- Smart boundary word handling using non-blocking MPI_Irecv to merge words split across segments
- Character frequency tracking for ASCII range (32-126) with MPI_Allreduce aggregation
- Word frequency collection via MPI_Gatherv with custom serialization/deserialization
- Intelligent tie-breaking: characters by ASCII value, words by first occurrence position
- Docker containerization for portable multi-node MPI deployment
Architecture
Each MPI rank reads its segment of the input file independently using MPI-IO. Ranks exchange incomplete boundary words via non-blocking sends/receives to ensure correct word counting at segment edges. Character frequencies are aggregated with MPI_Allreduce (efficient collective operation), while word data is serialized and gathered to rank 0 via MPI_Gatherv. The entire system runs in Docker containers for easy cluster deployment.