MPI Distributed Word Count

Parallel Text Processing Across Docker Containers

C++MPIDockerMPI-IOParallel Computing

Overview

A parallelized text analysis application that counts character and word frequencies across large files using Message Passing Interface (MPI). The program partitions input files across multiple processes, each counting independently, then aggregates results to produce the top 10 most frequent characters and words — all containerized with Docker for portable cluster deployment.

Key Features

Parallel file partitioning with MPI-IO (MPI_File_read_at) for concurrent segment reading
Smart boundary word handling using non-blocking MPI_Irecv to merge words split across segments
Character frequency tracking for ASCII range (32-126) with MPI_Allreduce aggregation
Word frequency collection via MPI_Gatherv with custom serialization/deserialization
Intelligent tie-breaking: characters by ASCII value, words by first occurrence position
Docker containerization for portable multi-node MPI deployment

Architecture

Each MPI rank reads its segment of the input file independently using MPI-IO. Ranks exchange incomplete boundary words via non-blocking sends/receives to ensure correct word counting at segment edges. Character frequencies are aggregated with MPI_Allreduce (efficient collective operation), while word data is serialized and gathered to rank 0 via MPI_Gatherv. The entire system runs in Docker containers for easy cluster deployment.

MyMake Build Tool

Memory Hierarchy Simulator

All Projects GitHub