Pipe-Filter #

Pipe-Filter is suitable for applications that require a defined series of (mostly) independent computations to be performed on a stream of data entries.

Topology #

Example component diagram:

Pipe-Filter topology example

Component: Filters
- each filter transforms input data entry/entries to output data entry/entries.
- local transformations: no shared state between filters, and minimal knowledge of upstream/downstream.
- incremental processing: outputs can begin before all inputs are consumed.
Connector: Pipes
- carry data between filters (streams, files, in-memory queues).

Pros and cons #

Pros
- Maintainability: filters can be replaced or improved locally, as long as their input and output formats are stable.
- Efficiency (throughput): supports concurrency naturally by parallelization and stream processing.
- Observability: performing throughput and deadlock analyses is possible.
Cons
- Efficiency (latency): (de)serialization and data copying across pipes can be expensive.
  - variant: using different data formats for each pipe, improves efficiency but adds complexity.
- Complexity: debugging end-to-end behaviors across many stages can be non-trivial.
- Not ideal for interactive systems.

Variants #

Pipeline #

This variant requires a linear sequence of filters (i.e., each filter has exactly one input pipe and one output pipe). An example is the pipe operator | in Unix shells, e.g., the following component diagram shows the command ls | grep '.pdf' | wc -l:

Pipeline variant (Unix shell)

This simplified version of pipe-filter is easier to understand, and enables straightforward stream processing.

Batch-Sequential #

This variant, based on the Pipeline variant, additionally requires that each filter processes all input before producing output. This is common in compiler and data processing tasks.

Batch-sequential variant (compiler passes)

Real-world examples #

Unix shell #

In Unix shells, following the Pipeline variant, each command (e.g., ls, grep, wc) is a filter, and the pipe operator (|) connects the output of the previous command to the input of the next command. The stream processing usually happens at the line level: a newline character generated by the previous command triggers a “flush” to the pipe, such that the next command can start processing that line of characters while the previous command continues running.

Further reading: The Architecture of Open Source Applications: The Bourne-Again Shell

LLVM passes #

Modern compilers often implement a Pipe-Filter architecture and allows for different combinations of code optimization/analysis/instrumentation steps. This enables customizing the compilation pipeline towards specific target devices, sometimes balancing executable size and runtime efficiency.

In LLVM, the filters are called “passes”; they are intensively used to implement the basic compiler functionality, and can also be added by developers. Each pass’s input is a code element (e.g., function, basic block, instruction) in the format of intermediate representation (LLVM IR). The pass can choose to only analyze the input code (e.g., printing diagnostic messages) or transform it.

Distributed data processing #

Distributed data processing systems uses pipe-filter to perform a graph of transformations over large datasets, often with additional concerns like shuffles, fault tolerance, and scheduling. One of the popular paradigm is MapReduce, which defines several highly-optimized types of filters such as Map, Shuffle, and Reduce. There are several open-source implementations of this paradigm, such as: