Compressed Apache Arrow Tables Over HTTP
Returning Arrow tables over HTTP is relatively straightforward. The
Arrow IPC streaming format
is repurposed to the HTTP use case and you can return each serialized IPC batch
as a chunk of the HTTP response — Transfer-Encoding: chunked
— to enable
streaming clients.
Arrow IPC streams have a MIME type
registered with IANA
so you can also set the appropriate Content-Type
header on your HTTP responses.
1
|
|
You can find Python examples of this in the arrow-experiments
repository:
https://github.com/apache/arrow-experiments/tree/main/http/get_simple/python.
Things get trickier when you try to support compression. Arrow IPC streams are not compressed by default and not all client implementations of the IPC protocol can parse streams with compressed buffers.
We will get back to IPC compression in a moment, but first let’s see how we can achieve compression of the data by compressing the HTTP response body itself.
Compressing HTTP Responses
All clients that can handle compressed HTTP responses can benefit from this.
A server can look at the Accept-Encoding
header of the request and decide to
compress the response body if the client supports it. Chrome, Firefox, and
most modern browsers support gzip
and deflate
compression as well as modern
codecs like br
(Brotli) and zstd
(Zstandard)1.
Web browsers and HTTP client libraries will automatically decompress the
response body if the Content-Encoding
header is set to the codec used to
compress the response body.
1
|
|
When Chrome makes an HTTP request, it sets the
Accept-Encoding: gzip, deflate, br, zstd
header to indicate that it can handle
compressed responses in any of these formats. If you’re using a library to make
HTTP requests, make sure it sets this header correctly so the server can
compress the response body without worrying about compatibility with the client.
Using Arrow IPC Compression
The buffers placed in a serialized Arrow IPC stream can be compressed using the LZ4 or Zstandard codecs. Arrow IPC compression has nothing to do with the HTTP protocol — the compression and decompression is done by the Arrow library itself as it parses the stream. Note, though, that not all Arrow implementations support compression.2
To maximize the chances of compression without compatibility issues, you can
ask the clients to communicate their compression preferences in the
request headers. This can be achieved by passing parameters to the MIME type in the
Accept
request header.3
1
|
|
This way, the server can decide to compress the IPC stream buffers using LZ4 or
Zstandard without fear that client won’t be able to decode it. If the client
doesn’t specify a codec, the server can disable compression altogether, fall back
to HTTP compression (based on the Accept-Encoding
header), or choose a
compression codec at the risk of incompatibility.
You can find examples of this in Python at https://github.com/apache/arrow-experiments/pull/35.
To make troubleshooting easier, the server may indicate which compression codec was used to compress the IPC stream, even though the Arrow IPC stream reader picks up which codec was used from the data stream itself.
1
|
|
A Note on Double-Compression
Applying both IPC buffer and HTTP compression to the same data is not recommended. The extra CPU overhead of decompressing the data twice is not worth any possible gains that double compression might bring. If compression ratios are unambiguously more important than reducing CPU overhead, then a different compression algorithm4 that optimizes for that can be chosen.
Conclusion
Returning Arrow tables over HTTP is a powerful way to share data between applications. Unsurprisingly, columnar formats like Arrow are very friendly to compression and serialization. This is why Arrow works well as a network transfer format. As tooling evolves you can expect to see more and more Arrow tables being shared over HTTP instead of CSV or JSON.
-
Check the implementation status table in the Arrow documentation to see if the Arrow implementation you’re using supports buffer compression. ↩
-
This is similar to clients requesting video streams by specifying the container format and the codecs they support like this
Accept: video/webm; codecs="vp8, vorbis"
. ↩ -
Compression codecs can be instantiated with different compression levels trading off compression ratio for speed. ↩