File format for tabular data

there is several high quality and well-developed formats - HN

csv – Simple for simple use cases, text-based, however many edge cases, feature lacking etc

ndjsonevery line is a json object

xlsx – Works in excel, ubiquitous format with a standard, however complicated and missing scientific features

sqlite – Designed for relational data, somewhat ubiquitous, types defined but not enforced

parquet / hdf5 / apache feather / etc – Designed for scientific use cases, robust, efficient, less ubiquitous - DuckDB is a lightweight and super fast library/CLI for working with Parquet. It’s SQLite for column formats - Arrow also has its own on-disk format called Feather

capn proto, prototype buffers, avro, thrift – Has specific features for data communication between systems

xml – Useful if you are programming in early 2000s

GDBM, Kyoto Cabinet, etc – Useful if you are programming in late 1990s

see also

Written on May 13, 2022, Last update on June 28, 2023
file db array parquet csv