┌─ FILE ANALYSIS ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ┐
│ DEVELOPER : Apache Software Foundation
│ CATEGORY : Data
│ MIME TYPE : application/vnd.apache.parquet
│ MAGIC BYTES : 50415231
└ ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ┘
What is a Parquet file?
Apache Parquet is an open-source columnar storage format designed for efficient data processing at scale. Unlike row-based formats like CSV, Parquet stores data by column, enabling excellent compression and fast analytical queries that only read relevant columns. It is the de facto standard for big data lakes.
How to open Parquet files
- DuckDB —
SELECT * FROM 'file.parquet'for fast SQL queries - Python pandas —
pd.read_parquet('file.parquet') - Apache Spark — Distributed processing
- Parquet Viewer (VS Code extension) — Visual inspection
Technical specifications
| Property | Value |
|---|---|
| Storage | Columnar |
| Compression | Snappy, Gzip, LZ4, Zstd |
| Encoding | Dictionary, RLE, Delta, Bit-packing |
| Schema | Self-describing (embedded schema) |
| Types | Primitive + logical types (decimal, date, timestamp) |
Common use cases
- Data lakes: S3/GCS storage for analytics.
- ETL pipelines: Efficient intermediate data format.
- Machine learning: Feature stores and training datasets.
- Business intelligence: Fast analytical queries.