Question 1

What is a Parquet file?

Accepted Answer

Apache Parquet is an open-source columnar storage format designed for efficient data processing at scale. Unlike row-based formats like CSV, Parquet stores data by column, enabling excellent compression and fast analytical queries that only read relevant columns. It is the de facto standard for big data lakes.

Question 2

How to open Parquet files

Accepted Answer

- **DuckDB** — `SELECT * FROM 'file.parquet'` for fast SQL queries - **Python pandas** — `pd.read_parquet('file.parquet')` - **Apache Spark** — Distributed processing - **Parquet Viewer** (VS Code extension) — Visual inspection

Question 3

Technical specifications

Accepted Answer

| Property | Value | |----------|-------| | Storage | Columnar | | Compression | Snappy, Gzip, LZ4, Zstd | | Encoding | Dictionary, RLE, Delta, Bit-packing | | Schema | Self-describing (embedded schema) | | Types | Primitive + logical types (decimal, date, timestamp) |

Question 4

Common use cases

Accepted Answer

- **Data lakes**: S3/GCS storage for analytics. - **ETL pipelines**: Efficient intermediate data format. - **Machine learning**: Feature stores and training datasets. - **Business intelligence**: Fast analytical queries.

Property	Value
Storage	Columnar
Compression	Snappy, Gzip, LZ4, Zstd
Encoding	Dictionary, RLE, Delta, Bit-packing
Schema	Self-describing (embedded schema)
Types	Primitive + logical types (decimal, date, timestamp)