What is Parquet?

SaaS data in the data cloud featuring Aeriz

Apache Parquet is an open-source file format often used for big data in Hadoop clusters. Parquet stores data using a flat compressed, columnar storage data format. The Apache Spark provides high-level APIs for developers to use, including support for Java, Scala, Python and R. Parquet is well suited to efficiently storing nested data structures.

Storage

A Parquet file consists of row groups. Row groups contain a subset of data stored as pages. Pages are grouped into column chunks. Columns chunks contain metadata information that includes the number of defined values, their size, and statistics such as the number of null and min/max values.

Parquet Data Types

Parquet supports multiple 320-bit data types. These include the following:

BOOLEAN: 1-bit boolean
INT32: 32-bit signed ints
INT64: 64-bit signed ints
INT96: 96-bit signed ints
FLOAT: IEEE 32-bit floating point values
DOUBLE: IEEE 64-bit floating point values
BYTE_ARRAY: arbitrarily long byte arrays
FIXED_LEN_BYTE_ARRAY: fixed length byte arrays

Compression in Parquet

Parquet stores large datasets, so compression is an important feature of this file format. Different compression algorithms can be applied per column. The following compression codecs are provided with the base distribution:

GZIP: is suited to long-term static storage such as data archival. GZIP provides greater compression than Snappy but consumes more CPU resources.
Snappy: is a good choice for hot data that is often accessed. Decompression speeds for Snappy are shorter than GZIP. Snappy is splittable (unlike GZIP).
LZO: is a good choice if your application requires high decompression speeds.

Hadoop

Hadoop provides an open-source platform that scales horizontally to accommodate big data workloads. Parquet was developed primarily for Hadoop environments. Many organizations created Hadoop clusters in the heyday of the big data movement. Unfortunately, Hadoop tuning skills have become scarce, so organizations are moving their Parquet data to more modern architectures. Hadoop systems such as Pig, Hive and Spark use Parquet. Spark provides the best API to access Parquet data. Modern data warehousing systems such as the Actian Data Platform use a Spark connector to access Parquet data wherever it resides, on-premise or in the cloud.

Actian Data Platform

Using a Spark connector, the Actian Data Platform can access Parquet as an external object. Predicates can be pushed down to Parquet to improve access speeds. External Parquet data is easy to load because requests can be wrapped in SQL. Loading parquet data into tables provides the fastest access to data thanks to vector processing that loads columnar data into the CPU cache across a server or cluster to maximize the parallel processing of queries. Benchmarks have shown the Actian Data Platform can scale an order of magnitude beyond Hive.

Visit our website to learn more about the Actian Data Platform, or sign up for a free trial.

In the examples below, a Parquet external data source is mapped as an external table before loading its data into an internal table for faster application access.

Loading Parquet Data From Google Cloud Storage:

DROP TABLE IF EXISTS pemdata_gs;

CREATE EXTERNAL TABLE pemdata_gs (

timeperiod VARCHAR(20),

flow1 VARCHAR(20),

flow2 VARCHAR(20),

occupancy1 VARCHAR(20),

speed1 VARCHAR(20)

) using spark

WITH

reference=’gs://avpemdata/part*.parquet’,

format=’parquet’;

DROP TABLE IF EXISTS pemdata;

CREATE TABLE pemdata (

timeperiod TIMESTAMP,

flow1 VARCHAR(20),

flow2 VARCHAR(20),

occupancy1 VARCHAR(20),

speed1 VARCHAR(20)

);

Loading Parquet Data From Azure Blob Storage:

DROP TABLE IF EXISTS pemdata_adl;

CREATE EXTERNAL TABLE pemdata_adl (

timeperiod VARCHAR(20),

flow1 VARCHAR(20),

flow2 VARCHAR(20),

occupancy1 VARCHAR(20),

speed1 VARCHAR(20)

) using spark

WITH

reference=’abfs://parquetdata@mydata.dfs.core.windows.net//part*.parquet’,

format=’parquet’;

DROP TABLE IF EXISTS pemdata;

CREATE TABLE pemdata (

timeperiod TIMESTAMP,

flow1 VACHAR(20),

flow2 VARCHAR(20),

occupancy1 VARCHAR(20),

speed1 VARCHAR(20)

);

INSERT INTO pemdata SELECT * FROM pemdata_adl;

Loading Parquet Data From AWS S3 Storage:

DROP TABLE IF EXISTS pemdata_s3;

CREATE EXTERNAL TABLE pemdata_s3 (

timeperiod VARCHAR(20),

flow1 VARCHAR(20),

flow2 VARCHAR(20),

occupancy1 VARCHAR(20),

speed1 VARCHAR(20)

) using spark

WITH

reference=’s3a://avpemdata/part*.parquet’,

format=’parquet’;

DROP TABLE IF EXISTS pemdata;

CREATE TABLE pemdata (

timeperiod TIMESTAMP,

flow1 VARCHAR(20),

flow2 VARCHAR(20),

occupancy1 VARCHAR(20),

speed1 VARCHAR(20)

);

INSERT INTO pemdata SELECT * FROM pemdata_s3;

Start doing more with your Hadoop using the Actian Data Platform by trying the free trial.

Platform

Databases

Analytics

Data Management

App Modernization

Deployment

Solutions by Industry

Use Cases

Customers

Featured Customer Stories

Partners

Learn

Company

What is Parquet?

Storage

Parquet Data Types

Compression in Parquet

Hadoop

Actian Data Platform

Loading Parquet Data From Google Cloud Storage:

Loading Parquet Data From Azure Blob Storage:

Loading Parquet Data From AWS S3 Storage:

Ready to Get Started?

Platform

Databases

Analytics

Data Management

App Modernization

Deployment

Partners

Learn

Company

What is Parquet?

Storage

Parquet Data Types

Compression in Parquet

Hadoop

Actian Data Platform

Loading Parquet Data From Google Cloud Storage:

Loading Parquet Data From Azure Blob Storage:

Loading Parquet Data From AWS S3 Storage:

Additional resources

3 Considerations for Creating a Winning Data Quality Business Case

Customer Voices: Data Democratization

Customer Voices: Journey to the Cloud

Ready to Get Started?