deepr.io.ParquetDataset

class deepr.io.ParquetDataset(path_or_paths, filesystem=None, metadata=None, schema=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=1, memory_map=False)[source]

Context aware ParquetDataset with support for chunk writing.

Makes it easier to read / write ParquetDataset. For example

>>> from deepr.io import ParquetDataset
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
>>> with ParquetDataset("viewfs://root/foo.parquet.snappy").open() as ds:  
...     ds.write_pandas(df, chunk_size=100)  

The use of context managers automatically opens / closes the dataset as well as the connection to the FileSystem.

path_or_paths

Path to parquet dataset (directory or file), or list of files.

Type:

Union[str, Path, List[Union[str, Path]]]

filesystem

FileSystem, if None, will be inferred automatically later.

Type:

FileSystem, Optional

__init__(path_or_paths, filesystem=None, metadata=None, schema=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=1, memory_map=False)[source]

Methods

__init__(path_or_paths[, filesystem, ...])

open()

Open HDFS Filesystem if dataset on HDFS

read([columns, use_threads, use_pandas_metadata])

read_pandas([columns, use_threads])

write(table[, compression])

write_pandas(df[, compression, num_chunks, ...])

Write DataFrame as Parquet Dataset

Attributes

is_hdfs

rtype:

bool

is_local

rtype:

bool

pq_dataset