deepr.io package

Submodules

deepr.io.hdfs module

HDFS Utilities

class deepr.io.hdfs.HDFSFile(filesystem, path, mode='rb', encoding='utf-8')[source]

Bases: object

FileSystemFile, support of “r”, “w” modes, readlines and iter.

Makes it easier to read or write file from any filesystem. For example, if you use HDFS you can do

>>> from deepr.io import HDFSFileSystem
>>> with HDFSFileSystem() as fs:
...     with HDFSFile(fs, "viewfs://root/user/foo.txt", "w") as file:  
...         file.write("Hello world!")  

The use of context manager means that the connection to the filesystem is automatically opened / closed, as well as the file buffer.

filesystem

FileSystem instance

Type:

FileSystem

path

Path to file

Type:

str

mode

Write / read mode. Supported: “r”, “rb” (default), “w”, “wb”.

Type:

str, Optional

read(*args, **kwargs)[source]
readlines()[source]
write(data, *args, **kwargs)[source]
class deepr.io.hdfs.HDFSFileSystem[source]

Bases: object

Context aware HDFSFileSystem using pyarrow.hdfs.

Open and closes connection to HDFS thanks to a context manager

>>> from deepr.io import HDFSFileSystem
>>> with HDFSFileSystem() as fs:  
...     fs.open("path/to/file")  

deepr.io.json module

Json IO

deepr.io.json.is_json(data)[source]

Return True if data is a valid json string else False

Return type:

bool

deepr.io.json.load_json(data)[source]

Load json from a json file or json string

deepr.io.json.read_json(path)[source]

Read json or jsonnet file into dictionary

Return type:

Dict

deepr.io.json.write_json(data, path)[source]

Write data to path

deepr.io.parquet module

Utilities for parquet

class deepr.io.parquet.ParquetDataset(path_or_paths, filesystem=None, metadata=None, schema=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=1, memory_map=False)[source]

Bases: object

Context aware ParquetDataset with support for chunk writing.

Makes it easier to read / write ParquetDataset. For example

>>> from deepr.io import ParquetDataset
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
>>> with ParquetDataset("viewfs://root/foo.parquet.snappy").open() as ds:  
...     ds.write_pandas(df, chunk_size=100)  

The use of context managers automatically opens / closes the dataset as well as the connection to the FileSystem.

path_or_paths

Path to parquet dataset (directory or file), or list of files.

Type:

Union[str, Path, List[Union[str, Path]]]

filesystem

FileSystem, if None, will be inferred automatically later.

Type:

FileSystem, Optional

property is_hdfs: bool
Return type:

bool

property is_local: bool
Return type:

bool

open()[source]

Open HDFS Filesystem if dataset on HDFS

property pq_dataset
read(columns=None, use_threads=True, use_pandas_metadata=False)[source]
read_pandas(columns=None, use_threads=True)[source]
write(table, compression='snappy')[source]
write_pandas(df, compression='snappy', num_chunks=None, chunk_size=None, schema=None)[source]

Write DataFrame as Parquet Dataset

deepr.io.path module

Path Utilities

class deepr.io.path.Path(*args)[source]

Bases: object

Equivalent of pathlib.Path for local and HDFS FileSystem

Automatically opens and closes an HDFS connection if the path is an HDFS path.

Allows you to work with local / HDFS files in an agnostic manner.

Example

path = Path("viewfs://foo", "bar") / "baz"
path.parent.mkdir()
with path.open("r") as file:
    for line in file:
        print(line)
for path in path.glob("*"):
    print(path.is_file())
copy_dir(dest, recursive=False, filesystem=None)[source]

Copy current files and directories if recursive to dest.

copy_file(dest, filesystem=None)[source]

Copy current file to dest (target directory must exist).

delete(filesystem=None)[source]

Delete file from filesystem

delete_dir(filesystem=None)[source]

Delete dir from filesystem

exists(filesystem=None)[source]

Return True if the path points to an existing file or dir.

Return type:

bool

glob(pattern)[source]

Retrieve directory content matching pattern

Return type:

Generator[Path, None, None]

is_dir(filesystem=None)[source]

Return True if the path points to a regular directory.

Return type:

bool

is_file(filesystem=None)[source]

Return True if the path points to a regular file.

Return type:

bool

property is_hdfs: bool

Return True if the path points to an HDFS location

Return type:

bool

property is_local: bool

Return True if the path points to a local file or dir.

Return type:

bool

iterdir(filesystem=None)[source]

Retrieve directory content.

Return type:

Generator[Path, None, None]

mkdir(parents=False, exist_ok=False, filesystem=None)[source]

Create directory

property name: str

Final path component.

Return type:

str

open(mode='r', encoding='utf-8', filesystem=None)[source]

Open file on both HDFS and Local File Systems.

Example

Use a context manager like so

path = Path("viewfs://root/user/path/to/file.txt")
with path.open("w") as file:
    file.write("Hello world!")
property parent

Path to the parent of the current path

property suffix

File extension of the file if any.

Module contents