deepr.prepros package

Submodules

deepr.prepros.base module

Abstract Base Class for preprocessing

class deepr.prepros.base.Prepro[source]

Bases: ABC

Base class for composable preprocessing functions.

Prepro are the basic building blocks of a preprocessing pipeline. A Prepro defines a function on a tf.data.Dataset.

The basic usage of a Prepro is to apply it on a Dataset. For example: >>> from deepr import readers >>> from deepr.prepros import Map >>> def gen(): … for i in range(3): … yield {“a”: i} >>> raw_dataset = tf.data.Dataset.from_generator(gen, {“a”: tf.int32}, {“a”: tf.TensorShape([])}) >>> list(readers.from_dataset(raw_dataset)) [{‘a’: 0}, {‘a’: 1}, {‘a’: 2}] >>> prepro_fn = Map(lambda x: {‘a’: x[‘a’] + 1}) >>> dataset = prepro_fn(raw_dataset) >>> list(readers.from_dataset(dataset)) [{‘a’: 1}, {‘a’: 2}, {‘a’: 3}]

Because some preprocessing pipelines behave differently depending on the mode (TRAIN, EVAL, PREDICT), an optional argument can be provided: >>> def map_func(element, mode=None): … if mode == tf.estimator.ModeKeys.PREDICT: … return {‘a’: 0} … else: … return element >>> prepro_fn = Map(map_func) >>> list(readers.from_dataset(raw_dataset)) [{‘a’: 0}, {‘a’: 1}, {‘a’: 2}] >>> dataset = prepro_fn(raw_dataset, mode=tf.estimator.ModeKeys.TRAIN) >>> list(readers.from_dataset(dataset)) [{‘a’: 0}, {‘a’: 1}, {‘a’: 2}] >>> dataset = prepro_fn(raw_dataset, mode=tf.estimator.ModeKeys.PREDICT) >>> list(readers.from_dataset(dataset)) [{‘a’: 0}, {‘a’: 1}, {‘a’: 2}]

TODO: Actually mode in map_func is not taken into account

Map, Filter, Shuffle and Repeat have a special attribute modes that you can use to specify the modes on which the preprocessing should be applied. For example: >>> def map_func(element, mode=None): … return {‘a’: 0} >>> prepro_fn = Map(map_func, modes=[tf.estimator.ModeKeys.PREDICT]) >>> dataset = prepro_fn(raw_dataset, tf.estimator.ModeKeys.TRAIN) >>> list(readers.from_dataset(dataset)) [{‘a’: 0}, {‘a’: 1}, {‘a’: 2}] >>> dataset = prepro_fn(dataset, tf.estimator.ModeKeys.PREDICT) >>> list(readers.from_dataset(dataset)) [{‘a’: 0}, {‘a’: 0}, {‘a’: 0}]

Authors of new Prepro subclasses typically override the apply method of the base Prepro class:

def apply(self, dataset: tf.data.Dataset, mode: str = None) -> tf.data.Dataset:
    return dataset

The easiest way to define custom preprocessors is to use the prepro decorator (see documentation).

abstract apply(dataset, mode=None)[source]

Pre-process a dataset

Return type:: DatasetV1

class deepr.prepros.base.PreproFn(prepro_fn)[source]

Bases: Prepro

Prepro from function.

apply(dataset, mode=None)[source]

Pre-process a dataset

Return type:: DatasetV1

deepr.prepros.base.prepro(fn)[source]

Decorator that creates a Prepro class from a function.

For example, the following snippet defines a subclass of Prepro whose apply offsets each element of the dataset by offset:

>>> from deepr import readers
>>> from deepr.prepros import prepro
>>> @prepro
... def AddOffset(dataset, mode, offset):
...     return dataset.map(lambda element: element + offset)
>>> raw_dataset = tf.data.Dataset.from_tensor_slices([0, 1, 2])
>>> prepro_fn = AddOffset(offset=1)
>>> dataset = prepro_fn(raw_dataset)
>>> list(readers.from_dataset(dataset))
[1, 2, 3]

The class created by the decorator is roughly equivalent to

class AddOffset(Prepro):

    def __init__(self, offset)
        Prepro.__init__(self)
        self.offset = offset

    def apply(self, dataset, mode: str = None):
        return dataset.map(lambda element: element + self.offset)

You can also add a ‘mode’ argument to your preprocessor like so >>> @prepro … def AddOffsetInTrain(dataset, mode, offset): … if mode == tf.estimator.ModeKeys.TRAIN: … return dataset.map(lambda element: element + offset) … else: … return dataset >>> prepro_fn = AddOffsetInTrain(offset=1) >>> dataset = prepro_fn(raw_dataset, tf.estimator.ModeKeys.TRAIN) >>> list(readers.from_dataset(dataset)) [1, 2, 3] >>> dataset = prepro_fn(raw_dataset, tf.estimator.ModeKeys.PREDICT) >>> list(readers.from_dataset(dataset)) [0, 1, 2] >>> dataset = prepro_fn(raw_dataset) >>> list(readers.from_dataset(dataset)) [0, 1, 2]

Note that ‘dataset’ and ‘mode’ need to be the the first arguments of the function IN THIS ORDER.

Return type:: Type[Prepro]

deepr.prepros.combinators module

Combine Preprocessors

class deepr.prepros.combinators.Serial(*preprocessors, fuse=True, num_parallel_calls=None)[source]

Bases: Prepro

Chain preprocessors to define complex preprocessing pipelines.

It will apply each preprocessing step one after the other on each element. For performance reasons, it fuses Map and Filter operations into single tf.data calls.

For an example, see the following snippet:

import deepr

def gen():
    yield {"a": [0], "b": [0, 1]}
    yield {"a": [0, 1], "b": [0]}
    yield {"a": [0, 1], "b": [0, 1]}

prepro_fn = deepr.prepros.Serial(
    deepr.prepros.Map(deepr.layers.Sum(inputs=("a", "b"), outputs="c")),
    deepr.prepros.Filter(deepr.layers.IsMinSize(inputs="a", outputs="a_size", size=2)),
    deepr.prepros.Filter(deepr.layers.IsMinSize(inputs="b", outputs="b_size", size=2)),
)

dataset = tf.data.Dataset.from_generator(gen, {"a": tf.int32, "b": tf.int32}, {"a": (None,), "b": (None,)})
reader = deepr.readers.from_dataset(prepro_fn(dataset))
expected = [{"a": [0, 1], "b": [0, 1], "c": [0, 2]}]

fuse

If True (default), will fuse Map and Filter.

Type:: bool, Optional

preprocessors

Positional arguments of Prepro instance or Tuple / List / Generator of prepro instances

Type:: Union[Prepro, Tuple[Prepro], List[Prepro], Generator[Prepro, None, None]]

apply(dataset, mode=None)[source]

Pre-process a dataset

Return type:: DatasetV1

deepr.prepros.core module

Core Classes for preprocessing

class deepr.prepros.core.Batch(batch_size, drop_remainder=False)[source]

Bases: Prepro

Combines consecutive elements of a dataset into batches.

count

Number of dataset repeat

Type:: int

modes

Active modes for the map (will skip modes not in modes). Default is None (all modes are considered active modes).

Type:: Iterable[str], Optional

apply(dataset, mode=None)[source]: Pre-process a dataset

class deepr.prepros.core.Cache(filename=None, modes=None)[source]

Bases: Prepro

Cache Dataset in memory, unless a file is provided.

You must iterate over the dataset completely to cache it (i.e. a tf.error.OutOfRangeError must be raised).

If caching to file, note that it consumes a lot of disk space (10x to 100x compared to tfrecords), and reloading seems brittle.

Prefer writing preprocessed data to tfrecord instead.

filename

Type:: str

apply(dataset, mode=None)[source]: Pre-process a dataset

class deepr.prepros.core.Filter(predicate, on_dict=True, modes=None)[source]

Bases: Prepro

Filter a dataset keeping only elements on which predicate is True

A Filter instance applies a predicate to all elements of a dataset and keeps only element for which predicate returns True.

By default, elements are expected to be dictionaries. You can set on_dict=False if your dataset does not yield dictionaries.

Because some preprocessing pipelines behave differently depending on the mode (TRAIN, EVAL, PREDICT), an optional argument can be provided. By setting modes, you select the modes on which the map transformation should apply. For example:

>>> from deepr import readers
>>> from deepr.prepros import Filter
>>> def gen():
...     yield {"a": 0}
...     yield {"a": 1}
>>> raw_dataset = tf.data.Dataset.from_generator(gen, {"a": tf.int32}, {"a": tf.TensorShape([])})
>>> list(readers.from_dataset(raw_dataset))
[{'a': 0}, {'a': 1}]
>>> def predicate(x):
...     return {"b": tf.equal(x["a"], 0)}
>>> prepro_fn = Filter(predicate, modes=[tf.estimator.ModeKeys.TRAIN])
>>> raw_dataset = tf.data.Dataset.from_generator(gen, {"a": tf.int32}, {"a": tf.TensorShape([])})
>>> dataset = prepro_fn(raw_dataset, tf.estimator.ModeKeys.TRAIN)
>>> list(readers.from_dataset(dataset))
[{'a': 0}]

>>> dataset = prepro_fn(raw_dataset, tf.estimator.ModeKeys.PREDICT)
>>> list(readers.from_dataset(dataset))
[{'a': 0}, {'a': 1}]

If the mode is not given at runtime, the preprocessing is applied.

>>> dataset = prepro_fn(raw_dataset)
>>> list(readers.from_dataset(dataset))
[{'a': 0}]

predicate

Predicate function, returns either a tf.bool or a dictionary with one key.

Type:: Callable

on_dict

If True (default), assumes dataset yields dictionaries

Type:: bool, Optional

modes

Active modes for the map (will skip modes not in modes). Default is None (all modes are considered active modes).

Type:: Iterable[str], Optional

apply(dataset, mode=None)[source]: Pre-process a dataset

property tf_predicate: Return final predicate function.

class deepr.prepros.core.Map(map_func, on_dict=True, update=True, modes=None, num_parallel_calls=None)[source]

Bases: Prepro

Map a function on each element of a tf.data.Dataset.

A Map instance applies a map_func to all elements of a dataset. By default, elements are expected to be dictionaries. You can set on_dict=False if your dataset does not yield dictionaries.

If elements are dictionaries, you can use the additional argument update to choose to update dictionaries instead of overriding them.

NOTE: If map_func is a Layer, it directly uses forward or forward_as_dict to avoid inspection overhead from the Layer.__call__ method.

WARNING: if map_func is a Layer, the mode will not be forwarded by the Map.apply() call, and the default None will always be used. This is intended to keep the signature of the generic map_func in line with the tf.Dataset.map method.

If you wish to use a Layer with a given mode, you can do

>>> from functools import partial
>>> from deepr import readers
>>> from deepr.layers import Sum
>>> from deepr.prepros import Map
>>> layer = Sum()
>>> prepro_fn = Map(partial(layer.forward_as_dict, mode=tf.estimator.ModeKeys.TRAIN))

For example, by setting update=True (DEFAULT behavior)

>>> def gen():
...     yield {"a": 0}
>>> dataset = tf.data.Dataset.from_generator(gen, {"a": tf.int32}, {"a": tf.TensorShape([])})
>>> list(readers.from_dataset(dataset))
[{'a': 0}]
>>> def map_func(x):
...     return {"b": x["a"] + 1}
>>> prepro_fn = Map(map_func, update=True)
>>> list(readers.from_dataset(prepro_fn(dataset)))
[{'a': 0, 'b': 1}]

On the other hand, update=False yields the output of the map_func

>>> prepro_fn = Map(map_func, update=False)
>>> list(readers.from_dataset(prepro_fn(dataset)))
[{'b': 1}]

Because some preprocessing pipelines behave differently depending on the mode (TRAIN, EVAL, PREDICT), an optional argument can be provided. By setting modes, you select the modes on which the map transformation should apply. For example:

>>> prepro_fn = Map(map_func, modes=[tf.estimator.ModeKeys.TRAIN])
>>> list(readers.from_dataset(prepro_fn(dataset, tf.estimator.ModeKeys.TRAIN)))
[{'a': 0, 'b': 1}]
>>> list(readers.from_dataset(prepro_fn(dataset, tf.estimator.ModeKeys.PREDICT)))
[{'a': 0}]

If the mode is not given at runtime, the preprocessing is applied.

>>> list(readers.from_dataset(prepro_fn(dataset)))
[{'a': 0, 'b': 1}]

map_func

Function to map to each element

Type:: Callable[[Any], Any]

modes

Active modes for the map (will skip modes not in modes). Default is None (all modes are considered active modes).

Type:: Iterable[str], Optional

num_parallel_calls

Number of threads.

Type:: int

on_dict

If True (default), assumes dataset yields dictionaries

Type:: bool

update

If True (default), combine element and map_func(element)

Type:: bool

apply(dataset, mode=None)[source]: Pre-process a dataset

property tf_map_func: Return final map function.

class deepr.prepros.core.PaddedBatch(batch_size, fields, drop_remainder=False)[source]

Bases: Prepro

Combines consecutive elements of a dataset into padded batches.

NOTE: this applies on dataset yielding dictionaries ONLY.

If you want to create padded batches from other structures, you need to create your own padded batch prepro wrapping the tensorflow implementation. For example:

@deepr.prepros.prepro
def PaddedBatchDefault(dataset, batch_size, padded_shapes, padding_values):
    return dataset.padded_batch(bath_size, padded_shapes, padding_values)

batch_size

Size of batches

Type:: int

fields

Field information for each key of yielded dictionaries

Type:: Iterable[Field]

modes

Active modes for the map (will skip modes not in modes). Default is None (all modes are considered active modes).

Type:: Iterable[str], Optional

apply(dataset, mode=None)[source]: Pre-process a dataset

class deepr.prepros.core.Prefetch(buffer_size)[source]

Bases: Prepro

Creates a dataset that prefetch element on CPU / GPU.

buffer_size

Number of element to prefetch. High values may lead to high memory consumption, it is recommended to use a buffer_size of 1.

Type:: int

apply(dataset, mode=None)[source]: Pre-process a dataset

class deepr.prepros.core.Repeat(count=None, modes=None)[source]

Bases: Prepro

Repeats a dataset so each original value is seen count times.

count

Number of dataset repeat, if None or -1, repeat forever.

Type:: int

modes

Active modes for the map (will skip modes not in modes). Default is None (all modes are considered active modes).

Type:: Iterable[str], Optional

apply(dataset, mode=None)[source]: Pre-process a dataset

class deepr.prepros.core.Shuffle(buffer_size, modes=None, seed=None, reshuffle_each_iteration=None)[source]

Bases: Prepro

Randomly shuffles the elements of a dataset.

buffer_size

Buffer size for the shuffle buffer

Type:: int

modes

Active modes for the map (will skip modes not in modes). Default is None (all modes are considered active modes).

Type:: Iterable[str], Optional

apply(dataset, mode=None)[source]: Pre-process a dataset

class deepr.prepros.core.Take(count=None)[source]

Bases: Prepro

Creates a dataset with at most count elements.

count

Cap the number of elements of a dataset to this number. Using None means no capping (will not apply the take transformation).

Type:: int

apply(dataset, mode=None)[source]: Pre-process a dataset

deepr.prepros.lookup module

Lookup Preprocessing Utilities.

class deepr.prepros.lookup.TableInitializer(table_initializer_fn)[source]

Bases: Prepro

Table Initializer.

Tensorflow does not allow tables initialization inside a map transformation (all tables must be created outside the map).

To remedy this, follow this example

First, create a table_initializer_fn that uses the tf.AUTO_REUSE argument.

>>> import deepr
>>> def table_initializer_fn():
...     return deepr.utils.table_from_mapping(
...         name="partner_table", mapping={1: 2}, reuse=tf.AUTO_REUSE
... )

Then, define your preprocessing pipeline as follows

>>> prepro_fn = deepr.prepros.Serial(
...     deepr.prepros.TableInitializer(table_initializer_fn),
...     deepr.prepros.Map(deepr.layers.Lookup(table_initializer_fn)),
... )

When applying the prepro_fn on a tf.data.Dataset, it will run the table_initializer_fn at the beginning (outside the map transformation), then apply the Lookup that uses the same table_initializer_fn, but thanks to reuse=tf.AUTO_REUSE instead of creating a new table, it will simply reuse the table created by the TableInitializer.

apply(dataset, mode=None)[source]

Pre-process a dataset

Return type:: DatasetV1

deepr.prepros.record module

Parse TF Records

class deepr.prepros.record.FromExample(fields, sequence=None, modes=None, num_parallel_calls=None, batched=False)[source]

Bases: Map

Parse TF Record Sequence Example

deepr.prepros.record.TFRecordSequenceExample: alias of FromExample

class deepr.prepros.record.ToExample(fields, sequence=None, modes=None, num_parallel_calls=None)[source]

Bases: Map

Convert dictionary of Tensors to tf.SequenceExample.

deepr.prepros.record.arrays_to_example(arrays, fields, sequence=None)[source]: Convert NumPy arrays to a tf.train.Example.

deepr.prepros package

Submodules

deepr.prepros.base module

deepr.prepros.combinators module

deepr.prepros.core module

deepr.prepros.lookup module

deepr.prepros.record module

Module contents