deepr.prepros package
Submodules
deepr.prepros.base module
Abstract Base Class for preprocessing
- class deepr.prepros.base.Prepro[source]
Bases:
ABC
Base class for composable preprocessing functions.
Prepro are the basic building blocks of a preprocessing pipeline. A Prepro defines a function on a tf.data.Dataset.
The basic usage of a
Prepro
is to apply it on a Dataset. For example: >>> from deepr import readers >>> from deepr.prepros import Map >>> def gen(): … for i in range(3): … yield {“a”: i} >>> raw_dataset = tf.data.Dataset.from_generator(gen, {“a”: tf.int32}, {“a”: tf.TensorShape([])}) >>> list(readers.from_dataset(raw_dataset)) [{‘a’: 0}, {‘a’: 1}, {‘a’: 2}] >>> prepro_fn = Map(lambda x: {‘a’: x[‘a’] + 1}) >>> dataset = prepro_fn(raw_dataset) >>> list(readers.from_dataset(dataset)) [{‘a’: 1}, {‘a’: 2}, {‘a’: 3}]Because some preprocessing pipelines behave differently depending on the mode (TRAIN, EVAL, PREDICT), an optional argument can be provided: >>> def map_func(element, mode=None): … if mode == tf.estimator.ModeKeys.PREDICT: … return {‘a’: 0} … else: … return element >>> prepro_fn = Map(map_func) >>> list(readers.from_dataset(raw_dataset)) [{‘a’: 0}, {‘a’: 1}, {‘a’: 2}] >>> dataset = prepro_fn(raw_dataset, mode=tf.estimator.ModeKeys.TRAIN) >>> list(readers.from_dataset(dataset)) [{‘a’: 0}, {‘a’: 1}, {‘a’: 2}] >>> dataset = prepro_fn(raw_dataset, mode=tf.estimator.ModeKeys.PREDICT) >>> list(readers.from_dataset(dataset)) [{‘a’: 0}, {‘a’: 1}, {‘a’: 2}]
TODO: Actually mode in map_func is not taken into account
Map
,Filter
,Shuffle
andRepeat
have a special attribute modes that you can use to specify the modes on which the preprocessing should be applied. For example: >>> def map_func(element, mode=None): … return {‘a’: 0} >>> prepro_fn = Map(map_func, modes=[tf.estimator.ModeKeys.PREDICT]) >>> dataset = prepro_fn(raw_dataset, tf.estimator.ModeKeys.TRAIN) >>> list(readers.from_dataset(dataset)) [{‘a’: 0}, {‘a’: 1}, {‘a’: 2}] >>> dataset = prepro_fn(dataset, tf.estimator.ModeKeys.PREDICT) >>> list(readers.from_dataset(dataset)) [{‘a’: 0}, {‘a’: 0}, {‘a’: 0}]Authors of new
Prepro
subclasses typically override the apply method of the basePrepro
class:def apply(self, dataset: tf.data.Dataset, mode: str = None) -> tf.data.Dataset: return dataset
The easiest way to define custom preprocessors is to use the prepro decorator (see documentation).
- deepr.prepros.base.prepro(fn)[source]
Decorator that creates a
Prepro
class from a function.For example, the following snippet defines a subclass of
Prepro
whose apply offsets each element of the dataset by offset:>>> from deepr import readers >>> from deepr.prepros import prepro >>> @prepro ... def AddOffset(dataset, mode, offset): ... return dataset.map(lambda element: element + offset) >>> raw_dataset = tf.data.Dataset.from_tensor_slices([0, 1, 2]) >>> prepro_fn = AddOffset(offset=1) >>> dataset = prepro_fn(raw_dataset) >>> list(readers.from_dataset(dataset)) [1, 2, 3]
The class created by the decorator is roughly equivalent to
class AddOffset(Prepro): def __init__(self, offset) Prepro.__init__(self) self.offset = offset def apply(self, dataset, mode: str = None): return dataset.map(lambda element: element + self.offset)
You can also add a ‘mode’ argument to your preprocessor like so >>> @prepro … def AddOffsetInTrain(dataset, mode, offset): … if mode == tf.estimator.ModeKeys.TRAIN: … return dataset.map(lambda element: element + offset) … else: … return dataset >>> prepro_fn = AddOffsetInTrain(offset=1) >>> dataset = prepro_fn(raw_dataset, tf.estimator.ModeKeys.TRAIN) >>> list(readers.from_dataset(dataset)) [1, 2, 3] >>> dataset = prepro_fn(raw_dataset, tf.estimator.ModeKeys.PREDICT) >>> list(readers.from_dataset(dataset)) [0, 1, 2] >>> dataset = prepro_fn(raw_dataset) >>> list(readers.from_dataset(dataset)) [0, 1, 2]
Note that ‘dataset’ and ‘mode’ need to be the the first arguments of the function IN THIS ORDER.
deepr.prepros.combinators module
Combine Preprocessors
- class deepr.prepros.combinators.Serial(*preprocessors, fuse=True, num_parallel_calls=None)[source]
Bases:
Prepro
Chain preprocessors to define complex preprocessing pipelines.
It will apply each preprocessing step one after the other on each element. For performance reasons, it fuses
Map
andFilter
operations into single tf.data calls.For an example, see the following snippet:
import deepr def gen(): yield {"a": [0], "b": [0, 1]} yield {"a": [0, 1], "b": [0]} yield {"a": [0, 1], "b": [0, 1]} prepro_fn = deepr.prepros.Serial( deepr.prepros.Map(deepr.layers.Sum(inputs=("a", "b"), outputs="c")), deepr.prepros.Filter(deepr.layers.IsMinSize(inputs="a", outputs="a_size", size=2)), deepr.prepros.Filter(deepr.layers.IsMinSize(inputs="b", outputs="b_size", size=2)), ) dataset = tf.data.Dataset.from_generator(gen, {"a": tf.int32, "b": tf.int32}, {"a": (None,), "b": (None,)}) reader = deepr.readers.from_dataset(prepro_fn(dataset)) expected = [{"a": [0, 1], "b": [0, 1], "c": [0, 2]}]
- preprocessors
Positional arguments of
Prepro
instance or Tuple / List / Generator of prepro instances
deepr.prepros.core module
Core Classes for preprocessing
- class deepr.prepros.core.Batch(batch_size, drop_remainder=False)[source]
Bases:
Prepro
Combines consecutive elements of a dataset into batches.
- class deepr.prepros.core.Cache(filename=None, modes=None)[source]
Bases:
Prepro
Cache Dataset in memory, unless a file is provided.
You must iterate over the dataset completely to cache it (i.e. a
tf.error.OutOfRangeError
must be raised).If caching to file, note that it consumes a lot of disk space (10x to 100x compared to tfrecords), and reloading seems brittle.
Prefer writing preprocessed data to tfrecord instead.
- class deepr.prepros.core.Filter(predicate, on_dict=True, modes=None)[source]
Bases:
Prepro
Filter a dataset keeping only elements on which predicate is True
A
Filter
instance applies apredicate
to all elements of a dataset and keeps only element for which predicate returns True.By default, elements are expected to be dictionaries. You can set
on_dict=False
if your dataset does not yield dictionaries.Because some preprocessing pipelines behave differently depending on the mode (TRAIN, EVAL, PREDICT), an optional argument can be provided. By setting modes, you select the modes on which the map transformation should apply. For example:
>>> from deepr import readers >>> from deepr.prepros import Filter >>> def gen(): ... yield {"a": 0} ... yield {"a": 1} >>> raw_dataset = tf.data.Dataset.from_generator(gen, {"a": tf.int32}, {"a": tf.TensorShape([])}) >>> list(readers.from_dataset(raw_dataset)) [{'a': 0}, {'a': 1}] >>> def predicate(x): ... return {"b": tf.equal(x["a"], 0)} >>> prepro_fn = Filter(predicate, modes=[tf.estimator.ModeKeys.TRAIN]) >>> raw_dataset = tf.data.Dataset.from_generator(gen, {"a": tf.int32}, {"a": tf.TensorShape([])}) >>> dataset = prepro_fn(raw_dataset, tf.estimator.ModeKeys.TRAIN) >>> list(readers.from_dataset(dataset)) [{'a': 0}]
>>> dataset = prepro_fn(raw_dataset, tf.estimator.ModeKeys.PREDICT) >>> list(readers.from_dataset(dataset)) [{'a': 0}, {'a': 1}]
If the mode is not given at runtime, the preprocessing is applied.
>>> dataset = prepro_fn(raw_dataset) >>> list(readers.from_dataset(dataset)) [{'a': 0}]
- predicate
Predicate function, returns either a tf.bool or a dictionary with one key.
- Type:
Callable
- modes
Active modes for the map (will skip modes not in modes). Default is None (all modes are considered active modes).
- Type:
Iterable[str], Optional
- property tf_predicate
Return final predicate function.
- class deepr.prepros.core.Map(map_func, on_dict=True, update=True, modes=None, num_parallel_calls=None)[source]
Bases:
Prepro
Map a function on each element of a tf.data.Dataset.
A
Map
instance applies amap_func
to all elements of a dataset. By default, elements are expected to be dictionaries. You can seton_dict=False
if your dataset does not yield dictionaries.If elements are dictionaries, you can use the additional argument
update
to choose to update dictionaries instead of overriding them.NOTE: If
map_func
is aLayer
, it directly usesforward
orforward_as_dict
to avoid inspection overhead from theLayer.__call__
method.WARNING: if
map_func
is aLayer
, themode
will not be forwarded by theMap.apply()
call, and the defaultNone
will always be used. This is intended to keep the signature of the genericmap_func
in line with thetf.Dataset.map
method.If you wish to use a
Layer
with a givenmode
, you can do>>> from functools import partial >>> from deepr import readers >>> from deepr.layers import Sum >>> from deepr.prepros import Map >>> layer = Sum() >>> prepro_fn = Map(partial(layer.forward_as_dict, mode=tf.estimator.ModeKeys.TRAIN))
For example, by setting update=True (DEFAULT behavior)
>>> def gen(): ... yield {"a": 0} >>> dataset = tf.data.Dataset.from_generator(gen, {"a": tf.int32}, {"a": tf.TensorShape([])}) >>> list(readers.from_dataset(dataset)) [{'a': 0}] >>> def map_func(x): ... return {"b": x["a"] + 1} >>> prepro_fn = Map(map_func, update=True) >>> list(readers.from_dataset(prepro_fn(dataset))) [{'a': 0, 'b': 1}]
On the other hand,
update=False
yields the output of themap_func
>>> prepro_fn = Map(map_func, update=False) >>> list(readers.from_dataset(prepro_fn(dataset))) [{'b': 1}]
Because some preprocessing pipelines behave differently depending on the mode (TRAIN, EVAL, PREDICT), an optional argument can be provided. By setting modes, you select the modes on which the map transformation should apply. For example:
>>> prepro_fn = Map(map_func, modes=[tf.estimator.ModeKeys.TRAIN]) >>> list(readers.from_dataset(prepro_fn(dataset, tf.estimator.ModeKeys.TRAIN))) [{'a': 0, 'b': 1}] >>> list(readers.from_dataset(prepro_fn(dataset, tf.estimator.ModeKeys.PREDICT))) [{'a': 0}]
If the mode is not given at runtime, the preprocessing is applied.
>>> list(readers.from_dataset(prepro_fn(dataset))) [{'a': 0, 'b': 1}]
- map_func
Function to map to each element
- Type:
Callable[[Any], Any]
- modes
Active modes for the map (will skip modes not in modes). Default is None (all modes are considered active modes).
- Type:
Iterable[str], Optional
- property tf_map_func
Return final map function.
- class deepr.prepros.core.PaddedBatch(batch_size, fields, drop_remainder=False)[source]
Bases:
Prepro
Combines consecutive elements of a dataset into padded batches.
NOTE: this applies on dataset yielding dictionaries ONLY.
If you want to create padded batches from other structures, you need to create your own padded batch prepro wrapping the tensorflow implementation. For example:
@deepr.prepros.prepro def PaddedBatchDefault(dataset, batch_size, padded_shapes, padding_values): return dataset.padded_batch(bath_size, padded_shapes, padding_values)
- class deepr.prepros.core.Prefetch(buffer_size)[source]
Bases:
Prepro
Creates a dataset that prefetch element on CPU / GPU.
- buffer_size
Number of element to prefetch. High values may lead to high memory consumption, it is recommended to use a buffer_size of 1.
- Type:
- class deepr.prepros.core.Repeat(count=None, modes=None)[source]
Bases:
Prepro
Repeats a dataset so each original value is seen count times.
- class deepr.prepros.core.Shuffle(buffer_size, modes=None, seed=None, reshuffle_each_iteration=None)[source]
Bases:
Prepro
Randomly shuffles the elements of a dataset.
deepr.prepros.lookup module
Lookup Preprocessing Utilities.
- class deepr.prepros.lookup.TableInitializer(table_initializer_fn)[source]
Bases:
Prepro
Table Initializer.
Tensorflow does not allow tables initialization inside a
map
transformation (all tables must be created outside themap
).To remedy this, follow this example
First, create a
table_initializer_fn
that uses thetf.AUTO_REUSE
argument.>>> import deepr >>> def table_initializer_fn(): ... return deepr.utils.table_from_mapping( ... name="partner_table", mapping={1: 2}, reuse=tf.AUTO_REUSE ... )
Then, define your preprocessing pipeline as follows
>>> prepro_fn = deepr.prepros.Serial( ... deepr.prepros.TableInitializer(table_initializer_fn), ... deepr.prepros.Map(deepr.layers.Lookup(table_initializer_fn)), ... )
When applying the
prepro_fn
on atf.data.Dataset
, it will run thetable_initializer_fn
at the beginning (outside themap
transformation), then apply theLookup
that uses the sametable_initializer_fn
, but thanks toreuse=tf.AUTO_REUSE
instead of creating a new table, it will simply reuse the table created by theTableInitializer
.
deepr.prepros.record module
Parse TF Records
- class deepr.prepros.record.FromExample(fields, sequence=None, modes=None, num_parallel_calls=None, batched=False)[source]
Bases:
Map
Parse TF Record Sequence Example
- deepr.prepros.record.TFRecordSequenceExample
alias of
FromExample