‘Gulp’ a Dataset¶
gulpio package has been designed to be infinitely hackable and to support
arbitrary datasets. To this end, we are providing a so-called adapter
pattern. Specifically there exists an abstract class in
AbstractDatasetAdapter. In order to ingest your dataset, you basically
need to implement your own custom adapter that inherits from this.
You should be able to get going quickly by looking at the following examples, that we use internally to gulp our video datasets.
gulp2_20bn_json_videoscommand line script
And an example invocation would be:
gulp2_20bn_json_videos videos.json input_dir output_dir
Additionally, if you would like to ingest your dataset from the command line,
register_adapter script can be used to generate the command line interface
for the new adapter. Write your adapter that inherits from the
adapter.py file, then simply call:
gulp2_register_adapter gulpio.adapters <NewAdapterClassName>
The script that provides the command line interface will be in the main directory of the repository. To use it, execute
Sanity check the ‘Gulped’ Files¶
A very basic test to check the correctness of the gulped files is provided by the
For execution run:
gulp2_sanity_check <folder containing the gulped files>
The presence of any content in the
The file size of the
.gulpfile corresponds to the required file size that is given in the
Duplicate appearances of any video-ids
The file names of the files where any test fails will be printed. Currently no script to fix possible errors is provided, ‘regulping’ is the only solution.
Read a ‘Gulped’ Dataset¶
In order to read from the gulps, you can let yourself be inspired by the following snippet:
from gulpio2 import GulpDirectory # You can either read greyscale (`colorspace="GRAY"`) or RGB (`colorspace="RGB"`) # images. gulp_directory = GulpDirectory('/tmp/something_something_gulps') # iterate over all chunks for chunk in gulp_directory: # for each 'video' get the metadata and all frames for frames, meta in chunk: # do something with the metadata for i, f in enumerate(frames): # do something with the frames pass
Alternatively, a video with a specific
id can be directly accessed via:
from gulpio2 import GulpDirectory gulp_directory = GulpDirectory('/tmp/something_something_gulps') frames, meta = gulp_directory[<id>]
For down-sampling or loading only a part of a video, a python slice or list of indices can be passed as well:
frames, meta = gulp_directory[<id>, slice(1,10,2)] frames, meta = gulp_directory[<id>, [1, 5, 6, 8]]
frames, meta = gulp_directory[<id>, 1:10:2]
Below is an example loading an image dataset and defining an augmentation pipeline using torchvision. Transformations are applied to each instance on the fly.
from torch.utils.data import DataLoader from torchvision.transforms import Scale, CenterCrop, Compose, Normalize class GulpImageDataset: def __init__(self, gulp_dir: GulpDirectory, transform=None): self.gulp_dir = gulp_dir self.transform = transform if transform is not None else lambda x: x self.example_ids = list(gulp_dir.merged_meta_dict.keys()) def __getitem__(self, idx): if isinstance(idx, int): example_id = self.example_ids[idx] else: example_id = idx imgs, meta = self.gulp_dir[example_id] return self.transform(imgs), meta def __len__(self): return len(self.gulp_dir.merged_meta_dict) # define data augmentations. Notice that there are different functions for videos and images transform = Compose([ Resize(120), CenterCrop(112), Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)), ]) # define dataset wrapper and pick this up by the data loader interface. dataset = GulpImageDataset(GulpDirectory('/path/to/train_data'), transform=transforms) loader = DataLoader(dataset, batch_size=256, shuffle=True, num_workers=0, drop_last=True) dataset_val = GulpImageDataset(GulpDirectory('/path/to/validation_data/'), transform=transforms) loader_val = DataLoader(dataset_val, batch_size=256, shuffle=False, num_workers=0, drop_last=True)
Here we iterate through the dataset we loaded. Iterator returns data and label as numpy arrays. You might need to cast these into the format of you deep learning library.
for data, label in loader: # train your model here # ...