Dataset loading utilities

class dlordinal.datasets.Adience(root: str | Path, train: bool = True, ranges: list | tuple = ((0, 2), (4, 6), (8, 13), (15, 20), (25, 32), (38, 43), (48, 53), (60, 100)), test_size: float = 0.2, transform: Callable | None = None, target_transform: Callable | None = None, verbose: bool = False, download: bool = False, username: str | None = None, password: str | None = None)[source]

PyTorch dataset for the Adience age classification benchmark Eidinger et al.[1].

The Adience dataset contains unfiltered face images collected from Flickr albums and is commonly used for age and gender classification benchmarks.

Parameters:

root (Union[str, Path]) –
Root directory where the dataset will be stored.

If download=False (default), the following files are expected to already exist inside the adience directory:

1. aligned.tar.gz: tar.gz archive containing the aligned face images.

2. folds: directory containing the official Adience fold files: fold_0_data.txt through fold_4_data.txt.

If download=True, these files are downloaded automatically from the official Adience website.
ranges (list, optional) – List of age ranges to use, by default [(0, 2), (4, 6), (8, 13), (15, 20), (25, 32), (38, 43), (48, 53), (60, 100)].
test_size (float, optional, default = 0.2) – Test size.
transform (Callable, optional) – A callable that takes in an PIL image and returns a transformed version.
target_transform (Callable, optional) – A callable that takes in the target and transforms it.
verbose (bool, optional, default = False) – Whether to print progress messages.
download (bool, optional, default = False) –
Whether to download the dataset automatically.

Downloading requires valid username and password credentials provided by the Adience dataset authors.
username (str, optional) – Username to download the dataset. If not provided, the dataset will not be downloaded and the files are expected to be already present in the root directory.
password (str, optional) – Password to download the dataset. If not provided, the dataset will not be downloaded and the files are expected to be already present in the root directory.

root

Root directory where the Adience dataset is stored.

Type:: Path

train

Whether to use the training or test partition.

Type:: bool

transform

A callable that takes in an PIL image and returns a transformed version.

Type:: Callable

target_transform

A callable that takes in the target and transforms it.

Type:: Callable

verbose

Whether to print progress messages.

Type:: bool

data

List of image paths.

Type:: list

targets

Contains the target of each sampel contained in the dataset.

Type:: list

classes

Unique classes in the dataset.

Type:: list

download

Whether to download the dataset if it is not already present in the root directory. If False, the files are expected to be already present in the root directory.

Type:: bool

class dlordinal.datasets.FGNet(root: str | Path, download: bool = True, target_size: tuple = (128, 128), categories: list = [3, 11, 16, 24, 40], test_size: float = 0.2, validation_size: float = 0.15, train: bool = True, transform: Callable | None = None, target_transform: Callable | None = None)[source]

Base class for FGNet dataset.

root

Root directory of the dataset.

Type:: Path

target_size

Size of the images after resizing.

Type:: tuple

categories

List of categories to be used.

Type:: list

test_size

Size of the test set.

Type:: float

validation_size

Size of the validation set.

Type:: float

transform

A function/transform that takes in a PIL image and returns a transformed version.

Type:: callable, optional

target_transform

A function/transform that takes in the target and transforms it.

Type:: callable, optional

data

Dataframe containing the dataset.

Type:: pd.DataFrame

Parameters:

root (str or Path) – Root directory of the dataset.
download (bool, optional, default = True) – If True, downloads the dataset from the internet and puts it in the root directory. If the dataset is already downloaded, it is not downloaded again.
target_size (tuple, optional) – Size of the images after resizing. Default is (128, 128).
categories (list, optional) – List of categories to be used. Default is [3, 11, 16, 24, 40].
test_size (float, optional) – Size of the test set. Default is 0.2.
validation_size (float, optional) – Size of the validation set. Default is 0.15.
train (bool, optional) – If True, returns the training dataset, otherwise returns the test dataset. Default is True.
transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version.
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

property classes: List[int]

Return the unique classes in the dataset.

Returns:: List of unique classes.
Return type:: list

download() → None[source]: Download the FGNet dataset and extract it.

find_category(real_age)[source]

Find the category of the real age.

Parameters:: real_age (int) – Real age of the image.

get_age_from_filename(filename)[source]

Get the age from the filename.

Parameters:: filename (str) – Filename of the image.

load_data(original_path: Path)[source]

Load the data from the original_path.

Parameters:: original_path (Path) – Path to the original dataset.

process(original_path, processed_path)[source]

Process the FGNet dataset and save it in the processed_path.

Parameters:

original_path (Path) – Path to the original dataset.
processed_path (Path) – Path to save the processed dataset.

process_images_from_df(df: DataFrame, original_path: Path, processed_path: Path)[source]

Process the images from the dataframe.

Parameters:

df (pd.DataFrame) – Dataframe with the images.
original_path (Path) – Path to the original dataset.
processed_path (Path) – Path to save the processed dataset.

split(original_csv_path: Path, train_csv_path: Path, test_csv_path: Path, original_images_path: Path, train_images_path: Path, test_images_path: Path)[source]

Split the FGNet dataset into train and test sets.

Parameters:

original_csv_path (Path) – Path to the original csv file.
train_csv_path (Path) – Path to save the train csv file.
test_csv_path (Path) – Path to save the test csv file.
original_images_path (Path) – Path to the original images.
train_images_path (Path) – Path to save the train images.
test_images_path (Path) – Path to save the test images.

split_dataframe(csv_path: Path, train_images_path: Path, original_images_path: Path, test_images_path: Path)[source]

Split the dataframe into train and test sets.

Parameters:

csv_path (Path) – Path to the csv file.
train_images_path (Path) – Path to save the train images.
original_images_path (Path) – Path to the original images.
test_images_path (Path) – Path to save the test images.

property targets: List[int]

Return the targets of the dataset.

Returns:: List of targets.
Return type:: list

class dlordinal.datasets.FeatureDataset(filename)[source]

Dataset torch implementation for a standard dataset that contains several features that are organised in a tabular way in a csv file. The last column is the target variable.

Example

>>> train_data = FeatureDataset("train.csv")
>>> train_data.normalize_X()
>>> train_data.normalize_y()
>>> train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
>>> for X, y in train_loader:
>>>     print(X.shape, y.shape)
>>> test_data = FeatureDataset("test.csv")
>>> test_data.normalize_X(train_data.X_mean, train_data.X_scale)
>>> test_data.normalize_y(train_data.y_mean, train_data.y_scale)
>>> test_loader = DataLoader(test_data, batch_size=32, shuffle=False)
>>> for X, y in test_loader:
>>>     print(X.shape, y.shape)

get_valid_shape_array(v: ArrayLike)[source]

Convert the input ArrayLike object to a 2D numpy array with shape (n, 1) if it is a 1D array.

Parameters:: v (ArrayLike) – Input array.
Returns:: v – 2D numpy array with shape (n, 1).
Return type:: np.ndarray

normalize_X(mean: ArrayLike | None = None, scale: ArrayLike | None = None)[source]

Standardize the features of the dataset. If mean and scale are not provided, they are computed from the dataset. If they are provided, they are used to standardize the dataset.

Parameters:

mean (array-like, default=None) – Mean of the dataset.
scale (array-like, default=None) – Scale of the dataset.

Returns:

self – The dataset with standardized features.

Return type:

FeatureDataset

Example

>>> train_data = FeatureDataset("train.csv")
>>> train_data.normalize_X()
>>> test_data = FeatureDataset("test.csv")
>>> test_data.normalize_X(train_data.X_mean, train_data.X_scale)

normalize_y(mean: ArrayLike = None, scale: ArrayLike = None)[source]

Standardize the target variable of the dataset. If mean and scale are not provided, they are computed from the dataset. If they are provided, they are used to standardize the dataset.

Parameters:

mean (array-like, default=None) – Mean of the dataset.
scale (array-like, default=None) – Scale of the dataset.

Returns:

self – The dataset with standardized target variable.

Return type:

FeatureDataset

Example

>>> train_data = FeatureDataset("train.csv")
>>> train_data.normalize_y()
>>> test_data = FeatureDataset("test.csv")
>>> test_data.normalize_y(train_data.y_mean, train_data.y_scale)

class dlordinal.datasets.HCI(root: str | Path, transform: Callable | None = None, target_transform: Callable | None = None, is_valid_file: Callable[[str], bool] | None = None, train: bool = True)[source]

Historical Color Images (HCI) Decade Database dataset Palermo et al.[2].

This dataset contains colour photographs from five decades (1930s-1970s), organised for decade classification. Upon first use, the dataset is automatically downloaded, verified, preprocessed, and split into training and test subsets.

The preprocessing pipeline includes: - verifying and downloading the dataset archive if necessary; - extracting and normalising directory names according to class labels; - resizing all images to 224x224 pixels; - creating a stratified 70/30 train/test split; - generating an MD5 checksum file for future integrity checks.

Parameters:

root (str or Path) – Root directory where the dataset will be stored and processed.
transform (callable, optional) – A function/transform applied to each loaded PIL image.
target_transform (callable, optional) – A function/transform applied to the target label.
is_valid_file (callable, optional) – A function that takes a file path and returns True if the file should be included.
train (bool, default=True) – If True, loads the training split; otherwise, loads the test split.

URL

Download URL for the dataset archive.

Type:: str

MD5

MD5 checksum used to verify the downloaded archive.

Type:: str

CATEGORIES

Mapping from decade names to numeric class labels (as strings).

Type:: dict

base_root

Base directory for dataset storage and processing.

Type:: Path

train

Indicates whether the dataset instance is for training or testing.

Type:: bool

Example

>>> from dlordinal.datasets.hci import HCI
>>> dataset = HCI(root="data", train=True)
>>> img, label = dataset[0]

Notes

The train/test split is stratified by decade, with 70% of the images in the training set and 30% in the test set. Preprocessing is only performed the first time the dataset is initialised.