Dataset loading utilities
- class dlordinal.datasets.Adience(root: str | Path, train: bool = True, ranges: list | tuple = ((0, 2), (4, 6), (8, 13), (15, 20), (25, 32), (38, 43), (48, 53), (60, 100)), test_size: float = 0.2, transform: Callable | None = None, target_transform: Callable | None = None, verbose: bool = False, download: bool = False, username: str | None = None, password: str | None = None)[source]
PyTorch dataset for the Adience age classification benchmark Eidinger et al.[1].
The Adience dataset contains unfiltered face images collected from Flickr albums and is commonly used for age and gender classification benchmarks.
- Parameters:
root (Union[str, Path]) –
Root directory where the dataset will be stored.
If
download=False(default), the following files are expected to already exist inside theadiencedirectory:1.
aligned.tar.gz: tar.gz archive containing the aligned face images.2.
folds: directory containing the official Adience fold files:fold_0_data.txtthroughfold_4_data.txt.If
download=True, these files are downloaded automatically from the official Adience website.ranges (list, optional) – List of age ranges to use, by default [(0, 2), (4, 6), (8, 13), (15, 20), (25, 32), (38, 43), (48, 53), (60, 100)].
test_size (float, optional, default = 0.2) – Test size.
transform (Callable, optional) – A callable that takes in an PIL image and returns a transformed version.
target_transform (Callable, optional) – A callable that takes in the target and transforms it.
verbose (bool, optional, default = False) – Whether to print progress messages.
download (bool, optional, default = False) –
Whether to download the dataset automatically.
Downloading requires valid username and password credentials provided by the Adience dataset authors.
username (str, optional) – Username to download the dataset. If not provided, the dataset will not be downloaded and the files are expected to be already present in the root directory.
password (str, optional) – Password to download the dataset. If not provided, the dataset will not be downloaded and the files are expected to be already present in the root directory.
- root
Root directory where the Adience dataset is stored.
- Type:
Path
- train
Whether to use the training or test partition.
- Type:
bool
- transform
A callable that takes in an PIL image and returns a transformed version.
- Type:
Callable
- target_transform
A callable that takes in the target and transforms it.
- Type:
Callable
- verbose
Whether to print progress messages.
- Type:
bool
- data
List of image paths.
- Type:
list
- targets
Contains the target of each sampel contained in the dataset.
- Type:
list
- classes
Unique classes in the dataset.
- Type:
list
- download
Whether to download the dataset if it is not already present in the root directory. If False, the files are expected to be already present in the root directory.
- Type:
bool
- class dlordinal.datasets.FGNet(root: str | Path, download: bool = True, target_size: tuple = (128, 128), categories: list = [3, 11, 16, 24, 40], test_size: float = 0.2, validation_size: float = 0.15, train: bool = True, transform: Callable | None = None, target_transform: Callable | None = None)[source]
Base class for FGNet dataset.
- root
Root directory of the dataset.
- Type:
Path
- target_size
Size of the images after resizing.
- Type:
tuple
- categories
List of categories to be used.
- Type:
list
- test_size
Size of the test set.
- Type:
float
- validation_size
Size of the validation set.
- Type:
float
- transform
A function/transform that takes in a PIL image and returns a transformed version.
- Type:
callable, optional
- target_transform
A function/transform that takes in the target and transforms it.
- Type:
callable, optional
- data
Dataframe containing the dataset.
- Type:
pd.DataFrame
- Parameters:
root (str or Path) – Root directory of the dataset.
download (bool, optional, default = True) – If True, downloads the dataset from the internet and puts it in the root directory. If the dataset is already downloaded, it is not downloaded again.
target_size (tuple, optional) – Size of the images after resizing. Default is (128, 128).
categories (list, optional) – List of categories to be used. Default is [3, 11, 16, 24, 40].
test_size (float, optional) – Size of the test set. Default is 0.2.
validation_size (float, optional) – Size of the validation set. Default is 0.15.
train (bool, optional) – If True, returns the training dataset, otherwise returns the test dataset. Default is True.
transform (callable, optional) – A function/transform that takes in a PIL image and returns a transformed version.
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
- property classes: List[int]
Return the unique classes in the dataset.
- Returns:
List of unique classes.
- Return type:
list
- find_category(real_age)[source]
Find the category of the real age.
- Parameters:
real_age (int) – Real age of the image.
- get_age_from_filename(filename)[source]
Get the age from the filename.
- Parameters:
filename (str) – Filename of the image.
- load_data(original_path: Path)[source]
Load the data from the original_path.
- Parameters:
original_path (Path) – Path to the original dataset.
- process(original_path, processed_path)[source]
Process the FGNet dataset and save it in the processed_path.
- Parameters:
original_path (Path) – Path to the original dataset.
processed_path (Path) – Path to save the processed dataset.
- process_images_from_df(df: DataFrame, original_path: Path, processed_path: Path)[source]
Process the images from the dataframe.
- Parameters:
df (pd.DataFrame) – Dataframe with the images.
original_path (Path) – Path to the original dataset.
processed_path (Path) – Path to save the processed dataset.
- split(original_csv_path: Path, train_csv_path: Path, test_csv_path: Path, original_images_path: Path, train_images_path: Path, test_images_path: Path)[source]
Split the FGNet dataset into train and test sets.
- Parameters:
original_csv_path (Path) – Path to the original csv file.
train_csv_path (Path) – Path to save the train csv file.
test_csv_path (Path) – Path to save the test csv file.
original_images_path (Path) – Path to the original images.
train_images_path (Path) – Path to save the train images.
test_images_path (Path) – Path to save the test images.
- split_dataframe(csv_path: Path, train_images_path: Path, original_images_path: Path, test_images_path: Path)[source]
Split the dataframe into train and test sets.
- Parameters:
csv_path (Path) – Path to the csv file.
train_images_path (Path) – Path to save the train images.
original_images_path (Path) – Path to the original images.
test_images_path (Path) – Path to save the test images.
- property targets: List[int]
Return the targets of the dataset.
- Returns:
List of targets.
- Return type:
list
- class dlordinal.datasets.FeatureDataset(filename)[source]
Dataset torch implementation for a standard dataset that contains several features that are organised in a tabular way in a csv file. The last column is the target variable.
Example
>>> train_data = FeatureDataset("train.csv") >>> train_data.normalize_X() >>> train_data.normalize_y() >>> train_loader = DataLoader(train_data, batch_size=32, shuffle=True) >>> for X, y in train_loader: >>> print(X.shape, y.shape) >>> test_data = FeatureDataset("test.csv") >>> test_data.normalize_X(train_data.X_mean, train_data.X_scale) >>> test_data.normalize_y(train_data.y_mean, train_data.y_scale) >>> test_loader = DataLoader(test_data, batch_size=32, shuffle=False) >>> for X, y in test_loader: >>> print(X.shape, y.shape)
- get_valid_shape_array(v: ArrayLike)[source]
Convert the input ArrayLike object to a 2D numpy array with shape (n, 1) if it is a 1D array.
- Parameters:
v (ArrayLike) – Input array.
- Returns:
v – 2D numpy array with shape (n, 1).
- Return type:
np.ndarray
- normalize_X(mean: ArrayLike | None = None, scale: ArrayLike | None = None)[source]
Standardize the features of the dataset. If mean and scale are not provided, they are computed from the dataset. If they are provided, they are used to standardize the dataset.
- Parameters:
mean (array-like, default=None) – Mean of the dataset.
scale (array-like, default=None) – Scale of the dataset.
- Returns:
self – The dataset with standardized features.
- Return type:
Example
>>> train_data = FeatureDataset("train.csv") >>> train_data.normalize_X() >>> test_data = FeatureDataset("test.csv") >>> test_data.normalize_X(train_data.X_mean, train_data.X_scale)
- normalize_y(mean: ArrayLike = None, scale: ArrayLike = None)[source]
Standardize the target variable of the dataset. If mean and scale are not provided, they are computed from the dataset. If they are provided, they are used to standardize the dataset.
- Parameters:
mean (array-like, default=None) – Mean of the dataset.
scale (array-like, default=None) – Scale of the dataset.
- Returns:
self – The dataset with standardized target variable.
- Return type:
Example
>>> train_data = FeatureDataset("train.csv") >>> train_data.normalize_y() >>> test_data = FeatureDataset("test.csv") >>> test_data.normalize_y(train_data.y_mean, train_data.y_scale)
- class dlordinal.datasets.HCI(root: str | Path, transform: Callable | None = None, target_transform: Callable | None = None, is_valid_file: Callable[[str], bool] | None = None, train: bool = True)[source]
Historical Color Images (HCI) Decade Database dataset Palermo et al.[2].
This dataset contains colour photographs from five decades (1930s-1970s), organised for decade classification. Upon first use, the dataset is automatically downloaded, verified, preprocessed, and split into training and test subsets.
The preprocessing pipeline includes: - verifying and downloading the dataset archive if necessary; - extracting and normalising directory names according to class labels; - resizing all images to 224x224 pixels; - creating a stratified 70/30 train/test split; - generating an MD5 checksum file for future integrity checks.
- Parameters:
root (str or Path) – Root directory where the dataset will be stored and processed.
transform (callable, optional) – A function/transform applied to each loaded PIL image.
target_transform (callable, optional) – A function/transform applied to the target label.
is_valid_file (callable, optional) – A function that takes a file path and returns
Trueif the file should be included.train (bool, default=True) – If
True, loads the training split; otherwise, loads the test split.
- URL
Download URL for the dataset archive.
- Type:
str
- MD5
MD5 checksum used to verify the downloaded archive.
- Type:
str
- CATEGORIES
Mapping from decade names to numeric class labels (as strings).
- Type:
dict
- base_root
Base directory for dataset storage and processing.
- Type:
Path
- train
Indicates whether the dataset instance is for training or testing.
- Type:
bool
Example
>>> from dlordinal.datasets.hci import HCI >>> dataset = HCI(root="data", train=True) >>> img, label = dataset[0]
Notes
The train/test split is stratified by decade, with 70% of the images in the training set and 30% in the test set. Preprocessing is only performed the first time the dataset is initialised.