Utils

This section provides documentation for utility functions in the hwm.utils module. These functions offer essential support for data manipulation, preprocessing, and other common tasks required during model development and evaluation.

activator

activator

activator(z, activation='sigmoid', alpha=1.0, clipping_threshold=250)

Apply the specified activation function to the input array z [1].

Parameters

zarray-like

Input array to which the activation function is applied.

activationstr or callable, default=’sigmoid’

The activation function to apply. Supported activation functions are: ‘sigmoid’, ‘relu’, ‘leaky_relu’, ‘identity’, ‘elu’, ‘tanh’, ‘softmax’. If a callable is provided, it should take z as input and return the transformed output.

alphafloat, default=1.0

The alpha value for activation functions that use it (e.g., ELU).

clipping_thresholdint, default=250

Threshold value to clip the input z to avoid overflow in activation functions like ‘sigmoid’ and ‘softmax’.

Returns

activation_outputarray-like

The output array after applying the activation function.

Notes

The available activation functions are defined as follows:

  • Sigmoid: \(\sigma(z) = \frac{1}{1 + \exp(-z)}\)

  • ReLU: \(\text{ReLU}(z) = \max(0, z)\)

  • Leaky ReLU: \(\text{Leaky ReLU}(z) = \max(0.01z, z)\)

  • Identity: \(\text{Identity}(z) = z\)

  • ELU: :math:`text{ELU}(z) = begin{cases}

    z & text{if } z > 0 \ alpha (exp(z) - 1) & text{if } z leq 0

    end{cases}`

  • Tanh: \(\tanh(z) = \frac{\exp(z) - \exp(-z)}{\exp(z) + \exp(-z)}\)

  • Softmax: \(\text{Softmax}(z)_i = \frac{\exp(z_i)}{\sum_{j} \exp(z_j)}\)

Examples

The following examples demonstrate how to use the activator function with different activation functions and configurations.

Basic Example:

Applying ReLU activation to a simple array.

 1import numpy as np
 2from hwm.utils import activator
 3
 4# Input array
 5z = np.array([1.0, 2.0, -1.0, -2.0])
 6
 7# Apply ReLU activation
 8output = activator(z, activation='relu')
 9print(output)
10# Output: [1. 2. 0. 0.]

Advanced Example:

Using softmax activation on a multi-class array.

 1import numpy as np
 2from hwm.utils import activator
 3
 4# Input array for softmax
 5z = np.array([2.0, 1.0, 0.1])
 6
 7# Apply softmax activation
 8output = activator(z, activation='softmax')
 9print(output)
10# Output: [0.65900114 0.24243297 0.09856589]

Custom Callable Example:

Using a custom activation function.

 1import numpy as np
 2from hwm.utils import activator
 3
 4# Define a custom activation function
 5def custom_activation(x):
 6    return np.sqrt(np.abs(x)) * np.sign(x)
 7
 8# Input array
 9z = np.array([4, -9, 16, -25])
10
11# Apply custom activation
12output = activator(z, activation=custom_activation)
13print(output)
14# Output: [ 2. -3.  4. -5.]

resample_data

resample_data(*data, samples=1, replace=False, random_state=None, shuffle=True)

Resample multiple data structures (arrays, sparse matrices, Series, DataFrames) based on specified sample size or ratio [4].

Parameters

data

Variable number of array-like, sparse matrix, pandas Series, or DataFrame objects to be resampled.

samples

Specifies the number of items to sample from each data structure. - If an integer greater than 1, it is treated as the exact number

of items to sample.

  • If a float between 0 and 1, it is treated as a ratio of the total number of rows to sample.

  • If a string containing a percentage (e.g., “50%”), it calculates the sample size as a percentage of the total data length.

Default is 1, meaning no resampling is performed unless a different value is specified.

replace

Determines if sampling with replacement is allowed, enabling the same row to be sampled multiple times. Default is False.

random_state

Sets the seed for the random number generator to ensure reproducibility. If specified, repeated calls with the same parameters will yield identical results. Default is None.

shuffle

If True, shuffles the data before sampling. Otherwise, rows are selected sequentially without shuffling. Default is True.

Returns

List[Any]

A list of resampled data structures, each in the original format (e.g., numpy array, sparse matrix, pandas DataFrame) and with the specified sample size.

Methods

  • _determine_sample_size: Calculates the sample size based on the samples parameter.

  • _perform_sampling: Conducts the sampling process based on the calculated sample size, replace, and shuffle parameters.

Notes

  • If samples is given as a percentage string (e.g., “25%”), the actual number of rows to sample, \(n\), is calculated as:

    \[n = \left(\frac{\text{percentage}}{100}\right) \times N\]

    where \(N\) is the total number of rows in the data structure.

  • Resampling supports both dense and sparse matrices. If the input contains sparse matrices stored within numpy objects, the function extracts and samples them directly.

Examples

The following examples demonstrate how to use the resample_data function to resample different data structures with various configurations.

Basic Example:

Resampling a NumPy array by selecting 10 items with replacement.

 1from hwm.utils import resample_data
 2import numpy as np
 3
 4# Original data array
 5data = np.arange(100).reshape(20, 5)
 6
 7# Resample 10 items with replacement
 8resampled_data = resample_data(data, samples=10, replace=True)
 9print(resampled_data[0].shape)
10# Output: (10, 5)

Resampling by Ratio:

Resampling 50% of the rows from a NumPy array without replacement.

 1from hwm.utils import resample_data
 2import numpy as np
 3
 4# Original data array
 5data = np.arange(100).reshape(20, 5)
 6
 7# Resample 50% of the rows
 8resampled_data = resample_data(data, samples=0.5, random_state=42)
 9print(resampled_data[0].shape)
10# Output: (10, 5)

Resampling with Percentage:

Resampling 25% of the rows from a NumPy array using a percentage string.

 1from hwm.utils import resample_data
 2import numpy as np
 3
 4# Original data array
 5data = np.arange(100).reshape(20, 5)
 6
 7# Resample 25% of the rows
 8resampled_data = resample_data(data, samples="25%", random_state=42)
 9print(resampled_data[0].shape)
10# Output: (5, 5)

Multiple Data Structures:

Resampling multiple data structures simultaneously.

 1from hwm.utils import resample_data
 2import numpy as np
 3import pandas as pd
 4import scipy.sparse as sp
 5
 6# Original data structures
 7array = np.arange(100).reshape(20, 5)
 8dataframe = pd.DataFrame(array, columns=['A', 'B', 'C', 'D', 'E'])
 9sparse_matrix = sp.csr_matrix(array)
10
11# Resample 10 items from each data structure
12resampled_array, resampled_df, resampled_sparse = resample_data(
13    array, dataframe, sparse_matrix, samples=10, replace=True, random_state=42
14)
15
16print(resampled_array.shape)
17# Output: (10, 5)
18print(resampled_df.shape)
19# Output: (10, 5)
20print(resampled_sparse.shape)
21# Output: (10, 5)

See Also

numpy.random.choice() : Selects random samples from an array. pandas.DataFrame.sample() : Randomly samples rows from a DataFrame.

add_noises_to

add_noises_to

add_noises_to(data, noise=0.1, seed=None, gaussian_noise=False, cat_missing_value=pd.NA)

Adds NaN or specified missing values to a pandas DataFrame [4].

Parameters

data

The DataFrame to which NaN values or specified missing values will be added.

noise

The percentage of values to be replaced with NaN or the specified missing value in each column. This must be a number between 0 and 1. Default is 0.1 (10%).

seed

Seed for random number generator to ensure reproducibility. If seed is an int, array-like, or BitGenerator, it will be used to seed the random number generator. If seed is a np.random.RandomState or np.random.Generator, it will be used as given.

gaussian_noise

If True, adds Gaussian noise to the data. Otherwise, replaces values with NaN or the specified missing value. Default is False.

cat_missing_value

The value to use for missing data in categorical columns. By default, pd.NA is used.

Returns

pandas.DataFrame

A DataFrame with NaN or specified missing values added.

Notes

The function modifies the DataFrame by either adding Gaussian noise to numerical columns or replacing a percentage of values in each column with NaN or a specified missing value.

The Gaussian noise is added according to the formula:

\[\text{new\_value} = \text{original\_value} + \mathcal{N}(0, \text{noise})\]

where \(\mathcal{N}(0, \text{noise})\) represents a normal distribution with mean 0 and standard deviation equal to noise.

Examples

The following examples demonstrate how to use the add_noises_to function to add missing values or Gaussian noise to a DataFrame.

Adding Missing Values:

 1from hwm.utils import add_noises_to
 2import pandas as pd
 3
 4# Original DataFrame
 5df = pd.DataFrame({'A': [1, 2, 3], 'B': ['x', 'y', 'z']})
 6
 7# Add 20% missing values
 8new_df = add_noises_to(df, noise=0.2)
 9print(new_df)
10# Output:
11#      A     B
12# 0  1.0  <NA>
13# 1  NaN     y
14# 2  3.0  <NA>

Adding Gaussian Noise:

 1from hwm.utils import add_noises_to
 2import pandas as pd
 3
 4# Original DataFrame
 5df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
 6
 7# Add 10% Gaussian noise
 8new_df = add_noises_to(df, noise=0.1, gaussian_noise=True)
 9print(new_df)
10# Output:
11#           A         B
12# 0  1.063292  3.986400
13# 1  2.103962  4.984292
14# 2  2.856601  6.017380

See Also

pandas.DataFrameTwo-dimensional, size-mutable, potentially

heterogeneous tabular data.

numpy.random.normal()Draw random samples from a normal

(Gaussian) distribution.

gen_X_y_batches

gen_X_y_batches(X, y, *, batch_size='auto', n_samples=None, min_batch_size=0, shuffle=True, random_state=None, return_batches=False, default_size=200)

Generate batches of data (X, y) for machine learning tasks such as training or evaluation [2]. This function slices the dataset into smaller batches, optionally shuffles the data, and returns them as a list of tuples or just the data batches [6].

Parameters

X

The input data matrix, where each row is a sample and each column represents a feature. Must be an ndarray of shape (n_samples, n_features).

y

The target variable(s) corresponding to X. Can be a vector or matrix depending on the problem (single or multi-output). Must be an ndarray of shape (n_samples,) or (n_samples, n_targets).

batch_size

The number of samples per batch. If set to “auto”, it uses the minimum between default_size and the number of samples, n_samples. Default is “auto”.

n_samples

The total number of samples to consider. If None, the function defaults to using the number of samples in X. Default is None.

min_batch_size

The minimum size for each batch. This parameter ensures that the final batch contains at least min_batch_size samples. If the last batch is smaller than min_batch_size, it will be excluded from the result. Default is 0.

shuffle

If True, the data is shuffled before batching. This helps avoid bias when splitting data for training and validation. Default is True.

random_state

The seed used by the random number generator for reproducibility. If None, the random number generator uses the system time or entropy source. Default is None.

return_batches

If True, the function returns both the data batches and the slice objects used to index into X and y. If False, only the data batches are returned. Default is False.

default_size

The default batch size used when batch_size=”auto” is selected. Default is 200.

Returns

list of tuples

A list of tuples where each tuple contains a batch of X and its corresponding batch of y.

list of slice objects, optional

If return_batches=True, this list of slice objects is returned, each representing the slice of X and y used for a specific batch.

Notes

  • This function ensures that no empty batches are returned. If a batch contains zero samples (either from improper slicing or due to min_batch_size), it will be excluded.

  • The function performs shuffling using scikit-learn’s shuffle function, which is more stable and reduces memory usage by shuffling indices rather than the whole dataset.

  • The function utilizes the gen_batches utility to divide the data into batches.

Examples

The following examples demonstrate how to use the gen_X_y_batches function to generate data batches for machine learning tasks.

Basic Example:

Generating batches of size 500 from random data.

 1from hwm.utils import gen_X_y_batches
 2import numpy as np
 3
 4# Generate random input data and binary targets
 5X = np.random.rand(2000, 5)
 6y = np.random.randint(0, 2, size=(2000,))
 7
 8# Create batches of size 500 with shuffling
 9batches = gen_X_y_batches(X, y, batch_size=500, shuffle=True)
10print(len(batches))
11# Output: 4

Returning Batch Slices:

Generating batches and obtaining slice objects for indexing.

 1from hwm.utils import gen_X_y_batches
 2import numpy as np
 3
 4# Generate random input data and binary targets
 5X = np.random.rand(2000, 5)
 6y = np.random.randint(0, 2, size=(2000,))
 7
 8# Create batches of size 500 and return batch slices
 9batches, slices = gen_X_y_batches(
10    X, y, batch_size=500, shuffle=True, return_batches=True
11)
12print(len(batches))
13# Output: 4
14print(len(slices))
15# Output: 4

Handling Minimum Batch Size:

Ensuring that the final batch meets the minimum batch size requirement.

 1from hwm.utils import gen_X_y_batches
 2import numpy as np
 3
 4# Generate random input data and binary targets
 5X = np.random.rand(1025, 5)
 6y = np.random.randint(0, 2, size=(1025,))
 7
 8# Create batches with batch_size=500 and min_batch_size=25
 9batches = gen_X_y_batches(
10    X, y, batch_size=500, min_batch_size=25, shuffle=True
11)
12print(len(batches))
13# Output: 2
14for batch in batches:
15    print(batch[0].shape, batch[1].shape)
16# Output:
17# (500, 5) (500,)
18# (525, 5) (525,)

See Also

gen_batches() : A utility function that generates slices of data. shuffle() : A utility to shuffle data while keeping the data and labels in sync.

ensure_non_empty_batch

ensure_non_empty_batch(X, y, *, batch_slice, max_attempts=10, random_state=None, error='raise')

Shuffle the dataset (X, y) until the specified batch_slice yields a non-empty batch. This function ensures that the batch extracted using batch_slice contains at least one sample by repeatedly shuffling the data and reapplying the slice [5].

Parameters

X

The input data matrix, where each row corresponds to a sample and each column corresponds to a feature. Must be an ndarray of shape (n_samples, n_features).

y

The target variable(s) corresponding to X. It can be a one-dimensional array for single-output tasks or a two-dimensional array for multi-output tasks. Must be an ndarray of shape (n_samples,) or (n_samples, n_targets).

batch_slice

A slice object representing the indices for the batch. For example, slice(0, 512) would extract the first 512 samples from X and y.

max_attempts

The maximum number of attempts to shuffle the data to obtain a non-empty batch. If the batch remains empty after the specified number of attempts, a ValueError is raised. Default is 10.

random_state

Controls the randomness of the shuffling. Pass an integer for reproducible results across multiple function calls. If None, the random number generator is the RandomState instance used by np.random. Default is None.

error

Handle error status when an empty batch is still present after max_attempts. Expected values are “raise”, “warn”, or “ignore”. If “warn”, the error is converted into a warning message. Any other value will ignore the error message. Default is “raise”.

Returns

ndarray

The batch of input data extracted using batch_slice. Ensures that X_batch is not empty.

ndarray

The batch of target data corresponding to X_batch, extracted using batch_slice. Ensures that y_batch is not empty.

Raises

ValueError

If a non-empty batch cannot be obtained after max_attempts shuffles.

Examples

The following examples demonstrate how to use the ensure_non_empty_batch function to guarantee that a batch contains data.

Basic Example:

Ensuring a non-empty batch from random data.

 1from hwm.utils import ensure_non_empty_batch
 2import numpy as np
 3
 4# Generate random input data and binary targets
 5X = np.random.rand(2000, 5)
 6y = np.random.randint(0, 2, size=(2000,))
 7batch_slice = slice(0, 512)
 8
 9# Ensure the batch is non-empty
10X_batch, y_batch = ensure_non_empty_batch(
11    X, y, batch_slice=batch_slice
12)
13print(X_batch.shape)
14# Output: (512, 5)
15print(y_batch.shape)
16# Output: (512,)

Handling Empty Batches:

Attempting to extract a batch from empty data, which raises a ValueError.

 1from hwm.utils import ensure_non_empty_batch
 2import numpy as np
 3
 4# Empty input data
 5X_empty = np.empty((0, 5))
 6y_empty = np.empty((0,))
 7batch_slice = slice(0, 512)
 8
 9# Attempt to ensure a non-empty batch
10try:
11    X_batch, y_batch = ensure_non_empty_batch(
12        X_empty, y_empty, batch_slice=batch_slice
13    )
14except ValueError as e:
15    print(e)
16    # Output: Unable to obtain a non-empty batch after 10 attempts.

Using with Different Error Handling:

Suppressing the error and receiving the original data when a non-empty batch cannot be obtained.

 1from hwm.utils import ensure_non_empty_batch
 2import numpy as np
 3
 4# Empty input data
 5X_empty = np.empty((0, 5))
 6y_empty = np.empty((0,))
 7batch_slice = slice(0, 512)
 8
 9# Attempt to ensure a non-empty batch with warning
10X_batch, y_batch = ensure_non_empty_batch(
11    X_empty, y_empty, batch_slice=batch_slice, error="warn"
12)
13print(X_batch.shape, y_batch.shape)
14# Output: (0, 5) (0,)

See Also

gen_batches() : Generate slice objects to divide data into batches. shuffle() : Shuffle arrays or sparse matrices in a consistent way.

References