Utils¶
This section provides documentation for utility functions in the
hwm.utils module. These functions offer essential support for
data manipulation, preprocessing, and other common tasks required
during model development and evaluation.
activator¶
activator¶
- activator(z, activation='sigmoid', alpha=1.0, clipping_threshold=250)¶
Apply the specified activation function to the input array z [1].
Parameters¶
- zarray-like
Input array to which the activation function is applied.
- activationstr or callable, default=’sigmoid’
The activation function to apply. Supported activation functions are: ‘sigmoid’, ‘relu’, ‘leaky_relu’, ‘identity’, ‘elu’, ‘tanh’, ‘softmax’. If a callable is provided, it should take z as input and return the transformed output.
- alphafloat, default=1.0
The alpha value for activation functions that use it (e.g., ELU).
- clipping_thresholdint, default=250
Threshold value to clip the input z to avoid overflow in activation functions like ‘sigmoid’ and ‘softmax’.
Returns¶
- activation_outputarray-like
The output array after applying the activation function.
Notes¶
The available activation functions are defined as follows:
Sigmoid: \(\sigma(z) = \frac{1}{1 + \exp(-z)}\)
ReLU: \(\text{ReLU}(z) = \max(0, z)\)
Leaky ReLU: \(\text{Leaky ReLU}(z) = \max(0.01z, z)\)
Identity: \(\text{Identity}(z) = z\)
ELU: :math:`text{ELU}(z) = begin{cases}
z & text{if } z > 0 \ alpha (exp(z) - 1) & text{if } z leq 0
end{cases}`
Tanh: \(\tanh(z) = \frac{\exp(z) - \exp(-z)}{\exp(z) + \exp(-z)}\)
Softmax: \(\text{Softmax}(z)_i = \frac{\exp(z_i)}{\sum_{j} \exp(z_j)}\)
Examples¶
The following examples demonstrate how to use the activator function with different activation functions and configurations.
Basic Example:
Applying ReLU activation to a simple array.
1import numpy as np
2from hwm.utils import activator
3
4# Input array
5z = np.array([1.0, 2.0, -1.0, -2.0])
6
7# Apply ReLU activation
8output = activator(z, activation='relu')
9print(output)
10# Output: [1. 2. 0. 0.]
Advanced Example:
Using softmax activation on a multi-class array.
1import numpy as np
2from hwm.utils import activator
3
4# Input array for softmax
5z = np.array([2.0, 1.0, 0.1])
6
7# Apply softmax activation
8output = activator(z, activation='softmax')
9print(output)
10# Output: [0.65900114 0.24243297 0.09856589]
Custom Callable Example:
Using a custom activation function.
1import numpy as np
2from hwm.utils import activator
3
4# Define a custom activation function
5def custom_activation(x):
6 return np.sqrt(np.abs(x)) * np.sign(x)
7
8# Input array
9z = np.array([4, -9, 16, -25])
10
11# Apply custom activation
12output = activator(z, activation=custom_activation)
13print(output)
14# Output: [ 2. -3. 4. -5.]
resample_data¶
- resample_data(*data, samples=1, replace=False, random_state=None, shuffle=True)¶
Resample multiple data structures (arrays, sparse matrices, Series, DataFrames) based on specified sample size or ratio [4].
Parameters¶
data |
Variable number of array-like, sparse matrix, pandas Series, or DataFrame objects to be resampled. |
|---|---|
samples |
Specifies the number of items to sample from each data structure. - If an integer greater than 1, it is treated as the exact number
Default is 1, meaning no resampling is performed unless a different value is specified. |
replace |
Determines if sampling with replacement is allowed, enabling the same row to be sampled multiple times. Default is False. |
random_state |
Sets the seed for the random number generator to ensure reproducibility. If specified, repeated calls with the same parameters will yield identical results. Default is None. |
shuffle |
If True, shuffles the data before sampling. Otherwise, rows are selected sequentially without shuffling. Default is True. |
Returns¶
- List[Any]
A list of resampled data structures, each in the original format (e.g., numpy array, sparse matrix, pandas DataFrame) and with the specified sample size.
Methods¶
_determine_sample_size: Calculates the sample size based on the samples parameter.
_perform_sampling: Conducts the sampling process based on the calculated sample size, replace, and shuffle parameters.
Notes¶
If samples is given as a percentage string (e.g., “25%”), the actual number of rows to sample, \(n\), is calculated as:
\[n = \left(\frac{\text{percentage}}{100}\right) \times N\]where \(N\) is the total number of rows in the data structure.
Resampling supports both dense and sparse matrices. If the input contains sparse matrices stored within numpy objects, the function extracts and samples them directly.
Examples¶
The following examples demonstrate how to use the resample_data function to resample different data structures with various configurations.
Basic Example:
Resampling a NumPy array by selecting 10 items with replacement.
1from hwm.utils import resample_data
2import numpy as np
3
4# Original data array
5data = np.arange(100).reshape(20, 5)
6
7# Resample 10 items with replacement
8resampled_data = resample_data(data, samples=10, replace=True)
9print(resampled_data[0].shape)
10# Output: (10, 5)
Resampling by Ratio:
Resampling 50% of the rows from a NumPy array without replacement.
1from hwm.utils import resample_data
2import numpy as np
3
4# Original data array
5data = np.arange(100).reshape(20, 5)
6
7# Resample 50% of the rows
8resampled_data = resample_data(data, samples=0.5, random_state=42)
9print(resampled_data[0].shape)
10# Output: (10, 5)
Resampling with Percentage:
Resampling 25% of the rows from a NumPy array using a percentage string.
1from hwm.utils import resample_data
2import numpy as np
3
4# Original data array
5data = np.arange(100).reshape(20, 5)
6
7# Resample 25% of the rows
8resampled_data = resample_data(data, samples="25%", random_state=42)
9print(resampled_data[0].shape)
10# Output: (5, 5)
Multiple Data Structures:
Resampling multiple data structures simultaneously.
1from hwm.utils import resample_data
2import numpy as np
3import pandas as pd
4import scipy.sparse as sp
5
6# Original data structures
7array = np.arange(100).reshape(20, 5)
8dataframe = pd.DataFrame(array, columns=['A', 'B', 'C', 'D', 'E'])
9sparse_matrix = sp.csr_matrix(array)
10
11# Resample 10 items from each data structure
12resampled_array, resampled_df, resampled_sparse = resample_data(
13 array, dataframe, sparse_matrix, samples=10, replace=True, random_state=42
14)
15
16print(resampled_array.shape)
17# Output: (10, 5)
18print(resampled_df.shape)
19# Output: (10, 5)
20print(resampled_sparse.shape)
21# Output: (10, 5)
See Also¶
numpy.random.choice() : Selects random samples from an array.
pandas.DataFrame.sample() : Randomly samples rows from a DataFrame.
add_noises_to¶
add_noises_to¶
- add_noises_to(data, noise=0.1, seed=None, gaussian_noise=False, cat_missing_value=pd.NA)¶
Adds NaN or specified missing values to a pandas DataFrame [4].
Parameters¶
data |
The DataFrame to which NaN values or specified missing values will be added. |
|---|---|
noise |
The percentage of values to be replaced with NaN or the specified missing value in each column. This must be a number between 0 and 1. Default is 0.1 (10%). |
seed |
Seed for random number generator to ensure reproducibility. If seed is an int, array-like, or BitGenerator, it will be used to seed the random number generator. If seed is a np.random.RandomState or np.random.Generator, it will be used as given. |
gaussian_noise |
If True, adds Gaussian noise to the data. Otherwise, replaces values with NaN or the specified missing value. Default is False. |
cat_missing_value |
The value to use for missing data in categorical columns. By default, pd.NA is used. |
Returns¶
- pandas.DataFrame
A DataFrame with NaN or specified missing values added.
Notes¶
The function modifies the DataFrame by either adding Gaussian noise to numerical columns or replacing a percentage of values in each column with NaN or a specified missing value.
The Gaussian noise is added according to the formula:
where \(\mathcal{N}(0, \text{noise})\) represents a normal distribution with mean 0 and standard deviation equal to noise.
Examples¶
The following examples demonstrate how to use the add_noises_to function to add missing values or Gaussian noise to a DataFrame.
Adding Missing Values:
1from hwm.utils import add_noises_to
2import pandas as pd
3
4# Original DataFrame
5df = pd.DataFrame({'A': [1, 2, 3], 'B': ['x', 'y', 'z']})
6
7# Add 20% missing values
8new_df = add_noises_to(df, noise=0.2)
9print(new_df)
10# Output:
11# A B
12# 0 1.0 <NA>
13# 1 NaN y
14# 2 3.0 <NA>
Adding Gaussian Noise:
1from hwm.utils import add_noises_to
2import pandas as pd
3
4# Original DataFrame
5df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
6
7# Add 10% Gaussian noise
8new_df = add_noises_to(df, noise=0.1, gaussian_noise=True)
9print(new_df)
10# Output:
11# A B
12# 0 1.063292 3.986400
13# 1 2.103962 4.984292
14# 2 2.856601 6.017380
See Also¶
pandas.DataFrameTwo-dimensional, size-mutable, potentiallyheterogeneous tabular data.
numpy.random.normal()Draw random samples from a normal(Gaussian) distribution.
gen_X_y_batches¶
- gen_X_y_batches(X, y, *, batch_size='auto', n_samples=None, min_batch_size=0, shuffle=True, random_state=None, return_batches=False, default_size=200)¶
Generate batches of data (X, y) for machine learning tasks such as training or evaluation [2]. This function slices the dataset into smaller batches, optionally shuffles the data, and returns them as a list of tuples or just the data batches [6].
Parameters¶
X |
The input data matrix, where each row is a sample and each column represents a feature. Must be an ndarray of shape (n_samples, n_features). |
|---|---|
y |
The target variable(s) corresponding to X. Can be a vector or matrix depending on the problem (single or multi-output). Must be an ndarray of shape (n_samples,) or (n_samples, n_targets). |
batch_size |
The number of samples per batch. If set to “auto”, it uses the minimum between default_size and the number of samples, n_samples. Default is “auto”. |
n_samples |
The total number of samples to consider. If None, the function defaults to using the number of samples in X. Default is None. |
min_batch_size |
The minimum size for each batch. This parameter ensures that the final batch contains at least min_batch_size samples. If the last batch is smaller than min_batch_size, it will be excluded from the result. Default is 0. |
shuffle |
If True, the data is shuffled before batching. This helps avoid bias when splitting data for training and validation. Default is True. |
random_state |
The seed used by the random number generator for reproducibility. If None, the random number generator uses the system time or entropy source. Default is None. |
return_batches |
If True, the function returns both the data batches and the slice objects used to index into X and y. If False, only the data batches are returned. Default is False. |
default_size |
The default batch size used when batch_size=”auto” is selected. Default is 200. |
Returns¶
- list of tuples
A list of tuples where each tuple contains a batch of X and its corresponding batch of y.
- list of slice objects, optional
If return_batches=True, this list of slice objects is returned, each representing the slice of X and y used for a specific batch.
Notes¶
This function ensures that no empty batches are returned. If a batch contains zero samples (either from improper slicing or due to min_batch_size), it will be excluded.
The function performs shuffling using scikit-learn’s shuffle function, which is more stable and reduces memory usage by shuffling indices rather than the whole dataset.
The function utilizes the gen_batches utility to divide the data into batches.
Examples¶
The following examples demonstrate how to use the gen_X_y_batches function to generate data batches for machine learning tasks.
Basic Example:
Generating batches of size 500 from random data.
1from hwm.utils import gen_X_y_batches
2import numpy as np
3
4# Generate random input data and binary targets
5X = np.random.rand(2000, 5)
6y = np.random.randint(0, 2, size=(2000,))
7
8# Create batches of size 500 with shuffling
9batches = gen_X_y_batches(X, y, batch_size=500, shuffle=True)
10print(len(batches))
11# Output: 4
Returning Batch Slices:
Generating batches and obtaining slice objects for indexing.
1from hwm.utils import gen_X_y_batches
2import numpy as np
3
4# Generate random input data and binary targets
5X = np.random.rand(2000, 5)
6y = np.random.randint(0, 2, size=(2000,))
7
8# Create batches of size 500 and return batch slices
9batches, slices = gen_X_y_batches(
10 X, y, batch_size=500, shuffle=True, return_batches=True
11)
12print(len(batches))
13# Output: 4
14print(len(slices))
15# Output: 4
Handling Minimum Batch Size:
Ensuring that the final batch meets the minimum batch size requirement.
1from hwm.utils import gen_X_y_batches
2import numpy as np
3
4# Generate random input data and binary targets
5X = np.random.rand(1025, 5)
6y = np.random.randint(0, 2, size=(1025,))
7
8# Create batches with batch_size=500 and min_batch_size=25
9batches = gen_X_y_batches(
10 X, y, batch_size=500, min_batch_size=25, shuffle=True
11)
12print(len(batches))
13# Output: 2
14for batch in batches:
15 print(batch[0].shape, batch[1].shape)
16# Output:
17# (500, 5) (500,)
18# (525, 5) (525,)
See Also¶
gen_batches() : A utility function that generates slices of data.
shuffle() : A utility to shuffle data while keeping the data and labels in sync.
ensure_non_empty_batch¶
- ensure_non_empty_batch(X, y, *, batch_slice, max_attempts=10, random_state=None, error='raise')¶
Shuffle the dataset (X, y) until the specified batch_slice yields a non-empty batch. This function ensures that the batch extracted using batch_slice contains at least one sample by repeatedly shuffling the data and reapplying the slice [5].
Parameters¶
X |
The input data matrix, where each row corresponds to a sample and each column corresponds to a feature. Must be an ndarray of shape (n_samples, n_features). |
|---|---|
y |
The target variable(s) corresponding to X. It can be a one-dimensional array for single-output tasks or a two-dimensional array for multi-output tasks. Must be an ndarray of shape (n_samples,) or (n_samples, n_targets). |
batch_slice |
A slice object representing the indices for the batch. For example, slice(0, 512) would extract the first 512 samples from X and y. |
max_attempts |
The maximum number of attempts to shuffle the data to obtain a non-empty batch. If the batch remains empty after the specified number of attempts, a ValueError is raised. Default is 10. |
random_state |
Controls the randomness of the shuffling. Pass an integer for reproducible results across multiple function calls. If None, the random number generator is the RandomState instance used by np.random. Default is None. |
error |
Handle error status when an empty batch is still present after max_attempts. Expected values are “raise”, “warn”, or “ignore”. If “warn”, the error is converted into a warning message. Any other value will ignore the error message. Default is “raise”. |
Returns¶
- ndarray
The batch of input data extracted using batch_slice. Ensures that X_batch is not empty.
- ndarray
The batch of target data corresponding to X_batch, extracted using batch_slice. Ensures that y_batch is not empty.
Raises¶
- ValueError
If a non-empty batch cannot be obtained after max_attempts shuffles.
Examples¶
The following examples demonstrate how to use the ensure_non_empty_batch function to guarantee that a batch contains data.
Basic Example:
Ensuring a non-empty batch from random data.
1from hwm.utils import ensure_non_empty_batch
2import numpy as np
3
4# Generate random input data and binary targets
5X = np.random.rand(2000, 5)
6y = np.random.randint(0, 2, size=(2000,))
7batch_slice = slice(0, 512)
8
9# Ensure the batch is non-empty
10X_batch, y_batch = ensure_non_empty_batch(
11 X, y, batch_slice=batch_slice
12)
13print(X_batch.shape)
14# Output: (512, 5)
15print(y_batch.shape)
16# Output: (512,)
Handling Empty Batches:
Attempting to extract a batch from empty data, which raises a ValueError.
1from hwm.utils import ensure_non_empty_batch
2import numpy as np
3
4# Empty input data
5X_empty = np.empty((0, 5))
6y_empty = np.empty((0,))
7batch_slice = slice(0, 512)
8
9# Attempt to ensure a non-empty batch
10try:
11 X_batch, y_batch = ensure_non_empty_batch(
12 X_empty, y_empty, batch_slice=batch_slice
13 )
14except ValueError as e:
15 print(e)
16 # Output: Unable to obtain a non-empty batch after 10 attempts.
Using with Different Error Handling:
Suppressing the error and receiving the original data when a non-empty batch cannot be obtained.
1from hwm.utils import ensure_non_empty_batch
2import numpy as np
3
4# Empty input data
5X_empty = np.empty((0, 5))
6y_empty = np.empty((0,))
7batch_slice = slice(0, 512)
8
9# Attempt to ensure a non-empty batch with warning
10X_batch, y_batch = ensure_non_empty_batch(
11 X_empty, y_empty, batch_slice=batch_slice, error="warn"
12)
13print(X_batch.shape, y_batch.shape)
14# Output: (0, 5) (0,)
See Also¶
gen_batches() : Generate slice objects to divide data into batches.
shuffle() : Shuffle arrays or sparse matrices in a consistent way.