utilities

Shared utilities for downloading, reading, and parsing data files.

Utility functions for AMOCatlas package.

This module provides shared utility functions including: - File download and caching - Data directory management - URL and path validation - Metadata loading and validation - Decorator functions for default parameters

amocatlas.utilities.apply_defaults(default_source: str, default_files: List[str]) → Callable[source]

Decorator to apply default values for ‘source’ and ‘file_list’ parameters if they are None.

Parameters:

default_source (str) – Default source URL or path.
default_files (list of str) – Default list of filenames.

Returns:

A wrapped function with defaults applied.

Return type:

Callable

amocatlas.utilities.apply_unit_standardization_after_metadata(ds: Dataset) → Dataset[source]

Apply unit standardization with high priority to override YAML metadata.

This function is designed to be called after metadata enrichment to ensure that standardized units take precedence over any units specified in YAML metadata files.

Parameters:: ds (xr.Dataset) – Dataset that may have had units overwritten by metadata processing.
Returns:: Dataset with units re-standardized.
Return type:: xr.Dataset

Notes

This addresses the issue where YAML metadata files contain “Sv” units that override the standardized “Sverdrup” units. This function should be called as the final step in standardization.

Examples

>>> # In standardization pipeline
>>> ds = apply_metadata_from_yaml(ds)  # This might set units: Sv
>>> ds = apply_unit_standardization_after_metadata(ds)  # This fixes it

amocatlas.utilities.download_file(url: str, dest_folder: str, redownload: bool = False, filename: str = None) → str[source]

Download a file from HTTP(S) or FTP to the specified destination folder.

Parameters:

url (str) – The URL of the file to download.
dest_folder (str) – Local folder to save the downloaded file.
redownload (bool, optional) – If True, force re-download of the file even if it exists.
filename (str, optional) – Optional filename to save the file as. If not given, uses the name from the URL.

Returns:

The full path to the downloaded file.

Return type:

str

Raises:

ValueError – If the URL scheme is unsupported.

amocatlas.utilities.find_data_start(file_path: str) → int[source]

Locate the first line of numerical data in a legacy ASCII file.

This function scans an ASCII text file line by line and returns the zero-based line index of the first row that appears to contain data. A data row is identified as a non-empty line whose first non-whitespace character is a digit. This is useful for files with long, human-readable headers (titles, references, separators) preceding the actual data table.

Parameters:: file_path (str) – Path to the ASCII file to be scanned.
Returns:: Zero-based line index at which the numerical data table begins.
Return type:: int
Raises:: ValueError – If no data-like lines are found in the file.

amocatlas.utilities.get_default_data_dir() → Path[source]: Get the default data directory path for AMOCatlas.

amocatlas.utilities.get_project_root() → Path[source]: Return the absolute path to the project root directory.

amocatlas.utilities.get_standard_unit_mappings() → Dict[str, str][source]

Get the comprehensive mapping of unit variations to standard units.

Uses defaults.PREFERRED_UNITS as target values for standardization.

Returns:: Dictionary mapping various unit forms to their standard equivalents.
Return type:: Dict[str, str]

Notes

This centralizes all unit standardization rules for consistency across the AMOCatlas package. Add new unit mappings here as needed. Target values come from defaults.PREFERRED_UNITS to ensure consistency.

Examples

>>> mappings = get_standard_unit_mappings()
>>> print(mappings["Sv"])  # "Sverdrup"
>>> print(mappings["deg C"])  # "degree_C"

amocatlas.utilities.is_valid_url(url: str) → bool[source]

Validate if a given string is a valid URL with supported schemes.

Parameters:: url (str) – The URL string to validate.
Returns:: True if the URL is valid and uses a supported scheme (‘http’, ‘https’, ‘ftp’), otherwise False.
Return type:: bool

amocatlas.utilities.load_array_metadata(datasource_id: str) → dict[source]

Load metadata YAML for a given data source.

Parameters:: datasource_id (str) – Datasource identifier (e.g., ‘rapid26n’, ‘samba34s’).
Returns:: Dictionary containing the parsed YAML metadata.
Return type:: dict

amocatlas.utilities.mask_invalid_values(ds: Dataset) → Dataset[source]

Mask values outside valid_min/valid_max ranges as NaN.

Many netCDF files contain valid_min and valid_max attributes that define the valid range for variables. Values outside this range should be treated as missing data but are often not automatically masked by xarray.

Parameters:: ds (xr.Dataset) – Dataset to check for invalid values.
Returns:: Dataset with values outside valid ranges masked as NaN.
Return type:: xr.Dataset

Examples

>>> # Variable has valid_min=-100, valid_max=100 but contains 9.97e+36
>>> ds_clean = mask_invalid_values(ds)
>>> # Now extreme values are masked as NaN

amocatlas.utilities.normalize_whitespace(attrs: dict) → dict[source]: Replace non-breaking & other unusual whitespace in every string attr value with a normal ASCII space, and collapse runs of whitespace down to one space.

amocatlas.utilities.parse_ascii_header(file_path: str, comment_char: str = '%') → Tuple[List[str], int][source]

Parse the header of an ASCII file to extract column names and the number of header lines.

Header lines are identified by the given comment character (default: ‘%’). Columns are defined in lines like: ‘<comment_char> Column 1: <column_name>’.

Parameters:

file_path (str) – Path to the ASCII file.
comment_char (str, optional) – Character used to identify header lines. Defaults to ‘%’.

Returns:

A tuple containing: - A list of column names extracted from the header. - The number of header lines to skip.

Return type:

tuple of (list of str, int)

amocatlas.utilities.read_ascii_file(file_path: str, comment_char: str = '#') → DataFrame[source]

Read an ASCII file into a pandas DataFrame, skipping lines starting with a specified comment character.

Parameters:

file_path (str) – Path to the ASCII file.
comment_char (str, optional) – Character denoting comment lines. Defaults to ‘#’.

Returns:

The loaded data as a pandas DataFrame.

Return type:

pd.DataFrame

amocatlas.utilities.resolve_file_path(file_name: str, source: str | Path | None, download_url: str | None, local_data_dir: Path, redownload: bool = False) → Path[source]

Resolve the path to a data file, using local source, cache, or downloading if necessary.

Parameters:

file_name (str) – The name of the file to resolve.
source (str or Path or None) – Optional local source directory.
download_url (str or None) – URL to download the file if needed.
local_data_dir (Path) – Directory where downloaded files are stored.
redownload (bool, optional) – If True, force redownload even if cached file exists.

Returns:

Path to the resolved file.

Return type:

Path

amocatlas.utilities.safe_update_attrs(ds: Dataset, new_attrs: Dict[str, str], overwrite: bool = False, verbose: bool = True) → Dataset[source]

Safely update attributes of an xarray Dataset without overwriting existing keys, unless explicitly allowed.

Parameters:

ds (xr.Dataset) – The xarray Dataset whose attributes will be updated.
new_attrs (dict of str) – Dictionary of new attributes to add.
overwrite (bool, optional) – If True, allow overwriting existing attributes. Defaults to False.
verbose (bool, optional) – If True, emit a warning when skipping existing attributes. Defaults to True.

Returns:

The dataset with updated attributes.

Return type:

xr.Dataset

amocatlas.utilities.sanitize_variable_name(name: str) → str[source]

Sanitize variable names to create valid Python identifiers.

Replaces illegal Python identifier characters (spaces, parentheses, periods, hyphens, etc.) with underscores and collapses repeated underscores into single ones.

Parameters:: name (str) – The original variable name that may contain illegal characters
Returns:: A sanitized variable name that is a valid Python identifier
Return type:: str

Examples

>>> sanitize_variable_name("Total MOC anomaly (relative to record-length average of 14.7 Sv)")
'Total_MOC_anomaly__relative_to_record_length_average_of_14_7_Sv'
>>> sanitize_variable_name("Upper-cell volume transport anomaly")
'Upper_cell_volume_transport_anomaly'

amocatlas.utilities.standardize_dataset_units(ds: Dataset, mapping: Dict[str, str] | None = None, log_changes: bool = True) → Dataset[source]

Standardize units throughout a dataset using comprehensive mapping rules.

Parameters:

ds (xr.Dataset) – Dataset to standardize units for.
mapping (Dict[str, str], optional) – Custom unit mapping. If None, uses get_standard_unit_mappings().
log_changes (bool, optional) – Whether to log unit changes. Default is True.

Returns:

Dataset with standardized units.

Return type:

xr.Dataset

Notes

This function applies unit standardization to all variables and coordinates in the dataset. It’s designed to be the central unit standardization function for AMOCatlas, replacing the simpler standardize_units function.

Examples

>>> ds_std = standardize_dataset_units(ds)
>>> # Check if Sv was converted to Sverdrup
>>> print(ds_std['transport'].attrs['units'])  # "Sverdrup"

amocatlas.utilities.validate_array_yaml(datasource_id: str, verbose: bool = True) → bool[source]

Validate the structure and required fields of a datasource metadata YAML.

Parameters:

datasource_id (str) – The datasource identifier (e.g., ‘rapid26n’, ‘samba34s’).
verbose (bool) – If True, print detailed validation messages.

Returns:

True if validation passes, False otherwise.

Return type:

bool