tools

Helper functions for data manipulation, unit conversion, and analysis.

AMOCatlas analysis tools for data processing, filtering, and calculations.

amocatlas.tools.apply_tukey_filter(df: DataFrame, column: str, window_months: int = 6, samples_per_day: float = 0.2, alpha: float = 0.5, add_back_mean: bool = False, output_column: str | None = None) DataFrame[source]

Apply a Tukey filter using NumPy convolution (safely handles NaN values).

This function uses pandas DataFrame input to leverage NumPy’s convolution capabilities with Tukey windows, which provides more flexibility than xarray’s built-in rolling operations for this specific filtering approach.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the column to filter.

  • column (str) – Name of the column to apply the filter to.

  • window_months (int, default 6) – Filter window size in months.

  • samples_per_day (float, default 0.2) – Expected number of samples per day in the data.

  • alpha (float, default 0.5) – Tukey window parameter (0=rectangular, 1=Hann).

  • add_back_mean (bool, default False) – Whether to remove and add back the overall mean.

  • output_column (str, optional) – Name for the filtered output column. If None, uses “{column}_filtered”.

Returns:

Copy of input DataFrame with filtered column added.

Return type:

pandas.DataFrame

Notes

Uses pandas DataFrame rather than xarray Dataset because pandas provides better access to convolution operations with custom window functions.

amocatlas.tools.bin_average_5day(df: DataFrame, time_column: str = 'time', value_column: str = 'moc') DataFrame[source]

Bin-average a time series into 5-day means.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame with time and value columns.

  • time_column (str, default "time") – Name of the datetime column.

  • value_column (str, default "moc") – Name of the data column to average.

Returns:

DataFrame with 5-day averaged time and values.

Return type:

pandas.DataFrame

amocatlas.tools.bin_average_monthly(df: DataFrame, time_column: str = 'time') DataFrame[source]

Bin-average a time series into monthly means.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame with time column.

  • time_column (str, default "time") – Name of the datetime column.

Returns:

DataFrame with monthly averaged data.

Return type:

pandas.DataFrame

amocatlas.tools.check_and_bin(df: DataFrame, time_column: str = 'time') DataFrame[source]

Check temporal resolution and bin to monthly if needed.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame with time column.

  • time_column (str, default "time") – Name of the datetime column.

Returns:

Original DataFrame if already monthly, or monthly-binned version.

Return type:

pandas.DataFrame

amocatlas.tools.convert_units_var(var_values: ndarray | float, current_unit: str, new_unit: str, unit_conversion: dict[str, dict[str, float]] = {'PW': {'W': 1000000000000000.0}, 'Pa': {'dbar': 0.0001}, 'S/m': {'mS/cm': 0.1}, 'Sv': {'sverdrup': 1.0}, 'W': {'PW': 1e-15}, 'cm': {'m': 0.01}, 'cm s-1': {'m s-1': 0.01}, 'cm/s': {'m/s': 0.01}, 'dbar': {'Pa': 10000, 'kPa': 10}, 'degree_Celsius': {'degrees_Celsius': 1.0}, 'degrees_Celsius': {'degree_Celsius': 1}, 'g m-3': {'kg m-3': 0.001}, 'kPa': {'dbar': 0.1}, 'kg m-3': {'g m-3': 1000.0}, 'km': {'m': 1000.0}, 'm': {'cm': 100, 'km': 0.001}, 'm s-1': {'cm s-1': 100.0}, 'm/s': {'cm/s': 100.0}, 'mS/cm': {'S/m': 10.0}, 'sverdrup': {'Sv': 1}}) ndarray | float[source]

Converts variable values from one unit to another using a predefined conversion factor.

Parameters:
  • var_values (numpy.ndarray or float) – The values to be converted.

  • current_unit (str) – The current unit of the variable values.

  • new_unit (str) – The target unit to which the variable values should be converted.

  • unit_conversion (dict of {str: dict of {str: float}}, optional) – A dictionary containing conversion factors between units. The default is unit_conversion.

Returns:

The converted variable values. If no conversion factor is found, the original values are returned.

Return type:

numpy.ndarray or float

Raises:

KeyError – If the conversion factor for the specified units is not found in the unit_conversion dictionary.

Notes

If the conversion factor for the specified units is not available, a message is printed, and the original values are returned without any conversion.

amocatlas.tools.extract_time_and_time_num(ds: Dataset, time_var: str = 'TIME') DataFrame[source]

Extract time coordinates from xarray Dataset and convert to pandas DataFrame.

Parameters:
  • ds (xarray.Dataset) – Dataset containing time coordinate.

  • time_var (str, default "TIME") – Name of the time variable in the dataset.

Returns:

DataFrame with ‘time’ (datetime) and ‘time_num’ (decimal year) columns.

Return type:

pandas.DataFrame

amocatlas.tools.find_best_dtype(var_name: str, da: DataArray) dtype[source]

Determines the most suitable data type for a given variable.

Parameters:
  • var_name (str) – The name of the variable.

  • da (xarray.DataArray) – The data array containing the variable’s values.

Returns:

The optimal data type for the variable based on its name and values.

Return type:

numpy.dtype

amocatlas.tools.generate_reverse_conversions(forward_conversions: dict[str, dict[str, float]]) dict[str, dict[str, float]][source]

Create a unit conversion dictionary with both forward and reverse conversions.

Parameters:

forward_conversions (dict of {str: dict of {str: float}}) – Mapping of source units to target units and conversion factors. Example: {“m”: {“cm”: 100, “km”: 0.001}}

Returns:

dict of {str – Complete mapping of units including reverse conversions. Example: {“cm”: {“m”: 0.01}, “km”: {“m”: 1000}}

Return type:

dict of {str: float}}

Notes

If a conversion factor is zero, a warning is printed, and the reverse conversion is skipped.

amocatlas.tools.handle_samba_gaps(df: DataFrame, time_column: str = 'time') DataFrame[source]

Handle temporal gaps in SAMBA MOC data to prevent plotting artifacts.

SAMBA data has significant gaps (e.g., 2011-2014) that cause plotting functions to draw connecting lines across missing periods. This function creates a regular monthly grid and masks interpolation to only occur within existing data periods, preventing spurious connections across large gaps.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame with time and MOC columns.

  • time_column (str, default "time") – Name of the datetime column.

Returns:

DataFrame with regular monthly grid and gap-aware data masking.

Return type:

pandas.DataFrame

Notes

PyGMT and other plotting functions connect all valid (non-NaN) data points regardless of temporal gaps. This function prevents artifacts by: 1. Creating a regular monthly time grid 2. Preserving NaN values where no original data existed 3. Only interpolating within continuous data segments

amocatlas.tools.reformat_units_var(ds: Dataset, var_name: str, unit_format: dict[str, str] = {'S/m': 'S m-1', 'cm/s': 'cm s-1', 'degrees_Celsius': 'degree_Celsius', 'g/m^3': 'g m-3', 'm/s': 'm s-1', 'meters': 'm'}) str[source]

Reformat the units of a variable in the dataset based on a provided mapping.

Parameters:
  • ds (xarray.Dataset) – The input dataset containing variables with units to be reformatted.

  • var_name (str) – The name of the variable whose units need to be reformatted.

  • unit_format (dict of {str: str}, optional) – A dictionary mapping old unit strings to new formatted unit strings. Defaults to unit_str_format.

Returns:

The reformatted unit string. If the old unit is not found in unit_format, the original unit string is returned.

Return type:

str

amocatlas.tools.set_best_dtype(ds: Dataset) Dataset[source]

Adjust the data types of variables in a dataset to optimize memory usage.

Parameters:

ds (xarray.Dataset) – The input dataset whose variables’ data types will be adjusted.

Returns:

The dataset with updated data types for its variables, potentially saving memory.

Return type:

xarray.Dataset

Notes

  • The function determines the best data type for each variable using find_best_dtype.

  • Attributes like valid_min and valid_max are updated to match the new data type.

  • If the new data type is integer-based, NaN values are replaced with a fill value.

  • Logs the percentage of memory saved after the data type adjustments.

amocatlas.tools.set_fill_value(new_dtype: dtype) int[source]

Calculate the fill value for a given data type.

Parameters:

new_dtype (numpy.dtype) – The data type for which the fill value is to be calculated.

Returns:

The calculated fill value based on the bit-width of the data type.

Return type:

int

amocatlas.tools.to_decimal_year(dates: Series) Series[source]

Convert datetime series to decimal years, handling NaN values safely.

Parameters:

dates (pandas.Series or pandas.DatetimeIndex) – Series or Index of datetime objects to convert.

Returns:

Series of decimal years with NaN preserved for invalid dates.

Return type:

pandas.Series

Examples

>>> import pandas as pd
>>> dates = pd.Series(['2020-01-01', '2020-07-01', '2021-01-01'])
>>> dates = pd.to_datetime(dates)
>>> decimal_years = to_decimal_year(dates)