tools
Helper functions for data manipulation, unit conversion, and analysis.
AMOCatlas analysis tools for data processing, filtering, and calculations.
- amocatlas.tools.apply_tukey_filter(df: DataFrame, column: str, window_months: int = 6, samples_per_day: float = 0.2, alpha: float = 0.5, add_back_mean: bool = False, output_column: str | None = None) DataFrame[source]
Apply a Tukey filter using NumPy convolution (safely handles NaN values).
This function uses pandas DataFrame input to leverage NumPy’s convolution capabilities with Tukey windows, which provides more flexibility than xarray’s built-in rolling operations for this specific filtering approach.
- Parameters:
df (pandas.DataFrame) – Input DataFrame containing the column to filter.
column (str) – Name of the column to apply the filter to.
window_months (int, default 6) – Filter window size in months.
samples_per_day (float, default 0.2) – Expected number of samples per day in the data.
alpha (float, default 0.5) – Tukey window parameter (0=rectangular, 1=Hann).
add_back_mean (bool, default False) – Whether to remove and add back the overall mean.
output_column (str, optional) – Name for the filtered output column. If None, uses “{column}_filtered”.
- Returns:
Copy of input DataFrame with filtered column added.
- Return type:
pandas.DataFrame
Notes
Uses pandas DataFrame rather than xarray Dataset because pandas provides better access to convolution operations with custom window functions.
- amocatlas.tools.bin_average_5day(df: DataFrame, time_column: str = 'time', value_column: str = 'moc') DataFrame[source]
Bin-average a time series into 5-day means.
- Parameters:
df (pandas.DataFrame) – Input DataFrame with time and value columns.
time_column (str, default "time") – Name of the datetime column.
value_column (str, default "moc") – Name of the data column to average.
- Returns:
DataFrame with 5-day averaged time and values.
- Return type:
pandas.DataFrame
- amocatlas.tools.bin_average_monthly(df: DataFrame, time_column: str = 'time') DataFrame[source]
Bin-average a time series into monthly means.
- Parameters:
df (pandas.DataFrame) – Input DataFrame with time column.
time_column (str, default "time") – Name of the datetime column.
- Returns:
DataFrame with monthly averaged data.
- Return type:
pandas.DataFrame
- amocatlas.tools.check_and_bin(df: DataFrame, time_column: str = 'time') DataFrame[source]
Check temporal resolution and bin to monthly if needed.
- Parameters:
df (pandas.DataFrame) – Input DataFrame with time column.
time_column (str, default "time") – Name of the datetime column.
- Returns:
Original DataFrame if already monthly, or monthly-binned version.
- Return type:
pandas.DataFrame
- amocatlas.tools.convert_units_var(var_values: ndarray | float, current_unit: str, new_unit: str, unit_conversion: dict[str, dict[str, float]] = {'PW': {'W': 1000000000000000.0}, 'Pa': {'dbar': 0.0001}, 'S/m': {'mS/cm': 0.1}, 'Sv': {'sverdrup': 1.0}, 'W': {'PW': 1e-15}, 'cm': {'m': 0.01}, 'cm s-1': {'m s-1': 0.01}, 'cm/s': {'m/s': 0.01}, 'dbar': {'Pa': 10000, 'kPa': 10}, 'degree_Celsius': {'degrees_Celsius': 1.0}, 'degrees_Celsius': {'degree_Celsius': 1}, 'g m-3': {'kg m-3': 0.001}, 'kPa': {'dbar': 0.1}, 'kg m-3': {'g m-3': 1000.0}, 'km': {'m': 1000.0}, 'm': {'cm': 100, 'km': 0.001}, 'm s-1': {'cm s-1': 100.0}, 'm/s': {'cm/s': 100.0}, 'mS/cm': {'S/m': 10.0}, 'sverdrup': {'Sv': 1}}) ndarray | float[source]
Converts variable values from one unit to another using a predefined conversion factor.
- Parameters:
var_values (numpy.ndarray or float) – The values to be converted.
current_unit (str) – The current unit of the variable values.
new_unit (str) – The target unit to which the variable values should be converted.
unit_conversion (dict of {str: dict of {str: float}}, optional) – A dictionary containing conversion factors between units. The default is unit_conversion.
- Returns:
The converted variable values. If no conversion factor is found, the original values are returned.
- Return type:
numpy.ndarray or float
- Raises:
KeyError – If the conversion factor for the specified units is not found in the unit_conversion dictionary.
Notes
If the conversion factor for the specified units is not available, a message is printed, and the original values are returned without any conversion.
- amocatlas.tools.extract_time_and_time_num(ds: Dataset, time_var: str = 'TIME') DataFrame[source]
Extract time coordinates from xarray Dataset and convert to pandas DataFrame.
- Parameters:
ds (xarray.Dataset) – Dataset containing time coordinate.
time_var (str, default "TIME") – Name of the time variable in the dataset.
- Returns:
DataFrame with ‘time’ (datetime) and ‘time_num’ (decimal year) columns.
- Return type:
pandas.DataFrame
- amocatlas.tools.find_best_dtype(var_name: str, da: DataArray) dtype[source]
Determines the most suitable data type for a given variable.
- Parameters:
var_name (str) – The name of the variable.
da (xarray.DataArray) – The data array containing the variable’s values.
- Returns:
The optimal data type for the variable based on its name and values.
- Return type:
numpy.dtype
- amocatlas.tools.generate_reverse_conversions(forward_conversions: dict[str, dict[str, float]]) dict[str, dict[str, float]][source]
Create a unit conversion dictionary with both forward and reverse conversions.
- Parameters:
forward_conversions (dict of {str: dict of {str: float}}) – Mapping of source units to target units and conversion factors. Example: {“m”: {“cm”: 100, “km”: 0.001}}
- Returns:
dict of {str – Complete mapping of units including reverse conversions. Example: {“cm”: {“m”: 0.01}, “km”: {“m”: 1000}}
- Return type:
dict of {str: float}}
Notes
If a conversion factor is zero, a warning is printed, and the reverse conversion is skipped.
- amocatlas.tools.handle_samba_gaps(df: DataFrame, time_column: str = 'time') DataFrame[source]
Handle temporal gaps in SAMBA MOC data to prevent plotting artifacts.
SAMBA data has significant gaps (e.g., 2011-2014) that cause plotting functions to draw connecting lines across missing periods. This function creates a regular monthly grid and masks interpolation to only occur within existing data periods, preventing spurious connections across large gaps.
- Parameters:
df (pandas.DataFrame) – Input DataFrame with time and MOC columns.
time_column (str, default "time") – Name of the datetime column.
- Returns:
DataFrame with regular monthly grid and gap-aware data masking.
- Return type:
pandas.DataFrame
Notes
PyGMT and other plotting functions connect all valid (non-NaN) data points regardless of temporal gaps. This function prevents artifacts by: 1. Creating a regular monthly time grid 2. Preserving NaN values where no original data existed 3. Only interpolating within continuous data segments
- amocatlas.tools.reformat_units_var(ds: Dataset, var_name: str, unit_format: dict[str, str] = {'S/m': 'S m-1', 'cm/s': 'cm s-1', 'degrees_Celsius': 'degree_Celsius', 'g/m^3': 'g m-3', 'm/s': 'm s-1', 'meters': 'm'}) str[source]
Reformat the units of a variable in the dataset based on a provided mapping.
- Parameters:
ds (xarray.Dataset) – The input dataset containing variables with units to be reformatted.
var_name (str) – The name of the variable whose units need to be reformatted.
unit_format (dict of {str: str}, optional) – A dictionary mapping old unit strings to new formatted unit strings. Defaults to unit_str_format.
- Returns:
The reformatted unit string. If the old unit is not found in unit_format, the original unit string is returned.
- Return type:
str
- amocatlas.tools.set_best_dtype(ds: Dataset) Dataset[source]
Adjust the data types of variables in a dataset to optimize memory usage.
- Parameters:
ds (xarray.Dataset) – The input dataset whose variables’ data types will be adjusted.
- Returns:
The dataset with updated data types for its variables, potentially saving memory.
- Return type:
xarray.Dataset
Notes
The function determines the best data type for each variable using find_best_dtype.
Attributes like valid_min and valid_max are updated to match the new data type.
If the new data type is integer-based, NaN values are replaced with a fill value.
Logs the percentage of memory saved after the data type adjustments.
- amocatlas.tools.set_fill_value(new_dtype: dtype) int[source]
Calculate the fill value for a given data type.
- Parameters:
new_dtype (numpy.dtype) – The data type for which the fill value is to be calculated.
- Returns:
The calculated fill value based on the bit-width of the data type.
- Return type:
int
- amocatlas.tools.to_decimal_year(dates: Series) Series[source]
Convert datetime series to decimal years, handling NaN values safely.
- Parameters:
dates (pandas.Series or pandas.DatetimeIndex) – Series or Index of datetime objects to convert.
- Returns:
Series of decimal years with NaN preserved for invalid dates.
- Return type:
pandas.Series
Examples
>>> import pandas as pd >>> dates = pd.Series(['2020-01-01', '2020-07-01', '2021-01-01']) >>> dates = pd.to_datetime(dates) >>> decimal_years = to_decimal_year(dates)