One common approach is to use a pandas MultiIndex to combine time and metadata into a single dimension.
Setup¶
%xmode minimal
import numpy as np
import pandas as pd
import xarray as xrCreating the MultiIndex Dataset¶
We create a dataset where the time dimension has a MultiIndex with both time values and word labels.
# Create sample data
T = 1000
C = 2
times = np.linspace(0, 120, T)
data = np.random.rand(C, T)
# Define word boundaries
breaks = np.array([0, 333, 666, 1000])
# Create word labels for each time point
words = np.array(["red"] * T)
words[breaks[0] : breaks[1]] = "red"
words[breaks[1] : breaks[2]] = "green"
words[breaks[2] :] = "blue"
# Create MultiIndex
mdx = pd.MultiIndex.from_arrays([words, times], names=["word", "time"])
# Create xarray Dataset
ds = xr.DataArray(data, [("C", range(C)), ("T", mdx)]).to_dataset(name="data")
dsSelection by Word¶
We can select by word label:
ds.sel(word="red")Limitations¶
1. Time slicing is awkward¶
Slicing by time requires specifying the word level too, or using .loc patterns:
For float valued time slicing raises a ValueError!
# This doesn't work as expected:
ds.sel(time=slice(0, 50)) # TypeError# Need to be more explicit
# but this will always return a copy, instead of a less expensive view.
ds.sel(T=ds.T[(ds.time >= 0.15) & (ds.time <= 55.8)])
# ds.sel(T=slice(.15, 55.8))2. can’t isel the metadata coords¶
You might want to be able to select the 3rd word. That is not easy with this index.
# value error
ds.isel(word=0)3. Interval boundaries are lost¶
The MultiIndex doesn’t preserve the actual interval boundaries - we can’t easily ask “what time range does word X span?” without computing it from the data.
4. Constrained to measurement time points¶
If metadata events happen at times not in your measurement grid, you lose that precision. For example, if you sample monthly but an event happened mid-month, you can’t represent that exactly.
5. Memory duplication¶
Each time point stores its word label, which is redundant information.