StateArrayDataset

class StateArrayDataset(self, paths: pandas.DataFrame, likelihood: Likelihood, params: StateArrayParameters, path_col: str, condition_col: str = None, **kwargs)

Implements routines to run state arrays at the dataset level. Parallelizes inference across multiple files and provides visualization methods to compare different experimental conditions.

The structure of the SPT dataset is specified with the paths argument. This is a pandas.DataFrame that encodes the path and experimental condition for all files in the dataset. Only two columns in this DataFrame are recognized:

path_col (required): encodes the path to each SPT trajectory file

condition_col (optional): encodes the experimental condition to which that file belongs

If the DataFrame contains other columns, they are ignored.

An example is provided by the file experiment_conditions.csv under the examples folder in the saSPT repo. The path_col is filepath and condition_col is condition:

$ cat experiment_conditions.csv | head
filepath,condition
u2os_ht_nls_7.48ms/region_0_7ms_trajs.csv,HaloTag-NLS
u2os_ht_nls_7.48ms/region_10_7ms_trajs.csv,HaloTag-NLS
...

Running from the saspt/examples directory, we can construct a StateArrayDataset as follows:

import pandas as pd
from saspt import StateArrayDataset, RBME

# Load the paths DataFrame
paths = pd.read_csv("experiment_conditions.csv")

# Settings for state array inference
settings = dict(
    likelihood_type = RBME,   # type of likelihood function to use
    pixel_size_um = 0.16,     # camera pixel size in microns
    frame_interval = 0.00748, # frame interval in seconds
    focal_depth = 0.7,        # objective focal depth in microns
    path_col = 'filepath',    # column in *paths* encoding file path
    condition_col='condition',# column in *paths* encoding
                              # experimental condition
    num_workers = 4,          # parallel processes to use
    progress_bar = True,      # show progress
)

# Make a StateArrayDataset with these settings
with StateArrayDataset.from_kwargs(paths, **settings) as SAD:
    print(SAD)

The output of this script is:

StateArrayDataset:
    likelihood_type    : rbme
    shape              : (101, 36)
    n_files            : 22
    path_col           : filepath
    condition_col      : condition
    conditions         : ['HaloTag-NLS' 'RARA-HaloTag']

StateArrayDataset implements several methods to get information about a dataset. Continuing with the example above,

# Make a StateArrayDataset with these settings
with StateArrayDataset.from_kwargs(paths, **settings) as SAD:

    # Save some statistics on each SPT file, including the number
    # of trajectories, trajectory length, etc.
    SAD.raw_track_statistics.to_csv("raw_track_statistics.csv", index=False)

    # Save the posterior state occupations to a file
    SAD.marginal_posterior_occs_dataframe.to_csv(
        "marginal_posterior_distribution.csv", index=False)

    # Make some plots comparing the naive state occupations across all
    # files in this dataset
    SAD.naive_heat_map("naive_heat_map.png")
    SAD.naive_line_plot("naive_line_plot.png")

    # Make some plots comparing the posterior state occupations across all
    # files in this dataset
    SAD.posterior_heat_map("posterior_heat_map.png")
    SAD.posterior_line_plot("posterior_line_plot.png")

    # Do some kind of specific calculation; for example, calculate the
    # fraction of particles with diffusion coefficients less than 1.0
    # in each file
    print(SAD.marginal_posterior_occs[:,SAD.diff_coefs<1.0].sum(axis=1))

Warning

An important parameter when constructing the StateArrayDataset is num_workers, the number of parallel processes. This should not be set higher than the number of CPUs you have access to. Otherwise you’ll suffer performance drops.

Parameters

paths (pandas.DataFrame) –
likelihood_type (str) –
path_col (str) –
condition_col (str) –
pixel_size_um (float) –
frame_interval (float) –
focal_depth (float) –
num_workers (int) –
progress_bar (bool) –

classmethod from_kwargs(cls, paths: pandas.DataFrame, likelihood_type: str, path_col: str, condition_col: str = None, **kwargs)

Parameters

paths (pandas.DataFrame) –
likelihood_type (str) –
path_col (str) –
condition_col (str) –
pixel_size_um (float) –
frame_interval (float) –
focal_depth (float) –
num_workers (int) –
progress_bar (bool) –

Returns

new instance of StateArrayDataset

likelihood: Likelihood: The likelihood function used by all of the state arrays in this StateArrayDataset.

property n_files: int: Total number of files in this StateArrayDataset.

property shape: Tuple[int]

property likelihood_type: str: Name of the likelihood function; equivalent to StateArrayDataset.likelihood.name

property n_diff_coefs: int

Number of distinct diffusion coefficients in the parameter grid corresponding to this Likelihood function.

If self.likelihood does not use diffusion coefficient as a parameter, is 0.

property jumps_per_file: numpy.ndarray: Shape (n_files,), the number of observed jumps in each file (after preprocessing).

property raw_track_statistics: pandas.DataFrame

Raw trajectory statistics for this dataset. Each row of the DataFrame corresponds to one file, and each column to an attribute of that file.

Continuing with the previous example,

>>> with StateArray(paths, **settings) as SAD:
...     track_stats = SAD.raw_track_statistics

>>> print(track_stats.columns)
Index(['n_tracks', 'n_jumps', 'n_detections', 'mean_track_length',
   'max_track_length', 'fraction_singlets', 'fraction_unassigned',
   'mean_jumps_per_track', 'mean_detections_per_frame',
   'max_detections_per_frame', 'fraction_of_frames_with_detections',
   'filepath', 'condition'],
  dtype='object')

>>> print(track_stats[['mean_track_length', 'fraction_singlets', 'condition']])
    mean_track_length  fraction_singlets     condition
0            1.636783           0.839129   HaloTag-NLS
1            2.075513           0.666734   HaloTag-NLS
2            1.784457           0.746812   HaloTag-NLS
3            1.986613           0.675709   HaloTag-NLS
4            2.004172           0.698274   HaloTag-NLS
..                ...                ...           ...
17           3.881071           0.571429  RARA-HaloTag
18           3.826364           0.557824  RARA-HaloTag
19           3.423219           0.591547  RARA-HaloTag
20           3.682189           0.536682  RARA-HaloTag
21           3.520319           0.595750  RARA-HaloTag

[22 rows x 3 columns]

Notice that the trajectories from the HaloTag-NLS conditions are shorter and more likely to be singlets than the trajectories in the RARA-HaloTag condition.

property processed_track_statistics: pandas.DataFrame

Trajectory statistics for this dataset after preprocessing. Each row of the DataFrame corresponds to one file, and each column to an attribute of that file.

This is identical in form to raw_track_statistics.

property marginal_naive_occs: numpy.ndarray: Shape (n_files, n_diff_coefs), naive state occupations for each file marginalized on diffusion coefficient.

property marginal_posterior_occs: numpy.ndarray: Shape (n_files, n_diff_coefs), posterior state occupations for each file marginalized on diffusion coefficient.

property marginal_posterior_occs_dataframe: pandas.DataFrame

pandas.DataFrame representation of marginal_posterior_occs. Each row corresponds to a single state in a single file, so that the total number of rows is equal to n_files * n_diff_coefs.

Continuing the example from above,

>>> SAD = StateArrayDataset.from_kwargs(paths, **settings)
>>> diff_coefs = SAD.diff_coefs
>>> df = SAD.marginal_posterior_occs_dataframe

# Calculate the estimated fraction of trajectories with diffusion
# coefficients below 0.1 µm2/sec for all files in this dataset
>>> print(df.loc[df['diff_coef'] < 0.1].groupby('filepath')['posterior_occupation'].sum())
filepath
u2os_ht_nls_7.48ms/region_0_7ms_trajs.csv     0.173923
u2os_ht_nls_7.48ms/region_10_7ms_trajs.csv    0.067899
u2os_ht_nls_7.48ms/region_1_7ms_trajs.csv     0.165322
u2os_ht_nls_7.48ms/region_2_7ms_trajs.csv     0.020263
u2os_ht_nls_7.48ms/region_3_7ms_trajs.csv     0.101379
                                                ...
u2os_rara_ht_7.48ms/region_5_7ms_trajs.csv    0.364910
u2os_rara_ht_7.48ms/region_6_7ms_trajs.csv    0.430909
u2os_rara_ht_7.48ms/region_7_7ms_trajs.csv    0.426619
u2os_rara_ht_7.48ms/region_8_7ms_trajs.csv    0.350441
u2os_rara_ht_7.48ms/region_9_7ms_trajs.csv    0.553296
Name: posterior_occupation, Length: 22, dtype: float64

clear(self): Clear all cached attributes.

apply_by(self, col: str, func: Callable, is_variadic: bool = False, **kwargs)

Apply a function in parallel to groups of files identified by a common value in self.paths[col]. Essentially equivalent to a parallel version of self.paths.groupby(col)[self.path_col].apply(func).

func should have the signature func(paths: List[str], **kwargs) if is_variadic == False, or func(*paths: str, **kwargs) if is_variadic == True.

Parameters

col (str) – a column in self.paths to group files by
func (Callable) – function to apply to each file group
is_variadic (bool) – func is a variadic function
kwargs – additional keyword arguments to func

Returns

result (list), group_names (List[str])

infer_posterior_by_condition(self, col: str, normalize: bool = False)

Group files by common values of self.paths[col] and infer the marginal posterior occupations for each file group.

Parameters

col (str) – column in self.paths to group by
normalize (bool) – normalize the posterior occupations over all states for each file

Returns

posterior_occs (numpy.ndarray), group_names (List[str]). posterior_occs is a 2D array of shape (n_groups, n_diff_coefs) with the marginal posterior occupations for each file group, and group_names is the names of each group.

calc_marginal_naive_occs(self, *track_paths: str) → numpy.ndarray:

Calculate the naive state occupations (marginalized on diffusion coefficient) for one or more files. If multiple files are passed, runs on the concatenation of the detections across files.

If you want to infer the marginal naive occupations for all of the files in this StateArrayDataset, use StateArrayDataset.marginal_naive_occs instead.

Parameters: track_paths (str) – full paths to one or more files with detections
Returns: naive_occs (numpy.ndarray), 1D array of shape (n_diff_coefs,), the marginal naive state occupations

calc_marginal_posterior_occs(self, *track_paths: str) → numpy.ndarray:

Calculate the posterior state occupations (marginalized on diffusion coefficient) for one or more files. If multiple files are passed, runs on the concatenation of the detections across files.

If you want to infer the marginal naive occupations for all of the files in this StateArrayDataset, use StateArrayDataset.marginal_naive_occs instead.

Parameters: track_paths (str) – full paths to one or more files with detections
Returns: naive_occs (numpy.ndarray), 1D array of shape (n_diff_coefs,), the marginal posterior state occupations

naive_heat_map(self, out_png: str, normalize: bool = True, order_by_size: bool = True, **kwargs)

Naive state occupations, marginalized on diffusion coefficient, shown as a heat map. Groups by condition.

With normalize = True:

Parameters

out_png (str) – save path for this plot
normalize (bool) – normalize the state occupations for each file in the dataset. If False, the intensity for each file is proportional to the number of jumps observed in that SPT experiment.
order_by_size (bool) – within each condition group, order the files by decreasing number of observed jumps.
kwargs – additional kwargs to the plotting function

naive_line_plot(self, out_png: str, **kwargs)

Naive state occupations, marginalized on diffusion coefficient, shown as a line plot. Groups by condition.

Parameters

out_png (str) – save path for this plot
kwargs – additional kwargs to the plotting function

posterior_heat_map(self, out_png: str, normalize: bool = True, order_by_size: bool = True, **kwargs)

Posterior mean state occupations, marginalized on diffusion coefficient, shown as a heat map. Groups by condition.

../_images/example_posterior_heat_map.png

Parameters

out_png (str) – save path for this plot
normalize (bool) – normalize the state occupations for each file in the dataset. If False, the intensity for each file is proportional to the number of jumps observed in that SPT experiment.
order_by_size (bool) – within each condition group, order the files by decreasing number of observed jumps.
kwargs – additional kwargs to the plotting function

posterior_line_plot(self, out_png: str, **kwargs)

Posterior mean state occupations, marginalized on diffusion coefficient, shown as a line plot. Groups by condition.

../_images/example_posterior_line_plot.png

Parameters

out_png (str) – save path for this plot
kwargs – additional kwargs to the plotting function