StateArrayDataset

class StateArrayDataset(self, paths: pandas.DataFrame, likelihood: Likelihood, params: StateArrayParameters, path_col: str, condition_col: str = None, **kwargs)

Implements routines to run state arrays at the dataset level. Parallelizes inference across multiple files and provides visualization methods to compare different experimental conditions.

The structure of the SPT dataset is specified with the paths argument. This is a pandas.DataFrame that encodes the path and experimental condition for all files in the dataset. Only two columns in this DataFrame are recognized:

  • path_col (required): encodes the path to each SPT trajectory file

  • condition_col (optional): encodes the experimental condition to which that file belongs

If the DataFrame contains other columns, they are ignored.

An example is provided by the file experiment_conditions.csv under the examples folder in the saSPT repo. The path_col is filepath and condition_col is condition:

$ cat experiment_conditions.csv | head
filepath,condition
u2os_ht_nls_7.48ms/region_0_7ms_trajs.csv,HaloTag-NLS
u2os_ht_nls_7.48ms/region_10_7ms_trajs.csv,HaloTag-NLS
...

Running from the saspt/examples directory, we can construct a StateArrayDataset as follows:

import pandas as pd
from saspt import StateArrayDataset, RBME

# Load the paths DataFrame
paths = pd.read_csv("experiment_conditions.csv")

# Settings for state array inference
settings = dict(
    likelihood_type = RBME,   # type of likelihood function to use
    pixel_size_um = 0.16,     # camera pixel size in microns
    frame_interval = 0.00748, # frame interval in seconds
    focal_depth = 0.7,        # objective focal depth in microns
    path_col = 'filepath',    # column in *paths* encoding file path
    condition_col='condition',# column in *paths* encoding
                              # experimental condition
    num_workers = 4,          # parallel processes to use
    progress_bar = True,      # show progress
)

# Make a StateArrayDataset with these settings
with StateArrayDataset.from_kwargs(paths, **settings) as SAD:
    print(SAD)

The output of this script is:

StateArrayDataset:
    likelihood_type    : rbme
    shape              : (101, 36)
    n_files            : 22
    path_col           : filepath
    condition_col      : condition
    conditions         : ['HaloTag-NLS' 'RARA-HaloTag']

StateArrayDataset implements several methods to get information about a dataset. Continuing with the example above,

# Make a StateArrayDataset with these settings
with StateArrayDataset.from_kwargs(paths, **settings) as SAD:

    # Save some statistics on each SPT file, including the number
    # of trajectories, trajectory length, etc.
    SAD.raw_track_statistics.to_csv("raw_track_statistics.csv", index=False)

    # Save the posterior state occupations to a file
    SAD.marginal_posterior_occs_dataframe.to_csv(
        "marginal_posterior_distribution.csv", index=False)

    # Make some plots comparing the naive state occupations across all
    # files in this dataset
    SAD.naive_heat_map("naive_heat_map.png")
    SAD.naive_line_plot("naive_line_plot.png")

    # Make some plots comparing the posterior state occupations across all
    # files in this dataset
    SAD.posterior_heat_map("posterior_heat_map.png")
    SAD.posterior_line_plot("posterior_line_plot.png")

    # Do some kind of specific calculation; for example, calculate the
    # fraction of particles with diffusion coefficients less than 1.0
    # in each file
    print(SAD.marginal_posterior_occs[:,SAD.diff_coefs<1.0].sum(axis=1))

Warning

An important parameter when constructing the StateArrayDataset is num_workers, the number of parallel processes. This should not be set higher than the number of CPUs you have access to. Otherwise you’ll suffer performance drops.

Parameters
  • paths (pandas.DataFrame) –

  • likelihood_type (str) –

  • path_col (str) –

  • condition_col (str) –

  • pixel_size_um (float) –

  • frame_interval (float) –

  • focal_depth (float) –

  • num_workers (int) –

  • progress_bar (bool) –

classmethod from_kwargs(cls, paths: pandas.DataFrame, likelihood_type: str, path_col: str, condition_col: str = None, **kwargs)
Parameters
  • paths (pandas.DataFrame) –

  • likelihood_type (str) –

  • path_col (str) –

  • condition_col (str) –

  • pixel_size_um (float) –

  • frame_interval (float) –

  • focal_depth (float) –

  • num_workers (int) –

  • progress_bar (bool) –

Returns

new instance of StateArrayDataset

likelihood: Likelihood

The likelihood function used by all of the state arrays in this StateArrayDataset.

property n_files: int

Total number of files in this StateArrayDataset.

property shape: Tuple[int]
property likelihood_type: str

Name of the likelihood function; equivalent to StateArrayDataset.likelihood.name

property n_diff_coefs: int

Number of distinct diffusion coefficients in the parameter grid corresponding to this Likelihood function.

If self.likelihood does not use diffusion coefficient as a parameter, is 0.

property jumps_per_file: numpy.ndarray

Shape (n_files,), the number of observed jumps in each file (after preprocessing).

property raw_track_statistics: pandas.DataFrame

Raw trajectory statistics for this dataset. Each row of the DataFrame corresponds to one file, and each column to an attribute of that file.

Continuing with the previous example,

>>> with StateArray(paths, **settings) as SAD:
...     track_stats = SAD.raw_track_statistics

>>> print(track_stats.columns)
Index(['n_tracks', 'n_jumps', 'n_detections', 'mean_track_length',
   'max_track_length', 'fraction_singlets', 'fraction_unassigned',
   'mean_jumps_per_track', 'mean_detections_per_frame',
   'max_detections_per_frame', 'fraction_of_frames_with_detections',
   'filepath', 'condition'],
  dtype='object')

>>> print(track_stats[['mean_track_length', 'fraction_singlets', 'condition']])
    mean_track_length  fraction_singlets     condition
0            1.636783           0.839129   HaloTag-NLS
1            2.075513           0.666734   HaloTag-NLS
2            1.784457           0.746812   HaloTag-NLS
3            1.986613           0.675709   HaloTag-NLS
4            2.004172           0.698274   HaloTag-NLS
..                ...                ...           ...
17           3.881071           0.571429  RARA-HaloTag
18           3.826364           0.557824  RARA-HaloTag
19           3.423219           0.591547  RARA-HaloTag
20           3.682189           0.536682  RARA-HaloTag
21           3.520319           0.595750  RARA-HaloTag

[22 rows x 3 columns]

Notice that the trajectories from the HaloTag-NLS conditions are shorter and more likely to be singlets than the trajectories in the RARA-HaloTag condition.

property processed_track_statistics: pandas.DataFrame

Trajectory statistics for this dataset after preprocessing. Each row of the DataFrame corresponds to one file, and each column to an attribute of that file.

This is identical in form to raw_track_statistics.

property marginal_naive_occs: numpy.ndarray

Shape (n_files, n_diff_coefs), naive state occupations for each file marginalized on diffusion coefficient.

property marginal_posterior_occs: numpy.ndarray

Shape (n_files, n_diff_coefs), posterior state occupations for each file marginalized on diffusion coefficient.

property marginal_posterior_occs_dataframe: pandas.DataFrame

pandas.DataFrame representation of marginal_posterior_occs. Each row corresponds to a single state in a single file, so that the total number of rows is equal to n_files * n_diff_coefs.

Continuing the example from above,

>>> SAD = StateArrayDataset.from_kwargs(paths, **settings)
>>> diff_coefs = SAD.diff_coefs
>>> df = SAD.marginal_posterior_occs_dataframe

# Calculate the estimated fraction of trajectories with diffusion
# coefficients below 0.1 µm2/sec for all files in this dataset
>>> print(df.loc[df['diff_coef'] < 0.1].groupby('filepath')['posterior_occupation'].sum())
filepath
u2os_ht_nls_7.48ms/region_0_7ms_trajs.csv     0.173923
u2os_ht_nls_7.48ms/region_10_7ms_trajs.csv    0.067899
u2os_ht_nls_7.48ms/region_1_7ms_trajs.csv     0.165322
u2os_ht_nls_7.48ms/region_2_7ms_trajs.csv     0.020263
u2os_ht_nls_7.48ms/region_3_7ms_trajs.csv     0.101379
                                                ...
u2os_rara_ht_7.48ms/region_5_7ms_trajs.csv    0.364910
u2os_rara_ht_7.48ms/region_6_7ms_trajs.csv    0.430909
u2os_rara_ht_7.48ms/region_7_7ms_trajs.csv    0.426619
u2os_rara_ht_7.48ms/region_8_7ms_trajs.csv    0.350441
u2os_rara_ht_7.48ms/region_9_7ms_trajs.csv    0.553296
Name: posterior_occupation, Length: 22, dtype: float64
clear(self)

Clear all cached attributes.

apply_by(self, col: str, func: Callable, is_variadic: bool = False, **kwargs)

Apply a function in parallel to groups of files identified by a common value in self.paths[col]. Essentially equivalent to a parallel version of self.paths.groupby(col)[self.path_col].apply(func).

func should have the signature func(paths: List[str], **kwargs) if is_variadic == False, or func(*paths: str, **kwargs) if is_variadic == True.

Parameters
  • col (str) – a column in self.paths to group files by

  • func (Callable) – function to apply to each file group

  • is_variadic (bool) – func is a variadic function

  • kwargs – additional keyword arguments to func

Returns

result (list), group_names (List[str])

infer_posterior_by_condition(self, col: str, normalize: bool = False)

Group files by common values of self.paths[col] and infer the marginal posterior occupations for each file group.

Parameters
  • col (str) – column in self.paths to group by

  • normalize (bool) – normalize the posterior occupations over all states for each file

Returns

posterior_occs (numpy.ndarray), group_names (List[str]). posterior_occs is a 2D array of shape (n_groups, n_diff_coefs) with the marginal posterior occupations for each file group, and group_names is the names of each group.

calc_marginal_naive_occs(self, *track_paths: str) numpy.ndarray:

Calculate the naive state occupations (marginalized on diffusion coefficient) for one or more files. If multiple files are passed, runs on the concatenation of the detections across files.

If you want to infer the marginal naive occupations for all of the files in this StateArrayDataset, use StateArrayDataset.marginal_naive_occs instead.

Parameters

track_paths (str) – full paths to one or more files with detections

Returns

naive_occs (numpy.ndarray), 1D array of shape (n_diff_coefs,), the marginal naive state occupations

calc_marginal_posterior_occs(self, *track_paths: str) numpy.ndarray:

Calculate the posterior state occupations (marginalized on diffusion coefficient) for one or more files. If multiple files are passed, runs on the concatenation of the detections across files.

If you want to infer the marginal naive occupations for all of the files in this StateArrayDataset, use StateArrayDataset.marginal_naive_occs instead.

Parameters

track_paths (str) – full paths to one or more files with detections

Returns

naive_occs (numpy.ndarray), 1D array of shape (n_diff_coefs,), the marginal posterior state occupations

naive_heat_map(self, out_png: str, normalize: bool = True, order_by_size: bool = True, **kwargs)

Naive state occupations, marginalized on diffusion coefficient, shown as a heat map. Groups by condition.

With normalize = True:

../_images/example_naive_heat_map.png
Parameters
  • out_png (str) – save path for this plot

  • normalize (bool) – normalize the state occupations for each file in the dataset. If False, the intensity for each file is proportional to the number of jumps observed in that SPT experiment.

  • order_by_size (bool) – within each condition group, order the files by decreasing number of observed jumps.

  • kwargs – additional kwargs to the plotting function

naive_line_plot(self, out_png: str, **kwargs)

Naive state occupations, marginalized on diffusion coefficient, shown as a line plot. Groups by condition.

../_images/example_naive_line_plot.png
Parameters
  • out_png (str) – save path for this plot

  • kwargs – additional kwargs to the plotting function

posterior_heat_map(self, out_png: str, normalize: bool = True, order_by_size: bool = True, **kwargs)

Posterior mean state occupations, marginalized on diffusion coefficient, shown as a heat map. Groups by condition.

../_images/example_posterior_heat_map.png
Parameters
  • out_png (str) – save path for this plot

  • normalize (bool) – normalize the state occupations for each file in the dataset. If False, the intensity for each file is proportional to the number of jumps observed in that SPT experiment.

  • order_by_size (bool) – within each condition group, order the files by decreasing number of observed jumps.

  • kwargs – additional kwargs to the plotting function

posterior_line_plot(self, out_png: str, **kwargs)

Posterior mean state occupations, marginalized on diffusion coefficient, shown as a line plot. Groups by condition.

../_images/example_posterior_line_plot.png
Parameters
  • out_png (str) – save path for this plot

  • kwargs – additional kwargs to the plotting function