# Quickstart

Note

Before running, see Install for installation instructions.

Note

Want more code and less prose? Check out examples.py, a simple executable that demos most of stuff in this Quickstart.

This is a quick guide to getting started with `saspt`

. It assumes you’re familiar
with single particle tracking (SPT), have seen mixture models before, and have
installed `saspt`

.
For a more detailed explanation of `saspt`

’s purpose, model, and range of applicability,
see Background.

## Run state arrays on a single SPT experiment

We’ll use the sample set of trajectories that comes with `saspt`

:

```
>>> import pandas as pd, numpy as np, matplotlib.pyplot as plt
>>> from saspt import sample_detections, StateArray, RBME
>>> detections = sample_detections()
```

The expected format for input trajectories is described under Note on input format.
**Importantly, the units for the XY coordinates are in pixels, not microns.**

Next, we’ll set the parameters for a state array that uses mixtures of regular Brownian motions with localization error (RBMEs):

```
>>> settings = dict(
... likelihood_type = RBME,
... pixel_size_um = 0.122,
... frame_interval = 0.01,
... focal_depth = 0.7,
... progress_bar = True,
... )
```

Only three parameters are actually required:

`likelihood_type`

: the type of physical model to use

`pixel_size_um`

: size of camera pixels after magnification in microns

`frame_interval`

: time between frames in seconds

Additionally, we’ve set the `focal_depth`

, which accounts for the finite focal depth of
high-NA objectives used for SPT experiments, and `progress_bar`

, which shows progress
during inference. A full list of parameters and their meaning is described at StateArrayParameters.

Construct a `StateArray`

with these settings:

```
>>> SA = StateArray.from_detections(detections, **settings)
```

If we run `print(SA)`

, we get

```
StateArray:
likelihood_type : rbme
n_tracks : 1130
n_jumps : 3933
parameter_names : ('diff_coef', 'loc_error')
shape : (100, 36)
```

This means that this state array infers occupations on a 2D parameter grid of diffusion coefficient
(`diff_coef`

) and localization error (`loc_error`

), using 1130 trajectories. The shape of the
parameter grid is `(100, 36)`

, meaning that the grid uses 100 distinct diffusion coefficients
and 36 distinct localization errors (the default). These define the range of physical models that can be
described with this state array. We can get the values of these parameters using the
`StateArray.parameter_values`

attribute:

```
>>> diff_coefs, loc_errors = SA.parameter_values
>>> print(diff_coefs.shape)
(100,)
>>> print(loc_errors.shape)
(36,)
```

The `StateArray`

object provides two estimates of the state occupations at each point on
this parameter grid:

The “naive” estimate, a quick and dirty estimate from the raw likelihood function

The “posterior” estimate, which uses the full state array model

The posterior estimate is more precise than the naive estimate, but also requires more trajectories and time. The more trajectories are present in the input, the more precise the posterior estimate becomes.

The `StateArray`

object provides a built-in plot to compare the naive and posterior
estimates:

```
>>> SA.plot_occupations("rbme_occupations.png")
```

The plot will look something like this:

The bottom row shows the posterior occupations marginalized on diffusion coefficient. This is a simple and powerful mechanism to account for the influence of localization error.

In this case, the state array identified a dominant diffusive state with a diffusion coefficient of about 5 \(\mu \text{m}^{2}\)/sec. We can also see a less-populated state between about 1 and 3 \(\mu \text{m}^{2}\)/sec, and some very slow particles with diffusion coefficients in the range 0.01 to 0.1 \(\mu \text{m}^{2}\)/sec.

We can retrieve the raw arrays used in this plot via the `naive_occs`

and `posterior_occs`

attributes. Both are arrays defined on the same grid of diffusion coefficient vs. localization error:

```
>>> naive_occs = SA.naive_occs
>>> posterior_occs = SA.posterior_occs
>>> print(naive_occs.shape)
(100, 36)
>>> print(posterior_occs.shape)
(100, 36)
```

Along with the state occupations, the `StateArray`

object also infers the
probabilities of each *trajectory-state assignment*. As with the state occupations,
the trajectory-state assignment probabilities have both “naive” and “posterior”
versions that we can compare:

```
>>> naive_assignment_probabilities = SA.naive_assignment_probabilities
>>> posterior_assignment_probabilities = SA.posterior_assignment_probabilities
>>> print(naive_assignment_probabilities.shape)
(100, 36, 1130)
>>> print(posterior_assignment_probabilities.shape)
(100, 36, 1130)
```

Notice that these arrays have one element per point in our 100-by-36 parameter grid and per trajectory. For example, the marginal probability that trajectory 100 has each of the 100 diffusion coefficients is:

```
>>> posterior_assignment_probabilities[:,:,100].sum(axis=1)
```

`StateArray`

provides a plot to compare the naive and posterior assignment
probabilities across all trajectories:

```
>>> SA.plot_assignment_probabilities('rbme_assignment_probabilities.png')
```

Each column in this plot represents a single trajectory, and the rows represent the probability of the trajectories having a particular diffusion coefficient. (The trajectories are sorted by their posterior mean diffusion coefficient.)

- There are also a couple of related plots (not illustrated here):
`saspt.StateArray.plot_temporal_assignment_probabilities()`

: shows the assignment probabilities as a function of the frame(s) in which the respective trajectories were found`saspt.StateArray.plot_spatial_assignment_probabilities()`

: shows the assignment probabilities as a function of the spatial location of the component detections

Finally, `StateArray`

provides the naive and posterior state occupations as a
`pandas.DataFrame`

:

```
>>> occupations = SA.occupations_dataframe
>>> print(occupations)
diff_coef loc_error naive_occupation mean_posterior_occupation
0 0.01 0.000 0.000017 0.000009
1 0.01 0.002 0.000017 0.000008
2 0.01 0.004 0.000016 0.000008
... ... ... ... ...
3597 100.00 0.066 0.000042 0.000014
3598 100.00 0.068 0.000041 0.000014
3599 100.00 0.070 0.000041 0.000014
[3600 rows x 4 columns]
```

Each row corresponds to a single point on the parameter grid. For instance, if we wanted to get the probability that a particle has a diffusion coefficient less than 0.1 \(\mu \text{m}^{2}\)/sec, we could do:

```
>>> selected = occupations['diff_coef'] < 0.1
>>> naive_estimate = occupations.loc[selected, 'naive_occupation'].sum()
>>> posterior_estimate = occupations.loc[selected, 'mean_posterior_occupation'].sum()
>>> print(naive_estimate)
0.24171198737935867
>>> print(posterior_estimate)
0.2779671727562628
```

In this case, the naive and posterior estimates are quite similar.

## Run state arrays on a SPT dataset

Often we want to run state arrays on more than one SPT experiment and compare the
output between experimental conditions. The `StateArrayDataset`

object is intended to
be a simple solution that provides:

methods to parallelize state array inference across multiple SPT experiments

outputs and visualizations to help compare between experimental conditions

In this example, we’ll use an example
from the saspt repo.
You can follow along by cloning the `saspt`

repo and navigating to
the `examples`

subdirectory:

```
$ git clone https://github.com/alecheckert/saspt.git
$ cd saspt/examples
$ ls -1
examples.py
experiment_conditions.csv
u2os_ht_nls_7.48ms
u2os_rara_ht_7.48ms
```

- The
`examples`

subdirectory contains a small SPT dataset where two proteins have been tracked: `HT-NLS`

: HaloTag (HT) fused to a nuclear localization signal (NLS), labeled with the photoactivatable fluorescent dye PA-JFX549`RARA-HT`

: retinoic acid receptor \(\alpha\) (RARA) fused to HaloTag (HT), labeled with the photoactivatable fluorescent dye PA-JFX549

Each protein has 11 SPT experiments, stored as CSV files in the `examples/u2os_ht_nls_7.48ms`

and
`examples/u2os_rara_ht_7.48ms`

subdirectories. We also have a registry file (`experiment_conditions.csv`

) that contains the assignment of each file to an experimental condition:

```
>>> paths = pd.read_csv('experiment_conditions.csv')
```

In this case, we have two columns: `filepath`

encodes the path to the CSV corresponding
to each SPT experiment, while `condition`

encodes the experimental condition. (It doesn’t
actually matter what these are named as long as they match the `path_col`

and `condition_col`

parameters below.)

```
>>> print(paths)
filepath condition
0 u2os_ht_nls_7.48ms/region_0_7ms_trajs.csv HaloTag-NLS
1 u2os_ht_nls_7.48ms/region_10_7ms_trajs.csv HaloTag-NLS
2 u2os_ht_nls_7.48ms/region_1_7ms_trajs.csv HaloTag-NLS
.. ... ...
19 u2os_rara_ht_7.48ms/region_7_7ms_trajs.csv RARA-HaloTag
20 u2os_rara_ht_7.48ms/region_8_7ms_trajs.csv RARA-HaloTag
21 u2os_rara_ht_7.48ms/region_9_7ms_trajs.csv RARA-HaloTag
[22 rows x 2 columns]
```

Specify some parameters related to this analysis:

```
>>> settings = dict(
... likelihood_type = RBME,
... pixel_size_um = 0.16,
... frame_interval = 0.00748,
... focal_depth = 0.7,
... path_col = 'filepath',
... condition_col = 'condition',
... progress_bar = True,
... num_workers = 6,
... )
```

Warning

The `num_workers`

attribute specifies the number of parallel processes to use when
running inference. Don’t set this higher than the number of CPUs on your computer, or
you’re likely to suffer performance hits.

Create a `StateArrayDataset`

with these settings:

```
>>> from saspt import StateArrayDataset
>>> SAD = StateArrayDataset.from_kwargs(paths, **settings)
```

If you do `print(SAD)`

, you’ll get some basic info on this dataset:

```
>>> print(SAD)
StateArrayDataset:
likelihood_type : rbme
shape : (100, 36)
n_files : 22
path_col : filepath
condition_col : condition
conditions : ['HaloTag-NLS' 'RARA-HaloTag']
```

We can get more detailed information on these experiments (such as the detection density,
mean trajectory length, etc.) by accessing the `raw_track_statistics`

attribute:

```
>>> stats = SAD.raw_track_statistics
>>> print(stats)
n_tracks n_jumps ... filepath condition
0 2387 1520 ... u2os_ht_nls_7.48ms/region_0_7ms_trajs.csv HaloTag-NLS
1 4966 5341 ... u2os_ht_nls_7.48ms/region_10_7ms_trajs.csv HaloTag-NLS
2 3294 2584 ... u2os_ht_nls_7.48ms/region_1_7ms_trajs.csv HaloTag-NLS
.. ... ... ... ... ...
19 5418 13129 ... u2os_rara_ht_7.48ms/region_7_7ms_trajs.csv RARA-HaloTag
20 9814 26323 ... u2os_rara_ht_7.48ms/region_8_7ms_trajs.csv RARA-HaloTag
21 7530 18978 ... u2os_rara_ht_7.48ms/region_9_7ms_trajs.csv RARA-HaloTag
[22 rows x 13 columns]
>>> print(stats.columns)
Index(['n_tracks', 'n_jumps', 'n_detections', 'mean_track_length',
'max_track_length', 'fraction_singlets', 'fraction_unassigned',
'mean_jumps_per_track', 'mean_detections_per_frame',
'max_detections_per_frame', 'fraction_of_frames_with_detections',
'filepath', 'condition'],
dtype='object')
```

To get the naive and posterior state occupations for each file in this dataset:

```
>>> marginal_naive_occs = SAD.marginal_naive_occs
>>> marginal_posterior_occs = SAD.marginal_posterior_occs
>>> print(marginal_naive_occs.shape)
>>> print(marginal_posterior_occs.shape)
```

Note

It can take a few minutes to compute the posterior occupations for a dataset of
this size. If you need a quick estimate for a test, try reducing the `max_iter`

or `sample_size`

parameters.

These occupations are “marginal” in the sense that they’ve been marginalized onto the
parameter of interest in most SPT experiments: the diffusion coefficient. (You can
get the original, unmarginalized occupations via the `StateArrayDataset.posterior_occs`

and `StateArrayDataset.naive_occs`

attributes.)

The same information is also provided as a `pandas.DataFrame`

:

```
>>> occupations = SAD.marginal_posterior_occs_dataframe
```

For example, imagine we want to calculate the posterior probability that a particle had a diffusion coefficient less than 0.5 \(\mu\text{m}^{2}\)/sec for each file. We could do this by taking

```
>>> print(occupations.loc[occupations['diff_coef'] < 0.5].groupby(
... 'filepath')['mean_posterior_occupation'].sum())
filepath
u2os_ht_nls_7.48ms/region_0_7ms_trajs.csv 0.188782
u2os_ht_nls_7.48ms/region_10_7ms_trajs.csv 0.103510
u2os_ht_nls_7.48ms/region_1_7ms_trajs.csv 0.091148
...
u2os_rara_ht_7.48ms/region_7_7ms_trajs.csv 0.579444
u2os_rara_ht_7.48ms/region_8_7ms_trajs.csv 0.553111
u2os_rara_ht_7.48ms/region_9_7ms_trajs.csv 0.650187
Name: posterior_occupation, dtype: float64
```

The `StateArrayDataset`

provides a few plots to visualize these occupations:

```
>>> SAD.posterior_heat_map('posterior_heat_map.png')
```

Notice that the two kinds of proteins have different diffusive profiles: HaloTag-NLS occupies a narrow range of diffusion coefficients centered around 10 \(\mu \text{m}^{2}\)/sec, while RARA-HaloTag has a much broader range of free diffusion coefficients with a substantial immobile fraction (showing up at the lower end of the diffusion coefficient range).

The heat map plot is useful to judge how consistent the result is across SPT experiments in the same condition. We can also compare the variability using an alternative line plot representation:

```
>>> SAD.posterior_line_plot('posterior_line_plot.png')
```

```
>>> SAD.naive_heat_map('naive_heat_map.png')
```

Notice that the information provided by the naive occupations is qualitatively similar but less precise than the posterior occupations.

```
>>> SAD.naive_line_plot('naive_line_plot.png')
```

Additionally, rather than performing state array inference on each file individually, we can aggregate trajectories across all files matching a particular condition:

```
>>> posterior_occs, condition_names = SAD.infer_posterior_by_condition('condition')
>>> print(posterior_occs.shape)
(2, 100)
>>> print(condition_names)
['HaloTag-NLS', 'RARA-HaloTag']
```

The results are unnormalized (they reflect the total number of jumps in each condition). We can normalize and plot the results by doing:

```
>>> from saspt import normalize_2d
>>> posterior_occs = normalize_2d(posterior_occs, axis=1)
>>> diff_coefs = SAD.likelihood.diff_coefs
>>> for c in range(posterior_occs.shape[0]):
... plt.plot(diff_coefs, posterior_occs[c,:], label=condition_names[c])
>>> plt.xscale('log')
>>> plt.xlabel('Diff. coef. ($\mu$m$^{2}$ s$^{-1}$)')
>>> plt.ylabel('Mean posterior occupation')
>>> plt.ylim((0, plt.ylim()[1]))
>>> plt.legend()
>>> plt.show()
```

The more trajectories we aggregate, the better our state occupation estimates
become. `saspt`

performs best when using large datasets with tens of thousands of
trajectories per condition.