Quickstart

Note

Before running, see Install for installation instructions.

Note

Want more code and less prose? Check out examples.py, a simple executable that demos most of stuff in this Quickstart.

This is a quick guide to getting started with saspt. It assumes you’re familiar with single particle tracking (SPT), have seen mixture models before, and have installed saspt. For a more detailed explanation of saspt’s purpose, model, and range of applicability, see Background.

Run state arrays on a single SPT experiment

We’ll use the sample set of trajectories that comes with saspt:

>>> import pandas as pd, numpy as np, matplotlib.pyplot as plt
>>> from saspt import sample_detections, StateArray, RBME
>>> detections = sample_detections()

The expected format for input trajectories is described under Note on input format. Importantly, the units for the XY coordinates are in pixels, not microns.

Next, we’ll set the parameters for a state array that uses mixtures of regular Brownian motions with localization error (RBMEs):

>>> settings = dict(
...     likelihood_type = RBME,
...     pixel_size_um = 0.122,
...     frame_interval = 0.01,
...     focal_depth = 0.7,
...     progress_bar = True,
... )

Only three parameters are actually required:

likelihood_type: the type of physical model to use

pixel_size_um: size of camera pixels after magnification in microns

frame_interval: time between frames in seconds

Additionally, we’ve set the focal_depth, which accounts for the finite focal depth of high-NA objectives used for SPT experiments, and progress_bar, which shows progress during inference. A full list of parameters and their meaning is described at StateArrayParameters.

Construct a StateArray with these settings:

>>> SA = StateArray.from_detections(detections, **settings)

If we run print(SA), we get

StateArray:
  likelihood_type   : rbme
  n_tracks          : 1130
  n_jumps           : 3933
  parameter_names   : ('diff_coef', 'loc_error')
  shape             : (100, 36)

This means that this state array infers occupations on a 2D parameter grid of diffusion coefficient (diff_coef) and localization error (loc_error), using 1130 trajectories. The shape of the parameter grid is (100, 36), meaning that the grid uses 100 distinct diffusion coefficients and 36 distinct localization errors (the default). These define the range of physical models that can be described with this state array. We can get the values of these parameters using the StateArray.parameter_values attribute:

>>> diff_coefs, loc_errors = SA.parameter_values
>>> print(diff_coefs.shape)
(100,)
>>> print(loc_errors.shape)
(36,)

The StateArray object provides two estimates of the state occupations at each point on this parameter grid:

The “naive” estimate, a quick and dirty estimate from the raw likelihood function

The “posterior” estimate, which uses the full state array model

The posterior estimate is more precise than the naive estimate, but also requires more trajectories and time. The more trajectories are present in the input, the more precise the posterior estimate becomes.

The StateArray object provides a built-in plot to compare the naive and posterior estimates:

>>> SA.plot_occupations("rbme_occupations.png")

The plot will look something like this:

The bottom row shows the posterior occupations marginalized on diffusion coefficient. This is a simple and powerful mechanism to account for the influence of localization error.

In this case, the state array identified a dominant diffusive state with a diffusion coefficient of about 5 \(\mu \text{m}^{2}\)/sec. We can also see a less-populated state between about 1 and 3 \(\mu \text{m}^{2}\)/sec, and some very slow particles with diffusion coefficients in the range 0.01 to 0.1 \(\mu \text{m}^{2}\)/sec.

We can retrieve the raw arrays used in this plot via the naive_occs and posterior_occs attributes. Both are arrays defined on the same grid of diffusion coefficient vs. localization error:

>>> naive_occs = SA.naive_occs
>>> posterior_occs = SA.posterior_occs
>>> print(naive_occs.shape)
(100, 36)
>>> print(posterior_occs.shape)
(100, 36)

Along with the state occupations, the StateArray object also infers the probabilities of each trajectory-state assignment. As with the state occupations, the trajectory-state assignment probabilities have both “naive” and “posterior” versions that we can compare:

>>> naive_assignment_probabilities = SA.naive_assignment_probabilities
>>> posterior_assignment_probabilities = SA.posterior_assignment_probabilities
>>> print(naive_assignment_probabilities.shape)
(100, 36, 1130)
>>> print(posterior_assignment_probabilities.shape)
(100, 36, 1130)

Notice that these arrays have one element per point in our 100-by-36 parameter grid and per trajectory. For example, the marginal probability that trajectory 100 has each of the 100 diffusion coefficients is:

>>> posterior_assignment_probabilities[:,:,100].sum(axis=1)

StateArray provides a plot to compare the naive and posterior assignment probabilities across all trajectories:

>>> SA.plot_assignment_probabilities('rbme_assignment_probabilities.png')

_images/rbme_assignment_probabilities.png

Each column in this plot represents a single trajectory, and the rows represent the probability of the trajectories having a particular diffusion coefficient. (The trajectories are sorted by their posterior mean diffusion coefficient.)

There are also a couple of related plots (not illustrated here):

saspt.StateArray.plot_temporal_assignment_probabilities(): shows the assignment probabilities as a function of the frame(s) in which the respective trajectories were found
saspt.StateArray.plot_spatial_assignment_probabilities(): shows the assignment probabilities as a function of the spatial location of the component detections

Finally, StateArray provides the naive and posterior state occupations as a pandas.DataFrame:

>>> occupations = SA.occupations_dataframe
>>> print(occupations)
      diff_coef  loc_error  naive_occupation  mean_posterior_occupation
0          0.01      0.000          0.000017                   0.000009
1          0.01      0.002          0.000017                   0.000008
2          0.01      0.004          0.000016                   0.000008
...         ...        ...               ...                        ...
3597     100.00      0.066          0.000042                   0.000014
3598     100.00      0.068          0.000041                   0.000014
3599     100.00      0.070          0.000041                   0.000014

[3600 rows x 4 columns]

Each row corresponds to a single point on the parameter grid. For instance, if we wanted to get the probability that a particle has a diffusion coefficient less than 0.1 \(\mu \text{m}^{2}\)/sec, we could do:

>>> selected = occupations['diff_coef'] < 0.1
>>> naive_estimate = occupations.loc[selected, 'naive_occupation'].sum()
>>> posterior_estimate = occupations.loc[selected, 'mean_posterior_occupation'].sum()
>>> print(naive_estimate)
0.24171198737935867
>>> print(posterior_estimate)
0.2779671727562628

In this case, the naive and posterior estimates are quite similar.

Run state arrays on a SPT dataset

Often we want to run state arrays on more than one SPT experiment and compare the output between experimental conditions. The StateArrayDataset object is intended to be a simple solution that provides:

methods to parallelize state array inference across multiple SPT experiments

outputs and visualizations to help compare between experimental conditions

In this example, we’ll use an example from the saspt repo. You can follow along by cloning the saspt repo and navigating to the examples subdirectory:

$ git clone https://github.com/alecheckert/saspt.git
$ cd saspt/examples
$ ls -1
examples.py
experiment_conditions.csv
u2os_ht_nls_7.48ms
u2os_rara_ht_7.48ms

The examples subdirectory contains a small SPT dataset where two proteins have been tracked:

HT-NLS: HaloTag (HT) fused to a nuclear localization signal (NLS), labeled with the photoactivatable fluorescent dye PA-JFX549
RARA-HT: retinoic acid receptor \(\alpha\) (RARA) fused to HaloTag (HT), labeled with the photoactivatable fluorescent dye PA-JFX549

Each protein has 11 SPT experiments, stored as CSV files in the examples/u2os_ht_nls_7.48ms and examples/u2os_rara_ht_7.48ms subdirectories. We also have a registry file (experiment_conditions.csv) that contains the assignment of each file to an experimental condition:

>>> paths = pd.read_csv('experiment_conditions.csv')

In this case, we have two columns: filepath encodes the path to the CSV corresponding to each SPT experiment, while condition encodes the experimental condition. (It doesn’t actually matter what these are named as long as they match the path_col and condition_col parameters below.)

>>> print(paths)
                                      filepath     condition
0    u2os_ht_nls_7.48ms/region_0_7ms_trajs.csv   HaloTag-NLS
1   u2os_ht_nls_7.48ms/region_10_7ms_trajs.csv   HaloTag-NLS
2    u2os_ht_nls_7.48ms/region_1_7ms_trajs.csv   HaloTag-NLS
..                                         ...           ...
19  u2os_rara_ht_7.48ms/region_7_7ms_trajs.csv  RARA-HaloTag
20  u2os_rara_ht_7.48ms/region_8_7ms_trajs.csv  RARA-HaloTag
21  u2os_rara_ht_7.48ms/region_9_7ms_trajs.csv  RARA-HaloTag

[22 rows x 2 columns]

Specify some parameters related to this analysis:

>>> settings = dict(
...     likelihood_type = RBME,
...     pixel_size_um = 0.16,
...     frame_interval = 0.00748,
...     focal_depth = 0.7,
...     path_col = 'filepath',
...     condition_col = 'condition',
...     progress_bar = True,
...     num_workers = 6,
... )

Warning

The num_workers attribute specifies the number of parallel processes to use when running inference. Don’t set this higher than the number of CPUs on your computer, or you’re likely to suffer performance hits.

Create a StateArrayDataset with these settings:

>>> from saspt import StateArrayDataset
>>> SAD = StateArrayDataset.from_kwargs(paths, **settings)

If you do print(SAD), you’ll get some basic info on this dataset:

>>> print(SAD)
StateArrayDataset:
  likelihood_type    : rbme
  shape              : (100, 36)
  n_files            : 22
  path_col           : filepath
  condition_col      : condition
  conditions         : ['HaloTag-NLS' 'RARA-HaloTag']

We can get more detailed information on these experiments (such as the detection density, mean trajectory length, etc.) by accessing the raw_track_statistics attribute:

>>> stats = SAD.raw_track_statistics
>>> print(stats)
    n_tracks  n_jumps  ...                                    filepath     condition
0       2387     1520  ...   u2os_ht_nls_7.48ms/region_0_7ms_trajs.csv   HaloTag-NLS
1       4966     5341  ...  u2os_ht_nls_7.48ms/region_10_7ms_trajs.csv   HaloTag-NLS
2       3294     2584  ...   u2os_ht_nls_7.48ms/region_1_7ms_trajs.csv   HaloTag-NLS
..       ...      ...  ...                                         ...           ...
19      5418    13129  ...  u2os_rara_ht_7.48ms/region_7_7ms_trajs.csv  RARA-HaloTag
20      9814    26323  ...  u2os_rara_ht_7.48ms/region_8_7ms_trajs.csv  RARA-HaloTag
21      7530    18978  ...  u2os_rara_ht_7.48ms/region_9_7ms_trajs.csv  RARA-HaloTag

[22 rows x 13 columns]
>>> print(stats.columns)
Index(['n_tracks', 'n_jumps', 'n_detections', 'mean_track_length',
       'max_track_length', 'fraction_singlets', 'fraction_unassigned',
       'mean_jumps_per_track', 'mean_detections_per_frame',
       'max_detections_per_frame', 'fraction_of_frames_with_detections',
       'filepath', 'condition'],
      dtype='object')

To get the naive and posterior state occupations for each file in this dataset:

>>> marginal_naive_occs = SAD.marginal_naive_occs
>>> marginal_posterior_occs = SAD.marginal_posterior_occs
>>> print(marginal_naive_occs.shape)
>>> print(marginal_posterior_occs.shape)

Note

It can take a few minutes to compute the posterior occupations for a dataset of this size. If you need a quick estimate for a test, try reducing the max_iter or sample_size parameters.

These occupations are “marginal” in the sense that they’ve been marginalized onto the parameter of interest in most SPT experiments: the diffusion coefficient. (You can get the original, unmarginalized occupations via the StateArrayDataset.posterior_occs and StateArrayDataset.naive_occs attributes.)

The same information is also provided as a pandas.DataFrame:

>>> occupations = SAD.marginal_posterior_occs_dataframe

For example, imagine we want to calculate the posterior probability that a particle had a diffusion coefficient less than 0.5 \(\mu\text{m}^{2}\)/sec for each file. We could do this by taking

>>> print(occupations.loc[occupations['diff_coef'] < 0.5].groupby(
...   'filepath')['mean_posterior_occupation'].sum())
filepath
u2os_ht_nls_7.48ms/region_0_7ms_trajs.csv      0.188782
u2os_ht_nls_7.48ms/region_10_7ms_trajs.csv     0.103510
u2os_ht_nls_7.48ms/region_1_7ms_trajs.csv      0.091148
...
u2os_rara_ht_7.48ms/region_7_7ms_trajs.csv     0.579444
u2os_rara_ht_7.48ms/region_8_7ms_trajs.csv     0.553111
u2os_rara_ht_7.48ms/region_9_7ms_trajs.csv     0.650187
Name: posterior_occupation, dtype: float64

The StateArrayDataset provides a few plots to visualize these occupations:

>>> SAD.posterior_heat_map('posterior_heat_map.png')

Notice that the two kinds of proteins have different diffusive profiles: HaloTag-NLS occupies a narrow range of diffusion coefficients centered around 10 \(\mu \text{m}^{2}\)/sec, while RARA-HaloTag has a much broader range of free diffusion coefficients with a substantial immobile fraction (showing up at the lower end of the diffusion coefficient range).

The heat map plot is useful to judge how consistent the result is across SPT experiments in the same condition. We can also compare the variability using an alternative line plot representation:

>>> SAD.posterior_line_plot('posterior_line_plot.png')

>>> SAD.naive_heat_map('naive_heat_map.png')

Notice that the information provided by the naive occupations is qualitatively similar but less precise than the posterior occupations.

>>> SAD.naive_line_plot('naive_line_plot.png')

Additionally, rather than performing state array inference on each file individually, we can aggregate trajectories across all files matching a particular condition:

>>> posterior_occs, condition_names = SAD.infer_posterior_by_condition('condition')
>>> print(posterior_occs.shape)
(2, 100)
>>> print(condition_names)
['HaloTag-NLS', 'RARA-HaloTag']

The results are unnormalized (they reflect the total number of jumps in each condition). We can normalize and plot the results by doing:

>>> from saspt import normalize_2d
>>> posterior_occs = normalize_2d(posterior_occs, axis=1)
>>> diff_coefs = SAD.likelihood.diff_coefs
>>> for c in range(posterior_occs.shape[0]):
...     plt.plot(diff_coefs, posterior_occs[c,:], label=condition_names[c])
>>> plt.xscale('log')
>>> plt.xlabel('Diff. coef. ($\mu$m$^{2}$ s$^{-1}$)')
>>> plt.ylabel('Mean posterior occupation')
>>> plt.ylim((0, plt.ylim()[1]))
>>> plt.legend()
>>> plt.show()

_images/mean_posterior_occs_by_condition.png

The more trajectories we aggregate, the better our state occupation estimates become. saspt performs best when using large datasets with tens of thousands of trajectories per condition.