# Analysis of Locust Data Set

## Table of Contents

- 1. How to proceed?
- 2. Preliminary Analysis
- 3. Data renormalization
- 4. Interactive data exploration
- 5. Spike detection
- 6. Cuts
- 7. Getting "clean" events
- 8. Dimension reduction
- 9. Clustering
- 10. "Brute force" superposition resolution
- 11. Getting the spike trains
- 12. Custom developed code
- 12.1. Individual function definitions
- 12.1.1.
`cstValueSgts`

definition - 12.1.2. Definitions of generic function and time-series specific method for data exploration
- 12.1.3.
`peaks`

and associated methods definition - 12.1.4. Definition of an
`explore`

method for time series and`eventsPos`

objects - 12.1.5. Function
`cutSglEvt`

- 12.1.6. Function
`mkEvents`

- 12.1.7.
`events`

methods - 12.1.8.
`mkNoise`

definition - 12.1.9. Definition of an
`explore`

method for`pca`

results - 12.1.10. Definition of
`get_jitter`

- 12.1.11. Definition of
`mk_aligned_events`

- 12.1.12. Definition of
`mk_center_list`

- 12.1.13. Definition of
`classify_and_align_evt`

- 12.1.14. Definition of
`predict_data`

- 12.1.1.

- 12.1. Individual function definitions

Sys.setlocale(category="LC_MESSAGES",locale="C")

[1] "C"

## 1 How to proceed?

The locust data set is located at the following web site: http://xtof.disque.math.cnrs.fr/data. So you should start by dowloading the four file `Locust_x.tar.gz`

(where `x`

= 1, 2, 3, 4) in some directory where you will perform your analysis (see second sub-section).

### 1.1 Loading the data

The data are simply loaded, once we know where to find them and if we do not forget that they have been compressed with `gzip`

. The data are stored with one file per recording channel in double format. They were sampled at 15 kHz and there is 20 s of them. Since they are compressed, we cannot read them directly into `R`

from the depository nd we must download them first:

reposName <-"http://xtof.disque.math.cnrs.fr/data/" dN <- paste("Locust_",1:4,".dat.gz",sep="") sapply(1:4, function(i) download.file(paste(reposName,dN[i],sep=""), dN[i],mode="wb") )

Once the data are in our `working directory`

we can load them into our `R workspace`

:

dN <- paste("Locust_",1:4,".dat.gz",sep="") nb <- 20*15000 lD <- sapply(dN, function(n) { mC <- gzfile(n,open="rb") x <- readBin(mC,what="double",n=nb) close(mC);x } ) colnames(lD) <- paste("site",1:4)

We can check that our `lD`

object has the correct dimension:

dim(lD)

[1] 300000 4

### 1.2 Getting the code

The individual functions developed for this kind of analysis are defined at the end of this document (Sec. *Individual function definitions*).
They can also be downloaded as a single file sorting_{with}_{r.R} which must then be sourced with:

source("sorting_with_r.R")

where it is assumed that the working directory of your `R`

session is the directory where `sorting_with_r.R`

was downloaded.

## 2 Preliminary Analysis

We are going to start our analysis by some "sanity checks" to make sure that nothing "weird" happened during the recording.

### 2.1 Five number summary

We should start by getting an overall picture of the data like the one provided by the `summary`

method of `R`

which outputs a `5 numbers plus mean`

summary. The five numbers are the `minimum`

, the `first quartile`

, the `median`

, the `third quartile`

and the `maximum`

:

summary(lD,digits=2)

site 1 site 2 site 3 site 4 Min. :-9.074 Min. :-8.229 Min. :-6.890 Min. :-7.35 1st Qu.:-0.371 1st Qu.:-0.450 1st Qu.:-0.530 1st Qu.:-0.49 Median :-0.029 Median :-0.036 Median :-0.042 Median :-0.04 Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.00 3rd Qu.: 0.326 3rd Qu.: 0.396 3rd Qu.: 0.469 3rd Qu.: 0.43 Max. :10.626 Max. :11.742 Max. : 9.849 Max. :10.56

We see that the data range (`maximum - minimum`

) is similar (close to 20) on the four recording sites. The inter-quartiles ranges are also similar.

### 2.2 Were the data normalized?

We can check next if some processing like a division by the *standard deviation* (SD) has been applied:

apply(lD,2,sd)

site 1 site 2 site 3 site 4 1 1 1 1

### 2.3 Discretization step amplitude

We clearly see that these data have been *scaled*, that is, normalized to have an SD of 1. Since the data have been digitized we can easily obtain the apparent size of the digitization set:

```
apply(lD,2, function(x) min(diff(sort(unique(x)))))
```

site 1 site 2 site 3 site 4 0.006709845 0.009194500 0.011888433 0.009614042

### 2.4 Detecting saturation

Before embarking into a comprehensive analysis of data that we did not record ourselves (of that we recorded so long ago that we do not remember any "remarkable" event concerning them), it can be wise to check that no amplifier or A/D card saturation occurred. We can quickly check for that by looking at the length of the longuest segment of constant value. When saturation occurs the recorded value stays for many sampling points at the same upper or lower saturating level. We use function cstValueSgts that does this job.

ndL <- lapply(1:4,function(i) cstValueSgts(lD[,i])) sapply(ndL, function(l) max(l[2,]))

[1] 2 2 2 2

We see that for each recording site, the longest segment of constant value is *two sampling points* long, that is 2/15 ms. There is no ground to worry about saturation here.

### 2.5 Plot the data

We are going to profit from the `time series`

(`ts`

and `mts`

for multiple time series) objects of `R`

by redefining our `lD`

matrix as:

```
lD <- ts(lD,start=0,freq=15e3)
```

It is then straightforward to plot the whole data set:

plot(lD)

Figure 1: The whole (20 s) locust data set.

It is also good to "zoom in" and look at the data with a finer time scale:

plot(window(lD,start=0,end=0.2))

Figure 2: First 200 ms of the locust data set.

## 3 Data renormalization

We are going to use a median absolute deviation (`MAD`

) based renormalization. The goal of the procedure is to scale the raw data such that the *noise SD* is approximately 1. Since it is not straightforward to obtain a noise SD on data where both signal (*i.e.*, spikes) and noise are present, we use this robust type of statistic for the SD. Luckily this is simply obtained in `R`

:

lD.mad <- apply(lD,2,mad) lD <- t((t(lD)-apply(lD,2,median))/lD.mad) lD <- ts(lD,start=0,freq=15e3)

where the last line of code ensures that `lD`

is still an `mts`

object. We can check on a plot how `MAD`

and `SD`

compare:

plot(window(lD[,1],0,0.2)) abline(h=c(-1,1),col=2) abline(h=c(-1,1)*sd(lD[,1]),col=4,lty=2,lwd=2)

Figure 3: First 200 ms on site 1 of the locust data set. In red: +/- the `MAD`

; in dashed blue +/- the `SD`

.

### 3.1 A quick check that the `MAD`

"does its job"

We can check that the `MAD`

does its job as a robust estimate of the *noise* standard deviation by looking at Q-Q plots of the whole traces normalized with the `MAD`

and normalized with the "classical" `SD`

.

lDQ <- apply(lD,2,quantile, probs=seq(0.01,0.99,0.01)) lDnormSD <- apply(lD,2,function(x) x/sd(x)) lDnormSDQ <- apply(lDnormSD,2,quantile, probs=seq(0.01,0.99,0.01)) qq <- qnorm(seq(0.01,0.99,0.01)) matplot(qq,lDQ,type="n",xlab="Normal quantiles",ylab="Empirical quantiles") abline(0,1,col="grey70",lwd=3) col=c("black","orange","blue","red") matlines(qq,lDnormSDQ,lty=2,col=col) matlines(qq,lDQ,lty=1,col=col) rm(lDnormSD,lDnormSDQ)

Figure 4: Performances of `MAD`

based vs `SD`

based normalizations. After normalizing the data of each recording site by its `MAD`

(plain colored curves) or its `SD`

(dashed colored curves), Q-Q plot against a standard normal distribution were constructed. Colors: site 1, black; site 2, orange; site 3, blue; site 4, red.

We see that the behavior of the "away from normal" fraction is much more homogeneous for small, as well as for large in fact, quantile values with the `MAD`

normalized traces than with the `SD`

normalized ones. If we consider automatic rules like the three sigmas we are going to reject fewer events (*i.e.*, get fewer putative spikes) with the `SD`

based normalization than with the `MAD`

based one.

## 4 Interactive data exploration

Although we can't illustrate properly this *key* step on a "static" document it is absolutely necessary to look at the data in detail using (the corresponding function definition can be found in time-series specific method for data exploration):

explore(lD,col=c("black","grey70"))

Upon using this command the user is invited to move forward (typing "n" + `RETURN`

or simply `RETURN`

), backward (typing "f" + `RETURN`

), to change the abscissa or ordinate scale, etc.

## 5 Spike detection

We are going to filter the data slightly using a "box" filter of length 5. That is, the data points of the original trace are going to be replaced by the average of themselves with their two nearest neighbors. We will then scale the filtered traces such that the `MAD`

is one on each recording sites and keep only the parts of the signal which above 4:

lDf <- filter(lD,rep(1,5)/5) lDf[is.na(lDf)] <- 0 lDf.mad <- apply(lDf,2,mad) lDf <- t(t(lDf)/lDf.mad) thrs <- c(4,4,4,4) bellow.thrs <- t(t(lDf) < thrs) lDfr <- lDf lDfr[bellow.thrs] <- 0 remove(lDf)

We can see the difference between the *raw* trace and the *filtered and rectified* one on which spikes are going to be detected with:

plot(window(lD[,1],0,0.2)) abline(h=4,col=4,lty=2,lwd=2) lines(window(ts(lDfr[,1],start=0,freq=15e3),0,0.2),col=2)

Figure 5: First 200 ms on site 1 of data set `lD`

. The raw data are shown in black, the detection threshold appears in dashed blue and the filtered and rectified trace on which spike detection is going to be preformed appears in red.

Spikes are then detected as local maxima on the *summed, filtered and rectified* traces. To do that we use a peak detecting function defined in `peaks`

and associated methods definition.

```
sp0 <- peaks(apply(lDfr,1,sum),15)
```

The returned object, `sp0`

, is essentially a vector of integer containing the indexes of the detected spikes. To facilitate handling it is in addition defined as an object of class `eventsPos`

meaning that entering its name on the command line and typing returns, that is, calling the `print`

method on the object gives a short description of it:

sp0

eventsPos object with indexes of 1795 events. Mean inter event interval: 166.81 sampling points, corresponding SD: 143.79 sampling points Smallest and largest inter event intervals: 16 and 1157 sampling points.

We see that `length(sp0)`

`1795`

events were detected. Since the mean inter event interval is very close to the SD, the "compound process" (since it's likely to be the sum of the activities of many neurons) is essentially Poisson.

### 5.1 Interactive spike detection check

We can interactively check the detection quality with the tailored `explore`

method (defined in Definition of an `explore`

method for time series and `eventsPos`

objects):

explore(sp0,lD,col=c("black","grey50"))

That leads to a display very similar to the one previously obtained with `explore(lD)`

except that the detected events appear superposed on the raw data as red dots.

### 5.2 Remove useless objects

Since we are not going to use `lDfr`

anymore we can save memory by removing it:

remove(lDfr)

### 5.3 Data set split

In order to get stronger checks for our procedure and to illustrate better how it works, we are going to split our data set in two parts, establish our model on the first and use this model on both parts:

(sp0E <- as.eventsPos(sp0[sp0 <= dim(lD)[1]/2])) (sp0L <- as.eventsPos(sp0[sp0 > dim(lD)[1]/2]))

eventsPos object with indexes of 908 events. Mean inter event interval: 164.88 sampling points, corresponding SD: 139.06 sampling points Smallest and largest inter event intervals: 16 and 931 sampling points. eventsPos object with indexes of 887 events. Mean inter event interval: 168.72 sampling points, corresponding SD: 148.59 sampling points Smallest and largest inter event intervals: 16 and 1157 sampling points.

We see that `eventsPos`

objects can be sub-set like classical vectors. We also see that the sub-setting based on total time results in set with roughly the same number of events.

## 6 Cuts

### 6.1 Getting the "right" length for the cuts

After detecting our spikes, we must make our cuts in order to create our events' sample. That is, for each detected event we literally cut a piece of data and we do that on the four recording sites. To this end we use function `mkEvents`

which in addition to an `eventPos`

argument (`sp1E`

) and a "raw data" argument (`lD`

) takes an integer argument (`before`

) stating how many sampling points we want to keep within the cut before the reference time as well as another integer argument (`after`

) stating how many sampling points we want to keep within the cut after the reference time. The function returns essentially a matrix where each event is a column. The cuts on the different recording sites are put one after the other when the event is built. The obvious question we must first address is: How long should our cuts be? The pragmatic way to get an answer is:

- Make cuts much longer than what we think is necessary, like 50 sampling points on both sides of the detected event's time.
- Compute robust estimates of the "central" event (with the
`median`

) and of the dispersion of the sample around this central event (with the`MAD`

). - Plot the two together and check when does the
`MAD`

trace reach the background noise level (at 1 since we have normalized the data). - Having the central event allows us to see if it outlasts significantly the region where the
`MAD`

is above the background noise level.

Clearly cutting beyond the time at which the `MAD`

hits back the noise level should not bring any useful information as far a classifying the spikes is concerned. So here we perform this task as follows:

evtsE <- mkEvents(sp0E,lD,49,50) evtsE.med <- median(evtsE) evtsE.mad <- apply(evtsE,1,mad)

plot(evtsE.med,type="n",ylab="Amplitude") abline(v=seq(0,400,10),col="grey") abline(h=c(0,1),col="grey") lines(evtsE.med,lwd=2) lines(evtsE.mad,col=2,lwd=2)

Figure 6: Robust estimates of the central event (black) and of the sample's dispersion around the central event (red) obtained with "long" (100 sampling points) cuts. We see clearly that the dispersion is back to noise level 15 points before the peak and 30 points after the peak (on all sites). We also see that the median event is not back to zero 50 points after the peak, we will have to keep his information in mind when we are going to look for superpositions.

### 6.2 Events

Once we are satisfied with our spike detection, at least in a provisory way, and that we have decided on the length of our cuts, we proceed by making `cuts`

around the detected events. :

```
evtsE <- mkEvents(sp0E,lD,14,30)
```

Here we have decided to keep 14 points before and 30 points after our reference times. `evtsE`

is a bit more than a matrix, it is an object of class `events`

, meaning that a `summary`

method is available:

summary(evtsE)

events object deriving from data set: lD. Events defined as cuts of 45 sampling points on each of the 4 recording sites. The 'reference' time of each event is located at point 15 of the cut. There are 908 events in the object.

A `print`

method which calls the `plot`

method is also available giving:

evtsE[,1:200]

Figure 7: First 200 events of `evtsE`

. Cuts from the four recording sites appear one after the other. The background (white / grey) changes with the site. In red, *robust* estimate of the "central" event obtained by computing the pointwise median. In blue, *robust* estimate of the scale (SD) obtained by computing the pointwise `MAD`

.

Like `eventsPos`

objects, `events`

objects can be sub-set *with respect to the rows* like usual matrix. Notice that a rather sophisticated plot was obtained with an extremely simple command… The beauty of `R`

class / method mechanism in action.

### 6.3 Noise

Getting an estimate of the noise statistical properties is an essential ingredient to build respectable goodness of fit tests. In our approach "noise events" are essentially anything that is not an "event" is the sense of the previous section. I wrote essentially and not exactly since there is a little twist here which is the minimal distance we are willing to accept between the reference time of a noise event and the reference time of the last preceding and of the first following "event". We could think that keeping a cut length on each side would be enough. That would indeed be the case if *all* events were starting from and returning to zero within a cut. But this is not the case with the cuts parameters we tool previously (that will become clear soon). You might wonder why we chose so short a cut length then. Simply to avoid having to deal with too many superposed events which are the really bothering events for anyone wanting to do proper sorting.
To obtain our noise events we are going to define function `mkNoise`

which takes the *same* arguments as function `mkEvents`

plus two number: `safetyFactor`

a number by which the cut length is multiplied and which sets the minimal distance between the reference times discussed in the previous paragraph and `size`

the maximal number of noise events one wants to cut (the actual number obtained might be smaller depending on the data length, the cut length, the safety factor and the number of events).

We cut next noise events with a rather large safety factor:

```
noiseE <- mkNoise(sp0E,lD,14,30,safetyFactor=2.5,2000)
```

Here `noiseE`

is also an `events`

object and its `summary`

is:

summary(noiseE)

events object deriving from data set: lD. Events defined as cuts of 45 sampling points on each of the 4 recording sites. The 'reference' time of each event is located at point 15 of the cut. There are 1330 events in the object.

The reader interested in checking the effect of the `safetyFactor`

argument is invited to try something like:

noiseElowSF <- mkNoise(sp0E,lD,14,30,safetyFactor=1,2000) plot(mean(noiseElowSF),type="l") lines(mean(noiseE),col=2)

## 7 Getting "clean" events

Our spike sorting has two main stages, the first one consist in estimating a `generative model`

and the second one consists in using this model to build a `classifier`

before applying to the data. Our `generative model`

*will include superposed events* but it is going to be built out of reasonably "clean" ones. Here by clean we mean events which are not due to a nearly simultaneous firing of two or more neurons; and simultaneity is defined on the time scale of one of our cuts.

In order to eliminate the most obvious superpositions we are going to use a rather brute force approach, looking at the sides of the central peak of our median event and checking if individual events are not too large there, that is do not exhibit extra peaks. We first define a function doing this job:

goodEvtsFct <- function(samp,thr=3) { samp.med <- apply(samp,1,median) samp.mad <- apply(samp,1,mad) above <- samp.med > 0 samp.r <- apply(samp,2,function(x) {x[above] <- 0;x}) apply(samp.r,2,function(x) all(abs(x-samp.med) < thr*samp.mad)) }

We then apply our new function to our realigned sample:

```
goodEvts <- goodEvtsFct(evtsE,8)
```

Here `goodEvts`

is a vector of `logical`

with as many elements as events in `evtsE`

. Elements of `goodEvts`

are `TRUE`

if the corresponding event of `evtsE`

is "good" (*i.e.*, not a superposition) and is `FALSE`

otherwise. We can look at the first 200 good events easily with:

evtsE[,goodEvts][,1:200]

Figure 8: First 200 good events of `evtsE`

.

We see that few superpositions are left but the most obvious ones of our previous events figure are gone. We can also look at the `sum(!goodEvts)`

`[1] 38`

"bad" events with:

evtsE[,!goodEvts]

Figure 9: Bad events of `evtsE`

.

## 8 Dimension reduction

### 8.1 Principal component analysis

Our events are living right now in an 180 dimensional space (our cuts are 45 sampling points long and we are working with 4 recording sites simultaneously). It turns out that it hard for most humans to perceive structures in such spaces. It also hard, not to say impossible with a realistic sample size, to estimate probability densities (which what some clustering algorithm are actually doing) in such spaces, unless one is ready to make strong assumptions about these densities. It is therefore usually a good practice to try to reduce the dimension of the sample space used to represent the data. We are going to that with principal component analysis (`PCA`

), using it on our "good" events.

```
evtsE.pc <- prcomp(t(evtsE[,goodEvts]))
```

We have to be careful here since function `prcomp`

assumes that the data matrix is built by stacking the events / observations as rows and not as columns like we did in our `events`

object. We apply therefore the function to the `transpose`

(`t()`

) of our events.

### 8.2 Exploring `PCA`

results

`PCA`

is a rather abstract procedure to most of its users, at least when they start using it. But one way to grasp what it does is to plot the `mean event`

plus or minus, say twice, each principal components like:

layout(matrix(1:4,nr=2)) explore(evtsE.pc,1,5) explore(evtsE.pc,2,5) explore(evtsE.pc,3,5) explore(evtsE.pc,4,5)

Figure 10: PCA of `evtsEo2`

(for "good" events) exploration (PC 1 to 4). Each of the 4 graphs shows the mean waveform (black), the mean waveform + 5 x PC (red), the mean - 5 x PC (blue) for each of the first 4 PCs. The fraction of the total variance "explained" by the component appears in between parenthesis in the title of each graph.

We can see that the first 3 PCs correspond to pure amplitude variations. An event with a large projection (`score`

) on the first PC is smaller than the average event on recording sites 1, 2 and 3, but not on 4. An event with a large projection on PC 2 is larger than average on site 1, smaller than average on site 2 and 3 and identical to the average on site 4. An event with a large projection on PC 3 is larger than the average on site 4 only. PC 4 is the first principal component corresponding to a change in *shape* as opposed to *amplitude*. A large projection on PC 4 means that the event as a shallower first valley and a deeper second valley than the average event on all recording sites.

We now look at the next 4 principal components:

layout(matrix(1:4,nr=2)) explore(evtsE.pc,5,5) explore(evtsE.pc,6,5) explore(evtsE.pc,7,5) explore(evtsE.pc,8,5)

Figure 11: PCA of `evtsE`

(for "good" events) exploration (PC 5 to 8). Each of the 4 graphs shows the mean waveform (black), the mean waveform + 5 x PC (red), the mean - 5 x PC (blue). The fraction of the total variance "explained" by the component appears in between parenthesis in the title of each graph.

An event with a large projection on PC 5 tends to be "slower" than the average event. An event with a large projection on PC 6 exhibits a slower kinetics of its second valley than the average event. PC 5 and 6 correspond to effects shared among recording sites. PC 7 correspond also to a "change of shape" effect on all sites except the first. Events with a large projection on PC 8 rise slightly faster and decay slightly slower than the average event on all recording site. Notice also that PC 8 has a "noisier" aspect than the other suggesting that we are reaching the limit of the "events extra variability" compared to the variability present in the background noise.

This guess can be confirmed by comparing the variance of the "good" events sample with the one of the noise sample to which the variance contributed by the first 8 PCs is added:

```
sapply(1:15, function(n) sum(diag(cov(t(noiseE))))+sum(evtsE.pc$sdev[1:n]^2)-sum(evtsE.pc$sdev^2))
```

[1] -277.4651543 -187.5634116 -128.0390777 -91.3186691 -58.8398876 [6] -36.3630674 -21.5437224 -8.2644952 0.2848893 6.9067336 [11] 13.3415488 19.4720891 25.2553356 29.1021047 32.5634359

This near equality means that we should not include component beyond the 8th one in our analysis. That's leave the room to use still fewer components.

### 8.3 Static representation of the projected data

We can build a `scatter plot matrix`

showing the projections of our "good" events sample onto the plane defined by pairs of the few first PCs:

panel.dens <- function(x,...) { usr <- par("usr") on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) d <- density(x, adjust=0.5) x <- d$x y <- d$y y <- y/max(y) lines(x, y, col="grey50", ...) } pairs(evtsE.pc$x[,1:4],pch=".",gap=0,diag.panel=panel.dens)

Figure 12: Scatter plot matrix of the projections of the good events in `evtsEo2`

onto the planes defined by the first 4 PCs. The diagonal shows a smooth (Gaussian kernel based) density estimate of the projection of the sample on the corresponding PC. Using the first 8 PCs does not make finner structure visible.

### 8.4 Dynamic representation of the projected data

The best way to discern structures in "high dimensional" data is to dynamically visualize them. To this end, the tool of choice is GGobi, an open source software available on `Linux`

, `Windows`

and `MacOS`

. It is in addition interfaced to `R`

thanks to the rggobi package. We have therefore two ways to use it: as a stand alone program after exporting the data from `R`

, or directly within `R`

. We are going to use it in its stand alone version here. We therefore start by exporting our data in `csv`

format to our disk:

```
write.csv(evtsE.pc$x[,1:8],file="evtsE.csv")
```

What comes next is not part of this document but here is a brief description of how to get it:

- Launch
`GGobi`

- In menu:
`File`

->`Open`

, select`evtsE.csv`

. - Since the glyphs are rather large, start by changing them for smaller ones:
- Go to menu:
`Interaction`

->`Brush`

. - On the Brush panel which appeared check the
`Persistent`

box. - Click on
`Choose color & glyph...`

. - On the chooser which pops out, click on the small dot on the upper left of the left panel.
- Go back to the window with the data points.
- Right click on the lower right corner of the rectangle which appeared on the figure after you selected
`Brush`

. - Dragg the rectangle corner in order to cover the whole set of points.
- Go back to the
`Interaction`

menu and select the first row to go back where you were at the start.

- Go to menu:
- Select menu:
`View`

->`Rotation`

. - Adjust the speed of the rotation in order to see things properly.

You should easily discern 10 rather well separated clusters. Meaning that an automatic clustering with 10 clusters on the first 3 principal components should do the job.

## 9 Clustering

### 9.1 k-means clustering

Since our dynamic visualization shows 10 well separated clusters in 3 dimension, a simple k-means should do the job:

set.seed(20061001,kind="Mersenne-Twister") km10 <- kmeans(evtsE.pc$x[,1:3],centers=10,iter.max=100,nstart=100) c10 <- km10$cluster

Since function `kmeans`

of `R`

does use a random initialization, we set the seed (as well as the `kind`

) of our pseudo random number generator in order to ensure full reproducibility. In order to ensure reproducibility even if another seed is used as well as to facilitate the interpretation of the results, we "order" the clusters by "size" using the integrated absolute value of the central / median event of each cluster as a measure of its size.

cluster.med <- sapply(1:10, function(cIdx) median(evtsE[,goodEvts][,c10==cIdx])) sizeC <- sapply(1:10,function(cIdx) sum(abs(cluster.med[,cIdx]))) newOrder <- sort.int(sizeC,decreasing=TRUE,index.return=TRUE)$ix cluster.mad <- sapply(1:10, function(cIdx) {ce <- t(evtsE)[goodEvts,];ce <- ce[c10==cIdx,];apply(ce,2,mad)}) cluster.med <- cluster.med[,newOrder] cluster.mad <- cluster.mad[,newOrder] c10b <- sapply(1:10, function(idx) (1:10)[newOrder==idx])[c10]

### 9.2 Results inspection with `GGobi`

We start by checking our clustering quality with `GGobi`

. To this end we export the data and the labels of each event:

```
write.csv(cbind(evtsE.pc$x[,1:3],c10b),file="evtsEsorted.csv")
```

Again the dynamic visualization is not part of this document, but here is how to get it:

- Load the new data into GGobi like before.
- In menu:
`Display`

->`New Scatterplot Display`

, select`evtsEsorted.csv`

. - Change the glyphs like before.
- In menu:
`Tools`

->`Color Schemes`

, select a scheme with 10 colors, like`Spectral`

,`Spectral 10`

. - In menu:
`Tools`

->`Automatic Brushing`

, select`evtsEsorted.csv`

tab and, within this tab, select variable`c10b`

. Then click on`Apply`

. - Select
`View`

->`Rotation`

like before and see your result.

### 9.3 Cluster specific plots

Another way to inspect the clustering results is to look at cluster specific events plots:

layout(matrix(1:4,nr=4)) par(mar=c(1,1,1,1)) plot(evtsE[,goodEvts][,c10b==1],y.bar=5) plot(evtsE[,goodEvts][,c10b==2],y.bar=5) plot(evtsE[,goodEvts][,c10b==3],y.bar=5) plot(evtsE[,goodEvts][,c10b==4],y.bar=5)

Figure 13: First 4 clusters. Cluster 1 at the top, cluster 4 at the bottom. Scale bar: 5 global `MAD`

units. Red, cluster specific central / median event. Blue, cluster specific `MAD`

.

Notice the increased `MAD`

on the rising phase of cluster 2 on the first recording site. A sing of misalignment of the events of this cluster.

layout(matrix(1:4,nr=4)) par(mar=c(1,1,1,1)) plot(evtsE[,goodEvts][,c10b==5],y.bar=5) plot(evtsE[,goodEvts][,c10b==6],y.bar=5) plot(evtsE[,goodEvts][,c10b==7],y.bar=5) plot(evtsE[,goodEvts][,c10b==8],y.bar=5)

Figure 14: Next 4 clusters. Cluster 5 at the top, cluster 8 at the bottom. Scale bar: 5 global `MAD`

units. Red, cluster specific central / median event. Blue, cluster specific `MAD`

.

Cluster 5 has few events while some "subtle" superpositions are present in cluster 7.

layout(matrix(1:2,nr=2)) par(mar=c(1,1,1,1)) plot(evtsE[,goodEvts][,c10b==9],y.bar=5) plot(evtsE[,goodEvts][,c10b==10],y.bar=5)

Figure 15: Last 2 clusters. Cluster 9 at the top, cluster 10 at the bottom. Scale bar: 5 global `MAD`

units. Red, cluster specific central / median event. Blue, cluster specific `MAD`

.

Cluster 10 exhibits an extra variability on sites 1 and 4 around its first valley and its peak.

### 9.4 Dealing with the jitter

#### 9.4.1 A worked out example

Cluster 2 is the cluster exhibiting the largest sampling jitter effects, since it has the largest time derivative, in absolute value, of its median event . This is clearly seen when we superpose the 51st event from this cluster with the median event:

c2_median = apply(evtsE[,goodEvts][,c10b==2],1,median) plot(c2_median,col='red',lwd=2,type="l") lines(unclass(evtsE[,goodEvts][,c10b==2][,51]),lwd=2)

Figure 16: The median event of cluster 2 (red) together with event 51 of the same cluster (black).

A Taylor expansion shows that if we write *g(t)* the observed 51st event, Î´ the sampling jitter and *f(t)* the actual waveform of the event then:
\[g(t) = f(t+Î´) + Îµ(t) \approx f(t) + Î´ \, f'(t) + Î´^2/2 \, f''(t) + Îµ(t) \, ;\]
where Îµ is a Gaussian process and where \(f'\) and \(f''\) stand for the first and second time derivatives of \(f\). Therefore, if we can get estimates of \(f'\) and \(f''\) we should be able to estimate Î´ by linear regression (if we neglect the \(Î´^2\) term as well as the potentially non null correlation in Îµ) or by non linear regression (if we keep the latter).

We start by getting the derivatives estimates:

dataD = apply(lD,2,function(x) c(0,diff(x,2)/2,0)) evtsED = mkEvents(sp0E,dataD,14,30) dataDD = apply(dataD,2,function(x) c(0,diff(x,2)/2,0)) evtsEDD = mkEvents(sp0E,dataDD,14,30) c2D_median = apply(evtsED[,goodEvts][,c10b==2],1,median) c2DD_median = apply(evtsEDD[,goodEvts][,c10b==2],1,median)

We then get something like:

plot(unclass(evtsE[,goodEvts][,c10b==2][,51])-c2_median,type="l",col='red',lwd=2) lines(1.5*c2D_median,col='blue',lwd=2) lines(1.5*c2D_median+1.5^2/2*c2DD_median,col='black',lwd=2)

Figure 17: The median event of cluster 2 subtracted from event 51 of the same cluster (red); 2 times the first derivative of the median event (blue)â€”corresponding to Î´=1.5â€”; 1.5 times the first derivative + 1.5^{2}/2 times the second (black)â€”corresponding again to Î´=1.5â€”.

If we neglect the \(Î´^2\) term we quickly arrive at: \[\hat{Î´} = \frac{\mathbf{f'} \cdot (\mathbf{g} -\mathbf{f})}{\| \mathbf{f'} \|^2} \, ;\]

where the 'vectorial' notation like \(\mathbf{a} \cdot \mathbf{b}\) stands here for: \[\sum_{i=0}^{179} a_i b_i \, .\]

For the 51st event of the cluster we get:

(delta_hat = sum(c2D_median*(unclass(evtsE[,goodEvts][,c10b==2][,51])-c2_median))/sum(c2D_median^2))

1.4917182304327

We can use this estimated value of `delta_hat`

as an initial guess for a procedure refining the estimate using also the \(Î´^2\) term. The obvious quantity we should try to minimize is the residual sum of square, `RSS`

defined by:
\[\mathrm{RSS}(Î´) = \| \mathbf{g} - \mathbf{f} - Î´ \, \mathbf{f'} - Î´^2/2 \, \mathbf{f''} \|^2 \; .\]

We can define a function returning the `RSS`

for a given value of Î´ as well as an event `evt`

a cluster center (median event of the cluster) `center`

and its first two derivatives, `centerD`

and `centerDD`

:

rss_for_alignment <- function(delta, evt=unclass(evtsE[,goodEvts][,c10b==2][,51]), center=c2_median, centerD=c2D_median, centerDD=c2DD_median) { sapply(delta, function(d) sum((evt - center - d*centerD - d^2/2*centerDD)^2) ) }

We then get the graph with:

layout(matrix(1:2,nc=2)) dd <- seq(-5,5,0.05) plot(dd,rss_for_alignment(dd),type='l',col='black',lwd=2) dd_fine <- seq(delta_hat-0.5,delta_hat+0.5,len=501) plot(dd_fine,rss_for_alignment(dd_fine),type='l',col='black',lwd=2) abline(v=delta_hat,col='red')

Figure 18: The `RSS`

as a function of Î´ for event 51 of cluster 2. Left, \(Î´ \in [-5,5]\); right, \(Î´ \in [\hat{Î´}-0.5,\hat{Î´}+0.5]\) and the red vertical line shows \(\hat{Î´}\).

The left panel of the above figure shows that our initial guess for \(\hat{Î´}\) is not bad but still approximately 0.2 units away from the actual minimum. The classical way to refine our Î´ estimateâ€”in 'nice situations' where the function we are trying to minimize is locally convexâ€”is to use the Newton-Raphson algorithm which consists in approximating locally the 'target function' (here our `RSS`

function) by a parabola having locally the same first and second derivatives, before jumping to the minimum of this approximating parabola. If we develop our previous expression of \(\mathrm{RSS}(Î´)\) we get:
\[\mathrm{RSS}(Î´) = \| \mathbf{h} \|^2 - 2\, Î´ \, \mathbf{h} \cdot \mathbf{f'} + Î´^2 \, \left( \|\mathbf{f'}\|^2 - \mathbf{h} \cdot \mathbf{f''}\right) + Î´^3 \, \mathbf{f'} \cdot \mathbf{f''} + \frac{Î´^4}{4} \|\mathbf{f''}\|^2 \, ;\]
where \(\mathbf{h}\) stands for \(\mathbf{g} - \mathbf{f}\). By differentiation with respect to Î´ we get:
\[\mathrm{RSS}'(Î´) = - 2\, \mathbf{h} \cdot \mathbf{f'} + 2 \, Î´ \, \left( \|\mathbf{f'}\|^2 - \mathbf{h} \cdot \mathbf{f''}\right) + 3 \, Î´^2 \, \mathbf{f'} \cdot \mathbf{f''} + Î´^3 \|\mathbf{f''}\|^2 \, .\]
And a second differentiation leads to:
\[\mathrm{RSS}''(Î´) = 2 \, \left( \|\mathbf{f'}\|^2 - \mathbf{h} \cdot \mathbf{f''}\right) + 6 \, Î´ \, \mathbf{f'} \cdot \mathbf{f''} + 3 \, Î´^2 \|\mathbf{f''}\|^2 \, .\]
The equation of the approximating parabola at \(Î´^{(k)}\) is then:
\[\mathrm{RSS}(Î´^{(k)} + Î·) \approx \mathrm{RSS}(Î´^{(k)}) + Î· \, \mathrm{RSS}'(Î´^{(k)}) + \frac{Î·^2}{2} \, \mathrm{RSS}''(Î´^{(k)})\; ,\]
and its minimumâ€”if \(\mathrm{RSS}''(Î´)\) > 0â€”is located at:
\[Î´^{(k+1)} = Î´^{(k)} - \frac{\mathrm{RSS}'(Î´^{(k)})}{\mathrm{RSS}''(Î´^{(k)})} \; .\]

Defining functions returning the required derivatives:

rssD_for_alignment <- function(delta, evt=unclass(evtsE[,goodEvts][,c10b==2][,51]), center=c2_median, centerD=c2D_median, centerDD=c2DD_median){ h = evt - center -2*sum(h*centerD) + 2*delta*(sum(centerD^2) - sum(h*centerDD)) + 3*delta^2*sum(centerD*centerDD) + delta^3*sum(centerDD^2) } rssDD_for_alignment <- function(delta, evt=unclass(evtsE[,goodEvts][,c10b==2][,51]), center=c2_median, centerD=c2D_median, centerDD=c2DD_median){ h = evt - center 2*(sum(centerD^2) - sum(h*centerDD)) + 6*delta*sum(centerD*centerDD) + 3*delta^2*sum(centerDD^2) }

we can get a graphical representation of a single step of the Newton-Raphson algorithm:

rss_at_delta_0 = rss_for_alignment(delta_hat) rssD_at_delta_0 = rssD_for_alignment(delta_hat) rssDD_at_delta_0 = rssDD_for_alignment(delta_hat) delta_1 = delta_hat - rssD_at_delta_0/rssDD_at_delta_0 plot(dd_fine,rss_for_alignment(dd_fine),type="l",col='black',lwd=2) abline(v=delta_hat,col='red') lines(dd_fine, rss_at_delta_0 + (dd_fine-delta_hat)*rssD_at_delta_0 + (dd_fine-delta_hat)^2/2*rssDD_at_delta_0,col='blue',lwd=2) abline(v=delta_1,col='grey')

Figure 19: The `RSS`

as a function of Î´ for event 51 of cluster 1 (black), the red vertical line shows \(\hat{Î´}\). In blue, the approximating parabola at \(\hat{Î´}\). The grey vertical line shows the minimum of the approximating parabola.

Subtracting the second order in Î´ approximation of f(t+Î´) from the observed 51st event of cluster 1 we get:

plot(evtsE[,goodEvts][,c10b==2][,51],type="l",col='black',lwd=2) lines(evtsE[,goodEvts][,c10b==2][,51]-c2_median-delta_1*c2D_median-delta_1^2/2*c2DD_median,col='red',lwd=2) lines(c2_median+delta_1*c2D_median+delta_1^2/2*c2DD_median,col='blue',lwd=1)

Figure 20: Event 51 of cluster 2 (black), second order approximation of f(t+Î´) (blue) and residual (red) for Î´â€”obtained by a succession of a linear regression (order 1) and a single Newton-Raphson stepâ€”equal to: `delta_1`

`1.37480481443249`

.

#### 9.4.2 Making the procedure systematic

Function `get_jitter`

is defined in Sec. Definition of `get_jitter`

. If we try it on the events attribute to cluster 2:

c2_jitter = get_jitter(evtsE[,goodEvts][,c10b==2],c2_median,c2D_median,c2DD_median)

we get the following summary statistic of the estimated jitters:

summary(c2_jitter,digits=2)

Min. 1st Qu. Median Mean 3rd Qu. Max. -1.900 -0.380 -0.044 0.041 0.660 1.800

We then get a estimate of the jitter shifted events of cluster 2 with:

c2_noj = unclass(evtsE[,goodEvts][,c10b==2]) - c2D_median %o% c2_jitter - c2DD_median %o% c2_jitter^2/2 attributes(c2_noj) = attributes(evtsE[,goodEvts][,c10b==2])

A plot allows us to compare the "raw version" of the events with the jitter corrected version:

layout(matrix(1:2,nr=2)) par(mar=c(1,1,1,1)) evtsE[,goodEvts][,c10b==2] c2_noj

Figure 21: Top: cluster 2 events before jitter correction. Bottom: cluster 2 events after second order jitter correction.

#### 9.4.3 Applying the procedure systematically

Using function `mk_aligned_events`

defined in Definition of `mk_aligned_events`

, for each of the 10 clusters:

```
evtsE_noj = lapply(1:10, function(i) mk_aligned_events(sp0E[goodEvts][c10b==i],lD))
```

Looking at the first 5 clusters we get:

layout(matrix(1:5,nr=5)) evtsE_noj[[1]] evtsE_noj[[2]] evtsE_noj[[3]] evtsE_noj[[4]] evtsE_noj[[5]]

Figure 22: First 5 clusters **after jitter cancellation**. Cluster 1 at the top, cluster 5 at the bottom. Red, cluster specific central / median event. Blue, cluster specific `MAD`

.

Looking at the last 5 clusters we get:

layout(matrix(1:5,nr=5)) evtsE_noj[[6]] evtsE_noj[[7]] evtsE_noj[[8]] evtsE_noj[[9]] evtsE_noj[[10]]

Figure 23: Last 5 clusters **after jitter cancellation**. Cluster 6 at the top, cluster 10 at the bottom. Red, cluster specific central / median event. Blue, cluster specific `MAD`

. Notice the vertical scale change compared to the previous figure

The five numbers summary of the remaining jitters for each of the 10 clusters are:

t(sapply(evtsE_noj, function(l) summary(attr(l,"delta"),digits=3)))

Min. 1st Qu. Median Mean 3rd Qu. Max. [1,] -0.493 -0.253 0.02120 -0.01690 0.171 0.465 [2,] -0.534 -0.225 -0.03800 0.00695 0.286 0.509 [3,] -0.450 -0.271 0.02930 0.01750 0.270 0.496 [4,] -0.475 -0.193 0.01990 0.01360 0.227 0.468 [5,] -0.627 -0.320 0.04860 -0.05690 0.216 0.337 [6,] -0.530 -0.235 0.01760 0.01860 0.320 0.473 [7,] -0.513 -0.249 -0.00277 -0.01360 0.234 0.560 [8,] -0.514 -0.233 0.00456 0.00753 0.256 0.613 [9,] -0.574 -0.258 0.02530 0.08350 0.256 4.570 [10,] -0.955 -0.275 0.01010 -0.00546 0.255 2.960

## 10 "Brute force" superposition resolution

We are going to resolve (the most "obvious") superpositions by a "recursive peeling method":

- Events are detected and cut from the raw data
*or from an already peeled version of the data*. - The closest center (in term of Euclidean distance) to the event is found—the jitter is always evaluated and compensated for when the distances are computed.
- If the residual sum of squares (
`RSS`

), that is: (actual data - best center)\(^2\), is smaller than the squared norm of a cut, the best center is subtracted from the data on which detection was performed—jitter is again compensated for at this stage. - Go back to step 1 or stop.

We get a named list of cluster centers lists with:

centers = lapply(1:10, function(i) mk_center_list(attr(evtsE_noj[[i]],"positions"),lD)) names(centers) = paste("Cluster",1:10)

We then apply function `classify_and_align_evt`

defined in Sec. Definition of `classify_and_align_evt`

**to every** detected event:

round0 = lapply(as.vector(sp0),classify_and_align_evt, data=lD,centers=centers)

We can see how many events (for a total of `length(sp0)`

`1795`

) got unclassified with:

sum(sapply(round0, function(l) l[[1]] == '?'))

28

### 10.1 First peeling

Using `predict_data`

we get:

pred0 = predict_data(round0,centers)

We subtract `pred0`

from `lD`

:

lD1 = lD - pred0

We can compare the original data with the result of the "first peeling":

ii = 1:1500 + 0.9*15000 tt = ii/15000 par(mar=c(1,1,1,1)) plot(tt, lD[ii,1], axes = FALSE, type="l",ylim=c(-50,20), xlab="",ylab="") lines(tt, lD1[ii,1], col='red') lines(tt, lD[ii,2]-15, col='black') lines(tt, lD1[ii,2]-15, col='red') lines(tt, lD[ii,3]-25, col='black') lines(tt, lD1[ii,3]-25, col='red') lines(tt, lD[ii,4]-40, col='black') lines(tt, lD1[ii,4]-40, col='red')

Figure 24: 100 ms (between 0.9 and 1.0 s) of the locust data set. Black, original data; red, after first peeling.

### 10.2 Second peeling

We then take `lD1`

as our former `lD`

and we repeat the procedure detecting only on the recording site 1 (also using a slightly shorter box filter):

lDf <- filter(lD1,rep(1,3)/3) lDf[is.na(lDf)] <- 0 lDf.mad <- apply(lDf,2,mad) lDf <- t(t(lDf)/lDf.mad) bellow.thrs <- t(t(lDf) < thrs) lDfr <- lDf lDfr[bellow.thrs] <- 0 remove(lDf) (sp1 <- peaks(lDfr[,1],15))

eventsPos object with indexes of 247 events. Mean inter event interval: 1208.87 sampling points, corresponding SD: 1386.01 sampling points Smallest and largest inter event intervals: 18 and 9140 sampling points.

We then apply again function `classify_and_align_evt`

defined in Sec. Definition of `classify_and_align_evt`

**to every** detected event:

round1 = lapply(as.vector(sp1),classify_and_align_evt, data=lD1,centers=centers)

We can see how many events (for a total of `length(sp1)`

`247`

) got unclassified with:

sum(sapply(round1, function(l) l[[1]] == '?'))

60

We then get our prediction

pred1 = predict_data(round1,centers)

We subtract `pred1`

from `lD1`

:

lD2 = lD1 - pred1

We can compare the first peeling with the second one:

ii = 1:1500 + 0.9*15000 tt = ii/15000 par(mar=c(1,1,1,1)) plot(tt, lD1[ii,1], axes = FALSE, type="l",ylim=c(-50,20), xlab="",ylab="") lines(tt, lD2[ii,1], col='red') lines(tt, lD1[ii,2]-15, col='black') lines(tt, lD2[ii,2]-15, col='red') lines(tt, lD1[ii,3]-25, col='black') lines(tt, lD2[ii,3]-25, col='red') lines(tt, lD1[ii,4]-40, col='black') lines(tt, lD2[ii,4]-40, col='red')

Figure 25: 100 ms (between 0.9 and 1.0 s) of the locust data set. Black, first peeling; red, second peeling.

### 10.3 Third peeling

We then take `lD2`

as our former `lD1`

and we repeat the procedure detecting only on the recording site 2:

lDf <- filter(lD2,rep(1,3)/3) lDf[is.na(lDf)] <- 0 lDf.mad <- apply(lDf,2,mad) lDf <- t(t(lDf)/lDf.mad) bellow.thrs <- t(t(lDf) < thrs) lDfr <- lDf lDfr[bellow.thrs] <- 0 remove(lDf) (sp2 <- peaks(lDfr[,2],15))

eventsPos object with indexes of 124 events. Mean inter event interval: 2404.16 sampling points, corresponding SD: 2381.06 sampling points Smallest and largest inter event intervals: 16 and 10567 sampling points.

We then apply again function `classify_and_align_evt`

defined in Sec. Definition of `classify_and_align_evt`

**to every** detected event:

round2 = lapply(as.vector(sp2),classify_and_align_evt, data=lD2,centers=centers)

We can see how many events (for a total of `length(sp2)`

`124`

) got unclassified with:

sum(sapply(round2, function(l) l[[1]] == '?'))

22

We then get our prediction

pred2 = predict_data(round2,centers)

We subtract `pred2`

from `lD2`

:

lD3 = lD2 - pred2

We can compare the second peeling with the third one:

ii = 1:1500 + 0.9*15000 tt = ii/15000 par(mar=c(1,1,1,1)) plot(tt, lD2[ii,1], axes = FALSE, type="l",ylim=c(-50,20), xlab="",ylab="") lines(tt, lD3[ii,1], col='red') lines(tt, lD2[ii,2]-15, col='black') lines(tt, lD3[ii,2]-15, col='red') lines(tt, lD2[ii,3]-25, col='black') lines(tt, lD3[ii,3]-25, col='red') lines(tt, lD2[ii,4]-40, col='black') lines(tt, lD3[ii,4]-40, col='red')

Figure 26: 100 ms (between 0.9 and 1.0 s) of the locust data set. Black, second peeling; red, third peeling.

### 10.4 Fourth peeling

We then take `lD3`

as our former `lD2`

and we repeat the procedure detecting only on the recording site 3:

lDf <- filter(lD3,rep(1,3)/3) lDf[is.na(lDf)] <- 0 lDf.mad <- apply(lDf,2,mad) lDf <- t(t(lDf)/lDf.mad) bellow.thrs <- t(t(lDf) < thrs) lDfr <- lDf lDfr[bellow.thrs] <- 0 remove(lDf) (sp3 <- peaks(lDfr[,3],15))

eventsPos object with indexes of 99 events. Mean inter event interval: 2993.43 sampling points, corresponding SD: 3686.95 sampling points Smallest and largest inter event intervals: 34 and 21217 sampling points.

We then apply again function `classify_and_align_evt`

defined in Sec. Definition of `classify_and_align_evt`

**to every** detected event:

round3 = lapply(as.vector(sp3),classify_and_align_evt, data=lD3,centers=centers)

We can see how many events (for a total of `length(sp3)`

`99`

) got unclassified with:

sum(sapply(round3, function(l) l[[1]] == '?'))

21

We then get our prediction

pred3 = predict_data(round3,centers)

We subtract `pred3`

from `lD3`

:

lD4 = lD3 - pred3

We can compare the third peeling with the fourth one:

ii = 1:1500 + 3.9*15000 tt = ii/15000 par(mar=c(1,1,1,1)) plot(tt, lD3[ii,1], axes = FALSE, type="l",ylim=c(-50,20), xlab="",ylab="") lines(tt, lD4[ii,1], col='red') lines(tt, lD3[ii,2]-15, col='black') lines(tt, lD4[ii,2]-15, col='red') lines(tt, lD3[ii,3]-25, col='black') lines(tt, lD4[ii,3]-25, col='red') lines(tt, lD3[ii,4]-40, col='black') lines(tt, lD4[ii,4]-40, col='red')

Figure 27: 100 ms (between 3.9 and 4.0 s) of the locust data set. Black, third peeling; red, fourth peeling. **Different time frame than on the previous plots**.

### 10.5 Fifth peeling

We then take `lD4`

as our former `lD3`

and we repeat the procedure detecting only on the recording site 4:

lDf <- filter(lD4,rep(1,3)/3) lDf[is.na(lDf)] <- 0 lDf.mad <- apply(lDf,2,mad) lDf <- t(t(lDf)/lDf.mad) bellow.thrs <- t(t(lDf) < thrs) lDfr <- lDf lDfr[bellow.thrs] <- 0 remove(lDf) (sp4 <- peaks(lDfr[,4],15))

eventsPos object with indexes of 164 events. Mean inter event interval: 1806.91 sampling points, corresponding SD: 1902.64 sampling points Smallest and largest inter event intervals: 25 and 9089 sampling points.

`classify_and_align_evt`

defined in Sec. Definition of `classify_and_align_evt`

**to every** detected event:

round4 = lapply(as.vector(sp4),classify_and_align_evt, data=lD4,centers=centers)

We can see how many events (for a total of `length(sp4)`

`164`

) got unclassified with:

sum(sapply(round4, function(l) l[[1]] == '?'))

60

We then get our prediction

pred4 = predict_data(round4,centers)

We subtract `pred4`

from `lD4`

:

lD5 = lD4 - pred4

We can compare the fourth peeling with the fifth one:

ii = 1:1500 + 3.9*15000 tt = ii/15000 par(mar=c(1,1,1,1)) plot(tt, lD4[ii,1], axes = FALSE, type="l",ylim=c(-50,20), xlab="",ylab="") lines(tt, lD5[ii,1], col='red') lines(tt, lD4[ii,2]-15, col='black') lines(tt, lD5[ii,2]-15, col='red') lines(tt, lD4[ii,3]-25, col='black') lines(tt, lD5[ii,3]-25, col='red') lines(tt, lD4[ii,4]-40, col='black') lines(tt, lD5[ii,4]-40, col='red')

Figure 28: 100 ms (between 3.9 and 4.0 s) of the locust data set. Black, fourth peeling; red, fifth peeling.

### 10.6 General comparison

We can compare the raw data with the fifth peeling on the first second:

ii = 1:15000 tt = ii/15000 par(mar=c(1,1,1,1)) plot(tt, lD[ii,1], axes = FALSE, type="l",ylim=c(-50,20), xlab="",ylab="") lines(tt, lD5[ii,1], col='red') lines(tt, lD[ii,2]-15, col='black') lines(tt, lD5[ii,2]-15, col='red') lines(tt, lD[ii,3]-25, col='black') lines(tt, lD5[ii,3]-25, col='red') lines(tt, lD[ii,4]-40, col='black') lines(tt, lD5[ii,4]-40, col='red')

Figure 29: The first second of the locust data set. Black, raw data; red, fifth peeling.

We can also look at the remaining unclassified events (that don't look like any of our templates):

bad_ones = sapply(round4, function(l) l[[1]] == '?') r4BE = mkEvents(sp4[bad_ones], lD4) r4BE

Figure 30: The 60 remaining bad events after the fifth peeling.

### 10.7 Sixth peeling

We can push things one peeling step further detecting on all sites simultaneously like we did the first time:

lDf <- filter(lD5,rep(1,3)/3) lDf[is.na(lDf)] <- 0 lDf.mad <- apply(lDf,2,mad) lDf <- t(t(lDf)/lDf.mad) bellow.thrs <- t(t(lDf) < thrs) lDfr <- lDf lDfr[bellow.thrs] <- 0 remove(lDf) (sp5 <- peaks(apply(lDfr,1,sum),15))

eventsPos object with indexes of 234 events. Mean inter event interval: 1283.21 sampling points, corresponding SD: 1325.54 sampling points Smallest and largest inter event intervals: 17 and 8484 sampling points.

`classify_and_align_evt`

defined in Sec. Definition of `classify_and_align_evt`

**to every** detected event:

round5 = lapply(as.vector(sp5),classify_and_align_evt, data=lD5,centers=centers)

We can see how many events (for a total of `length(sp5)`

`234`

) got unclassified with:

sum(sapply(round5, function(l) l[[1]] == '?'))

184

This number is a large fraction indeed of the total number of events. We can look at the whole sample of events:

mkEvents(sp5, lD5)

Figure 31: Detected events after fifth peeling (detection done on all sites simultaneously).

They also do not look like any of our templates.

## 11 Getting the spike trains

Once we have decided to stop the peeling iterations we can extract our spike trains with:

round_all = c(round0,round1,round2,round3,round4,round5) spike_trains = lapply(paste("Cluster",1:10), function(cn) sapply(round_all[sapply(round_all, function(l) l[[1]]==cn)], function(l) l[[2]]+l[[3]])) names(spike_trains) = paste("Cluster",1:10)

## 12 Custom developed code

### 12.1 Individual function definitions

#### 12.1.1 `cstValueSgts`

definition

Formal parameters:

`x`

: a numeric vector.

Returns: an intgeger matrix with as many columns as segments where the derivative is null, the first row contains the index of the beginning of each segment, the second row contains the segments' lengths.

cstValueSgts <- function(x) { dx <- as.integer(abs(diff(x,2))/2 <= .Machine$double.eps) ## use "order 2" derivative estimates ## ddx is zero where the derivative did not change, ## it is one if the derivative becomes null and -1 ## is the derivative stops being null ddx <- diff(dx) ## s contains the indexes of the first ## points of segments where the ## derivative is null s <- (1:length(ddx))[ddx==1] sapply(s, function(b) { n <- ddx[(b+1):length(ddx)] c(b+1,min((1:length(n))[n==-1])) } ) }

#### 12.1.2 Definitions of generic function and time-series specific method for data exploration

Definition of the generic function:

explore <- function(x,...) { UseMethod("explore") }

Definition of the corresponding method for time series: Formal arguments:

`x`

: a`ts`

(time series) object.`offsetFactor`

: a numeric controlling the the spacing between the recording sites on the plot. Smaller values lead to closer spacing.`...`

: additional arguments passed to`matplot`

and`plot`

functions called inernally.

explore.ts <- function(x, offsetFactor=0.5, ...) { stopifnot(inherits(x,"ts")) yRange <- range(x,na.rm=TRUE) plotPara <- list(tlim = start(x)[1] + c(0,0.1), ylim = yRange, yMin = yRange[1], yMax = yRange[2], firstTime = start(x)[1], lastTime = end(x)[1], keepGoing = TRUE) nFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime rightTime <- rightTime + timeRange if (rightTime > plotPara$lastTime) { cat("Recording end reached.\n ") rightTime <- plotPara$lastTime } plotPara$tlim <- c(rightTime - timeRange, rightTime) plotPara } fFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime leftTime <- leftTime - timeRange if (leftTime < plotPara$firstTime) { cat("Recording end reached.\n ") leftTime <- plotPara$firstTime } plotPara$tlim <- c(leftTime, leftTime + timeRange) plotPara } qFct <- function() { plotPara$keepGoing <- FALSE plotPara } ## Function tFct definition ## Allows the user to change the recording duration displayed on the window ## The user is invited to enter a factor which will be used to multiply the ## present duration displayed. ## If the resulting duration is too long a warning is given and the whole ## recording is shown. ## If possible the center of the displayed window is conserved. tFct <- function() { presentWindowLength <- diff(range(plotPara$tlim)) tMessage <- paste("Present duration displayed: ", presentWindowLength, " \n", sep = "") tMessage <- paste(tMessage, "By what factor do you want to multiply it? \n", sep = "") theFactor <- as.numeric(readline(tMessage)) if (theFactor <= 0) { cat("A negative or null factor does not make sense.\n") return(plotPara) } ## End of conditional on theFactor <= 0 ## Check that the new display length is reasonable totalLength <- plotPara$lastTime - plotPara$firstTime if (theFactor * presentWindowLength >= totalLength) { cat("Cannot show more data than available but only the entire record.\n ") plotPara$tlim[1] <- plotPara$firstTime plotPara$tlim[2] <- plotPara$lastTime return(plotPara) } windowCenter <- plotPara$tlim[1] + presentWindowLength / 2 newLeft <- windowCenter - theFactor * presentWindowLength / 2 newRight <- windowCenter + theFactor * presentWindowLength / 2 if (!(newLeft >= plotPara$firstTime & newRight <= plotPara$lastTime)) { if (newLeft <= plotPara$firstTime) { cat("Cannot show data before the recording started, the displayed center wont be conserved.\n ") plotPara$tlim[1] <- plotPara$firstTime plotPara$tlim[2] <- plotPara$tlim[1] + theFactor * presentWindowLength } if (newRight >= plotPara$lastTime) { cat("Cannot show data after the recording ended, the displayed center wont be conserved.\n ") plotPara$tlim[2] <- plotPara$lastTime plotPara$tlim[1] <- plotPara$tlim[2] - theFactor * presentWindowLength } return(plotPara) } ## End of conditional on !(newLeft >= plotPara$firstTime & newRight <= plotPara$lastTime) plotPara$tlim[1] <- newLeft plotPara$tlim[2] <- newRight return(plotPara) } ## End of function tFct definition ## Function rFct definition ## Allows the user to change the maximal value displayed on the abscissa ## The user is invited to enter a value. rFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime tMessage <- paste("Present latest time displayed: ", rightTime, "\n", sep = "") tMessage <- paste(tMessage, "What new latest time do want (return leaves things unchanged)? \n", sep = "") theNewTime <- as.numeric(readline(tMessage)) if (is.na(theNewTime)) { ## Nothing entered, leave things unchanged return(plotPara) } ## End of conditional on is.na(theFactor) if (theNewTime <= plotPara$firstTime) { ## This choice does not make sense cat("Cannot display data before recording started.\n") return(plotPara) } if (theNewTime > plotPara$lastTime) { cat("Recording end reached.\n ") rightTime <- plotPara$lastTime } else { if (theNewTime <= leftTime) { ## The new latest time entered is smaller that the earliest time displayed cat("The new latest time is smaller than the earliest, adjustement will be made.\n") leftTime <- theNewTime - timeRange if (leftTime < plotPara$firstTime) { cat("Adjustment requires a change in displayed duration.\n") leftTime <- plotPara$firstTime } } ## End of conditional on theNewTime <= leftTime rightTime <- theNewTime } ## End of conditional on theNewTime > plotPara$lastTime plotPara$tlim <- c(leftTime, rightTime) plotPara } ## Function lFct definition ## Allows the user to change the minimal value displayed on the abscissa ## The user is invited to enter a value. lFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime tMessage <- paste("Present earliest time displayed: ", leftTime, "\n", sep = "") tMessage <- paste(tMessage, "What new earliest time do want (return leaves things unchanged)? \n", sep = "") theNewTime <- as.numeric(readline(tMessage)) if (is.na(theNewTime)) { ## Nothing entered, leave things unchanged return(plotPara) } ## End of conditional on is.na(theFactor) if (theNewTime >= plotPara$lastTime) { ## This choice does not make sense cat("Cannot display data after recording ended.\n") return(plotPara) } if (theNewTime < plotPara$firstTime) { cat("Recording start reached.\n ") leftTime <- plotPara$firstTime } else { if (theNewTime >= rightTime) { ## The new earliest time entered is larger that the latest time displayed cat("The new earliest time is larger than the latest, adjustement will be made.\n") rightTime <- theNewTime + timeRange if (rightTime > plotPara$lastTime) { cat("Adjustment requires a change in displayed duration.\n") rightTime <- plotPara$lastTime } } ## End of conditional on theNewTime <= leftTime leftTime <- theNewTime } ## End of conditional on theNewTime > plotPara$lastTime plotPara$tlim <- c(leftTime, rightTime) plotPara } ## Function yMaxFct definition ## Allows the user to change the maximal value displayed on the ordinate ## The user is invited to enter a value. yMaxFct <- function() { presentWindowRange <- range(plotPara$ylim) tMessage <- paste("Present range displayed: [", paste(presentWindowRange, collapse = ","), "] \n", sep = "") tMessage <- paste(tMessage, "What new maximal ordinate value do want (return goes back to maximum)? \n", sep = "") theFactor <- as.numeric(readline(tMessage)) if (is.na(theFactor)) { plotPara$ylim <- c(presentWindowRange[1],plotPara$yMax) return(plotPara) } ## End of conditional on is.na(theFactor) if (theFactor <= plotPara$ylim[1]) { cat("The maximum should be larger than the minimum.\n") return(plotPara) } ## End of conditional on theFactor <= plotPara$ylim[1] plotPara$ylim <- c(presentWindowRange[1],theFactor) return(plotPara) } ## End of function yMaxFct definition ## Function yMinFct definition ## Allows the user to change the minimal value displayed on the ordinate ## The user is invited to enter a value. yMinFct <- function() { presentWindowRange <- range(plotPara$ylim) tMessage <- paste("Present range displayed: [", paste(presentWindowRange, collapse = ","), "] \n", sep = "") tMessage <- paste(tMessage, "What new minimal ordinate value do want (return goes back to minimum)? \n", sep = "") theFactor <- as.numeric(readline(tMessage)) if (is.na(theFactor)) { plotPara$ylim <- c(plotPara$yMin, presentWindowRange[2]) return(plotPara) } ## End of conditional on is.na(theFactor) if (theFactor >= plotPara$ylim[2]) { cat("The minimum should be smaller than the maximum.\n") return(plotPara) } ## End of conditional on theFactor >= plotPara$ylim[2] plotPara$ylim <- c(theFactor, presentWindowRange[2]) return(plotPara) } ## End of function yMinFct definition show <- function(x, plotPara, ...) { s <- plotPara$tlim[1] e <- plotPara$tlim[2] y.m <- plotPara$ylim[1] y.M <- plotPara$ylim[2] m <- unclass(window(x,start=s,end=e)) if (class(m) == "matrix") { m <- apply(m,2,function(x) ifelse(x < y.m, y.m,x)) m <- apply(m,2,function(x) ifelse(x > y.M, y.M,x)) ns <- dim(m)[2] offset <- c(0,-(1:(ns-1))*(y.M-y.m)) m <- t(t(m)+offset*offsetFactor) matplot(m,type="l",lty=1,axes=FALSE,xlab="",ylab="",...) } else { m[m<y.m] <- y.m m[m>y.M] <- y.M plot(m,type="l",lty=1,axes=FALSE,xlab="",ylab="",ylim=c(y.m,y.M),...) } } plot.new() par(mar=c(0.5,0.5,0.5,0.5)) show(x,plotPara,...) myMessage <- "Make a choice:\n n or 'return' (next); f (former); l (lower abscissa limit); r (upper abscissa limit) \n t (time scale); Y (upper ordinate limit); y (lower ordinate limit); q (quit) \n " while(plotPara$keepGoing) { myChoice <- readline(myMessage) plotPara <- switch(myChoice, n = nFct(), f = fFct(), l = lFct(), r = rFct(), t = tFct(), Y = yMaxFct(), y = yMinFct(), q = qFct(), nFct() ) show(x,plotPara,...) } ## End of while loop on keepGoing dev.off() invisible() }

#### 12.1.3 `peaks`

and associated methods definition

Function `peaks`

returns an object (essentially a vector of indices) of class `eventsPos`

:

peaks <- function(x, minimalDist=15, notZero=1e-3) { dx <- c(0,diff(x,2)/2,0) dx[abs(dx) < notZero] <- 0 dx <- diff(sign(dx)) res <- (1:length(dx))[dx < 0] res <- res[-length(res)][diff(res) > minimalDist] attr(res,"call") <- match.call() attr(res,"nIDx") <- length(x) class(res) <- "eventsPos" res }

Method `as.eventsPos`

transforms a vector into an `eventsPos`

object, its formal arguments are:

`x`

: an integer vector with strictly increasing elements.`start`

: an integer, the sampling point at which observation started.`end`

: an integer, the sampling point at which observation ended.

as.eventsPos <- function(x, start, end ) { x <- as.integer(x) stopifnot(all(diff(x)>0)) if (missing(start)) start <- floor(x) if (missing(end)) end <- ceiling(x) stopifnot(all(x>=start)) stopifnot(all(x<=end)) attr(x,"call") <- match.call() attr(x,"nIDx") <- end-start+1 class(x) <- "eventsPos" x }

Method `print.eventsPos`

prints an `eventsPos`

object:

print.eventsPos <- function(x, ...) { cat("\neventsPos object with indexes of ", length(x)," events. \n", sep = "") cat(" Mean inter event interval: ", round(mean(diff(x)),digits=2), " sampling points, corresponding SD: ", round(sd(diff(x)),digits=2), " sampling points \n", sep = "") cat(" Smallest and largest inter event intervals: ", paste(range(diff(x)),collapse=" and "), " sampling points. \n\n",sep= "") }

#### 12.1.4 Definition of an `explore`

method for time series and `eventsPos`

objects

The formal parameters of the method are:

`x`

: an`eventsPos`

object.`y`

: a`ts`

(time series) object.`offsetFactor`

: a numeric controlling the the spacing between the recording sites on the plot. Smaller values lead to closer spacing.`events.pch`

: an integer of a character: the ploting character used to indicate events.`events.col`

: an integer or a character string coding the color used to indicate the event.`...`

: additional arguments passed to`matplot`

and`plot`

functions called inernally.

explore.eventsPos <- function(x,y, offsetFactor=0.5, events.pch=16, events.col=2, ...) { stopifnot(inherits(y,"ts")) yRange <- range(y,na.rm=TRUE) plotPara <- list(tlim = start(y)[1] + c(0,0.1), ylim = yRange, yMin = yRange[1], yMax = yRange[2], firstTime = start(y)[1], lastTime = end(y)[1], keepGoing = TRUE) nFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime rightTime <- rightTime + timeRange if (rightTime > plotPara$lastTime) { cat("Recording end reached.\n ") rightTime <- plotPara$lastTime } plotPara$tlim <- c(rightTime - timeRange, rightTime) plotPara } fFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime leftTime <- leftTime - timeRange if (leftTime < plotPara$firstTime) { cat("Recording end reached.\n ") leftTime <- plotPara$firstTime } plotPara$tlim <- c(leftTime, leftTime + timeRange) plotPara } qFct <- function() { plotPara$keepGoing <- FALSE plotPara } ## Function tFct definition ## Allows the user to change the recording duration displayed on the window ## The user is invited to enter a factor which will be used to multiply the ## present duration displayed. ## If the resulting duration is too long a warning is given and the whole ## recording is shown. ## If possible the center of the displayed window is conserved. tFct <- function() { presentWindowLength <- diff(range(plotPara$tlim)) tMessage <- paste("Present duration displayed: ", presentWindowLength, " \n", sep = "") tMessage <- paste(tMessage, "By what factor do you want to multiply it? \n", sep = "") theFactor <- as.numeric(readline(tMessage)) if (theFactor <= 0) { cat("A negative or null factor does not make sense.\n") return(plotPara) } ## End of conditional on theFactor <= 0 ## Check that the new display length is reasonable totalLength <- plotPara$lastTime - plotPara$firstTime if (theFactor * presentWindowLength >= totalLength) { cat("Cannot show more data than available but only the entire record.\n ") plotPara$tlim[1] <- plotPara$firstTime plotPara$tlim[2] <- plotPara$lastTime return(plotPara) } windowCenter <- plotPara$tlim[1] + presentWindowLength / 2 newLeft <- windowCenter - theFactor * presentWindowLength / 2 newRight <- windowCenter + theFactor * presentWindowLength / 2 if (!(newLeft >= plotPara$firstTime & newRight <= plotPara$lastTime)) { if (newLeft <= plotPara$firstTime) { cat("Cannot show data before the recording started, the displayed center wont be conserved.\n ") plotPara$tlim[1] <- plotPara$firstTime plotPara$tlim[2] <- plotPara$tlim[1] + theFactor * presentWindowLength } if (newRight >= plotPara$lastTime) { cat("Cannot show data after the recording ended, the displayed center wont be conserved.\n ") plotPara$tlim[2] <- plotPara$lastTime plotPara$tlim[1] <- plotPara$tlim[2] - theFactor * presentWindowLength } return(plotPara) } ## End of conditional on !(newLeft >= plotPara$firstTime & newRight <= plotPara$lastTime) plotPara$tlim[1] <- newLeft plotPara$tlim[2] <- newRight return(plotPara) } ## End of function tFct definition ## Function rFct definition ## Allows the user to change the maximal value displayed on the abscissa ## The user is invited to enter a value. rFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime tMessage <- paste("Present latest time displayed: ", rightTime, "\n", sep = "") tMessage <- paste(tMessage, "What new latest time do want (return leaves things unchanged)? \n", sep = "") theNewTime <- as.numeric(readline(tMessage)) if (is.na(theNewTime)) { ## Nothing entered, leave things unchanged return(plotPara) } ## End of conditional on is.na(theFactor) if (theNewTime <= plotPara$firstTime) { ## This choice does not make sense cat("Cannot display data before recording started.\n") return(plotPara) } if (theNewTime > plotPara$lastTime) { cat("Recording end reached.\n ") rightTime <- plotPara$lastTime } else { if (theNewTime <= leftTime) { ## The new latest time entered is smaller that the earliest time displayed cat("The new latest time is smaller than the earliest, adjustement will be made.\n") leftTime <- theNewTime - timeRange if (leftTime < plotPara$firstTime) { cat("Adjustment requires a change in displayed duration.\n") leftTime <- plotPara$firstTime } } ## End of conditional on theNewTime <= leftTime rightTime <- theNewTime } ## End of conditional on theNewTime > plotPara$lastTime plotPara$tlim <- c(leftTime, rightTime) plotPara } ## Function lFct definition ## Allows the user to change the minimal value displayed on the abscissa ## The user is invited to enter a value. lFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime tMessage <- paste("Present earliest time displayed: ", leftTime, "\n", sep = "") tMessage <- paste(tMessage, "What new earliest time do want (return leaves things unchanged)? \n", sep = "") theNewTime <- as.numeric(readline(tMessage)) if (is.na(theNewTime)) { ## Nothing entered, leave things unchanged return(plotPara) } ## End of conditional on is.na(theFactor) if (theNewTime >= plotPara$lastTime) { ## This choice does not make sense cat("Cannot display data after recording ended.\n") return(plotPara) } if (theNewTime < plotPara$firstTime) { cat("Recording start reached.\n ") leftTime <- plotPara$firstTime } else { if (theNewTime >= rightTime) { ## The new earliest time entered is larger that the latest time displayed cat("The new earliest time is larger than the latest, adjustement will be made.\n") rightTime <- theNewTime + timeRange if (rightTime > plotPara$lastTime) { cat("Adjustment requires a change in displayed duration.\n") rightTime <- plotPara$lastTime } } ## End of conditional on theNewTime <= leftTime leftTime <- theNewTime } ## End of conditional on theNewTime > plotPara$lastTime plotPara$tlim <- c(leftTime, rightTime) plotPara } ## Function yMaxFct definition ## Allows the user to change the maximal value displayed on the ordinate ## The user is invited to enter a value. yMaxFct <- function() { presentWindowRange <- range(plotPara$ylim) tMessage <- paste("Present range displayed: [", paste(presentWindowRange, collapse = ","), "] \n", sep = "") tMessage <- paste(tMessage, "What new maximal ordinate value do want (return goes back to maximum)? \n", sep = "") theFactor <- as.numeric(readline(tMessage)) if (is.na(theFactor)) { plotPara$ylim <- c(presentWindowRange[1],plotPara$yMax) return(plotPara) } ## End of conditional on is.na(theFactor) if (theFactor <= plotPara$ylim[1]) { cat("The maximum should be larger than the minimum.\n") return(plotPara) } ## End of conditional on theFactor <= plotPara$ylim[1] plotPara$ylim <- c(presentWindowRange[1],theFactor) return(plotPara) } ## End of function yMaxFct definition ## Function yMinFct definition ## Allows the user to change the minimal value displayed on the ordinate ## The user is invited to enter a value. yMinFct <- function() { presentWindowRange <- range(plotPara$ylim) tMessage <- paste("Present range displayed: [", paste(presentWindowRange, collapse = ","), "] \n", sep = "") tMessage <- paste(tMessage, "What new minimal ordinate value do want (return goes back to minimum)? \n", sep = "") theFactor <- as.numeric(readline(tMessage)) if (is.na(theFactor)) { plotPara$ylim <- c(plotPara$yMin, presentWindowRange[2]) return(plotPara) } ## End of conditional on is.na(theFactor) if (theFactor >= plotPara$ylim[2]) { cat("The minimum should be smaller than the maximum.\n") return(plotPara) } ## End of conditional on theFactor >= plotPara$ylim[2] plotPara$ylim <- c(theFactor, presentWindowRange[2]) return(plotPara) } ## End of function yMinFct definition show <- function(x, y, plotPara, ...) { s <- plotPara$tlim[1] e <- plotPara$tlim[2] y.m <- plotPara$ylim[1] y.M <- plotPara$ylim[2] firstIdx <- round(max(1,s*frequency(y))) lastIdx <- round(min(end(y)[1]*frequency(y),e*frequency(y))) ii <- firstIdx:lastIdx xx <- x[firstIdx <= x & x <= lastIdx] if (class(y)[1] == "mts") { m <- y[ii,] if (length(xx) > 0) mAtx <- as.matrix(y)[xx,,drop=FALSE] } else { m <- y[ii] if (length(xx) > 0) mAtx <- y[xx] } if (class(m) == "matrix") { m <- apply(m,2,function(x) ifelse(x < y.m, y.m,x)) m <- apply(m,2,function(x) ifelse(x > y.M, y.M,x)) ns <- dim(m)[2] offset <- c(0,-(1:(ns-1))*(y.M-y.m)) m <- t(t(m)+offset*offsetFactor) matplot(m,type="l",lty=1,axes=FALSE,xlab="",ylab="",...) if (length(xx) > 0) { mAtx <- t(t(mAtx)+offset*offsetFactor) matpoints(xx-ii[1]+1,mAtx,pch=events.pch,col=events.col) } } else { m[m<y.m] <- y.m m[m>y.M] <- y.M plot(m,type="l",lty=1,axes=FALSE,xlab="",ylab="",ylim=c(y.m,y.M),...) if (length(xx) > 0) points(xx-ii[1]+1,mAtx,pch=events.pch,col=events.col) } } plot.new() par(mar=c(0.5,0.5,0.5,0.5)) show(x,y,plotPara,...) myMessage <- "Make a choice:\n n or 'return' (next); f (former); l (lower abscissa limit); r (upper abscissa limit) \n t (time scale); Y (upper ordinate limit); y (lower ordinate limit); q (quit) \n " while(plotPara$keepGoing) { myChoice <- readline(myMessage) plotPara <- switch(myChoice, n = nFct(), f = fFct(), l = lFct(), r = rFct(), t = tFct(), Y = yMaxFct(), y = yMinFct(), q = qFct(), nFct() ) show(x,y,plotPara,...) } ## End of while loop on keepGoing dev.off() invisible() }

#### 12.1.5 Function `cutSglEvt`

Its formal parameters are:

`evtPos`

: a numeric or integer interpretable as an index, the posistion at which cuts will be produced.`data`

: a numeric vector of matrix containing the data. If vector the argument is converted as a single column matrix internally. The matrix rows are indexed by sampling points and its columns by recording sites / channels.`before`

: an integer: the number of sampling points within the cut before the reference time given by evtPos.`after`

: an integer: the number of sampling points after the reference time.

It returns:

- A numeric vector with the cut(s). When several recording sites are used the cuts of each individual sites are placed one after the other.

cutSglEvt <- function(evtPos, data, before=14, after=30 ) { evtPos <- as.integer(evtPos) ## make sure evtPos is an integer before <- as.integer(before) ## make sure before is an integer stopifnot(0 <= before) ## make sure before is positive or null after <- as.integer(after) stopifnot(0 <= after) ## make sure after is positive or null if (is.vector(data)) data <- matrix(data,nc=1) ns <- dim(data)[2] dl <- dim(data)[1] stopifnot(0<evtPos, evtPos<=dl) ## make sure evtPos is within range sl <- before+after+1 ## the length of the cut keep <- -before:after + evtPos within <- 1 <= keep & keep <= dl kw <- keep[within] res <- sapply(1:ns, function(idx) { v <- numeric(sl) v[within] <- data[kw,idx] v } ) as.vector(res) }

#### 12.1.6 Function `mkEvents`

Its formal parameters are:

`positions`

: an integer vector with events' positions as indices / sampling points or an`eventsPos`

object.`data`

: a numeric vector of matrix containing the data or a 'ts' or 'mts' object. If vector the argument is converted as a single column matrix internally. The matrix rows are indexed by sampling points and its columns by recording sites / channels.`before`

: an integer, the number of sampling points within the cut before the reference times given by`positions`

.`after`

: an integer, the number of sampling points within the cut after the reference times given by`positions`

.

It returns a matrix with before + after + 1 rows and as many columns as elements in positions. Each column is an "event", that is, a set of cuts on the data. Attribute "positions" contains the value of the argument with the same name and attribute "data" contains the name of the corresponding argument, attribute "before" contains the value of the argument with the same name, attribute "after" contains the value of the argument with the same name, attribute "numberOfSites" contains the number of recording sites. Attribute "delta" is used when events are realligned on their mean waveforms (during "jitter cancellation"). The returned matrix is given an "events" class.

mkEvents <- function(positions, data, before=14, after=30 ) { positions <- unclass(positions) data <- unclass(data) res <- sapply(positions, cutSglEvt, data, before, after) the.call <- match.call() attr(res,"positions") <- positions attr(res,"delta") <- NULL attr(res,"data") <- the.call[["data"]] attr(res,"before") <- before attr(res,"after") <- after attr(res,"numberOfSites") <- ifelse(is.matrix(data),dim(data)[2],1) attr(res,"call") <- match.call() class(res) <- "events" res }

#### 12.1.7 `events`

methods

summary.events <- function(object, ...) { b <- attr(object,"before") a <- attr(object,"after") ns <- attr(object,"numberOfSites") cat("\nevents object deriving from data set: ",attr(object,"data"),".\n",sep="") cat(" Events defined as cuts of ", a+b+1, " sampling points on each of the ",ns, " recording sites.\n",sep="") cat(" The 'reference' time of each event is located at point ", b+1, " of the cut.\n",sep="") if (!is.null(attr(object,"delta"))) { cat(" Events were realigned on median event.\n",sep="") } cat(" There are ", length(attr(object,"positions")), " events in the object.\n\n",sep="") }

"[.events" <- function(x,i,j,drop = FALSE) { y <- NextMethod("[") if (!missing(i)) return(NULL) if (is.matrix(y) && dim(y)[2] > 1) { attr(y,"positions") <- attr(x,"positions")[j] attr(y,"delta") <- attr(x,"delta") attr(y,"data") <- attr(x,"data") attr(y,"before") <- attr(x,"before") attr(y,"after") <- attr(x,"after") attr(y,"numberOfSites") <- attr(x,"numberOfSites") attr(y,"call") <- match.call() class(y) <- "events" } y }

t.events <- function(x) { t(unclass(x)) }

mean.events <- function(x,...) { apply(unclass(x),1,mean,...) } median.events <- function(x,na.rm = FALSE) { apply(unclass(x),1,median,na.rm) }

'-.events' <- function(e1,e2) { stopifnot(length(e2) == dim(e1)[1]) res <- unclass(e1)-e2 attr(res,"positions") <- attr(e1,"positions") attr(res,"delta") <- attr(e1,"delta") attr(res,"data") <- attr(e1,"data") attr(res,"before") <- attr(e1,"before") attr(res,"after") <- attr(e1,"after") attr(res,"numberOfSites") <- attr(e1,"numberOfSites") attr(res,"call") <- match.call() class(res) <- "events" res }

plot.events <- function(x, y=NULL, evts.lwd = 0.1, medAndMad = TRUE, evts.col = "black", med.col = "red", mad.col = "blue", x.bar = NULL, y.bar = NULL) { nsites <- attr(x,"numberOfSites") ne <- dim(x)[2] cl <- dim(x)[1]/nsites ylim <- range(x) matplot(x,type="n",xlab="",ylab="",axes=FALSE,ylim=ylim) if (nsites > 1) { ii <- 2*(1:(nsites %/% 2)) rect((ii-1)*cl,ylim[1],ii*cl,ylim[2],col="grey80",border=NA) } matlines(x,col=evts.col,lty=1,lwd=evts.lwd) if (medAndMad) { med <- apply(x,1,median) mad <- apply(x,1,mad) lines(med,col=med.col) lines(mad,col=mad.col) } if (!is.null(x.bar)) segments(x0=0,y0=ylim[1]+0.1*diff(ylim),x1=x.bar) if (!is.null(y.bar)) segments(x0=0,y0=ylim[1]+0.1*diff(ylim),y1=ylim[1]+0.1*diff(ylim)+y.bar) }

lines.events <- function(x, evts.lwd = 0.1, evts.col = "black", ... ) { matlines(x,col=evts.col,lty=1,lwd=evts.lwd,...) }

print.events <- function(x, ... ) { plot.events(x,...) }

#### 12.1.8 `mkNoise`

definition

mkNoise <- function(positions, data, before=14, after=30, safetyFactor=2, size=2000) { positions <- unclass(positions) data <- unclass(data) if (!is.matrix(data)) data <- matrix(data,nc=1) size <- as.integer(size) stopifnot(0 < size) ## make sure size is a positive integer sl <- before+after+1 ns <- dim(data)[2] i1 <- diff(positions) ## inter events intervals nbI <- (i1-round(safetyFactor*sl))%/%sl ## number of noise sweeps ## one can cut from each ## interval nbPossible <- min(size, sum((nbI)[nbI>0]) ) ## allocate next the memory for the noise events noiseMatrix <- matrix(0, nr=ns*sl, nc=nbPossible ) iV <- (1:length(i1))[nbI>0] ## A vector containing the indexes of ## the (inter event) intervals from ## which at least one noise sweep can be ## cut. iIdx <- 1 ## an index running over the inter event intervals from ## which noise events can be cut. nInI <- nbI[iV[iIdx]] ## the number of noise sweeps that can be cut ## from the "non empty" inter event interval ## iV[iIdx]. nIdx <- 1 ## An index running over the noise sweeps. noisePositions <- integer(nbPossible) while (nIdx <= nbPossible) { uInI <- 1 ## An index running over the noise sweeps that will be ## cut from a given "non empty" inter event interval. iPos <- positions[iV[iIdx]] + round(safetyFactor*sl) noisePositions[nIdx] <- iPos while (uInI <= nInI & nIdx <= nbPossible ) { ii <- (-before:after) + iPos ns <- as.vector(data[ii,]) noiseMatrix[,nIdx] <- ns nIdx <- nIdx + 1 iPos <- iPos + sl uInI <- uInI + 1 } ## End of while loop on uInI iIdx <- iIdx + 1 nInI <- nbI[iV[iIdx]] } ## End of while loop on nIdx the.call <- match.call() attr(noiseMatrix,"positions") <- noisePositions attr(noiseMatrix,"delta") <- NULL attr(noiseMatrix,"data") <- the.call[["data"]] attr(noiseMatrix,"before") <- before attr(noiseMatrix,"after") <- after attr(noiseMatrix,"numberOfSites") <- ifelse(is.matrix(data),dim(data)[2],1) attr(noiseMatrix,"call") <- match.call() class(noiseMatrix) <- "events" noiseMatrix }

#### 12.1.9 Definition of an `explore`

method for `pca`

results

explore.prcomp <- function(x, pc=1, ##<< an integer: the pc index to add to the mean. factor=2, ##<< a numeric, the scaling factor; that is, the plot shows mean +/- factor * pc. m.col="black", ##<< a character string or an integer, the color used for mean. u.col="red", ##<< a character string or an integer, the color used for mean + factor * pc. l.col="blue", ##<< a character string or an integer, the color used for mean - factor * pc. xlab="Index", ##<< a character string with the abscissa label. ylab="Amplitude", ##<< a character string with the ordinate label. main, ##<< a character string with the title. If 'missing' (default) one is automatically generated. ... ##<< additional arguments passed to 'plot'. ) { if (missing(main)) { w <- x$sdev[pc]^2/sum(x$sdev^2) main <- paste("PC ",pc," (",round(100*w,digits=1),"%)",sep="") } u <- x$center + factor * x$rotation[,pc] l <- x$center - factor * x$rotation[,pc] ylim=range(c(l,u)) plot(x$center,type="l",xlab=xlab,ylab=ylab,col=m.col,main=main,ylim=ylim,...) lines(u,col=u.col,...) lines(l,col=l.col,...) }

#### 12.1.10 Definition of `get_jitter`

We define a function that estimates the jitter given:

`evts`

: an event—or a matrix of events where individual events form the columns.`center`

: the 'central' (median) event on which the alignment will be performed.`centerD`

: the first derivative of the central event.`centerDD`

: the second derivative of the central event.

The functions returns a vector of estimated jitters giving the amount of sampling points by which the central event should be shifted in order to best match each individual events. This sampling jitter estimation is performed by a two stages procedure:

- Linear regression is first used get a first jitter estimation based on a first order Taylor expansion.
- A Newton-Raphson step is used to refine the first estimation (used as a starting point) based on a second order Taylor expansion.

get_jitter <- function(evts, center, centerD, centerDD){ centerD_norm2 <- sum(centerD^2) centerDD_norm2 <- sum(centerDD^2) centerD_dot_centerDD <- sum(centerD*centerDD) if (is.null(dim(evts))) evts <- matrix(evts, nc=1) evts <- evts - center h_dot_centerD <- centerD %*% evts delta0 <- h_dot_centerD/centerD_norm2 h_dot_centerDD <- centerDD %*% evts first <- -2*h_dot_centerD + 2*delta0*(centerD_norm2 - h_dot_centerDD) + 3*delta0^2*centerD_dot_centerDD + delta0^3*centerDD_norm2 second <- 2*(centerD_norm2 - h_dot_centerDD) + 6*delta0*centerD_dot_centerDD + 3*delta0^2*centerDD_norm2 as.vector(delta0 - first/second) }

#### 12.1.11 Definition of `mk_aligned_events`

We define a function taking a vector of spike times, that should all come from the same cluster and correspond to reasonably "clean" events, and three arguments corresponding to the last three arguments of `mkEvents`

, the function returns a `events`

object. The `positions`

attribute of the returned object gives the nearest sampling point to the actual peak. The `delta`

attribute gives the offset between the previous spike position and the "actual" peak position (the actual position is attribute `positions`

**minus** attribute `delta`

):

mk_aligned_events <- function(positions, data, before=14, after=30){ dataD = apply(data,2,function(x) c(0,diff(x,2)/2,0)) dataDD = apply(dataD,2,function(x) c(0,diff(x,2)/2,0)) evts = mkEvents(positions, data, before, after) evtsD = mkEvents(positions, dataD, before, after) evtsDD = mkEvents(positions, dataDD, before, after) evts_median = apply(evts,1,median) evtsD_median = apply(evtsD,1,median) evtsDD_median = apply(evtsDD,1,median) evts_jitter = get_jitter(evts,evts_median,evtsD_median,evtsDD_median) ## positions = positions-[round(x.item(0)) for x in np.nditer(evts_jitter)] positions = positions-round(evts_jitter) evts = mkEvents(positions, data, before, after) evtsD = mkEvents(positions, dataD, before, after) evtsDD = mkEvents(positions, dataDD, before, after) evts_median = apply(evts,1,median) evtsD_median = apply(evtsD,1,median) evtsDD_median = apply(evtsDD,1,median) evts_jitter = get_jitter(evts,evts_median,evtsD_median,evtsDD_median) res = unclass(evts) - evtsD_median %o% evts_jitter - evtsDD_median %o% evts_jitter^2/2 attributes(res) = attributes(evts) attr(res,"positions") <- positions attr(res,"call") <- match.call() attr(res,"delta") <- evts_jitter res }

#### 12.1.12 Definition of `mk_center_list`

We define function `mk_center_list`

that given:

`positions`

, a vector of spike times, that should all come from the same cluster and correspond to reasonably "clean" events,`data`

, a data matrix,`before`

, the number of sampling point to keep before the peak,`after`

, the number of sampling point to keep after the peak,

returns a nammed list with the following elements and their content:

`center`

: the estimate of the center (obtained from the median),`centerD`

: the estimate of the center's derivative (obtained from the median of events cut on the derivative of data),`centerDD`

: the estimate of the center's second derivative (obtained from the median of events cut on the second derivative of data),`centerD_norm2`

: the squared norm of the center's derivative,`centerDD_norm2`

: the squared norm of the center's second derivative,`centerD_dot_centerDD`

: the scalar product of the center's first and second derivatives,`center_idx`

: an array of indices going from`-before`

to`after`

.

The peeling procedure, requiers, for each cluster, estimates of its center and of its first two derivatives. Clusters' centers must moreover be built such that they can be used for subtraction, this implies that we should make them long enough, on both side of the peak, to see them go back to baseline. Formal parameters `before`

and `after`

should therefore be set to larger values than the ones used for clustering.

mk_center_list = function(positions, data, before=49, after=80) { dataD = apply(data,2,function(x) c(0,diff(x,2)/2,0)) dataDD = apply(dataD,2,function(x) c(0,diff(x,2)/2,0)) evts = mkEvents(positions, data, before, after) evtsD = mkEvents(positions, dataD, before, after) evtsDD = mkEvents(positions, dataDD, before, after) evts_median = apply(evts,1,median) evtsD_median = apply(evtsD,1,median) evtsDD_median = apply(evtsDD,1,median) list("center" = evts_median, "centerD" = evtsD_median, "centerDD" = evtsDD_median, "centerD_norm2" = sum(evtsD_median^2), "centerDD_norm2" = sum(evtsDD_median^2), "centerD_dot_centerDD" = sum(evtsD_median*evtsDD_median), "center_idx" = -before:after) }

#### 12.1.13 Definition of `classify_and_align_evt`

The formal parameters of `classify_and_align_evt`

are:

- a set of events' positions,
- a data matrix containing the raw (but normalized data),
- a named list of centers,
- arguments
`before`

and`after`

corresponding to arguments with those names in`mk_center_list`

,

returns a list with the following element:

- the name of the closest center in terms of Euclidean distance or "?" if none of the clusters' waveform does better than a uniformly null one,
- the new position of the event (the previous position corrected by the integer part of the estimated jitter),
- the remaining jitter.

classify_and_align_evt <- function(evt_pos, data, centers, before=14, after=30 ){ cluster_names = names(centers) n_sites = dim(data)[2] centersM = sapply(cluster_names, function(cn) centers[[cn]][["center"]][rep(-before <= centers[[cn]][["center_idx"]] & centers[[cn]][["center_idx"]] <= after, n_sites)]) evt = cutSglEvt(evt_pos,data=data,before=before, after=after) delta = -(centersM - evt) cluster_idx = which.min(apply(delta^2,2,sum)) good_cluster_name = cluster_names[cluster_idx] good_cluster_idx = rep(-before <= centers[[good_cluster_name]][["center_idx"]] & centers[[good_cluster_name]][["center_idx"]] <= after, n_sites) centerD = centers[[good_cluster_name]][["centerD"]][good_cluster_idx] centerD_norm2 = sum(centerD^2) centerDD = centers[[good_cluster_name]][["centerDD"]][good_cluster_idx] centerDD_norm2 = sum(centerDD^2) centerD_dot_centerDD = sum(centerD*centerDD) h = delta[,cluster_idx] h_order0_norm2 = sum(h^2) h_dot_centerD = sum(h*centerD) jitter0 = h_dot_centerD/centerD_norm2 h_order1_norm2 = sum((h-jitter0*centerD)^2) if (h_order0_norm2 > h_order1_norm2) { h_dot_centerDD = sum(h*centerDD) first = -2*h_dot_centerD + 2*jitter0*(centerD_norm2 - h_dot_centerDD) + 3*jitter0^2*centerD_dot_centerDD + jitter0^3*centerDD_norm2 second = 2*(centerD_norm2 - h_dot_centerDD) + 6*jitter0*centerD_dot_centerDD + 3*jitter0^2*centerDD_norm2 jitter1 = jitter0 - first/second h_order2_norm2 = sum((h-jitter1*centerD-jitter1^2/2*centerDD)^2) if (h_order1_norm2 <= h_order2_norm2) { jitter1 = jitter0 } } else { jitter1 = 0 } if (abs(round(jitter1)) > 0) { evt_pos = evt_pos - round(jitter1) evt = cutSglEvt(evt_pos,data=data,before=before, after=after) h = evt - centers[[good_cluster_name]][["center"]][good_cluster_idx] h_order0_norm2 = sum(h^2) h_dot_centerD = sum(h*centerD) jitter0 = h_dot_centerD/centerD_norm2 h_order1_norm2 = sum((h-jitter0*centerD)^2) if (h_order0_norm2 > h_order1_norm2) { h_dot_centerDD = sum(h*centerDD) first = -2*h_dot_centerD + 2*jitter0*(centerD_norm2 - h_dot_centerDD) + 3*jitter0^2*centerD_dot_centerDD + jitter0^3*centerDD_norm2 second = 2*(centerD_norm2 - h_dot_centerDD) + 6*jitter0*centerD_dot_centerDD + 3*jitter0^2*centerDD_norm2 jitter1 = jitter0 - first/second h_order2_norm2 = sum((h-jitter1*centerD-jitter1^2/2*centerDD)^2) if (h_order1_norm2 <= h_order2_norm2) { jitter1 = jitter0 } } else { jitter1 = 0 } } if (sum(evt^2) > sum((h-jitter1*centerD-jitter1^2/2*centerDD)^2)) return(list(cluster_names[cluster_idx], evt_pos, jitter1)) else return(list('?',evt_pos, jitter1)) }

#### 12.1.14 Definition of `predict_data`

The formal parameters of `predict_data`

are:

`class_pos_jitter_list`

: a list of lists. Each sub-list corresponds to a single detected event and as three components: the cluster of origin (or "?" if none was found), the position and the jitter.`centers_list`

: a list of list. Each sub-list is the result of a call to`mk_center_list`

.`nb_channels`

: an integer, the number of recording channels.`data_length`

: an integer, the length of the data.

The function returns a matrix with `data_length`

rows and `nb_channels`

columns with observations predictions based on `class_pos_jitter_list`

and `centers_list`

.

predict_data <- function(class_pos_jitter_list, centers_list, nb_channels=4, data_length=300000) { res = matrix(0,nc=nb_channels, nr=data_length) for (class_pos_jitter in class_pos_jitter_list) { cluster_name = class_pos_jitter[[1]] if (cluster_name != '?') { center = centers_list[[cluster_name]][["center"]] centerD = centers_list[[cluster_name]][["centerD"]] centerDD = centers_list[[cluster_name]][["centerDD"]] jitter = class_pos_jitter[[3]] pred = center + jitter*centerD + jitter^2/2*centerDD pred = matrix(pred,nc=nb_channels) idx = centers_list[[cluster_name]][["center_idx"]] + class_pos_jitter[[2]] within = 0 < idx & idx <= data_length kw = idx[within] res[kw,] = res[kw,] + pred[within,] } } res }