Title: | Splitting-Coalescence-Estimation Method |
---|---|
Description: | We introduce improved methods for statistically assessing birth seasonality and intra-annual variation. The first method we propose is a new idea that uses a nonparametric clustering procedure to group individuals with similar time series data and estimate birth seasonality based on the clusters. One can use the function SCEM() to implement this method. The second method estimates input parameters for use with a previously-developed parametric approach (Tornero et al., 2013). The relevant code for this approach is makeFits_OLS(), while makeFits_initial() is the code to implement the same method but with given initial conditions for two parameters. The latter can be used to show the disadvantage of the existing approach. One can use the function makeFits() to generate parametric birth seasonality estimates using either initialization. Detailed description can be found here: Chazin Hannah, Soudeep Deb, Joshua Falk, and Arun Srinivasan. (2019) "New Statistical Approaches to Intra-Individual Isotopic Analysis and Modeling Birth Seasonality in Studies of Herd Animals." <doi:10.1111/arcm.12432>. |
Authors: | Hannah Chazin [aut], Soudeep Deb [aut], Joshua Falk [aut], Arun Srinivasan [aut], Kyung Serk Cho [cre] |
Maintainer: | Kyung Serk Cho <[email protected]> |
License: | GPL-3 |
Version: | 1.1.0 |
Built: | 2025-01-22 04:13:05 UTC |
Source: | https://github.com/kserkcho/scem |
Archaeological faunal remains (24 sheep second molars) from Late Bronze Age (1500–1100 BCE) sites in the Tsaghkahovit Plain, Armenia
data("armenia")
data("armenia")
A data frame with 223 observations on the following 4 variables.
ID
a numeric vector
Subsample
a factor with levels A
B
C
D
E
F
G
H
I
J
distance
a numeric vector
oxygen
a numeric vector
H. Chazin, S. Deb, J Falk and A. SRINIVASAN
Chazin, Hannah, Soudeep Deb, Joshua Falk, and Arun Srinivasan. 2019. “New Statistical Approaches to Intra-Individual Isotopic Analysis and Modeling Birth Seasonality in Studies of Herd Animals.” Archaeometry 61 (2): 478–93
data(armenia)
data(armenia)
SCEM uses the residual sum of squares for each group to give a sense of the error in estimation. It is defined by:
\[RSS(S_q) = \sum_{k \in S_q} \sum_{i = 1}^{n_k} ||y_{k,i} - \hat{\mu}_{S_q} \left(\frac{i}{n_k}\right) - \hat{c}_{k}||^2\](See Chazin et al. 2019, Supplemental Materials 1).
The trend function for each individual time series is estimated non-parametrically by the local linear estimate (as discussed in Fan and Gijbels (1996)). Then, the common trend function for the group is estimated by taking the average over the group. Next, the shift functions are estimated as the differences from the individual trend functions and finally, the residual sum of squares are calculated using the original values, the common trend functions and the shifts.
calculateRSS(paths, S, bandwidth)
calculateRSS(paths, S, bandwidth)
paths |
A list of data frames, where each frame contains the data for one individual. Every data frame should have two columns with names 'distance' and 'oxygen'. |
S |
A vector of integers showing which individuals are considered in the group. |
bandwidth |
Denotes the order of the bandwidth that should be used in the estimation process. bandwidth = k will mean that the bandwidth is n^k. |
A vector of length equal to the group-size, so that each element is the RSS for the corresponding individual in the group.
armenia_split = split(armenia,f = armenia$ID) band = -0.33 p = length(armenia_split) calculateRSS(armenia_split,1:p,band)
armenia_split = split(armenia,f = armenia$ID) band = -0.33 p = length(armenia_split) calculateRSS(armenia_split,1:p,band)
This function converts the estimated parameters from the non-linear least squares (NLS) model fit to the appropriate parameter space corresponding to the cosine model proposed by Balasse et al (2012).
convertParameters(curve)
convertParameters(curve)
curve |
A fitted model object from nls function. The fitted model should have the following parameter estimates - amplitude, intercept, frequency, phase. |
A list containing the following components:
amplitude |
estimated amplitude |
intercept |
estimated intercept |
x0 |
delay of the data |
X |
period of the data |
birth |
birth seasonality estimate |
armenia_split = split(armenia,f = armenia$ID) curve = sineFit(armenia_split[[1]],method = "OLS") convertParameters(curve)
armenia_split = split(armenia,f = armenia$ID) curve = sineFit(armenia_split[[1]],method = "OLS") convertParameters(curve)
This function calculates an extended version of BIC, which is computed using a particular weighted average of the total residual sum of squares and the number of clusters.
SCEM uses the following equation for the BIC of each partition:
\[BIC(P) = (np)\log \left\lbrace\frac{RSS(P)}{np}\right\rbrace + |P|(B_{n}^{-1}-1) \log(nB_{n}),\]where \(RSS(P) = \sum_{q=1}^{Q} RSS(S_q)\).
The sample size of each individual time series (i.e. the number of observations) is denoted by n, but in dealing with archaeological data, not all the time series in a data set will have the same number of observations.
In order to have a reasonable representative value for the sample size, we have chosen to use the natural arithmetic mean \(n=(n_1+\dots+n_p)/p\).
\((B_{n}^{-1}-1)\log(nB_{n})\) is the tuning parameter that places the penalty on the number of clusters (also note that the term \(nB_{n}\)). Using a different tuning parameter \(\gamma_{n}\) in place of \((B_{n}^{-1}-1)\log(nB_{n})\) allows stronger or weaker penalties on the number of clusters.
EBIC(paths, partition, bandwidth)
EBIC(paths, partition, bandwidth)
paths |
A list of data frames, where each frame contains the data for one individual. Every data frame should have two columns with names 'distance' and 'oxygen'. |
partition |
A list of vectors. Each element in the list is a vector of integers, corresponding to individuals considered in one group. |
bandwidth |
Denotes the order of the bandwidth that should be used in the estimation process. bandwidth = k will mean that the bandwidth is n^k. |
Value of the extended BIC function for the partition.
armenia_split = split(armenia,f = armenia$ID) band = -0.33 p = length(armenia_split) EBIC(armenia_split,1:p,band)
armenia_split = split(armenia,f = armenia$ID) band = -0.33 p = length(armenia_split) EBIC(armenia_split,1:p,band)
The trend function for each individual time series is estimated non-parametrically by the local linear estimate (as discussed in Fan and Gijbels (1996)). Detailed description can be found in Chazin et al. 2019, Supplemental Materials 1.
EstTrend(y, time, bandwidth)
EstTrend(y, time, bandwidth)
y |
A vector of time series observations. |
time |
A vector of time points where the value of the trend needs to be estimated. |
bandwidth |
Denotes the order of the bandwidth that should be used in the estimation process. bandwidth = k will mean that the bandwidth is n^k. |
A vector of estimated values for the trend function at the given time-points.
armenia_split = split(armenia,f = armenia$ID) band = -0.33 z = armenia_split[[1]]$oxygen n = length(z) ndx = (1:n)/n EstTrend(z,ndx,band)
armenia_split = split(armenia,f = armenia$ID) band = -0.33 z = armenia_split[[1]]$oxygen n = length(z) ndx = (1:n)/n EstTrend(z,ndx,band)
This function performs the iteration step. Detailed description can be found in Chazin et al. 2019, Supplemental Materials 1.
iteration(paths, U, bandwidth)
iteration(paths, U, bandwidth)
paths |
A list of data frames, where each frame contains the data for one individual. Every data frame should have two columns with names 'distance' and 'oxygen'. |
U |
A list of vectors. Each element in the list is a vector of integers, corresponding to individuals considered in one group. |
bandwidth |
Denotes the order of the bandwidth that should be used in the estimation process. bandwidth = k will mean that the bandwidth is n^k. |
A list containing the following components:
S1 |
A set of individuals who are in the cluster |
U |
A set of individuals to be used in the next iteration. |
## Not run: armenia_split = split(armenia,f = armenia$ID) band = -0.33 p = length(armenia_split) iteration(armenia_split,1:p,band) ## End(Not run)
## Not run: armenia_split = split(armenia,f = armenia$ID) band = -0.33 p = length(armenia_split) iteration(armenia_split,1:p,band) ## End(Not run)
Calculates the value of the Epanechnikov kernel function for any vector.
kernel(v)
kernel(v)
v |
A vector of real numbers. |
A vector of the calculated kernel values for the input vector.
Epanechnikov, V. A. (1969). Non-parametric estimation of a multivariate probability density. Theory of Probability and its Applications, 14(1), 153-6.
x = runif(10) kernel(x)
x = runif(10) kernel(x)
This function performs the nonlinear least squares (NLS) regression method for the cosine model. It fits the NLS method as required, and then computes different quantities for the birth seasonality estimates corresponding to different individuals.
makeFits( paths, amplitude = NULL, intercept = NULL, method = c("OLS", "initial") )
makeFits( paths, amplitude = NULL, intercept = NULL, method = c("OLS", "initial") )
paths |
A list of data frames, where each frame contains the data for one individual. Every data frame should have two columns with names 'distance' and 'oxygen'. |
amplitude |
Initial value for the amplitude parameter for the |
intercept |
Initial value for the intercept parameter for the |
method |
A character string giving the initialization for the nonlinear least squares regression. This must be either |
A data frame containing the following components:
amplitude |
estimated amplitude |
intercept |
estimated intercept |
x0 |
delay of the data |
X |
period of the data |
birth |
birth seasonality estimate |
predictedMin |
predicted minimum for the oxygen isotope variable |
predictedMax |
predicted maximum for the oxygen isotope variable |
observedMin |
observed minimum for the oxygen isotope variable |
observedMax |
observed minimum for the oxygen isotope variable |
MSE |
mean squared error corresponding to the model fit for every individual |
Pearson |
Pearson's R^2 corresponding to the model fit for every individual |
armenia_split = split(armenia,f = armenia$ID) amp = seq(1,10,by=0.5) int = seq(-25,0,by=0.5) makeFits(armenia_split,amp[1],int[1],method = "initial") makeFits(armenia_split, method = "OLS")
armenia_split = split(armenia,f = armenia$ID) amp = seq(1,10,by=0.5) int = seq(-25,0,by=0.5) makeFits(armenia_split,amp[1],int[1],method = "initial") makeFits(armenia_split, method = "OLS")
Performs the nonlinear least squares (NLS) regression method for the cosine model, with the given initial values for amplitude and intercept. It fits the NLS method as required, and then computes different quantities for the birth seasonality estimates corresponding to different individuals.
makeFits_initial(paths, amplitude, intercept)
makeFits_initial(paths, amplitude, intercept)
paths |
A list of data frames, where each frame contains the data for one individual. Every data frame should have two columns with names 'distance' and 'oxygen'. |
amplitude |
Initial value for the amplitude parameter. |
intercept |
Initial value for the intercept parameter. |
A data frame containing the following components:
amplitude |
estimated amplitude |
intercept |
estimated intercept |
x0 |
delay of the data |
X |
period of the data |
birth |
birth seasonality estimate |
predictedMin |
predicted minimum for the oxygen isotope variable |
predictedMax |
predicted maximum for the oxygen isotope variable |
observedMin |
observed minimum for the oxygen isotope variable |
observedMax |
observed minimum for the oxygen isotope variable |
MSE |
mean squared error corresponding to the model fit for every individual |
Pearson |
Pearson's R^2 corresponding to the model fit for every individual |
armenia_split = split(armenia,f = armenia$ID) amp = seq(1,10,by=0.5) int = seq(-25,0,by=0.5) makeFits_initial(armenia_split,amp[1],int[1])
armenia_split = split(armenia,f = armenia$ID) amp = seq(1,10,by=0.5) int = seq(-25,0,by=0.5) makeFits_initial(armenia_split,amp[1],int[1])
Performs the nonlinear least squares (NLS) regression method for the cosine model, with the proposed initialization for all the parameters. It fits the NLS method as required, and then computes different quantities for the birth seasonality estimates corresponding to different individuals.
makeFits_OLS(paths)
makeFits_OLS(paths)
paths |
A list of data frames, where each frame contains the data for one individual. Every data frame should have two columns with names 'distance' and 'oxygen'. |
A data frame containing the following components:
amplitude |
estimated amplitude |
intercept |
estimated intercept |
x0 |
delay of the data |
X |
period of the data |
birth |
birth seasonality estimate |
predictedMin |
predicted minimum for the oxygen isotope variable |
predictedMax |
predicted maximum for the oxygen isotope variable |
observedMin |
observed minimum for the oxygen isotope variable |
observedMax |
observed minimum for the oxygen isotope variable |
MSE |
mean squared error corresponding to the model fit for every individual |
Pearson |
Pearson's R^2 corresponding to the model fit for every individual |
armenia_split = split(armenia,f = armenia$ID) makeFits_OLS(armenia_split)
armenia_split = split(armenia,f = armenia$ID) makeFits_OLS(armenia_split)
This function performs the iterative clustering algorithm on the archaeological time series data. Detailed description can be found in Chazin et al. 2019, Supplemental Materials 1.
SCalgo(paths, bandwidth)
SCalgo(paths, bandwidth)
paths |
A list of data frames, where each frame contains the data for one individual. There should be two columns with names 'distance' and 'oxygen'. |
bandwidth |
Denotes the order of the bandwidth that should be used in the splitting-coalescence (SC) clustering algorithm. A value k will mean that the bandwidth used in the algorithm is n^k. |
A list of vectors where each vector gives the indexes of the individuals to be assigned in the same cluster.
## Not run: armenia_split = split(armenia,f = armenia$ID) band = -0.33 results = SCalgo(armenia_split,bandwidth = band) ## End(Not run)
## Not run: armenia_split = split(armenia,f = armenia$ID) band = -0.33 results = SCalgo(armenia_split,bandwidth = band) ## End(Not run)
This function performs the clustering algorithm SCEM on the bivariate time series data – where one series is the distance from the cementum-enamel junction, and the other series is the value of the oxygen-18 isotope at that distance. It returns the class assignments and birth seasonality estimates for all the individuals.
The SCEM assumes that the oxygen isotope values (z(t)) can be expressed as a function of x(t), the natural logarithm of the distance from the CEJ for each of the incremental samples, scaled down by the period (X, the length of the tooth crown). In other words,
\[z(t) = f(x(t)/X) + e(t)\]where the form of f is unknown and e(t) is an error process. Also, following our definition, it assumes that the value of x(t)/X that maximizes z(t) is the estimated birth seasonality.
Birth seasonality is estimated using the combined data from all individuals in a single cluster, but birth seasonality estimates for individuals in a cluster are based on individual estimates of the length of the tooth crown (\(X_{k}\)).
For a detailed description of the algorithm, please see Chazin et al. 2019, Supplemental Materials 1.
SCEM(paths, bandwidth)
SCEM(paths, bandwidth)
paths |
A list of data frames, where each frame contains the data for one individual. There should be two columns with names 'distance' and 'oxygen'. |
bandwidth |
Denotes the order of the bandwidth that should be used in the splitting-coalescence (SC) clustering algorithm. A value k will mean that the bandwidth used in the algorithm is n^k. |
A list containing the following components:
results |
A data frame that has the individual information (ID, species, number of observations in the time series), cluster assignment, estimated period, delay and the birth seasonality estimate for every individual. |
groups |
The groups formed by the clustering algorithm |
Chazin, Hannah, Soudeep Deb, Joshua Falk, and Arun Srinivasan. 2019. “New Statistical Approaches to Intra-Individual Isotopic Analysis and Modeling Birth Seasonality in Studies of Herd Animals.” Archaeometry 61 (2): 478–93.
## Not run: armenia_split = split(armenia,f = armenia$ID) results = SCEM(armenia_split,bandwidth = -0.33) ## End(Not run)
## Not run: armenia_split = split(armenia,f = armenia$ID) results = SCEM(armenia_split,bandwidth = -0.33) ## End(Not run)
Performs the updated nonlinear least squares (NLS) regression method for the cosine model proposed by Balasse et al. The method calculates with the proposed initial values for ampliitude and intercept, and then fits the NLS method as required.
sine_initial(data, amplitude, intercept)
sine_initial(data, amplitude, intercept)
data |
A data frame that contains the data for one individual. There should be two columns with names 'distance' and 'oxygen'. |
amplitude |
Initial value for the amplitude parameter. |
intercept |
Initial value for the intercept parameter. |
A fitted model object from the nls function in R:
m |
an 'nlsModel' object incorporating the model. |
convInfo |
a list with convergence information |
data |
the expression that was passed to 'nls' as the data argument. The actual data values are present in the environment of the 'm' component. |
call |
the matched call with several components, notably 'algorithm' |
dataClasses |
the '"dataClasses"' attribute (if any) of the '"terms"' attribute of the model frame. |
control |
the control 'list' used |
Florent Baty, Christian Ritz, Sandrine Charles, Martin Brutsche, Jean-Pierre Flandrois, Marie-Laure Delignette-Muller (2015). A Toolbox for Nonlinear Regression in R: The Package nlstools. Journal of Statistical Software, 66(5), 1-21. URL http://www.jstatsoft.org/v66/i05/.
armenia_split = split(armenia,f = armenia$ID) amp = seq(1,10,by=0.5) int = seq(-25,0,by=0.5) sine_initial(armenia_split[[2]],amp[3],int[4])
armenia_split = split(armenia,f = armenia$ID) amp = seq(1,10,by=0.5) int = seq(-25,0,by=0.5) sine_initial(armenia_split[[2]],amp[3],int[4])
Performs the updated nonlinear least squares (NLS) regression method for the cosine model (see Chazin et al. 2019).
sine_OLS(data)
sine_OLS(data)
data |
A data frame that contains the data for one individual. There should be two columns with names 'distance' and 'oxygen'. |
A fitted model object from the nls function in R:
m |
an 'nlsModel' object incorporating the model. |
convInfo |
a list with convergence information |
data |
the expression that was passed to 'nls' as the data argument. The actual data values are present in the environment of the 'm' component. |
call |
the matched call with several components, notably 'algorithm' |
dataClasses |
the '"dataClasses"' attribute (if any) of the '"terms"' attribute of the model frame. |
control |
the control 'list' used |
Florent Baty, Christian Ritz, Sandrine Charles, Martin Brutsche, Jean-Pierre Flandrois, Marie-Laure Delignette-Muller (2015). A Toolbox for Nonlinear Regression in R: The Package nlstools. Journal of Statistical Software, 66(5), 1-21. URL http://www.jstatsoft.org/v66/i05/.
armenia_split = split(armenia,f = armenia$ID) sine_OLS(armenia_split[[1]])
armenia_split = split(armenia,f = armenia$ID) sine_OLS(armenia_split[[1]])
This function performs the updated nonlinear least squares (NLS) regression method for the cosine model (see Chazin et al. 2019).
sineFit(data, amplitude = NULL, intercept = NULL, method = c("OLS", "initial"))
sineFit(data, amplitude = NULL, intercept = NULL, method = c("OLS", "initial"))
data |
A data frame that contains the data for one individual. There should be two columns with names 'distance' and 'oxygen'. |
amplitude |
Initial value for the amplitude parameter for the |
intercept |
Initial value for the intercept parameter for the |
method |
A character string giving the initialization for the nonlinear least squares regression. This must be either |
A fitted model object from the nls function in R:
m |
an 'nlsModel' object incorporating the model. |
convInfo |
a list with convergence information |
data |
the expression that was passed to 'nls' as the data argument. The actual data values are present in the environment of the 'm' component. |
call |
the matched call with several components, notably 'algorithm' |
dataClasses |
the '"dataClasses"' attribute (if any) of the '"terms"' attribute of the model frame. |
control |
the control 'list' used |
Florent Baty, Christian Ritz, Sandrine Charles, Martin Brutsche, Jean-Pierre Flandrois, Marie-Laure Delignette-Muller (2015). A Toolbox for Nonlinear Regression in R: The Package nlstools. Journal of Statistical Software, 66(5), 1-21. URL http://www.jstatsoft.org/v66/i05/.
armenia_split = split(armenia,f = armenia$ID) amp = seq(1,10,by=0.5) int = seq(-25,0,by=0.5) sineFit(armenia_split[[2]],amp[3],int[4],method = "initial") sineFit(armenia_split[[1]],method = "OLS")
armenia_split = split(armenia,f = armenia$ID) amp = seq(1,10,by=0.5) int = seq(-25,0,by=0.5) sineFit(armenia_split[[2]],amp[3],int[4],method = "initial") sineFit(armenia_split[[1]],method = "OLS")