a method created by:  Marco C. Campi and Erik Weyer




The need for a set of models

Even when a system $S$ belongs to the model class, we cannot expect that a model $\hat{M}$ identified from a finite number of data coincides with $S$ since there are a number of noise sources affecting our data: measurement noise, presence of disturbances acting on the system, etc.

As a consequence, under general circumstances the only probabilistic claim we are in a position to make is that

\begin{displaymath}
Pr \{ \hat{M} = S\} = 0,
\end{displaymath}

which is clearly a useless statement if our intention is to credit the model with reliability.

To obtain certified reliability results, one can move from nominal-model to model-set identification.

The situation is depicted in Figure 1: for any number of data points N, the parameter estimate is affected by random fluctuation, so that it has probability zero to exactly hit the true system parameter value
$\theta^0$. Considering a region around the estimate, however, elevates the probability that $\theta^0$ belongs to the region to a nonzero - and therefore meaningful - value.

\includegraphics[width=5.5cm]{region-around-estimate.eps}

Fig.1  Random fluctuation in parameter estimate.



This observation points to a simple, but important fact:

exploiting the information conveyed by a finite number of data points can at best generate a model set to which the system belongs. Any further attempt to provide sharper estimation results (e.g. one single model) goes beyond the information available in the data and generates results that cannot be guaranteed.


Challenging the reader with a preliminary example

Consider the system

\begin{displaymath}
y_t + \theta^0 y_{t-1} = w_t.
\end{displaymath}

Assume we know that $w_t$ is an independent process with symmetric distribution around zero. Apart from this, no knowledge on $w_t$ is assumed: it can have any (unknown) distribution: Gaussian; uniform; etc. Its variance can be any (unknown) number, from very small to very large. We do not even make any stationarity assumption on $w_t$ and allow its distribution to vary with time.
The assumption that
$w_t$ is independent can be interpreted by saying that we know the system structure: it is an autoregressive system of order 1.

9 data points were generated according to the system and they are shown in Figure 2.
Can you guess the value of
$\theta^0$or give a confidence region for it?

\includegraphics[width=7cm]{preview-y-data.eps}

Fig. 2  Data for the preliminary example.


To form a guaranteed confidence region for $\theta^0$, we use LSCR.

Rewrite the system as a model with generic parameter
$\theta$:
\begin{displaymath}
y_t + \theta y_{t-1} = w_t.
\end{displaymath}
The predictor and prediction error associated with the model are

\begin{displaymath}
\hat{y}_t(\theta) = - \theta y_{t-1}, \ \ \
\epsilon_t(\theta) = y_t - \hat{y}_{t}(\theta) = y_t + \theta y_{t-1}.
\end{displaymath}

Next we compute the prediction errors $\epsilon_t(\theta)$ for $t=1,\ldots,8$, and calculate

\begin{displaymath}
f_{t-1}(\theta) = \epsilon_{t-1}(\theta)\epsilon_t(\theta), \ \ \ t=2,\ldots,8.
\end{displaymath}

Note that, $f_{t-1}(\theta)$ are functions of $\theta$ that can indeed be computed from the available data set. Then, we take different averages of these functions. Precisely, we form 8 averages of the form:

\begin{displaymath}
g_i(\theta) = {1 \over 4} \sum_{k \in I_i} f_k(\theta), \ \ \ i=1,\ldots,8,
\end{displaymath}
where the sets $I_i$ are subsets of $\{1,\ldots,7\}$ containing the elements highlighted by a bullet in the table below. For instance: $I_1 = \{1,2,4,5\}$, $I_2 = \{1,3,4,6\}$, etc. The functions $g_i(\theta)$, $i=1,\ldots,7$, can be interpreted as empirical 1-step correlations of the prediction error.

x


Functions
$g_i(\theta)$, $i=1,\ldots,7$, obtained for the data in Figure 2 are displayed in Figure 3.


\includegraphics[width=7cm]{preview-correlations.eps}

Fig.3  The $g_i(\theta)$ functions.


Can you now guess the value of $\theta^0$?

Let me give you a hint: the $g_i(\theta)$ functions have a tendency to intersect the $\theta$-axis near $\theta^0$. Why is it so? Let us re-write one of these functions, say $g_1(\theta)$, for $\theta = \theta^0$ as follows:

x

The right-hand-side is zero mean and, due to averaging, the "vertical displacement" from the zero line is reduced. So, we would like to claim that $\theta^0$ will be "somewhere" near where the average functions intersect the $\theta$-axis.

While the above reasoning makes sense, LSCR provides us with a much stronger claim:

RESULT: discard the rightmost and leftmost regions where at most one function is less than zero or greater than zero. The resulting interval [-0.04, 0.48] (see Figure 3) is a confidence region for $\theta^0$ whose exact probability to contain $\theta^0$ is 0.5.


Comments. 

1. The interval is stochastic because it depends on data; the true parameter value $\theta^0$ is not and it has a fixed location that does not depend on any random element. Thus, what the above RESULT says is that the interval is random and contains the true parameter value $\theta^0$ in 50% of the cases.

\includegraphics[width=3.5cm]{preview-more-trials.eps}

Fig.4  10 more trials.


To better understand the nature of the result, we performed 10 more simulation trials obtaining the results in Figure 4. Note that
$\theta^0$ and $w_t$ were as follows: $\theta^0$ = 0.2, $w_t$ independent with uniform distribution between -1 and +1.

2. In this example, the probability is low (50%) and the interval is rather large. With more data, we obtain smaller intervals and higher probabilities.

3. The LSCR algorithm was applied with no knowledge about the noise level or distribution and, yet, it returned an interval whose probability was exact, not an upper bound. The key is that the above RESULT is a "pivotal" result as the probability remains the same no matter what the noise characteristics are.


4. LSCR works along a totally different inference principle from standard Prediction Error Minimization (PEM) methods. In particular - differently from the asymptotic theory of PEM - LSCR does not construct the confidence region by quantifying the variability in the estimate.



LSCR for general linear systems

Data generating system

Consider the general linear system in Figure 5.
  

x

Fig. 5  The system.


Assume that
$w_t$ and $u_t$ are independent processes (open-loop). For closed-loop click here.

No a-priori knowledge about the noise is assumed.


The basic assumption is that the system structure is known. Correspondingly, we take a full-order model class of the form:

\begin{displaymath}
y_t = G(\theta)u_t+H(\theta)w_t,
\end{displaymath}

Goal: construct an algorithm that works in the following way (see Figure 6): the algorithm takes a finite input-output data set and a probability $\bar{p}$ as input and returns a confidence region that contains $\theta^0$ with probability $\bar{p}$.

 


algorithm
Fig. 6  The algorithm.


Construction of confidence regions

Two types of confidence sets are constructed: $\Theta_r^\epsilon$, $\Theta_s^u$.
   
A confidence regions
$\hat{\Theta}$ for $\theta^0$ is usually obtained by taking the intersection of a few of these $\Theta_r^\epsilon$ and $\Theta_s^u$ (see below).


------------------------------------------------------------------------------

Procedure for the construction of
$\Theta_r^\epsilon$
------------------------------------------------------------------------------
(this generalizes the preliminary example)

(1) Compute the prediction errors
\begin{displaymath}
\epsilon_t(\theta) = y_t - \hat{y}_t(\theta) = H^{-1}(\theta)y_t-H^{-1}(\theta)G(\theta)u_t
\end{displaymath}
for a finite number of values of
$t$, say $t=1,2,\ldots,K$;

(2) Select an integer
$r \geq 1$. For $t=1+r,\ldots,N+r=K$, compute
\begin{displaymath}
f^{\epsilon}_{t-r,r}(\theta) = \epsilon_{t-r}(\theta)\epsilon_t(\theta);
\end{displaymath}

(3) Let
$I = \{1,\ldots,N\}$ and consider a collection $G$ of subsets $I_i \subseteq I$, $i = 1, \ldots, M$, forming a group under the symmetric difference operation (i.e. $(I_i \cup I_j) - (I_i \cap I_j) \in G$, if $I_i, I_j \in G$). Compute
\begin{displaymath}
g^{\epsilon}_{i,r}(\theta) = \sum_{k
\in I_i} f^{\epsilon}_{k,r}(\theta), \ \ \ i = 1,\ldots,M;
\end{displaymath}

(4) Select an integer
$q$ in the interval $[1,(M+1)/2)$ and find the region $\Theta_r^\epsilon$ such that at least $q$ of the $g_{i,r}^{\epsilon}(\theta)$ functions are bigger than zero and at least q are smaller than zero.


Theorem 1 below states that the probability that
$\theta^0 \in \Theta_r^{\epsilon}$ is exactly equal to 1-2q/M. Thus, q is a free parameter the user can employ to determine the probability of the confidence region.

In the procedure for construction of $\Theta^u_s$, the empirical auto-correlations in point (2) are replaced by empirical cross-correlations between the input signal and the prediction error.

---------------------------------------------------------------
Procedure for the construction of  $\Theta^u_s$
---------------------------------------------------------------

(1) Compute the prediction errors

\begin{displaymath}
\epsilon_t(\theta) = y_t - \hat{y}_t(\theta) = H^{-1}(\theta)y_t-H^{-1}(\theta)G(\theta)u_t
\end{displaymath}
for a finite number of values of $t$, say $t=1,2,\ldots,K$;

(2) Select an integer
$s \geq 0$. For $t=1+s,\ldots,N+s=K$, compute
\begin{displaymath}
f_{t-s,s}^u(\theta) = u_{t-s}\epsilon_t(\theta);
\end{displaymath}

(3) Let
$I = \{1,\ldots,N\}$ and consider a collection $G$ of subsets $I_i \subseteq I$, $i = 1, \ldots, M$, forming a group under the symmetric difference operation. Compute

\begin{displaymath}
g_{i,s}^u(\theta) = \sum_{k \in I_i} f_{k,s}^u(\theta), \ \ \ i =
1,\ldots,M;
\end{displaymath}

(4) Select an integer
$q$ in the interval $\left[1,(M+1)/2\right)$ and find the region $\Theta_s^u$ such that at least $q$ of the$g^u_{i,s}(\theta)$ functions are bigger than zero and at least $q$ are smaller than zero.

Theorem 1 (see Campi and Weyer (2005))
If  $w_t$ is symmetrically distributed around zero, then

$\displaystyle Pr\{\theta^0 \in \Theta^{\epsilon}_r\}$ $\textstyle =$ $\displaystyle 1 - 2q/M,$
$\displaystyle Pr\{\theta^0 \in \Theta^u_s\}$ $\textstyle =$ $\displaystyle 1 - 2q/M.$



Comments

1. The procedures return regions of guaranteed probability despite that no a-priori knowledge about the noise is assumed: the noise enters the procedures through data only. This could be phrased by saying that the procedures let the data speak, without a-priori assuming what they have to tell us.

2. The noise level does impact the final result as the shape and size of the region depend on noise via the data.

3. In the theorem, probabilities 1 - 2q/M are exact, not bounds. So, the theorem is free of any conservativeness.



Usually, a confidence set
$\hat{\Theta}$ is obtained by intersecting a number of sets $\Theta^{\epsilon}_r$ and $\Theta^u_s$, i.e.
\begin{displaymath}
\hat{\Theta}=\cap_{r=1}^{n_\epsilon}\Theta^{\epsilon}_r \cap_{s=1}^{n_u}\Theta_s^u.
\end{displaymath}

An obvious question to ask is how one should choose
$n_{\epsilon}$ and $n_u$ in order to obtained well shaped confidence sets that are bounded and which concentrate around the true parameter $\theta^0$ as the number of data points increases. The answer depends on the model class under consideration and this issue is discussed in "LSCR properties".

From Theorem 1, it follows that:

Theorem 2 (see Campi and Weyer (2005))
\begin{displaymath}
Pr\{\theta^0 \in \hat{\Theta}\} \geq 1 -(n_\epsilon+n_u){2q}/{M},
\end{displaymath}


Theorem 2 can be used in connection with robust design procedures: if a problem solution is robust with respect to $\hat{\Theta}$ in the sense that a certain property is achieved for any $\theta \in \hat{\Theta}$, then such a property is also guaranteed for the true system with the chosen probability $1 -(n_\epsilon+n_u){2q}/{M}$.


Example 1 - ARMA system

Consider the ARMA system

\begin{displaymath}
y_t + a^0y_{t-1} = w_t + c^0w_{t-1},
\end{displaymath}

where
$a^0=-0.5, c^0=0.2$ and $w_t$ is an independent sequence of zero mean Gaussian random variables with variance 1. 1025 data points were generated.

The model $y_t + ay_{t-1} = w_t + cw_{t-1},\ \vert a\vert<1,\ \vert c\vert<1$ has predictor and prediction error given by

\begin{eqnarray*}
\hat{y}_t(a,c)&=&-c\hat{y}_{t-1}(a,c)+(c-a)y_{t-1}, \\
\epsilon_t(a,c)&=&y_t-\hat{y}_{t}(a,c)=y_t+ay_{t-1}-c\epsilon_{t-1}(a,c).
\end{eqnarray*}

In order to form a confidence region for
$\theta^0 = (a^0,c^0)$ we calculated
\begin{eqnarray*}
f_{t-1,1}^\epsilon(a,c)&=&\epsilon_{t-1}(a,c)\epsilon_t(a,c), ...
...,c)&=&\epsilon_{t-2}(a,c)\epsilon_t(a,c), \ \ \ t=3,\ldots,1025,
\end{eqnarray*}

and then computed
\begin{eqnarray*}
g_{i,1}^\epsilon(a,c)&=&\sum_{k\in I_i}f_{k,1}^\epsilon(a,c),\...
...a,c)&=&\sum_{k\in I_i}f_{k,2}^\epsilon(a,c),\ \ i=1,\ldots,1024,
\end{eqnarray*}

using the Gordon group. Next we discarded those values of
$a$ and $c$ for which zero was among the 12 largest and smallest values of $g_{i,1}^\epsilon(a,c)$ and$g_{i,2}^\epsilon(a,c)$ .

According to Theorem 2, $(a^0,c^0)$ belongs to the constructed region with probability at least $1-2\cdot 2 \cdot 12/1024=0.9531$.

The obtained confidence region is the blank area in Figure 7. 

Using the algorithm for the construction of
$\hat{\Theta}$, we have obtained a bounded confidence set with a guaranteed probability based on a finite number of data points. As no asymptotic theory is involved, this is a rigorous finite sample result. For comparison, we have in Figure 7 also plotted the 95% confidence ellipsoid obtained using the asymptotic theory (e.g. L. Ljung, "System identification - Theory for the user, Chapter 9, 1999, Prentice Hall). The two confidence regions are of similar shape and size in this case.


\includegraphics[width=7.5cm]{armacolourg12asymp.eps}


Fig.7  Non-asymptotic confidence region for $(a^0,c^0)$  (blank region) and asymptotic confidence ellipsoid.  $\star$= true parameter, $\diamond$ = estimated parameter using a prediction error method.



Example 2 - A closed-loop system

This example was originally introduced in Garatti et al. (2004) to demonstrate that the asymptotic theory of PEM can at times deliver misleading results even with a large number of data points. It is re-examined here to show how LSCR works in this challenging situation.

Consider the system of Figure 8 where

$\displaystyle F(\theta^0)$ $\textstyle =$ $\displaystyle \frac{b^0z^{-1}}{1+a^0z^{-1}}, \hspace{0.5cm} a^0 = -0.7, \ \ b^0 = 0.3$  
$\displaystyle H'(\theta^0)$ $\textstyle =$ $\displaystyle 1+h^0z^{-1}, \hspace{0.5cm} h^0 = 0.5 ,$

closed-loop system

Fig.8  The closed-loop system.



$w_t$ is white Gaussian noise with variance 1 and the reference $u_t$ is also white Gaussian, with variance$10^{-6}$. Note that the variance of the reference signal is very small as compared to the noise variance, that is there is poor excitation. The present situation - though admittedly artificial - is a simplification of what often happens in practical applications of identification, where poor excitation is due to the closed-loop operation of the system.

2050 measurements of $u$ and $y$ were generated to be used for identification.

We first use PEM identification.


A full order model was identified. The amplitude Bode diagrams of the transfer function from
$u$ to $y$ of the identified model and of the real system are plotted in Figure 9. From the plot, a big mismatch between the real plant and the identified model is apparent, a fact that does not come too much of a surprise considering that the reference signal is poorly exciting.

An analysis conducted in Garatti et al. (2004) shows that, when $u_t = 0$, the asymptotic PEM identification cost has two isolated global minimizers, one is  $\theta^0$ and the second is a spurious parameter which we denote $\theta^*$. When$u_t \neq 0$ but small as is the case in our actual experiment, $\theta^*$ does not minimize the asymptotic cost anymore, but random fluctuations in the identification cost due to the finiteness of the data points may as well result in that the estimate gets trapped near the spurious $\theta^*$, generating a totally wrong identified model.

\includegraphics[width=7.5cm]{PEM.eps}

Fig.9  The identified $u-y$ transfer function.


But, let us now see what we obtained as a 90% confidence region with the asymptotic theory. Figure 10 displays the confidence region in the frequency domain: surprisingly, it concentrates around the identified model, so that in a real identification application where the true transfer function is not known we would conclude that the estimated model is reliable, a totally misleading result.

\includegraphics[width=7.5cm]{PEM-uncertainty.eps}

Fig.10  90% confidence region for the identified $u-y$ transfer function obtained with the asymptotic theory.



Return now to the LSCR approach.

LSCR was used in a totally "blind" manner, with no concern at all for the existence of local minima: the method is guaranteed by theory and it works in all possible situations.

The prediction error is given by
\begin{eqnarray*}
\epsilon_t(\theta)
&=&\frac{1}{1+hz^{-1}}y_t \\
& & -\frac{bz...
...z^{-1})}y_t \\
& & - \frac{bz^{-1}}{(1+az^{-1})(1+hz^{-1})}u_t.
\end{eqnarray*}

We used a Gordon group
with 2048 elements, and computed
\begin{displaymath}
g^{\epsilon}_{i,r}(\theta)=\sum_{k\in I_i}\epsilon_{k-r}(\theta)\epsilon_{k}(\theta),\ \ r=1,2,3,
\end{displaymath}
in the parameter space. We excluded the regions where 0 was among the 34 smallest or largest values of any of the three correlations above to obtain a
$1-3\cdot 2\cdot 34/2048 = 0.9004$ confidence set (see Theorem 2). The confidence set is shown in Figure 11.

The set consists of two separate regions, one around the true parameter $\theta^0$ and one around $\theta^*$. This illustrates the global features of the approach: LSCR produces two separate regions as the overall confidence set because information in the data is intrinsically ineffective in telling us which one of the two regions contain the true parameter.


confidence region

Fig.11  90% confidence set.


Figures 12 and 13 show the close-ups of the two regions. The ellipsoid in Figure 12 is the 90% confidence set obtained with the asymptotic PEM theory: when the PEM estimate gets trapped near
$\theta^*$, the confidence ellipsoid is centered around this spurious $\theta^*$ because the PEM asymptotic theory is local in nature (it is based on a Taylor expansion) and is therefore unable to explore locations far from the identified model. This is the reason why in Figure 10 we obtained a frequency domain confidence region unable to capture the real model uncertainty. The reader is referred to Garatti et al. (2004) for more details.

\includegraphics[width=7.5cm]{example-closed-LSCR1.eps}

Fig.12  Asymptotic confidence 90% ellipsoid, and the part of the non-asymptotic confidence set around $\theta^*$.


\includegraphics[width=7.5cm]{example-closed-LSCR2.eps}

Fig.13  Close-up of the non-asymptotic confidence region around $\theta^0$.




LSCR properties

Theorems 1 and 2 quantify the probability that $\theta^0$ belongs to the constructed regions. However, these theorems deal only with one side of the story. In fact, a good evaluation method must have two properties:

Securing this second property requires choosing $n_\epsilon$ and $n_u$ in a suitable way, and the correct choice depends on the model class. For a general discussion, particularly in connection with ARMA and ARMAX models, see Campi and Weyer (2005).



Presence of unmodeled dynamics

LSCR can also be used in the presence of unmodeled dynamics.

General ideas are discussed here by means of two simple examples:


Identification without a noise model: an example

Consider the system
\begin{displaymath}
y_t = \theta^0 u_t + n_t.
\end{displaymath}
Suppose that the structure of the
$u$ to $y$ transfer function is known. Instead, the noise $n_t$ describes all other sources of variation in $y_t$ apart from $u_t$ and we do not want to make any assumption on how $n_t$ is generated. Correspondingly, we want our results regarding the value of $\theta^0$ to be valid with no limitations whatsoever on the deterministic noise sequence $n_t$.

Assume that we have access to the system for experimentation: we generate a finite number, say 7 for the sake of simplicity, of input data and - based on the collected outputs - we are asked to construct a confidence interval
$\hat{\Theta}$ for $\theta^0$ of guaranteed probability.

The problem is challenging: since the noise can be whatever, it seems that the observed data are unable to give us a hand in constructing a confidence region. In fact, for any given
$\theta^0$ and $u_t$, a suitable choice of the noise sequence can lead to any observed output signal! Let us see how LSCR gets around this problem.

Before proceeding, we would like to clarify what is meant here by "guaranteed probability". We said that
$n_t$ is regarded as a deterministic sequence, and the result is required to hold true for any $n_t$, that is uniformly in $n_t$. The stochastic element is instead the input sequence: we will select $u_t$ according to a random generation mechanism and we require that $\theta^0 \in \hat{\Theta}$ with a given probability, where the probability is with respect to the random choice of $u_t$.

We first indicate the input design and then the procedure for construction of the confidence interval
$\hat{\Theta}$.

Input design
Let
$u_t$, $t = 1,\ldots,7$, be independent and identically distributed with distribution
\begin{displaymath}
u_t =
\left\{
\begin{array}{ll}
1, & \ \ \ \mbox{with probab...
...\
-1, & \ \ \ \mbox{with probability }0.5.
\end{array}\right.
\end{displaymath}

Procedure for construction of the confidence interval
Rewrite the system as a model with generic parameter
$\theta$:
\begin{displaymath}
y_t = \theta u_t + n_t.
\end{displaymath}
Construct a predictor by dropping the noise term
$n_t$:
\begin{displaymath}
\hat{y}_t(\theta) = \theta u_t, \ \ \ \epsilon_t(\theta) = y_t - \hat{y}_{t}(\theta) = y_t - \theta u_t.
\end{displaymath}
Next, we compute the prediction errors
$\epsilon_t(\theta)$ from the observed data for $t = 1,\ldots,7$ and calculate
\begin{displaymath}
f_t(\theta) = u_t \epsilon_t(\theta), \ \ \ t = 1,\ldots,7.
\end{displaymath}
Then, we take different averages of these functions. Precisely, we form 8 averages of the form:

\begin{displaymath}
g_i(\theta) = {1 \over 4} \sum_{k \in I_i} f_k(\theta), \ \ \ i=1,\ldots,8,
\end{displaymath}

where the sets $I_i$ are subsets of $\{1,\ldots,7\}$ containing the elements highlighted by a bullet in the table below. For instance: $I_1 = \{1,2,4,5\}$, $I_2 = \{1,3,4,6\}$, etc.

x


The confidence interval is where at least two functions are below zero and at least two functions are above zero.

A simulation example was run with $\theta^0 = 1$ and where the noise sequence was as shown in Figure 14. This noise sequence was obtained as a realization of a biased independent Gaussian process with mean 0.5 and variance 0.1. The obtained $g_i(\theta)$ functions are given in Figure 15.

We located the interval where at least two functions were below zero and at least two were above zero, obtaining the interval in Figure 15. Generalizing the theory for the case of no unmodelled dynamics, one can establish that the obtained interval has exacxt probability 0.5 of containing the true $\theta^0$ (see Campi and Weyer (2006)). 

\includegraphics[width=7cm]{noise-noise-free-example.eps}

Fig.14  Noise sequence.




\includegraphics[width=7.5cm]{noise-free-correlations.eps}

Fig.15  The $g_i(\theta)$ functions.


Unmodeled dynamics in the transfer function between

u and y: an example

Suppose that a system has structure
\begin{displaymath}
y_t = \theta_0^0 u_t + \theta_1^0 u_{t-1} + n_t,
\end{displaymath}
while - for estimation purposes - we use the reduced order model
\begin{displaymath}
y_t = \theta u_t + n_t.
\end{displaymath}
The noise can be whatever, and we regard
$n_t$ as a generic unknown deterministic signal.

After determining a region for the parameter
$\theta$ one sensible question to ask is: does this region contain with a given probability the system parameter $\theta_0^0$ linking $u_t$ to $y_t$?

Reinterpreting the above question we are asking whether the projection of the true transfer function
$\theta_0^0 + \theta_1^0 z^{-1}$ onto the 1-dimensional space spanned by constant transfer functions is contained in the estimated set with a certain probability.

Input design
Let
$u_t$, $t = 1,\ldots,7$, be independent and identically distributed with distribution

\begin{displaymath}
u_t =
\left\{
\begin{array}{ll}
1, & \ \ \ \mbox{with probab...
...\
-1, & \ \ \ \mbox{with probability }0.5.
\end{array}\right.
\end{displaymath}


Procedure for construction of the confidence interval
Construct a prediction by dropping the noise term $n_t$ whose characteristics are unknown:
\begin{displaymath}
\hat{y}_t(\theta) = \theta u_t, \ \ \ \epsilon_t(\theta) = y_t - \hat{y}_{t}(\theta) = y_t - \theta u_t.
\end{displaymath}
Next, we compute the prediction errors
$\epsilon_t(\theta)$ from the observed data for $t = 1,\ldots,7$ and calculate:
\begin{displaymath}
f_t(\theta) = sign(u_t \epsilon_t(\theta)), \ \ \ t = 1,\ldots,7.
\end{displaymath}

Then, we take the average of some of these functions in many different ways. Precisely, we form 8 averages of the form:

\begin{displaymath}
g_i(\theta) = {1 \over 4} \sum_{k \in I_i} f_k(\theta), \ \ \ i=1,\ldots,8,
\end{displaymath}

where the sets $I_i$ are subsets of $\{1,\ldots,7\}$ containing the elements highlighted by a bullet in the table below. For instance: $I_1 = \{1,2,4,5\}$, $I_2 = \{1,3,4,6\}$, etc.

x


The confidence interval is where at least two functions are below zero and at least two functions are above zero.

A simulation example was run with $\theta_0^0 = 1$, $\theta_1^0 = 0.5$ and where the noise was the realization of a biased Gaussian process shown in Figure 14. As $sign(u_t\epsilon_t(\theta))$ can only take on the values -1, 1 and 0, it is possible that two or more of the $g_i(\theta)$ functions will take on the same value on an interval. This tie is broken by introducing a random ordering. The obtained $g_i(\theta)$ functions and confidence region are shown in Figure 16.

\includegraphics[width=7.5cm]{sign-correlations.eps}

Fig.16  The $g_i(\theta)$ functions.

Generalizing the theory, one can establish that the obtained interval has exacxt probability 0.5 to contain the true $\theta^0$ (see Campi and Weyer (2006)).



Nonlinear systems

Here we discuss an example. The reader is referred to Dalai et al. (2007, to appear in Automatica) for more details.

Consider the following nonlinear system

\begin{displaymath}
y_t = \theta^0 (y_{t-1}^2 - 1) + w_t,
\end{displaymath}
where
$w_t$ is an independent and symmetrically distributed sequence and $\theta^0$ is an unknown parameter.

This system can be made explicit with respect to
$w_t$ as follows:
\begin{displaymath}
w_t = y_t - \theta^0 (y_{t-1}^2 - 1),
\end{displaymath}
and - by substituting
$\theta^0$ with a generic $\theta$ and re-naming the so-obtained right-hand-side as $w_t(\theta)$ - we have
\begin{displaymath}
w_t(\theta) = y_t - \theta (y_{t-1}^2 - 1).
\end{displaymath}
Second order statistics are in general not enough to identify the true parameter value for nonlinear systems. The good news is that LSCR can be extended to higher-order statistics with little effort. A general presentation of the results can be found in Dalai et al. (2007
, to appear in Automatica). Here, it suffices to say that we can e.g. consider the third-order statistic $E[w_t(\theta)^2 w_{t+1}(\theta)]$ and the theory goes through.

As an example, we generated 9 samples of
$y_t$, $t=0,\ldots,8$ where $w_t$ were zero-mean Gaussian with variance 1. Then, we constructed
\begin{displaymath}
g_i(\theta) = {1 \over 4} \sum_{k \in I_i} w_k(\theta)^2 w_{k+1}(\theta), \ \ \ i=1,\ldots,8,
\end{displaymath}

where the sets $I_i$ are subsets of $\{1,\ldots,7\}$ containing the elements highlighted by a bullet in the table below. For instance: $I_1 = \{1,2,4,5\}$, $I_2 = \{1,3,4,6\}$, etc.

x


These functions are displayed in Figure 17. The interval marked in blue is where at least two functions were below zero and at least two were above zero and has exact probability 0.5 of containing $\theta^0 = 0$.

\includegraphics[width=7.5cm]{nonlinear-correlations.eps}

Fig.17  The $g_i(\theta)$functions.








Closed-loop

Closed-loop systems can be treated within to the open-loop framework by regarding $w_t$ and $u_t$ as external signals of the closed-loop system, see Figure 18. The LSCR theory can be applied unaltered in this setting.


x

Fig.18  Closed-loop recast as open-loop.



Gordon's construction of the incident matrix of a group

Given $I = \{1,\ldots,N\}$, the incident matrix for a group $\{I_i\}$ of subsets of $I$ is a matrix whose $(i,j)$ element is $1$ if $j\in I_i$ and zero otherwise. In L. Gordon, "Completely separating groups in subsampling", Annals of Statistics, vol.2, pp.572-578, the following construction procedure for an incident matrix $\bar{R}$ is proposed where  $I = \{1,\ldots,2^l-1\}$ and the group has $2^l$ elements.

Let
$R(1)=[1]$, and recursively compute ($k=2,3,\ldots,l$)
\begin{displaymath}
R(2^k-1)=\left[ \begin{array}{ccc} R(2^{k-1}-1) & R(2^{k-1}-...
...}-1) & J-R(2^{k-1}-1) &e \\
0^T & e^T & 1
\end{array}\right],
\end{displaymath}

where
$J$ and $e$ are, respectively, a matrix and a vector of all ones, and 0 is a vector of all zeros. Then, let

\begin{displaymath}
\bar{R}=\left[ \begin{array}{c} R(2^l-1) \\
0^T
\end{array}\right].
\end{displaymath}
Gordon (1974) also gives construction of groups when the number of data points is different from
$2^{l}-1$.



Papers

M.C. Campi and E. Weyer.
Guaranteed non-asymptotic confidence regions in system identification.
Automatica, 41:1751-1764, 2005.
(the downloadable file is an extended version (with all proofs) of the Automatica paper)

M.C. Campi and E. Weyer.
Identification with finitely many data points: the LSCR approach

Semi-plenary presentation. In Proc. Symposium on System Identification, SYSID 2006, Newcastle, Australia, 2006.

M. Dalai, E. Weyer and M.C. Campi.
Parameter Identification for Nonlinear Systems: Guaranteed Confidence regions through LSCR.

Automatica, 43:1418-1425, 2007.


Other related papers are downloadable from M.C. Campi's webpage