mathematical convex optimization

Posted in optimization by mlwkdb on December 21, 2009

Most optimizations can be expressed as the minimization of an objective function subject to a set of constraints both of which are defined over a set of optimization variables:

  • optimization variables x= \{ x_1, \cdots , x_n \}
  • objective function: f_0 : \mathbb{R}^n \to \mathbb{R}
  • constraints: f_i : \mathbb{R}^n \to \mathbb{R}, i=1, \cdots , m

The optimization variables are typically actions (portfolio optimization: amounts invested in various assets) or parameters (data fitting: parameters of the model being estimated). There exists a family of optimization problems for which finding a global solution can be done both efficiently and reliably; for an optimization to be convex the objective function and the constraints must all have positive curvature.

Some examples include least squares (no constraints, has an analytical solution), linear programming (zero curvature objective and constraints, no analytical solution), quadratic optimization (quadratic objective function and either linear or non-linear constraints) are examples.

For details click here.

q load NYSE data (23 assets, 44 years) into kdb+

Posted in Uncategorized by mlwkdb on October 28, 2009

Here is a q script to load the NYSE merged dataset found here into kdb+:

f:`:ahp.csv`:alcoa.csv`:amerb.csv`:coke.csv`:comme.csv`:dow.csv`:dupont.csv`:ford.csv`:ge.csv `:gm.csv`:hp.csv`:ibm.csv`:inger.csv`:jnj.csv`:kimbc.csv`:kinar.csv`:kodak.csv `:merck.csv`:mmm.csv`:morris.csv`:pandg.csv`:schlum.csv`:sherw.csv;
t: (uj/) {update ticker:x from (flip c!("DF";",")0: x)} each f;
t:`ticker`date`close xcols update `$ssr[;":";""] each ssr[;".csv";""] each string[ticker] from t;
save `:t;
c <- open_connection()
t <- execute(c, "select from t where ticker in 16#exec distinct ticker from t")
names(t) <- c("ticker","date","close")
#t$date <- strptime(t$date, format="%Y-%m-%d")
t$ticker <- factor(t$ticker)
xyplot(close~date | ticker,t,xlab='',ylab='',ylim=c(0.9, 1.1),scales=list(tick.number=2),type='l')

t$ibm <- execute(c, "exec close from t where ticker in `ibm")
t$coke <- execute(c, "exec close from t where ticker=`coke")
t$ge <- execute(c, "exec close from t where ticker=`ge")
t$dow <- execute(c, "exec close from t where ticker=`dow")
t$ford <- execute(c, "exec close from t where ticker=`ford")
t$gm <- execute(c, "exec close from t where ticker=`gm")
scatterplot.matrix(~ibm+coke+ge+dow+ford+gm, data=t)

R nyse dow covariance matrix of close prices

Tagged with: , , ,

q principal component analysis

Posted in kdb+, q, statistics by mlwkdb on October 23, 2009

PCA is a pattern recognition algorithm that highlights similarities and differences in a dataset. It can also be used to optimally (least information loss) reduce the dimension (given by the number of variates) of a dataset by projecting the data (in the least squares sense) onto a few of its principal (high variability) components (dimensions in transformed space). A reduction to 2 components (best approximating plane) can be used to capture and visualize the majority of the variance in the dataset.

The components correspond to the new transformed basis and don’t necessarily have a one-to-one correspondence with the original variates; there are two ways to think about why we would consider dropping components in the transformed space:

  1. if certain variates are redundant (express almost the same information i.e. are highly correlated) then it makes sense to include only a single component to record the information contained in the variates
  2. by rotating and stretching the data into a new basis we are left with components along which there is very little variance; intuitively these components correspond to noise since they fail to record any essential information

Since we are comparing the spreads of variates it makes sense to standardize the data to have zero mean and unit standard deviation.

Given a dataset X_{m \times n} = \{ \vec{x}_1 | \vec{x}_2 | \cdots | \vec{x}_n\} = \{\vec{f}_1, \vec{f}_2, \cdots , \vec{f}_m\} where the n columns are sample vectors and the m rows are feature vectors, we are interested in finding directions in \mathbb{R}^m (the row space) that capture maximal variance. We calculate the center of mass \nu_{m \times 1} = argmin_{\vec{x}} \sum \|\vec{x}_i-\vec{x}\|^2 = \frac{1}{n} \sum_{i=1}^n \vec{x}_i which gives us the mean of each row. The center of mass will serve as the origin of the new basis.

Next we calculate the scatter matrix as the product of the demeaned dataset (subtract the mean of each row from the row so that all the rows have mean zero) with a transposed version of itself: S_{m \times m}=\sum_{i=1}^m (\vec{f}_i - \nu_i)(\vec{f}_i-\nu_i)^T or in matrix form S_{m \times m} = XC_nX^T where C_n is the centering matrix. The entries of the scatter matrix measure the variance of the data in different directions relative to the center of mass; specifically the i-j th entry is the covariance between the ith and jth feature.

Since we are rotating our data into a new basis, what we will need is a measure of the variance of the data set projected onto each axis (infact any arbitrary axis since we need the one with highest variance). To illustrate how we can build such a measure (using only the scatter matrix and any arbitrary unit vector direction) let us project the data onto a line L=\nu+ae representing an axis in the direction e such that \|e \| =1

principal component analysis

The distance of the projected point \vec{f'}_i from the center of mass \nu is then \| \vec{f'}_i - \nu \| = \langle e, \vec{f}_i - \nu \rangle \| e \| = e^T (\vec{f}_i - \nu)

So the total variance or the squared sum of the distances of all the projected points from the center of mass is given by:
var(e) = \frac{1}{m} \sum \| \vec{f'}_i - \nu \|^2 = \frac{1}{m} \sum (e^T (\vec{f}_i - \nu))^2
which after some matrix algebra works out to var(e) = \frac{1}{m} e^T S e.

So now we can calculate the variance of the data projected onto some arbitrary direction e; the principal component is precisely the direction that maximizes this variance: \max_e e^TSe subject to e^Te=1. The lagrangian of this optimization is given by e^TSe - \lambda(e^Te-1) where the \lambda is a slack variable or lagrange multiplier. We apply the lagrangian saddle point equivalence theorem and differentiate with respect to the primal variables 2Se-2 \lambda e = 0 or Se=\lambda e which is in fact part of the eigen-decomposition of the scatter matrix. So the direction along which the projected data has maximized variance is an eigen-vector; but which one? The variance of an eigen-vector is given by var(e) = e^T S e = e^T \lambda e e^T e = \lambda and since we are maximizing the variance \max_e var(e) = \max_e \lambda the eigen-vector with a corresponding maximal eigen-value \lambda is the principal component.

Note if the eigen-values are nearly the same then there is no preferable “principal” component. Also pca is optimal only in the least-squares sense.

Connection to SVD

The svd of the scatter matrix is S=U\Sigma V^T and since the scatter matrix is symmetric S=S^T, we have S=U\Sigma V^T=S^T=V\Sigma U^T which is satisfied only when U=V so we have S=U\Sigma U^T which is exactly the eigen-decomposition so that for S the eigen-values are the singular values. Look here for the general relationship between the eigendecomposition and the svd when the matrix is not square.

Outline of PCA algorithm

  1. Calculate the scatter matrix S
  2. Calculate the SVD of S=U\Sigma U^T
  3. Rearrange the singular values and singular vectors: \tilde{\Sigma} and \tilde{U}
  4. Select a new basis as a subset of the singular vectors \tilde{U}_{1:r}
  5. Project the data onto the new basis X_r = \tilde{U}_{1:r} \cdot X
.qml.mpca:{[x;r] \x is samples (rows) by variates (cols)
 t:-[;avg x] each x;
 t: t $ flip t;
 u:s[0][;1+til r];
 :flip[u] $ x;


Kernel PCA

Is an extension of PCA for non-linear distributions. We would like to apply PCA in a higher dimensional space \vec{x} \to \phi(\vec{x}); the application of this map to the data set is unnecessary because of the kernel trick. We simply form the kernel matrix and solve this dual problem by finding its singular value decomposition; K \alpha = \lambda \alpha. A feature f_K in the higher dimensional feature space is given by f_K = \sum \alpha_i k(\vec{z}, \vec{f}_i) where \vec{z} is a feature in the original feature space. For a derivation of this result look here or here

Variance Connection, Karhunen Connection

A typical data matrix contains signals each of which has some underlying noise; a reduced rank SVD approximation of the data matrix captures a highest variance mixture of the original signals in a lower dimensional space. The noise is usually associated with the smallest singular values (outliers generally have the least to say about the true distribution of the data because well it is noise and doesn’t belong in the dataset; they have the smallest semi-axis in the transformed unit sphere i.e. they are what the fuck can i explain this better?) which are dropped to form the approximation.

antlr grammar for q

Posted in kdb+, q by mlwkdb on October 17, 2009

A q grammar file for antlr can be downloaded here.

Tagged with: ,

the karhunen–loève representation of a stochastic process

Posted in statistics by mlwkdb on October 17, 2009

Consider a continuous time, continuous state space stochastic process \{ S_t : t \in \mathcal{I} \} that has a continuous and compact index set \mathcal{I} and where the random variables S_t are measurable functions from a probability space to a (continuous, measurable) state space:

S_t : (\mathcal{X}, \Sigma_\mathcal{X}, \mu) \to (\mathcal{Y}, \Sigma_\mathcal{Y})

Assume the stochastic process is centered (each random variable is centered) so that \mathbb{E}(S_t)=\int_\mathcal{X} S_t d\mu = 0 then the autocovariance (the variance of the process or signal against a time shifted version of itself) is given by Cov(S_t,S_s) = \mathbb{E}(S_t \cdot S_s) - \mathbb{E}(S_t) \cdot \mathbb{E}(S_s) which given our assumption can be viewed as a kernel

K_{SS}(t,s) = \mathbb{E}(S_t \cdot S_s) = Cov(S_t,S_s)

Let us assume that this kernel is positive, symmetric and square integrable K_{SS}(t,s) \in L^2(\mathcal{X} \times \mathcal{X}) then the resulting integral operator

T_K f(\cdot) : L^2(\mathcal{X}) \to L^2(\mathcal{X}) = \int_{\mathcal{I}} K_{SS}(\cdot,s) f(t) ds

is positive, self-adjoint and compact. It therefore follows from the Spectral Decomposition Theorem that T_K must have a countable set of non-negative eigenvalues \{ \nu_1, \nu_2, \cdots \}; furthermore, the corresponding eigenfunctions \{ \psi_1, \psi_2, \cdots \} must form an orthonormal basis for L^2(\mathcal{X}):

T\psi_i(t) = \int_\mathcal{I} K_{SS}(t,s) \psi_i(s) ds = \nu_i\psi_i(t)

The integral transformation T_K essentially changes the representation of the input function f to an output function (T_Kf) defined in a space that has a basis given by the set of eigenfunctions.

Let us make a stronger assumption (than being square integrable) that the covariance (kernel) function is continuous Cov(S_t, S_s) \in C^0(\mathcal{X} \times \mathcal{X}) then by Mercer’s theorem (page 34) we have that:

Cov(S_t, S_s) = \sum_{i=1}^\infty \nu_i \psi_i(s) \psi_i(t)

where convergence is absolute and uniform in both variables. We define a new set of random variables \{ Z_n, n \geq 0\} based on our original stochastic process:

Z_n = \frac{1}{\sqrt{\nu_n}} \int_{\mathcal{I}} S_t \psi_n(t) dt

that are orthogonal and centered E(Z_n)=0. We can then represent our original stochastic process as an infinite linear combination of orthogonal functions:

S_t = \sum_{i=1}^\infty \sqrt{\nu_i} Z_i \psi_i(t)

where convergence is uniform over the index set \mathcal{I}. It is interesting to note that the orthogonal basis functions used in this expansion are determined by the autocovariance function.

1. Karhunen-Loeve Expansion of Vector Random Processes
2. Large Scale Learning: Dimensionality Reduction

Protected: my axiom of choice

Posted in Uncategorized by mlwkdb on October 15, 2009

This content is password protected. To view it please enter your password below:

Enter your password to view comments.

singular value decomposition

Posted in kdb+, statistics by mlwkdb on October 7, 2009

The singular value decomposition of a n \times m matrix M is given by

M = U_{(n \times n)} \Sigma_{(n \times m)} {V^T}_{(m \times m)}

where the rows or right-singular vectors of V^T (and the columns or left-singular vectors of U) form a set of orthonormal basis vector directions:

U^T U = I and V^T V=I

The singular value matrix \Sigma is diagonal with entries equal to the square roots of the eigenvalues of the matrix M^TM. If the diagonal entries \sigma_i are sorted in decreasing order then \Sigma is uniquely determined. The same is not true of the left and right singular vectors.

As a geometric transform

Any real matrix maybe viewed as a geometric transformation

M:\mathbb{R}^m \to \mathbb{R}^n : U \Sigma V^T=(hanger)(stretcher)(aligner)

that when applied to a set of data points M \vec{d}_i = U \Sigma V^T \vec{d}_i rotates them in the direction of the right singular vectors \{\vec{v}_1, \vec{v}_2, \cdots , \vec{v}_n \} first and then stretches them in the direction of each of the left singular vectors \{\vec{u}_1, \vec{u}_2, \cdots , \vec{u}_m \} by a factor equal to its corresponding singular values \{\sigma_1, \sigma_2, \cdots , \sigma_n \}. The order of the rotate and stretch is important as it determines the final output space. Note that from M\vec{v}_i = \sigma_i \vec{u}_i, \forall i \leq min(n,m) and M\vec{v}_i = 0, \forall i > min(n,m) it is clear that the transformation changes the underlying basis; the SVD stretches and rotates the dataset into a space whose basis is given by the set of right singular vectors. If for example we were to apply M to the unit sphere, it is transformed (rotated and stretched) into an ellipsoid whose semi-axes are given by the singular values of M.

As a linear combination

Another way to see this factorization is as a weighted ordered sum of separable matrices S_i=\vec{u}_i \otimes \vec{v}_i^T each of which has rank(S_i)=1, \forall i:

M = \sum_i \sigma_i \cdot S_i

As an unconstrained optimization

We can also formulate the solution of the svd in terms of an optimization: find the best (minimizes the frobenius norm \|M - B\|_F) approximation B of M that has lower rank \texttt{rank}(B) < \texttt{rank}(M). Dropping an appropriate number of singular values (starting with the smallest) and singular vectors and then reforming a matrix yields the exact global solution. The proof of this lies in the fact that

\| M - B \|_F = \left\| \sum_{i=1}^n \sigma_i \vec{u}_i \vec{v}_i^T - \sum_{i=1}^k \sigma_i \vec{u}_i \vec{v}_i^T \right\|_F = \left\| \sum_{i=k+1}^n \sigma_i \vec{u}_i \vec{v}_i^T \right\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_n^2}

This result is the Eckhart-Young theorem.

Condition Number

The condition number of a matrix can be given in terms of its singular values:

k(M) = \|M \| \cdot \| M^{-1} \| = \frac{\sigma_{\max}(M)}{\sigma_{\min}(M)}

A matrix is said to be ill conditioned if the condition number is too large. How large the condition number can be, before the matrix is ill conditioned, is determined by the machine precision.

Tagged with: , ,

Protected: interfacing kdb+ & c: linking q.lib using mingw/gcc

Posted in kdb+ by mlwkdb on October 6, 2009

This content is password protected. To view it please enter your password below:

Enter your password to view comments.

q/kdb+ multivariate gaussian distribution

Posted in kdb+, statistics by mlwkdb on September 10, 2009

Lets use this slightly hacky (uses ols instead of gaussian elimination) LU matrix factorization in q:

r:lu[X[0+til n;0+til n]];
$[n=1;v:X[0+til n;n];v:lsq[X[0+til n;n];flip r[0]]];
$[n=1;mT:X[n;0+til n]%r[1];mT:lsq[X[n;0+til n];r[1]]];
w:X[n;n]-mT wsum v;
: (L;U)

And a q determinant function based on the LU factorization:

det:{[X] n:count[X]; :prd lu[X][1] ./: (til n),'til n};

Finally, the multivariate gaussian distribution:

 /check that S is positive definite
 if[not S~flip S;'`symmetry];
 if[not all n=/:(count[S];count[M];n:count[X]);'`size];
 reciprocal[xexp[(2*acos -1);n%2]*sqrt[det[S]]]*exp[neg((X-M)$inv[S]$(X-M))%2]
S:(S + flip S)%2.;

To plot a bivariate distribution in R using the above definition of the gaussian:

c <- open_connection()
execute(c, "S:`float$id[2];")
execute(c, "S[1;1]:0.4;S[0;0]:0.2;")
execute(c, "M:(0.2 -0.1);")
execute(c, "y:x:(-10+til 20)%10;")
execute(c, "z:gaussnD[;M;S] each raze x(,\\:)/: x;")
x <- execute(c, "x")
y <- execute(c, "y")
z <- execute(c, "z")
execute(c, "x:raze[x(,\\:)/: x][;0]")
execute(c, "y:raze[y(,\\:)/: y][;1];")
x <- execute(c, "x")
y <- execute(c, "y")

Tagged with: , , , ,

q/kdb+ kernel regression

Posted in kdb+, q, regression, statistics by mlwkdb on September 8, 2009

In the mse sense, the best prediction is always the conditional expectation E(\vec{y}|\mathbf{X})

Let \mathcal{Y} be a random variable realizing a vector \vec{y} \in \mathbb{R}^n of n random variates. Let \mathcal{X} be a vector of m random variables realizing a matrix \mathbf{X} \in \mathbb{R}^{n\times m}. Linear regression approximates parameters \vec{\beta} such that:

\vec{y} = \mathbf{X}\vec{\beta} + \epsilon, \hspace{3mm} \epsilon \sim  N(0,\sigma^2)

The best \vec{\beta} parameter values  (in the least squares sense, i.e. an unconstrained quadratic optimization {\arg \min}_\beta \|\vec{y}-\mathbf{X}\vec{\beta}\|_2^2 of the l_2 norm) are given by a unique analytical solution that is given by solving the normal equations. It is also easily proved that the conditional expectation E(\vec{y}|\mathbf{X}) is the least squares predictor of \vec{y} given \mathbf{X}. More generally, for any prediction problem \vec{y} = f(\mathbf{X}) + \epsilon the function \hat{f}_{(\vec{y}, \mathbf{X})} that optimally minimizes the l_2 norm of the training error is always the conditional expectation.

The conditional expectation can be decomposed as follows:

\displaystyle \hat{f}_{(\vec{y}, \mathbf{X})}=E(\vec{y}|\mathbf{X}) = \frac{\int \vec{y} P_{\mathbf{X}}(\vec{y},\vec{x}) d\vec{y}}{\int P_{\mathbf{X}}(\vec{y},\vec{x}) d\vec{y}} = \frac{\int \vec{y} P_{\mathbf{X}}(\vec{y},\vec{x}) d\vec{y}}{P_{\mathbf{X}}(\vec{x})}

Derivation of the Nadaraya-Watson non-parametric estimate

If we approximate P_{\mathbf{X}}(\vec{y},\vec{x}) and P_{\mathbf{X}}(\vec{x}) with their kernel density estimates [pdf] [applet] then we have:

\displaystyle \hat{f}_{(\vec{y}, \mathbf{X})}(\vec{x}) = \int \vec{y} \frac{  P_{\mathbf{X}}(\vec{y},\vec{x})}{P_{\mathbf{X}}(\vec{x})} d\vec{y} = \boxed{ \frac{\sum_{i=1}^n {\mathbb{K}\hspace{0.15mm}}_h \left(\vec{x}-\vec{x}_i\right)y_i}{{\sum_{i=1}^n\mathbb{K}\hspace{0.15mm}}_h \left(\vec{x}-\vec{x}_i\right)}}

If we take y_i = w_i then the solution can be seen as a weighted average of a set of kernel functions anchored at the data points \{ x_1, \cdots , x_m\}.

Solving the Nadaraya-Watson equation for optimal f

Minimize some metric of the training error

\displaystyle \min_{\vec{y} \in \mathbb{R}} \hspace{2mm} r(\vec{y},\mathbf{X})

For example the mean square error:

\displaystyle r(\vec{y},\mathbf{X}) = \frac{1}{n} \sum_{i=1}^n \| y_i - \hat{f}_{(\vec{y}, \mathbf{X})}(\vec{x}_i) \|^2=\frac{1}{n} \sum_{i=1}^n \left\| y_i - \frac{\sum_{j=1}^n {\mathbb{K}\hspace{0.15mm}}_h \left(\vec{x}_i-\vec{x}_j\right)y_j}{{\sum_{j=1}^n\mathbb{K}\hspace{0.15mm}}_h \left(\vec{x}_i-\vec{x}_j\right)} \right\|^2

The partial derivatives of r(\vec{y},\mathbf{X})

\displaystyle g(\vec{y},\mathbf{X}) = \frac{\partial r(\vec{y},\mathbf{X})}{\partial \vec{y}} = \left\{ \frac{\partial r(\vec{y},\mathbf{X})}{\partial y_1}, \frac{\partial r(\vec{y},\mathbf{X})}{\partial y_2}, \cdots, \frac{\partial r(\vec{y},\mathbf{X})}{\partial y_n} \right\}

are given by:

\displaystyle g_i(\vec{y},\mathbf{X}) =  \frac{\partial r(\vec{y},\mathbf{X})}{\partial y_i} = \frac{2}{n} \sum_{j = 1}^n (y_j - \hat{f}_{(\vec{y},\mathbf{X})}(\vec{x}_j)) \left(\mathbb{I}_{i=j}-\frac{\partial}{\partial y_i} \hat{f}_{(\vec{y},\mathbf{X})}(\vec{x}_j) \right)


\displaystyle  \frac{\partial}{\partial y_i}\hat{f}_{(\vec{y},\mathbf{X})}(\vec{x}_j) = \frac{\partial}{\partial y_i} \left( \frac{\sum_{k=1}^n {\mathbb{K}\hspace{0.15mm}}_h \left(\vec{x}_j-\vec{x}_k\right)y_k}{{\sum_{k=1}^n\mathbb{K}\hspace{0.15mm}}_h \left(\vec{x}_j-\vec{x}_k\right)} \right) = \left( \frac{{\mathbb{K}\hspace{0.15mm}}_h \left(\vec{x}_j-\vec{x}_i\right)}{{\sum_{k=1}^n\mathbb{K}\hspace{0.15mm}}_h \left(\vec{x}_j-\vec{x}_k\right)} \right)

Outline of a gradient descent kernel regression algorithm

  1. evaluate g_i(\vec{1}, \mathbf{X}), \forall i
  2. use gradient descent update rule: w_i \longleftarrow w_i - \eta \cdot g_i(\vec{w}, \mathbf{X})
  3. recalculate gradient using updated \vec{w}
  4. descend
  5. stop when \| \sum_{i=1}^n g_i(\vec{w}, \mathbf{X}) \|<\epsilon
  6. finally \hat{y} = E(\vec{w}|\mathbf{X})

Lets use an unnormalized radial basis function as our kernel. It takes a bandwidth input parameter h that regulates model complexity:

kernel:{[x1;x2;h] :exp[reciprocal[2*xexp[h;2]] * sum[xexp[(x1-x2);2]]]}

Helper functions to calculate the partial derivatives and the gradient descent update:

fhat:{[x;X;y;h]n:count[x]; k:kernel[x;;h]each flip X; d:sum k; n:y wsum k; :n%d};
derfhat:{[x;X;y;h]n:count[x]; : (kernel[x;;h]each flip X)%sum[kernel[x;;h]each flip X]};
gi:{[ivar;X;y;h]n:count[y]; I:(n#0f); I[ivar]:1f; (2%n)*(y-fhat[;X;y;h] each X) wsum (I-derfhat[X[ivar;];X;y;h])};
n:count[y];w:n?1f;i:1;dw:enlist 0f;
do[2000; f:fhat[;X;w;h]each X; delta:gi[;X;w;h] each til n; dw,:sum[abs[y-f]]; w:w-delta];
 yr:fhat[;X;w;h] each x