mathematical convex optimization
Most optimizations can be expressed as the minimization of an objective function subject to a set of constraints both of which are defined over a set of optimization variables:
 optimization variables
 objective function:
 constraints:
The optimization variables are typically actions (portfolio optimization: amounts invested in various assets) or parameters (data fitting: parameters of the model being estimated). There exists a family of optimization problems for which finding a global solution can be done both efficiently and reliably; for an optimization to be convex the objective function and the constraints must all have positive curvature.
Some examples include least squares (no constraints, has an analytical solution), linear programming (zero curvature objective and constraints, no analytical solution), quadratic optimization (quadratic objective function and either linear or nonlinear constraints) are examples.
For details click here.
q load NYSE data (23 assets, 44 years) into kdb+
Here is a q script to load the NYSE merged dataset found here into kdb+:
f:`:ahp.csv`:alcoa.csv`:amerb.csv`:coke.csv`:comme.csv`:dow.csv`:dupont.csv`:ford.csv`:ge.csv `:gm.csv`:hp.csv`:ibm.csv`:inger.csv`:jnj.csv`:kimbc.csv`:kinar.csv`:kodak.csv `:merck.csv`:mmm.csv`:morris.csv`:pandg.csv`:schlum.csv`:sherw.csv; c:`date`close; t: (uj/) {update ticker:x from (flip c!("DF";",")0: x)} each f; t:`ticker`date`close xcols update `$ssr[;":";""] each ssr[;".csv";""] each string[ticker] from t; save `:t;
source("c:/r/qserver.r") c < open_connection() t < execute(c, "select from t where ticker in 16#exec distinct ticker from t") names(t) < c("ticker","date","close") #t$date < strptime(t$date, format="%Y%m%d") t$ticker < factor(t$ticker) summary(t) class(t) str(t) library(lattice) #xyplot(close~date,t,groups=ticker,type='l') xyplot(close~date  ticker,t,xlab='',ylab='',ylim=c(0.9, 1.1),scales=list(tick.number=2),type='l') t$ibm < execute(c, "exec close from t where ticker in `ibm") t$coke < execute(c, "exec close from t where ticker=`coke") t$ge < execute(c, "exec close from t where ticker=`ge") t$dow < execute(c, "exec close from t where ticker=`dow") t$ford < execute(c, "exec close from t where ticker=`ford") t$gm < execute(c, "exec close from t where ticker=`gm") library(cars) scatterplot.matrix(~ibm+coke+ge+dow+ford+gm, data=t) close_connection(c)
q principal component analysis
PCA is a pattern recognition algorithm that highlights similarities and differences in a dataset. It can also be used to optimally (least information loss) reduce the dimension (given by the number of variates) of a dataset by projecting the data (in the least squares sense) onto a few of its principal (high variability) components (dimensions in transformed space). A reduction to 2 components (best approximating plane) can be used to capture and visualize the majority of the variance in the dataset.
The components correspond to the new transformed basis and don’t necessarily have a onetoone correspondence with the original variates; there are two ways to think about why we would consider dropping components in the transformed space:
 if certain variates are redundant (express almost the same information i.e. are highly correlated) then it makes sense to include only a single component to record the information contained in the variates
 by rotating and stretching the data into a new basis we are left with components along which there is very little variance; intuitively these components correspond to noise since they fail to record any essential information
Since we are comparing the spreads of variates it makes sense to standardize the data to have zero mean and unit standard deviation.
Given a dataset where the columns are sample vectors and the rows are feature vectors, we are interested in finding directions in (the row space) that capture maximal variance. We calculate the center of mass which gives us the mean of each row. The center of mass will serve as the origin of the new basis.
Next we calculate the scatter matrix as the product of the demeaned dataset (subtract the mean of each row from the row so that all the rows have mean zero) with a transposed version of itself: or in matrix form where is the centering matrix. The entries of the scatter matrix measure the variance of the data in different directions relative to the center of mass; specifically the ij th entry is the covariance between the ith and jth feature.
Since we are rotating our data into a new basis, what we will need is a measure of the variance of the data set projected onto each axis (infact any arbitrary axis since we need the one with highest variance). To illustrate how we can build such a measure (using only the scatter matrix and any arbitrary unit vector direction) let us project the data onto a line representing an axis in the direction such that
The distance of the projected point from the center of mass is then
So the total variance or the squared sum of the distances of all the projected points from the center of mass is given by:
which after some matrix algebra works out to .
So now we can calculate the variance of the data projected onto some arbitrary direction ; the principal component is precisely the direction that maximizes this variance: subject to . The lagrangian of this optimization is given by where the is a slack variable or lagrange multiplier. We apply the lagrangian saddle point equivalence theorem and differentiate with respect to the primal variables or which is in fact part of the eigendecomposition of the scatter matrix. So the direction along which the projected data has maximized variance is an eigenvector; but which one? The variance of an eigenvector is given by and since we are maximizing the variance the eigenvector with a corresponding maximal eigenvalue is the principal component.
Note if the eigenvalues are nearly the same then there is no preferable “principal” component. Also pca is optimal only in the leastsquares sense.
Connection to SVD
The svd of the scatter matrix is and since the scatter matrix is symmetric , we have which is satisfied only when so we have which is exactly the eigendecomposition so that for the eigenvalues are the singular values. Look here for the general relationship between the eigendecomposition and the svd when the matrix is not square.
Outline of PCA algorithm
 Calculate the scatter matrix S
 Calculate the SVD of
 Rearrange the singular values and singular vectors: and
 Select a new basis as a subset of the singular vectors
 Project the data onto the new basis
.qml.mpca:{[x;r] \x is samples (rows) by variates (cols) t:[;avg x] each x; t: t $ flip t; s:.qml.msvd[t]; u:s[0][;1+til r]; :flip[u] $ x; }; m:10; n:800; x:(m;n)#(m*n)?1f; .qml.mpca[x;2]
Kernel PCA
Is an extension of PCA for nonlinear distributions. We would like to apply PCA in a higher dimensional space ; the application of this map to the data set is unnecessary because of the kernel trick. We simply form the kernel matrix and solve this dual problem by finding its singular value decomposition; . A feature in the higher dimensional feature space is given by where is a feature in the original feature space. For a derivation of this result look here or here
Variance Connection, Karhunen Connection
A typical data matrix contains signals each of which has some underlying noise; a reduced rank SVD approximation of the data matrix captures a highest variance mixture of the original signals in a lower dimensional space. The noise is usually associated with the smallest singular values (outliers generally have the least to say about the true distribution of the data because well it is noise and doesn’t belong in the dataset; they have the smallest semiaxis in the transformed unit sphere i.e. they are what the fuck can i explain this better?) which are dropped to form the approximation.
the karhunen–loève representation of a stochastic process
Consider a continuous time, continuous state space stochastic process that has a continuous and compact index set and where the random variables are measurable functions from a probability space to a (continuous, measurable) state space:
Assume the stochastic process is centered (each random variable is centered) so that then the autocovariance (the variance of the process or signal against a time shifted version of itself) is given by which given our assumption can be viewed as a kernel
Let us assume that this kernel is positive, symmetric and square integrable then the resulting integral operator
is positive, selfadjoint and compact. It therefore follows from the Spectral Decomposition Theorem that must have a countable set of nonnegative eigenvalues ; furthermore, the corresponding eigenfunctions must form an orthonormal basis for :
The integral transformation essentially changes the representation of the input function to an output function defined in a space that has a basis given by the set of eigenfunctions.
Let us make a stronger assumption (than being square integrable) that the covariance (kernel) function is continuous then by Mercer’s theorem (page 34) we have that:
where convergence is absolute and uniform in both variables. We define a new set of random variables based on our original stochastic process:
that are orthogonal and centered . We can then represent our original stochastic process as an infinite linear combination of orthogonal functions:
where convergence is uniform over the index set . It is interesting to note that the orthogonal basis functions used in this expansion are determined by the autocovariance function.
References:
1. KarhunenLoeve Expansion of Vector Random Processes
2. Large Scale Learning: Dimensionality Reduction
singular value decomposition
The singular value decomposition of a matrix is given by
where the rows or rightsingular vectors of (and the columns or leftsingular vectors of ) form a set of orthonormal basis vector directions:
and
The singular value matrix is diagonal with entries equal to the square roots of the eigenvalues of the matrix . If the diagonal entries are sorted in decreasing order then is uniquely determined. The same is not true of the left and right singular vectors.
As a geometric transform
Any real matrix maybe viewed as a geometric transformation
that when applied to a set of data points rotates them in the direction of the right singular vectors first and then stretches them in the direction of each of the left singular vectors by a factor equal to its corresponding singular values . The order of the rotate and stretch is important as it determines the final output space. Note that from and it is clear that the transformation changes the underlying basis; the SVD stretches and rotates the dataset into a space whose basis is given by the set of right singular vectors. If for example we were to apply to the unit sphere, it is transformed (rotated and stretched) into an ellipsoid whose semiaxes are given by the singular values of .
As a linear combination
Another way to see this factorization is as a weighted ordered sum of separable matrices each of which has :
As an unconstrained optimization
We can also formulate the solution of the svd in terms of an optimization: find the best (minimizes the frobenius norm ) approximation of that has lower rank . Dropping an appropriate number of singular values (starting with the smallest) and singular vectors and then reforming a matrix yields the exact global solution. The proof of this lies in the fact that
This result is the EckhartYoung theorem.
Condition Number
The condition number of a matrix can be given in terms of its singular values:
A matrix is said to be ill conditioned if the condition number is too large. How large the condition number can be, before the matrix is ill conditioned, is determined by the machine precision.
Protected: interfacing kdb+ & c: linking q.lib using mingw/gcc
Enter your password to view comments.
q/kdb+ multivariate gaussian distribution
Lets use this slightly hacky (uses ols instead of gaussian elimination) LU matrix factorization in q:
lu:{[X]n:count[X]1; if[n=0;:(1f;X[0;0])]; r:lu[X[0+til n;0+til n]]; $[n=1;v:X[0+til n;n];v:lsq[X[0+til n;n];flip r[0]]]; $[n=1;mT:X[n;0+til n]%r[1];mT:lsq[X[n;0+til n];r[1]]]; w:X[n;n]mT wsum v; L:(r[0],'n#0f),enlist(mT,1f); U:(r[1],'v),enlist((n#0f),w); : (L;U) };
And a q determinant function based on the LU factorization:
det:{[X] n:count[X]; :prd lu[X][1] ./: (til n),'til n};
Finally, the multivariate gaussian distribution:
gaussnD:{[X;M;S] /check that S is positive definite if[not S~flip S;'`symmetry]; if[not all n=/:(count[S];count[M];n:count[X]);'`size]; reciprocal[xexp[(2*acos 1);n%2]*sqrt[det[S]]]*exp[neg((XM)$inv[S]$(XM))%2] } n:4 S:(S + flip S)%2.; S:(n;n)#(n*n)?1f X:n?1f; M:n?1f; gaussnD[X;M;S]
To plot a bivariate distribution in R using the above definition of the gaussian:
source("c:/r/qserver.r") c < open_connection() execute(c, "S:`float$id[2];") execute(c, "S[1;1]:0.4;S[0;0]:0.2;") execute(c, "M:(0.2 0.1);") execute(c, "y:x:(10+til 20)%10;") execute(c, "z:gaussnD[;M;S] each raze x(,\\:)/: x;") x < execute(c, "x") y < execute(c, "y") z < execute(c, "z") persp(x,y,matrix(nrow=20,ncol=20,z),theta=30,phi=20,shade=0.15,xlab='',ylab='',zlab='') execute(c, "x:raze[x(,\\:)/: x][;0]") execute(c, "y:raze[y(,\\:)/: y][;1];") x < execute(c, "x") y < execute(c, "y") g<data.frame(x,y) g$z<z load(lattice) wireframe(z~x*y,g,shade=TRUE,xlab='',ylab='',zlab='',box=FALSE) close_connection(c)
q/kdb+ kernel regression
In the mse sense, the best prediction is always the conditional expectation
Let be a random variable realizing a vector of random variates. Let be a vector of random variables realizing a matrix . Linear regression approximates parameters such that:
The best parameter values (in the least squares sense, i.e. an unconstrained quadratic optimization of the norm) are given by a unique analytical solution that is given by solving the normal equations. It is also easily proved that the conditional expectation is the least squares predictor of given . More generally, for any prediction problem the function that optimally minimizes the norm of the training error is always the conditional expectation.
The conditional expectation can be decomposed as follows:
Derivation of the NadarayaWatson nonparametric estimate
If we approximate and with their kernel density estimates [pdf] [applet] then we have:
If we take then the solution can be seen as a weighted average of a set of kernel functions anchored at the data points .
Solving the NadarayaWatson equation for optimal
Minimize some metric of the training error
For example the mean square error:
The partial derivatives of
are given by:
where
Outline of a gradient descent kernel regression algorithm

evaluate

use gradient descent update rule:

recalculate gradient using updated

descend

stop when

finally
Lets use an unnormalized radial basis function as our kernel. It takes a bandwidth input parameter that regulates model complexity:
kernel:{[x1;x2;h] :exp[reciprocal[2*xexp[h;2]] * sum[xexp[(x1x2);2]]]}
Helper functions to calculate the partial derivatives and the gradient descent update:
n:4; X:(n;n)#(n*n)?1f; y:n?1f; fhat:{[x;X;y;h]n:count[x]; k:kernel[x;;h]each flip X; d:sum k; n:y wsum k; :n%d}; derfhat:{[x;X;y;h]n:count[x]; : (kernel[x;;h]each flip X)%sum[kernel[x;;h]each flip X]}; gi:{[ivar;X;y;h]n:count[y]; I:(n#0f); I[ivar]:1f; (2%n)*(yfhat[;X;y;h] each X) wsum (Iderfhat[X[ivar;];X;y;h])}; n:count[y];w:n?1f;i:1;dw:enlist 0f; do[2000; f:fhat[;X;w;h]each X; delta:gi[;X;w;h] each til n; dw,:sum[abs[yf]]; w:wdelta]; w dw sum[w] xr:10*n?1f yr:fhat[;X;w;h] each x
leave a comment