# mlwkdb by ro

## almost ten years later

Posted in embedpy, kdb+, machinelearning, optimization, python, q by machine learning w kdb on July 6, 2018

so in 2008/2009 when i published the posts below i had just discovered and started working with q/kdb+

i still remember walking on granville island one sunny afternoon with dd and talking about the need to run more sophisticated analysis on what was then (and continues to be) the fastest time series column store

and then i told a stranger at the astoria one night that machine learning + kdb was the future and that stranger turned out to be an incredibly important ml researcher from stanford who encouraged me to start this blog

then someone published a connector to R and that was fantastic for then and so i wrote a few posts

and now 10 years later there is embedPy which is incredible – the way i see it, python is necessary for proof of concepts but eventually every productionized ml algorithm needs to be (mostly) written in pure q/k and benchmarked against python to prove it is producing identical results faster

to get there we first need a fully functional convex optimizer – an open source c package compiled with c.k and attached to q – in my opinion this should be the only portion that is not written in q – but i’m getting ahead of myself – i’m sticking to python initially as well just because its easier

i’m going to restart this blog now and include posts that not just go over practical machine learning (i.e. how do you run a support vector classification – see this fantastic post for example) but will also preface my results with underlining theory and diagrams that clearly explain how these statistical and machine learning algorithms are derived and will include a brief history of each in the appendix

i’m using latex to write my documents so they look better (following edward tufte’s strict conventions) but will convert and post to html as well

to begin with take a look at this presentation i recently gave at a kx meetup

## mathematical convex optimization

Posted in optimization by machine learning w kdb on December 21, 2009

Most optimizations can be expressed as the minimization of an objective function subject to a set of constraints both of which are defined over a set of optimization variables:

• optimization variables $x= \{ x_1, \cdots , x_n \}$
• objective function: $f_0 : \mathbb{R}^n \to \mathbb{R}$
• constraints: $f_i : \mathbb{R}^n \to \mathbb{R}, i=1, \cdots , m$

The optimization variables are typically actions (portfolio optimization: amounts invested in various assets) or parameters (data fitting: parameters of the model being estimated). There exists a family of optimization problems for which finding a global solution can be done both efficiently and reliably; for an optimization to be convex the objective function and the constraints must all have positive curvature.

Some examples include least squares (no constraints, has an analytical solution), linear programming (zero curvature objective and constraints, no analytical solution), quadratic optimization (quadratic objective function and either linear or non-linear constraints) are examples.

## q load NYSE data (23 assets, 44 years) into kdb+

Posted in Uncategorized by machine learning w kdb on October 28, 2009

Here is a q script to load the NYSE merged dataset found here into kdb+:

f::ahp.csv:alcoa.csv:amerb.csv:coke.csv:comme.csv:dow.csv:dupont.csv:ford.csv:ge.csv :gm.csv:hp.csv:ibm.csv:inger.csv:jnj.csv:kimbc.csv:kinar.csv:kodak.csv :merck.csv:mmm.csv:morris.csv:pandg.csv:schlum.csv:sherw.csv;
c:dateclose;
t: (uj/) {update ticker:x from (flip c!("DF";",")0: x)} each f;
t:tickerdateclose xcols update $ssr[;":";""] each ssr[;".csv";""] each string[ticker] from t; save :t;  source("c:/r/qserver.r") c <- open_connection() t <- execute(c, "select from t where ticker in 16#exec distinct ticker from t") names(t) <- c("ticker","date","close") #t$date <- strptime(t$date, format="%Y-%m-%d") t$ticker <- factor(t$ticker) summary(t) class(t) str(t) library(lattice) #xyplot(close~date,t,groups=ticker,type='l') xyplot(close~date | ticker,t,xlab='',ylab='',ylim=c(0.9, 1.1),scales=list(tick.number=2),type='l') t$ibm <- execute(c, "exec close from t where ticker in ibm")
t$coke <- execute(c, "exec close from t where ticker=coke") t$ge <- execute(c, "exec close from t where ticker=ge")
t$dow <- execute(c, "exec close from t where ticker=dow") t$ford <- execute(c, "exec close from t where ticker=ford")
t$gm <- execute(c, "exec close from t where ticker=gm") library(cars) scatterplot.matrix(~ibm+coke+ge+dow+ford+gm, data=t) close_connection(c)  Tagged with: , , , ## q principal component analysis Posted in kdb+, q, statistics by machine learning w kdb on October 23, 2009 PCA is a pattern recognition algorithm that highlights similarities and differences in a dataset. It can also be used to optimally (least information loss) reduce the dimension (given by the number of variates) of a dataset by projecting the data (in the least squares sense) onto a few of its principal (high variability) components (dimensions in transformed space). A reduction to 2 components (best approximating plane) can be used to capture and visualize the majority of the variance in the dataset. The components correspond to the new transformed basis and don’t necessarily have a one-to-one correspondence with the original variates; there are two ways to think about why we would consider dropping components in the transformed space: 1. if certain variates are redundant (express almost the same information i.e. are highly correlated) then it makes sense to include only a single component to record the information contained in the variates 2. by rotating and stretching the data into a new basis we are left with components along which there is very little variance; intuitively these components correspond to noise since they fail to record any essential information Since we are comparing the spreads of variates it makes sense to standardize the data to have zero mean and unit standard deviation. Given a dataset $X_{m \times n} = \{ \vec{x}_1 | \vec{x}_2 | \cdots | \vec{x}_n\} = \{\vec{f}_1, \vec{f}_2, \cdots , \vec{f}_m\}$ where the $n$ columns are sample vectors and the $m$ rows are feature vectors, we are interested in finding directions in $\mathbb{R}^m$ (the row space) that capture maximal variance. We calculate the center of mass $\nu_{m \times 1} = argmin_{\vec{x}} \sum \|\vec{x}_i-\vec{x}\|^2 = \frac{1}{n} \sum_{i=1}^n \vec{x}_i$ which gives us the mean of each row. The center of mass will serve as the origin of the new basis. Next we calculate the scatter matrix as the product of the demeaned dataset (subtract the mean of each row from the row so that all the rows have mean zero) with a transposed version of itself: $S_{m \times m}=\sum_{i=1}^m (\vec{f}_i - \nu_i)(\vec{f}_i-\nu_i)^T$ or in matrix form $S_{m \times m} = XC_nX^T$ where $C_n$ is the centering matrix. The entries of the scatter matrix measure the variance of the data in different directions relative to the center of mass; specifically the i-j th entry is the covariance between the ith and jth feature. Since we are rotating our data into a new basis, what we will need is a measure of the variance of the data set projected onto each axis (infact any arbitrary axis since we need the one with highest variance). To illustrate how we can build such a measure (using only the scatter matrix and any arbitrary unit vector direction) let us project the data onto a line $L=\nu+ae$ representing an axis in the direction $e$ such that $\|e \| =1$ The distance of the projected point $\vec{f'}_i$ from the center of mass $\nu$ is then $\| \vec{f'}_i - \nu \| = \langle e, \vec{f}_i - \nu \rangle \| e \| = e^T (\vec{f}_i - \nu)$ So the total variance or the squared sum of the distances of all the projected points from the center of mass is given by: $var(e) = \frac{1}{m} \sum \| \vec{f'}_i - \nu \|^2 = \frac{1}{m} \sum (e^T (\vec{f}_i - \nu))^2$ which after some matrix algebra works out to $var(e) = \frac{1}{m} e^T S e$. So now we can calculate the variance of the data projected onto some arbitrary direction $e$; the principal component is precisely the direction that maximizes this variance: $\max_e e^TSe$ subject to $e^Te=1$. The lagrangian of this optimization is given by $e^TSe - \lambda(e^Te-1)$ where the $\lambda$ is a slack variable or lagrange multiplier. We apply the lagrangian saddle point equivalence theorem and differentiate with respect to the primal variables $2Se-2 \lambda e = 0$ or $Se=\lambda e$ which is in fact part of the eigen-decomposition of the scatter matrix. So the direction along which the projected data has maximized variance is an eigen-vector; but which one? The variance of an eigen-vector is given by $var(e) = e^T S e = e^T \lambda e e^T e = \lambda$ and since we are maximizing the variance $\max_e var(e) = \max_e \lambda$ the eigen-vector with a corresponding maximal eigen-value $\lambda$ is the principal component. Note if the eigen-values are nearly the same then there is no preferable “principal” component. Also pca is optimal only in the least-squares sense. ## Connection to SVD The svd of the scatter matrix is $S=U\Sigma V^T$ and since the scatter matrix is symmetric $S=S^T$, we have $S=U\Sigma V^T=S^T=V\Sigma U^T$ which is satisfied only when $U=V$ so we have $S=U\Sigma U^T$ which is exactly the eigen-decomposition so that for $S$ the eigen-values are the singular values. Look here for the general relationship between the eigendecomposition and the svd when the matrix is not square. ## Outline of PCA algorithm 1. Calculate the scatter matrix S 2. Calculate the SVD of $S=U\Sigma U^T$ 3. Rearrange the singular values and singular vectors: $\tilde{\Sigma}$ and $\tilde{U}$ 4. Select a new basis as a subset of the singular vectors $\tilde{U}_{1:r}$ 5. Project the data onto the new basis $X_r = \tilde{U}_{1:r} \cdot X$ .qml.mpca:{[x;r] \x is samples (rows) by variates (cols) t:-[;avg x] each x; t: t$ flip t;
s:.qml.msvd[t];
u:s[0][;1+til r];
:flip[u] x; }; m:10; n:800; x:(m;n)#(m*n)?1f; .qml.mpca[x;2]  ## Kernel PCA Is an extension of PCA for non-linear distributions. We would like to apply PCA in a higher dimensional space $\vec{x} \to \phi(\vec{x})$; the application of this map to the data set is unnecessary because of the kernel trick. We simply form the kernel matrix and solve this dual problem by finding its singular value decomposition; $K \alpha = \lambda \alpha$. A feature $f_K$ in the higher dimensional feature space is given by $f_K = \sum \alpha_i k(\vec{z}, \vec{f}_i)$ where $\vec{z}$ is a feature in the original feature space. For a derivation of this result look here or here ## Variance Connection, Karhunen Connection A typical data matrix contains signals each of which has some underlying noise; a reduced rank SVD approximation of the data matrix captures a highest variance mixture of the original signals in a lower dimensional space. The noise is usually associated with the smallest singular values (outliers generally have the least to say about the true distribution of the data because well it is noise and doesn’t belong in the dataset; they have the smallest semi-axis in the transformed unit sphere i.e. they are what the fuck can i explain this better?) which are dropped to form the approximation. ## antlr grammar for q Posted in kdb+, q by machine learning w kdb on October 17, 2009 A q grammar file for antlr can be downloaded here. Tagged with: , ## the karhunen–loève representation of a stochastic process Posted in statistics by machine learning w kdb on October 17, 2009 Consider a continuous time, continuous state space stochastic process $\{ S_t : t \in \mathcal{I} \}$ that has a continuous and compact index set $\mathcal{I}$ and where the random variables $S_t$ are measurable functions from a probability space to a (continuous, measurable) state space: $S_t : (\mathcal{X}, \Sigma_\mathcal{X}, \mu) \to (\mathcal{Y}, \Sigma_\mathcal{Y})$ Assume the stochastic process is centered (each random variable is centered) so that $\mathbb{E}(S_t)=\int_\mathcal{X} S_t d\mu = 0$ then the autocovariance (the variance of the process or signal against a time shifted version of itself) is given by $Cov(S_t,S_s) = \mathbb{E}(S_t \cdot S_s) - \mathbb{E}(S_t) \cdot \mathbb{E}(S_s)$ which given our assumption can be viewed as a kernel $K_{SS}(t,s) = \mathbb{E}(S_t \cdot S_s) = Cov(S_t,S_s)$ Let us assume that this kernel is positive, symmetric and square integrable $K_{SS}(t,s) \in L^2(\mathcal{X} \times \mathcal{X})$ then the resulting integral operator $T_K f(\cdot) : L^2(\mathcal{X}) \to L^2(\mathcal{X}) = \int_{\mathcal{I}} K_{SS}(\cdot,s) f(t) ds$ is positive, self-adjoint and compact. It therefore follows from the Spectral Decomposition Theorem that $T_K$ must have a countable set of non-negative eigenvalues $\{ \nu_1, \nu_2, \cdots \}$; furthermore, the corresponding eigenfunctions $\{ \psi_1, \psi_2, \cdots \}$ must form an orthonormal basis for $L^2(\mathcal{X})$: $T\psi_i(t) = \int_\mathcal{I} K_{SS}(t,s) \psi_i(s) ds = \nu_i\psi_i(t)$ The integral transformation $T_K$ essentially changes the representation of the input function $f$ to an output function $(T_Kf)$ defined in a space that has a basis given by the set of eigenfunctions. Let us make a stronger assumption (than being square integrable) that the covariance (kernel) function is continuous $Cov(S_t, S_s) \in C^0(\mathcal{X} \times \mathcal{X})$ then by Mercer’s theorem (page 34) we have that: $Cov(S_t, S_s) = \sum_{i=1}^\infty \nu_i \psi_i(s) \psi_i(t)$ where convergence is absolute and uniform in both variables. We define a new set of random variables $\{ Z_n, n \geq 0\}$ based on our original stochastic process: $Z_n = \frac{1}{\sqrt{\nu_n}} \int_{\mathcal{I}} S_t \psi_n(t) dt$ that are orthogonal and centered $E(Z_n)=0$. We can then represent our original stochastic process as an infinite linear combination of orthogonal functions: $S_t = \sum_{i=1}^\infty \sqrt{\nu_i} Z_i \psi_i(t)$ where convergence is uniform over the index set $\mathcal{I}$. It is interesting to note that the orthogonal basis functions used in this expansion are determined by the autocovariance function. ## Protected: my axiom of choice Posted in Uncategorized by machine learning w kdb on October 15, 2009 This content is password protected. To view it please enter your password below: Enter your password to view comments. ## singular value decomposition Posted in kdb+, statistics by machine learning w kdb on October 7, 2009 The singular value decomposition of a $n \times m$ matrix $M$ is given by $M = U_{(n \times n)} \Sigma_{(n \times m)} {V^T}_{(m \times m)}$ where the rows or right-singular vectors of $V^T$ (and the columns or left-singular vectors of $U$) form a set of orthonormal basis vector directions: $U^T U = I$ and $V^T V=I$ The singular value matrix $\Sigma$ is diagonal with entries equal to the square roots of the eigenvalues of the matrix $M^TM$. If the diagonal entries $\sigma_i$ are sorted in decreasing order then $\Sigma$ is uniquely determined. The same is not true of the left and right singular vectors. ## As a geometric transform Any real matrix maybe viewed as a geometric transformation M:\mathbb{R}^m \to \mathbb{R}^n : U \Sigma V^T=(hanger)(stretcher)(aligner) that when applied to a set of data points $M \vec{d}_i = U \Sigma V^T \vec{d}_i$ rotates them in the direction of the right singular vectors $\{\vec{v}_1, \vec{v}_2, \cdots , \vec{v}_n \}$ first and then stretches them in the direction of each of the left singular vectors $\{\vec{u}_1, \vec{u}_2, \cdots , \vec{u}_m \}$ by a factor equal to its corresponding singular values $\{\sigma_1, \sigma_2, \cdots , \sigma_n \}$. The order of the rotate and stretch is important as it determines the final output space. Note that from $M\vec{v}_i = \sigma_i \vec{u}_i, \forall i \leq min(n,m)$ and $M\vec{v}_i = 0, \forall i > min(n,m)$ it is clear that the transformation changes the underlying basis; the SVD stretches and rotates the dataset into a space whose basis is given by the set of right singular vectors. If for example we were to apply $M$ to the unit sphere, it is transformed (rotated and stretched) into an ellipsoid whose semi-axes are given by the singular values of $M$. ## As a linear combination Another way to see this factorization is as a weighted ordered sum of separable matrices $S_i=\vec{u}_i \otimes \vec{v}_i^T$ each of which has $rank(S_i)=1, \forall i$: $M = \sum_i \sigma_i \cdot S_i$ ## As an unconstrained optimization We can also formulate the solution of the svd in terms of an optimization: find the best (minimizes the frobenius norm $\|M - B\|_F$) approximation $B$ of $M$ that has lower rank $\texttt{rank}(B) < \texttt{rank}(M)$. Dropping an appropriate number of singular values (starting with the smallest) and singular vectors and then reforming a matrix yields the exact global solution. The proof of this lies in the fact that $\| M - B \|_F = \left\| \sum_{i=1}^n \sigma_i \vec{u}_i \vec{v}_i^T - \sum_{i=1}^k \sigma_i \vec{u}_i \vec{v}_i^T \right\|_F = \left\| \sum_{i=k+1}^n \sigma_i \vec{u}_i \vec{v}_i^T \right\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_n^2}$ This result is the Eckhart-Young theorem. ## Condition Number The condition number of a matrix can be given in terms of its singular values: $k(M) = \|M \| \cdot \| M^{-1} \| = \frac{\sigma_{\max}(M)}{\sigma_{\min}(M)}$ A matrix is said to be ill conditioned if the condition number is too large. How large the condition number can be, before the matrix is ill conditioned, is determined by the machine precision. Tagged with: , , ## Protected: interfacing kdb+ & c: linking q.lib using mingw/gcc Posted in kdb+ by machine learning w kdb on October 6, 2009 This content is password protected. To view it please enter your password below: Enter your password to view comments. ## q/kdb+ multivariate gaussian distribution Posted in kdb+, statistics by machine learning w kdb on September 10, 2009 Lets use this slightly hacky (uses ols instead of gaussian elimination) LU matrix factorization in q: lu:{[X]n:count[X]-1; if[n=0;:(1f;X[0;0])]; r:lu[X[0+til n;0+til n]];[n=1;v:X[0+til n;n];v:lsq[X[0+til n;n];flip r[0]]];
$[n=1;mT:X[n;0+til n]%r[1];mT:lsq[X[n;0+til n];r[1]]]; w:X[n;n]-mT wsum v; L:(r[0],'n#0f),enlist(mT,1f); U:(r[1],'v),enlist((n#0f),w); : (L;U) };  And a q determinant function based on the LU factorization: det:{[X] n:count[X]; :prd lu[X][1] ./: (til n),'til n};  Finally, the multivariate gaussian distribution: gaussnD:{[X;M;S] /check that S is positive definite if[not S~flip S;'symmetry]; if[not all n=/:(count[S];count[M];n:count[X]);'size]; reciprocal[xexp[(2*acos -1);n%2]*sqrt[det[S]]]*exp[neg((X-M)$inv[S]$(X-M))%2] } n:4 S:(S + flip S)%2.; S:(n;n)#(n*n)?1f X:n?1f; M:n?1f; gaussnD[X;M;S]  To plot a bivariate distribution in R using the above definition of the gaussian: source("c:/r/qserver.r") c <- open_connection() execute(c, "S:float$id[2];")
execute(c, "S[1;1]:0.4;S[0;0]:0.2;")
execute(c, "M:(0.2 -0.1);")
execute(c, "y:x:(-10+til 20)%10;")
execute(c, "z:gaussnD[;M;S] each raze x(,\\:)/: x;")
x <- execute(c, "x")
y <- execute(c, "y")
z <- execute(c, "z")
`