What is the SMO (SVM) classifier? - Sequential Minimal Optimization, or SMO. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of the smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop. The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets. Because matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while the standard chunking SVM algorithm scales somewhere between linear and cubic in the training set size. SMO’s computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets. On real-world sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm.
Extended for regression and clustering problems (1 class).
maps feature vectors into a higher-dimensional space using a kernel function
builds an optimal linear discriminating function in this space (linear?) or an optimal hyper-plane (RBF?) that fits the training data
In case of SVM, the kernel is not defined explicitly.
A distance needs to be defined between any 2 points in the hyper-space.
The solution is optimal, the margin is maximal. between the separating hyper-plane and the nearest feature vectors
The feature vectors that are the closest to the hyper-plane are called support vectors, which means that the position of other vectors does not affect the hyper-plane (the decision function).
The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin.
Linear - maximize the margin, optimal solution, only a few close points are really needed the others are zeroes by the alphas (alpha says “pay attention to this variable”) in the quadratic programming equation. XtX is a similarity function (pairs of points that relate to each other in output labels and how similar to one another, Xi’s point in the same direction) y1y2 are the labels. Therefore further points are not needed. But the similarity is important here(?)
Non-linear - e.g. circle inside a circle, needs to map to a higher plane, a measure of similarity as XtX is important. We use this similarity idea to map into a higher plane, but we choose the higher plane for the purpose of a final function that behaves likes a known function, such as (A+B)^2. It turns out that (q1,q2,root(2)q1q2) is engineered with that root(2) thing for the purpose of making the multiplication of X^tY, which turns out to be (X^tY)^2. We can substitute this formula (X^tY)^2 instead of the X^tX in the quadratic equation to do that for us.This is the kernel trick that maps the inner class to one side and the outer circle class to the other and passes a plane in between them.
Similarity is defined intuitively as all the points in one class vs the other.. I think
A general kernel K=(X^tY + C)^p is a polynomial kernel that can define the above function and others.
Quadratic eq with possible kernels including the polynomial.
Most importantly the kernel function is our domain knowledge. (?) IMO we should choose a kernel that fits our feature data.
The output of K is a number(?)
Infinite dimensions - possible as well.
Mercer condition - it acts like a distance\similar so that is the “rule” of which a kernel needs to follow.
The method of SVM can be extended to solve regression problems.
Similar to SVM, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction.
- using many kernel transforms to turn a non-linear problem into a linear problem beforehand.
From the link above, it seems like liblinear is very much the same thing, without those kernel transforms. So, as they say, in cases where the kernel transforms are not needed (they mention document classification), it will be faster.
libsvm (SMO) implementation
Linear SVM (n^3)
liblinear - optimized to deal with linear classification without kernels
does not support kernel SVMs.
n is the number of samples in the training dataset.
Conclusion: In practice libsvm becomes painfully slow at 10k samples. Hence for medium to large scale datasets use liblinear and forget about libsvm (or maybe have a look at approximate kernel SVM solvers such as LaSVM, which saves training time and memory usage for large scale datasets).
allows us to do certain calculations faster which otherwise would involve computations in higher dimensional space.
K(x, y) = <f(x), f(y)>. Here K is the kernel function, x, y are n dimensional inputs. f is a map from n-dimension to m-dimension space. < x,y> denotes the dot product. usually m is much larger than n.
normally calculating <f(x), f(y)> requires us to calculate f(x), f(y) first, and then do the dot product. These two computation steps can be quite expensive as they involve manipulations in m dimensional space, where m can be a large number.
Result is ONLY a scalar, i..e., 1-dim space.
We don’t need to do that calc if we use a clever kernel.
Simple Example: x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (<x, y>)^2.
Let's plug in some numbers to make this more intuitive: suppose x = (1, 2, 3); y = (4, 5, 6). Then:
The idea of SVM is that y = w phi(x) +b, where w is the weight, phi is the feature vector, and b is the bias.
if y> 0, then we classify datum to class 1, else to class 0.
We want to find a set of weight and bias such that the margin is maximized.
Previous answers mention that kernel makes data linearly separable for SVM. I think a more precise way to put this is, kernels do not make the the data linearly separable.
The feature vector phi(x) makes the data linearly separable. Kernel is to make the calculation process faster and easier, especially when the feature vector phi is of very high dimension (for example, x1, x2, x3, ..., x_D^n, x1^2, x2^2, ...., x_D^2).
Why it can also be understood as a measure of similarity: if we put the definition of kernel above, <f(x), f(y)>, in the context of SVM and feature vectors, it becomes <phi(x), phi(y)>. The inner product means the projection of phi(x) onto phi(y). or colloquially, how much overlap do x and y have in their feature space. In other words, how similar they are.