Tale of twelve algorithms

An adventure from the plains of distance and similarity; on the rivers of probability; through the random forests of decision trees; gently sliding the slopes of gradient descent; into the seas of matrices and sandy tensor dunes. Strikingly similar, yet vastly different algorithms with unique superpowers live in those realms. Some are able to shapeshift and transform the land, others see into the future. Their DNA is written in strange scripture so elegantly, that some may think it descended from gods.

Let’s have some fun in q

Most kdb+ developers write q to build tickerplants, databases or complex event-processing engines. Few use q as a data-science language. To the uninitiated, q is mostly an sql-like language to extract fast and big data from the kdb+ store.

This common impression is beginning to change. The proliferation of q analytical libraries emerging in the public domain provides convenience and network effect to the community. Fun Q, the book and the associated code library by Nick Psaris, is one of them.

As we will see, there is much more to q than just querying the database. The book’s 300+ pages are home to one and lonely select-from-where query (featured for its irreplaceable dot notation).

Book for the new generation of data scientists

Readers with prior knowledge of kdb+ will benefit from the book in multiple ways. Kdb+ developers will find advanced idioms they previously encountered in the scriptures of q gods finally explained and used repeatedly, ready to become internalized in their daily repertoire.

Those looking to enter the world of machine learning will find the famous algorithms rise from the ground up, in front of their eyes. No more black boxes. We best learn by examples, pulling objects apart and storytelling. Every algorithm is broken down into its smallest parts. Every piece of dry math is accompanied by a real world data set—which we learn to download, visualize and model.

Seasoned data scientists find that the big or fast data they need to analyze is increasingly being stored in a kdb+ database. They will reap the speed benefit from bringing the machine learning algorithms to the place of storage, instead of incurring the cost of transferring and translating the data from database layer to the research or application layer. This is a radical shift in their quant workflow.

Twelve diverse machine learning algorithms and their variants fit in just a thousand lines of code. This includes all the math, data wrangling and model validation. This should convincingly demonstrate that q is much more expressive than Python or R. Perhaps the seasoned data scientists will find the algorithm implementations in q stunningly elegant, leaving convinced that q’s symbolic notation and notorious terseness is their thought-helper friend, and not their enemy. Following the history of mathematical notation [1], data notation will evolve into brevity [2] too. Both humans and machines perform faster on the shorter code. Nick’s extensive commentary and examples make sure that the optimized brevity does not compromise the clarity, flow—and fun!

Unleashing lambda calculus

Looking to unleash the power of lambda calculus on big data? A match made in heaven. The book is a perfect opportunity to learn functional programming—or unlearn bad procedural or OOP style habits. Not a single class was defined. No unnecessary abstraction that would keep us from the data, metal and brevity. How does the book manage the complexity then? The answer is functional composition (think high-order functions) and strictly immutable state. This again reminds me of fractals—infinitely complex yet modular system, derived from small, well-factored and highly reusable component functions. There is an extremely high data-ink ratio in the code. There are no loops. Rarely the need for flow control or head-spinning brackets. No nasty mutable state side effects. No tight coupling between objects or functions, so it’s easy to reuse them in other projects. The book is a testament to the power of functional programming.

Modus operandi

It’s best to read the book linearly. Early chapters introduce the basic building block functions as needed and later chapters reuse them. The fourteen chapters gradually present twelve machine learning algorithms with increasing complexity. Every major algorithm gets its dedicated chapter.

Nick’s modus operandi for each chapter is the same: download the real-world dataset; run the algorithm to perform clustering or prediction; show the concise implementation; explain the q code line by line. That’s easy even with the most complex algorithms, because his functions are short, composable and reused.

Each algorithm, developed in front of our eyes, first runs on the data to train the model’s parameters. Nick equips us with functions and visualisation tools to assess the model quality: ROC, AUC, recall, fallout, confusion matrix etc. Functions to tune the model’s hyperparameters (such as cross-validation) reappear at the end of the chapters. If different variations of the same algorithm exist, Nick exposes their relative weaknesses and trade-offs. One example compares the speed of convergence of an iterative approach versus memory consumption of an algebraic solution.

The soul of wit

Reader’s focus on the relationships and operations between variables in the code is paramount. Mentally parsing the long variable names would detract from that focus. Therefore, Nick establishes a short variable naming convention. Extensive comments on the meaning of the short variable names appear both in the text and code margin.

As you will see, the material is dense. Dozens of algorithm variants are developed in the short span of pages. The complete code is fully transparent throughout the chapters. Yet the reader has to study only unbelievably short code snippets—wait, this can’t be the whole thing. Yes it is, and Nick’s secret is in fractal-like reuse of the meticulously factored functions. The functions, often oneliners, rarely span more than 4 lines each. The resulting code base hosted on GitHub, a loose collection of over 300 functions, boasts about 1k lines of code.

Keeping the distance

K-nearest neighbors

The first algorithm chapter journey goes gentle into that good night, visiting the most intuitive of the supervised learning algorithms. The k-nearest neighbors (KNN) algorithm makes prediction of the unknown variable based on the distance of its features to the rest of the training dataset. The user provides three parameters to knn function: a distance metric, the number of neighbors to include and the the weight given to each neighbor. This chapter also introduces the data download function, which we will reuse extensively in the book. The function that partitions data is similarly reusable. Given sampling weights and a sampling function, its most common use is to split a dataset into training and testing partitions. Confusion matrix analytics and cross-validation help pick the best value for the algorithm’s “k” parameter.

K-means clustering

K-means, the simplest among the clustering algorithms, is the main protagonist of the next chapter and a distance-based sibling to KNN. In contrast to KNN, which is the only “lazy learning” algorithm in the book, K-means is an “eager learner” because it turns the whole training dataset into a reduced model consisting of k clusters instead of lugging the uncompressed training data around each time it makes a new prediction on the unseen data. K-means does not predict. It eagerly turns the whole dataset into a reduced model consisting of k clusters. Iteration finds the cluster memberships that minimize the intra-cluster distance while maximizing the inter-cluster distances.

The chapter implements different methods for initializing centroids (centers of each cluster), guiding the centroids as they recalculate, and determining optimal number of iterations and number of centroids. It is quite impressive to discover that k-means is a specific parameterization of the more general Lloyd’s algorithm, which we can simply reuse to define k-medians, k-harmonic means and k-medoids – thanks to the functional nature of q.

Hierarchical clustering

While the K-means chapter introduced the partitional clustering family of algorithms driven by centroid updates, the next chapter’s distance-driven algorithm achieves the dataset compression without centroids. Hierarchical agglomerative clustering starts with as many clusters as there are data points and a matrix of distances between each of the data points. The algorithm then iteratively merges the two closest clusters, until a single cluster contains all data. The ingenious Lance-Williams algorithm is able to update the new inter-cluster distances from the previous step’s distances during the merge. Merging stops when we reach the requested number of clusters.

The optimal number of clusters is determined with the help of the silhouette metric. It indicates how likely each data point belongs to the assigned cluster rather than the next closest one. Thanks to clever factoring, the whole chapter’s rather complex sounding logic is expressed in just 27 lines—shared among just four functions. You will enjoy the chapter’s commentary.

Weighing the probabilities

Expectation maximization

Previous algorithms assigned each data point to a single cluster with 100% certainty, while considering only the shortest distance and ignoring the compactness of the candidate clusters. Enter expectation-maximization (EM), where a datapoint can belong to more than one cluster with calculated probabilities, taking the density of clusters into consideration.

Nick makes an interesting link that is hard to find in other books: the EM algorithm is a further generalization of k-means. Studying the functional implementation helps us see it. While the k-means higher-order function requires us to supply a distance and centroid function (for example Euclidean and average respectively), EM accepts likelihood, and maximum-likelihood-estimator (MLE) functions respectively. In the k-means special case, the likelihood function returns 0 or 1 indicating the closest centroid, while the MLE estimates only one parameter of the distribution – the centroid (most commonly an average of the data points).

One demonstrated application clusters the handwritten digit images into ten clusters. Think of each image pixel as a feature with varying probability of being black or white – a binomial mixture model. The second example revisits the famous iris dataset and finds three clusters with the gaussian mixture model – think of each flower belonging to a cluster with a computed probability, and each cluster described by a central location, density and covariance in relation to other clusters.

Naive Bayes

The next chapter’s code naturally builds from unsupervised clustering algorithms like EM into supervised algorithms like Naive Bayes. If we have correct labels (or classes) already assigned to our training observations, we already know cluster memberships. Our algorithm can skip the alternating cluster assignment and update steps to find the best fit upon convergence. Instead, in a single pass, we calculate the distribution model parameters that best fit each cluster. For that, we reuse the MLE functions from the previous chapter. Once again, the elegance of the library is demonstrated by the ability to reuse the previous chapter’s likelihood functions along with the computed Naive Bayes model to classify new data (predict their class).

Growing the trees

Decision trees

Instead of modelling features as distributions to predict the unseen, the next chapter uses decision trees to partition one feature at a time into buckets that gradually become more uniform. A variety of functions measures impurity, decide the next feature to partition by, and control the complexity of the tree. Due to lack of native tree structures in q, Nick invents the data structure to hold the tree-like nested levels during the iterative growth and pruning. Even utilities to graphically visualize tree hierarchies are provided.

Decision tree ensembles

The decision tree algorithm accepts many parameters that address the classical trade off between bias (small, inaccurate trees) and variance (bushy, over-fitted trees). Complex methods pre- and post-prune the trees hoping to find that optimal tradeoff. The next chapter finds an even better solution with the introduction of ensembles. By growing many decision trees, each on a random subset of the data, a ‘random forest’ is capable of making better predictions, as the errors tend to cancel each other out.

Sliding down the slope

Linear regression

Linear and logistic regression, neural networks and matrix factorization are all specific parameterizations of the same higher-order functional algorithm—gradient descent. Truly a versatile workhorse which, together with regularization, takes center stage in the next four chapters.

Choose to minimize the sum of squared errors (the difference between the result and prediction of a continuous ordered value) as the measure of prediction accuracy loss, and we are looking at linear regression, sometimes nicknamed least squares regression. Although this optimization problem of finding coefficients can sometimes be solved using the so called normal equations and manipulating matrices composed of the whole dataset, this is not always possible. An approximate alternative and our main hero, the gradient descent algorithm, iterates a randomly initialized set of parameters in the steepest direction towards a local minimum. The cost function (a measure of inaccuracy) and in turn its gradient, optionally accept an arbitrary list of regularization functions, and give rise to regression variants less prone to coefficient overfitting. Specific examples include LASSO, ridge regression and elastic net regression.

We have two options to control the speed of descent: 1. scale the variance of each feature to the same unit so a single learning rate can be used; or 2. dynamically adjust the learning rate to reflect the slope steepness and to cater to the variance of each feature. Even stochastic gradient descent is demonstrated by reusing the elegantly factored functional primitives.

By using sub-samples of the data, stochastic gradient descent is able to quickly arrive at the vicinity of the minimum—with no convergence guarantees. At the limit, a sub-sample of one observation is explained to be online gradient descent.

Logistic regression

Logistic regression begins like linear regression, but concludes by using the sigmoid function to compress the continuous ordered value predictions to a range between 0 and 1 to predict probabilities. Rounding them up or down and we are performing binary classification. If we need multi-class classification (one-vs.-all), we fit a binary classification model for each class and the class with the highest probability is the predicted class. This is easy because every function designed by Nick naturally generalizes from atoms to vectors to matrices to tensors – in this case from individual to ensemble thetas. The algorithm, of course, leverages the same gradient descent infrastructure, just configured with a different loss function and, in turn, gradient function. Again, we reap the fruits of functional composition.

Neural networks

If the reader, like yours truly, thought that neural networks (NNs) are hard to understand, Nick’s functional approach illuminates. A whole book could be written about different types of NNs, but this chapter implements only one architecture: a fully-connected feed-forward neural network. Luckily for our understanding, Nick reuses the same gradient descent and regularization framework from the previous chapters, with the additional complexity of a multiple layer network.

The neural network algorithm remembers the activation values at each layer during the feed-forward loss computation. It then uses the supplied gradient function to correct each parameter during the backpropagation stage. Instead of merely implementing the common sigmoid activation function and log-loss cost function with a fixed network topology, Nick once again provides a fully factorized algorithm which allows variations in network topology, hidden layer activation functions and final layer activation and cost functions. With full generality, the algorithm permits binomial classification, multi-class classification and even regression. The elegance of the implementation is breathtaking.

Connecting the dots

Recommender systems

The next chapter is about recommender systems, matrix factorization and datasets rife with null values. Using a movie-ratings dataset, we build models that generate statements like “Customers who bought X also bought Y”. Nick unleashes the workhorses developed in prior chapters.

Gradient descent with regularization fits the coefficients of user’s preference for each of the product features in the so-called content-based filtering. By modelling movies as a linear combination of genres that we either like or dislike, we can recommend other movies that share the same genres.

In the second application, called user-user collaborative filtering, our old friend KNN from chapter two also meets the null-rich dataset, this time observing how the users are similar, and predicts the movie ratings of the person’s neighbors. In contrast, the KNN variant called item-item collaborative filtering observes how the movies are similar and predicts the movie rating based on the person’s own rating of the movie’s neighbors.

Finally, the third and most comprehensive approach—collaborative filtering—factors the rating matrix into a user preference matrix and a product feature matrix. Recombined together, they approximate the original ratings but also infer hidden ratings for the previously unrated user-product pairs.

Three matrix factorization algorithms are discussed. The first algorithm uses gradient descent from the previous chapters with a new cost function that models the errors introduced when recombining the factored matrix. The second algorithm, a stochastic one, more easily arrives at the global optimum of this non-convex problem. Lastly, alternating least squares (ALS) tackles the non-convexity by alternately solving for the user and product matrix in each iteration assuming the other matrix is correct.

PageRank

The last algorithm chapter grants understanding of the famous algorithm behind the Google search engine—PageRank. Links between pages are modelled as an adjacency matrix. Probabilities of clicking the links on the page are represented as a Markov matrix (with rows summing to 1). To simulate web surfing, the current page visit is iteratively multiplied by the Markov transition matrix. If we add a few random resets in the surfing, tuned with a damping factor, we will arrive at the Google matrix. Iteratively multiplying the initial position vector by the Google matrix until convergence then produces the PageRank of each site. These concepts are elegant q one-liners of course.

There are alternative implementations, addressing various weaknesses: slow convergence, large memory footprint, and sparse adjacency matrices. The power method takes seed ratings and iteratively improves them. Then, the algebraic solution is faster, but inverts the whole matrix. It’s a closed-form solution that stems from the following observation. Notice that the PageRank vector is the vector remains unchanged when multiplied by the transition matrix and then dampened. Lastly, there is the sparse matrix solution. In the real network, most of the values in the adjacency matrix are 0. Unfortunately, q doesn’t have the native support for sparse matrices. Nick develops the necessary support for us and reuses it across many sparse matrix computations.

Show me the data

The final chapter exposes the ASCII terminal plotting functions. We relied on them throughout the book to gain a visual understanding of the data. The plotting works with three user choices: plot dimensions, character palette, and the overlapping data aggregation method. We are also given the utilities for pretty-printing axis labels. The plotting utilities are demonstrated on none other than the Mandelbrot set. ASCII characters in the terminal represent the color palette. It’s even possible to escape the terminal and export the images into the high resolution portable file format. The last (unfortunately) example of the book presents a “sparkline” function that turns unicode characters into compact charts: ▁▂▃▁▅▂▆▇▅▂. Like reading a good novel, the final words of the book left me wanting the fun to continue.

Performance and productivity takeaways

Whenever opportunity presents itself to optimize in time, space or brevity, Nick peppers the narrative with tips. We are thus not just learning to write machine learning from scratch, but honing our language skills in the process. Each algorithm implementation story gleans a tip or two. Some of the many advanced performance or productivity takeaways for me are:

  • Multi-argument (and faster) projections are possible with the :: operator, instead of the @ operator. Use @ if your projection should be immediately evaluated upon supplying one argument. Otherwise use :: for the faster projection that would not be evaluated until all remaining arguments are supplied. (chapter 2)
  • Sometimes I need to operate on rows rather than generally faster columns. I could flip the large matrix and crash the memory-constrained systems. Better solution is to iterate my function over the second dimension of the matrix, at a small cost. The wrapper function f2nd optimizes this common use case. (chapter 3)
  • Iterate a complex state with // (chapter 4)
  • Employ null values in matrix diagonals in conjunction with operator preference for min/max instead of &| and thus elegantly avoid datapoint self-referencing when calculating extrema such as minimum distance. This is because the min of -0w 0n is -0w while -0w&0n is 0n. Higher-order functions should accept both functions and function names (symbols) to yield better composability. (chapter 4)
  • For the necessary matrix algebra involved throughout the book, Nick reimplements the native q matrix operators for three reasons: 1. parallelized matrix multiplication, 2. enabling X*Y' and X'*Y matrix multiplication without paying the cost of flipping the data, 3. enable optional redefinition of functions such as least squares and matrix inversion using external libraries such as BLAS. (chapter 9)
  • The each-both ' operator does not only apply to functions, but naturally extends to tensors (and matrices) as well. For example, T'[0 0 1 1;1 2 2 3;3 2 0 0] indexes deeply into the tensor T. This can be re-written as (T') . (0 0 1 1;1 2 2 3;3 2 0 0) (chapter 13)
  • Generating many small vectors is a good way to fragment q’s memory. Flipping between table and dictionary is cheap. Flipping a matrix isn’t. Nick always expresses matrices as list of few, but long, vectors. I will always keep this in mind when defining data structures. (chapter 14)

All-purpose data utilities

Nick’s contribution to the q data-science ecosystem is immense, if not pivotal. I am sure I will find myself reusing many of the book’s data utilities – even outside the machine learning context. Null-aware matrix aggregations. Aggregation-based transformation functions such as divide-by-sum. Generating plots without leaving the convenience of the terminal. Visualizing, for example, how clusters change positions in each iteration. Feature scaling. Tensor and sparse tensor operators. These are just a few to mention.

Conclusion

Data in the 21st century is like oil in the 18th century [3]. A data scientist can save millions in labor and network infrastructure costs. She can unlock further billions in gained access to real-time insight. This is the benefit of a lab setup at the direct heart of the rig, silo, distillery and refinery complex. Evolving [4] symbolic data notation [5] shortens the communication lines and breaks down the complexity overhead. It is this integrative kind of technology and language that makes the world yet smaller.

C++ has QuantLib, Python has numpy et al. Q/kdb+ now has three brand new machine learning libraries [6]. That is, if we don’t count the additional ad-hoc Github projects of community members. We are seeing the start of a new era unfolding page by page. Hopefully, the three libraries will harmonize into a powerful toolbox in the hands of a new breed of data scientist.

Fun Q: A Functional Introduction to Machine Learning in Q
Published by: Vector Sigma, New York USA (2020)
ISBN: 978-1-7344675-0-5
https://www.fun-q.net/

References

[1] History of mathematical notation (2020). Available at: https://en.wikipedia.org/wiki/History_of_mathematical_notation (Accessed: 26 September 2020).

[2] Language of mathematics (2020). Available at: https://en.wikipedia.org/wiki/Language_of_mathematics#Concise_expression (Accessed: 26 September 2020).

[3] The world’s most valuable resource is no longer oil, but data (2017). Available at: https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data (Accessed: 26 September 2020).

[4] History of mathematical notation (2020). Available at: https://en.wikipedia.org/wiki/History_of_mathematical_notation#Symbolic_stage (Accessed: 26 September 2020).

[5] “Notation as a tool of thought | Communications of the ACM” (2020). Available at: https://dl.acm.org/doi/10.1145/358896.358899 (Accessed: 26 September 2020).

[6] github.com/KxSystems/ml, github.com/hanssmail/quantQ and github.com/psaris/funq