The way forward [GSoC 2017]

The summer of 2017 was a tough time for me – it meant I knew I couldn’t finish my GSoC 2017 project on time with the program deadlines, so I decided to withdraw from the program. It wasn’t an easy decision to make – I had wonderful mentors, and the project, while challenging, was a lot of fun to work on.

With that being said, the purpose of this blog post is to discuss what would have to be done to finish up with the project – there’s quite a bit to do, and I figured getting my thoughts down would help both me and anyone else who wants to contribute to the project.

For starters, this is a list of useful papers and material related to the project over at my personal repository here. We will be trying to base our implementation around the one described in the paper by Betancourt. In my previous blog post I talked about having to update the energy and momentum equations – our challenge remains the same, but the path is a lot clearer now.

Our entire task is summed up in the appendix – approximating the Hessian will make the computation a lot easier, and Betancourt mentions 3 different approximations which fit into our SoftAbs metric. Footnote 3 on page 11 even mentions it’s possible to reduce the computation to O(N) if we compute Hessian-vector products with automatic differentiation techniques.

 

Once we have our Hessian, we can move on to a Symplectic Integrator, which is described in appendix B. Algorithms 1 and 2 describe this, approximating the evolution of the current position and momentum. Page 3 of the paper serves as a nice way to explain some of the approximations which are being carried out in the algorithms.

Following the restructuring of the NUTS class, this means we have to implement the new integrator, while keeping the same base HMC class, with a few considerations for manifolds. PR #2439 attempts to begin this.

I was half way through my implementations for the Hessian approximations when I had to take my break, and it will be a while before I can get back into it – but it would be great to see PRs or discussions about this. It isn’t a straightforward task, but it is certainly doable – the Stan implementation of it is proof that it can be stitched together!

I hope this serves at least as a direction for others looking to get involved with RMHMC – and also foster some discussion on the same!

Advertisements

Energy, velocity, integrators – and some code.

It’s about the halfway mark of Google Summer of Code 2017, and the fight against RMHMC continues. After struggling with manifolds and metrics for a good 2 weeks, I decided to re-read the papers on RMHMC to see if I was missing something, or to get an idea of how to proceed.

And I think I’m onto something – the paper on introducing the SoftAbs metric as a Riemannian metric goes into details about mathematical tricks which will ease the computational burden of RMHMC – which is when I realised the only thing really relevant to us are the Hamiltonian equations, because the final sampling is based only on the change in the energy equations.

What does this mean? Be rewriting the Theano Hamiltonian equations to account for a new metric, we should have a new sampler! Of course, this also means rewriting the energy and velocity equations, and more importantly, the integrator. The integrator function allows us to move to the next point in our space, and since the whole idea of using Hamiltonians is to move in this space “better”, this becomes the key part of the project.

From a coding perspective, this means I might not have 1000s of lines of complicated mathematical code, but should rather figure out how to best translate the integrator functions into clean Theano code. As of now, RMHMC will inherit from the same base class as HMC, and is structured very similarly. This means that making changes to the trajectory and quad potential files is the bulk of the work.

But this still means the bulk of the coding and testing remains – I just know now what I should be doing, something which I’d rather have figured out a long time ago. The nature of the project means it’s rather theory heavy, and it isn’t the easiest math and physics around. Still, I’ve been learning tons throughout, and hope that I can pull through!

Stuck on not so smooth Manifolds

It’s been about a month since GSoC 2017 kicked off – and things have not been as smooth as I’d have liked. By this time, like I had mentioned in my proposal, my RMHMC code should have been taking some shape – but I only have two notebooks with some pseudo code, and spend most of my time pouring over the literature related to the project.

This reminded me a lot of my initial struggles with GSoC 2016 , and how long it took to actually start pushing code. What helped was to start by breaking down my problem into smaller problems and get down and dirty with it.

My first task given to me by Colin is to create some general Manifold equations, and try and find distances for these. I’m working on this at the moment, and it’s helping in giving me the clarity I need when working on the heavier stuff.

I was mulling with the idea of working on probability distribution metrics for a while – a few days were ‘wasted’ coming up with an API for this, but the exercise was useful in figuring out better how pymc3 is built. I’d still like to come back and think of possible classical statistical distances to compare between models (or traces).

My second task I’ve taken on myself is to make some important design decisions – especially wether we can use the baseHMC class, so that I can push some actual code onto the very dry (at the moment) PR #2240!

Looking to pick up the pace – and my next blog post should talk about how things have been!

Riemannian Manifold Hamiltonian Monte Carlo what?

Trying to explain my 2017 Google Summer of Code project with pymc3 to people usually results in me being even more confused about what I’ll actually be doing this summer. I figured a blog post breaking it down would make things easier for everyone involved.

What usually leads to raised eyebrows is the name of the sampling algorithm I will be implementing – Riemannian Manifold Hamiltonian Monte CarloThose are a lot of big words, and not all of them might make sense even if you’ve dabbled with python or data science before. But it’s actually quite easy to understand, even if right now the only thing that comes to your mind when you hear of Monte Carlo is the casino over at Monaco.

Monte_Carlo_casino

Not quite the Monte Carlo we’ll be working on.

 

Let’s start by talking about what a Monte Carlo method is, and what sampling is. The wikipedia page on sampling actually does a good job explaining it; sampling can be understood as choosing a representative subset of the data or distribution we are dealing with. There are many different kinds of sampling algorithms – why do we concern ourselves with Monte Carlo ones?

This is simply because of how powerful these suite of algorithms are. We can sample from complex and possibly correlated distributions, and do it well. And how? The main idea is using randomness to solve problems that might be deterministic in principle- and it works well! In cases where there is a complex interaction of many variables — or when the inherently probabilistic nature of certain phenomena  rules out a definitive prediction, we can try and estimate it by using essentially random inputs (within realistic limits) to model the system and produce probable outcomes.

Monte Carlo Markov Chains are used when sampling from probability distributions. We remember information about our previous step, and jump to a region of high probability, and hang around there for our sampling.

I could go on and explain exactly how this works but there are a lot of well written explanations for this, both mathematical and otherwise. I particularly like this blog post by pymc3 maintainer Thomas, and stack exchange also does a good job. It’s also fun to know why it was called as the Monte Carlo simulation method – back in the 1950s when scientists von Neumann and Ulam were working on it, the code name Monte Carlo was suggested by their colleague Metropolis, whose uncle would often gamble in the Monte Carlo casino we talked about before. It’s also a very important algorithm from a historic point of view, with it being used in the simulations for the Manhattan project responsible for the atomic bomb.

So now that the Monte Carlo bit is down, let’s try and see why Hamiltonian Mechanics fits into all this. When we use Monte Carlo Markov Chains, we jump from one state to another while doing  our sampling – Hamiltonian MCs are when we go from successive sampled states by using a Hamiltonian evolution between states. But wait again – what is a Hamiltonian? It can be understood as a different way to understand classical mechanics, where we reformulate our equations so that we now deal with total energy of the system we are working with. And as for why,  it is because properties of Hamiltonians allow us to reduce the random-walk behaviour of MCMC – the point of the Hamiltonian MC is to use a different kind of sampling that hopefully “explores” the distribution more efficiently. Instead of generating the next sample from the distribution by taking the current sample and taking a random step in some defined way, HMC lets us simulate Hamiltonian dynamics from the current sample.

One of my favourite intuitive explanations for HMC was from my mentor for this project, Colin Carrol. He says – “Metropolis sampling will take the previous point, and pick a new proposal nearby, usually using a normal distribution.  This proposal is accepted or rejected based on some numerical bookkeeping.  The problem is that in high dimensions, there is too much space, and we either reject every proposal, or take super tiny steps.  Hamiltonian Monte Carlo solves this problem by inducing a vector field that our proposals can “flow” along, leading to distant proposals that are (almost) always accepted, at the price of computing gradient information.”

For more details, Alex Rogozhnikov’s blog post is great, and Micheal Betancourt’s paper on the subject is authoritative. For a less math heavy explanation, this works too.

We’re getting there now – only two more words to go!

Riemannian Manifolds – quite the mouthful, eh?
Again, the wiki does a good job in explaining what a manifold is –  it’s a topological space that locally resembles Euclidean space. In other words, if you “zoom in” close enough, we’re dealing with an euclidean space. A nice example to understand this is our planet, Earth. Why did folks think the Earth was flat? I mean, it does look flat if we look around, but walk long enough in the same direction and you end up where you started! So our planet is a sphere, or a two-dimensional structure globally – but locally it appears to a two-dimensional plane. Or – Earth is a two-dimensional manifold.
So why is this relevant to Monte Carlo or Hamiltonians? By using Riemannian manifolds, we can use Reimannian metrics, which are differentiable.

Hamiltonian Monte Carlo also deal with manifolds in the sense of defining the potential energy of the state as a function on some manifold – what we intend to do is instead of using the standard Euclidean distances used before, make things funky by using Riemannian manifolds and metrics. These allow us to change around our equations allowing us to sample from further complex and correlated data. How would using a different metric change the way we deal with data?

Maybe think about the distance between two points on a torus.  The Riemannian metric would measure a path on the surface of the torus, but the Euclidean metric would use a straight line.  For points that are far apart, the length of the straight line might be very different from actual distance.

If you want to get more into the math and physics behind this, the reading material on my pymc3 GSoC page will let you know what’s up.

Let’s sum it up – Monte Carlo simulations lets us sample from probability distributions and data, Hamiltonians allow us to avoid some pitfalls of MCMC by assigning energy states and momentum, and Riemannian manifolds give HMCs even more power by using more advanced distance metrics. Phew!

Not so nasty after all, eh? The first bit I’m working on is implementing the SoftAbs metric, which seems like it should be a load of fun. I’ll be attempting to explain my summer work over a bunch of blog posts – and hope to see you around!

 

Making Machine Learning Easy

As I currently grapple with my Undergraduate thesis in Machine Learning, I realise that things would have been easier if I paid more attention in my classes, or learned certain things earlier. Luckily I have a very understanding and helpful guide in Dr. Benjamin here at INRIA, Lille, but things could have been worse.

This (very short) blog post is mainly targeted at 1st and 2nd year Undergrad Computer Science at BITS Pilani, but could be extended to other Indian engineering colleges too, I guess.

So some, very basic things which would make a career in Machine Learning or Data Science easier –

  1. Pay attention to your Math 2 (Linear Algebra) and Prob & Stats courses. They are usually the foundation of whatever ML you will do. I had to study this paper thoroughly for my Google Summer of Code project, and am studying this paper now for my Thesis. I would have saved more valuable time if I knew those 2 courses well!
  2. Don’t just pay attention – do well and score well in them. Professors usually ask for your grades in these two courses so it is very useful to have a ‘B’ grade or higher in them. That being said, you can still survive by just knowing these two topics well (reading relevant textbooks, online courses, youtube videos).
  3. Learn Python and C++ well. Alternatively, you can also be good at R, or Julia, or other languages which implement Machine Learning techniques. It is usually a must when you apply for positions and jobs.

I know the advice is very superficial – isn’t it obvious that you’ve to be good at this? But people tend to take it for granted. I know I did. And I have to put double the effort to do my work. There are times when I am re-watching YouTube videos on Conditional Expectation and Frobenius Forms of Matrices and I think – well, damn, I’ve heard of all this before but don’t know it inside out. And it really helps in knowing it inside out if you want to better understand why your ML algorithm does the magic it does. 🙂

GSoC Final Submission

This blog post shall serve as my final submission link for Google Summer of Code 2016.

My task was to implement a pythonic Dynamic Topic Model for gensim and RaRe Technologies. I was very ably guided by my mentors Lev and Radim, and am happy that my code for the same has been merged into gensim via PR#739. You can follow the tutorial to understand the theory behind Dynamic Topic Models and use the code here: notebook tutorial link.

My collaboration with Gensim has not only been through this PR, and I enjoyed fixing bugs and adding features throughout the summer (all my PRs can be viewed here).

I have also regularly blogged about my work on RaRe technologies blog, where you can ready my bi-weekly blog posts and a more regularly updated live blog.

There is still a lot of scope for taking this project further – particularly through increasing performance in terms of Memory, Speed and Documentation. Adding the Document Influence Model mode and making it distributed are other aims.

I have been addressing this via a new PR, #831. It will attempt to use a fixed memory stamp, and make the code more ‘pythonic’ via style changes and improving speed by further vectorising code. I will also update the tutorial notebook to explain how you can further play around with your Dynamic Topic Model there. This should be up soon, and will make the already merged code only better. 🙂

Participating in GSOC been a great and humbling learning experience, where I not only learned many new skills and coding practices, but also that there is so much more to learn and do in the fields of open source Machine Learning and Data Science. I hope to keep contributing to Gensim and the open source community!

Dynamic NMF and Dynamic Topics

While hunting for a data set to try my DTM python port, I came across this paper, and this repository. The paper itself was quite an interesting read and analysed trends of topics in the European Parliament,  but what caught my attention was the algorithm they used to perform this analysis – what they called the Dynamic Non-Negative Matrix Factorisation (NMF).

The reason for my interest in it was obvious, because the authors claimed it to be a method to identify Dynamic Topics in a corpus – the exact same thing I am attempting to do.

The GitHub page mentions their steps to run their algorithm and to clean the data [the dataset on the page is not the same as described in the paper – it is instead of news transcripts split into Entertainment, Business, Politics, Technology and Football spanning over 3 months]. I decided to clean the data myself, and store it in the Blei-LDA format – it becomes easier to use it for my DTM code that way if I want to. Cleaning involved the usual – splitting up each document line by line and then word by word, removing stop words, and using the handy Gensim corpora class to store it to disk in the Blei format. The authors of the paper have described their own way of cleaning (by running the scripts on their repo), so you can do that as well.

Soon after getting my corpus ready, I sat down to figure out how exactly the Dynamic NMF works – only to realise it is quite, quite different from the Dynamic Topic Model. While the idea of identifying the evolution of topics over time, the approach was quite different. Before getting into it, if you aren’t sure what LDA or DTM is, have a look at this and the resources there to refresh your memory/understand it.

So in Blei’s DTM, the hyperparameters of the previous time-slice influence the ones on the next – that is to say, the words in the topics of an older time slice directly affect the words of the topic in the next time-slice. In Dynamic NMF, the idea is to ‘discover’ what the authors call dynamic topics.

They do this in 2 steps – first, by separating the corpus into it’s time-slices or windows. The NMF algorithm is then run on each of these windows, and we get topics for each time-slice. Then, another matrix is constructed which contains the top terms from each matrix as it’s columns. In the second step the NMF algorithm is applied again, with the number of topics decided after applying a W2V Coherence measure described in the paper. I’ve diluted the exact approach to make it more palatable – I would highly recommend going through the ‘Methods’ section of the paper and go through it yourself.

Screen Shot 2016-07-29 at 3.03.16 PM

[an excerpt from the Dynamic NMF paper]

What the above means is that there isn’t an identification of topics from the very start – rather, after finding a certain number of topics in each time-slice, the ones which are semantically similar are grouped together as ‘dynamic topics’. This is quite different from the original DTM, where topics are fixed from the start and evolve in each time-slice. We can also see a similar evolution in NMF, but it’s a little ‘looser’. The words in DTM tend to not change so rapidly in each time-slice.

Both are effective methods but have different approaches and give slightly different results. I’ll try and do a blog post next week comparing the results of both on the same corpus – and hopefully with my DTM python wrapper!