This is a short update to my earlier post from 2015 on modeling MEG signals using CRBMs. CRBMs were outdated already at the time I wrote the post and one of the development areas was to try out recurrent neural networks (RNNs). I’m describing here the results of my weekend project on applying RNNs to time series prediction of MEG data.
This is the first time I’m applying RNNs to time series prediction – so if something in the way I’m using them doesn’t make sense, drop me a note and I’ll educate myself more 🙂
Short History of RNNs and Why They Are Relevant
Regular feed-forward networks are great in tasks where there are no temporal dependencies between individual samples (e.g. “what number is in this image?”) but they are bad when it comes to cases where the sequence length is not fixed or the individual samples are dependent.
RNNs were developed already in 1980s to overcome the issues in feed-forward networks by allowing dependencies between time steps, however, they were found to be impossible to train: The early RNN architectures suffered from an issue of vanishing/exploding gradients. Advances in network architectures allowed to overcome the issue and today there are few different solutions to the issue. One of the most used solutions is called Long Short-Term Memory (LSTM) which was introduced already in 1997. Today LSTMs are used in multiple areas including speech recognition, machine translation and time series prediction for generating text.
Since the LSTMs are popular today, I chose to use them also in my experimentation. In case you are interested to learn more about LSTMs, take a look on colah’s blog post.
Shortly about the Data-set
The data-set is same as in my earlier post: I’m using the data from ICANN 2011 MEG mind reading challenge. The data is split into 1s patches and each patch has 200 samples from 204 channels. In addition to the raw MEG signals, the data-set includes 5 differently filtered versions of the signals (from 2Hz up to 35Hz).
The raw MEG data in general is quite chaotic and instead of using the raw data “as is” I chose three frequency domains (10Hz, 20Hz and 35Hz) for my experimentation. This gives me a better feeling if the network is modeling the data well. Since each frequency band gets its own set of 204 channels, the network gets data from 3*204=612 channels as the input. Following figure visualizes data from the different frequency bands to give an idea what type of data is being modeled:

Network for Modeling the Data-set
I wanted to tune the network parameters easily to get a feeling on what works and what doesn’t. Therefore, I intentionally kept the network architecture simple: The network has an input layer for holding few previous time steps of the data, an LSTM layer for capturing the dependencies in the previous time steps, a dense layer to increase ability of cross-referencing between channels – and an output layer for the prediction.
The next question is the hyper-parameter selection. I tried to rationalize my choices instead of randomly guessing something and hoping that it will work:
- The number of input neurons depends on the number of channels in the data and the number of history samples that are used for prediction. The first one is fixed by the data (612) but the second one can be tuned. Given that the data has 10Hz components, I chose to have history of 0.05s (10 samples). This should be sufficient for holding half of a 10Hz oscillation. Also, CRBMs were observed to perform adequately with this history length.
- The LSTM layer needs to model oscillatory behavior for 612 channels. I thought that it should also be the minimum number of nodes in the LSTM layer – but because we want to model interaction between the input channels, I picked 3*612=1836 neurons here. Other LSTM hyper-parameters were left to Keras defaults (which match with the values from the original paper).
- The dense layer is there also for increasing ability to pick up interaction between various channels so I chose to use 2*612=1224 neurons here.
- Output layer size is also fixed to 612 neurons.
Results
After training the network for 50 iterations, I tested how well it predicts activity for samples that weren’t used in training. Here are the prediction results from four different starting points (blue curves denote the prediction, red curves denote the ground truth) and from three different frequency bands (from top to down: 10Hz, 25Hz and 35Hz):
The predicted values start deviating almost immediately from the ground truth, however, the predicted data shares similar dynamics with the original data: Like the original data, the produced signals look a little bit like sin-waves with amplitude scaling and doing something a bit more surprising every now and then.
Discussion
The model seems to pick up some dynamics from the data – but it fails to model even one second of data from the validation set. There are more than a single reason to observe these deviations:
- A larger network might model some of the behavior better.
- Training isn’t optimized for sequences. While the network has an LSTM layer, each x-y pair is treated as an independent sample and the state of the LSTM layer gets lost. This could be improved further with a relatively simple changes to the training part.
- The brain isn’t an isolated system. Everything the test subject sensed while gathering the data affect the signals.
Further, my analysis here is based on visual inspection of the time series. It would be interesting to analyze channel-to-channel connectivity of the ground truth and predictions (using e.g. Phase Lag Index) and see if these share any characteristics. That would given an unbiased estimate how well the network models connectivity between channels. This might be a good topic for my next post…
Given that I didn’t use much time for the project, I’m still happy with the results. My personal goal was to get familiar with newer machine learning frameworks. 🙂
Code
The code is available in github. There shouldn’t be any significant quirks involved.. just be aware that the data-set is large and after pre-processing the data you easily run out of memory if you have an older computer – or you experiment with e.g. increasing the number of history samples.
Also, I gave a try for few new tools this time:
- Keras. I’ve used earlier Theano and Theano+Lasagne -combination, however, Tensorflow+Keras -combination turned out to be fast and easy to use.
- Jupyter Notebook. While in general I’m a fan of vim while writing C/C++/python, this tool made experimentation simpler: I could write and modify code without losing the state of the python session while everything is still safely saved for later usage.