So this week, I had come to the collection of lectures in my online course that teaches how to use PyTorch to construct an artificial neural network (ANN) that handles all kinds of tabular data – this being data that is analogous to relational or structured data; anything that can be stored in an SQL database.
Of course, there are two types of data within this sort of data, being categorical and continuous data. Understandably, neural networks work specifically better with continuous data than categorical. But also, there are continuous types of data that don’t perform well with ANNs as well, leading to what is called feature engineering having to be used.
The example given was that a pair of longitudinal and latitudinal coordinates were given to indicate the distance traveled from point A to point B; however, clearly, there is a better way to present this data to a neural network in the fashion of an actual distance metric. So pre-calculating for each pair, the distance was appended to the input data. Then, date-time stamps were given, and reorganising this data into better useful measurements would have to occur before the ANN could even process the data.
Unfortunately, the tutorial I am following did very little to explain the chosen architecture of the model. I had to really go through the code with a fine-toothed comb to understand that the categorical data was being embedded, a term up to this point I had never heard of. Embedding categorical data means taking some categorical data x, with n possible values, and create a tensor of weights to represent that one piece of data. It would be a tensor of 1 dimension, of length min(50, (n+1)/2), so is the minimum value between half of n (integer division) or 50. These weights are part of the ANN and are learnt by the same backpropagation as the other standard fully connected weights and biases of the network. Theoretically, this tensor not only proclaims which category is being represented but also illustrates any relationship similarities are there. So, there may be an underlining effecting difference between whether a weekday category input results in a differing result because it is or isn’t the weekend, this would hopefully result in a visible difference in the input’s weekday embedded tensor description of a weekday that is, or isn’t a weekend.
After learning this, I decided to test myself on newfound data from Kaggle. Of course, the beginner’s introductory dataset describing which passengers did or did not survive the Titanic is what I chose. I put to use what I had been taught, and after a fair bit of trouble recognising how to handle missing data from the age of passengers, and then later a singular missing data entry of someone’s fare price in the test data, I began testing different hyperparameters and the resulting accuracy of test data. I managed to get a high of 0.746 the night I finished the program and submitted the predictions. I’m honestly quite happy with this as I remembered some years ago; I attempted to learn how to use TensorFlow and barely managed to get a 0.68 accuracy. I’m likely going to mess around with the hyperparameters more and get a better score; however, that is all I have this week, and I really wanted to post my progress.