• Wayne Cheng

Strategies for Handling Large Vocabulary Datasets in an RNN or CNN



Do you have a dataset with a large vocabulary or number of features?


A large vocabulary dataset is typically addressed using an embedding layer in an RNN. This blog explores strategies for reducing the vocabulary to improve accuracy, and using the dataset in a CNN based model.


Reduce the Vocabulary by Adding Categories


By reducing the vocabulary, the patterns in the dataset are more easily detected by machine learning algorithms. The vocabulary can be reduced by adding categories, and then feeding the data with all the categories concatenated to the machine learning model.


For example, a music note has properties such as duration and pitch. By separating the duration and pitch into distinct categories, the patterns of duration and pitch can be detected separately.


The tradeoff in separating data into more categories is that the relationship between the categories is removed. This tradeoff can be mitigated by training the machine to achieve a high accuracy, so that the relationship between the categories is inherently retained by its spatial or temporal property.


Use a Custom Encoding for CNN Based Models


Typically, for an RNN model, the input is tokenized, and sent to an embedding layer. The output is a one-hot vector, which represents the probability distribution of the prediction. This architecture works for an RNN because only one output is generated per time step.


However, this architecture is not scaleable for a CNN based model, because every time step requires an output.


Instead of using a one-hot vector for each time step, the data can be encoded with a custom encoding scheme. The most direct way to encode a one-hot vector is by using a binary encoding. This reduces the vocabulary size by a factor of log2. Note that the final activation function should be sigmoid instead of softmax (since the outputs are now calculated independently). The output will need to be rounded to obtain the encoding that represents the prediction.


The tradeoff in using this encoding scheme is that a high accuracy is required to obtain the correct prediction. Since the output is an encoding rather than a probability distribution, an inaccurate output may result in a terrible prediction.



Thank you for reading. I hope you find this guide helpful for dealing with large vocabulary datasets.


Questions or comments? You can reach me at info@audoir.com



Wayne Cheng is an A.I., machine learning, and deep learning developer at Audoir, LLC. His research involves the use of artificial neural networks to create music. Prior to starting Audoir, LLC, he worked as an engineer in various Silicon Valley startups. He has an M.S.E.E. degree from UC Davis, and a Music Technology degree from Foothill College.

Copyright © 2020 Audoir, LLC

All rights reserved