• Wayne Cheng

Solving Out Of Memory (OOM) Errors on Keras and Tensorflow Running on the GPU

Updated: Feb 4



Are you running into Out Of Memory errors when building and training your neural network models on the GPU?


The size of the model is limited by the available memory on the GPU. The following may occur when a model has exhausted the available memory :


  • Resource Exhausted Error : an error message that indicates Out Of Memory (OOM)

ResourceExhaustedError: OOM when allocating tensor with shape[...] ...

  • Failure to Get Convolution Algorithm : an error message that indicates the convolution neural network model is too large.

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above ...

  • Jupyter Notebook hangs, and then displays an error message :

Kernel Restarting : The kernel appears to have died. It will restart automatically.

The following explains the methods that I find effective in resolving these Out Of Memory errors. I am creating my neural network models in Jupyter Notebook, and running Keras version 2.3.1 and Tensorflow version 2.0.0.


Step 1 : Enable Dynamic Memory Allocation


First, restart the kernel (Kernel -> Restart). The previous model remains in the memory until the Kernel is restarted, so rerunning the Notebook cells without restarting the kernel may lead to a false Out Of Memory error.


By default, Tensorflow statically allocates the memory in the GPU for the model. I find that I can build larger models when allowing Tensorflow to dynamically allocate the memory.


To enable dynamic memory allocation, I run the following commands at the start of my Jupyter Notebook :

from tensorflow.compat.v1 import ConfigProto from tensorflow.compat.v1 import InteractiveSession config = ConfigProto() config.gpu_options.allow_growth = True session = InteractiveSession(config=config)


Step 2 : Incrementally Reduce the Size of the Model


For each test, I would reduce the size of the layer from the end of the model to the start. For example, suppose that this model results in an Out of Memory error :

model.add(Dense(2**14, activation='relu', input_shape=(X_train.shape[1:]))) model.add(Dense(2**13, activation='relu')) model.add(Dense(2**12, activation='relu'))

For test 1, I would try to run this model :

model.add(Dense(2**14, activation='relu', input_shape=(X_train.shape[1:]))) model.add(Dense(2**13, activation='relu')) model.add(Dense(2**11, activation='relu'))

Test 2 :

model.add(Dense(2**14, activation='relu', input_shape=(X_train.shape[1:]))) model.add(Dense(2**12, activation='relu')) model.add(Dense(2**11, activation='relu'))

Test 3 :

model.add(Dense(2**13, activation='relu', input_shape=(X_train.shape[1:]))) model.add(Dense(2**12, activation='relu')) model.add(Dense(2**11, activation='relu'))

Test 4 :

model.add(Dense(2**13, activation='relu', input_shape=(X_train.shape[1:]))) model.add(Dense(2**12, activation='relu')) model.add(Dense(2**10, activation='relu'))

The reasoning behind this iterative process is that the layers at the start will have more of an impact on the results than the layers at the end. So it makes sense to reduce the nodes on the less important layers before reducing the nodes on the more important layers.

Note that I am keeping the ratio between layers roughly the same for each iteration.


Step 3 : Check the GPU Memory Usage


When I am no longer getting any error messages, I will then check the GPU memory usage. I want to make sure that I am using most of the GPU memory, and that I am not running a model that is less than the optimal size. This can be done by typing the following in the terminal :

nvidia-smi

After typing this in, I can usually find the Python process that is running in my Jupyter Notebook on the last line :


NVIDIA-SMI 435.2... ... 0 18679 C .../anaconda3/bin/python 7495MiB

This tells me that I am using 7495MiB of GPU memory for my model, which is 94% of capacity. As long as the model uses at least 90% of the GPU memory, the model is optimally sized for the GPU.



Thank you for reading. I hope you find this guide helpful for solving Out Of Memory errors on Keras and Tensorflow when using the GPU.


Questions or comments? You can reach me at info@audoir.com



Wayne Cheng is an A.I., machine learning, and deep learning developer at Audoir, LLC. His research involves the use of artificial neural networks to create music. Prior to starting Audoir, LLC, he worked as an engineer in various Silicon Valley startups. He has an M.S.E.E. degree from UC Davis, and a Music Technology degree from Foothill College.

Copyright © 2020 Audoir, LLC

All rights reserved