How to reduce a DataFrame's memory footprint by selecting the appropriate data types for columns?

When working with relatively small datasets (<100 Mb) the performance is rarely a problem with pandas. When we move to larger data, performance issues can make run times much longer, and cause code to fail entirely due to insufficient memory.

In [1]:
import numpy as np
import pandas as pd
import math
from IPython.display import Image
In [2]:
# Our example dataset is from a kaggle competition. https://www.kaggle.com/c/elo-merchant-category-recommendation
# It contains up to 3 months' worth of transactions for every card at any of the provided merchant 
path = "../../../kaggle/Elo_Merchant_Category_Recommendation/historical_transactions.csv"
In [3]:
data = pd.read_csv(path)
In [4]:
# 14 columns and over 29 million rows
data.shape
Out[4]:
(29112361, 14)
In [5]:
data.head(3)
Out[5]:
authorized_flag card_id city_id category_1 installments category_3 merchant_category_id merchant_id month_lag purchase_amount purchase_date category_2 state_id subsector_id
0 Y C_ID_4e6213e9bc 88 N 0 A 80 M_ID_e020e9b302 -8 -0.703331 2017-06-25 15:33:07 1.0 16 37
1 Y C_ID_4e6213e9bc 88 N 0 A 367 M_ID_86ec983688 -7 -0.733128 2017-07-15 12:10:45 1.0 16 16
2 Y C_ID_4e6213e9bc 88 N 0 A 80 M_ID_979ed661fc -6 -0.720386 2017-08-09 22:04:29 1.0 16 37

We can use the DataFrame.info() method to give us some high level information about our dataframe, including its size, information about data types and memory usage. By default, pandas approximates of the memory usage of the dataframe to save time. To get accurate memory usage we must set memory_usage='deep'.

In [6]:
# Pandas automatically detects the types, usually, we do not have to worry about it
print(data.info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29112361 entries, 0 to 29112360
Data columns (total 14 columns):
authorized_flag         object
card_id                 object
city_id                 int64
category_1              object
installments            int64
category_3              object
merchant_category_id    int64
merchant_id             object
month_lag               int64
purchase_amount         float64
purchase_date           object
category_2              float64
state_id                int64
subsector_id            int64
dtypes: float64(2), int64(6), object(6)
memory usage: 12.7 GB
None

We have an object, int64 and float64 type columns. Object columns are used for strings or where a column contains mixed data types. Total memory usage is over 12.7GB.

In [7]:
data.memory_usage(deep=True)
Out[7]:
Index                           80
authorized_flag         1804966382
card_id                 2096089992
city_id                  232898888
category_1              1804966382
installments             232898888
category_3              1799621612
merchant_category_id     232898888
merchant_id             2090550752
month_lag                232898888
purchase_amount          232898888
purchase_date           2212539436
category_2               232898888
state_id                 232898888
subsector_id             232898888
dtype: int64

Each data type is stored separately in pandas. Let's examine the memory usage by data type.

In [8]:
for dtype in ['float','int64','object']:
    selected_dtype = data.select_dtypes(include=[dtype])
    mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
    mean_usage_mb = mean_usage_b / 1024 ** 2
    print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))
Average memory usage for float columns: 148.07 MB
Average memory usage for int64 columns: 190.38 MB
Average memory usage for object columns: 1608.81 MB
We can see that the object type takes in our case up to 10x more memory compared to int or float columns on average.

Why objects take more memory?

Numeric data in pandas DataFrame is stored in NumPy data types. However, each element in an object column is really a pointer that contains the "address" for the actual value's location in memory. Each pointer takes up 1 byte of memory, each actual string value uses the same amount of memory that string would use if stored individually in Python. (Strides are the indexing scheme in NumPy arrays, and indicate the number of bytes to jump to find the next element)

In [19]:
Image(filename='object.png')
Out[19]:
Reference: https://www.dataquest.io/blog/pandas-big-data/

We can use sys.getsizeof() to get the size of individual strings and then items in a pandas series.

In [9]:
from sys import getsizeof #Return the size of an object in bytes
s1 = 'memory'
print(getsizeof(s1))

s2 = 'individual long long long strings'
print(getsizeof(s2))
55
82
In [10]:
# String in series is exactly the same size + pointer
series = pd.Series(['memory','individual long long long strings'])
series.apply(getsizeof)
Out[10]:
0    55
1    82
dtype: int64

We can see that the size of strings when stored in a pandas series are identical to their usage as separate strings in Python.

Is it possible to reduce the memory usage?

First, let's look at how to improve the memory usage of numeric columns.

Pandas represent numeric values as NumPy ndarrays and stores them in a continuous block of memory under. This storage model consumes less space and allows us to access the values themselves quickly

Many types in pandas have multiple subtypes that can use fewer bytes to represent each value. For example, the float type has the float16, float32, and float64 subtypes. So, in the case of float16 one number will take 2 bytes of memory instead of 8 bytes per float64.

The int type has int64 (uint64), int32 (uint32), int16 (uint16) and int8 (uint8) subtypes. uint (unsigned integers) and int (signed integers). Both types have the same capacity for storage, but by only uint stores positive values, unsigned integers allow us to be more efficient with our storage of columns that only contain positive values.

We can use the numpy.iinfo() class to verify the minimum and maximum values for each integer subtype.

In [11]:
int_types = ["uint8", "int8", "int16" , "int64"]
for it in int_types:
    print(np.iinfo(it))
Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

An int8 value uses 1 byte (or 8 bits) to store a value and can represent 256 values. This means that we can use this subtype to represent values ranging from -128 to 127 (including 0). Int64 can store up to ... do we need to allocate memory for so large numbers?

Optimizing Numeric Columns with Subtypes

In [12]:
initial_memory_usage = data.memory_usage(deep=True).sum()
def print_memory_usage_dif(data):
    print(round(((initial_memory_usage-data.memory_usage(deep=True).sum())/initial_memory_usage),3)*100,'%')

So, first let's convert binary object values into numeric values (int65) using map.

In [13]:
display(data.authorized_flag.unique())
data.authorized_flag = data.authorized_flag.map(dict(Y=1, N=0))

display(data.category_1.unique())
data.category_1 = data.category_1.map(dict(Y=1, N=0))
array(['Y', 'N'], dtype=object)
array(['N', 'Y'], dtype=object)
In [14]:
print(data.info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29112361 entries, 0 to 29112360
Data columns (total 14 columns):
authorized_flag         int64
card_id                 object
city_id                 int64
category_1              int64
installments            int64
category_3              object
merchant_category_id    int64
merchant_id             object
month_lag               int64
purchase_amount         float64
purchase_date           object
category_2              float64
state_id                int64
subsector_id            int64
dtypes: float64(2), int64(8), object(4)
memory usage: 9.8 GB
None
In [15]:
print_memory_usage_dif(data)
23.0 %

We saved quite a lot of memory by just converting two object columns into integer columns. We are down to 9.8 GB from initial 12.7 GB

Integer

Let's try to reduce int64 columns by downcasting. All the columns have values between -1 to 999. So, we should use int16 or lower.

In [16]:
int_cols = data.select_dtypes(include=['int64'])

for col in int_cols.columns:
    print(col, 'min:',data[col].min(),'; max:',data[col].max())
    data[col] = pd.to_numeric(data[col], downcast ='integer')

print(data.info(memory_usage='deep'))
authorized_flag min: 0 ; max: 1
city_id min: -1 ; max: 347
category_1 min: 0 ; max: 1
installments min: -1 ; max: 999
merchant_category_id min: -1 ; max: 891
month_lag min: -13 ; max: 0
state_id min: -1 ; max: 24
subsector_id min: -1 ; max: 41
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29112361 entries, 0 to 29112360
Data columns (total 14 columns):
authorized_flag         int8
card_id                 object
city_id                 int16
category_1              int8
installments            int16
category_3              object
merchant_category_id    int16
merchant_id             object
month_lag               int8
purchase_amount         float64
purchase_date           object
category_2              float64
state_id                int8
subsector_id            int8
dtypes: float64(2), int16(3), int8(5), object(4)
memory usage: 8.4 GB
None

We are down to 8.4 GB from initial 12.7 GB.

In [17]:
print_memory_usage_dif(data)
34.300000000000004 %

Float

The float64 type represents each floating point value using 64 bits (8 bytes). We can save memory by converting within the same type from float64 to float32.

We can use the function pd.to_numeric() to downcast our numeric types. We'll use DataFrame.select_dtypes to select only the integer columns, then we'll optimize the types and compare the memory usage.

In [18]:
float_cols = data.select_dtypes(include=['float'])

for col in float_cols.columns:
    data[col] = pd.to_numeric(data[col], downcast ='float')

print(data.info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29112361 entries, 0 to 29112360
Data columns (total 14 columns):
authorized_flag         int8
card_id                 object
city_id                 int16
category_1              int8
installments            int16
category_3              object
merchant_category_id    int16
merchant_id             object
month_lag               int8
purchase_amount         float32
purchase_date           object
category_2              float32
state_id                int8
subsector_id            int8
dtypes: float32(2), int16(3), int8(5), object(4)
memory usage: 8.2 GB
None

We are down to 8.2 GB from initial 12.7 GB.

In [19]:
print_memory_usage_dif(data)
36.0 %

Can we do better?

Dummies

We could use dummy variables to reduce the memory footprint. pd.get_dummies should be used on columns with very few unique values

In [20]:
for col in data.select_dtypes(include=['object']):
    print(col, len(data[col].unique()))
card_id 325540
category_3 4
merchant_id 326312
purchase_date 16395300

category_3 has only 4 unique values (A, B, C and nan). With get_dummies() we get 3 uint8 columns instead of 1 object column and save a lot of memory. We got only 3 columns as nan will have 0 in each three columns.

In [21]:
data = pd.get_dummies(data, columns = ['category_3'])
In [22]:
print(data.info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29112361 entries, 0 to 29112360
Data columns (total 16 columns):
authorized_flag         int8
card_id                 object
city_id                 int16
category_1              int8
installments            int16
merchant_category_id    int16
merchant_id             object
month_lag               int8
purchase_amount         float32
purchase_date           object
category_2              float32
state_id                int8
subsector_id            int8
category_3_A            uint8
category_3_B            uint8
category_3_C            uint8
dtypes: float32(2), int16(3), int8(5), object(3), uint8(3)
memory usage: 6.6 GB
None

We are now down to 6.6 GB from initial 12.7 GB and have reduced the memory usage almost 50%.

In [23]:
print_memory_usage_dif(data)
48.5 %

Category type

Like we saw earlier, an object type takes up the most memory. Pandas has category type which is much more memory efficient. Pandas uses a separate mapping dictionary that maps the integer values to the raw ones. This arrangement is useful whenever a column contains a limited set of values.

In [26]:
Image(filename='categorical.png', width='500px')
Out[26]:
Reference: https://www.dataquest.io/blog/pandas-big-data/

We should use the category type primarily for object columns where less than 50% of the values are unique. If all of the values in a column are unique (e.g. index), the category type will end up using more memory. That's because the column is storing all of the raw string values in addition to the integer category codes.

We could write a loop to iterate over each object column, check if the number of unique values is less than 50%, and if so, convert it to the category type.

In [24]:
for col in data.select_dtypes(include=['object']):
    num_unique_values = len(data[col].unique())
    num_total_values = len(data[col])
    prop = num_unique_values / num_total_values
    if prop < 0.5:
        print(col, num_unique_values, str(round(prop*100))+'%')
card_id 325540 1%
merchant_id 326312 1%

In both cases, unique values form only 1%, as we have over 29 million rows. Let's covert these columns into the category type.

In [25]:
for cols in ['card_id','merchant_id']:
    data[cols] = data[cols].astype('category')

print(data.info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29112361 entries, 0 to 29112360
Data columns (total 16 columns):
authorized_flag         int8
card_id                 category
city_id                 int16
category_1              int8
installments            int16
merchant_category_id    int16
merchant_id             category
month_lag               int8
purchase_amount         float32
purchase_date           object
category_2              float32
state_id                int8
subsector_id            int8
category_3_A            uint8
category_3_B            uint8
category_3_C            uint8
dtypes: category(2), float32(2), int16(3), int8(5), object(1), uint8(3)
memory usage: 2.9 GB
None
In [26]:
print_memory_usage_dif(data)
76.9 %

We are now down to 2.9 GB from initial 12.7 GB.

While converting all of the columns to this type sounds appealing, it's important to be aware of the trade-offs. The biggest one is the inability to perform numerical computations. We can't do arithmetic with category columns or use methods like Series.min() and Series.max() without converting to a true numeric dtype first.

In [28]:
# We still can use category columns to group by
df = data.groupby(['card_id'])['purchase_amount'].agg('mean')
df.head()
Out[28]:
card_id
C_ID_00007093c1   -0.515739
C_ID_0001238066   -0.589002
C_ID_0001506ef0   -0.524271
C_ID_0001793786   -0.170306
C_ID_000183fdda   -0.478041
Name: purchase_amount, dtype: float32

We have one more optimization

In [29]:
for col in data.select_dtypes(include=['object']):
    print(col)
data.loc[0, 'purchase_date']
purchase_date
Out[29]:
'2017-06-25 15:33:07'

As in our example, we also have a column containing a date - "purchase_date". It is an Object type which takes a lot of memory. We can convert it into a DateTime type (a 64-bit type) giving us a reduction in memory footprint and also allowing us to do time series analysis (Calculate time difference in days etc).

In [30]:
print(data.info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29112361 entries, 0 to 29112360
Data columns (total 16 columns):
authorized_flag         int8
card_id                 category
city_id                 int16
category_1              int8
installments            int16
merchant_category_id    int16
merchant_id             category
month_lag               int8
purchase_amount         float32
purchase_date           object
category_2              float32
state_id                int8
subsector_id            int8
category_3_A            uint8
category_3_B            uint8
category_3_C            uint8
dtypes: category(2), float32(2), int16(3), int8(5), object(1), uint8(3)
memory usage: 2.9 GB
None
In [31]:
# Convert an object type to a datetime type
data.purchase_date = pd.to_datetime(data.purchase_date,format = "%Y-%m-%d %H:%M:%S")
In [32]:
print(data.info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29112361 entries, 0 to 29112360
Data columns (total 16 columns):
authorized_flag         int8
card_id                 category
city_id                 int16
category_1              int8
installments            int16
merchant_category_id    int16
merchant_id             category
month_lag               int8
purchase_amount         float32
purchase_date           datetime64[ns]
category_2              float32
state_id                int8
subsector_id            int8
category_3_A            uint8
category_3_B            uint8
category_3_C            uint8
dtypes: category(2), datetime64[ns](1), float32(2), int16(3), int8(5), uint8(3)
memory usage: 1.1 GB
None

We are down to 1.1 GB from initial 12.7 GB

In [33]:
print_memory_usage_dif(data)
91.4 %
In [34]:
data.purchase_date.dtype
Out[34]:
dtype('<M8[ns]')

Selecting Types While Reading the Data

We often don't have enough memory to represent all the values in a data set. How can we apply memory-saving techniques when we can't even create the DataFrame in the first place?

We could also set types during read_csv(). Fortunately, we can specify the optimal column types when we read the data set in. The pandas.read_csv() function has a few different parameters that allow us to do this. The dtype parameter accepts a dictionary that has (string) column names as the keys and NumPy type objects as the values. We can also cast "purchase_date" into datetime64[ns] using parse_dates.

In [10]:
data = pd.read_csv(path,dtype={'card_id': 'category', 
                               'city_id': 'int16',
                               'installments':'int16',
                               'merchant_category_id':'int16',
                               'merchant_id':'category',
                               'month_lag':'int8',
                               'purchase_amount':'float32',
                               'category_2':'float32',
                               'state_id':'int8',
                               'subsector_id':'int8'
                              },parse_dates=["purchase_date"])
In [11]:
data.authorized_flag = data.authorized_flag.map(dict(Y=1, N=0))
data.category_1 = data.category_1.map(dict(Y=1, N=0))
In [12]:
for col in data.select_dtypes(include=['int64']):
    data[col] = pd.to_numeric(data[col], downcast ='integer')
In [13]:
data = pd.get_dummies(data, columns = ['category_3'])
In [14]:
print(data.info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29112361 entries, 0 to 29112360
Data columns (total 16 columns):
authorized_flag         int8
card_id                 category
city_id                 int16
category_1              int8
installments            int16
merchant_category_id    int16
merchant_id             category
month_lag               int8
purchase_amount         float32
purchase_date           datetime64[ns]
category_2              float32
state_id                int8
subsector_id            int8
category_3_A            uint8
category_3_B            uint8
category_3_C            uint8
dtypes: category(2), datetime64[ns](1), float32(2), int16(3), int8(5), uint8(3)
memory usage: 1.1 GB
None

We achieved the same result - 1.1 GB memory usage.

Error is is given if we try to cast object to int8.

In [15]:
data = pd.read_csv(path,dtype={'card_id': 'int8', 
                               'city_id': 'int16',
                               'installments':'int16',
                               'merchant_category_id':'int16',
                               'merchant_id':'category',
                               'month_lag':'int8',
                               'purchase_amount':'float32',
                               'category_2':'float32',
                               'state_id':'int8',
                               'subsector_id':'int8'
                              })
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

TypeError: Cannot cast array from dtype('O') to dtype('int8') according to the rule 'safe'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-15-6dbbb57a917d> in <module>()
      8                                'category_2':'float32',
      9                                'state_id':'int8',
---> 10                                'subsector_id':'int8'
     11                               })

~\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)
    676                     skip_blank_lines=skip_blank_lines)
    677 
--> 678         return _read(filepath_or_buffer, kwds)
    679 
    680     parser_f.__name__ = name

~\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    444 
    445     try:
--> 446         data = parser.read(nrows)
    447     finally:
    448         parser.close()

~\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1034                 raise ValueError('skipfooter not supported for iteration')
   1035 
-> 1036         ret = self._engine.read(nrows)
   1037 
   1038         # May alter columns / col_dict

~\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1846     def read(self, nrows=None):
   1847         try:
-> 1848             data = self._reader.read(nrows)
   1849         except StopIteration:
   1850             if self._first_chunk:

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

ValueError: invalid literal for int() with base 10: 'C_ID_4e6213e9bc'

The column merchant_category_id has vaues between -1 and 891. What happens if we cast it a uint8 type?

In [16]:
data = pd.read_csv(path,dtype={'card_id': 'category', 
                               'city_id': 'int16',
                               'installments':'int16',
                               'merchant_category_id':'uint8',
                               'merchant_id':'category',
                               'month_lag':'int8',
                               'purchase_amount':'float32',
                               'category_2':'float32',
                               'state_id':'int8',
                               'subsector_id':'int8'
                              },parse_dates=["purchase_date"])
In [17]:
print(data.info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29112361 entries, 0 to 29112360
Data columns (total 14 columns):
authorized_flag         object
card_id                 category
city_id                 int16
category_1              object
installments            int16
category_3              object
merchant_category_id    uint8
merchant_id             category
month_lag               int8
purchase_amount         float32
purchase_date           datetime64[ns]
category_2              float32
state_id                int8
subsector_id            int8
dtypes: category(2), datetime64[ns](1), float32(2), int16(2), int8(3), object(3), uint8(1)
memory usage: 6.0 GB
None
In [18]:
data.merchant_category_id.min()
Out[18]:
1
In [19]:
data.merchant_category_id.max()
Out[19]:
255

What happened? uint8 can represent values from 0 to 255. Numbers lower than 0 and higher than 255 overflowed.

In [20]:
df = pd.DataFrame([100,256,257,300,-1])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
0    5 non-null int64
dtypes: int64(1)
memory usage: 120.0 bytes
In [21]:
df.astype('uint8')
Out[21]:
0
0 100
1 0
2 1
3 44
4 255

inplace = True?

The memory usage of column category_3.

In [22]:
print(data['category_3'].memory_usage(deep=True) / 1024 ** 2,'Mb')
1716.252986907959 Mb

Let's see total memory used by python.

In [23]:
import os
import psutil
process = psutil.Process(os.getpid())
print(process.memory_info().rss / 1024 ** 2,'Mb')
5811.35546875 Mb

Let's count nan values in column category_3.

In [24]:
data['category_3'].isnull().sum(axis = 0)
Out[24]:
178159

In different operations and methods (e.g., df.dropna()) the default value of inplace is False. And we are used to do things the following way.

In [25]:
# You have to assign back to dataframe because it is a new copy
data['category_3'] = data['category_3'].dropna(axis=0)
In [26]:
print(data['category_3'].memory_usage(deep=True) / 1024 ** 2,'Mb')
1716.252986907959 Mb

Memory usage increased by about 1.7 GB

In [27]:
print(process.memory_info().rss / 1024 ** 2,'Mb')
7289.13671875 Mb

Why did the memory usage increased?

By default dropna() performs the operation and returns a copy of the object, which means that we have to allocate extra memory (1.7 GB) for the new column without nan values. If you have a large dataFrame you cannot afford coping it.

We can solve this issue of coping by setting inplace = True and this time there is no need to assign back to dataframe because the operation is done on the same column (it returns nothing).

In [28]:
data['category_3'].dropna(axis=0, inplace=True)
In [29]:
print(process.memory_info().rss / 1024 ** 2,'Mb')
7731.5078125 Mb

To remove the dataFrame from memory use del

In [ ]:
del data

Call garbage collector manually

In [ ]:
import gc
gc.collect()