How to reduce amount of ram used by pandas - python

I have a Raspberry Pi 3B+ with 1Gb ram in which I am running a telegram bot.
This bot uses a database I store in csv format, which contains about 100k rows and four columns:
First two are for searching
Third is a result
those use about 20-30MB ram, this is assumable.
The last column is really a problem, it shoots up the ram usage to 180MB, impossible to manage for RPi, this column is also for searching, but I only need it sometimes.
I started only loading the df with read_csv at start of script and let the script polling, but when the db grows, I realized that this is too much for RPi.
What do you think is the best way to do this? Thanks!

Setting the dtype according to the data can reduce memory usage a lot.
With read_csv you can directly set the dtype for each column:
dtypeType name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}
Use str or object together with suitable
na_values settings to preserve and not interpret dtype. If converters
are specified, they will be applied INSTEAD of dtype conversion.
Example:
df = pd.read_csv(my_csv, dtype={0:'int8',1:'float32',2:'category'}, header=None)
See the next section on some dtype examples (with an existing df).
To change that on an existing dataframe you can use astype on the respective column.
Use df.info() to check the df memory usage before and after the change.
Some examples:
# Columns with integer values
# check first if min & max of that column is in the respective int range to not loose info
df['column_with_integers'] = df['column_with_integers'].astype('int8')
# Columns with float values
# mind potential calculation precision = don't go too low
df['column_with_floats'] = df['column_with_floats'].astype('float32')
# Columns with categorical values (strings)
# e.g. when the rows have repeatingly the same strings
# like 'TeamRed', 'TeamBlue', 'TeamYellow' spread over 10k rows
df['Team_Name'] = df['Team_Name'].astype('category')
# Cange boolean like string columns to actual boolean
df['Yes_or_No'] = df['Yes_or_No'].map({'yes':True, 'no':False})
Kudos to Medallion Data Science with his Youtube Video Efficient Pandas Dataframes in Python - Make your code run fast with these tricks! where I learned those tips.
Kudos to Robert Haas for the additional link in the comments to Pandas Scaling to large datasets - Use efficient datatypes

Not sure if this is a good idea in this case but it might be worth trying.
The Dask package was designed to allow Pandas-like data analysis on dataframes that are too big to fit in memory (RAM) (as well as other things). It does this by only loading chunks of the complete dataframe into memory at a time.
However, not sure it was designed for machines like the Raspberry Pi (not even sure there is a distribution for it).
The good thing is Dask will slide seamlessly into your Pandas script so it might not be too much effort to try it out.
Here is a simple example I made before:
dask-examples.ipynb
If you try it let me know if it works, I'm also interested.

Related

How to get around Memory Error when using Pandas?

I know that Memors Error is a common error when using different functions of the Pandas library.
I want to get help in several areas. My questions are formulated below, after describing the problem.
My OS is Ubuntu 18, workspace is jupyter notebook within the framework of Anaconda, RAM volume 8Gb.
The task that I solve.
I have over 100,000 dictionaries containing data on site visits by users, like this.
{'meduza.io': 2, 'google.com': 4, 'oracle.com': 2, 'mail.google.com':
1, 'yandex.ru': 1, 'user_id': 3}
It is necessary to form a DataFrame from this data. At first I used the append function to add dictionaries line by line in a DataFrame.
for i in tqdm_notebook(data):
real_data = real_data.append([i], ignore_index=True)
But the toy dataset showed that this function takes a long time to complete.
Then I directly tried to create a DataFrame by passing an array with dictionaries like this.
real_data = pd.DataFrame(data=data, dtype='int')
Converting a small amount of data is fast enough.But when I pass the complete data set to the function Memory Eror appears.
I track the consumption of RAM. The function does not start execution and does not waste memory.
I tried to expand the swap file. But this did not work, the function does not access it.
I understand that to solve my particular problem, I can break the data into parts, and then combine them. But I'm not sure that I know the most effective method of solving this problem.
I want to understand how the calculation of the required amount of memory for the operation of the Pandas works.
Judging by the number of questions on this topic, a memory error occurs when reading, merging, etc. Is it possible to include a swap file to solve this problem?
How to more efficiently implement the solution to the problem with the addition of dictionaries in DataFrame?
'Append' is not working efficiently. Creating a DataFrame from a complete dataset is more efficient, but leads to an error.
I do not understand the implementation of these processes, but I want to figure out what is the most efficient way to convert data like my task.
I'd suggest specifying the dtypes of the columns, it might be trying to read them as object types - e.g. if using DataFrame.from_dict then specify the dtype argument; dtype={'a': np.float64, 'b': np.int32, 'c': 'Int64'}. The best way to create the dataframe is from the dictionary object as you're doing - never use dataframe.append, because it's really inefficient.
See if any other programs are taking up memory on your system as well, and kill those before trying to do the load.
You could also try and see at what point the memory error occurs - 50k, 70k, 100k?
Debug the dataframe and see what types are being loaded, and make sure those types are the smallest appropriate (e.g. bool rather than object for example).
EDIT: What could be making your dataframe very large is if you have lots of sparse entries, especially if there are lots of different domains as headers. It might be better to change your columns to a more 'key:value' approach, e.g. {'domain': 'google.ru', 'user_id': 3, 'count': 10} for example. You might have 100k columns!

Is it a good practice to preallocate an empty dataframe with types?

I'm trying to load around 3GB of data into a Pandas dataframe, and I figured that I would save some memory by first declaring an empty dataframe, while enforcing that its float coulms would be 32bit instead of the default 64bit. However, the Pandas dataframe constructor does not allow specifying the types fo multiple columns on an empty dataframe.
I found a bunch of workarounds in the replies to this question, but they made me realize that Pandas is not designed in this way.
This made me wonder whether it was a good strategy at all to declare the empty dataframe first, instead of reading the file and then downcasting the float columns (which seems inefficient memory-wise and processing-wise).
What would be the best strategy to design my program?

PySpark casting IntegerTypes to ByteType for optimization

I'm reading in a large amount of data via parquet files into dataframes. I noticed a vast amount of the columns either have 1,0,-1 as values and thus could be converted from Ints to Byte types to save memory.
I wrote a function to do just that and return a new dataframe with the values casted as bytes, however when looking at the memory of the dataframe in the UI, I see it saved as just a transformation from the original dataframe and not as a new dataframe itself, thus taking the same amount of memory.
I'm rather new to Spark and may not fully understand the internals, so how would I go about initially setting those columns to be of ByteType?
TL;DR It might be useful, but in practice impact might be much smaller than you think.
As you noticed:
the memory of the dataframe in the UI, I see it saved as just a transformation from the original dataframe and not as a new dataframe itself, thus taking the same amount of memory.
For storage, Spark uses in-memory columnar storage, which applies a number of optimizations, including compression. If data has low cardinality, then column can be easily compressed using run length encoding or dictionary encoding, and casting won't make any difference.
In order to see whether there is any impact, you can try two things:
Write the data back to the file system. Once with the original type and anther time with your optimisation. Compare size on disk.
Try calling collect on the dataframe and look at the driver memory in your OS's system monitor, make sure to induce a garbage collection to get a cleaner indication. Again- do this once w/o the optimisation and another time with the optimisation.
user8371915 is right in the general case but take into account that the optimisations may or may not kick in based on various parameters like row group size and dictionary encoding threshold.
This means that even if you do see impact, there is a good chance you could get the same compression by tuning spark.

Pandas memory usage?

I was wondering how pandas handles memory usage in python? I was wondering more specifically how the memory is handled if I set a pandas dataframe query results to a variable. Behind the hood, would it just be some memory addresses to the original dataframe object or would I be cloning all of the data?
I'm afraid of memory ballooning out of control but I have a dataframe that has non-unique fields I can't index it by. It's incredibly slow to query and plot data from it using commands like df[(df[''] == x) & (df[''] == y)].
(They're both integer values in the rows. They're also not unique, hence the fact it returns multiple results.)
I'm very new to pandas anyway, but any insights as to how to handle a situation where I'm looking for the arrays of values where two conditions match would be great too. Right now I'm using an O(n) algorithm to loop through and index it because even that runs faster than the search queries when I need to access the data quickly. Watching my system take twenty seconds on a dataset of only 6,000 rows is foreboding.

Pandas - retrieving HDF5 columns and memory usage

I have a simple question, I cannot help but feel like I am missing something obvious.
I have read data from a source table (SQL Server) and have created an HDF5 file to store the data via the following:
output.to_hdf('h5name', 'df', format='table', data_columns=True, append=True, complib='blosc', min_itemsize = 10)
The dataset is ~50 million rows and 11 columns.
If I read the entire HDF5 back into a dataframe (through HDFStore.select or read_hdf), it consumes about ~24GB of RAM. If I parse specific columns into the read statements (e.g. selecting 2 or 3 columns), the dataframe now only returns those columns, however the same amount of memory is consumed (24GB).
This is running on Python 2.7 with Pandas 0.14.
Am I missing something obvious?
EDIT: I think I answered my own question. While I did a ton of searching before posting, obviously once posted I found a useful link: https://github.com/pydata/pandas/issues/6379
Any suggestions on how to optimize this process would be great, due to memory limitations I cannot hit peak memory required to release via gc.
HDFStore in table format is a row oriented store. When selecting the query indexes on the rows, but for each row you get every column. selecting a subset of columns does a reindex at the end.
There are several ways to approach this:
use a column store, like bcolz; this is currently not implemented by PyTables so this would involve quite a bit of work
chunk thru the table, see here and concat at the end - this will use constant memory
store as a fixed format - this is a more efficient storage format so will use less memory (but cannot be appended)
create your own column store-like by storing to multiple sub tables and use select_as_multiple see here
which options you choose depend on the nature of your data access
note: you may not want to have all of the columns as data_columns unless you are really going to select from the all (you can only query ON a data_column or an index)
this will make store/query faster

Categories