How to get around Memory Error when using Pandas? - python

I know that Memors Error is a common error when using different functions of the Pandas library.
I want to get help in several areas. My questions are formulated below, after describing the problem.
My OS is Ubuntu 18, workspace is jupyter notebook within the framework of Anaconda, RAM volume 8Gb.
The task that I solve.
I have over 100,000 dictionaries containing data on site visits by users, like this.
{'meduza.io': 2, 'google.com': 4, 'oracle.com': 2, 'mail.google.com':
1, 'yandex.ru': 1, 'user_id': 3}
It is necessary to form a DataFrame from this data. At first I used the append function to add dictionaries line by line in a DataFrame.
for i in tqdm_notebook(data):
real_data = real_data.append([i], ignore_index=True)
But the toy dataset showed that this function takes a long time to complete.
Then I directly tried to create a DataFrame by passing an array with dictionaries like this.
real_data = pd.DataFrame(data=data, dtype='int')
Converting a small amount of data is fast enough.But when I pass the complete data set to the function Memory Eror appears.
I track the consumption of RAM. The function does not start execution and does not waste memory.
I tried to expand the swap file. But this did not work, the function does not access it.
I understand that to solve my particular problem, I can break the data into parts, and then combine them. But I'm not sure that I know the most effective method of solving this problem.
I want to understand how the calculation of the required amount of memory for the operation of the Pandas works.
Judging by the number of questions on this topic, a memory error occurs when reading, merging, etc. Is it possible to include a swap file to solve this problem?
How to more efficiently implement the solution to the problem with the addition of dictionaries in DataFrame?
'Append' is not working efficiently. Creating a DataFrame from a complete dataset is more efficient, but leads to an error.
I do not understand the implementation of these processes, but I want to figure out what is the most efficient way to convert data like my task.

I'd suggest specifying the dtypes of the columns, it might be trying to read them as object types - e.g. if using DataFrame.from_dict then specify the dtype argument; dtype={'a': np.float64, 'b': np.int32, 'c': 'Int64'}. The best way to create the dataframe is from the dictionary object as you're doing - never use dataframe.append, because it's really inefficient.
See if any other programs are taking up memory on your system as well, and kill those before trying to do the load.
You could also try and see at what point the memory error occurs - 50k, 70k, 100k?
Debug the dataframe and see what types are being loaded, and make sure those types are the smallest appropriate (e.g. bool rather than object for example).
EDIT: What could be making your dataframe very large is if you have lots of sparse entries, especially if there are lots of different domains as headers. It might be better to change your columns to a more 'key:value' approach, e.g. {'domain': 'google.ru', 'user_id': 3, 'count': 10} for example. You might have 100k columns!

Related

How to reduce amount of ram used by pandas

I have a Raspberry Pi 3B+ with 1Gb ram in which I am running a telegram bot.
This bot uses a database I store in csv format, which contains about 100k rows and four columns:
First two are for searching
Third is a result
those use about 20-30MB ram, this is assumable.
The last column is really a problem, it shoots up the ram usage to 180MB, impossible to manage for RPi, this column is also for searching, but I only need it sometimes.
I started only loading the df with read_csv at start of script and let the script polling, but when the db grows, I realized that this is too much for RPi.
What do you think is the best way to do this? Thanks!
Setting the dtype according to the data can reduce memory usage a lot.
With read_csv you can directly set the dtype for each column:
dtypeType name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}
Use str or object together with suitable
na_values settings to preserve and not interpret dtype. If converters
are specified, they will be applied INSTEAD of dtype conversion.
Example:
df = pd.read_csv(my_csv, dtype={0:'int8',1:'float32',2:'category'}, header=None)
See the next section on some dtype examples (with an existing df).
To change that on an existing dataframe you can use astype on the respective column.
Use df.info() to check the df memory usage before and after the change.
Some examples:
# Columns with integer values
# check first if min & max of that column is in the respective int range to not loose info
df['column_with_integers'] = df['column_with_integers'].astype('int8')
# Columns with float values
# mind potential calculation precision = don't go too low
df['column_with_floats'] = df['column_with_floats'].astype('float32')
# Columns with categorical values (strings)
# e.g. when the rows have repeatingly the same strings
# like 'TeamRed', 'TeamBlue', 'TeamYellow' spread over 10k rows
df['Team_Name'] = df['Team_Name'].astype('category')
# Cange boolean like string columns to actual boolean
df['Yes_or_No'] = df['Yes_or_No'].map({'yes':True, 'no':False})
Kudos to Medallion Data Science with his Youtube Video Efficient Pandas Dataframes in Python - Make your code run fast with these tricks! where I learned those tips.
Kudos to Robert Haas for the additional link in the comments to Pandas Scaling to large datasets - Use efficient datatypes
Not sure if this is a good idea in this case but it might be worth trying.
The Dask package was designed to allow Pandas-like data analysis on dataframes that are too big to fit in memory (RAM) (as well as other things). It does this by only loading chunks of the complete dataframe into memory at a time.
However, not sure it was designed for machines like the Raspberry Pi (not even sure there is a distribution for it).
The good thing is Dask will slide seamlessly into your Pandas script so it might not be too much effort to try it out.
Here is a simple example I made before:
dask-examples.ipynb
If you try it let me know if it works, I'm also interested.

How do I perform deduplication with the python record linkage toolkit with large data sets?

I am currently using Python Record Linkage Toolkit to perform deduplication on data sets at work. In an ideal world, I would just use blocking or sortedneighborhood to trim down the size of the index of record pairs, but sometimes I need to do a full index on a data set with over 75k records, which results in a couple billion records pairs.
The issue I'm running into is that the workstation I'm able to use is running out of memory, so it can't store the full 2.5-3 billion pair multi-index. I know the documentation has ideas for doing record linkage with two large data sets using numpy split, which is simple enough for my usage, but doesn't provide anything for deduplication within a single dataframe. I actually incorporated this subset suggestion into a method for splitting the multiindex into subsets and running those, but it doesn't get around the issue of the .index() call seemingly loading the entire multiindex into memory and causing an out of memory error.
Is there a way to split a dataframe and compute the matched pairs iteratively so I don't have to load the whole kit and kaboodle into memory at once? I was looking at dask, but I'm still pretty green on the whole python thing, so I don't know how to incorporate the dask dataframes into the record linkage toolkit.
While I was able to solve this, sort of, I am going to leave it open because I suspect given my inexperience with python, my process could be improved.
Basically, I had to ditch the index function from record linkage toolkit. I pulled out the Index of the dataframe I was using, and then converted it to a list, and passed it through the itertools combinations function.
candidates = fl
candidates = candidates.index
candidates = candidates.tolist()
candidates = combinations(candidates,2)
This then gave me an iteration object full of tuples, without having to load everything in to memory. I then passed it into an islice grouper as a for loop.
for x in iter(lambda: list(islice(candidates,1000000)),[]):
I then proceeded to perform all of the necessary comparisons in the for loop, and added the resultant dataframe to a dictionary, which I then concatenate at the end for the full list. Python's memory usage hasn't risen above 3GB the entire time.
I would still love some information on how to incorporate dask into this, so I will accept any answer that can provide that (unless the mods think I should open a new question).

Dask DataFrame groupby/apply efficiency

I am struggling to come up with an efficient way of solving what seems to be a typical use case of dask.dataframe groupby+apply and am wondering if I'm missing something obvious (various docs speak to this issue but I haven't been able to fully resolve it).
In short, I'm trying to load a medium-sized (say 10GB) dataframe, group by some column, train a machine learning model for each subset (a few seconds per model, ~100k subsets), and save that model to disk. My best attempt so far looks like:
c = Client()
df = dd.read_parquet('data.parquet')
df = c.persist(dd.set_index('key')) # data is already sorted by key
result = c.compute(df.groupby(df.index).apply(train_and_save_model))
No matter how I try to repartition the data, I seem to spend an enormous amount of time on serialization/IO compared to on actual computation. A silly naive workaround of writing 100k separate Parquet files up front and then passing filenames to the workers to load/train on seems to be much more efficient; I'm struggling to see why that would perform any differently. Isn't the idea of setting the index and partitioning that each worker understands which parts of the file it should read from? I assume I'm missing something obvious here so any guidance would be appreciated.
It looks like there's a big difference between data that is read in with index='key' and data that is re-indexed with set_index; I thought that if the new index was already sorted then there'd be no shuffling cost but I guess that was wrong.
Changing to dd.read_parquet('data.parquet', index='key') seems to give performance like what I was hoping for so I'm happy (though still curious why set_index seems to shuffle unnecessarily in this case).

PySpark casting IntegerTypes to ByteType for optimization

I'm reading in a large amount of data via parquet files into dataframes. I noticed a vast amount of the columns either have 1,0,-1 as values and thus could be converted from Ints to Byte types to save memory.
I wrote a function to do just that and return a new dataframe with the values casted as bytes, however when looking at the memory of the dataframe in the UI, I see it saved as just a transformation from the original dataframe and not as a new dataframe itself, thus taking the same amount of memory.
I'm rather new to Spark and may not fully understand the internals, so how would I go about initially setting those columns to be of ByteType?
TL;DR It might be useful, but in practice impact might be much smaller than you think.
As you noticed:
the memory of the dataframe in the UI, I see it saved as just a transformation from the original dataframe and not as a new dataframe itself, thus taking the same amount of memory.
For storage, Spark uses in-memory columnar storage, which applies a number of optimizations, including compression. If data has low cardinality, then column can be easily compressed using run length encoding or dictionary encoding, and casting won't make any difference.
In order to see whether there is any impact, you can try two things:
Write the data back to the file system. Once with the original type and anther time with your optimisation. Compare size on disk.
Try calling collect on the dataframe and look at the driver memory in your OS's system monitor, make sure to induce a garbage collection to get a cleaner indication. Again- do this once w/o the optimisation and another time with the optimisation.
user8371915 is right in the general case but take into account that the optimisations may or may not kick in based on various parameters like row group size and dictionary encoding threshold.
This means that even if you do see impact, there is a good chance you could get the same compression by tuning spark.

How can I ensure unique rows in a large HDF5

I'm working on implementing a relatively large (5,000,000 and growing) set of time series data in an HDF5 table. I need a way to remove duplicates on it, on a daily basis, one 'run' per day. As my data retrieval process currently stands, it's far easier to write in the duplicates during the data retrieval process than ensure no dups go in.
What is the best way to remove dups from a pytable? All of my reading is pointing me towards importing the whole table into pandas, and getting a unique- valued data frame, and writing it back to disk by recreating the table with each data run. This seems counter to the point of pytables, though, and in time I don't know that the whole data set will efficiently fit into memory. I should add that it is two columns that define a unique record.
No reproducible code, but can anyone give me pytables data management advice?
Big thanks in advance...
See this releated question: finding a duplicate in a hdf5 pytable with 500e6 rows
Why do you say that this is 'counter to the point of pytables'? It is perfectly possible to store duplicates. The user is responsible for this.
You can also try this: merging two tables with millions of rows in python, where you use a merge function that is simply drop_duplicates().

Categories