pandas: handling a DataFrame with a large number of strings

pandas: handling a DataFrame with a large number of strings - python

I want to read and handle a large CSV file (data_file) having the following 2-columns structure:
id params
1 '14':'blah blah','25':'more cool stuff'
2 '157':'yes, more stuff','15':'and even more'
3 '14':'blah blah','25':'more cool stuff'
4 '15':'different here'
5 '157':'yes, more stuff','15':'and even more'
6 '100':'exhausted'
This file contains 30.000.000 lines (5 Gb on disk). (The actual strings are encoded in UTF-8; for simplicity, I gave them in ascii here). Note that some of the values in the 2nd column are repeated.
I read this using pandas.read_csv():
df = pandas.read_csv(open(data_file, 'rb'), delimiter='\t',
usecols=['id', 'params'],dtype={'id':'u4', 'params':'str'})
Once the file is read, the data frame df uses 1.2 Gb of RAM.
So far so good.
Now comes the processing part. I want to have the params string column with this format:
blah blah||more cool stuff
yes, more stuff||and even more
blah blah||more cool stuff
different here
yes, more stuff||and even more
exhausted
I wrote:
def clean_keywords(x):
return "||".join(x.split("'")[1:][::2])
df['params'] = df['params'].map(clean_keywords)
This code works in the sense it gives the correct result. But:
More than 6.8 Gb of RAM are used while performing the map operation.
After computation has been finished, 5.5 Gb of RAM are used by df (after gc.collect()), although the string computed in params column is shorter than the one that was read.
Can someone explain this and propose an alternative way of performing the above operation using pandas (I use python 3.4, pandas 0.16.2, win64) ?

Answering my own question.
It turns out that pandas.read_csv() is clever. When the file is read, strings are made unique. But when these string are processed and stored in the column, they are no longer unique. Hence the RAM usage increases. In order to avoid this, one has to maintain the uniqueness manually. I did it this way:
unique_strings = {}
def clean_keywords(x):
s = "||".join(x.split("'")[1:][::2])
return unique_strings.setdefault(s, s)
df['params'] = df['params'].map(clean_keywords)
With this solution, RAM max. usage was only 2.8 Gb and went down slightly under the initial RAM usage after reading data (1.2 Gb), as expected.

Related

How can I efficiently replace all instances of multiple regex patterns from a dataframe of strings, in pySpark?

I have a table in Hadoop which contains 7 billion strings which can themselves contain anything. I need to remove every name from the column containing the strings. An example string would be 'John went to the park' and I'd need to remove 'John' from that, ideally just replacing with '[name]'.
In the case of 'John and Mary went to market', the output would be '[NAME] and [NAME] went to market'.
To support this I have an ordered list of the most frequently occurring 20k names.
I have access to Hue (Hive, Impala) and Zeppelin (Spark, Python & libraries) to execute this.
I've tried this in the DB, but being unable to update columns or iterate over a variable made it a non-starter, so using Python and PySpark seems to be the best option especially considering the number of calculations (20k names * 7bil input strings)
#nameList contains ['John','Emma',etc]
def removeNames(line, nameList):
str_line= line[0]
for name in nameList:
rx = f"(^| |[[:^alpha:]])({name})( |$|[[:^alpha:]])"
str_line = re.sub(rx,'[NAME]', str_line)
str_line= [str_line]
return tuple(str_line)
df = session.sql("select free_text from table")
rdd = df.rdd.map(lambda line: removeNames(line, nameList))
rdd.toDF().show()
The code is executing, but it's taking an hour and a half even if I limit the input text to 1000 lines (which is nothing for Spark), and the lines aren't actually being replaced in the final output.
What I'm wondering is: Why isn't map actually updating the lines of the RDD, and how could I make this more efficient so it executes in a reasonable amount of time?
This is my first time posting so if there's essential info missing, I'll fill in as much as I can.
Thank you!

In case you're still curious about this, by using the udf (your removeNames function) Spark is serializing all of your data to the master node, essentially defeating your usage of Spark to do this operation in a distributed fashion. As the method suggested in the comments, if you go with the regexp_replace() method, Spark will be able to keep all of the data on the distributed nodes, keeping everything distributed and improving performance.

pandas.read_csv slow when reading file with variable length string

I have an issue which I think I somewhat solved but I would like to learn more about it or learn about better solutions.
The problem: I have tab separated files with ~600k lines (and one comment line), of which one field (out of 8 fields) contains a string of variable length, anything between 1 and ~2000 characters.
Reading that file with the following function is terribly slow:
df = pd.read_csv(tgfile,
sep="\t",
comment='#',
header=None,
names=list_of_names)
However, perhaps I don't care so much about most of the string (field name of this string is 'motif') and I'm okay with truncating it if it's too long using:
def truncate_motif(motif):
if len(motif) > 8:
return motif[:8] + '~'
else:
return motif
df = pd.read_csv(tgfile,
sep="\t",
comment='#',
header=None,
converters={'motif': truncate_motif},
names=list_of_names)
This suddenly is lots faster.
So my questions are:
Why is reading this file so slow? Does it have to do with allocating memory?
Why does the converter function help here? It has to execute an additional function for every row, but is still lots faster...
What else can be done?

You didn't mention what means slow to you, but if:
your file contains ca. 600k rows,
each row contains 1-2000 characters (let's say 1000 in average, so each line has ca. 1000B),
then this file's size is: 600 000 * 1000B ~ 570 MB. It's a lot, especially if you don't have much RAM memory.
It helps, because suddenly the average size of one line is not 1000B, but ca. 6-7B (considering new max = 8B). The system does not read and keep whole strings, but only checks their length and cut if it's needed. Sounds logic to me!
In such cases, when you have to load lots of data, it's good to use chunks.
for chunk in pd.read_csv(tgfile, chunksize=10000):
process(chunk)
The chunksize parameter says how many rows contains one chunk. It's good to check if it improves performance in your case!

how do I avoid strings being read as bytes when reading a HDF 5 file into Pandas?

currently, the data in h5 file does not have prefix 'b'. I read h5 file with following code. I wonder whether there is some better way to read h5 and with no prefix 'b'.
import tables as tb
import pandas as pd
import numpy as np
import time
time0=time.time()
pth='d:/download/'
# read data
data_trading=pth+'Trading_v01.h5'
filem=tb.open_file(data_trading,mode='a',driver="H5FD_CORE")
tb_trading=filem.get_node(where='/', name='wind_data')
df=pd.DataFrame.from_records(tb_trading[:])
time1=time.time()
print('\ntime on reading data %6.3fs' %(time1-time0))
# in python3, remove prefix 'b'
df.loc[:,'Date']=[[dt.decode('utf-8')] for dt in df.loc[:,'Date']]
df.loc[:,'Code']=[[cd.decode('utf-8')] for cd in df.loc[:,'Code']]
time2=time.time()
print("\ntime on removing prefix 'b' %6.3fs" %(time2-time1))
print('\ntotal time %6.3fs' %(time2-time0))
the result of time
time on reading data 1.569s
time on removing prefix 'b' 29.921s
total time 31.490s
you see, removing prefix 'b' is really time consuming.
I have try to use pd.read_hdf, which don't rise prefix 'b'.
%time df2=pd.read_hdf(data_trading)
Wall time: 14.7 s
which so far is faster.
Using this SO answer and using a vectorised str.decode, I can cut the conversion time to 9.1 seconds (and thus the total time less than 11 seconds):
for key in ['Date', 'Code']:
df[key] = df[key].str.decode("utf-8")
Question: is there an even more effective way to convert my bytes columns to string when reading a HDF 5 data table?

The best solution for performance is to stop trying to "remove the b prefix." The b prefix is there because your data consists of bytes, and Python 3 insists on displaying this prefix to indicate bytes in many places. Even places where it makes no sense such as the output of the built-in csv module.
But inside your own program this may not hurt anything, and in fact if you want the highest performance you may be better off leaving these columns as bytes. This is especially true if you're using Python 3.0 to 3.2, which always use multi-byte unicode representation (see).
Even if you are using Python 3.3 or later, where the conversion from bytes to unicode doesn't cost you any extra space, it may still be a waste of time if you have a lot of data.
Finally, Pandas is not optimal if you are dealing with columns of mostly unique strings which have a somewhat consistent width. For example if you have columns of text data which are license plate numbers, all of them will fit in about 9 characters. The inefficiency arises because Pandas does not exactly have a string column type, but instead uses an object column type, which contains pointers to strings stored separately. This is bad for CPU caches, bad for memory bandwidth, and bad for memory consumption (again, if your strings are mostly unique and of similar lengths). If your strings have highly variable widths, it may be worth it because a short string takes only its own length plus a pointer, whereas the fixed-width storage typical in NumPy and HDF5 takes the full column width for every string (even empty ones).
To get fast, fixed-width string columns in Python, you may consider using NumPy, which you can read via the excellent h5py library. This will give you a NumPy array which is a lot more similar to the underlying data stored in HDF5. It may still have the b prefix, because Python insists that non-unicode strings always display this prefix, but that's not necessarily something you should try to prevent.

csvstat takes a much longer time to run than pandas

I am trying to get the number of unique values in a specific column in a csv (~10 GB) and am looking for the fastest way to do that. I expected command line tools like csvstat to run faster than pandas, but:
def get_uniq_col_count(colname):
df = pd.read_csv('faults_all_main_dp_1_joined__9-4-15.csv', engine='c', usecols=[colname], nrows = 400000)
df.columns = ['col']
return len(set(df.col.values)), list(set(df.col.values))
t1 = datetime.datetime.now()
count, uniq = get_uniq_col_count('model')
print(datetime.datetime.now() - t1)
# 0:00:04.386585
vs.
$ time csvcut -c model faults_all_main_dp_1_joined__9-4-15.csv | head -n 400000 | csvstat --unique
3
real 1m3.343s
user 1m3.212s
sys 0m0.237s
(I am doing the header, because I let csvstat run on the whole dataset, went out for lunch, came back, and it’s still running. It took pandas 50 sec to finish.)
I wonder if I am doing something wrong, and in general, if there is a way to speed up the process. (There are about 5 million rows to read through for each column.)

This is not surprising. 'csvtools' are written entirely in python and do not use any optimization tricks. In particular, 'csvstat' will read entire table into memory and store multiple copies of it as regular python list. This has huge overhead -- both in memory and in the garbage collector time.
Oh the other hand, pandas uses numpy, which means that the whole column uses only a few python objects, and almost no memory overhead. You may be able to make your program slightly faster by using pandas-specific unique method (Find unique values in a Pandas dataframe, irrespective of row or column location)
If you are going to be doing this a lot, convert the data to a more efficient format. This page: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/ shows that if once you convert your data to HDF5, it will be much faster to load.

Python fast string parsing, manipulation

I am using python to parse the incoming comma separated string. I want to do some calculation afterwards on the data.
The length of the string is: 800 characters with 120 comma separated fields.
There such 1.2 million strings to process.
for v in item.values():
l.extend(get_fields(v.split(',')))
#process l
get_fields uses operator.itemgetter() to extract around 20 fields out of 120.
This entire operation takes about 4-5 minutes excluding the time to bring in the data.
In the later part of the program I insert these lines into sqlite memory table for further use.
But overall 4-5 minutes time for just parsing and getting a list is not good for my project.
I run this processing in around 6-8 threads.
Does switching to C/C++ might help?

Are you loading a dict with your file records? Probably better to process the data directly:
datafile = file("file_with_1point2million_records.dat")
# uncomment next to skip over a header record
# file.next()
l = sum(get_fields(v.split(',')) for v in file, [])
This avoids creating any overall data structures, and only accumulated the desired values as returned by get_fields.

Your program might be slowing down trying to allocate enough memory for 1.2M strings. In other words, the speed problem might not be due to the string parsing/manipulation, but rather in the l.extend. To test this hypothsis, you could put a print statement in the loop:
for v in item.values():
print('got here')
l.extend(get_fields(v.split(',')))
If the print statements get slower and slower, you can probably conclude l.extend is the culprit. In this case, you may see significant speed improvement if you can move the processing of each line into the loop.
PS: You probably should be using the csv module to take care of the parsing for you in a more high-level manner, but I don't think that will affect the speed very much.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.