I am trying to get the number of unique values in a specific column in a csv (~10 GB) and am looking for the fastest way to do that. I expected command line tools like csvstat to run faster than pandas, but:
def get_uniq_col_count(colname):
df = pd.read_csv('faults_all_main_dp_1_joined__9-4-15.csv', engine='c', usecols=[colname], nrows = 400000)
df.columns = ['col']
return len(set(df.col.values)), list(set(df.col.values))
t1 = datetime.datetime.now()
count, uniq = get_uniq_col_count('model')
print(datetime.datetime.now() - t1)
# 0:00:04.386585
vs.
$ time csvcut -c model faults_all_main_dp_1_joined__9-4-15.csv | head -n 400000 | csvstat --unique
3
real 1m3.343s
user 1m3.212s
sys 0m0.237s
(I am doing the header, because I let csvstat run on the whole dataset, went out for lunch, came back, and it’s still running. It took pandas 50 sec to finish.)
I wonder if I am doing something wrong, and in general, if there is a way to speed up the process. (There are about 5 million rows to read through for each column.)
This is not surprising. 'csvtools' are written entirely in python and do not use any optimization tricks. In particular, 'csvstat' will read entire table into memory and store multiple copies of it as regular python list. This has huge overhead -- both in memory and in the garbage collector time.
Oh the other hand, pandas uses numpy, which means that the whole column uses only a few python objects, and almost no memory overhead. You may be able to make your program slightly faster by using pandas-specific unique method (Find unique values in a Pandas dataframe, irrespective of row or column location)
If you are going to be doing this a lot, convert the data to a more efficient format. This page: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/ shows that if once you convert your data to HDF5, it will be much faster to load.
Related
enter image description here
I want to add column df['B'] and df['C']
I coded like below
df['B'] = df['A'].split("_").str[0]
df['C'] = df['A'].split("_").str[1]
But python split function is slow than I expected, So It spends too long time as dataframe became more bigger
I want to find more efficient ways.
Is it possible to use split function one time?
df[['B','C']] = df['A'].split("_") (this code is just example)
Or is there any more smart way?
Thanks.
split in Pandas has an expand option which you could use a follows. I've not tested speed.
df = df.join(df['A'].str.split('_', expand = True).rename(columns = {0 : 'B', 1: 'C'}))
You could use expand to create several columns at a time, and then join to add all those columns
df.join(df.A.str.split('_', expand=True).rename({0:'B', 1:'C'}, axis=1))
Tho, I fail to see how the method you are using can be so slow. I doubt this one is so much faster. It is basically the same split. That is not python's split btw. This is a pandas method. So it is "vectorized".
Unless you have thousands of those columns? In which case, indeed, it is better if you can avoid 1000 calls
Edit: timing
With 1000000 rows, on my computer, this method (and you can see that, in the same 10 seconds, you got twice the exact same answer — with the variation of axis=1 vs columns= for rename — so it might be the good one :D), takes 2.71 seconds.
Your method takes 2.77 seconds.
With some 0.03 standard deviation. So sometimes, yours is even faster, tho I ran it enough time to prove with a p-value<5% that yours is sligthly slower, but really really sligthly.
I guess, this is just as fast as it gets.
I'm trying to optimize one piece of Software written in Python using Pandas DF . The algorithm takes a pandas DF as input, can't be distributed and it outputs a metric for each client.
Maybe it's not the best solution, but my time-efficient approach is to load all files in parallel and then build a DF for each client
This works fine BUT very few clients have really HUGE amount of data. So I need to save memory when creating their DF.
In order to do this I'm performing a groupBy() (actually a combineByKey, but logically it's a groupBy) and then for each group (aka now a single Row of an RDD) I build a list and from it, a pandas DF.
However this makes many copies of the data (RDD rows, List and pandas DF...) in a single task/node and crashes and I would like to remove that many copies in a single node.
I was thinking on a "special" combineByKey with the following pseudo-code:
def createCombiner(val):
return [val]
def mergeCombinerVal(x,val):
x.append(val);
return x;
def mergeCombiners(x,y):
#Not checking if y is a pandas DF already, but we can do it too
if (x is a list):
pandasDF= pd.Dataframe(data=x,columns=myCols);
pandasDF.append(y);
return pandasDF
else:
x.append(y);
return x;
My question here, docs say nothing, but someone knows if it's safe to assume that this will work? (return dataType of merging two combiners is not the same than the combiner). I can control datatype on mergeCombinerVal too if the amount of "bad" calls is marginal, but it would be very inefficient to append to a pandas DF row by row.
Any better idea to perform want i want to do?
Thanks!,
PS: Right now I'm packing Spark rows, would switching from Spark rows to python lists without column names help reducing memory usage?
Just writing my comment as answer
At the end I've used regular combineByKey, its faster than groupByKey (idk the exact reason, I guess it helps packing the rows, because my rows are small size, but there are maaaany rows), and also allows me to group them into a "real" Python List (groupByKey groups into some kind of Iterable which Pandas doesn't support and forces me to create another copy of the structure, which doubles memory usage and crashes), which helps me with memory management when packing them into Pandas/C datatypes.
Now I can use those lists to build a dataframe directly without any extra transformation (I don't know what structure is Spark's groupByKey "list", but pandas won't accept it in the constructor).
Still, my original idea should have given a little less memory usage (at most 1x DF + 0.5x list, while now I have 1x DF + 1x list), but as user8371915 said it's not guaranteed by the API/docs..., better not to put that into production :)
For now, my biggest clients fit into a reasonable amount of memory. I'm processing most of my clients in a very parallel low-memory-per-executor job and the biggest ones in a not-so-parallel high-memory-per-executor job. I decide based on a pre-count I perform
Say we have a (huge) csv file with a header "data".
I load this file with pandas (imported as pd):
pd_data = pd.read_csv(csv_file)
Now, say the file has multiple headers and many rows, so I would like to increase reading speed and reduce memory usage.
Is there a difference in performance between using:
(1) pd.read_csv(csv_file, usecols=['data'])
versus
(2) pd.read_csv(csv_file, usecols=0)
(let us assume that 'data' is always column 0).
The reason for using (1) is the piece of mind that no matter where 'data' is actually stored in the file we will load the right column.
Using (2) does require us to trust that column 0 is indeed 'data'.
The reason why I think (2) is faster is because (1) may have some overhead in looking for the column (unless it says "if match on first column, stop searching", kind of thing, of course).
Questions:
Which approach is faster among (1) and (2)?
Is there an even faster approach?
Thank you.
I want to read and handle a large CSV file (data_file) having the following 2-columns structure:
id params
1 '14':'blah blah','25':'more cool stuff'
2 '157':'yes, more stuff','15':'and even more'
3 '14':'blah blah','25':'more cool stuff'
4 '15':'different here'
5 '157':'yes, more stuff','15':'and even more'
6 '100':'exhausted'
This file contains 30.000.000 lines (5 Gb on disk). (The actual strings are encoded in UTF-8; for simplicity, I gave them in ascii here). Note that some of the values in the 2nd column are repeated.
I read this using pandas.read_csv():
df = pandas.read_csv(open(data_file, 'rb'), delimiter='\t',
usecols=['id', 'params'],dtype={'id':'u4', 'params':'str'})
Once the file is read, the data frame df uses 1.2 Gb of RAM.
So far so good.
Now comes the processing part. I want to have the params string column with this format:
blah blah||more cool stuff
yes, more stuff||and even more
blah blah||more cool stuff
different here
yes, more stuff||and even more
exhausted
I wrote:
def clean_keywords(x):
return "||".join(x.split("'")[1:][::2])
df['params'] = df['params'].map(clean_keywords)
This code works in the sense it gives the correct result. But:
More than 6.8 Gb of RAM are used while performing the map operation.
After computation has been finished, 5.5 Gb of RAM are used by df (after gc.collect()), although the string computed in params column is shorter than the one that was read.
Can someone explain this and propose an alternative way of performing the above operation using pandas (I use python 3.4, pandas 0.16.2, win64) ?
Answering my own question.
It turns out that pandas.read_csv() is clever. When the file is read, strings are made unique. But when these string are processed and stored in the column, they are no longer unique. Hence the RAM usage increases. In order to avoid this, one has to maintain the uniqueness manually. I did it this way:
unique_strings = {}
def clean_keywords(x):
s = "||".join(x.split("'")[1:][::2])
return unique_strings.setdefault(s, s)
df['params'] = df['params'].map(clean_keywords)
With this solution, RAM max. usage was only 2.8 Gb and went down slightly under the initial RAM usage after reading data (1.2 Gb), as expected.
Okay, deep breath, this may be a bit verbose, but better to err on the side of detail than lack thereof...
So, in one sentence, my goal is to find the intersection of about 22 ~300-400mb files based on 3 of 139 attributes.:
Now a bit more background. The files range from ~300-400mb, consisting of 139 columns and typically in the range of 400,000-600,000 rows. I have three particular fields I want to join on - a unique ID, and latitude/longitude (with a bit of a tolerance if possible). The goal is to determine which of these recored existed across certain ranges of files. Going worst case, that will mean performing a 22 file intersection.
So far, the following has failed
I tried using MySQL to perform the join. This was back when I was only looking at 7 years. Attempting the join on 7 years (using INNER JOIN about 7 times... e.g. t1 INNER JOIN t2 ON condition INNER JOIN t3 ON condition ... etc), I let it run for about 48 hours before the timeout ended it. Was it likely to actually still be running, or does that seem overly long? Despite all the suggestions I found to enable better multithreading and more RAM usage, I couldn't seem to get the cpu usage above 25%. If this is a good approach to pursue, any tips would be greatly appreciated.
I tried using ArcMap. I converted the CSVs to tables and imported them into a file geodatabase. I ran the intersection tool on two files, which took about 4 days, and the number of records returned was more than twice the number of input features combined. Each file had about 600,000 records. The intersection returned with 2,000,0000 results. In other cases, not all records were recognized by ArcMap. ArcMap says there are 5,000 records, when in reality there are 400,000+
I tried combining in python. Firstly, I can immediately tell RAM is going to be an issue. Each file takes up roughly 2GB of RAM in python when fully opened. I do this with:
f1 = [row for row in csv.reader(open('file1.csv', 'rU'))]
f2 = [row for row in csv.reader(open('file2.csv', 'rU'))]
joinOut = csv.writer(open('Intersect.csv', 'wb'))
uniqueIDs = set([row[uniqueIDIndex] for row in f1].extend([row[uniqueIDIndex] for row in f2]))
for uniqueID in uniqueIDs:
f1rows = [row for row in f1 if row[uniqueIDIndex] == uniqueID]
f2rows = [row for row in f2 if row[uniqueIDIndex] == uniqueID]
if len(f1rows) == 0 or len(f2rows) == 0:
//Not an intersect
else:
// Strings, split at decimal, if integer and first 3 places
// after decimal are equal, they are spatially close enough
f1lat = f1rows[0][latIndex].split('.')
f1long = f1rows[0][longIndex].split('.')
f2lat = f2rows[0][latIndex].split('.')
f2long = f2rows[0][longIndex].split('.')
if f1lat[0]+f1lat[1][:3] == f2lat[0]+f2lat[1][:3] and f1long[0]+f1long[1][:3] == f2long[0]+f2long[1][:3]:
joinOut.writerows([f1rows[0], f2rows[0]])
Obviously, this approach requires that the files being intersected are available in memory. Well I only have 16GB of RAM available and 22 files would need ~44GB of RAM. I could change it so that instead, when each uniqueID is iterated, it opens and parses each file for the row with that uniqueID. This has the benefit of reducing the footprint to almost nothing, but with hundreds of thousands of unique IDs, that could take an unreasonable amount of time to execute.
So, here I am, asking for suggestions on how I can best handle this data. I have an i7-3770k at 4.4Ghz, 16GB RAM, and a vertex4 SSD, rated at 560 MB/s read speed. Is this machine even capable of handling this amount of data?
Another venue I've thought about exploring is an Amazon EC2 cluster and Hadoop. Would that be a better idea to investigate?
Suggestion: Pre-process all the files to extract the 3 attributes you're interested in first. You can always keep track of the file/rownumber as well, so you can reference all the original attributes later if you want.