I have a data acquisition system which takes measurements for a few minutes and generates a csv file with 10 million rows and 10 columns. Then I import this csv file in Python (csv.reader), perform a bunch of operations on the acquired numeric data (but ‘only’ 10000 rows at a time, otherwise the computer memory would be overwhelmed). In the end, I export the results in another much smaller csv file (csv.writer).
The problem is that the runtime is very long and I want to speed it up. When I open the original csv file with Notepad I see that the numbers have up to 16 digits each, like 0.0015800159870059, 12.0257771094508 etc. I know that the accuracy of the DAQ is 0.1% at best and most of the trailing digits are noise. Is there an elegant way of forcing Python to operate globally with only 7-8 digits from start to finish, to speed up the calculations? I know about error propagation and I’m going to try different settings for the number of digits to see what the optimum is.
Please note that it is not enough for me to build a temporary csv file with ‘truncated’ data (e.g. containing 0.0015800, 12.0257771 etc) and simply import that into Python. The calculations in Python should use reduced precision as well. I looked into decimal module, with no success so far.
with open(‘datafile’,newline='') as DAQfile:
reader=csv.reader(DAQfile,delimiter=',')
for row in reader:
… calculate stuff…
with open('results.csv','w',newline='') as myfile:
mywriter = csv.writer(myfile)
…write stuff…
Adding some details, based on the comments so far:
The program calculates the peak of the moving average of the 'instantaneous power'. The data in the csv file can be described like this, where 'col' means column, V means voltage and I means current: col1=time, col2=V1, col3=I1, col4=V2, col5=I2 and so on until col11=V10, col12=I10. So each row represents a data sample taken by the DAQ.
The instantaneous power is Pi=V1*I1+V2*I2+...+V11*I11
To calculate moving average over 10000 rows at a time, I built a buffer (initialized with Buffer=[0]*10000). This buffer will store the Pi's for 10000 consecutive rows and will be updated every time when csv.reader moves to the next row. The buffer works exactly like a shift register.
This way the memory usage is insignificant (verified). In summary, the calculations are multiplications, additions, min(a,b) function (to detect the peak of the moving average) and del/append for refreshing the buffer. The moving average itself is iterative too, something like newavg=oldavg+(newlast-oldfirst)/bufsize.
My thinking is that it does not make any sense to let Python work with all those decimals when I know that most of the trailing figures are garbage.
Forgot to mention that the size of the csv file coming from the DAQ is just under 1Gb.
Yes, there is a way - use NumPy. First, there are tons of vector/vector operations, which could be performed with one command
a = b + c
will efficiently sum two vector.
Second, which is the answer to your question, you could specify 4bytes float type, greatly reducing memory reqs and increasing speed.
You should read your file directly using
from numpy import genfromtxt
data = genfromtxt('datafile.csv', dtype=numpy.float32, delimiter=',')
...
data would made up from standard 32bits floats, circa 7digits precision.
CSV file could be read by parts/bunches
numpy.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None,
skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None,
usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_',
autostrip=False, case_sensitive=True, defaultfmt='f%i',
unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None,
encoding='bytes')
here is full list of parameters. If max_rows is set to, say, 10, only 10 rows will be read. Default is to read whole file. You could read anything in the middle of the files by skipping some initial records, via skip_header option.
Use DyZ's comment. if there is a way to speed up the calculations, (i.e. using << or >> for multiplications or divisions respectiveley if the second operand or the dividend is a power of 2, you should take it.
example:
>>> 22 * 16
352
>>> 22 << 4
352
in that scenario, I did the exact same operation with marginal decrease in time. However, if that will equate to 100 trillion calculations, the difference is much more notable.
Related
I have an issue which I think I somewhat solved but I would like to learn more about it or learn about better solutions.
The problem: I have tab separated files with ~600k lines (and one comment line), of which one field (out of 8 fields) contains a string of variable length, anything between 1 and ~2000 characters.
Reading that file with the following function is terribly slow:
df = pd.read_csv(tgfile,
sep="\t",
comment='#',
header=None,
names=list_of_names)
However, perhaps I don't care so much about most of the string (field name of this string is 'motif') and I'm okay with truncating it if it's too long using:
def truncate_motif(motif):
if len(motif) > 8:
return motif[:8] + '~'
else:
return motif
df = pd.read_csv(tgfile,
sep="\t",
comment='#',
header=None,
converters={'motif': truncate_motif},
names=list_of_names)
This suddenly is lots faster.
So my questions are:
Why is reading this file so slow? Does it have to do with allocating memory?
Why does the converter function help here? It has to execute an additional function for every row, but is still lots faster...
What else can be done?
You didn't mention what means slow to you, but if:
your file contains ca. 600k rows,
each row contains 1-2000 characters (let's say 1000 in average, so each line has ca. 1000B),
then this file's size is: 600 000 * 1000B ~ 570 MB. It's a lot, especially if you don't have much RAM memory.
It helps, because suddenly the average size of one line is not 1000B, but ca. 6-7B (considering new max = 8B). The system does not read and keep whole strings, but only checks their length and cut if it's needed. Sounds logic to me!
In such cases, when you have to load lots of data, it's good to use chunks.
for chunk in pd.read_csv(tgfile, chunksize=10000):
process(chunk)
The chunksize parameter says how many rows contains one chunk. It's good to check if it improves performance in your case!
Simple question: I have a dataframe in dask containing about 300 mln records. I need to know the exact number of rows that the dataframe contains. Is there an easy way to do this?
When I try to run dataframe.x.count().compute() it looks like it tries to load the entire data into RAM, for which there is no space and it crashes.
# ensure small enough block size for the graph to fit in your memory
ddf = dask.dataframe.read_csv('*.csv', blocksize="10MB")
ddf.shape[0].compute()
From the documentation:
blocksize <str, int or None>
Optional Number of bytes by which to cut up
larger files. Default value is computed based on available physical
memory and the number of cores, up to a maximum of 64MB. Can be a
number like 64000000` or a string like ``"64MB". If None, a single
block is used for each file.
If you only need the number of rows -
you can load a subset of the columns while selecting the columns with lower memory usage (such as category/integers and not string/object), there after you can run len(df.index)
currently, the data in h5 file does not have prefix 'b'. I read h5 file with following code. I wonder whether there is some better way to read h5 and with no prefix 'b'.
import tables as tb
import pandas as pd
import numpy as np
import time
time0=time.time()
pth='d:/download/'
# read data
data_trading=pth+'Trading_v01.h5'
filem=tb.open_file(data_trading,mode='a',driver="H5FD_CORE")
tb_trading=filem.get_node(where='/', name='wind_data')
df=pd.DataFrame.from_records(tb_trading[:])
time1=time.time()
print('\ntime on reading data %6.3fs' %(time1-time0))
# in python3, remove prefix 'b'
df.loc[:,'Date']=[[dt.decode('utf-8')] for dt in df.loc[:,'Date']]
df.loc[:,'Code']=[[cd.decode('utf-8')] for cd in df.loc[:,'Code']]
time2=time.time()
print("\ntime on removing prefix 'b' %6.3fs" %(time2-time1))
print('\ntotal time %6.3fs' %(time2-time0))
the result of time
time on reading data 1.569s
time on removing prefix 'b' 29.921s
total time 31.490s
you see, removing prefix 'b' is really time consuming.
I have try to use pd.read_hdf, which don't rise prefix 'b'.
%time df2=pd.read_hdf(data_trading)
Wall time: 14.7 s
which so far is faster.
Using this SO answer and using a vectorised str.decode, I can cut the conversion time to 9.1 seconds (and thus the total time less than 11 seconds):
for key in ['Date', 'Code']:
df[key] = df[key].str.decode("utf-8")
Question: is there an even more effective way to convert my bytes columns to string when reading a HDF 5 data table?
The best solution for performance is to stop trying to "remove the b prefix." The b prefix is there because your data consists of bytes, and Python 3 insists on displaying this prefix to indicate bytes in many places. Even places where it makes no sense such as the output of the built-in csv module.
But inside your own program this may not hurt anything, and in fact if you want the highest performance you may be better off leaving these columns as bytes. This is especially true if you're using Python 3.0 to 3.2, which always use multi-byte unicode representation (see).
Even if you are using Python 3.3 or later, where the conversion from bytes to unicode doesn't cost you any extra space, it may still be a waste of time if you have a lot of data.
Finally, Pandas is not optimal if you are dealing with columns of mostly unique strings which have a somewhat consistent width. For example if you have columns of text data which are license plate numbers, all of them will fit in about 9 characters. The inefficiency arises because Pandas does not exactly have a string column type, but instead uses an object column type, which contains pointers to strings stored separately. This is bad for CPU caches, bad for memory bandwidth, and bad for memory consumption (again, if your strings are mostly unique and of similar lengths). If your strings have highly variable widths, it may be worth it because a short string takes only its own length plus a pointer, whereas the fixed-width storage typical in NumPy and HDF5 takes the full column width for every string (even empty ones).
To get fast, fixed-width string columns in Python, you may consider using NumPy, which you can read via the excellent h5py library. This will give you a NumPy array which is a lot more similar to the underlying data stored in HDF5. It may still have the b prefix, because Python insists that non-unicode strings always display this prefix, but that's not necessarily something you should try to prevent.
Okay, deep breath, this may be a bit verbose, but better to err on the side of detail than lack thereof...
So, in one sentence, my goal is to find the intersection of about 22 ~300-400mb files based on 3 of 139 attributes.:
Now a bit more background. The files range from ~300-400mb, consisting of 139 columns and typically in the range of 400,000-600,000 rows. I have three particular fields I want to join on - a unique ID, and latitude/longitude (with a bit of a tolerance if possible). The goal is to determine which of these recored existed across certain ranges of files. Going worst case, that will mean performing a 22 file intersection.
So far, the following has failed
I tried using MySQL to perform the join. This was back when I was only looking at 7 years. Attempting the join on 7 years (using INNER JOIN about 7 times... e.g. t1 INNER JOIN t2 ON condition INNER JOIN t3 ON condition ... etc), I let it run for about 48 hours before the timeout ended it. Was it likely to actually still be running, or does that seem overly long? Despite all the suggestions I found to enable better multithreading and more RAM usage, I couldn't seem to get the cpu usage above 25%. If this is a good approach to pursue, any tips would be greatly appreciated.
I tried using ArcMap. I converted the CSVs to tables and imported them into a file geodatabase. I ran the intersection tool on two files, which took about 4 days, and the number of records returned was more than twice the number of input features combined. Each file had about 600,000 records. The intersection returned with 2,000,0000 results. In other cases, not all records were recognized by ArcMap. ArcMap says there are 5,000 records, when in reality there are 400,000+
I tried combining in python. Firstly, I can immediately tell RAM is going to be an issue. Each file takes up roughly 2GB of RAM in python when fully opened. I do this with:
f1 = [row for row in csv.reader(open('file1.csv', 'rU'))]
f2 = [row for row in csv.reader(open('file2.csv', 'rU'))]
joinOut = csv.writer(open('Intersect.csv', 'wb'))
uniqueIDs = set([row[uniqueIDIndex] for row in f1].extend([row[uniqueIDIndex] for row in f2]))
for uniqueID in uniqueIDs:
f1rows = [row for row in f1 if row[uniqueIDIndex] == uniqueID]
f2rows = [row for row in f2 if row[uniqueIDIndex] == uniqueID]
if len(f1rows) == 0 or len(f2rows) == 0:
//Not an intersect
else:
// Strings, split at decimal, if integer and first 3 places
// after decimal are equal, they are spatially close enough
f1lat = f1rows[0][latIndex].split('.')
f1long = f1rows[0][longIndex].split('.')
f2lat = f2rows[0][latIndex].split('.')
f2long = f2rows[0][longIndex].split('.')
if f1lat[0]+f1lat[1][:3] == f2lat[0]+f2lat[1][:3] and f1long[0]+f1long[1][:3] == f2long[0]+f2long[1][:3]:
joinOut.writerows([f1rows[0], f2rows[0]])
Obviously, this approach requires that the files being intersected are available in memory. Well I only have 16GB of RAM available and 22 files would need ~44GB of RAM. I could change it so that instead, when each uniqueID is iterated, it opens and parses each file for the row with that uniqueID. This has the benefit of reducing the footprint to almost nothing, but with hundreds of thousands of unique IDs, that could take an unreasonable amount of time to execute.
So, here I am, asking for suggestions on how I can best handle this data. I have an i7-3770k at 4.4Ghz, 16GB RAM, and a vertex4 SSD, rated at 560 MB/s read speed. Is this machine even capable of handling this amount of data?
Another venue I've thought about exploring is an Amazon EC2 cluster and Hadoop. Would that be a better idea to investigate?
Suggestion: Pre-process all the files to extract the 3 attributes you're interested in first. You can always keep track of the file/rownumber as well, so you can reference all the original attributes later if you want.
I have simple text file containing two columns, both integers
1 5
1 12
2 5
2 341
2 12
and so on..
I need to group the dataset by second value,
such that the output will be.
5 1 2
12 1 2
341 2
Now the problem is that the file is very big around 34 Gb
in size, I tried writing a python script to group them into a dictionary with value as an array of integers, still it takes way too long. (I guess a large time is taken for allocating the array('i') and extending them on append.
I am now planning to write a pig script which I am planning to run on a pseudo distributed hadoop machine (An Amazon EC3 High Memory Large instance).
data = load 'Net.txt';
gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet
store gdata into 'res.txt';
I wanted to know if there was any simpler way of doing this.
Update:
keeping such a big file in memory is out of question, In case of python solution, what I planned was to conduct 4 runs in first run only second col values from 1 - 10 million are considered in next run 10 million to 20 million are considered and so on. but this turned out to be really slow.
The pig / hadoop solution is interesting because it keeps everything on disk [Well most of it].
For better understanding this dataset contains information about connectivity of ~45 Million twitter users and the format in file means that userid given by the second number is following the the first one.
Solution which I had used:
class AdjDict(dict):
"""
A special Dictionary Class to hold adjecancy list
"""
def __missing__(self, key):
"""
Missing is changed such that when a key is not found an integer array is initialized
"""
self.__setitem__(key,array.array('i'))
return self[key]
Adj= AdjDict()
for line in file("net.txt"):
entry = line.strip().split('\t')
node = int(entry[1])
follower = int(entry[0])
if node < 10 ** 6:
Adj[node].append(follower)
# Code for writting Adj matrix to the file:
Assuming you have ~17 characters per line (a number I picked randomly to make the math easier), you have about 2 billion records in this file. Unless you are running with much physical memory on a 64-bit system, you will thrash your pagefile to death trying to hold all this in memory in a single dict. And that's just to read it in as a data structure - one presumes that after this structure is built, you plan to actually do something with it.
With such a simple data format, I should think you'd be better off doing something in C instead of Python. Cracking this data shouldn't be difficult, and you'll have much less per-value overhead. At minimum, just to hold 2 billion 4-byte integers would be 8 Gb (unless you can make some simplifying assumptions about the possible range of the values you currently list as 1 and 2 - if they will fit within a byte or a short, then you can use smaller int variables, which will be worth the trouble for a data set of this size).
If I had to solve this on my current hardware, I'd probably write a few small programs:
The first would work on 500-megabyte chunks of the file, swapping columns and writing the result to new files. (You'll get 70 or more.) (This won't take much memory.)
Then I'd call the OS-supplied sort(1) on each small file. (This might take a few gigs of memory.)
Then I'd write a merge-sort program that would merge together the lines from all 70-odd sub-files. (This won't take much memory.)
Then I'd write a program that would run through the large sorted list; you'll have a bunch of lines like:
5 1
5 2
12 1
12 2
and you'll need to return:
5 1 2
12 1 2
(This won't take much memory.)
By breaking it into smaller chunks, hopefully you can keep the RSS down to something that would fit a reasonable machine -- it will take more disk I/O, but on anything but astonishing hardware, swap use would kill attempts to handle this in one big program.
Maybe you can do a multi-pass through the file.
Do a range of keys each pass through the file, for example if you picked a range size of 100
1st pass - work out all the keys from 0-99
2nd pass - work out all the keys from 100-199
3rd pass - work out all the keys from 200-299
4th pass - work out all the keys from 300-399
..and so on.
for your sample, the 1st pass would output
5 1 2
12 1 2
and the 4th pass would output
341 2
Choose the range size so that the dict you are creating fits into your RAM
I wouldn't bother using multiprocessing to try to speed it up by using multiple cores, unless you have a very fast harddrive this should be IO bound and you would just end up thrashing the disk
If you are working with a 34 GB file, I'm assuming that hard drive, both in terms of storage and access-time, is not a problem. How about reading the pairs sequentially and when you find pair (x,y), open file "x", append " y" and close file "x"? In the end, you will have one file per Twitter userid, and each file containing all users this one is connected to. You can then concatenate all those files if you want to have your result in the output format you specified.
THAT SAID HOWEVER, I really do think that:
(a) for such a large data set, exact resolution is not appropriate and that
(b) there is probably some better way to measure connectivity, so perhaps you'd like to tell us about your end goal.
Indeed, you have a very large graph and a lot of efficient techniques have been devised to study the shape and properties of huge graphs---most of these techniques are built to work as streaming, online algorithms.
For instance, a technique called triangle counting, coupled with probabilistic cardinality estimation algorithms, efficiently and speedily provides information on the cliques contained in your graph. For a better idea on the triangle counting aspect, and how it is relevant to graphs, see for example this (randomly chosen) article.
I had a similar requirement and you just require one more pig statement to remove the redundancies in 5 (1,5) (2,5).
a = LOAD 'edgelist' USING PigStorage('\t') AS (user:int,following:int);
b = GROUP a BY user;
x = FOREACH b GENERATE group.user, a.following;
store x INTO 'following-list';