Latency in large file analysis with python pandas

Latency in large file analysis with python pandas - python

I have a very large file (10 GB) with approximately 400 billion lines, it is a csv with 4 fields. Here the description, the first field is an ID and the second is a current position of the ID, the third field is a correlative number assigned to that row.
Similar to this:
41545496|4154|1
10546767|2791|2
15049399|491|3
38029772|491|4
15049399|1034|5
My intention is to create a fourth column (old position) in another file or the same, where the previous position in which your ID was stored is stored, what I do is verify if the ID number has already appeared before, I look for its last appearance and assigned to his field of old position, the position he had in the last appearance. If the ID has not appeared before, then I assign to its old position the current position it has in that same row.
Something like this:
41545496|4154|1|4154
10546767|2791|2|2791
15049399|491|3|491
38029772|491|4|491
15049399|1034|5|491
I have created a program that does the reading and analysis of the file but performs a reading of 10 thousand lines every minute, so it gives me a very high time to read the entire file, more than 5 days approximately.
import pandas as pd
with open('file_in.csv', 'rb') as inf:
df = pd.read_csv(inf, sep='|', header=None)
cont = 0
df[3] = 0
def test(x):
global cont
a = df.iloc[:cont,0]
try:
index = a[a == df[0][cont]].index[-1]
df[3][cont] = df[1][index]
except IndexError:
df[3][cont] = df[1][cont]
pass
cont+= 1
df.apply(test, axis=1)
df.to_csv('file_out.csv', sep='|', index=False, header=False)
I have a computer 64 processors with 64 GB of RAM in the university but still it's a long time, is there any way to reduce that time? thank you very much!

Processing the data efficiently
You have two main problems in your approach:
That amount of data should have never been written to a text file
Your approach needs (n^2/2) comparisons
A better idea, is to index-sort your array first before doing the actual work. So you need only about 2n operations for comparisons and n*log(n) operations for sorting in the worst case.
I also used numba to compile that function which will speed up the computation time by a factor of 100 or more.
import numpy as np
#the hardest thing to do efficiently
data = np.genfromtxt('Test.csv', delimiter='|',dtype=np.int64)
#it is important that we use a stable sort algorithm here
idx_1=np.argsort(data[:,0],kind='mergesort')
column_4=last_IDs(data,idx_1)
#This function isn't very hard to vectorize, but I expect better
#peformance and easier understanding when doing it in this way
import numba as nb
#nb.njit()
def last_IDs(data,idx_1):
#I assume that all values in the second column are positive
res=np.zeros(data.shape[0],dtype=np.int64) -1
for i in range(1,data.shape[0]):
if (data[idx_1[i],0]==data[idx_1[i-1],0]):
res[idx_1[i]]=data[idx_1[i-1],1]
same_ID=res==-1
res[same_ID]=data[same_ID,1]
return res
For performant writing and reading data have a look at: https://stackoverflow.com/a/48997927/4045774
If you don't get at least 100 M/s IO-speed, please ask.

Related

Quickest way to access & compare huge data in Python

I am a newbie to Pandas, and somewhat newbie to python
I am looking at stock data, which I read in as CSV and typical size is 500,000 rows.
The data looks like this
'''
'''
I need to check the data against itself - the basic algorithm is a loop similar to
Row = 0
x = get "low" price in row ROW
y = CalculateSomething(x)
go through the rest of the data, compare against y
if (a):
append ("A") at the end of row ROW # in the dataframe
else
print ("B") at the end of row ROW
Row = Row +1
the next iteration, the datapointer should reset to ROW 1. then go through same process
each time, it adds notes to the dataframe at the ROW index
I looked at Pandas, and figured the way to try this would be to use two loops, and copying the dataframe to maintain two separate instances
The actual code looks like this (simplified)
df = pd.read_csv('data.csv')
calc1 = 1 # this part is confidential so set to something simple
calc2 = 2 # this part is confidential so set to something simple
def func3_df_index(df):
dfouter = df.copy()
for outerindex in dfouter.index:
dfouter_openval = dfouter.at[outerindex,"Open"]
for index in df.index:
if (df.at[index,"Low"] <= (calc1) and (index >= outerindex)) :
dfouter.at[outerindex,'notes'] = "message 1"
break
elif (df.at[index,"High"] >= (calc2) and (index >= outerindex)):
dfouter.at[outerindex,'notes'] = "message2"
break
else:
dfouter.at[outerindex,'notes'] = "message3"
this method is taking a long time (7 minutes+) per 5K - which will be quite long for 500,000 rows. There may be data exceeding 1 million rows
I have tried using the two loop method with the following variants:
using iloc - e.g df.iloc[index,2]
using at - e,g df.at[index,"low"]
using numpy& at - eg df.at[index,"low"] = np.where((df.at[index,"low"] < ..."
The data is floating point values, and datetime string.
Is it better to use numpy? maybe an alternative to using two loops?
any other methods, like using R, mongo, some other database etc - different from python would also be useful - i just need the results, not necessarily tied to python.
any help and constructs would be greatly helpful
Thanks in advance

You are copying the dataframe and manually looping over the indicies. This will almost always be slower than vectorized operations.
If you only care about one row at a time, you can simply use csv module.
numpy is not "better"; pandas internally uses numpy
Alternatively, load the data into a database. Examples include sqlite, mysql/mariadb, postgres, or maybe DuckDB, then use query commands against that. This will have the added advantage of allowing for type-conversion from stings to floats, so numerical analysis is easier.
If you really want to process a file in parallel directly from Python, then you could move to Dask or PySpark, although, Pandas should work with some tuning, though Pandas read_sql function would work better, for a start.

You have to split main dataset in smaller datasets for eg. 50 sub-datasets with 10.000 rows each to increase speed. Do functions in each sub-dataset using threading or concurrency and then combine your final results.

pandas inserting rows in a monotonically increasing dataframe using itertuples

I've been searching for a solution to this for a while, and I'm really stuck! I have a very large text file, imported as a panda dataframe containing just two columns but with hundreds of thousands to millions of rows. The columns contain packet dumps: one is the data of the packets formatted as ascii representations of monotonically increasing integers, and the second the packet time.
I want to go through this dataframe, and make sure that the dataframe is monotonically increasing, and if there are missing data, to insert a new rows in order to make the list monotonically increasing. i.e the 'data' column should be filled in with the appropriate value but the time should be changed to 'NaN' or 'NULL', etc.
The following is a sample of the data:
data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303400 1527986052.506439335
So I have two questions:
1) I've been trying to loop through the dataframe using itertuples to try to get the next row do a comparison with the current row and if the difference s more than the 100 to add a new row, but unfortunately I've struggled with this since, there doesn't seem to be a good way to retreive the row after the one called.
2) Is there a better way (faster) way to do this other than the way I've proposed?
This may be trivial, though I've really struggled with it. Thank you in advance for your help.

A problem at a time. You can do a verbatim check df.data.is_monotonic_increasing.
Inserting new indices: it is better to go the other way around. You already know the index you want. It is given by range(min_val, max_val+1, 100). You can create a blank DataFrame with this index and update it using your data.
This may be memory intensive so you may need to go over your data in chunks. In that case, you may need to provide index range ahead of time.
import pandas as pd
# test data
df = pd.read_csv(
pd.compat.StringIO(
"""data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303500 1527986052.506439335"""
),
sep=r" +",
)
# check if the data is increasing
assert df.data.is_monotonic_increasing
# desired index range
rng = range(df.data.iloc[0], df.data.iloc[-1] + 1, 100)
# blank frame with full index
df2 = pd.DataFrame(index=rng, columns=["frame_time_epoch"])
# update with existing data
df2.update(df.set_index("data"))
# result
# frame_time_epoch
# 303030303030303000 1.52799e+09
# 303030303030303100 1.52799e+09
# 303030303030303200 1.52799e+09
# 303030303030303300 1.52799e+09
# 303030303030303400 NaN
# 303030303030303500 1.52799e+09

Just for examination: Did you try sth like
delta = df['data'].diff()
delta[delta>0]
delta[delta<100]

Find average in large csv file using pandas

I have 60 HUGE csv files (around 2.5 GB each). Each cover data for a month and has a 'distance' column I am interested in. Each has around 14 million rows.
I need to find the average distance for each month.
This is what I have so far:
import pandas as pd
for x in range(1, 60):
df=pd.read_csv(r'x.csv', error_bad_lines=False, chunksize=100000)
for chunk in df:
print df["distance"].mean()
First I know 'print' is not a good idea. I need to assign the mean to a variable I guess. Second, what I need is the average for the whole dataframe and not just each chunk.
But I don't know how to do that. I was thinking of getting the average of each chunk and taking the simple average of all the chunks. That should give me the average for the dataframe as long as chunksize is equal for all chunks.
Third, I need to do this for all of the 60 csv files. Is my looping for that correct in the code above? My files are named 1.csv to 60.csv .

Few things I would fix based on how your file is named. I presume your files are named like "1.csv", "2.csv". Also remember that range is exclusive, and thus you would need to go to 61 in the range.
distance_array = []
for x in range(1,61):
df = pd.read((str(x) + ".csv", error_bad_lines=False, chunksize=100000)
for index, row in df.iterrows():
distance_array.append(x['distance'])
print(sum(distance_array)/len(distance_array))

I am presuming that the datasets are too large to load into memory as a pandas dataframe. If that is the case consider using a generator on each csv file, something similar too: Where to use yield in Python best?
As the overall result that you are after is the average you can accumulate the the total sum over each row and track how many rows with incremental step.

How to stream in and manipulate a large data file in python

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:
Geography AgeGroup Gender Race Count
County1 1 M 1 12
County1 2 M 1 3
County1 2 M 2 0
To:
Geography Count
County1 15
County2 23
This would be a simple matter if the whole file could fit in memory but using pandas.read_csv() gives MemoryError. So I have been looking into other methods, and there appears to be many options - HDF5? Using itertools (which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.
Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.
Edit: In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.

You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:
import dask.dataframe as dd
df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by #chrisaycock. You may want to experiment with the chunksize parameter.
# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
data.append(chunk)
# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

I do like #root's solution, but i would go bit further optimizing memory usage - keeping only aggregated DF in memory and reading only those columns, that you really need:
cols = ['Geography','Count']
df = pd.DataFrame()
chunksize = 2 # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
usecols=cols,
chunksize=chunksize)
):
# merge previously aggregated DF with a new portion of data and aggregate it again
df = (pd.concat([df,
chunk.groupby('Geography')['Count'].sum().to_frame()])
.groupby(level=0)['Count']
.sum()
.to_frame()
)
df.reset_index().to_csv('c:/temp/result.csv', index=False)
test data:
Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111
output.csv:
Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111
PS using this approach will you can process huge files.
PPS using chunking approach should work unless you need to sort your data - in this case i would use classic UNIX tools, like awk, sort, etc. for sorting your data first
I would also recommend to use PyTables (HDF5 Storage), instead of CSV files - it is very fast and allows you to read data conditionally (using where parameter), so it's very handy and saves a lot of resources and usually much faster compared to CSV.

Loop Iterations and arithmetic operations on csv document with Python

I've been working for a while on the problem I'm about to present and have tried to solve it resolving to mailing lists and other online resources but at the end all my efforts have been unsuccesful, so far. That's why I'm asking for your precious time to help me accomplish the following task:
I'm working on a dataset stored as an sqlite database that I converted into an csv document, in order to process it better with Python 2.7.
The data is arranged in a comma separated csv format and reports hits of a sensor along with other info. It comes in 8 comma-separated fields of different data types, i.e. string, int and float. I'm only interested in the second field, which is where the datetime of the recorded hit was stored in a UNIX timestamp format and in milliseconds.
Unfortunately the sensors' builtin clock failed to remain up-to-date and I was only able to recover an approximated corrected timestamp for a given day by other means.
Here's an example of how the data looks like:
sensor_ID,timestamp,z,w,k,j,n,human-readable_datetime
651,956684876150254,-0.1692345,0.623286,0.01442572,0.81455,-0.145732,"2000-01-01 00:01:16"
651,956684936161895,0.00526153,0.999893,0.00998516,0.898215,-0.155301,"2000-01-01 00:02:16"
651,956684996173593,0.00526153,0.999893,0.00988516,0.865215,-0.154301,"2000-01-01 00:03:16"
651,956685056185292,0.00526153,0.999893,0.00978676,0.883215,-0.159301,"2000-01-01 00:04:16"
651,956685116196912,0.00526153,0.999893,0.00922469,0.809862,-0.158607,"2000-01-01 00:05:16"
What I want to do is the following:
1) compare each of the timestamps in column #2 to a value corresponding to the retrieved approximately corrected timestamp, which is stored in a separate file. This means: for each timestamp 'x' in column #2 --> subtract it to the correct timestamp 'y' --> IF abs(y-x) > 60 seconds THEN CONTINUE (to step 2) ELSE QUIT
2) once one match is found and the subtraction operation outputs a value > 60 secs --> add a given fixed value (that I will call the 'syncing_value') to all the timestamps in the file, both backwards and forwards, and do this as long as they remain coherent, i.e., as long as the timestamps are out-of-date. This is because some sensors' clock would stop synchronising but would go back to work normally after a software update.
3) write file to output and exit
For the sake of simplicity I'll attach the pseudocode I used, as regarding the actual code I've got so many different almost-working alternatives, I wouldn't know which one to present. I hope you understand.
This is my pseudocode, which somehow lacks of some fundamental features I mentioned above:
import csv
for row in data:
for x in row[1]:
if x <= y:
for i in range(x,len(data)-1):
i = i + syncing_value
else:
exit
I really hope that what I want to achieve is clear to you and would really appreciate your help.
Thank you!

import csv
filename = 'csv.txt' #csv file to read/write to
timestamp = 956684876150324 #correct timestamp
syncing_value = 90
#read contents of csv file into list
data = list(csv.reader(open(filename)))
#compare second column of every row after header (data[1:]) to correct timestamp
#if any differ more than 60 from correct timestamp, add syncing_value to all timestamps
#in file, and save changes.
if any(abs(int(row[1])-timestamp) > 60 for row in data[1:]):
for row in data[1:]:
row[1] = int(row[1]) + syncing_value
csv_writer = csv.writer(open(filename,'w'))
csv_writer.writerows(data)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.