Find average in large csv file using pandas - python

I have 60 HUGE csv files (around 2.5 GB each). Each cover data for a month and has a 'distance' column I am interested in. Each has around 14 million rows.
I need to find the average distance for each month.
This is what I have so far:
import pandas as pd
for x in range(1, 60):
df=pd.read_csv(r'x.csv', error_bad_lines=False, chunksize=100000)
for chunk in df:
print df["distance"].mean()
First I know 'print' is not a good idea. I need to assign the mean to a variable I guess. Second, what I need is the average for the whole dataframe and not just each chunk.
But I don't know how to do that. I was thinking of getting the average of each chunk and taking the simple average of all the chunks. That should give me the average for the dataframe as long as chunksize is equal for all chunks.
Third, I need to do this for all of the 60 csv files. Is my looping for that correct in the code above? My files are named 1.csv to 60.csv .

Few things I would fix based on how your file is named. I presume your files are named like "1.csv", "2.csv". Also remember that range is exclusive, and thus you would need to go to 61 in the range.
distance_array = []
for x in range(1,61):
df = pd.read((str(x) + ".csv", error_bad_lines=False, chunksize=100000)
for index, row in df.iterrows():
distance_array.append(x['distance'])
print(sum(distance_array)/len(distance_array))

I am presuming that the datasets are too large to load into memory as a pandas dataframe. If that is the case consider using a generator on each csv file, something similar too: Where to use yield in Python best?
As the overall result that you are after is the average you can accumulate the the total sum over each row and track how many rows with incremental step.

Related

generate output files with random samples from pandas dataframe

I have a dataframe with 500K rows. I need to distribute sets of 100 randomly selected rows to volunteers for labeling.
for example:
df = pd.DataFrame(np.random.randint(0,450,size=(450,1)),columns=list('a'))
I can remove a random sample of 100 rows and output a file with time stamp:
df_subset=df.sample(100)
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
df=df.drop(df_subset.index)
the above works but if I try to apply it to the entire example dataframe:
while len(df)>0:
df_subset=df.sample(100)
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
df=df.drop(df_subset.index)
it runs continuously - my expected output is 5 timestampdfsample.csv files 4 of which have 100 rows and the fifth 50 rows all randomly selected from df however df.drop(df_sample.index) doesn't update df so condition is always true and it runs forever generating csv files. I'm having trouble solving this problem.
any guidance would be appreciated
UPDATE
this to gets me almost there:
for i in range(4):
df_subset=df.sample(100)
df=df.drop(df_subset.index)
time.sleep(1) #added because runs too fast for unique naming
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
it requires me to specify number of files. If I say 5 for the example df I get an error on the 5th. I hoped for 5 files with the 5th having 50 rows but not sure how to do that.
After running your code, I think the problem is not with df.drop
but with the line containing time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv', because Pandas creates multiple CSV files within a second which might be causing some overwriting issues.
I think if you want label files using a timestamp, perhaps going to the millisecond level might be more useful and prevent possibility of overwrite. In your case
while len(df)>0:
df_subset=df.sample(100)
df_subset.to_csv(datetime.now().strftime("%Y%m%d_%H%M%S.%f") + 'dfsample.csv')
df=df.drop(df_subset.index)
Another way is to shuffle your rows and get rid of that awful loop.
df.sample(frac=1)
and save slices of the shuffled dataframe.

How can I loop through just a certain part of a csv file?

I need to loop through certain rows in my CSV file, for example, row 231 to row 252. Then I want to add up the values that I get from calculating every row and divide them by as many rows as I looped through. How would I do that?
I'm new to pandas so I would really appreciate some help on this.
I have a CSV file from Yahoo finance looking something like this (it has many more rows):
Date,Open,High,Low,Close,Adj Close,Volume
2019-06-06,31.500000,31.990000,30.809999,31.760000,31.760000,1257700
2019-06-07,27.440001,30.000000,25.120001,29.820000,29.820000,5235700
2019-06-10,32.160000,35.099998,31.780001,32.020000,32.020000,1961500
2019-06-11,31.379999,32.820000,28.910000,29.309999,29.309999,907900
2019-06-12,29.270000,29.950001,28.900000,29.559999,29.559999,536800
I have done the basic steps of importing pandas and all that. Then I added two variables corresponding to different columns to easily reference to just that column.
import pandas as pd
df = pd.read_csv(file_name)
high = df.High
low = df.Low
Then I tried doing something like this. I tried using .loc in a variable, but that didn't seem to work. This is maybe super dumb but I'm really new to pandas.
dates = df.loc[231:252, :]
for rows in dates:
# calculations here
# for example:
print(high - low)
# I would have a more complex calculation than this but
# but for simplicity's sake let's stick with this.
The output of this would be for every row 1-252 it prints high - low, for example:
...
231 3.319997
232 3.910000
233 1.050001
234 1.850001
235 0.870001
...
But I only want this output on a certain number of rows.
Then I want to add up all of those values and divide them by the number of rows I looped. This part is simple so you don't need to include this in your answer but it's okay if you do.
Use skiprows and nrows. Keep headers as per Python Pandas read_csv skip rows but keep header by passing a range to skiprows that starts with 1.
In [9]: pd.read_csv("t.csv",skiprows=range(1,3),nrows=2)
Out[9]:
Date Open High Low Close Adj Close Volume
0 2019-06-10 32.160000 35.099998 31.780001 32.020000 32.020000 1961500
1 2019-06-11 31.379999 32.820000 28.910000 29.309999 29.309999 907900
.loc slices by label. For integer slicing use .iloc
dates = df.iloc[231:252]

Latency in large file analysis with python pandas

I have a very large file (10 GB) with approximately 400 billion lines, it is a csv with 4 fields. Here the description, the first field is an ID and the second is a current position of the ID, the third field is a correlative number assigned to that row.
Similar to this:
41545496|4154|1
10546767|2791|2
15049399|491|3
38029772|491|4
15049399|1034|5
My intention is to create a fourth column (old position) in another file or the same, where the previous position in which your ID was stored is stored, what I do is verify if the ID number has already appeared before, I look for its last appearance and assigned to his field of old position, the position he had in the last appearance. If the ID has not appeared before, then I assign to its old position the current position it has in that same row.
Something like this:
41545496|4154|1|4154
10546767|2791|2|2791
15049399|491|3|491
38029772|491|4|491
15049399|1034|5|491
I have created a program that does the reading and analysis of the file but performs a reading of 10 thousand lines every minute, so it gives me a very high time to read the entire file, more than 5 days approximately.
import pandas as pd
with open('file_in.csv', 'rb') as inf:
df = pd.read_csv(inf, sep='|', header=None)
cont = 0
df[3] = 0
def test(x):
global cont
a = df.iloc[:cont,0]
try:
index = a[a == df[0][cont]].index[-1]
df[3][cont] = df[1][index]
except IndexError:
df[3][cont] = df[1][cont]
pass
cont+= 1
df.apply(test, axis=1)
df.to_csv('file_out.csv', sep='|', index=False, header=False)
I have a computer 64 processors with 64 GB of RAM in the university but still it's a long time, is there any way to reduce that time? thank you very much!
Processing the data efficiently
You have two main problems in your approach:
That amount of data should have never been written to a text file
Your approach needs (n^2/2) comparisons
A better idea, is to index-sort your array first before doing the actual work. So you need only about 2n operations for comparisons and n*log(n) operations for sorting in the worst case.
I also used numba to compile that function which will speed up the computation time by a factor of 100 or more.
import numpy as np
#the hardest thing to do efficiently
data = np.genfromtxt('Test.csv', delimiter='|',dtype=np.int64)
#it is important that we use a stable sort algorithm here
idx_1=np.argsort(data[:,0],kind='mergesort')
column_4=last_IDs(data,idx_1)
#This function isn't very hard to vectorize, but I expect better
#peformance and easier understanding when doing it in this way
import numba as nb
#nb.njit()
def last_IDs(data,idx_1):
#I assume that all values in the second column are positive
res=np.zeros(data.shape[0],dtype=np.int64) -1
for i in range(1,data.shape[0]):
if (data[idx_1[i],0]==data[idx_1[i-1],0]):
res[idx_1[i]]=data[idx_1[i-1],1]
same_ID=res==-1
res[same_ID]=data[same_ID,1]
return res
For performant writing and reading data have a look at: https://stackoverflow.com/a/48997927/4045774
If you don't get at least 100 M/s IO-speed, please ask.

How to stream in and manipulate a large data file in python

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:
Geography AgeGroup Gender Race Count
County1 1 M 1 12
County1 2 M 1 3
County1 2 M 2 0
To:
Geography Count
County1 15
County2 23
This would be a simple matter if the whole file could fit in memory but using pandas.read_csv() gives MemoryError. So I have been looking into other methods, and there appears to be many options - HDF5? Using itertools (which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.
Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.
Edit: In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.
You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:
import dask.dataframe as dd
df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by #chrisaycock. You may want to experiment with the chunksize parameter.
# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
data.append(chunk)
# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
I do like #root's solution, but i would go bit further optimizing memory usage - keeping only aggregated DF in memory and reading only those columns, that you really need:
cols = ['Geography','Count']
df = pd.DataFrame()
chunksize = 2 # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
usecols=cols,
chunksize=chunksize)
):
# merge previously aggregated DF with a new portion of data and aggregate it again
df = (pd.concat([df,
chunk.groupby('Geography')['Count'].sum().to_frame()])
.groupby(level=0)['Count']
.sum()
.to_frame()
)
df.reset_index().to_csv('c:/temp/result.csv', index=False)
test data:
Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111
output.csv:
Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111
PS using this approach will you can process huge files.
PPS using chunking approach should work unless you need to sort your data - in this case i would use classic UNIX tools, like awk, sort, etc. for sorting your data first
I would also recommend to use PyTables (HDF5 Storage), instead of CSV files - it is very fast and allows you to read data conditionally (using where parameter), so it's very handy and saves a lot of resources and usually much faster compared to CSV.

How to write a pandas DataFrame to a CSV by fixed size chunks

I need to output data from pandas into CSV files to interact with a 3rd party developed process.
The process requires that I pass it no more than 100,000 records in a file, or it will cause issues (slowness, perhaps a crash).
That said, how can I write something that takes a dataframe in pandas and splits it into 100,000 records frames? Nothing would be different other than the exported dataframes would be subsets of the parent dataframe.
I assume I could do a loop with something like this, but I assume it would be remarkably inefficient..
First, taking recordcount=len(df.index) to get the number of records and then looping until I get there using something like
df1 = df[currentrecord:currentrecord+100000,]
And then exporting that to a CSV file
There has to be an easier way.
You can try smth like this:
def save_df(df, chunk_size=100000):
df_size=len(df)
for i, start in enumerate(range(0, df_size, chunk_size)):
df[start:start+chunk_size].to_csv('df_name_{}.csv'.format(i))
You could add a column with a group, and then use the function groupby:
df1['Dummy'] = [a for b in zip(*[range(N)] * 100000) for a in b][:len(df1)]
Where N is set to a value large enough, the minimum being:
N = int(np.ceil(df1.len() / 100000))
Then group by that column and apply function to_csv():
def save_group(df):
df.drop('Dummy', axis=1).to_csv("Dataframe_" + str(df['Dummy'].iloc[0]) + ".csv")
df1.groupby('Dummy').apply(save_group)

Categories