Let's say I have a .dat file, filename.dat, and I wish to read this into a Pandas Dataframe:
import pandas as pd
df = pd.read_table('filename.dat')
Is there a size limit regarding this? I was hoping to save the columns of a dataframe individually for a file of size 1 TB. Is this possible?
To expand on the usage of chunksize mentioned in the comments, I'd do something like the following:
chunks = pd.read_table('filename.dat', chunksize=10**5)
fileout = 'filname_{}.dat'
for i, chunk in enumerate(chunks):
mode = 'w' if i == 0 else 'a'
header = i == 0
for col in chunk.columns:
chunk[col].to_csv(fileout.format(col), index=False, header=header, mode=mode)
You'll probably want to experiment with the chunksize parameter to see what's most efficient for your data.
The reason I'm using enumerate is to create a new file with a header when the first chunk is read in, and append without a header for subsequent chunks.
Related
I have a .csv file with over 50k rows. I would like to divide it into smaller chunks and save as separate .csv files. Not sure if pandas are best approach here (if not I'm open for any suggestions).
My goal: Read file, identify number of existing rows in dataframe, divide dataframe into chunks (3000 rows each file including the header row, save as separate .csv files)
My code so far:
import os
import pandas as pd
i = 0
while os.path.exists("output/path/chunk%s.csv" % i):
i += 1
size = 3000
df = pd.read_csv('/input/path/input.csv')
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),size)]
for x in list_of_dfs:
x.to_csv('/output/path/chunk%s.csv' % i, index=False)
the above code didn't throw any error, but created only one file ('chunk0.csv') with 1439 rows instead of 3000.
Could someone help me with this? thanks in advance!
Use DataFrame.groupby with pass integer division of index values by size, loop and write to files with f for f-strings:
size = 3000
df = pd.read_csv('/input/path/input.csv')
for i, g in df.groupby(df.index // size):
g.to_csv(f'/output/path/chunk{i}.csv', index=False)
You may be interested in pd.read_csv chunksize parameter
You can use it this way :
size = 3000
filename = '/input/path/input.csv'
for i, chunk in enumerate(pd.read_csv(filename, chunksize=size)):
chunk.to_csv(f"output/path/chunk{i}.csv", index=False)
So I'm a bit stuck on how to save the chunks into there own separate csv file.
Here is what I'm trying to do :
1-1000 rows >bracelet_no_variants_1000.csv
1001-2000 rows >bracelet_no_variants_2000.csv
2001-3000 rows >bracelet_no_variants_3000.csv
import numpy as np
df = pd.read_csv("bracelet_no_variants.csv")
def split(df, chunk_size):
indices = index_marks(df.shape[0], chunk_size)
return np.split(df, indices)
chunks = split(df, 1000)
for c in chunks:
print(c)
The following should work:
df = pd.read_csv("bracelet_no_variants.csv")
l=[i*1000 for i in range(len(df)//1000+1)]+[len(df)]
for i in range(len(l)-1):
temp=df.iloc[l[i]:l[i+1]]
temp.to_csv('bracelet_no_variants_'+str(l[i+1])+'.csv')
You can use chunksize from read_csv function, and iterate pur chunks to save them to csv as
pd.read_csv("bracelet_no_variants.csv",chunksize=1000)
counter=1000
for chunk in chunks:
chunk.to_csv("bracelet_no_variants_"+str(counter)+".csv")
counter=counter+1000
you might need to pass index=Falseto to_csv function to stop saving index as column.
if you dont want header in all csv files, set header True for first iteration only
pd.read_csv("bracelet_no_variants.csv",chunksize=1000)
counter=1000
header=True
for chunk in chunks:
chunk.to_csv("bracelet_no_variants_"+str(counter)+".csv",header=header)
counter=counter+1000
header=False
I have a 1GB, 70M row file which anytime I load it all it runs out of memory. I have read in 1000 rows and been able to prototype what I'd like it to do.
My problem is not knowing how to get the next 1000 rows and apply my logic and then continue to run through my file until it finishes the last rows. I've read about chunksizing, although I can't figure out how to continue the iteration of the chunksizing.
Ideally, it would flow like such:
1)read in first 1000 rows
2)filter data based on criteria
3)write to csv
4)repeat until no more rows
Here's what i have so far:
import pandas as pd
data=pd.read_table('datafile.txt',sep='\t',chunksize=1000, iterator=True)
data=data[data['visits']>10]
with open('data.csv', 'a') as f:
data.to_csv(f,sep = ',', index=False, header=False)
You have some problems with your logic, we want to loop over each chunk in the data, not the data itself.
The 'chunksize' argument gives us a 'textreader object' that we can iterate over.
import pandas as pd
data=pd.read_table('datafile.txt',sep='\t',chunksize=1000)
for chunk in data:
chunk = chunk[chunk['visits']>10]
chunk.to_csv('data.csv', index = False, header = False)
You will need to think about how to handle your header!
When you pass a chunksize or iterator=True, pd.read_table returns a TextFileReader that you can iterate over or call get_chunk on. So you need to iterate or call get_chunk on data.
So proper handling of your entire file might look something like
import pandas as pd
data = pd.read_table('datafile.txt',sep='\t',chunksize=1000, iterator=True)
with open('data.csv', 'a') as f:
for chunk in data:
chunk[chunk.visits > 10].to_csv(f, sep=',', index=False, header=False)
I have been using pandas on csv files to get some values out of them. My data looks like this:
"A",23.495,41.995,"this is a sentence with some words"
"B",52.243,0.118,"More text but contains WORD1"
"A",119.142,-58.289,"Also contains WORD1"
"B",423.2535,292.3958,"Doesn't contain anything of interest"
"C",12.413,18.494,"This string contains WORD2"
I have a simple script to read the csv and create the frequencies of WORD by group so the output is like:
group freqW1 freqW2
A 1 0
B 1 0
C 0 1
Then do some other operations on the values. The problem is now I have to deal with very large csv files (20+ GB) that can't be held in memory. I tried the chunksize=x option in pd.read_csv, but because 'TextFileReader' object is not subscriptable, I can't do the necessary operations on the chunks.
I suspect there is some easy way to iterate through the csv and do what I want.
My code is like this:
df = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"])
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
outfile = open("csv_out.txt","w", encoding='utf-8')
df1.to_csv(outfile, sep=",")
outfile.close()
You can specify a chunksize option in the read_csv call. See here for details
Alternatively you could use the Python csv library and create your own csv Reader or DictReader and then use that to read in data in whatever chunk size you choose.
Okay I misunderstood the chunk parameter. I solved it by doing this:
frame = pd.DataFrame()
chunks = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"],chunksize=1000000)
for df in chunks:
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
frame = frame.add(df1,fill_value=0)
outfile = open("csv_out.txt","w", encoding='utf-8')
frame.to_csv(outfile, sep=",")
outfile.close()
I get an error 'TypeError: 'TextFileReader' object does not support item assignment' when I try to add columns and modify header names etc in chunks.
My issue is I am using a slow work laptop to process a pretty large file (10 million rows). I want to add some simple columns (1 or 0 values), concatenate two columns to create a unique ID, change the dtype for other columns, and rename some headers so they match with other files that I will .merge later. I could probably split this csv (maybe select date ranges and make separate files), but I would like to learn how to use chunksize or deal with large files in general without running into memory issues. Is it possible to modify a file in chunks and then concatenate them all together later?
I am doing a raw data clean up which will then be loaded into Tableau for visualization.
Example (reading/modifying 10 million rows):
> rep = pd.read_csv(r'C:\repeats.csv.gz',
> compression = 'gzip', parse_dates = True , usecols =
> ['etc','stuff','others','...'])
> rep.sort()
> rep['Total_Repeats'] = 1
> rep.rename(columns={'X':'Y'}, inplace = True)
> rep.rename(columns={'Z':'A'}, inplace = True)
> rep.rename(columns={'B':'C'}, inplace = True)
> rep['D']= rep['E'] + rep['C']
> rep.rename(columns={'L':'M'}, inplace = True)
> rep.rename(columns={'N':'O'}, inplace = True)
> rep.rename(columns={'S':'T'}, inplace = True)
If you pass chunk_size keyword to pd.read_csv, it returns iterator of csv reader. and you can write processed chunks with to_csv method in append mode. you will be able to process large file, but you can't sort dataframe.
import pandas as pd
reader = pd.read_csv(r'C:\repeats.csv.gz',
compression = 'gzip', parse_dates=True, chunk_size=10000
usecols = ['etc','stuff','others','...'])
output_path = 'output.csv'
for chunk_df in reader:
chunk_result = do_somthing_with(chunk_df)
chunk_result.to_csv(output_path, mode='a', header=False)
Python's usually pretty good with that as long as you ignore the .read() part when looking at large files.
If you just use the iterators, you should be fine:
with open('mybiginputfile.txt', 'rt') as in_file:
with open('mybigoutputfile.txt', 'wt') as out_file:
for row in in_file:
'do something'
out_file.write(row)
Someone who knows more will explain how the memory side of it works, but this works for me on multi GB files without crashing Python.
You might want to chuck the data into a proper DB before killing your laptop with the task of serving up the data AND running Tableau too!