Join two huge file without chunking with pandas - python

I have File1 with "id,name" and File2 with "id,address". I cannot load the first file (less than 2Gb): it crashes after 76k rows (with chunk concat) and only 2 columns... I cannot read_csv on the second file too because it crashes the kernel after some rows loading.
I need to join the File1 and File2 with "id" but if I cannot put files in a dataframe variable I don't know how to do...
The file is only 5Gb with 30M rows but it crashes the kernel after few seconds of loading.
How to join the file without dataframing em please ?
I have tried with chucking but it crashes.
chunks = []
cols = [...]
for chunk in pd.read_csv("file2.csv", chunksize=500000, sep=',', error_bad_lines=False, low_memory=False, usecols=cols):
chunks.append(chunk)
df = pd.concat(chunks, axis=0)
print(f.shape)
I need the dataframe to load to join them or join the file without loading if possible

You read df2 chunk by chunk but since you append all the chunks, your resulting chunk is of the same size as your file2.
What you could do, if you are able to fully load your df1, is to join your df2 chunk by chunk like so :
for chunk in pd.read_csv("file2.csv", chunksize=500000, sep=',', error_bad_lines=False, low_memory=False, usecols=cols):
df1.merge(chunk, on =['id'], how='left')

Chunking like that will definitely still crash your kernel, since you're still trying to fit everything into memory. You need to do something to your chunks to reduce their size.
For instance, you could read both files in chunks, join each chunk, output the matches to another file, and keep the un-matched IDs in memory. That might still crash your kernel if you get unlucky though. It depends on what your performance constraints are, and what you need to do with your data afterwards.

Related

read_csv stops at 100000

I am trying to import a .csv file from my Downloads folder.
Usually, the read_csv function will import the entire rows, though there are millions of rows.
In this case, my file has 236,905 rows, but exactly 100,000 are loaded.
df = pd.read_csv(r'C:\Users\user\Downloads\df.csv',nrows=9999999,low_memory=False)
I come across the same problem with a file containing 5M rows.
I tried first this option :
tp = pd.read_csv('yourfile.csv', iterator=True, chunksize=1000)
data_customers = pd.concat(tp, ignore_index=True)
It did work but in my case some rows where not read properly since some columns contained the character ',' which is used as delimiter in read_csv
The other solution is to use Dask It has an object called "DataFrame" (as Pandas). Dask reads your file and construct a dask dataframe composed of several pandas dataframe.
It's a great solution for parallel computing.
Hope it helps
You need to create chunks using the chunksize= parameter:
temporary = pd.read_csv(r'C:\Users\user\Downloads\df.csv', iterator=True, chunksize=1000)
df = pd.concat(temporary, ignore_index=True)
ignore_index resets the index so it's not repeating.

Melt a big data frame without pandas

I have a 3GB dataset with 40k rows and 60k columns which Pandas is unable to read and I would like to melt the file based on the current index.
The current file looks like this:
The first column is an index and I would like to melt all the file based on this index.
I tried pandas and dask, but all of them crush when reading the big file.
Do you have any suggestions?
thanks
You need to use the chunksize property of pandas. See for example How to read a 6 GB csv file with pandas.
You will process N rows at one time, without loading the whole dataframe. N will depend on your computer: if N is low, it will cost less memory but it will increase the run time and will cost more IO load.
# create an object reading your file 100 rows at a time
reader = pd.read_csv( 'bigfile.tsv', sep='\t', header=None, chunksize=100 )
# process each chunk at a time
for chunk in file:
result = chunk.melt()
# export the results into a new file
result.to_csv( 'bigfile_melted.tsv', header=None, sep='\t', mode='a' )
Furthermore, you can use the argument dtype=np.int32 for read_csv if you have integer or dtype=np.float32 to process data faster if you do not need precision.
NB: here you have examples of memory usage: Using Chunksize in Pandas.

How to read a few lines in a large CSV file with pandas?

I have a CSV file that doesn't fit into my system's memory. Using Pandas, I want to read a small number of rows scattered all over the file.
I think that I can accomplish this without pandas following the steps here: How to read specific lines of a large csv file
In pandas, I am trying to use skiprows to select only the rows that I need.
# FILESIZE is the number of lines in the CSV file (~600M)
# rows2keep is an np.array with the line numbers that I want to read (~20)
rows2skip = (row for row in range(0,FILESIZE) if row not in rows2keep)
signal = pd.read_csv('train.csv', skiprows=rows2skip)
I would expect this code to return a small dataframe pretty fast. However, what is does is start consuming memory over several minutes until the system becomes irresponsive. I'm guessing that it is reading the whole dataframe first and will get rid of rows2skip later.
Why is this implementation so inefficient? How can I efficiently create a dataframe with only the lines specified in rows2keep?
Try this
train = pd.read_csv('file.csv', iterator=True, chunksize=150000)
If you only want to read the first n rows:
train = pd.read_csv(..., nrows=n)
If you only want to read rows from n to n+100
train = pd.read_csv(..., skiprows=n, nrows=n+100)
chunksize should help in limiting the memory usage. Alternatively, if you only need a few number of lines, a possible way is to first read the required lines ouside of pandas and then only feed read_csv with that subset. Code could be:
lines = [line for i, line in enumerate(open('train.csv')) if i in lines_to_keep]
signal = pd.read_csv(io.StringIO(''.join(lines)))

python pandas read and process a huge csv in chunks

I am trying to process a huge csv file with panda
Firstly, i come across a memory error when loading the file. I am able to fix to by this:
df = pd.read_csv('data.csv', chunksize=1000, low_memory=False)
device_data = pd.concat(df, ignore_index=True)
However, I still get memory errors when processing the "device_data" with multiple filters
Here are my questions:
1- Is there any way to get rid of memory errors when processing the dataframe loaded from that huge csv?
2- I have also tried adding conditions to concatenate dataframe with the iterators. Referring to this link
[How can I filter lines on load in Pandas read_csv function?
iter_csv = pd.read_csv('data.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['ID'] == 1234567] for chunk in iter_csv])
However, the number of results seems much less than what it should be. Is there any advice from anyone?
Thanks.
update on 2019/02/19
I have managed to load the csv via this. However, it is noticed that the unmber of results (shown in df.shape) vary with different chunksize.....
iter_csv = pd.read_csv('data.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['ID'] == 1234567] for chunk in iter_csv])
df.shape

Reading a part of csv file

I have a really large csv file about 10GB. When ever I try to read in into iPython notebook using
data = pd.read_csv("data.csv")
my laptop gets stuck. Is it possible to just read like 10,000 rows or 500 MB of a csv file.
It is possible. You can create an iterator yielding chunks of your csv of a certain size at a time as a DataFrame by passing iterator=True with your desired chunksize to read_csv.
df_iter = pd.read_csv('data.csv', chunksize=10000, iterator=True)
for iter_num, chunk in enumerate(df_iter, 1):
print(f'Processing iteration {iter_num}')
# do things with chunk
Or more briefly
for chunk in pd.read_csv('data.csv', chunksize=10000):
# do things with chunk
Alternatively if there was just a specific part of the csv you wanted to read, you could use the skiprows and nrows options to start at a particular line and subsequently read n rows, as the naming suggests.
Likely a memory issue. On read_csv you can set chunksize (where you can specify number of rows).
Alternatively, if you don't need all the columns, you can change usecols on read_csv to import only the columns you need.

Categories