How to speed up the process of chunk to dataframe?

How to speed up the process of chunk to dataframe? - python

I try to use multiprocessing to read the csv file faster than using read_csv.
df = pd.read_csv('review-1m.csv', chunksize=10000)
But the df I get is not the dataframe but of the type pandas.io.parsers.TextFileReader. So I try to use
df = pd.concat(tp, ignore_index=True)
to convert df into a dataframe. But this process takes a lot of time thus the result is not much different from directly using read_csv. Does anyone know that how to make the process of converting df into dataframe faster?

pd.read_csv() is likely going to give you the same read time as any other method. If you want a real performance increase you should change the format you store your file in.
http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

Related

read_csv stops at 100000

I am trying to import a .csv file from my Downloads folder.
Usually, the read_csv function will import the entire rows, though there are millions of rows.
In this case, my file has 236,905 rows, but exactly 100,000 are loaded.
df = pd.read_csv(r'C:\Users\user\Downloads\df.csv',nrows=9999999,low_memory=False)

I come across the same problem with a file containing 5M rows.
I tried first this option :
tp = pd.read_csv('yourfile.csv', iterator=True, chunksize=1000)
data_customers = pd.concat(tp, ignore_index=True)
It did work but in my case some rows where not read properly since some columns contained the character ',' which is used as delimiter in read_csv
The other solution is to use Dask It has an object called "DataFrame" (as Pandas). Dask reads your file and construct a dask dataframe composed of several pandas dataframe.
It's a great solution for parallel computing.
Hope it helps

You need to create chunks using the chunksize= parameter:
temporary = pd.read_csv(r'C:\Users\user\Downloads\df.csv', iterator=True, chunksize=1000)
df = pd.concat(temporary, ignore_index=True)
ignore_index resets the index so it's not repeating.

Is it possible to pd.read_csv data with a date parameter?

For me, working remotely means accessing big CSV files on a server which take a long time to download to local hard drive.
I've tried to speed this process up using a bit of Python, only reading in particular columns I require. Ideally, however, if I could only read in data for those columns after a date (e.g > 2019-01-04) it would significantly reduce the amount of data.
My existing code for this will read in the total file and then apply a date filter. I'm just wondering if it's possible to apply that date filter to the reading of the file in the first place. I appreciate this might not be possible.
Code e.g...
import pandas as pd
fields = ['a','b','c'...]
data1 = pd.read_csv(r'SomeForeignDrive.csv', error_bad_lines=False,usecols=fields)
data1['c']=pd.to_datetime(data1['c'], errors='coerce')
data1.dropna()
data1 = data1[data1['c'] > '2019-01-04']
data1.to_csv(r'SomeLocalDrive.csv')

It's not possible to read files starting from a specific date but you can use the following workaround. You can read only the column with dates and find the row index where you want to start from. Then you can read the whole file and skip all rows before the start index:
df = pd.read_csv('path', usecols=['date'])
df['date'] = pd.to_datetime(df['date'])
idx = df[df['date'] > '2019-01-04'].index[0]
df = pd.read_csv('path', skiprows=idx)
read_csv docs:
Using this parameter (usecols) results in much faster parsing time and
lower memory usage.

Melt a big data frame without pandas

I have a 3GB dataset with 40k rows and 60k columns which Pandas is unable to read and I would like to melt the file based on the current index.
The current file looks like this:
The first column is an index and I would like to melt all the file based on this index.
I tried pandas and dask, but all of them crush when reading the big file.
Do you have any suggestions?
thanks

You need to use the chunksize property of pandas. See for example How to read a 6 GB csv file with pandas.
You will process N rows at one time, without loading the whole dataframe. N will depend on your computer: if N is low, it will cost less memory but it will increase the run time and will cost more IO load.
# create an object reading your file 100 rows at a time
reader = pd.read_csv( 'bigfile.tsv', sep='\t', header=None, chunksize=100 )
# process each chunk at a time
for chunk in file:
result = chunk.melt()
# export the results into a new file
result.to_csv( 'bigfile_melted.tsv', header=None, sep='\t', mode='a' )
Furthermore, you can use the argument dtype=np.int32 for read_csv if you have integer or dtype=np.float32 to process data faster if you do not need precision.
NB: here you have examples of memory usage: Using Chunksize in Pandas.

python pandas read and process a huge csv in chunks

I am trying to process a huge csv file with panda
Firstly, i come across a memory error when loading the file. I am able to fix to by this:
df = pd.read_csv('data.csv', chunksize=1000, low_memory=False)
device_data = pd.concat(df, ignore_index=True)
However, I still get memory errors when processing the "device_data" with multiple filters
Here are my questions:
1- Is there any way to get rid of memory errors when processing the dataframe loaded from that huge csv?
2- I have also tried adding conditions to concatenate dataframe with the iterators. Referring to this link
[How can I filter lines on load in Pandas read_csv function?
iter_csv = pd.read_csv('data.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['ID'] == 1234567] for chunk in iter_csv])
However, the number of results seems much less than what it should be. Is there any advice from anyone?
Thanks.
update on 2019/02/19
I have managed to load the csv via this. However, it is noticed that the unmber of results (shown in df.shape) vary with different chunksize.....
iter_csv = pd.read_csv('data.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['ID'] == 1234567] for chunk in iter_csv])
df.shape

How to use several input files and do parallel processing in python?

I have 30 csv files. I want to give it as input in for loop, in pandas?
Each file has names such as fileaa, fileab,fileac,filead,....
I have multiple input files and And i would like to receive one output.
Usually i use read_csv but due to memory error, 'read_csv' doesn't work.
f = "./file.csv"
df = pd.read_csv(f, sep="/", header=0, dtype=str)
So i would like to try parallel processing in python 2.7

You might want to have a look at dask.
Dask docs show a demo on how to read in many csv files and output a single dask dataframe:
import dask.dataframe as dd
df = dd.read_csv('*.csv')
And then MANY (but not all) of the pandas methods are available, i.e.:
df.head()
It would be useful to read more on dask dataframe to understand difference with pandas dataframe

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to speed up the process of chunk to dataframe? - python

pd.read_csv() is likely going to give you the same read time as any other method. If you want a real performance increase you should change the format you store your file in. http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

Related

read_csv stops at 100000

Is it possible to pd.read_csv data with a date parameter?

Melt a big data frame without pandas

python pandas read and process a huge csv in chunks

How to use several input files and do parallel processing in python?

Categories

Resources