Melt a big data frame without pandas - python

I have a 3GB dataset with 40k rows and 60k columns which Pandas is unable to read and I would like to melt the file based on the current index.
The current file looks like this:
The first column is an index and I would like to melt all the file based on this index.
I tried pandas and dask, but all of them crush when reading the big file.
Do you have any suggestions?
thanks

You need to use the chunksize property of pandas. See for example How to read a 6 GB csv file with pandas.
You will process N rows at one time, without loading the whole dataframe. N will depend on your computer: if N is low, it will cost less memory but it will increase the run time and will cost more IO load.
# create an object reading your file 100 rows at a time
reader = pd.read_csv( 'bigfile.tsv', sep='\t', header=None, chunksize=100 )
# process each chunk at a time
for chunk in file:
result = chunk.melt()
# export the results into a new file
result.to_csv( 'bigfile_melted.tsv', header=None, sep='\t', mode='a' )
Furthermore, you can use the argument dtype=np.int32 for read_csv if you have integer or dtype=np.float32 to process data faster if you do not need precision.
NB: here you have examples of memory usage: Using Chunksize in Pandas.

Related

groupby on very large dataset +10GB with python librairies, pandas, vaex and dask

I have more than 10 GB transaction data, i used DASK to read the data, select the columns am intrested in and also groupby the columns i wanted. All this was incredibly fast but computing wasn't working well and debugging was hard.
I then decided to open my data by using PANDAS chunksize, chunking my data per million. And then used VAEX to combine the files in one big HDF5 file. Until here everything went well, but when i try to groupby my columns and exceed 50k data, my code crashes. I was wondering how to manage this...should i groupby every pandas chunk before combining them in the a vaex dataframe or is it possible to convert my vaex dataframe to a dask dataframe, groupby and then convert the grouped by dataframe into a vaex which is more user friendly for me as it's similar to pandas.
path=....
cols=['client_1_id','amount', 'client_2_id', 'transaction_direction']
chunksize = 10**6
df = pd.read_csv(path,
iterator=True,
sep='\t',
usecols=cols,
chunksize=chunksize,
error_bad_lines=False)
import vaex
# Step 1: export to hdf5 chunks
for i, chunk in enumerate(df):
print(i)
df_chunk = vaex.from_pandas(chunk, copy_index=False)
df_chunk.export_hdf5(f'dfv_{i}.hdf5')
dfv = vaex.open('dfv_*.hdf5')
# Step 2: Combine back into one big hdf5 file
dfv.export_hdf5('dfv.hdf5')
dfv=vaex.open('dfv.hdf5')
this is my first post, sorry if there's not enough details, or if i am unclear, please feel free to ask me any question.

How to read a few lines in a large CSV file with pandas?

I have a CSV file that doesn't fit into my system's memory. Using Pandas, I want to read a small number of rows scattered all over the file.
I think that I can accomplish this without pandas following the steps here: How to read specific lines of a large csv file
In pandas, I am trying to use skiprows to select only the rows that I need.
# FILESIZE is the number of lines in the CSV file (~600M)
# rows2keep is an np.array with the line numbers that I want to read (~20)
rows2skip = (row for row in range(0,FILESIZE) if row not in rows2keep)
signal = pd.read_csv('train.csv', skiprows=rows2skip)
I would expect this code to return a small dataframe pretty fast. However, what is does is start consuming memory over several minutes until the system becomes irresponsive. I'm guessing that it is reading the whole dataframe first and will get rid of rows2skip later.
Why is this implementation so inefficient? How can I efficiently create a dataframe with only the lines specified in rows2keep?
Try this
train = pd.read_csv('file.csv', iterator=True, chunksize=150000)
If you only want to read the first n rows:
train = pd.read_csv(..., nrows=n)
If you only want to read rows from n to n+100
train = pd.read_csv(..., skiprows=n, nrows=n+100)
chunksize should help in limiting the memory usage. Alternatively, if you only need a few number of lines, a possible way is to first read the required lines ouside of pandas and then only feed read_csv with that subset. Code could be:
lines = [line for i, line in enumerate(open('train.csv')) if i in lines_to_keep]
signal = pd.read_csv(io.StringIO(''.join(lines)))

Pandas read_csv large file Performance improvement

Just was wondering if there is a way to improve the performance of reading large csv files into a pandas dataframe. I have 3 large (3.5MM records each) pipe delimited file which I want to load into dataframe and perform some task on it. Currently I am using pandas.read_csv() defining the cols and there datatypes in the parameter like below. I did see some improvement by defining the datatype of the columns but it still takes more than 3 minutes to load.
import pandas as pd
df = pd.read_csv(file_, index_col=None, usecols = sourceFields, sep='|', header=0, dtype={'date':'str', 'gwTimeUtc':'str', 'asset':'|str',
'instrumentId':'|str', 'askPrice':'float64', 'bidPrice':'float64',
'askQuantity':'float64', 'bidQuantity':'float64', 'currency':'|str',
'venue':'|str', 'owner':'|str', 'status':'|str', 'priceNotation':'|str', 'nominalQuantity':'float64'})
Depending on what you wish to do with the data, a good option is dask.dataframe. This library works out-of-memory, and allows you to perform a subset of pandas operations lazily. You can then bring the results in memory as a pandas dataframe. Below is example code you can try:
import dask.dataframe as dd, pandas as pd
# point to all files beginning with "file"
dask_df = dd.read_csv('file*.csv')
# define your calculations as you would in pandas
dask_df['col2'] = dask_df['col1'] * 2
# compute results & return to pandas
df = dask_df.compute()
Crucially, nothing significant is computed until the very last line.
The .feather file is significantly faster than .csv. Pandas has built-in support for feather files.
Read the csv in using pd.read_csv(path) and then export it to a feather file: pd.to_feather(path). Now, read the feather file instead of csv.
In my case, a 950 MB csv file was compressed to a 180 MB feather file. Instead of taking 30 seconds to read, it takes about 1 second. I know I am a bit late to the party, but feather files are seriously underrated.

Reading a part of csv file

I have a really large csv file about 10GB. When ever I try to read in into iPython notebook using
data = pd.read_csv("data.csv")
my laptop gets stuck. Is it possible to just read like 10,000 rows or 500 MB of a csv file.
It is possible. You can create an iterator yielding chunks of your csv of a certain size at a time as a DataFrame by passing iterator=True with your desired chunksize to read_csv.
df_iter = pd.read_csv('data.csv', chunksize=10000, iterator=True)
for iter_num, chunk in enumerate(df_iter, 1):
print(f'Processing iteration {iter_num}')
# do things with chunk
Or more briefly
for chunk in pd.read_csv('data.csv', chunksize=10000):
# do things with chunk
Alternatively if there was just a specific part of the csv you wanted to read, you could use the skiprows and nrows options to start at a particular line and subsequently read n rows, as the naming suggests.
Likely a memory issue. On read_csv you can set chunksize (where you can specify number of rows).
Alternatively, if you don't need all the columns, you can change usecols on read_csv to import only the columns you need.

Large, persistent DataFrame in pandas

I am exploring switching to python and pandas as a long-time SAS user.
However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
Is there something analogous in pandas?
I regularly work with large files and do not have access to a distributed computing network.
Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:
import pandas as pd
tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.
df = pd.concat(tp, ignore_index=True) # df is DataFrame. If errors, do `list(tp)` instead of `tp`
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.
This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).
Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:
import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess
# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'
table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration
# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']
# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0])
# connect to database
cnx = sqlite3.connect(out_sqlite)
# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):
df = pd.read_csv(in_csv,
header=None, # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i) # skip rows that were already read
# columns to read
df.columns = columns
sql.to_sql(df,
name=table_name,
con=cnx,
index=False, # don't use CSV file index
index_label='molecule_id', # use a unique column from DataFrame as index
if_exists='append')
cnx.close()
Below is my working flow.
import sqlalchemy as sa
import pandas as pd
import psycopg2
count = 0
con = sa.create_engine('postgresql://postgres:pwd#localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
Base on your file size, you'd better optimized the chunksize.
for chunk in chunks:
chunk.to_sql(name='Table', if_exists='append', con=con)
count += 1
print(count)
After have all data in Database, You can query out those you need from database.
If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas
link to dask on github
You can use Pytable rather than pandas df.
It is designed for large data sets and the file format is in hdf5.
So the processing time is relatively fast.

Categories