I have significantly large volume of "Raw data" stored in pickle files. At first I have to read/load them in pandas dataframe. Afterwards I do some analysis, update something, analyze again and so on.
Every time I run the code, it reads the raw data from the pickle files. This takes quite some time that I want to avoid. One dirty solution is, I load the files once and then comment out the reading section. Of course I remain careful not to %reset the namespace.
import pandas as pd
df_1 = pd.read_pickle('MyFile_1.pkl')
df_2 = pd.read_pickle('MyFile_2.pkl')
df_3 = pd.read_pickle('MyFile_3.pkl')
Do some work on the loaded data.....
Is there some smart way to do it? Something like
if Myfile_1.pkl is NOT loaded:
df_1 = pd.read_pickle('MyFile_1.pkl')
Related
So, I have this database with thousands of rows and columns. At the start of the program I load the data and assign a variable to it:
data=np.loadtxt('database1.txt',delimiter=',')
Since this database contains many elements, it takes minutes to start the program. Is there a way in Python (similar to .mat files in matlab) which makes me only load the data once even when I stop the program then run it again? Currenly my time is wasted waiting for the program to load the data if I just change a small thing for testing.
Firstly, the Numpy package isn't good to read a large file, the Pandas package it's so strongly.
So just stop using np.loadtxt and start using pd.read_csv instead.
But, if you want to use it
I think that the np.fromfile() module is more efficient and faster than np.loadtxt().
So, my advice try:
data = np.fromfile('database1.txt', sep=',')
instead of:
data = np.loadtxt('database1.txt',delimiter=',')
You could pickle to cache your data.
import pickle
import os
import numpy as np
if os.path.isfile("cache.p"):
with open("cache.p","rb") as f:
data=pickle.load(f)
else:
data=data=np.loadtxt('database1.txt',delimiter=',')
with open("cache.p","wb") as f:
pickle.dump(data,f)
The first time it will be very slow, then in later executions it will be pretty fast.
just tested with a file containing 1 million rows and 20 columns of random floats, it took ~30s the first time, and ~0.4s the following times.
I have to read a file into spark (databricks) as bytes, and convert it to a string.
file_bytes.decode("utf-8")
This is all fine, and I have my data, as a pipe delimited string, including carriage returns etc. It all looks good. Something like:
"Column1"|"Column2"|"Column3"|"Column4"|"Column5"
"This"|"is"|"some"|"data."|
"Shorter"|"line."|||
I want this in a dataframe though so that I can manipulate it, and initially I was attempting to use the following:
df = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", '|')
.load(???)
I appreciate that the load() portion is really meant to be a path to a location on the filesystem ... so have been struggling with that one.
I have therefore reverted to using pandas as it makes life a lot easier:
import io
import pandas
temp = io.StringIO(file_bytes.decode("utf-8"))
df = pandas.read_csv(temp, sep="|")
This is a pandas dataframe, and not a spark dataframe, which as far as I am aware (and it's a very loose awareness) has pros and cons relating to where it lives (in memory) which relates to scaleability/ cluster-usage etc.
Initially, is there a way for me to get my string into a spark dataframe using sqlContext? Maybe I am missing some parameter or switch etc., or should I just stick with pandas?
The main thing I am worried about is that right now files are quite small (200 kb or so), but they might not be forever, and I'd like to reuse a pattern that will allow me to work with larger things (which is why I am marginally concerned about using pandas).
You can actually load an RDD of strings using the CSV reader.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader
So, assuming lines is an RDD of strings that you parsed as you described, you can run:
df = spark.read.csv(lines, sep='|', header=True, inferSchema=True)
The CSV source will then scan the RDD instead of trying to load files. This lets you perform custom pre-processing prior to parsing.
I have a large tabular data, which needs to be merged and splitted by group. The easy method is to use pandas, but the only problem is memory.
I have this code to merge dataframes:
import pandas as pd;
from functools import reduce;
large_df = pd.read_table('large_file.csv', sep=',')
This, basically load the whole data in memory th
# Then I could group the pandas dataframe by some column value (say "block" )
df_by_block = large_df.groupby("block")
# and then write the data by blocks as
for block_id, block_val in df_by_block:
pd.Dataframe.to_csv(df_by_block, "df_" + str(block_id), sep="\t", index=False)
The only problem with above code is memory allocation, which freezes my desktop. I tried to transfer this code to dask but dask doesn't have a neat groupby implementation.
Note: I could have just sorted the file, then read the data line by line and split as the "block" value changes. But, the only problem is that "large_df.txt" is created in the pipeline upstream by merging several dataframes.
Any suggestions?
Thanks,
Update:
I tried the following approach but, it still seems to be memory heavy:
# find unique values in the column of interest (which is to be "grouped by")
large_df_contig = large_df['contig']
contig_list = list(large_df_contig.unique().compute())
# groupby the dataframe
large_df_grouped = large_df.set_index('contig')
# now, split dataframes
for items in contig_list:
my_df = large_df_grouped.loc[items].compute().reset_index()
pd.DataFrame.to_csv(my_df, 'dask_output/my_df_' + str(items), sep='\t', index=False)
Everything is fine, but the code
my_df = large_df_grouped.loc[items].compute().reset_index()
seems to be pulling everything into the memory again.
Any way to improve this code??
but dask doesn't have a neat groupb
Actually, dask does have groupby + user defined functions with OOM reshuffling.
You can use
large_df.groupby(something).apply(write_to_disk)
where write_to_disk is some short function writing the block to the disk. By default, dask uses disk shuffling in these cases (as opposed to network shuffling). Note that this operation might be slow, and it can still fail if the size of a single group exceeds your memory.
I am trying to load csv files in pandas dataframe. However, Python is taking very large amount of memory while loading the files. For example, the size of csv file is 289 MB but the memory usage goes to around 1700 MB while I am trying to load the file. And at that point, the system shows memory error. I have also tried chunk size but the problem persists. Can anyone please show me a way forward?
OK, first things first, do not confuse disk size and memory size. A csv, in it's core is a plain text file, whereas a pandas dataframe is a complex object loaded in memory. That said, I can't give a statement about your particular case, considering that I don't know what you have in your csv. So instead I'll give you an example with a csv on my computer that has a similar size:
-rw-rw-r-- 1 alex users 341M Jan 12 2017 cpromo_2017_01_12_rec.csv
Now reading the CSV:
>>> import pandas as pd
>>> df = pd.read_csv('cpromo_2017_01_12_rec.csv')
>>> sys:1: DtypeWarning: Columns (9) have mixed types. Specify dtype option on import or set low_memory=False.
>>> df.memory_usage(deep=True).sum() / 1024**2
1474.4243307113647
Pandas will attempt to optimize it as much as it can, but it won't be able to do the impossible. If you are low on memory, this answer is a good place to start. Alternatively you could try dask but I think that's too much work for a small csv.
You can use the library "dask" e.g:
# Dataframes implement the Pandas API
import dask.dataframe as dd`<br>
df = dd.read_csv('s3://.../2018-*-*.csv')
try like this - 1) load with dask and then 2) convert to pandas
import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv', usecols=[col1, col2])
df_train=df_train.compute()
print("load train: " , time.clock()-t)
I am exploring switching to python and pandas as a long-time SAS user.
However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
Is there something analogous in pandas?
I regularly work with large files and do not have access to a distributed computing network.
Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:
import pandas as pd
tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.
df = pd.concat(tp, ignore_index=True) # df is DataFrame. If errors, do `list(tp)` instead of `tp`
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.
This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).
Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:
import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess
# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'
table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration
# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']
# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0])
# connect to database
cnx = sqlite3.connect(out_sqlite)
# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):
df = pd.read_csv(in_csv,
header=None, # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i) # skip rows that were already read
# columns to read
df.columns = columns
sql.to_sql(df,
name=table_name,
con=cnx,
index=False, # don't use CSV file index
index_label='molecule_id', # use a unique column from DataFrame as index
if_exists='append')
cnx.close()
Below is my working flow.
import sqlalchemy as sa
import pandas as pd
import psycopg2
count = 0
con = sa.create_engine('postgresql://postgres:pwd#localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
Base on your file size, you'd better optimized the chunksize.
for chunk in chunks:
chunk.to_sql(name='Table', if_exists='append', con=con)
count += 1
print(count)
After have all data in Database, You can query out those you need from database.
If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas
link to dask on github
You can use Pytable rather than pandas df.
It is designed for large data sets and the file format is in hdf5.
So the processing time is relatively fast.