Fastest way to read large files text files in Pandas Dataframe - python

I have several large files (> 4 gb each). Some of them are in a fixed width format and some are pipe delimited. The files have both numeric and text data. Currently I am using the following approach:
df1 = pd.read_fwf(fwFileName, widths = [2, 3, 5, 2, 16],
names = columnNames, dtype = columnTypes,
skiprows = 1, engine = 'c',
keep_default_na = False)
df2 = pd.read_csv(pdFileName, sep = '|', names = columnNames,
dtype = columnTypes, useCols = colNumbers,
skiprows = 1, engine = 'c',
keep_default_na = False)
However, this seems to be slower than for example R's read_fwf (from readr) and fread (from data.table). Can I use some other methods that will help speed up reading these files?
I am working on a big server with several cores so memory is not an issue. I can safely load the whole files into memory. Maybe they are the same thing in this case but my goal is to optimize on time and not on resources.
Update
Based on the comments so far here are a few additional details about the data and my ultimate goal.
These files are compressed (fixed width are zip and pipe delimited are gzip). Therefore, I am not sure if things like Dask will add value for loading. Will they?
After loading these files, I plan to apply a computationally expensive function to groups of data. Therefore, I need the whole data. Although the data is sorted by groups, i.e. first x rows are group 1, next y rows are group 2 and so on. Therefore, forming groups on the fly might be more productive? Is there an efficient way of doing that, given that I don't know how many rows to expect for each group?

Since we are taking time as a metric here, then your memory size is not the main factor we should be looking at, actually on the contrary all methods using lazy loading(less memory and only load objects when needed) are much much faster than loading all data at once in memory, you can check out dask as it provides such lazy read function. https://dask.org/
start_time = time.time()
data = dask.dataframe.read_csv('rg.csv')
duration = time.time() - start_time
print(f"Time taken {duration} seconds") # less than a second
But as I said this won't load data in memory, but rather load only portions of data when needed, you could however load it in full using:
data.compute()
If you want to load things faster in memory, then you need to have good computing capabilities in your server, a good candidate that could benefit from such capabilities is ParaText https://github.com/wiseio/paratext
You can benchmark ParaText against readcsv using the following code:
import time
import paratext
start_time = time.time()
df = paratext.load_csv_to_pandas("rg.csv")
duration = time.time() - start_time
print(f"Time taken {duration} seconds")
import time
import pandas as pd
start_time = time.time()
df = pd.read_csv("rg.csv")
duration = time.time() - start_time
print(f"Time taken {duration} seconds")
Please Note that results may be worse if you don't have enough compute power to support paraText.
You can check out the benchmarks for ParaText loading large files here
https://deads.gitbooks.io/paratext-bench/content/results_csv_throughput.html .

Related

Dask .categorize very slow

I am trying to use Dask in an effort to perform feature extraction on a very large dataset (feature extraction using tsfresh), however I am having trouble with very long processing times.
My data looks as follows.
I have it all stored in Parquet files on my hard-drive.
To begin with I import the data into a Dask dataframe using the following code.
import dask
from dask import dataframe as dd
df = dd.read_parquet("/Users/oskar/Library/Mobile Documents/com~apple~CloudDocs/Documents/Studies/BSc Sem 7/Bachelor Project/programs/python/data/*/data.parquet")
I then initialise a Dask cluster.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=8,
threads_per_worker=1,
scheduler_port=8786,
memory_limit='2GB')
cluster.scheduler_address
After that I start a Dask client.
from tsfresh.utilities.distribution import ClusterDaskDistributor
dask_distributor = ClusterDaskDistributor(address="127.0.0.1:8786")
dask_distributor.client
I then melt 'df'...
dfm = df.melt(id_vars=["id", "time"],
value_vars=['FP1-F7','F7-T7','T7-P7','P7-O1','FP1-F3','F3-C3','C3-P3','P3-O1','FP2-F4','F4-C4',
'C4-P4','P4-O2','FP2-F8','F8-T8','T8-P8','P8-O2','FZ-CZ','CZ-PZ','T7-FT9','FT9-FT10',
'FT10-T8'],
var_name="kind",
value_name="value")
... and group it.
dfm_grouped = dfm.groupby(["id", "kind"])
I then create an instance of 'dask_feature_extraction_on_chunk'.
from tsfresh.convenience.bindings import dask_feature_extraction_on_chunk
from tsfresh.feature_extraction.settings import MinimalFCParameters
features = dask_feature_extraction_on_chunk(dfm_grouped,
column_id="id",
column_kind="kind",
column_sort="time",
column_value="value",
default_fc_parameters=MinimalFCParameters())
And I then finally try to categorize it. Now this is what takes absolutely forever. And I'm wondering if it is possible to speed up this process?
features = features.categorize(columns=["variable"])
After that I intend on resetting the index and pivoting the table; presumably this will take forever also.
features = features.reset_index(drop=True)
feature_table = features.pivot_table(index="id",
columns="variable",
values="value",
aggfunc="sum")
Not to mention the actual computation..
df_features = feature_table.compute()
Again - is there any way I can set up my Dask to allow for faster computation? My computer has 16GB of memory. Thank you.

map.pool running out of memory - when to close and join

I am parallelizing ingestion and analysis of a dataset using Pool.map. However I keep running into 'an out of memory' error. This is despite me working on a cluster with 100GB of memory and working with a compressed file of ~25gb (the uncompressed is likely larger than 100GB). I believe I am running out of memory because I am joining and closing processes incorrectly i.e. All the chunks are being stored in memory. Here is my code:
##load df
df_chunk = pd.read_csv(f'{file}', sep = '\t' , chunksize = 10000)
## parallel process
pool = Pool(16)
processed_results = pool.map(conduct_analysis, df_chunk)
## merge all results together
df_all = reduce(merge_results, processed_results)
pool.close()
pool.join()
With the following function definition for the analysis:
def conduct_analysis(df):
##dictionary to store results
pvalue = {}
##clean columns
df = df.replace('./.', '0/0')
##run analysis
for group in df.groupby(pos):
sample = pd.DataFrame(group[1]['age_of_onset'])
#ANOVA test
f_val, p_val = stats.f_oneway(*sample)
p_value[pos] = p_val[0]
return p_value
Note reduce(merge results) just takes in the mapped results and joins them together.
I can provide output examples too but i believe this is not necessary for my issue. I believe I am running out of memory as the function calls are not being closed when the new one is being opened. is there a way to close the chunk but still retain the output? If I run this code in a for loop one chunk at a time I do not get such an error.
Thanks!

How to apply dask method to apply functions on files in list?

first of all, thanks for this community and all advice we can retrieve, it's really appreciate!
This is my first venture into parallel processing and I have been looking into Dask by my own but I am having trouble actually coding it... to be honest I am really lost
In on of my project, I want to trigger URL and retrieve observations data (meteorological station) from xml files.
For each URL, I run some different process in order to: retreive data from URL, parsing XML information to dataframe, apply a filter and store data in MySQL database.
So i need to loop these process over thousands of URL (station)...
I wrote a sequential code , and it take 300s to finish computation which is really to long and not efficient.
As we are applying the same process for each station, I think I can speed-up all the computations, but I don't know where to start. I used delayed from dask but I don't think it's the best approach.
This is my code so far:
First I have some functions.
def xml_to_dataframe(ood_xml):
tmp_file = wget.download(ood_xml)
prstree = ETree.parse(tmp_file)
root = prstree.getroot()
################ Section to retrieve data for one station and apply parameter
all_obs = []
for obs in root.iter('observations'):
ood_observation = []
for n, param in enumerate(list_parameters):
x=obs.find(variable_to_check).text
ood_observation.append(x)
all_obs.append(ood_observation)
return(pd.DataFrame(all_obs, columns=list_parameters))
def filter_criteria(df,threshold,criteria):
if criteria in df.columns:
result = []
for index, row in df.iterrows():
if pd.to_numeric(row[criteria],errors='coerce') >= threshold:
result.append(index)
return result
else:
#print(criteria + ' parameter does not exist for this station !!! ')
return([])
def get_and_filter_data(filename,criteria,threshold):
try:
xmlToDf = xml_to_dataframe(filename)
final_df = xmlToDf.loc[filter_criteria(xmlToDf,threshold,criteria)]
some msql connection and instructions....
except:
pass
and then the main code I want to parallelise:
criteria = 'temperature'
threshold = 22
filenames =[url1.html, url2.html, url3.html]
for file in filenames:
get_and_filter_data(file,criteria,threshold)
Do you have any advice to do it ?
Many thanks for your help !
Guillaume
Not 100% sure this is what you are after, but one way is via delayed:
from dask import delayed, compute
delayeds = [delayed(get_and_filter_data)(file,criteria,threshold) for file in filenames]
results = compute(delayeds)

Dask read_csv: skip periodically ocurring lines

I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data

RAM consumption by pandas DataFrame

I am trying to work with around 100 csv files to do a time series analysis.
To build an efficient algorithm to use I've structured my data read_csv function such that it only reads all the files at once and don't have to repeat the same process again and again. To explain further following is my code:
start_date = '2016-06-01'
end_date = '2017-09-02'
allocation = 170000
#contains 100 symbols
usesymbols = ['']
cost_matrix = []
def data():
dates=pd.date_range(start_date,end_date)
df=pd.DataFrame(index=dates)
for symbol in usesymbols:
df_temp=pd.read_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),usecols=['Date','Close'],
parse_dates=True,index_col='Date',na_values=['nan'])
df_temp = df_temp.rename(columns={'Close': symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
return df
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1))
power_set = list(powerset(usesymbols))
dataframe = data()
Problem is that if I run the above code with 15 symbols it works perfectly.
But that's not sufficient, I want to use 100 symbols.
If I run the code with 100 items in usesymbols, my RAM is used up completely and the machine freezes.
Is there anything that can be done to avoid this situation?
Edited Part:
1) I've 16 GB RAM.
2) the issue is with the variable power_set, if I don't call powerset function data gets retrieved easily.
DataFrame.memory_usage(index=False)
Return:
sizes : Series
A series with column names as index and memory usage of columns with units of bytes.

Categories