Using Dask to download, process, and save to csv - python

Problem
Part of my workflow involves downloading hundreds of thousands of files, parse the data, and then save to csv locally. I'm trying to set this workflow up with Dask but it does not appear to be processing in parallel. The Dask dashboard shows low cpu % for each worker and the task tab is empty. Status doesn't show anything either. htop doesn't appear to processing more than 1 or 2 "running" at a time. I'm not sure how to proceed from here.
Related: How should I write multiple CSV files efficiently using dask.dataframe? (Older question that this question is based on)
Example
from dask.delayed import delayed
from dask import compute
from dask.distributed import Client, progress
import pandas as pd
import wget
import zipfile
import multiprocessing
def get_fn(dat):
### Download file and unzip based on input dat
url = f"http://www.urltodownloadfrom.com/{dat['var1']}/{dat['var2']}.csv"
wget.download(url)
indat = unzip()
### Process file
outdat = proc_dat(indat)
### Save file
outdat.to_csv('file_path')
### Trash collection with custom download fn
delete_downloads()
if __name__ == '__main__':
### Dask setup
NCORES = multiprocessing.cpu_count() - 1
client = Client(n_workers=NCORES, threads_per_worker=1)
### Build df of needed dates and variables
beg_dat = "2020-01-01"
end_dat = "2020-01-31"
date_range = pd.date_range(beg_dat, end_dat)
var = ["var1", "var2"]
lst_ = [(x, y) for x in date_range for y in var]
date = [x[0] for x in lst_]
var = [x[1] for x in lst_]
indf = pd.DataFrame({'date': date, 'var': var}).reset_index()
### Group by each row to process
gb = indf.groupby('index')
gb_i = [gb.get_group(x) for x in gb.groups]
### Start dask using delayed
compute([delayed(get_fn)(thisRow) for thisRow in gb_i], scheduler='processes')
Dashboard

In this line:
compute([...], scheduler='processes')
you explicitly use a scheduler other than the distributed one you set up earlier in the script. If you do not specify scheduler= here, you will use the correct client, as it has been set as the default. You will see things appear in the dashboard.
Note that you might still not see high CPU usage, since it seems likely that most of the time is waiting for downloads.

Related

Parallelized loading of data into Pandas Dataframes [duplicate]

I have a Requirement, where I have three Input files and need to load them inside the Pandas Data Frame, before merging two of the files into one single Data Frame.
The File extension always changes, it could be .txt one time and .xlsx or .csv another time.
How Can I run this process parallel, in order to save the waiting/ loading time ?
This is my code at the moment,
from time import time # to measure the time taken to run the code
start_time = time()
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
print(end_time - start_time)
It takes around 20 minutes for me to load my primary_df and secondary_df. So, I am looking for an efficient solution possibly using parallel processing to save time.
I timed by Reading operation and it takes most of the time approximately 18 minutes 45 seconds.
Hardware Config :- Intel i5 Processor, 16 GB Ram and 64-bit OS
Question Made Eligible for bounty :- As I am looking for a working
code with detailed steps - using a package with in anaconda
environment that supports loading my input files Parallel and
storing them in a pandas data frame separately. This should eventually
save time.
Try this:
from time import time
import pandas as pd
from multiprocessing.pool import ThreadPool
start_time = time()
pool = ThreadPool(processes=3)
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
# Define a function for the thread
def import_xlsx(file_name):
df_xlsx = pd.read_excel(file_name)
# print(df_xlsx.head())
return df_xlsx
def import_csv(file_name):
df_csv = pd.read_csv(file_name)
# print(df_csv.head())
return df_csv
# Create two threads as follows
Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get()
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get()
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get()
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
Why not use asyncio over multiprocessing?
Instead of using multiple threads, you might want to first leverage on the I/O level with an Async CSV Dict Reader (which can be parallelized using multiprocessing for multiple files). Afterwards, you can either concat the dicts and then load these dictionaries into pandas or load the individual dicts into pandas and concat there.
However, pandas does not support asyncio so you will have a performance loss at some point.
Try using #Cezary.Sz code but using (delete the calls to .get()), instead:
Primary_df_job = pool.apply_async(import_xlsx, (Primary_File, ))
Secondary_1_df_job = pool.apply_async(import_csv, (Secondary_File_1, ))
Secondary_2_df_job = pool.apply_async(import_csv, (Secondary_File_2, ))
Then
Secondary_1_df = Secondary_1_df_job.get()
Secondary_2_df = Secondary_2_df_job.get()
And you can use the dataframes, while Primary_df_job is loading.
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
When you need Primary_df in your code, use
Primary_df = Primary_df_job.get()
This will block the execution until Primary_df_job is finished.
Unfortunately, due to GIL (Global Interpreter Lock) in Python, multiple threads do not run simultaneously — all threads use the same single CPU's core. That means if you create several threads to load your files, the total time will be equal (or actually greater) the time needed to load that files one-by-one.
More about GIL: https://wiki.python.org/moin/GlobalInterpreterLock
To speed up load time you can try to switch from csv/excel to pickle files (or HDF).
You give the hardware details but you do not give the most interesting part: the number of disks you have, the type of RAID you have and the filesystem you are reading from.
If you only have one disk, no RAID, and a regular filesystem (ext4, XFS, etc.), like you mostly have on laptops, you will not be able to increase the bandwidth simply by throwing CPUs (multithread or multiprocess) at the problem. Using multiple threads, or asynchronous I/Os will help mask the latency a bit, but will not increase the bandwidth, because chances are you are already saturating it with a single reader process.
So using the code suggested by #Cezary.Sz, try moving one of the file to a USB3.0 external storage, or to SDSX storage. If you are running on a large workstation, look at the hardware details to see if several disks are available, and if you run on a large cluster, look for a parallel filesystem (BeeGFS, Lustre, etc.)

Reading different set of json files same time with python

I have two sets of files b and c (JSON). The number of files in each is normally between 500-1000. Right now I am reading this seperately. Can I read these at the same time using multi-threading? I have enough memory and processors.
yc=no of c files
yb=no of b files
c_output_transaction_list =[]
for num in range(yc):
c_json_file='./output/d_c_'+str(num)+'.json'
print(c_json_file)
c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
c_output_transaction_list.extend(c_transaction_list)
df_res_c= pd.DataFrame(c_output_transaction_list)
b_output_transaction_list =[]
for num in range(yb):
b_json_file='./output/d_b_'+str(num)+'.json'
print(b_json_file)
b_transaction_list = json.load(open(b_json_file))['data']['transaction_list']
b_output_transaction_list.extend(b_transaction_list)
df_res_b= pd.DataFrame(b_output_transaction_list)
I use this method to read hundreds of files in parallel into a final dataframe. Without having your data, you'll have to verify this does what you want. Reading the multiprocess help docs will assist. I use the same code on linux (aws ec2 reading s3 files) and windows reading the same s3 files. I find a big time savings do this.
import os
import pandas as pd
from multiprocessing import Pool
# you set the number of processors or just take the cpu_count from the os object. playing around with this does make a difference. For me using the max isn't always the fast overall time
num_proc = os.cpu_count()
# define the funciton that creates a dataframe from your file
# note, this is different where you build the list the create a dataframe at the end
def json_parse(c_json_file):
c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
return pd.DataFrame(c_transaction_list)
# this is multiprocessing function that feeds the file names to the parsing function
# if you don't pass num_proc it defaults to 4
def json_multiprocess(fn_list, num_proc=4):
with Pool(num_proc) as pool:
# I use starmap, you may just be able use map
# if you pass more than the file name, starmap handles zip() very well
r = pool.starmap(json_parse, fn_list, 15)
pool.close()
pool.join()
return r
# build your file list first
yc=no of c files
flist = []
for num in range(yc):
c_json_file='./output/d_c_'+str(num)+'.json'
flist.append(c_json_file)
# get a list of of your intermediate dataframes
dfs = json_multiprocess(flist, num_proc=num_proc)
# concat your dataframe
df_res_c = pd.concat(dfs)
Then do the same for your next set of files...
Use the example in Aelarion's comment to help structure the file

Cant fit dataframe with fbprophet using dask to read the csv into a dataframe

References:
https://examples.dask.org/applications/forecasting-with-prophet.html?highlight=prophet
https://facebook.github.io/prophet/
A few things to note:
I've got a total of 48gb of ram
Here are my versions of the libraries im using
Python 3.7.7
dask==2.18.0
fbprophet==0.6
pandas==1.0.3
The reason im import pandas is for this line only pd.options.mode.chained_assignment = None
This helps with dask erroring when im using dask.distributed
So, I have a 21gb csv file that I am reading using dask and jupyter notebook...
I've tried to read it from my mysql database table, however, the kernel eventually crashes
I've tried multiple combinations of using my local network of workers, threads, and available memory, available storage_memory, and even tried not using distributed at all. I have also tried chunking with pandas (not with the line mentioned above related to pandas), however, even with chunking, the kernel still crashes...
I can now load the csv with dask, and apply a few transformations, such as setting the index, adding the column (names) that fbprophet requires... but I am still not able to compute the dataframe with df.compute(), as this is why I think I am receiving the error I am with fbprophet. After I have added the columns y, and ds, with the appropriate dtypes, I receive the error Truth of Delayed objects is not supported, and I think this is because fbprophet expects the dataframe to not be lazy, which is why im trying to run compute beforehand. I have also bumped up the ram on the client to allow it to use the full 48gb, as I suspected that it may be trying to load the data twice, however, this still failed, so most likely this wasn't the case / isn't causing the problem.
Alongside this, fbpropphet is also mentioned in the documentation of dask for applying machine learning to dataframes, however, I really don't understand why this isn't working... I've also tried modin with ray, and with dask, with basically the same result.
Another question... regarding memory usage
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 32.35 GB -- Worker memory limit: 25.00 GB
I am getting this error when assigning the client, reading the csv file, and applying operations/transformations to the dataframe, however the allotted size is larger than the csv file itself, so this confuses me...
What I have done to try and solve this myself:
- Googling of course, did not find anything :-/
- Asking a discord help channel, on multiple occasions
- Asking IIRC help channel, on multiple occasions
Anyways, would really appreciate any help on this problem!!!
Thank you in advance :)
MCVE
from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
from fbprophet import Prophet
pd.options.mode.chained_assignment = None
client = Client(n_workers=2, threads_per_worker=4, processes=False, memory_limit='4GB')
csv_file = 'provide_your_own_csv_file_here.csv'
df = dd.read_csv(csv_file, parse_dates=['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['y'] = df[['a','b']].mean(axis=1)
m = Prophet(daily_seasonality=True)
m.fit(df)
# ERROR: Truth of Delayed objects is not supported
Unfortunately Prophet doesn't support Dask dataframes today.
The example that you refer to shows using Dask to accelerate Prophet's fitting on Pandas dataframes. Dask Dataframe is only one way that people use Dask.
As already suggested, one approach is to use dask.delayed with a pandas DataFrame, and skip dask.dataframe.
You could use a simplified version of the load-clean-analyze pipeline shown for custom computations using Dask.
Here is one possible approach based on this type of custom pipeline, using a small dataset (to create a MCVE) - every step in the pipeline will be delayed
Imports
import numpy as np
import pandas as pd
from dask import delayed
from dask.distributed import Client
from fbprophet import Prophet
Generate some data in a .csv, with column names Time (UTC), a and b
def generate_csv(nrows, fname):
df = pd.DataFrame(np.random.rand(nrows, 2), columns=["a", "b"])
df["Time (UTC)"] = pd.date_range(start="1850-01-01", periods=nrows)
df.to_csv(fname, index=False)
First write the load function from the pipeline, to load the .csv with Pandas, and delay its execution using the dask.delayed decorator
might be good to use read_csv with nrows to see how the pipeline performs on a subset of the data, rather than loading it all
this will return a dask.delayed object and not a pandas.DataFrame
#delayed
def load_data(fname, nrows=None):
return pd.read_csv(fname, nrows=nrows)
Now create the process function, to process data using pandas, again delayed since its input is a dask.delayed object and not a pandas.DataFrame
#delayed
def process_data(df):
df = df.rename(columns={"Time (UTC)": "ds"})
df["y"] = df[["a", "b"]].mean(axis=1)
return df
Last function - this one will train fbprophet on the data (loaded from the .csv and processed, but delayed) to make a forecast. This analyze function is also delayed, since one of its inputs is a dask.delayed object
#delayed
def analyze(df, horizon):
m = Prophet(daily_seasonality=True)
m.fit(df)
future = m.make_future_dataframe(periods=horizon)
forecast = m.predict(future)
return forecast
Run the pipeline (if running from a Python script, requires __name__ == "__main__")
the output of the pipeline (a forecast by fbprophet) is stored in a variable result, which is delayed
when this output is computed, this will generate a pandas.DataFrame (corresponding to the output of a forecast by fbprophet), so it can be evaluated using result.compute()
if __name__ == "__main__":
horizon = 8
num_rows_data = 40
num_rows_to_load = 35
csv_fname = "my_file.csv"
generate_csv(num_rows_data, csv_fname)
client = Client() # modify this as required
df = load_data(csv_fname, nrows=num_rows_to_load)
df = process_data(df)
result = analyze(df, horizon)
forecast = result.compute()
client.close()
assert len(forecast) == num_rows_to_load + horizon
print(forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].head())
Output
ds yhat yhat_lower yhat_upper
0 1850-01-01 0.330649 0.095788 0.573378
1 1850-01-02 0.493025 0.266692 0.724632
2 1850-01-03 0.573344 0.348953 0.822692
3 1850-01-04 0.491388 0.246458 0.712400
4 1850-01-05 0.307939 0.066030 0.548981

Parallel loading of Input Files in Pandas Dataframe

I have a Requirement, where I have three Input files and need to load them inside the Pandas Data Frame, before merging two of the files into one single Data Frame.
The File extension always changes, it could be .txt one time and .xlsx or .csv another time.
How Can I run this process parallel, in order to save the waiting/ loading time ?
This is my code at the moment,
from time import time # to measure the time taken to run the code
start_time = time()
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
print(end_time - start_time)
It takes around 20 minutes for me to load my primary_df and secondary_df. So, I am looking for an efficient solution possibly using parallel processing to save time.
I timed by Reading operation and it takes most of the time approximately 18 minutes 45 seconds.
Hardware Config :- Intel i5 Processor, 16 GB Ram and 64-bit OS
Question Made Eligible for bounty :- As I am looking for a working
code with detailed steps - using a package with in anaconda
environment that supports loading my input files Parallel and
storing them in a pandas data frame separately. This should eventually
save time.
Try this:
from time import time
import pandas as pd
from multiprocessing.pool import ThreadPool
start_time = time()
pool = ThreadPool(processes=3)
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
# Define a function for the thread
def import_xlsx(file_name):
df_xlsx = pd.read_excel(file_name)
# print(df_xlsx.head())
return df_xlsx
def import_csv(file_name):
df_csv = pd.read_csv(file_name)
# print(df_csv.head())
return df_csv
# Create two threads as follows
Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get()
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get()
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get()
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
Why not use asyncio over multiprocessing?
Instead of using multiple threads, you might want to first leverage on the I/O level with an Async CSV Dict Reader (which can be parallelized using multiprocessing for multiple files). Afterwards, you can either concat the dicts and then load these dictionaries into pandas or load the individual dicts into pandas and concat there.
However, pandas does not support asyncio so you will have a performance loss at some point.
Try using #Cezary.Sz code but using (delete the calls to .get()), instead:
Primary_df_job = pool.apply_async(import_xlsx, (Primary_File, ))
Secondary_1_df_job = pool.apply_async(import_csv, (Secondary_File_1, ))
Secondary_2_df_job = pool.apply_async(import_csv, (Secondary_File_2, ))
Then
Secondary_1_df = Secondary_1_df_job.get()
Secondary_2_df = Secondary_2_df_job.get()
And you can use the dataframes, while Primary_df_job is loading.
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
When you need Primary_df in your code, use
Primary_df = Primary_df_job.get()
This will block the execution until Primary_df_job is finished.
Unfortunately, due to GIL (Global Interpreter Lock) in Python, multiple threads do not run simultaneously — all threads use the same single CPU's core. That means if you create several threads to load your files, the total time will be equal (or actually greater) the time needed to load that files one-by-one.
More about GIL: https://wiki.python.org/moin/GlobalInterpreterLock
To speed up load time you can try to switch from csv/excel to pickle files (or HDF).
You give the hardware details but you do not give the most interesting part: the number of disks you have, the type of RAID you have and the filesystem you are reading from.
If you only have one disk, no RAID, and a regular filesystem (ext4, XFS, etc.), like you mostly have on laptops, you will not be able to increase the bandwidth simply by throwing CPUs (multithread or multiprocess) at the problem. Using multiple threads, or asynchronous I/Os will help mask the latency a bit, but will not increase the bandwidth, because chances are you are already saturating it with a single reader process.
So using the code suggested by #Cezary.Sz, try moving one of the file to a USB3.0 external storage, or to SDSX storage. If you are running on a large workstation, look at the hardware details to see if several disks are available, and if you run on a large cluster, look for a parallel filesystem (BeeGFS, Lustre, etc.)

Read in HDF5 dataset as fast as possible

I need to read in a very large H5 file from disk to memory as fast as possible.
I am currently attempting to read it in using multiple threads via the multiprocessing library, but I keep getting errors related to the fact that H5 files cannot be read in concurrently.
Here is a little snippet demonstrating the approach that I am taking:
import multiprocessing
import h5py
import numpy
f = h5py.File('/path/to/dataset.h5', 'r')
data = f['/Internal/Path/Dataset'] # this is just to get how big axis 0 is
dataset = numpy.zeros((300, 720, 1280)) # what to read the data into
def process_wrapper(frameCounter):
dataset[frameCounter] = f['/Internal/Path/Dataset']
#init objects
cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(cores)
jobs = []
#create jobs
frameCounter = 0
for frame in range(0, data.shape[0]): # iterate through axis 0 of data
jobs.append( pool.apply_async(process_wrapper,([frameCounter])) )
frameCounter += 1
#wait for all jobs to finish
for job in jobs:
job.get()
#clean up
pool.close()
I am looking for either a workaround that will allow me to use multiple readers on the H5 file or a different approach that would still allow me to read it in faster. Thanks

Categories