Parallelized loading of data into Pandas Dataframes [duplicate] - python

I have a Requirement, where I have three Input files and need to load them inside the Pandas Data Frame, before merging two of the files into one single Data Frame.
The File extension always changes, it could be .txt one time and .xlsx or .csv another time.
How Can I run this process parallel, in order to save the waiting/ loading time ?
This is my code at the moment,
from time import time # to measure the time taken to run the code
start_time = time()
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
print(end_time - start_time)
It takes around 20 minutes for me to load my primary_df and secondary_df. So, I am looking for an efficient solution possibly using parallel processing to save time.
I timed by Reading operation and it takes most of the time approximately 18 minutes 45 seconds.
Hardware Config :- Intel i5 Processor, 16 GB Ram and 64-bit OS
Question Made Eligible for bounty :- As I am looking for a working
code with detailed steps - using a package with in anaconda
environment that supports loading my input files Parallel and
storing them in a pandas data frame separately. This should eventually
save time.

Try this:
from time import time
import pandas as pd
from multiprocessing.pool import ThreadPool
start_time = time()
pool = ThreadPool(processes=3)
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
# Define a function for the thread
def import_xlsx(file_name):
df_xlsx = pd.read_excel(file_name)
# print(df_xlsx.head())
return df_xlsx
def import_csv(file_name):
df_csv = pd.read_csv(file_name)
# print(df_csv.head())
return df_csv
# Create two threads as follows
Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get()
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get()
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get()
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()

Why not use asyncio over multiprocessing?
Instead of using multiple threads, you might want to first leverage on the I/O level with an Async CSV Dict Reader (which can be parallelized using multiprocessing for multiple files). Afterwards, you can either concat the dicts and then load these dictionaries into pandas or load the individual dicts into pandas and concat there.
However, pandas does not support asyncio so you will have a performance loss at some point.

Try using #Cezary.Sz code but using (delete the calls to .get()), instead:
Primary_df_job = pool.apply_async(import_xlsx, (Primary_File, ))
Secondary_1_df_job = pool.apply_async(import_csv, (Secondary_File_1, ))
Secondary_2_df_job = pool.apply_async(import_csv, (Secondary_File_2, ))
Then
Secondary_1_df = Secondary_1_df_job.get()
Secondary_2_df = Secondary_2_df_job.get()
And you can use the dataframes, while Primary_df_job is loading.
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
When you need Primary_df in your code, use
Primary_df = Primary_df_job.get()
This will block the execution until Primary_df_job is finished.

Unfortunately, due to GIL (Global Interpreter Lock) in Python, multiple threads do not run simultaneously — all threads use the same single CPU's core. That means if you create several threads to load your files, the total time will be equal (or actually greater) the time needed to load that files one-by-one.
More about GIL: https://wiki.python.org/moin/GlobalInterpreterLock
To speed up load time you can try to switch from csv/excel to pickle files (or HDF).

You give the hardware details but you do not give the most interesting part: the number of disks you have, the type of RAID you have and the filesystem you are reading from.
If you only have one disk, no RAID, and a regular filesystem (ext4, XFS, etc.), like you mostly have on laptops, you will not be able to increase the bandwidth simply by throwing CPUs (multithread or multiprocess) at the problem. Using multiple threads, or asynchronous I/Os will help mask the latency a bit, but will not increase the bandwidth, because chances are you are already saturating it with a single reader process.
So using the code suggested by #Cezary.Sz, try moving one of the file to a USB3.0 external storage, or to SDSX storage. If you are running on a large workstation, look at the hardware details to see if several disks are available, and if you run on a large cluster, look for a parallel filesystem (BeeGFS, Lustre, etc.)

Related

Asynchronous processing in spark pipeline

I have a local linux server which contains 4 cores. I am running a pyspark job on it locally which basically reads two tables from database and saves the data into 2 dataframes. Now i am using these 2 dataframes to do some processing and then i am using the resultant processed df to save it into elasticsearch. Below is the code
def save_to_es(df):
df.write.format('es').option('es.nodes', 'es_node').option('es.port', some_port_no.).option('es.resource', index_name).option('es.mapping', es_mappings).save()
def coreFun():
spark = SparkSession.builder.master("local[1]").appName('test').getOrCreate()
spark.catalog.clearCache()
spark.sparkContext.setLogLevel("ERROR")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
select_sql = """(select * from db."master_table")"""
df_master = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql).option("user", "username").option("password", "password").option("driver", "database_driver").load()
select_sql_child = """(select * from db."child_table")"""
df_child = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql_cost).option("user", "username").option("password", "password").option("driver", "database_driver").load()
merged_df = merged_python_file.merged_function(df_master,df_child,sqlContext)
logic1_df = logic1_python_file.logic1_function(df_master,sqlContext)
logic2_df = logic2_python_file.logic2_function(df_master,sqlContext)
logic3_df = logic3_python_file.logic3_function(df_master,sqlContext)
logic4_df = logic4_python_file.logic4_function(df_master,sqlContext)
logic5_df = logic5_python_file.logic5_function(df_master,sqlContext)
save_to_es(merged_df)
save_to_es(logic1_df)
save_to_es(logic2_df)
save_to_es(logic3_df)
save_to_es(logic4_df)
save_to_es(logic5_df)
end_time = int(time.time())
print(end_time-start_time)
sc.stop()
if __name__ == "__main__":
coreFun()
There are different logic for processing written in separate python files e.g logic1 in logic1_python_file etc. I send my df_master to separate functions and they return resultant processed df back to driver. Now i use this resultant processed df to save into elasticsearch.
It works fine but problem is here everything is happening sequentially first merged_df gets processed and while it is getting processed others simply wait even though they are not really dependent on the o/p of merged_df function and then logic_1 gets processed while others wait and it goes on. This is not an ideal system design considering the o/p of one logic is not dependent on other.
I am sure asynchronous processing can help me here but i am not sure how to implement it here in my usecase. I know i may have to use some kind of queue(jms,kafka etc) to accomplish this but i dont have a complete picture.
Please let me know how can i utilize asynchronous processing here. Any other inputs which can help in improving the performance of job is welcome.
If during the processing of one single step like (merged_python_file.merged_function), only one core of the CPU is getting heavily utilized and others are nearly idle, multiprocessing can speed up. It can be achieved by using multiprocessing module of python. For more details can check answer on How to do parallel programming in Python?

Reading different set of json files same time with python

I have two sets of files b and c (JSON). The number of files in each is normally between 500-1000. Right now I am reading this seperately. Can I read these at the same time using multi-threading? I have enough memory and processors.
yc=no of c files
yb=no of b files
c_output_transaction_list =[]
for num in range(yc):
c_json_file='./output/d_c_'+str(num)+'.json'
print(c_json_file)
c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
c_output_transaction_list.extend(c_transaction_list)
df_res_c= pd.DataFrame(c_output_transaction_list)
b_output_transaction_list =[]
for num in range(yb):
b_json_file='./output/d_b_'+str(num)+'.json'
print(b_json_file)
b_transaction_list = json.load(open(b_json_file))['data']['transaction_list']
b_output_transaction_list.extend(b_transaction_list)
df_res_b= pd.DataFrame(b_output_transaction_list)
I use this method to read hundreds of files in parallel into a final dataframe. Without having your data, you'll have to verify this does what you want. Reading the multiprocess help docs will assist. I use the same code on linux (aws ec2 reading s3 files) and windows reading the same s3 files. I find a big time savings do this.
import os
import pandas as pd
from multiprocessing import Pool
# you set the number of processors or just take the cpu_count from the os object. playing around with this does make a difference. For me using the max isn't always the fast overall time
num_proc = os.cpu_count()
# define the funciton that creates a dataframe from your file
# note, this is different where you build the list the create a dataframe at the end
def json_parse(c_json_file):
c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
return pd.DataFrame(c_transaction_list)
# this is multiprocessing function that feeds the file names to the parsing function
# if you don't pass num_proc it defaults to 4
def json_multiprocess(fn_list, num_proc=4):
with Pool(num_proc) as pool:
# I use starmap, you may just be able use map
# if you pass more than the file name, starmap handles zip() very well
r = pool.starmap(json_parse, fn_list, 15)
pool.close()
pool.join()
return r
# build your file list first
yc=no of c files
flist = []
for num in range(yc):
c_json_file='./output/d_c_'+str(num)+'.json'
flist.append(c_json_file)
# get a list of of your intermediate dataframes
dfs = json_multiprocess(flist, num_proc=num_proc)
# concat your dataframe
df_res_c = pd.concat(dfs)
Then do the same for your next set of files...
Use the example in Aelarion's comment to help structure the file

Using Dask to download, process, and save to csv

Problem
Part of my workflow involves downloading hundreds of thousands of files, parse the data, and then save to csv locally. I'm trying to set this workflow up with Dask but it does not appear to be processing in parallel. The Dask dashboard shows low cpu % for each worker and the task tab is empty. Status doesn't show anything either. htop doesn't appear to processing more than 1 or 2 "running" at a time. I'm not sure how to proceed from here.
Related: How should I write multiple CSV files efficiently using dask.dataframe? (Older question that this question is based on)
Example
from dask.delayed import delayed
from dask import compute
from dask.distributed import Client, progress
import pandas as pd
import wget
import zipfile
import multiprocessing
def get_fn(dat):
### Download file and unzip based on input dat
url = f"http://www.urltodownloadfrom.com/{dat['var1']}/{dat['var2']}.csv"
wget.download(url)
indat = unzip()
### Process file
outdat = proc_dat(indat)
### Save file
outdat.to_csv('file_path')
### Trash collection with custom download fn
delete_downloads()
if __name__ == '__main__':
### Dask setup
NCORES = multiprocessing.cpu_count() - 1
client = Client(n_workers=NCORES, threads_per_worker=1)
### Build df of needed dates and variables
beg_dat = "2020-01-01"
end_dat = "2020-01-31"
date_range = pd.date_range(beg_dat, end_dat)
var = ["var1", "var2"]
lst_ = [(x, y) for x in date_range for y in var]
date = [x[0] for x in lst_]
var = [x[1] for x in lst_]
indf = pd.DataFrame({'date': date, 'var': var}).reset_index()
### Group by each row to process
gb = indf.groupby('index')
gb_i = [gb.get_group(x) for x in gb.groups]
### Start dask using delayed
compute([delayed(get_fn)(thisRow) for thisRow in gb_i], scheduler='processes')
Dashboard
In this line:
compute([...], scheduler='processes')
you explicitly use a scheduler other than the distributed one you set up earlier in the script. If you do not specify scheduler= here, you will use the correct client, as it has been set as the default. You will see things appear in the dashboard.
Note that you might still not see high CPU usage, since it seems likely that most of the time is waiting for downloads.

Is it possible to operate on multiple txt files at once in python?

I have 1080 .txt files, each of which contain over 100k rows of values in three columns. I have to perform an average of the first column in each of these .txt files.
Any method that performs looping is proving to be too slow as only one file is loaded by numpy.loadtxt at a time.
The kicker is that I have 38 of these folders on which I need to perform this operation. So 38*1030 files in total. Using time module to get compute time for each numpy.loadtxt gives me around 1.7 seconds. So the total time to run over all folders is over 21 hours which seems a bit too much time.
So this has me wondering if there is a way to perform multiple operations at once by being able to open multiple txt files and performing average on the first column. Then also being able to store that average in the corresponding order of the txt files, since the order is important.
Since I am a begineer, I'm not sure if this even is the fastest way. Thanks in advance.
import numpy as np
import glob
import os
i = 0
while i < 39:
source_directory = "something/" + str(i) #Go to specific folder with the numbering
hw_array = sorted(glob.glob(source_directory + "/data_*.txt")) # read paths of 1080 txt files
velocity_array = np.zeros((30,36,3))
for probe in hw_array:
x = 35 - int((i-0.0001)/30) #describing position of the probes where velocities are measured
y = (30 - int(round((i)%30)))%30
velocity_column = np.loadtxt(data_file, usecols=(0)) #step that takes most time
average_array = np.mean(velocity_column, axis=0)
velocity_array[y,x,0] = average_array
velocity_array[y,x,1] = y*(2/29)
velocity_array[y,x,2] = x*0.5
np.save(r"C:/Users/md101/Desktop/AE2/Project2/raw/" + "R29" + "/data" + "R29", velocity_array) #save velocity array for use in later analysis
i += 1
Python has pretty slow I/O, and most of your time is spent in talking to the operating system and other costs associated with opening files.
Threading in python is strange and only provides an improvement in certain situations. Here is why it is good for your case. How threading works in Python is that a thread will do stuff if it has permission (called acquiring the GIL, or global interpreter lock. read about it). While it is waiting for something, like I/O, it will pass up the GIL to another thread. This will allow your files to be operated on (averaging the first row) while it has the GIL, and while the file is being opened it will pass the GIL on to another file to perform operations
It's completely possible to write a function that loads files from a directory, and spawn off one multiprocessing and get it done in close to 1/39th the time it was taking. Or, don't parallelize by directory, but by file and queue up the work and read from that work queue.
Some pseudocode:
pool = multiprocessing.Pool()
workers = []
for d in os.listdir("."):
for f in os.listdir(d):
workers.append(pool.apply_async(counter_function_for_file, (os.path.join(d, f),))
s = sum(worker.get() for worker in workers)
...and put your code for reading from the file in that counter_function_for_file(filename) function.

Parallel loading of Input Files in Pandas Dataframe

I have a Requirement, where I have three Input files and need to load them inside the Pandas Data Frame, before merging two of the files into one single Data Frame.
The File extension always changes, it could be .txt one time and .xlsx or .csv another time.
How Can I run this process parallel, in order to save the waiting/ loading time ?
This is my code at the moment,
from time import time # to measure the time taken to run the code
start_time = time()
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
print(end_time - start_time)
It takes around 20 minutes for me to load my primary_df and secondary_df. So, I am looking for an efficient solution possibly using parallel processing to save time.
I timed by Reading operation and it takes most of the time approximately 18 minutes 45 seconds.
Hardware Config :- Intel i5 Processor, 16 GB Ram and 64-bit OS
Question Made Eligible for bounty :- As I am looking for a working
code with detailed steps - using a package with in anaconda
environment that supports loading my input files Parallel and
storing them in a pandas data frame separately. This should eventually
save time.
Try this:
from time import time
import pandas as pd
from multiprocessing.pool import ThreadPool
start_time = time()
pool = ThreadPool(processes=3)
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
# Define a function for the thread
def import_xlsx(file_name):
df_xlsx = pd.read_excel(file_name)
# print(df_xlsx.head())
return df_xlsx
def import_csv(file_name):
df_csv = pd.read_csv(file_name)
# print(df_csv.head())
return df_csv
# Create two threads as follows
Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get()
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get()
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get()
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
Why not use asyncio over multiprocessing?
Instead of using multiple threads, you might want to first leverage on the I/O level with an Async CSV Dict Reader (which can be parallelized using multiprocessing for multiple files). Afterwards, you can either concat the dicts and then load these dictionaries into pandas or load the individual dicts into pandas and concat there.
However, pandas does not support asyncio so you will have a performance loss at some point.
Try using #Cezary.Sz code but using (delete the calls to .get()), instead:
Primary_df_job = pool.apply_async(import_xlsx, (Primary_File, ))
Secondary_1_df_job = pool.apply_async(import_csv, (Secondary_File_1, ))
Secondary_2_df_job = pool.apply_async(import_csv, (Secondary_File_2, ))
Then
Secondary_1_df = Secondary_1_df_job.get()
Secondary_2_df = Secondary_2_df_job.get()
And you can use the dataframes, while Primary_df_job is loading.
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
When you need Primary_df in your code, use
Primary_df = Primary_df_job.get()
This will block the execution until Primary_df_job is finished.
Unfortunately, due to GIL (Global Interpreter Lock) in Python, multiple threads do not run simultaneously — all threads use the same single CPU's core. That means if you create several threads to load your files, the total time will be equal (or actually greater) the time needed to load that files one-by-one.
More about GIL: https://wiki.python.org/moin/GlobalInterpreterLock
To speed up load time you can try to switch from csv/excel to pickle files (or HDF).
You give the hardware details but you do not give the most interesting part: the number of disks you have, the type of RAID you have and the filesystem you are reading from.
If you only have one disk, no RAID, and a regular filesystem (ext4, XFS, etc.), like you mostly have on laptops, you will not be able to increase the bandwidth simply by throwing CPUs (multithread or multiprocess) at the problem. Using multiple threads, or asynchronous I/Os will help mask the latency a bit, but will not increase the bandwidth, because chances are you are already saturating it with a single reader process.
So using the code suggested by #Cezary.Sz, try moving one of the file to a USB3.0 external storage, or to SDSX storage. If you are running on a large workstation, look at the hardware details to see if several disks are available, and if you run on a large cluster, look for a parallel filesystem (BeeGFS, Lustre, etc.)

Categories