Read in HDF5 dataset as fast as possible - python

I need to read in a very large H5 file from disk to memory as fast as possible.
I am currently attempting to read it in using multiple threads via the multiprocessing library, but I keep getting errors related to the fact that H5 files cannot be read in concurrently.
Here is a little snippet demonstrating the approach that I am taking:
import multiprocessing
import h5py
import numpy
f = h5py.File('/path/to/dataset.h5', 'r')
data = f['/Internal/Path/Dataset'] # this is just to get how big axis 0 is
dataset = numpy.zeros((300, 720, 1280)) # what to read the data into
def process_wrapper(frameCounter):
dataset[frameCounter] = f['/Internal/Path/Dataset']
#init objects
cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(cores)
jobs = []
#create jobs
frameCounter = 0
for frame in range(0, data.shape[0]): # iterate through axis 0 of data
jobs.append( pool.apply_async(process_wrapper,([frameCounter])) )
frameCounter += 1
#wait for all jobs to finish
for job in jobs:
job.get()
#clean up
pool.close()
I am looking for either a workaround that will allow me to use multiple readers on the H5 file or a different approach that would still allow me to read it in faster. Thanks

Related

Parallelized loading of data into Pandas Dataframes [duplicate]

I have a Requirement, where I have three Input files and need to load them inside the Pandas Data Frame, before merging two of the files into one single Data Frame.
The File extension always changes, it could be .txt one time and .xlsx or .csv another time.
How Can I run this process parallel, in order to save the waiting/ loading time ?
This is my code at the moment,
from time import time # to measure the time taken to run the code
start_time = time()
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
print(end_time - start_time)
It takes around 20 minutes for me to load my primary_df and secondary_df. So, I am looking for an efficient solution possibly using parallel processing to save time.
I timed by Reading operation and it takes most of the time approximately 18 minutes 45 seconds.
Hardware Config :- Intel i5 Processor, 16 GB Ram and 64-bit OS
Question Made Eligible for bounty :- As I am looking for a working
code with detailed steps - using a package with in anaconda
environment that supports loading my input files Parallel and
storing them in a pandas data frame separately. This should eventually
save time.
Try this:
from time import time
import pandas as pd
from multiprocessing.pool import ThreadPool
start_time = time()
pool = ThreadPool(processes=3)
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
# Define a function for the thread
def import_xlsx(file_name):
df_xlsx = pd.read_excel(file_name)
# print(df_xlsx.head())
return df_xlsx
def import_csv(file_name):
df_csv = pd.read_csv(file_name)
# print(df_csv.head())
return df_csv
# Create two threads as follows
Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get()
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get()
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get()
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
Why not use asyncio over multiprocessing?
Instead of using multiple threads, you might want to first leverage on the I/O level with an Async CSV Dict Reader (which can be parallelized using multiprocessing for multiple files). Afterwards, you can either concat the dicts and then load these dictionaries into pandas or load the individual dicts into pandas and concat there.
However, pandas does not support asyncio so you will have a performance loss at some point.
Try using #Cezary.Sz code but using (delete the calls to .get()), instead:
Primary_df_job = pool.apply_async(import_xlsx, (Primary_File, ))
Secondary_1_df_job = pool.apply_async(import_csv, (Secondary_File_1, ))
Secondary_2_df_job = pool.apply_async(import_csv, (Secondary_File_2, ))
Then
Secondary_1_df = Secondary_1_df_job.get()
Secondary_2_df = Secondary_2_df_job.get()
And you can use the dataframes, while Primary_df_job is loading.
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
When you need Primary_df in your code, use
Primary_df = Primary_df_job.get()
This will block the execution until Primary_df_job is finished.
Unfortunately, due to GIL (Global Interpreter Lock) in Python, multiple threads do not run simultaneously — all threads use the same single CPU's core. That means if you create several threads to load your files, the total time will be equal (or actually greater) the time needed to load that files one-by-one.
More about GIL: https://wiki.python.org/moin/GlobalInterpreterLock
To speed up load time you can try to switch from csv/excel to pickle files (or HDF).
You give the hardware details but you do not give the most interesting part: the number of disks you have, the type of RAID you have and the filesystem you are reading from.
If you only have one disk, no RAID, and a regular filesystem (ext4, XFS, etc.), like you mostly have on laptops, you will not be able to increase the bandwidth simply by throwing CPUs (multithread or multiprocess) at the problem. Using multiple threads, or asynchronous I/Os will help mask the latency a bit, but will not increase the bandwidth, because chances are you are already saturating it with a single reader process.
So using the code suggested by #Cezary.Sz, try moving one of the file to a USB3.0 external storage, or to SDSX storage. If you are running on a large workstation, look at the hardware details to see if several disks are available, and if you run on a large cluster, look for a parallel filesystem (BeeGFS, Lustre, etc.)

Reading different set of json files same time with python

I have two sets of files b and c (JSON). The number of files in each is normally between 500-1000. Right now I am reading this seperately. Can I read these at the same time using multi-threading? I have enough memory and processors.
yc=no of c files
yb=no of b files
c_output_transaction_list =[]
for num in range(yc):
c_json_file='./output/d_c_'+str(num)+'.json'
print(c_json_file)
c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
c_output_transaction_list.extend(c_transaction_list)
df_res_c= pd.DataFrame(c_output_transaction_list)
b_output_transaction_list =[]
for num in range(yb):
b_json_file='./output/d_b_'+str(num)+'.json'
print(b_json_file)
b_transaction_list = json.load(open(b_json_file))['data']['transaction_list']
b_output_transaction_list.extend(b_transaction_list)
df_res_b= pd.DataFrame(b_output_transaction_list)
I use this method to read hundreds of files in parallel into a final dataframe. Without having your data, you'll have to verify this does what you want. Reading the multiprocess help docs will assist. I use the same code on linux (aws ec2 reading s3 files) and windows reading the same s3 files. I find a big time savings do this.
import os
import pandas as pd
from multiprocessing import Pool
# you set the number of processors or just take the cpu_count from the os object. playing around with this does make a difference. For me using the max isn't always the fast overall time
num_proc = os.cpu_count()
# define the funciton that creates a dataframe from your file
# note, this is different where you build the list the create a dataframe at the end
def json_parse(c_json_file):
c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
return pd.DataFrame(c_transaction_list)
# this is multiprocessing function that feeds the file names to the parsing function
# if you don't pass num_proc it defaults to 4
def json_multiprocess(fn_list, num_proc=4):
with Pool(num_proc) as pool:
# I use starmap, you may just be able use map
# if you pass more than the file name, starmap handles zip() very well
r = pool.starmap(json_parse, fn_list, 15)
pool.close()
pool.join()
return r
# build your file list first
yc=no of c files
flist = []
for num in range(yc):
c_json_file='./output/d_c_'+str(num)+'.json'
flist.append(c_json_file)
# get a list of of your intermediate dataframes
dfs = json_multiprocess(flist, num_proc=num_proc)
# concat your dataframe
df_res_c = pd.concat(dfs)
Then do the same for your next set of files...
Use the example in Aelarion's comment to help structure the file

Using Dask to download, process, and save to csv

Problem
Part of my workflow involves downloading hundreds of thousands of files, parse the data, and then save to csv locally. I'm trying to set this workflow up with Dask but it does not appear to be processing in parallel. The Dask dashboard shows low cpu % for each worker and the task tab is empty. Status doesn't show anything either. htop doesn't appear to processing more than 1 or 2 "running" at a time. I'm not sure how to proceed from here.
Related: How should I write multiple CSV files efficiently using dask.dataframe? (Older question that this question is based on)
Example
from dask.delayed import delayed
from dask import compute
from dask.distributed import Client, progress
import pandas as pd
import wget
import zipfile
import multiprocessing
def get_fn(dat):
### Download file and unzip based on input dat
url = f"http://www.urltodownloadfrom.com/{dat['var1']}/{dat['var2']}.csv"
wget.download(url)
indat = unzip()
### Process file
outdat = proc_dat(indat)
### Save file
outdat.to_csv('file_path')
### Trash collection with custom download fn
delete_downloads()
if __name__ == '__main__':
### Dask setup
NCORES = multiprocessing.cpu_count() - 1
client = Client(n_workers=NCORES, threads_per_worker=1)
### Build df of needed dates and variables
beg_dat = "2020-01-01"
end_dat = "2020-01-31"
date_range = pd.date_range(beg_dat, end_dat)
var = ["var1", "var2"]
lst_ = [(x, y) for x in date_range for y in var]
date = [x[0] for x in lst_]
var = [x[1] for x in lst_]
indf = pd.DataFrame({'date': date, 'var': var}).reset_index()
### Group by each row to process
gb = indf.groupby('index')
gb_i = [gb.get_group(x) for x in gb.groups]
### Start dask using delayed
compute([delayed(get_fn)(thisRow) for thisRow in gb_i], scheduler='processes')
Dashboard
In this line:
compute([...], scheduler='processes')
you explicitly use a scheduler other than the distributed one you set up earlier in the script. If you do not specify scheduler= here, you will use the correct client, as it has been set as the default. You will see things appear in the dashboard.
Note that you might still not see high CPU usage, since it seems likely that most of the time is waiting for downloads.

BiqQuery Storage. Python. Reading multiple streams in parallel issue (multiprocessing)

I am trying to use BALANCED ShardingStrategy to get more then 1 stream and python multiprocessing lib to read stream in parallel.
However, when reading streams in parallel the same rows number and data is returned. As, if I understand correctly, no data is assigned to any stream before it starts reading and is finalized, so two parallel streams try to read same data and a part of data is never read as a result.
Using LIQUID strategy we can read all the data from one stream, which cannot be split.
According to documentation it is possible to read multiple streams in parallel with BALANCED one. However, I cannot figure out how to read in parallel and to assign different data to each stream
I have the following toy code:
import pandas as pd
from google.cloud import bigquery_storage_v1beta1
import os
import google.auth
from multiprocessing import Pool
import multiprocessing
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='key.json'
credentials, your_project_id = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
bq_storage_client = bigquery_storage_v1beta1.BigQueryStorageClient(credentials=credentials)
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = "bigquery-public-data"
table_ref.dataset_id = "ethereum_blockchain"
table_ref.table_id = "contracts"
parent = "projects/{}".format(your_project_id)
session = bq_storage_client.create_read_session(
table_ref,
parent,
format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]), timeout=100000).to_arrow(session).to_pandas()
return reader
if __name__ == '__main__':
p = Pool(2)
output = p.map(read_rows,([i for i in range(0,2)]))
print(output)
Need assistance to have multiple streams being read in parallel.
Probably there is a way to assign data to a stream before the reading starts. Any examples of code or explanations and tips would be appreciated
I apologize for the partial answer, but it didn't fit in a comment.
LIQUID or BALANCED just affect how data is allocated to streams, not the fact that data arrives in multiple streams (see here).
When I ran a variant of your code with this read_rows function, I saw different data for the first row of both streams, so I was otherwise unable to replicate your problem with seeing the same data on this dataset with either shading strategy.
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(
bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]),timeout=100000)
for row in reader.rows(session):
print(row)
break
I was running this code on a Linux compute engine instance.
I do worry that the output you are asking for in the map call is otherwise going to be quite large, however.

Parallel loading of Input Files in Pandas Dataframe

I have a Requirement, where I have three Input files and need to load them inside the Pandas Data Frame, before merging two of the files into one single Data Frame.
The File extension always changes, it could be .txt one time and .xlsx or .csv another time.
How Can I run this process parallel, in order to save the waiting/ loading time ?
This is my code at the moment,
from time import time # to measure the time taken to run the code
start_time = time()
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
print(end_time - start_time)
It takes around 20 minutes for me to load my primary_df and secondary_df. So, I am looking for an efficient solution possibly using parallel processing to save time.
I timed by Reading operation and it takes most of the time approximately 18 minutes 45 seconds.
Hardware Config :- Intel i5 Processor, 16 GB Ram and 64-bit OS
Question Made Eligible for bounty :- As I am looking for a working
code with detailed steps - using a package with in anaconda
environment that supports loading my input files Parallel and
storing them in a pandas data frame separately. This should eventually
save time.
Try this:
from time import time
import pandas as pd
from multiprocessing.pool import ThreadPool
start_time = time()
pool = ThreadPool(processes=3)
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
# Define a function for the thread
def import_xlsx(file_name):
df_xlsx = pd.read_excel(file_name)
# print(df_xlsx.head())
return df_xlsx
def import_csv(file_name):
df_csv = pd.read_csv(file_name)
# print(df_csv.head())
return df_csv
# Create two threads as follows
Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get()
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get()
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get()
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
Why not use asyncio over multiprocessing?
Instead of using multiple threads, you might want to first leverage on the I/O level with an Async CSV Dict Reader (which can be parallelized using multiprocessing for multiple files). Afterwards, you can either concat the dicts and then load these dictionaries into pandas or load the individual dicts into pandas and concat there.
However, pandas does not support asyncio so you will have a performance loss at some point.
Try using #Cezary.Sz code but using (delete the calls to .get()), instead:
Primary_df_job = pool.apply_async(import_xlsx, (Primary_File, ))
Secondary_1_df_job = pool.apply_async(import_csv, (Secondary_File_1, ))
Secondary_2_df_job = pool.apply_async(import_csv, (Secondary_File_2, ))
Then
Secondary_1_df = Secondary_1_df_job.get()
Secondary_2_df = Secondary_2_df_job.get()
And you can use the dataframes, while Primary_df_job is loading.
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
When you need Primary_df in your code, use
Primary_df = Primary_df_job.get()
This will block the execution until Primary_df_job is finished.
Unfortunately, due to GIL (Global Interpreter Lock) in Python, multiple threads do not run simultaneously — all threads use the same single CPU's core. That means if you create several threads to load your files, the total time will be equal (or actually greater) the time needed to load that files one-by-one.
More about GIL: https://wiki.python.org/moin/GlobalInterpreterLock
To speed up load time you can try to switch from csv/excel to pickle files (or HDF).
You give the hardware details but you do not give the most interesting part: the number of disks you have, the type of RAID you have and the filesystem you are reading from.
If you only have one disk, no RAID, and a regular filesystem (ext4, XFS, etc.), like you mostly have on laptops, you will not be able to increase the bandwidth simply by throwing CPUs (multithread or multiprocess) at the problem. Using multiple threads, or asynchronous I/Os will help mask the latency a bit, but will not increase the bandwidth, because chances are you are already saturating it with a single reader process.
So using the code suggested by #Cezary.Sz, try moving one of the file to a USB3.0 external storage, or to SDSX storage. If you are running on a large workstation, look at the hardware details to see if several disks are available, and if you run on a large cluster, look for a parallel filesystem (BeeGFS, Lustre, etc.)

Categories