I have a local linux server which contains 4 cores. I am running a pyspark job on it locally which basically reads two tables from database and saves the data into 2 dataframes. Now i am using these 2 dataframes to do some processing and then i am using the resultant processed df to save it into elasticsearch. Below is the code
def save_to_es(df):
df.write.format('es').option('es.nodes', 'es_node').option('es.port', some_port_no.).option('es.resource', index_name).option('es.mapping', es_mappings).save()
def coreFun():
spark = SparkSession.builder.master("local[1]").appName('test').getOrCreate()
spark.catalog.clearCache()
spark.sparkContext.setLogLevel("ERROR")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
select_sql = """(select * from db."master_table")"""
df_master = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql).option("user", "username").option("password", "password").option("driver", "database_driver").load()
select_sql_child = """(select * from db."child_table")"""
df_child = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql_cost).option("user", "username").option("password", "password").option("driver", "database_driver").load()
merged_df = merged_python_file.merged_function(df_master,df_child,sqlContext)
logic1_df = logic1_python_file.logic1_function(df_master,sqlContext)
logic2_df = logic2_python_file.logic2_function(df_master,sqlContext)
logic3_df = logic3_python_file.logic3_function(df_master,sqlContext)
logic4_df = logic4_python_file.logic4_function(df_master,sqlContext)
logic5_df = logic5_python_file.logic5_function(df_master,sqlContext)
save_to_es(merged_df)
save_to_es(logic1_df)
save_to_es(logic2_df)
save_to_es(logic3_df)
save_to_es(logic4_df)
save_to_es(logic5_df)
end_time = int(time.time())
print(end_time-start_time)
sc.stop()
if __name__ == "__main__":
coreFun()
There are different logic for processing written in separate python files e.g logic1 in logic1_python_file etc. I send my df_master to separate functions and they return resultant processed df back to driver. Now i use this resultant processed df to save into elasticsearch.
It works fine but problem is here everything is happening sequentially first merged_df gets processed and while it is getting processed others simply wait even though they are not really dependent on the o/p of merged_df function and then logic_1 gets processed while others wait and it goes on. This is not an ideal system design considering the o/p of one logic is not dependent on other.
I am sure asynchronous processing can help me here but i am not sure how to implement it here in my usecase. I know i may have to use some kind of queue(jms,kafka etc) to accomplish this but i dont have a complete picture.
Please let me know how can i utilize asynchronous processing here. Any other inputs which can help in improving the performance of job is welcome.
If during the processing of one single step like (merged_python_file.merged_function), only one core of the CPU is getting heavily utilized and others are nearly idle, multiprocessing can speed up. It can be achieved by using multiprocessing module of python. For more details can check answer on How to do parallel programming in Python?
Related
I have a Requirement, where I have three Input files and need to load them inside the Pandas Data Frame, before merging two of the files into one single Data Frame.
The File extension always changes, it could be .txt one time and .xlsx or .csv another time.
How Can I run this process parallel, in order to save the waiting/ loading time ?
This is my code at the moment,
from time import time # to measure the time taken to run the code
start_time = time()
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
print(end_time - start_time)
It takes around 20 minutes for me to load my primary_df and secondary_df. So, I am looking for an efficient solution possibly using parallel processing to save time.
I timed by Reading operation and it takes most of the time approximately 18 minutes 45 seconds.
Hardware Config :- Intel i5 Processor, 16 GB Ram and 64-bit OS
Question Made Eligible for bounty :- As I am looking for a working
code with detailed steps - using a package with in anaconda
environment that supports loading my input files Parallel and
storing them in a pandas data frame separately. This should eventually
save time.
Try this:
from time import time
import pandas as pd
from multiprocessing.pool import ThreadPool
start_time = time()
pool = ThreadPool(processes=3)
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
# Define a function for the thread
def import_xlsx(file_name):
df_xlsx = pd.read_excel(file_name)
# print(df_xlsx.head())
return df_xlsx
def import_csv(file_name):
df_csv = pd.read_csv(file_name)
# print(df_csv.head())
return df_csv
# Create two threads as follows
Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get()
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get()
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get()
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
Why not use asyncio over multiprocessing?
Instead of using multiple threads, you might want to first leverage on the I/O level with an Async CSV Dict Reader (which can be parallelized using multiprocessing for multiple files). Afterwards, you can either concat the dicts and then load these dictionaries into pandas or load the individual dicts into pandas and concat there.
However, pandas does not support asyncio so you will have a performance loss at some point.
Try using #Cezary.Sz code but using (delete the calls to .get()), instead:
Primary_df_job = pool.apply_async(import_xlsx, (Primary_File, ))
Secondary_1_df_job = pool.apply_async(import_csv, (Secondary_File_1, ))
Secondary_2_df_job = pool.apply_async(import_csv, (Secondary_File_2, ))
Then
Secondary_1_df = Secondary_1_df_job.get()
Secondary_2_df = Secondary_2_df_job.get()
And you can use the dataframes, while Primary_df_job is loading.
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
When you need Primary_df in your code, use
Primary_df = Primary_df_job.get()
This will block the execution until Primary_df_job is finished.
Unfortunately, due to GIL (Global Interpreter Lock) in Python, multiple threads do not run simultaneously — all threads use the same single CPU's core. That means if you create several threads to load your files, the total time will be equal (or actually greater) the time needed to load that files one-by-one.
More about GIL: https://wiki.python.org/moin/GlobalInterpreterLock
To speed up load time you can try to switch from csv/excel to pickle files (or HDF).
You give the hardware details but you do not give the most interesting part: the number of disks you have, the type of RAID you have and the filesystem you are reading from.
If you only have one disk, no RAID, and a regular filesystem (ext4, XFS, etc.), like you mostly have on laptops, you will not be able to increase the bandwidth simply by throwing CPUs (multithread or multiprocess) at the problem. Using multiple threads, or asynchronous I/Os will help mask the latency a bit, but will not increase the bandwidth, because chances are you are already saturating it with a single reader process.
So using the code suggested by #Cezary.Sz, try moving one of the file to a USB3.0 external storage, or to SDSX storage. If you are running on a large workstation, look at the hardware details to see if several disks are available, and if you run on a large cluster, look for a parallel filesystem (BeeGFS, Lustre, etc.)
I am doing some simulation where I compute some stuff for several time step. For each I want to save a parquet file where each line correspond to a simulation this looks like so :
def simulation():
nsim = 3
timesteps = [1,2]
data = {} #initialization not shown here
for i in nsim:
compute_stuff()
for j in timesteps:
data[str(j)]= compute_some_other_stuff()
return data
Once I have my dict data containing the result of my simulation (under numpy arrays), I transform it into dask.DataFrame objects to then save them to file using the .to_parquet() method as follows:
def save(data):
for i in data.keys():
data[i] = pd.DataFrame(data[i], bins=...)
df = from_pandas(data[i], npartitions=2)
f.to_parquet(datafolder + i + "/", engine="pyarrow", append=True, ignore_divisions = True)
When use this code only once it works perfectly and the struggle arrises when I try to implement it in parallel. Using dask I do:
client = Client(n_workers=10, processes=True)
def f():
data = simulation()
save(data)
to_compute = [delayed(f)(n) for n in range(20)]
compute(to_compute)
The behaviour of this last portion of code is quite random. At some point this happens:
distributed.worker - WARNING - Compute Failed
Function: f
args: (4)
kwargs: {}
Exception: "ArrowInvalid('Parquet file size is 0 bytes')"
....
distributed.worker - WARNING - Compute Failed
Function: f
args: (12)
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
I think these errors are due to the fact that 2 processes try to write at the same time to the same parquet file, and it not well handled (as it can be on txt file). I already tried to switch to pySpark / Koalas without much success. I there a better way to save the result along a simulation (in case of a crash / wall time on a cluster)?
You are making a classic dask mistake of invoking the dask API from within functions that are themselves delayed. The error indicates that things are happening in parallel (which is what dask does!) which are not expected to change during processing. Specifically, a file is clearly being edited by one task while another one is reading it (not sure which).
What you probably want to do, is use concat on the dataframe pieces and then a single call to to_parquet.
Note that it seems all of your data is actually held in the client, and you are using from_parquet. This seems like a bad idea, since you are missing out on one of dask's biggest features, to only load data when needed. You should, instead, load your data inside delayed functions or dask dataframe API calls.
I am trying to use BALANCED ShardingStrategy to get more then 1 stream and python multiprocessing lib to read stream in parallel.
However, when reading streams in parallel the same rows number and data is returned. As, if I understand correctly, no data is assigned to any stream before it starts reading and is finalized, so two parallel streams try to read same data and a part of data is never read as a result.
Using LIQUID strategy we can read all the data from one stream, which cannot be split.
According to documentation it is possible to read multiple streams in parallel with BALANCED one. However, I cannot figure out how to read in parallel and to assign different data to each stream
I have the following toy code:
import pandas as pd
from google.cloud import bigquery_storage_v1beta1
import os
import google.auth
from multiprocessing import Pool
import multiprocessing
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='key.json'
credentials, your_project_id = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
bq_storage_client = bigquery_storage_v1beta1.BigQueryStorageClient(credentials=credentials)
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = "bigquery-public-data"
table_ref.dataset_id = "ethereum_blockchain"
table_ref.table_id = "contracts"
parent = "projects/{}".format(your_project_id)
session = bq_storage_client.create_read_session(
table_ref,
parent,
format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]), timeout=100000).to_arrow(session).to_pandas()
return reader
if __name__ == '__main__':
p = Pool(2)
output = p.map(read_rows,([i for i in range(0,2)]))
print(output)
Need assistance to have multiple streams being read in parallel.
Probably there is a way to assign data to a stream before the reading starts. Any examples of code or explanations and tips would be appreciated
I apologize for the partial answer, but it didn't fit in a comment.
LIQUID or BALANCED just affect how data is allocated to streams, not the fact that data arrives in multiple streams (see here).
When I ran a variant of your code with this read_rows function, I saw different data for the first row of both streams, so I was otherwise unable to replicate your problem with seeing the same data on this dataset with either shading strategy.
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(
bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]),timeout=100000)
for row in reader.rows(session):
print(row)
break
I was running this code on a Linux compute engine instance.
I do worry that the output you are asking for in the map call is otherwise going to be quite large, however.
I'm iterating over M dataframes, each containing a column with N URLs. For each URL, I extract paragraph text, then conduct standard cleaning for textual analysis before calculating "sentiment" scores.
Is it more efficient for me to:
Continue as it is (compute scores in the URL for-loop itself)
Extract all of the text from URLs first, and then separately iterate over the list / column of text ?
Or does it not make any difference?
Currently running calculations within the loop itself. Each DF has about 15,000 - 20,000 URLs so it's taking an insane amount of time too!
# DFs are stored on a website
# I extract links to each .csv file and store it as a list in "df_links"
for link in df_links:
cleaned_articles = []
df = pd.read_csv(link, sep="\t", header=None)
# Conduct df cleaning
# URLs for articles to scrape are stored in 1 column, which I iterate over as...
for url in df['article_url']:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
para_text = [text.get_text() for text in soup.findAll('p')]
text = " ".join(para_text)
words = text.split()
if len(words) > 500:
# Conduct Text Cleaning & Scores Computations
# Cleaned text stored as a variable "clean_text"
cleaned_articles.append(clean_text)
df['article_text'] = cleaned_articles
df.to_csv('file_name.csv')
To answer the question, it shouldn't make too much of a difference if you download the data and then apply analysis to it. You'd just be re arranging the order in which you do a set of tasks that would effectively take the same time.
The only difference may be if the text corpus' are rather large and then read write time to disk will start to play a part so could be a little faster running the analytics all in memory. But this still isn't going to really solve your problem.
May I be so bold as to reinterpret your question as: "My analysis is taking too long help me speed it up!"
This sounds like a perfect use case for multiprocessing! Since this sounds like a data science project you'll need to pip install multiprocess if you're using a ipython notebook (like Jupyter) or import multiprocessing if using a python script. This is because of the way python passes information between processes, don't worry though the API's for both multiprocess and multiprocessing are identical!
A basic and easy way to speed up your analysis is to indent you for loop and put it in a function. That function can then be passed to a multiprocessing map which can spawn multiple processes and do the analysis on several urls all at once:
from multiprocess import Pool
import numpy as np
import os
import pandas as pd
num_cpus = os.cpu_count()
def analytics_function(*args):
#Your full function including fetching data goes here and accepts a array of links
return something
df_links_split = np.array_split(df_links, num_cpus * 2) #I normally just use 2 as a rule of thumb
pool = Pool(num_cpus * 2) #Start a pool with num_cpus * 2 processes
list_of_returned = pool.map(analytics_function, df_links_split)
This will spin up a load of processes and utilise your full cpu. You'll not be able to do much else on your computer, and you'll need to have your resource monitor open to check you're not maxing our your memory and slowing down/crashing the process. But it should significantly speed up your analysis by roughly a factor of num_cpus * 2!!
Extracting all of the texts then processing all of it or extracting one text then processing it before extracting the next wont do any difference.
Doing ABABAB takes as much time as doing AAABBB.
You might however be interested in using threads or asynchronous requests to fetch all of the data in parallel.
I have a Requirement, where I have three Input files and need to load them inside the Pandas Data Frame, before merging two of the files into one single Data Frame.
The File extension always changes, it could be .txt one time and .xlsx or .csv another time.
How Can I run this process parallel, in order to save the waiting/ loading time ?
This is my code at the moment,
from time import time # to measure the time taken to run the code
start_time = time()
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
print(end_time - start_time)
It takes around 20 minutes for me to load my primary_df and secondary_df. So, I am looking for an efficient solution possibly using parallel processing to save time.
I timed by Reading operation and it takes most of the time approximately 18 minutes 45 seconds.
Hardware Config :- Intel i5 Processor, 16 GB Ram and 64-bit OS
Question Made Eligible for bounty :- As I am looking for a working
code with detailed steps - using a package with in anaconda
environment that supports loading my input files Parallel and
storing them in a pandas data frame separately. This should eventually
save time.
Try this:
from time import time
import pandas as pd
from multiprocessing.pool import ThreadPool
start_time = time()
pool = ThreadPool(processes=3)
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
# Define a function for the thread
def import_xlsx(file_name):
df_xlsx = pd.read_excel(file_name)
# print(df_xlsx.head())
return df_xlsx
def import_csv(file_name):
df_csv = pd.read_csv(file_name)
# print(df_csv.head())
return df_csv
# Create two threads as follows
Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get()
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get()
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get()
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
Why not use asyncio over multiprocessing?
Instead of using multiple threads, you might want to first leverage on the I/O level with an Async CSV Dict Reader (which can be parallelized using multiprocessing for multiple files). Afterwards, you can either concat the dicts and then load these dictionaries into pandas or load the individual dicts into pandas and concat there.
However, pandas does not support asyncio so you will have a performance loss at some point.
Try using #Cezary.Sz code but using (delete the calls to .get()), instead:
Primary_df_job = pool.apply_async(import_xlsx, (Primary_File, ))
Secondary_1_df_job = pool.apply_async(import_csv, (Secondary_File_1, ))
Secondary_2_df_job = pool.apply_async(import_csv, (Secondary_File_2, ))
Then
Secondary_1_df = Secondary_1_df_job.get()
Secondary_2_df = Secondary_2_df_job.get()
And you can use the dataframes, while Primary_df_job is loading.
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
When you need Primary_df in your code, use
Primary_df = Primary_df_job.get()
This will block the execution until Primary_df_job is finished.
Unfortunately, due to GIL (Global Interpreter Lock) in Python, multiple threads do not run simultaneously — all threads use the same single CPU's core. That means if you create several threads to load your files, the total time will be equal (or actually greater) the time needed to load that files one-by-one.
More about GIL: https://wiki.python.org/moin/GlobalInterpreterLock
To speed up load time you can try to switch from csv/excel to pickle files (or HDF).
You give the hardware details but you do not give the most interesting part: the number of disks you have, the type of RAID you have and the filesystem you are reading from.
If you only have one disk, no RAID, and a regular filesystem (ext4, XFS, etc.), like you mostly have on laptops, you will not be able to increase the bandwidth simply by throwing CPUs (multithread or multiprocess) at the problem. Using multiple threads, or asynchronous I/Os will help mask the latency a bit, but will not increase the bandwidth, because chances are you are already saturating it with a single reader process.
So using the code suggested by #Cezary.Sz, try moving one of the file to a USB3.0 external storage, or to SDSX storage. If you are running on a large workstation, look at the hardware details to see if several disks are available, and if you run on a large cluster, look for a parallel filesystem (BeeGFS, Lustre, etc.)