Python: Joblib for multiprocessing - python

So I have these given functions:
def make_event_df(match_id, path):
'''
Function for making event dataframe.
Argument:
match_id -- int, the required match id for which event data will be constructed.
path -- str, path to .json file containing event data.
Returns:
df -- pandas dataframe, the event dataframe for the particular match.
'''
## read in the json file
event_json = json.load(open(path, encoding='utf-8'))
## normalize the json data
df = json_normalize(event_json, sep='_')
return df
def full_season_events(comp_name, match_df, match_ids, path):
'''
Function to make event dataframe for a full season.
Arguments:
comp_name -- str, competition name + season name
match_df -- pandas dataframe, containing match-data
match_id -- list, list of match id.
path -- str, path to directory where .json file is listed.
e.g. '../input/Statsbomb/data/events'
Returns:
event_df -- pandas dataframe, containing event data for the whole season.
'''
## init an empty dataframe
event_df = pd.DataFrame()
for match_id in tqdm(match_ids, desc=f'Making Event Data For {comp_name}'):
## .json file
temp_path = path + f'/{match_id}.json'
temp_df = make_event_df(match_id, temp_path)
event_df = pd.concat([event_df, temp_df], sort=True)
return event_df
Now I am running this piece of code to get the dataframe:
comp_id = 11
season_id = 1
path = f'../input/Statsbomb/data/matches/{comp_id}/{season_id}.json'
match_df = get_matches(comp_id, season_id, path)
comp_name = match_df['competition_name'].unique()[0] + '-' + match_df['season_name'].unique()[0]
match_ids = list(match_df['match_id'].unique())
path = f'../input/Statsbomb/data/events'
event_df = full_season_events(comp_name, match_df, match_ids, path)
The above code snippet is giving me this output:
Making Event Data For La Liga-2017/2018: 100%|██████████| 36/36 [00:29<00:00, 1.20it/s]
How can I make use multiprocessing to make the process faster i.e. how can I use the match_ids in full_season_events() to grab the data from the JSON file in a faster manner(using multiprocessing). I am very new to joblib and multiprocessing concept. Can someone tell what changes do I have to make in these functions to get the required results?

You don't need joblib here, just plain multiprocessing will do.
I'm using imap_unordered since it's faster than imap or map, but doesn't retain order (each worker can receive and submit jobs out of order). Not retaining order doesn't seem to matter since you're sort=Trueing anyway.
Because I'm using imap_unordered, there's that need for additional jobs finagling; there's no istarmap_unordered which would unpack parameters, so we need to do it ourselves.
If you have many match_ids, things can be sped up with e.g. chunksize=10 to imap_unordered; it means each worker process will be fed 10 jobs at a time, and they will also return 10 jobs at a time. It's faster since less time is spent in process synchronization and serialization, but on the other hand the TQDM progress bar will update less often.
As usual, the code below is dry-coded and might not work OOTB.
import multiprocessing
def make_event_df(job):
# Unpack parameters from job tuple
match_id, path = job
with open(path) as f:
event_json = json.load(f)
# Return the match id (if required) and the result.
return (match_id, json_normalize(event_json, sep="_"))
def full_season_events(comp_name, match_df, match_ids, path):
event_df = pd.DataFrame()
with multiprocessing.Pool() as p:
# Generate job tuples
jobs = [(match_id, path + f"/{match_id}.json") for match_id in match_ids]
# Run & get results from multiprocessing generator
for match_id, temp_df in tqdm(
p.imap_unordered(make_event_df, jobs),
total=len(jobs),
desc=f"Making Event Data For {comp_name}",
):
event_df = pd.concat([event_df, temp_df], sort=True)
return event_df

Related

How to apply dask method to apply functions on files in list?

first of all, thanks for this community and all advice we can retrieve, it's really appreciate!
This is my first venture into parallel processing and I have been looking into Dask by my own but I am having trouble actually coding it... to be honest I am really lost
In on of my project, I want to trigger URL and retrieve observations data (meteorological station) from xml files.
For each URL, I run some different process in order to: retreive data from URL, parsing XML information to dataframe, apply a filter and store data in MySQL database.
So i need to loop these process over thousands of URL (station)...
I wrote a sequential code , and it take 300s to finish computation which is really to long and not efficient.
As we are applying the same process for each station, I think I can speed-up all the computations, but I don't know where to start. I used delayed from dask but I don't think it's the best approach.
This is my code so far:
First I have some functions.
def xml_to_dataframe(ood_xml):
tmp_file = wget.download(ood_xml)
prstree = ETree.parse(tmp_file)
root = prstree.getroot()
################ Section to retrieve data for one station and apply parameter
all_obs = []
for obs in root.iter('observations'):
ood_observation = []
for n, param in enumerate(list_parameters):
x=obs.find(variable_to_check).text
ood_observation.append(x)
all_obs.append(ood_observation)
return(pd.DataFrame(all_obs, columns=list_parameters))
def filter_criteria(df,threshold,criteria):
if criteria in df.columns:
result = []
for index, row in df.iterrows():
if pd.to_numeric(row[criteria],errors='coerce') >= threshold:
result.append(index)
return result
else:
#print(criteria + ' parameter does not exist for this station !!! ')
return([])
def get_and_filter_data(filename,criteria,threshold):
try:
xmlToDf = xml_to_dataframe(filename)
final_df = xmlToDf.loc[filter_criteria(xmlToDf,threshold,criteria)]
some msql connection and instructions....
except:
pass
and then the main code I want to parallelise:
criteria = 'temperature'
threshold = 22
filenames =[url1.html, url2.html, url3.html]
for file in filenames:
get_and_filter_data(file,criteria,threshold)
Do you have any advice to do it ?
Many thanks for your help !
Guillaume
Not 100% sure this is what you are after, but one way is via delayed:
from dask import delayed, compute
delayeds = [delayed(get_and_filter_data)(file,criteria,threshold) for file in filenames]
results = compute(delayeds)

Fastest way to send dataframe to Redis

I have a dataframe that contains 2 columns. For each row, I simply want to to create a Redis set where first value of dataframe is key and 2nd value is the value of the Redis set. I've done research and I think I found the fastest way of doing this via iterables:
def send_to_redis(df, r):
df['bin_subscriber'] = df.apply(lambda row: uuid.UUID(row.subscriber).bytes, axis=1)
df['bin_total_score'] = df.apply(lambda row: struct.pack('B', round(row.total_score)), axis=1)
df = df[['bin_subscriber', 'bin_total_score']]
with r.pipeline() as pipe:
index = 0
for subscriber, total_score in zip(df['bin_subscriber'], df['bin_total_score']):
r.set(subscriber, total_score)
if (index + 1) % 2000 == 0:
pipe.execute()
index += 1
With this, I can send about 400-500k sets to Redis per minute. We may end up processing up to 300 million which at this rate would take half a day or so. Doable but not ideal. Note that in the outer wrapper I am downloading .parquet files from s3 one at a time and pulling into Pandas via IO bytes.
def process_file(s3_resource, r, bucket, key):
buffer = io.BytesIO()
s3_object = s3_resource.Object(bucket, key)
s3_object.download_fileobj(buffer)
send_to_redis(
pandas.read_parquet(buffer, columns=['subscriber', 'total_score']), r)
def main():
args = get_args()
s3_resource = boto3.resource('s3')
r = redis.Redis()
file_prefix = get_prefix(args)
s3_keys = [
item.key for item in
s3_resource.Bucket(args.bucket).objects.filter(Prefix=file_prefix)
if item.key.endswith('.parquet')
]
for key in s3_keys:
process_file(s3_resource, r, args.bucket, key)
Is there a way to send this data to Redis without the use of iteration? Is it possible to send an entire blob of data to Redis and have Redis set the key and value for every 1st and 2nd value of the data blob? I imagine that would be slightly faster.
The original parquet that I am pulling into Pandas is created via Pyspark. I've tried using the Spark-Redis plugin which is extremely fast, but I'm not sure how to convert my data to the above binary within a Spark dataframe itself and I don't like how the column name is added as a string to every single value and it doesn't seem to be configurable. Every redis object having that label seems very space inefficient.
Any suggestions would be greatly appreciated!
Try Redis Mass Insertion and redis bulk import using --pipe:
Create a new text file input.txt containing the Redis command
Set Key0 Value0
set Key1 Value1
...
SET Keyn Valuen
use redis-mass.py (see below) to insert to redis
python redis-mass.py input.txt | redis-cli --pipe
redis-mass.py from github.
#!/usr/bin/env python
"""
redis-mass.py
~~~~~~~~~~~~~
Prepares a newline-separated file of Redis commands for mass insertion.
:copyright: (c) 2015 by Tim Simmons.
:license: BSD, see LICENSE for more details.
"""
import sys
def proto(line):
result = "*%s\r\n$%s\r\n%s\r\n" % (str(len(line)), str(len(line[0])), line[0])
for arg in line[1:]:
result += "$%s\r\n%s\r\n" % (str(len(arg)), arg)
return result
if __name__ == "__main__":
try:
filename = sys.argv[1]
f = open(filename, 'r')
except IndexError:
f = sys.stdin.readlines()
for line in f:
print(proto(line.rstrip().split(' ')),)

Generating multiple ndjson files [3000+] by filtering the pandas dataframe for difference criteria

I need to generate 3000+ ndjson files from a pandas data frame based on certain criteria. I tried running the following code, it works but it takes a lot of time to finish.
def p_generate_files(result_df: pd.DataFrame, p_code: str) -> None:
print(result_df.shape)
tmp_df = result_df.filter(like=str(p_code), axis=0)
start_date = tmp_df.index.unique(level='date').min().to_pydatetime().strftime('%b').upper()
end_date = tmp_df.index.unique(level='date').max().to_pydatetime().strftime('%b').upper()
file_name_path = f'data/CR-{p_code}-{start_date}-{end_date}-2000.json'
tmp_df.reset_index(inplace=True)
tmp_df.to_json(
file_name_path,
orient="records",
index=True,
lines=True)
result_df.drop(labels= p_code, inplace = True)
I tried the following implementation of parallel processing but it doesn't seem to work. I have no experience with concurrent programming. Any help to speed up the processing is appreciated.
p_generate_files = partial(generate_files,result_df=big_df)
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(p_generate_files, p_codes)
Try multiprocessing, you have to set up the inputs as a list of tuples
import multiprocessing as mp
def generate_files(result_df: pd.DataFrame, codes: list) -> None:
# Your function
if __name__ == '__main__':
cores = mp.cpu_count()
args = [(df1, lst1), (df2, lst2), (df3, lst3) ...]
with mp.Pool(processes=cores) as pool:
results = pool.starmap(merge_names, args)

appending new datetime record each time the class is ran

i am trying to keep record of how many times and wen the code below is run, but every time i run it, it always adds the current time overriding the previous one. more like logs, but to be saved in dataframe.
the objective is to also capture the successful or failure of the compilation in that same df.
class track:
def tracker(self):
start_time = dt.now()
#do something here
return start_time
def create_dataframe(self):
tracktime = self.tracker()
# create empty pandas df
pdf = pd.DataFrame(index =['index'] ,columns= ['Date'])
# convert to pyspark df
sdf = sqlContext.createDataFrame(pdf)
sdf = sdf.withColumn('Date',lit(str(tracktime.date())))
sdf = sdf.withColumn('Time',lit(str(tracktime.time())))
sdf.show()
if __name__ == '__main__':
p = track()
p.create_dataframe()
I don't see all the dependencies so i cannot run your code, but you should look at append method for DataFrame class.

How can I make this function more efficient/ run it parallel?

I am trying to convert 33000 zipcodes into coordinates using geocoder package. I was hoping there was a way to parallelize this method because it is consuming quite a bit of resources.
from geopy.geocoders import ArcGIS
import pandas as pd
import time
geolocator = ArcGIS()
df1 = pd.DataFrame(0.0, index=list(range(0,len(df))), columns=list(['lat','lon']))
df = pd.concat([df,df1], axis=1)
for index in range(0,len(df)):
row = df['zipcode'].loc[index]
print index
# time.sleep(1)
# I put this function in just in case it would give me a timeout error.
myzip = geolocator.geocode(row)
try:
df['lat'].loc[index] = myzip.latitude
df['lon'].loc[index] = myzip.longitude
except:
continue
geopy.geocoders.ArcGIS.geocode queries a web server. Sending 33,000 queries alone will probably get you IP banned, so I wouldn't suggest sending them in parallel.
You're looking up almost every single ZIP code in the US. The US Census Bureau has a 1MB CSV file that contains this information for 33,144 ZIP codes: https://www.census.gov/geo/maps-data/data/gazetteer2017.html.
You can process it all in a fraction of a second:
zip_df = pd.read_csv('2017_Gaz_zcta_national.zip', sep='\t')
zip_df.rename(columns=str.strip, inplace=True)
One thing to watch out for is that the last column's name isn't properly parsed by Pandas and contains a lot of trailing whitespace. You have to strip the column names before use.
Here would be one way to do it, using multiprocessing.Pool
from multiprocessing import Pool
def get_longlat(x):
index, row = x
print index
time.sleep(1)
myzip = geolocator.geocode(row['zipcode'])
try:
return myzip.latitude, myzip.longitude
except:
return None, None
p = Pool()
df[['lat', 'long']] = p.map(get_longlat, df.iterrows())
More generally, using DataFrame.iterrows (for which each item iterated over is an index, row tuple) is likely slightly more efficient than the index-based method you use above
EDIT: after reading the other answer, you should be aware of rate limiting; you could use a fix number of processes in the Pool along with a time.sleep delay to mitigate this to some extent, however.

Categories