I am witnessing some strange run time issues with PyCharm that are explained below. The code has been run on a machine with 20 cores and 256 GB RAM and there is sufficient memory to spare. I am not showing any of the real functions as it is a reasonably large project, but am more than happy to add details upon request.
In short, I have a .py file project with the following structure:
import ...
import ...
cpu_cores = control_parameters.cpu_cores
prng = RandomState(123)
def collect_results(result_list):
return pd.DataFrame({'start_time': result_list[0::4],
'arrival_time': result_list[1::4],
'tour_id': result_list[2::4],
'trip_id': result_list[3::4]})
if __name__ == '__main__':
# Run the serial code
st = starttimes.StartTimesCreate(prng)
temp_df, two_trips_df, time_dist_arr = st.run()
# Prepare the dataframe to sample start times. Create groups from the input dataframe
temp_df1 = st.prepare_two_trips_more_df(temp_df, two_trips_df)
validation.logger.info("Dataframe prepared for multiprocessing")
grp_list = []
for name, group in temp_df1.groupby('tour_id'): ### problem lies here in runtimes
grp_list.append(group)
validation.logger.info("All groups have been prepared for multiprocessing, "
"for a total of %s groups" %len(grp_list))
################ PARALLEL CODE BELOW #################
The for loop is run on a dataframe of 10.5million rows and 18 columns. In the current form it takes about 25 mins to create the list of groups (2.8M groups). These groups are created and then fed to a multiprocess pool, code for which is not shown.
The 25 mins it is taking is quite long for I have done the following test run as well, which takes only 7 mins. Essentially, I saved the temp_df1 file to a CSV and then just batched in the pre-saved file and run the same for loop as before.
import ...
import ...
cpu_cores = control_parameters.cpu_cores
prng = RandomState(123)
def collect_results(result_list):
return pd.DataFrame({'start_time': result_list[0::4],
'arrival_time': result_list[1::4],
'tour_id': result_list[2::4],
'trip_id': result_list[3::4]})
if __name__ == '__main__':
# Run the serial code
st = starttimes.StartTimesCreate(prng)
temp_df1 = pd.read_csv(r"c:\\...\\temp_df1.csv")
time_dist = pd.read_csv(r"c:\\...\\start_time_distribution_treso_1.csv")
time_dist_arr = np.array(time_dist.to_records())
grp_list = []
for name, group in temp_df1.groupby('tour_id'):
grp_list.append(group)
validation.logger.info("All groups have been prepared for multiprocessing, "
"for a total of %s groups" %len(grp_list))
QUESTION
So, what is it that is causing the code to run 3 times faster when I just batch in the file versus when the file is created as part of a function further upstream?
Thanks in advance and please let me know how I can further clarify.
I am answering my question as I stumbled upon the answer while doing a bunch of tests and thankfully when I googled the solution someone else had the same issue. The explanation for why having categorical columns is a bad idea when doing group_by operations can be found at the above link. Thus I am not going to post it here. Thanks.
Related
I have a local linux server which contains 4 cores. I am running a pyspark job on it locally which basically reads two tables from database and saves the data into 2 dataframes. Now i am using these 2 dataframes to do some processing and then i am using the resultant processed df to save it into elasticsearch. Below is the code
def save_to_es(df):
df.write.format('es').option('es.nodes', 'es_node').option('es.port', some_port_no.).option('es.resource', index_name).option('es.mapping', es_mappings).save()
def coreFun():
spark = SparkSession.builder.master("local[1]").appName('test').getOrCreate()
spark.catalog.clearCache()
spark.sparkContext.setLogLevel("ERROR")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
select_sql = """(select * from db."master_table")"""
df_master = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql).option("user", "username").option("password", "password").option("driver", "database_driver").load()
select_sql_child = """(select * from db."child_table")"""
df_child = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql_cost).option("user", "username").option("password", "password").option("driver", "database_driver").load()
merged_df = merged_python_file.merged_function(df_master,df_child,sqlContext)
logic1_df = logic1_python_file.logic1_function(df_master,sqlContext)
logic2_df = logic2_python_file.logic2_function(df_master,sqlContext)
logic3_df = logic3_python_file.logic3_function(df_master,sqlContext)
logic4_df = logic4_python_file.logic4_function(df_master,sqlContext)
logic5_df = logic5_python_file.logic5_function(df_master,sqlContext)
save_to_es(merged_df)
save_to_es(logic1_df)
save_to_es(logic2_df)
save_to_es(logic3_df)
save_to_es(logic4_df)
save_to_es(logic5_df)
end_time = int(time.time())
print(end_time-start_time)
sc.stop()
if __name__ == "__main__":
coreFun()
There are different logic for processing written in separate python files e.g logic1 in logic1_python_file etc. I send my df_master to separate functions and they return resultant processed df back to driver. Now i use this resultant processed df to save into elasticsearch.
It works fine but problem is here everything is happening sequentially first merged_df gets processed and while it is getting processed others simply wait even though they are not really dependent on the o/p of merged_df function and then logic_1 gets processed while others wait and it goes on. This is not an ideal system design considering the o/p of one logic is not dependent on other.
I am sure asynchronous processing can help me here but i am not sure how to implement it here in my usecase. I know i may have to use some kind of queue(jms,kafka etc) to accomplish this but i dont have a complete picture.
Please let me know how can i utilize asynchronous processing here. Any other inputs which can help in improving the performance of job is welcome.
If during the processing of one single step like (merged_python_file.merged_function), only one core of the CPU is getting heavily utilized and others are nearly idle, multiprocessing can speed up. It can be achieved by using multiprocessing module of python. For more details can check answer on How to do parallel programming in Python?
I have read numerous threads on similar topics on the forum. However, what I am asking here, I believe, it is not a duplicate question.
I am reading a very large dataset (22 gb) of CSV format, having 350 million rows. I am trying to read the dataset in chunks, based on the solution provided by that link.
My current code is as following.
import pandas as pd
def Group_ID_Company(chunk_of_dataset):
return chunk_of_dataset.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum()
chunk_size = 9000000
chunk_skip = 1
transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv')
for i in range(0, 38):
chunk_skip += chunk_size;
transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv', mode = 'a', header = False)
There is no issue with the code, it runs fine. But, it, groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum() only runs for 9000000 rows, which is the declared as chunk_size. Whereas, I need to run that statement for the entire dataset, not chunk by chunk.
Reason for that is, when it is run chunk by chunk, only one chunk get processed, however, there are a lot of other rows which are scattered all over the dataset and get left behind into another chunk.
A possible solution is to run the code again on the newly generated "Group_ID_Company.csv". By doing so, code will go through new dataset once again and sum() the required columns. However, I am thinking may be there is another (better) way of achieving that.
The solution for your problem is probably Dask. You may watch the introductory video, read examples and try them online in a live session (in JupyterLab).
The answer form MarianD worked perfectly, I am answering to share the solution code here.
Moreover, DASK is able to utilize all cores equally, whereas, Pandas was using the only one core to 100%. So, that's the another benefit of DASK, I have noticed over Pandas.
import dask.dataframe as dd
transactions_dataset_DF = dd.read_csv('transactions.csv')
Group_ID_Company_DF = transactions_dataset_DF.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum().compute()
Group_ID_Company_DF.to_csv('Group_ID_Company.csv')
# to clear the memory
transactions_dataset_DF = None
Group_ID_Company_DF = None
DASK has been able to read all 350 million rows (20 GB of dataset) at once. Which was not achieved by Pandas previously, I had to create 37 chunks to process the entire dataset and it took almost 2 hours to complete the processing using Pandas.
However, with the DASK, it only took around 20 mins to (at once) read, process and then save the new dataset.
I am trying to figure out a way to multi process this function.
from multiprocessing import Pool
import pandas
#csv, dfew1
def dataframe_iterator(csv, dfnew1):
for item in csv['Trips_Tbl_ID'].unique():
# creating a temporary file for each unique trip table ID
dftemp = traveL_time(item, 'all')
# appending the temporary file to a new file
dfnew1 = dfnew1.append(dftemp)
return dfnew1
if __name__ == '__main__':
args = [(csv), (dfnew1)]
start_time = time.time()
pool = Pool()
pool.map(dataframe_iterator, args)
pool.close()
pool.join()
print("--- %s seconds ---" % (time.time() - start_time))
Some context:
'csv' is a huge dataframe with a bunch of jumbled up transportation records and a lot of non unique trip table ids. It has already been called in the jupyter notebook.
For every unique trip table ID in the csv, I am applying a computation heavy custom made function called traveL_time to transform it and condense it.
'dfnew1' is a custom made empty dataframe with the exact same structure that a dataframe would have after the traveL_time function is applied to 'csv', which is why I'm appending to it in the for loop. It has already been called in the jupyter notebook.
I am pretty new to python and I'm having some trouble with speeding up runtime because it's really a massive dataset, and it takes about 9 minutes to go over 150k records in 'csv' currently, and I have to eventually go over hundreds of millions of records.
My question is how would I be able to multi process this function using the two arguments csv and dfnew1 to potentially speed this up?
I have tried using pandarallel but I am getting a lot of errors, and I'm not quite sure how to go about using multiple arguments for the multi processing library. Also this current setup seems to be going on forever and I'm sure it's wrong.
Thank you.
So I developed a script that would pull data from a live-updated site tracking coronavirus data. I set it up to pull data every 30 minutes but recently tested it on updates every 30 seconds.
The idea is that it creates the request to the site, pulls the html, creates a list of all of the data I need, then restructures into a dataframe (basically it's the country, the cases, deaths, etc.).
Then it will take each row and append to the rows of each of the 123 excel files that are for the various countries. This will work well for, I believe, somewhere in the range of 30-50 iterations before it either causes file corruptions or weird data entries.
I have my code below. I know it's poorly written (my initial reasoning was I felt confident I could set it up quickly and I wanted to collect data quickly.. unfortunately I overestimated my abilities but now I want to learn what went wrong). Below my code I'll include sample output.
PLEASE note that this 30 second interval code pull is only for quick testing. I don't usually look to send that many requests for months. I just wanted to see what the issue was. Originally it was set to pull every 30 minutes when I detected this issue.
See below for the code:
import schedule
import time
def RecurringProcess2():
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import numpy as np
from os import listdir
import os
try:
extractTime = datetime.datetime.now()
extractTime = str(extractTime)
print("Access Initiated at " + extractTime)
link = 'https://www.worldometers.info/coronavirus/'
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser').findAll('td')#[1107].get_text()
table = pd.DataFrame(columns=['Date and Time','Country','Total Cases','New Cases','Total Deaths','New Deaths','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop'])
soupList = []
for i in range(1107):
value = soup[i].get_text()
soupList.insert(i,value)
table = np.reshape(soupList,(123,-1))
table = pd.DataFrame(table)
table.columns=['Country','Total Cases','New Cases (+)','Total Deaths','New Deaths (+)','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop']
table['Date & Time'] = extractTime
#Below code is run once to generate the initial files. That's it.
# for i in range(122):
# fileName = table.iloc[i,0] + '.xlsx'
# table.iloc[i:i+1,:].to_excel(fileName)
FilesDirectory = 'D:\\Professional\\Coronavirus'
fileType = '.csv'
filenames = listdir(FilesDirectory)
DataFiles = [ filename for filename in filenames if filename.endswith(fileType) ]
for file in DataFiles:
countryData = pd.read_csv(file,index_col=0)
MatchedCountry = table.loc[table['Country'] == str(file)[:-4]]
if file == ' USA .csv':
print("Country Data Rows: ",len(countryData))
if os.stat(file).st_size < 1500:
print("File Size under 1500")
countryData = countryData.append(MatchedCountry)
countryData.to_csv(FilesDirectory+'\\'+file, index=False)
except :
pass
print("Process Complete!")
return
schedule.every(30).seconds.do(RecurringProcess2)
while True:
schedule.run_pending()
time.sleep(1)
When I check the code after some number of iterations (usually successful for like 30-50) it has either displayed only 2 rows and lost all other rows, or it'll keep appending while deleting a single entry in the row above while two rows above loses 2 entries, etc. (essentially forming a triangle of sorts).
Above that image would be a few hundred empty rows. Does anyone have an idea of what is going wrong here? I'd consider this a failed attempt but would still like to learn from this attempt. I appreciate any help in advance.
Hi as per my understanding the webpage only has one table element. My suggestion would be to use pandas read_html method as it provides clean and structured table.
Try the below code you can modify to schedule the same:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
Disclaimer: I'm still evaluating this solution. So far it works almost perfectly for 77 rows.
Originally I had set the script up to run for .xlsx files. I converted everything to .csv but retained the index column code:
countryData = pd.read_csv(file,index_col=0)
I started realizing that things were being ordered differently every time the script ran. I have since removed that from the code and so far it works. Almost.
Unnamed: 0 Unnamed: 0.1
0 7
7
For some reason I have the above output in every file. I don't know why. But it's in the first 2 columns yet it still seems to be reading and writing correctly. Not sure what's going on here.
I am trying to read a pandas dataframe and perform certain operations and return the dataframe. I also want to multiprocess the operation to take advantage of multiple cores that my system has.
import pandas as pd
import re
from jug import TaskGenerator
#TaskGenerator
def find_replace(input_path_find):
start_time = time.clock()
df_find = pd.read_csv(input_path_find)
df_find.currentTitle=df_find.currentTitle.str.replace(r"[^a-zA-Z0-9`~!|##%&_}={:\"\];<>,./. -]",r'')
#extra space
df_find.currentTitle=df_find.currentTitle.str.replace('\s+', ' ')
#length
df_find['currentTitle_sort'] = df_find.currentTitle.str.len()
#sort
df_find = df_find.sort_values(by='currentTitle_sort',ascending=0)
#reindx
df_find.reset_index(drop=True,inplace=True)
del df_find['currentTitle_sort']
return df_find
When i pass the parameter which is CSV file i want to process
df_returned = find_replace('C:\\Users\\Dell\\Downloads\\Find_Replace_in_this_v1.csv')
I am getting some weird output
find_replace
Task(__main__.find_replace, args=('C:\\Users\\Dell\\Downloads\\Find_Replace_in_this_v1.csv',), kwargs={})
In [ ]:
Any help? I basically want to save the output from the function
I have already checked the answer and it didn't work. Also, i am using pythono 2.7 and anaconda IDE Pandas memoization
This is a misunderstanding of how jug works.
The result you are getting is, indeed a Task object, which you can run: df_returned.run().
Usually, though, you would have saved this script to a file (say analysis.py) and called jug execute analysis.py to execute the tasks.