i scraped a html table from a nba game as an pandas dataframe.
import pandas as pd
url = 'https://www.basketball-reference.com/boxscores/pbp/200911060GSW.html'
dfs = pd.read_html(url)
df = dfs[0]
df.rename(columns={'Unnamed: 2_level_1': 'PM1', 'Unnamed: 4_level_1': 'PM2'}, inplace=True)
df
i have the column "time" which starts at 12:00.0 and is descending to 0:00.0 and this for every quarter.
i want the time as overall time so that it begins at 48:00.0 and is descending.
my approach: overall_time(i) = overall_time(i-1) - (quarter_time(i-1) - quarter_time(i))
e.g. 48:00.0 - (12:00.0 - 11:46.0) = 47:46.0 for the first row of my dataframe
i think this should be working but i am struggling to implement this in python. maybe someone can help me with this
There is probably a better way, but I felt I needed to converting from Time 'string' format like 11:30 which is hard to subtract, to 11.5 and then back again. Then a bit of fussing with formatting
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
url = 'https://www.basketball-reference.com/boxscores/pbp/200911060GSW.html'
dfs = pd.read_html(url)
df = dfs[0]
df.rename(columns={'Unnamed: 2_level_1': 'PM1', 'Unnamed: 4_level_1': 'PM2'}, inplace=True)
df.columns = df.columns.droplevel() #columns currently multiindex, you don't need 1st Q, drop it
df = df[df['Time'].str.contains(':')] #only include rows with a real 'Time' that contains a colon, excludes headers
#Identify which rows signify the start of a new quarter
#has to have 12 minutes of time and text of 'Start of...' in the 'Score' column
quarter_start_rows = df['Time'].eq('12:00.0') & df['Score'].str.startswith('Start of')
#create a new column called quarter w/ 1 at new quarter, 0 otherwise then cumsum
df['Quarter'] = np.where(quarter_start_rows,1,0).cumsum()
#separate the minutes and seconds and make them int and float respectively
df[['Minutes','Seconds']] = df['Time'].str.split(':',expand=True).astype({0:'int',1:'float'})
#represent Q2 11:30 as 11.5 etc so it is easy to add/subtract times
fractional_time = df['Minutes'].add(df['Seconds'].div(60))
#convert from Q2 11:30 (11.5) to 'global time' which would be 35.5
global_fractional_time = fractional_time.add((4-df['Quarter'])*12)
#convert from fractional time back to Minutes and Seconds
minutes = global_fractional_time.astype(int)
seconds = global_fractional_time.sub(minutes).multiply(60).round(1)
#Make a new string column to show the global minutes and seconds more nicely
df['Overall Time'] = minutes.astype(str).str.zfill(2)+':'+seconds.astype(str).str.zfill(4)
Output
I am attempting to roll-up rows from a data set with similar measures into a consolidated row. There are two conditions that must be met for the roll-up:
The measures (ranging from 1-5) should remain the same across the
rows for them to be rolled up to a single row.
The dates should be continuous (no gaps in dates).
If these conditions are not met, the code should generate a separate row.
This is the sample data that I am using:
id,measure1,measure2,measure3,measure4,measure5,begin_date,end_date
ABC123XYZ789,1,1,1,1,1,1/1/2019,3/31/2019
ABC123XYZ789,1,1,1,1,1,4/23/2019,6/30/2019
ABC123XYZ789,1,1,1,1,1,7/1/2019,9/30/2019
ABC123XYZ789,1,1,1,1,1,10/12/2019,12/31/2019
FGH589J6U88SW,1,1,1,1,1,1/1/2019,3/31/2019
FGH589J6U88SW,1,1,1,1,1,4/1/2019,6/30/2019
FGH589J6U88SW,1,1,1,2,1,7/1/2019,9/30/2019
FGH589J6U88SW,1,1,1,2,1,10/1/2019,12/31/2019
253DRWQ85AT2F334B,1,2,1,3,1,1/1/2019,3/31/2019
253DRWQ85AT2F334B,1,2,1,3,1,4/1/2019,6/30/2019
253DRWQ85AT2F334B,1,2,1,3,1,7/1/2019,9/30/2019
253DRWQ85AT2F334B,1,2,1,3,1,10/1/2019,12/31/2019
The expected result should be:
id,measure1,measure2,measure3,measure4,measure5,begin_date,end_date
ABC123XYZ789,1,1,1,1,1,1/1/2019,3/31/2019
ABC123XYZ789,1,1,1,1,1,4/23/2019,9/30/2019
ABC123XYZ789,1,1,1,1,1,10/12/2019,12/31/2019
FGH589J6U88SW,1,1,1,1,1,1/1/2019,6/30/2019
FGH589J6U88SW,1,1,1,2,1,7/1/2019,12/31/2019
253DRWQ85AT2F334B,1,2,1,3,1,1/1/2019,12/31/2019
I have implemented the code below which seems to address condition # 1, but I am looking for ideas on how to incorporate condition # 2 into the solution.
import pandas as pd
import time
startTime=time.time()
data=pd.read_csv('C:\\Users\\usertemp\\Data\\Rollup2.csv')
data['end_date']= pd.to_datetime(data['end_date'])
data['begin_date']= pd.to_datetime(data['begin_date'])
data = data.groupby(['id','measure1','measure2', 'measure3', 'measure4', 'measure5']) \
['begin_date', 'end_date'].agg({'begin_date': ['min'], 'end_date': ['max']}).reset_index()
print(data)
print("It took %s seconds for the collapse process" % (time.time() - startTime))
Any help is appreciated.
You can do the following.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Convert begin_date and end_time to datetime
df['begin_date'] = pd.to_datetime(df['begin_date'], format='%m/%d/%Y')
df['end_date']= pd.to_datetime(df['end_date'], format='%m/%d/%Y')
# We create a new column which contains the end_date+1 from the previous row
df['end_date_prev'] = df['end_date'].iloc[:-1] + timedelta(days=1)
df['end_date_prev'] = np.roll(df['end_date_prev'], 1)
# Create a cumsum that resets when begin_date and end_date_prev doesn't match
df['cont'] = (~(df['begin_date'] == df['end_date_prev'])).astype(int).cumsum()
# Since we need all measures to match we create a string column containing all measurements
df['comb_measure'] = df['measure1'].astype(str).str.cat(df[['measure{}'.format(i) for i in range(2,6)]].astype(str))
# Get the final df
new_df = df.groupby(['id', 'comb_measure', 'cont']).agg(
{'measure1':'first', 'measure2':'first', 'measure3':'first', 'measure4':'first', 'measure5':'first',
'begin_date':'first', 'end_date':'last'})
I'm working with a relatively large dataset (approx 5m observations, made up of about 5.5k firms).
I needed to run OLS regressions with a 60 month rolling window for each firm. I noticed that the performance was insanely slow when I ran the following code:
for idx, sub_df in master_df.groupby("firm_id"):
# OLS code
However, when I first split my dataframe into about 5.5k dfs and then iterated over each of the dfs, the performance improved dramatically.
grouped_df = master_df.groupby("firm_id")
df_list = [group for group in grouped_df]
for df in df_list:
my_df = df[1]
# OLS code
I'm talking 1-2 weeks of time (24/7) to complete in the first version compared to 8-9 hours tops.
Can anyone please explain why splitting the master df into N smaller dfs and then iterating over each smaller df performs better than iterating over the same number of groups within the master df?
Thanks ever so much!
I'm unable to reproduce your observation. Here's some code that generates data and then times the direct and indirect methods separately. The time taken is very similar in either case.
Is it possible that you accidentally sorted the dataframe by the group key between the runs? Sorting by group key results in a noticeable difference in run time.
Otherwise, I'm beginning to think that there might be some other differences in your code. It would be great if you could post the full code.
import numpy as np
import pandas as pd
from datetime import datetime
def generate_data():
''' returns a Pandas DF with columns 'firm_id' and 'score' '''
# configuration
np.random.seed(22)
num_groups = 50000 # number of distinct groups in the DF
mean_group_length = 200 # how many records per group?
cov_group_length = 0.10 # throw in some variability in the num records per group
# simulate group lengths
stdv_group_length = mean_group_length * cov_group_length
group_lengths = np.random.normal(
loc=mean_group_length,
scale=stdv_group_length,
size=(num_groups,)).astype(int)
group_lengths[group_lengths <= 0] = mean_group_length
# final length of DF
total_length = sum(group_lengths)
# compute entries for group key column
firm_id_list = []
for i, l in enumerate(group_lengths):
firm_id_list.extend([(i + 1)] * l)
# construct the DF; data column is 'score' populated with Numpy's U[0, 1)
result_df = pd.DataFrame(data={
'firm_id': firm_id_list,
'score': np.random.rand(total_length)
})
# Optionally, shuffle or sort the DF by group keys
# ALTERNATIVE 1: (badly) unsorted df
result_df = result_df.sample(frac=1, random_state=13).reset_index(drop=True)
# ALTERNATIVE 2: sort by group key
# result_df.sort_values(by='firm_id', inplace=True)
return result_df
def time_method(df, method):
''' time 'method' with 'df' as its argument '''
t_start = datetime.now()
method(df)
t_final = datetime.now()
delta_t = t_final - t_start
print(f"Method '{method.__name__}' took {delta_t}.")
return
def process_direct(df):
''' direct for-loop over groupby object '''
for group, df in df.groupby('firm_id'):
m = df.score.mean()
s = df.score.std()
return
def process_indirect(df):
''' indirect method: generate groups first as list and then loop over list '''
grouped_df = df.groupby('firm_id')
group_list = [pair for pair in grouped_df]
for pair in group_list:
m = pair[1].score.mean()
s = pair[1].score.std()
df = generate_data()
time_method(df, process_direct)
time_method(df, process_indirect)
Here's what my data looks like:
There are daily records, except for a gap from 2017-06-12 to 2017-06-16.
df2['timestamp'] = pd.to_datetime(df['timestamp'])
df2['timestamp'] = df2['timestamp'].map(lambda x:
datetime.datetime.strftime(x,'%Y-%m-%d'))
df2 = df2.convert_objects(convert_numeric = True)
df2 = df2.groupby('timestamp', as_index = False).sum()
I need to fill this missing gap and others with values for all fields (e.g. timestamp, temperature, humidity, light, pressure, speed, battery_voltage, etc...).
How can I accomplish this with Pandas?
This is what I have done before
weektime = pd.date_range(start = '06/04/2017', end = '12/05/2017', freq = 'W-SUN')
df['week'] = 'nan'
df['weektemp'] = 'nan'
df['weekhumidity'] = 'nan'
df['weeklight'] = 'nan'
df['weekpressure'] = 'nan'
df['weekspeed'] = 'nan'
df['weekbattery_voltage'] = 'nan'
for i in range(0,len(weektime)):
df['week'][i+1] = weektime[i]
df['weektemp'][i+1] = df['temperature'].iloc[7*i+1:7*i+7].sum()
df['weekhumidity'][i+1] = df['humidity'].iloc[7*i+1:7*i+7].sum()
df['weeklight'][i+1] = df['light'].iloc[7*i+1:7*i+7].sum()
df['weekpressure'][i+1] = df['pressure'].iloc[7*i+1:7*i+7].sum()
df['weekspeed'][i+1] = df['speed'].iloc[7*i+1:7*i+7].sum()
df['weekbattery_voltage'][i+1] =
df['battery_voltage'].iloc[7*i+1:7*i+7].sum()
i = i + 1
The value of sum is not correct. Cause the value of 2017-06-17 is a sum of 2017-06-12 to 2017-06-16. I do not want to add them again. This gap is not only one gap in the period. I want to fill all of them.
Here is a function I wrote that might be helpful to you. It looks for inconsistent jumps in time and fills them in. After using this function, try using a linear interpolation function (pandas has a good one) to fill in your null data values. Note: Numpy arrays are much faster to iterate over and manipulate than Pandas dataframes, which is why I switch between the two.
import numpy as np
import pandas as pd
data_arr = np.array(your_df)
periodicity = 'daily'
def fill_gaps(data_arr, periodicity):
rows = data_arr.shape[0]
data_no_gaps = np.copy(data_arr) #avoid altering the thing you're iterating over
data_no_gaps_idx = 0
for row_idx in np.arange(1, rows): #iterate once for each row (except the first record; nothing to compare)
oldtimestamp_str = str(data_arr[row_idx-1, 0])
oldtimestamp = np.datetime64(oldtimestamp_str)
currenttimestamp_str = str(data_arr[row_idx, 0])
currenttimestamp = np.datetime64(currenttimestamp_str)
period = currenttimestamp - oldtimestamp
if period != np.timedelta64(900,'s') and period != np.timedelta64(3600,'s') and period != np.timedelta64(86400,'s'):
if periodicity == 'quarterly':
desired_period = 900
elif periodicity == 'hourly':
desired_period = 3600
elif periodicity == 'daily':
desired_period = 86400
periods_missing = int(period / np.timedelta64(desired_period,'s'))
for missing in np.arange(1, periods_missing):
new_time_orig = str(oldtimestamp + missing*(np.timedelta64(desired_period,'s')))
new_time = new_time_orig.replace('T', ' ')
data_no_gaps = np.insert(data_no_gaps, (data_no_gaps_idx + missing),
np.array((new_time, np.nan, np.nan, np.nan, np.nan, np.nan)), 0) # INSERT VALUES YOU WANT IN THE NEW ROW
data_no_gaps_idx += (periods_missing-1) #incriment the index (zero-based => -1) in accordance with added rows
data_no_gaps_idx += 1 #allow index to change as we iterate over original data array (main for loop)
#create a dataframe:
data_arr_no_gaps = pd.DataFrame(data=data_no_gaps, index=None,columns=['Time', 'temp', 'humidity', 'light', 'pressure', 'speed'])
return data_arr_no_gaps
Fill time gaps and nulls
Use the function below to ensure expected date sequence exists, and then use forward fill to fill in nulls.
import pandas as pd
import os
def fill_gaps_and_nulls(df, freq='1D'):
'''
General steps:
A) check for extra dates (out of expected frequency/sequence)
B) check for missing dates (based on expected frequency/sequence)
C) use forwardfill to fill nulls
D) use backwardfill to fill remaining nulls
E) append to file
'''
#rename the timestamp to 'date'
df.rename(columns={"timestamp": "date"})
#sort to make indexing faster
df = df.sort_values(by=['date'], inplace=False)
#create an artificial index of dates at frequency = freq, with the same beginning and ending as the original data
all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq=freq)
#record column names
df_cols = df.columns
#delete ffill_df.csv so we can begin anew
try:
os.remove('ffill_df.csv')
except FileNotFoundError:
pass
#check for extra dates and/or dates out of order. print warning statement for log
extra_dates = set(df.date).difference(all_dates)
#if there are extra dates (outside of expected sequence/frequency), deal with them
if len(extra_dates) > 0:
#############################
#INSERT DESIRED BEHAVIOR HERE
print('WARNING: Extra date(s):\n\t{}\n\t Shifting highlighted date(s) back by 1 day'.format(extra_dates))
for date in extra_dates:
#shift extra dates back one day
df.date[df.date == date] = date - pd.Timedelta(days=1)
#############################
#check the artificial date index against df to identify missing gaps in time and fill them with nulls
gaps = all_dates.difference(set(df.date))
print('\n-------\nWARNING: Missing dates: {}\n-------\n'.format(gaps))
#if there are time gaps, deal with them
if len(gaps) > 0:
#initialize df of correct size, filled with nulls
gaps_df = pd.DataFrame(index=gaps, columns=df_cols.drop('date')) #len(index) sets number of rows
#give index a name
gaps_df.index.name = 'date'
#add the region and type
gaps_df.region = r
gaps_df.type = t
#remove that index so gaps_df and df are compatible
gaps_df.reset_index(inplace=True)
#append gaps_df to df
new_df = pd.concat([df, gaps_df])
#sort on date
new_df.sort_values(by='date', inplace=True)
#fill nulls
new_df.fillna(method='ffill', inplace=True)
new_df.fillna(method='bfill', inplace=True)
#append to file
new_df.to_csv('ffill_df.csv', mode='a', header=False, index=False)
return df_cols, regions, types, all_dates
I want to calculate the average number of successful Rattatas catches hourly for this whole dataset. I am looking for an efficient way to do this by utilizing pandas--I'm new to Python and pandas.
You don't need any loops. Try this. I think logic is rather clear.
import pandas as pd
#read csv
df = pd.read_csv('pkmn.csv', header=0)
#we need apply some transformations to extract date from timestamp
df['time'] = df['time'].apply(lambda x : pd.to_datetime(str(x)))
df['date'] = df['time'].dt.date
#main transformations
df = df.query("Pokemon == 'rattata' and caught == True").groupby('hour')
result = pd.DataFrame()
result['caught total'] = df['hour'].count()
result['days'] = df['date'].nunique()
result['caught average'] = result['caught total'] / result['days']
If you have your pandas dataframe saved as df this should work:
rats = df.loc[df.Pokemon == "rattata"] #Gives you subset of rows relating to Rattata
total = sum(rats.Caught) #Gives you the number caught total
diff = rats.time[len(rats)] - rats.time[0] #Should give you difference between first and last
average = total/diff #Should give you the number caught per unit time