I have a basic sports score at the end and the players ranking what I want to do is if their score is equal instead of one player being 3 and the other 4 I need them both to be 3-4, anyone have any good hints where I could find a solution?
Name 100 m Long jump Shot put High jump 400 m 110 m hurdles Discus throw Pole vault Javelin throw 1500 m Total Score Ranking
1 Edan Daniele 12.61 5.00 9.22 1.50 60.39 16.43 21.60 2.6 35.81 00:05:25.720000 3847.0 1
2 Coos Kwesi 13.75 4.84 10.12 1.50 68.44 19.18 30.85 2.8 33.88 00:06:22.750000 3127.0 2
3 Severi Eileifr 13.43 4.35 8.64 1.50 66.06 19.05 24.89 2.2 33.48 00:06:51.010000 2953.0 3
4 Lehi Poghos 13.04 4.53 7.79 1.55 64.72 18.74 24.20 2.4 28.20 00:06:50.760000 2940.0 4
This is the outcome, and this is the code
import numpy as np
from os import sep
import pandas as pd
df = pd.read_csv("Decathlon.csv",sep=";",header=None)
df.reset_index(drop=False)
df.index = np.arange(1, len(df) + 1)
df.columns = ["Name","100 m","Long jump","Shot put","High jump","400 m","110 m hurdles","Discus throw","Pole vault","Javelin throw","1500 m"]
df['100m score'] = round(25.4347*((18-df["100 m"])**1.81))
df["Long jump score"] = round(0.14354*(((df["Long jump"]-220)*-1)**1.4))
df["shot put score"] = round( 51.39*((df["Shot put"]-1.5)**1.05))
df["high jump score"] = round( 0.8465*(((df["High jump"]-75)*-1)**1.42))
df["400m score"] = round( 1.53775*((82-df["400 m"])**1.81))
df['110m hurdles score'] = round( 5.74352*((28.5-df['110 m hurdles'])**1.92))
df['Discus throw score'] = round( 12.91*((df['Discus throw']-4)**1.1))
df['Pole vault score'] = round( 0.2797*(((df['Pole vault']-100)*-1)*1.35))
df['Javelin throw score'] = round( 10.14*(((df['Javelin throw']-7)**1.08)))
df['1500 m'] = pd.to_datetime(df['1500 m'].str.strip(), format='%M.%S.%f')
df['Minute'] = pd.to_datetime(df['1500 m']).dt.minute
df['sekunde'] = pd.to_datetime(df['1500 m']).dt.second
df['milisekunde'] = pd.to_datetime(df['1500 m']).dt.microsecond
df.loc[df['milisekunde']>500000,['sekunde']] = df['sekunde']+1
df['Total seconds'] = (df["Minute"]*60) + df["sekunde"]
df['1500 m score'] = round(0.03768*((480-df["Total seconds"])**1.85))
df["Total Score"] = df['100m score']+df["Long jump score"]+df["shot put score"]+df["high jump score"]+df["400m score"]+df['110m hurdles score']+df['Discus throw score']+df['Pole vault score']+df['Javelin throw score']+df['1500 m score']
df["1500 m"] = pd.DatetimeIndex(df['1500 m']).time
#clean up
del df['100m score']
del df["Long jump score"]
del df["shot put score"]
del df["high jump score"]
del df["400m score"]
del df['110m hurdles score']
del df['Discus throw score']
del df['Pole vault score']
del df['Javelin throw score']
del df['Minute']
del df['sekunde']
del df['milisekunde']
del df["Total seconds"]
del df ["1500 m score"]
df = df.sort_values(['Total Score'], ascending = False)
df= df.reset_index(drop = True)
df.index = np.arange(1, len(df) + 1)
df["Ranking"] = df.index
print(df)
df.to_json('Json file')
Assuming "Decathlon.csv" file looks something like this:
Edan Daniele;12.61;5.00;9.22;1.50;60.39;16.43;21.60;2.6;35.81;00:05:25.720000
Coos Kwesi;13.75;4.84;10.12;1.50;68.44;19.18;30.85;2.8;33.88;00:06:22.750000
Severi Eileifr;13.43;4.35;8.64;1.50;66.06;19.05;24.89;2.2;33.48;00:06:51.010000
Severi Eileifr;13.43;4.35;8.64;1.50;66.06;19.05;24.89;2.2;33.48;00:06:51.010000
Lehi Poghos;13.04;4.53;7.79;1.55;64.72;18.74;24.20;2.4;28.20;00:06:50.760000
Here's how you can generate the rankings:
df["Ranking"] = df["Total Score"].apply(lambda score: df.index[df["Total Score"] == score].astype(str)).str.join("-")
output:
Name 100 m ... Total Score Ranking
1 Edan Daniele 12.61 ... 6529.0 1
2 Coos Kwesi 13.75 ... 6088.0 2
3 Severi Eileifr 13.43 ... 5652.0 3-4
4 Severi Eileifr 13.43 ... 5652.0 3-4
5 Lehi Poghos 13.04 ... 5639.0 5
or just use .tolist() to get rankings as a list:
df["Ranking"] = df["Total Score"].apply(lambda score: df.index[df["Total Score"] == score].tolist())
Name 100 m ... Total Score Ranking
1 Edan Daniele 12.61 ... 6529.0 [1]
2 Coos Kwesi 13.75 ... 6088.0 [2]
3 Severi Eileifr 13.43 ... 5652.0 [3, 4]
4 Severi Eileifr 13.43 ... 5652.0 [3, 4]
5 Lehi Poghos 13.04 ... 5639.0 [5]
Might not be the best approach though
Note: I've made rows 3 and 4 identical in the initial csv to match the example you've provided
Related
I have a DataFrame with Employees and their hours for different categories.
I need to recalculate only specific categories (OT, MILE and REST Categories SHOULD NOT Be Updated, ALL Other Should be updated) ONLY if OT category is present under Empl_Id.
data = {'Empl_Id': [1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3],
'Category': ["MILE", "REST", "OT", "TRVL", "REG", "ADMIN", "REST", "REG", "MILE", "OT", "TRVL", "REST", "MAT", "REG"],
'Value': [43, 0.7, 6.33, 2.67, 52, 22, 1.17, 16.5, 73.6, 4.75, 1.33, 2.5, 5.5, 52.25]}
df = pd.DataFrame(data=data)
df
Empl_Id
Category
Value
1
MILE
43
1
REST
0.7
1
OT
6.33
1
TRVL
2.67
1
REG
52
2
ADMIN
22
2
REST
1.17
2
REG
16.5
3
MILE
73.6
3
OT
4.75
3
TRVL
1.33
3
REST
2.5
3
MAT
5.5
3
REG
52.25
The Logic is to:
1) Find % of OT Hours from Total Hours (OT, REST and MILE don't count):
1st Empl_Id: 6.33 (OT) / 2.67 (TRVL) + 52 (REG) = 6.33 / 54.67 = 11.58 %
2nd Empl_Id: OT Hours Not present, nothing should be updated
3rd Empl_Id: 4.75 (OT) / 1.33 (TRVL) + 5.5 (MAT) + 52.25 (REG) = 4.75 / 59.08 = 8.04 %
2) Substract % of OT from each category (OT, REST and MILE don't count):
Empl_Id
Category
Value
1
MILE
43
1
REST
0.7
1
OT
6.33
1
TRVL
2.67 - 11.58 % (0.31) = 2.36
1
REG
52 - 11.58 % (6.02) = 45.98
2
ADMIN
22
2
REST
1.17
2
REG
16.5
3
MILE
73.6
3
OT
4.75
3
TRVL
1.33 - 8.04 % (0.11) = 1.22
3
REST
2.5
3
MAT
5.5 - 8.04 % (0.44) = 5.06
3
REG
52.25 - 8.04 % (4.2) = 48.05
You can use:
keep = ['OT', 'MILE', 'REST']
# get factor
factor = (df.groupby(df['Empl_Id'])
.apply(lambda g: g.loc[g['Category'].eq('OT'),'Value'].sum()
/g.loc[~g['Category'].isin(keep),'Value'].sum()
)
.rsub(1)
)
# update
df.loc[~df['Category'].isin(keep), 'Value'] *= df['Empl_Id'].map(factor)
output:
Empl_Id Category Value
0 1 MILE 43.000000
1 1 REST 0.700000
2 1 OT 6.330000
3 1 TRVL 2.360852
4 1 REG 45.979148
5 2 ADMIN 22.000000
6 2 REST 1.170000
7 2 REG 16.500000
8 3 MILE 73.600000
9 3 OT 1.750000
10 3 TRVL 1.290604
11 3 REST 2.500000
12 3 MAT 5.337085
13 3 REG 50.702310
I would like to simulate individual changes in growth and mortality for a variable number of days. My dataframe is formatted as follows...
import pandas as pd
data = {'unique_id': ['2', '4', '5', '13'],
'length': ['27.7', '30.2', '25.4', '29.1'],
'no_fish': ['3195', '1894', '8', '2774'],
'days_left': ['253', '253', '254', '256'],
'growth': ['0.3898', '0.3414', '0.4080', '0.3839']
}
df = pd.DataFrame(data)
print(df)
unique_id length no_fish days_left growth
0 2 27.7 3195 253 0.3898
1 4 30.2 1894 253 0.3414
2 5 25.4 8 254 0.4080
3 13 29.1 2774 256 0.3839
Ideally, I would like the initial length (i.e., length) to increase by the daily growth rate (i.e., growth) for each of the days remaining in the year (i.e., days_left).
df['final'] = df['length'] + (df['days_left'] * df['growth']
However, I would also like to update the number of fish that each individual represents (i.e., no_fish) on a daily basis using a size-specific equation. I'm fairly new to python so I initially thought to use a for-loop (I'm not sure if there is another, more efficient way). My code is as follows:
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for indx in range(len(df)):
count = 1
while count <= int(df.days_to_forecast[indx]):
# (1) update individual length
df.lgth[indx] = df.lgth[indx] + df.linearGR[indx]
# (2) estimate daily size-specific mortality
if df.lgth[indx] > 50.0:
df.z[indx] = 0.01
else:
if df.lgth[indx] <= 50.0:
df.z[indx] = 0.052857-((0.03/35)*df.lgth[indx])
elif df.lgth[indx] < 15.0:
df.z[indx] = 0.728*math.exp(-0.1892*df.lgth[indx])
df['no_fish'].round(decimals = 0)
if df.no_fish[indx] < 1.0:
df.no_fish[indx] = 0.0
elif df.no_fish[indx] >= 1.0:
df.no_fish[indx] = df.no_fish[indx]*math.exp(-(df.z[indx]))
# (3) reduce no. of days left in forecast by 1
count = count + 1
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
The above code now works correctly, but it is still far to inefficient to run for 40,000 individuals each for 200+ days.
I would really appreciate any advice on how to modify the following code to make it pythonic.
Thanks
Another option that was suggested to me is to use the pd.dataframe.apply function. This dramatically reduced the overall the run time and could be useful to someone else in the future.
### === RUN SIMULATION === ###
start_time = time.perf_counter() # keep track of run time -- START
#-------------------------------------------------------------------------#
def function_to_apply( df ):
df['z_instantMort'] = ''
for indx in range(int(df['days_left'])):
# (1) update individual length
df['length'] = df['length'] + df['growth']
# (2) estimate daily size-specific mortality
if df['length'] > 50.0:
df['z_instantMort'] = 0.01
else:
if df['length'] <= 50.0:
df['z_instantMort'] = 0.052857-((0.03/35)*df['length'])
elif df['length'] < 15.0:
df['z_instantMort'] = 0.728*np.exp(-0.1892*df['length'])
whole_fish = round(df['no_fish'], 0)
if whole_fish < 1.0:
df['no_fish'] = 0.0
elif whole_fish >= 1.0:
df['no_fish'] = df['no_fish']*np.exp(-(df['z_instantMort']))
return df
#-------------------------------------------------------------------------#
sim_results = df.apply(function_to_apply, axis=1)
total_elapsed_time = round(time.perf_counter() - start_time, 2) # END
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(sim_results)
### ====================== ###
output being...
Forecast iteration completed in 0.05 seconds
unique_id length no_fish days_left growth z_instantMort
0 2.0 126.3194 148.729190 253.0 0.3898 0.01
1 4.0 116.5742 93.018465 253.0 0.3414 0.01
2 5.0 129.0320 0.000000 254.0 0.4080 0.01
3 13.0 127.3784 132.864757 256.0 0.3839 0.01
As I said in my comment, a preferable alternative to for loops in this setting is using vector operations. For instance, running your code:
import pandas as pd
import time
import math
import numpy as np
data = {'unique_id': [2, 4, 5, 13],
'length': [27.7, 30.2, 25.4, 29.1],
'no_fish': [3195, 1894, 8, 2774],
'days_left': [253, 253, 254, 256],
'growth': [0.3898, 0.3414, 0.4080, 0.3839]
}
df = pd.DataFrame(data)
print(df)
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for indx in range(len(df)):
count = 1
while count <= int(df.days_left[indx]):
# (1) update individual length
df.length[indx] = df.length[indx] + df.growth[indx]
# (2) estimate daily size-specific mortality
if df.length[indx] > 50.0:
df.z[indx] = 0.01
else:
if df.length[indx] <= 50.0:
df.z[indx] = 0.052857-((0.03/35)*df.length[indx])
elif df.length[indx] < 15.0:
df.z[indx] = 0.728*math.exp(-0.1892*df.length[indx])
df['no_fish'].round(decimals = 0)
if df.no_fish[indx] < 1.0:
df.no_fish[indx] = 0.0
elif df.no_fish[indx] >= 1.0:
df.no_fish[indx] = df.no_fish[indx]*math.exp(-(df.z[indx]))
# (3) reduce no. of days left in forecast by 1
count = count + 1
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(df)
with output:
unique_id length no_fish days_left growth
0 2 27.7 3195 253 0.3898
1 4 30.2 1894 253 0.3414
2 5 25.4 8 254 0.4080
3 13 29.1 2774 256 0.3839
Forecast iteration completed in 31.75 seconds
unique_id length no_fish days_left growth z
0 2 126.3194 148.729190 253 0.3898 0.01
1 4 116.5742 93.018465 253 0.3414 0.01
2 5 129.0320 0.000000 254 0.4080 0.01
3 13 127.3784 132.864757 256 0.3839 0.01
Now with vector operations, you could do something like:
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for day in range(1, df.days_left.max() + 1):
update = day <= df['days_left']
# (1) update individual length
df[update]['length'] = df[update]['length'] + df[update]['growth']
# (2) estimate daily size-specific mortality
df[update]['z'] = np.where( df[update]['length'] > 50.0, 0.01, 0.052857-( ( 0.03 / 35)*df[update]['length'] ) )
df[update]['z'] = np.where( df[update]['length'] < 15.0, 0.728 * np.exp(-0.1892*df[update]['length'] ), df[update]['z'] )
df[update]['no_fish'].round(decimals = 0)
df[update]['no_fish'] = np.where(df[update]['no_fish'] < 1.0, 0.0, df[update]['no_fish'] * np.exp(-(df[update]['z'])))
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(df)
with output
Forecast iteration completed in 1.32 seconds
unique_id length no_fish days_left growth z
0 2 126.3194 148.729190 253 0.3898 0.0
1 4 116.5742 93.018465 253 0.3414 0.0
2 5 129.0320 0.000000 254 0.4080 0.0
3 13 127.3784 132.864757 256 0.3839 0.0
First off, I think it would be helpful to offer some background about what I want to do. I have a time-series dataset that describes air quality in a region, with hour resolution. Each row is an observation, each column is a different parameter (eg. Temperature, Pressure, Particulate matter, etc.) I want to take an average of observations for each hour in the day, across the entire five year dataset. However, I first need to distinguish between summer and winter observations. Here are a few rows for reference:
Date Time WSA WSV WDV WSM SGT T2M T10M DELTA_T PBAR SRAD RH PM25 AQI
0 2015-01-01 00:00:00 0.9 0.2 334 3.2 70.9 29.2 29.1 -0.1 740.4 8 102.5 69.0 157.970495
1 2015-01-01 01:00:00 1.5 0.7 129 4.0 58.8 29.6 29.2 -0.4 740.2 8 102.5 23.5 74.974249
2 2015-01-01 02:00:00 0.8 0.8 70 2.7 18.0 28.7 28.3 -0.4 740.3 7 102.2 40.1 112.326633
3 2015-01-01 03:00:00 1.1 1.0 82 3.4 21.8 28.2 27.8 -0.4 740.1 6 102.0 31.1 90.957082
4 2015-01-01 04:00:00 1.0 0.8 65 4.7 34.3 27.3 27.2 -0.2 739.7 6 101.7 13.7 54.364807
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
43175 2016-12-30 19:00:00 1.7 0.7 268 4.1 63.6 33.8 34.1 0.3 738.8 8 100.7 38.4 108.140704
43176 2016-12-30 20:00:00 1.5 0.1 169 3.3 77.5 33.2 33.7 0.5 738.7 9 101.0 27.2 82.755365
43177 2016-12-30 21:00:00 1.4 0.5 278 4.0 65.7 32.5 32.8 0.3 738.6 9 101.4 42.5 118.236181
43178 2016-12-30 22:00:00 2.8 2.7 277 6.5 16.7 33.2 33.3 0.1 738.6 9 101.6 25.2 78.549356
43179 2016-12-30 23:00:00 1.9 0.3 241 4.2 74.2 31.0 31.6 0.6 738.4 9 100.4 18.7 64.879828
[43180 rows x 15 columns]
I have tried splitting the dataset into two based on season, and plotting each separately. This works, but I cannot manage to make the plot display a legend.
mask = (df['Date'].dt.month > 3) & (df['Date'].dt.month < 10)
summer = df[mask]
winter = df[~mask]
summer = summer.groupby(summer['Time'].dt.hour).mean().reset_index()
winter = winter.groupby(winter['Time'].dt.hour).mean().reset_index()
p = (
ggplot(mapping=aes( x='Time', y='PM25')) +
geom_point(data=summer, color='red')+
geom_point(data=winter, color='blue')
)
print(p)
Plotting with separate dataframes:
[1]: https://i.stack.imgur.com/W75kk.png
I did some more research, and learned that plotnine/ggplot can color-code data points based on one of their attributes. This approach requires the data to be a single dataset, so I added a parameter specifying the season. However, when I group by hour, this 'Season' attribute is removed. I assume it is because you cannot take the mean of non-numeric data. As such, I find myself in a bit of a paradox.
Here is the my attempt at keeping the data together and adding a 'Season' column:
df.insert(0,'Season', 0)
summer = (df['Date'].dt.month > 3) & (df['Date'].dt.month < 10)
df['Season'] = df.where(summer, other='w')
df['Season'] = df.where(~summer, other='s')
df = df.groupby(df['Time'].dt.hour).mean()
print(df)
p = (
ggplot(data = df, mapping=aes( x='Time', y='PM25', color='Season')) +
geom_point()
)
print(p)
When I try to run this, it raises the following, and if I inspect the dataframe all non-numeric paramters have been removed:
plotnine.exceptions.PlotnineError: "Could not evaluate the 'color' mapping: 'Season' (original error: name 'Season' is not defined)"
Any suggestions would be hugely appreciated.
Data provided has been saved to airq.csv. Besides to Season column, Hour column has been included. Code provided has been used. 'Hour' and 'Season' have to be provided in groupby function. Two plotnine.ggplot possibilities are provided. Fist using geom_point, and second one adding facet_wrap. Theme customization has been included for each case.
from plotnine import *
import pandas as pd
df = pd.read_csv('airq.csv', parse_dates=[0,1])
df.insert(0,'Season', 0)
summer = (df['Date'].dt.month > 3) & (df['Date'].dt.month < 9)
df['Season'] = df.where(summer, other='Winter')
df['Season'] = df.where(~summer, other='Summer')
df['Hour'] = df['Time'].dt.hour
df = df.groupby(['Hour', 'Season']).mean().reset_index()
custom_axis = theme(axis_text_x = element_text(color="grey", size=6, angle=90, hjust=.3),
axis_text_y = element_text(color="grey", size=6),
plot_title = element_text(size = 25, face = "bold"),
axis_title = element_text(size = 10)
)
(
ggplot(data = df, mapping = aes(x='Hour', y='PM25',
color='Season')) + geom_point() +
custom_axis + ylab("Particulate matter 2.5 micrometres") + xlab("Hour") + labs(title="PM air quality report")
)
custom_axis = theme(axis_text_x = element_text(color="grey", size=6, angle=90, hjust=.3),
axis_text_y = element_text(color="grey", size=6),
plot_title = element_text(size = 25, face = "bold"),
axis_title = element_text(size = 10),
panel_spacing_y=.4,
figure_size=(8, 4)
)
(
ggplot(data = df, mapping = aes(x='Hour', y='PM25')) + geom_point(alpha=1) + facet_wrap('Season') +
custom_axis + ylab("Particulate matter 2.5 micrometres") + xlab("Hour") + labs(title="PM air quality report")
)
I'm trying to find an efficient way to generate rolling counts or sums in pandas given a grouping and a date range. Eventually, I want to be able to add conditions, ie. evaluating a 'type' field, but I'm not there just yet. I've written something to get the job done, but feel that there could be a more direct way of getting to the desired result.
My pandas data frame currently looks like this, with the desired output being put in the last column 'rolling_sales_180'.
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0
My current solution and environment can be sourced below. I've been modeling my solution from this R Q&A in stackoverflow. Efficient way to perform running total in the last 365 day window
import pandas as pd
import numpy as np
def trans_date_to_dist_matrix(date_col): # used to create a distance matrix
x = date_col.tolist()
y = date_col.tolist()
data = []
for i in x:
tmp = []
for j in y:
tmp.append(abs((i - j).days))
data.append(tmp)
del tmp
return pd.DataFrame(data=data, index=date_col.values, columns=date_col.values)
def lower_tri(x_col, date_col, win): # x_col = column user wants a rolling sum of ,date_col = dates, win = time window
dm = trans_date_to_dist_matrix(date_col=date_col) # dm = distance matrix
dm = dm.where(dm <= win) # find all elements of the distance matrix that are less than window(time)
lt = dm.where(np.tril(np.ones(dm.shape)).astype(np.bool)) # lt = lower tri of distance matrix so we get only future dates
lt[lt >= 0.0] = 1.0 # cleans up our lower tri so that we can sum events that happen on the day we are evaluating
lt = lt.fillna(0) # replaces NaN with 0's for multiplication
return pd.DataFrame(x_col.values * lt.values).sum(axis=1).tolist()
def flatten(x):
try:
n = [v for sl in x for v in sl]
return [v for sl in n for v in sl]
except:
return [v for sl in x for v in sl]
data = [
['David', '1/1/2015', 100], ['David', '1/5/2015', 500], ['David', '5/30/2015', 50], ['David', '7/25/2015', 50],
['Ryan', '1/4/2014', 100], ['Ryan', '1/19/2015', 500], ['Ryan', '3/31/2016', 50],
['Joe', '7/1/2015', 100], ['Joe', '9/9/2015', 500], ['Joe', '10/15/2015', 50]
]
list_of_vals = []
dates_df = pd.DataFrame(data=data, columns=['name', 'date', 'amount'], index=None)
dates_df['date'] = pd.to_datetime(dates_df['date'])
list_of_vals.append(dates_df.groupby('name', as_index=False).apply(
lambda x: lower_tri(x_col=x.amount, date_col=x.date, win=180)))
new_data = flatten(list_of_vals)
dates_df['rolling_sales_180'] = new_data
print dates_df
Your time and feedback are appreciated.
Pandas has support for time-aware rolling via the rolling method, so you can use that instead of writing your own solution from scratch:
def get_rolling_amount(grp, freq):
return grp.rolling(freq, on='date')['amount'].sum()
df['rolling_sales_180'] = df.groupby('name', as_index=False, group_keys=False) \
.apply(get_rolling_amount, '180D')
The resulting output:
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0
When I use pandas value_count method, I get the data below:
new_df['mark'].value_counts()
1 1349110
2 1606640
3 175629
4 790062
5 330978
How can I get the percentage for each row like this?
1 1349110 31.7%
2 1606640 37.8%
3 175629 4.1%
4 790062 18.6%
5 330978 7.8%
I need to divide each row by the sum of these data.
np.random.seed([3,1415])
s = pd.Series(np.random.choice(list('ABCDEFGHIJ'), 1000, p=np.arange(1, 11) / 55.))
s.value_counts()
I 176
J 167
H 136
F 128
G 111
E 85
D 83
C 52
B 38
A 24
dtype: int64
As percent
s.value_counts(normalize=True)
I 0.176
J 0.167
H 0.136
F 0.128
G 0.111
E 0.085
D 0.083
C 0.052
B 0.038
A 0.024
dtype: float64
counts = s.value_counts()
percent = counts / counts.sum()
fmt = '{:.1%}'.format
pd.DataFrame({'counts': counts, 'per': percent.map(fmt)})
counts per
I 176 17.6%
J 167 16.7%
H 136 13.6%
F 128 12.8%
G 111 11.1%
E 85 8.5%
D 83 8.3%
C 52 5.2%
B 38 3.8%
A 24 2.4%
I think you need:
#if output is Series, convert it to DataFrame
df = df.rename('a').to_frame()
df['per'] = (df.a * 100 / df.a.sum()).round(1).astype(str) + '%'
print (df)
a per
1 1349110 31.7%
2 1606640 37.8%
3 175629 4.1%
4 790062 18.6%
5 330978 7.8%
Timings:
It seems faster is use sum as twice value_counts:
In [184]: %timeit (jez(s))
10 loops, best of 3: 38.9 ms per loop
In [185]: %timeit (pir(s))
10 loops, best of 3: 76 ms per loop
Code for timings:
np.random.seed([3,1415])
s = pd.Series(np.random.choice(list('ABCDEFGHIJ'), 1000, p=np.arange(1, 11) / 55.))
s = pd.concat([s]*1000)#.reset_index(drop=True)
def jez(s):
df = s.value_counts()
df = df.rename('a').to_frame()
df['per'] = (df.a * 100 / df.a.sum()).round(1).astype(str) + '%'
return df
def pir(s):
return pd.DataFrame({'a':s.value_counts(),
'per':s.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'})
print (jez(s))
print (pir(s))
Here's a more pythonic snippet than what is proposed above I think
def aspercent(column,decimals=2):
assert decimals >= 0
return (round(column*100,decimals).astype(str) + "%")
aspercent(df['mark'].value_counts(normalize=True),decimals=1)
This will output:
1 1349110 31.7%
2 1606640 37.8%
3 175629 4.1%
4 790062 18.6%
5 330978 7.8%
This also allows to adjust the number of decimals
Create two series, first one with absolute values and a second one with percentages, and concatenate them:
import pandas
d = {'mark': ['pos', 'pos', 'pos', 'pos', 'pos',
'neg', 'neg', 'neg', 'neg',
'neutral', 'neutral' ]}
df = pd.DataFrame(d)
absolute = df['mark'].value_counts(normalize=False)
absolute.name = 'value'
percentage = df['mark'].value_counts(normalize=True)
percentage.name = '%'
percentage = (percentage*100).round(2)
pd.concat([absolute, percentage], axis=1)
Output:
value %
pos 5 45.45
neg 4 36.36
neutral 2 18.18