I have a data frame with columns:
User_id PQ_played PQ_offered
1 5 15
2 12 75
3 25 50
I need to divide PQ_played by PQ_offered to calculate the % of games played. This is what I've tried so far:
new_df['%_PQ_played'] = df.groupby('User_id').((df['PQ_played']/df['PQ_offered'])*100),as_index=True
I know that I am terribly wrong.
It's much simpler than you think.
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
PQ_offered PQ_played %_PQ_played
User_id
1 15 5 33.333333
2 75 12 16.000000
3 50 25 50.000000
You can use lambda functions
df.groupby('User_id').apply(lambda x: (x['PQ_played']/x['PQ_offered'])*100)\
.reset_index(1, drop = True).reset_index().rename(columns = {0 : '%_PQ_played'})
You get
User_id %_PQ_played
0 1 33.333333
1 2 16.000000
2 3 50.000000
I totally agree with #mVChr and think you are over complicating what you need to do. If you are simply trying to add an additional column then his response is spot on. If you truly need to groupby it is worth noting that this is typically used for aggregation, e.g., sum(), count(), etc. If, for example, you had several records with non-unique values in the User_id column then you could create the additional column using
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
and then perform an aggregation. Let's say you wanted to know the average number of games played of the games offered for each user, you could do something like
new_df = df.groupby('User_id', as_index=False)['%_PQ_played'].mean()
This would yield (numbers are arbitrary)
User_id %_PQ_played
0 1 52.777778
1 2 29.250000
2 3 65.000000
Related
I am trying to randomly select records from 17mm dataframe using np.random.choice as it runs faster compared to other methods but I am getting incorrect value in output against each record...example below:
data = {
"calories":[420,380,390,500,200,100],
"Duration":[50,40,45,600,450,210],
"Id":[1,1 2,3,2,3],
"Yr":[2003,2003,2009,2003,2012,2003],
"Mth":[3,6,9,12,3,6],
}
df = PD.dataframe(data)
df2=df.groupby(['id','yr'],as_index=False).agg(np.random.choice)
Output:
Id yr calories duration mth
1 2003 420 50 6
2 2009 390 45 9
2 2012 200 450 3
3 2003 500 210 6
Problem in the output is for Id 3 for calories 500, duration and mth should be 600 and 12 instead of 210 and 6...can anyone please help why it is choosing value from different row ?
Expected output:
Same row value should be retained after random selection
This doesn't work because Pandas applies aggregates across each column independently, try putting a print statement in, e.g.:
def fn(x):
print(x)
return np.random.choice(x)
df.groupby(['id','yr'],as_index=False).agg(fn)
would let you see when the function was called and what it was called with.
I'm not an expert in Pandas, but using GroupBy.apply seems to be the easiest way I've found of keeping rows together.
Something like the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"calories":[420,380,390,500,200,100],
"duration":[50,40,45,600,450,210],
"id":[1,1,2,3,2,3],
"yr":[2003,2003,2009,2003,2012,2003],
"mth":[3,6,9,12,3,6],
})
df.groupby(['id', 'yr'], as_index=False).apply(lambda x: x.sample(1))
produces:
calories duration id yr mth
0 1 380 40 1 2003 6
1 2 390 45 2 2009 9
2 4 200 450 2 2012 3
3 5 100 210 3 2003 6
the two numbers at the beginning are because you end up with a multi-index. If you want to know where the rows were selected from this would contain useful information, otherwise you could discard the index.
Note that there are warnings in the docs that this might not be very performant, but don't know the details.
Update: I've just had more of a read of the docs, and noticed that there's a GroupBy.sample method, so you could instead just do:
df.groupby(['id', 'yr']).sample(1)
which would presumably be performant as well as being much shorter!
I know this question might be a bit inappropriate for this platform but I have nowhere else to ask for help.
I'm new to python and I'm trying to learn how to iterate over a list with some conditions. I have the next problem - for each unique link from Where to Where, I want to choose one of the most profitable suppliers. A profitable supplier is a supplier that most of the days of the week turned out to be cheaper (that is, had a lower cost) than other suppliers. The dataset is the following where columns are: 1st-From, 2nd-To, 3rd-Day in a week, 4th-supplier's number, 5th-Cost.
To solve this task, I've decided firstly to create a new column and list with unique routes.
df_routes['route'] = df_routes['From'] + '-' + df_routes['Where']
routes = df_routes['route'].unique()
len(routes)
And then iterate over it but I do not fully understand how the structure should look like. My guess is that it should be something like this:
for i, route in enumerate(routes):
x = df_routes[df_routes['route'] == route]
if x['supplier'].nunique() == 1:
print(route, supplier)
else:
...
I don't know how to structure it further and whether this is the right structure. So how it should look like?
I will really appreciate any help (tips, hints, snippets of code) on this question.
This is more efficiently solved with pandas functions rather than looping
Let df be a portion of your dataframe for the first two routes. First we sort by cost and group by the route and the 'Day'. This will tell us for each day and each route which supplier is the cheapest:
df1 = df.sort_values('Cost', ascending = True).groupby(['From','To', 'Day']).first()
df1 looks like this:
Supplier Cost
From To Day
BGW MOW 1 3 75910
2 3 75990
3 3 27340
4 3 75990
5 11 19880
6 3 75440
7 11 24740
OSS UUS 1 47 65650
2 47 47365
3 47 70635
4 47 47365
5 47 62030
6 47 62030
7 47 71010
Next we count the number of mentions for each supplier for each route:
df2 = df1.groupby(['From','To'])['Supplier'].value_counts().rename('days').reset_index(level=2)
df2 looks like this:
Supplier days
From To
BGW MOW 3 5
MOW 11 2
OSS UUS 47 7
eg for the first route, supplier 3 was the cheapest for 5 days and supplier 11 for 2 days
Now we just pick the first (most-mentioned) supplier for each route:
df3 = df2.groupby(['From','To']).first()
df3 is the final output and looks like this:
Supplier days
From To
BGW MOW 3 5
OSS UUS 47 7
Groupby dataframe based on columns (['From','To', 'Day']) and use aggregate min ('Cost')
function to get result
df.groupby(['From','To', 'Day']).min('Cost').reset_index()
First post: I apologize in advance for sloppy wording (and possibly poor searching if this question has been answered ad nauseum elsewhere - maybe I don't know the right search terms yet).
I have data in 10-minute chunks and I want to perform calculations on a column ('input') grouped by minute (i.e. 10 separate 60-second blocks - not a rolling 60 second period) and then store all ten calculations in a single list called output.
The 'seconds' column records the second from 1 to 600 in the 10-minute period. If no data was entered for a given second, there is no row for that number of seconds. So, some minutes have 60 rows of data, some have as few as one or two.
Note: the calculation (my_function) is not basic so I can't use groupby and np.sum(), np.mean(), etc. - or at least I can't figure out how to use groupby.
I have code that gets the job done but it looks ugly to me so I am sure there is a better way (probably several).
output=[]
seconds_slicer = 0
for i in np.linspace(1,10,10):
seconds_slicer += 60
minute_slice = df[(df['seconds'] > (seconds_slicer - 60)) &
(df['seconds'] <= seconds_slicer)]
calc = my_function(minute_slice['input'])
output.append(calc)
If there is a cleaner way to do this, please let me know. Thanks!
Edit: Adding sample data and function details:
seconds input
1 1 0.000054
2 2 -0.000012
3 3 0.000000
4 4 0.000000
5 5 0.000045
def realized_volatility(series_log_return):
return np.sqrt(np.sum(series_log_return**2))
For this answer, we're going to repurpose Bin pandas dataframe by every X rows
We'll create a dataframe with missing data in the 'seconds' column, as I understand your data to be based on the description given
secs=[1,2,3,4,5,6,7,8,9,11,12,14,15,17,19]
data = [np.random.randint(-25,54)/100000 for _ in range(15)]
df=pd.DataFrame(data=zip(secs,data), columns=['seconds','input'])
df
seconds input
0 1 0.00017
1 2 -0.00020
2 3 0.00033
3 4 0.00052
4 5 0.00040
5 6 -0.00015
6 7 0.00001
7 8 -0.00010
8 9 0.00037
9 11 0.00050
10 12 0.00000
11 14 -0.00009
12 15 -0.00024
13 17 0.00047
14 19 -0.00002
I didn't create 600 rows, but that's okay, we'll say we want to bin every 5 seconds instead of every 60. Now, because we're just trying to use equal time measures for grouping, we can just use floor division to see which bin each time interval would end up in. (In your case, you'd divide by 60 instead)
grouped=df.groupby(df['seconds'] // 5).apply(realized_volatility).drop('seconds', axis=1) #we drop the extra 'seconds' column because we don;t care about the root sum of squares of seconds in the df
grouped
input
seconds
0 0.000441
1 0.000372
2 0.000711
3 0.000505
I am currently working with a data-stream that updates every 30 seconds with highway probe data. The data in the database needs to aggregate the incoming data and provide a 15 minute total. The issue I am encountering is trying to sum specific columns while matching keys.
Current_DataFrame:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 GOOD 10 55 5 5
1 2 GOOD 5 57 3 2
2 1 GOOD 7 45 4 3
New_Dataframe:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 BAD 7 59 6 1
1 2 GOOD 4 64 2 2
2 1 BAD 5 63 3 2
Goal_Dataframe:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 BAD 17 59 11 6
1 2 GOOD 9 64 5 4
2 1 BAD 12 63 7 5
The goal is to match the dataframes on the uuid and lane-Number, and then to take the New_Dataframe values for lane-Status and lane-Speed, and then sum the lane-Volume, lane-Class1Count and laneClass2Count together. I want to keep all the new incoming data, unless it is aggregative (i.e. Number of cars passing the road probe) in which case I want to sum it together.
I found a solution after some more digging.
df = pd.concat(["new_dataframe", "current_dataframe"], ignore_index=True)
df = df.groupby(["uuid", "lane-Number"]).agg(
{
"lane-Status": "first",
"lane-Volume": "sum",
"lane-Speed": "first",
"lane-Class1Count": "sum",
"lane-Class2Count": "sum"
})
By concatenating the current_dataframe onto the back of the new_dataframe I can use the first aggregation option to get the newest data, and then sum the necessary rows.
This question is related to my previous question. I have the following dataframe:
df =
QUEUE_1 QUEUE_2 DAY HOUR TOTAL_SERVICE_TIME TOTAL_WAIT_TIME EVAL
ABC123 DEF656 1 7 20 30 1
ABC123 1 7 22 32 0
DEF656 ABC123 1 8 15 12 0
FED456 DEF656 2 8 15 16 1
I need to get the following dataframe (it's similar to the one I wanted to get in my previous question, but here I need to add 2 additional columns AVG_COUNT_PER_DAY_HOUR and AVG_PERCENT_EVAL_1).
QUEUE HOUR AVG_TOT_SERVICE_TIME AVG_TOT_WAIT_TIME AVG_COUNT_PER_DAY_HOUR AVG_PERCENT_EVAL_1
ABC123 7 21 31 1 50
ABC123 8 15 12 0.5 100
DEF656 7 20 30 0.5 100
DEF656 8 15 14 1 50
FED456 7 0 0 0 0
FED456 8 15 14 0.5 100
The column AVG_COUNT_PER_DAY_HOUR should contain the average count of a corresponding HOUR value over days (DAY) grouped by QUEUE. For example, in df, in case of ABC123, the HOUR 7 appears 2 times for the DAY 1 and 0 times for the DAY 2. Therefore the average is 1. The same logic is applied to the HOUR 8. It appears 1 time in DAY 1 and 0 times in DAY 2 for ABC123. Therefore the average is 0.5.
The column AVG_PERCENT_EVAL_1 should contain the percent of EVAL equal to 1 over hours, grouped by QUEUE. For example, in case of ABC123, the EVAL is equal to 1 one time when HOUR is 7. It is also equal to 0 one time when HOUR is 7. So, AVG_PERCENT_EVAL_1 is 50 for ABC123 and hour 7.
I use this approach:
df = pd.lreshape(aa, {'QUEUE': df.columns[df.columns.str.startswith('QUEUE')].tolist()})
piv_df = df.pivot_table(index=['QUEUE'], columns=['HOUR'], fill_value=0)
result = piv_df.stack().add_prefix('AVG_').reset_index()
I get stuck with adding columns AVG_COUNT_PER_DAY_HOUR and AVG_PERCENT_EVAL_1. For instance, to add the column AVG_COUNT_PER_DAY_HOUR I am thinking to use .apply(pd.value_counts, 1).notnull().groupby(level=0).sum().astype(int), while for calculating AVG_PERCENT_EVAL_1 I am thinking to use [df.EVAL==1].agg({'EVAL' : 'count'}). However, don't know how to incorporate it into my current code in order to get correct solution.
UPDATE:
Perhaps it is easier to adopt this solution to what I need in this questions:
result = pd.lreshape(df, {'QUEUE': ['QUEUE_1','QUEUE_2']})
mux = pd.MultiIndex.from_product([result.QUEUE.dropna().unique(),
result.dropna().DAY.unique(),
result.HOUR.dropna().unique(), ], names=['QUEUE','DAY','HOUR'])
print (result.groupby(['QUEUE','DAY','HOUR'])
.mean()
.reindex(mux, fill_value=0)
.add_prefix('AVG_')
.reset_index())
Steps:
1) To compute AVG_COUNT_PER_DAY_HOUR :
With the help of pd.crosstab(), compute the distinct counts of HOUR w.r.t DAYS (so that we obtain cases for missing days) grouped by QUEUE.
stack the DF so that HOUR which was part of a hierarchical column before now gets positioned as an index, leaving just DAYS as columns. We take the mean columnwise after filling NaNs with 0.
2) To compute AVG_PERCENT_EVAL_1:
After getting the pivoted frame (same as before) and also from the fact that mean would just give us the percentage change as those are simply binary in nature (1/0), we simply take EVAL from this DF and multiply it's result by 100 as means were computed while pivoting itself (default agg=np.mean).
Finally, we join all these frames.
same as in the linked post:
df = pd.lreshape(df, {'QUEUE': df.columns[df.columns.str.startswith('QUEUE')].tolist()})
piv_df = df.pivot_table(index='QUEUE', columns='HOUR', fill_value=0).stack()
avg_tot = piv_df[['TOTAL_SERVICE_TIME', 'TOTAL_WAIT_TIME']].add_prefix("AVG_")
additional portion:
avg_cnt = pd.crosstab(df['QUEUE'], [df['DAY'], df['HOUR']]).stack().fillna(0).mean(1)
avg_pct = piv_df['EVAL'].mul(100).astype(int)
avg_tot.join(
avg_cnt.to_frame("AVG_COUNT_PER_DAY_HOUR")
).join(avg_pct.to_frame("AVG_PERCENT_EVAL_1")).reset_index()
avg_cnt looks like:
QUEUE HOUR
ABC123 7 1.0
8 0.5
DEF656 7 0.5
8 1.0
FED456 7 0.0
8 0.5
dtype: float64
avg_pct looks like:
QUEUE HOUR
ABC123 7 50
8 0
DEF656 7 100
8 50
FED456 7 0
8 100
Name: EVAL, dtype: int32