I am trying to obtain rolling sum on a data frame after multiple levels of grouping:
import pandas as pd
import numpy as np
year_vec = np.arange(2000, 2005)
month_vec = np.arange(1, 4)
soln_list = []
firmList = [61, 62, 63]
firmId = []
year_month = []
year = []
month = []
for firmIndex in range(0, len(firmList)):
for yearIndex in range(0, len(year_vec)):
for monthIndex in range(0, len(month_vec)):
soln_list.append("soln_%s_%s_%s" % (firmList[firmIndex], year_vec[yearIndex], month_vec[monthIndex]))
firmId.append(firmList[firmIndex])
month.append(month_vec[monthIndex])
year.append(year_vec[yearIndex])
year_month.append("%s_%s" % (year_vec[yearIndex], month_vec[monthIndex]))
df = pd.DataFrame({'firmId': firmId, 'year': year, 'month': month, 'year_month' : year_month,
'soln_vars': soln_list})
df = df.set_index(["firmId", "year_month"])
The resulting data frame looks as follows:
month soln_vars year
firmId year_month
61 2000_1 1 soln_61_2000_1 2000
2000_2 2 soln_61_2000_2 2000
2000_3 3 soln_61_2000_3 2000
2001_1 1 soln_61_2001_1 2001
2001_2 2 soln_61_2001_2 2001
2001_3 3 soln_61_2001_3 2001
2002_1 1 soln_61_2002_1 2002
... ... ...
At this point I want a rolling sum of soln_vars over every 2 years, over every month for each firm. To do so, I first group by firmId and year and then sum:
df = df.groupby([df.index.get_level_values(0), "year"])["soln_vars"].sum()
This operation gives me the sum of soln_vars over every year for each firm:
firmId year
61 2000 soln_61_2000_1soln_61_2000_2soln_61_2000_3
2001 soln_61_2001_1soln_61_2001_2soln_61_2001_3
2002 soln_61_2002_1soln_61_2002_2soln_61_2002_3
2003 soln_61_2003_1soln_61_2003_2soln_61_2003_3
2004 soln_61_2004_1soln_61_2004_2soln_61_2004_3
62 2000 soln_62_2000_1soln_62_2000_2soln_62_2000_3
2001 soln_62_2001_1soln_62_2001_2soln_62_2001_3
... ...
In my application the solution variables are provided by another library that result in mathematical expressions: soln_61_2000_1 +soln_61_2000_2
+ soln_61_2000_3 - I am using strings here for simplicity.
Then grouping by firmId and applying rolling sum:
df = df.groupby(level=0, group_keys=False).rolling(2).sum()
does not change df. Any help is appreciated in clarifying this.
Related
I have a very complex situation to append the rows with one agg. function of sum of population based on different columns. Find below:
Please make sure I have multiple rows in all column such as "year" in range (2019,2040) & "con" have mulitple countries.
import pandas as pd
d = { 'year': [2019,2020,2019,2020,2019,2020], 'age': [10,10,20,20,30,30], 'con': ['UK','UK','UK','US','US','US'],'population': [1,2,300,400,1000,2000]}
df = pd.DataFrame(data=d)
df
year age con population
2019 10 UK 1
2020 10 UK 2
2021 20 UK 300
2019 20 US 400
2020 30 US 1000
2021 30 US 2000
output required:
year age con population
2019 10 UK 1
2020 10 UK 2
2019 10 UK 300
2020 20 US 400
2019 20 US 1000
2020 20 US 2000
2019 10-20 UK child 301 #addition of row 1 + row 3
2020 10-20 UK child 402 #addition of 1+2
2019 20-30 UK teen 1000+ age30 population
I am looking for a loop function so I apply on con col
I am trying, FAILED!!!
variable_list = ['UK', 'US']
ranges = [[0,10], [10,20], [20,30]]
categories = ["Child", "teen", "work"]
year = [x for x in range(2019,2022)]
q = df#df.loc[(df["Kategorie 1"].str.strip()==BASE)]
q["age2"] = pd.to_numeric(q["age"])
sums_years = {}
for variable in variable_list:
c = 0
u = q.loc[q["cat2"]==variable]
for r in ranges:
cat = "Germany: " + categories[c]
for year in date:
group = str(r[0])+'-'+str(r[1])
n = variable + "_" + group
if n not in sums_years:
sums_years[n] = {}
s = u.loc[(u['year']==year) & (u["age"]>=r[0]) & (u["age"]<=r[1]), 'population'].sum()
```
and also like for one condition
df_uk = df[df.con=='UK'].reset_index(drop=True)
div =['child','teen','working']
c = [div[i] for i in range(len(df_uk))] #list to get element from div
y = [i+2018 for i in range(1,len(df_uk)+1)] #list of 2019,2020,2021
x = [[[0,10], [10,20], [20,30]] for i in range(1,len(df_uk)+1)]
d={'year':y, 'age':x, 'con':c, 'population': (df_uk['value'] + #adds_something).values}
df_new = pd.DataFrame(data=d)
df = pd.concat([df, df_new], ignore_index=True)
sorry if its a mess.. I asked people but no help... I am sure there can be easy and better loop function. Please Help!!!!
Is there any better way to melt the dataframe and do all calcuation.. or to restructure the dataframe.
d = { 'year': [2019,2020,2021,2020,2019,2021],
'age': [10,20,30,10,20,30],
'con': ['UK','UK','UK','US','US','US'],
'population': [1,2,300,400,1000,2000]}
df = pd.DataFrame(data=d)
df2 = df.copy()
criteria = [df2['age'].between(0, 10),
df2['age'].between(11, 20),
df2['age'].between(21, 30)]
values = ['child', 'teen', 'work']
df2['con'] = df2['con']+'_'+np.select(criteria, values, 0)
df2['population'] = df.groupby(['con', 'age']).sum()\
.groupby(level=0).cumsum()\
.reset_index()['population']
final = pd.concat([df, df2])
I'm trying to calculate a weighted average for multiple columns in a dataframe.
This is a sample of my data
Group
Year
Month
Weight(kg)
Nitrogen
Calcium
A
2020
01
10000
10
70
A
2020
01
15000
4
78
A
2021
05
12000
5
66
A
2021
05
10000
8
54
B
2021
08
14000
10
90
C
2021
08
50000
20
92
C
2021
08
40000
10
95
My desired result would look something like this:
What I've tried:
I can get the correct weighted average values for a single column using this function:
(similar to: link)
def wavg(df, value, weight):
d = df[value]
w = df[weight]
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return d.mean()
I can apply this function to a single column of my df:
df2 = df.groupby(["Group", "year", "month"]).apply(wavg, "Calcium", "Weight(kg").to_frame()
(Don't mind the different values, they are correct for the data in my notebook)
The obvious problem is that this function only works for a single column whilst I have a douzens of columns. I therefore tried a for loop:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").to_frame())
It calculates the values correctly, but the columns are placed on top of each other instead of next to eachother. They also miss a usefull column name:
How could I adapt my code to return the desired df?
Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:
def wavg(x, value, weight):
d = x[value]
w = x[weight]
try:
return (d.mul(w, axis=0)).div(w.sum())
except ZeroDivisionError:
return d.mean()
#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)
#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
.groupby(level=groups)
.apply(wavg, cols, "Weight(kg)")
.reset_index())
print (df2)
Group Year Month Calcium Nitrogen
0 A 2020 1 28.000000 4.000000
1 A 2020 1 46.800000 2.400000
2 A 2021 5 36.000000 2.727273
3 A 2021 5 24.545455 3.636364
4 B 2021 8 90.000000 10.000000
5 C 2021 8 51.111111 11.111111
6 C 2021 8 42.222222 4.444444
Try via concat() and reset_index():
df=pd.concat(column_list,axis=1).reset_index()
OR
you can make changes here:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").reset_index())
#Finally:
df=pd.concat(column_list,axis=1)
How to get mean of only positive values after groupby in pandas?
MWE:
import numpy as np
import pandas as pd
flights = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
print(flights.shape)
print(flights.iloc[:2,:4])
print()
not_cancelled = flights.dropna(subset=['dep_delay','arr_delay'])
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.mean().reset_index()
)
df['avg_delay2'] = df[df.arr_delay>0]['arr_delay'].mean()
print(df.head())
This gives all avg_delay2 values as 16.66.
(336776, 19)
year month day dep_time
0 2013 1 1 517.0
1 2013 1 1 533.0
year month day arr_delay avg_delay2
0 2013 1 1 12.651023 16.665681
1 2013 1 2 12.692888 16.665681
2 2013 1 3 5.733333 16.665681
3 2013 1 4 -1.932819 16.665681
4 2013 1 5 -1.525802 16.665681
Which is WRONG.
# sanity check
a = not_cancelled.query(""" year==2013 & month ==1 & day ==1 """)['arr_delay']
a = a[a>0]
a.mean() # 32.48156182212581
When I do the same thing in R:
library(nycflights13)
not_cancelled = flights %>%
filter( !is.na(dep_delay), !is.na(arr_delay))
df = not_cancelled %>%
group_by(year,month,day) %>%
summarize(
# average delay
avg_delay1 = mean(arr_delay),
# average positive delay
avg_delay2 = mean(arr_delay[arr_delay>0]))
head(df)
It gives correct output for avg_delay2.
year month day avg_delay1 avg_delay2
2013 1 1 12.651023 32.48156
2013 1 2 12.692888 32.02991
2013 1 3 5.733333 27.66087
2013 1 4 -1.932819 28.30976
2013 1 5 -1.525802 22.55882
2013 1 6 4.236429 24.37270
How to do this in Pandas?
I would filter the positive before groupby
df = (not_cancelled[not_cancelled.arr_delay >0].groupby(['year','month','day'])['arr_delay']
.mean().reset_index()
)
df.head()
because, as in your code, df is an separate dataframe after the groupby operation has completed, and
df['avg_delay2'] = df[df.arr_delay>0]['arr_delay'].mean()
assign the same value to df['avg_delay2']
Edit: Similar to R, you can do both in one shot using agg:
def mean_pos(x):
return x[x>0].mean()
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.agg({'arr_delay': 'mean', 'arr_delay_2': mean_pos})
)
df.head()
Note that from pandas 23, using dictionary in gropby agg is deprecated and will be removed in future, so we can not use that method.
Warning
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.agg({'arr_delay': 'mean', 'arr_delay_2': mean_pos})
)
FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version.
So, I to tackle that problem in this specific case, I came up with another idea.
Create a new column making all non-positive values nans, then do the usual groupby.
import numpy as np
import pandas as pd
# read data
flights = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
# select flights that are not cancelled
df = flights.dropna(subset=['dep_delay','arr_delay'])
# create new column to fill non-positive with nans
df['arr_delay_pos'] = df['arr_delay']
df.loc[df.arr_delay_pos <= 0,'arr_delay_pos'] = np.nan
df.groupby(['year','month','day'])[['arr_delay','arr_delay_pos']].mean().reset_index().head()
It gives:
year month day arr_delay arr_delay_positive
0 2013 1 1 12.651023 32.481562
1 2013 1 2 12.692888 32.029907
2 2013 1 3 5.733333 27.660870
3 2013 1 4 -1.932819 28.309764
4 2013 1 5 -1.525802 22.558824
Sanity check
# sanity check
a = not_cancelled.query(""" year==2013 & month ==1 & day ==1 """)['arr_delay']
a = a[a>0]
a.mean() # 32.48156182212581
I have a df with around 100,000 rows and 1,000 columns and need to make some adjustments based on the existing data. How do I best approach this? Most of the changes will follow this basic formula:
search a column (or two or three) to see if a condition is met
if met, change the values of dozens or hundreds of columns in that row
This is my best attempt, where I created a list of the columns and was looking to see whether the first column contained the value 1. Where it did, I wanted to just add some number. That part worked, but it only worked on the FIRST row, not on all the 1s in the column. To fix that, I think I need to create a loop where I have the second [i] that goes through all the rows, but I wasn't sure if I was approaching the entire problem incorrectly. FWIW, test_cols = list of columns and testing_2 is my df.
def try_this(test_cols):
for i in range(len(test_cols)):
if i == 0 and testing_2[test_cols[i]][i] == 1:
testing_2[test_cols[i]][i]=testing_2[test_cols[i]][i]+78787
i+=1
return test_cols
Edit/example:
Year Month Mean_Temp
City
Madrid 1999 Jan 7--this value should appear twice
Bilbao 1999 Jan 9--appear twice
Madrid 1999 Feb 9
Bilbao 1999 Feb 10
. . . .
. . . .
. . . .
Madrid 2000 Jan 6.8--this value should go away
Bilbao 2000 Jan 9.2--gone
So I would need to do something like (using your answer):
def alter(row):
if row['Year'] == 2000 and row['Month'] == 'Jan':
row['Mean_Temp'] = row['Mean_Temp'] #from year 1999!
return row['Mean_Temp']
else:
return row['Mean_Temp']
One way you could do this is by creating a function and applying it. Suppose you want to increase column 'c' by a factor of 10 if the corresponding row in 'a' or 'b' is an even number.
import pandas as pd
data = {'a':[1,2,3,4],'b':[3,6,8,12], 'c':[1,2,3,4]}
df = pd.DataFrame(data)
def alter(row):
if row['a']%2 == 0 or row['b']%2 == 0:
return row['b']*10
else:
return row['b']
df['c'] = df.apply(alter, axis=1)
would create a df that looks like,
a b c
0 1 3 3
1 2 6 60
2 3 8 80
3 4 12 120
Edit to add:
If you want to apply values from other parts of the df you could put those in a dict and then pass that into your apply function.
import pandas as pd
data = {'Cities':['Madrid', 'Balbao'] * 3, 'Year':[1999] * 4 + [2000] * 2,
'Month':['Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Jan'],
'Mean_Temp':[7, 9, 9, 10, 6.8, 9.2]}
df = pd.DataFrame(data)
df = df[['Cities', 'Year', 'Month', 'Mean_Temp']]
#create dicitonary with the values from 1999
edf = df[df.Year == 1999]
keys = zip(edf.Cities, edf.Month)
values = edf.Mean_Temp
dictionary = dict(zip(keys, values))
def alter(row, dictionary):
if row['Year'] == 2000 and row['Month'] == 'Jan':
return dictionary[(row.Cities, row.Month)]
else:
return row['Mean_Temp']
df['Mean_Temp'] = df.apply(alter, args = (dictionary,), axis=1)
Which gives you a df that looks like,
Cities Year Month Mean_Temp
0 Madrid 1999 Jan 7
1 Balbao 1999 Jan 9
2 Madrid 1999 Feb 9
3 Balbao 1999 Feb 10
4 Madrid 2000 Jan 7
5 Balbao 2000 Jan 9
Of course you can change the parameters however you like. Hope this helps.
In the dataframe below (small snippet show, actual dataframe spans from 2000 to 2014 in time), I want to compute the annual average but starting in September of one year and going till only May of next year.
Cnt Year JD Min_Temp
S 2000 1 277.139
S 2000 2 274.725
S 2001 1 270.945
S 2001 2 271.505
N 2000 1 257.709
N 2000 2 254.533
N 2000 3 258.472
N 2001 1 255.763
I can compute annual average (Jan - Dec) using this code:
df['Min_Temp'].groupby(df['YEAR']).mean()
How do I adapt this code to mean from Sept of first year to May of next year?
--EDIT: Based on comments below, you can assume that a MONTH column is also available, specifying the month for each row
Not sure which column refers to month or if it is missing, but in the past I've used a quick and dirty method to assign custom seasons (interested if anyone has found more elegant route).
I've used Yahoo Finance data to demonstrate approach, unless one of your columns is Month?
EDIT Requires dataframe to be sorted by date ascending
import pandas as pd
import pandas.io.data as web
import datetime
start = datetime.datetime(2010, 9, 1)
end = datetime.datetime(2015, 5, 31)
df = web.DataReader("F", 'yahoo', start, end)
#Ensure date sorted --required
df = df.sort_index()
#identify custom season and set months june-august to null
count = 0
season = 1
for i,row in df.iterrows():
if i.month in [9,10,11,12,1,2,3,4,5]:
if count == 1:
season += 1
df.set_value(i,'season', season)
count = 0
else:
count = 1
df.set_value(i,'season',None)
#new data frame excluding months june-august
df_data = df[~df['season'].isnull()]
df_data['Adj Close'].groupby(df_data.season).mean()