I am trying to balance check on a Pandas DataFrame using an OLS with entity fixed effects. An example DataFrame is below:
county
year
treatment_vs_control
age
gender
Jefferson
2022
1
24
M
Jackson
2022
1
31
M
Jefferson
2022
0
28
F
Jackson
2022
1
24
null
Adams
2022
0
72
F
First I try to run the model with the gender field as-is.
model_as_is = PanelOLS.from_formula(
formula="treatment_vs_control ~ age + gender + EntityEffects",
data=df
).fit()
model_as_is.summary
I get an F statistics of ~3.05 with a p value of 0.0001.
Then, I try to run the model with one-hot encoded dummy gender columns. The DataFrame looks like below:
county
year
treatment_vs_control
age
gender_m
gender_f
Jefferson
2022
1
24
1
0
Jackson
2022
1
31
1
0
Jefferson
2022
0
28
0
1
Jackson
2022
1
24
0
0
Adams
2022
0
72
0
1
My model now looks like:
model_dummy = PanelOLS(
dependent = df["treatment_vs_control"],
exog = df[["age", "gender"]],
entity_effects=True,
time_effects=False,
).fit()
model_dummy.summary
My F statistic is now ~2.61 with a p value of 0.0002.
If I try to simply keep a single gender column but make it numeric instead of string-type, I get even a third statistical breakdown.
Why might this happen?
Related
I'd like to compare the difference in data frames. xyz has all of the same columns as abc, but it has an additional column.
In the comparison, I'd like match up the two like columns (Sport) but only show the SportLeague in the output (if a difference exists, that is). Example, instead of showing 'Soccer' as a difference, show 'Soccer:MLS', which is the adjacent column in xyz)
Here's a screenshot of the two data frames:
import pandas as pd
import numpy as np
abc = {'Sport' : ['Football', 'Basketball', 'Baseball', 'Hockey'], 'Year' : ['2021','2021','2022','2022'], 'ID' : ['1','2','3','4']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
abc
xyz = {'Sport' : ['Football', 'Football', 'Basketball', 'Baseball', 'Hockey', 'Soccer'], 'SportLeague' : ['Football:NFL', 'Football:XFL', 'Basketball:NBA', 'Baseball:MLB', 'Hockey:NHL', 'Soccer:MLS'], 'Year' : ['2022','2019', '2022','2022','2022', '2022'], 'ID' : ['2','0', '3','2','4', '1']}
xyz = pd.DataFrame({k: pd.Series(v) for k, v in xyz.items()})
xyz = xyz.sort_values(by = ['ID'], ascending = True)
xyz
Code already tried:
abc.compare(xyz, align_axis=1, keep_shape=False, keep_equal=False)
The error I get is the following (since the data frames don't have the exact same columns):
Example. If xyz['Sport'] does not show up anywhere within abc['Sport'], then show xyz['SportLeague]' as the difference between the data frames
Further clarification of the logic:
Does abc['Sport'] appear anywhere in xyz['Sport']? If not, indicate "Not Found in xyz data frame". If it does exist, are its corresponding abc['Year'] and abc['ID'] values the same? If not, show "Change from xyz['Year'] and xyz['ID'] to abc['Year'] and abc['ID'].
Does xyz['Sport'] appear anywhere in abc['Sport']? If not, indicate "Remove xyz['SportLeague']".
What I've explained above is similar to the .compare method. However, the data frames in this example may not be the same length and have different amounts of variables.
If I understand you correctly, we basically want to merge both DataFrames, and then apply a number of comparisons between both DataFrames, and add a column that explains the course of action to be taken, given a certain result of a given comparison.
Note: in the example here I have added one sport ('Cricket') to your df abc, to trigger the condition abc['Sport'] does not exist in xyz['Sport'].
abc = {'Sport' : ['Football', 'Basketball', 'Baseball', 'Hockey','Cricket'], 'Year' : ['2021','2021','2022','2022','2022'], 'ID' : ['1','2','3','4','5']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
print(abc)
Sport Year ID
0 Football 2021 1
1 Basketball 2021 2
2 Baseball 2022 3
3 Hockey 2022 4
4 Cricket 2022 5
I've left xyz unaltered. Now, let's merge these two dfs:
df = xyz.merge(abc, on='Sport', how='outer', suffixes=('_xyz','_abc'))
print(df)
Sport SportLeague Year_xyz ID_xyz Year_abc ID_abc
0 Football Football:XFL 2019 0 2021 1
1 Football Football:NFL 2022 2 2021 1
2 Soccer Soccer:MLS 2022 1 NaN NaN
3 Baseball Baseball:MLB 2022 2 2022 3
4 Basketball Basketball:NBA 2022 3 2021 2
5 Hockey Hockey:NHL 2022 4 2022 4
6 Cricket NaN NaN NaN 2022 5
Now, we have a df where we can evaluate your set of conditions using np.select(conditions, choices, default). Like this:
conditions = [ df.Year_abc.isnull(),
df.Year_xyz.isnull(),
(df.Year_xyz != df.Year_abc) & (df.ID_xyz != df.ID_abc),
df.Year_xyz != df.Year_abc,
df.ID_xyz != df.ID_abc
]
choices = [ 'Sport not in abc',
'Sport not in xyz',
'Change year and ID to xyz',
'Change year to xyz',
'Change ID to xyz']
df['action'] = np.select(conditions, choices, default=np.nan)
Result as below with a new column action with notes on which course of action to take.
Sport SportLeague Year_xyz ID_xyz Year_abc ID_abc \
0 Football Football:XFL 2019 0 2021 1
1 Football Football:NFL 2022 2 2021 1
2 Soccer Soccer:MLS 2022 1 NaN NaN
3 Baseball Baseball:MLB 2022 2 2022 3
4 Basketball Basketball:NBA 2022 3 2021 2
5 Hockey Hockey:NHL 2022 4 2022 4
6 Cricket NaN NaN NaN 2022 5
action
0 Change year and ID to xyz # match, but mismatch year and ID
1 Change year and ID to xyz # match, but mismatch year and ID
2 Sport not in abc # no match: Sport in xyz, but not in abc
3 Change ID to xyz # match, but mismatch ID
4 Change year and ID to xyz # match, but mismatch year and ID
5 nan # complete match: no action needed
6 Sport not in xyz # no match: Sport in abc, but not in xyz
Let me know if this is a correct interpretation of what you are looking to achieve.
I am relatively new to pandas / python.
I have a list of names and dates. I want to group the entries by Name and count the number of Names for 'after 2016' and 'before 2016'. The count should be added to a new column.
My input:
Name Date
Marc 2006
Carl 2003
Carl 2002
Carl 1990
Marc 1999
Max 2016
Max 2014
Marc 2006
Carl 2003
Carl 2002
Carl 2019
Marc 1999
Max 2016
Max 2014
And the output, should look like this:
Before
2016 Count
Marc 1 4
Marc 0 0
Carl 1 5
Carl 0 1
Max 1 2
Max 0 2
So the Output should have 2 entries for each Name, one with a count of Names before 2016 and one after. Addtionally a column which just stats 1 for before 2016 and 0 for after.
As mentioned before, I am quite a beginner. I was able to count the entries with the condition of the year:
df.groupby('Name')['Date'].apply(lambda x: (x<'2016').sum()).reset_index(name='count')
But honestly, I am not quite sure what to do next. Maybe somebody could point me in the right direction.
You can pass to apply a function which returns a 2x2 dataframe. Something like this:
def counting(x):
bef = (x < 2016).sum()
aft = (x > 2016).sum()
return pd.DataFrame([[1, bef], [0, aft]], index=[x.name, x.name], columns=["before 2016", "Count"])
ddf = df.groupby('Name')['Date'].apply(counting).reset_index(level=0, drop=True)
ddf is:
before 2016 Count
Carl 1 5
Carl 0 1
Marc 1 4
Marc 0 0
Max 1 2
Max 0 0
You can group by an external series having the same length as the dataframe:
s = df['Date'].lt(2016).astype('int')
s.name = 'Before 2016'
df.groupby(['Name', s]).count()
Result:
Date
Name Before 2016
Carl 0 1
1 5
Marc 1 4
Max 0 2
1 2
lt stands for "less than". Other comparison functions are le (less than or equal), gt (greater than), ge (greater than or equal) and eq (equal)
From what I understand you need to populate both 1 and 0 for each names, try with pivot_table with df.unstack():
(df.assign(Before=df['Date'].lt(2016).view('i1'))
.pivot_table('Date','Name','Before',aggfunc='count',fill_value=0).unstack()
.sort_index(level=1).reset_index(0,name='Count'))
Before Count
Name
Carl 0 1
Carl 1 5
Marc 0 0
Marc 1 4
Max 0 2
Max 1 2
I am working on a dataset with pandas in which a maintenance work is done at a location. The maintenance is done at random intervals, sometimes a year, and sometimes never. I want to find the years since the last maintenance action at each site if an action has been made on that site. There can be more than one action for a site and the occurrences of actions are random. For the years prior to the first action, it is not possible to know the years since action because that information is not in the dataset.
I give only two sites in the following example but in the original dataset, I have thousands of them. My data only covers the years 2014 through 2017.
Action = 0 means no action has been performed that year, Action = 1 means some action has been done. Measurement is a performance reading related to the effect of the action. The action can happen in any year.
Site Year Action Measurement
A 2014 1 100
A 2015 0 150
A 2016 0 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 1 60
B 2017 0 110
Given this dataset; I want to have a dataset like this:
Item Year Action Measurement Years_Since_Last_Action
A 2014 1 100 1
A 2015 0 150 2
A 2016 0 300 3
A 2017 0 80 4
B 2015 1 250 1
B 2016 1 60 1
B 2017 0 110 2
Please observe the Year 2015 is filtered out for Site B because that year is prior to the first action for that site.
Many thanks in advance!
I wrote the code myself. It is messy but does the job for me. :)
The solution assumes that df_select has an integer index.
df_select = (df_select[df_select['Site'].map((df_select.groupby('Site')['Action'].max() == 1))])
years_since_action = pd.Series(dtype='int64')
gbo = df_select.groupby('Site')
for (key,group) in gbo:
indices_with_ones = group[group['Action']==1].index
indices = group.index
group['Years_since_action'] = 0
group.loc[indices_with_ones,'Years_since_action'] = 1
for idx_with_ones in indices_with_ones.sort_values(ascending=False):
for idx in indices:
if group.loc[idx,'Years_since_action']==0:
if idx>idx_with_ones:
group.loc[idx,'Years_since_action'] = idx - idx_with_ones + 1
years_since_action = years_since_action.append(group['Years_since_action'])
df_final = pd.merge(df_select,pd.DataFrame(years_since_action),how='left',left_index=True,right_index=True)
Here is how I will approach it:
import pandas as pd
from io import StringIO
import numpy as np
s = '''Site Year Action Measurement
A 2014 1 100
A 2015 0 150
A 2016 0 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 1 60
B 2017 0 110
'''
ss = StringIO(s)
df = pd.read_csv(ss, sep=r"\s+")
df_maintain = df[df.Action==1][['Site', 'Year']]
df_maintain.reset_index(drop=True, inplace=True)
df_maintain
def find_last_maintenance(x):
df_temp = df_maintain[x.Site == df_maintain.Site]
gap = [0]
for ind, row in df_temp.iterrows():
if (x.Year >= row['Year']):
gap.append(x.Year - row['Year'] + 1)
return gap[-1]
df['Gap'] = df.apply(find_last_maintenance, axis=1)
df = df[df.Gap !=0]
This generates the desired output.
I have a dataframe with records spanning multiple years:
WarName | StartDate | EndDate
---------------------------------------------
'fakewar1' 01-01-1990 02-02-1995
'examplewar' 05-01-1990 03-07-1998
(...)
'examplewar2' 05-07-1999 06-09-2002
I am trying to convert this dataframe to a summary overview of the total wars per year, e.g.:
Year | Number_of_wars
----------------------------
1989 0
1990 2
1991 2
1992 3
1994 2
Usually I would use someting like df.groupby('year').count() to get total wars by year, but since I am currently working with ranges instead of set dates that approach wouldn't work.
I am currently writing a function that generates a list of years, and then for each year in the list checks each row in the dataframe and runs a function that checks if the year is within the date-range of that row (returning True if that is the case).
years = range(1816, 2006)
year_dict = {}
for year in years:
for index, row in df.iterrows():
range = year_in_range(year, row)
if range = True:
year_dict[year] = year_dict.get(year, 0) + 1
This works, but is also seems extremely convoluted. So I was wondering, what am I missing? What would be the canonical 'pandas-way' to solve this issue?
Use a comprehension with pd.value_counts
pd.value_counts([
d.year for s, e in zip(df.StartDate, df.EndDate)
for d in pd.date_range(s, e, freq='Y')
]).sort_index()
1990 2
1991 2
1992 2
1993 2
1994 2
1995 1
1996 1
1997 1
1999 1
2000 1
2001 1
dtype: int64
Alternate
from functools import reduce
def r(t):
return pd.date_range(t.StartDate, t.EndDate, freq='Y')
pd.value_counts(reduce(pd.Index.append, map(r, df.itertuples())).year).sort_index()
Setup
df = pd.DataFrame(dict(
WarName=['fakewar1', 'examplewar', 'feuxwar2'],
StartDate=pd.to_datetime(['01-01-1990', '05-01-1990', '05-07-1999']),
EndDate=pd.to_datetime(['02-02-1995', '03-07-1998', '06-09-2002'])
), columns=['WarName', 'StartDate', 'EndDate'])
df
WarName StartDate EndDate
0 fakewar1 1990-01-01 1995-02-02
1 examplewar 1990-05-01 1998-03-07
2 feuxwar2 1999-05-07 2002-06-09
By using np.unique
x,y = np.unique(sum([list(range(x.year,y.year)) for x,y in zip(df.StartDate,df.EndDate)],[]), return_counts=True)
pd.Series(dict(zip(x,y)))
Out[222]:
1990 2
1991 2
1992 2
1993 2
1994 2
1995 1
1996 1
1997 1
1999 1
2000 1
2001 1
dtype: int64
The other answers with pandas are far preferable, but the native Python answer you showed didn't have to be so convoluted; just instantiate and directly index into an array:
wars = [0] * 191 # max(df['EndDate']).year - min(df['StartDate']).year + 1
yr_offset = 1816 # min(df['StartDate']).year
for _, row in df.iterrows():
for yr in range(row['StartDate'].year-yr_offset, row['EndDate'].year-yr_offset): # or maybe (year+1)
wars[yr] += 1
I have a dataframe with 2 columns as below:
Index Year Country
0 2015 US
1 2015 US
2 2015 UK
3 2015 Indonesia
4 2015 US
5 2016 India
6 2016 India
7 2016 UK
I want to create a new dataframe containing the maximum count of country in every year.
The new dataframe will contain 3 columns as below:
Index Year Country Count
0 2015 US 3
1 2016 India 2
Is there any function in pandas where this can be done quickly?
One way can be to use groupby and along with size for finding in each category adn sort values and slice by possible number of year. You can try the following:
num_year = df['Year'].nunique()
new_df = df.groupby(['Year', 'Country']).size().rename('Count').sort_values(ascending=False).reset_index()[:num_year]
Result:
Year Country Count
0 2015 US 3
1 2016 India 2
Use:
1.
First get count of each pairs Year and Country by groupby and size.
Then get index of max value by idxmax and select row by loc:
df = df.groupby(['Year','Country']).size()
df = df.loc[df.groupby(level=0).idxmax()].reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
2.
Use custom function with value_counts and head:
df = df.groupby('Year')['Country']
.apply(lambda x: x.value_counts().head(1))
.rename_axis(('Year','Country'))
.reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
Just provide a method without groupby
Count=pd.Series(list(zip(df2.Year,df2.Country))).value_counts()
.head(2).reset_index(name='Count')
Count[['Year','Country']]=Count['index'].apply(pd.Series)
Count.drop('index',1)
Out[266]:
Count Year Country
0 3 2015 US
1 2 2016 India