Resampling DataFrame while keeping my other columns untouched - python

I have a dataset that tells me the frequency of an event for each demographic(for example, first row says there are 13 white men, who are 11 years old, in the county Alameda in the year 2006 who experienced an event). Here is the original DataFrame:
Year County age Race Sex freq
0 2006 Alameda 11 1 0 13
1 2006 Alameda 11 1 1 9
2 2006 Alameda 11 2 0 9
3 2006 Alameda 11 2 1 16
4 2006 Alameda 11 3 0 2
Now, I want to compute the 2 year average of the "freq" column, by the demographic. This is the code I tried and the output:
dc = dc.dropna()
dc['date'] = dc.apply(lambda x: pd.Timestamp('{year}'
.format(year=int(x.Year),
)),
axis=1)
dc.set_index('date', inplace=True)
dc=dc.resample('2A', how='mean')
Date age_range Race Sex freq
2006-12-31 14.507095 1.637789 0.489171 10.451830
2008-12-31 14.543697 1.664187 0.493120 10.285980
2010-12-31 14.516471 1.670205 0.489019 10.349927
2012-12-31 14.512953 1.675056 0.486677 10.109178
2014-12-31 14.568190 1.699817 0.485923 10.134186
It's computing the averages for each column, but how do I do it for just the freq,by the demographic cuts(like the orginal DF) column?

Combining groupby() Grouper() and transform() should give you a series you want. Your sample data set does not have enough data to demonstrate.
dc = pd.read_csv(io.StringIO(""" Year County age Race Sex freq
0 2006 Alameda 11 1 0 13
1 2006 Alameda 11 1 1 9
2 2006 Alameda 11 2 0 9
3 2006 Alameda 11 2 1 16
4 2006 Alameda 11 3 0 2"""), sep="\s+")
dc["date"] = pd.to_datetime(dc["Year"], format="%Y")
dc["freq_2y"] = dc.groupby(["County","age","Race","Sex",
pd.Grouper(key="date", freq="2A", origin="start"),])["freq"].transform("mean")
Year
County
age
Race
Sex
freq
date
freq_2y
0
2006
Alameda
11
1
0
13
2006-01-01 00:00:00
13
1
2006
Alameda
11
1
1
9
2006-01-01 00:00:00
9
2
2006
Alameda
11
2
0
9
2006-01-01 00:00:00
9
3
2006
Alameda
11
2
1
16
2006-01-01 00:00:00
16
4
2006
Alameda
11
3
0
2
2006-01-01 00:00:00
2

Related

random sampling of the data in python

I have a dataframe with several columns and I need to re-sample from that data with more weight to one category. I think np.random.choice should work but not sure how to implement it. Following is the example data from which I want to sample randomly but want 70% probability of getting expensive home (based on the Expensive_home column, value = 1) and 30% probability for Expensive_home=0. How can I create the re-sampled data file? Thank you!
ID Lot_Area Year_Built Full_Bath Bedroom Sale_Price Expensive_home
1 31770 1960 1 3 215000 0
2 11622 1961 1 2 105000 0
3 5389 1995 2 2 236500 0
4 8402 1998 2 3 180400 0
5 10176 1990 1 2 171500 0
6 6820 1985 1 1 212000 0
7 53504 2003 3 4 538000 1
8 12134 1988 2 4 164000 0
9 11394 2010 1 1 394432 1
10 19138 1951 1 2 141000 0
11 13175 1978 2 3 210000 0
12 11751 1977 2 3 190000 0
13 10625 1974 2 3 170000 0
14 7500 2000 2 3 216000 0
15 11241 1970 1 2 149000 0
16 2280 1978 2 3 146000 0
17 12858 2009 2 3 376162 1
18 12883 2009 2 3 290941 0
19 12182 2005 2 3 220000 0
20 11520 2005 2 3 275000 0
similar data file but with more of randomly picked 1s in the last column
To create a dataframe of the same length but allowing expensive to have a higher chance of being selected and allowing replacements, use:
weights = df['Expensive_home'].replace({0: 30, 1: 70})
df1 = df.sample(len(df), replace=True, weights=weights)
To create a dataframe with all expensive and then 30% of non-expensive, you can do:
expensive = df['Expensive_home'].astype(bool)
df2 = pd.concat([df[expensive], df[~expensive].sample(frac=0.3)])

How to calculate a groupby mean and variance in a pandas DataFrame?

I have a DataFrame and I want to calculate the mean and the variance for each row for each person. Moreover, there is a column date and the chronological order must be respect when calculating the mean and the variance; the dataframe is already sorted by date. The date are just the number of day after the earliest date. The mean for the earliest date of a person row is simply the value in the column Points and the variance should be NAN or 0. Then, for the second date, the mean should be the means of the value in the column Points for this date and the previous one. Here is my code to generate the dataframe:
import pandas as pd
import numpy as np
data=[["Al",0, 12],["Bob",2, 10],["Carl",5, 12],["Al",5, 5],["Bob",9, 2]
,["Al",22, 4],["Bob",22, 16],["Carl",33, 2],["Al",45, 7],["Bob",68, 4]
,["Al",72, 11],["Bob",79, 5]]
df= pd.DataFrame(data, columns=["Name", "Date", "Points"])
print(df)
Name Date Points
0 Al 0 12
1 Bob 2 10
2 Carl 5 12
3 Al 5 5
4 Bob 9 2
5 Al 22 4
6 Bob 22 16
7 Carl 33 2
8 Al 45 7
9 Bob 68 4
10 Al 72 11
11 Bob 79 5
Here is my code to obtain the mean and the variance:
df['Mean'] = df.apply(
lambda x: df[(df.Name == x.Name) & (df.Date < x.Date)].Points.mean(),
axis=1)
df['Variance'] = df.apply(
lambda x: df[(df.Name == x.Name)& (df.Date < x.Date)].Points.var(),
axis=1)
However, the mean is shifted by one row and the variance by two rows. The dataframe obtained when sort by Nameand Dateis:
Name Date Points Mean Variance
0 Al 0 12 NaN NaN
3 Al 5 5 12.000000 NaN
5 Al 22 4 8.50000 24.500000
8 Al 45 7 7.000000 19.000000
10 Al 72 11 7.000000 12.666667
1 Bob 2 10 NaN NaN
4 Bob 9 2 10.000000 NaN
6 Bob 22 16 6.000000 32.000000
9 Bob 68 4 9.333333 49.333333
11 Bob 79 5 8.000000 40.000000
2 Carl 5 12 NaN NaN
7 Carl 33 2 12.000000 NaN
Instead, the dataframe should be as below:
Name Date Points Mean Variance
0 Al 0 12 12 NaN
3 Al 5 5 8.5 24.5
5 Al 22 4 7 19
8 Al 45 7 7 12.67
10 Al 72 11 7.8 ...
1 Bob 2 10 10 NaN
4 Bob 9 2 6 32
6 Bob 22 16 9.33 49.33
9 Bob 68 4 8 40
11 Bob 79 5 7.4 ...
2 Carl 5 12 12 NaN
7 Carl 33 2 7 50
What should I change ?

How to calculate Quarterly difference and add missing Quarterly with count in python pandas

I am having a data frame like this I have to get missing Quarterly value and count between them
Same with Quarterly Missing count and fill the data frame is
year Data Id
2019Q4 57170 A
2019Q3 55150 A
2019Q2 51109 A
2019Q1 51109 A
2018Q1 57170 B
2018Q4 55150 B
2017Q4 51109 C
2017Q2 51109 C
2017Q1 51109 C
Id Start year end-year count
B 2018Q2 2018Q3 2
B 2017Q3 2018Q3 1
How can I achieve this using python panda
Use:
#changed data for more general solution - multiple missing years per groups
print (df)
year Data Id
0 2015 57170 A
1 2016 55150 A
2 2019 51109 A
3 2023 51109 A
4 2000 47740 B
5 2002 44563 B
6 2003 43643 C
7 2004 42050 C
8 2007 37312 C
#add missing values for no years by reindex
df1 = (df.set_index('year')
.groupby('Id')['Id']
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max() + 1)))
.reset_index(name='val'))
print (df1)
Id year val
0 A 2015 A
1 A 2016 A
2 A 2017 NaN
3 A 2018 NaN
4 A 2019 A
5 A 2020 NaN
6 A 2021 NaN
7 A 2022 NaN
8 A 2023 A
9 B 2000 B
10 B 2001 NaN
11 B 2002 B
12 C 2003 C
13 C 2004 C
14 C 2005 NaN
15 C 2006 NaN
16 C 2007 C
#boolean mask for check no NaNs to variable for reuse
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2017 2018 2
1 A 2020 2022 3
2 B 2001 2001 1
3 C 2005 2006 2
EDIT:
#convert to datetimes
df['year'] = pd.to_datetime(df['year'], format='%Y%m')
#resample by start of months with asfreq
df1 = df.set_index('year').groupby('Id')['Id'].resample('MS').asfreq().rename('val').reset_index()
print (df1)
Id year val
0 A 2015-05-01 A
1 A 2015-06-01 NaN
2 A 2015-07-01 A
3 A 2015-08-01 NaN
4 A 2015-09-01 A
5 B 2000-01-01 B
6 B 2000-02-01 NaN
7 B 2000-03-01 B
8 C 2003-01-01 C
9 C 2003-02-01 C
10 C 2003-03-01 NaN
11 C 2003-04-01 NaN
12 C 2003-05-01 C
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2

How to update a column using another column in pandas

I'm trying to create a dataframe that keeps track of the number of public schools opened between 2010-2016.
StatusType County 2010 ...2016 OpenYear ClosedYear
1 Closed Alameda 0 0 2005 2015.0
2 Active Alameda 0 0 2006 NaN
3 Closed Alameda 0 0 2008 2015.0
4 Active Alameda 0 0 2011 NaN
5 Active Alameda 0 0 2011 NaN
6 Active Alameda 0 0 2012 NaN
7 Closed Alameda 0 0 1980 1989.0
8 Active Alameda 0 0 1980 NaN
9 Active Alameda 0 0 1980 NaN
I want to update the 2010-2016 columns to keep track of the number of schools open per year. For example, the first school in the dataframe opens in 2005 and closes in 2015. The iterator should check "ClosedYear" column and add 1 to all columns' rows' values < 2015 (2010,2011...,2014). If the "ClosedYear" column shows "NaN", then starting at the year in "OpenYear" column, add 1 to all columns' rows' values >= "OpenYear" (ex: school#4, columns[2011,2012...,2016] +1 & column[2010] no change)
I was thinking about using "apply" to apply a function to the dataframe. But that might not be the most efficient way to solve the problem. Need help figuring out how to make this work! Thanks!
Extra Step:
After finishing the counts, I want to group the year columns by county. I'm leaning towards using the "groupby" w/sum function to sum up the open school counts per county per year. If someone could add that with the answer to the question above, would be very helpful.
Expected Output:
StatusType County 2010 ...2016 OpenYear ClosedYear
1 Closed Alameda 1 0 2005 2015.0
2 Active Alameda 1 1 2006 NaN
3 Closed Alameda 1 0 2008 2015.0
4 Active Alameda 0 1 2011 NaN
5 Active Alameda 0 1 2011 NaN
6 Active Alameda 0 1 2012 NaN
7 Closed Alameda 0 0 1980 1989.0
8 Active Alameda 1 1 1980 NaN
9 Active Alameda 1 1 1980 NaN
I feel like there should be a way to do this without using a for loop but, I cannot think of it atm, so here's my solution:
# Read Example data
from io import StringIO # This only works python 3+
df = pd.read_fwf(StringIO(
"""StatusType County OpenYear ClosedYear
Closed Alameda 2005 2015.0
Active Alameda 2006 NaN
Closed Alameda 2008 2015.0
Active Alameda 2011 NaN
Active Alameda 2011 NaN
Active Alameda 2012 NaN
Closed Alameda 1980 1989.0
Active Alameda 1980 NaN
Active Alameda 1980 NaN"""))
# For each year
for year in range(2010, 2016+1):
# Create a column of 0s
df[str(year)] = 0
# Where the year is between OpenYear and ClosedYear (or closed year is NaN) set it to 1
df.loc[(df['OpenYear'] <= year) & (pd.isna(df['ClosedYear']) | (df['ClosedYear'] >= year)), str(year)] = int(1)
print(df.to_string)
Output:
StatusType County OpenYear ClosedYear 2010 2011 2012 2013 2014 2015 2016
0 Closed Alameda 2005 2015.0 1 1 1 1 1 1 0
1 Active Alameda 2006 NaN 1 1 1 1 1 1 1
2 Closed Alameda 2008 2015.0 1 1 1 1 1 1 0
3 Active Alameda 2011 NaN 0 1 1 1 1 1 1
4 Active Alameda 2011 NaN 0 1 1 1 1 1 1
5 Active Alameda 2012 NaN 0 0 1 1 1 1 1
6 Closed Alameda 1980 1989.0 0 0 0 0 0 0 0
7 Active Alameda 1980 NaN 1 1 1 1 1 1 1
8 Active Alameda 1980 NaN 1 1 1 1 1 1 1
(PS: I'm not quite sure what you were trying to do with the groupby)
Unless there is really a need to create those intermediate columns, you can get the counts directly with a groupby and .size Depending upon whether you want to include the closing year, change the inequalities from <= to <. If you want to group this by county also you can do that in the same step.
Here's the starting df
StatusType County OpenYear ClosedYear
1 Closed Alameda 2005 2015.0
2 Active Alameda 2006 NaN
3 Closed Alameda 2008 2015.0
4 Active Alameda 2011 NaN
5 Active Alameda 2011 NaN
6 Active Alameda 2012 NaN
7 Closed Alameda 1980 1989.0
8 Active Alameda 1980 NaN
9 Active Alameda 1980 NaN
import pandas as pd
year_list = [2010, 2011, 2012, 2013, 2014, 2015, 2016]
df_list = []
for year in year_list:
group = ((df.ClosedYear.isnull()) | (df.ClosedYear >= year)) & (df.OpenYear <= year)
n_schools = df.groupby([group, df.County]).size()[True]
df_list.append(pd.DataFrame({'n_schools':n_schools, 'year': year}))
ndf = pd.concat(df_list)
# n_schools year
#County
#Alameda 5 2010
#Alameda 7 2011
#Alameda 8 2012
#Alameda 8 2013
#Alameda 8 2014
#Alameda 8 2015
#Alameda 6 2016

Create variable with multiple return of numpy where

Hi i am stata user and now iam trying to pass my codes in stata to python/pandas. In this case i want to create a new variables size that assign the value 1 if the number of jobs is between 1 and 9, the value 2 if jobs is between 10 and 49, 3 between 50 and 199 and 4 for bigger than 200 jobs.
And aftewards, if it is possible label them (1:'Micro', 2:'Small', 3:'Median', 4:'Big')
id year entry cohort jobs
1 2009 0 NaN 3
1 2012 1 2012 3
1 2013 0 2012 4
1 2014 0 2012 11
2 2010 1 2010 11
2 2011 0 2010 12
2 2012 0 2010 13
3 2007 0 NaN 38
3 2008 0 NaN 58
3 2012 1 2012 58
3 2013 0 2012 70
4 2007 0 NaN 231
4 2008 0 NaN 241
I tried using this code but couldnt succed
df['size'] = np.where((1 <= df['jobs'] <= 9),'Micro',np.where((10 <= df['jobs'] <= 49),'Small'),np.where((50 <= df['jobs'] <= 200),'Median'),np.where((200 <= df['empleo']),'Big','NaN'))
What you are trying to do is called binning use pd.cut i.e
df['new'] = pd.cut(df['jobs'],bins=[1,10,50,201,np.inf],labels=['micro','small','medium','big'])
Output:
id year entry cohort jobs new
0 1 2009 0 NaN 3 micro
1 1 2012 1 2012.0 3 micro
2 1 2013 0 2012.0 4 micro
3 1 2014 0 2012.0 11 small
4 2 2010 1 2010.0 11 small
5 2 2011 0 2010.0 12 small
6 2 2012 0 2010.0 13 small
7 3 2007 0 NaN 38 small
8 3 2008 0 NaN 58 medium
9 3 2012 1 2012.0 58 medium
10 3 2013 0 2012.0 70 medium
11 4 2007 0 NaN 231 big
12 4 2008 0 NaN 241 big
For multiple conditions you have to go for np.select not np.where. Hope that helps.
​
numpy.select(condlist, choicelist, default=0)
Where condlist is the list of your condtions, and choicelist is the
list of choices if condition is met. default = 0, here you can put
that as np.nan
Using np.select for doing the same with the help of .between i.e
np.select([df['jobs'].between(1,10),
df['jobs'].between(10,50),
df['jobs'].between(50,200),
df['jobs'].between(200,np.inf)],
['Micro','Small','Median','Big']
,'NaN')

Categories