I found this description of how to resample a multi-index:
Resampling Within a Pandas MultiIndex
However as soon as I use count instead of sum the solution is not working any longer
This might be related to: Resampling with 'how=count' causing problems
Not working count and strings:
values_a =[1]*16
states = ['Georgia']*8 + ['Alabama']*8
#cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([datetime.datetime(2012,1,1)+datetime.timedelta(days = i) for i in range(4)]*4)
df2 = pd.DataFrame(
{'value_a': values_a},
index = [states, dates])
df2.index.names = ['State', 'Date']
df2.reset_index(level=[0], inplace=True)
print(df2.groupby(['State']).resample('W',how='count'))
Yields:
2012-01-01 2012-01-08
State value_a State value_a
State
Alabama 2 2 6 6
Georgia 2 2 6 6
The working version with sum and numbers as values
values_a =[1]*16
states = ['Georgia']*8 + ['Alabama']*8
#cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([datetime.datetime(2012,1,1)+datetime.timedelta(days = i) for i in range(4)]*4)
df2 = pd.DataFrame(
{'value_a': values_a},
index = [states, dates])
df2.index.names = ['State', 'Date']
df2.reset_index(level=[0], inplace=True)
print(df2.groupby(['State']).resample('W',how='sum'))
Yields (notice no duplication of 'State'):
value_a
State Date
Alabama 2012-01-01 2
2012-01-08 6
Georgia 2012-01-01 2
2012-01-08 6
When using count, state isn't a nuisance column (it can count strings) so the resample is going to apply count to it (although the output is not what I would expect). You could do something like (tell it only to apply count to value_a),
>>> print df2.groupby(['State']).resample('W',how={'value_a':'count'})
value_a
State Date
Alabama 2012-01-01 2
2012-01-08 6
Georgia 2012-01-01 2
2012-01-08 6
Or more generally, you can apply different kinds of how to different columns:
>>> print df2.groupby(['State']).resample('W',how={'value_a':'count','State':'last'})
State value_a
State Date
Alabama 2012-01-01 Alabama 2
2012-01-08 Alabama 6
Georgia 2012-01-01 Georgia 2
2012-01-08 Georgia 6
So while the above allows you to count a resampled multi-index dataframe it doesn't explain the behavior of output fromhow='count'. The following is closer to the way I would expect it to behave:
print df2.groupby(['State']).resample('W',how={'value_a':'count','State':'count'})
State value_a
State Date
Alabama 2012-01-01 2 2
2012-01-08 6 6
Georgia 2012-01-01 2 2
2012-01-08 6 6
#Karl D soln is correct; this will be possible in 0.14/master (releasing shortly), see docs here
In [118]: df2.groupby([pd.Grouper(level='Date',freq='W'),'State']).count()
Out[118]:
value_a
Date State
2012-01-01 Alabama 2
Georgia 2
2012-01-08 Alabama 6
Georgia 6
Prior to 0.14 it was difficult to groupby / resample with a time based grouper and another grouper. pd.Grouper allows a very flexible specification to do this.
Related
I have a countrydf as below, in which each cell in the country column contains a list of the countries where the movie was released.
countrydf
id Country release_year
s1 [US] 2020
s2 [South Africa] 2021
s3 NaN 2021
s4 NaN 2021
s5 [India] 2021
I want to make a new df which look like this:
country_yeardf
Year US UK Japan India
1925 NaN NaN NaN NaN
1926 NaN NaN NaN NaN
1927 NaN NaN NaN NaN
1928 NaN NaN NaN NaN
It has the release year and the number of movies released in each country.
My solution is that: with a blank df like the second one, run a for loop to count the number of movies released and then modify the value in the cell relatively.
countrylist=['Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', ….]
for x in countrylist:
for j in list(range(0,8807)):
if x in countrydf.country[j]:
t=int (countrydf.release_year[j] )
country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
an error occurred which read:
TypeError Traceback (most recent call last)
<ipython-input-25-225281f8759a> in <module>()
1 for x in countrylist:
2 for j in li:
----> 3 if x in countrydf.country[j]:
4 t=int(countrydf.release_year[j])
5 country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
TypeError: argument of type 'float' is not iterable
I don’t know which one is of float type here, I have check the type of countrydf.country[j] and it returned int.
I was using pandas and I am just getting started with it. Can anyone please explain the error and suggest a solution for a df that I want to create?
P/s: my English is not so good so hop you guys understand.
Here is a solution using groupby
df = pd.DataFrame([['US', 2015], ['India', 2015], ['US', 2015], ['Russia', 2016]], columns=['country', 'year'])
country year
0 US 2015
1 India 2015
2 US 2015
3 Russia 2016
Now just groupby country and year and unstack the output:
df.groupby(['year', 'country']).size().unstack()
country India Russia US
year
2015 1.0 NaN 2.0
2016 NaN 1.0 NaN
Some alternative ways to achieve this in pandas without loops.
If the Country Column have more than 1 value in the list in each row, you can try the below:
>>df['Country'].str.join("|").str.get_dummies().groupby(df['release_year']).sum()
India South Africa US
release_year
2020 0 0 1
2021 1 1 0
Else if Country has just 1 value per row in the list as you have shown in the example, you can use crosstab
>>pd.crosstab(df['release_year'],df['Country'].str[0])
Country India South Africa US
release_year
2020 0 0 1
2021 1 1 0
I have a Pandas dataframe as follows
df = pd.DataFrame([['John', '1/1/2017','10'],
['John', '2/2/2017','15'],
['John', '2/2/2017','20'],
['John', '3/3/2017','30'],
['Sue', '1/1/2017','10'],
['Sue', '2/2/2017','15'],
['Sue', '3/2/2017','20'],
['Sue', '3/3/2017','7'],
['Sue', '4/4/2017','20']
],
columns=['Customer', 'Deposit_Date','DPD'])
. What is the best way to calculate the PreviousMean column in the screen shot below?
The column is the year to date average of DPD for that customer. I.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.
Screenshot:
Notes:
the data is grouped by Customer Name and expanding over Deposit Dates
within each group, the expanding mean is calculated using only values from the previous rows.
at the start of each new customer the mean is 0 or alternatively null as there are no previous records on which to form the mean
the data frame is ordered by Customer Name and Deposit_Date
instead of grouping & expanding the mean, filter the dataframe on the conditions, and calculate the mean of DPD:
Customer == current row's Customer
Deposit_Date < current row's Deposit_Date
Use df.apply to perform this operation for all row in the dataframe:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
outputs:
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0
Here's one way to exclude repeated days from mean calculation:
# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])
# apply pd.expanding_mean
df['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))
# drop helper series
df = df.drop('DPD2', 1)
print(df)
Customer Name Deposit_Date DPD CumMean
0 John 01/01/2017 10 10.0
1 John 01/01/2017 10 10.0
2 John 02/02/2017 20 15.0
3 John 03/03/2017 30 20.0
4 Sue 01/01/2017 10 10.0
5 Sue 01/01/2017 10 10.0
6 Sue 02/02/2017 20 15.0
7 Sue 03/03/2017 30 20.0
Ok here is the best solution I've come up with thus far.
The trick is to first create an aggregated table at the customer & deposit date level containing a shifted mean. To calculate this mean you have to calculate the sum and the count first.
s=df.groupby(['Customer Name','Deposit_Date'],as_index=False)[['DPD']].agg(['count','sum'])
s.columns = [' '.join(col) for col in s.columns]
s.reset_index(inplace=True)
s['DPD_CumSum']=s.groupby(['Customer Name'])[['DPD sum']].cumsum()
s['DPD_CumCount']=s.groupby(['Customer Name'])[['DPD count']].cumsum()
s['DPD_CumMean']=s['DPD_CumSum']/ s['DPD_CumCount']
s['DPD_PrevMean']=s.groupby(['Customer Name'])['DPD_CumMean'].shift(1)
df=df.merge(s[['Customer Name','Deposit_Date','DPD_PrevMean']],how='left',on=['Customer Name','Deposit_Date'])
I have some time series data that is mostly quarterly, but reported in year-month-day format for multiple variables and multiple countries, however some variables for some dates have are posted the last day of the quarter and others might post on close to the last day. I would like to perform a resample that aggregates each row to end of quarter of frequency. I have this:
Date Country Var1 Var2 Var3
2012-03-30 China 12 Nan 200
2012-03-31 China Nan 50 Nan
2012-06-28 China 13 Nan 199
2012-06-30 China Nan 48 Nan
2012-09-30 China 13 49 200
2012-12-31 China 12 50 201
What I want to see is
Date Country Var1 Var2 Var3
2012-03-31 China 12 50 200
2012-06-30 China 13 48 199
2012-09-30 China 13 49 200
2012-12-31 China 12 50 201
I tried a couple of different resample ideas. First I tried
df=df.groupby("Country").resample('Q').applymap(lambda x: df.shift(1) if math.isnan(x) else x)
Then I tried converting all the Nans to zeros then aggregating by sum, but this is not ideal since I cannot keep track of which data actually are zero and which data were missing.
df=df.fillna(0)
df=df.groupby("Country").resample('Q').sum()
Here's a small example with my own dataframe doing what you want.
# creating the dataframe
df = pd.DataFrame(np.random.randn(8, 3), columns=['Var1', 'Var2', 'Var3'])
# adding NaN values
df.iloc[1]['Var1'] = np.nan
df.iloc[5]['Var1'] = np.nan
df.iloc[4]['Var2'] = np.nan
df.iloc[6]['Var2'] = np.nan
df
'''
Var1 Var2 Var3
0 -0.437551 -2.707623 0.726240
1 NaN 2.529733 0.484732
2 0.199278 -0.316516 -0.655426
3 0.732910 -0.638045 -0.706436
4 0.877915 NaN -1.141384
5 NaN -2.050228 2.091994
6 -1.119849 NaN 1.222602
7 0.406632 -2.255687 0.742452
'''
# backfilling values in Var2
df['Var2'] = df['Var2'].fillna(method='backfill').dropna()
# dropping NaN rows based on column Var1
df.dropna()
df
'''
Var1 Var2 Var3
0 -0.437551 -2.707623 0.726240
2 0.199278 -0.316516 -0.655426
3 0.732910 -0.638045 -0.706436
4 0.877915 -2.050228 -1.141384
6 -1.119849 -2.255687 1.222602
7 0.406632 -2.255687 0.742452
'''
I have a pivot table and I want to plot the values for the 12 months of each year for each town.
2010-01 2010-02 2010-03
City RegionName
Atlanta Downtown NaN NaN NaN
Midtown 194.263702 196.319964 197.946962
Alexandria Alexandria NaN NaN NaN
West
Landmark- NaN NaN NaN
Van Dom
How can I select only the values for each region of each town? I thought maybe it would be better to change the column names with years and months to datetime format and set them as index. How can I do this?
The result must be:
City RegionName
2010-01 Atlanta Downtown NaN
Midtown 194.263702
Alexandria Alexandria NaN
West
Landmark- NaN
Van Dom
Here's some similar dummy data to play with:
idx = pd.MultiIndex.from_arrays([['A','A', 'B','C','C'],
['A1','A2','B1','C1','C2']], names=['City','Region'])
idcol = pd.date_range('2012-01', freq='M', periods=12)
df = pd.DataFrame(np.random.rand(5,12), index=idx, columns=[t.strftime('%Y-%m') for t in idcol])
Let's see what we've got:
print(df.ix[:,:3])
2012-01 2012-02 2012-03
City Region
A A1 0.513709 0.941354 0.133290
A2 0.734199 0.005218 0.068914
B B1 0.043178 0.124049 0.603469
C C1 0.721248 0.483388 0.044008
C2 0.784137 0.864326 0.450250
Let's convert these to a datetime: df.columns = pd.to_datetime(df.columns)
Now to plot you just need to transpose:
df.T.plot()
Update after your updated your question:
Use stack, and then reorder if you want:
df = df.stack().reorder_levels([2,0,1])
df.head()
City Region
2012-01-01 A A1 0.513709
2012-02-01 A A1 0.941354
2012-03-01 A A1 0.133290
2012-04-01 A A1 0.324518
2012-05-01 A A1 0.554125
I want to create a dataframe from an aggregate function. I thought that it would create by default a dataframe as this solution states, but it creates a series and I don't know why (Converting a Pandas GroupBy object to DataFrame).
The dataframe is from Kaggle's San Francisco Salaries. My code:
df=pd.read_csv('Salaries.csv')
in: type(df)
out: pandas.core.frame.DataFrame
in: df.head()
out: EmployeeName JobTitle TotalPay TotalPayBenefits Year Status 2BasePay 2OvertimePay 2OtherPay 2Benefits 2Year
0 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 567595.43 567595.43 2011 NaN 167411.18 0.00 400184.25 NaN 2011-01-01
1 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 538909.28 538909.28 2011 NaN 155966.02 245131.88 137811.38 NaN 2011-01-01
2 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 335279.91 335279.91 2011 NaN 212739.13 106088.18 16452.60 NaN 2011-01-01
3 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 332343.61 332343.61 2011 NaN 77916.00 56120.71 198306.90 NaN 2011-01-01
4 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 326373.19 326373.19 2011 NaN 134401.60 9737.00 182234.59 NaN 2011-01-01
in: df2=df.groupby(['JobTitle'])['TotalPay'].mean()
type(df2)
out: pandas.core.series.Series
I want df2 to be a dataframe with the columns 'JobTitle' and 'TotalPlay'
Breaking down your code:
df2 = df.groupby(['JobTitle'])['TotalPay'].mean()
The groupby is fine. It's the ['TotalPay'] that is the misstep. That is telling the groupby to only execute the the mean function on the pd.Series df['TotalPay'] for each group defined in ['JobTitle']. Instead, you want to refer to this column with [['TotalPay']]. Notice the double brackets. Those double brackets say pd.DataFrame.
Recap
df2 = df2=df.groupby(['JobTitle'])[['TotalPay']].mean()