I have a dataframe with a bunch of columns labelled in 'YYYY-MM' format, along with several other columns. I need to collapse the date columns into calendar quarters and take the mean; I was able to do it manually, but there are a few hundred date columns in my real data and I'd like to not have to map every single one of them by hand. I'm generating the initial df from a CSV; I didn't see anything in read_csv that seemed like it would help, but if there's anything I can leverage there that would be great. I found dataframe.dt.to_period("Q") that will convert a datetime object to quarter, but I'm not quite sure how to apply that here, if I can at all.
Here's a sample df (code below):
foo bar 2016-04 2016-05 2016-06 2016-07 2016-08
0 6 5 3 3 5 8 1
1 9 3 6 9 9 7 8
2 8 5 8 1 9 9 4
3 5 8 1 2 3 5 6
4 4 5 1 2 7 2 6
This code will do what I'm looking for, but I had to generate mapping by hand:
mapping = {'2016-04':'2016q2', '2016-05':'2016q2', '2016-06':'2016q2', '2016-07':'2016q3', '2016-08':'2016q3'}
df = df.set_index(['foo', 'bar']).groupby(mapping, axis=1).mean().reset_index()
New df:
foo bar 2016q2 2016q3
0 6 5 3.666667 4.5
1 9 3 8.000000 7.5
2 8 5 6.000000 6.5
3 5 8 2.000000 5.5
4 4 5 3.333333 4.0
Code to generate the initial df:
df = pd.DataFrame(np.random.randint(1, 11, size=(5, 7)), columns=('foo', 'bar', '2016-04', '2016-05', '2016-06', '2016-07', '2016-08')) '2016-07', '2016-08'))
Use a callable that gets applied to the index values. Use axis=1 to apply it to the column values instead.
(df.set_index(['foo', 'bar'])
.groupby(lambda x: pd.Period(x, 'Q'), axis=1)
.mean().reset_index())
foo bar 2016Q2 2016Q3
0 6 5 3.666667 4.5
1 9 3 8.000000 7.5
2 8 5 6.000000 6.5
3 5 8 2.000000 5.5
4 4 5 3.333333 4.0
The solution is quite short:
Start from copying "monthly" columns to another DataFrame and converting
column names to PeriodIndex:
df2 = df.iloc[:, 2:]
df2.columns = pd.PeriodIndex(df2.columns, freq='M')
Then, to get the result, resample columns by quarter,
compute the mean (for each quarter) and join with 2 "initial" columns:
df.iloc[:, :2].join(df2.resample('Q', axis=1).agg('mean'))
data = [[2,2,2,3,3,3],[1,2,2,3,4,5],[1,2,2,3,4,5],[1,2,2,3,4,5],[1,2,2,3,4,5],[1,2,2,3,4,5]]
df = pd.DataFrame(data, columns = ['A','1996-04','1996-05','2000-07','2000-08','2010-10'])
# separate year columns and other columns
# separate year columns
df3 = df.iloc[:, 1:]
# separate other columns
df2 = df.iloc[:,0]
#apply groupby using period index
df3=df3.groupby(pd.PeriodIndex(df3.columns, freq='Q'), axis=1).mean()
final_df = pd.concat([df3,df2], axis=1)
print(final_df)
output is attached in image:
Related
I have a dataset, df, where I would like to:
filter the values in the 'id' column, group these values, take their average, and then sum these values
id use free total G_Used G_Free G_Total
a 5 5 10 4 1 5
b 14 6 20 5 1 6
a 10 5 15 9 1 10
c 6 4 10 10 10 20
b 10 5 15 5 5 10
b 5 5 10 1 4 5
c 4 1 5 3 1 4
Desired Output
use free total
9.5 7.5 20
filter only values that contain 'a' or 'b'
group by each id
take the mean of the 'use', 'free' and 'total' columns
sum these values
Intermediate steps:
filter out only the a and c values
id use free total G_Used G_Free G_Total
a 5 5 10 4 1 5
a 10 5 15 9 1 10
c 6 4 10 10 10 20
c 4 1 5 3 1 4
take mean of a
a
use free total
7.5 5 12.5
take mean of c
c
use free total
2 2.5 7.5
sum both a and c values for final desired output
use free total
9.5 7.5 20
This is what I am doing, however the syntax is not correct for some of the code. I am still researching. Any suggestion is appreciated
df1 = df[df.id = 'a' | 'b']
df2 = df1.groupby(['id'], as_index=False).agg({'use': 'mean', 'free': 'mean', 'total': 'mean'})
df3= df2.sum(['id'], axis = 0)
Use Series.isin for test membership first and then filter columns with mean, ouput is summed and converted Series to one row DataFrame by Series.to_frame and DataFrame.T for transpose:
df1 = df[df.id.isin(['a','c'])]
df2 = df1.groupby('id')[['use','free','total']].mean().sum().to_frame().T
Your solution is similar, only used GroupBy.agg:
df1 = df[df.id.isin(['a','c'])]
df2 = df1.groupby('id').agg({'use': 'mean', 'free': 'mean', 'total': 'mean'}).sum().to_frame().T
print (df2)
use free total
0 12.5 7.5 20.0
I have this data:
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_tuples(list(zip(*[['one', 'one', 'two', 'two'],['foo', 'bar', 'foo', 'bar']])))
df = pd.DataFrame(np.arange(12).reshape((3,4)), columns=index)
one two
foo bar foo bar
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Is there a way to do simple vectorized calculations (like addition) for each level 0 group columns on each of the level 1 columns without having to reference the specific column level pairs like:
df[('one','add')] = df[('one','foo')]+df[('one','bar')]
I'd like to get
one two
foo bar add foo bar add
0 0 1 1 2 3 5
1 4 5 9 6 7 13
2 8 9 17 10 11 21
I fiddled around with it for a bit and here is a one-liner that solves the problem in my opinion. It's fully vectorized and doesn't address specific column names. It also puts the add column in the right place.
df.stack(0).assign(add=df.stack(0).sum(axis=1)).stack(0).unstack(0).T
Unfortunately, because of the property of stack / unstack to do the stacking / unstacking into the innermost level, it needs the cryptic .stack(0).unstack(0) operation. It seems like those two operations should cancel each other out, but they actually shuffle the index levels while preserving order.
Here is the same thing split into 3 lines without assign statement.
df = df.stack(0)
df['add'] = df.sum(axis=1)
df = df.stack(0).unstack(0).T
Use pandas.DataFrame.sum with axis=1 and level=0:
df2 = df.sum(axis=1, level=0)
print(df2)
Output:
one two
0 1 5
1 9 13
2 17 21
You can then add new column names to pandas.concat:
df2.columns = [(c, "add") for c in df2]
df2 = pd.concat([df, df2], 1).sort_index(1)
print(df2)
Output:
one two
add bar foo add bar foo
0 1 1 0 5 3 2
1 9 5 4 13 7 6
2 17 9 8 21 11 10
An alternative solution, here, using the same sum solution, but without pd.concat :
df[("one", "add")] = None
df[("two", "add")] = None
df.iloc[:, -2:] = df.sum(axis=1, level=0).to_numpy()
df.sort_index(1)
one two
add bar foo add bar foo
0 1.0 1 0 5.0 3 2
1 9.0 5 4 13.0 7 6
2 17.0 9 8 21.0 11 10
Actual dataframe consist of more than a million rows.
Say for example a dataframe is:
UniqueID Code Value OtherData
1 A 5 Z01
1 B 6 Z02
1 C 7 Z03
2 A 10 Z11
2 B 11 Z24
2 C 12 Z23
3 A 10 Z21
4 B 8 Z10
I want to obtain ratio of A/B for each UniqueID and put it in a new dataframe. For example, for UniqueID 1, its ratio of A/B = 5/6.
What is the most efficient way to do this in Python?
Want:
UniqueID RatioAB
1 5/6
2 10/11
3 Inf
4 0
Thank you.
One approach is using pivot_table, aggregating with the sum in the case there are multiple occurrences of the same letters (otherwise a simple pivot will do), and evaluating on columns A and B:
df.pivot_table(index='UniqueID', columns='Code', values='Value', aggfunc='sum').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If there is maximum one occurrence of each letter per group:
df.pivot(index='UniqueID', columns='Code', values='Value').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If you only care about A/B ratio:
df1 = df[df['Code'].isin(['A','B'])][['UniqueID', 'Code', 'Value']]
df1 = df1.pivot(index='UniqueID',
columns='Code',
values='Value')
df1['RatioAB'] = df1['A']/df1['B']
The most apparent way is via groupby.
df.groupby('UniqueID').apply(lambda g: g.query("Code == 'A'")['Value'].iloc[0] / g.query("Code == 'B'")['Value'].iloc[0])
I currently have a dataset with two indexes, year and zip code, but multiple observations (prices) per zip code. How can I get the average price per zip code, so that I only have distinct observations per zip code and year.
Screenshot of current table
Use DataFrame.mean with level parameter:
df = s.mean(level=[0,1])
Sample:
s = pd.DataFrame({
'B':[5,5,4,5,5,4],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
}).set_index(['F','B'])['E']
print (s)
F B
a 5 5
5 3
4 6
b 5 9
5 2
4 4
Name: E, dtype: int64
df = s.mean(level=[0,1]).reset_index()
print (df)
F B E
0 a 5 4.0
1 a 4 6.0
2 b 5 5.5
3 b 4 4.0
I'm confused as to the highlighted line. What exactly is this line doing. What does .div do? I tried to look through the documentation which said
"Floating division of dataframe and other, element-wise (binary operator truediv)"
I'm not exactly sure what this means. Any help would be appreciated!
You can divide one dataframe by another and pandas will automagically aligned the index and columns and subsequently divide the appropriate values. EG df1 / df2
If you divide a dataframe by series, pandas automatically aligns the series index with the columns of the dataframe. It maybe that you want to align the index of the series with the index of the dataframe instead. If this is the case, then you will have to use the div method.
So instead of:
df / s
You use
df.div(s, axis=0)
Which says to align the index of s with the index of df then perform the division while broadcasting over the other dimension, in this case columns.
In the above example, what it is essentially doing is dividing pclass_xt on axis 0, by the array/series which pclass_xt.sum(0) has generated. In pclass_xt.sum(0), .sum is summing up values along the axis=1, which gives you the total of both survived and not survived along all the pclasses. Then, .div is simply dividing the entire dataframe along 0 axis with the sum generated i.e. a row is divided by the sum of that row.
import pandas as pd,numpy as np
data={"A":np.arange(10),"B":np.random.randint(1,10,10),"C":np.random.random(10)}
#print(data)
df2=pd.DataFrame(data=data)
print("DataFrame values:\n",df2)
s1=pd.Series(np.arange(1,11))
print("s1 series values:\n",s1)
print("Result of Division:\n",df2.div(s1,axis=0))
**#So here, How the div is working as mention below:-
#df Row1/s1 Row1 -0/1 4/1 0.305/1
#df Row2/s1 Row2 -1/2 9/2 0.821/2**
#################Output###########################
DataFrame values:
A B C
0 0 2 0.265396
1 1 2 0.055646
2 2 7 0.963006
3 3 9 0.958677
4 4 6 0.256558
5 5 6 0.859066
6 6 8 0.818831
7 7 4 0.656055
8 8 6 0.885797
9 9 4 0.412497
s1 series values:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
dtype: int64
Result of Division:
A B C
0 0.000000 2.000000 0.265396
1 0.500000 1.000000 0.027823
2 0.666667 2.333333 0.321002
3 0.750000 2.250000 0.239669
4 0.800000 1.200000 0.051312
5 0.833333 1.000000 0.143178
6 0.857143 1.142857 0.116976
7 0.875000 0.500000 0.082007
8 0.888889 0.666667 0.098422
9 0.900000 0.400000 0.041250