How do i simultaneously groupby and mean of n rows? - python

I have a large DataFrame with multiple cols, one of which has ID of string, and the others are of float
example of the dataframe:
df = pd.DataFrame({'ID': ['Child', 'Child', 'Child', 'Child', 'Baby', 'Baby', 'Baby', 'Baby'],
'income': [40000, 50000, 42000, 300, 2000, 4000, 2000, 3000],
'Height': [1.3, 1.5, 1.9, 2.0, 2.3, 1.4, 0.9, 0.8]})
What I want to do is a combination of calculating the average of every n rows of all cols, inside every ID group.
desired output:
steps = 3
df = pd.DataFrame({'ID': ['Child', 'Child', 'Baby', 'Baby'],
'income': [44000, 300, 2600 , 3000],
'Height': [1.567, 2.0, 1.533, 0.8],
'Values': [3, 1, 3, 1]})
Where the rows are first grouped by ID and then the mean is taken over every 3 values in the same group. I added Values such that i can track how many rows are taken for that row's average of all cols.
I have found similar questions but I cannot seem to combine them to solve my problem:
This question gives averages of n rows.
[This question2 covers pd.cut which I might need as well, I just dont understand how the bins work.
How can I make this happen?

You can use a double groupby:
# set up secondary grouper
group = df.groupby('ID').cumcount().floordiv(steps)
# groupy+agg
(df.groupby(['ID', group], as_index=False, sort=False)
.agg(**{'income': ('income', 'mean'),
'Height': ('Height', 'mean'),
'Values': ('Height', 'count'),
})
)
output:
ID income Height Values
0 Child 44000.000000 1.566667 3
1 Child 300.000000 2.000000 1
2 Baby 2666.666667 1.533333 3
3 Baby 3000.000000 0.800000 1

Related

How do I rearrange nested Pandas DataFrame columns?

In the DataFrame below, I want to rearrange the nested columns - i.e. to have 'region_sea' appearing before 'region_inland'
df = pd.DataFrame( {'state': ['WA', 'CA', 'NY', 'NY', 'CA', 'CA', 'WA' ]
, 'region': ['region_sea', 'region_inland', 'region_sea', 'region_inland', 'region_sea', 'region_sea', 'region_inland',]
, 'count': [1, 3, 4, 6, 7, 8, 4]
, 'income': [100, 200, 300, 400, 600, 400, 300]
}
)
df = df.pivot_table(index='state', columns='region', values=['count', 'income'], aggfunc={'count': 'sum', 'income': 'mean'})
df
I tried the code below but it's not working...any idea how to do this? Thanks
df[['count']]['region_sea', 'region_inland']
You can use sort_index to sort it. However, as it is nested columns, it will replace income and count too.
df.sort_index(axis='columns', level=0, ascending=False, inplace=True)
If you don't want replace income/count, than it will not give common header for both.
df.sort_index(axis='columns', level='region', ascending=False, inplace=True)

How do I use pandas to add rows to a data frame based on a date column and number of days column

I would like to know how to use a start date from a data frame column and have it add rows to the dataframe from the number of days in another column. A new date per day.
Essentially, I am trying to turn this data frame:
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
...into this data frame:
df_2 = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter'],
'Date':['1/1/2019', '1/2/2019', '1/2/2019', '1/3/2019', '1/4/2019','1/10/2019', '1/15/2019', '1/16/2019'],
'Hrs':[0.6, 0.6, 1, 1, 1, 1.2, 0.3, 0.3]})
I'm new to programming in general and have tried the following:
df_2 = pd.DataFrame({
'date': pd.date_range(
start = df.Planned_Start,
end = pd.to_timedelta(df.Duration, unit='D'),
freq = 'D'
)
})
... and ...
df["date"] = df.Planned_Start + timedelta(int(df.Duration))
with no luck.
I am not entirely sure what you are trying to achieve as your df_2 looks a bit wrong from what I can see.
If you want the take the duration column as days and add this many dates to a Date column, then the below code achieves that:
You can also drop any columns you don't need with pd.Series.drop() method:
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
df_new = pd.DataFrame()
for i, row in df.iterrows():
for duration in range(row.Duration):
date = pd.Series([pd.datetime.strptime(row.Planned_Start, '%m/%d/%Y') + timedelta(days=duration)], index=['date'])
newrow = row.append(date)
df_new = df_new.append(newrow, ignore_index=True)

pandas aggregate data from two data frames

I have two pandas data frames, with some indexes and some column names in common (like partially overlapping time-series related to common quantities).
I need to merge these two dataframes in a single one containing all the indexes and all the values for each index, keeping the values of the left (right) one in case an index-column combination appears in both data frames.
Both merge and join methods are unhelpful as the merge method will duplicate information I don't need and join causes the same problem.
What's an efficient method to obtain the result I need?
EDIT:
If for example I have the two data frames
df1 = pd.DataFrame({
'C1' : [1.1, 1.2, 1.3],
'C2' : [2.1, 2.2, 2.3],
'C3': [3.1, 3.2, 3.3]},
index=['a', 'b', 'c'])
df2 = pd.DataFrame({
'C3' : [3.1, 3.2, 33.3],
'C4' : [4.1, 4.2, 4.3]},
index=['b', 'c', 'd'])
What I need is a method that allows me to create:
merged = pd.DataFrame({
'C1': [1.1, 1.2, 1.3, 'nan'],
'C2': [2.1, 2.2, 2.3, 'nan'],
'C3': [3.1, 3.2, 3.3, 33.3],
'C4': ['nan', 4.1, 4.2, 4.3]},
index=['a', 'b', 'c', 'd'])
Here are three possibilities:
Use concat/groupby: First concatenate both DataFrames vertically. Then group by the index and select the first row in each group.
Use combine_first: Make a new index which is the union of df1 and df2. Reindex df1 using the new index. Then use combine_first to fill in NaNs with values from df2.
Use manual construction: We could use df2.index.difference(df1.index) to find exactly which rows need to be added to df1. So we could manually select those rows from df2 and concatenate them on to df1.
For small DataFrames, using_concat is faster. For larger DataFrames, using_combine_first appears to be slightly faster than the other options:
import numpy as np
import pandas as pd
import perfplot
def make_dfs(N):
df1 = pd.DataFrame(np.random.randint(10, size=(N,2)))
df2 = pd.DataFrame(np.random.randint(10, size=(N,2)), index=range(N//2,N//2 + N))
return df1, df2
def using_concat(dfs):
df1, df2 = dfs
result = pd.concat([df1,df2], sort=False)
n = result.index.nlevels
return result.groupby(level=range(n)).first()
def using_combine_first(dfs):
df1, df2 = dfs
index = df1.index.union(df2.index)
result = df1.reindex(index)
result = result.combine_first(df2)
return result
def using_manual_construction(dfs):
df1, df2 = dfs
index = df2.index.difference(df1.index)
cols = df2.columns.difference(df1.columns)
result = pd.concat([df1, df2.loc[index]], sort=False)
result.loc[df2.index, cols] = df2
return result
perfplot.show(
setup=make_dfs,
kernels=[using_concat, using_combine_first,
using_manual_construction],
n_range=[2**k for k in range(5,21)],
logx=True,
logy=True,
xlabel='len(df)')
Without seeing your code I can only give a generic answer:
To merge 2 dataframes use
df3 = pd.merge(df1, df2, how='right', on=('col1', 'col2'))
or
a.merge(b, how='right', on=('c1', 'c2'))

What is print (df.head(20).to_dict()) ?How to get average of field in dataframe [duplicate]

Say my data looks like this:
date,name,id,dept,sale1,sale2,sale3,total_sale
1/1/17,John,50,Sales,50.0,60.0,70.0,180.0
1/1/17,Mike,21,Engg,43.0,55.0,2.0,100.0
1/1/17,Jane,99,Tech,90.0,80.0,70.0,240.0
1/2/17,John,50,Sales,60.0,70.0,80.0,210.0
1/2/17,Mike,21,Engg,53.0,65.0,12.0,130.0
1/2/17,Jane,99,Tech,100.0,90.0,80.0,270.0
1/3/17,John,50,Sales,40.0,50.0,60.0,150.0
1/3/17,Mike,21,Engg,53.0,55.0,12.0,120.0
1/3/17,Jane,99,Tech,80.0,70.0,60.0,210.0
I want a new column average, which is the average of total_sale for each name,id,dept tuple
I tried
df.groupby(['name', 'id', 'dept'])['total_sale'].mean()
And this does return a series with the mean:
name id dept
Jane 99 Tech 240.000000
John 50 Sales 180.000000
Mike 21 Engg 116.666667
Name: total_sale, dtype: float64
but how would I reference the data? The series is a one dimensional one of shape (3,). Ideally I would like this put back into a dataframe with proper columns so I can reference properly by name/id/dept.
If you call .reset_index() on the series that you have, it will get you a dataframe like you want (each level of the index will be converted into a column):
df.groupby(['name', 'id', 'dept'])['total_sale'].mean().reset_index()
EDIT: to respond to the OP's comment, adding this column back to your original dataframe is a little trickier. You don't have the same number of rows as in the original dataframe, so you can't assign it as a new column yet. However, if you set the index the same, pandas is smart and will fill in the values properly for you. Try this:
cols = ['date','name','id','dept','sale1','sale2','sale3','total_sale']
data = [
['1/1/17', 'John', 50, 'Sales', 50.0, 60.0, 70.0, 180.0],
['1/1/17', 'Mike', 21, 'Engg', 43.0, 55.0, 2.0, 100.0],
['1/1/17', 'Jane', 99, 'Tech', 90.0, 80.0, 70.0, 240.0],
['1/2/17', 'John', 50, 'Sales', 60.0, 70.0, 80.0, 210.0],
['1/2/17', 'Mike', 21, 'Engg', 53.0, 65.0, 12.0, 130.0],
['1/2/17', 'Jane', 99, 'Tech', 100.0, 90.0, 80.0, 270.0],
['1/3/17', 'John', 50, 'Sales', 40.0, 50.0, 60.0, 150.0],
['1/3/17', 'Mike', 21, 'Engg', 53.0, 55.0, 12.0, 120.0],
['1/3/17', 'Jane', 99, 'Tech', 80.0, 70.0, 60.0, 210.0]
]
df = pd.DataFrame(data, columns=cols)
mean_col = df.groupby(['name', 'id', 'dept'])['total_sale'].mean() # don't reset the index!
df = df.set_index(['name', 'id', 'dept']) # make the same index here
df['mean_col'] = mean_col
df = df.reset_index() # to take the hierarchical index off again
Adding to_frame
df.groupby(['name', 'id', 'dept'])['total_sale'].mean().to_frame()
You are very close. You simply need to add a set of brackets around [['total_sale']] to tell python to select as a dataframe and not a series:
df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()
If you want all columns:
df.groupby(['name', 'id', 'dept'], as_index=False).mean()[['name', 'id', 'dept', 'total_sale']]
The answer is in two lines of code:
The first line creates the hierarchical frame.
df_mean = df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()
The second line converts it to a dataframe with four columns('name', 'id', 'dept', 'total_sale')
df_mean = df_mean.reset_index()

Custom function using multiple parameters applied to every column in dataframe

I have a df that looks like this
data = [{'Stock': 'Apple', 'Weight': 0.2, 'Price': 101.99, 'Beta': 1.1},
{'Stock': 'MCSFT', 'Weight': 0.1, 'Price': 143.12, 'Beta': 0.9},
{'Stock': 'WARNER','Weight': 0.15,'Price': 76.12, 'Beta': -1.1},
{'Stock': 'ASOS', 'Weight': 0.35,'Price': 76.12, 'Beta': -1.1 },
{'Stock': 'TESCO', 'Weight': 0.2, 'Price': 76.12, 'Beta': -1.1 }]
data_df = pd.DataFrame(data)
and a custom function that will calculate weighted averages
def calc_weighted_averages(data_in, weighted_by):
return sum(x * y for x, y in zip(data_in, weighted_by)) / sum(weighted_by)
I want to apply this custom formula to a all the columns in my df, my first idea was to do s.th. like this
data_df = data_df[['Weight','Price','Beta']]
data_df = data_df.apply(lambda x: calc_weighted_averages(x['Price'], x['Weight']), axis=1)
How can I keep my weighted_by column fixed and apply the custom function to the other columns? I should end up with a weighted average number for Price and Beta.
I think you need subset of all columns first and then use second argument Weight column:
s1 = data_df[['Price','Beta']].apply(lambda x: calc_weighted_averages(x, data_df['Weight']))
print (s1)
Price 87.994
Beta -0.460
dtype: float64
Another solution without apply is faster:
s1 = data_df[['Price','Beta']].mul(data_df['Weight'], 0).sum().div(data_df['Weight'].sum())
print (s1)
Price 87.994
Beta -0.460
dtype: float64

Categories