I have a list of persons with the respective earnings by company like this
Company_code Person Date Earning1 Earning2
1 Jonh 2014-01 100 200
2 Jonh 2014-01 300 400
1 Jonh 2014-02 500 600
1 Peter 2014-01 300 400
1 Peter 2014-02 500 600
And I would like to summarize into this:
Company_code Person 2014-01_E1 2014-01_E2 2014-02_E1 2014-02_E2
1 Jonh 100 200 300 400
2 Jonh 500 600
1 Peter 300 400 500 600
I had the same problem doing this with SQL which I solved with the code:
with t(Company_code, Person, Dt, Earning1, Earning2) as (
select 1, 'Jonh', to_date('2014-01-01', 'YYYY-MM-DD'), 100, 200 from dual union all
select 2, 'Jonh', to_date('2014-01-01', 'YYYY-MM-DD'), 300, 400 from dual union all
select 1, 'Jonh', to_date('2014-02-01', 'YYYY-MM-DD'), 500, 600 from dual union all
select 1, 'Peter', to_date('2014-01-01', 'YYYY-MM-DD'), 300, 400 from dual union all
select 1, 'Peter', to_date('2014-02-01', 'YYYY-MM-DD'), 500, 600 from dual
)
select *
from t
pivot (
sum(Earning1) e1
, sum(Earning2) e2
for dt in (
to_date('2014-01-01', 'YYYY-MM-DD') "2014-01"
, to_date('2014-02-01', 'YYYY-MM-DD') "2014-02"
)
)
COMPANY_CODE PERSON 2014-01_E1 2014-01_E2 2014-02_E1 2014-02_E2
----------------------------------------------------------------------
2 Jonh 300 400 - -
1 Peter 300 400 500 600
1 Jonh 100 200 500 600
How can this be achived in python? I'm trying with Pandas pivot_table:
pd.pivot_table(df, columns=['COMPANY_CODE', 'PERSON', 'DATE'], aggfunc=np.sum)
but this just transposes the table ... any clues?
Using user1827356's suggestion:
df2 = pd.pivot_table(df, rows=['Company_code', 'Person'], cols=['Date'], aggfunc='sum')
print(df2)
# Earning1 Earning2
# Date 2014-01 2014-02 2014-01 2014-02
# Company_code Person
# 1 Jonh 100 500 200 600
# Peter 300 500 400 600
# 2 Jonh 300 NaN 400 NaN
You can flatten the hierarchical columns like this:
columns = ['{}_E{}'.format(date, earning.replace('Earning', ''))
for earning, date in df2.columns.tolist()]
df2.columns = columns
print(df2)
# 2014-01_E1 2014-02_E1 2014-01_E2 2014-02_E2
# Company_code Person
# 1 Jonh 100 500 200 600
# Peter 300 500 400 600
# 2 Jonh 300 NaN 400 NaN
Here's the nicest way to do it, using unstack.
df = pd.DataFrame({
'company_code': [1, 2, 1, 1, 1],
'person': ['Jonh', 'Jonh', 'Jonh', 'Peter', 'Peter'],
'earning2': [200, 400, 600, 400, 600],
'earning1': [100, 300, 500, 300, 500],
'date': ['2014-01', '2014-01', '2014-02', '2014-01', '2014-02']
})
df = df.set_index(['date', 'company_code', 'person'])
df.unstack('date')
Resulting in:
earning1 earning2
date 2014-01 2014-02 2014-01 2014-02
company_code person
1 Jonh 100.0 500.0 200.0 600.0
1 Peter 300.0 500.0 400.0 600.0
2 Jonh 300.0 NaN 400.0 NaN
Setting the index to ['date', 'company_code', 'person'] is a good idea anyway, since that's really what your DataFrame contains: two different earnings categories (1 and 2) each described by a date, a company code and a person.
It's good practice to always work out what the 'real' data in your DataFrame is, and which columns are meta-data, and index accordingly.
Related
I have a dataframe of categories and amounts. Categories can be nested into sub categories an infinite levels using a colon separated string. I wish to sort it by descending amount. But in hierarchical type fashion like shown.
How I need it sorted
CATEGORY AMOUNT
Transport 5000
Transport : Car 4900
Transport : Train 100
Household 1100
Household : Utilities 600
Household : Utilities : Water 400
Household : Utilities : Electric 200
Household : Cleaning 100
Household : Cleaning : Bathroom 75
Household : Cleaning : Kitchen 25
Household : Rent 400
Living 250
Living : Other 150
Living : Food 100
EDIT:
The data frame:
pd.DataFrame({
"category": ["Transport", "Transport : Car", "Transport : Train", "Household", "Household : Utilities", "Household : Utilities : Water", "Household : Utilities : Electric", "Household : Cleaning", "Household : Cleaning : Bathroom", "Household : Cleaning : Kitchen", "Household : Rent", "Living", "Living : Other", "Living : Food"],
"amount": [5000, 4900, 100, 1100, 600, 400, 200, 100, 75, 25, 400, 250, 150, 100]
})
Note: this is the order I want it. It may be in any arbitrary order before the sort.
EDIT2:
If anyone looking for a similar solution I posted the one I settled on here: How to sort dataframe in pandas by value in hierarchical category structure
One way could be to first str.split the category column.
df_ = df['category'].str.split(' : ', expand=True)
print (df_.head())
0 1 2
0 Transport None None
1 Transport Car None
2 Transport Train None
3 Household None None
4 Household Utilities None
Then get the column amount and what you want is to get the maximum amount per group based on:
the first column alone,
then the first and the second columns
then the first-second and third columns, ...
You can do this with groupby.transform with max, and you concat each column created.
s = df['amount']
l_cols = list(df_.columns)
dfa = pd.concat([s.groupby([df_[col] for col in range(0, lv+1)]).transform('max')
for lv in l_cols], keys=l_cols, axis=1)
print (dfa)
0 1 2
0 5000 NaN NaN
1 5000 4900.0 NaN
2 5000 100.0 NaN
3 1100 NaN NaN
4 1100 600.0 NaN
5 1100 600.0 400.0
6 1100 600.0 200.0
7 1100 100.0 NaN
8 1100 100.0 75.0
9 1100 100.0 25.0
10 1100 400.0 NaN
11 250 NaN NaN
12 250 150.0 NaN
13 250 100.0 NaN
Now you just need to sort_values on all columns in the right order on first 0, then 1, then 2..., get the index and use loc to order df in the expected way
dfa = dfa.sort_values(l_cols, na_position='first', ascending=False)
dfs = df.loc[dfa.index] #here you can reassign to df directly
print (dfs)
category amount
0 Transport 5000
1 Transport : Car 4900
2 Transport : Train 100
3 Household 1100
4 Household : Utilities 600
5 Household : Utilities : Water 400
6 Household : Utilities : Electric 200
10 Household : Rent 400 #here is the one difference with this data
7 Household : Cleaning 100
8 Household : Cleaning : Bathroom 75
9 Household : Cleaning : Kitchen 25
11 Living 250
12 Living : Other 150
13 Living : Food 100
I packaged #Ben. T's answer into a more generic function, hopefully this is clearer to read!
EDIT: I have made changes to the function to group by columns in order rather than one by one to address potential issues noted by #Ben. T in the comments.
import pandas as pd
def category_sort_df(df, sep, category_col, numeric_col, ascending=False):
'''Sorts dataframe by nested categories using `sep` as the delimiter for `category_col`.
Sorts numeric columns in descending order by default.
Returns a copy.'''
df = df.copy()
try:
to_sort = pd.to_numeric(df[numeric_col])
except ValueError:
print(f'Column `{numeric_col}` is not numeric!')
raise
categories = df[category_col].str.split(sep, expand=True)
# Strips any white space before and after sep
categories = categories.apply(lambda x: x.str.split().str[0], axis=1)
levels = list(categories.columns)
to_concat = []
for level in levels:
# Group by columns in order rather than one at a time
level_by = [df_[col] for col in range(0, level+1)]
gb = to_sort.groupby(level_by)
to_concat.append(gb.transform('max'))
dfa = pd.concat(to_concat, keys=levels, axis=1)
ixs = dfa.sort_values(levels, na_position='first', ascending=False).index
df = df.loc[ixs].copy()
return df
Using Python 3.7.3, pandas 0.24.2
To answer my own question: I found a way. Kind of long winded but here it is.
import numpy as np
import pandas as pd
def sort_tree_df(df, tree_column, sort_column):
sort_key = sort_column + '_abs'
df[sort_key] = df[sort_column].abs()
df.index = pd.MultiIndex.from_frame(
df[tree_column].str.split(":").apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
sort_columns = [df[tree_column].values, df[sort_key].values] + [
df.groupby(level=list(range(0, x)))[sort_key].transform('max').values
for x in range(df.index.nlevels - 1, 0, -1)
]
sort_indexes = np.lexsort(sort_columns)
df_sorted = df.iloc[sort_indexes[::-1]]
df_sorted.reset_index(drop=True, inplace=True)
df_sorted.drop(sort_key, axis=1, inplace=True)
return df_sorted
sort_tree_df(df, 'category', 'amount')
If you don't mind adding an extra column you can extract the main category from the category and then sort by amount/main category/category, ie.:
df['main_category'] = df.category.str.extract(r'^([^ ]+)')
df.sort_values(['main_category', 'amount', 'category'], ascending=False)[['category', 'amount']]
Output:
category amount
0 Transport 5000
1 Transport : Car 4900
2 Transport : Train 100
11 Living 250
12 Living : Other 150
13 Living : Food 100
3 Household 1100
4 Household : Utilities 600
5 Household : Utilities : Water 400
10 Household : Rent 400
6 Household : Utilities : Electric 200
7 Household : Cleaning 100
8 Household : Cleaning : Bathroom 75
9 Household : Cleaning : Kitchen 25
Note that this will work well only if your main categories are single words without spaces. Otherwise you will need to do it in a different way, ie. extract all non-colons and strip the trailing space:
df['main_category'] = df.category.str.extract(r'^([^:]+)')
df['main_category'] = df.main_category.str.rstrip()
I am trying to aggregate values in a groupby over multiple columns. I come from the R/dplyr world and what I want is usually achievable in a single line using group_by/summarize. I am trying to find an equivalently elegant way of achieving this using pandas.
Consider the below Input Dataset. I would like to aggregate by state and calculate the column v1 as v1 = sum(n1)/sum(d1) by state.
The r-code for this using dplyr is as follows:
input %>% group_by(state) %>%
summarise(v1=sum(n1)/sum(d1),
v2=sum(n2)/sum(d2))
Is there an elegant way of doing this in Python? I found a slightly verbose way of getting what I want in on a stack overflow answer here.
Copying over modified python-code from the link
In [14]: s = mn.groupby('state', as_index=False).sum()
In [15]: s['v1'] = s['n1'] / s['d1']
In [16]: s['v2'] = s['n2'] / s['d2']
In [17]: s[['state', 'v1', 'v2']]
INPUT DATASET
state n1 n2 d1 d2
CA 100 1000 1 2
FL 200 2000 2 4
CA 300 3000 3 6
AL 400 4000 4 8
FL 500 5000 5 2
NY 600 6000 6 4
CA 700 7000 7 6
OUTPUT
state v1 v2
AL 100 500.000000
CA 100 500.000000
NY 100 1500.000000
CA 100 1166.666667
FL 100 1166.666667
One possible solution with DataFrame.assign and DataFrame.reindex:
df = (mn.groupby('state', as_index=False)
.sum()
.assign(v1 = lambda x: x['n1'] / x['d1'], v2 = lambda x: x['n2'] / x['d2'])
.reindex(['state', 'v1', 'v2'], axis=1))
print (df)
state v1 v2
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
And another with GroupBy.apply and custom lambda function:
df = (mn.groupby('state')
.apply(lambda x: x[['n1','n2']].sum() / x[['d1','d2']].sum().values)
.reset_index()
.rename(columns={'n1':'v1', 'n2':'v2'})
)
print (df)
state v1 v2
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
Another solution:
def func(x):
u = x.sum()
return pd.Series({'v1':u['n1']/u['d1'],
'v2':u['n2']/u['d2']})
df.groupby('state').apply(func)
Output:
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Here is the equivalent way as you did in R:
>>> from datar.all import f, tribble, group_by, summarise, sum
>>>
>>> input = tribble(
... f.state, f.n1, f.n2, f.d1, f.d2,
... "CA", 100, 1000, 1, 2,
... "FL", 200, 2000, 2, 4,
... "CA", 300, 3000, 3, 6,
... "AL", 400, 4000, 4, 8,
... "FL", 500, 5000, 5, 2,
... "NY", 600, 6000, 6, 4,
... "CA", 700, 7000, 7, 6,
... )
>>>
>>> input >> group_by(f.state) >> \
... summarise(v1=sum(f.n1)/sum(f.d1),
... v2=sum(f.n2)/sum(f.d2))
state v1 v2
<object> <float64> <float64>
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
I am the author of the datar package.
Another option is with the pipe function, where the groupby object is resuable:
(df.groupby('state')
.pipe(lambda df: pd.DataFrame({'v1' : df.n1.sum() / df.d1.sum(),
'v2' : df.n2.sum() / df.d2.sum()})
)
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Another option would be to convert the columns into a MultiIndex before grouping:
temp = temp = df.set_index('state')
temp.columns = temp.columns.str.split('(\d)', expand=True).droplevel(-1)
(temp.groupby('state')
.sum()
.pipe(lambda df: df.n /df.d)
.add_prefix('v')
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Yet another way, still with the MultiIndex option, while avoiding a groupby:
# keep the index, necessary for unstacking later
temp = df.set_index('state', append=True)
# convert the columns to a MultiIndex
temp.columns = temp.columns.map(tuple)
# this works because the index is unique
(temp.unstack('state')
.sum()
.unstack([0,1])
.pipe(lambda df: df.n / df.d)
.add_prefix('v')
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
I just started learning pandas a week ago or so and I've been struggling with a pandas dataframe for a bit now. My data looks like this:
State NY CA Other Total
Year
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
I made this table from a dataset that included 30 or so values for the variable I'm representing as State here. If they weren't NY or CA, in the example, I summed them and put them in an 'Other' category. The years here were made from a normalized list of dates (originally mm/dd/yyyy and yyyy-mm-dd) as such, if this is contributing to my issue:
dict = {'Date': pd.to_datetime(my_df.Date).dt.year}
and later:
my_df = my_df.rename_axis('Year')
I'm trying now to append a row at the bottom that shows the totals in each category:
final_df = my_df.append({'Year' : 'Total',
'NY': my_df.NY.sum(),
'CA': my_df.CA.sum(),
'Other': my_df.Other.sum(),
'Total': my_df.Total.sum()},
ignore_index=True)
This does technically work, but it makes my table look like this:
NY CA Other Total State
0 450 50 25 525 NaN
1 300 75 5 380 NaN
2 500 100 100 700 NaN
3 250 50 100 400 NaN
4 a b c d Total
('a' and so forth are the actual totals of the columns.) It adds a column at the beginning and puts my 'Year' column at the end. In fact, it removes the 'Date' label as well, and turns all the years in the last column into NaNs.
Is there any way I can get this formatted properly? Thank you for your time.
I believe you need create Series by sum and rename it:
final_df = my_df.append(my_df.sum().rename('Total'))
print (final_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another solution is use loc for setting with enlargement:
my_df.loc['Total'] = my_df.sum()
print (my_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another idea from previous answer - add parameters margins=True and margins_name='Total' to crosstab:
df1 = df.assign(**dct)
out = (pd.crosstab(df1['Firing'], df1['State'], margins=True, margins_name='Total'))
I have a column in my dataframe comprised of numbers. Id like to have another column in the dataframe that takes a running average of the values greater than 0 that i can ideally do in numpy without iteration. (data is huge)
Vals Output
-350
1000 1000
1300 1150
1600 1300
1100 1250
1000 1200
450 1075
1900 1192.857143
-2000 1192.857143
-3150 1192.857143
1000 1168.75
-900 1168.75
800 1127.777778
8550 1870
Code:
list =[-350, 1000, 1300, 1600, 1100, 1000, 450,
1900, -2000, -3150, 1000, -900, 800, 8550]
df = pd.DataFrame(data = list)
Option 1
expanding and mean
df.assign(out=df.loc[df.Vals.gt(0)].Vals.expanding().mean()).ffill()
If you have other columns in your DataFrame that have NaN values, this method will ffill those too, so if that is a concern, you may want to consider using something like this:
df['Out'] = df.loc[df.Vals.gt(0)].Vals.expanding().mean()
df['Out'] = df.Out.ffill()
Which will only fill in the Out column.
Option 2
mask:
df.assign(Out=df.mask(df.Vals.lt(0)).Vals.expanding().mean())
Both of these result in:
Vals Out
0 -350 NaN
1 1000 1000.000000
2 1300 1150.000000
3 1600 1300.000000
4 1100 1250.000000
5 1000 1200.000000
6 450 1075.000000
7 1900 1192.857143
8 -2000 1192.857143
9 -3150 1192.857143
10 1000 1168.750000
11 -900 1168.750000
12 800 1127.777778
13 8550 1870.000000
I want to store a a dictionary to an data frame
dictionary_example={1234:{'choice':0,'choice_set':{0:{'A':100,'B':200,'C':300},1:{'A':200,'B':300,'C':300},2:{'A':500,'B':300,'C':300}}},
234:{'choice':1,'choice_set':0:{'A':100,'B':400},1:{'A':100,'B':300,'C':1000}},
1876:{'choice':2,'choice_set':0:{'A': 100,'B':400,'C':300},1:{'A':100,'B':300,'C':1000},2:{'A':600,'B':200,'C':100}}
}
That put them into
id choice 0_A 0_B 0_C 1_A 1_B 1_C 2_A 2_B 2_C
1234 0 100 200 300 200 300 300 500 300 300
234 1 100 400 - 100 300 1000 - - -
1876 2 100 400 300 100 300 1000 600 200 100
I think the following is pretty close, the core idea is simply to convert those dictionaries into json and relying on pandas.read_json to parse them.
dictionary_example={
"1234":{'choice':0,'choice_set':{0:{'A':100,'B':200,'C':300},1:{'A':200,'B':300,'C':300},2:{'A':500,'B':300,'C':300}}},
"234":{'choice':1,'choice_set':{0:{'A':100,'B':400},1:{'A':100,'B':300,'C':1000}}},
"1876":{'choice':2,'choice_set':{0:{'A': 100,'B':400,'C':300},1:{'A':100,'B':300,'C':1000},2:{'A':600,'B':200,'C':100}}}
}
df = pd.read_json(json.dumps(dictionary_example)).T
def to_s(r):
return pd.read_json(json.dumps(r)).unstack()
flattened_choice_set = df["choice_set"].apply(to_s)
flattened_choice_set.columns = ['_'.join((str(col[0]), col[1])) for col in flattened_choice_set.columns]
result = pd.merge(df, flattened_choice_set,
left_index=True, right_index=True).drop("choice_set", axis=1)
result