I have the following pandas dataframe and baseline value:
df = pd.DataFrame(data=[
{'yr': 2010, 'month': 0, 'name': 'Johnny', 'total': 50},,
{'yr': 2010, 'month': 0, 'name': 'Johnny', 'total': 50},
{'yr': 2010, 'month': 1, 'name': 'Johnny', 'total': 105},
{'yr': 2010, 'month': 0, 'name': 'Zack', 'total': 90}
])
baseline_value = 100
I'm grouping and aggregating the data based on year, month and name. Then I'm calculating the net sum relative to the baseline value:
pt = pd.pivot_table(data=df, index=['yr', 'month', 'name'], values='total', aggfunc=np.sum)
pt['net'] = pt['total'] - baseline_value
print(pt)
total net
yr month name
2010 0 Johnny 100 0
Zack 90 -10
1 Johnny 105 5
How can I restructure this DataFrame so the output looks something like this:
value
yr month name type
2010 0 Johnny Total 100
Net 0
Zack Total 90
Net -10
1 Johnny Total 105
Net 5
Option 1: Reshaping yout pivot dataframe: pt
Use stack, rename, and to_frame:
pt.stack().rename('value').to_frame()
Output:
value
yr month name
2010 0 Johnny total 100
net 0
Zack total 90
net -10
1 Johnny total 105
net 5
Option 2 using set_index and sum from original df
Here is another approach starting from your source df, using set_index and sum with level parameter, then reshape with stack:
baseline_value = 100
(df.set_index(['yr','month','name'])
.sum(level=[0,1,2])
.eval('net = #baseline_value - total',inplace=False)
.stack()
.to_frame(name='value'))
Output:
value
yr month name
2010 0 Johnny total 100
net 0
Zack total 90
net 10
1 Johnny total 105
net -5
Related
Picture of my dataframe
Is it possible to summarize or group every country's info to something like a 'total info' row
This df is fluent, it will change each month and having a "quick access" view of how it looks will be very beneficial.
Take the picture as example: I would like to have Albania's (every county's) info in row so something like this
**ORIGINATING COUNTRY Calls Made Actual Qty Billable Qty. Cost (€)**
Albania 10 190 600 7
Zambia total total total
and total total total
every total total total
other total total total
country in my df total total total
I've tried groupby() and sum() but can figure it out.
import pandas as pd
df = pd.DataFrame(
data=[
['Albania', 1, 10, 100, 0.1],
['Albania', 2, 20, 200, 0.2],
['Zambia', 3, 30, 300, 0.3],
['Zambia', 4, 40, 400, 0.4],
[None, 5, 50, 500, 0.5],
[None, 6, 60, 600, 0.6],
],
columns=[
'ORIGINATING COUNTRY',
'Calls Made',
'Actual Qty. (s)',
'Billable Qty. (s)',
'Cost (€)',
],
)
df['ORIGINATING COUNTRY'].replace({None: 'Unknown'}, inplace=True)
df.groupby('ORIGINATING COUNTRY', as_index=False).sum()
Output:
ORIGINATING COUNTRY Calls Made Actual Qty. (s) Billable Qty. (s) Cost (€)
0 Albania 3 30 300 0.3
1 Unknown 11 110 1100 1.1
2 Zambia 7 70 700 0.7
I have real estate dataframe with many outliers and many observations.
I have variables: total area, number of rooms (if rooms = 0, then it's studio appartment) and kitchen_area.
"Minimalized" extraction from my dataframe:
dic = [{'area': 40, 'kitchen_area': 10, 'rooms': 1, 'price': 50000 },
{'area': 20, 'kitchen_area': 0, 'rooms': 0, 'price': 50000},
{'area': 60, 'kitchen_area': 0, 'rooms': 2, 'price': 70000},
{'area': 29, 'kitchen_area': 9, 'rooms': 1, 'price': 30000},
{'area': 15, 'kitchen_area': 0, 'rooms': 0, 'price': 25000}]
df = pd.DataFrame(dic, index=['apt1', 'apt2','apt3','apt4', 'apt5'])
My target would be to eliminate apt3, because by law, kitchen area cannot be smaller than 5 squared meters in non-studio apartments.
In other words, I would like to eliminate all rows from my dataframe containing the data about apartments which are non-studio (rooms>0), but have kitchen_area <5
I have tried code like this:
df1 = df.drop(df[(df.rooms > 0) & (df.kitchen_area < 5)].index)
But it just eliminated all data from both columns kitchen_area and rooms according to the multiple conditions I put.
Clean
mask1 = df.rooms > 0
mask2 = df.kitchen_area < 5
df1 = df[~(mask1 & mask2)]
df1
area kitchen_area rooms price
apt1 40 10 1 50000
apt2 20 0 0 50000
apt4 29 9 1 30000
apt5 15 0 0 25000
pd.DataFRame.query
df1 = df.query('rooms == 0 | kitchen_area >= 5')
df1
area kitchen_area rooms price
apt1 40 10 1 50000
apt2 20 0 0 50000
apt4 29 9 1 30000
apt5 15 0 0 25000
I have given the following pandas dataframe:
d = {'ID': ['1169', '1234', '2456', '9567', '1234', '4321', '9567', '0169'], 'YEAR': ['2001', '2013', '2009', '1989', '2012', '2013', '2002', '2012'], 'VALUE': [8, 24, 50, 75, 3, 6, 150, 47]}
df = pd.DataFrame(data=d)
print(df)
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 24
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 4321 2013 6
6 9567 2002 150
7 1169 2012 47
I now want to merge two rows of the DataFrame, where there are two different IDs, where ultimately only one remains. The merge should only take place if the values of the column "YEAR" match. The values of the column "VALUE" should be added.
The output should look like this:
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 30
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 9567 2002 150
6 1169 2012 47
Line 1 and line 5 have been merged. Line 5 is removed and line 1 remains with the previous ID, but the VALUEs of line 1 and line 5 have been added.
I would like to specify later which two lines or which two IDs should be merged. One of the two should always remain. The two IDs to be merged come from another function.
I experimented with the groupby() function, but I don't know how to merge two different IDs there. I managed it only with identical values of the "ID" column. This then looked like this:
df.groupby(['ID', 'YEAR'])['VALUE'].sum().reset_index(name ='VALUE')
Unfortunately, even after extensive searching, I have not found anything suitable. I would be very happy if someone can help me! I would like to apply the whole thing later to a much larger DataFrame with more rows. Thanks in advance and best regards!
Try this, just group on 'ID' and take the max YEAR and sum VALUE:
df.groupby('ID', as_index=False).agg({'YEAR':'max', 'VALUE':'sum'})
Output:
ID YEAR VALUE
0 1234 2013 27
1 4321 2013 6
Or group on year and take first ID:
df.groupby('YEAR', as_index=False).agg({'ID':'first', 'VALUE':'sum'})
Ouptut:
YEAR ID VALUE
0 2012 1234 3
1 2013 1234 30
Based on all the comments and update to the question it sounds like the logic (maybe not this exact code) is required...
Try:
import pandas as pd
d = {'ID': ['1169', '1234', '2456', '9567', '1234', '4321', '9567', '0169'], 'YEAR': ['2001', '2013', '2009', '1989', '2012', '2013', '2002', '2012'], 'VALUE': [8, 24, 50, 75, 3, 6, 150, 47]}
df = pd.DataFrame(d)
df['ID'] = df['ID'].astype(int)
def correctRows(l, i):
for x in l:
if df.loc[x, 'YEAR'] == df.loc[i, 'YEAR']:
row = x
break
return row
def mergeRows(a, b):
rowa = list(df[df['ID'] == a].index)
rowb = list(df[df['ID'] == b].index)
if len(rowa) > 1:
if type(rowb)==list:
rowa = correctRows(rowa, rowb[0])
else:
rowa = correctRows(rowa, rowb)
else:
rowa = rowa[0]
if len(rowb) > 1:
if type(rowa)==list:
rowb = correctRows(rowb, rowa[0])
else:
rowb = correctRows(rowb, rowa)
else:
rowb = rowb[0]
print('Keeping:', df.loc[rowa].to_string().replace('\n', ', ').replace(' ', ' '))
print('Dropping:', df.loc[rowb].to_string().replace('\n', ', ').replace(' ', ' '))
df.loc[rowa, 'VALUE'] = df.loc[rowa, 'VALUE'] + df.loc[rowb, 'VALUE']
df.drop(df.index[rowb], inplace=True)
df.reset_index(drop = True, inplace=True)
return None
# add two ids. First 'ID' is kept; the second dropped, but the 'Value'
# of the second is added to the 'Value' of the first.
# Note: the line near the start df['ID'].astype(int), hence integers required
# mergeRows(4321, 1234)
mergeRows(1234, 4321)
Outputs:
Keeping: ID 1234, YEAR 2013, VALUE 24
Dropping: ID 4321, YEAR 2013, VALUE 6
Frame now looks like:
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 30 #<-- sum of 6 + 24
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 9567 2002 150
6 169 2012 47
I have a dataframe in python similar to the one in the picture and I was wondering how can I get to the same output. So if I have a certain value and there exists another value of opposite sign in that row (-360 and 360 for example) which has the same date (same month more exactly), then I have to create a new variable that outputs '+/- same month'. Likewise if the values are from different months, then '+/- different month". If there is no opposite value then I just have to print whether the value is positive or negative. I have tried to do this with 2 for loops but I have failed miserably and I am out of ideas.
Example of desired output
Gotta rush, I will add a description in the evening
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
test_data = [
{'Title': "Account1", 'Amount': 100, 'Date': '2019-11-13'},
{'Title': "Account1", 'Amount': -100, 'Date': '2019-11-17'},
{'Title': "Account2", 'Amount': 200, 'Date': '2019-11-14'},
{'Title': "Account2", 'Amount': -200, 'Date': '2019-12-14'},
{'Title': "Account3", 'Amount': 300, 'Date': '2020-01-01'}
]
test_data = pd.DataFrame(test_data)
test_data['Month'] = test_data['Date'].apply(lambda x: x[:-3])
positiv_amount = test_data[test_data['Amount'] > 0]
negative_amout = test_data[test_data['Amount'] < 0]
compare = pd.merge(how='left',
left= positiv_amount,
right=negative_amout,
left_on='Title', right_on='Title',suffixes=['_positive', '_negative'])
conditions = [
(compare['Amount_negative'].isnull()),
(compare['Month_positive'] == compare['Month_negative']),
(compare['Month_positive'] != compare['Month_negative'])]
choices = ['', 'Same Month', 'Different Month']
compare['output'] = np.select(conditions, choices)
print(compare)`
Title Amount_positive Date_positive Month_positive Amount_negative Date_negative Month_negative output
0 Account1 100 2019-11-13 2019-11 -100.0 2019-11-17 2019-11 Same Month
1 Account2 200 2019-11-14 2019-11 -200.0 2019-12-14 2019-12 Different Month
2 Account3 300 2020-01-01 2020-01 NaN NaN NaN
I am trying to generate annual data for a certain product when I have data for the base year and growth rate.
In the toy example, each product has different annual growth rate in efficiency by its 'color', and I want to generate yearly data until 2030.
Therefore, I have base year data (base_year) as follows:
year color shape efficiency
0 2018 red circle 50
1 2018 red square 30
2 2018 blue circle 100
3 2018 blue square 60
And each type of product's growth rate (growthrate) as:
color rate
0 red 30
1 blue 20
Results I desire is:
year color shape efficiency
0 2018 red circle 50
1 2018 red square 30
2 2018 blue circle 100
3 2018 blue square 60
4 2019 red circle 65
5 2019 red square 39
6 2019 blue circle 120
7 2019 blue square 72
8 2020 red circle 84.5
... (until 2030)
The data used in the toy code is..
base_year = pd.DataFrame(data = {'year': [2018,2018,2018,2018],
'color': ['red', 'red', 'blue', 'blue'],
'shape' : ['circle', 'square', 'circle', 'square'],
'efficiency' : [50, 30, 100, 60]}, columns = ['year', 'color', 'shape', 'efficiency'])
growthrate = pd.DataFrame(data = {'color': ['red', 'blue'],
'rate' : [30, 20]}, columns = ['color', 'rate'])
I've been trying some approach using .loc, but it seems such approach is quite inefficient.
Any suggestions or hints would be appreciated. Thank you in advance!
Here is one way to do this:
years = 2031 - 2018
df = (pd.concat([df.assign(year=df['year']+i,
efficiency=df['efficiency']*((df['rate']/100+1)**i))
for i, df in enumerate([base_year.merge(growthrate, on='color')] * years)])
.drop('rate', axis=1))