when I use this syntax it creates a series rather than adding a column to my new dataframe sum.
My code:
sum = data['variance'] = data.budget + data.actual
My dataframe data currently has everything except the budget - actual column. How do I create a variance column?
cluster date budget actual budget - actual
0 a 2014-01-01 00:00:00 11000 10000 1000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
I think you've misunderstood some python syntax, the following does two assignments:
In [11]: a = b = 1
In [12]: a
Out[12]: 1
In [13]: b
Out[13]: 1
So in your code it was as if you were doing:
sum = df['budget'] + df['actual'] # a Series
# and
df['variance'] = df['budget'] + df['actual'] # assigned to a column
The latter creates a new column for df:
In [21]: df
Out[21]:
cluster date budget actual
0 a 2014-01-01 00:00:00 11000 10000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
In [22]: df['variance'] = df['budget'] + df['actual']
In [23]: df
Out[23]:
cluster date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
As an aside, you shouldn't use sum as a variable name as the overrides the built-in sum function.
df['variance'] = df.loc[:,['budget','actual']].sum(axis=1)
You could also use the .add() function:
df.loc[:,'variance'] = df.loc[:,'budget'].add(df.loc[:,'actual'])
Same thing can be done using lambda function.
Here I am reading the data from a xlsx file.
import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name = 4)
print df
Output:
cluster Unnamed: 1 date budget actual
0 a 2014-01-01 00:00:00 11000 10000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
Sum two columns into 3rd new one.
df['variance'] = df.apply(lambda x: x['budget'] + x['actual'], axis=1)
print df
Output:
cluster Unnamed: 1 date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
If "budget" has any NaN values but you don't want it to sum to NaN then try:
def fun (b, a):
if math.isnan(b):
return a
else:
return b + a
f = np.vectorize(fun, otypes=[float])
df['variance'] = f(df['budget'], df_Lp['actual'])
This is the most elegant solution which follows DRY and work absolutely great.
dataframe_name['col1', 'col2', 'col3'].sum(axis = 1, skipna = True)
Thank you.
eval lets you sum and create columns right away:
In [12]: data.eval('variance = budget + actual', inplace=True)
In [13]: data
Out[13]:
cluster date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
Since inplace=True you don't need to assign it back to data.
Related
I was looking for similar topics, but I found only change by month. My problem is that I would like to have a month change in value e.g. UPL but per user like in the below example.
user_id
month
UPL
1
2022-01-01 00:00:00
100
1
2022-02-01 00:00:00
200
2
2022-01-01 00:00:00
100
2
2022-02-01 00:00:00
50
1
2022-03-01 00:00:00
150
And to have additional column named "UPL change month by month":
user_id
month
UPL
UPL_change_by_month
1
2022-01-01 00:00:00
100
0
1
2022-02-01 00:00:00
200
100
2
2022-01-01 00:00:00
100
0
2
2022-02-01 00:00:00
50
-50
1
2022-03-01 00:00:00
150
-50
Is it possible using aggfunc or shift function using Pandas?
IIUC, you can use groupby_diff:
df['UPL_change_by_month'] = df.sort_values('month').groupby('user_id')['UPL'].diff().fillna(0)
print(df)
# Output
user_id month UPL UPL_change_by_month
0 1 2022-01-01 100 0.0
1 1 2022-02-01 200 100.0
2 2 2022-01-01 100 0.0
3 2 2022-02-01 50 -50.0
4 1 2022-03-01 150 -50.0
** I have Input data frame **
ID
Date
Amount
A
2021-08-03
100
A
2021-08-04
100
A
2021-08-06
20
A
2021-08-07
100
A
2021-08-09
300
A
2021-08-11
100
A
2021-08-12
100
A
2021-08-13
10
A
2021-08-23
10
A
2021-08-24
10
A
2021-08-26
10
A
2021-08-28
10
desired Output data frame
ID
Date
Amount
TwoWeekSum
A
2021-08-03
100
320
A
2021-08-04
100
320
A
2021-08-06
20
320
A
2021-08-07
100
320
A
2021-08-09
300
830
A
2021-08-11
100
830
A
2021-08-12
100
830
A
2021-08-13
10
830
A
2021-08-23
10
40
A
2021-08-24
10
40
A
2021-08-26
10
40
A
2021-08-28
10
40
I want to calculate the last two week total sum like
twoweekSum= current week total sum + Previous Week total sum i.e. current week is 34 then twoweekSum is 34 week total sum + 33 week total sum.
Please help me in to get in this in like output data frame so I can use that for further analysis.
Thank You folks !
Use:
#convert values to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#convert values to weeks
df['week'] = df['Date'].dt.isocalendar().week
#aggregate sum per ID and weeks, then add missing weeks and sum in rolling
f = lambda x: x.reindex(range(x.index.min(), x.index.max() + 1))
.rolling(2, min_periods=1).sum()
df1 = df.groupby(['ID', 'week'])['Amount'].sum().reset_index(level=0).groupby('ID').apply(f)
print (df1)
Amount
ID week
A 31 320.0
32 830.0
33 510.0
34 40.0
#last add to original DataFrame per ID and weeks
df=df.join(df1.rename(columns={'Amount':'TwoWeekSum'}),on=['ID','week']).drop('week',axis=1)
print (df)
ID Date Amount TwoWeekSum
0 A 2021-08-03 100 320.0
1 A 2021-08-04 100 320.0
2 A 2021-08-06 20 320.0
3 A 2021-08-07 100 320.0
4 A 2021-08-09 300 830.0
5 A 2021-08-11 100 830.0
6 A 2021-08-12 100 830.0
7 A 2021-08-13 10 830.0
8 A 2021-08-23 10 40.0
9 A 2021-08-24 10 40.0
10 A 2021-08-26 10 40.0
11 A 2021-08-28 10 40.0
per = pd.period_range(df['Date'].min(), df['Date'].max(), freq='w')
mapper = df.groupby(df['Date'].astype('Period[W]')).sum().reindex(per, fill_value=0).rolling(2, 1).sum()['Amount']
out = df['Date'].astype('Period[W]').map(mapper)
out
0 320.0
1 320.0
2 320.0
3 320.0
4 830.0
5 830.0
6 830.0
7 830.0
8 40.0
9 40.0
10 40.0
11 40.0
Name: Date, dtype: float64
make out to TwoWeekSum column
df.assign(TwoWeekSum=out)
ID Date Amount TwoWeekSum
0 A 2021-08-03 100 320.0
1 A 2021-08-04 100 320.0
2 A 2021-08-06 20 320.0
3 A 2021-08-07 100 320.0
4 A 2021-08-09 300 830.0
5 A 2021-08-11 100 830.0
6 A 2021-08-12 100 830.0
7 A 2021-08-13 10 830.0
8 A 2021-08-23 10 40.0
9 A 2021-08-24 10 40.0
10 A 2021-08-26 10 40.0
11 A 2021-08-28 10 40.0
Update
if each ID , groupby and merge
per = pd.period_range(df['Date'].min(), df['Date'].max(), freq='w')
s = df['Date'].astype('Period[W]')
idx = pd.MultiIndex.from_product([df['ID'].unique(), per])
df1 = df.groupby(['ID', s]).sum().reindex(idx, fill_value=0).rolling(2, 1).agg(sum).reset_index().set_axis(['ID', 'period', 'TwoWeekSum'], axis=1)
df.assign(period=s).merge(df1, how='left').drop('period', axis=1)
Try using groupby to group the dataframe by dt.week (each week), then use transform sum to add up the values weekly and repeat the values:
df['TwoWeekSum'] = df.groupby(df['Date'].dt.week)['Amount'].transform('sum')
And then:
print(df)
Gives:
ID Date Amount TwoWeekSum
0 A 2021-08-03 100 320
1 A 2021-08-04 100 320
2 A 2021-08-06 20 320
3 A 2021-08-07 100 320
4 A 2021-08-09 300 830
5 A 2021-08-11 100 830
6 A 2021-08-12 100 830
7 A 2021-08-13 10 830
8 A 2021-08-23 10 40
9 A 2021-08-24 10 40
10 A 2021-08-26 10 40
11 A 2021-08-28 10 40
I have following dataframe in pandas
code date tank product time_frst time_lst qty_frst qty_lst
123 2019-01-01 1 MS 02:00:00 10:00:00 234 100
123 2019-01-01 2 HS 02:30:00 19:00:00 200 50
123 2019-01-01 3 MS 00:30:00 22:00:00 300 500
My desired dataframe is as follows
code date tank product time qty
123 2019-01-01 1 MS 02:00:00 234
123 2019-01-01 1 MS 10:00:00 100
123 2019-01-01 2 HS 02:30:00 200
123 2019-01-01 2 HS 19:00:00 50
123 2019-01-01 3 MS 00:30:00 300
123 2019-01-01 3 MS 22:00:00 500
How can I do it in pandas?
Create MultiIndex by all columns without _, so possible split all another columns to MultiIndex in columns and last reshape by DataFrame.stack, for remove unnecesary column use first reset_index and second is for convert MultiIndex in index to columns:
df = df.set_index(['code','date','tank','product'])
df.columns = df.columns.str.split('_', expand=True)
df = df.stack().reset_index(level=4, drop=True).reset_index()
print (df)
code date tank product qty time
0 123 2019-01-01 1 MS 234 02:00:00
1 123 2019-01-01 1 MS 100 10:00:00
2 123 2019-01-01 2 HS 200 02:30:00
3 123 2019-01-01 2 HS 50 19:00:00
4 123 2019-01-01 3 MS 300 00:30:00
5 123 2019-01-01 3 MS 500 22:00:00
I am trying to run sort on the data imported from excel but getting below error. Why after importing into dataframe its saying its ordered dictionary?
-Error:
'OrderedDict' object has no attribute 'sort'
Code:
import pandas as pd
dfs = pd.read_excel("data.xlsx", sheet_name=None)
dfs
data_df = (dfs.sort(['Date','Tank','Time']).groupby(['Date','Tank']))
data_df
DF:
OrderedDict([(u'Sheet1',
Date Time Tank Sales Quantity Delivery
0 2018-01-01 06:30:00 1 100 3444 0
1 2018-01-01 07:00:00 1 200 3144 0
2 2018-01-01 05:30:00 1 100 2900 0
3 2018-01-01 07:30:00 1 200 2800 0
4 2018-01-01 06:30:00 2 50 3000 0
5 2018-01-01 07:00:00 2 100 2950 0
6 2018-01-01 05:30:00 2 150 2800 0
7 2018-01-01 07:30:00 2 100 2704 0
8 2018-01-02 06:30:00 1 100 3444 0
9 2018-01-02 07:00:00 1 200 3144 0
10 2018-01-02 05:30:00 1 100 2900 50
11 2018-01-02 07:30:00 1 200 2800 0
12 2018-01-02 06:30:00 2 50 3000 0
13 2018-01-02 07:00:00 2 100 2950 0
14 2018-01-02 05:30:00 2 150 2800 50
15 2018-01-02 07:30:00 2 100 2704 0)])
Because parameter sheet_name=None in read_excel:
sheet_name : string, int, mixed list of strings/ints, or None, default 0
...
None -> All sheets as a dictionary of DataFrames
Also check specifying sheets.
So you need no parameter for return first sheetname:
df = pd.read_excel("data.xlsx")
Or specify sheet_name if necessary:
df = pd.read_excel("data.xlsx", sheet_name='Sheet1')
I have the table below in a Pandas dataframe:
date user_id whole_cost cost1
02/10/2012 00:00:00 1 1790 12
07/10/2012 00:00:00 1 364 15
30/01/2013 00:00:00 1 280 10
02/02/2013 00:00:00 1 259 24
05/03/2013 00:00:00 1 201 39
02/10/2012 00:00:00 3 623 1
07/12/2012 00:00:00 3 90 0
30/01/2013 00:00:00 3 312 90
02/02/2013 00:00:00 5 359 45
05/03/2013 00:00:00 5 301 34
02/02/2013 00:00:00 5 359 1
05/03/2013 00:00:00 5 801 12
..
The table was extracted from a csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'whole_cost', 'cost1']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of my users and for this purpose:
I would like to group (for each user - they are thousands) queries by month summing the query whole_cost for the entire month e.g. if user_id=1 was has a whole cost of 1790 on 02/10/2012 with cost1 12 and on the 07/10/2012 with whole cost 364, then it should have an entry in the new table of 2154 (as the whole cost) on 31/10/2012 (end of the month end-point representing the month - all dates in the transformed table will be month ends representing the whole month to which they relate).
In 0.14 you'll be able to groupby monthly and another column at the same time:
In [11]: df
Out[11]:
user_id whole_cost cost1
2012-10-02 1 1790 12
2012-10-07 1 364 15
2013-01-30 1 280 10
2013-02-02 1 259 24
2013-03-05 1 201 39
2012-10-02 3 623 1
2012-12-07 3 90 0
2013-01-30 3 312 90
2013-02-02 5 359 45
2013-03-05 5 301 34
2013-02-02 5 359 1
2013-03-05 5 801 12
In [12]: df1 = df.sort_index() # requires sorted DatetimeIndex
In [13]: df1.groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
Out[13]:
user_id
2012-10-31 1 2154
3 623
2012-12-31 3 90
2013-01-31 1 280
3 312
2013-02-28 1 259
5 718
2013-03-31 1 201
5 1102
Name: whole_cost, dtype: int64
until 0.14 I think you're stuck with doing two groupbys:
In [14]: g = df.groupby('user_id')['whole_cost']
In [15]: g.resample('M', how='sum').dropna()
Out[15]:
user_id
1 2012-10-31 2154
2013-01-31 280
2013-02-28 259
2013-03-31 201
3 2012-10-31 623
2012-12-31 90
2013-01-31 312
5 2013-02-28 718
2013-03-31 1102
dtype: float64
With timegrouper getting deprecated, you can replace it with Grouper to get the same results
df.groupby(['user_id', pd.Grouper(key='date', freq='M')]).agg({'whole_cost':sum})
df.groupby(['user_id', df['date'].dt.dayofweek]).agg({'whole_cost':sum})