I have two dataframes as follows
df1
Location Month Date Ratio
A June Jun 1 0.2
A June Jun 2 0.3
A June Jun 3 0.4
B June Jun 1 0.6
B June Jun 2 0.7
B June Jun 3 0.8
And df2
Location Month Value
A June 1000
B June 2000
Result should be as :
df3
Location Month Date Value
A June Jun 1 200
A June Jun 2 300
A June Jun 3 400
B June Jun 1 1200
B June Jun 2 1400
B June Jun 3 1600
How do I go about doing this. I am able to carry out division without problem as Pandas somehow does great job of matching indices while division but in multiplication result is all over the place.
Thanks.
You can use df.merge and df.assign
df.assign(Value = df.merge(df1,how='inner',on=['Location','Month'])['Value'].\
mul(df['Ratio']))
#or
# df = df.merge(df1,how='inner',on=['Location','Month'])
# df['Value']*=df['Ratio']
Location Month Date Ratio Value
0 A June Jun 1 0.2 200.0
1 A June Jun 2 0.3 300.0
2 A June Jun 3 0.4 400.0
3 B June Jun 1 0.6 1200.0
4 B June Jun 2 0.7 1400.0
5 B June Jun 3 0.8 1600.0
Or
using df.set_index
df.set_index(['Location','Month'],inplace=True)
df1.set_index(['Location','Month'],inplace=True)
df['Value'] = df['Ratio']*df1['Value']
IIUC and Location is index for both dataframe then you can use pandas.Series.mul
df1["Value"] = df1.Ratio.mul(df2.Value)
df1
Month Date Ratio Value
Location
A June Jun 1 0.2 200.0
A June Jun 2 0.3 300.0
A June Jun 3 0.4 400.0
B June Jun 1 0.6 1200.0
B June Jun 2 0.7 1400.0
B June Jun 3 0.8 1600.0
Related
This is my available df, it contains year from 2016 to 2020
Year Month Bill
-----------------
2016 1 2
2016 2 5
2016 3 10
2016 4 2
2016 5 4
2016 6 9
2016 7 7
2016 8 8
2016 9 9
2016 10 5
2016 11 1
2016 12 3
.
.
.
2020 12 10
Now I want to create a 2 new columns in this dataframe Level and Contribution.
and the level column contain Q1, Q2, Q3, Q4 representing 4 quarters of the year and Contribution contains average value from bill column of each quarter in those 3 months of the respective year.
for example Q1 for 2016 will contains the average of month 1,2,3 of bill across **Contribution**
and same for Q3 for year 2020 will contains average of month 7,8,9 of the year 2020 bill column in the Contribution Column, expected Dataframe is given below
Year Month Bill levels contribution
------------------------------------
2016 1 2 Q1 5.66
2016 2 5 Q1 5.66
2016 3 10 Q1 5.66
2016 4 2 Q2 5
2016 5 4 Q2 5
2016 6 9 Q2 5
2016 7 7 Q3 8
2016 8 8 Q3 8
2016 9 9 Q3 8
2016 10 5 Q4 3
2016 11 1 Q4 3
2016 12 3 Q4 3
.
.
2020 10 2 Q4 6
2020 11 6 Q4 6
2020 12 10 Q4 6
This process will be repeated for each month 4 quarters
Iam not able to figure out the as it is something new to me
You can try:
df['levels'] = 'Q' + df['Month'].div(3).apply(math.ceil).astype(str)
df['contribution'] = df.groupby(['Year', 'levels'])['Bill'].transform('mean')
Pandas actually has a set of datatypes for monthly and quarterly values called Pandas.Period.
See this similar question:
How to create a period range with year and week of year in python?
In your case it would look like this:
from datetime import datetime
# First create dates for 1st of each month period
df['Dates'] = [datetime(row['Year'], row['Month'], 1) for i, row in df[['Year', 'Month']].iterrows()]
# Create monthly periods
df['Month Periods'] = df['Dates'].dt.to_period('M')
# Use the new monthly index
df = df.set_index('Month Periods')
# Group by quarters
df_qtrly = df['Bill'].resample('Q').mean()
df_qtrly.index.names = ['Quarters']
print(df_qtrly)
Result:
Quarters
2016Q1 5.666667
2016Q2 5.000000
2016Q3 8.000000
2016Q4 3.000000
Freq: Q-DEC, Name: Bill, dtype: float64
If you want to put these values back into the monthly dataframe you could do this:
df['Quarters'] = df['Dates'].dt.to_period('Q')
df['Contributions'] = df_qtrly.loc[df['Quarters']].values
Year Month Bill Dates Quarters Contributions
Month Periods
2016-01 2016 1 2 2016-01-01 2016Q1 5.666667
2016-02 2016 2 5 2016-02-01 2016Q1 5.666667
2016-03 2016 3 10 2016-03-01 2016Q1 5.666667
2016-04 2016 4 2 2016-04-01 2016Q2 5.000000
2016-05 2016 5 4 2016-05-01 2016Q2 5.000000
2016-06 2016 6 9 2016-06-01 2016Q2 5.000000
2016-07 2016 7 7 2016-07-01 2016Q3 8.000000
2016-08 2016 8 8 2016-08-01 2016Q3 8.000000
2016-09 2016 9 9 2016-09-01 2016Q3 8.000000
2016-10 2016 10 5 2016-10-01 2016Q4 3.000000
2016-11 2016 11 1 2016-11-01 2016Q4 3.000000
2016-12 2016 12 3 2016-12-01 2016Q4 3.000000
I have a dataset that looks like this:
overflow_data={'state': ['CA', 'CA', 'HI', 'HI', 'HI', 'NY', 'NY'],
'year': [2010, 2013, 2010, 2012, 2016, 2009, 2013],
'value': [1, 3, 1, 2, 3, 2, 5]}
pd.DataFrame(overflow_data)
Starting DataFrame:
I would like to fill in the missing years for each state, and use the prior year's values for those years, so the table would look like this:
Expected output:
I think you are looking for pivot and fill:
(df.pivot('year','state','value') # you can print this line alone to see what it does
.ffill().bfill() # fill missing the data based on the states
.unstack() # transform back to original form
.reset_index(name='value')
)
Output:
state year value
0 CA 2009 1.0
1 CA 2010 1.0
2 CA 2012 1.0
3 CA 2013 3.0
4 CA 2016 3.0
5 HI 2009 1.0
6 HI 2010 1.0
7 HI 2012 2.0
8 HI 2013 2.0
9 HI 2016 3.0
10 NY 2009 2.0
11 NY 2010 2.0
12 NY 2012 2.0
13 NY 2013 5.0
14 NY 2016 5.0
Note I just realized that the above is slightly different than what you are asking for. It only spawns data to all available years in the data, not resamples the data for the continuous years.
For what you ask, we can resolve to reindex with groupby:
(df.set_index('year').groupby('state')
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max()+1)).ffill())
.reset_index('state',drop=True)
.reset_index()
)
Output:
year state value
0 2010 CA 1.0
1 2011 CA 1.0
2 2012 CA 1.0
3 2013 CA 3.0
4 2010 HI 1.0
5 2011 HI 1.0
6 2012 HI 2.0
7 2013 HI 2.0
8 2014 HI 2.0
9 2015 HI 2.0
10 2016 HI 3.0
11 2009 NY 2.0
12 2010 NY 2.0
13 2011 NY 2.0
14 2012 NY 2.0
15 2013 NY 5.0
I am using a calendar data set for price prediction for different houses with a date feature that includes 365 days of the year. I would like to minimize the data set by taking the average month price of each listing in a new column.
input data:
listing_id date price months
1 2020-01-08 75.0 Jan
1 2020-01-09 100.0 Jan
1 2020-02-08 350.0 Feb
2 2020-01-08 465.0 Jan
2 2020-02-08 250.0 Feb
2 2020-02-09 250.0 Feb
Output data:
listing_id date Avg_price months
1 2020-01-08 90.0 Jan
1 2020-02-08 100.0 Feb
2 2020-01-08 50.0 Jan
2 2020-02-08 150.0 Feb
You can get the average price for each month using groupby:
g = df.groupby("months")["price"].mean()
You can then create new columns:
for month, avg in g.iteritems():
df["average_{}".format(month)] = avg
Example with dummy data:
import pandas as pd
df = pd.DataFrame({'months':['Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Mar'],
'price':[1, 2, 3, 4, 5, 6]})
Result:
months price average_Feb average_Jan average_Mar
0 Jan 1 2.5 1.0 5.0
1 Feb 2 2.5 1.0 5.0
2 Feb 3 2.5 1.0 5.0
3 Mar 4 2.5 1.0 5.0
4 Mar 5 2.5 1.0 5.0
5 Mar 6 2.5 1.0 5.0
I upvoted Dan's answer.
It may help to see another way to do this.
Additionally, if you ever have data that spans multiple years you may want a month_year column instead.
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html
Example:
df = pd.DataFrame({'price':[i for i in range(121)]},
index=pd.date_range(start='12/1/2017',end='3/31/2018'))
df = df.reset_index()
df['month_year'] = df['index'].dt.month_name() + " " +
df['index'].dt.year.astype(str)
df.pivot_table(values='price',columns='month_year')
Result:
In [39]: df.pivot_table(values='price',columns='month_year')
Out[39]:
month_year December 2017 February 2018 January 2018 March 2018
price 15.0 75.5 46.0 105.0
My data looks like this
import numpy as np
import pandas as pd
# My Data
enroll_year = np.arange(2010, 2015)
grad_year = enroll_year + 4
n_students = [[100, 100, 110, 110, np.nan]]
df = pd.DataFrame(
n_students,
columns=pd.MultiIndex.from_arrays(
[enroll_year, grad_year],
names=['enroll_year', 'grad_year']))
print(df)
# enroll_year 2010 2011 2012 2013 2014
# grad_year 2014 2015 2016 2017 2018
# 0 100 100 110 110 NaN
What I am trying to do is to stack the data, one column/index level for year of enrollment, one for year of graduation and one for the numbers of students, which should look like
# enroll_year grad_year n
# 2010 2014 100.0
# . . .
# . . .
# . . .
# 2014 2018 NaN
The data produced by .stack() is very close, but the missing record(s) is dropped,
df1 = df.stack(['enroll_year', 'grad_year'])
df1.index = df1.index.droplevel(0)
print(df1)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# dtype: float64
So, .stack(dropna=False) is tried, but it will expand the index levels to all combinations of enrollment and graduation years
df2 = df.stack(['enroll_year', 'grad_year'], dropna=False)
df2.index = df2.index.droplevel(0)
print(df2)
# enroll_year grad_year
# 2010 2014 100.0
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2011 2014 NaN
# 2015 100.0
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2012 2014 NaN
# 2015 NaN
# 2016 110.0
# 2017 NaN
# 2018 NaN
# 2013 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 110.0
# 2018 NaN
# 2014 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# dtype: float64
And I need to subset df2 to get my desired data set.
existing_combn = list(zip(
df.columns.levels[0][df.columns.labels[0]],
df.columns.levels[1][df.columns.labels[1]]))
df3 = df2.loc[existing_combn]
print(df3)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# 2014 2018 NaN
# dtype: float64
Although it only adds a few more extra lines to my code, I wonder if there are any better and neater approaches.
Use unstack with pd.DataFrame then reset_index and drop unnecessary columns and rename the column as:
pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Or:
df.unstack().reset_index(level=2, drop=True)
enroll_year grad_year
2010 2014 100.0
2011 2015 100.0
2012 2016 110.0
2013 2017 110.0
2014 2018 NaN
dtype: float64
Or:
df.unstack().reset_index(level=2, drop=True).reset_index().rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Explanation :
print(pd.DataFrame(df.unstack()))
0
enroll_year grad_year
2010 2014 0 100.0
2011 2015 0 100.0
2012 2016 0 110.0
2013 2017 0 110.0
2014 2018 0 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1))
enroll_year grad_year 0
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'}))
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
I have the following df:
Jan 2004 Feb 2004 Mar 2004 Apr 2004 May 2004 Jun 2004 \
0 6.4 6.1 5.9 5.2 5.4 6.1
1 134673.0 130294.0 126006.0 111309.0 114147.0 131745.0
2 1985886.0 1990082.0 1999936.0 2009556.0 2009573.0 2013057.0
3 2120559.0 2120376.0 2125942.0 2120865.0 2123720.0 2144802.0
4 8.8 8.9 8.5 7.8 7.4 7.6
Jul 2004 Aug 2004 Sep 2004 Oct 2004 ... May 2014 \
0 6.0 5.9 5.6 5.5 ... 6.6
1 128010.0 126954.0 119043.0 119278.0 ... 142417.0
2 2019963.0 2015320.0 2015103.0 2035705.0 ... 2009815.0
3 2147973.0 2142274.0 2134146.0 2154983.0 ... 2152232.0
4 6.5 6.2 6.5 6.8 ... 6.8
Jun 2014 Jul 2014 Aug 2014 Sep 2014 Oct 2014 Nov 2014 \
0 7.4 7.6 7.2 6.2 6.0 5.7
1 161376.0 165248.0 154786.0 132918.0 128711.0 122831.0
2 2008339.0 2003562.0 1994433.0 2001023.0 2019314.0 2016260.0
3 2169715.0 2168810.0 2149219.0 2133941.0 2148025.0 2139091.0
4 7.0 6.3 6.0 6.2 6.2 6.4
Dec 2014 state type_string
0 5.5 01 foo
1 117466.0 01 barb
2 2005276.0 01 asd
3 2122742.0 01 foobarbar
4 6.4 02 foo
That is, I have for every US state a set of variables (foo, barb, asd, foobarbar, foo), as in type_string.
I would like to switch the data frame to a structure where the different dates (currently in the columns) become the lower level of the MultiIndex, and the state becomes the upper level of the MultiIndex.
I tried
datesIndex = df.columns[:-2]
stateIndex = pd.Index(df.state)
mindex = pd.MultiIndex.from_tuples((datesIndex, stateIndex))
df.pivot(index=mindex, columns='type_string')
but got
ValueError: Length mismatch: Expected axis has 208 elements, new values have 2 elements
How should I approach this?
Expected Output
foo barb asd foobarbar
date state
2004/01/01 1 6.4 134673.0 1985886 2120559
2004/02/01 1 6.1 130294.0 1990082 2120376
2004/03/01 1 5.9 126006.0 1999936 2125942
This can be accomplished with pivot/transpose:
In [195]: result = df.pivot(index='type_string', columns='state').T
In [196]: result.columns.name = None
In [197]: result
Out[197]:
asd barb foo foobarbar
state
Jan 2004 1 1985886 134673 6.4 2120559
2 NaN NaN 8.8 NaN
Feb 2004 1 1990082 130294 6.1 2120376
2 NaN NaN 8.9 NaN
The idea here is that columns='state' moves the state column into a column level next to the dates. Thus, transposing with .T swaps the index and columns producing the desired result.