I have the following df:
Jan 2004 Feb 2004 Mar 2004 Apr 2004 May 2004 Jun 2004 \
0 6.4 6.1 5.9 5.2 5.4 6.1
1 134673.0 130294.0 126006.0 111309.0 114147.0 131745.0
2 1985886.0 1990082.0 1999936.0 2009556.0 2009573.0 2013057.0
3 2120559.0 2120376.0 2125942.0 2120865.0 2123720.0 2144802.0
4 8.8 8.9 8.5 7.8 7.4 7.6
Jul 2004 Aug 2004 Sep 2004 Oct 2004 ... May 2014 \
0 6.0 5.9 5.6 5.5 ... 6.6
1 128010.0 126954.0 119043.0 119278.0 ... 142417.0
2 2019963.0 2015320.0 2015103.0 2035705.0 ... 2009815.0
3 2147973.0 2142274.0 2134146.0 2154983.0 ... 2152232.0
4 6.5 6.2 6.5 6.8 ... 6.8
Jun 2014 Jul 2014 Aug 2014 Sep 2014 Oct 2014 Nov 2014 \
0 7.4 7.6 7.2 6.2 6.0 5.7
1 161376.0 165248.0 154786.0 132918.0 128711.0 122831.0
2 2008339.0 2003562.0 1994433.0 2001023.0 2019314.0 2016260.0
3 2169715.0 2168810.0 2149219.0 2133941.0 2148025.0 2139091.0
4 7.0 6.3 6.0 6.2 6.2 6.4
Dec 2014 state type_string
0 5.5 01 foo
1 117466.0 01 barb
2 2005276.0 01 asd
3 2122742.0 01 foobarbar
4 6.4 02 foo
That is, I have for every US state a set of variables (foo, barb, asd, foobarbar, foo), as in type_string.
I would like to switch the data frame to a structure where the different dates (currently in the columns) become the lower level of the MultiIndex, and the state becomes the upper level of the MultiIndex.
I tried
datesIndex = df.columns[:-2]
stateIndex = pd.Index(df.state)
mindex = pd.MultiIndex.from_tuples((datesIndex, stateIndex))
df.pivot(index=mindex, columns='type_string')
but got
ValueError: Length mismatch: Expected axis has 208 elements, new values have 2 elements
How should I approach this?
Expected Output
foo barb asd foobarbar
date state
2004/01/01 1 6.4 134673.0 1985886 2120559
2004/02/01 1 6.1 130294.0 1990082 2120376
2004/03/01 1 5.9 126006.0 1999936 2125942
This can be accomplished with pivot/transpose:
In [195]: result = df.pivot(index='type_string', columns='state').T
In [196]: result.columns.name = None
In [197]: result
Out[197]:
asd barb foo foobarbar
state
Jan 2004 1 1985886 134673 6.4 2120559
2 NaN NaN 8.8 NaN
Feb 2004 1 1990082 130294 6.1 2120376
2 NaN NaN 8.9 NaN
The idea here is that columns='state' moves the state column into a column level next to the dates. Thus, transposing with .T swaps the index and columns producing the desired result.
Related
I have a dataset that looks like this:
overflow_data={'state': ['CA', 'CA', 'HI', 'HI', 'HI', 'NY', 'NY'],
'year': [2010, 2013, 2010, 2012, 2016, 2009, 2013],
'value': [1, 3, 1, 2, 3, 2, 5]}
pd.DataFrame(overflow_data)
Starting DataFrame:
I would like to fill in the missing years for each state, and use the prior year's values for those years, so the table would look like this:
Expected output:
I think you are looking for pivot and fill:
(df.pivot('year','state','value') # you can print this line alone to see what it does
.ffill().bfill() # fill missing the data based on the states
.unstack() # transform back to original form
.reset_index(name='value')
)
Output:
state year value
0 CA 2009 1.0
1 CA 2010 1.0
2 CA 2012 1.0
3 CA 2013 3.0
4 CA 2016 3.0
5 HI 2009 1.0
6 HI 2010 1.0
7 HI 2012 2.0
8 HI 2013 2.0
9 HI 2016 3.0
10 NY 2009 2.0
11 NY 2010 2.0
12 NY 2012 2.0
13 NY 2013 5.0
14 NY 2016 5.0
Note I just realized that the above is slightly different than what you are asking for. It only spawns data to all available years in the data, not resamples the data for the continuous years.
For what you ask, we can resolve to reindex with groupby:
(df.set_index('year').groupby('state')
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max()+1)).ffill())
.reset_index('state',drop=True)
.reset_index()
)
Output:
year state value
0 2010 CA 1.0
1 2011 CA 1.0
2 2012 CA 1.0
3 2013 CA 3.0
4 2010 HI 1.0
5 2011 HI 1.0
6 2012 HI 2.0
7 2013 HI 2.0
8 2014 HI 2.0
9 2015 HI 2.0
10 2016 HI 3.0
11 2009 NY 2.0
12 2010 NY 2.0
13 2011 NY 2.0
14 2012 NY 2.0
15 2013 NY 5.0
I have two dataframes as follows
df1
Location Month Date Ratio
A June Jun 1 0.2
A June Jun 2 0.3
A June Jun 3 0.4
B June Jun 1 0.6
B June Jun 2 0.7
B June Jun 3 0.8
And df2
Location Month Value
A June 1000
B June 2000
Result should be as :
df3
Location Month Date Value
A June Jun 1 200
A June Jun 2 300
A June Jun 3 400
B June Jun 1 1200
B June Jun 2 1400
B June Jun 3 1600
How do I go about doing this. I am able to carry out division without problem as Pandas somehow does great job of matching indices while division but in multiplication result is all over the place.
Thanks.
You can use df.merge and df.assign
df.assign(Value = df.merge(df1,how='inner',on=['Location','Month'])['Value'].\
mul(df['Ratio']))
#or
# df = df.merge(df1,how='inner',on=['Location','Month'])
# df['Value']*=df['Ratio']
Location Month Date Ratio Value
0 A June Jun 1 0.2 200.0
1 A June Jun 2 0.3 300.0
2 A June Jun 3 0.4 400.0
3 B June Jun 1 0.6 1200.0
4 B June Jun 2 0.7 1400.0
5 B June Jun 3 0.8 1600.0
Or
using df.set_index
df.set_index(['Location','Month'],inplace=True)
df1.set_index(['Location','Month'],inplace=True)
df['Value'] = df['Ratio']*df1['Value']
IIUC and Location is index for both dataframe then you can use pandas.Series.mul
df1["Value"] = df1.Ratio.mul(df2.Value)
df1
Month Date Ratio Value
Location
A June Jun 1 0.2 200.0
A June Jun 2 0.3 300.0
A June Jun 3 0.4 400.0
B June Jun 1 0.6 1200.0
B June Jun 2 0.7 1400.0
B June Jun 3 0.8 1600.0
My data looks like this
import numpy as np
import pandas as pd
# My Data
enroll_year = np.arange(2010, 2015)
grad_year = enroll_year + 4
n_students = [[100, 100, 110, 110, np.nan]]
df = pd.DataFrame(
n_students,
columns=pd.MultiIndex.from_arrays(
[enroll_year, grad_year],
names=['enroll_year', 'grad_year']))
print(df)
# enroll_year 2010 2011 2012 2013 2014
# grad_year 2014 2015 2016 2017 2018
# 0 100 100 110 110 NaN
What I am trying to do is to stack the data, one column/index level for year of enrollment, one for year of graduation and one for the numbers of students, which should look like
# enroll_year grad_year n
# 2010 2014 100.0
# . . .
# . . .
# . . .
# 2014 2018 NaN
The data produced by .stack() is very close, but the missing record(s) is dropped,
df1 = df.stack(['enroll_year', 'grad_year'])
df1.index = df1.index.droplevel(0)
print(df1)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# dtype: float64
So, .stack(dropna=False) is tried, but it will expand the index levels to all combinations of enrollment and graduation years
df2 = df.stack(['enroll_year', 'grad_year'], dropna=False)
df2.index = df2.index.droplevel(0)
print(df2)
# enroll_year grad_year
# 2010 2014 100.0
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2011 2014 NaN
# 2015 100.0
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2012 2014 NaN
# 2015 NaN
# 2016 110.0
# 2017 NaN
# 2018 NaN
# 2013 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 110.0
# 2018 NaN
# 2014 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# dtype: float64
And I need to subset df2 to get my desired data set.
existing_combn = list(zip(
df.columns.levels[0][df.columns.labels[0]],
df.columns.levels[1][df.columns.labels[1]]))
df3 = df2.loc[existing_combn]
print(df3)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# 2014 2018 NaN
# dtype: float64
Although it only adds a few more extra lines to my code, I wonder if there are any better and neater approaches.
Use unstack with pd.DataFrame then reset_index and drop unnecessary columns and rename the column as:
pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Or:
df.unstack().reset_index(level=2, drop=True)
enroll_year grad_year
2010 2014 100.0
2011 2015 100.0
2012 2016 110.0
2013 2017 110.0
2014 2018 NaN
dtype: float64
Or:
df.unstack().reset_index(level=2, drop=True).reset_index().rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Explanation :
print(pd.DataFrame(df.unstack()))
0
enroll_year grad_year
2010 2014 0 100.0
2011 2015 0 100.0
2012 2016 0 110.0
2013 2017 0 110.0
2014 2018 0 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1))
enroll_year grad_year 0
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'}))
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
I have the following array:
[299.13953679 241.1902389 192.58645951 ... 8.53750551 24.38822528
71.61117789]
For each value in the array I want to get the interpolated wind speed based on the values in the column power in the following pd.DataFrame:
wind speed power
5 2.5 0
6 3.0 25
7 3.5 82
8 4.0 154
9 4.5 244
10 5.0 354
11 5.5 486
12 6.0 643
13 6.5 827
14 7.0 1038
15 7.5 1272
16 8.0 1525
17 8.5 1794
18 9.0 2037
19 9.5 2211
20 10.0 2362
21 10.5 2386
22 11.0 2400
So basically I'd like to retreive the following array:
[4.7 4.5 4.3 ... 2.6 3.0 3.4]
Any suggestions on where to start? I was looking at the pd.DataFrame.interpolate function but reading through its functionalities it does not seem to be helpful in my problem. Or am I wrong?
Using interp from numpy
np.interp(ary,df['power'].values,df['wind speed'].values)
Out[202]:
array([4.75063426, 4.48439022, 4.21436922, 2.67075011, 2.98776451,
3.40886998])
I have the following df and want to write the number column backwards and also overwrite other values if necessary. The condition is to always use the previous value unless the new values difference to the old value is greater than 10%.
Date Number
2019 150
2018 NaN
2017 118
2016 NaN
2015 115
2014 107
2013 105
2012 NaN
2011 100
Because of the condition the value in e.g. 2013 is equal to 100, because it is not smaller than 90 and not greater than 110. The result would look like this:
Date Number
2019 150
2018 115
2017 115
2016 115
2015 115
2014 100
2013 100
2012 100
2011 100
You can reverse your column and then apply a function to update values. Finally reverse the column to the original order:
def get_val(x):
global prev_num
if x and x > prev_num*1.1:
prev_num = x
return prev_num
prev_num = 0
df['number'] = df['number'][::-1].apply(get_val)[::-1]
Just groupby the difference after floor division by 10 which is not equal to zero then transform the min i.e
df['x'] = df.groupby((df['number'].bfill()[::-1]//10).diff().ne(0).cumsum())['number'].transform(min)
date number x
0 2019 150.0 150.0
1 2018 NaN 115.0
2 2017 118.0 115.0
3 2016 NaN 115.0
4 2015 115.0 115.0
5 2014 107.0 100.0
6 2013 105.0 100.0
7 2012 NaN 100.0
8 2011 100.0 100.0
Here is one way. It assumes the first value 100 is not NaN and the original dataframe is ordered descending by year. If performance is an issue, the loop can be converted to a list comprehension.
lst = df.sort_values('date')['number'].ffill().tolist()
for i in range(1, len(lst)):
if abs(lst[i] - lst[i-1]) / lst[i] <= 0.10:
lst[i] = lst[i-1]
df['number'] = list(reversed(lst))
# date number
# 0 2019 150.0
# 1 2018 115.0
# 2 2017 115.0
# 3 2016 115.0
# 4 2015 115.0
# 5 2014 100.0
# 6 2013 100.0
# 7 2012 100.0
# 8 2011 100.0