This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 3 years ago.
I have the below table which shows rainfall by month in the UK across a number of years. I want to transpose it so that each row is one month/year and the data is chronological.
Year JAN FEB MAR APR
2010 79.7 74.8 79.4 48
2011 102.8 114.5 49.7 36.7
2012 110.9 60 37 128
2013 110.5 59.8 64.6 63.6
I would like it so the table looks like the below with year, month & rainfall as the columns:
2010 JAN 79.7
2010 FEB 74.8
2010 MAR 79.4
2010 APR 48
2011 JAN 102.8
2011 FEB 114.5
I think I need to use a for loop and iterate through each row to create a new dataframe but I'm not sure of the syntax. I've tried the below loop which nearly does what I want but doesn't output as a dataframe.
for index, row in weather.iterrows():
print(row["Year"],row)
2014.0 Year 2014.0
JAN 188.0
FEB 169.2
MAR 80.0
APR 67.8
MAY 99.6
JUN 54.8
JUL 64.7
Any help would be appreciated.
You should avoid using for-loops and instead use stack.
df.set_index('Year') \
.stack() \
.reset_index() \
.rename(columns={'level_1': 'Month', 0: 'Amount'})
Year Month Amount
0 2010 JAN 79.7
1 2010 FEB 74.8
2 2010 MAR 79.4
3 2010 APR 48.0
4 2011 JAN 102.8
5 2011 FEB 114.5
6 2011 MAR 49.7
7 2011 APR 36.7
8 2012 JAN 110.9
9 2012 FEB 60.0
etc...
Related
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I am trying to create a new column on an existing dataframe based on values of another dataframe.
# Define a dataframe containing 2 columns Date-Year and Date-Qtr
data1 = {'Date-Year': [2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 2017],
'Date-Qtr': ['2015Q1', '2015Q2', '2015Q3', '2015Q4', '2016Q1', '2016Q2', '2016Q3', '2016Q4', '2017Q1', '2017Q2']}
dfx = pd.DataFrame(data1)
# Define another dataframe containing 2 columns Date-Year and Interest Rate
data2 = {'Date-Year': [2000, 2015, 2016, 2017, 2018, 2019, 2020, 2021],
'Interest Rate': [0.00, 8.20, 8.20, 7.75, 7.50, 7.50, 6.50, 6.50]}
dfy = pd.DataFrame(data2)
# Add 1 more column to the first dataframe
dfx['Int-rate'] = float(0)
Output for dfx
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 0.0
1 2015 2015Q2 0.0
2 2015 2015Q3 0.0
3 2015 2015Q4 0.0
4 2016 2016Q1 0.0
5 2016 2016Q2 0.0
6 2016 2016Q3 0.0
7 2016 2016Q4 0.0
8 2017 2017Q1 0.0
9 2017 2017Q2 0.0
Output for dfy
Date-Year Interest Rate
0 2000 0.00
1 2015 8.20
2 2016 8.20
3 2017 7.75
4 2018 7.50
5 2019 7.50
6 2020 6.50
7 2021 6.50
Now I need to update the column 'Int-rate' of dfx by picking up the value for 'Interest Rate' from dfy for its corresponding year which I am achieving through 2 FOR loops
#Check the year from dfx - goto dfy - check the interest rate from dfy for that year and modify Int-rate of dfx with this value
for i in range (len(dfx['Date-Year'])):
for j in range (len(dfy['Date-Year'])):
if (dfx['Date-Year'][i] == dfy['Date-Year'][j]):
dfx['Int-rate'][i] = dfy['Interest Rate'][j]
and I get the desired output
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
Is there a way I can achieve the same output
without declaring dfx['Int-rate'] = float(0). I get a KeyError: 'Int-rate'if I don't declare this
not very happy with the 2 FOR loops. Is it possible to get it done in a better way (like using map or merge or joins)
I have tried looking through other posts and the best one I found is here, tried using map but I could not do it. Any help will be appreciated
thanks
You could use replace with a dictionary:
dfx['Int-Rate'] = dfx['Date-Year'].replace(dict(dfy.to_numpy()))
print(dfx)
Output
Date-Year Date-Qtr Int-Rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
Or with a Series as an alternative:
dfx['Int-Rate'] = dfx['Date-Year'].replace(dfy.set_index('Date-Year').squeeze())
You can simply use df.merge:
In [4448]: df = dfx.merge(dfy).rename(columns={'Interest Rate':'Int-rate'})
In [4449]: df
Out[4449]:
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
I have a dataframe which has a datetime column lets call it my_dates.
I also have a list of dates which has say 5 dates for this example.
15th Jan 2020
20th Mar 2020
28th Jun 2020
20th Jul 2020
8th Aug 2020
What I want to do is create another column in my datframe where it looks at the datetime in my_dates column & where it is less than a date in my date list for it to take that value.
For example lets say for this example say its 23rd June 2020. I want the new column to have the value for this row of 28th June 2020. Hopefully the examples below are clear.
More examples
my_dates expected_values
14th Jan 2020 15th Jan 2020
15th Jan 2020 15th Jan 2020
16th Jan 2020 20th Mar 2020
... ...
19th Mar 2020 20th Mar 2020
20th Mar 2020 20th Mar 2020
21st Mar 2020 28th Jun 2020
What is the most efficient way to do this rather than looping?
IIUC, you need pd.merge_asof with the argument direction set to forward
dates = ['15th Jan 2020',
'20th Mar 2020',
'28th Jun 2020',
'20th Jul 2020',
'8th Aug 2020' ]
dates_proper = [pd.to_datetime(d) for d in dates]
df = pd.DataFrame(pd.date_range('14-01-2020','21-03-2020'),columns=['my_dates'])
df1 = pd.DataFrame(dates_proper,columns=['date_list'])
merged_df = pd.merge_asof(
df, df1, left_on=["my_dates"], right_on=["date_list"], direction="forward"
)
print(merged_df)
my_dates date_list
0 2020-01-14 2020-01-15
1 2020-01-15 2020-01-15
2 2020-01-16 2020-03-20
3 2020-01-17 2020-03-20
4 2020-01-18 2020-03-20
.. ... ...
63 2020-03-17 2020-03-20
64 2020-03-18 2020-03-20
65 2020-03-19 2020-03-20
66 2020-03-20 2020-03-20
67 2020-03-21 2020-06-28
Finally a usecase for pd.merge_asof! :) From the documentation
Perform an asof merge. This is similar to a left-join except that we match on nearest key rather than equal keys.
It would have been helpful to make your example reproducible like this:
In [12]: reference = pd.DataFrame([['15th Jan 2020'],['20th Mar 2020'],['28th Jun 2020'],['20th Jul 2020'],['8th Aug 2020']], columns=['reference']).astype('datetime64')
In [13]: my_dates = pd.DataFrame([['14th Jan 2020'], ['15th Jan 2020'], ['16th Jan 2020'], ['19th Mar 2020'], ['20th Mar 2020'],['21th Mar 2020']], columns=['dates']).astype('datetime64')
In [15]: pd.merge_asof(my_dates, reference, left_on='dates', right_on='reference', direction='forward')
Out[15]:
dates reference
0 2020-01-14 2020-01-15
1 2020-01-15 2020-01-15
2 2020-01-16 2020-03-20
3 2020-03-19 2020-03-20
4 2020-03-20 2020-03-20
5 2020-03-21 2020-06-28
I've created a pandas dataframe using the 'read html' method from an external source. There's no problem creating the dataframe, however, I'm stuck trying to adjust the structure of the first column, 'Month'.
The data I'm scraping is updated once a month at the source, therefore, the solution requires a dynamic approach. So far I've only been able to achieve the desired outcome using .iloc to manually update each row, which works fine until the data is updated at source next month.
This is what my dataframe looks like:
df = pd.read_html(url)
df
Month Value
0 2017 NaN
1 November 1.29
2 December 1.29
3 2018 NaN
4 January 1.29
5 February 1.29
6 March 1.29
7 April 1.29
8 May 1.29
9 June 1.28
10 July 1.28
11 August 1.28
12 September 1.28
13 October 1.26
14 November 1.16
15 December 1.09
16 2019 NaN
17 January 1.25
18 February 1.34
19 March 1.34
20 April 1.34
This is my desired outcome:
df
Month Value
0 November 2017 1.29
2 December 2017 1.29
4 January 2018 1.29
5 February 2018 1.29
6 March 2018 1.29
7 April 2018 1.29
8 May 2018 1.29
9 June 2018 1.28
10 July 2018 1.28
11 August 2018 1.28
12 September 2018 1.28
13 October 2018 1.26
14 November 2018 1.16
15 December 2018 1.09
17 January 2019 1.25
18 February 2019 1.34
19 March 2019 1.34
20 April 2019 1.34
Right now the best idea I've come up with would be select, extract and append the year to each row in the 'Month' column, until the month 'December' is reached, and then switch to/increment to next year, but i have no idea how to implement this in code. Would this be a viable solution (and how could it be implemented?) or is there a better way?
Many thanks from a long time reader and first time poster of stackoverflow!
Using ffill base on value, if it is NaN then we should forward fill the year here for future paste
df.Month=df.Month+' '+df.Month.where(df.Value.isna()).ffill().astype(str)
df.dropna(inplace=True)
df
Out[29]:
Month Value
1 November 2017 1.29
2 December 2017 1.29
4 Januari 2018 1.29
5 Februari 2018 1.29
6 Mars 2018 1.29
7 April 2018 1.29
8 Maj 2018 1.29
9 Juni 2018 1.28
10 Juli 2018 1.28
11 Augusti 2018 1.28
12 September 2018 1.28
13 Oktober 2018 1.26
14 November 2018 1.16
15 December 2018 1.09
17 Januari 2019 1.25
18 Februari 2019 1.34
19 Mars 2019 1.34
20 April 2019 1.34
I have a DataFrame with dtype=object as:
YY MM DD hh var1 var2
.
.
.
10512 2013 01 01 06 1.64 4.64
10513 2013 01 01 07 1.57 4.63
10514 2013 01 01 08 1.56 4.71
10515 2013 01 01 09 1.45 4.69
10516 2013 01 01 10 1.53 4.67
10517 2013 01 01 11 1.31 4.63
10518 2013 01 01 12 1.41 4.70
10519 2013 01 01 13 1.49 4.80
10520 2013 01 01 20 1.15 4.91
10521 2013 01 01 21 1.14 4.74
10522 2013 01 01 22 1.10 4.95
As seen, there are missing rows corresponding to hours (hh) (for instance between 10519 and 10520 rows, hh jumps from 13 to 20). I tried to add the gap by setting hh as index, as what was discussed here: Missing data, insert rows in Pandas and fill with NAN
df=df.set_index('hh')
new_index = pd.Index(np.arange(0,24), name="hh")
df=df.reindex(new_index).reset_index()
and reach something like:
YY MM DD hh var1 var2
10519 2013 01 01 13 1.49 4.80
10520 2013 01 01 14 Nan Nan
10521 2013 01 01 15 Nan Nan
10522 2013 01 01 16 Nan Nan
...
10523 2013 01 01 20 1.15 4.91
10524 2013 01 01 21 1.14 4.74
10525 2013 01 01 22 1.10 4.95
But I encounter the error "cannot reindex from a duplicate axis" for the part df=df.reindex(new_index).
There are duplicate values for each hh=0,1,...,23, because same value of hh would be repeated for different months (MM) and years (YY).
Probably that's the reason. How can I solve the problem?
In general,how can one fills the missing rows of pandas DataFrame when index contains duplicate data. I appreciate any comments.
First create a new column with the time, including date and hour, of type datetime. One way this can be done is as follows:
df = df.rename(columns={'YY': 'year', 'MM': 'month', 'DD': 'day', 'hh': 'hour'})
df['time'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])
To use to_datetime in this way, the column names need to be year, month, day and hour which is why rename is used.
To get the expected result, set this new column as the index and use resample:
df.set_index('time').resample('H').mean()
This code does exactly what you need.
import pandas as pd
import numpy as np
from io import StringIO
YY, MM, DD, hh, var1, var2 = [],[],[],[],[],[]
a = '''10512 2013 01 01 06 1.64 4.64
10513 2013 01 01 07 1.57 4.63
10514 2013 01 01 08 1.56 4.71
10515 2013 01 01 09 1.45 4.69
10516 2013 01 01 10 1.53 4.67
10517 2013 01 01 11 1.31 4.63
10518 2013 01 01 12 1.41 4.70
10519 2013 01 01 13 1.49 4.80
10520 2013 01 01 20 1.15 4.91
10521 2013 01 01 21 1.14 4.74
10522 2013 01 01 22 1.10 4.95
10523 2013 01 01 27 1.30 4.55
10524 2013 01 01 28 1.2 4.62
'''
text = StringIO(a)
for line in text.readlines():
a = line.strip().split(" ")
a = list(filter(None, a))
YY.append(a[1])
MM.append(a[2])
DD.append(a[3])
hh.append(a[4])
var1.append(a[5])
var2.append(a[6])
df = pd.DataFrame({'YY':YY, 'MM':MM, 'DD':DD,
'hh':hh, 'var1':var1, 'var2':var2})
df['hh'] = df.hh.astype(int)
a = np.diff(df.hh)
b = np.where(a!=1)
df2 = df.copy(deep=True)
for i in range(len(df)):
if (i in b[0]):
line = pd.DataFrame(columns=['YY', 'MM', 'DD',
'hh', 'var1', 'var2'])
for k in range(a[i]-1):
line.loc[k]=[df2.iloc[i, 0], df2.iloc[i, 1],
df2.iloc[i, 2], df2.iloc[i, 3]+k+1 ,
np.nan, np.nan]
df = pd.concat([df.loc[:i],
line, df.loc[i+1:]])
df.reset_index(inplace=True, drop=True)
print(df)
YY MM DD hh var1 var2
0 2013 01 01 6 1.64 4.64
1 2013 01 01 7 1.57 4.63
2 2013 01 01 8 1.56 4.71
3 2013 01 01 9 1.45 4.69
4 2013 01 01 10 1.53 4.67
5 2013 01 01 11 1.31 4.63
6 2013 01 01 12 1.41 4.70
7 2013 01 01 13 1.49 4.80
8 2013 01 01 14 NaN NaN
9 2013 01 01 15 NaN NaN
10 2013 01 01 16 NaN NaN
11 2013 01 01 17 NaN NaN
12 2013 01 01 18 NaN NaN
13 2013 01 01 19 NaN NaN
14 2013 01 01 20 1.15 4.91
15 2013 01 01 21 1.14 4.74
16 2013 01 01 22 1.10 4.95
17 2013 01 01 23 NaN NaN
18 2013 01 01 24 NaN NaN
19 2013 01 01 25 NaN NaN
20 2013 01 01 26 NaN NaN
21 2013 01 01 27 1.30 4.55
22 2013 01 01 28 1.2 4.62
I have a data frame that has a 3 columns.
Time represents every day of the month for various months. what I am trying to do is get the 'Count' value per day and average it per each month, and do this for each country. The output must be in the form of a data frame.
Curent data:
Time Country Count
2017-01-01 us 7827
2017-01-02 us 7748
2017-01-03 us 7653
..
..
2017-01-30 us 5432
2017-01-31 us 2942
2017-01-01 us 5829
2017-01-02 ca 9843
2017-01-03 ca 7845
..
..
2017-01-30 ca 8654
2017-01-31 ca 8534
Desire output (dummy data, numbers are not representative of the DF above):
Time Country Monthly Average
Jan 2017 us 6873
Feb 2017 us 8875
..
..
Nov 2017 us 9614
Dec 2017 us 2475
Jan 2017 ca 1878
Feb 2017 ca 4775
..
..
Nov 2017 ca 7643
Dec 2017 ca 9441
I'd organize it like this:
df.groupby(
[df.Time.dt.strftime('%b %Y'), 'Country']
)['Count'].mean().reset_index(name='Monthly Average')
Time Country Monthly Average
0 Feb 2017 ca 88.0
1 Feb 2017 us 105.0
2 Jan 2017 ca 85.0
3 Jan 2017 us 24.6
4 Mar 2017 ca 86.0
5 Mar 2017 us 54.0
If your 'Time' column wasn't already a datetime column, I'd do this:
df.groupby(
[pd.to_datetime(df.Time).dt.strftime('%b %Y'), 'Country']
)['Count'].mean().reset_index(name='Monthly Average')
Time Country Monthly Average
0 Feb 2017 ca 88.0
1 Feb 2017 us 105.0
2 Jan 2017 ca 85.0
3 Jan 2017 us 24.6
4 Mar 2017 ca 86.0
5 Mar 2017 us 54.0
Use pandas dt strftime to create a month-year column that you desire + groupby + mean. Used this dataframe:
Dated country num
2017-01-01 us 12
2017-01-02 us 12
2017-02-02 us 134
2017-02-03 us 76
2017-03-30 us 54
2017-01-31 us 29
2017-01-01 us 58
2017-01-02 us 12
2017-02-02 ca 98
2017-02-03 ca 78
2017-03-30 ca 86
2017-01-31 ca 85
Then create a Month-Year column:
a['MonthYear']= a.Dated.dt.strftime('%b %Y')
Then, drop the Date column and aggregate by mean:
a.drop('Dated', axis=1).groupby(['MonthYear','country']).mean().rename(columns={'num':'Averaged'}).reset_index()
MonthYear country Averaged
Feb 2017 ca 88.0
Feb 2017 us 105.0
Jan 2017 ca 85.0
Jan 2017 us 24.6
Mar 2017 ca 86.0
Mar 2017 us 54.0
I retained the Dated column just in case.