I'm currently trying to build a dataframe consisting of daily US Treasury Rates. As you can see, pandas automatically formats the columns so that they're in order, which clearly I do not want. Here's some of my code. I only needed to do a small example in order to show the problem I'm having.
import quandl
import matplotlib.pyplot as plt
One_Month = quandl.get('FRED/DGS1MO')
^^ Repeated for all rates
Yield_Curve = pd.DataFrame({'1m': One_Month['Value'], '3m': Three_Month['Value'], '1yr': One_Year['Value']})
Yield_Curve.loc['2017-06-22'].plot()
plt.show()
Yield_Curve.tail()
1m 1yr 3m
Date
2017-06-16 0.85 1.21 1.03
2017-06-19 0.85 1.22 1.02
2017-06-20 0.88 1.22 1.01
2017-06-21 0.85 1.22 0.99
2017-06-22 0.80 1.22 0.96
As I said, I only added three rates to the dataframe but obviously the two year, three year, and five year rates will cause a problem as well.
I did some searching and saw this post:
Plotting Treasury Yield Curve, how to overlay two yield curves using matplotlib
While using the code in the last post clearly works, I'd rather be able to keep my current datasets (One_Month, Three_Month....) to do this since I use them for other analyses as well.
Question: Is there a way for me to lock the column order?
Thanks for your help!
If you're looking to define the column ordering, you can use reindex_axis():
df = df.reindex_axis(labels=['1m', '3m', '1yr'], axis=1)
df
1m 3m 1yr
Date
2017-06-16 0.85 1.03 1.21
2017-06-19 0.85 1.02 1.22
2017-06-20 0.88 1.01 1.22
2017-06-21 0.85 0.99 1.22
2017-06-22 0.80 0.96 1.22
With pandas-datareader you can specify the symbols as one list. And in addition to using reindex_axis as suggested by #Andrew L, you can also just pass a list of ordered columns with two brackets, see final line below, to specify column order.
from pandas_datareader.data import DataReader as dr
syms = ['DGS10', 'DGS5', 'DGS2', 'DGS1MO', 'DGS3MO']
yc = dr(syms, 'fred') # could specify start date with start param here
names = dict(zip(syms, ['10yr', '5yr', '2yr', '1m', '3m']))
yc = yc.rename(columns=names)
yc = yc[['1m', '3m', '2yr', '5yr', '10yr']]
print(yc)
1m 3m 2yr 5yr 10yr
DATE
2010-01-01 NaN NaN NaN NaN NaN
2010-01-04 0.05 0.08 1.09 2.65 3.85
2010-01-05 0.03 0.07 1.01 2.56 3.77
2010-01-06 0.03 0.06 1.01 2.60 3.85
2010-01-07 0.02 0.05 1.03 2.62 3.85
... ... ... ... ...
2017-06-16 0.85 1.03 1.32 1.75 2.16
2017-06-19 0.85 1.02 1.36 1.80 2.19
2017-06-20 0.88 1.01 1.36 1.77 2.16
2017-06-21 0.85 0.99 1.36 1.78 2.16
2017-06-22 0.80 0.96 1.34 1.76 2.15
yc.loc['2016-06-01'].plot(label='Jun 1')
yc.loc['2016-06-02'].plot(label='Jun 2')
plt.legend(loc=0)
If you don't want to change original column order in spite of you need sorted column along to finance notation, I guess you should make your own customized column order like below.
fi_col = df.columns.str.extract('(\d)(\D+)', expand=True).sort_values([1, 0]).reset_index(drop=True)
fi_col = fi_col[0] + fi_col[1]
print(df[fi_col])
1m 3m 1yr
Date
2017-06-16 0.85 1.03 1.21
2017-06-19 0.85 1.02 1.22
2017-06-20 0.88 1.01 1.22
2017-06-21 0.85 0.99 1.22
2017-06-22 0.80 0.96 1.22
You can also pull all the historical rates directly from the US Treasury's website (updated daily):
from bs4 import BeautifulSoup
import requests
import pandas as pd
soup = BeautifulSoup(requests.get('https://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData').text,'lxml')
table = soup.find_all('m:properties')
tbondvalues = []
for i in table:
tbondvalues.append([i.find('d:new_date').text[:10],i.find('d:bc_1month').text,i.find('d:bc_2month').text,i.find('d:bc_3month').text,i.find('d:bc_6month').text,i.find('d:bc_1year').text,i.find('d:bc_2year').text,i.find('d:bc_3year').text,i.find('d:bc_5year').text,i.find('d:bc_10year').text,i.find('d:bc_20year').text,i.find('d:bc_30year').text])
ustcurve = pd.DataFrame(tbondvalues,columns=['date','1m','2m','3m','6m','1y','2y','3y','5y','10y','20y','30y'])
ustcurve.iloc[:,1:] = ustcurve.iloc[:,1:].apply(pd.to_numeric)/100
ustcurve['date'] = pd.to_datetime(ustcurve['date'])
Related
I am trying to use Seaborn to plot a simple bar plot using data that was transformed. The data started out looking like this (text follows):
element 1 2 3 4 5 6 7 8 9 10 11 12
C 95.6 95.81 96.1 95.89 97.92 96.71 96.1 96.38 96.09 97.12 95.12 95.97
N 1.9 1.55 1.59 1.66 0.53 1.22 1.57 1.63 1.82 0.83 2.37 2.13
O 2.31 2.4 2.14 2.25 1.36 1.89 2.23 1.8 1.93 1.89 2.3 1.71
Co 0.18 0.21 0.16 0.17 0.01 0.03 0.13 0.01 0.02 0.01 0.14 0.01
Zn 0.01 0.03 0.02 0.03 0.18 0.14 0.07 0.17 0.14 0.16 0.07 0.18
and after importing using:
df1 = pd.read_csv(r"C:\path.txt", sep='\t',header = 0, usecols=[0, 1, 2,3,4,5,6,7,8,9,10,11,12], index_col='element').transpose()
display(df1)
When I plot the values of an element versus the first column (which represents an observation), the first column of data corresponding to 'C' is used instead. What am I doing wrong and how can I fix it?
I also tried importing, then pivoting the dataframe, which resulted in an undesired shape that repeated the element set as columns 12 times.
ax = sns.barplot(x=df1.iloc[:,0], y='Zn', data=df1)
edited to add that I am not married to using any particular package or technique. I just want to be able to use my data to build a bar plot with 1-12 on the x axis and elemental compositions on the y.
you have different possibilities here. The problem you have is because 'element' is the index of your dataframe, so x=df1.iloc[:,0] is the column of 'C'.
1)
ax = sns.barplot(x=df.index, y='Zn', data=df1)
df.reset_index(inplace=True) #now 'element' is the first column of the df1
ax = sns.barplot(x=df.iloc[:,0], y='Zn', data=df1)
#equal to
ax = sns.barplot(x='element', y='Zn', data=df1
I have time-series data in CSV format. I want to calculate the mean for a different selected time period on a single run of the script, e.g. 01-05-2017: 30-04-2018, 01-05-2018: 30-04-2019 so on. Below is sample data
I have a script but it's taking only one given time period. but I want to give the multiple time period as I mentioned above.
from datetime import datetime
import pandas as pd
df = pd.read_csv(r'D:\Data\RT_2015_2020.csv', index_col=[0],parse_dates=[0])
z = df['2016-05-01' : '2017-04-30']
# Want to make like this way
#z = df[['2016-05-01' : '2017-04-30'], ['2017-05-01' : '2018-04-30']]
# It will calculate the mean for the selected time period
z.mean()
If you use dates as an index, you can extract the data with the conditions included in the desired range.
import pandas as pd
import numpy as np
import io
data = '''
Date Mean
18-05-2016 0.31
07-06-2016 0.32
17-07-2016 0.50
15-09-2016 0.62
25-10-2016 0.63
04-11-2016 0.56
24-11-2016 0.56
14-12-2016 0.22
13-01-2017 0.22
23-01-2017 0.23
12-02-2017 0.21
22-02-2017 0.21
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df['2016'].head()
Mean
Date
2016-05-18 0.31
2016-07-06 0.32
2016-07-17 0.50
2016-09-15 0.62
2016-10-25 0.63
df.loc['2016-05-01':'2017-01-30']
Mean
Date
2016-05-18 0.31
2016-07-06 0.32
2016-07-17 0.50
2016-09-15 0.62
2016-10-25 0.63
2016-11-24 0.56
2016-12-14 0.22
2017-01-13 0.22
2017-01-23 0.23
df.loc['2016-05-01':'2017-01-30'].mean()
Mean 0.401111
dtype: float64
This question already has answers here:
Pandas pivot table with multiple columns at once
(2 answers)
Closed 5 years ago.
Have a df like the one below, and looking to compress duplicate index values into a single row:
ask bid
date
2011-01-03 0.32 0.30
2011-01-03 1.03 1.01
2011-01-03 4.16 4.11
and expected output is to have (column names not important for now will manually set it):
ask bid ask1 bid1 ask2 bid2
date
2011-01-03 0.32 0.30 1.03 1.01 4.16 4.11
Something like below can be done to get the output you looking for:
import pandas as pd
df_1=pd.DataFrame({'date':['2011-01-03','2011-01-03','2011-01-03'],'ask':[0.31,1.05,4.17],'bid':[0.40,1.41,5.11]})
dfs=list()
df_count=1
while df_1['date'].duplicated().any()==True:
df_count+=1
b=df_1.drop_duplicates(subset='date',keep='first')
dfs.append(b)
df_1=df_1.merge(b,how='outer',on=['date','ask','bid'],indicator=True)
df_1=df_1[df_1['_merge']=='left_only']
del df_1['_merge']
dfs.append(df_1)
df_final = reduce(lambda left,right: pd.merge(left,right,on='date',suffixes=('_1','_2')), dfs)
input:
ask bid date
0 0.31 0.40 2011-01-03
1 1.05 1.41 2011-01-03
2 4.17 5.11 2011-01-03
Output :
ask_1 bid_1 date ask_2 bid_2 ask bid
0 0.31 0.4 2011-01-03 1.05 1.41 4.17 5.11
For this data that is already pivoted in a dataframe:
1 2 3 4 5 6 7
2013-05-28 -0.44 0.03 0.06 -0.31 0.13 0.56 0.81
2013-07-05 0.84 1.03 0.96 0.90 1.09 0.59 1.15
2013-08-21 0.09 0.25 0.06 0.09 -0.09 -0.16 0.56
2014-10-15 0.35 1.16 1.91 3.44 2.75 1.97 2.16
2015-02-09 0.09 -0.10 -0.38 -0.69 -0.25 -0.85 -0.47
.. I'm trying to make a lines chart. This from Excel:
.. and if I click that flip x & y button in Excel, also this pic:
I'm getting lost with the to-chart and to-png steps, and most of the examples want unpivoted raw data, which is something I'm passed.
Seaborn or Matplotlib or anything that can make the chart would be great. On a box without X11 would be better still :)
I thought about posting this a comment on this SO answer, but I could not do newlines, insert pics and all of that.
Edit: Sorry, I've not pasted in any of the attempts I've tried because they have not even come close to putting a PNG out. The only other examples on SO I can see start with transactional rows, and pivot for sure, but don't go as far as PNG output.
You need to transpose your data before plotting it.
df.T.plot()
I want to compute the duration (in weeks between change). For example, p is the same for weeks 1,2,3 and changes to 1.11 in period 4. So duration is 3. Now the duration is computed in a loop ported from R. It works but it is slow. Any suggestion how to improve this would be greatly appreciated.
raw['duration']=np.nan
id=raw['unique_id'].unique()
for i in range(0,len(id)):
pos1= abs(raw['dp'])>0
pos2= raw['unique_id']==id[i]
pos= np.where(pos1 & pos2)[0]
raw['duration'][pos[0]]=raw['week'][pos[0]]-1
for j in range(1,len(pos)):
raw['duration'][pos[j]]=raw['week'][pos[j]]-raw['week'][pos[j-1]]
The dataframe is raw, and values for a particular unique_id looks like this.
date week p change duration
2006-07-08 27 1.05 -0.07 1
2006-07-15 28 1.05 0.00 NaN
2006-07-22 29 1.05 0.00 NaN
2006-07-29 30 1.11 0.06 3
... ... ... ... ...
2010-06-05 231 1.61 0.09 1
2010-06-12 232 1.63 0.02 1
2010-06-19 233 1.57 -0.06 1
2010-06-26 234 1.41 -0.16 1
2010-07-03 235 1.35 -0.06 1
2010-07-10 236 1.43 0.08 1
2010-07-17 237 1.59 0.16 1
2010-07-24 238 1.59 0.00 NaN
2010-07-31 239 1.59 0.00 NaN
2010-08-07 240 1.59 0.00 NaN
2010-08-14 241 1.59 0.00 NaN
2010-08-21 242 1.61 0.02 5
##
Computing duratiosn once you have your list in date order is trivial: iterate over the list, keeping track of how long since the last change to p. If the slowness comes from how you get that list, you haven't provided nearly enough info for help with that.
You can simply get the list of weeks where there is a change, then compute their differences, and finally join those differences back onto your original DataFrame.
weeks = raw.query('change != 0.0')[['week']]
weeks['duration'] = weeks.week.diff()
pd.merge(raw, weeks, on='week', how='left')
raw2=raw.ix[raw['change'] !=0,['week','unique_id']]
data2=raw2.groupby('unique_id')
raw2['duration']=data2['week'].transform(lambda x: x.diff())
raw2.drop('unique_id',1)
raw=pd.merge(raw,raw2,on=['unique_id','week'],how='left')
Thank you all. I modified the suggestion and got this to give the same answer as the complicated loop. For 10,000. observations, it is not a whole lot faster but the code seems more compact.
I put no change to Nan because the duration seems to be undefined when no change is made. But zero will work too. With the above code, the NaN is put in automatically by merge. In any case,
I want to compute statistics for the non-change group separately.