I am trying to use Seaborn to plot a simple bar plot using data that was transformed. The data started out looking like this (text follows):
element 1 2 3 4 5 6 7 8 9 10 11 12
C 95.6 95.81 96.1 95.89 97.92 96.71 96.1 96.38 96.09 97.12 95.12 95.97
N 1.9 1.55 1.59 1.66 0.53 1.22 1.57 1.63 1.82 0.83 2.37 2.13
O 2.31 2.4 2.14 2.25 1.36 1.89 2.23 1.8 1.93 1.89 2.3 1.71
Co 0.18 0.21 0.16 0.17 0.01 0.03 0.13 0.01 0.02 0.01 0.14 0.01
Zn 0.01 0.03 0.02 0.03 0.18 0.14 0.07 0.17 0.14 0.16 0.07 0.18
and after importing using:
df1 = pd.read_csv(r"C:\path.txt", sep='\t',header = 0, usecols=[0, 1, 2,3,4,5,6,7,8,9,10,11,12], index_col='element').transpose()
display(df1)
When I plot the values of an element versus the first column (which represents an observation), the first column of data corresponding to 'C' is used instead. What am I doing wrong and how can I fix it?
I also tried importing, then pivoting the dataframe, which resulted in an undesired shape that repeated the element set as columns 12 times.
ax = sns.barplot(x=df1.iloc[:,0], y='Zn', data=df1)
edited to add that I am not married to using any particular package or technique. I just want to be able to use my data to build a bar plot with 1-12 on the x axis and elemental compositions on the y.
you have different possibilities here. The problem you have is because 'element' is the index of your dataframe, so x=df1.iloc[:,0] is the column of 'C'.
1)
ax = sns.barplot(x=df.index, y='Zn', data=df1)
df.reset_index(inplace=True) #now 'element' is the first column of the df1
ax = sns.barplot(x=df.iloc[:,0], y='Zn', data=df1)
#equal to
ax = sns.barplot(x='element', y='Zn', data=df1
Related
I have a dataset like below
data = {'ReportingDate':['2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31',
'2013/6/28','2013/6/28',
'2013/6/28','2013/6/28','2013/6/28'],
'MarketCap':[' ',0.35,0.7,0.875,0.7,0.35,' ',1,1.5,0.75,1.25],
'AUM':[3.5,3.5,3.5,3.5,3.5,3.5,5,5,5,5,5],
'weight':[' ',0.1,0.2,0.25,0.2,0.1,' ',0.2,0.3,0.15,0.25]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('Reporting Date',inplace=True)
df
Just a sample of a 8000 rows dataset.
ReportingDate starts from 2013/5/31 to 2015/10/30.
It includes data of all the months during the above period. But Only the last day of each month.
The first line of each month has two missing data. I know that
the sum of weight for each month is equal to 1
weight*AUM is equal to MarketCap
I can use the below line to get the answer I want, for only one month
a= (1-df["2013-5"].iloc[1:]['weight'].sum())
b= a* AUM
df.iloc[1,0]=b
df.iloc[1,2]=a
How can I use a loop to get the data for the whole period? Thanks
One way using pandas.DataFrame.groupby:
# If whitespaces are indeed whitespaces, not nan
df = df.replace("\s+", np.nan, regex=True)
# If not already datatime series
df.index = pd.to_datetime(df.index)
s = df["weight"].fillna(1) - df.groupby(df.index.date)["weight"].transform(sum)
df["weight"] = df["weight"].fillna(s)
df["MarketCap"] = df["MarketCap"].fillna(s * df["AUM"])
Note: This assumes that dates are always only the last day so that it is equivalent to grouping by year-month. If not so, try:
s = df["weight"].fillna(1) - df.groupby(df.index.strftime("%Y%m"))["weight"].transform(sum)
Output:
MarketCap AUM weight
ReportingDate
2013-05-31 0.350 3.5 0.10
2013-05-31 0.525 3.5 0.15
2013-05-31 0.700 3.5 0.20
2013-05-31 0.875 3.5 0.25
2013-05-31 0.700 3.5 0.20
2013-05-31 0.350 3.5 0.10
2013-06-28 0.500 5.0 0.10
2013-06-28 1.000 5.0 0.20
2013-06-28 1.500 5.0 0.30
2013-06-28 0.750 5.0 0.15
2013-06-28 1.250 5.0 0.25
I have the following dataframe (df):
Row Number
Row 0 0.24 0.16 -0.18 -0.20 1.24
Row 1 0.18 0.12 -0.73 -0.36 -0.54
Row 2 -0.01 0.25 -0.35 -0.08 -0.43
Row 3 -0.43 0.21 0.53 0.55 -1.03
Row 4 -0.24 -0.20 0.49 0.08 0.61
Row 5 -0.19 -0.29 -0.08 -0.16 0.34
I am attempting to sum all the negative and positive numbers respectively, e.g. sum(neg_numbers) = n and sum(pos_numbers) = x
I have tried:
df.groupby(df.agg([('negative' , lambda x : x[x < 0].sum()) , ('positive' , lambda x : x[x > 0].sum())])
to no avail.
How would I sum these values?
Thank you in advance!
You can do
sum_pos = df[df>0].sum(1)
sum_neg = df[df<0].sum(1)
if you want to get the sums per row. If you want to sum all values regardless of rows/columns, can use np.nansum
sum_pos = np.nansum(df[df>0])
You can do with
df.mul(df.gt(0)).sum().sum()
Out[447]: 5.0
df.mul(~df.gt(0)).sum().sum()
Out[448]: -5.5
If need sum by row
df.mul(df.gt(0)).sum()
Out[449]:
1 0.42
2 0.74
3 1.02
4 0.63
5 2.19
dtype: float64
Yet another way for the total sums:
sum_pos = df.to_numpy().flatten().clip(min=0).sum()
sum_neg = df.to_numpy().flatten().clip(max=0).sum()
And for sums by columns:
sum_pos_col = sum(df.to_numpy().clip(min=0))
sum_neg_col = sum(df.to_numpy().clip(max=0))
If you have string columns in dataframe and want to get the sum for particular column, then
df[df['column_name']>0]['column_name'].sum()
df[df['column_name']<0]['column_name'].sum()
For this data that is already pivoted in a dataframe:
1 2 3 4 5 6 7
2013-05-28 -0.44 0.03 0.06 -0.31 0.13 0.56 0.81
2013-07-05 0.84 1.03 0.96 0.90 1.09 0.59 1.15
2013-08-21 0.09 0.25 0.06 0.09 -0.09 -0.16 0.56
2014-10-15 0.35 1.16 1.91 3.44 2.75 1.97 2.16
2015-02-09 0.09 -0.10 -0.38 -0.69 -0.25 -0.85 -0.47
.. I'm trying to make a lines chart. This from Excel:
.. and if I click that flip x & y button in Excel, also this pic:
I'm getting lost with the to-chart and to-png steps, and most of the examples want unpivoted raw data, which is something I'm passed.
Seaborn or Matplotlib or anything that can make the chart would be great. On a box without X11 would be better still :)
I thought about posting this a comment on this SO answer, but I could not do newlines, insert pics and all of that.
Edit: Sorry, I've not pasted in any of the attempts I've tried because they have not even come close to putting a PNG out. The only other examples on SO I can see start with transactional rows, and pivot for sure, but don't go as far as PNG output.
You need to transpose your data before plotting it.
df.T.plot()
I'm currently trying to build a dataframe consisting of daily US Treasury Rates. As you can see, pandas automatically formats the columns so that they're in order, which clearly I do not want. Here's some of my code. I only needed to do a small example in order to show the problem I'm having.
import quandl
import matplotlib.pyplot as plt
One_Month = quandl.get('FRED/DGS1MO')
^^ Repeated for all rates
Yield_Curve = pd.DataFrame({'1m': One_Month['Value'], '3m': Three_Month['Value'], '1yr': One_Year['Value']})
Yield_Curve.loc['2017-06-22'].plot()
plt.show()
Yield_Curve.tail()
1m 1yr 3m
Date
2017-06-16 0.85 1.21 1.03
2017-06-19 0.85 1.22 1.02
2017-06-20 0.88 1.22 1.01
2017-06-21 0.85 1.22 0.99
2017-06-22 0.80 1.22 0.96
As I said, I only added three rates to the dataframe but obviously the two year, three year, and five year rates will cause a problem as well.
I did some searching and saw this post:
Plotting Treasury Yield Curve, how to overlay two yield curves using matplotlib
While using the code in the last post clearly works, I'd rather be able to keep my current datasets (One_Month, Three_Month....) to do this since I use them for other analyses as well.
Question: Is there a way for me to lock the column order?
Thanks for your help!
If you're looking to define the column ordering, you can use reindex_axis():
df = df.reindex_axis(labels=['1m', '3m', '1yr'], axis=1)
df
1m 3m 1yr
Date
2017-06-16 0.85 1.03 1.21
2017-06-19 0.85 1.02 1.22
2017-06-20 0.88 1.01 1.22
2017-06-21 0.85 0.99 1.22
2017-06-22 0.80 0.96 1.22
With pandas-datareader you can specify the symbols as one list. And in addition to using reindex_axis as suggested by #Andrew L, you can also just pass a list of ordered columns with two brackets, see final line below, to specify column order.
from pandas_datareader.data import DataReader as dr
syms = ['DGS10', 'DGS5', 'DGS2', 'DGS1MO', 'DGS3MO']
yc = dr(syms, 'fred') # could specify start date with start param here
names = dict(zip(syms, ['10yr', '5yr', '2yr', '1m', '3m']))
yc = yc.rename(columns=names)
yc = yc[['1m', '3m', '2yr', '5yr', '10yr']]
print(yc)
1m 3m 2yr 5yr 10yr
DATE
2010-01-01 NaN NaN NaN NaN NaN
2010-01-04 0.05 0.08 1.09 2.65 3.85
2010-01-05 0.03 0.07 1.01 2.56 3.77
2010-01-06 0.03 0.06 1.01 2.60 3.85
2010-01-07 0.02 0.05 1.03 2.62 3.85
... ... ... ... ...
2017-06-16 0.85 1.03 1.32 1.75 2.16
2017-06-19 0.85 1.02 1.36 1.80 2.19
2017-06-20 0.88 1.01 1.36 1.77 2.16
2017-06-21 0.85 0.99 1.36 1.78 2.16
2017-06-22 0.80 0.96 1.34 1.76 2.15
yc.loc['2016-06-01'].plot(label='Jun 1')
yc.loc['2016-06-02'].plot(label='Jun 2')
plt.legend(loc=0)
If you don't want to change original column order in spite of you need sorted column along to finance notation, I guess you should make your own customized column order like below.
fi_col = df.columns.str.extract('(\d)(\D+)', expand=True).sort_values([1, 0]).reset_index(drop=True)
fi_col = fi_col[0] + fi_col[1]
print(df[fi_col])
1m 3m 1yr
Date
2017-06-16 0.85 1.03 1.21
2017-06-19 0.85 1.02 1.22
2017-06-20 0.88 1.01 1.22
2017-06-21 0.85 0.99 1.22
2017-06-22 0.80 0.96 1.22
You can also pull all the historical rates directly from the US Treasury's website (updated daily):
from bs4 import BeautifulSoup
import requests
import pandas as pd
soup = BeautifulSoup(requests.get('https://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData').text,'lxml')
table = soup.find_all('m:properties')
tbondvalues = []
for i in table:
tbondvalues.append([i.find('d:new_date').text[:10],i.find('d:bc_1month').text,i.find('d:bc_2month').text,i.find('d:bc_3month').text,i.find('d:bc_6month').text,i.find('d:bc_1year').text,i.find('d:bc_2year').text,i.find('d:bc_3year').text,i.find('d:bc_5year').text,i.find('d:bc_10year').text,i.find('d:bc_20year').text,i.find('d:bc_30year').text])
ustcurve = pd.DataFrame(tbondvalues,columns=['date','1m','2m','3m','6m','1y','2y','3y','5y','10y','20y','30y'])
ustcurve.iloc[:,1:] = ustcurve.iloc[:,1:].apply(pd.to_numeric)/100
ustcurve['date'] = pd.to_datetime(ustcurve['date'])
I want to apply a function to row slices of dataframe in pandas for each row and returning a dataframe with for each row the value and number of slices that was calculated.
So, for example
df = pandas.DataFrame(numpy.round(numpy.random.normal(size=(2, 10)),2))
f = lambda x: (x - x.mean())
What I want is to apply lambda function f from column 0 to 5 and from column 5 to 10.
I did this:
a = pandas.DataFrame(f(df.T.iloc[0:5,:])
but this is only for the first slice.. how can include the second slice in the code, so that my resulting output frame looks exactly as the input frame -- just that every data point is changed to its value minus the mean of the corresponding slice.
I hope it makes sense.. What would be the right way to go with this?
thank you.
You can simply reassign the result to original df, like this:
import pandas as pd
import numpy as np
# I'd rather use a function than lambda here, preference I guess
def f(x):
return x - x.mean()
df = pd.DataFrame(np.round(np.random.normal(size=(2,10)), 2))
df.T
0 1
0 0.92 -0.35
1 0.32 -1.37
2 0.86 -0.64
3 -0.65 -2.22
4 -1.03 0.63
5 0.68 -1.60
6 -0.80 -1.10
7 -0.69 0.05
8 -0.46 -0.74
9 0.02 1.54
# makde a copy of df here
df1 = df
# just reassign the slices back to the copy
# edited, omit DataFrame part.
df1.T[:5], df1.T[5:] = f(df.T.iloc[0:5,:]), f(df.T.iloc[5:,:])
df1.T
0 1
0 0.836 0.44
1 0.236 -0.58
2 0.776 0.15
3 -0.734 -1.43
4 -1.114 1.42
5 0.930 -1.23
6 -0.550 -0.73
7 -0.440 0.42
8 -0.210 -0.37
9 0.270 1.91