Cross tab with mean of one column - python

Using Pandas and Matplotlib, how can I make a bar plot with using cross tab of two columns, one column will just be the mean? Here is an example of my data set:
score lunch setting
70 N Sub
69 N Sub
62 Y Urb
78 N R
60 Y R
58 Y Urb
80 N Sub
75 N Urb
70 N R
70 N Urb
69 N Sub
70 N Urb
What I would like to do is get
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("my file path")
pd.crosstab(df["score"], df["lunch"]).plot(kind="bar", figsize=(8,2))
plt.show()
#pd is pandas and df is my data frame
with the "score" column being the mean of all scores rather than the individual scores.
After running plt.show() this is the plot that I get:
What I would like is to have two bars, attached, with the Y as the mean score of lunch with 'N' and mean score of lunch with 'Y' values.
I have tried
df_grouped = df.groupby(["lunch"])["score"].mean()
df_grouped.plot(kind="bar", figsize=(7,2)
This seems to look alright except I would like to be able to get the legend and have the two bars be side by side. Here is what it looks like by grouping first:
I would like to know if I can do this by using crosstab first without having to group? I need to keep the legend and also have the two bars side by side.
My thought would be something that looks like this:
pd.crosstab(df["score"].mean(), df["lunch"]).plot(kind="bar",figsize=(6,3))
Getting the mean of each lunch using crosstab.

Try with to_frame
df.groupby('lunch')['score'].mean().to_frame().T.plot.bar()

Related

Show how when values rise in one column, so does the values in another one

I'm working with a covid dataset for some python exercises I am working through to try learn. I've got it by doing the normal:
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/Desktop/Python Short Course/diagnosis.csv")
In this dataset there are 2 columns called BodyTemp and SpO2, what I am looking to try do is show how the results of the columns are similar. So like when the values rise in the BodyTemp column, so does the values in the SpO2 column, that sort of idea. I had thought of maybe doing a bar chart like:
plt.xlabel("BodyTemp") , plt.ylabel("SpO2")
plt.bar(x = df["BodyTemp"], height = df["SpO2"])
plt.show()
but all the bars are very close together and it just doesn't look great, so what would be a better way to do this? Or would there be a better approach to show the visualisation of the distribution of values?
Edit: to show screenshot of graph
Edit to show data:
BodyTemp
SpO2
37.6
85
38.9
93
38.5
92
37
80
I've added a table showing the first few, there are a whole lot more though but it gives an idea of the data
you need to change the scale of y-axis. try this.
plt.ylim((df['SpO2'].min()-.5, df['SpO2'].max()+.5))
If this didn't work, it's probably because there are very small values in the column SpO2. These gaps between the bars may be small values that are distorting the data. Try to remove them from the dataframe.

How do I format a y-axis 'y' in matplotlib going between pandas dataframes and simple variables?

CSV1only is a dataframe uploaded from a CSV
Let CSV1only as a dataframe be a column such that:
TRADINGITEMID:
1233
2455
3123
1235
5098
as a small example
How can I plot a scatterplot accordingly, specifically the y-axis?
I tried:
import pandas as pd
import matplotlib.pyplot as plt
CSV1only.plot(kind='scatter',x='TRADINGITEMID', y= [1,2], color='b')
plt.xlabel('TRADINGITEMID Numbers')
plt.ylabel('Range')
plt.title('Distribution of ItemIDNumbers')
and it doesn't work because of the y.
So, my main question is just how I can get a 0, 1, 2 y-axis for this scatter plot, as I want to make a distribution graph.
The following code doesn't work because it doesn't match the amount of rows included in the original TRADINGITEMID column, which has 5000 rows:
newcolumn_values = [1, 2]
CSV1only['un et deux'] = newcolumn_values
#and then I changed the y = [1,2] from before into y = ['un et deux']
Therefore the solution would need to work from any integer 1 to N, N being the # of rows. Yet, it would only have a range of [0, 2] or some [0, m], m being some arbitrary integer.
Don't need to worry about the actual pandas data frame CSV1only.
The 'TRADINGITEMIDNUMBERS' contains 5000 rows of unique numbers, so I just wanna plot those numbers on a line, with the y-axis being instances (which will never pass 1 since it is unique).
I think you are looking for the following: You need to generate y-values starting from 0 until n-1 where n is the total number of rows
y = np.arange(len(CSV1only['TRADINGITEMID']))
plt.scatter(CSV1only['TRADINGITEMID'], y, c='DarkBlue')

Compact way of visualizing heat maps of correlated data

I am trying to visualize the correlation of the Result column with every other column.
A_B A_C B_C Result
0 0.318182 0.925311 0.860465 91
1 -0.384030 0.991803 0.996344 12
2 -0.818182 0.411765 0.920000 53
3 0.444444 0.978261 0.944444 64
A_B = (A-B)/(A+B) correspondingly all other values too.
which works for smaller no. of columns but if I increase the no. of columns then no. of rows in heatmap keeps on stacking up.Is there any compact way to represent it.
Following code will reproduce the output-
import pandas as pd
import seaborn as sns
data = {'A':[232,243,12,546,67,12,78,11,245],
'B':[120,546,120,210,56,120,56,89,12],
'C':[9,1,5,6,7,43,7,12,64],
'Result':[91,12,53,64,71,436,74,123,641],
}
df = pd.DataFrame(data,columns=['A','B','C','Result'])
#Responsible for (A-B)/(A+B) ,(A-C)/(A+C) and similarly
colnames = df.columns.tolist()[:-1]
for i,c in enumerate(colnames):
if i!=len(colnames):
for k in range(i+1,len(colnames)):
df[c+'_'+colnames[k]]=(df[c]-df[colnames[k]])/(df[c]+df[colnames[k]])
newdf = df[['A_B','A_C','B_C','Result']].copy()
#Plotting A_B,A_C,B_C by ignoring the output of result of itself
plot = pd.DataFrame(newdf.corr().iloc[:-1,-1])
sns.heatmap(plot,annot=True)
A technique which I heard but unable to find any source ,is representing each correlation factor in the mini-recangles like
So according to it, considering the given map as a matrix of 3*3 and (0,0) starting from left-bottom, A_B will be represented in (1,1)
A_C in (2,1),B_C in (2,2).
But ,I am not getting it how to do it ?
You can plot the correlation of each column against the Result column and other columns as well. Below is one way to do so. Providing the x- and y-ticklabels guides you better for comparing the correlations. You can also annotate the correlation values to be displayed on the heat map.
cor = newdf.corr()
sns.heatmap(cor, xticklabels=cor.columns.values,
yticklabels=cor.columns.values, annot=True)

Calculating angle between two points in time-series

I have a time-series data and i am trying to calculate angle (degree) between two points. Here is what i did so far but it doesn't seem to give the correct solution:
bars = 2
df = pd.read_csv("EURUSD.csv")
df = df.reset_index()
df['A'] = np.rad2deg(np.arctan2(df['Low']-df['Low'].shift(pts), df['index']-df['index'].shift(pts)))
df.dropna(inplace=True)
However, sometimes this gives me weird outputs like:
2693 3.141258
2702 -3.141383
2708 -3.141451
2719 -3.141033
2724 -3.140893
2734 3.141550
I have also tried the following code:
df['A'] = ((df['Low']-df['Low'].shift(pts))/(df['index']-df['index'].shift(pts)))
2693 -0.000334
2702 0.000210
2708 0.000142
2719 0.000560
2724 0.000700
2734 -0.000043
what am i doing wrong here?
EDIT:
Here is the screenshot i'm trying to do. I'm simply trying to find that -48 degree in Python. I am not trying to get these points automatically. I have spotted them manually and just need to do calculation.
I guess that your question is how do I calculated the angle between two lines? Where those lines are each of them defined by a single point and a common origin. Then you want to perform this operation for a series of x1, x2 points recorded over time.
Here you can find the arithmetics and here an example.
To get your line angle between the two points, you'll need the following:
price difference (looks like 1.29250 - 1.29650 = -0.004)
number of bar between the two points (That appears to be 10 bars)
Price to Bar ratio (you'll have to look at the settings for that particular graph)
price_diff = -0.004
bars = 10
price_to_bar = unknown
X = bars * price_to_bar
Final output:
import numpy as np
round(np.angle(complex(x, price_diff), deg=True), 0)

Plot several densities on one plot

I have a data frame with a MultiIndex (expenditure, groupid):
coef stderr N
expenditure groupid
TOTEXPCQ 176 3745.124 858.1998 81
358 -1926.703 1036.636 75
109 239.3678 639.373 280
769 6406.512 1823.979 96
775 2364.655 1392.187 220
I can get the density using df['coef'].plot(kind='density'). I would like to group these densities by the outer level of the MultiIndex (expenditure), and draw the different densities for different levels of expenditure into the same plot.
How would I achieve this? Bonus: label the different expenditure graphs with the 'expenditure' value
Answer
My initial approach was to merge the different kdes by generating one ax object and passing that along, but the accepted answer inspired me to rather generate one df with the group identifiers as columns:
n = 25
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : np.random.randn(n)})
df2 = df[['expenditure', 'coef']].pivot_table(index=df.index, columns='expenditure', values='coef')
df2.plot(kind='kde')
Wow, that ended up being much harder than I expected. Seemed easy in concept, but (yet again) concept and practice really differed.
Set up some toy data:
n = 25
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : randn(n)})
Then group by expenditure, iterate through each expenditure, pivot the data, and plot the kde:
gExp = df.groupby('expenditure')
for exp in gExp:
print exp[0]
gGroupid = exp[1].groupby('groupid')
g = exp[1][['groupid','coef']].reset_index(drop=True)
gpt = g.pivot_table(index = g.index, columns='groupid', values='coef')
gpt.plot(kind='kde').set_title(exp[0])
show()
Results in:
It took some trial and error to figure out the data had to be pivoted before plotting.

Categories