I have dataframe with column(C_NC) containing two values namely C and NC. I plotted frequency of C and NC values with
df['C_NC'].value_counts().plot(kind='bar')
Though this graph is nice, I also want to have exact frequency number on each bar in bar chart. I am quite new to Data visualization with Pandas Dataframe. Is there a way to do this with pandas dataframe?
Use:
s=df['C_NC'].value_counts()
s.plot(kind='bar',yticks=s)
Example
as you can see here is the same problem:
import numpy as np
import matplotlib.pyplot as plt
s1=pd.Series(np.random.randint(0,2,300))
s=s1.value_counts()
print(s)
1 156
0 144
dtype: int64
s1.value_counts().plot(kind='bar')
we can now show the exact values
s.plot(kind='bar',yticks=s)
Related
Python newbie here. I'm looking at some daily weather data for a couple of cities over the course of a year. Each city has its own csv file. I'm interested in comparing the count of daily average temperatures between two cities in a bar graph, so I can see (for example) how often the average temperature in Seattle was 75 degrees (or 30 or 100) compared to Phoenix.
I'd like a bar graph with side-by-side bars with temperature on the x-axis and count on the y-axis. I've been able to get a bar graph of each city separately with this data, but don't know how to get both cities on the same bar chart with with a different color for each city. Seems like it should be pretty simple, but my hours of search haven't gotten me a good answer yet.
Suggestions please, oh wise stackoverflow mentors?
Here's what I've got so far:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("KSEA.csv")
df2 = pd.read_csv("KPHX.csv")
df["actual_mean_temp"].value_counts(sort=False).plot(kind ="bar")
df2["actual_mean_temp"].value_counts(sort = False).plot(kind = 'bar')
You can concat DataFrames, assigning city as a column, and then use histplot in seaborn:
import seaborn as sns
z = pd.concat([
df[['actual_mean_temp']].assign(city='KSEA'),
df2[['actual_mean_temp']].assign(city='KPHX'),
])
ax = sns.histplot(data=z, x='actual_mean_temp', hue='city',
multiple='dodge', binwidth=1)
Output:
Here is my problem
This is a sample of my two DataFrames (I have 30 columns in reality)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.DataFrame({"Marc":[6,0,8,-30,-15,0,-3],
"Elisa":[0,1,0,-1,0,-2,-4],
"John":[10,12,24,-20,7,-10,-30]})
df1 = pd.DataFrame({"Marc":[8,2,15,-12,-8,0,-35],
"Elisa":[4,5,7,0,0,1,-2],
"John":[20,32,44,-30,15,-10,-50]})
I would like to create a scatter plot with two different colors :
1 color if the scores of df1 are negative and one if they are positive, but I don't really know how to do it.
I already did that by using matplotlib
plt.scatter(df,df1);
And I also checked this link Link but the problem is that I have two Pandas Dataframe
and not numpy array as on this link. Hence the I can't use the c= np.sign(df.y) method.
I would like to keep Pandas DataFrame as I have many columns but I really stuck on that.
If anyone has a solution, you are welcome!
You can pass the color array in, but it seems to work with 1D array only:
# colors as stated
colors = np.where(df1<0, 'C0', 'C1')
# stack and ravel to turn into 1D
plt.scatter(df.stack(),df1.stack(), c=colors.ravel())
Output:
I am creating a barplot by using groupby showing the success rate of an individual for Calendar year 2012. This works well. X axis= S_L's and Y axis is the success rate%. I have a column in my dataset for the success (1 or 0).
ax=df[df['CY']==2012].groupby('S_L').success.mean().sort_values(ascending=False).plot(kind='bar',stacked=False)
Instead of showing the values for each of the barplots, I want to show the calculations behind the mean, i.e the total for each group and the count where success (which is a flag) =1 i.e. the numerator. For example: If the bar shows 90%, which is calculated by 9 (numerator) being successful/ 10 (overall count for the given S_L group), I want to show n=9 and n=10 for that bar.
I looked at these posts Add labels to barplots , and it works when I display the values for the bars.
However, I don't know how to add the values for the calculation. As I am also sorting the values in descending order, I don't know how to do this. Please help.
My code:
import pandas as pd
from os import path
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
fname=path.expanduser(r'Test file.xlsx')
df=pd.read_excel(io=fname,sheet_name='Sheet1')
ax=df.groupby('S_L').success.mean().sort_values(ascending=False).plot(kind='bar',stacked=False)
vals = ax.get_yticks()
ax.set_ylabel('Success Rate')
ax.set_yticklabels(['{:,.2%}'.format(x) for x in vals])
Below is the dataset image
I'm looking to make a stacked area plot over time, based on summary data created by groupby and sum.
The groupby and sum part correctly groups and sums the data I want, but it seems the resultant format is nonsense in terms of plotting it.
I'm not sure where to go from here:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame({'invoice':[1,2,3,4,5,6],'year':[2016,2016,2017,2017,2017,2017],'part':['widget','wonka','widget','wonka','wonka','wonka'],'dollars':[10,20,30,10,10,10]})
#drop the invoice number from the data since we don't need it
df=df[['dollars','part','year']]
#group by year and part, and add them up
df=df.groupby(['year','part']).sum()
#plotting this is nonsense:
df.plot.area()
plt.show()
to chart multiple series, its easiest to have each series organized as a separate column, i.e. replace
df=df.groupby(['year','part']).sum()
with
df=df.groupby(['year', 'part']).sum().unstack(-1)
Then the rest of the code should work. But, I'm not sure if this is what you need because the desired output is not shown.
df.plot.area() then produces the chart like
I am trying to create a plot extracting points from a .txt file. The points are separated by 'tab' space only. Also, there are too many points to be accommodated in only one column, so they have been spread over 3 columns. However, when I plot in matplotlib, I am a little suspicious I am not seeing all the numbers plotted. It may be the case the data is plotted only over the first column and is ignoring the other two columns.
Here is the sample example of such data file: https://www.dropbox.com/s/th6uwrk2xdnmhyi/n1l2m2.txt?dl=0
I also attached the simple code I am using to plot:
import matplotlib.pyplot as plt
%matplotlib inline
import sys
import os
import numpy
from pylab import *
exp_sum = '/home/trina/Downloads/n1l2m2.txt'
a= numpy.loadtxt(exp_sum, unpack =True)
plt.plot(a)
show()
and here is the output image:
I am interested to know if this plot covers all the points in my data file. Your suggestion is very appreciated.
By doing plt.plot(a), you are passing a 3 dimensional data set to be plotted onto a 2 dimensional graph.
From the matplotlib docs for plot
If x and/or y is 2-dimensional, then the corresponding columns will be
plotted.
So, your graph output is:
column 0 values at x = 0
column 1 values at x = 1
column 2 values at x = 2
Adding the following to the code:
for i in range(0,len(a)):
print('a'+str(i),max(a[i]),min(a[i]))
Outputs the following:
stats max min
a0 0.9999 0.0
a1 0.9856736 0.3736717
a2 -0.003469009 -0.08896232
Using the mouseover position readout with matplotlib, this looks correct.
On a general graphs point, I'd recommend using histograms, boxplots or violin plots if you want to visualise the frequency (and other stats) of data sets. See the matplotlib examples for histograms, boxplots and violin plots.
Edit: from the shading on the graph you have, it also looks like it does contain all the points, as your data columns are long tails when plotted individually. The long tail graphs correlate to the shading on the graph you have.