I have a df of Airbnb where each row represents a airbnb listing. I am trying to plot two columns as bar plot using Matplotlib.
fig,ax= plt.subplots()
ax.bar(airbnb['neighbourhood_group'],airbnb['revenue'])
plt.show()
What I think is, this graph should plot every neighbourhood on x axis and avg revenue per neighbourhood group on y axis(by default bar graph takes mean value per category)
This code of line keeps on running without giving me any error as if it has entered an indefinite while loop.
Can someone please suggest what could be wrong?
following I have used a dataframe, since none is available.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Create sample DataFrame
y = np.random.rand(10,2)
y[:,0]= np.arange(10)
df = pd.DataFrame(y, columns=["neighbourhood_group", "revenue"])
Make sure that the "np.random" always gives different values for the revenue column when you start the program.
df:
# bar plot
ax = df.plot(x="neighbourhood_group", y="revenue", kind="bar")
regarding your statement that your code runs like in a loop. Could it be that the amount of data to be processed from the DataFrame to display the bar chart is too much effort. However, to say that for sure you would have to provide us with a dataset.
Related
Python newbie here. I'm looking at some daily weather data for a couple of cities over the course of a year. Each city has its own csv file. I'm interested in comparing the count of daily average temperatures between two cities in a bar graph, so I can see (for example) how often the average temperature in Seattle was 75 degrees (or 30 or 100) compared to Phoenix.
I'd like a bar graph with side-by-side bars with temperature on the x-axis and count on the y-axis. I've been able to get a bar graph of each city separately with this data, but don't know how to get both cities on the same bar chart with with a different color for each city. Seems like it should be pretty simple, but my hours of search haven't gotten me a good answer yet.
Suggestions please, oh wise stackoverflow mentors?
Here's what I've got so far:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("KSEA.csv")
df2 = pd.read_csv("KPHX.csv")
df["actual_mean_temp"].value_counts(sort=False).plot(kind ="bar")
df2["actual_mean_temp"].value_counts(sort = False).plot(kind = 'bar')
You can concat DataFrames, assigning city as a column, and then use histplot in seaborn:
import seaborn as sns
z = pd.concat([
df[['actual_mean_temp']].assign(city='KSEA'),
df2[['actual_mean_temp']].assign(city='KPHX'),
])
ax = sns.histplot(data=z, x='actual_mean_temp', hue='city',
multiple='dodge', binwidth=1)
Output:
I am new to python and trying to plot a color magnitude diagram(CMD) for a selected cluster by matplotlib, there are 3400000 stars that I need to plot, the data for each star would be color on x axis and magnitude on y axis, However, my code should read two columns in a csv file and plot. The problem is when I using a part of the data (3000 stars), I can plot a CMD succesfully but when I use all the data, the plot is very mess(see figure below) and it seems that points are ploted by their positions in the column instead of its value. For example, a point has data (0.92,20.64) should be close to the y-axis, but is actually located at the far right of the plot just becasue it placed at last few columns of the dataset. So I wanna know how can I plot the entire dataset and show a plot like the first figure.Thanks for yout time. These are my codes:
import matplotlib.pyplot as plt
import pandas as pd
import csv
data = pd.read_csv(r'C:\Users\Peter\Desktop\F275W test.csv', low_memory=False)
# Generate some test data
x = data['F275W-F336W']
y = data['F275W']
#remove the axis
plt.axis('off')
plt.plot(x,y, ',')
plt.show()
This is the plot I got for 3000 stars it's a CMD
This is the plot I got for entire dataset, which is very mess
I'm trying to recreate the following plot:
With an online tool I could create the dataset (135 data points) which I saved in a CSV file with the following structure:
Year,Number of titles available
1959,1.57480315
1959,1.57480315
1959,1.57480315
...
1971,221.4273356
1971,215.2494175
1971,211.5426666
I created a Python file with the following code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('file.csv')
df.plot.line(x='Year', y='Number of titles available')
plt.show()
and I'm getting the following plot:
What can I do to get a smooth line like in the original plot?
How can I have the same values in the x axis like in the original plot?
EDIT: I worked on the data set and formatting properly the dates, the plot is now better.
This is how the data set looks now:
Date,Number of available titles
1958/07/31,2.908816952
1958/09/16,3.085527674
1958/11/02,4.322502727
1958/12/19,5.382767059
...
1971/04/13,221.6766907
1971/05/30,215.4918154
1971/06/26,211.7808903
This is the plot I can get with the same code posted above:
The question now is: how can I have the same date range as in the original plot (1958 - mid 1971)?
Try taking the mean of your values that you have grouped by year. This will smooth out the discontinuities that you get each year to an average value. If that does not help, then you should apply any one of numerous filters.
df.groupby('year').mean().plot(kind='line')
I'm relatively new to Python (in the process of self-teaching) and so this is proving to be quite a learning curve but I'm very happy to get to grips with it. I have a set of data points from an experiment in excel, one column is time (with the format 00:00:00:000) and a second column is the measured parameter.
I'm using pandas to read the excel document in order to produce a graph from it with time along the x-axis and the measured variable along the y-axis. However, when I plot the data, the time column becomes the data point number (i.e. 00:00:00:000 - 00:05:40:454 becomes 0 - 2000) and I'm not sure why. Could anyone please advise how to rectify this?
Secondly, I'd like to produce a subplot that shows the difference between the y-values as a function of time, basically a gradient to show the variation. Is there a way to easily calculate this and display it using pandas?
Here is my code, please do forgive how basic it is!
import pandas as pd
import matplotlib.pyplot as plt
import pylab
df = pd.read_excel('rest.xlsx', 'Sheet1')
df.plot(legend=False, grid=False)
plt.show()
plt.savefig('myfig')
If you just read the excel file, pandas will create a RangeIndex, starting at 0. To use your time information from you excel file as index, you have to specify the name (as string) of the time column with the key-word argument index_col in the read_excel call:
df = pd.read_excel('rest.xlsx', 'Sheet1', index_col='name_of_time_column')
Just replace 'name_of_time_column' with the actual name of the column that contains the time information.
(Hopefully pandas will automatically parse the time information to a Datetimeindex, but your format should be fine.) The plot will use the Datetimeindex on x-axis.
To get the time difference between each datapoint, use the diff method with argument 1 on your DataFrame:
difference = df.diff(1)
difference.plot(legend=False, grid=False)
Try This:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('rest.xlsx', 'Sheet1')
X = df['Time'].tolist()#If the time column called 'Time'
Y = df['Parameter'].tolist()#If the Parameter column called 'Parameter'
plt.plot(X,Y)
plt.gcf().autofmt_xdate()
plt.show()
With matplotlib you can create a figure with two axis and name the axis for example ax_df and ax_diff:
import matplotlib.pyplot as plt
fig, [ax_df, ax_diff] = plt.subplots(nrows=2, ncols=1, sharex=True)
sharex=True specifies to use the same x-axis for both subplots.
When calling plot on the DataFrame, you can redirect the output to the axis by specifying the axes with the keyword argument ax:
df.plot(ax=ax_df)
df.diff(1).plot(ax=ax_diff)
plt.show()
It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.
You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.
You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.
Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data), this script generates this figure. Hope it helps!