Smoothing the curve in a line plot - Values interval x axis - python

I'm trying to recreate the following plot:
With an online tool I could create the dataset (135 data points) which I saved in a CSV file with the following structure:
Year,Number of titles available
1959,1.57480315
1959,1.57480315
1959,1.57480315
...
1971,221.4273356
1971,215.2494175
1971,211.5426666
I created a Python file with the following code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('file.csv')
df.plot.line(x='Year', y='Number of titles available')
plt.show()
and I'm getting the following plot:
What can I do to get a smooth line like in the original plot?
How can I have the same values in the x axis like in the original plot?
EDIT: I worked on the data set and formatting properly the dates, the plot is now better.
This is how the data set looks now:
Date,Number of available titles
1958/07/31,2.908816952
1958/09/16,3.085527674
1958/11/02,4.322502727
1958/12/19,5.382767059
...
1971/04/13,221.6766907
1971/05/30,215.4918154
1971/06/26,211.7808903
This is the plot I can get with the same code posted above:
The question now is: how can I have the same date range as in the original plot (1958 - mid 1971)?

Try taking the mean of your values that you have grouped by year. This will smooth out the discontinuities that you get each year to an average value. If that does not help, then you should apply any one of numerous filters.
df.groupby('year').mean().plot(kind='line')

Related

Python Visualisation Not Plotting Full Range of Data Points

I'm just starting out on using Python and I'm using it to plot some points through Power BI. I use Power BI as part of my work anyway and this is for an article I'm writing alongside learning. I'm aware Power BI isn't the ideal place to be using Python :)
I have a dataset of average banana prices since 1995 (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1132096/bananas-30jan23.csv)
I've managed to turn that into a nice line chart which plots the average for each month but only shows the yearly labels. The chart is really nice and I'm happy with it other than the fact that it isn't plotting anything before 1997 or after 2020 despite the date range being outside that. Earlier visualisations without the x-axis labelling grouping led to all points being plot but with this it's now no longer working.
ChatGPT got me going in circles that never resolved the issue so I suspect my issue may lie in my understand of Python. If anyone could help me understand the issue that would be brilliant, I can provide more information if that helps:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Convert the 'Date' column to a datetime format
dataset['Date'] = pd.to_datetime(dataset['Date'])
# Group the dataframe by month and calculate the average price for each month
monthly_average = dataset.groupby(dataset['Date'].dt.strftime('%B-%Y'))['Price'].mean()
# Plot the monthly average price against the month using seaborn
ax = sns.lineplot(x=monthly_average.index, y=monthly_average.values)
# Find the unique years in the dataset
unique_years = np.unique(dataset['Date'].dt.year)
# Set the x-axis tick labels to only be the unique years
ax.xaxis.set_ticklabels(unique_years)
ax.xaxis.set_major_locator(plt.MaxNLocator(len(unique_years)))
# Show the plot
plt.show()
Resulting Chart

No Output: Bar Graph Using Matplotlib

I have a df of Airbnb where each row represents a airbnb listing. I am trying to plot two columns as bar plot using Matplotlib.
fig,ax= plt.subplots()
ax.bar(airbnb['neighbourhood_group'],airbnb['revenue'])
plt.show()
What I think is, this graph should plot every neighbourhood on x axis and avg revenue per neighbourhood group on y axis(by default bar graph takes mean value per category)
This code of line keeps on running without giving me any error as if it has entered an indefinite while loop.
Can someone please suggest what could be wrong?
following I have used a dataframe, since none is available.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Create sample DataFrame
y = np.random.rand(10,2)
y[:,0]= np.arange(10)
df = pd.DataFrame(y, columns=["neighbourhood_group", "revenue"])
Make sure that the "np.random" always gives different values for the revenue column when you start the program.
df:
# bar plot
ax = df.plot(x="neighbourhood_group", y="revenue", kind="bar")
regarding your statement that your code runs like in a loop. Could it be that the amount of data to be processed from the DataFrame to display the bar chart is too much effort. However, to say that for sure you would have to provide us with a dataset.

How do you change the spread of the Y axis of pandas box plot?

I am plotting 100 data points for 9 different groups. One group's data points are much larger than all the other groups so when I make a box graph using pandas only that group is shown, while all other groups are smashed to the bottom. Here is what it looks like now: smushed box plot
I would like the Y axis to be more spaced out so that I can see the other groups' box graphs. Here is similar data in a scatter plot that has the spacing I am looking for: well spaced scatter plot
What I have
What is need
Here is my code at the moment:
# use ``` to designate a code block in markdown
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("residues.csv")
df.plot.box()
plt.show()
It looks like you want y to be log-scaled:
df.plot.box(logy=True)
Try this:
boxplot = df.boxplot(column=df.columns)
plt.show()
Reference
See the pandas documentation on boxplot: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html

A problem of python plot for large number of data

I am new to python and trying to plot a color magnitude diagram(CMD) for a selected cluster by matplotlib, there are 3400000 stars that I need to plot, the data for each star would be color on x axis and magnitude on y axis, However, my code should read two columns in a csv file and plot. The problem is when I using a part of the data (3000 stars), I can plot a CMD succesfully but when I use all the data, the plot is very mess(see figure below) and it seems that points are ploted by their positions in the column instead of its value. For example, a point has data (0.92,20.64) should be close to the y-axis, but is actually located at the far right of the plot just becasue it placed at last few columns of the dataset. So I wanna know how can I plot the entire dataset and show a plot like the first figure.Thanks for yout time. These are my codes:
import matplotlib.pyplot as plt
import pandas as pd
import csv
data = pd.read_csv(r'C:\Users\Peter\Desktop\F275W test.csv', low_memory=False)
# Generate some test data
x = data['F275W-F336W']
y = data['F275W']
#remove the axis
plt.axis('off')
plt.plot(x,y, ',')
plt.show()
This is the plot I got for 3000 stars it's a CMD
This is the plot I got for entire dataset, which is very mess

How can I plot a pandas dataframe as a scatter graph? I think I may have messed up the indexing and can't add a new index?

I am trying to plot my steps as a scatter graph and then eventually add a trend line.
I managed to get it to work with df.plot() but it is a line chart.
The following is the code I have tried:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data_file = pd.read_csv('CSV/stepsgyro.csv')
# print(data_file.head())
# put in the correct data types
data_file = data_file.astype({"steps": int})
pd.to_datetime(data_file['date'])
# makes the date definitely the index at the bottom
data_file.set_index(['date'], inplace=True)
# sorts the data frame by the index
data_file.sort_values(by=['date'], inplace=True, ascending=True)
# data_file.columns.values[1] = 'date'
# plot the raw steps data
# data_file.plot()
plt.scatter(data_file.date, data_file.steps)
plt.title('Daily Steps')
plt.grid(alpha=0.3)
plt.show()
plt.close('all')
# plot the cumulative steps data
data_file = data_file.cumsum()
data_file.plot()
plt.title('Cumulative Daily Steps')
plt.grid(alpha=0.3)
plt.show()
plt.close('all')
and here is a screenshot of what it's looking like on my IDE:
any guidance would be greatly appreciated!
You have set the index to be the "date" column. From that moment on, there is no "date" column anymore, hence data_file.date fails.
Two options:
Don't set the index. Sorting doesn't seem to be needed anyways.
Plot the index, plt.scatter(data_file.index, data_file.steps)
I can't figure out just by looking at your example why you are getting that error. However, I can offer a quick and easy solution to plotting your data:
data_file.plot(marker='.', linestyle='none')
You can use df.plot(kind='scatter') to avoid the line chart.

Categories