Update legend when adding data to existing plot (Pandas) - python

I've written a small Python code to read Covid statistics from ourworldindata.org and plot a certain data series for a certain country.
from pandas import read_csv
import pandas as pd
import matplotlib.pyplot as plt
filename = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
dataset = read_csv(filename)
dataset["date"] = pd.to_datetime(dataset["date"])
country = "Norway"
data = "new_cases"
mask = dataset["location"] == country
dataset.loc[mask].set_index("date")[data].plot()
plt.ylabel(data)
plt.legend([country])
plt.show()
It works as intended and plots the number of new cases in Norway as a function of date in the example above. If I change "country" and rerun it, it will plot a new curve for the new country with a different color in the same plot, which is what I want. But there's a problem with the legend. It shows the name of the last plotted country but the color of the first plotted country. I would like it to show both with the correct name and color. How can I do that?
The link shows a figure with the result when first plotting Norway (blue curve) and then Denmark (yellow curve):
Plot of new cases in Norway and Denmark

I'm not quite sure how exactly you "rerun" the code but you can define your countries in a list and print them in a loop:
import pandas as pd
import matplotlib.pyplot as plt
filename = "owid-covid-data.csv"
dataset = pd.read_csv(filename)
dataset["date"] = pd.to_datetime(dataset["date"])
countries = ["Denmark", "Norway"]
data = "new_cases"
for country in countries:
mask = dataset["location"] == country
dataset.loc[mask].set_index("date")[data].plot()
plt.ylabel(data)
plt.legend(countries)
plt.show()
Or you can use seaborn instead of the loop:
import seaborn as sns
df = dataset[dataset["location"].isin(countries)][["date", "location", data]]
sns.lineplot(data=df, x="date", y=data, hue="location")
plt.show()

Related

how can i plot a bar chart m variable and how cam i extract each column and save it in a list

----------python
import pandas as pd
import matplotlib.pyplot as p
from IPython.display import display
survey =pd.read_csv('Video_Game_Sales.csv')
l=[]
----------m variable sorts and sums the sales per genre
m=survey.groupby('Genre')[['Global_Sales']].sum().sort_values('Global_Sales')
x=survey["Genre"]
----------here the error shows 2d array how to extract the columns from array and use it
as x and y axis
df=pd.DataFrame({"Genre":x,"Global_Sales":m})
ax=df.plot.bar(x='Genre',y="Global_Sales",rot=0)
ax.plot()
p.show()
You can directly plot using the dataframe created by the groupby sum.
The dataframe m looks like (here using test data) this. Note that it contains both a column with the sales and an "index" indicating the genres.
Global_Sales
Genre
Other 5434900
Puzzle 5779272
MMO 7885381
Sports 8177998
Simulation 10883251
Action 11022618
Action-adventure 11821294
Role-playing 12010055
Strategy 12874008
Adventure 14073635
You can call m.plot.bar() to create a bar plot. (Or m.plot.barh() for horizontal bars.)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# first create some dummy data for testing purposes
genres = ['Action', 'Action-adventure', 'Adventure', 'Puzzle', 'Role-playing',
'Simulation', 'Strategy', 'Sports', 'MMO', 'Other']
survey = pd.DataFrame({'Genre': np.random.choice(genres, 200),
'Global_Sales': np.random.randint(10000, 1000000, 200)})
# create a dataframe with the sums of sales per genre
m = survey.groupby('Genre')[['Global_Sales']].sum().sort_values('Global_Sales')
# plot the dataframe
ax = m.plot.bar(rot=0, figsize=(12, 5))
ax.ticklabel_format(axis='y', style='plain') # prevent scientific notation
plt.tight_layout()
plt.show()

Python Pie chart in Matplotlib or Altair from two categorical data

This is my sample data frame
Column A
Column B
Paris
Gas
Paris
Solar
Paris
Solar
London
Oil
London
Solar
London
Oil
I want to create pie charts based on all unique data points based on Column A with pie chart size coming from Column B. For example, for this data base, I want a pie chart for Paris and one for London.
How can I do this in one code?
Thank you
The easiest way, I think, is to cross-tabulate your data:
#data generation
import numpy as np
np.random.seed(1234)
n=20
city = np.random.choice(["Paris", "London"], n)
cat = np.random.choice(["Gas", "Oil", "Sun"], n)
from matplotlib import pyplot as plt
import pandas as pd
df = pd.DataFrame({"A": city, "B": cat})
#count the numbers occurrence of each category in an A x B table
df_count = df.pivot_table(index="B", columns="A", fill_value=0, aggfunc="size")
#plot the pie chart using pandas convenience wrapper for matplotlib
ax = df_count.plot.pie(subplots=True, figsize=(10, 5))
plt.show()
Sample output:
For a quick overview, pandas plotting routine is sufficient. For better control over the final image, you may want to plot df_count directly using matplotlib.
Is this what you are looking for ?
df_2 = df.groupby(["A"])['B'].nunique()
plt.pie(df_2, labels = df_2.index, autopct='%.0f%%')
plt.show()
Output:

Adjust seaborn countplot by hue groups

I have a dataset that looks something like this
status
age_group
failure
18-25
failure
26-30
failure
18-25
success
41-50
and so on...
sns.countplot(y='status', hue='age_group', data=data)
When i countplot the full dataset I get this
dataset countplot hued by age_group
The question is the following, how do I plot a graph that is adjusted by the n of occurences of each age_group directly with seaborn? because without it, the graph is really misleading, as for example, the >60 age group appears the most simply because it has more persons within that age_group. I searched the documentation but it does not have any built-in function for this case.
Thanks in advance.
The easiest way to show the proportions, is via sns.histogram(..., multiple='fill'). To force an order for the age groups and the status, creating ordered categories can help.
Here is some example code, tested with seaborn 0.11.1:
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
import numpy as np
import pandas as pd
data = pd.DataFrame({'status': np.random.choice(['Success', 'Failure'], 100, p=[.7, .3]),
'age_group': np.random.choice(['18-45', '45-60', '> 60'], 100, p=[.2, .3, .5])})
data['age_group'] = pd.Categorical(data['age_group'], ordered=True, categories=['18-45', '45-60', '> 60'])
data['status'] = pd.Categorical(data['status'], ordered=True, categories=['Failure', 'Success'])
ax = sns.histplot(y='age_group', hue='status', multiple='fill', data=data)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel('Percentage')
plt.show()
Now, to create the exact plot of the question, some pandas manupulations might create the following dataframe:
count the values for each age group and status
divide these by the total for each age group
Probably some shortcuts can be taken, but this is how I tried to juggle with pandas (edit from comment by #PatrickFitzGerald: using pd.crosstab()):
# df = data.groupby(['status', 'age_group']).agg(len).reset_index(level=0) \
# .pivot(columns='status').droplevel(level=0, axis=1)
# totals = df.sum(axis=1)
# df['Success'] /= totals
# df['Failure'] /= totals
df = pd.crosstab(data['age_group'], data['status'], normalize='index')
df1 = df.melt(var_name='status', value_name='percentage', ignore_index=False).reset_index()
ax = sns.barplot(y='status', x='percentage', hue='age_group', palette='rocket', data=df1)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel('Percentage')
ax.set_ylabel('')
plt.show()

Make Matplotlib Ignore CSV Headings

Trying to create a bar chart using a CSV file and Matplotlib. However, there are two headings (COUNTRY & COST) which means that the code isn't able to run properly and produce the bar chart. How do I edit the code so that it will ignore the headings? The first image is what the CSV file actually looks like and the second image is what the code is able to understand and run.
EDIT: the python assisstant tells me that the error seems to be occurring in Line 14 of the code: price.append(float(row[1]))
import matplotlib.pyplot as plt
import csv
price = []
countries = []
with open ("Europe.csv","r") as csvfile:
plot = csv.reader(csvfile)
for idx, row in enumerate(plot):
if idx == 0:
continue
price.append(float(row[1]))
countries.append(str(row[0]))
plt.style.use('grayscale')
plt.bar( countries, price, label='Europe', color='red')
plt.ylabel('Price in US$')
plt.title('Cost of spotify premium per country')
plt.xticks(rotation=90)
plt.legend(loc='best')
plt.show()
I would use pandas for this. With that you can then more easily create the bar plot using this function.
Example using your variables countries and price:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"country": countries, "price": price})
df.plot.bar(x="country", y="price")
plt.show()
Just using pandas.read_csv then using skiprows=[0],header=None like this:
import pandas as pd
df = pd.read_csv('data.csv',sep=';',skiprows=[0],header=None)
Iam using separator ';' to data because I assume your csv file create in ms.excel
But I think just read the csv file without skiprows, like this:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv',sep=';')
price = df['cost']
countries = df['country']
plt.style.use('grayscale')
plt.bar( countries, price, label='Europe', color='red')
plt.ylabel('Price in US$')
plt.title('Cost of spotify premium per country')
plt.xticks(rotation=90)
plt.legend(loc='best')
plt.show()
for data like this :
and the result like this :

Uncertain why trendline is not appearing on matplotlib scatterplot

I am trying to plot a trendline for a matplotlib scatterplot and am uncertain why the trendline is not appearing. What should I change in my code to make the trendline appear? Event is a categorical data type.
I've followed what most other stackoverflow questions suggest about plotting a trendline, but am uncertain why my trendline is not appearing.
#import libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pandas.plotting import register_matplotlib_converters
#register datetime converters
register_matplotlib_converters()
#read dataset using pandas
dataset = pd.read_csv("UsrNonCallCDCEvents_CDCEventType.csv")
#convert date to datetime type
dataset['Interval'] = pd.to_datetime(dataset['Interval'])
#convert other columns to numeric type
for cols in list(dataset):
if cols != 'Interval' and cols != 'CDCEventType':
dataset[cols] = pd.to_numeric(dataset[cols])
#create pivot of dataset
pivot_dataset = dataset.pivot(index='Interval',columns='CDCEventType',values='AvgWeight(B)')
#create scatterplot with trendline
x = pivot_dataset.index.values.astype('float64')
y = pivot_dataset['J-STD-025']
plt.scatter(x,y)
z = np.polyfit(x,y,1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")
plt.show()
This is the graph currently being output. I am trying to get this same graph, but with a trendline: https://imgur.com/a/o18a5Y3
It's also fine that x axis is not showing dates
A snippet of my dataframe looks like this: https://imgur.com/a/xJAcgEI
I've painted out the irrelvant column names

Categories