Plot two pandas dataframes in one scatter plot - python

I have two dataframes with the same index and columns like:
import pandas as pd
dfGDPgrowth = pd.DataFrame({'France':[2%, 1.8%, 3%], 'Germany':[3%, 2%, 2.5%]}, index = [2007, 2006, 2005])
dfpopulation = pd.DataFrame({'France':[100, 105, 112], 'Germany':[70, 73, 77]}, index = [2007, 2006, 2005])
Is there a straightforward matplotlib way to create a scatter plot with x-axis % grow and y-axis population?
Edit: My dataframe has 64 columns so I wonder if it could be done with some loop so I don't have to input them all manualy.

Are you looking for something like this
import pandas as pd
import matplotlib.pyplot as plt
dfGDPgrowth = pd.DataFrame({'France':[2, 1.8, 3], 'Germany':[3, 2, 2.5]}, index = [2007, 2006, 2005])
dfpopulation = pd.DataFrame({'France':[100, 105, 112], 'Germany':[70, 73, 77]}, index = [2007, 2006, 2005])
for col in dfGDPgrowth.columns:
plt.scatter(dfGDPgrowth[col], dfpopulation[col], label=col)
plt.legend(loc='best', fontsize=16)
plt.xlabel('Growth %')
plt.ylabel('Population')

Related

How can I find the Optimal Price Point and Group By ID?

I have a dataframe that looks like this.
import pandas as pd
# intialise data of lists.
data = {'ID':[101762, 101762, 101762, 101762, 102842, 102842, 102842, 102842, 108615, 108615, 108615, 108615, 108615, 108615],
'Year':[2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020, 2021, 2021, 2021, 2021, 2021, 2021],
'Quantity':[60, 80, 88, 75, 50, 55, 62, 58, 100, 105, 112, 110, 98, 95],
'Price':[2000, 3000, 3330, 4000, 850, 900, 915, 980, 1000, 1250, 1400, 1550, 1600, 1850]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
Here are some plots of the data.
import matplotlib.pyplot as plt
import seaborn as sns
uniques = df['ID'].unique()
for i in uniques:
fig, ax = plt.subplots()
fig.set_size_inches(4,3)
df_single = df[df['ID']==i]
sns.lineplot(data=df_single, x='Price', y='Quantity')
ax.set(xlabel='Price', ylabel='Quantity')
plt.xticks(rotation=45)
plt.show()
Now, I am trying to find the optimal price to sell something, before quantity sold starts to decline. I think the code below is pretty close, but when I run the code I get '33272.53'. This doesn't make any sense. I am trying to get the optimal price point per ID. How can I do that?
df["% Change in Quantity"] = df["Quantity"].pct_change()
df["% Change in Price"] = df["Price"].pct_change()
df["Price Elasticity"] = df["% Change in Quantity"] / df["% Change in Price"]
df.columns
import pandas as pd
from sklearn.linear_model import LinearRegression
x = df[["Price"]]
y = df["Quantity"]
# Fit a linear regression model to the data
reg = LinearRegression().fit(x, y)
# Find the optimal price that maximizes the quantity sold
optimal_price = reg.intercept_/reg.coef_[0]
optimal_price

Python plot category on a day axis

Hello Guys I have a data set of Date, Category, and Quantity, I want to plot both date and category on the x-axis and quantity on the y axis. that is a plot of Quantity vs Category for each day in the data frame.
question is tagged as plotly hence a plotly answered
using this documented approach https://plotly.com/python/categorical-axes/#multicategorical-axes
have simulated a dataframe that has same structure as image in your question
have deliberately used strings for dates and interval index
import pandas as pd
import numpy as np
import plotly.graph_objects as go
# simulate dataframe shown in question
df = pd.DataFrame(
index=pd.MultiIndex.from_product(
[pd.date_range("7-feb-2022", "17-feb-2022"), range(1, 51, 1)],
names=["Date", "Category"],
),
data=np.random.uniform(1, 25, 550),
columns=["Quantity"],
).reset_index()
df["Category"] = pd.cut(df["Category"], bins=[0, 10, 20, 30, 40, 50]).astype(str)
df = df.groupby(["Date", "Category"]).sum()
# https://plotly.com/python/categorical-axes/#multicategorical-axes
go.Figure(
go.Bar(
x=[
df.index.get_level_values("Date").strftime("%Y-%m-%d").tolist(),
df.index.get_level_values("Category").tolist(),
],
y=df["Quantity"],
)
)

How to order the tick labels on a discrete axis (0 indexed like a bar plot)

I have a dataframe with this data and want to plot it with a bar graph with x-axis labels being months
import pandas as pd
data = {'Birthday': ['1900-01-31', '1900-02-28', '1900-03-31', '1900-04-30', '1900-05-31', '1900-06-30', '1900-07-31', '1900-08-31', '1900-09-30', '1900-10-31', '1900-11-30', '1900-12-31'],
'Players': [32, 25, 27, 19, 27, 18, 18, 21, 23, 21, 26, 23]}
df = pd.DataFrame(data)
Birthday Players
1900-01-31 32
1900-02-28 25
1900-03-31 27
1900-04-30 19
1900-05-31 27
1900-06-30 18
1900-07-31 18
1900-08-31 21
1900-09-30 23
1900-10-31 21
1900-11-30 26
1900-12-31 23
This is what I have
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
fig = plt.figure(figsize=(12, 7))
locator = mdates.MonthLocator()
fmt = mdates.DateFormatter('%b')
X = plt.gca().xaxis
X.set_major_locator(locator)
X.set_major_formatter(fmt)
plt.bar(month_df.index, month_df.Players, color = 'maroon', width=10)
but the result is this with the label starting from Feb instead of Jan
Bar plot x-axis tick locations are 0 indexed, not datetimes
This solution applies to any plot with a discrete axis (e.g. bar, hist, heat, etc.).
Similar to this answer, the easiest solution follows:
Skip to step 3 if the str column already exists
Convert the 'Birthday' column to a datetime dtype with pd.to_datetime
Extract the abbreviated month name to a separate column
Order the column with pd.Categorical. The build-in calendar module is used to supply an ordered list of abbreviated month names, or the list can be typed manually
Plot the dataframe with pandas.DataFrame.plot, which uses matplotlib as the default backend
Tested in python 3.8.12, pandas 1.3.4, matplotlib 3.4.3
import pandas as pd
import matplotlib.pyplot as plt
from calendar import month_abbr as ma # ordered abbreviated month names
# convert the Birthday column to a datetime and extract only the date component
df.Birthday = pd.to_datetime(df.Birthday)
# create a month column
df['month'] = df.Birthday.dt.strftime('%b')
# convert the column to categorical and ordered
df.month = pd.Categorical(df.month, categories=ma[1:], ordered=True)
# plot the dataframe
ax = df.plot(kind='bar', x='month', y='Players', figsize=(12, 7), rot=0, legend=False)
If there are many repeated months, where the data must be aggregated, then combine the data using pandas.DataFrame.groupby and aggregate some function like .mean() or .sum()
dfg = df.groupby('month').Players.sum()
ax = dfg.plot(kind='bar', figsize=(12, 7), rot=0, legend=False)
Typically, matplotlib.bar does not do a very good job with datetimes for various reasons. It's easy to manually set your x tick locations and labels as below. This a fixed formatter convenience wrapper function, but it lets you take control quite easily.
#generate data
data = pd.Series({
'1900-01-31' : 32, '1900-02-28' : 25, '1900-03-31' : 27,
'1900-04-30' : 19, '1900-05-31' : 27, '1900-06-30' : 18,
'1900-07-31' : 18, '1900-08-31' : 21, '1900-09-30' : 23,
'1900-10-31' : 21, '1900-11-30' : 26, '1900-12-31' : 23,
})
#make plot
fig, ax = plt.subplots(figsize=(12, 7))
ax.bar(range(len(data)), data, color = 'maroon', width=0.5, zorder=3)
#ax.set_xticks uses a fixed locator
ax.set_xticks(range(len(data)))
#ax.set_xticklables uses a fixed formatter
ax.set_xticklabels(pd.to_datetime(data.index).strftime('%b'))
#format plot a little bit
ax.spines[['top','right']].set_visible(False)
ax.tick_params(axis='both', left=False, bottom=False, labelsize=13)
ax.grid(axis='y', color='gray', dashes=(8,3), alpha=0.5)
I'm not familiar with matplotlib.dates but because you are using pandas there are simple ways doing what you need using pandas.
Here is my code:
import pandas as pd
import calendar
from matplotlib import pyplot as plt
# data
data = {'Birthday': ['1900-01-31', '1900-02-28', '1900-03-31', '1900-04-30', '1900-05-31', '1900-06-30', '1900-07-31', '1900-08-31', '1900-09-30', '1900-10-31', '1900-11-30', '1900-12-31'],
'Players': [32, 25, 27, 19, 27, 18, 18, 21, 23, 21, 26, 23]}
df = pd.DataFrame(data)
# convert column to datetime
df["Birthday"] = pd.to_datetime(df["Birthday"], format="%Y-%m-%d")
# groupby month and plot bar plot
df.groupby(df["Birthday"].dt.month).sum().plot(kind="bar", color = "maroon")
# set plot properties
plt.xlabel("Birthday Month")
plt.ylabel("Count")
plt.xticks(ticks = range(0,12) ,labels = calendar.month_name[1:])
# show plot
plt.show()
Output:

How do I create a Matplotlib bar chart with categories?

I'm trying to create a matplotlib bar chart with categories on the X-axis, but I can't get the categories right. Here's a minimal example of what I'm trying to do.
data = [[46, 11000], [97, 15000], [27, 24000], [36, 9000], [9, 17000]]
df = pd.DataFrame(data, columns=['car_id', 'price'])
fig1, ax1 = plt.subplots(figsize=(10,5))
ax1.set_title('Car prices')
ax1.bar(df['car_id'], df['price'])
plt.xticks(np.arange(len(df)), list(df['car_id']))
plt.legend()
plt.show()
I need the five categories (car_id) on the X-axis. What Am I doing wrong? :-/
You can turn car_id into category:
df['car_id'] = df['car_id'].astype('category')
df.plot.bar(x='car_id')
Output:
You can also plot just the price column and relabel:
ax = df.plot.bar(y='price')
ax.set_xticklabels(df['car_id'])
You got confused in the xticks with the label and position. Here you specify the position np.arange(len(df)) and the labels list(df['car_id']. So he puts the labels at the specified position list(df['car_id'], i.e. array([0, 1, 2, 3, 4]).
If the position and the labels are here the same, just replace plt.xticks(np.arange(len(df)), list(df['car_id'])) by plt.xticks(df['car_id']).
If you want them to be evenly spaced, your approach is right but you also need to change ax1.bar(df['car_id'], df['price']) toax1.bar(np.arange(len(df)), df['price']), so that the bar x-position is now evenly spaced.
Full code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = [[46, 11000], [97, 15000], [27, 24000], [36, 9000], [9, 17000]]
df = pd.DataFrame(data, columns=['car_id', 'price'])
fig1, ax1 = plt.subplots(figsize=(10,5))
ax1.set_title('Car prices')
ax1.bar(np.arange(len(df)), df['price'])
ax1.set_xticks(np.arange(len(df)))
ax1.set_xticklabels(df['car_id'])
plt.show()

using matplotlib visualize two pandas dataframes in a single scatter plot

I have two pandas data frames having same column names.
Dataframe 1:
Dataframe 2:
Both the data frames have same column names. I need to visualize
both the dfs in same scatter plot where X-axis would be values
present in the 'function' column i.e D1_1_2, D1_2_3 etc
Single scatter plot is required for all the entries(or labels) ex:
'D1_1_2', 'D1_2_3' etc , in the 'function' column as X-axis. Y-axis can dynamically pick the numeric values.
Different colors for both data frame values.
Add spacing or jitters between overlapping values.
Need support in this.
With below example you might get an idea on how to do what you are looking for:
import pandas as pd
import matplotlib.pyplot as plt
index = ["D1_1-2", "D1_2-3", "D1_3-4"]
df1 = pd.DataFrame({"count": [10, 20, 25]}, index=index)
df2 = pd.DataFrame({"count": [15, 11, 30]}, index=index)
ax = df1.plot(style='ro', legend=False)
df2.plot(style='bo',ax=ax, legend=False)
plt.show()
The key is asking plot of df2 to use the axis from plot of df1.
The plot you get for this is as follows:
Aproach with jitter:
If you want to add jitter to your data one approach can be as follows, where instead of using the previous plot axis we concatenate the dataframes and iterate over it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
index = ["D1_1-2", "D1_2-3", "D1_3-4", "D1_4-5", "D1_5-6", "D1_6-7", "D1_7-8", "D1_8-9", "D1_1-3", "D1_2-3", "D1_3-5", "D1_5-7"]
df1 = pd.DataFrame({"count": [10, 20, 25, 30, 32, 35, 25, 15, 5, 17, 11, 2]}, index=index)
df2 = pd.DataFrame({"count": [15, 11, 30, 30, 20, 30, 25, 27, 5, 16, 11, 5]}, index=index)
#We ensure we use different column names for df1 and df2
df1.columns = ["count1"]
df2.columns = ["count2"]
#We concatenate the dataframes
df = pd.concat([df1, df2],axis=1)
#Function to add jitter to the array
def rand_jitter(arr):
stdev = .01*(max(arr)-min(arr))
return arr + np.random.randn(len(arr)) * stdev
# We iterate between the two columns of the concatenated dataframe
for i,d in enumerate(df):
y = df[d]
arr = range(1,len(y)+1)
x = rand_jitter(arr)
plt.plot(x, y, mfc = ["red","blue"][i], mec='k', ms=7, marker="o", linestyle="None")
# We set the ticks as the index labels and rotate the labels to avoid overlapping
plt.xticks(arr, index, rotation='vertical')
plt.show()
Finally this results on following graph:

Categories