I have two pandas data frames having same column names.
Dataframe 1:
Dataframe 2:
Both the data frames have same column names. I need to visualize
both the dfs in same scatter plot where X-axis would be values
present in the 'function' column i.e D1_1_2, D1_2_3 etc
Single scatter plot is required for all the entries(or labels) ex:
'D1_1_2', 'D1_2_3' etc , in the 'function' column as X-axis. Y-axis can dynamically pick the numeric values.
Different colors for both data frame values.
Add spacing or jitters between overlapping values.
Need support in this.
With below example you might get an idea on how to do what you are looking for:
import pandas as pd
import matplotlib.pyplot as plt
index = ["D1_1-2", "D1_2-3", "D1_3-4"]
df1 = pd.DataFrame({"count": [10, 20, 25]}, index=index)
df2 = pd.DataFrame({"count": [15, 11, 30]}, index=index)
ax = df1.plot(style='ro', legend=False)
df2.plot(style='bo',ax=ax, legend=False)
plt.show()
The key is asking plot of df2 to use the axis from plot of df1.
The plot you get for this is as follows:
Aproach with jitter:
If you want to add jitter to your data one approach can be as follows, where instead of using the previous plot axis we concatenate the dataframes and iterate over it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
index = ["D1_1-2", "D1_2-3", "D1_3-4", "D1_4-5", "D1_5-6", "D1_6-7", "D1_7-8", "D1_8-9", "D1_1-3", "D1_2-3", "D1_3-5", "D1_5-7"]
df1 = pd.DataFrame({"count": [10, 20, 25, 30, 32, 35, 25, 15, 5, 17, 11, 2]}, index=index)
df2 = pd.DataFrame({"count": [15, 11, 30, 30, 20, 30, 25, 27, 5, 16, 11, 5]}, index=index)
#We ensure we use different column names for df1 and df2
df1.columns = ["count1"]
df2.columns = ["count2"]
#We concatenate the dataframes
df = pd.concat([df1, df2],axis=1)
#Function to add jitter to the array
def rand_jitter(arr):
stdev = .01*(max(arr)-min(arr))
return arr + np.random.randn(len(arr)) * stdev
# We iterate between the two columns of the concatenated dataframe
for i,d in enumerate(df):
y = df[d]
arr = range(1,len(y)+1)
x = rand_jitter(arr)
plt.plot(x, y, mfc = ["red","blue"][i], mec='k', ms=7, marker="o", linestyle="None")
# We set the ticks as the index labels and rotate the labels to avoid overlapping
plt.xticks(arr, index, rotation='vertical')
plt.show()
Finally this results on following graph:
Related
I have a straightforward for loop that loops through datasets in a set and plots the resultant scatterplot for each dataset using the code below;
for i in dataframes:
x = i['cycleNumber']
y = i['QCharge_mA_h']
plt.figure()
sns.scatterplot(x=x, y=y).set(title=i.name)
This plots the graphs out as expected, one on top of the other. Is there a simple way to get them all to plot onto a grid for better readability?
As an example lets say we have the following datasets and code:
data1 = {'X':[12, 10, 20, 17], 'Y':[9, 8, 5, 3]}
data2 = {'X':[2, 13, 7, 21], 'Y':[17, 18, 4, 6]}
data3 = {'X':[9, 19, 20, 3], 'Y':[6, 12, 4, 1]}
data4 = {'X':[10, 13, 15, 1], 'Y':[6, 12, 5,16]}
data5 = {'X':[12, 10, 5, 3], 'Y':[18, 7, 21, 7]}
data6 = {'X':[5, 10, 8, 17], 'Y':[9, 12, 5, 18]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
df3=pd.DataFrame(data3)
df4=pd.DataFrame(data4)
df5=pd.DataFrame(data5)
df6=pd.DataFrame(data6)
lst = [df1, df2, df3, df4, df5, df6]
for i in lst:
plt.figure()
sns.scatterplot(x=i['X'], y=i['Y'])
This returns an output of each scatterplot called printing on top of another i.e. stacked. I cant upload a shot of what that output looks like as it runs across multiple pages (this tidy output that I can capture and display is exactly what it is I'm trying to achieve).
I want it to be in a grid, lets say a 2x3 grid given it has 6 plots. How do I achieve this?
Few ways you could do this.
The Original
import matplotlib # 3.6.0
from matplotlib import pyplot as plt
import numpy as np # 1.23.3
import pandas as pd # 1.5.1
import seaborn as sns # 0.12.1
# make fake data
df = pd.DataFrame({
"cycleNumber": np.random.random(size=(100,)),
"QCharge_mA_h": np.random.random(size=(100,)),
})
# single plot
fig, ax = plt.subplots()
sns.scatterplot(df, x="cycleNumber", y="QCharge_mA_h", ax=ax)
plt.show()
With matplotlib
# make 5 random data frames
dataframes = []
for i in range(5):
np.random.seed(i)
random_df = pd.DataFrame({
"cycleNumber": np.random.random(size=(100,)),
"QCharge_mA_h": np.random.random(size=(100,)),
})
dataframes.append(random_df)
# make len(dataframes) rows using matplotlib
fig, axs = plt.subplots(nrows=len(dataframes))
for df, ax in zip(dataframes, axs):
sns.scatterplot(df, x="cycleNumber", y="QCharge_mA_h", ax=ax)
plt.show()
With seaborn
# make 5 random data frames
dataframes = []
for i in range(5):
np.random.seed(i)
random_df = pd.DataFrame({
"cycleNumber": np.random.random(size=(100,)),
"QCharge_mA_h": np.random.random(size=(100,)),
})
dataframes.append(random_df)
# make len(dataframes) rows using matplotlib
# concat dataframes
dfs = pd.concat(dataframes, keys=range(len(dataframes)), names=["keys"])
# move keys to columns
dfs = dfs.reset_index(level="keys")
# make grid and map scatterplot to each row
grid = sns.FacetGrid(data=dfs, row="keys")
grid.map(sns.scatterplot, "cycleNumber", "QCharge_mA_h")
plt.show()
With col_wrap=3
# make 5 random data frames
dataframes = []
for i in range(5):
np.random.seed(i)
random_df = pd.DataFrame({
"cycleNumber": np.random.random(size=(100,)),
"QCharge_mA_h": np.random.random(size=(100,)),
})
dataframes.append(random_df)
# make len(dataframes) rows using matplotlib
# concat dataframes
dfs = pd.concat(dataframes, keys=range(len(dataframes)), names=["keys"])
# move keys to columns
dfs = dfs.reset_index(level="keys")
# make grid and map scatterplot to each column, wrapping after 3
grid = sns.FacetGrid(data=dfs, col="keys", col_wrap=3)
grid.map(sns.scatterplot, "cycleNumber", "QCharge_mA_h")
plt.show()
I have a dataframe with this data and want to plot it with a bar graph with x-axis labels being months
import pandas as pd
data = {'Birthday': ['1900-01-31', '1900-02-28', '1900-03-31', '1900-04-30', '1900-05-31', '1900-06-30', '1900-07-31', '1900-08-31', '1900-09-30', '1900-10-31', '1900-11-30', '1900-12-31'],
'Players': [32, 25, 27, 19, 27, 18, 18, 21, 23, 21, 26, 23]}
df = pd.DataFrame(data)
Birthday Players
1900-01-31 32
1900-02-28 25
1900-03-31 27
1900-04-30 19
1900-05-31 27
1900-06-30 18
1900-07-31 18
1900-08-31 21
1900-09-30 23
1900-10-31 21
1900-11-30 26
1900-12-31 23
This is what I have
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
fig = plt.figure(figsize=(12, 7))
locator = mdates.MonthLocator()
fmt = mdates.DateFormatter('%b')
X = plt.gca().xaxis
X.set_major_locator(locator)
X.set_major_formatter(fmt)
plt.bar(month_df.index, month_df.Players, color = 'maroon', width=10)
but the result is this with the label starting from Feb instead of Jan
Bar plot x-axis tick locations are 0 indexed, not datetimes
This solution applies to any plot with a discrete axis (e.g. bar, hist, heat, etc.).
Similar to this answer, the easiest solution follows:
Skip to step 3 if the str column already exists
Convert the 'Birthday' column to a datetime dtype with pd.to_datetime
Extract the abbreviated month name to a separate column
Order the column with pd.Categorical. The build-in calendar module is used to supply an ordered list of abbreviated month names, or the list can be typed manually
Plot the dataframe with pandas.DataFrame.plot, which uses matplotlib as the default backend
Tested in python 3.8.12, pandas 1.3.4, matplotlib 3.4.3
import pandas as pd
import matplotlib.pyplot as plt
from calendar import month_abbr as ma # ordered abbreviated month names
# convert the Birthday column to a datetime and extract only the date component
df.Birthday = pd.to_datetime(df.Birthday)
# create a month column
df['month'] = df.Birthday.dt.strftime('%b')
# convert the column to categorical and ordered
df.month = pd.Categorical(df.month, categories=ma[1:], ordered=True)
# plot the dataframe
ax = df.plot(kind='bar', x='month', y='Players', figsize=(12, 7), rot=0, legend=False)
If there are many repeated months, where the data must be aggregated, then combine the data using pandas.DataFrame.groupby and aggregate some function like .mean() or .sum()
dfg = df.groupby('month').Players.sum()
ax = dfg.plot(kind='bar', figsize=(12, 7), rot=0, legend=False)
Typically, matplotlib.bar does not do a very good job with datetimes for various reasons. It's easy to manually set your x tick locations and labels as below. This a fixed formatter convenience wrapper function, but it lets you take control quite easily.
#generate data
data = pd.Series({
'1900-01-31' : 32, '1900-02-28' : 25, '1900-03-31' : 27,
'1900-04-30' : 19, '1900-05-31' : 27, '1900-06-30' : 18,
'1900-07-31' : 18, '1900-08-31' : 21, '1900-09-30' : 23,
'1900-10-31' : 21, '1900-11-30' : 26, '1900-12-31' : 23,
})
#make plot
fig, ax = plt.subplots(figsize=(12, 7))
ax.bar(range(len(data)), data, color = 'maroon', width=0.5, zorder=3)
#ax.set_xticks uses a fixed locator
ax.set_xticks(range(len(data)))
#ax.set_xticklables uses a fixed formatter
ax.set_xticklabels(pd.to_datetime(data.index).strftime('%b'))
#format plot a little bit
ax.spines[['top','right']].set_visible(False)
ax.tick_params(axis='both', left=False, bottom=False, labelsize=13)
ax.grid(axis='y', color='gray', dashes=(8,3), alpha=0.5)
I'm not familiar with matplotlib.dates but because you are using pandas there are simple ways doing what you need using pandas.
Here is my code:
import pandas as pd
import calendar
from matplotlib import pyplot as plt
# data
data = {'Birthday': ['1900-01-31', '1900-02-28', '1900-03-31', '1900-04-30', '1900-05-31', '1900-06-30', '1900-07-31', '1900-08-31', '1900-09-30', '1900-10-31', '1900-11-30', '1900-12-31'],
'Players': [32, 25, 27, 19, 27, 18, 18, 21, 23, 21, 26, 23]}
df = pd.DataFrame(data)
# convert column to datetime
df["Birthday"] = pd.to_datetime(df["Birthday"], format="%Y-%m-%d")
# groupby month and plot bar plot
df.groupby(df["Birthday"].dt.month).sum().plot(kind="bar", color = "maroon")
# set plot properties
plt.xlabel("Birthday Month")
plt.ylabel("Count")
plt.xticks(ticks = range(0,12) ,labels = calendar.month_name[1:])
# show plot
plt.show()
Output:
Suppose I have the following DataFrame:
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime as dt
df = pd.DataFrame(
[
['2008-02-19', 10],
['2008-03-01', 15],
['2009-02-05', 20],
['2009-05-10', 40],
['2010-10-10', 25],
['2010-11-15', 5]
],
columns = ['Date', 'DollarTotal']
)
df
I want to plot the total summed by year so I perform the following transformations:
df['Date'] = pd.to_datetime(df['Date'])
df_Year = df.groupby(df['Date'].dt.year)
df_Year = df_Year.sum('DollarTotal')
df_Year
The following code in matplotlib creates the chart below:
fig,ax = plt.subplots()
ax.plot(df_Year.index, df_Year.values)
ax.set_xlabel("OrderYear")
ax.set_ylabel("$ Total")
ax.set_title("Annual Purchase Amount")
plt.xticks([x for x in df_Year.index], rotation=0)
plt.show()
The problem occurs when I want to create a bar graph using the same DataFrame. By changing the code above from ax.plot to ax.bar, I get the following error:
I've never come across this error before when plotting in matplotlib. What have I done wrong?
Please see the answer below by dm2 which solves this problem.
Edit:
I just figured out why I never had this problem in the past. It has to do with how I summed the groupby. If I replace df_Year = df_Year.sum('DollarTotal') with df_Year = df_Year['DollarTotal'].sum() then this problem does not occur.
df = pd.DataFrame(
[
['2008-02-19', 10],
['2008-03-01', 15],
['2009-02-05', 20],
['2009-05-10', 40],
['2010-10-10', 25],
['2010-11-15', 5]
],
columns = ['Date', 'DollarTotal']
)
df['Date'] = pd.to_datetime(df['Date'])
df_Year = df.groupby(df['Date'].dt.year)
df_Year = df_Year['DollarTotal'].sum()
df_Year
fig,ax = plt.subplots()
ax.bar(df_Year.index, df_Year.values)
ax.set_xlabel("OrderYear")
ax.set_ylabel("$ Total")
ax.set_title("Annual Purchase Amount")
plt.xticks([x for x in df_Year.index], rotation=0)
plt.show()
From matplotlib.axes.Axes.bar documentation, the function expects height parameter to be a scalar or a sequence of scalars. pandas.DataFrame.values is a two-dimensional array that has rows as its first dimension and columns as its second dimension (even with just one column, it's a two dimensional array), so it's a sequence of arrays. Therefore, if you use df.values, you also need to reshape it to the expected sequence (i.e. one-dimensional array) of scalars (i.e. df.values.reshape(len(df))).
Or, specifically in your code: ax.bar(df_Year.index, df_Year.values.reshape(len(df_Year)).
Result:
You could also just use the plot.bar of pandas in the following wat:
df_Year.plot.bar()
plt.show()
This will produce:
I'm trying to create a matplotlib bar chart with categories on the X-axis, but I can't get the categories right. Here's a minimal example of what I'm trying to do.
data = [[46, 11000], [97, 15000], [27, 24000], [36, 9000], [9, 17000]]
df = pd.DataFrame(data, columns=['car_id', 'price'])
fig1, ax1 = plt.subplots(figsize=(10,5))
ax1.set_title('Car prices')
ax1.bar(df['car_id'], df['price'])
plt.xticks(np.arange(len(df)), list(df['car_id']))
plt.legend()
plt.show()
I need the five categories (car_id) on the X-axis. What Am I doing wrong? :-/
You can turn car_id into category:
df['car_id'] = df['car_id'].astype('category')
df.plot.bar(x='car_id')
Output:
You can also plot just the price column and relabel:
ax = df.plot.bar(y='price')
ax.set_xticklabels(df['car_id'])
You got confused in the xticks with the label and position. Here you specify the position np.arange(len(df)) and the labels list(df['car_id']. So he puts the labels at the specified position list(df['car_id'], i.e. array([0, 1, 2, 3, 4]).
If the position and the labels are here the same, just replace plt.xticks(np.arange(len(df)), list(df['car_id'])) by plt.xticks(df['car_id']).
If you want them to be evenly spaced, your approach is right but you also need to change ax1.bar(df['car_id'], df['price']) toax1.bar(np.arange(len(df)), df['price']), so that the bar x-position is now evenly spaced.
Full code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = [[46, 11000], [97, 15000], [27, 24000], [36, 9000], [9, 17000]]
df = pd.DataFrame(data, columns=['car_id', 'price'])
fig1, ax1 = plt.subplots(figsize=(10,5))
ax1.set_title('Car prices')
ax1.bar(np.arange(len(df)), df['price'])
ax1.set_xticks(np.arange(len(df)))
ax1.set_xticklabels(df['car_id'])
plt.show()
I want to have two curves with different x-datapoints shown in the same plot:
import pandas as pd
df1 = pd.DataFrame()
df1['Date']= ['2014-12-31', '2015-12-31', '2016-12-31', '2017-12-31']
df1['Value'] = [22, 44, 11, 55]
df2 = pd.DataFrame()
df2['Date']= ['2015-03-31', '2015-07-31', '2015-8-31', '2015-12-31']
df2['Value'] = [34, 39, 31, 27]
ax1 = df1.plot(x='Date', marker='o')
df2.plot(ax=ax1, marker='o')
In the above code the 2nd curve (df2-data) uses the x-datapoints of the df1-data, not it's own.
I can make it work by manipulating the data (e.g. add the missing Dates in df1 and df2 with NaN accordingly), but I would like to know if there is something like a simple setting directly in the df.plot()-function.
Note: I did convert those dates to datetimes using df['Date'] =
pd.to_datetime(df.Date)
One way to do this is to use pd.concat then use pandas plot:
pd.concat([df1,df2], keys=['df1','df2'])\
.set_index('Date', append=True)\
.unstack(0)['Value']\
.reset_index(0, drop=True)\
.fillna(0).plot(marker='o')
More like a scatter plot:
pd.concat([df1,df2], keys=['df1','df2'])\
.set_index('Date', append=True)\
.unstack(0)['Value']\
.reset_index(0, drop=True)\
.plot(marker='o',linestyle='none')