Bar plot based on two columns - python

I have generated the dataframe below, I want to plot a bar plot where the x-axis will have two categories i.e. exp_type values and the y-axis will have a value of avg. Then a legend of disk_type for each type of disk.
exp_type disk_type avg
0 Random Read nvme 3120.240000
1 Random Read sda 132.638831
2 Random Read sdb 174.313413
3 Seq Read nvme 3137.849000
4 Seq Read sda 119.171269
5 Seq Read sdb 211.451616
I have attempted to use the code below for the plotting but I get the wrong plot. They should be grouped together with links.
def plot(df):
df.plot(x='exp_type', y=['avg'], kind='bar')
print(df)

The important thing here is to reshape correctly your dataframe with pivot:
(df.pivot(index='disk_type', columns='exp_type', values='avg').rename_axis(columns='Exp Type')
.plot(kind='bar', rot=0, title='Performance', xlabel='Disk Type', ylabel='IOPS'))
# OR
(df.pivot(index='exp_type', columns='disk_type', values='avg').rename_axis(columns='Disk Type')
.plot(kind='bar', rot=0, title='Performance', xlabel='Exp Type', ylabel='IOPS'))
Output:
Update
Pandas doesn't understand how to group data because you have a flatten dataframe (one numeric value per row). You have to reshape it:
>>> df.pivot(index='exp_type', columns='disk_type', values='avg')
exp_type Random Read Seq Read # <- Two bar groups
disk_type
nvme 3120.240000 3137.849000 # <- First bar of each group
sda 132.638831 119.171269 # <- Second bar of each group
sdb 174.313413 211.451616 # <- Third bar of each group

Related

How to plot multiple time series from a CSV while the data points are in different columns

I have a data frame (loading from CSV) file that looks like below one
Data Mean sd time__1 time__2 time__3 time__4 time__5
0 Data_1 0.947667 0.025263 0.501517 0.874750 0.929426 0.953847 0.958375
1 Data_2 0.031960 0.017314 0.377588 0.069185 0.037523 0.024028 0.021532
Now, I wanted to plot 2 time series plots for (data_1, data_2) with (time__1, time__2, etc) as a timepoint. The x axis is (time__1, time__2, etc) and the y axis is their associated values.
The code I am trying
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("file.csv", delimiter=',', header=0)
data = data.drop(["Unnamed: 0"], axis=1)
# Set the date column as the index
data = data.set_index(["time__1", "time__2", "time__3", "time__4", "time__5"])
ax = data.plot(linewidth=2, fontsize=12)
ax.set_xlabel('Data')
ax.legend(fontsize=12)
plt.savefig("series.png")
plt.show()
The figure I am getting is not as expected.
I think I am doing some wrong with set_index() as my time points are in different columns.
How can I plot time-series when time points are in different columns?
Reproducible data as dictionary formate
{'Data': {(0.501517236232758, 0.874750375747681, 0.929425954818726, 0.953846752643585, 0.958374977111816): 'Data_1', (0.377588421106338, 0.069185301661491, 0.037522859871388, 0.0240284409374, 0.021532088518143): 'Data_2'}, 'Mean': {(0.501517236232758, 0.874750375747681, 0.929425954818726, 0.953846752643585, 0.958374977111816): 0.947667360305786, (0.377588421106338, 0.069185301661491, 0.037522859871388, 0.0240284409374, 0.021532088518143): 0.031959813088179}, 'sd': {(0.501517236232758, 0.874750375747681, 0.929425954818726, 0.953846752643585, 0.958374977111816): 0.025263005867601, (0.377588421106338, 0.069185301661491, 0.037522859871388, 0.0240284409374, 0.021532088518143): 0.017313838005066}}
IIUC you are getting the index wrong: If time__1, time__2 etc. is supposed to be your x-axis, that's what you want your index to be. The plot data series names are the columns. Therefore, you need to transpose your DataFrame. Using the csv data in your first table:
print(df)
# out:
Data Mean sd time__1 time__2 time__3 time__4 \
0 Data_1 0.947667 0.025263 0.501517 0.874750 0.929426 0.953847
1 Data_2 0.031960 0.017314 0.377588 0.069185 0.037523 0.024028
time__5
0 0.958375
1 0.021532
Changing column names and transposing:
df.drop(["Mean", "sd"], axis=1).set_index("Data").T
yields an appropriately formatted dataframe:
Data Data_1 Data_2
time__1 0.501517 0.377588
time__2 0.874750 0.069185
time__3 0.929426 0.037523
time__4 0.953847 0.024028
time__5 0.958375 0.021532
which can simply be plotted:
df.plot()

Ordering of elements in Pandas stacked bar chart

I'm trying to graph information about the portion of a household's income earned in a specific industry across 5 districts in a region.
I used groupby to sort the information in my data frame by district:
df = df_orig.groupby('District')['Portion of income'].value_counts(dropna=False)
df = df.groupby('District').transform(lambda x: 100*x/sum(x))
df = df.drop(labels=math.nan, level=1)
ax = df.unstack().plot.bar(stacked=True, rot=0)
ax.set_ylim(ymax=100)
display(df.head())
District Portion of income
A <25% 12.121212
25 - 50% 9.090909
50 - 75% 7.070707
75 - 100% 2.020202
Since this income falls into categories, I would like to order the elements in the stacked bar in a logical way. The graph Pandas produced is below. Right now, the ordering (starting from the bottom of each bar) is:
25 - 50%
50 - 75%
75 - 100%
<25%
Unsure
I realize that these are sorted in alphabetical order and was curious if there was a way to set a custom ordering. To be intuitive, I would like the order to be (again, starting from the bottom of the bar):
Unsure
<25%
25 - 50%
50 - 75%
75 - 100%
Then, I would like to flip the legend to display the reverse of this order (ie, I would like the legend to have 75 - 100 at the top, as that is what will be at the top of the bars).
To impose a custom sort order on the income categories, one way is to convert them to a CategoricalIndex.
To reverse the order of matplotlib legend entries, use the get_legend_handles_labels method from this SO question: Reverse legend order pandas plot
import pandas as pd
import numpy as np
import math
np.random.seed(2019)
# Hard-code the custom ordering of categories
categories = ['unsure', '<25%', '25 - 50%', '50 - 75%', '75 - 100%']
# Generate some example data
# I'm not sure if this matches your input exactly
df_orig = pd.DataFrame({'District': pd.np.random.choice(list('ABCDE'), size=100),
'Portion of income': np.random.choice(categories + [np.nan], size=100)})
# Unchanged from your code. Note that value_counts() returns a
# Series, but you name it df
df = df_orig.groupby('District')['Portion of income'].value_counts(dropna=False)
df = df.groupby('District').transform(lambda x: 100*x/sum(x))
# In my example data, np.nan was cast to the string 'nan', so
# I have to drop it like this
df = df.drop(labels='nan', level=1)
# Instead of plotting right away, unstack the MultiIndex
# into columns, then convert those columns to a CategoricalIndex
# with custom sort order
df = df.unstack()
df.columns = pd.CategoricalIndex(df.columns.values,
ordered=True,
categories=categories)
# Sort the columns (axis=1) by the new categorical ordering
df = df.sort_index(axis=1)
# Plot
ax = df.plot.bar(stacked=True, rot=0)
ax.set_ylim(ymax=100)
# Matplotlib idiom to reverse legend entries
handles, labels = ax.get_legend_handles_labels()
ax.legend(reversed(handles), reversed(labels))

How to find the correct condition for my matplotlib scatterplot?

I'm trying to correlate two measures(DD & DRE) from a data set which contains many more columns. I created a data frame and called it as 'Data'.
Within this Data, I want to create a scatterplot between DD(X axis) & DRE(y Axis), I want to include DD values between 0 and 100.
Please help me with the first line of my code to get the condition of DD between 0 and 100
Also when I plot the scatterplot, I get dots beyond 100% ( Y axis is DRE in %) though I dont have any value >100%.
Data1= Data[ Data['DD']<100]
plt.scatter(Data1.DD,Data1.DRE)
tick_val = [0,10,20,30,40,50,60,70,80,90,100]
tick_lab = ['0%','10%','20%','30%','40%','50%','60%','70%','80%','90%','100']
plt.yticks(tick_val,tick_lab)
plt.show()

Turning Pandas DataFrame into Histogram Using Matplotlib

I have a Pandas DataFrame which has a two columns, pageviews and type:
pageviews type
0 48.0 original
1 1.0 licensed
2 181.0 licensed
...
I'm trying to create a histogram each for original and licensed. Each histogram would (ideally) chart the number of occurrences in a given range for that particular type. So the x-axis would be a range of pageviews and the y-axis would be the number of pageviews that fall within that range.
Any recs on how to do this? I feel like it should be straightforward...
Thanks!
Using your current dataframe: df.hist(by='type')
For example:
# Me recreating your dataframe
pageviews = np.random.randint(200, size=100)
types = np.random.choice(['original','licensed'], size=100)
df = pd.DataFrame({'pageviews': pageviews,'type':types})
# Code you need to create faceted histogram by type
df.hist(by='type')
pandas.DataFrame.hist documentation

plot multiple data series from numpy array

I had a very ambitious project (for my novice level) to use on numpy array, where I load a series of data, and make different plots based on my needs - I have uploaded a slim version of my data file input_data and wanted to make plots based on: F (where I would like to choose the desired F before looping), and each series will have the data from E column (e.g. A12 one data series, A23 another data series in the plot, etc) and on the X axis I would like to use the corresponding values in D.
so to summarize for a chosen value on column F I want to have 4 different data series (as the number of variables on column E) and the data should be reference (x-axis) on the value of column D (which is date)
I stumbled in the first step (although spend too much time) where I wanted to plot all data with F column identifier as one plot.
Here is what I have up to now:
import os
import numpy as np
N = 8 #different values on column F
M = 4 #different values on column E
dataset = open('array_data.txt').readlines()[1:]
data = np.genfromtxt(dataset)
my_array = data
day = len(my_array)/M/N # number of measurement sets - variation on column D
for i in range(0, len(my_array), N):
plt.xlim(0, )
plt.ylim(-1, 2)
plt.plot(my_array[i, 0], my_array[i, 2], 'o')
plt.hold(True)
plt.show()
this does nothing.... and I still have a long way to go..
With pandas you can do:
import pandas as pd
dataset = pd.read_table("toplot.txt", sep="\t")
#make D index (automatically puts it on the x axis)
dataset.set_index("D", inplace=True)
#plotting R vs. D
dataset.R.plot()
#plotting F vs. D
dataset.F.plot()
dataset is a DataFrame object and DataFrame.plot is just a wrapper around the matplotlib function to plot the series.
I'm not clear on how you are wanting to plot it, but it sound like you'll need to select some values of a column. This would be:
# get where F == 1000
maskF = dataset.F == 1000
# get the values where F == 1000
rows = dataset[maskF]
# get the values where A12 is in column E
rows = rows[rows.E == "A12"]
#remove the we don't want to see
del rows["E"]
del rows["F"]
#Plot the result
rows.plot(xlim=(0,None), ylim=(-1,2))

Categories