Plotting categorical variable against numeric variable in matplotlib - python

My DataFrame's structure
trx.columns
Index(['dest', 'orig', 'timestamp', 'transcode', 'amount'], dtype='object')
I'm trying to plot transcode (transaction code) against amount to see the how much money is spent per transaction. I made sure to convert transcode to a categorical type as seen below.
trx['transcode']
...
Name: transcode, Length: 21893, dtype: category
Categories (3, int64): [1, 17, 99]
The result I get from doing plt.scatter(trx['transcode'], trx['amount']) is
Scatter plot
While the above plot is not entirely wrong, I would like the X axis to contain just the three possible values of transcode [1, 17, 99] instead of the entire [1, 100] range.
Thanks!

In matplotlib 2.1 you can plot categorical variables by using strings. I.e. if you provide the column for the x values as string, it will recognize them as categories.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.choice([1,17,99], size=100),
"y" : np.random.rand(100)*100})
plt.scatter(df["x"].astype(str), df["y"])
plt.margins(x=0.5)
plt.show()
In order to optain the same in matplotlib <=2.0 one would plot against some index instead.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.choice([1,17,99], size=100),
"y" : np.random.rand(100)*100})
u, inv = np.unique(df["x"], return_inverse=True)
plt.scatter(inv, df["y"])
plt.xticks(range(len(u)),u)
plt.margins(x=0.5)
plt.show()
The same plot can be obtained using seaborn's stripplot:
sns.stripplot(x="x", y="y", data=df)
And a potentially nicer representation can be done via seaborn's swarmplot:
sns.swarmplot(x="x", y="y", data=df)

Related

Python vs matplotlib - Chart generation issue

I have the below python code. but as an output it gives a chart like in the attachment. And its really messy in python. Can anybody tell me hw to fix the issue and make the day in ascenting order in X axis?
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel("C/desktop/data.xlsx")
df = df.loc[df['month'] == 8]
df = df.astype({'day': str})
plt.plot( 'day', 'cases', data=df)
In the first instance, i didnt take the day as str. So it came like this.
Because it had decimal numbers, i have converted it to str. now this happens.
What you got is typical of an unsorted dataset with many points per group.
As you did not provide an example, here is one:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'day': np.random.randint(1,21,size=100),
'cases': np.random.randint(0,50000,size=100),
})
plt.plot('day', 'cases', data=df)
There is no reason to plot a line in this case, you can use a scatter plot instead:
plt.scatter('day', 'cases', data=df)
To make more sense of your data, you can also compute an aggregated value (ex. mean):
plt.plot('day', 'cases', data=df.groupby('day', as_index=False)['cases'].mean())

How to make a bar chart with multiple series and count

I want to have x-axis = 'brand', y-axis = 'count', and 2 series for 'online_order' (True & False)
How can I do this on Python (using Jupyter?)
Right now, my Y axis comes on a scale of 0-1. I want to ensure that the Y axis is automated based on the values
This is the result I am getting :
I'm guessing the plot was made with something like the following:
Since the plot code is not included, it's just a guess.
df.groupby(['brand', 'online_order'])['count'].size().unstack().plot.bar(legend=True)
The issue is, size is not the value in 'count', it's .Groupby.size which computes group sizes, of which there is 1 of each.
Using seaborn
The easiest way to get the desired plot is using seaborn, which is a high-level API for matplolib.
Use seaborn.barplot with hue='online_order'.
The dataframe does not need to be reshaped.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# test data
df = pd.DataFrame({'brand': ['Solex', 'Solex', 'Giant Bicycles', 'Giant Bicycles'], 'online_order': [False, True, True, False], 'count': [2122, 2047, 1640, 1604]})
# plot
plt.figure(figsize=(7, 5))
sns.barplot(x='brand', y='count', hue='online_order', data=df)
Using pandas.DataFrame.pivot
.pivot changes the shape of the dataframe to accommodate the plot API
This option also uses pandas.DataFrame.plot.bar
df.pivot('brand', 'online_order', 'count').plot.bar()
If the data is a csv file, you can import matplotlib and pandas to create a graph and view the data.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("file name here")
plt.bar(data.brand,data.count)
plt.xlabel("brand")
plt.ylabel("count")

Python Pandas Seaborn FacetGrid: use dataframe series' names to set up columns

I am using pandas dataframes to hold some volume calculation results, and trying to configure a seaborn FacetGrid setup to visualize results of 4 different types of volume calculations for a reservoir zone.
I believe I can handle the dataframe part, my problems is with the visualization part:
Each different type of volume calculations is loaded in the dataframe as a series. The series name corresponds to the type of volume calculation. I want to create a number of plots then, aligned so that each column of plot corresponds to one series in my dataframe.
Theory (documentation) says this should do it (example from tutorial at https://seaborn.pydata.org/tutorial/axis_grids.html):
import seaborn as sns
import matpltlib.pyplot as plt
tips = sns.load_dataset("tips")
g=sns.FacetGrid(tips, col = "time")
I cannot find the referenced dataset "tips" for download, but I think that is a minor problem. From the code snippet above and after some testing on my own data, I infer that "time" in that dataset refers to the name of one series in the dataframe and that different times would then be different categories or other types of values in that series.
This is not how my dataset is ordered. I have the different types of volume calculations that I would see as individual plots (in columns) represented as series in my dataframe. How do I provide the series name as input to seaborn FacetGrid col= argument?
g = seaborn.FacetGrid(data=volumes_table, col=?????)
I cannot figure out how I can get col=dataframe.series and I cannot find any documented example of that.
here's a setup with some exciting dummy names and dummy values
import os
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
#provide some input data, using a small dictionary
volumes_categories = {'zone_numbers': [1, 2, 3, 4],
'zone_names': ['corona', 'hiv', 'h5n1', 'measles'],
'grv': [30, 90, 80, 100],
'nv': [20, 60, 20, 50],
'pv': [5, 12, 4, 25],
'hcpv': [4, 6, 1, 20]}
# create the dataframe
volumes_table = pandas.DataFrame(volumes_categories)
# set up for plotting
seaborn.set(style='ticks')
g= seaborn.FacetGrid(data=volumes_table, col='zone_names')
The above setup generates columns ok, but I cannot get the colums to represent series in my dataframe (the columns when visualizing the dataframe as a table....)
What do I need to do?
The main part of the solution is described in BBQuercus's answer: reshaping the nice, human-readable wide-format dataframe/table into a long-format table which is simpler to digest for seaborn, using seaborn.melt()
I implemented this by creating a copy of the original dataframe and melting the copy:
# first copy dataframe
vol_table2 = volumes_table.copy()
#melt it into long format
vol_table2 = pandas.melt(vol_table2, id_vars = ['zone_numbers','zone_names'], value_vars=['grv','nv','pv','hcpv'], var_name = "volume_type", value_name = "volume")
In the end I also decided to scrap the explicit FacetGrid and map setup and use seaborn.catplot (with FacetGrid functionality included).
Thanks for assistance
(PS: it must be a good idea for seaborn to accept series names for Facetgrid setup)
Once we imported all requirements:
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')
The FacetGrid essentially just provides a canvas to draw on. You can then use the map function to "project" plotting functions onto the canvas:
# Blueprint
g = sns.FacetGrid(dataframe, col="dataframe.column", row="dataframe.column")
g = g.map(plotting.function, "dataframe.column")
# Example with the tips dataset
g = sns.FacetGrid(tips, col="time", row="smoker")
g = g.map(plt.hist, "total_bill")
plt.show()
In your case as mentioned above I would also melt the columns first to get a tidy data format and then plot as usual. Changing what to plot however necessary:
volumes_table = volumes_table.melt(id_vars=['zone_numbers', 'zone_names'])
g = sns.FacetGrid(data=volumes_table, col='variable')
g = g.map(plt.scatter, 'zone_numbers', 'value')
plt.show()

How to remove certain values before plotting data

I'm using python for the first time. I have a csv file with a few columns of data: location, height, density, day etc... I am plotting height (i_h100) v density (i_cd) and have managed to constrain the height to values below 50 with the code below. I now want to constrain the values on the y axis to be within a certain 'day' range say (85-260). I can't work out how to do this.
import pandas
import matplotlib.pyplot as plt
data=pandas.read_csv('data.csv')
data.plot(kind='scatter',x='i_h100',y='i_cd')
plt.xlim(right=50)
Use .loc to subset data going into graph.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Make some dummy data
np.random.seed(42)
df = pd.DataFrame({'a':np.random.randint(0,365,20),
'b':np.random.rand(20),
'c':np.random.rand(20)})
# all data: plot of 'b' vs. 'c'
df.plot(kind='scatter', x='b', y='c')
plt.show()
# use .loc to subset data displayed based on value in 'a'
# can also use .loc to restrict values of 'b' displayed rather than plt.xlim
df.loc[df['a'].between(85,260) & (df['b'] < 0.5)].plot(kind='scatter', x='b', y='c')
plt.show()

Simple barplot of column means using seaborn

I have a pandas dataframe with 26 columns of numerical data. I want to represent the mean of each column in a barplot with 26 bars. This is easy to do with pandas plotting function: df.plot(kind = 'bar'). However, the results are ugly and the column labels are often truncated, i.e.:
I'd like to do this with seaborn instead, but can't seem to find a way no matter how hard I look. Surely there's an easy way to do a simple barplot of column averages? Thanks.
You can try something like this:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
fig = df.mean().plot(kind='bar')
plt.margins(0.02)
plt.ylabel('Your y-label')
plt.xlabel('Your x-label')
fig.set_xticklabels(df.columns, rotation = 45, ha="right")
plt.show()
If anyone finds this by a search, the easiest solution I've found (I'm OP) is to use use the pandas.melt() function. This concatenates all the columns into a single column, but adds a second column that preserves the column title adjacent to each value. This dataframe can be passed directly to seaborn.
You can use sns.barplot - especially for horizontal barplots more suitable for so many categories - like this:
import seaborn as sns
df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]})
unstacked = df.unstack().to_frame()
sns.barplot(
y=unstacked.index.get_level_values(0),
x=unstacked[0]);
df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]})
sns.barplot(x = df.mean().index, y = df.mean())
plt.show()

Categories