Boxplot - grouped data - Python (only frequency known) - python

I'm starting to use Jupyter and Pandas library and I have a trouble with the boxplot graphic.
I have the next dataframe:
dataframe
The problem with this dataframe is that I only have the data for frequency in different range of values. How could I make a graphic with this kind of value table? I'd like to make a boxplot for each column of frequency.
Thank you!!!

I m not sure that I understand your question well but here is a demo for boxplot visualization hope its helps
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# a custom dataframe
df = pd.DataFrame({'A': np.random.randint(1, 10, 5),
'B': np.random.randint(1, 10, 5)})
# and your first column
df['E'] = pd.Series(['0-6', '7-13', '14-20', '21-27', '28-34'
])
print(df)
df.boxplot(column=['A', 'B'],
by='E',vert=True,showmeans=True,meanline=True,showfliers=False)
plt.show()

Related

2 line plot using seaborn

I want to plot this array. I am using seaborn to do that. I used
import seaborn as sns
sns.set_style('whitegrid')
sns.kdeplot(data= score_for_modelA[:,0])
But the above one only gives for column 1. My scores are in column 1 and 2 and I want both of them plotted in the same graph.
The sample data is like this:
array ([[0.67,0.33],[0.45,0.55],......,[0.81,0.19]]
You can try putting them into a data frame first, with the proper column names, for example:
import seaborn as sns
import numpy as np
import pandas as pd
# create sample dataframe in wide format
score_for_modelA = np.random.normal(0, 1, (50, 2))
df = pd.DataFrame(score_for_modelA, columns=['col1', 'col2'])
# use melt to convert the dataframe to a long form
dfm = df.melt()
Plot the long form dataframe
sns.kdeplot(data=dfm, hue="variable", x="value")
As pointed out by #JohanC, if you want all of the columns:
sns.kdeplot(data=df)

Is there a way to plot a heatmap for a dataframe based on rows/columns?

I wanted to ask if anyone has ever done a heatmap on a Pandas DataFrame but on each individual columns (with the same color gradient showing 'low' to 'high'). It's more like the conditional formatting on each column in Excel (refer to the included image). I tried sns.heatmap, but it kind of gives the overall picture. I have a DataFrame like below:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0.1, 100, size = 30).reshape(5,6),
columns= ['A','B','C','D','E','F'], index = ['aa','bb', 'cc', 'dd', 'ee'])
I wanted to make something like this.
One trick using seaborn.heatmap is to apply a min-max normalization to each column of your DataFrame, so that the values of each column are rescaled to the range [0, 1].
The rescaled values are used to map the colors, but you annotate the heatmap with the original values (i.e., pass annot=df).
import seaborn as sns
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 100, size = 30).reshape(5,6),
columns= ['A','B','C','D','E','F'], index = ['aa','bb', 'cc', 'dd', 'ee'])
norm_df = (df - df.min(0)) / (df.max(0) - df.min(0))
sns.heatmap(norm_df, annot=df, cmap="YlGn", cbar=False, lw=0.01)
Output

Python Pandas Seaborn FacetGrid: use dataframe series' names to set up columns

I am using pandas dataframes to hold some volume calculation results, and trying to configure a seaborn FacetGrid setup to visualize results of 4 different types of volume calculations for a reservoir zone.
I believe I can handle the dataframe part, my problems is with the visualization part:
Each different type of volume calculations is loaded in the dataframe as a series. The series name corresponds to the type of volume calculation. I want to create a number of plots then, aligned so that each column of plot corresponds to one series in my dataframe.
Theory (documentation) says this should do it (example from tutorial at https://seaborn.pydata.org/tutorial/axis_grids.html):
import seaborn as sns
import matpltlib.pyplot as plt
tips = sns.load_dataset("tips")
g=sns.FacetGrid(tips, col = "time")
I cannot find the referenced dataset "tips" for download, but I think that is a minor problem. From the code snippet above and after some testing on my own data, I infer that "time" in that dataset refers to the name of one series in the dataframe and that different times would then be different categories or other types of values in that series.
This is not how my dataset is ordered. I have the different types of volume calculations that I would see as individual plots (in columns) represented as series in my dataframe. How do I provide the series name as input to seaborn FacetGrid col= argument?
g = seaborn.FacetGrid(data=volumes_table, col=?????)
I cannot figure out how I can get col=dataframe.series and I cannot find any documented example of that.
here's a setup with some exciting dummy names and dummy values
import os
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
#provide some input data, using a small dictionary
volumes_categories = {'zone_numbers': [1, 2, 3, 4],
'zone_names': ['corona', 'hiv', 'h5n1', 'measles'],
'grv': [30, 90, 80, 100],
'nv': [20, 60, 20, 50],
'pv': [5, 12, 4, 25],
'hcpv': [4, 6, 1, 20]}
# create the dataframe
volumes_table = pandas.DataFrame(volumes_categories)
# set up for plotting
seaborn.set(style='ticks')
g= seaborn.FacetGrid(data=volumes_table, col='zone_names')
The above setup generates columns ok, but I cannot get the colums to represent series in my dataframe (the columns when visualizing the dataframe as a table....)
What do I need to do?
The main part of the solution is described in BBQuercus's answer: reshaping the nice, human-readable wide-format dataframe/table into a long-format table which is simpler to digest for seaborn, using seaborn.melt()
I implemented this by creating a copy of the original dataframe and melting the copy:
# first copy dataframe
vol_table2 = volumes_table.copy()
#melt it into long format
vol_table2 = pandas.melt(vol_table2, id_vars = ['zone_numbers','zone_names'], value_vars=['grv','nv','pv','hcpv'], var_name = "volume_type", value_name = "volume")
In the end I also decided to scrap the explicit FacetGrid and map setup and use seaborn.catplot (with FacetGrid functionality included).
Thanks for assistance
(PS: it must be a good idea for seaborn to accept series names for Facetgrid setup)
Once we imported all requirements:
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')
The FacetGrid essentially just provides a canvas to draw on. You can then use the map function to "project" plotting functions onto the canvas:
# Blueprint
g = sns.FacetGrid(dataframe, col="dataframe.column", row="dataframe.column")
g = g.map(plotting.function, "dataframe.column")
# Example with the tips dataset
g = sns.FacetGrid(tips, col="time", row="smoker")
g = g.map(plt.hist, "total_bill")
plt.show()
In your case as mentioned above I would also melt the columns first to get a tidy data format and then plot as usual. Changing what to plot however necessary:
volumes_table = volumes_table.melt(id_vars=['zone_numbers', 'zone_names'])
g = sns.FacetGrid(data=volumes_table, col='variable')
g = g.map(plt.scatter, 'zone_numbers', 'value')
plt.show()

Pandas scatter_matrix plotting - additional arguments

I am running Python 3.6 with Pandas version 0.19.2. On the code example below, I have two questions regarding the Pandas plotting function scatter_matrix():
**1.**How can I colour-label the observations in the scatter plots with respect to the Label column?
**2.**How can I specify the number of bins for the histograms on the diagonal? Can I do this individually or just one bin number for all?
import pandas as pd
import numpy as np
N= 1000
df_feat = pd.DataFrame(np.random.randn(N, 4), columns=['A','B','C','D'])
df_label = pd.DataFrame(np.random.choice([0,1], N), columns=['Label'])
df = pd.concat([df_feat, df_label], axis=1)
axes = pd.tools.plotting.scatter_matrix(df, alpha=0.2)
This is linked to this more general one.
To answer your first question, there may be a less 'kludgey' way, but
scatter_matrix(df,c=['r' if i == 1 else 'b' for i in df['Label']])
To answer the second:
The scatter matrix can use the pd.hist() api to use hist keywords passed in a dictionary
scatter_matrix(df,hist_kwds={'bins':5})

Seaborn tsplot not showing data

I'm trying to use seaborn to make a simple tsplot, but for reasons that aren't clear to me nothing shows up when I run the code. Here's a minimal example:
import numpy as np
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'value': np.random.rand(31), 'time': range(31)})
ax = sns.tsplot(data=df, value='value', time='time')
sns.plt.show()
Usually tsplot you supply multiple data points for each time point, but does it just not work if you only supply one?
I know matplotlib can be used to do this pretty easily, but I wanted to use seaborn for some of its other functionality.
You are missing individual units. When using a data frame the idea is that multiple timeseries for the same unit have been recorded, which can be individually identifier in the data frame. The error is then calculated based on the different units.
So for one series only, you can get it working again like this:
df = pd.DataFrame({'value': np.random.rand(31), 'time': range(31)})
df['subject'] = 0
sns.tsplot(data=df, value='value', time='time', unit='subject')
Just to see how the error is computed, look at this example:
dfs = []
for i in range(10):
df = pd.DataFrame({'value': np.random.rand(31), 'time': range(31)})
df['subject'] = i
dfs.append(df)
all_dfs = pd.concat(dfs)
sns.tsplot(data=all_dfs, value='value', time='time', unit='subject')
You can use set_index for index from column time and then plot Series:
df = pd.DataFrame({'value': np.random.rand(31), 'time': range(31)})
df = df.set_index('time')['value']
ax = sns.tsplot(data=df)
sns.plt.show()

Categories