Heatmap or other two variable histogram option? - python

I have a dataframe with two columns, the first one can have an integer from 0-15, the other one can have an integer from 0-10.
The df has approximately 10,000 rows.
I want to plot some sort of grid, (15x10) that can visually represent how many instances of each combination I have throughout the dataframe, ideally displaying the actual number on every grid cell.
I have tried both Seaborn and Matplotlib.
In Seaborn I tried a jointplot which almost did it but I can't get it to show an actual 15x10 grid. I also tried a heatmap but it gave me an error (see below) and I wasn't able to find anything on it.
I also tried plotting some sort of 3D histogram.
Finally I tried pivoting the data but Pandas calculates the numbers as values instead of treating them as "buckets".
Not sure where to go from here.
*heatmap error: "ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''"
sns.heatmap(x='pressure_bucket', y='rate_bucket', data=df)
The closest to what I want is something like this, ideally with the actual numbers in each cell
https://imgur.com/a/d4qWIod
Thanks to all in advance!

We can use plt.imshow to display a heat map,
# get the counts in form of a dataframe indexed by (c1,c2)
counts = df.groupby(['c1'])['c2'].value_counts().rename('value').reset_index()
# pivot to c1 as index, c2 as columns
counts = counts.pivot(index='c1', columns='c2', values='value')
# after reading your question carefully, there's another step
# fill all missing value in c1
counts.reindex(range(16))
# fill all missing value in c2
counts = counts.reindex(range(10), axis=1)
# fill all missing values with 0
counts = counts.fillna(0)
# imshow
plt.figure(figsize=(15,10))
plt.imshow(counts, cmap='hot')
plt.grid(False)
plt.show()
# sns would give a color bar legend
plt.figure(figsize=(15,10))
sns.heatmap(counts, cmap='hot')
plt.show()
Output (random entries)
Output sns:

Related

pandas DataFrame line plot does not work when there are missing values but scatter plot works fine

In Python pandas DataFrame, when a column does not have value for every index value, the line plot will be partially or entirely missing, but the scatter plot will be just fine. Is there a way to plot the line plot correctly? I am on 0.24.2 version of pandas.
Note this question is not a duplicate of some other similar questions, since I don't want to fillna or interpolate the missing values since that is not what I want to show. I just want the missing to stay missing, and a straight line connect every two closest non-missing dots (which is the normal behavior one would expect for line plots).
Thanks in advance.
Creating an example dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(index = range(1,21,2), columns=['val1'])
df1.val1=np.random.rand(10)
df2 = pd.DataFrame(index = range(2,22,2), columns=['val2'])
df2.val2=np.random.rand(10)
df=df1.append(df2, sort='False').sort_index()
df
A scatter plot looks just as what I expect:
df.plot(style='.')
A line plot, on the contrary, does not work:
df.plot(style='-')
You need to drop the NaN values on-the-fly. So your data frame will remain the same it will take just not-NaN values and indexes for plotting. When you use pandas plot, in default it will use indexes as the x-axis. Even if you drop the NaN values, indexes will remain the same so the plot will be as you desire. It will not squeeze the x-axis because of the dropping NaN values.
df.plot()
df.iloc[:,0].dropna().plot()
df.iloc[:,1].dropna().plot()
Pandas will not plot a line with NaN values. There is no way around this as far as I know. Either use a different plot type or fill your NaN values in an acceptable manner.
It isn't clear what you would expect a line plot to even look like here. You have 0 (x,y) pairs to plot.
Note that there is a difference between lines and markers in terms of styles as they are concepts at different dimension (1D and 2D).
There will be a line if there are two consecutive non-na points.
There will be a marker, however, as long as there is a non-na point.
I.e., the point can be isolated from the other points.
This behavior not only applies to pandas plot, but also to most software in general (e.g., Excel).

How do I plot my histogram for density rather than count? (Matplotlib)

I have a data frame called 'train' with a column 'string' and a column 'string length' and a column 'rank' which has ranking ranging from 0-4.
I want to create a histogram of the string length for each ranking and plot all of the histograms on one graph to compare. I am experiencing two issues with this:
The only way I can manage to do this is by creating separate datasets e.g. with the following type of code:
S0 = train.loc[train['rank'] == 0]
S1 = train.loc[train['rank'] == 1]
Then I create individual histograms for each dataset using:
plt.hist(train['string length'], bins = 100)
plt.show()
This code doesn't plot the density but instead plots the counts. How do I alter my code such that it plots density instead?
Is there also a way to do this without having to create separate datasets? I was told that my method is 'unpythonic'
You could do something like:
df.loc[:, df.columns != 'string'].groupby('rank').hist(density=True, bins =10, figsize=(5,5))
Basically, what it does is select all columns except string, group them by rank and make an histogram of all them following the arguments.
The density argument set to density=True draws it in a normalized manner, as
Hope this has helped.
EDIT:
f there are more variables and you want the histograms overlapped, try:
df.groupby('rank')['string length'].hist(density=True, histtype='step', bins =10,figsize=(5,5))

Fixing axis spacing (ticks) in Bokeh scatter plots

I'm generating scatter plots with Bokeh with differing numbers Y values for each X value. When Bokeh generates the plot, it automatically pads the x-axis spacing based on the number of values plotted. I would like for all values on the x-axis to be spaced evenly, regardless of the number of individual data points. I've looked into manually setting the ticks, but it looks like I have to set the spacing myself using this approach (ie. specify the exact positions). I would like for it to automatically set the spacing evenly as it does when plotting singular x,y value pairs. Can this be done?
Here is an example showing the behavior.
import pandas
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
days =['Mon','Mon','Mon', 'Tues', 'Tues', 'Weds','Weds','Weds','Weds']
vals = [1,3,5,2,3,6,3,2,4]
df = pandas.DataFrame({'Day': days, 'Values':vals})
source = ColumnDataSource(df)
p = figure(x_range=df['Day'].tolist())
p.circle(x='Day', y='Values', source=source)
show(p)
You are passing a list of strings as the range. This creates a categorical axis. However, the list of categories for the range is expected to be unique, with no duplicates. You are passing a list with duplicate values. This is actually invalid usage, and the result is undefined behavior. You should pass a unique list of categorical factors, in the order you want them to appear, for the range.

How do I create a multiline plot using seaborn?

I am trying out Seaborn to make my plot visually better than matplotlib. I have a dataset which has a column 'Year' which I want to plot on the X-axis and 4 Columns say A,B,C,D on the Y-axis using different coloured lines. I was trying to do this using the sns.lineplot method but it allows for only one variable on the X-axis and one on the Y-axis. I tried doing this
sns.lineplot(data_preproc['Year'],data_preproc['A'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['B'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['C'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['D'], err_style=None)
But this way I don't get a legend in the plot to show which coloured line corresponds to what. I tried checking the documentation but couldn't find a proper way to do this.
Seaborn favors the "long format" as input. The key ingredient to convert your DataFrame from its "wide format" (one column per measurement type) into long format (one column for all measurement values, one column to indicate the type) is pandas.melt. Given a data_preproc structured like yours, filled with random values:
num_rows = 20
years = list(range(1990, 1990 + num_rows))
data_preproc = pd.DataFrame({
'Year': years,
'A': np.random.randn(num_rows).cumsum(),
'B': np.random.randn(num_rows).cumsum(),
'C': np.random.randn(num_rows).cumsum(),
'D': np.random.randn(num_rows).cumsum()})
A single plot with four lines, one per measurement type, is obtained with
sns.lineplot(x='Year', y='value', hue='variable',
data=pd.melt(data_preproc, ['Year']))
(Note that 'value' and 'variable' are the default column names returned by melt, and can be adapted to your liking.)
This:
sns.lineplot(data=data_preproc)
will do what you want.
See the documentation:
sns.lineplot(x="Year", y="signal", hue="label", data=data_preproc)
You probably need to re-organize your dataframe in a suitable way so that there is one column for the x data, one for the y data, and one which holds the label for the data point.
You can also just use matplotlib.pyplot. If you import seaborn, much of the improved design is also used for "regular" matplotlib plots. Seaborn is really "just" a collection of methods which conveniently feed data and plot parameters to matplotlib.

How to connect boxplot median values

It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.
You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.
You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.
Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data), this script generates this figure. Hope it helps!

Categories