Plot distribution of pandas dataframe depending on target value - python

I want to visualize the grade depending on the sex (male/female).
My dataframe:
df = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3", "K4", "K5", "K6", "K7", "K8", "K9"],
"grade": [1.0, 2.0, 4.0, 1.0, 5.0, 2.0, 3.0, 1.0, 6.0, 3.0],
"sex": [1, 0, 0, 1, 0,1,0,1,0,0]
}
)
key grade sex
0 K0 1.0 1
1 K1 2.0 0
2 K2 4.0 0
3 K3 1.0 1
4 K4 5.0 0
5 K5 2.0 1
6 K6 3.0 0
7 K7 1.0 1
8 K8 6.0 0
9 K9 3.0 0
My approach was to use a histogram and plot the distribution. However, I don't know how to visualize the distribution depending on the target. There are some examples in Seaborn Documentation, but I failed to apply it to my specific problem.
All I have is this:
plt.hist(df['grade'], bins=10, edgecolor='black');
plt.xlabel('grade');
plt.ylabel('count');

You can do this in matplotlib:
import matplotlib.pyplot as pyplot
x=df.loc[df['sex']==1, 'grade']
y=df.loc[df['sex']==0, 'grade']
bins=list(range(6))
pyplot.hist(x, bins, alpha=0.5, label='sex=1')
pyplot.hist(y, bins, alpha=0.5, label='sex=2')
pyplot.legend(loc='upper right')
pyplot.show()

There is also a way for doing this with pandas:
df[df['sex'] == 0]['grade'].plot.hist()
df[df['sex'] == 1]['grade'].plot.hist()
and you can also have smooth curve with using kde():
df[df['sex'] == 0]['grade'].plot.kde()

Related

How to do Pandas stacked bar chart on number line instead of categories

I am trying to make a stacked bar chart where the x-axis is based on a regular number line instead of categories. Maybe bar chart is not the right term?
How can I make the stacked bars, but have the x number line be spaced "normally" (with a big relative gap between 5.0 and 10.6)? I also want to set a regular tick interval, instead of having every bar labeled. (The real dataset is dense but with some spurious gaps, and I want to use the bar colors to qualitatively show changes as a function of x.)
fid = ["name", "name", "name", "name", "name"]
x = [1.02, 1.3, 2, 5, 10.6]
y1 = [0, 1, 0.2, 0.6, 0.1]
y2 = [0.3, 0, 0.1, 0.1, 0.4]
y3 = [0.7, 0, 0.7, 0.3, 0.5]
df = pd.DataFrame(data=zip(fid, x, y1, y2, y3), columns=["fid", "x", "y1", "y2", "y3"])
fig, ax = plt.subplots()
df.plot.bar(x="x", stacked=True, ax=ax)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2)
In a matplotlib bar chart, the x values are treated as categorical data, so matplotlib always plots it along range(0, ...) and relabels the ticks with the x values.
To scale the bar distances, reindex the x values to have filler rows between the real data points:
start, stop = 0, 16
xstep = 0.01
tickstep = 2
xfill = np.round(np.arange(start, stop + xstep, xstep), 2)
out = df.set_index("x").reindex(xfill).reset_index()
ax = out.plot.bar(x="x", stacked=True, width=20, figsize=(10, 3))
xticklabels = np.arange(start, stop+tickstep, tickstep).astype(float)
xticks = out.index[out.x.isin(xticklabels)]
ax.set_xticks(xticks)
ax.set_xticklabels(xticklabels)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2)
Details
Generate the xfill as [0, 0.01, 0.02, ...]. I've tried to make this portable by extracting the max number of decimals from x, but float precision is always tricky so this may need to be tweaked:
decimals = df.x.astype(str).str.split(".").str[-1].str.len().max()
xstep = 10.0 ** -decimals
start = 0
stop = 16
xfill = np.round(np.arange(start, stop + xstep, xstep), decimals)
# array([ 0. , 0.01, 0.02, 0.03, 0.04, 0.05, ...])
reindex the x column against this new xfill, so the filler rows will be NaN:
out = df.set_index("x").reindex(xfill).reset_index()
# x fid y1 y2 y3
# 0.00 NaN NaN NaN NaN
# ... ... ... ... ...
# 1.01 NaN NaN NaN NaN
# 1.02 name 0.0 0.3 0.7
# 1.03 NaN NaN NaN NaN
# ... ... ... ... ...
# 1.29 NaN NaN NaN NaN
# 1.30 name 1.0 0.0 0.0
# 1.31 NaN NaN NaN NaN
# ... ... ... ... ...
# 1.99 NaN NaN NaN NaN
# 2.00 name 0.2 0.1 0.7
# 2.01 NaN NaN NaN NaN
# ... ... ... ... ...
# 4.99 NaN NaN NaN NaN
# 5.00 name 0.6 0.1 0.3
# 5.01 NaN NaN NaN NaN
# ... ... ... ... ...
# 10.59 NaN NaN NaN NaN
# 10.60 name 0.1 0.4 0.5
# 10.61 NaN NaN NaN NaN
# ... ... ... ... ...
# 16.00 NaN NaN NaN NaN
Plot the reindexed data (with xticks spaced apart by tickstep):
ax = out.plot.bar(x="x", stacked=True, width=20, figsize=(10, 3))
tickstep = 2
xticklabels = np.arange(start, stop + tickstep, tickstep).astype(float)
xticks = out.index[out.x.isin(xticklabels)]
ax.set_xticks(xticks)
ax.set_xticklabels(xticklabels)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2)
Combined code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"fid": ["name", "name", "name", "name", "name"], "x": [1.02, 1.3, 2, 5, 10.6], "y1": [0, 1, 0.2, 0.6, 0.1], "y2": [0.3, 0, 0.1, 0.1, 0.4], "y3": [0.7, 0, 0.7, 0.3, 0.5]})
decimals = df.x.astype(str).str.split(".").str[-1].str.len().max()
xstep = 10.0 ** -decimals
start = 0
stop = 16
xfill = np.round(np.arange(start, stop + xstep, xstep), decimals)
out = df.set_index("x").reindex(xfill).reset_index()
ax = out.plot.bar(x="x", stacked=True, width=20, figsize=(10, 3))
tickstep = 2
xticklabels = np.arange(start, stop + tickstep, tickstep).astype(float)
xticks = out.index[out.x.isin(xticklabels)]
ax.set_xticks(xticks)
ax.set_xticklabels(xticklabels)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2)
My answer below illustrates how the stacking with spacing can be done. You can adopt the solution and tailor the function to your needs, eg you don't need to use itertools, just use a regular counter will do. You can also tailor the arguments as needed.
The idea behind the solution:
using cumsum to calculate the stacking
use matplotlib to plot a bar (instead of stack) and use the zorder to control which is infront.
Function
from itertools import count
from math import floor
def plt_stack_spacing( df , figsize=(10,6) , width=0.2 , bb_anchor=(1.05,1)):
ycol = df.columns[2:]
df1 = df.iloc[:,[0,1]]
df1 = df1.join(df[ycol].cumsum(axis=1))
c = count(0,-1) # either itertools.count or manuall adjust the number c+=1
plt.figure(figsize=figsize)
for col in ycol:
plt.bar(df1.x,df1[col],label=col,width=width,zorder=next(c))
xmin = floor(df.x.min())
xmax = floor(df.x.max())
xt = [*range(xmin,xmax+2)]
plt.xticks(xt)
plt.legend(bbox_to_anchor=bb_anchor, loc=2)
plt.show();
Calling the function:
plt_stack_spacing(df,(18,5),0.2,(1.01,1))
Output:
Benchmark: Timing of 100 plots of 300 rows and 4 columns (y1,y2,y3,y4) = 225 secs = 3.75 min without enhancements.

A better way for sorting and arranging specific mesh data using DataFrames

I'm currently using a specific FEM software. The post-processing tool is quite outdated, and it can run only on a dedicated machine. I want to visualize some of the results on my own laptop (for better presentation), using the result files the software produces. I'm using the Pandas library with Python.
I was able to get to the point where I have two different DataFrames, one with the element ID and the nodes that construct it, and the second with nodes ID, and x,y coordinates -
elementDF - includes {index, element ID, node1, node2, node3} # elements have 3 nodes
coordsDF - includes {index, node ID, x, y}
and I was able to combine the two into a single DataFrame -
df - includes {index, element ID, x1, y1, x2, y2, x3, y3} # where x1 and y1 are the
coordinates of node1, etc
I will later use this DataFrame to build polygons and visualize the mesh.
The thing is, I believe I used a very costly loop to search for each node by its ID, extract the x & y coordinates, and then combine everything. I know this because the dedicated post-processing program does that in a few seconds (for a large mesh - 10,000 elements or more) and mine takes around 40~60 seconds for the same number of elements. I would like to know if there is a quicker and more efficient way to construct the final DataFrame.
Sample input DataFrames:
elementDF = pd.DataFrame({
'element': [1,2,3,4,5,6,7,8,9,10],
'node1': [2,33,33,32,183,183,183,185,185,36],
'node2': [34,34,183,183,34,35,185,35,36,37],
'node3': [33,183,32,184,35,185,186,36,187,187]
})
coordsDF = pd.DataFrame({
'node': [2,32,33,34,35,36,37,183,184,185,186,187],
'x': [-1, 1, 1, -1, -1.1, 1.1, 1.1, -1.1, -1.1, 1.1, 2, 2.2],
'y': [0,0,2,2,-0.2,-0.2,0,0,2,2, 4, 4.4]
})
Sample code:
import pandas as pd
def extractXY(nodeNumber,df):
# extract x,y data from node location
nodeData = df.loc[df['node'] == nodeNumber]
x = nodeData.x
y = nodeData.y
return x, y
#main#
df = pd.DataFrame(columns = ['x1','y1','x2','y2','x3','y3'])
for i in range(len(elementDF)):
nodeNumber1 = elementDF.loc[i].node1
x1, y1 = extractXY(nodeNumber1, coordsDF)
nodeNumber2 = elementDF.loc[i].node2
x2, y2 = extractXY(nodeNumber2, coordsDF)
nodeNumber3 = elementDF.loc[i].node3
x3, y3 = extractXY(nodeNumber3, coordsDF)
df = df.append({'x1': float(x1), 'y1': float(y1),
'x2': float(x2), 'y2': float(y2) ,
'x3': float(x3), 'y3': float(y3)}, ignore_index = True)
df = pd.concat([elementDF['element'],df], axis = 1)
Let's try this:
import pandas as pd
elementDF = pd.DataFrame({
'element': [1,2,3,4,5,6,7,8,9,10],
'node1': [2,33,33,32,183,183,183,185,185,36],
'node2': [34,34,183,183,34,35,185,35,36,37],
'node3': [33,183,32,184,35,185,186,36,187,187]
})
coordsDF = pd.DataFrame({
'node': [2,32,33,34,35,36,37,183,184,185,186,187],
'x': [-1, 1, 1, -1, -1.1, 1.1, 1.1, -1.1, -1.1, 1.1, 2, 2.2],
'y': [0,0,2,2,-0.2,-0.2,0,0,2,2, 4, 4.4]
})
mapx = coordsDF.set_index('node')['x']
mapy = coordsDF.set_index('node')['y']
df = pd.concat([
elementDF.set_index('element').replace(mapx).rename(columns=lambda x: x.replace('node','x')),
elementDF.set_index('element').replace(mapy).rename(columns=lambda y: y.replace('node','y')),
],
axis=1)
df
Output:
x1 x2 x3 y1 y2 y3
element
1 -1.0 -1.0 1.0 0.0 2.0 2.0
2 1.0 -1.0 -1.1 2.0 2.0 0.0
3 1.0 -1.1 1.0 2.0 0.0 0.0
4 1.0 -1.1 -1.1 0.0 0.0 2.0
5 -1.1 -1.0 -1.1 0.0 2.0 -0.2
6 -1.1 -1.1 1.1 0.0 -0.2 2.0
7 -1.1 1.1 2.0 0.0 2.0 4.0
8 1.1 -1.1 1.1 2.0 -0.2 -0.2
9 1.1 1.1 2.2 2.0 -0.2 4.4
10 1.1 1.1 2.2 -0.2 0.0 4.4

Seaborn scatter plot from pandas dataframe colours based on third column

I have a pandas dataframe, with columns 'groupname', 'result', and 'temperature'. I've plotted a Seaborn swarmplot, where x='groupname' and y='result', which shows the results data separated into the groups.
What I also want to do is to colour the markers according to their temperature, using a colormap, so that for example the coldest are blue and hottest red.
Plotting the chart is very simple:
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
data = {'groupname': ['G0', 'G0', 'G0', 'G0', 'G1', 'G1', 'G1'], 'shot': [1, 2, 3, 4, 1, 2, 3], 'temperature': [20, 25, 35, 10, -20, -17, -6], 'result': [10.0, 10.1, 10.5, 15.0, 15.1, 13.5, 10.5]}
df = pd.DataFrame(data)
groupname shot temperature result
0 G0 1 20 10.0
1 G0 2 25 10.1
2 G0 3 35 10.5
3 G0 4 10 15.0
4 G1 1 -20 15.1
5 G1 2 -17 13.5
6 G1 3 -6 10.5
plt.figure()
sns.stripplot(data=results, x="groupname", y="result")
plt.show()
But now I'm stuck trying to colour the points, I've tried a few things like:
sns.stripplot(data=results, x="groupname", y="result", cmap=matplotlib.cm.get_cmap('Spectral'))
which doesn't seem to do anything.
Also tried:
sns.stripplot(data=results, x="groupname", y="result", hue='temperature')
which does colour the points depending on the temperature, however the colours are random rather than mapped.
I feel like there is probably a very simple way to do this, but haven't been able to find any examples.
Ideally looking for something like:
sns.stripplot(data=results, x="groupname", y="result", colorscale='temperature')
Hello the keyword you are looking for is "palette"
Below should work:
sns.stripplot(data=results, x="groupname", y="result", hue='temperature',palette="vlag")
http://man.hubwiz.com/docset/Seaborn.docset/Contents/Resources/Documents/generated/seaborn.stripplot.html

Frequency mean calculation for an arbitrary distibution in pandas

I have a large dataset with values ranging from 1 to 25 with a resolution of o.1 . The distribution is arbitrary in nature with mode value of 1. The sample dataset can be like :
1,
1,
23.05,
19.57,
1,
1.56,
1,
23.53,
19.74,
7.07,
1,
22.85,
1,
1,
7.78,
16.89,
12.75,
15.32,
7.7,
14.26,
15.41,
1,
16.34,
8.57,
15,
14.97,
1.18,
14.15,
1.94,
14.61,
1,
15.49,
1,
9.18,
1.71,
1,
10.4,
How to evaluate the counts in different ranges (0-0.5,0.5-1, etc) and find out their frequency mean in pandas, Python.
expected output can be
values ranges(f) occurance(n) f*n
1
2.2 1-2 2 3
2.8 2-3 3 7.5
3.7 3-4 2 7
5.5 4-5 1 4.5
5.8 5-6 3 16.5
4.3
2.7 sum- 11 38.5
3.5
1.8 frequency mean 3.5
5.9
You need cut for binning, then convert CategoricalIndex to IntervalIndex for mid value, multiple column by mul, sum and last divide scalars:
df = pd.DataFrame({'col':[1,2.2,2.8,3.7,5.5,5.8,4.3,2.7,3.5,1.8,5.9]})
print (df)
col
0 1.0
1 2.2
2 2.8
3 3.7
4 5.5
5 5.8
6 4.3
7 2.7
8 3.5
9 1.8
10 5.9
binned = pd.cut(df['col'], np.arange(1, 7), include_lowest=True)
df1 = df.groupby(binned).size().reset_index(name='val')
df1['mid'] = pd.IntervalIndex(df1['col']).mid
df1['mul'] = df1['val'].mul(df1['mid'])
print (df1)
col val mid mul
0 (0.999, 2.0] 2 1.4995 2.999
1 (2.0, 3.0] 3 2.5000 7.500
2 (3.0, 4.0] 2 3.5000 7.000
3 (4.0, 5.0] 1 4.5000 4.500
4 (5.0, 6.0] 3 5.5000 16.500
a = df1.sum()
print (a)
val 11.0000
mid 17.4995
mul 38.4990
dtype: float64
b = a['mul'] / a['val']
print (b)
3.49990909091

p_value is 0 when I use scipy.stats.kstest() for large dataset

I have a unique series with there frequencies and want to know if they are from normal distribution so I did a Kolmogorov–Smirnov test using scipy.stats.kstest. Since, to my knowledge, the function takes only a list so I transform the frequencies to a list before I put it into the function. However, the result is weird since the pvalue=0.0
The histogram of the original data and my code are in the followings:
Histogram of my dataset
[In]: frequencies = mp[['c','v']]
[In]: print frequencies
c v
31 3475.8 18.0
30 3475.6 12.0
29 3475.4 13.0
28 3475.2 8.0
20 3475.0 49.0
14 3474.8 69.0
13 3474.6 79.0
12 3474.4 78.0
11 3474.2 78.0
7 3474.0 151.0
6 3473.8 157.0
5 3473.6 129.0
2 3473.4 149.0
1 3473.2 162.0
0 3473.0 179.0
3 3472.8 145.0
4 3472.6 139.0
8 3472.4 95.0
9 3472.2 103.0
10 3472.0 125.0
15 3471.8 56.0
16 3471.6 75.0
17 3471.4 70.0
18 3471.2 70.0
19 3471.0 57.0
21 3470.8 36.0
22 3470.6 22.0
23 3470.4 20.0
24 3470.2 12.0
25 3470.0 23.0
26 3469.8 13.0
27 3469.6 17.0
32 3469.4 6.0
[In]: testData = map(lambda x: np.repeat(x[0], int(x[1])), frequencies.values)
[In]: testData = list(itertools.chain.from_iterable(testData))
[In]: print len(testData)
2415
[In]: print np.unique(testData)
[ 3469.4 3469.6 3469.8 3470. 3470.2 3470.4 3470.6 3470.8 3471.
3471.2 3471.4 3471.6 3471.8 3472. 3472.2 3472.4 3472.6 3472.8
3473. 3473.2 3473.4 3473.6 3473.8 3474. 3474.2 3474.4 3474.6
3474.8 3475. 3475.2 3475.4 3475.6 3475.8]
[In]: scs.kstest(testData, 'norm')
KstestResult(statistic=1.0, pvalue=0.0)
Thanks everyone at first.
Using 'norm' for your input will check if the distribution of your data is the same as scipy.stats.norm.cdf with default parameters: loc=0, scale=1.
Instead, you will need to fit a normal distribution to your data and then check if the data and the distribution are the same using the Kolmogorov–Smirnov test.
import numpy as np
from scipy.stats import norm, kstest
import matplotlib.pyplot as plt
freqs = [[3475.8, 18.0], [3475.6, 12.0], [3475.4, 13.0], [3475.2, 8.0], [3475.0, 49.0],
[3474.8, 69.0], [3474.6, 79.0], [3474.4, 78.0], [3474.2, 78.0], [3474.0, 151.0],
[3473.8, 157.0], [3473.6, 129.0], [3473.4, 149.0], [3473.2, 162.0], [3473.0, 179.0],
[3472.8, 145.0], [3472.6, 139.0], [3472.4, 95.0], [3472.2, 103.0], [3472.0, 125.0],
[3471.8, 56.0], [3471.6, 75.0], [3471.4, 70.0], [3471.2, 70.0], [3471.0, 57.0],
[3470.8, 36.0], [3470.6, 22.0], [3470.4, 20.0], [3470.2, 12.0], [3470.0, 23.0],
[3469.8, 13.0], [3469.6, 17.0], [3469.4, 6.0]]
data = np.hstack([np.repeat(x,int(f)) for x,f in freqs])
loc, scale = norm.fit(data)
# create a normal distribution with loc and scale
n = norm(loc=loc, scale=scale)
Plot the fit of the norm to the data:
plt.hist(data, bins=np.arange(data.min(), data.max()+0.2, 0.2), rwidth=0.5)
x = np.arange(data.min(), data.max()+0.2, 0.2)
plt.plot(x, 350*n.pdf(x))
plt.show()
This not a terribly good fit, most due to the long tail on the left. However, you can now run a proper Kolmogorov–Smirnov test using the cdf of the fitted normal distribution
kstest(data, n.cdf)
# returns:
KstestResult(statistic=0.071276854859734784, pvalue=4.0967451653273201e-11)
So we are still rejecting the null hypothesis of the distribution that produced the data being the same as the fitted distribution.

Categories