Visualization of large combination of groups using pandas - python

I have a data frame with a structure similar to
dt, ing_net, egs_net, ing_ip, egs_ip, avg_pkt, sum_time
2017-01-01, A2, A1, 10.100.0.0, 22.54.23.0, 12.1, 123
2017-01-01, B2, A1, 10.100.1.0, 22.54.23.0, 12.1, 982
2017-01-01, B2, A2, 10.0.1.0, 22.54.13.0, 92.1, 692
...
2017-06-31, A2, B8, 65.200.0.0, 33.0.23.0, 12.7, 99887
and the possible number of combinations between ing_net and egs_net is 250. How can I visualize one of the value variables, say avg_pkt, for all possible combinations? I'm looking for a visual aid or approach to search for outliers.
Seaborn FacetGrid cannot plot all graphs:
g = sns.FacetGrid(df, row='egs_net', column='ing_net', 'avg_pkt')
g.map(sns.distplot)
# all plots are extremely small and with wrong x datetime axis
Doing a groupby on the pandas dataframe generates all graphs, but not in a grid:
for name, group in df_avg.groupby(['egs_net', 'ing_net']):
group.plot(x='dt', y='avg_pkt', title='{} - {}'.format(name[0], name[1]), figsize=(7,5), subplots=True)
and a Holoviews HoloMap bails with the number of intermediate graphs:
hv_df = hv.Dataset(df.reset_index(), kdims=['dt', 'egs_net', 'ing_net'])
hv_df.to(hv.Curve, vdims=['avg_pkt'])
How can I explore this phase space?

Related

How to use the linear regression to create a "calibration curve" class in python for experimental activities?

I'm a newbie in python and would like to create a module that makes a class named "calibration curve" to help save time during my lab experimental activities.
My goal is to just upload my measurements in python with pd.read_Excel and obtain a "calibration curve" resulting from the linear regression of my data points.
My measurements are typically in this form: usually i have two different measurements of different things (A and B), in duplicate (A1, A2; B1, B2) or in triplicate (A1, A2, A3; B1, B2, B3).
Time
Measurement A1
Measurement A2
Measurement B1
Measurement B2
0
2.451
2.480
3.01
3.01
1
2.102
2.09
3.31
3.02
2
1.850
1.844
3.2
2.9
3
1.200
NaN
3.4
3.2
4
0.999
1.001
2.9
3.01
I typically have to make some calculations on these data like: the ratio between the measurement B1 and measurement A1, the ratio between the measurement B2 and measurement A2 and then the mean of these ratios, etc. I usually obtain something like this:
Final result image
Then I need to calculate the linear regression of mean vs. time and find slope, intercept, rsquared, etc.
I would like to store all of these values as attributes of an object to evoke them when I need. This is how it should work:
calibration_curve_instrument1 = calibration_curve("datapoints.xlsx", intercept=0)
print(calibration_curve_instrument1.intercept) --> 0
print(calibration_curve_instrument1.slope) --> 0.44
print(calibration_curve_instrument1.rsquared) --> 0.98
I wrote a code, but I don't know how to overcome these issues:
I don't know how to set intercept equal to 0 (sometimes I need to make this assumption)
how to make it "smart" (for instance: not raising an error if my measurements columns are 3 instead of 2, namely A1, A2, A3; B1, B2, B3)
how to avoid referring to the column name in the code, in order to adapt to different situations (in case, for instance, the first column is not named "time" but "velocity").
I've seen that there are many libraries to solve these problems, but, being new in python, I'm wondering what's the best in this situation: sklearn.linear_model, numpy.polyfit, scipy.stats.linregr?
I tried to define a class called calibration_curve that reads .xlsx data points and sets intercept, slope, and rsquared values as attributes.
def ratio_A(row):
return (row[3] / row[1])
def ratio_B(row):
return row[4] / row[2]
def ratio_mean(row):
return np.nanmean([row[5], row[6]])
class calibration_curve():
def __init__(self, xlsx):
self.raw = pd.read_excel(xlsx)
self.input = self.raw.copy()
self.raw["ratio_A"] = self.raw.apply(ratio_A, axis="columns")
self.raw["ratio_B"] = self.raw.apply(ratio_B, axis="columns")
self.raw["mean"] = self.raw.apply(ratio_mean, axis="columns")
self.slope, self.intercept, self.rsquared, self.p, self.std_err = \
scipy.stats.linregress(self.raw["mean"], self.raw["Time"])
It worked but, as I said I can't force intercept = 0 with scipy.stats.linregress and I'm pretty sure there are better ways to solve this problem than this one. For instance, if the dataset is in triplicate (A1, A2, A3; B1, B2, B3), this wouldn't work because I defined the functions "ratio_A" and "ratio_B" considering the column index. I also had to to refer to the column name self.raw["mean"], self.raw["Time"] so if the first column has a diffent name (like "velocity") this wouldn't work.
I hope I was clear. Any kind of suggestion to face these kind of problems is appreciated! Thanks a lot.

How to create a bar plot with the same column from multiple dataframes

Here is how the graph currently looks, but I want it to have just 3 bars, s1, s2, and s9 for the x value and the microns on the y, but for some reason there are 5 separate colors for the bar and the x-axis is just an extension of the y axis.
Here is the code:
if you are looking for just one bar for each of the columns s1, s2, s9 and the sum of all the columns in those to be the height of each of these bars, you should be using barplot() instead. Get the sum of each column using sum() and then plot it. Below code shows this for some random data. The sum() will get the total for each column and doing a reset_index() will add the column names as a column index. You can try print box_data.sum().reset_index() to understand how the sum data looks like.
Code
data = {'s1': np.random.randint(20,160, size=(100)),
's2':np.random.randint(16,80, size=(100)),
's9':np.random.randint(60,170, size=(100))}
box_data=pd.DataFrame(data)
sns.barplot(data = box_data.sum().reset_index(), y=0, x= 'index')
Plot

Python: How to plot a conditional cumulative frequency histogram?

I have this list of data for which I would like to plot a histogram chart. However the graph is not very readable for large values โ€‹โ€‹of the X axis and are not really important to keep them.
Here is a sub sample of my data:
print(v)
1 1738 #the values โ€‹โ€‹I want to plot on the histogram
2 2200
3 1338
4 1222
5 939
6 898
I calculated the cumulative frequency as follows:
v = x.cumsum()
t = [round(100*v/x.sum(),2)]
t
and the output is:
1 9.90
2 22.44
3 30.06
4 37.02
5 42.37
How can I represent on the histogram only the data for which the cumulative frequency is less than or equal to 40%?
I don't know how to do in python, thank you in advance for your help
The short answer is: Slice the numpy array to filter values <= 40%. For example, if a is a 1D numpy array:
a[a <= 40]
A longer answer is provided by the example below, which shows:
A generation of normally distributed random data (as the provided dataset is very small)
Performing your calculation on the numpy array
Slicing the array to return values which are <= 40%
Plotting the results using the Plotly library - API only.
Example code:
import numpy as np
import plotly.io as pio
# Generate random dataset (for demo only).
np.random.seed(1)
X = np.random.normal(0, 1, 10000)
# Calculate the cumulative frequency.
X_ = np.cumsum(X)*100/X.sum()
data = X_[X_ <= 40]
# Plot the histogram.
pio.show({'data': {'x': data,
'type': 'histogram',
'marker': {'line': {'width': 0.5}}},
'layout': {'title': 'Cumulative Frequency Demo'}})
Output:

Overlapping coefficient using scipy quad not working as expected

Trying to find the area overlap of two scipy.skewnorm distributions I have generated using the code below
import scipy
a1 =1
loc1 = 0
scale1 = 1
a2=2
loc2=0
scale2=1
print(scipy.integrate.quad(lambda x: (min(skewnorm.pdf(x, a1, loc=loc1, scale=scale1), skewnorm.pdf(x, a2, loc=loc2, scale=scale1))),-10,10))
output : (0.8975836176504333, 8.065277615563445e-10)
However, changing the limits, significantly affects my results.
print(scipy.integrate.quad(lambda x: (min(skewnorm.pdf(x, a1, loc=loc1, scale=scale1), skewnorm.pdf(x, a2, loc=loc2, scale=scale1))),-1,1))
output:(0.341344746068543, 3.789687964201238e-15)
How can I determine the limits to be used?

How to Distinguish the 22 Variables in the Stacked Bar Graph?

I have a pandas dataframe that contains 75 samples (rows) and measures of 22 different human cell types (columns), like the following:
import pandas as pd
import numpy as np
import plotly.plotly as py
import cufflinks as cf
cf.set_config_file(offline=False, world_readable=True, theme='ggplot')
#pandas dataframe:
df = pd.DataFrame(
np.random.rand(75, 22), columns=
['B cells naive','B cells memory','Plasma cells','T cells CD8','T cells CD4 naive','T cells CD4 memory resting','T cells CD4 memory activated','T cells follicular helper','T cells regulatory (Tregs)','T cells gamma delta','NK cells resting','NK cells activated','Monocytes','Macrophages M0','Macrophages M1','Macrophages M2','Dendritic cells resting','Dendritic cells activated','Mast cells resting','Mast cells activated','Eosinophils','Neutrophils']
)
df.iplot(kind="bar", barmode="stack")
I want to visualize how much of each of the 22 cell types present in the 75 samples. So I thought to use a stacked bar graph where each bar represent a sample and the stacks show the estimated number of each cell type found in the sample. Here is the graph generated using Plotly:
Problem: the cell types represented in the stacked bar graph do not have unique and distingishable color. For example, there are three cell types have the the same color (red) and this makes the graph pointless because we can not visualize the frequency of each cell types per sample.
Question: what are some possible ways to distinguish the cell types for each sample? ... Any ideas what sets of colors and/or patterns could be used to solve this problem?
Do you think there are other ways to visualize, other than a stacked bar graph?
Thanks!

Categories