By group, plot highest quantile data vs lowest, and capture stats - python

I wish to group a dataset by "assay", then compare intensities for small cells versus large cells. The problem I have is that in writing my code I only understand how to group the top and bottom cellArea quantiles of the entire dataFrame, rather than for each individual assay ('wt' and 'cnt').
As a final point, I would like to compare the mean values between the intensities of the two groups for each assay type...
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = DataFrame({'assay':['cnt']*10+['wt']*10,
'image':['001']*10+['002']*5+['001']*5,
'roi':['1']*5+['2']*5+['3']*5+['1']*5,
'cellArea':[99,90,50,2,30,65,95,30,56,5,33,18,98,76,56,72,12,5,47,89],
'intensity':[88,34,1,50,2,67,88,77,73,3,2,67,37,34,12,45,23,82,12,1]},
columns=['assay','image','roi','cellArea','intensity'])
df.loc[(df['cellArea'] < df['cellArea'].quantile(.20)),'group'] = 'Small_CellArea'
df.loc[(df['cellArea'] > df['cellArea'].quantile(.80)),'group'] = 'Large_CellArea'
df = df.reset_index(drop=True)
sns.violinplot(data=df,y='intensity',x='assay',hue='group',capsize=1,ci=95,palette="Set3",inner='quartile',split=True, cut=0)
plt.ylim(-20,105)
plt.legend(loc='center', bbox_to_anchor=(0.5, 0.08), ncol=3, frameon=True, fancybox=True, shadow=True, fontsize=12)

Calculate the low and high quantile by groups and then merge them back to the original data frame from where you can then calculate the group variable as Small or large:
from pandas import pd
quantileLow = df.groupby('assay').cellArea.quantile(0.2).reset_index()
quantileHigh = df.groupby('assay').cellArea.quantile(0.8).reset_index()
df = pd.merge(df, pd.merge(quantileLow, quantileHigh, on = "assay"), on = "assay")
df.loc[df['cellArea'] < df.cellArea_x,'group'] = 'Small_CellArea'
df.loc[df['cellArea'] > df.cellArea_y,'group'] = 'Large_CellArea'

Related

seaborn violin plot with frequency and values in separate columns

I have some DataFrame:
import pandas as pd
import numpy as np
import seaborn as sns
np.random.seed(1)
data = {'values': range(0,200,1), 'frequency': np.random.randint(low=0, high=2000, size=200)}
df = pd.DataFrame(data)
I am trying to create a violin plot where the y-axis corresponds to the values column and the width of the violin corresponds to the frequency column.
I can duplicate each row by the value in the frequency column and then call a violin plot:
repeat_df = df.loc[df['values'].repeat(df['frequency'])]
sns.violinplot(y=repeat_df['values'])
This works...except when the resulting duplicated DataFrame has 50+ million rows. What is a better solution when working with large DataFrames?
As suggested in my comment:
Before repeating the frequencies, reduce their resolution to a percent level, by normalizing and rounding them to an integer range of 0 to 100.
This way, you are not loosing significant amount of detail but keep the amount of repetitions to a maximum of 100.
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
np.random.seed(1)
n_values = 50000
# creating values with sinusoidal frequency modulation
data = {'values': range(0,n_values,1), 'frequency': np.random.randint(low=0, high=2000, size=n_values)*(np.sin(np.arange(n_values)/(n_values/50))+2)}
df = pd.DataFrame(data)
# old method: 100 million rows after repeat
repeat_df = df.loc[df['values'].repeat(df['frequency'])]
print(f"Old method: {len(repeat_df)} Observations")
# new method: renormalize and round frequency to reduce repetitions to 100
# resulting in <2 million rows after repeat
df.frequency = np.round(df.frequency / df.frequency.max() * 100)
repeat_df = df.loc[df['values'].repeat(df['frequency'])]
print(f"New method: {len(repeat_df)} normalized Observations")
sns.violinplot(y=repeat_df['values'])
plt.show()
If your 50+ million rows stem from the values instead, I would rebin those values accordingly, e.g. to a set of 100 values.

Python: How to construct a joyplot with values taken from a column in pandas dataframe as y axis

I have a dataframe df in which the column extracted_day consists of dates ranging between 2022-05-08 to 2022-05-12. I have another column named gas_price, which consists of the price of the gas. I want to construct a joyplot such that for each date, it shows the gas_price in the y axis and has minutes_elapsed_from_start_of_day in the x axis. We may also use ridgeplot or any other plot if this doesn't work.
This is the code that I have written, but it doesn't serve my purpose.
from joypy import joyplot
import matplotlib.pyplot as plt
df['extracted_day'] = df['extracted_day'].astype(str)
joyplot(df, by = 'extracted_day', column = 'minutes_elapsed_from_start_of_day',figsize=(14,10))
plt.xlabel("Number of minutes elapsed throughout the day")
plt.show()
Create dataframe with mock data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from joypy import joyplot
np.random.seed(111)
df = pd.DataFrame({
'minutes_elapsed_from_start_of_day': np.tile(np.arange(1440), 5),
'extracted_day': np.repeat(['2022-05-08', '2022-05-09', '2022-05-10','2022-05-11', '2022-05-12'], 1440),
'gas_price': abs(np.cumsum(np.random.randn(1440*5)))})
Then create the joyplot. It is important that you set kind='values', since you do not want joyplot to show KDEs (kernel density estimates, joyplot's default) but the raw gas_price values:
joyplot(df, by='extracted_day',
column='gas_price',
kind='values',
x_range=np.arange(1440),
figsize=(7,5))
The resulting joyplot looks like this (the fake gas prices are represented by the y-values of the lines):

FREQUENCY BAR CHART OF A DATE COLUMN IN AN ASCENDING ORDER OF DATES

So, I have a dataset (some first rows of it pasted here). My goal is to plot a frequency distribution of the 'sample_date' column. It seemed pretty simple to me at first. Just convert the column to datetime, sort values (dates) by default in an ascending order, and finally plot the bar chart. But the problem is that the bar chart is displayed NOT IN AN ASCENDING ORDER OF DATES (which is what I want to get), but in a DESCENDING ORDER OF VALUE COUNTS CORRESPONDING TO THESE DATES.
Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset.csv')
data['sample_date'] = pd.to_datetime(data['sample_date'])
data = data.sort_values(by='sample_date')
data['sample_date'].value_counts().plot(kind='bar')
Here is the dataset.csv:
,sequence_name,sample_date,epi_week,epi_date,lineage
1,England/MILK-1647769/2021,2021-06-07,76,2021-06-06,C.37
2,England/MILK-156082C/2021,2021-05-06,71,2021-05-02,C.37
3,England/CAMC-149B04F/2021,2021-03-30,66,2021-03-28,C.37
4,England/CAMC-13962F4/2021,2021-03-04,62,2021-02-28,C.37
5,England/CAMC-13238EB/2021,2021-02-23,61,2021-02-21,C.37
0,England/PHEC-L304L78C/2021,2021-05-12,72,2021-05-09,B.1.617.3
1,England/MILK-15607D4/2021,2021-05-06,71,2021-05-02,B.1.617.3
2,England/MILK-156C77E/2021,2021-05-05,71,2021-05-02,B.1.617.3
4,England/PHEC-K305K062/2021,2021-04-25,70,2021-04-25,B.1.617.3
5,England/PHEC-K305K080/2021,2021-04-25,70,2021-04-25,B.1.617.3
6,England/ALDP-153351C/2021,2021-04-23,69,2021-04-18,B.1.617.3
7,England/PHEC-30C13B/2021,2021-04-22,69,2021-04-18,B.1.617.3
8,England/PHEC-30AFE8/2021,2021-04-22,69,2021-04-18,B.1.617.3
9,England/PHEC-30A935/2021,2021-04-21,69,2021-04-18,B.1.617.3
10,England/ALDP-152BC6D/2021,2021-04-21,69,2021-04-18,B.1.617.3
11,England/ALDP-15192D9/2021,2021-04-17,68,2021-04-11,B.1.617.3
12,England/ALDP-1511E0A/2021,2021-04-15,68,2021-04-11,B.1.617.3
13,England/PHEC-306896/2021,2021-04-12,68,2021-04-11,B.1.617.3
14,England/PORT-2DFB70/2021,2021-04-06,67,2021-04-04,B.1.617.3
Here is what I get and do not want to get:
BAR CHART FOR THE 'SAMPLE_DATE' COLUMN IN A DESCENDING ORDER OF VALUE COUNTS OF THE DATES
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset.csv')
data['sample_date'] = pd.to_datetime(data['sample_date'])
data['sample_date'].value_counts().sort_index().plot(kind='bar') # Use sort_index()
plt.tight_layout()
plt.show()
The value_counts() give you a option to add a flag - ascending you only need to set it to True and the bar chart will be in ascending order. actually you don't need to use the sort_values() at all.
Check out value_counts() documentation: https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html
Code:
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset.csv')
data['sample_date'] = pd.to_datetime(data['sample_date'])
data['sample_date'].value_counts(ascending=True).plot(kind='bar')
plt.show()
Output:

How to remove certain values before plotting data

I'm using python for the first time. I have a csv file with a few columns of data: location, height, density, day etc... I am plotting height (i_h100) v density (i_cd) and have managed to constrain the height to values below 50 with the code below. I now want to constrain the values on the y axis to be within a certain 'day' range say (85-260). I can't work out how to do this.
import pandas
import matplotlib.pyplot as plt
data=pandas.read_csv('data.csv')
data.plot(kind='scatter',x='i_h100',y='i_cd')
plt.xlim(right=50)
Use .loc to subset data going into graph.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Make some dummy data
np.random.seed(42)
df = pd.DataFrame({'a':np.random.randint(0,365,20),
'b':np.random.rand(20),
'c':np.random.rand(20)})
# all data: plot of 'b' vs. 'c'
df.plot(kind='scatter', x='b', y='c')
plt.show()
# use .loc to subset data displayed based on value in 'a'
# can also use .loc to restrict values of 'b' displayed rather than plt.xlim
df.loc[df['a'].between(85,260) & (df['b'] < 0.5)].plot(kind='scatter', x='b', y='c')
plt.show()

Count Rows in Dictionary of Dataframes

I have a dictionary of dataframes. I am trying to count the rows in each dataframe. For the real data, my code is counting just over ten thousand rows for a dataframe that has only has a few rows.
I have tried to reproduce the error using dummy data. Unfortunately, the code works fine with the dummy data!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Dataframe
Df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))
# Map
Ma = Df.groupby('D')
# Dictionary of Dataframes
Di = {}
for name, group in Ma:
Di[str(name)] = group
# Count the Rows in each Dataframe
Li = []
for k in Di:
Count = Di[k].shape[0]
Li.append([Count])
# Flatten
Li_1 = []
for sublist in Li:
for item in sublist:
Li_1.append(item)
# Histogram
plt.hist(Li_1, bins=10)
plt.xlabel("Rows / Dataframe")
plt.ylabel("Frequency")
fig = plt.gcf()
To get the number of rows corresponding to each category in 'D', you can simply use .size when you do your groupby:
Df.groupby('D').size()
pandas also allows you to directly plot graphs, so your code can be reduced to:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))
Df.groupby('D').size().plot.hist()
plt.xlabel("Rows / Dataframe")
plt.ylabel("Frequency")
fig = plt.gcf()
Assuming that, the data in column D is a categorical variable. You can get the count for each category using seaborn countplot.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Dataframe
df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))
# easy count plot in sns
sns.countplot(x='D',data=df)
plt.xlabel("category")
plt.ylabel("frequency")
But if you are looking for distribution plot but not categorical count plot then you can use the folowing part of the code to have distribution plot.
# for distribution plot
sns.distplot(df['D'],kde=False,bins=10)
plt.xlabel("Spread")
plt.ylabel("frequency")
But if you want distribution plot after group by the elements which does not make any sense to me but you can use the following:
# for distribution plot after group by
sns.distplot(df.groupby('D').size() ,kde=False,bins=10)
plt.xlabel("Spread")
plt.ylabel("frequency")

Categories