I have some DataFrame:
import pandas as pd
import numpy as np
import seaborn as sns
np.random.seed(1)
data = {'values': range(0,200,1), 'frequency': np.random.randint(low=0, high=2000, size=200)}
df = pd.DataFrame(data)
I am trying to create a violin plot where the y-axis corresponds to the values column and the width of the violin corresponds to the frequency column.
I can duplicate each row by the value in the frequency column and then call a violin plot:
repeat_df = df.loc[df['values'].repeat(df['frequency'])]
sns.violinplot(y=repeat_df['values'])
This works...except when the resulting duplicated DataFrame has 50+ million rows. What is a better solution when working with large DataFrames?
As suggested in my comment:
Before repeating the frequencies, reduce their resolution to a percent level, by normalizing and rounding them to an integer range of 0 to 100.
This way, you are not loosing significant amount of detail but keep the amount of repetitions to a maximum of 100.
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
np.random.seed(1)
n_values = 50000
# creating values with sinusoidal frequency modulation
data = {'values': range(0,n_values,1), 'frequency': np.random.randint(low=0, high=2000, size=n_values)*(np.sin(np.arange(n_values)/(n_values/50))+2)}
df = pd.DataFrame(data)
# old method: 100 million rows after repeat
repeat_df = df.loc[df['values'].repeat(df['frequency'])]
print(f"Old method: {len(repeat_df)} Observations")
# new method: renormalize and round frequency to reduce repetitions to 100
# resulting in <2 million rows after repeat
df.frequency = np.round(df.frequency / df.frequency.max() * 100)
repeat_df = df.loc[df['values'].repeat(df['frequency'])]
print(f"New method: {len(repeat_df)} normalized Observations")
sns.violinplot(y=repeat_df['values'])
plt.show()
If your 50+ million rows stem from the values instead, I would rebin those values accordingly, e.g. to a set of 100 values.
Related
I have a dataframe df in which the column extracted_day consists of dates ranging between 2022-05-08 to 2022-05-12. I have another column named gas_price, which consists of the price of the gas. I want to construct a joyplot such that for each date, it shows the gas_price in the y axis and has minutes_elapsed_from_start_of_day in the x axis. We may also use ridgeplot or any other plot if this doesn't work.
This is the code that I have written, but it doesn't serve my purpose.
from joypy import joyplot
import matplotlib.pyplot as plt
df['extracted_day'] = df['extracted_day'].astype(str)
joyplot(df, by = 'extracted_day', column = 'minutes_elapsed_from_start_of_day',figsize=(14,10))
plt.xlabel("Number of minutes elapsed throughout the day")
plt.show()
Create dataframe with mock data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from joypy import joyplot
np.random.seed(111)
df = pd.DataFrame({
'minutes_elapsed_from_start_of_day': np.tile(np.arange(1440), 5),
'extracted_day': np.repeat(['2022-05-08', '2022-05-09', '2022-05-10','2022-05-11', '2022-05-12'], 1440),
'gas_price': abs(np.cumsum(np.random.randn(1440*5)))})
Then create the joyplot. It is important that you set kind='values', since you do not want joyplot to show KDEs (kernel density estimates, joyplot's default) but the raw gas_price values:
joyplot(df, by='extracted_day',
column='gas_price',
kind='values',
x_range=np.arange(1440),
figsize=(7,5))
The resulting joyplot looks like this (the fake gas prices are represented by the y-values of the lines):
I want my matplotlib plot to display my df's DateTimeIndex as consecutive count data (in seconds) on the x-axis and my df's Load data on the y axis. Then I want to overlap it with a scipy.signal find_peaks result (which has an x-axis of consecutive seconds). My data is not consecutive (real world data), though it does have a frequency of seconds.
Code
import pandas as pd
import matplotlib.pyplot as plt
from scipy import signal
import numpy as np
# Create Sample Dataset
df = pd.DataFrame([['2020-07-25 09:26:28',2],['2020-07-25 09:26:29',10],['2020-07-25 09:26:32',203],['2020-07-25 09:26:33',30]],
columns = ['Time','Load'])
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index("Time")
print(df)
# Try to solve the problem
rng = pd.date_range(df.index[0], df.index[-1], freq='s')
print(rng)
peaks, _ = signal.find_peaks(df["Load"])
plt.plot(rng, df["Load"])
plt.plot(peaks, df["Load"][peaks], "x")
plt.plot(np.zeros_like(df["Load"]), "--", color="gray")
plt.show()
This code does not work because rng has a length of 6, while the df has a length of 4. I think I might be going about this the wrong way entirely. Thoughts?
You are really close - I think you can get what you want by reindexing your df with your range. For instance:
df = df.reindex(rng).fillna(0)
peaks, _ = signal.find_peaks(df["Load"])
...
Does that do what you expect?
I'm using python for the first time. I have a csv file with a few columns of data: location, height, density, day etc... I am plotting height (i_h100) v density (i_cd) and have managed to constrain the height to values below 50 with the code below. I now want to constrain the values on the y axis to be within a certain 'day' range say (85-260). I can't work out how to do this.
import pandas
import matplotlib.pyplot as plt
data=pandas.read_csv('data.csv')
data.plot(kind='scatter',x='i_h100',y='i_cd')
plt.xlim(right=50)
Use .loc to subset data going into graph.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Make some dummy data
np.random.seed(42)
df = pd.DataFrame({'a':np.random.randint(0,365,20),
'b':np.random.rand(20),
'c':np.random.rand(20)})
# all data: plot of 'b' vs. 'c'
df.plot(kind='scatter', x='b', y='c')
plt.show()
# use .loc to subset data displayed based on value in 'a'
# can also use .loc to restrict values of 'b' displayed rather than plt.xlim
df.loc[df['a'].between(85,260) & (df['b'] < 0.5)].plot(kind='scatter', x='b', y='c')
plt.show()
I have a large data set with over 10,000 rows with values between 0 and 400,000,000. I would like to plot those values vs. the mean of another column in matplotlib where the x axis increments by 50,000,000 but I am unsure how to do so. I can plot it using pandas but would really like to do it using matplotlib but unsure how. This is what I have in pandas:
mean_values = df.groupby(pd.cut(df['budget_adj'],np.arange(0,4000000000,50000000)))['vote_average'].mean()
mean_values.plot(kind='line',figsize=(12,5))
I think I figured out what your problem is
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# Create some data
df = pd.DataFrame({'budget_adj': np.random.uniform(0, 4000000000, 10000),
'vote_average': np.random.uniform(0, 100000, 10000)})
# Calculate the mean values
mean_values = df.groupby(pd.cut(df['budget_adj'],np.arange(0,4000000000,50000000)))['vote_average'].mean()
And this is what I suspect you do
# This wont work since mean_values.index is an interval
plt.plot(mean_values.index, mean_values)
This wont work since you index is a categorical interval. In order for plot to work your x-values have to be numbers. We can convert our intervals in many ways
# You can pick the left endpoint...
x_values = [i.left for i in mean_values.index]
# the right endpoint...
x_values = [i.right for i in mean_values.index]
# or the center value.
x_values = [i.mid for i in mean_values.index]
# And NOW you will get no error
plt.plot(x_values, mean_values)
I wish to group a dataset by "assay", then compare intensities for small cells versus large cells. The problem I have is that in writing my code I only understand how to group the top and bottom cellArea quantiles of the entire dataFrame, rather than for each individual assay ('wt' and 'cnt').
As a final point, I would like to compare the mean values between the intensities of the two groups for each assay type...
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = DataFrame({'assay':['cnt']*10+['wt']*10,
'image':['001']*10+['002']*5+['001']*5,
'roi':['1']*5+['2']*5+['3']*5+['1']*5,
'cellArea':[99,90,50,2,30,65,95,30,56,5,33,18,98,76,56,72,12,5,47,89],
'intensity':[88,34,1,50,2,67,88,77,73,3,2,67,37,34,12,45,23,82,12,1]},
columns=['assay','image','roi','cellArea','intensity'])
df.loc[(df['cellArea'] < df['cellArea'].quantile(.20)),'group'] = 'Small_CellArea'
df.loc[(df['cellArea'] > df['cellArea'].quantile(.80)),'group'] = 'Large_CellArea'
df = df.reset_index(drop=True)
sns.violinplot(data=df,y='intensity',x='assay',hue='group',capsize=1,ci=95,palette="Set3",inner='quartile',split=True, cut=0)
plt.ylim(-20,105)
plt.legend(loc='center', bbox_to_anchor=(0.5, 0.08), ncol=3, frameon=True, fancybox=True, shadow=True, fontsize=12)
Calculate the low and high quantile by groups and then merge them back to the original data frame from where you can then calculate the group variable as Small or large:
from pandas import pd
quantileLow = df.groupby('assay').cellArea.quantile(0.2).reset_index()
quantileHigh = df.groupby('assay').cellArea.quantile(0.8).reset_index()
df = pd.merge(df, pd.merge(quantileLow, quantileHigh, on = "assay"), on = "assay")
df.loc[df['cellArea'] < df.cellArea_x,'group'] = 'Small_CellArea'
df.loc[df['cellArea'] > df.cellArea_y,'group'] = 'Large_CellArea'