How to remove certain values before plotting data - python

I'm using python for the first time. I have a csv file with a few columns of data: location, height, density, day etc... I am plotting height (i_h100) v density (i_cd) and have managed to constrain the height to values below 50 with the code below. I now want to constrain the values on the y axis to be within a certain 'day' range say (85-260). I can't work out how to do this.
import pandas
import matplotlib.pyplot as plt
data=pandas.read_csv('data.csv')
data.plot(kind='scatter',x='i_h100',y='i_cd')
plt.xlim(right=50)

Use .loc to subset data going into graph.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Make some dummy data
np.random.seed(42)
df = pd.DataFrame({'a':np.random.randint(0,365,20),
'b':np.random.rand(20),
'c':np.random.rand(20)})
# all data: plot of 'b' vs. 'c'
df.plot(kind='scatter', x='b', y='c')
plt.show()
# use .loc to subset data displayed based on value in 'a'
# can also use .loc to restrict values of 'b' displayed rather than plt.xlim
df.loc[df['a'].between(85,260) & (df['b'] < 0.5)].plot(kind='scatter', x='b', y='c')
plt.show()

Related

Python: How to construct a joyplot with values taken from a column in pandas dataframe as y axis

I have a dataframe df in which the column extracted_day consists of dates ranging between 2022-05-08 to 2022-05-12. I have another column named gas_price, which consists of the price of the gas. I want to construct a joyplot such that for each date, it shows the gas_price in the y axis and has minutes_elapsed_from_start_of_day in the x axis. We may also use ridgeplot or any other plot if this doesn't work.
This is the code that I have written, but it doesn't serve my purpose.
from joypy import joyplot
import matplotlib.pyplot as plt
df['extracted_day'] = df['extracted_day'].astype(str)
joyplot(df, by = 'extracted_day', column = 'minutes_elapsed_from_start_of_day',figsize=(14,10))
plt.xlabel("Number of minutes elapsed throughout the day")
plt.show()
Create dataframe with mock data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from joypy import joyplot
np.random.seed(111)
df = pd.DataFrame({
'minutes_elapsed_from_start_of_day': np.tile(np.arange(1440), 5),
'extracted_day': np.repeat(['2022-05-08', '2022-05-09', '2022-05-10','2022-05-11', '2022-05-12'], 1440),
'gas_price': abs(np.cumsum(np.random.randn(1440*5)))})
Then create the joyplot. It is important that you set kind='values', since you do not want joyplot to show KDEs (kernel density estimates, joyplot's default) but the raw gas_price values:
joyplot(df, by='extracted_day',
column='gas_price',
kind='values',
x_range=np.arange(1440),
figsize=(7,5))
The resulting joyplot looks like this (the fake gas prices are represented by the y-values of the lines):

FREQUENCY BAR CHART OF A DATE COLUMN IN AN ASCENDING ORDER OF DATES

So, I have a dataset (some first rows of it pasted here). My goal is to plot a frequency distribution of the 'sample_date' column. It seemed pretty simple to me at first. Just convert the column to datetime, sort values (dates) by default in an ascending order, and finally plot the bar chart. But the problem is that the bar chart is displayed NOT IN AN ASCENDING ORDER OF DATES (which is what I want to get), but in a DESCENDING ORDER OF VALUE COUNTS CORRESPONDING TO THESE DATES.
Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset.csv')
data['sample_date'] = pd.to_datetime(data['sample_date'])
data = data.sort_values(by='sample_date')
data['sample_date'].value_counts().plot(kind='bar')
Here is the dataset.csv:
,sequence_name,sample_date,epi_week,epi_date,lineage
1,England/MILK-1647769/2021,2021-06-07,76,2021-06-06,C.37
2,England/MILK-156082C/2021,2021-05-06,71,2021-05-02,C.37
3,England/CAMC-149B04F/2021,2021-03-30,66,2021-03-28,C.37
4,England/CAMC-13962F4/2021,2021-03-04,62,2021-02-28,C.37
5,England/CAMC-13238EB/2021,2021-02-23,61,2021-02-21,C.37
0,England/PHEC-L304L78C/2021,2021-05-12,72,2021-05-09,B.1.617.3
1,England/MILK-15607D4/2021,2021-05-06,71,2021-05-02,B.1.617.3
2,England/MILK-156C77E/2021,2021-05-05,71,2021-05-02,B.1.617.3
4,England/PHEC-K305K062/2021,2021-04-25,70,2021-04-25,B.1.617.3
5,England/PHEC-K305K080/2021,2021-04-25,70,2021-04-25,B.1.617.3
6,England/ALDP-153351C/2021,2021-04-23,69,2021-04-18,B.1.617.3
7,England/PHEC-30C13B/2021,2021-04-22,69,2021-04-18,B.1.617.3
8,England/PHEC-30AFE8/2021,2021-04-22,69,2021-04-18,B.1.617.3
9,England/PHEC-30A935/2021,2021-04-21,69,2021-04-18,B.1.617.3
10,England/ALDP-152BC6D/2021,2021-04-21,69,2021-04-18,B.1.617.3
11,England/ALDP-15192D9/2021,2021-04-17,68,2021-04-11,B.1.617.3
12,England/ALDP-1511E0A/2021,2021-04-15,68,2021-04-11,B.1.617.3
13,England/PHEC-306896/2021,2021-04-12,68,2021-04-11,B.1.617.3
14,England/PORT-2DFB70/2021,2021-04-06,67,2021-04-04,B.1.617.3
Here is what I get and do not want to get:
BAR CHART FOR THE 'SAMPLE_DATE' COLUMN IN A DESCENDING ORDER OF VALUE COUNTS OF THE DATES
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset.csv')
data['sample_date'] = pd.to_datetime(data['sample_date'])
data['sample_date'].value_counts().sort_index().plot(kind='bar') # Use sort_index()
plt.tight_layout()
plt.show()
The value_counts() give you a option to add a flag - ascending you only need to set it to True and the bar chart will be in ascending order. actually you don't need to use the sort_values() at all.
Check out value_counts() documentation: https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html
Code:
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('dataset.csv')
data['sample_date'] = pd.to_datetime(data['sample_date'])
data['sample_date'].value_counts(ascending=True).plot(kind='bar')
plt.show()
Output:

Matplotlib violinplots overlap on the same column

I want to create a figure with different violin plots on the same graph (but not on the same column).
My data are a list of dataframes and I want to create a violin plot of one column for each dataframe. (the names of the columns in the final figure I prefer to have as a name that is inside each dataframe in one other column).
I used this code:
for i in range(0,len(sta_list)):
plt.violinplot(sta_list[i]['diff_APS_1'])
I know that this is wrong, I want to split up the resulting plots in the figure.
You can specify the x-position of the violin plot for each column using positions argument
for i in range(0, len(sta_list)):
plt.violinplot(sta_list[i]['diff_APS_1'], positions=[i])
A sample answer for demonstration taking the dataset from this post
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.random.poisson(lam =3, size=100)
y = np.random.choice(["S{}".format(i+1) for i in range(4)], size=len(x))
df = pd.DataFrame({"Scenario":y, "LMP":x})
fig, ax = plt.subplots()
for i, key in enumerate(['S1', 'S2', 'S3', 'S4']):
ax.violinplot(df[df.Scenario == key]["LMP"].values, positions=[i])

Splitting large data set and plotting the average in matplotlib

I have a large data set with over 10,000 rows with values between 0 and 400,000,000. I would like to plot those values vs. the mean of another column in matplotlib where the x axis increments by 50,000,000 but I am unsure how to do so. I can plot it using pandas but would really like to do it using matplotlib but unsure how. This is what I have in pandas:
mean_values = df.groupby(pd.cut(df['budget_adj'],np.arange(0,4000000000,50000000)))['vote_average'].mean()
mean_values.plot(kind='line',figsize=(12,5))
I think I figured out what your problem is
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# Create some data
df = pd.DataFrame({'budget_adj': np.random.uniform(0, 4000000000, 10000),
'vote_average': np.random.uniform(0, 100000, 10000)})
# Calculate the mean values
mean_values = df.groupby(pd.cut(df['budget_adj'],np.arange(0,4000000000,50000000)))['vote_average'].mean()
And this is what I suspect you do
# This wont work since mean_values.index is an interval
plt.plot(mean_values.index, mean_values)
This wont work since you index is a categorical interval. In order for plot to work your x-values have to be numbers. We can convert our intervals in many ways
# You can pick the left endpoint...
x_values = [i.left for i in mean_values.index]
# the right endpoint...
x_values = [i.right for i in mean_values.index]
# or the center value.
x_values = [i.mid for i in mean_values.index]
# And NOW you will get no error
plt.plot(x_values, mean_values)

Clustermapping in Python using Seaborn

I am trying to create a heatmap with dendrograms on Python using Seaborn and I have a csv file with about 900 rows. I'm importing the file as a pandas dataframe and attempting to plot that but a large number of the rows are not being represented in the heatmap. What am I doing wrong?
This is the code I have right now. But the heatmap only represents about 49 rows.
Here is an image of the clustermap I've obtained but it is not displaying all of my data.
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
# Data set
df = pd.read_csv('diff_exp_gene.csv', index_col = 0)
# Default plot
sns.clustermap(df, cmap = 'RdBu', row_cluster=True, col_cluster=True)
plt.show()
Thank you.
An alternative approach would be to use imshow in matpltlib. I'm not exactly sure what your question is but I demonstrate a way to graph points on a plane from csv file
import numpy as np
import matplotlib.pyplot as plt
import csv
infile = open('diff_exp_gene.csv')
df = csv.DictReader(in_file)
temp = np.zeros((128,128), dtype = int)
for row in data:
if row['TYPE'] == types:
temp[int(row['Y'])][int(row['X'])] = temp[int(row['Y'])][int(row['X'])] + 1
plt.imshow(temp, cmap = 'hot', origin = 'lower')
plt.show()
As far as I know, keywords that apply to seaborn heatmaps also apply to clustermap, as the sns.clustermap passes to the sns.heatmap. In that case, all you need to do in your example is to set yticklabels=True as a keyword argument in sns.clustermap(). That will make all of the 900 rows appear.
By default, it is set as "auto" to avoid overlap. The same applies to the xticklabels. See more here: https://seaborn.pydata.org/generated/seaborn.heatmap.html

Categories