Uncertain why trendline is not appearing on matplotlib scatterplot - python

I am trying to plot a trendline for a matplotlib scatterplot and am uncertain why the trendline is not appearing. What should I change in my code to make the trendline appear? Event is a categorical data type.
I've followed what most other stackoverflow questions suggest about plotting a trendline, but am uncertain why my trendline is not appearing.
#import libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pandas.plotting import register_matplotlib_converters
#register datetime converters
register_matplotlib_converters()
#read dataset using pandas
dataset = pd.read_csv("UsrNonCallCDCEvents_CDCEventType.csv")
#convert date to datetime type
dataset['Interval'] = pd.to_datetime(dataset['Interval'])
#convert other columns to numeric type
for cols in list(dataset):
if cols != 'Interval' and cols != 'CDCEventType':
dataset[cols] = pd.to_numeric(dataset[cols])
#create pivot of dataset
pivot_dataset = dataset.pivot(index='Interval',columns='CDCEventType',values='AvgWeight(B)')
#create scatterplot with trendline
x = pivot_dataset.index.values.astype('float64')
y = pivot_dataset['J-STD-025']
plt.scatter(x,y)
z = np.polyfit(x,y,1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")
plt.show()
This is the graph currently being output. I am trying to get this same graph, but with a trendline: https://imgur.com/a/o18a5Y3
It's also fine that x axis is not showing dates
A snippet of my dataframe looks like this: https://imgur.com/a/xJAcgEI
I've painted out the irrelvant column names

Related

How to create a Boxplot with Timestamp using Matplotlib and Seaborn?

I have been trying to get a boxplot with each box representing an emotion over a period of time.
The data frame used to plot this contains timestamp and emotion name. I have tried converting the timestamp into a string first and then to datetime and finally to int64. This resulted in the gaps between x labels as seen in the plot. I have tried the same without converting to int64, but the matplotlib doesn't seem to allow the dates in the plot.
I'm attaching the code I have used here:
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib qt
import pandas as pd
import numpy as np
from datetime import datetime
import seaborn as sns
data = pd.read_csv("TX-governor-sentiment.csv")
## check data types
data.dtypes
# drop rows with all missing values
data = data.dropna(how='all')
## transforming the timestamp column
#convert from obj type to string then to date type
data['timestamp2'] = data['timestamp']
data['timestamp2'] = pd.to_datetime(data['timestamp2'].astype(str), format='%m/%d/%Y %H:%M')
# convert to number format with the following logic:
# yyyymmddhourmin --> this allows us to treat dates as a continuous variable
data['timestamp2'] = data['timestamp2'].dt.strftime('%Y%m%d%H%M')
data['timestamp2'] = data['timestamp2'].astype('int64')
print (data[['timestamp','timestamp2']])
#data transformation for data from Orange
df = pd.DataFrame(columns=('timestamp', 'emotion'))
for index, row in data.iterrows():
if row['sentiment'] == 0:
df.loc[index] = [row['timestamp2'], 'Neutral']
else:
df.loc[index] = [row['timestamp2'], row['Emotion']]
# Plot using Seaborn & Matplotlib
#convert timestamp in case it's not in number format
df['timestamp'] = df['timestamp'].astype('int64')
fig = plt.figure(figsize=(10,10))
#colors = {"Neutral": "grey", "Joy": "pink", "Surprise":"blue"}
#visualize as boxplot
plot_ = sns.boxplot(x="timestamp", y="emotion", data=df, width=0.5,whis=np.inf);
#add data point on top
plot_ = sns.stripplot(x="timestamp", y="emotion", data=df, alpha=0.8, color="black");
fig.canvas.draw()
#modify ticks and labels
plt.xlim([202003010000,202004120000])
plt.xticks([202003010000, 202003150000, 202003290000, 202004120000], ['2020/03/01', '2020/03/15', '2020/03/29', '2020/04/12'])
#add colors
for patch in plot_.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .3))
Please let me know how I can overcome this problem of gaps in the boxplot. Thank you!

How to sequentially add seaborn boxplots to the same axis?

Is there a way how to add multiple seaborn boxplots to one figure sequentially?
Taking example from Time-series boxplot in pandas:
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
n = 480
ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
fig, ax = plt.subplots(figsize=(12,5))
seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)
This gives me one series of box-plots?
Now, is there any way to plot two time-series like this one the same plot side-by-side? I want to plot it in the function that would have make_new_plot boolean parameter for separating the boxplots that are plotted from the for-loop.
If I try to just call it on the same axis, it gives me the overlapping plots:
I know that it is possible to concatenate the dataframes and make box plots of the concatenated dataframe together, but I would not want to have this plotting function returning any dataframes.
Is there some other way to make it? Maybe it is possible to somehow manipulate the width&position of boxes to achieve this? The fact tact that I need a time-series of boxplots & matplotlib "positions" parameter is on purpose not supported by seaborn makes it a bit tricky for me to figure out how to do it.
Note that it is NOT the same as eg. Plotting multiple boxplots in seaborn?, because I want to plot it sequentially without returning any dataframes from the plotting function.
You could do something like the following if you want to have hue nesting of different time-series in your boxplots.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
n = 480
ts0 = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
ts1 = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
ts2 = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
def ts_boxplot(ax, list_of_ts):
new_list_of_ts = []
for i, ts in enumerate(list_of_ts):
ts = ts.to_frame(name='ts_variable')
ts['ts_number'] = i
ts['doy']=ts.index.dayofyear
new_list_of_ts.append(ts)
plot_data = pd.concat(new_list_of_ts)
sns.boxplot(data=plot_data, x='doy', y='ts_variable', hue='ts_number', ax=ax)
return ax
fig, ax = plt.subplots(figsize=(12,5))
ax = ts_boxplot(ax, [ts0, ts1, ts2])

Clustermapping in Python using Seaborn

I am trying to create a heatmap with dendrograms on Python using Seaborn and I have a csv file with about 900 rows. I'm importing the file as a pandas dataframe and attempting to plot that but a large number of the rows are not being represented in the heatmap. What am I doing wrong?
This is the code I have right now. But the heatmap only represents about 49 rows.
Here is an image of the clustermap I've obtained but it is not displaying all of my data.
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
# Data set
df = pd.read_csv('diff_exp_gene.csv', index_col = 0)
# Default plot
sns.clustermap(df, cmap = 'RdBu', row_cluster=True, col_cluster=True)
plt.show()
Thank you.
An alternative approach would be to use imshow in matpltlib. I'm not exactly sure what your question is but I demonstrate a way to graph points on a plane from csv file
import numpy as np
import matplotlib.pyplot as plt
import csv
infile = open('diff_exp_gene.csv')
df = csv.DictReader(in_file)
temp = np.zeros((128,128), dtype = int)
for row in data:
if row['TYPE'] == types:
temp[int(row['Y'])][int(row['X'])] = temp[int(row['Y'])][int(row['X'])] + 1
plt.imshow(temp, cmap = 'hot', origin = 'lower')
plt.show()
As far as I know, keywords that apply to seaborn heatmaps also apply to clustermap, as the sns.clustermap passes to the sns.heatmap. In that case, all you need to do in your example is to set yticklabels=True as a keyword argument in sns.clustermap(). That will make all of the 900 rows appear.
By default, it is set as "auto" to avoid overlap. The same applies to the xticklabels. See more here: https://seaborn.pydata.org/generated/seaborn.heatmap.html

How to combine bar and line plots with x-axis as datetime in matplotlib

I have a dataFrame with datetimeIndex and two columns with int values. I would like to plot on the same graph Col1 as a bar plot, and Col2 as a line plot.
Important feature is to have correctly labeled x-axis as datetime, also when zooming in-out. I think solutions with DateFormatter would not work, since I want a dynamic xtick labeling.
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt
import numpy as np
startDate = dt.datetime(2018,1,1,0,0)
nrHours = 144
datetimeIndex = [startDate + dt.timedelta(hours=x) for x in range(0,nrHours)]
dF = pd.DataFrame(index=datetimeIndex)
dF['Col1'] = np.random.randint(1,3,nrHours)
dF['Col2'] = np.random.randint(3,6,nrHours)
axes = dF[['Col1']].plot(kind='bar')
dF[['Col2']].plot(ax=axes)
What seemed to be a simple task turns out being very challenging. Actually, after extensive search on the net, I still haven't found any clean solutions.
I have tried to use both pandas plot and matplotlib.
The main issue arises from the bar plot that seems to have difficulties handling datetime index (prefers integers, in some cases it plot dates but in Epoch 1970-1-1 style which is equivalent to 0).
I finally found a way using mdates and date2num. The solution is not very clean but provides an efficient solution to:
Combine bar and line plot on same graph
Using datetime on x-axis
Correctly and dynamically displaying x-ticks time labels (also when zooming in and out)
Working example :
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import datetime as dt
import numpy as np
startDate = dt.datetime(2018,1,1,0,0)
nrHours = 144
datetimeIndex = [startDate + dt.timedelta(hours=x) for x in range(0, nrHours)]
dF = pd.DataFrame(index=datetimeIndex)
dF['Col1'] = np.random.randint(1,3,nrHours)
dF['Col2'] = np.random.randint(3,6,nrHours)
fig,axes = plt.subplots()
axes.xaxis_date()
axes.plot(mdates.date2num(list(dF.index)),dF['Col2'])
axes.bar(mdates.date2num(list(dF.index)),dF['Col1'],align='center',width=0.02)
fig.autofmt_xdate()
Sample output:

Only the graph axis shows up when I try to plot CSV data using pandas

I have a CSV file with zip codes and a corresponding number for each zip code. I want to plot it using a histogram, but right now only the axis are showing up, with none of the actual information.
import pandas as pd
import matplotlib.pyplot as plt
installedbase = 'zipcode.csv'
df = pd.read_csv(installedbase)
df.plot(x = 'zip_code_installed', y = 'installed_NP', kind = 'hist', rwidth = .5, bins = 1000 )
plt.xlabel('zip code')
plt.ylabel('NP sum')
plt.axis([9000,9650,0,6400])
plt.show()
I am using pandas and matplotlib to plot. "x" and "y" are both set to different columns in my CSV file.

Categories