How to plot markers on a python graph - python

I have a huge python data frame that looks like this.
HR ICULOS SepsisLabel PatientID
100.3 1 0 1
117.0 2 0 1
103.9 3 0 1
104.7 4 0 1
102.0 5 0 1
88.1 6 0 1
Access the whole file here. I Plotted the HR Column based on ICULOS like this. The code is here:
ax = plt.gca()
ax.set_title("Patient ID = 1")
ax.set_xlabel('ICULOS')
ax.set_ylabel('HR Readings')
dummy.plot(kind='line',x='ICULOS',y='HR',ax=ax)
plt.show()
What I want is to add a marker on the HR graph based on SepsisLabel (See the file). At ICULOS = 249, Sepsis Label changes from 0 to 1. I want to show that at this point on graph, sepsis label changed (This is what I want).

If I understand right, what do you want to do is to 1) find an index of value change and 2) plot (scatter) there point with a specific marker type.
Regarding 1), you can use dummy['SepsisLabel'].diff().idxmax(). diff() creates array of 0s with 1s at every change from 0 to 1 and back. idxmax() then finds first change - the point you are looking for.
Ad 2): You can just use plt.scatter(x, y, marker='X') before plt.show() to draw a point. Anything you draw (pandas, matplotlib or others) before calling plt.show() will still appear in the plot.

Related

Bar plot not appearing normally using df.plot.bar()

I have the following code. I am trying to loop through variables (dataframe columns) and create bar plots. I have attached below an example of a graph for the column newerdf['age'].
I believe this should produce 3 bars (one for each option - male (value = 1), female (value = 2), other(value = 3)).
However, the graph below does not seem to show this.
I would be so grateful for a helping hand as to where I am going wrong!
listedvariables = ['age','gender-quantised','hours_of_sleep','frequency_of_alarm_usage','nap_duration_mins','frequency_of_naps','takes_naps_yes/no','highest_education_level_acheived','hours_exercise_per_week_in_last_6_months','drink_alcohol_yes/no','drink_caffeine_yes/no','hours_exercise_per_week','hours_of_phone_use_per_week','video_game_phone/tablet_hours_per_week','video_game_all_devices_hours_per_week']
for i in range(0,len(listedvariables)):
fig = newerdf[[listedvariables[i]]].plot.bar(figsize=(30,20))
fig.tick_params(axis='x',labelsize=40)
fig.tick_params(axis='y',labelsize=40)
plt.tight_layout()
newerdf['age']
age
0 2
1 2
2 4
3 3
5 2
... ...
911 2
912 1
913 2
914 3
915 2
The data are not grouped into categories yet, so a value count is needed before calling the plotting method:
for var in listedvariables:
ax = newerdf[var].value_counts().plot.bar(figsize=(30,20))
ax.tick_params(axis='x', labelsize=40)
ax.tick_params(axis='y', labelsize=40)
plt.tight_layout()
plt.show()

Separate dataframe lat/lon pairs and plot multiple figures based on column value

I have a dataframe of lat/lon pairs that form a polygon. The rows with 'GAP' and NaN's in the lat/lon columns are separators. So in this case I have 4 polygons with multiple lat/lon locations. My goal is to separate these polygons from each other and then plot using cartopy.
0 1 2
0 POINT 87.6298 9.397332
1 POINT 87.8435 9.842206
2 POINT 87.2354 9.472004
4 GAP NaN NaN
5 POINT 87.8354 9.397332
6 POINT 87.9544 9.472004
7 POINT 87.9632 9.191509
8 POINT 87.6244 9.221509
9 POINT 87.4554 9.397332
10 GAP NaN NaN
11 POINT 87.6249 9.397332
12 POINT 87.7556 9.221509
13 POINT 87.5567 9.086767
14 POINT 87.3222 9.397332
15 GAP NaN NaN
16 POINT 87.6554 9.221509
17 POINT 87.9667 9.191509
18 POINT 87.8854 9.056767
19 POINT 87.4452 9.086767
Assume that with any time this is run, the amount of polygons and amount of lat/lon pairs in each polygon can change.
Sample code below of the set up:
df = pd.read_excel(xl, sheet_name=0, header=None)
#change column names
df.rename(columns={1:'lon', 2:'lat'},inplace=True)
#replace GAP with nan so i can separate by group numbers with each line of nans (didnt work with 'GAP')
df.replace('GAP',np.nan, inplace=True)
df['group_no'] = df.isnull().all(axis=1).cumsum()
#define amount of unique numbers and put into list for looping
numbers = df['group_no'].unique()
aa = list(numbers)
Here is where I get lost (area after set up and before plotting code) shown below.
a_lon, a_lat = 87.8544, 8.721576
b_lon, b_lat = 87.6554, 8.585951
fig,ax = plt.subplots()
plt.figure(figsize=(10.6,6))
proj = ccrs.PlateCarree()
ax = plt.axes(projection=proj)
ax.stock_img()
ax.set_extent([90, 85, 7, 11], crs=ccrs.PlateCarree())
proj = ccrs.Geodetic()
plt.plot([a_lon, b_lon], [a_lat, b_lat], linewidth=1, color='blue', transform=proj)
#plt.show()
So as you can see, I replaced the 'GAP' with a NaN, then separated the rows by 'group_no', which is a new column. Then removed the Nan's. Resulting dataframe:
0 lon lat group_no
0 POINT 87.6298 9.397332 0
1 POINT 87.8435 9.842206 0
2 POINT 87.2354 9.472004 0
4 POINT 87.8354 9.397332 1
5 POINT 87.9544 9.472004 1
6 POINT 87.9632 9.191509 1
7 POINT 87.6244 9.221509 1
8 POINT 87.4554 9.397332 1
10 POINT 87.6249 9.397332 2
11 POINT 87.7556 9.221509 2
12 POINT 87.5567 9.086767 2
13 POINT 87.3222 9.397332 2
15 POINT 87.6554 9.221509 3
16 POINT 87.9667 9.191509 3
17 POINT 87.8854 9.056767 3
18 POINT 87.4452 9.086767 3
I've attempted a few things but can't seem to close the deal. I've tried separated them in a dictionary using group_by, key being the group_no, values being lat/lon pair, but I don't know dictionaries well enough to manipulate them and plot for each 'key'.
I also attempted to separate each into a new dataframe. Thinking I could loop through and create a df0, df1, etc. and then use a for loop to plot, but couldn't figure that out either.
Any help would be appreciated and please ask if further details are needed.
You're almost there, if you call groupby on the group number you can just pull out each group and get the lat/lon pairs. Make sure you have proper projections set up as well.
from shapely.geometry import Polygon
for group_no,group_data in df.groupby('group_no'):
poly_coords = group_data[['lon','lat']].values
# Whatever function you are using to create shape with the 'poly_coords e.g.'
polygon = Polygon(poly_coords)
#add to map ...

Rolling Window of Local Minima/Maxima

I've made a script (shown below) that helps determine local maxima points using historical stock data. It uses the daily highs to mark out local resistance levels. Works great, but what I would like is, for any given point in time (or row in the stock data), I want to know what the most recent resistance level was just prior to that point. I want this in it's own column in the dataset. So for instance:
The top grey line is the highs for each day, and the bottom grey line was the close of each day. So roughly speaking, the dataset for that section would look like this:
High Close
216.8099976 216.3399963
215.1499939 213.2299957
214.6999969 213.1499939
215.7299957 215.2799988 <- First blue dot at high
213.6900024 213.3699951
214.8800049 213.4100037 <- 2nd blue dot at high
214.5899963 213.4199982
216.0299988 215.8200073
217.5299988 217.1799927 <- 3rd blue dot at high
216.8800049 215.9900055
215.2299957 214.2400055
215.6799927 215.5700073
....
Right now, this script looks at the entire dataset at once to determine the local maxima indexes for the highs, and then for any given point in the stock history (i.e. any given row), it looks for the NEXT maxima in the list of all maximas found. This would be a way to determine where the next resistance level is, but I don't want that due to a look ahead bias. I just want to have a column of the most recent past resistance level or maybe even the latest 2 recent points in 2 columns. That would be ideal actually.
So my final output would look like this for the 1 column:
High Close Most_Rec_Max
216.8099976 216.3399963 0
215.1499939 213.2299957 0
214.6999969 213.1499939 0
215.7299957 215.2799988 0
213.6900024 213.3699951 215.7299957
214.8800049 213.4100037 215.7299957
214.5899963 213.4199982 214.8800049
216.0299988 215.8200073 214.8800049
217.5299988 217.1799927 214.8800049
216.8800049 215.9900055 217.5299988
215.2299957 214.2400055 217.5299988
215.6799927 215.5700073 217.5299988
....
You'll notice that the dot only shows up in most recent column after it has already been discovered.
Here is the code I am using:
real_close_prices = df['Close'].to_numpy()
highs = df['High'].to_numpy()
max_indexes = (np.diff(np.sign(np.diff(highs))) < 0).nonzero()[0] + 1 # local max
# +1 due to the fact that diff reduces the original index number
max_values_at_indexes = highs[max_indexes]
curr_high = [c for c in highs]
max_values_at_indexes.sort()
for m in max_values_at_indexes:
for i, c in enumerate(highs):
if m > c and curr_high[i] == c:
curr_high[i] = m
#print(nextbig)
df['High_Resistance'] = curr_high
# plot
plt.figure(figsize=(12, 5))
plt.plot(x, highs, color='grey')
plt.plot(x, real_close_prices, color='grey')
plt.plot(x[max_indexes], highs[max_indexes], "o", label="max", color='b')
plt.show()
Hoping someone will be able to help me out with this. Thanks!
Here is one approach. Once you know where the peaks are, you can store peak indices in p_ids and peak values in p_vals. To assign the k'th most recent peak, note that p_vals[:-k] will occur at p_ids[k:]. The rest is forward filling.
# find all local maxima in the series by comparing to shifted values
peaks = (df.High > df.High.shift(1)) & (df.High > df.High.shift(-1))
# pass peak value if peak is achieved and NaN otherwise
# forward fill with previous peak value & handle leading NaNs with fillna
df['Most_Rec_Max'] = (df.High * peaks.replace(False, np.nan)).ffill().fillna(0)
# for finding n-most recent peak
p_ids, = np.where(peaks)
p_vals = df.High[p_ids].values
for n in [1,2]:
col_name = f'{n+1}_Most_Rec_Max'
df[col_name] = np.nan
df.loc[p_ids[n:], col_name] = p_vals[:-n]
df[col_name].ffill(inplace=True)
df[col_name].fillna(0, inplace=True)
# High Close Most_Rec_Max 2_Most_Rec_Max 3_Most_Rec_Max
# 0 216.809998 216.339996 0.000000 0.000000 0.000000
# 1 215.149994 213.229996 0.000000 0.000000 0.000000
# 2 214.699997 213.149994 0.000000 0.000000 0.000000
# 3 215.729996 215.279999 215.729996 0.000000 0.000000
# 4 213.690002 213.369995 215.729996 0.000000 0.000000
# 5 214.880005 213.410004 214.880005 215.729996 0.000000
# 6 214.589996 213.419998 214.880005 215.729996 0.000000
# 7 216.029999 215.820007 214.880005 215.729996 0.000000
# 8 217.529999 217.179993 217.529999 214.880005 215.729996
# 9 216.880005 215.990006 217.529999 214.880005 215.729996
# 10 215.229996 214.240006 217.529999 214.880005 215.729996
# 11 215.679993 215.570007 217.529999 214.880005 215.729996
I just came across this function that might help you a lot: scipy.signal.find_peaks.
Based on your sample dataframe, we can do the following:
from scipy.signal import find_peaks
## Grab the minimum high value as a threshold.
min_high = df["High"].min()
### Run the High values through the function. The docs explain more,
### but we can set our height to the minimum high value.
### We just need one out of two return values.
peaks, _ = find_peaks(df["High"], height=min_high)
### Do some maintenance and add a column to mark peaks
# Merge on our index values
df1 = df.merge(peaks_df, how="left", left_index=True, right_index=True)
# Set non-null values to 1 and null values to 0; Convert column to integer type.
df1.loc[~df1["local_high"].isna(), "local_high"] = 1
df1.loc[df1["local_high"].isna(), "local_high"] = 0
df1["local_high"] = df1["local_high"].astype(int)
Then, your dataframe should look like the following:
High Low local_high
0 216.809998 216.339996 0
1 215.149994 213.229996 0
2 214.699997 213.149994 0
3 215.729996 215.279999 1
4 213.690002 213.369995 0
5 214.880005 213.410004 1
6 214.589996 213.419998 0
7 216.029999 215.820007 0
8 217.529999 217.179993 1
9 216.880005 215.990005 0
10 215.229996 214.240005 0
11 215.679993 215.570007 0

how to visualize columns of a dataframe python as a plot?

I have a dataframe that looks like below:
DateTime ID Temperature
2019-03-01 18:36:01 3 21
2019-04-01 18:36:01 3 21
2019-18-01 08:30:01 2 18
2019-12-01 18:36:01 2 12
I would like to visualize this as a plot, where I need the datetime in x-axis, and Temperature on the y axis with a hue of IDs, I tried the below, but i need to see the Temperature distribution for every point more clearly. Is there any other visualization technique?
x= df['DateTime'].values
y= df['Temperature'].values
hue=df['ID'].values
plt.scatter(x, y,hue,color = "red")
you can try:
df.set_index('DateTime').plot()
output:
or you can use:
df.set_index('DateTime').plot(style="x-", figsize=(15, 10))
output:

Matplotlib showing x-tick labels overlapping

Have a look at the graph below:
It's a subplot of this larger figure:
I see two problems with it. First, the x-axis labels overlap with one another (this is my major issue). Second. the location of the x-axis minor gridlines seems a bit wonky. On the left of the graph, they look properly spaced. But on the right, they seem to be crowding the major gridlines...as if the major gridline locations aren't proper multiples of the minor tick locations.
My setup is that I have a DataFrame called df which has a DatetimeIndex on the rows and a column called value which contains floats. I can provide an example of the df contents in a gist if necessary. A dozen or so lines of df are at the bottom of this post for reference.
Here's the code that produces the figure:
now = dt.datetime.now()
fig, axes = plt.subplots(2, 2, figsize=(15, 8), dpi=200)
for i, d in enumerate([360, 30, 7, 1]):
ax = axes.flatten()[i]
earlycut = now - relativedelta(days=d)
data = df.loc[df.index>=earlycut, :]
ax.plot(data.index, data['value'])
ax.xaxis_date()
ax.get_xaxis().set_minor_locator(mpl.ticker.AutoMinorLocator())
ax.get_yaxis().set_minor_locator(mpl.ticker.AutoMinorLocator())
ax.grid(b=True, which='major', color='w', linewidth=1.5)
ax.grid(b=True, which='minor', color='w', linewidth=0.75)
What is my best option here to get the x-axis labels to stop overlapping each other (in each of the four subplots)? Also, separately (but less urgently), what's up with the minor tick issue in the top-left subplot?
I am on Pandas 0.13.1, numpy 1.8.0, and matplotlib 1.4.x.
Here's a small snippet of df for reference:
id scale tempseries_id value
timestamp
2014-11-02 14:45:10.302204+00:00 7564 F 1 68.0000
2014-11-02 14:25:13.532391+00:00 7563 F 1 68.5616
2014-11-02 14:15:12.102229+00:00 7562 F 1 68.9000
2014-11-02 14:05:13.252371+00:00 7561 F 1 69.0116
2014-11-02 13:55:11.792191+00:00 7560 F 1 68.7866
2014-11-02 13:45:10.782227+00:00 7559 F 1 68.6750
2014-11-02 13:35:10.972248+00:00 7558 F 1 68.4500
2014-11-02 13:25:10.362213+00:00 7557 F 1 68.1116
2014-11-02 13:15:10.822247+00:00 7556 F 1 68.2250
2014-11-02 13:05:10.102200+00:00 7555 F 1 68.5616
2014-11-02 12:55:10.292217+00:00 7554 F 1 69.0116
2014-11-02 12:45:10.382226+00:00 7553 F 1 69.3500
2014-11-02 12:35:10.642245+00:00 7552 F 1 69.2366
2014-11-02 12:25:12.642255+00:00 7551 F 1 69.1250
2014-11-02 12:15:11.122382+00:00 7550 F 1 68.7866
2014-11-02 12:05:11.332224+00:00 7549 F 1 68.5616
2014-11-02 11:55:11.662311+00:00 7548 F 1 68.2250
2014-11-02 11:45:11.122193+00:00 7547 F 1 68.4500
2014-11-02 11:35:11.162271+00:00 7546 F 1 68.7866
2014-11-02 11:25:12.102211+00:00 7545 F 1 69.2366
2014-11-02 11:15:10.422226+00:00 7544 F 1 69.4616
2014-11-02 11:05:11.412216+00:00 7543 F 1 69.3500
2014-11-02 10:55:10.772212+00:00 7542 F 1 69.1250
2014-11-02 10:45:11.332220+00:00 7541 F 1 68.7866
2014-11-02 10:35:11.332232+00:00 7540 F 1 68.5616
2014-11-02 10:25:11.202411+00:00 7539 F 1 68.2250
2014-11-02 10:15:11.932326+00:00 7538 F 1 68.5616
2014-11-02 10:05:10.922229+00:00 7537 F 1 68.9000
2014-11-02 09:55:11.602357+00:00 7536 F 1 69.3500
Edit: Trying fig.autofmt_xdate():
I don't think this going to do the trick. This seems to use the same x-tick labels for both graphs on the left and also for both graphs on the right. Which is not correct given my data. Please see the problematic output below:
Ok, finally got it working. The trick was to use plt.setp to manually rotate the tick labels. Using fig.autofmt_xdate() did not work as it does some unexpected things when you have multiple subplots in your figure. Here's the working code with its output:
for i, d in enumerate([360, 30, 7, 1]):
ax = axes.flatten()[i]
earlycut = now - relativedelta(days=d)
data = df.loc[df.index>=earlycut, :]
ax.plot(data.index, data['value'])
ax.get_xaxis().set_minor_locator(mpl.ticker.AutoMinorLocator())
ax.get_yaxis().set_minor_locator(mpl.ticker.AutoMinorLocator())
ax.grid(b=True, which='major', color='w', linewidth=1.5)
ax.grid(b=True, which='minor', color='w', linewidth=0.75)
plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')
fig.tight_layout()
By the way, the comment earlier about some matplotlib things taking forever is very interesting here. I'm using a raspberry pi to act as a weather station at a remote location. It's collecting the data and serving the results via the web. And boy oh boy, it's really wheezing trying to put out these graphics.
Due to the way text rendering is handled in matplotlib, auto-detecting overlapping text really slows things down. (The space that text takes up can't be accurately calculated until after it's been drawn.) For that reason, matplotlib doesn't try to do this automatically.
Therefore, it's best to rotate long tick labels. Because dates most commonly have this problem, there's a figure method fig.autofmt_xdate() that will (among other things) rotate the tick labels to make them a bit more readable. (Note: If you're using a pandas plot method, it returns an axes object, so you'll need to use ax.figure.autofmt_xdate().)
As a quick example:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
time = pd.date_range('01/01/2014', '4/01/2014', freq='H')
values = np.random.normal(0, 1, time.size).cumsum()
fig, ax = plt.subplots()
ax.plot_date(time, values, marker='', linestyle='-')
fig.autofmt_xdate()
plt.show()
If we were to leave fig.autofmt_xdate() out:
And if we use fig.autofmt_xdate():
For the problems which don't have date values in x axis, rather a string, you can insert \n character in x axis values so they don't overlap. Here is an example -
The data frame is
somecol value
category 1 of column 16
category 2 of column 13
category 3 of column 21
category 4 of column 20
category 5 of column 11
category 6 of column 22
category 7 of column 19
category 8 of column 14
category 9 of column 18
category 10 of column 23
category 11 of column 10
category 12 of column 24
category 13 of column 17
category 14 of column 15
category 15 of column 12
I need to plot value on y axis and somecol on x axis, which will normally be plotted like this -
As you can see, there is a lot of overlap. Now introduce \n character in somecol column.
somecol = df['somecol'].values.tolist()
for i in range(len(somecol)):
x = somecol[i].split(' ')
# insert \n before 'of'
x.insert(x.index('of'),'\n')
somecol[i] = ' '.join(x)
Now if you plot, it will look like this -
plt.plot(somecol, df['val'])
This method works well if you don't want to rotate your labels.
The only con so far I found in this method is that you need to tweak your labels 3-4 times i.e., try with multiple formats to display the plot in best format.

Categories