How to label categories in time series data? - python

I have time series data where each point on the time series is also part of a category. There are 3 categories and usually several points after each other are in the same category. I’d like to be able to plot the time series, but change the colour of the line according to which category the observation is in.
I currently have a solution which has the time series and then coloured points for each observation, based on their category, but it looks quite cluttered.
I’ve also tried out splitting the categories into 3 datasets and plotting them seperately, but then the lines don’t connect when the category changes in the series
I’m using python at the minute, but as I have the data set I’m not limited to a python solution.
Data snapshot:
Date Value Group
2016-04-01 0.65 2
2016-04-02 0.66 0
2016-04-03 0.65 0
2016-04-04 0.69 1

This should be what you want, I make use also of pandas:
import pandas as pd
import matplotlib.pyplot as mpl
df = pd.read_csv("data.txt", sep='\s+') #or however you build the dataframe with pandas
for i in range(len(df.index)):
if df.loc[i,'Group'] == 0:
col = 'g' #green
elif df.loc[i,'Group'] == 1:
col = 'r' #red
elif df.loc[i,'Group'] == 2:
col = 'c' #cyan
subdf = df.loc[i:i+2] #selecting two points
mpl.plot(subdf['Date'], subdf['Value'], 'o'+col) #plot bullet points
mpl.plot(subdf['Date'], subdf['Value'], col) #plot connecting line
mpl.show()
And this is the result:
The idea is to loop over the series taking each pair and plot it twice, the former to plot the bullet points, the latter to plot the connecting segment. The color is selected from the group (here the color list).
I added the bullet points to show the different color of the last point: it may belong to a different group. The color of the segment corresponds to the color associated to the group of the first point.

Related

plostly histogram facet row animation frame

Here is a sample of my data:
Time,Value,Name,Type
0,6.9,A,start
40,6.9,A,start
60,6.9,A,start
0,0.01,B,start
40,0.01,B,start
60,0.01,B,start
0,1.0,C,start
40,1.0,C,start
60,1.0,C,start
0,0.08,D,start
40,0.08,D,start
60,0.08,D,start
0,0.000131,E,End
40,0.00032,E,End
60,0.99209,E,End
0,0.002754,F,End
40,0.00392,F,End
60,0.01857,F,End
0,0.003,G,End
40,0.00516,G,End
60,0.00746,G,End
0,0.00426,H,End
40,0.0043,H,End
60,0.0095,H,End
0,0,I,End
40,0.0017,I,End
60,0.0183,I,End
And my code below:
import plotly.express as px
import pandas as pd
df=pd.read_csv('tohistogram.csv')
fig_bar = px.histogram(df,x='Name',y='Value',animation_frame='Time',color='Name',facet_row='Type')
fig_bar.update_layout(yaxis_title="value")
fig_bar.update_xaxes(matches=None)
fig_bar.for_each_xaxis(lambda xaxis: xaxis.update(showticklabels=True))
fig_bar.show()
`
Fig1:
Fig2:
With the data point listed above, I wanted 2 histogram separated by type (start,end) in one frame with one animation_frame
Tried the above code, as one can see from the image I could partial achieve but from Fig1: second histogram has (A,B,C,D),excepted just E to I.
2. Figure 2 was when I played the run button and auto scaled then I see A-D are gone and only E-I,
This is what I wanted to achieve in the first place itself, before running 2 histogram should sort as per 'Type'
A. Is it possible I tried couple of things like removed color
fig_bar = px.histogram(df,x='Name',y='Value',animation_frame='Time',facet_row='Type')
histogram sorts as per 'Type' of course no color but no label in second x-axis.
B.fig_bar = px.histogram(df,x='Name',y='Value',color='Name',facet_row='Type')
It sorts but no animation
What I am trying is it possible?
need 2 histogram with in the same frame sorted by 'Type',color and animation_frame?
C. Only if possible then, how to label y-axis of the first histogram from sumofValues to user-defined axis name and also have its own axis range.
D.I didn't come across any example but on the histogram, on mouse hover can I show another simple line graph image instead of text or value?
Thank you

compute and plot monthly mean SST anomalies and plot with xarray multindex (pangeo tutorial gallery)

I'm working through the pangeo tutorial gallery and am stuck on the ENSO exercise at the end of xarray
you'll need to download some files:
%%bash
git clone https://github.com/pangeo-data/tutorial-data.git
Then:
import numpy as np
import xarray as xr
import pandas as pd
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
# subset years to match hint at the bottom
sst_enso = sst_enso.sel(time=sst_enso.time.dt.year>=1982)
# groupby each timepoint and find mean for entire spatial region
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
This figure matches that shown at the bottom of the tutorial. so far so good, but i'd like to compute and plot ONI as well. Warm or cold phases of the Oceanic Nino Index are defined by a five consecutive 3-month running mean of sea surface temperature (SST) anomalies in the Niño 3.4 region that is above (below) the threshold of +0.5°C (-0.5°C). This is known as the Oceanic Niño Index (ONI).
I run into trouble because the month becomes an index.
Q1. I'm not sure how to make sure that subtracting sst_enso - enso_clim results in the correct math.
Assuming that is correct, I can compute the regional mean anomaly again and then use a rolling window mean.
enso_clim = sst_enso.sst.groupby('time.month').mean('time')
sst_anom = sst_enso - enso_clim
enso_anom = sst_anom.groupby('time').mean(dim=['lat','lon'])
oni = enso_anom.rolling(time = 3).mean()
Now I'd like to plot a bar chart of oni with positive red, negative blue. Something like this:
for exaample with:
oni.sst.plot.bar(color=(oni.sst < 0).map({True: 'b', False: 'r'}))
Instead oni.sst.plot() gives me:
Resetting the index enso_anom.reset_index('month', drop=True).sst still keeps month as a dimension and gives the same plot. If you drop_dims('month') then the sst data goes away.
I also tried converting to a pd with oni.to_dataframe() but you end up with 5040 rows which is 12 months x 420 month-years I subsetted for. According to the docs "The DataFrame is indexed by the Cartesian product of index coordinates (in the form of a pandas.MultiIndex)." so I guess that makes sense, but not useful. Even if you reset_index of oni before converting to a dataframe you get the same 5040 rows. Q2. Since the dataframe must be repeating itself I can probably figure out where, but is there a way to do this "cleaner" with each date not repeated for all 12 months?
Your code results into an DataArray with the dimensions time and month due to the
re-chunking. This is the reason why you end up with such a plot.
There is a trick (found here) to calculate anomalies. Besides this I would select as a reference period 1986-2015 ( see NOAA definition for ONI-index).
Combining both I ended up in this short code (without the bar plots):
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
ds = sst_enso.sst.mean(dim=['lat','lon'])
enso_clim = ds.sel(time=slice('1986-01-01', '2016-01-01')).groupby("time.month").mean("time")
# ref: https://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_change.shtml
enso_anom = ds.groupby("time.month") - enso_clim
# ref: http://xarray.pydata.org/en/stable/examples/weather-data.html#Calculate-monthly-anomalies
enso_anom.plot()
oni = enso_anom.rolling(time = 3).mean()
oni.plot()

Pandas plot several df with different variables on the same barplot

I have 4 Dataframes with different location: Indonesia, Singapore, Malaysia and Total each of them containing the percentage of the 5 top revenue-generating products. I have plotted them separately.
I want to combine them together on one plot where X-axis shows different locations and top-revenue-generating products for each location.
I have printed data frames and as you can see they have different products in them.
print(Ind_top_cat, Sin_top_cat, Mal_top_cat, Tot_top_cat)
Category Amt
M020P 0.144131
MH 0.099439
ML 0.055052
PB 0.050057
PPDR 0.048315
Category Amt
ML 0.480781
M015 0.073034
PPDR 0.035412
M025 0.033418
M020 0.031836
Category Amt
TN 0.343650
PPDR 0.190773
NMCN 0.118425
M015 0.047539
NN 0.038140
Category Amt
M020P 0.158575
MH 0.092012
ML 0.064179
PPDR 0.050803
PB 0.044301
Thanks to joelostblom I was able to construct a plot, however, there are still some issues.
enter image description here
all_countries = pd.concat([Ind_top_cat, Sin_top_cat, Mal_top_cat, Tot_top_cat])
all_countries['Category'] = all_countries.index
sns.barplot(x='Country', y='Amt',hue = 'Category',data=all_countries)
Is there any way I can put legend values on the x-axis (no need to colour categories on I want to instead colour countries), and put data values on top of bars. Also, bars are not centred and have no idea how to solve it.
You could create a new column in each dataframe with the country name, e.g.
Ind_top_cat['Country'] = 'Indonesia'
Sin_top_cat['Country'] = 'Singapore'
The you can create one big dataframe by concatenating the country dataframes together:
all_countries = pd.concat([Ind_top_cat, Sin_top_cat])
And finally, you can use a high level plotting library such as seaborn to assign one column to the x-axis location and one to the color of the bars:
import seaborn as sns
sns.barplot(x='Country', y='Amt', color='Category', data=all_countries)
You can scroll down to the second example on this page to get an idea what such a plot would look like (also pasted below):

How to find the correct condition for my matplotlib scatterplot?

I'm trying to correlate two measures(DD & DRE) from a data set which contains many more columns. I created a data frame and called it as 'Data'.
Within this Data, I want to create a scatterplot between DD(X axis) & DRE(y Axis), I want to include DD values between 0 and 100.
Please help me with the first line of my code to get the condition of DD between 0 and 100
Also when I plot the scatterplot, I get dots beyond 100% ( Y axis is DRE in %) though I dont have any value >100%.
Data1= Data[ Data['DD']<100]
plt.scatter(Data1.DD,Data1.DRE)
tick_val = [0,10,20,30,40,50,60,70,80,90,100]
tick_lab = ['0%','10%','20%','30%','40%','50%','60%','70%','80%','90%','100']
plt.yticks(tick_val,tick_lab)
plt.show()

Setting col_colors in seaborn clustermap from pandas

I have a clustermap generated from a pandas dataframe. Two of the columns are used to generate the clustermap and I need to use a 3rd column to generate a col_colors bar using sns.palplot(sns.light_palette('red')) palette (values will be from 0 - 1, light - dark colors).
The pseudo-code looks something like this:
df=pd.DataFrame(input, columns = ['Source', 'amplicon', 'coverage', 'GC'])
tiles = df.pivot("Source", "amplicon", "coverage")
col_colors = [values from df['GC']]
sns.clustermap(tiles, vmin=0, vmax=2, col_colors=col_colors)
I'm battling to find details on how to setup the col_colors so the correct values are linked to the appropriate tiles. Some direction would be greatly appreciated.
This example will be much easier to explain with sample data. I don't know what your data looks like, but say you had a bunch of GC content measurements For instance:
import seaborn as sns
import numpy as np
import pandas as pd
data = {'16S':np.random.normal(.52, 0.05, 12),
'ITS':np.random.normal(.52, 0.05, 12),
'Source':np.random.choice(['soil', 'water', 'air'], 12, replace=True)}
df=pd.DataFrame(data)
df[:3]
16S ITS Source
0 0.493087 0.460066 air
1 0.607229 0.592945 water
2 0.577155 0.440726 water
So data is GC content, and then there is a column describing the source. Say we want to plot a cluster map of the GC content where we use the Source column to define the network
#create a color palette with the same number of colors as unique values in the Source column
network_pal = sns.light_palette('red', len(df.Source.unique()))
#Create a dictionary where the key is the category and the values are the
#colors from the palette we just created
network_lut = dict(zip(df.Source.unique(), network_pal))
#get the series of all of the categories
networks = df.Source
#map the colors to the series. Now we have a list of colors the same
#length as our dataframe, where unique values are mapped to the same color
network_colors = pd.Series(networks).map(network_lut)
#plot the heatmap with the 16S and ITS categories with the network colors
#defined by Source column
sns.clustermap(df[['16S', 'ITS']], row_colors=network_colors, cmap='BuGn_r')
Basically what most of the above code is doing is creating a vector of colors that corrospond to the Source column of the data frame. You could of course create this manually, where the first color in the list would be mapped to the first row in the dataframe and the second color would be mapped to the second row and so on (this order will change when you plot it), however that would be a lot of work. I used a red color palette as that is what you mentioned in your question though I might recommend using a different palette. I colored by rows, however you can do the same thing for columns. Hope this helps!

Categories