How to specify date bin ranges for Seaborn displot - python

Problem statement
I am creating a distribution plot of flood events per N year periods starting in 1870. I am using Pandas and Seaborn. I need help with...
specifying the date range of each bin when using sns.displot, and
clearly representing my bin size specifications along the x axis.
To clarify this problem, here is the data that I am working with, what I have tried, and a description of the desired output.
The Data
The data I am using is available from the U.S. Weather service.
import pandas as pd
import bs4
import urllib.request
link = "https://water.weather.gov/ahps2/crests.php?wfo=jan&gage=jacm6&crest_type=historic"
webpage=str(urllib.request.urlopen(link).read())
soup = bs4.BeautifulSoup(webpage)
tbl = soup.find('div', class_='water_information')
vals = tbl.get_text().split(r'\n')
tcdf = pd.Series(vals).str.extractall(r'\((?P<Rank>\d+)\)\s(?P<Stage>\d+.\d+)\sft\son\s(?P<Date>\d{2}\/\d{2}\/\d{4})')\
.reset_index(drop=True)
tcdf['Stage'] = tcdf.Stage.astype(float)
total_crests_events = len(tcdf)
tcdf['Rank'] = tcdf.Rank.astype(int)
tcdf['Date'] = pd.to_datetime(tcdf.Date)
What works
I am able to plot the data with Seaborn's displot, and I can manipulate the number of bins with the bins command.
The second image is closer to my desired output. However, I do not think that it's clear where the bins start and end. For example, the first two bins (reading left to right) clearly start before and end after 1880, but the precise years are not clear.
import seaborn as sns
# fig. 1: data distribution using default bin parameters
sns.displot(data=tcdf,x="Date")
# fig. 2: data distribution using 40 bins
sns.displot(data=tcdf,x="Date",bins=40)
What fails
I tried specifying date ranges using the bins input. The approach is loosely based on a previous SO thread.
my_bins = pd.date_range(start='1870',end='2025',freq='5YS')
sns.displot(data=tcdf,x="Date",bins=my_bins)
This attempt, however, produced a TypeError
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
This is a long question, so I imagine that some clarification might be necessary. Please do not hesitate to ask questions in the comments.
Thanks in advance.

Seaborn internally converts its input data to numbers so that it can do math on them, and it uses matplotlib's "unit conversion" machinery to do that. So the easiest way to pass bins that will work is to use matplotlib's date converter:
sns.displot(data=tcdf, x="Date", bins=mpl.dates.date2num(my_bins))

Related

How can I plot only particular values in xarray?

I am using data from cdasws to plot dynamic spectra. I am following the example found here https://cdaweb.gsfc.nasa.gov/WebServices/REST/jupyter/CdasWsExample.html
This is my code which I have modified to obtain a dynamic spectra for STEREO.
from cdasws import CdasWs
from cdasws.datarepresentation import DataRepresentation
import matplotlib.pyplot as plt
cdas = CdasWs()
import numpy as np
datasets = cdas.get_datasets(observatoryGroup='STEREO')
for index, dataset in enumerate(datasets):
print(dataset['Id'], dataset['Label'])
variables = cdas.get_variables('STEREO_LEVEL2_SWAVES')
for variable_1 in variables:
print(variable_1['Name'], variable_1['LongDescription'])
data = cdas.get_data('STEREO_LEVEL2_SWAVES', ['avg_intens_ahead'],
'2020-07-11T02:00:00Z', '2020-07-11T03:00:00Z',
dataRepresentation = DataRepresentation.XARRAY)[1]
print(data)
plt.figure(figsize = (15,7))
# plt.ylim(100,1000)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.yscale('log')
sorted_data.transpose().plot()
plt.xlabel("Time",size=18)
plt.ylabel("Frequency (kHz)",size=18)
plt.show()
Using this code gives a plot that looks something like this,
My question is, is there anyway of plotting this spectrum only for a particular frequency? For example, I want to plot just the intensity values at 636 kHz, is there any way I can do that?
Any help is greatly appreciated, I dont understand xarray, I have never worked with it before.
Edit -
Using the command,
data_stereo.avg_intens_ahead.loc[:,625].plot()
generates a plot that looks like,
While this is useful, what I needed is;
for the dynamic spectrum, if i choose a particular frequency like 600khz, can it display something like this (i have just added white boxes to clarify what i mean) -
If you still want the plot to be 2D, but to include a subset of your data along one of the dimensions, you can provide an array of indices or a slice object. For example:
data_stereo.avg_intens_ahead.sel(
frequency=[625]
).plot()
Or
# include a 10% band on either side
data_stereo.avg_intens_ahead.sel(
frequency=slice(625*0.9, 625*1.1)
).plot()
Alternatively, if you would actually like your plot to show white space outside this selected area, you could mask your data with where:
data_stereo.avg_intens_ahead.where(
data_stereo.frequency==625
).plot()

Power law test using XY scatter plot

I have Daily Crude oil prices downloaded from FRED, about 10k observations, some values are blank(code cleans them). I believe that I cannot share excel sheets here, so I will just give you a screenshot of what the data looks like:
I calculate the differences and returns and clean up the data but I am kind of stuck.
Here is what the code looks like to get you started:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("DCOILWTICO.csv")
nan_value = float("NaN")
data.replace("", nan_value, inplace=True)
data.replace(".", nan_value, inplace=True)
data['Previous'] = data['DCOILWTICO'].shift(1)
data.dropna(subset=['Previous'],inplace=True)
data.replace("", nan_value, inplace=True)
data.replace(".", nan_value, inplace=True)
data['DCOILWTICO'] = data['DCOILWTICO'].astype(float)
data['Previous'] = data['Previous'].astype(float)
data['Diff'] = data['DCOILWTICO'] - data['Previous']
data['Return'] = (data['DCOILWTICO'] - data['Previous'])/data['Previous']
Here comes the question: I am trying to duplicate the graph below.(which I believe was generated using Mathematica) The difficult part is to be able to create the bins in the right way. Looking at the graph it looks like there are around 200 bins. On the x-axis are the returns and on the y axis are the frequencies(which have been binned).
I think you are asking how to make equally spaced bins in logspace. If so then use the np.geomspace function (geometric space), rather than np.linspace (linear space).
plt.figure()
bins = np.geomspace(data['returns'].min(), data['returns'].max(), 200)
plt.hist(data['returns'], bins = bins)

How to modify the output of my COXPH image drawn by cph.plot_covariate_groups

I do not know how I can modify the output image provide by lifelines since I am unfamiliar with "cph.plot_covariate_groups". Unfortunately, there seems no detailed description about it in the link here; https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html .
What I am looking for is, (1) how to shorten the event days (X axis), I do not want to show such a long days for the survival curve. Ideally, 4000 is the best. (2) Also, if possible, I would like to remove the baseline survival curve from my image. (3) I am also hoping if I could change the color of the survival curves from orange/blue to others.
import pandas as pd
from lifelines import AalenAdditiveFitter, CoxPHFitter, KaplanMeierFitter
data = pd.read_csv("cluster label.csv", index_col=0)
cph = CoxPHFitter()
cph.fit(data, duration_col="time", event_col="status")
cph.plot_covariate_groups('label', [0,1])
This is all possible. Information about specific functions and methods are available on the docs page: https://lifelines.readthedocs.io/en/latest/References.html
Specifically: https://lifelines.readthedocs.io/en/latest/lifelines.fitters.html#lifelines.fitters.coxph_fitter.CoxPHFitter.plot_covariate_groups
So try this:
cph.plot_covariate_groups('label', [0,1],
plot_baseline=False,
cmap='coolwarm'
)
plt.xlim(0, 4000)

Python plot lines with specific x values from numpy

I have a situation with a bunch of datafiles, these datafiles have a number of samples in a given time frame that depends on the system. i.e. At time t=1 for instance I might have a file with 10 items, or 20 items, at later times in that file I will always have the same number of items. The format is time, x, y, z in columns, and loaded into a numpy array. The time values show which frame, but as mentioned there's always the same, let's go with 10 as a sample. So I'll have a (10,4) numpy array where the time values are identical, but there are many frames in the file, so lets say 100 frames, so really I have (1000,4). I want to plot the data with time on the x-axis and manipulations of the other data on the y, but I am unsure how to do this with line plot methods in matplotlib. Normally to provide both x,y values I believe I need to do a scatter plot, so I'm hoping there's a better way to do this. What I ideally want is to treat each line that has the same time code as a different series (so it will colour differently), and the next bit of data for that same line number in the next frame (time value) will be labelled the same colour, giving those good contiguous lines. We can look at the time column and figure out how many items share a time code, let's call it "n". Sample code:
a = numpy.loadtxt('sampledata.txt')
plt.plot(a[:0,:,n],a[:1,:1])
plt.show()
I think this code expresses what I'm going for, though it doesn't work.
Edit:
I hope this is what you wanted.
seaborn scatterplot can categorize data to some groups which have the same codes (time code in this case) and use the same colors to them.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"E:\Programming\Python\Matplotlib\timecodes.csv",
names=["time","x","y","z","code"]) #use your file
df["time"]=pd.to_datetime(df["time"]) #recognize the data as Time
df["x"]=df["time"].dt.day # I changed the data into "Date only" and imported to x column. Easier to see on graph.
#just used random numbers in y and z in my data.
sns.scatterplot("x", "y", data = df, hue = "code") #hue does the grouping
plt.show()
I used csv file here but you can do to your text file as well by adding sep="\t" in the argument. I also added a code in the file. If you have it the code can group the data in the graph, so you don't have to separate or make a hierarchical index. If you want to change colors or grouping please see seaborn website.
Hope this helps.
Alternative, the method I used, but Tim's answer is still accurate as well. Since the time codes are not date/time information I modified my own code to add tags as a second column I call "p" (they're polymers).
import numpy as np
import pandas as pd
datain = np.loadtxt('somefile.txt')
df = pd.DataFrame(data = datain, columns = ["t","p","x","y","z"])
ax = sns.scatterplot("t","x", data = df, hue = "p")
plt.show()
And of course the other columns can be plotted similarly if desired.

Creating a faceted matplotlib/seaborn plot using indicator variables rather than a single column

Seaborn is great for creating faceted plots based on a categorical variable encoding the class of each facet. However, this assumes your categories are mutually exclusive. Is it possible to create a Seaborn FacetGrid (or similar) based on a set of indicator variables?
As a concrete example, think about comparing patients that are infected with one or more viruses, and plotting an attribute of interest by virus. Its possible that a patient carries more than one virus, so creating a virus column to create a grid on is not possible. You can, however, create a set of indicator variables (one for each virus) that flags the virus for each patient. There does not seem to be a way of passing a set of indicator variables to any of the Seaborn functions to do this.
I can't imagine I'm the first person to come across this scenario, so I'm hoping there are suggestions for how to do this, without coding it by hand in Matploltlib.
I don't see how to do it with FacetGrid, possibly because this isn't facetting the data, since a data-record might appear several times or only once in the plot. One of the standard tricks with a set of bitfields is to read them as binary, so you see each combination of the bits. That's unambiguous but gets messy:
import pandas as pd
import seaborn as sns
from numpy.random import random, randint
from numpy import concatenate
import matplotlib.pyplot as plt
# Dummy data
vdata = pd.DataFrame(concatenate((randint(2, size=(32,4)), random(size=(32,2))), axis=1))
vdata.columns=['Species','v1','v2','v3','x','y']
binary_v = vdata.v1 + vdata.v2*2 + vdata.v3*4
# Making a binary number out of the "virusX?" fields
pd.concat((vdata, binary_v), axis=1)
vdata = pd.concat((vdata, binary_v), axis=1)
vdata.columns=['Species','v1','v2','v3','x','y','binary_v']
# Plotting group membership by row
#g = sns.FacetGrid(vdata, col="Species", row='binary_v')
#g.map(plt.scatter, "x", "y")
#g.add_legend()
#plt.savefig('multiple_facet_binary_row') # Unreadably big.
h = sns.FacetGrid(vdata, col="Species", hue="binary_v")
h.map(plt.scatter, "x","y")
h.add_legend()
plt.savefig('multiple_facet_binary_hue')
If you have too many indicators to deal with the combinatorial explosion, explicitly making the new subsets works:
# Nope, need to pull out subsets:
bdata = vdata[vdata.v1 + vdata.v2 + vdata.v3 ==0.]
assert(len(bdata) > 0) # ... catch...
bdata['Virus'] = pd.Series(['none']*len(bdata), index=bdata.index)
for i in ['v1','v2','v3']:
on = vdata[vdata[i]==1.]
on['Virus'] = pd.Series([i]*len(on), index=on.index)
bdata = bdata.append(on)
j = sns.FacetGrid(bdata, col='Species', row='Virus')
j.map(plt.scatter, 'x', 'y')
j.add_legend()
j.savefig('multiple_facet_refish')

Categories