Plotting box plots in python without grouping the data - python

I want to plot a box plot for a variable in a data frame xldata['yaxis_data'] according to 1,0 mapping stored in another array (one_zero_map).
I have a working code for this I am just not sure if this is the best way. Any help would be great.
Reason I am unsure is I am guessing there should be a direct way for boxplot to understand what I want if I input directly one_zero_map and xldata['yaxis_data'] without creating good_ones and bad_ones and then putting them in a list called final_list
%matplotlib inline
import matplotlib.pyplot as plt
good_ones=[val for ind, val in zip(one_zero_map,xldata['yaxis_data']) if ind==1]
bad_ones=[val for ind, val in zip(one_zero_map,xldata['yaxis_data']) if ind==0]
final_list=[good_ones,bad_ones]
plt.boxplot(final_list)
Just to be more clear on what I am looking for, I am looking for Python equivalent of R which is like this
# Boxplot of MPG by Car Cylinders
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",
xlab="Number of Cylinders", ylab="Miles Per Gallon")
or phython equivalent of graphlab as
sales.show(view='BoxWhisker Plot',x='zipcode',y='price')

You can use the boxplot method directly from pandas DataFrames. This code is equivalent to your R example:
# statsmodels only needed to get the R mtcars dataset
import statsmodels.api as sm
mtcars = sm.datasets.get_rdataset('mtcars').data
mtcars.boxplot('mpg', by='cyl')

Related

How to loop over all columns and check data distribution using Fitter library?

I need to check data distributions of all my numeric columns in a dataset. I chose Fitter library to do so. I loop over all columns but have only one plot+summary table as an outcome instead. What is wrong with my code?
from fitter import Fitter
import numpy as np
df_numeric = df.select_dtypes(include=np.number).sample(n=5000)
num_cols = df_numeric.columns.tolist()
distr = ['cauchy',
'chi2',
'expon',
'exponpow',
'gamma',
'beta',
'lognorm',
'logistic',
'norm',
'powerlaw',
'rayleigh',
'uniform']
for col in num_cols:
modif_col = df_numeric[col].fillna(0).values
dist_fitter = Fitter(modif_col, distributions=distr)
dist_fitter.fit()
dist_fitter.summary()
Maybe there is another approach to check distributions in a loop?
It looks like your code is correctly looping over all the numeric columns in the dataframe, fitting different distributions to each column using the Fitter library, and then printing a summary of the fitting results. However, you're only seeing one plot and summary table as the outcome because you're overwriting the plot and summary table for each iteration of the loop.
To see a separate plot and summary table for each column, you should move the calls to dist_fitter.summary() and dist_fitter.plot() inside the loop and make sure to give each plot and summary table a unique name or title, so you can distinguish them when viewing them.
Here is the code example you can use it
import matplotlib.pyplot as plt
for col in num_cols:
modif_col = df_numeric[col].fillna(0).values
dist_fitter = Fitter(modif_col, distributions=distr)
dist_fitter.fit()
plt.figure()
dist_fitter.plot()
plt.title(col)
plt.show()
print(col)
dist_fitter.summary()

How can I plot only particular values in xarray?

I am using data from cdasws to plot dynamic spectra. I am following the example found here https://cdaweb.gsfc.nasa.gov/WebServices/REST/jupyter/CdasWsExample.html
This is my code which I have modified to obtain a dynamic spectra for STEREO.
from cdasws import CdasWs
from cdasws.datarepresentation import DataRepresentation
import matplotlib.pyplot as plt
cdas = CdasWs()
import numpy as np
datasets = cdas.get_datasets(observatoryGroup='STEREO')
for index, dataset in enumerate(datasets):
print(dataset['Id'], dataset['Label'])
variables = cdas.get_variables('STEREO_LEVEL2_SWAVES')
for variable_1 in variables:
print(variable_1['Name'], variable_1['LongDescription'])
data = cdas.get_data('STEREO_LEVEL2_SWAVES', ['avg_intens_ahead'],
'2020-07-11T02:00:00Z', '2020-07-11T03:00:00Z',
dataRepresentation = DataRepresentation.XARRAY)[1]
print(data)
plt.figure(figsize = (15,7))
# plt.ylim(100,1000)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.yscale('log')
sorted_data.transpose().plot()
plt.xlabel("Time",size=18)
plt.ylabel("Frequency (kHz)",size=18)
plt.show()
Using this code gives a plot that looks something like this,
My question is, is there anyway of plotting this spectrum only for a particular frequency? For example, I want to plot just the intensity values at 636 kHz, is there any way I can do that?
Any help is greatly appreciated, I dont understand xarray, I have never worked with it before.
Edit -
Using the command,
data_stereo.avg_intens_ahead.loc[:,625].plot()
generates a plot that looks like,
While this is useful, what I needed is;
for the dynamic spectrum, if i choose a particular frequency like 600khz, can it display something like this (i have just added white boxes to clarify what i mean) -
If you still want the plot to be 2D, but to include a subset of your data along one of the dimensions, you can provide an array of indices or a slice object. For example:
data_stereo.avg_intens_ahead.sel(
frequency=[625]
).plot()
Or
# include a 10% band on either side
data_stereo.avg_intens_ahead.sel(
frequency=slice(625*0.9, 625*1.1)
).plot()
Alternatively, if you would actually like your plot to show white space outside this selected area, you could mask your data with where:
data_stereo.avg_intens_ahead.where(
data_stereo.frequency==625
).plot()

Accessing (the right) data when using holoviews/bokeh

I am having difficulties accessing (the right) data when using holoviews/bokeh, either for connected plots showing a different aspect of the dataset, or just customising a plot with dynamic access to the data as plotted (say a tooltip).
TLDR: How to add a projection plot of my dataset (different set of dimensions and linked to main plot, like a marginal distribution but, you know, not restricted to histogram or distribution) and probably with a similar solution a related question I asked here on SO
Let me exemplify (straight from a ipynb, should be quite reproducible):
import numpy as np
import random, pandas as pd
import bokeh
import datashader as ds
import holoviews as hv
from holoviews import opts
from holoviews.operation.datashader import datashade, shade, dynspread, spread, rasterize
hv.extension('bokeh')
With imports set up, let's create a dataset (N target 10e12 ;) to use with datashader. Beside the key dimensions, I really need some value dimensions (here z and z2).
import numpy as np
import pandas as pd
N = int(10e6)
x_r = (0,100)
y_r = (100,2000)
z_r = (0,10e8)
x = np.random.randint(x_r[0]*1000,x_r[1]*1000,size=(N, 1))
y = np.random.randint(y_r[0]*1000,y_r[1]*1000,size=(N, 1))
z = np.random.randint(z_r[0]*1000,z_r[1]*1000,size=(N, 1))
z2 = np.ones((N,1)).astype(int)
df = pd.DataFrame(np.column_stack([x,y,z,z2]), columns=['x','y','z','z2'])
df[['x','y','z']] = df[['x','y','z']].div(1000, axis=0)
df
Now I plot the data, rasterised, and also activate the tooltip to see the defaults. Sure, x/y is trivial, but as I said, I care about the value dimensions. It shows z2 as x_y z2. I have a question related to tooltips with the same sort of data here on SO for value dimension access for the tooltips.
from matplotlib.cm import get_cmap
palette = get_cmap('viridis')
# palette_inv = palette.reversed()
p=hv.Points(df,['x','y'], ['z','z2'])
P=rasterize(p, aggregator=ds.sum("z2"),x_range=(0,100)).opts(cmap=palette)
P.opts(tools=["hover"]).opts(height=500, width=500,xlim=(0,100),ylim=(100,2000))
Now I can add a histogram or a marginal distribution which is pretty close to what I want, but there are issues with this soon past the trivial defaults. (E.g.: P << hv.Distribution(p, kdims=['y']) or P.hist(dimension='y',weight_dimension='x_y z',num_bins = 2000,normed=True))
Both are close approaches, but do not give me the other value dimension I'd like visualise. If I try to access the other value dimension ('x_y z') this fails. Also, the 'x_y z2' way seems very clumsy, is there a better way?
When I do something like this, my browser/notebook-extension blows up, of course.
transformed = p.transform(x=hv.dim('z'))
P << hv.Curve(transformed)
So how do I access all my data in the right way?

How to correctly use the savitzky-golay filter or other filters (python)

I am trying to get data set 1 to equal data set 2 using a filter. Right now all I have tried is the savitzky golay filter in python. The problem I have been tasked with is finding out how data set 1 is being transformed into data set 2 and my best guess is a filter.
from scipy.signal import savgol_filter
import matplotlib.pyplot as plt
import pandas as pd
data = [0.725/12,0.057833333,0.045,0.036666667,0.031916667,0.029083333,0.0295,0.0315,0.025166667,
0.025083333,0.027166667,0.03,0.036833333,0.041083333,0.050166667,0.060416667,0.058416667,0.054833333,
0.053916667,0.049666667,0.0415,0.030083333,0.023083333,0.017416667,0.015666667,0.0125,0.014166667,
0.014333333,0.01525,0.018583333,0.006833333,0.001083333,-0.006833333,-0.012166667,-0.016416667,-0.019166667,
-0.015916667,-0.01475,-0.008916667,-0.012083333,-0.013083333,-0.012333333,-0.002916667,0.007166667,0.003416667,
-0.00475,-0.009,-0.01175,-0.013166667,-0.013583333,-0.01175,-0.011833333,-0.009333333,-0.008166667,
-0.005,-0.004416667,-0.004916667,-0.005833333,-0.014416667,-0.019666667,-0.020916667,-0.02475,-0.02675,
-0.027,-0.026916667,-0.027333333,-0.0255,-0.026,-0.02025,-0.014833333,-0.0095,-0.004916667,
-0.014916667, -0.024166667,-0.027666667,-0.029916667,-0.029833333,-0.029,-0.0255,-0.020166667,-0.013,
-0.012083333,-0.00775,-0.002416667,0.004666667,0.014,0.003166667,-0.008166667,-0.0175,-0.0275,
-0.0305,-0.031333333,-0.02925,-0.02625,-.295/12 ]
data2=[0.0579166666666667,0.0554166666666667,0.0426666666666667,0.0344166666666667,0.02975,0.0269166666666667,0.0274166666666667,0.0295,0.02325,
0.02325,0.0254166666666667,0.0283333333333333,0.03525,0.0395833333333333,0.0486666666666667,0.059,0.0570833333333333,0.0535833333333333,
0.05275,0.0485833333333333,0.0405,0.0291666666666667,0.0221666666666667,0.0165833333333333,0.0149166666666667,0.0118333333333333,0.0135833333333333,
0.0138333333333333,0.0148333333333333,0.01825,0.00658333333333333,0.000833333333333333,-0.007,-0.01225,-0.0164166666666667,-0.0190833333333333,
-0.01575,-0.0145,-0.00858333333333333,-0.01175,-0.0126666666666667,-0.0118333333333333,-0.00233333333333333,0.00783333333333333,0.00416666666666667,
-0.00391666666666667,-0.00808333333333333,-0.01075,-0.0121666666666667,-0.0125,-0.0105833333333333,-0.0105833333333333,-0.008,-0.00675,
-0.0035,-0.00283333333333333,-0.00333333333333333,-0.00416666666666667,-0.0126666666666667,-0.0178333333333333,-0.019,-0.02275,-0.0246666666666667,
-0.0248333333333333,-0.0246666666666667,-0.0250833333333333,-0.0231666666666667,-0.0235833333333333,-0.01775,-0.01225,-0.00683333333333333,-0.00216666666666667,
-0.0120833333333333,-0.0213333333333333,-0.02475,-0.0269166666666667,-0.02675,-0.0258333333333333,-0.02225,-0.0168333333333333,-0.00958333333333333,
-0.00858333333333333,-0.00425,0.00116666666666667,0.00833333333333333,0.01775,0.007,-0.00425,-0.0135,-0.0234166666666667,
-0.0264166666666667,-0.0271666666666667,-0.025,-0.0219166666666667,-0.0201666666666667]
plt.plot(data)
plt.plot(data2)
plt.plot((savgol_filter(data,5,2)))
plt.show()
Can anyone explain how to change the parameters of the filter to make this work. Or if there is another filter that would make this work. I am open to any suggestions on solving this problem

Python plot lines with specific x values from numpy

I have a situation with a bunch of datafiles, these datafiles have a number of samples in a given time frame that depends on the system. i.e. At time t=1 for instance I might have a file with 10 items, or 20 items, at later times in that file I will always have the same number of items. The format is time, x, y, z in columns, and loaded into a numpy array. The time values show which frame, but as mentioned there's always the same, let's go with 10 as a sample. So I'll have a (10,4) numpy array where the time values are identical, but there are many frames in the file, so lets say 100 frames, so really I have (1000,4). I want to plot the data with time on the x-axis and manipulations of the other data on the y, but I am unsure how to do this with line plot methods in matplotlib. Normally to provide both x,y values I believe I need to do a scatter plot, so I'm hoping there's a better way to do this. What I ideally want is to treat each line that has the same time code as a different series (so it will colour differently), and the next bit of data for that same line number in the next frame (time value) will be labelled the same colour, giving those good contiguous lines. We can look at the time column and figure out how many items share a time code, let's call it "n". Sample code:
a = numpy.loadtxt('sampledata.txt')
plt.plot(a[:0,:,n],a[:1,:1])
plt.show()
I think this code expresses what I'm going for, though it doesn't work.
Edit:
I hope this is what you wanted.
seaborn scatterplot can categorize data to some groups which have the same codes (time code in this case) and use the same colors to them.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"E:\Programming\Python\Matplotlib\timecodes.csv",
names=["time","x","y","z","code"]) #use your file
df["time"]=pd.to_datetime(df["time"]) #recognize the data as Time
df["x"]=df["time"].dt.day # I changed the data into "Date only" and imported to x column. Easier to see on graph.
#just used random numbers in y and z in my data.
sns.scatterplot("x", "y", data = df, hue = "code") #hue does the grouping
plt.show()
I used csv file here but you can do to your text file as well by adding sep="\t" in the argument. I also added a code in the file. If you have it the code can group the data in the graph, so you don't have to separate or make a hierarchical index. If you want to change colors or grouping please see seaborn website.
Hope this helps.
Alternative, the method I used, but Tim's answer is still accurate as well. Since the time codes are not date/time information I modified my own code to add tags as a second column I call "p" (they're polymers).
import numpy as np
import pandas as pd
datain = np.loadtxt('somefile.txt')
df = pd.DataFrame(data = datain, columns = ["t","p","x","y","z"])
ax = sns.scatterplot("t","x", data = df, hue = "p")
plt.show()
And of course the other columns can be plotted similarly if desired.

Categories