I would like to plot an histogram representing the value TP on the y axis and the method on the x axis. In particular I would like to obtain different figures according to the value of the column 'data'.
In this case I want a first histogram with values 2,1,6,9,8,1,0 and a second histogram with values 10,10,16,...
The python version of ggplot seems to be slightly different by the R ones.
FN FP TN TP data method
method
SS0208 18 0 80 2 A p=100 n=100 SNR=0.5 SS0208
SS0408 19 0 80 1 A p=100 n=100 SNR=0.5 SS0408
SS0206 14 9 71 6 A p=100 n=100 SNR=0.5 SS0206
SS0406 11 6 74 9 A p=100 n=100 SNR=0.5 SS0406
SS0506 12 6 74 8 A p=100 n=100 SNR=0.5 SS0506
SS0508 19 0 80 1 A p=100 n=100 SNR=0.5 SS0508
LKSC 20 0 80 0 A p=100 n=100 SNR=0.5 LKSC
SS0208 10 1 79 10 A p=100 n=100 SNR=10 SS0208
SS0408 10 0 80 10 A p=100 n=100 SNR=10 SS0408
SS0206 4 5 75 16 A p=100 n=100 SNR=10 SS0206
As a first step I have tried to plot only one histogram and I received an error.
df = df[df.data == df.data.unique()[0]]
In [65]: ggplot() + geom_bar(df, aes(x='method', y='TP'), stat='identity')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-65-dd47b8d85375> in <module>()
----> 1 ggplot() + geom_bar(df, aes(x='method', y='TP'), stat='identity')
TypeError: __init__() missing 2 required positional arguments: 'aesthetics' and 'data'w
In [66]:
I have tried different combinations of commands but I did not solve.
Once that this first problem has been solved I would like the histograms grouped according to the value of 'data'. This could probably be done by 'facet_wrap'
This is probably because you called ggplot() without an argument (Not sure if that should be possible. If you think so, please add a issue on http://github.com/yhat/ggplot).
Anyway, this should work:
ggplot(df, aes(x='method', y='TP')) + geom_bar(stat='identity')
Unfortunately, faceting with geom_bar doesn't work yet properly (only when all facets have all levels/ x values!) -> Bugreport
Related
I am trying to use a Canadian gridded historical dataset of temperature anomalies but it seems that I don't have the skills to pull that off. The grd file are temperatures anomalies on what I believe is a highly regular grid. I have no experience with that kind of grid and I am having trouble building the xarray dataset.
What I have (a subset of the grd and the text file is accessible here) :
2075 '.grd' files ('t190001.grd' to 't202112.grd' following "t{year}{month}.grd" structure)
1 txt file listing the grid coordinates called "CANGRD_points_LL.txt"
From this I would like to build a xarray dataset in order to do some analysis.
Naively, I thought the grid files were already georeferenced and all so I started by doing this :
import glob
import rioxarray as rio
import pandas as pd
import numpy as np
import xarray as xr
#not used for the moment even though I believe that will be needed
#df = pd.read_csv(r"CANGRD_points_LL.txt", sep = ' ', header=None)
list_files = sorted(set(glob.glob(r"t?????[0-2].grd" ) + glob.glob(r"t????0[0-9].grd" )))
times = pd.date_range("1900/01/01",freq='M', periods= len(list_files))
datarrays = [rio.open_rasterio(rst, masked=True,band_as_variable=True).assign_coords(time = t).expand_dims(dim='time').squeeze() for rst,t in zip(list_files, times)]
ds = xr.concat(datarrays,dim='time').rename({'band_1' : 'tas', 'y': 'lat', 'x' : 'lon'})
But as I plotted the results it became evident that my coordinates were only the indices of the pixels :
So I believe I have to use the txt file provided, however, I have no idea how to make the xarray grid using the grid's coordinates and how to make that match with my array obtained by loading a grid via rioxarray. Here is a sample, the complete file is available above. What baffles me is that most of the 11874 lines of the dataframe resulting from the txt file seem to be unique, so how could I fit an array of dimensions 125 lon by 95 lat into it.
0 1 2 3
0 0 0 40.0451 -129.8530
1 0 1 40.1780 -129.3650
2 0 2 40.3080 -128.8740
3 0 3 40.4348 -128.3801
4 0 4 40.5585 -127.8834
5 0 5 40.6790 -127.3840
6 0 6 40.7963 -126.8817
7 0 7 40.9104 -126.3768
8 0 8 41.0211 -125.8693
9 0 9 41.1286 -125.3591
10 0 10 41.2327 -124.8465
11 0 11 41.3335 -124.3314
12 0 12 41.4308 -123.8140
13 0 13 41.5247 -123.2942
14 0 14 41.6151 -122.7722
15 0 15 41.7020 -122.2481
16 0 16 41.7853 -121.7218
17 0 17 41.8651 -121.1936
18 0 18 41.9413 -120.6634
19 0 19 42.0139 -120.1313
20 0 20 42.0828 -119.5975
21 0 21 42.1481 -119.0620
22 0 22 42.2097 -118.5249
23 0 23 42.2675 -117.9863
24 0 24 42.3216 -117.4462
25 0 25 42.3720 -116.9049
26 0 26 42.4186 -116.3622
27 0 27 42.4614 -115.8185
28 0 28 42.5005 -115.2736
29 0 29 42.5357 -114.7279
30 0 30 42.5670 -114.1812
31 0 31 42.5946 -113.6338
32 0 32 42.6182 -113.0857
33 0 33 42.6381 -112.5371
34 0 34 42.6540 -111.9880
35 0 35 42.6661 -111.4385
36 0 36 42.6743 -110.8888
37 0 37 42.6786 -110.3389
38 0 38 42.6791 -109.7889
39 0 39 42.6757 -109.2390
40 0 40 42.6684 -108.6892
41 0 41 42.6572 -108.1397
42 0 42 42.6421 -107.5905
43 0 43 42.6232 -107.0417
44 0 44 42.6004 -106.4935
45 0 45 42.5738 -105.9459
46 0 46 42.5433 -105.3991
47 0 47 42.5090 -104.8531
48 0 48 42.4708 -104.3081
49 0 49 42.4289 -103.7640
Here is the view of one grid file loaded as xarray,
Any help would be greatly appreciated! Thank you so much
I directly asked on the Xarray Github discussion here is the original answer from Keewis:
https://github.com/pydata/xarray/discussions/7443#discussioncomment-4700261
The grid file contains stacked 2D coordinates, which I guess is due to the grid's original coordinate system not being aligned with the lat / lon axes.
To read the coordinates into 2D coordinates you can use:
df = pd.read_csv(r"CANGRD_points_LL.txt", sep=" ", header=None, names=["y", "x", "lat", "lon"])
grid = df.set_index(["y", "x"]).to_xarray().set_coords(["lat", "lon"])
raw = xr.concat([...], dim="time")
ds = xr.merge([raw, grid]).assign_coords(time=times).rename_vars(...)
I know theres tons of similar question titles but none of them solved my particular question.
So I have this code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# my_list contains 983 list items
df = pd.DataFrame(np.array(my_list), columns=list('ABCDEF'))
df contains 983 items composed of lists of list
df.head()
A B C D E F
0 47 5 17 16 57 58
1 6 23 34 21 46 37
2 57 5 53 42 18 55
3 43 24 36 16 39 22
4 32 53 5 18 34 29
scaler = StandardScaler().fit(df.values)
transformed_dataset = scaler.transform(df.values)
transformed_df = pd.DataFrame(data=transformed_dataset, index=df.index)
number_of_rows = df.values.shape[0] # all our lists
window_length = 983 # amount of past number list we need to take in consideration for prediction
number_of_features = df.values.shape[1] # number count
train = np.empty([number_of_rows-window_length, window_length, number_of_features], dtype=float)
label = np.empty([number_of_rows-window_length, number_of_features], dtype=float)
window_length = 982
for i in range(0, number_of_rows-window_length):
train[i]=transformed_df.iloc[i:i+window_length,0:number_of_features]
label[i]=transformed_df.iloc[i:i+window_length:i+window_length+1,0:number_of_features]
train.shape
(0, 983, 6)
label.shape
(0, 6)
train[0] is working fine but when I do train[1] I got this error:
train[1]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-43-e73aed9430c6> in <module>
----> 1 train[1]
IndexError: index 1 is out of bounds for axis 0 with size 0
also when I do label[0], its fine. but when I do label[1] I got this error:
label[1]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-45-1e13a70afa10> in <module>
----> 1 label[1]
IndexError: index 1 is out of bounds for axis 0 with size 0
how to fix IndexErrors
You're creating an array whose first dimension has size 0 - that's why you're getting these errors
You're using the value number_of_rows - window_length for the first dimension - which is 0. I guess that's not what you want.
one more time i need your help,
To introduce the problem, i got this :
x=[0 1 3 4 5 6 7 8]
y=[9 10 11 12 13 14 15 16]
x=x(:)
y=y(:)
X=[x.^2, x.*y,y.^2,x,y]
a=sum(X)/(X'*X)
X=
0 0 81 0 9
1 10 100 1 10
9 33 121 3 11
16 48 144 4 12
25 65 169 5 13
36 84 196 6 14
49 105 225 7 15
64 128 256 8 16
a =
-0.0139 0.0278 -0.0139 -0.2361 0.2361
Considere that the matlab code is absolutely true
and i translate this to :
x=[0,1,3,4,5,6,7,8]
y=[9,10,11,12,13,14,15,16]
X=np.array([x*x,x*y,y*y,x,y]).T
a=np.sum(X)/np.dot(X.T,X)#line with the probleme
X is the same
But i get (5,5) matrix on a
Probleme come from the mult beetwen X.T and X i think, i'll try np.matmul, np.dot, transpose and T and i don't know why i can't get a (1,5) or (5,1) vector... what is wrong is the translation beetwen those 2 langage on the a calculation
Any Suggestions ?
The division of such two matrices in MATLAB:
s = sum(X)
XX = (X'*X)
a = s / XX
is solving for t the linear system: XX * t = s.
To achieve the same in Python/NumPy, just use np.linalg.solve() (making sure to use np.sum() with the correct axis parameter to mimic the same behavior as MATLAB's sum(), as indicated in the comments and #AnderBiguri's answer):
x=np.array([0,1,3,4,5,6,7,8])
y=np.array([9,10,11,12,13,14,15,16])
X=np.array([x*x,x*y,y*y,x,y]).T
s = np.sum(X, 0)
XX = np.dot(X.T, X)
a = np.linalg.solve(XX, s)
print(a)
# [-0.01388889 0.02777778 -0.01388889 -0.23611111 0.23611111]
The issue is sum.
In MATLAB, default sum sums over the first axis. In numpy sum sums all the values.
a=np.sum(X, axis=0)/np.dot(X.T,X)
EDIT 2
I fixed one part of the code that was wrong, With that line of code, I add the category for every information (Axis X).
y = joy(cat, EveryTest[i].GPS)
After adding that line of code, the graph improved, but something is still failing. The graph starts with the 4th category (I mean 12:40:00), and it must start in the first (12:10:00), What I am doing wrong?
EDIT 1:
I Updated Bkoeh to 0.12.13, then the label problem was fixed.
Now my problem is:
I suppose the loop for (for i, cat in enumerate(reversed(cats)):) put every chart on the label, but do not happen that. I see the chart stuck in the 5th o 6th label. (12:30:00 or 12:50:00)
- Start of question -
I am trying to reproduce the example of joyplot. But I have trouble when I want to lot my own data. I dont want to plot an histogram, I want to plot some list in X and some list in Y. But I do not understand what I am doing wrong.
the code (Fixed):
from numpy import linspace
from scipy.stats.kde import gaussian_kde
from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, FixedTicker, PrintfTickFormatter
from bokeh.plotting import figure
#from bokeh.sampledata.perceptions import probly
bokeh.BOKEH_RESOURCES='inline'
import colorcet as cc
output_file("joyplot.html")
def joy(category, data, scale=20):
return list(zip([category]*len(data),data))
#Elements = 7
cats = ListOfTime # list(reversed(probly.keys())) #list(['Pos_1','Pos_2']) #
print len(cats),' lengh of times'
palette = [cc.rainbow[i*15] for i in range(16)]
palette += palette
print len(palette),'lengh palette'
x = X # linspace(-20,110, 500) #Test.X #
print len(x),' lengh X'
source = ColumnDataSource(data=dict(x=x))
p = figure(y_range=cats, plot_width=900, x_range=(0, 1500), toolbar_location=None)
for i, cat in enumerate(reversed(cats)):
y = joy(cat, EveryTest[i].GPS)
#print cat
source.add(y, cat)
p.patch('x', cat, color=palette[i], alpha=0.6, line_color="black", source=source)
#break
print source
p.outline_line_color = None
p.background_fill_color = "#efefef"
p.xaxis.ticker = FixedTicker(ticks=list(range(0, 1500, 100)))
#p.xaxis.formatter = PrintfTickFormatter(format="%d%%")
p.ygrid.grid_line_color = None
p.xgrid.grid_line_color = "#dddddd"
p.xgrid.ticker = p.xaxis[0].ticker
p.axis.minor_tick_line_color = None
p.axis.major_tick_line_color = None
p.axis.axis_line_color = None
#p.y_range.range_padding = 0.12
#p
show(p)
the variables are:
print X, type(X)
[ 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75
78 81 84 87 90 93 96 99] <type 'numpy.ndarray'>
and
print EveryTest[0].GPS, type(EveryTest[i].GPS)
0 2
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 2
17 2
18 2
19 2
20 2
21 2
22 2
23 2
24 2
25 2
26 2
27 2
28 2
29 2
30 2
31 2
32 2
Name: GPS, dtype: int64 <class 'pandas.core.series.Series'>
Following the example, the type of data its ok. But I get the next image:
And I expected something like this:
I have a fairly large (~5000 rows) DataFrame, with a number of variables, say 2 ['max', 'min'], sorted by 4 parameters, ['Hs', 'Tp', 'wd', 'seed']. It looks like this:
>>> data.head()
Hs Tp wd seed max min
0 1 9 165 22 225 18
1 1 9 195 16 190 18
2 2 5 165 43 193 12
3 2 10 180 15 141 22
4 1 6 180 17 219 18
>>> len(data)
4500
I want to keep only the first 2 parameters and get the maximum standard deviation for all 'seed's calculated individually for each 'wd'.
In the end, I'm left with unique (Hs, Tp) pairs with the maximum standard deviations for each variable. Something like:
>>> stdev.head()
Hs Tp max min
0 1 5 43.31321 4.597629
1 1 6 43.20004 4.640795
2 1 7 47.31507 4.569408
3 1 8 41.75081 4.651762
4 1 9 41.35818 4.285991
>>> len(stdev)
30
The following code does what I want, but since I have little understanding about DataFrames, I'm wondering if these nested loops can be done in a different and more DataFramy way =)
import pandas as pd
import numpy as np
#
#data = pd.read_table('data.txt')
#
# don't worry too much about this ugly generator,
# it just emulates the format of my data...
total = 4500
data = pd.DataFrame()
data['Hs'] = np.random.randint(1,4,size=total)
data['Tp'] = np.random.randint(5,15,size=total)
data['wd'] = [[165, 180, 195][np.random.randint(0,3)] for _ in xrange(total)]
data['seed'] = np.random.randint(1,51,size=total)
data['max'] = np.random.randint(100,250,size=total)
data['min'] = np.random.randint(10,25,size=total)
# and here it starts. would the creators of pandas pull their hair out if they see this?
# can this be made better?
stdev = pd.DataFrame(columns = ['Hs', 'Tp', 'max', 'min'])
i=0
for hs in set(data['Hs']):
data_Hs = data[data['Hs'] == hs]
for tp in set(data_Hs['Tp']):
data_tp = data_Hs[data_Hs['Tp'] == tp]
stdev.loc[i] = [
hs,
tp,
max([np.std(data_tp[data_tp['wd']==wd]['max']) for wd in set(data_tp['wd'])]),
max([np.std(data_tp[data_tp['wd']==wd]['min']) for wd in set(data_tp['wd'])])]
i+=1
Thanks!
PS: if curious, this is statistics on variables depending on sea waves. Hs is wave height, Tp wave period, wd wave direction, the seeds represent different realizations of an irregular wave train, and min and max are the peaks or my variable during a certain exposition time. After all this, by means of the standard deviation and average, I can fit some distribution to the data, like Gumbel.
This could be a one-liner, if I understood you correctly:
data.groupby(['Hs', 'Tp', 'wd'])[['max', 'min']].std(ddof=0).max(level=[0, 1])
(include reset_index() on the end if you want)