I am trying to use a Canadian gridded historical dataset of temperature anomalies but it seems that I don't have the skills to pull that off. The grd file are temperatures anomalies on what I believe is a highly regular grid. I have no experience with that kind of grid and I am having trouble building the xarray dataset.
What I have (a subset of the grd and the text file is accessible here) :
2075 '.grd' files ('t190001.grd' to 't202112.grd' following "t{year}{month}.grd" structure)
1 txt file listing the grid coordinates called "CANGRD_points_LL.txt"
From this I would like to build a xarray dataset in order to do some analysis.
Naively, I thought the grid files were already georeferenced and all so I started by doing this :
import glob
import rioxarray as rio
import pandas as pd
import numpy as np
import xarray as xr
#not used for the moment even though I believe that will be needed
#df = pd.read_csv(r"CANGRD_points_LL.txt", sep = ' ', header=None)
list_files = sorted(set(glob.glob(r"t?????[0-2].grd" ) + glob.glob(r"t????0[0-9].grd" )))
times = pd.date_range("1900/01/01",freq='M', periods= len(list_files))
datarrays = [rio.open_rasterio(rst, masked=True,band_as_variable=True).assign_coords(time = t).expand_dims(dim='time').squeeze() for rst,t in zip(list_files, times)]
ds = xr.concat(datarrays,dim='time').rename({'band_1' : 'tas', 'y': 'lat', 'x' : 'lon'})
But as I plotted the results it became evident that my coordinates were only the indices of the pixels :
So I believe I have to use the txt file provided, however, I have no idea how to make the xarray grid using the grid's coordinates and how to make that match with my array obtained by loading a grid via rioxarray. Here is a sample, the complete file is available above. What baffles me is that most of the 11874 lines of the dataframe resulting from the txt file seem to be unique, so how could I fit an array of dimensions 125 lon by 95 lat into it.
0 1 2 3
0 0 0 40.0451 -129.8530
1 0 1 40.1780 -129.3650
2 0 2 40.3080 -128.8740
3 0 3 40.4348 -128.3801
4 0 4 40.5585 -127.8834
5 0 5 40.6790 -127.3840
6 0 6 40.7963 -126.8817
7 0 7 40.9104 -126.3768
8 0 8 41.0211 -125.8693
9 0 9 41.1286 -125.3591
10 0 10 41.2327 -124.8465
11 0 11 41.3335 -124.3314
12 0 12 41.4308 -123.8140
13 0 13 41.5247 -123.2942
14 0 14 41.6151 -122.7722
15 0 15 41.7020 -122.2481
16 0 16 41.7853 -121.7218
17 0 17 41.8651 -121.1936
18 0 18 41.9413 -120.6634
19 0 19 42.0139 -120.1313
20 0 20 42.0828 -119.5975
21 0 21 42.1481 -119.0620
22 0 22 42.2097 -118.5249
23 0 23 42.2675 -117.9863
24 0 24 42.3216 -117.4462
25 0 25 42.3720 -116.9049
26 0 26 42.4186 -116.3622
27 0 27 42.4614 -115.8185
28 0 28 42.5005 -115.2736
29 0 29 42.5357 -114.7279
30 0 30 42.5670 -114.1812
31 0 31 42.5946 -113.6338
32 0 32 42.6182 -113.0857
33 0 33 42.6381 -112.5371
34 0 34 42.6540 -111.9880
35 0 35 42.6661 -111.4385
36 0 36 42.6743 -110.8888
37 0 37 42.6786 -110.3389
38 0 38 42.6791 -109.7889
39 0 39 42.6757 -109.2390
40 0 40 42.6684 -108.6892
41 0 41 42.6572 -108.1397
42 0 42 42.6421 -107.5905
43 0 43 42.6232 -107.0417
44 0 44 42.6004 -106.4935
45 0 45 42.5738 -105.9459
46 0 46 42.5433 -105.3991
47 0 47 42.5090 -104.8531
48 0 48 42.4708 -104.3081
49 0 49 42.4289 -103.7640
Here is the view of one grid file loaded as xarray,
Any help would be greatly appreciated! Thank you so much
I directly asked on the Xarray Github discussion here is the original answer from Keewis:
https://github.com/pydata/xarray/discussions/7443#discussioncomment-4700261
The grid file contains stacked 2D coordinates, which I guess is due to the grid's original coordinate system not being aligned with the lat / lon axes.
To read the coordinates into 2D coordinates you can use:
df = pd.read_csv(r"CANGRD_points_LL.txt", sep=" ", header=None, names=["y", "x", "lat", "lon"])
grid = df.set_index(["y", "x"]).to_xarray().set_coords(["lat", "lon"])
raw = xr.concat([...], dim="time")
ds = xr.merge([raw, grid]).assign_coords(time=times).rename_vars(...)
I have a huge csv file of dataframe. However, I don't have the date column. I only have the sales for every month from Jan-2022 until Dec-2034. Below is the example of my dataframe:
import pandas as pd
data = [[6661, 'Mobile Phone', 43578, 5000, 78564, 52353, 67456, 86965, 43634, 32546, 56332, 58944, 98878, 68588, 43634, 3463, 74533, 73733, 64436, 45426, 57333, 89762, 4373, 75457, 74845, 86843, 59957, 74563, 745335, 46342, 463473, 52352, 23622],
[6672, 'Play Station', 4475, 2546, 5757, 2352, 57896, 98574, 53536, 56533, 88645, 44884, 76585, 43575, 74573, 75347, 57573, 5736, 53737, 35235, 5322, 54757, 74573, 75473, 77362, 21554, 73462, 74736, 1435, 4367, 63462, 32362, 56332],
[6631, 'Laptop', 35347, 36376, 164577, 94584, 78675, 76758, 75464, 56373, 56343, 54787, 7658, 76584, 47347, 5748, 8684, 75373, 57573, 26626, 25632, 73774, 847373, 736646, 847457, 57346, 43732, 347346, 75373, 6473, 85674, 35743, 45734],
[6600, 'Camera', 14365, 60785, 25436, 46747, 75456, 97644, 63573, 56433, 25646, 32548, 14325, 64748, 68458, 46537, 7537, 46266, 7457, 78235, 46223, 8747, 67453, 4636, 3425, 4636, 352236, 6622, 64625, 36346, 46346, 35225, 6436],
[6643, 'Lamp', 324355, 143255, 696954, 97823, 43657, 66686, 56346, 57563, 65734, 64484, 87685, 54748, 9868, 573, 73472, 5735, 73422, 86352, 5325, 84333, 7473, 35252, 7547, 73733, 7374, 32266, 654747, 85743, 57333, 46346, 46266]]
ds = pd.DataFrame(data, columns = ['ID', 'Product', 'SalesJan-22', 'SalesFeb-22', 'SalesMar-22', 'SalesApr-22', 'SalesMay-22', 'SalesJun-22', 'SalesJul-22', 'SalesAug-22', 'SalesSep-22', 'SalesOct-22', 'SalesNov-22', 'SalesDec-22', 'SalesJan-23', 'SalesFeb-23', 'SalesMar-23', 'SalesApr-23', 'SalesMay-23', 'SalesJun-23', 'SalesJul-23', 'SalesAug-23', 'SalesSep-23', 'SalesOct-23', 'SalesNov-23', 'SalesDec-23', 'SalesJan-24', 'SalesFeb-24', 'SalesMar-24', 'SalesApr-24', 'SalesMay-24', 'SalesJun-24', 'SalesJul-24']
Since I have more than 10 monthly sales column, I want to loop the date after each of the month sales column. Then, the first 6 months will generate number 1, while the next 12 months will generate number 2, then another 12 months will generate number 3, another subsequent 12 months will generate number 4 and so on.
Below shows the sample of result that I want:
Is there any way to perform the loop and adding the date column beside each of the sales month?
Here is the simplest approach I can think of:
for i, col in enumerate(ds.columns[2:]):
ds.insert(2 * i + 2, col.removeprefix("Sales"), (i - 6) // 12 + 2)
Here is a vectorial approach (using insert repeatedly is inefficient):
# convert (valid) columns to datetime
cols = pd.to_datetime(ds.columns, format='Sales%b-%y', errors='coerce')
# identify valid dates
m = cols.notna()
# get year
y = cols[m].year
# calculate number (1 for first 6 months, then +1 per 12 months)
num = ((cols[m].month+12*(y-y.min()))+5)//12+1
# slice dates columns, assign the number, rename
df2 = (ds.loc[:, m].assign(**dict(zip(ds.columns[m], num)))
.rename(columns=lambda x: x[5:])
)
# get new order of columns
idx = np.r_[np.zeros((~m).sum()), np.tile(np.arange(m.sum()), 2)+1]
# concat and reorder
out = pd.concat([ds, df2], axis=1).iloc[:, np.argsort(idx)]
print(out)
output:
ID Product SalesJan-22 Jan-22 SalesFeb-22 Feb-22 SalesMar-22 Mar-22 SalesApr-22 Apr-22 SalesMay-22 May-22 SalesJun-22 Jun-22 SalesJul-22 Jul-22 SalesAug-22 Aug-22 Sep-22 SalesSep-22 Oct-22 SalesOct-22 SalesNov-22 Nov-22 Dec-22 SalesDec-22 Jan-23 SalesJan-23 Feb-23 SalesFeb-23 SalesMar-23 Mar-23 Apr-23 SalesApr-23 SalesMay-23 May-23 SalesJun-23 Jun-23 Jul-23 SalesJul-23 SalesAug-23 Aug-23 Sep-23 SalesSep-23 SalesOct-23 Oct-23 Nov-23 SalesNov-23 Dec-23 SalesDec-23 Jan-24 SalesJan-24 Feb-24 SalesFeb-24 Mar-24 SalesMar-24 Apr-24 SalesApr-24 May-24 SalesMay-24 SalesJun-24 Jun-24 SalesJul-24 Jul-24
0 6661 Mobile Phone 43578 1 5000 1 78564 1 52353 1 67456 1 86965 1 43634 2 32546 2 2 56332 2 58944 98878 2 2 68588 2 43634 2 3463 74533 2 2 73733 64436 2 45426 2 3 57333 89762 3 3 4373 75457 3 3 74845 3 86843 3 59957 3 74563 3 745335 3 46342 3 463473 52352 3 23622 4
1 6672 Play Station 4475 1 2546 1 5757 1 2352 1 57896 1 98574 1 53536 2 56533 2 2 88645 2 44884 76585 2 2 43575 2 74573 2 75347 57573 2 2 5736 53737 2 35235 2 3 5322 54757 3 3 74573 75473 3 3 77362 3 21554 3 73462 3 74736 3 1435 3 4367 3 63462 32362 3 56332 4
2 6631 Laptop 35347 1 36376 1 164577 1 94584 1 78675 1 76758 1 75464 2 56373 2 2 56343 2 54787 7658 2 2 76584 2 47347 2 5748 8684 2 2 75373 57573 2 26626 2 3 25632 73774 3 3 847373 736646 3 3 847457 3 57346 3 43732 3 347346 3 75373 3 6473 3 85674 35743 3 45734 4
3 6600 Camera 14365 1 60785 1 25436 1 46747 1 75456 1 97644 1 63573 2 56433 2 2 25646 2 32548 14325 2 2 64748 2 68458 2 46537 7537 2 2 46266 7457 2 78235 2 3 46223 8747 3 3 67453 4636 3 3 3425 3 4636 3 352236 3 6622 3 64625 3 36346 3 46346 35225 3 6436 4
4 6643 Lamp 324355 1 143255 1 696954 1 97823 1 43657 1 66686 1 56346 2 57563 2 2 65734 2 64484 87685 2 2 54748 2 9868 2 573 73472 2 2 5735 73422 2 86352 2 3 5325 84333 3 3 7473 35252 3 3 7547 3 73733 3 7374 3 32266 3 654747 3 85743 3 57333 46346 3 46266 4
Here's a little solution : (I put the year unstead of your 1, 2, ... incrementation since i thought it is more representative, but you can change it easily)
idx_counter = 0
for idx, col in enumerate(ds.columns):
if col.startswith('Sales'):
date = col.replace('Sales', '')
year = col.split('-')[1]
ds.insert(loc=idx + 1 + idx_counter, column=date, value=[year] * ds.shape[0])
idx_counter += 1
output:
ID Product SalesJan-22 Jan-22 SalesFeb-22 Feb-22 SalesMar-22 Mar-22 SalesApr-22 Apr-22 ... SalesMar-24 Mar-24 SalesApr-24 Apr-24 SalesMay-24 May-24 SalesJun-24 Jun-24 SalesJul-24 Jul-24
0 6661 Mobile Phone 43578 22 5000 22 78564 22 52353 22 ... 745335 24 46342 24 463473 24 52352 24 23622 24
1 6672 Play Station 4475 22 2546 22 5757 22 2352 22 ... 1435 24 4367 24 63462 24 32362 24 56332 24
2 6631 Laptop 35347 22 36376 22 164577 22 94584 22 ... 75373 24 6473 24 85674 24 35743 24 45734 24
3 6600 Camera 14365 22 60785 22 25436 22 46747 22 ... 64625 24 36346 24 46346 24 35225 24 6436 24
4 6643 Lamp 324355 22 143255 22 696954 22 97823 22 ... 654747 24 85743 24 57333 24 46346 24 46266 24
This should do the trick.
import math
new_cols = []
old_cols = [x for x in df.columns if x.startswith('Sales')]
for i, col in enumerate(old_cols):
new_cols.append(col[5:])
if i < 6:
val = 1
else:
val = ((i+6)/12)+1
df[col[5:]] = math.floor(val)
df[['ID', 'Product'] + [x for y in zip(old_cols, new_cols) for x in y]]
I have tried out the following snippet of code for my project:
import pandas as pd
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
df=[]
hypo = wn.synset('science.n.01').hyponyms()
hyper = wn.synset('science.n.01').hypernyms()
mero = wn.synset('science.n.01').part_meronyms()
holo = wn.synset('science.n.01').part_holonyms()
ent = wn.synset('science.n.01').entailments()
df = df+hypo+hyper+mero+holo+ent
df_agri_clean = pd.DataFrame(df)
df_agri_clean.columns=["Items"]
print(df_agri_clean)
pd.set_option('display.expand_frame_repr', False)
It has given me this output of a dataframe:
Items
0 Synset('agrobiology.n.01')
1 Synset('agrology.n.01')
2 Synset('agronomy.n.01')
3 Synset('architectonics.n.01')
4 Synset('cognitive_science.n.01')
5 Synset('cryptanalysis.n.01')
6 Synset('information_science.n.01')
7 Synset('linguistics.n.01')
8 Synset('mathematics.n.01')
9 Synset('metallurgy.n.01')
10 Synset('metrology.n.01')
11 Synset('natural_history.n.01')
12 Synset('natural_science.n.01')
13 Synset('nutrition.n.03')
14 Synset('psychology.n.01')
15 Synset('social_science.n.01')
16 Synset('strategics.n.01')
17 Synset('systematics.n.01')
18 Synset('thanatology.n.01')
19 Synset('discipline.n.01')
20 Synset('scientific_theory.n.01')
21 Synset('scientific_knowledge.n.01')
This can be converted to a list by just printing df.
[Synset('agrobiology.n.01'), Synset('agrology.n.01'), Synset('agronomy.n.01'), Synset('architectonics.n.01'), Synset('cognitive_science.n.01'), Synset('cryptanalysis.n.01'), Synset('information_science.n.01'), Synset('linguistics.n.01'), Synset('mathematics.n.01'), Synset('metallurgy.n.01'), Synset('metrology.n.01'), Synset('natural_history.n.01'), Synset('natural_science.n.01'), Synset('nutrition.n.03'), Synset('psychology.n.01'), Synset('social_science.n.01'), Synset('strategics.n.01'), Synset('systematics.n.01'), Synset('thanatology.n.01'), Synset('discipline.n.01'), Synset('scientific_theory.n.01'), Synset('scientific_knowledge.n.01')]
I wish to change every word under "Items" like so :
Synset('agrobiology.n.01') => agrobiology.n.01
or
Synset('agrobiology.n.01') => 'agrobiology'
Any answer associated will be appreciated! Thanks!
To access the name of these items, just do function.name(). You could use line comprehension update these items as follows:
df_agri_clean['Items'] = [df_agri_clean['Items'][i].name() for i in range(len(df_agri_clean))]
df_agri_clean
The output will be as you expected
Items
0 agrobiology.n.01
1 agrology.n.01
2 agronomy.n.01
3 architectonics.n.01
4 cognitive_science.n.01
5 cryptanalysis.n.01
6 information_science.n.01
7 linguistics.n.01
8 mathematics.n.01
9 metallurgy.n.01
10 metrology.n.01
11 natural_history.n.01
12 natural_science.n.01
13 nutrition.n.03
14 psychology.n.01
15 social_science.n.01
16 strategics.n.01
17 systematics.n.01
18 thanatology.n.01
19 discipline.n.01
20 scientific_theory.n.01
21 scientific_knowledge.n.01
To further replace ".n.01" as well from the string, you could do the following:
df_agri_clean['Items'] = [df_agri_clean['Items'][i].name().replace('.n.01', '') for i in range(len(df_agri_clean))]
df_agri_clean
Output (just like your second expected output)
Items
0 agrobiology
1 agrology
2 agronomy
3 architectonics
4 cognitive_science
5 cryptanalysis
6 information_science
7 linguistics
8 mathematics
9 metallurgy
10 metrology
11 natural_history
12 natural_science
13 nutrition.n.03
14 psychology
15 social_science
16 strategics
17 systematics
18 thanatology
19 discipline
20 scientific_theory
21 scientific_knowledge
EDIT 2
I fixed one part of the code that was wrong, With that line of code, I add the category for every information (Axis X).
y = joy(cat, EveryTest[i].GPS)
After adding that line of code, the graph improved, but something is still failing. The graph starts with the 4th category (I mean 12:40:00), and it must start in the first (12:10:00), What I am doing wrong?
EDIT 1:
I Updated Bkoeh to 0.12.13, then the label problem was fixed.
Now my problem is:
I suppose the loop for (for i, cat in enumerate(reversed(cats)):) put every chart on the label, but do not happen that. I see the chart stuck in the 5th o 6th label. (12:30:00 or 12:50:00)
- Start of question -
I am trying to reproduce the example of joyplot. But I have trouble when I want to lot my own data. I dont want to plot an histogram, I want to plot some list in X and some list in Y. But I do not understand what I am doing wrong.
the code (Fixed):
from numpy import linspace
from scipy.stats.kde import gaussian_kde
from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, FixedTicker, PrintfTickFormatter
from bokeh.plotting import figure
#from bokeh.sampledata.perceptions import probly
bokeh.BOKEH_RESOURCES='inline'
import colorcet as cc
output_file("joyplot.html")
def joy(category, data, scale=20):
return list(zip([category]*len(data),data))
#Elements = 7
cats = ListOfTime # list(reversed(probly.keys())) #list(['Pos_1','Pos_2']) #
print len(cats),' lengh of times'
palette = [cc.rainbow[i*15] for i in range(16)]
palette += palette
print len(palette),'lengh palette'
x = X # linspace(-20,110, 500) #Test.X #
print len(x),' lengh X'
source = ColumnDataSource(data=dict(x=x))
p = figure(y_range=cats, plot_width=900, x_range=(0, 1500), toolbar_location=None)
for i, cat in enumerate(reversed(cats)):
y = joy(cat, EveryTest[i].GPS)
#print cat
source.add(y, cat)
p.patch('x', cat, color=palette[i], alpha=0.6, line_color="black", source=source)
#break
print source
p.outline_line_color = None
p.background_fill_color = "#efefef"
p.xaxis.ticker = FixedTicker(ticks=list(range(0, 1500, 100)))
#p.xaxis.formatter = PrintfTickFormatter(format="%d%%")
p.ygrid.grid_line_color = None
p.xgrid.grid_line_color = "#dddddd"
p.xgrid.ticker = p.xaxis[0].ticker
p.axis.minor_tick_line_color = None
p.axis.major_tick_line_color = None
p.axis.axis_line_color = None
#p.y_range.range_padding = 0.12
#p
show(p)
the variables are:
print X, type(X)
[ 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75
78 81 84 87 90 93 96 99] <type 'numpy.ndarray'>
and
print EveryTest[0].GPS, type(EveryTest[i].GPS)
0 2
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 2
17 2
18 2
19 2
20 2
21 2
22 2
23 2
24 2
25 2
26 2
27 2
28 2
29 2
30 2
31 2
32 2
Name: GPS, dtype: int64 <class 'pandas.core.series.Series'>
Following the example, the type of data its ok. But I get the next image:
And I expected something like this: