I have an xarray.Dataset that looks roughly like this:
<xarray.Dataset>
Dimensions: (index: 286720)
Coordinates:
* index (index) int64 0 1 2 3 4 ... 286716 286717 286718 286719
Data variables:
Time (index) float64 2.525 2.525 2.525 ... 9.475 9.475 9.475
ch (index) int64 1 1 1 1 1 1 1 1 1 1 ... 2 2 2 2 2 2 2 2 2 2
pixel (index) int64 1 2 3 4 5 6 ... 1020 1021 1022 1023 1024
Rough_wavelength (index) float64 2.698 2.701 2.704 ... 32.05 32.05 32.06
Count (index) int64 463 197 265 335 305 ... 285 376 278 0 278
There are only 140 unique values for the Time variable, 2 for the ch(...annel), and 1024 for the pixel value. I'd thus like to turn them into coordinates and completely drop the largely irrelevant index coordinate, something like this:
<xarray.Dataset>
Dimensions: (Time: 140, ch: 2, pixel: 1024)
Coordinates:
Time (time) float64 2.525 ... 9.475
ch (ch) int64 1 2
pixel (pixel) int64 1 2 3 4 5 6 ... 1020 1021 1022 1023 1024
Data variables:
Rough_wavelength (time, ch, pixel) float64 2.698 ... 32.06
Count (time, ch, pixel) int64 463 ... 278
Is there a way to do this using xarray? If not, what's a sane way to do this using the standard numpy stack?
Replace the index coordinate with a pd.MultiIndex, then unstack the index:
In [10]: ds.assign_coords(
...: {
...: "index": pd.MultiIndex.from_arrays(
...: [ds.Time.values, ds.ch.values, ds.pixel.values],
...: names=["Time", "ch", "pixel"],
...: )
...: }
...: ).drop_vars(["Time", "ch", "pixel"]).unstack("index")
Related
I want to read a plaintext file using pandas.
I have entries without delimiters and with different widths like this:
59967Y98Doe John 6211100004545SO20140314- 00024278
N0546664SCHMIDT-PETER 7441100008300AW20140314- 00023643
G4894jmhTAKLONSKY-JUERGEN 4211100005000TB20140315 00023882
34875738PODESBERG-SCHUMPERTS6211100003671SO20140315 00024622
1-8 is a string.
9-28 is a string.
29-31 is numeric.
32-34 is numeric.
35-41 is numeric.
42-43 is a string.
44-51 is a date (yyyyMMdd).
52 is minus or a blank
Rest is a currency amount without a decimal point (the last 2 digits is always after the decimal point). For example: - 00024278 = -242.78 €
I know there is pd.read_fwf
There is an argument width. I could do this:
pd.read_fwf(StringIO(txt), widths=[8], header="Peronal Nr.")
But how could I read my file with different columns widths?
As the s in widths suggest, you can pass a list of widths:
pd.read_fwf(io.StringIO(txt), widths=[8,20,3,3,7,2,8,1,99], header=None)
output:
0 1 2 3 4 5 6 7 8
0 59967Y98 Doe John 621 110 4545 SO 20140314 - 24278
1 N0546664 SCHMIDT-PETER 744 110 8300 AW 20140314 - 23643
2 G4894jmh TAKLONSKY-JUERGEN 421 110 5000 TB 20140315 NaN 23882
3 34875738 PODESBERG-SCHUMPERTS 621 110 3671 SO 20140315 NaN 24622
If you want names and dtypes:
df = (pd.read_fwf(io.StringIO(txt), widths=[8,20,3,3,7,2,8,1,99], header=None,
names=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
dtypes=[str, str, int, int, int, str, str, str, int])
.assign(**{'G': lambda d: pd.to_datetime(d['G'], format='%Y%m%d')})
)
output:
A B C D E F G H I
0 59967Y98 Doe John 621 110 4545 SO 2014-03-14 - 24278
1 N0546664 SCHMIDT-PETER 744 110 8300 AW 2014-03-14 - 23643
2 G4894jmh TAKLONSKY-JUERGEN 421 110 5000 TB 2014-03-15 NaN 23882
3 34875738 PODESBERG-SCHUMPERTS 621 110 3671 SO 2014-03-15 NaN 24622
df.dtypes
A object
B object
C int64
D int64
E int64
F object
G datetime64[ns]
H object
I int64
dtype: object
I have a xarray Dataset that looks like this below. I need to be able to plot by latitude and longitude any of the three variables in "Data variables: si10, si10_u, avg". However, I cannot figure out how to change the dimensions to latitude and longitude from index_id. Or, to delete "index_id" in Coordinates. I've tried that and then 'latitude' and 'longitude' disappear from "Coordinates". Thank you for suggestions.
Here is my xarray Dataset:
<xarray.Dataset>
Dimensions: (index: 2448, index_id: 2448)
Coordinates:
* index_id (index_id) MultiIndex
- latitude (index_id) float64 58.0 58.0 58.0 58.0 ... 23.0 23.0 23.0 23.0
- longitude (index_id) float64 -130.0 -129.0 -128.0 ... -65.0 -64.0 -63.0
Dimensions without coordinates: index
Data variables:
si10 (index) float32 1.7636629 1.899161 ... 5.9699616 5.9121003
si10_u (index) float32 1.6784391 1.7533684 ... 6.13361 6.139127
avg (index) float32 1.721051 1.8262646 ... 6.0517855 6.025614
You have two issues. First, you need to replace 'index' with 'index_id' so your data is indexed consistently. Second, to unstack 'index_id', you're looking for xr.Dataset.unstack:
ds = ds.unstack('index_id')
As an example... here's a dataset like yours
In [16]: y = np.arange(58, 23, -1)
...: x = np.arange(-130, -63, 1)
In [17]: ds = xr.Dataset(
...: data_vars={
...: v: (("index",), np.random.random(len(x) * len(y)))
...: for v in ["si10", "si10_u", "avg"]
...: },
...: coords={
...: "index_id": pd.MultiIndex.from_product(
...: [y, x], names=["latitude", "longitude"],
...: ),
...: },
...: )
In [18]: ds
Out[18]:
<xarray.Dataset>
Dimensions: (index: 2345, index_id: 2345)
Coordinates:
* index_id (index_id) MultiIndex
- latitude (index_id) int64 58 58 58 58 58 58 58 58 ... 24 24 24 24 24 24 24
- longitude (index_id) int64 -130 -129 -128 -127 -126 ... -68 -67 -66 -65 -64
Dimensions without coordinates: index
Data variables:
si10 (index) float64 0.9412 0.7395 0.6843 ... 0.03979 0.4259 0.09203
si10_u (index) float64 0.7359 0.1984 0.5919 ... 0.5535 0.2867 0.4093
avg (index) float64 0.04257 0.1442 0.008705 ... 0.1911 0.2669 0.1498
First, reorganize your data to have consistent dims:
In [19]: index_id = ds['index_id']
In [20]: ds = (
...: ds.drop("index_id")
...: .rename({"index": "index_id"})
...: .assign_coords(index_id=index_id)
...: )
Then, ds.unstack reorganizes the data to be the combinatorial product of all dimensions in the MultiIndex:
In [21]: ds.unstack("index_id")
Out[21]:
<xarray.Dataset>
Dimensions: (latitude: 35, longitude: 67)
Coordinates:
* latitude (latitude) int64 24 25 26 27 28 29 30 31 ... 52 53 54 55 56 57 58
* longitude (longitude) int64 -130 -129 -128 -127 -126 ... -67 -66 -65 -64
Data variables:
si10 (latitude, longitude) float64 0.9855 0.1467 ... 0.6569 0.9479
si10_u (latitude, longitude) float64 0.4672 0.2664 ... 0.4894 0.128
avg (latitude, longitude) float64 0.3738 0.01793 ... 0.1264 0.21
I have an xarray dataset of sea surface temperature values on an x/y grid. x and y are 1D vector coordinates, so it looks like this minimal example:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
I am able to compute the lat/lon from this x/y grid, and the output is 2 2D arrays. I can add them as coordinates with ds.assign_coords:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
lat (x, y) float64 30.0 30.0 30.0 30.0 30.0 ... 39.0 39.0 39.0 39.0
lon (x, y) float64 -120.0 -119.0 -118.0 -117.0 ... -113.0 -112.0 -111.0
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
But I'd like to .sel along slices of the lat/lon. This currently isn't possible, as I get the error:
ds.sel(lat=slice(32,36), lon=slice(-118, -115))
ValueError Traceback (most recent call last)
<ipython-input-20-28c79202d5f3> in <module>
----> 1 ds.sel(lat=slice(32,36), lon=slice(-118, -115))
~/.local/lib/python3.8/site-packages/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
2363 """
2364 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 2365 pos_indexers, new_indexes = remap_label_indexers(
2366 self, indexers=indexers, method=method, tolerance=tolerance
2367 )
~/.local/lib/python3.8/site-packages/xarray/core/coordinates.py in remap_label_indexers(obj, indexers, method, tolerance, **indexers_kwargs)
419 }
420
--> 421 pos_indexers, new_indexes = indexing.remap_label_indexers(
422 obj, v_indexers, method=method, tolerance=tolerance
423 )
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in remap_label_indexers(data_obj, indexers, method, tolerance)
256 new_indexes = {}
257
--> 258 dim_indexers = get_dim_indexers(data_obj, indexers)
259 for dim, label in dim_indexers.items():
260 try:
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in get_dim_indexers(data_obj, indexers)
222 ]
223 if invalid:
--> 224 raise ValueError(f"dimensions or multi-index levels {invalid!r} do not exist")
225
226 level_indexers = defaultdict(dict)
ValueError: dimensions or multi-index levels ['lat', 'lon'] do not exist
So my question is this: How can I change the dimensions of data to be (lon: (10,10), lat: (10,10)) instead of (x: 10, y: 10? Is this even possible?
Code to reproduce the example dataset:
import numpy as np
import xarray as xr
# Create sample data
data = np.random.rand(10,10)
x = y = np.arange(10)
# Set up dataset
ds = xr.Dataset(
data_vars = dict(
data = (["x", "y"], data)
),
coords= {
"x" : x,
"y" : y
}
)
# Create example lat/lon and assign to dataset
lon, lat = np.meshgrid(np.linspace(-120, -111, 10), np.linspace(30, 39, 10))
ds = ds.assign_coords({
"lat": (["x", "y"], lat),
"lon": (["x", "y"], lon)
})
I have an ArviZ InferenceData posterior trace which is an XArray Dataset.
In there, posterior traces for two of my random variables, a_mu_org and b_mu_org are DataArrays. Their coordinates are:
a_mu_org: (chain, draws, a_mu_org), with lengths (1, 2000, 15) respectively.
b_mu_org: (chain, draws, b_mu_org), with lengths (1, 2000, 15) respectively.
Semantically, a_mu_org and b_mu_org should really be indexed by a single categorical coordinate system of 15 organisms, rather than be separate indexes.
For a bit more clarity, here is the full dataset string repr:
<xarray.Dataset>
Dimensions: (L_dim_0: 34281, a_dim_0: 456260, a_prot_shift_dim_0: 34281, b_dim_0: 456260, b_mu_org_dim_0: 15, b_prot_shift_dim_0: 34281, chain: 1, draw: 2000, organism: 15, sigma_dim_0: 34281, t50_org_dim_0: 15, t50_prot_dim_0: 39957)
Coordinates:
* chain (chain) int64 0
* draw (draw) int64 0 1 2 3 4 5 ... 1995 1996 1997 1998 1999
* a_prot_shift_dim_0 (a_prot_shift_dim_0) object 'A0A023PXQ4_YMR173W-A' ... 'Z4YNA9_AB124611'
* b_prot_shift_dim_0 (b_prot_shift_dim_0) object 'A0A023PXQ4_YMR173W-A' ... 'Z4YNA9_AB124611'
* L_dim_0 (L_dim_0) object 'A0A023PXQ4_YMR173W-A' ... 'Z4YNA9_AB124611'
a_mu_org_dim_0 (organism) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
* a_dim_0 (a_dim_0) object 'ytzI' 'mtlF' ... 'atpG2' 'atpB2'
* b_mu_org_dim_0 (b_mu_org_dim_0) int64 0 1 2 3 4 5 ... 9 10 11 12 13 14
* b_dim_0 (b_dim_0) object 'ytzI' 'mtlF' ... 'atpG2' 'atpB2'
* t50_prot_dim_0 (t50_prot_dim_0) <U65 'Bacillus subtilis_168_lysate_R1-C0H3Q1_ytzI' ... 'Oleispira antarctica_RB-8_lysate_R1-R4YVF0_atpB2'
* t50_org_dim_0 (t50_org_dim_0) <U43 'Arabidopsis thaliana seedling lysate' ... 'Thermus thermophilus HB27 lysate'
* sigma_dim_0 (sigma_dim_0) object 'A0A023PXQ4_YMR173W-A' ... 'Z4YNA9_AB124611'
Dimensions without coordinates: organism
Data variables:
a_org_pop (chain, draw) float32 519.3236 518.8292 ... 517.84784
a_prot_shift (chain, draw, a_prot_shift_dim_0) float32 ...
b_org_pop (chain, draw) float32 11.509291 11.445394 ... 11.929538
b_prot_shift (chain, draw, b_prot_shift_dim_0) float32 ...
L_pop (chain, draw) float32 3.445896 3.4300675 ... 3.3917112
L (chain, draw, L_dim_0) float32 ...
a_mu_org (chain, draw, organism) float32 430.56827 ... 813.2518
a (chain, draw, a_dim_0) float32 ...
b_mu_org (chain, draw, b_mu_org_dim_0) float32 9.997488 ... 8.389757
b (chain, draw, b_dim_0) float32 ...
t50_prot (chain, draw, t50_prot_dim_0) float32 39.249863 ... 52.19809
t50_org (chain, draw, t50_org_dim_0) float32 43.067646 ... 96.93388
sigma (chain, draw, sigma_dim_0) float32 ...
Attributes:
created_at: 2020-04-23T08:54:58.300091
arviz_version: 0.7.0
inference_library: pymc3
inference_library_version: 3.8
I would like to make a_mu_org and b_mu_org take on dimensions (chain, draw, organism) instead of their separate a_mu_org and b_mu_org. Things I have already tried include:
Adding a coordinate called organism, and then doing trace.posterior.swap_dims({"a_mu_org_dim_0": "organism"}), but I get an error stating that "replacement dimension 'organism' is not a 1D variable along the old dimension 'a_mu_org_dim_0'".
Renaming the dimension a_mu_org_dim_0 to organism, but then I also can't swap b_mu_org_dim_0 to the new organism.
Is what I'm trying to accomplish possible?
I am not sure my solution is very good practice, it feels a little too hacky. Also, terminology is quite tricky, I'll try to stick to xarray terminology but may fail in doing so. The trick is to remove the coordinates so that a_dim_0 and b_dim_0 become only dimensions (now dimensions without coordinates). Afterwards, they can be renamed to the same thing and assigned to a new coord. Here is one example:
Starting from the following dataset called ds:
<xarray.Dataset>
Dimensions: (a_dim_0: 15, b_dim_0: 15, chain: 4, draw: 100)
Coordinates:
* chain (chain) int64 0 1 2 3
* draw (draw) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
* a_dim_0 (a_dim_0) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
* b_dim_0 (b_dim_0) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Data variables:
a (chain, draw, a_dim_0) float64 0.8152 1.189 ... 1.32 -0.2023
b (chain, draw, b_dim_0) float64 0.6447 -0.8059 ... -0.06435 -0.8666
the following 3 commands do the trick (the place of the assign_coord does not seem to affect the output, which makes sense, but it is key to first remove coordinates and then rename):
organism_names = [f"o{i}" for i in range(15)]
ds.reset_index(["a_dim_0", "b_dim_0"], drop=True) \
.assign_coords(organism=organism_names) \
.rename({"a_dim_0": "organism", "b_dim_0": "organism"})
Output:
<xarray.Dataset>
Dimensions: (chain: 4, draw: 100, organism: 15)
Coordinates:
* chain (chain) int64 0 1 2 3
* draw (draw) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
* organism (organism) <U3 'o0' 'o1' 'o2' 'o3' ... 'o11' 'o12' 'o13' 'o14'
Data variables:
a (chain, draw, organism) float64 0.8152 1.189 ... 1.32 -0.2023
b (chain, draw, organism) float64 0.6447 -0.8059 ... -0.8666
I have two dataframes
results:
0 2211 E Winston Rd Ste B, 92806, CA 33.814547 -117.886028 4
1 P.O. Box 5601, 29304, SC 34.945855 -81.930035 6
2 4113 Darius Dr, 17025, PA 40.287768 -76.967292 8
acctypeDF:
0 rooftop
1 place
2 rooftop
I wanted to combine both these dataframes into one so i did:
import pandas as pd
resultsfinal = pd.concat([results, acctypeDF], axis=1)
But the output is:
resultsfinal
Out[155]:
0 1 2 3 0
0 2211 E Winston Rd Ste B, 92806, CA 33.814547 -117.886028 4 rooftop
1 P.O. Box 5601, 29304, SC 34.945855 -81.930035 6 place
2 4113 Darius Dr, 17025, PA 40.287768 -76.967292 8 rooftop
As you can see the output is repeating the index number 0.Why does this happen? My objective is to drop the first index(first column) which has addresses, but I am getting this error:
resultsfinal.drop(columns='0')
raise KeyError('{} not found in axis'.format(labels))
KeyError: "['0'] not found in axis"
I also tried:
resultsfinal = pd.concat([results, acctypeDF], axis=1,ignore_index=True)
resultsfinal
Out[158]:
0 1 ... 4 5
0 2211 E Winston Rd Ste B, 92806, CA 33.814547 ... rooftop rooftop
1 P.O. Box 5601, 29304, SC 34.945855 ... place place
But as you see above, even though the issue of index 0 repeating goes away, it creates a duplicate column(5)
If i do:
resultsfinal = results[results.columns[1:]]
resultsfinal
Out[161]:
1 2 ... 0 0
0 33.814547 -117.886028 ... 2211 E Winston Rd Ste B, 92806, CA rooftop
1 34.945855 -81.930035 ... P.O. Box 5601, 29304, SC place
print(resultsfinal.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
0 10 non-null object
1 10 non-null float64
2 10 non-null float64
3 10 non-null int64
4 10 non-null object
dtypes: float64(2), int64(1), object(2)
memory usage: 480.0+ bytes
Use ingnore_index=True:
resultsfinal = pd.concat([results, acctypeDF], axis=1,ignore_index=True)
or
resultsfinal = pd.concat([results, acctypeDF], axis=1)
resultsfinal.columns=range(len(resultsfinal.columns))
print(resultfinal)
remove the first column:
resultsfinal[resultsfinal.columns[1:]]