Sum Data variables of Dataset

Sum Data variables of Dataset - python

I've merged a list of DataArray in one DataSet using the code below:
surface_dataarray = []
for (key, value) in surfaces_item_service.items():
print(f'{key}: {value}')
single_surface_class = __yearly_surface_type(folder_path, provider, data_source, bins, key)
single_surface_class.name = key
if single_surface_class.count() > 1:
single_surface_class.rio.to_raster(output_file_path + f'/{key}.tif', driver="GTiff")
surface_dataarray.append(single_surface_class)
surface_data = xr.merge(surface_dataarray)
And I have obtained a Dataset like the below:
<xarray.Dataset>
Dimensions: (x: 1868, y: 1373)
Coordinates:
band int64 1
* x (x) float64 4.269e+05 4.269e+05 ... 4.455e+05 4.455e+05
* y (y) float64 4.53e+06 4.53e+06 ... 4.516e+06 4.516e+06
spatial_ref int64 0
year int64 2020
variable <U15 'CLASSIFIED DATA'
Data variables:
water_surface (y, x) float32 nan nan nan nan nan ... nan nan nan nan
non_green_surface (y, x) float32 nan nan nan nan nan ... nan nan nan nan
green_surface (y, x) float32 nan nan nan nan nan ... nan nan nan nan
Is it possible to sum the Data variables?
I need to save as single band.
NB: there are nan values because my area doesn't have values at borders.

Yes, one way to do it is to use Dataset.to_array, which will combine all the data variables into a single array along a new "variable" dimension:
sum_of_data_variables = ds.to_array().sum("variable")

Related

Assign new value to a cell in pd.DataFrame which is a pd.Series when series index isn't unique

Here is my data if anyone wants to try to reproduce the problem:
https://github.com/LunaPrau/personal/blob/main/O_paired.csv
I have a pd.DataFrame (called O) of 1402 rows × 1402 columns with columns and index both as ['XXX-icsd', 'YYY-icsd', ...] and cell values as some np.float64, some np.nan and problematically, some as pandas.core.series.Series.
202324-icsd
644068-icsd
27121-icsd
93847-icsd
154319-icsd
202324-icsd
0.000000
0.029729
NaN
0.098480
0.097867
644068-icsd
NaN
0.000000
NaN
0.091311
0.091049
27121-icsd
0.144897
0.137473
0.0
0.081610
0.080442
93847-icsd
NaN
NaN
NaN
0.000000
0.005083
154319-icsd
NaN
NaN
NaN
NaN
0.000000
The problem is that some cells (e.g. O.loc["192693-icsd", "192401-icsd"]) contain a pandas.core.series.Series of form:
192693-icsd 0.129562
192693-icsd 0.129562
Name: 192401-icsd, dtype: float64
I'm struggling to make this cell contain only a np.float64.
I tried:
O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0]
and other various known forms of assignnign a new value to a cell in pd.DataFrame, but they only assign a new element to the same series in this cell, e.g. if I do
O.loc["192693-icsd", "192401-icsd"] = 5
then when calling O.loc["192693-icsd", "192401-icsd"] I get:
192693-icsd 5.0
192693-icsd 5.0
Name: 192401-icsd, dtype: float64
How to modify O.loc["192693-icsd", "192401-icsd"] so that it is of type np.float64?

It's not that df.loc["192693-icsd", "192401-icsd"] contain a Series, your index just isn't unique. This is especially obvious looking at these outputs:
>>> df.loc["192693-icsd"]
202324-icsd 644068-icsd 27121-icsd 93847-icsd 154319-icsd 28918-icsd 28917-icsd ... 108768-icsd 194195-icsd 174188-icsd 159632-icsd 89111-icsd 23308-icsd 253341-icsd
192693-icsd NaN NaN NaN NaN 0.146843 NaN NaN ... NaN 0.271191 NaN NaN NaN NaN 0.253996
192693-icsd NaN NaN NaN NaN 0.146843 NaN NaN ... NaN 0.271191 NaN NaN NaN NaN 0.253996
[2 rows x 1402 columns]
# And the fact that this returns the same:
>>> df.at["192693-icsd", "192401-icsd"]
192693-icsd 0.129562
192693-icsd 0.129562
Name: 192401-icsd, dtype: float64
You can fix this with a groupby, but you have to decide what to do with the non-unique groups. It looks like they're the same, so we'll combine them with max:
df = df.groupby(level=0).max()
Now it'll work as expected:
>>> df.loc["192693-icsd", "192401-icsd"])
0.129562120551387
Your non-unique values are:
>>> df.index[df.index.duplicated()]
Index(['193303-icsd', '192693-icsd', '416602-icsd'], dtype='object')

IIUC, you can try DataFrame.applymap to check each cell type and get the first row if it is Series
df = df.applymap(lambda x: x.iloc[0] if type(x) == pd.Series else x)

It works as expected for O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0]
Check this colab link: https://colab.research.google.com/drive/1XFXuj4OBu8GXQx6DTqv04XellmFcFWbC?usp=sharing

Create dataframe pandas from dict where values are list of tuples and each column name is unique

I have two lists that I use to create a dictionary, where list1 has text data and list2 is a list of tuples (text, float). I use these 2 lists to create a dictionary and the goal is to create a dataframe where each row of the first column will contain the elements of list1, each column will have a column name based on each unique text term from the first tuple element and for each row there will be the float values that connect them.
For example here's the dictionary with keys : {be, associate, induce, represent} and values : {('prove', 0.583171546459198), ('serve', 0.4951282739639282)} etc.
{'be': [('prove', 0.583171546459198), ('serve', 0.4951282739639282), ('render', 0.4826732873916626), ('represent', 0.47748714685440063), ('lead', 0.47725602984428406), ('replace', 0.4695377051830292), ('contribute', 0.4529820680618286)],
'associate': [('interact', 0.8237789273262024), ('colocalize', 0.6831706762313843)],
'induce': [('suppress', 0.8159114718437195), ('provoke', 0.7866303324699402), ('elicit', 0.7509980201721191), ('inhibit', 0.7498961687088013), ('potentiate', 0.742023229598999), ('produce', 0.7384929656982422), ('attenuate', 0.7352016568183899), ('abrogate', 0.7260081768035889), ('trigger', 0.717864990234375), ('stimulate', 0.7136563658714294)],
'represent': [('prove', 0.6612186431884766), ('evoke', 0.6591314673423767), ('up-regulate', 0.6582908034324646), ('synergize', 0.6541063785552979), ('activate', 0.6512928009033203), ('mediate', 0.6494284272193909)]}
Desired Output
prove serve render represent
be 0.58 0.49 0.48 0.47
associate 0 0 0 0
induce 0.45 0.58 0.9 0.7
represent 0.66 0 0 1
So what tricks me is that the verb prove can be found in more than one keys (i.e. for the key be, the score is 0.58 and for the key represent the score is 0.66).
If I use df = pd.DataFrame.from_dict(d,orient='index'), then the verb prove will appear twice as a column name, whereas I want each term to appear once in each column.
Can someone help?

With the dictionary that you provided (as d), you can't use from_dict directly.
You either need to rework the dictionary to have elements as dictionaries:
pd.DataFrame.from_dict({k: dict(v) for k,v in d.items()}, orient='index')
Or you need to read it as a Series and to reshape:
(pd.Series(d).explode()
.apply(pd.Series)
.set_index(0, append=True)[1]
.unstack(fill_value=0)
)
output:
prove serve render represent lead replace \
be 0.583172 0.495128 0.482673 0.477487 0.477256 0.469538
represent 0.661219 NaN NaN NaN NaN NaN
associate NaN NaN NaN NaN NaN NaN
induce NaN NaN NaN NaN NaN NaN
contribute interact colocalize suppress ... produce \
be 0.452982 NaN NaN NaN ... NaN
represent NaN NaN NaN NaN ... NaN
associate NaN 0.823779 0.683171 NaN ... NaN
induce NaN NaN NaN 0.815911 ... 0.738493
attenuate abrogate trigger stimulate evoke up-regulate \
be NaN NaN NaN NaN NaN NaN
represent NaN NaN NaN NaN 0.659131 0.658291
associate NaN NaN NaN NaN NaN NaN
induce 0.735202 0.726008 0.717865 0.713656 NaN NaN
synergize activate mediate
be NaN NaN NaN
represent 0.654106 0.651293 0.649428
associate NaN NaN NaN
induce NaN NaN NaN
[4 rows x 24 columns]

Python Pandas - Rolling regressions for multiple columns in a dataframe

I have a large dataframe containing daily timeseries of prices for 10,000 columns (stocks) over a period of 20 years (5000 rows x 10000 columns). Missing observations are indicated by NaNs.
0 1 2 3 4 5 6 7 8 \
31.12.2009 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
01.01.2010 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
04.01.2010 31.85 66.99 NaN NaN NaN NaN 404.93 57.04 NaN
05.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
06.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
Now I want to run a rolling regression for a 250 day window for each column over the whole sample period and save the coefficient in another dataframe
Iterating over the colums and rows using two for-loops isn't very efficient, so I tried this but getting the following error message
def regress(start, end):
y = df_returns.iloc[start:end].values
if np.isnan(y).any() == False:
X = np.arange(len(y))
X = sm.add_constant(X, has_constant="add")
model = sm.OLS(y,X).fit()
return model.params[1]
else:
return np.nan
regression_window = 250
for t in (regression_window, len(df_returns.index)):
df_coef[t] = df_returns.apply(regress(t-regression_window, t), axis=1)
TypeError: ("'float' object is not callable", 'occurred at index 31.12.2009')

here is my version, using df.rolling() instead and iterating over the columns.
I am not completely sure it is what you were looking for don't hesitate to comment
import statsmodels.regression.linear_model as sm
import statsmodels.tools.tools as sm2
df_returns =pd.DataFrame({'0':[30,30,31,32,32],'1':[60,60,60,60,60],'2':[np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
def regress(X,Z):
if np.isnan(X).any() == False:
model = sm.OLS(X,Z).fit()
return model.params[1]
else:
return np.NaN
regression_window = 3
Z = np.arange(regression_window)
Z= sm2.add_constant(Z, has_constant="add")
df_coef=pd.DataFrame()
for col in df_returns.columns:
df_coef[col]=df_returns[col].rolling(window=regression_window).apply(lambda col : regress(col, Z))
df_coef

How can I average a DataFrame filled with dictionaries?

I have a DataFrame of the following form:
df = pd.DataFrame({"06":{'6/6/2006':'5','6/24/2006':'3','8/24/2006':'3'}, "06_01":{}, "06_02":{}, "06_03":{} ,"06_04":{} ,"06_05":{} ,"06_06":{'6/6/2006':'5', '6/24/2006':'3'} ,"06_07":{} ,"06_08":{'8/24/2006':'3'}, "06_09":{} ,"06_10":{} ,"06_11":{}, "06_12":{}})
where each column represents all observations in a given year, or year_month period. I would like to average all the dictionary values within each given year_month period. So the output for 06_06 would be simply 4.
Any advice is greatly appreciated.

Just convert df to float and call mean
df.astype('float').mean()
Out[738]:
06 3.666667
06_01 NaN
06_02 NaN
06_03 NaN
06_04 NaN
06_05 NaN
06_06 4.000000
06_07 NaN
06_08 3.000000
06_09 NaN
06_10 NaN
06_11 NaN
06_12 NaN
dtype: float64

Extracting mean values from masked 2-D array

I want to extract a 12º x 12º region from lat/long/conductivity grids and calculate the mean conductivity values in this region. I can successfully apply masks on the lat/long grids, but somehow the same process is not working for the conductivity grid.
I've tried masking with for loops and now I'm using numpy.ma.masked_where function. I can successfully plot masked results (i.e: I can see that the region is extracted when I plot global maps), but the calculated mean conductivity values are corresponding to non-masked data.
I did a simple example of what I want to do:
x = np.linspace(1, 10, 10)
y = np.linspace(1, 10, 10)
xm = np.median(x)
ym = np.median(y)
x = ma.masked_outside(x, xm-3, xm+3)
y = ma.masked_outside(x, ym-3, ym+3)
x = np.ma.filled(x.astype(float), np.nan)
y = np.ma.filled(y.astype(float), np.nan)
x, y = np.meshgrid(x, y)
z = 2*x + 3*y
z = np.ma.masked_where(np.ma.getmask(x), z)
plt.pcolor(x, y, z)
plt.colorbar()
print('Maximum z:', np.nanmax(z))
print('Minimum z:', np.nanmin(z))
print('Mean z:', np.nanmean(z))
My code is:
def Observatory_Cond_Plot(filename, ndcfile, obslon, obslat, obsname, date):
files = np.array(sorted(glob.glob(filename))) #sort txt files containing the 2-D conductivitiy arrays]
filenames = ['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December'] #used for naming output plots and files
for i, fx in zip(filenames, files):
ndcdata = Dataset(ndcfile) #load netcdf file
lat = ndcdata.variables['latitude'][:] #import latitude data
long = ndcdata.variables['longitude'][:] #import longitude data
cond = np.genfromtxt(fx)
cond, long = shiftgrid(180., cond, long, start=False)
#Mask lat and long arrays and fill masks with nan values
lat = ma.masked_outside(lat, obslat-12, obslat+12)
long = ma.masked_outside(long, obslon-12, obslon+12)
lat = np.ma.filled(lat.astype(float), np.nan)
long = np.ma.filled(long.astype(float), np.nan)
longrid, latgrid = np.meshgrid(long, lat)
cond = np.ma.masked_where(np.ma.getmask(longrid), cond)
cond = np.ma.filled(cond.astype(float), np.nan)
condmean = np.nanmean(cond)
print('Mean Conductivity is:', condmean)
print('Minimum conductivity is:', np.nanmin(cond))
print('Maximum conductivity is:', np.nanmax(cond))
After that, the rest of the code just plots the data
My results are:
Mean Conductivity is: 3.5241649673154587
Minimum conductivity is: 0.497494528344129
Maximum conductivity is: 5.997825822915771
However, from tmy maps, it is clear that the conductivity in this region should not be lower than 3.2 S/m. Also, printing lat, long and cond grids:
long:
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
lat:
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
cond:
[[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]
[2.86749432 2.86743283 2.86746221 ... 2.87797247 2.87265508 2.87239185]
...
[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]]
And it seems like the mask is not working properly.

The problem is that the call of np.ma.filled will de-mask the long variable. Also np.meshgrid doesn't preserve the masks.
You could save the masks directly after creation and also create the meshgrid from the masks. I adapted your example accordingly. What can be seen is, that all versions of numpy mean take the mask into account. I had to adapt the upper limit (changed to 2), because the mean has been equal.
x = np.linspace(1, 10, 10)
y = np.linspace(1, 10, 10)
xm = np.median(x)
ym = np.median(y)
# Note: changed limits
x = np.ma.masked_outside(x, xm-3, xm+2)
y = np.ma.masked_outside(x, ym-3, ym+2)
xmask = np.ma.getmask(x)
ymask = np.ma.getmask(y)
x, y = np.meshgrid(x, y)
xmask, ymask = np.meshgrid(xmask, ymask)
z = 2*x + 3*y
z1 = np.ma.masked_where(np.ma.getmask(x), z)
z2 = np.ma.masked_where(xmask | ymask, z)
print(z1)
print(z2)
print('Type z1, z2:', type(z1), type(z2))
print('Maximum z1, z2:', np.nanmax(z1), np.nanmax(z2))
print('Minimum z1, z2:', np.nanmin(z1), np.nanmin(z2))
print('Mean z1, z2:', np.mean(z1), np.mean(z2) )
print('nan Mean z1, z2:', np.nanmean(z1), np.nanmean(z2) )
print('masked Mean z1, z2:', z1.mean(), z2.mean())

Beware that any kind of simple mean calculation (summing and dividing by the total), such as np.mean will not give you the correct answer if you are averaging over the lat-lon grid, since the area changes as you move towards the poles. You need to take a weighted average, weighting by cos(lat).
As you say you have the data in netcdf format, I hope you will permit me to suggest an alternative solution from the command line using the utility climate data operators (cdo) (on ubuntu you can install with sudo apt install cdo).
to extract the region of interest:
cdo sellonlatbox,lon1,lon2,lat1,lat2 infile.nc outfile.nc
then you can work out the correct weighted mean with
cdo fldmean infile.nc outfile.nc
you can pipe the two together like this:
cdo fldmean -sellonlatbox,lon1,lon2,lat1,lat2 infile.nc outfile.nc

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum Data variables of Dataset - python

Yes, one way to do it is to use Dataset.to_array, which will combine all the data variables into a single array along a new "variable" dimension: sum_of_data_variables = ds.to_array().sum("variable")

Related

Assign new value to a cell in pd.DataFrame which is a pd.Series when series index isn't unique

Create dataframe pandas from dict where values are list of tuples and each column name is unique

Python Pandas - Rolling regressions for multiple columns in a dataframe

How can I average a DataFrame filled with dictionaries?

Extracting mean values from masked 2-D array

Categories

Resources