I'm using xarray's groupby + reducer to perform spatial overlay/aggregation on spatial rasters. I'm wondering if there is a way to use a different reducer for certain data variables. In the code below for instance, I would like categorical_variable to be reduced with first() (or mode but that doesn't seem to be implemented), and continuous_variable to be reduced with mean()
import xarray as xr
import numpy as np
categorical_variable = np.array([[1,1,1,1,1],
[1,1,1,1,2],
[1,1,1,2,2],
[1,1,2,2,2],
[1,2,2,2,2]], dtype='int16')
grouping_variable = np.array([[1,1,1,2,2],
[1,1,3,2,2],
[1,3,3,3,3],
[3,3,3,3,3],
[4,4,4,4,4]], dtype='int16')
continuous_variable = np.random.rand(5,5)
xr_dataset = xr.Dataset({'grouping_variable': xr.DataArray(grouping_variable,
dims=['x', 'y']),
'categorical_variable': xr.DataArray(categorical_variable,
dims=['x', 'y']),
'continuous_variable': xr.DataArray(continuous_variable,
dims=['x', 'y'])})
xr_grouped = xr_dataset.groupby('grouping_variable')
xr_reduced = xr_grouped.mean()
This isn't currently possible in one go in xarray currently AFAIK, but since you're losing the spatial structure anyway you can go via pandas quite simply and use agg:
>>> df = xr_dataset.to_dataframe()
>>> df.groupby('grouping_variable').agg({"categorical_variable": "first",
"continuous_variable": "mean"})
categorical_variable continuous_variable
grouping_variable
1 1 0.458534
2 1 0.822294
3 1 0.539483
4 1 0.515586
The performance is not optimal but this is what I ended up doing:
xr_dataset = xr.merge([
xr_dataset.categorical_variable.groupby('grouping_variable').first(),
xr_dataset.continuous_variable.groupby('grouping_variable').mean(),
...
])
Related
I've written a piece of code to extract data from a HDF5 file and save into a dataframe that I can export as .csv later. The final data frame effectively has 2.5 million rows and is taking a lot of time to execute.
Is there any way, I can optimize this code so that it can run effectively.
Current runtime is 7.98 minutes!
Ideally I would want to run this program for 48 files like these and expect a faster run time.
Link to source file: https://drive.google.com/file/d/1g2fpJHZmD5FflfB4s3BlAoiB5sGISKmg/view
import h5py
import numpy as np
import pandas as pd
#import geopandas as gpd
#%%
f = h5py.File('mer.h5', 'r')
for key in f.keys():
#print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
#print(type(f[key])) # get the object type: usually group or dataset
ls = list(f.keys())
#Get the HDF5 group; key needs to be a group name from above
key ='DHI'
#group = f['OBSERVATION_TIME']
#print("Group")
#print(group)
#for key in ls:
#data = f.get(key)
#dataset1 = np.array(data)
#length=len(dataset1)
masterdf=pd.DataFrame()
data = f.get(key)
dataset1 = np.array(data)
#masterdf[key]=dataset1
X = f.get('X')
X_1 = pd.DataFrame(X)
Y = f.get('Y')
Y_1 = pd.DataFrame(Y)
#%%
data_df = pd.DataFrame(index=range(len(Y_1)),columns=range(len(X_1)))
for i in data_df.index:
data_df.iloc[i] = dataset1[0][i]
#data_df.to_csv("test.csv")
#%%
final = pd.DataFrame(index=range(1616*1616),columns=['X', 'Y','GHI'])
k=0
for y in range(len(Y_1)):
for x in range(len(X_1[:-2])): #X and Y ranges are not same
final.loc[k,'X'] = X_1[0][x]
final.loc[k,'Y'] = Y_1[0][y]
final.loc[k,'GHI'] = data_df.iloc[y,x]
k=k+1
# print(k)`
we can optimize loops by vectorizing operations. this is one/two orders of magnitude faster than their pure python equivalents(especially in numerical computations). vectorization is something we can get with NumPy. it is a library with efficient data structures designed to hold matrix data.
Could you please try the following (file.h5 your file):
import pandas as pd
import h5py
with h5py.File("file.h5", "r") as file:
df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
DHI = file.get("DHI")[0][:, :-2].reshape(-1)
final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]
Some explanations:
First read the data with key X into a dataframe df_X with one column X, except for the last 2 data points.
Then read the full data with key Y into a dataframe df_Y with one column Y.
Then get the data with key DHI and take the first element [0] (there are no more): Result is a NumpPy array with 2 dimensions, a matrix. Now remove the last two columns ([:, :-2]) and reshape the matrix into an 1-dimensional array, in the order you are looking for (order="C" is default). The result is the column DHI of your final dataframe.
Finally take the cross product of df_Y and df_X (y is your outer dimension in the loop) via .merge with how="cross", add the DHI column, and rearrange the columns in the order you want.
I created Custom Primitives like below.
class Correlate(TransformPrimitive):
name = 'correlate'
input_types = [Numeric,Numeric]
return_type = Numeric
commutative = True
compatibility = [Library.PANDAS, Library.DASK, Library.KOALAS]
def get_function(self):
def correlate(column1,column2):
return np.correlate(column1,column2,"same")
return correlate
Then I checked the calculation like below just in case.
np.correlate(feature_matrix["alcohol"], feature_matrix["chlorides"],mode="same")
However above function result and below function result were difference.
Do you know why those are difference?
If my code is wrong basically, please correct me.
Thanks for the question! You can create a custom primitive with a fixed argument to calculate that kind of correlation by using the TransformPrimitive as a base class. I will go through an example using this data.
import pandas as pd
data = [
[0.40168819, 0.0857946],
[0.06268886, 0.27811651],
[0.16931269, 0.96509497],
[0.15123022, 0.80546244],
[0.58610794, 0.56928692],
]
df = pd.DataFrame(data=data, columns=list('ab'))
df.reset_index(inplace=True)
df
index a b
0 0.401688 0.085795
1 0.062689 0.278117
2 0.169313 0.965095
3 0.151230 0.805462
4 0.586108 0.569287
The function np.correlate is a transform when the parameter mode=same, so define a custom primitive by using the TransformPrimitive as a base class.
from featuretools.primitives import TransformPrimitive
from featuretools.variable_types import Numeric
import numpy as np
class Correlate(TransformPrimitive):
name = 'correlate'
input_types = [Numeric, Numeric]
return_type = Numeric
def get_function(self):
def correlate(a, b):
return np.correlate(a, b, mode='same')
return correlate
The DFS call requires the data to be structured into an EntitySet, then you can use the custom primitive.
import featuretools as ft
es = ft.EntitySet()
es.entity_from_dataframe(
entity_id='data',
dataframe=df,
index='index',
)
fm, fd = ft.dfs(
entityset=es,
target_entity='data',
trans_primitives=[Correlate],
max_depth=1,
)
fm[['CORRELATE(a, b)']]
CORRELATE(a, b)
index
0 0.534548
1 0.394685
2 0.670774
3 0.670506
4 0.622236
You should get the same values between the feature matrix and np.correlate.
actual = fm['CORRELATE(a, b)'].values
expected = np.correlate(df['a'], df['b'], mode='same')
np.testing.assert_array_equal(actual, expected)
You can learn more about defining simple custom primitives and advanced custom primitives in the linked pages. Let me know if you found this helpful.
I have all the data (sites and distances already).
Now I have to form a string matrix to use as an input for another python script.
I have sites and distances as (returned from a query, delimited as here):
A|B|5
A|C|3
A|D|9
B|C|7
B|D|2
C|D|6
How to create this kind of matrix?
A|B|C|D
A|0|5|3|9
B|5|0|7|2
C|3|7|0|6
D|9|2|6|0
This has to be returned as a string from python and I'll have more than 1000 sites, so it should be optimized for such size.
Thanks
I have no doubt it could be done in a cleaner way (because Python).
I will do some more research later on but I do want you to have something to start with, so here it is.
import pandas as pd
data = [
('A','B',5)
,('A','C',3)
,('A','D',9)
,('B','C',7)
,('B','D',2)
,('C','D',6)
]
data.extend([(y,x,val) for x,y,val in data])
df = pd.DataFrame(data, columns=['x','y','val'])
df = df.pivot_table(values='val', index='x', columns='y')
df = df.fillna(0)
Here is a demo for 1000x1000 (take about 2 seconds)
import pandas as pd, itertools as it
data = [(x,y,val) for val,(x,y) in enumerate(it.combinations(range(1000),2))]
data.extend([(y,x,val) for x,y,val in data])
df = pd.DataFrame(data, columns=['x','y','val'])
df = df.pivot_table(values='val', index='x', columns='y')
df = df.fillna(0)
I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting:
Python PANDAS: Stack by Enumerated Date to Create Records Vectorized
import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.array as da
from io import StringIO
test_data = '''id,transaction_dt,units,measures
1,2018-01-01,4,30.5
1,2018-01-03,4,26.3
2,2018-01-01,3,12.7
2,2018-01-03,3,8.8'''
df_test = pd.read_csv(StringIO(test_data), sep=',')
df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt'])
df_test = df_test.loc[np.repeat(df_test.index, df_test['units'])]
df_test['transaction_dt'] += pd.to_timedelta(df_test.groupby(level=0).cumcount(), unit='d')
df_test = df_test.reset_index(drop=True)
expected results:
id,transaction_dt,measures
1,2018-01-01,30.5
1,2018-01-02,30.5
1,2018-01-03,30.5
1,2018-01-04,30.5
1,2018-01-03,26.3
1,2018-01-04,26.3
1,2018-01-05,26.3
1,2018-01-06,26.3
2,2018-01-01,12.7
2,2018-01-02,12.7
2,2018-01-03,12.7
2,2018-01-03,8.8
2,2018-01-04,8.8
2,2018-01-05,8.8
It occurred to me that this might be a good candidate to try to parallelize because the separate dask partitions should not need to know anything about each other to accomplish the required operations. Here is a naive representation of how I thought it might work:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] += dd_test.to_timedelta(dd.groupby(level=0).cumcount(), unit='d')
dd_test = dd_test.reset_index(drop=True)
So far I have been trying to work through the following errors or idiomatic differences:
"NotImplementedError: Only integer valued repeats supported."
I have tried to convert the index into a int column/array to try as well but still run into the issue.
2. dask does not support the mutating operator: "+="
3. No dask .to_timedelta() argument
4. No dask .cumcount() (but I think .cumsum() is interchangable?!)
If there are any dask experts out there who might be able let me know if there are fundamental impediments to preclude me from trying this or any tips on implementation, that would be a great help!
Edit:
I think I have made a bit of progress on this since posting the question:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] = dd_test['transaction_dt'] + (dd.test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
However, I am still stuck on the dask array repeats error. Any tips still welcome.
Not sure if this is exactly what you are looking for, but I replaced the da.repeat with using np.repeat, along with explicity casting dd_test.index and dd_test['units'] to numpy arrays, and finally adding dd_test['transaction_dt'].astype('M8[us]') to your timedelta calculation.
df_test = pd.read_csv(StringIO(test_data), sep=',')
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[np.repeat(np.array(dd_test.index),
np.array(dd_test['units']))]
dd_test['transaction_dt'] = dd_test['transaction_dt'].astype('M8[us]') + (dd_test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
df_expected = dd_test.compute()
I have a dataset df with three columns: 'String_key_val', 'Float_other_val1', 'Int_other_val2'. I want to groupby on key_val, then extract the sum of val1 (resp. val2) with respect to these groups. Here is my code:
df = pandas.read_csv('test.csv')
grouped = df.groupby('String_key_val')
series_calculus1 = grouped['Float_other_val1'].sum()
series_calculus2 = grouped['Int_other_val2'].sum()
res = pandas.concat([series_calculus1, series_calculus2], axis=1)
res.to_csv('output_test.csv')
My problem is: My entry dataset is 10GB and I have 4Go Ram so I need to chunk my calculus but I can't see how. I thought of using HDFStore, but since I only have to build a numerical dataset, I see no point of storing complete DataFrame, and I don't think HDFStore can store simple arrays.
What can I do?
I believe a simple approach would be something along these lines....
import pandas as pd
summary = pd.DataFrame()
chunker = pd.read_csv('test.csv',iterator=True,chunksize=50000)
for chunk in chunker:
group = chunk.groupby('String_key_val')
out = group[['Float_other_val1','Int_other_val2']].sum()
summary = summary.append(out)
summary = summary.reset_index()
group = summary.groupby('String_key_val')
summary = group[['Float_other_val1','Int_other_val2']].sum()