Why does manipulating xarrays takes up so much memory permanently?

Why does manipulating xarrays takes up so much memory permanently? - python

I'm new at using Xarray (using it inside jupyter notebooks), and up to now everything has worked like a charm, except when I started to look at how much RAM is used by my functions (e.g. htop), which is confusing me (I didn't find anything on stackexchange).
I am combining monthly data to yearly means, taking into account month lengths, masking nan values and also using specific months only, which requires the use of groupby and resample.
As I can see from using the memory profiler these operations temporarily take up ~15gm RAM, which as such is not a problem because I have 64gb RAM at hand.
Nonetheless it seems like some memory is blocked permanently, even though I call these methods inside a function. For the function below it blocks ~4gb of memory although the resulting xarray only has a size of ~440mb (55*10**6 float 64entries), with more complex operations it blocks more memory.
Explicitly using del , gc.collect() or Dataarray.close() inside the function did not change anything.
A basic function to compute a weighted yearly mean from monthly data looks like this:
import xarray as xr
test=xr.open_dataset(path)['pr']
def weighted_temporal_mean(ds):
"""
Taken from https://ncar.github.io/esds/posts/2021/yearly-averages-xarray/
Compute yearly average from monthly data taking into account month length and
masking nan values
"""
# Determine the month length
month_length = ds.time.dt.days_in_month
# Calculate the weights
wgts = month_length.groupby("time.year") / month_length.groupby("time.year").sum()
# Setup our masking for nan values
cond = ds.isnull()
ones = xr.where(cond, 0.0, 1.0)
# Calculate the numerator
obs_sum = (ds * wgts).resample(time="AS").sum(dim="time")
# Calculate the denominator
ones_out = (ones * wgts).resample(time="AS").sum(dim="time")
# Return the weighted average
return obs_sum / ones_out
wm=weighted_temporal_mean(test)
print("nbytes in MB:", wm.nbytes / (1024*1024))
Any idea how to ensure that the memory is freed up, or am I overlooking something and this behavior is actually expected?
Thank you!

The only hypothesis I have for this behavior is that some of the operations involving the passed in ds modify it in place, increasing its size, as, apart of the returned objects, this the the only object that should survive after the function execution.
That can be easily verified by using del on the ds structure used as input after the function is run. (If you need the data afterwards, re-read it, or make a deepcopy before calling the function).
If that does not resolve the problem, then this is an issue with the xarray project, and I'd advise you to open an issue in their project.

Related

Is there another way to convert ee.Number to float except getInfo()?

Hello friends!
Summarization:
I got a ee.FeatureCollection containing around 8500 ee.Point-objects. I would like to calculate the distance of these points to a given coordinate, lets say (0.0, 0.0).
For this i use the function geopy.distance.distance() (ref: https://geopy.readthedocs.io/en/latest/#module-geopy.distance). As input the the function takes 2 coordinates in the form of 2 tuples containing 2 floats.
Problem: When i am trying to convert the coordinates in form of an ee.List to float, i always use the getinfo() function. I know this is a callback and it is very time intensive but i don't know another way to extract them. Long story short: To extract the data as ee.Number it takes less than a second, if i want them as float it takes more than an hour. Is there any trick to fix this?
Code:
fc_containing_points = ee.FeatureCollection('projects/ee-philadamhiwi/assets/Flensburg_100') #ee.FeatureCollection
list_containing_points = fc_containing_points.toList(fc_containing_points.size()) #ee.List
fc_containing_points_length = fc_containing_points.size() #ee.Number
for index in range(fc_containing_points_length.getInfo()): #i need to convert ee.Number to int
point_tmp = list_containing_points.get(i) #ee.ComputedObject
point = ee.Feature(point_tmp) #transform ee.ComputedObject to ee.Feature
coords = point.geometry().coordinates() #ee.List containing 2 ee.Numbers
#when i run the loop with this function without the next part
#i got all the data i want as ee.Number in under 1 sec
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0]) #tuple containing 2 floats
#when i add this part to the function it takes hours
PS: This is my first question, pls be patient with me.

I would use .map instead of your looping. This stays server side until you export the table (or possibly do a .getInfo on the whole thing)
fc_containing_points = ee.FeatureCollection('projects/eephiladamhiwi/assets/Flensburg_100')
fc_containing_points.map(lambda feature: feature.set("distance_to_point", feature.distance(ee.Feature(ee.Geometry.Point([0.0,0.0])))
# Then export using ee.batch.Export.Table.toXXX or call getInfo
(An alternative might be to useee.Image.paint to convert the target point to an image then, use ee.Image.distance to calculate the distance to the point (as an image), then use reduceRegions over the feature collection with all points but 1) you can only calculate distance to a certain distance and 2) I don't think it would be any faster.)
To comment on your code, you are probably aware loops (especially client side loops) are frowned upon in GEE (primarily for the performance reasons you've run into) but also note that any time you call .getInfo on a server side object it incurs a performance cost. So this line
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0])
Would take roughly double the time as this
coords_client = coords.getInfo()
coords_as_tuple_of_ints = (coords_client[1],coords_client[0])
Finally, you could always just export your entire feature collection to a shapefile (using ee.batch.Export.Table.... as above) and do all the operations using geopy locally.

plot results from user defined ACT Extension

As a result of my simulation, I want the volume of a surface body (computed using a convex hull algorithm). This calculation is done in seconds but the plotting of the results takes a long time, which becomes a problem for the future design of experiment. I think the main problem is that a matrix (size = number of nodes =over 33 000 nodes) is filled with the same volume value in order to be plotted. Is there any other way to obtain that value without creating this matrix? (the value retrieved must be selected as an output parameter afterwards)
It must be noted that the volume value is computed in python in an intermediate script then saved in an output file that is later read by Ironpython in the main script in Ansys ACT.
Thanks!
The matrix creation in the intermediate script (myICV is the volume computed) :
import numpy as np
NodeNo=np.array(Col_1)
ICV=np.full_like(NodeNo,myICV)
np.savetxt(outputfile,(NodeNo,ICV),delimiter=',',fmt='%f')
Plot of the results in main script :
import csv #after the Cpython function
resfile=opfile
reader=csv.reader(open(resfile,'rb'),quoting=csv.QUOTE_NONNUMERIC) #read the node number and the scaled displ
NodeNos=next(reader)
ICVs=next(reader)
#ScaledUxs=next(reader)
a=int(NodeNos[1])
b=ICVs[1]
ExtAPI.Log.WriteMessage(a.GetType().ToString())
ExtAPI.Log.WriteMessage(b.GetType().ToString())
userUnit=ExtAPI.DataModel.CurrentUnitFromQuantityName("Length")
DispFactor=units.ConvertUnit(1,userUnit,"mm")
for id in collector.Ids:
collector.SetValues(int(NodeNos[NodeNos.index(id)]), {ICVs[NodeNos.index(id)]*DispFactor}) #plot results
ExtAPI.Log.WriteMessage("ICV read")
So far the result looks like this

Considering that your 'CustomPost' object is not relevant in terms of visualization but just to pass the volume calculation as a parameter, without adding many changes to the workflow, I suggest you to change the 'Scoping Method' to 'Geometry' and then selecting a single node (if the extension result type is 'Node'; you can check data on the xml file), instead of 'All Bodies'.
If you code runs slow due to the plotting this should fix it, cause you will be requesting just one node.
As you are referring to DoE, I understand you are expecting to run this model iteratively and read the parameter result. An easy trick might be to generate a 'NamedSelection' by 'Worksheet' and select 'Mesh Node' (Entity Type) with 'NodeID' as Criterion and equal to '1', for example. So even if through your iterations you change the mesh, we expect to have always node ID 1, so your NamedSelection is guaranteed to be generated successfully in each iteration.
Then you can scope you 'CustomPost' to 'NamedSelection' and then to the one you created. This should work.
If your extension does not accept 'NamedSelection' as 'Scoping Method' and you are changing the mesh in each iteration (if you are not, you can directly scope a node), I think it is time to manually write the parameter as an 'Input Parameter', in the 'Parameter Set'. But in this way you will have to control the execution of the model from Workbench platform.
I am curious to see how it goes.

How to parallelize calculating with odb files in ABAQUS?

I calculate sum of volume at all integration point like
volume = f.steps['Step-1'].frames[-1].fieldOutputs['IVOL'].values
# CALCULATE TOTAL VOLUME AT INTEGRATION POINTS
V = 0
for i in range(len(volume)):
V = V+volume[i].data
the length of volume in my problem is about 27,000,000 so it takes too long to do that.
I try to parallelize this process with multiprocessing module in python.
As far as I know, the data should be splited to several parts for that.
Could you give me some advice about spliting the data in odb files to several parts or parallelizing that code??

You can make use of bulkDataBlocks command available for the field output objects.
Using bulkDataBlocks, you can access all the field data as an array (abaqus array actaully, but numpy array and abaqus array are largely the same) and Hence, manipulation can get very easy.
For example:
# Accessing bulkDataBlocks object
ivolObj= f.steps['Step-1'].frames[-1].fieldOutputs['IVOL'].bulkDataBlocks
# Accessing data as array --> please note it is an Abaqus array type, not numpy array
ivolData = ivolObj[0].data
# now, you can sum it up
ivolSum = sum(ivolData )

reduce calculation precision to speed up execution

I have a data acquisition system which takes measurements for a few minutes and generates a csv file with 10 million rows and 10 columns. Then I import this csv file in Python (csv.reader), perform a bunch of operations on the acquired numeric data (but ‘only’ 10000 rows at a time, otherwise the computer memory would be overwhelmed). In the end, I export the results in another much smaller csv file (csv.writer).
The problem is that the runtime is very long and I want to speed it up. When I open the original csv file with Notepad I see that the numbers have up to 16 digits each, like 0.0015800159870059, 12.0257771094508 etc. I know that the accuracy of the DAQ is 0.1% at best and most of the trailing digits are noise. Is there an elegant way of forcing Python to operate globally with only 7-8 digits from start to finish, to speed up the calculations? I know about error propagation and I’m going to try different settings for the number of digits to see what the optimum is.
Please note that it is not enough for me to build a temporary csv file with ‘truncated’ data (e.g. containing 0.0015800, 12.0257771 etc) and simply import that into Python. The calculations in Python should use reduced precision as well. I looked into decimal module, with no success so far.
with open(‘datafile’,newline='') as DAQfile:
reader=csv.reader(DAQfile,delimiter=',')
for row in reader:
… calculate stuff…
with open('results.csv','w',newline='') as myfile:
mywriter = csv.writer(myfile)
…write stuff…
Adding some details, based on the comments so far:
The program calculates the peak of the moving average of the 'instantaneous power'. The data in the csv file can be described like this, where 'col' means column, V means voltage and I means current: col1=time, col2=V1, col3=I1, col4=V2, col5=I2 and so on until col11=V10, col12=I10. So each row represents a data sample taken by the DAQ.
The instantaneous power is Pi=V1*I1+V2*I2+...+V11*I11
To calculate moving average over 10000 rows at a time, I built a buffer (initialized with Buffer=[0]*10000). This buffer will store the Pi's for 10000 consecutive rows and will be updated every time when csv.reader moves to the next row. The buffer works exactly like a shift register.
This way the memory usage is insignificant (verified). In summary, the calculations are multiplications, additions, min(a,b) function (to detect the peak of the moving average) and del/append for refreshing the buffer. The moving average itself is iterative too, something like newavg=oldavg+(newlast-oldfirst)/bufsize.
My thinking is that it does not make any sense to let Python work with all those decimals when I know that most of the trailing figures are garbage.
Forgot to mention that the size of the csv file coming from the DAQ is just under 1Gb.

Yes, there is a way - use NumPy. First, there are tons of vector/vector operations, which could be performed with one command
a = b + c
will efficiently sum two vector.
Second, which is the answer to your question, you could specify 4bytes float type, greatly reducing memory reqs and increasing speed.
You should read your file directly using
from numpy import genfromtxt
data = genfromtxt('datafile.csv', dtype=numpy.float32, delimiter=',')
...
data would made up from standard 32bits floats, circa 7digits precision.
CSV file could be read by parts/bunches
numpy.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None,
skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None,
usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_',
autostrip=False, case_sensitive=True, defaultfmt='f%i',
unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None,
encoding='bytes')
here is full list of parameters. If max_rows is set to, say, 10, only 10 rows will be read. Default is to read whole file. You could read anything in the middle of the files by skipping some initial records, via skip_header option.

Use DyZ's comment. if there is a way to speed up the calculations, (i.e. using << or >> for multiplications or divisions respectiveley if the second operand or the dividend is a power of 2, you should take it.
example:
>>> 22 * 16
352
>>> 22 << 4
352
in that scenario, I did the exact same operation with marginal decrease in time. However, if that will equate to 100 trillion calculations, the difference is much more notable.

Numpy mean for big array dataset using For loop Python

I have big dataset in array form and its arranged like this:
Rainfal amount arranged in array form
Average or mean mean for each latitude and longitude at axis=0 is computed using this method declaration:
Lat=data[:,0]
Lon=data[:,1]
rain1=data[:,2]
rain2=data[:,3]
--
rain44=data[:,44]
rainT=[rain1,rain2,rain3,rain4,....rain44]
mean=np.mean(rainT)
The result was aweseome but requires time computation and I look forward to use For Loop to ease the calculation. As for the moment the script that I used is like this:
mean=[]
lat=data[:,0]
lon=data[:,1]
for x in range(2,46):
rainT=data[:,x]
mean=np.mean(rainT,axis=0)
print mean
But weird result is appeared. Anyone?

First, you probably meant to make the for loop add the subarrays rather than keep replacing rainT with other slices of the subarray. Only the last assignment matters, so the code averages that one subarray rainT=data[:,45], also it doesn't have the correct number of original elements to divide by to compute an average. Both of these mistakes contribute to the weird result.
Second, numpy should be able to average elements faster than a Python for loop can do it since that's just the kind of thing that numpy is designed to do in optimized native code.
Third, your original code copies a bunch of subarrays into a Python List, then asks numpy to average that. You should get much faster results by asking numpy to sum the relevant subarray without making a copy, something like this:
rainT = data[:,2:] # this gets a view onto data[], not a copy
mean = np.mean(rainT)
That computes an average over all the rainfall values, like your original code.
If you want an average for each latitude or some such, you'll need to do it differently. You can average over an array axis, but latitude and longitude aren't axes in your data[].

Thanks friends, you are giving me such aspiration. Here is the working script ideas by #Jerry101 just now but I decided NOT to apply Python Loop. New declaration would be like this:
lat1=data[:,0]
lon1=data[:,1]
rainT=data[:,2:46] ---THIS IS THE STEP THAT I AM MISSING EARLIER
mean=np.mean(rainT,axis=1)*24 - MAKE AVERAGE DAILY RAINFALL BY EACH LAT AND LON
mean2=np.array([lat1,lon1,mean])
mean2=mean2.T
np.savetxt('average-daily-rainfall.dat2',mean2,fmt='%9.3f')
And finally the result is exactly same to program made in Fortran.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does manipulating xarrays takes up so much memory permanently? - python

Related

Is there another way to convert ee.Number to float except getInfo()?

plot results from user defined ACT Extension

How to parallelize calculating with odb files in ABAQUS?

reduce calculation precision to speed up execution

Numpy mean for big array dataset using For loop Python

Categories

Resources