I calculate sum of volume at all integration point like
volume = f.steps['Step-1'].frames[-1].fieldOutputs['IVOL'].values
# CALCULATE TOTAL VOLUME AT INTEGRATION POINTS
V = 0
for i in range(len(volume)):
V = V+volume[i].data
the length of volume in my problem is about 27,000,000 so it takes too long to do that.
I try to parallelize this process with multiprocessing module in python.
As far as I know, the data should be splited to several parts for that.
Could you give me some advice about spliting the data in odb files to several parts or parallelizing that code??
You can make use of bulkDataBlocks command available for the field output objects.
Using bulkDataBlocks, you can access all the field data as an array (abaqus array actaully, but numpy array and abaqus array are largely the same) and Hence, manipulation can get very easy.
For example:
# Accessing bulkDataBlocks object
ivolObj= f.steps['Step-1'].frames[-1].fieldOutputs['IVOL'].bulkDataBlocks
# Accessing data as array --> please note it is an Abaqus array type, not numpy array
ivolData = ivolObj[0].data
# now, you can sum it up
ivolSum = sum(ivolData )
Related
Hello friends!
Summarization:
I got a ee.FeatureCollection containing around 8500 ee.Point-objects. I would like to calculate the distance of these points to a given coordinate, lets say (0.0, 0.0).
For this i use the function geopy.distance.distance() (ref: https://geopy.readthedocs.io/en/latest/#module-geopy.distance). As input the the function takes 2 coordinates in the form of 2 tuples containing 2 floats.
Problem: When i am trying to convert the coordinates in form of an ee.List to float, i always use the getinfo() function. I know this is a callback and it is very time intensive but i don't know another way to extract them. Long story short: To extract the data as ee.Number it takes less than a second, if i want them as float it takes more than an hour. Is there any trick to fix this?
Code:
fc_containing_points = ee.FeatureCollection('projects/ee-philadamhiwi/assets/Flensburg_100') #ee.FeatureCollection
list_containing_points = fc_containing_points.toList(fc_containing_points.size()) #ee.List
fc_containing_points_length = fc_containing_points.size() #ee.Number
for index in range(fc_containing_points_length.getInfo()): #i need to convert ee.Number to int
point_tmp = list_containing_points.get(i) #ee.ComputedObject
point = ee.Feature(point_tmp) #transform ee.ComputedObject to ee.Feature
coords = point.geometry().coordinates() #ee.List containing 2 ee.Numbers
#when i run the loop with this function without the next part
#i got all the data i want as ee.Number in under 1 sec
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0]) #tuple containing 2 floats
#when i add this part to the function it takes hours
PS: This is my first question, pls be patient with me.
I would use .map instead of your looping. This stays server side until you export the table (or possibly do a .getInfo on the whole thing)
fc_containing_points = ee.FeatureCollection('projects/eephiladamhiwi/assets/Flensburg_100')
fc_containing_points.map(lambda feature: feature.set("distance_to_point", feature.distance(ee.Feature(ee.Geometry.Point([0.0,0.0])))
# Then export using ee.batch.Export.Table.toXXX or call getInfo
(An alternative might be to useee.Image.paint to convert the target point to an image then, use ee.Image.distance to calculate the distance to the point (as an image), then use reduceRegions over the feature collection with all points but 1) you can only calculate distance to a certain distance and 2) I don't think it would be any faster.)
To comment on your code, you are probably aware loops (especially client side loops) are frowned upon in GEE (primarily for the performance reasons you've run into) but also note that any time you call .getInfo on a server side object it incurs a performance cost. So this line
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0])
Would take roughly double the time as this
coords_client = coords.getInfo()
coords_as_tuple_of_ints = (coords_client[1],coords_client[0])
Finally, you could always just export your entire feature collection to a shapefile (using ee.batch.Export.Table.... as above) and do all the operations using geopy locally.
I'm new at using Xarray (using it inside jupyter notebooks), and up to now everything has worked like a charm, except when I started to look at how much RAM is used by my functions (e.g. htop), which is confusing me (I didn't find anything on stackexchange).
I am combining monthly data to yearly means, taking into account month lengths, masking nan values and also using specific months only, which requires the use of groupby and resample.
As I can see from using the memory profiler these operations temporarily take up ~15gm RAM, which as such is not a problem because I have 64gb RAM at hand.
Nonetheless it seems like some memory is blocked permanently, even though I call these methods inside a function. For the function below it blocks ~4gb of memory although the resulting xarray only has a size of ~440mb (55*10**6 float 64entries), with more complex operations it blocks more memory.
Explicitly using del , gc.collect() or Dataarray.close() inside the function did not change anything.
A basic function to compute a weighted yearly mean from monthly data looks like this:
import xarray as xr
test=xr.open_dataset(path)['pr']
def weighted_temporal_mean(ds):
"""
Taken from https://ncar.github.io/esds/posts/2021/yearly-averages-xarray/
Compute yearly average from monthly data taking into account month length and
masking nan values
"""
# Determine the month length
month_length = ds.time.dt.days_in_month
# Calculate the weights
wgts = month_length.groupby("time.year") / month_length.groupby("time.year").sum()
# Setup our masking for nan values
cond = ds.isnull()
ones = xr.where(cond, 0.0, 1.0)
# Calculate the numerator
obs_sum = (ds * wgts).resample(time="AS").sum(dim="time")
# Calculate the denominator
ones_out = (ones * wgts).resample(time="AS").sum(dim="time")
# Return the weighted average
return obs_sum / ones_out
wm=weighted_temporal_mean(test)
print("nbytes in MB:", wm.nbytes / (1024*1024))
Any idea how to ensure that the memory is freed up, or am I overlooking something and this behavior is actually expected?
Thank you!
The only hypothesis I have for this behavior is that some of the operations involving the passed in ds modify it in place, increasing its size, as, apart of the returned objects, this the the only object that should survive after the function execution.
That can be easily verified by using del on the ds structure used as input after the function is run. (If you need the data afterwards, re-read it, or make a deepcopy before calling the function).
If that does not resolve the problem, then this is an issue with the xarray project, and I'd advise you to open an issue in their project.
I have about 15 million pairs that consist of a single int, paired with a batch of (2 to 100) other ints.
If it makes a difference, the ints themselve range from 0 to 15 million.
I have considered using:
Pandas, storing the batches as python lists
Numpy, where the batch is stored as it's own numpy array (since numpy doesn't allow variable length rows in it's 2D data structures)
Python List of Lists.
I also looked at Tensorflow tfrecords but not too sure about this one.
I only have about 12 gbs of RAM. I will also be using to train over a machine learning algorithm so
If you must store all values in memory, numpy will probably be the most efficient way. Pandas is built on top of numpy so it includes some overhead which you can avoid if you do not need any of the functionality that comes with pandas.
Numpy should have no memory issues when handling data of this size but another thing to consider, and this depends on how you will be using this data, is to use a generator to read from a file that has each pair on a new line. This would reduce memory usage significantly but would be slower than numpy for processing aggregate functions like sum() or max() and is more suitable if each value pair would be processed independently.
with open(file, 'r') as f:
data = (l for l in f) # generator
for line in data:
# process each record here
I would do the following:
# create example data
A = np.random.randint(0,15000000,100)
B = [np.random.randint(0,15000000,k) for k in np.random.randint(2,101,100)]
int32 is sufficient
A32 = A.astype(np.int32)
We want to glue all the batches together.
First, write down the batch sizes so we can separate them later.
from itertools import chain
sizes = np.fromiter(chain((0,),map(len,B)),np.int32,len(B)+1)
boundaries = sizes.cumsum()
# force int32
B_all = np.empty(boundaries[-1],np.int32)
np.concatenate(B,out=B_all)
After glueing resplit.
B32 = np.split(B_all, boundaries[1:-1])
Finally, make an array of pairs for convenience:
pairs = np.rec.fromarrays([A32,B32],names=["first","second"])
What was the point of glueing and then splitting again?
First, note that the resplit arrays are all views into B_all, so we do not waste much memory by having both. Also, if we modify either B_all_ or B32 (or rather some of its elements) in place the other one will be automatically updated as well.
The advantage of having B_all around is efficiency via numpy's reduceat ufunc method. If we wanted for example the means of all batches we could do np.add.reduceat(B_all, boundaries[:-1]) / sizes which is faster than looping through pairs['second']
Use numpy. It us the most efficient and you can use it easily with a machine learning model.
I have big dataset in array form and its arranged like this:
Rainfal amount arranged in array form
Average or mean mean for each latitude and longitude at axis=0 is computed using this method declaration:
Lat=data[:,0]
Lon=data[:,1]
rain1=data[:,2]
rain2=data[:,3]
--
rain44=data[:,44]
rainT=[rain1,rain2,rain3,rain4,....rain44]
mean=np.mean(rainT)
The result was aweseome but requires time computation and I look forward to use For Loop to ease the calculation. As for the moment the script that I used is like this:
mean=[]
lat=data[:,0]
lon=data[:,1]
for x in range(2,46):
rainT=data[:,x]
mean=np.mean(rainT,axis=0)
print mean
But weird result is appeared. Anyone?
First, you probably meant to make the for loop add the subarrays rather than keep replacing rainT with other slices of the subarray. Only the last assignment matters, so the code averages that one subarray rainT=data[:,45], also it doesn't have the correct number of original elements to divide by to compute an average. Both of these mistakes contribute to the weird result.
Second, numpy should be able to average elements faster than a Python for loop can do it since that's just the kind of thing that numpy is designed to do in optimized native code.
Third, your original code copies a bunch of subarrays into a Python List, then asks numpy to average that. You should get much faster results by asking numpy to sum the relevant subarray without making a copy, something like this:
rainT = data[:,2:] # this gets a view onto data[], not a copy
mean = np.mean(rainT)
That computes an average over all the rainfall values, like your original code.
If you want an average for each latitude or some such, you'll need to do it differently. You can average over an array axis, but latitude and longitude aren't axes in your data[].
Thanks friends, you are giving me such aspiration. Here is the working script ideas by #Jerry101 just now but I decided NOT to apply Python Loop. New declaration would be like this:
lat1=data[:,0]
lon1=data[:,1]
rainT=data[:,2:46] ---THIS IS THE STEP THAT I AM MISSING EARLIER
mean=np.mean(rainT,axis=1)*24 - MAKE AVERAGE DAILY RAINFALL BY EACH LAT AND LON
mean2=np.array([lat1,lon1,mean])
mean2=mean2.T
np.savetxt('average-daily-rainfall.dat2',mean2,fmt='%9.3f')
And finally the result is exactly same to program made in Fortran.
I have a Matlab (.mat, version >7.3) file that contains a structure (data) that itself contains many fields. Each field is a single column array. Each field represents an individual sensor and the array is the time series data. I am trying to open this file in Python to do some more analysis. I am using PyTables to read the data in:
import tables
impdat = tables.openFile('data_file.mat')
This reads the file in and I can enter the fileObject and get the names of each field by using:
impdat.root.data.__members__
This prints a list of the fields:
['rdg', 'freqlabels', 'freqbinsctr',... ]
Now, what I would like is a method to take each field in data and make a python variable (perhaps dictionary) with the field name as the key (if it is a dictionary) and the corresponding array as its value. I can see the size of the array by doing, for example:
impdat.root.data.rdg
which returns this:
/data/rdg (EArray(1, 1286920), zlib(3))
atom := Int32Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := (1, 16290)
My question is how do I access some of the data stored in that large array (1, 1286920). How can I read that array into another Python variable (list, dictionary, numpy array, etc.)? Any thoughts or guidance would be appreciated.
I have come up with a working solution. It is not very elegant as it requires an eval. So I first create a new variable (alldata) to the data I want to access, and then I create an empty dictionary datastruct, then I loop over all the members of data and assign the arrays to the appropriate key in the dictionary:
alldata = impdat.root.data
datastruct = {}
for names in impdat.rood.data.__members___:
datastruct[names] = eval('alldata.' + names + '[0][:]')
The '[0]' could be superfluous depending on the structure of the data trying to access. In my case the data is stored in an array of an array and I just want the first one. If you come up with a better solution please feel free to share.
I can't seem to replicate your code. I get an error when trying to open the file which I made in 8.0 using tables.
How about if you took the variables within the structure and saved them to a new mat file which only contains a collection of variables. This would make it much easier to deal with and this has already been answered quite eloquently here.
Which states that mat files which are arrays are simply hdf5 files which can be read with:
import numpy as np, h5py
f = h5py.File('somefile.mat','r')
data = f.get('data/variable1')
data = np.array(data) # For converting to numpy array
Not sure the size of the data set you're working with. If it's large I'm sure I could come up with a script to pull the fields out of the structures. I did find this tool which may be helpful. It recursively gets all of the structure field names.