Python: Creating time series with multiple channels for ConvNet - python

I am using Convolutional Networks to work with forecasting time series. For this I am using rolling windows to take the last t points to use them as time series. Every feature is going to be a channel,so I have a multiple time series set. The data need to be in 3 dimensions [n_samples,window_size,features]. The original data set I have is [n_samples,features]. The data is already in time ascending order. My problem is that the way I am creating my 3D tensor crash my computer, given I have close to 500k rows. This is the code I am using.
prueba = x_data # This data set has shape [500k,20]
window_size = 100 # I taking the last 100 days
n_units,n_features = prueba.shape
n_samples = n_units - window_size +1 # Represent the number of samples you are getting from the rolling windows.
data_list = []
for init_index in range(n_samples):
fin_index = window_size + init_index
window_set = prueba[init_index:fin_index,:]
window_flat = np.reshape(window_set,(1,window_size*n_features))
data_list.append(window_flat)
features_tensor = np.concatenate(data_list,axis = 0)
features_tensor = np.reshape(features_tensor,(n_samples,window_size,n_features)) ## This break my computer
The problem is that my computer crashes when I use np.concatenate to put together all the individual data set I create. Does anyone know faster way to this. I am trying to think in a way to avoid using np.concatenate, but so far I havent been able to figure out.

Using the approach you have here (which results in np.concatenate) is quite inefficient, since you are duplicating every data point (roughly) window_size times. And that is almost certainly a waste of memory, since whatever operation that acts on this dataset should, ideally, be able to do it on a rolling basis: go through the time series without having to see the fully expanded / vastly duplicated dataset in tensor format.
So, I suggest that the better approach is to find a way to avoid building this redundant tensor in the first place.
Since we don't know what you are doing with this tensor, it's not possible to give an answer. However, here are a few things to consider:
One "right" way to do this is to use pandas, which has a rolling window feature df.rolling()docs here. This does exactly what you want (performs computations on a rolling window, without a big redundant tensor), but of course only if that works with the downstream code.
If you are using tensorflow, then you'll be better served by creating a generator to yield the window when called, which can be put into a tf.Dataset (see the .from_generator() method and example here).
In Keras, try TimeseriesGenerator, which has this capability. docs here

Related

Fast Savgol Filter on 3D Tensor

I have a tensor of example shape (543, 133, 3), meaning 543 frames, with 133 points of X,Y,Z
I would like to run a savgol_filter on every point in every dimension, however, naively, this is quite slow:
points, frames, dims = tensor.shape
new_data = []
for point in range(points):
new_dims = []
for dim in range(dims):
new_dims.append(scipy.signal.savgol_filter(data[point, :, dim], 3, 1))
new_data.append(new_dims)
tensor = np.array(new_data)
On my computer, for this small tensor, this takes 300ms, which is quite a long time.
Is there a way to make this faster?
This is by no means the fastest method, but it should be quite a lot faster than what you're currently doing. We can utilize vectorized operations instead of for loops to achieve much better performance.
From your code, it seems like you want to smooth the 133 dimension (dim 1) and so you could apply SavGol all at once with
savgol_filter(data, 3, 1, axis=1). In general you can specify the axis on which you'd like to apply the filter. On my computer, this brought the computation from 500ms to 2ms.
A side note: Since you care about performance, I would pay attention to what your data order is. Depending on what you're doing, it might be advisable to reorder your data once to save time.
For example: Let's say you have a matrix of 5 signals (5x299). If you wanted to get a single signal. That's easy! Try signal[0]. This doesn't actually require copying the data and we can just "view" it in memory. But what if you wanted to select a particular band in the signal? If you do signal[:,0] then you can't take a "view" of the memory because first you need to access every signal and take that index. If you had first transposed the matrix, then the first index is just the band of every spectra that you want -- no need for iteration. Data order can be an important part of getting the best performance out of your computations.
There are two related concepts here: contiguous memory and vectorized operations. My explanation of why data order is important has some more complications, and you will need to do your own research to determine what data ordering will give you the best performance for your application. The big things to watch out for are C v Fortran contiguous memory layout.
Here are some resources I found: (not an endorsement)
StackOverflow article on contigous memory: What is the difference between contiguous and non-contiguous arrays?
Towards Data Science article on vectorized operations https://towardsdatascience.com/vectorization-must-know-technique-to-speed-up-operations-100x-faster-50b6e89ddd45

python xarray write to netcdf file very slow

for fname in ids['fnames']:
aq = xr.open_dataset(fname, chunks='auto', mask_and_scale=False)
aq = aq[var_lists]
aq = aq.isel(lat=slice(yoff, yoff+ysize), lon=slice(xoff, xoff+xsize))
list_of_ds.append(aq)
aq.close()
all_ds = xr.concat(list_of_ds, dim='time')
all_ds.to_netcdf('tmp.nc')
Hi all, I am making use of xarray to read netcdf files (around 1000) and save selected resutls to a temporary file, as shown above. However, the saving part runs very slow. How can I speed this up?
I also tried directly load the data, but still very slow.
I've also tried using open_mfdataset with parallel=True, and it's also slow:
aq = xr.open_mfdataset(
sorted(ids_list),
data_vars=var_lists,
preprocess=add_time_dim,
combine='by_coords',
mask_and_scale=False,
decode_cf=False,
parallel=True,
)
aq.isel({'lon':irlon,'lat':irlat}).to_netcdf('tmp.nc')
Unfortunately, concatenating ~1000 files in xarray will be slow. Not a great way around that.
It's hard for us to offer specific advice without more detail about your data and setup. But here are some things I'd try:
use xr.open_mfdataset. Your second code block looks great. dask will generally be faster and more efficient at managing tasks than you will with a for loop.
Make sure your chunks are aligned with how you're slicing the data. You don't want to read in more than you have to. If you're reading netCDFs, you have flexibility about how to read in the data into dask. Since you're selecting (it looks like) a small spatial region within each array, it may make sense to explicitly chunk the data such that you're only reading in a small portion of each array, e.g. with chunks={"lat": 50, "lon": 50}. You'll want to balance a few things here - making sure the chunk sizes are manageable and not too small (leading to too many tasks). Shoot for chunks ~100-500 MB range as a general rule, and trying to keep the number of tasks to less than 1 million (or # chunks to fewer than ~10-100k across all your datasets).
Be explicit about your concatenation. The more "magic" the process feels, the more work xarray is doing to infer what you mean. Generally, combine='nested' performs better than 'by_coords', so if you're concatenating files which are structured logically along one or more dimensions, it may help to arrange the files in the same way a dim is provided.
skip the pre-processing. If you can, add new dimensions on concatenation rather than as an ingestion step. This allows dask to more fully plan the computation, rather than treating your preprocess function as a black box, and what's worse as a pre-requisite to scheduling the final array construction operation because you're using combine='by_coords', where the coords are the result of an earlier dask operation. If you need to attach a time dim to each file, with 1 element per file, something like xr.open_mfdataset(files, concat_dim=pd.Index(pd.date_range("2020-01-01", freq="D", periods=1000), name="time"), combine="nested") works well in my experience.
If this is all taking too long, you could try pre-processing the data. Using a compiled utility like nco or even just subsetting the data and grouping smaller subsets of the data into larger files using dask.distributed's client.map might help cut down on the complexity of the final dataset join.

SatPy load all channels at once

I have a scene object, I would like to load all channels into a numpy array of shape (24,24,3). Where 3 is the number of channels.
scene_xybox = scn.crop(xy_bbox=box)
I have to select each channel:
channel= scene_xybox['VIS006'].values
repeat, and stack at the end.
Is there a way to get the stacked numpy array with one line.
This takes 5 seconds for each box, I have many files and it will take a very very long time to do the same operation to multiple boxes in an image to multiple images.
A perfect answer may require more information from you regarding what your end goal is, how many "boxes" you are cutting out, etc. But I'll see what I can clear up first. I assume you are not resampling the data with Scene.resample in your code at all.
Satpy uses dask so if possible it would be best to compute everything at once. Or at least limit how many times things are computed (.values computes the dask array). If you have a lot of boxes to cut out and your system has the available memory, you may want to calculate the slices yourself for all the xy bboxes (I think there are methods to help with this), load the entire image (see xr.concat below), and then use basic slicing techniques to get each of the box cutouts. This should save you from loading the data from disk each time you call .values, but also will really help with processing the other files you have since the slices should be the same across all times (except for special instrument cases).
You say you want the final shape to be (rows, cols, N). Is there a good reason you can't have (N, rows, cols)? The latter should be faster as the arrays are in their original contiguous form. If whatever processing you are doing after this could be done with dask at all this would "flow" really well with the tasks that would be made too.
You can use xr.concat, passing all the DataArrays at once and then call .values to get the full numpy array underneath. This should compute all the bands at the same time. Something like:
final_arr = xr.concat([scn['VIS006'], scn['band2'], scn['band3']], "bands").values

Slow loop python to search data in antoher data frame in python

I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :
This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])
This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.

Running filter over a large amount of data points and a long time period?

I need to apply two running filters on a large amount of data. I have read that creating variables on the fly is not a good idea, but I wonder if it still might be the best solution for me.
My question:
Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?
Why I want to do it:
The data are 400x700 arrays for 15min time steps over a year (So I have 35000 400x700 arrays). Each time step is read into python individually. Now I need to apply one running filter that checks if the last four time steps are equal (element-wise) and if they are, then all four values are set to zero. The next filter uses the data after the first filter has run and checks if the sum of the last twelve time steps exceeds a certain value. When both filters are done I want to sum up the values, so that at the end of the year I have one 400x700 array with the filtered accumulated values.
I do not have enough memory to read in all the data at once. So I thought I could create a loop where for each time step a new variable for the 400x700 array is created and the two filters run. The older arrays that are filtered I could then add to the yearly sum and delete, so that I do not have more than 16 (4+12) time steps(arrays) in memory at all times.
I don’t now if it’s correct of me to ask such a question without any code to show, but I would really appreciate the help.
If your question is about the best data structure to keep a certain amount of arrays in memory, in this case I would suggest using a three dimensional array. It's shape would be (400, 700, 12) since twelve is how many arrays you need to look back at. The advantage of this is that your memory use will be constant since you load new arrays into the larger one. The disadvantage is that you need to shift all arrays manually.
If you don't want to deal with the shifting yourself I'd suggest using a deque with a maxlen of 12.
"Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?"
This is a very common question that I think a lot of programmers will face eventually. Two examples for Python on Stack Overflow:
generating variable names on fly in python
How do you create different variable names while in a loop? (Python)
The lesson to learn from this is to not use dynamic variable names, but instead put the pieces of data you want to work with in an encompassing data structure.
The data structure could e.g. be a list, dict or Numpy array. Also the collections.deque proposed by #Midnighter seems to be a good candidate for such a running filter.

Categories