I have about a million rows of data with lat and lon attached, and more to come. Even now reading the data from SQLite file (I read it with pandas, then create a point for each row) takes a lot of time.
Now, I need to make a spatial joint over those points to get a zip code to each one, and I really want to optimise this process.
So I wonder: if there is any relatively easy way to parallelize those computations?
I am assuming you have already implemented GeoPandas and are still finding difficulties?
you can improve this by further hashing your coords data. similar to how google hashes their search data. Some databases already provide support for these types of operations (eg mongodb). Imagine if you took the first (left) digit of your coords, and put each set of cooresponding data into a seperate sqlite file. each digit can be a hash pointing to the correct file to look for. now your lookup time has improved by a factor of 20 (range(-9,10)), assuming your hash lookup takes minimal time in comparison
As it turned out, the most convenient solution in my case is to use pandas.read_SQL function with specific chunksize parameter. In this case, it returns a generator of data chunks, which can be effectively feed to the mp.Pool().map() along with the job;
In this (my) case job consists of 1) reading geoboundaries, 2) spatial joint of the chunk 3) writing the chunk to the database.
This method is completely dependent on your spatial scale, but one way you might parallelize your join would be to subdivide your polygons into subpolygons and then offload the work to separate threads in separate cores. This geopandas r-tree tutorial demonstrates that technique, subdividing a large polygon into many small ones and intersecting each with a large set of points. But again, this only works if your spatial scale is appropriate: ie, a few polygons and a lot of points (such as a few zip code polygons and millions of points in and around them).
Related
for fname in ids['fnames']:
aq = xr.open_dataset(fname, chunks='auto', mask_and_scale=False)
aq = aq[var_lists]
aq = aq.isel(lat=slice(yoff, yoff+ysize), lon=slice(xoff, xoff+xsize))
list_of_ds.append(aq)
aq.close()
all_ds = xr.concat(list_of_ds, dim='time')
all_ds.to_netcdf('tmp.nc')
Hi all, I am making use of xarray to read netcdf files (around 1000) and save selected resutls to a temporary file, as shown above. However, the saving part runs very slow. How can I speed this up?
I also tried directly load the data, but still very slow.
I've also tried using open_mfdataset with parallel=True, and it's also slow:
aq = xr.open_mfdataset(
sorted(ids_list),
data_vars=var_lists,
preprocess=add_time_dim,
combine='by_coords',
mask_and_scale=False,
decode_cf=False,
parallel=True,
)
aq.isel({'lon':irlon,'lat':irlat}).to_netcdf('tmp.nc')
Unfortunately, concatenating ~1000 files in xarray will be slow. Not a great way around that.
It's hard for us to offer specific advice without more detail about your data and setup. But here are some things I'd try:
use xr.open_mfdataset. Your second code block looks great. dask will generally be faster and more efficient at managing tasks than you will with a for loop.
Make sure your chunks are aligned with how you're slicing the data. You don't want to read in more than you have to. If you're reading netCDFs, you have flexibility about how to read in the data into dask. Since you're selecting (it looks like) a small spatial region within each array, it may make sense to explicitly chunk the data such that you're only reading in a small portion of each array, e.g. with chunks={"lat": 50, "lon": 50}. You'll want to balance a few things here - making sure the chunk sizes are manageable and not too small (leading to too many tasks). Shoot for chunks ~100-500 MB range as a general rule, and trying to keep the number of tasks to less than 1 million (or # chunks to fewer than ~10-100k across all your datasets).
Be explicit about your concatenation. The more "magic" the process feels, the more work xarray is doing to infer what you mean. Generally, combine='nested' performs better than 'by_coords', so if you're concatenating files which are structured logically along one or more dimensions, it may help to arrange the files in the same way a dim is provided.
skip the pre-processing. If you can, add new dimensions on concatenation rather than as an ingestion step. This allows dask to more fully plan the computation, rather than treating your preprocess function as a black box, and what's worse as a pre-requisite to scheduling the final array construction operation because you're using combine='by_coords', where the coords are the result of an earlier dask operation. If you need to attach a time dim to each file, with 1 element per file, something like xr.open_mfdataset(files, concat_dim=pd.Index(pd.date_range("2020-01-01", freq="D", periods=1000), name="time"), combine="nested") works well in my experience.
If this is all taking too long, you could try pre-processing the data. Using a compiled utility like nco or even just subsetting the data and grouping smaller subsets of the data into larger files using dask.distributed's client.map might help cut down on the complexity of the final dataset join.
I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :
This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])
This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.
I have two GPX files (from a race I ran twice, obtained via the Strava API) and I would like to be able to compare the effort across both. The sampling frequency is irregular however (i.e. data is not recorded every second, or every meter), so a straightforward comparison is not possible and I would need to standardize the data first. Preferably, I would resample the data so that I have data points for every 10 meters for example.
I'm using Pandas, so I'm currently standardizing a single file by inserting rows for every 10 meters and interpolating the heartrate, duration, lat/lng, etc from the surrounding data points. This works, but doesn't make the data comparable across files, as the recording does not start at the exact same location.
An alternative is first standardizing the course coordinates using something like geohashing and then trying to map both efforts to this standardized course. Since coordinates can not be easily sorted, I'm not sure how to do that correctly however.
Any pointers are appreciated, thanks!
I'm writing a program that creates vario-function plots for a fixed region of a digital elevation model that has been converted to an array. I calculate the variance (difference in elevation) and lag (distance) between point pairs within the window constraints. Every array position is compared with every other array position. For each pair, the lag and variance values are appended to separate lists. Once all pairs have been compared, these lists are then used for data binning, averaging and eventually plotting.
The program runs fine for smaller window sizes (say 60x60 px). For windows up to about 120x120 px or so, which would give 2 lists of 207,360,000 entries, I am able to slowly get the program running. Greater than this, and I run into "MemoryError" reports - e.g. for a 240x240 px region, I would have 3,317,760,000 entries
At the beginning of the program, I create an empty list:
variance = []
lag = []
Then within a for loop where I calculate my lags and variances, I append the values to the different lists:
variance.append(var_val)
lag.append(lag_val)
I've had a look over the stackoverflow pages and have seen a similar issue discussed here. This solution would potentially improve temporal program performance however the solution offered only goes up to 100 million entries and therefore doesn't help me out with the larger regions (as with the 240x240px example). I've also considered using numpy arrays to store the values but I don't think this will stave of the memory issues.
Any suggestions for ways to use some kind of list of the proportions I have defined for the larger window sizes would be much appreciated.
I'm new to python so please forgive any ignorance.
The main bulk of the code can be seen here
Use the array module of Python. It offers some list-like types that are more memory efficient (but cannot be used to store random objects, unlike regular lists). For example, you can have arrays containing regular floats ("doubles" in C terms), or even single-precision floats (four bytes each instead of eight, at the cost of a reduced precision). An array of 3 billion such single-floats would fit into 12 GB of memory.
You could look into PyTables, a library wrapping the HDF5 C library that can be used with numpy and pandas.
Essentially PyTables will store your data on disk and transparently load it into memory as needed.
Alternatively if you want to stick to pure python, you could use a sqlite3 database to store and manipulate your data - the docs say the size limit for a sqlite database is 140TB, which should be enough for your data.
try using heapq, import heapq. It uses the heap for storage rather than the stack allowing you to access the computer full memory.
I'm currently rewriting some python code to make it more efficient and I have a question about saving python arrays so that they can be re-used / manipulated later.
I have a large number of data, saved in CSV files. Each file contains time-stamped values of the data that I am interested in and I have reached the point where I have to deal with tens of millions of data points. The data has got so large now that the processing time is excessive and inefficient---the way the current code is written the entire data set has to be reprocessed every time some new data is added.
What I want to do is this:
Read in all of the existing data to python arrays
Save the variable arrays to some kind of database/file
Then, the next time more data is added I load my database, append the new data, and resave it. This way only a small number of data need to be processed at any one time.
I would like the saved data to be accessible to further python scripts but also to be fairly "human readable" so that it can be handled in programs like OriginPro or perhaps even Excel.
My question is: whats the best format to save the data in? HDF5 seems like it might have all the features I need---but would something like SQLite make more sense?
EDIT: My data is single dimensional. I essentially have 30 arrays which are (millions, 1) in size. If it wasn't for the fact that there are so many points then CSV would be an ideal format! I am unlikely to want to do lookups of single entries---more likely is that I might want to plot small subsets of data (eg the last 100 hours, or the last 1000 hours, etc).
HDF5 is an excellent choice! It has a nice interface, is widely used (in the scientific community at least), many programs have support for it (matlab for example), there are libraries for C,C++,fortran,python,... It has a complete toolset to display the contents of a HDF5 file. If you later want to do complex MPI calculation on your data, HDF5 has support for concurrently read/writes. It's very well suited to handle very large datasets.
Maybe you could use some kind of key-value database like Redis, Berkeley DB, MongoDB... But it would be nice some more info about the schema you would be using.
EDITED
If you choose Redis for example, you can index very long lists:
The max length of a list is 232 - 1 elements (4294967295, more than 4
billion of elements per list). The main features of Redis Lists from
the point of view of time complexity are the support for constant time
insertion and deletion of elements near the head and tail, even with
many millions of inserted items. Accessing elements is very fast near
the extremes of the list but is slow if you try accessing the middle
of a very big list, as it is an O(N) operation.
I would use a single file with fixed record length for this usecase. No specialised DB solution (seems overkill to me in that case), just plain old struct (see the documentation for struct.py) and read()/write() on a file. If you have just millions of entries, everything should be working nicely in a single file of some dozens or hundreds of MB size (which is hardly too large for any file system). You also have random access to subsets in case you will need that later.