I have a geodataframe with a local projection (EPSG:2263) that I want to transform to WGS84(global) in order to add basemap interactivity within python. However, when transforming this data, the runtime for this block of code is hours long and extremely impractical, I only have ~40,000 polygons that need to be transformed.
code I am using:
gdf.to_crs(epsg = 4326)
Does anyone know of quicker ways to re-project somewhat large datasets in python?
I have done this call before with a smaller dataset to test it out, and indeed it took nearly a minute with only 150 records.
Related
for fname in ids['fnames']:
aq = xr.open_dataset(fname, chunks='auto', mask_and_scale=False)
aq = aq[var_lists]
aq = aq.isel(lat=slice(yoff, yoff+ysize), lon=slice(xoff, xoff+xsize))
list_of_ds.append(aq)
aq.close()
all_ds = xr.concat(list_of_ds, dim='time')
all_ds.to_netcdf('tmp.nc')
Hi all, I am making use of xarray to read netcdf files (around 1000) and save selected resutls to a temporary file, as shown above. However, the saving part runs very slow. How can I speed this up?
I also tried directly load the data, but still very slow.
I've also tried using open_mfdataset with parallel=True, and it's also slow:
aq = xr.open_mfdataset(
sorted(ids_list),
data_vars=var_lists,
preprocess=add_time_dim,
combine='by_coords',
mask_and_scale=False,
decode_cf=False,
parallel=True,
)
aq.isel({'lon':irlon,'lat':irlat}).to_netcdf('tmp.nc')
Unfortunately, concatenating ~1000 files in xarray will be slow. Not a great way around that.
It's hard for us to offer specific advice without more detail about your data and setup. But here are some things I'd try:
use xr.open_mfdataset. Your second code block looks great. dask will generally be faster and more efficient at managing tasks than you will with a for loop.
Make sure your chunks are aligned with how you're slicing the data. You don't want to read in more than you have to. If you're reading netCDFs, you have flexibility about how to read in the data into dask. Since you're selecting (it looks like) a small spatial region within each array, it may make sense to explicitly chunk the data such that you're only reading in a small portion of each array, e.g. with chunks={"lat": 50, "lon": 50}. You'll want to balance a few things here - making sure the chunk sizes are manageable and not too small (leading to too many tasks). Shoot for chunks ~100-500 MB range as a general rule, and trying to keep the number of tasks to less than 1 million (or # chunks to fewer than ~10-100k across all your datasets).
Be explicit about your concatenation. The more "magic" the process feels, the more work xarray is doing to infer what you mean. Generally, combine='nested' performs better than 'by_coords', so if you're concatenating files which are structured logically along one or more dimensions, it may help to arrange the files in the same way a dim is provided.
skip the pre-processing. If you can, add new dimensions on concatenation rather than as an ingestion step. This allows dask to more fully plan the computation, rather than treating your preprocess function as a black box, and what's worse as a pre-requisite to scheduling the final array construction operation because you're using combine='by_coords', where the coords are the result of an earlier dask operation. If you need to attach a time dim to each file, with 1 element per file, something like xr.open_mfdataset(files, concat_dim=pd.Index(pd.date_range("2020-01-01", freq="D", periods=1000), name="time"), combine="nested") works well in my experience.
If this is all taking too long, you could try pre-processing the data. Using a compiled utility like nco or even just subsetting the data and grouping smaller subsets of the data into larger files using dask.distributed's client.map might help cut down on the complexity of the final dataset join.
I am working on software that processes time series. Sometimes these are very long (>10 million data points). Our software is very usable for shorter time series but gets unusably bogged down for these long ones. When looking at the RAM usage, it's almost 10x what all the time series data together occupy.
When doing some tests, it's clear that a lot of memory is used by matplotlib, which we are using to plot the time series. Using a separate piece of code that includes ONLY loading of the time series from a file and plotting, I can see that when going from loading only (with the plotting command commented out) to plotting, the memory usage goes up almost 3-fold. This is true whether or not the whole time range is visible within the given axis limits, although passing only a small slice of the series (numpy array) to matplotlib DOES proportionally reduce the excess memory.
Given that we expect users to scroll through the time series and only view short chunks at a time, it would be much better to have matplotlib only fetch the visible portion of the numpy array, grabbing new elements as the user scrolls or zooms. In fact, it would likely be preferable to replace the X and Y arrays with generators that re-compute the values on the fly as the plot needs them, possibly caching points just outside the limits to make scrolling faster. The X values in particular are simple linspaces that would likely be best not stored at all, given that computing them should be as fast as a lookup into a huge array, never mind storing them once in the outer software AND also in matplotlib.
I know we could try to "fake" this by capturing user events sent to the plot and re-sending new X and Y arrays all the time, but this feels clunky, prone to all sorts of corner cases where things get out of sync, and like trying to take over from the plotting library things it "wants" to do itself. At some point it would become easier just to write our own simple plotting routine in C/C++ that does the computations and draws lines using a graphics API. In fact, the nearest closed-source competitor to our software seems to be doing just that, given that it's super snappy and uses an amount of RAM that is a mere fraction of the size of a time series. But, we want our software to be extensible by users without a deep understanding of the internals of our code.
Is there a standard way of handling this, or is this just too far from the "spirit" of matplotlib to be worth using it? And in that case, is there an alternative Python plotting library with exactly this use case in mind? I would imagine that data scientists working with terabytes of data would want a way to graphically explore it without the plotting code eating terabytes of storage itself...
I am trying to write a code in python that will display the trajectory of projectile on a 2D graph. The initial velocity and launch angle will be varying. Instead of calculating it every time, I was wondering if there is any way to create a data file which will store all the values of the coordinates for each of those different combinations of speed and launch angle. That is a 4 dimensional database. Is this even possible?
This sounds like a pretty ideal case for using CSV as your file format. It's not a "4 dimension" so much as a "4 column" database.
initial_velocity, launch_angle, end_x, end_y
which you can write out and read in easily - using either the standard library's csv module, or pandas' read_csv()
i think that you should look at the HDF5 format, which has been specialized to work with big data in NASA and bulletproof in very large scale applications :
From the webside
HDF5 lets you store huge amounts of numerical data, and easily
manipulate that data from NumPy. For example, you can slice into
multi-terabyte datasets stored on disk, as if they were real NumPy
arrays. Thousands of datasets can be stored in a single file,
categorized and tagged however you want.
In addition from me is the point that NumPy has been developed to work with multidimensional array very efficiently. Good luck !
I have two GPX files (from a race I ran twice, obtained via the Strava API) and I would like to be able to compare the effort across both. The sampling frequency is irregular however (i.e. data is not recorded every second, or every meter), so a straightforward comparison is not possible and I would need to standardize the data first. Preferably, I would resample the data so that I have data points for every 10 meters for example.
I'm using Pandas, so I'm currently standardizing a single file by inserting rows for every 10 meters and interpolating the heartrate, duration, lat/lng, etc from the surrounding data points. This works, but doesn't make the data comparable across files, as the recording does not start at the exact same location.
An alternative is first standardizing the course coordinates using something like geohashing and then trying to map both efforts to this standardized course. Since coordinates can not be easily sorted, I'm not sure how to do that correctly however.
Any pointers are appreciated, thanks!
I have about a million rows of data with lat and lon attached, and more to come. Even now reading the data from SQLite file (I read it with pandas, then create a point for each row) takes a lot of time.
Now, I need to make a spatial joint over those points to get a zip code to each one, and I really want to optimise this process.
So I wonder: if there is any relatively easy way to parallelize those computations?
I am assuming you have already implemented GeoPandas and are still finding difficulties?
you can improve this by further hashing your coords data. similar to how google hashes their search data. Some databases already provide support for these types of operations (eg mongodb). Imagine if you took the first (left) digit of your coords, and put each set of cooresponding data into a seperate sqlite file. each digit can be a hash pointing to the correct file to look for. now your lookup time has improved by a factor of 20 (range(-9,10)), assuming your hash lookup takes minimal time in comparison
As it turned out, the most convenient solution in my case is to use pandas.read_SQL function with specific chunksize parameter. In this case, it returns a generator of data chunks, which can be effectively feed to the mp.Pool().map() along with the job;
In this (my) case job consists of 1) reading geoboundaries, 2) spatial joint of the chunk 3) writing the chunk to the database.
This method is completely dependent on your spatial scale, but one way you might parallelize your join would be to subdivide your polygons into subpolygons and then offload the work to separate threads in separate cores. This geopandas r-tree tutorial demonstrates that technique, subdividing a large polygon into many small ones and intersecting each with a large set of points. But again, this only works if your spatial scale is appropriate: ie, a few polygons and a lot of points (such as a few zip code polygons and millions of points in and around them).