Displaying very large data sets more efficently

Displaying very large data sets more efficently - python

I have a logic analyser project that records several hundred million 16bit values (~100-500 million) and I need to display anything from a few hundred samples to the entire capture as the user zooms.
When you zoom out the whole system gets a huge performance hit as it's loading a massive chunk from the file.
I just though this morning that it would be more efficient to "stride" through the file at the users screen resolution. You can't physically display anything between pixels anyways. This doesn't solve the massive file size hit in memory though.
Is there away I can take a huge data set and stream chunk it down efficiently?
I was thinking streaming from start to start + view size by horiz resolution. This makes a very choppy zoom though.
Program uses python but I am open to calling something in c if it already exists.

Well, I don't know if this is actually question on programming or design overall.
For "zooming" problem with vizualizations I suggest:
Have pre-computed/cached version for some zoom levels. Ideally, gradation should be calculated based on user behaviour.
When user zooms-in, you simultaneously
calculate "proper" data or load pre-computed aggregated data of deeper zoom layer and crop it by your view frame
cheat by rendering low-res data from previous layer or smooth it by some approximation (but make sure to somehow tell user that data is not finalized)
Aside of it, think if you can optimize the way you store data. Trees may make your life way easier, both for partial disk read/search and for storing aggregated data.

In my opinion, there is no point to display even a few hundred samples unless they form some kind of image/shape. I guess one can look at hundred numbers if they are properly structured (colored). Several hundred - doubt it - here you replace actual data with some visualization (plots, charts, maps, ...).
To approach the problem you may define some rule to stop displaying actual data at all. For instance, if digit height becomes less than, say, 10 pixels you display some kind of message selected numbers are from rows 200...300, columns 400..500 or some graphical alterantive with corner coordinates and amount of numbers.

Related

Is there a way to generate a gif in python without consuming an excessive amount of RAM?

I'm writing a little application to generate a GIF from a kifu file (it's a type of file used to save a game in Japanese chess). I'm using Matplotlib currently to draw the board and the pieces, and the matplotlib.animation.FuncAnimation class combined with numpngw.AnimatedPNGWriter to write the gif. However, it uses more than 800MB of RAM to generate a single gif with 80 frames. After reflection, this value seems not surprising, because (from my understanding), each frame has a dimension of 1700x1000 and is in color. So, to keep every frame in frame, it needs a minimum of 1700*1000*80*(nb_bytes by pixel), which is a huge amount of RAM.
Is there a way to minimize this amount either with matplotlib or with another library? I suppose I need to compress frames after creating them instead of keeping them raw but I can't figure out how to do that.
Thank you very much

Compressing big GeoJSON/Shapefle datasets for viewing on web browser

So I have a shapefile that is 3GB in size and as you can imagine my browser doesn't like it. How can I compress the data I have which is either in lon/lat coordinates or points on an X,Y grid?
I saw a video on Computerphile about Discreet Cosine Transforms for reducing high dimesionality data but being a programmer and not a mathematician I don't know if this is even possible. I have tried to take a point every 10 steps in the file like so: map[0:100000:10] but this had an udesireable and very lossy effect.
I would ideally like to have my data so it would work like Google Earth in which the resolution adjusts to your viewport altitude. So when you zoom in to the map higher freqency data is presented in the viewport, limiting the amount of points but I don't know how they do this and Google return nothing of value.
Last point is that since these are just vectors is there any type of vector compression I could use? I'm not to great at math so as you can imagine when I look into this I just get confused fairly quickly. I uderstand SciPy has some DCT built in and I know it has a whole bunch of other features which I don't understand, perhaps I could use this?

I can answer the "level of detail" part: you can experiment with leaflet (a javascript mapping library). You could then define a "coarse" layer wich is displayed for low zoom levels and "high detail" layers that are only displayed at higher zoom levels. You probably need to capture the map zoomend event and load/unload your layers from there.

One solution to this problem is to use a Web Map Server (WMS) like GeoServer or MapServer that stores your ShapeFile (though a spatial database like PostGIS would be better) on the server and sends a rendered image (often broken down into cacheable tiles) to the browser.

Presenting parts of a pre-prepared image array in Shady

I'm interested in migrating from psychtoolbox to shady for my stimulus presentation. I looked through the online docs, but it is not very clear to me how to replicate what I'm currently doing in matlab in shady.
What I do is actually very simple. For each trial,
I load from disk a single image (I do luminance linearization off-line), which contains all the frames I plan to display in that trial (the stimulus is 1000x1000 px, and I present 25 frames, hence the image is 5000x5000px. I only use BW images, so I have a single int8 value per pixel).
I transfer the entire image from the CPU to the GPU
At some point (externally controlled) I copy the first frame to the video buffer and present it
At some other point (externally controlled) I trigger the presentation of the
remaining 24 frames (copying the relevant part of the image to video buffer for each video frame, and then calling flip()).
The external control happens by having another machine communicate with the stimulus presentation code over TCP/IP. After the control PC sends a command to the presentation PC and this is executed, the presentation PC needs to send back an acknowledgement message to the control PC. I need to send three ACK messages, one when the first frame appears on screen, one when the 2nd frame appears on screen, and one when the 25th frame appears on screen (this way the control PC can easily verify if a frame has been dropped).
In matlab I do this by calling the blocking method flip() to present a frame, and when it returns I send the ACK to the control PC.
That's it. How would I do that in shady? Is there an example that I should look at?

The places to look for this information are the docstrings of Shady.Stimulus and Shady.Stimulus.LoadTexture, as well as the included example script animated-textures.py.
Like most things Python, there are multiple ways to do what you want. Here's how I would do it:
w = Shady.World()
s = w.Stimulus( [frame00, frame01, frame02, ...], multipage=True )
where each frameNN is a 1000x1000-pixel numpy array (either floating-point or uint8).
Alternatively you can ask Shady to load directly from disk:
s = w.Stimulus('trial01/*.png', multipage=True)
where directory trial01 contains twenty-five 1000x1000-pixel image files, named (say) 00.png through 24.png so that they get sorted correctly. Or you could supply an explicit list of filenames.
Either way, whether you loaded from memory or from disk, the frames are all transferred to the graphics card in that call. You can then (time-critically) switch between them with:
s.page = 0 # or any number up to 24 in your case
Note that, due to our use of the multipage option, we're using the "page" animation mechanism (create one OpenGL texture per frame) instead of the default "frame" mechanism (create one 1000x25000 OpenGL texture) because the latter would exceed the maximum allowable dimensions for a single texture on many graphics cards. The distinction between these mechanisms is discussed in the docstring for the Shady.Stimulus class as well as in the aforementioned interactive demo:
python -m Shady demo animated-textures
To prepare the next trial, you might use .LoadPages() (new in Shady version 1.8.7). This loops through the existing "pages" loading new textures into the previously-used graphics-card texture buffers, and adds further pages as necessary:
s.LoadPages('trial02/*.png')
Now, you mention that your established workflow is to concatenate the frames as a single 5000x5000-pixel image. My solutions above assume that you have done the work of cutting it up again into 1000x1000-pixel frames, presumably using numpy calls (sounds like you might be doing the equivalent in Matlab at the moment). If you're going to keep saving as 5000x5000, the best way of staying in control of things might indeed be to maintain your own code for cutting it up. But it's worth mentioning that you could take the entirely different strategy of transferring it all in one go:
s = w.Stimulus('trial01_5000x5000.png', size=1000)
This loads the entire pre-prepared 5000x5000 image from disk (or again from memory, if you want to pass a 5000x5000 numpy array instead of a filename) into a single texture in the graphics card's memory. However, because of the size specification, the Stimulus will only show the lower-left 1000x1000-pixel portion of the array. You can then switch "frames" by shifting the carrier relative to the envelope. For example, if you were to say:
s.carrierTranslation = [-1000, -2000]
then you would be looking at the frame located one "column" across and two "rows" up in your 5x5 array.
As a final note, remember that you could take advantage of Shady's on-the-fly gamma-correction and dithering–they're happening anyway unless you explicitly disable them, though of course they have no physical effect if you leave the stimulus .gamma at 1.0 and use integer pixel values. So you could generate your stimuli as separate 1000x1000 arrays, each containing unlinearized floating-point values in the range [0.0,1.0], and let Shady worry about everything beyond that.

Resample Scrolling Plot Live Data to show only actually visible points to increase performance (PyQtGraph)

I have a device which I am reading from. Currently it's just test device to implement a GUI (PyQT/PySide2). I am using PyQtGraph to display plots.
This is the update function (simplified for better readability):
def update(self, line):
self.data_segment[self.ptr] = line[1] # gets new line from a Plot-Manager which updates all plots
self.ptr += 1 # counts the amount of samples
self.line_plot.setData(self.data_segment[:self.ptr]) # displays all read samples
self.line_plot.setPos(-self.ptr, 0) # shifts the plot to the left so it scrolls
I have an algorithm that deletes the first x values of the array and saves them into a temp file. Currently the maximum of available data is 100 k. If the user is zoomed in and only sees a part of the plot, there is no problem, no lagging plot
But the more points are displayed (bigger x-range) the more it laggs, lagging plot
Especially when I set the width of the scrolling plot < 1 it starts lagging way faster. Note that this is just a test plot, the actual plot will be more complex, but the peaks will be important as well, so losing data is crucial.
I need an algorithm that resamples the data without losing information or almost no information and displays only visible points, rather then calculating 100k points, which aren't visible anyway and wasting performance with no gain.
This seems like a basic problem to me, but I can't seem to find a solution for this somehow... My knowledge on signal processing is very limited, which is why I might not be able find anything on the web. I might even took the false approach to solve this problem.
EDIT
This is what I mean by "invisible points"
invisible points

As a simple modification of what you are doing, you could try something like this:
def update(self, line):
# Get new data and update the counter
self.data_segment[self.ptr] = line[1]
self.ptr += 1
# Update the graph to show the last 256 samples
n = min( 256, len(self.data_segment) )
self.line_plot.setData(self.data_segment[-n:])
For an explicit downsampling of the data, you can try this
resampled_data = scipy.signal.resample( data, NumberOfPixels )
or to downsample the most recent set of N points,
n = min( N, len(self.data_segment) )
newdata = scipy.signal.resample( self.data_segment[-n:], NumberOfPixels )
self.line_plot.setData(newdata)
However, a good graphics engine should do this for your automatically.
A caveat in resampling or downsampling, is that the original data does not contain information or features on a scale that is too fast for the new scale after you resample or downsample. If it does, then the features will run together and you will get something that looks like your second graph.
Some general comments on coding signal acquisition, processing and display
It seems perhaps useful at this point to offer some general comments on working with and displaying signals.
In any signal acquisition, processing and display coding task, the architect or coder (sometimes by default), should understand (a) something of the physical phenomenon represented by the data, (b) how the information will be used, and (c) the physical characteristics of the measurement, signal processing, and display systems (c.f., bandwidths, sampling rates, dynamic range, noise characteristics, aliasing, effects of pixelation, and so forth).
This is a large subject, and not often completely described in any one text book. It seems to take some experience to pull it all together. Moreover, it seems to me that if you don't understand a measurement well enough to code it yourself, then you also don't know enough to use or rely on a canned routine. In other words, there is no substitute for understanding and the canned routine should be only a convenience and not a crutch. Even for the resampling algorithm suggested above, I would encourage its user to understand how it works and how it effects their signal.
In this particular example, we learn that the application is cardiography, type unspecified and that a great deal of latitude is left to the coder. As the coder then, we should try to learn about these kinds of measurements (c.f. heart in general and electro-,acoustic-, and echo- cardiography) and how they are performed and used, and try to find some examples.
P/S For anyone working with digital filters, if you have not formally studied the subject, it might useful to read the book "Digital Filters" by Hamming. Its available as a Dover book and affordable.

Pyqtgraph has downsampling implemented:
self.line_plot.setDownsampling(auto=True, method='peak')
Depending on how you created the line, you might instead have to use
self.line_plot.setDownsampling(auto=True, mode='peak')
There are other methods/modes available.
What can also slow down the drawing (and reactiveness of the UI) is continuously moving the shown XRange. Simply updating the position only every x ms or samples can help in that case. That also counts for the updating of the plots.
I use pyqtgraph to plot the live data coming in from three vibration sensors with a sampling rate of 12800 kSamples/second. For the plot I viewed a time window of 10 seconds per sensor (so a total of 384000 samples). The time shown includes reading the data, plotting it and regularly calculating and plotting FFTs, writing to a database, etc. For the "no downsampling" part, I turned off the downsampling for one of the three plots.
It is more than fast enough that I haven't bothered with multithreading or anything like that.

best way to print data in columnar format?

I am using Python to read in data in a user-unfriendly format and transform it into an easier-to-read format. The records I am outputting are usually going to be just a last name, first name, and room code. I
I would like to output a series of pages, each containing a contiguous subset of the total records, divided into multiple columns, each of which contains a contiguous subset of the total records on the page. (So in other words, you'd read down the first column, move to the next column, move to the next column, etc., and then start over on the next page...)
The problem I am facing now is that for output formats, I'm almost certainly limited to HTML (and Javascript, CSS, etc.) What is the best way to get the data into this columnar format? If I knew for certain that the printable area of the paper would hold 20 records vertically and five horizontally, for instance, I could easily print tables of 5x20, but I don't know if there's a way to indicate a page break -- and I don't know if there's any way to calculate programmatically how many records will fit on the page.
How would you approach this?
EDIT: The reason I said that I was limited in output: I have to produce the file on one computer, then bring it to a different computer upon which we cannot install new software and on which the selection of existing software is not optimal. The file itself is only going to be used to make a physical printout (which is what the end users will actually work with), but my time on the computer that I can print from is going to be limited, so I need to have the file all ready to go and print right away without a lot of tweaking.
Right now I've managed to find a word processor that I can use on the target machine, so I'm going to see if I can target a format that the word processor uses.
EDIT: Once I knew there was a word processor I could use, I made a simple skeleton file with the settings that I wanted (column and tab settings, monospaced font in a small point size, etc.) and then measured how many characters I got per line of a column and how many lines I got per column. I've watched the runs pretty carefully to make sure that there weren't some strange lines that somehow overflowed the characters-per-line guideline (which shouldn't happen with monospaced font, of course, but how many times do you end up having to figure out why that thing that "shouldn't" happen is happening anyways?)
If there hadn't been a word processor on the target machine that I could use, I probably would have looked at PDF as an output format.

"If I knew for certain that the printable area of the paper would hold 20 records vertically and five horizontally"
You do know that.
You know the size of your paper. You know the size of your font. You can easily do the math.
"almost certainly limited to HTML..." doesn't make much sense. Is this a web application? The page can have a "Previous" and "Next" button to step through the pages? Pick a size that looks good to you and display one page full with "Previous" and "Next" buttons.
If it's supposed to be one HTML page that prints correctly, that's hard. There are CSS things you can do, but you'll be happier creating a PDF file.
Get PyX or ReportLab and create a PDF that prints properly.
I -- personally -- have no patience with any of this. I try put this kind of thing into a CSV file. My users can then open CSV with a tool spreadsheet (Open Office Org has a good one) and then adjust the columns and print with it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.