How to speed up a Python Basemap choropleth animation - python

Taking ideas from various sources, and combining with my own, I sought to create an animated maps showing the shading of countries based on some value in my data.
The basic process is this:
Run DB query to get dataset, keyed by country and time
Use pandas to do some data manipulation (sums, avgs, etc)
Initialize basemap object, then load Load external shapefile
Using the animation library, color the countries, one frame for each distinct "time" in the dataset.
Save as gif or mp4 or whatever
This works just fine. The problem is that it is extremely slow. I have potentially over 100k time intervals (over several metrics) I want to animate, and I'm getting an average time of 15s to generate each frame, and it gets worse the more frames there are. At this rate, it will potentially take weeks of maxing out the cpu and memory on my computer to generate a single animation.
I know that matplotlib isn't known for being very fast (examples: 1 and 2) But I read stories of people generating animations at 5+ fps and wonder what I'm doing wrong.
Some optimizations that I have done:
Only recolor the countries in the animate function. This takes on average ~3s per frame, so while it could be improved, it's not what takes the most time.
I use the blit option.
I tried using a smaller plots size and less detailed basemap, but the results were marginal.
Perhaps a less detailed shapefile would speed up the coloring of the shapes, but as I said before, that's only a 3s per frame improvement.
Here is the code (minus a few identifiable features)
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import time
from math import pi
from sqlalchemy import create_engine
from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from geonamescache import GeonamesCache
from datetime import datetime
def get_dataset(avg_interval, startTime, endTime):
### SQL query
# Returns a dataframe with fields [country, unixtime, metric1, metric2, metric3, metric4, metric5]]
# I use unixtime so I can group by any arbitrary interval to get sums and avgs of the metrics (hence the param avg_interval)
return df
# Initialize plot figure
fig=plt.figure(figsize=(11, 6))
ax = fig.add_subplot(111, axisbg='w', frame_on=False)
# Initialize map with Robinson projection
m = Basemap(projection='robin', lon_0=0, resolution='c')
# Load and read shapefile
shapefile = 'countries/ne_10m_admin_0_countries'
m.readshapefile(shapefile, 'units', color='#dddddd', linewidth=0.005)
# Get valid country code list
gc = GeonamesCache()
iso2_codes = list(gc.get_dataset_by_key(gc.get_countries(), 'fips').keys())
# Get dataset and remove invalid countries
# This one will get daily aggregates for the first week of the year
df = get_dataset(60*60*24, '2016-01-01', '2016-01-08')
df.set_index(["country"], inplace=True)
df = df.ix[iso2_codes].dropna()
num_colors = 20
# Get list of distinct times to iterate over in the animation
period = df["unixtime"].sort_values(ascending=True).unique()
# Assign bins to each value in the df
values = df["metric1"]
cm = plt.get_cmap('afmhot_r')
scheme= cm(1.*np.arange(num_colors)/num_colors)
bins = np.linspace(values.min(), values.max(), num_colors)
df["bin"] = np.digitize(values, bins) - 1
# Initialize animation return object
x,y = m([],[])
point = m.plot(x, y,)[0]
# Pre-zip country details and shap objects
zipped = zip(m.units_info, m.units)
tbegin = time.time()
# Animate! This is the part that takes a long time. Most of the time taken seems to happen between frames...
def animate(i):
# Clear the axis object so it doesn't draw over the old one
ax.clear()
# Dynamic title
fig.suptitle('Num: {}'.format(datetime.utcfromtimestamp(int(i)).strftime('%Y-%m-%d %H:%M:%S')), fontsize=30, y=.95)
tstart = time.time()
# Get current frame dataset
frame = df[df["unixtime"]==i]
# Loop through every country
for info, shape in zipped:
iso2 = info['ISO_A2']
if iso2 not in frame.index:
# Gray if not in dataset
color = '#dddddd'
else:
# Colored if in dataset
color = scheme[int(frame.ix[iso2]["bin"])]
# Get shape info for country, then color on the ax subplot
patches = [Polygon(np.array(shape), True)]
pc = PatchCollection(patches)
pc.set_facecolor(color)
ax.add_collection(pc)
tend = time.time()
#print "{}%: {} of {} took {}s".format(str(ind/tot*100), str(ind), str(tot), str(tend-tstart))
print "{}: {}s".format(datetime.utcfromtimestamp(int(i)).strftime('%Y-%m-%d %H:%M:%S'), str(tend-tstart))
return None
# Initialize animation object
output = animation.FuncAnimation(fig, animate, period, interval=150, repeat=False, blit=False)
filestring = time.strftime("%Y%m%d%H%M%S")
# Save animation object as m,p4
#output.save(filestring + '.mp4', fps=1, codec='ffmpeg', extra_args=['-vcodec', 'libx264'])
# Save animation object as gif
output.save(filestring + '.gif', writer='imagemagick')
tfinish = time.time()
print "Total time: {}s".format(str(tfinish-tbegin))
print "{}s per frame".format(str((tfinish-tbegin)/len(df["unixtime"].unique())))
P.S. I know the code is sloppy and could use some cleanup. I'm open to any suggestions, especially if that cleanup would improve performance!
Edit 1: Here is an example of the output
2016-01-01 00:00:00: 3.87843298912s
2016-01-01 00:00:00: 4.08691620827s
2016-01-02 00:00:00: 3.40868711472s
2016-01-03 00:00:00: 4.21187019348s
Total time: 29.0233821869s
9.67446072896s per frame
The first first few lines represent the date being processed, and the runtime of each frame. I have no clue why the first one is repeated. The final line it the total runtime of the program divided by the number of frames. Note that the average time is 2-3x the individual times. This makes me think that there is something happening "between" the frames that is eating up a lot of time.
Edit 2: I ran some performance tests and determined that average time to generate each additional frame is greater than the last, proportional to the number of frames, indicating that this is an quadratic-time process. (or would it be exponential?) Either way, I'm very confused why this wouldn't be linear. If the dataset is already generated, and the maps take a constant time to regenerate, what variable is causing each extra frame to take longer than the previous?
Edit 3: I just made the realization that I have no idea how the animation function works. The (x,y) and point variables were taken from an example that was just plotting moving dots, so it makes sense in that context. A map... not so much. I tried returning something map related from the animate function and got better performance. Returning the ax object (return ax,) makes the procedure run in linear time... but doesn't write anything to the gif. Anybody have any idea what I need to return from the animate function to make this work?
Edit 4: Clearing the axis every frame lets the frames generate at a constant rate! Now I just have to work on general optimizations. I'll start with ImportanceOfBeingErnest's suggestion first. The previous edits are obsolete now.

Related

Plot each Dask partition seperatly using python

I'm using Dask to read 500 parquet files and it does it much faster than other methods that I have tested.
Each parquet file contains a time column and many other variable columns.
My goal is to create a single plot that will have 500 lines of variable vs time.
When I use the following code, it works very fast compared to all other methods that I have tested but it gives me a single "line" on the plot:
import dask.dataframe as dd
import matplotlib.pyplot as plt
import time
start = time.time()
ddf = dd.read_parquet("results_parq/*.parquet")
plt.plot(ddf['t'].compute(),ddf['reg'].compute())
plt.show()
end = time.time()
print(end-start)
from my understanding, it happens because Dask just plots the following:
t
0
0.01
.
.
100
0
0.01
.
.
100
0
What I mean it plots a huge column instead of 500 columns.
One possible solution that I tried to do is to plot it in a for loop over the partitions:
import dask.dataframe as dd
import matplotlib.pyplot as plt
import time
start = time.time()
ddf = dd.read_parquet("results_parq/*.parquet")
for p in ddf.partitions:
plt.plot(p['t'].compute(),p['reg'].compute())
plt.show()
end = time.time()
print(end-start)
It does the job and the resulting plot looks like I want:
However, it results in much longer times.
Is there a way to do something like this but yet to use Dask multicore benefits? Like somehow use map_partitions for it?
Thank you
as a start, you cannot normally make matplotlib draw to the same figure from multiple processes, as the renderers aren't using shared memory. (neither should they from a programming point of view)
drawing 500 lines is a very simple task for matplotlib and the problem maybe not in matplotlib.
your dask workers are likely sending data sequentially to your main process, hence the slowdown. (each worker has to wait for master to request data then send it then wait for confirmation, then wait for next order to come, etc)
you can force them to send their data faster by prefetching all the data before you start plotting by matplotlib.
import numpy as np
ddf = dd.read_parquet("results_parq/*.parquet")
# compute length of each partition
lengths = ddf.map_partitions(len).compute()
# get all partitions at once
ddf2 = ddf.compute()
# calculate each parition start and end
lengths = list(lengths)
lengths.insert(0,0)
accumelated_lengths= np.cumsum(lengths)
# plot each partition
for i in range(len(accumelated_lengths)-1):
plt.plot(ddf2['t'][accumelated_lengths[i]:accumelated_lengths[i+1]],
ddf2['reg'][accumelated_lengths[i]:accumelated_lengths[i+1]])
plt.show()
Edit: making 500 calls to plt.plot is probably slowing you down too, you could use matplotlib.LineCollection instead, it takes 1/20 of the time for 500 lines.
# plot each partition
lines = []
fig, ax = plt.subplots()
for i in range(len(accumelated_lengths) - 1):
lines.append(tuple(zip(ddf2['a'][accumelated_lengths[i]:accumelated_lengths[i + 1]],
ddf2['b'][accumelated_lengths[i]:accumelated_lengths[i + 1]])))
coll = LineCollection(lines,colors = np.random.random([len(lines),3]))
ax.add_collection(coll)
ax.autoscale_view()
plt.show()

Time series dBFS plot output modification - current output plot not as expected (matplotlib)

I'm trying to plot the Amplitude (dBFS) vs. Time (s) plot of an audio (.wav) file using matplotlib. I managed to do that with the following code:
def convert_to_decibel(sample):
ref = 32768 # Using a signed 16-bit PCM format wav file. So, 2^16 is the max. value.
if sample!=0:
return 20 * np.log10(abs(sample) / ref)
else:
return 20 * np.log10(0.000001)
from scipy.io.wavfile import read as readWav
from scipy.fftpack import fft
import matplotlib.pyplot as gplot1
import matplotlib.pyplot as gplot2
import numpy as np
import struct
import gc
wavfile1 = '/home/user01/audio/speech.wav'
wavsamplerate1, wavdata1 = readWav(wavfile1)
wavdlen1 = wavdata1.size
wavdtype1 = wavdata1.dtype
gplot1.rcParams['figure.figsize'] = [15, 5]
pltaxis1 = gplot1.gca()
gplot1.axhline(y=0, c="black")
gplot1.xticks(np.arange(0, 10, 0.5))
gplot1.yticks(np.arange(-200, 200, 5))
gplot1.grid(linestyle = '--')
wavdata3 = np.array([convert_to_decibel(i) for i in wavdata1], dtype=np.int16)
yvals3 = wavdata3
t3 = wavdata3.size / wavsamplerate1
xvals3 = np.linspace(0, t3, wavdata3.size)
pltaxis1.set_xlim([0, t3 + 2])
pltaxis1.set_title('Amplitude (dBFS) vs Time(s)')
pltaxis1.plot(xvals3, yvals3, '-')
which gives the following output:
I had also plotted the Power Spectral Density (PSD, in dBm) using the code below:
from scipy.signal import welch as psd # Computes PSD using Welch's method.
fpsd, wPSD = psd(wavdata1, wavsamplerate1, nperseg=1024)
gplot2.rcParams['figure.figsize'] = [15, 5]
pltpsdm = gplot2.gca()
gplot2.axhline(y=0, c="black")
pltpsdm.plot(fpsd, 20*np.log10(wPSD))
gplot2.xticks(np.arange(0, 4000, 400))
gplot2.yticks(np.arange(-150, 160, 10))
pltpsdm.set_xlim([0, 4000])
pltpsdm.set_ylim([-150, 150])
gplot2.grid(linestyle = '--')
which gives the output as:
The second output above, using the Welch's method plots a more presentable output. The dBFS plot though informative is not very presentable IMO. Is this because of:
the difference in the domains (time in case of 1st output vs frequency in the 2nd output)?
the way plot function is implemented in pyplot?
Also, is there a way I can plot my dBFS output as a peak-to-peak style of plot just like in my PSD (dBm) plot rather than a dense stem plot?
Would be much helpful and would appreciate any pointers, answers or suggestions from experts here as I'm just a beginner with matplotlib and plots in python in general.
TLNR
This has nothing to do with pyplot.
The frequency domain is different from the time domain, but that's not why you didn't get what you want.
The calculation of dbFS in your code is wrong.
You should frame your data, calculate RMSs or peaks in every frame, and then convert that value to dbFS instead of applying this transformation to every sample point.
When we talk about the amplitude, we are talking about a periodic signal. And when we read in a series of data from a sound file, we read in a series of sample points of a signal(may be or be not periodic). The value of every sample point represents a, say, voltage value, or sound pressure value sampled at a specific time.
We assume that, within a very short time interval, maybe 10ms for example, the signal is stationary. Every such interval is called a frame.
Some specific function is applied to each frame usually, to reduce the sudden change at the edge of this frame, and these functions are called window functions. If you did nothing to every frame, you added rectangle windows to them.
An example: when the sampling frequency of your sound is 44100Hz, in a 10ms-long frame, there are 44100*0.01=441 sample points. That's what the nperseg argument means in your psd function but it has nothing to do with dbFS.
Given the knowledge above, now we can talk about the amplitude.
There are two methods a get the value of amplitude in every frame:
The most straightforward one is to get the maximum(peak) values in every frame.
Another one is to calculate the RMS(Root Mean Sqaure) of every frame.
After that, the peak values or RMS values can be converted to dbFS values.
Let's start coding:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
# Determine full scall(maximum possible amplitude) by bit depth
bit_depth = 16
full_scale = 2 ** bit_depth
# dbFS function
to_dbFS = lambda x: 20 * np.log10(x / full_scale)
# Read in the wave file
fname = "01.wav"
fs,data = wavfile.read(fname)
# Determine frame length(number of sample points in a frame) and total frame numbers by window length(how long is a frame in seconds)
window_length = 0.01
signal_length = data.shape[0]
frame_length = int(window_length * fs)
nframes = signal_length // frame_length
# Get frames by broadcast. No overlaps are used.
idx = frame_length * np.arange(nframes)[:,None] + np.arange(frame_length)
frames = data[idx].astype("int64") # Convert to in 64 to avoid integer overflow
# Get RMS and peaks
rms = ((frames**2).sum(axis=1)/frame_length)**.5
peaks = np.abs(frames).max(axis=1)
# Convert them to dbfs
dbfs_rms = to_dbFS(rms)
dbfs_peak = to_dbFS(peaks)
# Let's start to plot
# Get time arrays of every sample point and ever frame
frame_time = np.arange(nframes) * window_length
data_time = np.linspace(0,signal_length/fs,signal_length)
# Plot
f,ax = plt.subplots()
ax.plot(data_time,data,color="k",alpha=.3)
# Plot the dbfs values on a twin x Axes since the y limits are not comparable between data values and dbfs
tax = ax.twinx()
tax.plot(frame_time,dbfs_rms,label="RMS")
tax.plot(frame_time,dbfs_peak,label="Peak")
tax.legend()
f.tight_layout()
# Save serval details
f.savefig("whole.png",dpi=300)
ax.set_xlim(1,2)
f.savefig("1-2sec.png",dpi=300)
ax.set_xlim(1.295,1.325)
f.savefig("1.2-1.3sec.png",dpi=300)
The whole time span looks like(the unit of the right axis is dbFS):
And the voiced part looks like:
You can see that the dbFS values become greater while the amplitudes become greater at the vowel start point:

How to plot real time graph using python for the signals received from RTL SDR?

Using the below python(ubuntu) code and rtlsdr, I am able to plot a graph. Can anyone please tell me how to modify this code to plot the graph continuously in real time?
from pylab import *
from rtlsdr import *
sdr = RtlSdr()
sdr.sample_rate = 2.4e6
sdr.center_freq = 93.5e6
sdr.gain = 50
samples = sdr.read_samples(256*1024)
sdr.close()
psd(samples.real, NFFT=1024, Fs=sdr.sample_rate/1e6, Fc=sdr.center_freq/1e6)
xlabel('Frequency (MHz)')
ylabel('Relative power (dB)')
show()
Normally you can update a pyplot-generated plot by calling plot.set_xdata(), plot.set_ydata(), and plot.draw() (Dynamically updating plot in matplotlib), without having to recreate the entire plot. However, this only works for plots that directly draw a data series. The plot instance cannot automatically recalculate the spectral density calculated by psd().
You'll therefore need to call psd() again when you want to update the plot - depending on how long it takes to draw, you could do this in regular intervals of a second or less.
This might work:
from pylab import *
from rtlsdr import *
from time import sleep
sdr = RtlSdr()
sdr.sample_rate = 2.4e6
sdr.center_freq = 93.5e6
sdr.gain = 50
try:
while True: # run until interrupted
samples = sdr.read_samples(256*1024)
clf()
psd(samples.real, NFFT=1024, Fs=sdr.sample_rate/1e6, Fc=sdr.center_freq/1e6)
xlabel('Frequency (MHz)')
ylabel('Relative power (dB)')
show()
sleep(1) # sleep for 1s
except:
pass
sdr.close()
Edit: Of course, I'm not sure how read_samples runs; my example here assumed that it returns almost immediately. If it blocks for a long time while waiting for data, you may want to read less data at a time, and discard old data as you do so:
from collections import deque
max_size = 256*1024
chunk_size = 1024
samples = deque([], max_size)
while True:
samples.extend(sdr.read_samples(chunk_size))
# draw plot

Python plot lines with specific x values from numpy

I have a situation with a bunch of datafiles, these datafiles have a number of samples in a given time frame that depends on the system. i.e. At time t=1 for instance I might have a file with 10 items, or 20 items, at later times in that file I will always have the same number of items. The format is time, x, y, z in columns, and loaded into a numpy array. The time values show which frame, but as mentioned there's always the same, let's go with 10 as a sample. So I'll have a (10,4) numpy array where the time values are identical, but there are many frames in the file, so lets say 100 frames, so really I have (1000,4). I want to plot the data with time on the x-axis and manipulations of the other data on the y, but I am unsure how to do this with line plot methods in matplotlib. Normally to provide both x,y values I believe I need to do a scatter plot, so I'm hoping there's a better way to do this. What I ideally want is to treat each line that has the same time code as a different series (so it will colour differently), and the next bit of data for that same line number in the next frame (time value) will be labelled the same colour, giving those good contiguous lines. We can look at the time column and figure out how many items share a time code, let's call it "n". Sample code:
a = numpy.loadtxt('sampledata.txt')
plt.plot(a[:0,:,n],a[:1,:1])
plt.show()
I think this code expresses what I'm going for, though it doesn't work.
Edit:
I hope this is what you wanted.
seaborn scatterplot can categorize data to some groups which have the same codes (time code in this case) and use the same colors to them.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"E:\Programming\Python\Matplotlib\timecodes.csv",
names=["time","x","y","z","code"]) #use your file
df["time"]=pd.to_datetime(df["time"]) #recognize the data as Time
df["x"]=df["time"].dt.day # I changed the data into "Date only" and imported to x column. Easier to see on graph.
#just used random numbers in y and z in my data.
sns.scatterplot("x", "y", data = df, hue = "code") #hue does the grouping
plt.show()
I used csv file here but you can do to your text file as well by adding sep="\t" in the argument. I also added a code in the file. If you have it the code can group the data in the graph, so you don't have to separate or make a hierarchical index. If you want to change colors or grouping please see seaborn website.
Hope this helps.
Alternative, the method I used, but Tim's answer is still accurate as well. Since the time codes are not date/time information I modified my own code to add tags as a second column I call "p" (they're polymers).
import numpy as np
import pandas as pd
datain = np.loadtxt('somefile.txt')
df = pd.DataFrame(data = datain, columns = ["t","p","x","y","z"])
ax = sns.scatterplot("t","x", data = df, hue = "p")
plt.show()
And of course the other columns can be plotted similarly if desired.

Real Time temperature plotting with python

I am currently making a project which requires real time monitoring of various quantities like temperature, pressure, humidity etc. I am following a approach of making individual arrays of all the sensors and ploting a graph using matplotlib and drwnow.
HOST = "localhost"
PORT = 4223
UID1 = "tsJ" # S1
from tinkerforge.ip_connection import IPConnection
from tinkerforge.bricklet_ptc import BrickletPTC
import numpy as np
import serial
import matplotlib
from matplotlib.ticker import ScalarFormatter, FormatStrFormatter
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
from drawnow import *
# creating arrays to feed the data
tempC1 = []
def makeafig():
# creating subplots
fig1 = plt.figure(1)
a = fig1.add_subplot(111)
#setting up axis label, auto formating of axis and title
a.set_xlabel('Time [s]', fontsize = 10)
a.set_ylabel('Temperature [°C]', fontsize = 10)
y_formatter = matplotlib.ticker.ScalarFormatter(useOffset=False)
a.yaxis.set_major_formatter(y_formatter)
title1 = "Current Room Temperature (Side1): " + str(temperature1/100) + " °C"
a.set_title(title1, fontsize = 10)
#plotting the graph
a.plot(tempC1, "#00A3E0")
#saving the figure
fig1.savefig('RoomTemperature.png', dpi=100)
while True:
ipcon = IPConnection() # Create IP connection
ptc1 = BrickletPTC(UID1, ipcon) # S1
ipcon.connect(HOST, PORT) # Connect to brickd
#setting the temperature from PTC bricklet
temperature1 = ptc1.get_temperature()
#processing data from a temperature sensor to 1st array
dataArray1=str(temperature1/100).split(',')
temp1 = float(dataArray1[0])
tempC1.append(temp1)
#making a live figure
drawnow(makeafig)
plt.draw()
This is the approach I found good on the internet and it is working. The only problem I am facing is It consumes more time if I made more arrays for other sensors and the plot being made lags from the real time when I compare it with a stopwatch.
Is there any good and efficient approach for obtaining live graphs that will be efficient with lot of sensors and doen't lag with real time.
Or any command to clear up the already plotted array values?
I'd be obliged if anyone can help me with this problem.
I'd like to ask that, does filling data in arrays continuosly makes the process slow or is it my misconception?
That one is easy to test; creating an empty list and appending a few thousand values to it takes roughly 10^-4 seconds, so that shouldn't be a problem. For me somewhat surprisingly, it is actually faster than creating and filling a fixed size numpy.ndarray (but that is probably going to depend on the size of the list/array).
I quickly played around with your makeafig() function, placing a = fig1.add_subplot(111) up to (including) a.plot(..) in a simple for i in range(1,5) loop, with a = fig1.add_subplot(2,2,i); that makes makeafig() about 50% slower, but differences are only about 0.1-0.2 seconds. Is that in line with the lag that you are experiencing?
That's about what I can test without the real data, my next step would be to time the part from ipcon=.. to temperature1=... Perhaps the bottleneck is simply the retrieval of the data? I'm sure that there are several examples on SO on how to time parts of Python scripts, for these kind of problems something like the example below should be sufficient:
import time
t0 = time.time()
# do something
dt = time.time() - t0

Categories