Plot each Dask partition seperatly using python - python

I'm using Dask to read 500 parquet files and it does it much faster than other methods that I have tested.
Each parquet file contains a time column and many other variable columns.
My goal is to create a single plot that will have 500 lines of variable vs time.
When I use the following code, it works very fast compared to all other methods that I have tested but it gives me a single "line" on the plot:
import dask.dataframe as dd
import matplotlib.pyplot as plt
import time
start = time.time()
ddf = dd.read_parquet("results_parq/*.parquet")
plt.plot(ddf['t'].compute(),ddf['reg'].compute())
plt.show()
end = time.time()
print(end-start)
from my understanding, it happens because Dask just plots the following:
t
0
0.01
.
.
100
0
0.01
.
.
100
0
What I mean it plots a huge column instead of 500 columns.
One possible solution that I tried to do is to plot it in a for loop over the partitions:
import dask.dataframe as dd
import matplotlib.pyplot as plt
import time
start = time.time()
ddf = dd.read_parquet("results_parq/*.parquet")
for p in ddf.partitions:
plt.plot(p['t'].compute(),p['reg'].compute())
plt.show()
end = time.time()
print(end-start)
It does the job and the resulting plot looks like I want:
However, it results in much longer times.
Is there a way to do something like this but yet to use Dask multicore benefits? Like somehow use map_partitions for it?
Thank you

as a start, you cannot normally make matplotlib draw to the same figure from multiple processes, as the renderers aren't using shared memory. (neither should they from a programming point of view)
drawing 500 lines is a very simple task for matplotlib and the problem maybe not in matplotlib.
your dask workers are likely sending data sequentially to your main process, hence the slowdown. (each worker has to wait for master to request data then send it then wait for confirmation, then wait for next order to come, etc)
you can force them to send their data faster by prefetching all the data before you start plotting by matplotlib.
import numpy as np
ddf = dd.read_parquet("results_parq/*.parquet")
# compute length of each partition
lengths = ddf.map_partitions(len).compute()
# get all partitions at once
ddf2 = ddf.compute()
# calculate each parition start and end
lengths = list(lengths)
lengths.insert(0,0)
accumelated_lengths= np.cumsum(lengths)
# plot each partition
for i in range(len(accumelated_lengths)-1):
plt.plot(ddf2['t'][accumelated_lengths[i]:accumelated_lengths[i+1]],
ddf2['reg'][accumelated_lengths[i]:accumelated_lengths[i+1]])
plt.show()
Edit: making 500 calls to plt.plot is probably slowing you down too, you could use matplotlib.LineCollection instead, it takes 1/20 of the time for 500 lines.
# plot each partition
lines = []
fig, ax = plt.subplots()
for i in range(len(accumelated_lengths) - 1):
lines.append(tuple(zip(ddf2['a'][accumelated_lengths[i]:accumelated_lengths[i + 1]],
ddf2['b'][accumelated_lengths[i]:accumelated_lengths[i + 1]])))
coll = LineCollection(lines,colors = np.random.random([len(lines),3]))
ax.add_collection(coll)
ax.autoscale_view()
plt.show()

Related

Is there a more efficient way to plot csv file content?

I made a simple app to represent a chart from a csv file with python pandas and plot.
The file I use to plot out is a csv which is around 180kb with almost 6000 rows and two column's.
Nothing complex.
My question/problem is:
When I start the app to plot the chart, it takes to much time.
Around 20 - 30 sec which is I think to much time for such of small data.
I use Win10, i5, 8G ram, ssd...
This file is a log file which will grow day by day and I expect in one year to have a log file of around 1Mb.
I cant imagine does I should wait around 1min to read and plot the file of 1Mb one day.
What is wrong with this code or is there a better/faster/more effective way to process the data?
And if I use the zoom function on the plot it takes a loooot of time too.
It looks for me so buggy...
This is my function:
def plot_file(file_name):
input_file = "log.csv"
sample_data = pd.read_csv(input_file, names=["A", "B"], header=None)
plt.plot(sample_data.A, sample_data.B)
plt.title("File name: " + input_file)
plt.xlabel("time scale x / date & time")
plt.ylabel("quality scale y in %")
plt.show()
Plotting can be a significant bottleneck, particularly in matplotlib but should not be with your example. The quick test below is less than 1s on my machine including creation of the csv file. Delay might be else where including imports, reading from remote source, etc.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import time
def main():
start = time.time()
make_test_csv()
plot_file()
end = time.time()
print(f'Duration: {end - start}')
def make_test_csv():
df = pd.DataFrame(np.random.randint(0,100,size=(6000, 2)), columns=list('AB'))
df.to_csv('test_data.csv')
def plot_file():
input_file = 'test_data.csv'
sample_data = pd.read_csv(input_file, names=["A", "B"], header=None)
plt.plot(sample_data.A, sample_data.B)
plt.title("File name: " + input_file)
plt.xlabel("time scale x / date & time")
plt.ylabel("quality scale y in %")
plt.show()
The matplotlib user guide provides recommendations regarding performance. Using the fast plotting style might help.
import matplotlib.style as mplstyle
mplstyle.use('fast')

How to plot real time graph using python for the signals received from RTL SDR?

Using the below python(ubuntu) code and rtlsdr, I am able to plot a graph. Can anyone please tell me how to modify this code to plot the graph continuously in real time?
from pylab import *
from rtlsdr import *
sdr = RtlSdr()
sdr.sample_rate = 2.4e6
sdr.center_freq = 93.5e6
sdr.gain = 50
samples = sdr.read_samples(256*1024)
sdr.close()
psd(samples.real, NFFT=1024, Fs=sdr.sample_rate/1e6, Fc=sdr.center_freq/1e6)
xlabel('Frequency (MHz)')
ylabel('Relative power (dB)')
show()
Normally you can update a pyplot-generated plot by calling plot.set_xdata(), plot.set_ydata(), and plot.draw() (Dynamically updating plot in matplotlib), without having to recreate the entire plot. However, this only works for plots that directly draw a data series. The plot instance cannot automatically recalculate the spectral density calculated by psd().
You'll therefore need to call psd() again when you want to update the plot - depending on how long it takes to draw, you could do this in regular intervals of a second or less.
This might work:
from pylab import *
from rtlsdr import *
from time import sleep
sdr = RtlSdr()
sdr.sample_rate = 2.4e6
sdr.center_freq = 93.5e6
sdr.gain = 50
try:
while True: # run until interrupted
samples = sdr.read_samples(256*1024)
clf()
psd(samples.real, NFFT=1024, Fs=sdr.sample_rate/1e6, Fc=sdr.center_freq/1e6)
xlabel('Frequency (MHz)')
ylabel('Relative power (dB)')
show()
sleep(1) # sleep for 1s
except:
pass
sdr.close()
Edit: Of course, I'm not sure how read_samples runs; my example here assumed that it returns almost immediately. If it blocks for a long time while waiting for data, you may want to read less data at a time, and discard old data as you do so:
from collections import deque
max_size = 256*1024
chunk_size = 1024
samples = deque([], max_size)
while True:
samples.extend(sdr.read_samples(chunk_size))
# draw plot

Python plot lines with specific x values from numpy

I have a situation with a bunch of datafiles, these datafiles have a number of samples in a given time frame that depends on the system. i.e. At time t=1 for instance I might have a file with 10 items, or 20 items, at later times in that file I will always have the same number of items. The format is time, x, y, z in columns, and loaded into a numpy array. The time values show which frame, but as mentioned there's always the same, let's go with 10 as a sample. So I'll have a (10,4) numpy array where the time values are identical, but there are many frames in the file, so lets say 100 frames, so really I have (1000,4). I want to plot the data with time on the x-axis and manipulations of the other data on the y, but I am unsure how to do this with line plot methods in matplotlib. Normally to provide both x,y values I believe I need to do a scatter plot, so I'm hoping there's a better way to do this. What I ideally want is to treat each line that has the same time code as a different series (so it will colour differently), and the next bit of data for that same line number in the next frame (time value) will be labelled the same colour, giving those good contiguous lines. We can look at the time column and figure out how many items share a time code, let's call it "n". Sample code:
a = numpy.loadtxt('sampledata.txt')
plt.plot(a[:0,:,n],a[:1,:1])
plt.show()
I think this code expresses what I'm going for, though it doesn't work.
Edit:
I hope this is what you wanted.
seaborn scatterplot can categorize data to some groups which have the same codes (time code in this case) and use the same colors to them.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"E:\Programming\Python\Matplotlib\timecodes.csv",
names=["time","x","y","z","code"]) #use your file
df["time"]=pd.to_datetime(df["time"]) #recognize the data as Time
df["x"]=df["time"].dt.day # I changed the data into "Date only" and imported to x column. Easier to see on graph.
#just used random numbers in y and z in my data.
sns.scatterplot("x", "y", data = df, hue = "code") #hue does the grouping
plt.show()
I used csv file here but you can do to your text file as well by adding sep="\t" in the argument. I also added a code in the file. If you have it the code can group the data in the graph, so you don't have to separate or make a hierarchical index. If you want to change colors or grouping please see seaborn website.
Hope this helps.
Alternative, the method I used, but Tim's answer is still accurate as well. Since the time codes are not date/time information I modified my own code to add tags as a second column I call "p" (they're polymers).
import numpy as np
import pandas as pd
datain = np.loadtxt('somefile.txt')
df = pd.DataFrame(data = datain, columns = ["t","p","x","y","z"])
ax = sns.scatterplot("t","x", data = df, hue = "p")
plt.show()
And of course the other columns can be plotted similarly if desired.

How to speed up a Python Basemap choropleth animation

Taking ideas from various sources, and combining with my own, I sought to create an animated maps showing the shading of countries based on some value in my data.
The basic process is this:
Run DB query to get dataset, keyed by country and time
Use pandas to do some data manipulation (sums, avgs, etc)
Initialize basemap object, then load Load external shapefile
Using the animation library, color the countries, one frame for each distinct "time" in the dataset.
Save as gif or mp4 or whatever
This works just fine. The problem is that it is extremely slow. I have potentially over 100k time intervals (over several metrics) I want to animate, and I'm getting an average time of 15s to generate each frame, and it gets worse the more frames there are. At this rate, it will potentially take weeks of maxing out the cpu and memory on my computer to generate a single animation.
I know that matplotlib isn't known for being very fast (examples: 1 and 2) But I read stories of people generating animations at 5+ fps and wonder what I'm doing wrong.
Some optimizations that I have done:
Only recolor the countries in the animate function. This takes on average ~3s per frame, so while it could be improved, it's not what takes the most time.
I use the blit option.
I tried using a smaller plots size and less detailed basemap, but the results were marginal.
Perhaps a less detailed shapefile would speed up the coloring of the shapes, but as I said before, that's only a 3s per frame improvement.
Here is the code (minus a few identifiable features)
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import time
from math import pi
from sqlalchemy import create_engine
from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from geonamescache import GeonamesCache
from datetime import datetime
def get_dataset(avg_interval, startTime, endTime):
### SQL query
# Returns a dataframe with fields [country, unixtime, metric1, metric2, metric3, metric4, metric5]]
# I use unixtime so I can group by any arbitrary interval to get sums and avgs of the metrics (hence the param avg_interval)
return df
# Initialize plot figure
fig=plt.figure(figsize=(11, 6))
ax = fig.add_subplot(111, axisbg='w', frame_on=False)
# Initialize map with Robinson projection
m = Basemap(projection='robin', lon_0=0, resolution='c')
# Load and read shapefile
shapefile = 'countries/ne_10m_admin_0_countries'
m.readshapefile(shapefile, 'units', color='#dddddd', linewidth=0.005)
# Get valid country code list
gc = GeonamesCache()
iso2_codes = list(gc.get_dataset_by_key(gc.get_countries(), 'fips').keys())
# Get dataset and remove invalid countries
# This one will get daily aggregates for the first week of the year
df = get_dataset(60*60*24, '2016-01-01', '2016-01-08')
df.set_index(["country"], inplace=True)
df = df.ix[iso2_codes].dropna()
num_colors = 20
# Get list of distinct times to iterate over in the animation
period = df["unixtime"].sort_values(ascending=True).unique()
# Assign bins to each value in the df
values = df["metric1"]
cm = plt.get_cmap('afmhot_r')
scheme= cm(1.*np.arange(num_colors)/num_colors)
bins = np.linspace(values.min(), values.max(), num_colors)
df["bin"] = np.digitize(values, bins) - 1
# Initialize animation return object
x,y = m([],[])
point = m.plot(x, y,)[0]
# Pre-zip country details and shap objects
zipped = zip(m.units_info, m.units)
tbegin = time.time()
# Animate! This is the part that takes a long time. Most of the time taken seems to happen between frames...
def animate(i):
# Clear the axis object so it doesn't draw over the old one
ax.clear()
# Dynamic title
fig.suptitle('Num: {}'.format(datetime.utcfromtimestamp(int(i)).strftime('%Y-%m-%d %H:%M:%S')), fontsize=30, y=.95)
tstart = time.time()
# Get current frame dataset
frame = df[df["unixtime"]==i]
# Loop through every country
for info, shape in zipped:
iso2 = info['ISO_A2']
if iso2 not in frame.index:
# Gray if not in dataset
color = '#dddddd'
else:
# Colored if in dataset
color = scheme[int(frame.ix[iso2]["bin"])]
# Get shape info for country, then color on the ax subplot
patches = [Polygon(np.array(shape), True)]
pc = PatchCollection(patches)
pc.set_facecolor(color)
ax.add_collection(pc)
tend = time.time()
#print "{}%: {} of {} took {}s".format(str(ind/tot*100), str(ind), str(tot), str(tend-tstart))
print "{}: {}s".format(datetime.utcfromtimestamp(int(i)).strftime('%Y-%m-%d %H:%M:%S'), str(tend-tstart))
return None
# Initialize animation object
output = animation.FuncAnimation(fig, animate, period, interval=150, repeat=False, blit=False)
filestring = time.strftime("%Y%m%d%H%M%S")
# Save animation object as m,p4
#output.save(filestring + '.mp4', fps=1, codec='ffmpeg', extra_args=['-vcodec', 'libx264'])
# Save animation object as gif
output.save(filestring + '.gif', writer='imagemagick')
tfinish = time.time()
print "Total time: {}s".format(str(tfinish-tbegin))
print "{}s per frame".format(str((tfinish-tbegin)/len(df["unixtime"].unique())))
P.S. I know the code is sloppy and could use some cleanup. I'm open to any suggestions, especially if that cleanup would improve performance!
Edit 1: Here is an example of the output
2016-01-01 00:00:00: 3.87843298912s
2016-01-01 00:00:00: 4.08691620827s
2016-01-02 00:00:00: 3.40868711472s
2016-01-03 00:00:00: 4.21187019348s
Total time: 29.0233821869s
9.67446072896s per frame
The first first few lines represent the date being processed, and the runtime of each frame. I have no clue why the first one is repeated. The final line it the total runtime of the program divided by the number of frames. Note that the average time is 2-3x the individual times. This makes me think that there is something happening "between" the frames that is eating up a lot of time.
Edit 2: I ran some performance tests and determined that average time to generate each additional frame is greater than the last, proportional to the number of frames, indicating that this is an quadratic-time process. (or would it be exponential?) Either way, I'm very confused why this wouldn't be linear. If the dataset is already generated, and the maps take a constant time to regenerate, what variable is causing each extra frame to take longer than the previous?
Edit 3: I just made the realization that I have no idea how the animation function works. The (x,y) and point variables were taken from an example that was just plotting moving dots, so it makes sense in that context. A map... not so much. I tried returning something map related from the animate function and got better performance. Returning the ax object (return ax,) makes the procedure run in linear time... but doesn't write anything to the gif. Anybody have any idea what I need to return from the animate function to make this work?
Edit 4: Clearing the axis every frame lets the frames generate at a constant rate! Now I just have to work on general optimizations. I'll start with ImportanceOfBeingErnest's suggestion first. The previous edits are obsolete now.

Real Time temperature plotting with python

I am currently making a project which requires real time monitoring of various quantities like temperature, pressure, humidity etc. I am following a approach of making individual arrays of all the sensors and ploting a graph using matplotlib and drwnow.
HOST = "localhost"
PORT = 4223
UID1 = "tsJ" # S1
from tinkerforge.ip_connection import IPConnection
from tinkerforge.bricklet_ptc import BrickletPTC
import numpy as np
import serial
import matplotlib
from matplotlib.ticker import ScalarFormatter, FormatStrFormatter
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
from drawnow import *
# creating arrays to feed the data
tempC1 = []
def makeafig():
# creating subplots
fig1 = plt.figure(1)
a = fig1.add_subplot(111)
#setting up axis label, auto formating of axis and title
a.set_xlabel('Time [s]', fontsize = 10)
a.set_ylabel('Temperature [°C]', fontsize = 10)
y_formatter = matplotlib.ticker.ScalarFormatter(useOffset=False)
a.yaxis.set_major_formatter(y_formatter)
title1 = "Current Room Temperature (Side1): " + str(temperature1/100) + " °C"
a.set_title(title1, fontsize = 10)
#plotting the graph
a.plot(tempC1, "#00A3E0")
#saving the figure
fig1.savefig('RoomTemperature.png', dpi=100)
while True:
ipcon = IPConnection() # Create IP connection
ptc1 = BrickletPTC(UID1, ipcon) # S1
ipcon.connect(HOST, PORT) # Connect to brickd
#setting the temperature from PTC bricklet
temperature1 = ptc1.get_temperature()
#processing data from a temperature sensor to 1st array
dataArray1=str(temperature1/100).split(',')
temp1 = float(dataArray1[0])
tempC1.append(temp1)
#making a live figure
drawnow(makeafig)
plt.draw()
This is the approach I found good on the internet and it is working. The only problem I am facing is It consumes more time if I made more arrays for other sensors and the plot being made lags from the real time when I compare it with a stopwatch.
Is there any good and efficient approach for obtaining live graphs that will be efficient with lot of sensors and doen't lag with real time.
Or any command to clear up the already plotted array values?
I'd be obliged if anyone can help me with this problem.
I'd like to ask that, does filling data in arrays continuosly makes the process slow or is it my misconception?
That one is easy to test; creating an empty list and appending a few thousand values to it takes roughly 10^-4 seconds, so that shouldn't be a problem. For me somewhat surprisingly, it is actually faster than creating and filling a fixed size numpy.ndarray (but that is probably going to depend on the size of the list/array).
I quickly played around with your makeafig() function, placing a = fig1.add_subplot(111) up to (including) a.plot(..) in a simple for i in range(1,5) loop, with a = fig1.add_subplot(2,2,i); that makes makeafig() about 50% slower, but differences are only about 0.1-0.2 seconds. Is that in line with the lag that you are experiencing?
That's about what I can test without the real data, my next step would be to time the part from ipcon=.. to temperature1=... Perhaps the bottleneck is simply the retrieval of the data? I'm sure that there are several examples on SO on how to time parts of Python scripts, for these kind of problems something like the example below should be sufficient:
import time
t0 = time.time()
# do something
dt = time.time() - t0

Categories