Is there a more efficient way to plot csv file content?

Is there a more efficient way to plot csv file content? - python

I made a simple app to represent a chart from a csv file with python pandas and plot.
The file I use to plot out is a csv which is around 180kb with almost 6000 rows and two column's.
Nothing complex.
My question/problem is:
When I start the app to plot the chart, it takes to much time.
Around 20 - 30 sec which is I think to much time for such of small data.
I use Win10, i5, 8G ram, ssd...
This file is a log file which will grow day by day and I expect in one year to have a log file of around 1Mb.
I cant imagine does I should wait around 1min to read and plot the file of 1Mb one day.
What is wrong with this code or is there a better/faster/more effective way to process the data?
And if I use the zoom function on the plot it takes a loooot of time too.
It looks for me so buggy...
This is my function:
def plot_file(file_name):
input_file = "log.csv"
sample_data = pd.read_csv(input_file, names=["A", "B"], header=None)
plt.plot(sample_data.A, sample_data.B)
plt.title("File name: " + input_file)
plt.xlabel("time scale x / date & time")
plt.ylabel("quality scale y in %")
plt.show()

Plotting can be a significant bottleneck, particularly in matplotlib but should not be with your example. The quick test below is less than 1s on my machine including creation of the csv file. Delay might be else where including imports, reading from remote source, etc.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import time
def main():
start = time.time()
make_test_csv()
plot_file()
end = time.time()
print(f'Duration: {end - start}')
def make_test_csv():
df = pd.DataFrame(np.random.randint(0,100,size=(6000, 2)), columns=list('AB'))
df.to_csv('test_data.csv')
def plot_file():
input_file = 'test_data.csv'
sample_data = pd.read_csv(input_file, names=["A", "B"], header=None)
plt.plot(sample_data.A, sample_data.B)
plt.title("File name: " + input_file)
plt.xlabel("time scale x / date & time")
plt.ylabel("quality scale y in %")
plt.show()
The matplotlib user guide provides recommendations regarding performance. Using the fast plotting style might help.
import matplotlib.style as mplstyle
mplstyle.use('fast')

Related

Plot each Dask partition seperatly using python

I'm using Dask to read 500 parquet files and it does it much faster than other methods that I have tested.
Each parquet file contains a time column and many other variable columns.
My goal is to create a single plot that will have 500 lines of variable vs time.
When I use the following code, it works very fast compared to all other methods that I have tested but it gives me a single "line" on the plot:
import dask.dataframe as dd
import matplotlib.pyplot as plt
import time
start = time.time()
ddf = dd.read_parquet("results_parq/*.parquet")
plt.plot(ddf['t'].compute(),ddf['reg'].compute())
plt.show()
end = time.time()
print(end-start)
from my understanding, it happens because Dask just plots the following:
t
0
0.01
.
.
100
0
0.01
.
.
100
0
What I mean it plots a huge column instead of 500 columns.
One possible solution that I tried to do is to plot it in a for loop over the partitions:
import dask.dataframe as dd
import matplotlib.pyplot as plt
import time
start = time.time()
ddf = dd.read_parquet("results_parq/*.parquet")
for p in ddf.partitions:
plt.plot(p['t'].compute(),p['reg'].compute())
plt.show()
end = time.time()
print(end-start)
It does the job and the resulting plot looks like I want:
However, it results in much longer times.
Is there a way to do something like this but yet to use Dask multicore benefits? Like somehow use map_partitions for it?
Thank you

as a start, you cannot normally make matplotlib draw to the same figure from multiple processes, as the renderers aren't using shared memory. (neither should they from a programming point of view)
drawing 500 lines is a very simple task for matplotlib and the problem maybe not in matplotlib.
your dask workers are likely sending data sequentially to your main process, hence the slowdown. (each worker has to wait for master to request data then send it then wait for confirmation, then wait for next order to come, etc)
you can force them to send their data faster by prefetching all the data before you start plotting by matplotlib.
import numpy as np
ddf = dd.read_parquet("results_parq/*.parquet")
# compute length of each partition
lengths = ddf.map_partitions(len).compute()
# get all partitions at once
ddf2 = ddf.compute()
# calculate each parition start and end
lengths = list(lengths)
lengths.insert(0,0)
accumelated_lengths= np.cumsum(lengths)
# plot each partition
for i in range(len(accumelated_lengths)-1):
plt.plot(ddf2['t'][accumelated_lengths[i]:accumelated_lengths[i+1]],
ddf2['reg'][accumelated_lengths[i]:accumelated_lengths[i+1]])
plt.show()
Edit: making 500 calls to plt.plot is probably slowing you down too, you could use matplotlib.LineCollection instead, it takes 1/20 of the time for 500 lines.
# plot each partition
lines = []
fig, ax = plt.subplots()
for i in range(len(accumelated_lengths) - 1):
lines.append(tuple(zip(ddf2['a'][accumelated_lengths[i]:accumelated_lengths[i + 1]],
ddf2['b'][accumelated_lengths[i]:accumelated_lengths[i + 1]])))
coll = LineCollection(lines,colors = np.random.random([len(lines),3]))
ax.add_collection(coll)
ax.autoscale_view()
plt.show()

How to to check the cycles count in time series data using Python

I have a battery voltage data with respect to Datetime, collected data for one month I need to find out the number of cycles of battery hers is my data https://docs.google.com/spreadsheets/d/1K0XspcrpO94mv2wFgW45DzSjAO0Af1uSwmoHHVxDJwI/edit?usp=sharing
since I used peak detection algorithm it is not detecting the cycles correctly is there any way to find the cycles data
here is my code I used
import numpy as np
op_col = []
for i in df["voltage"]:
op_col.append(i)
np.set_printoptions(threshold=np.inf)
x = np.array(op_col)
from scipy.signal import find_peaks
peak, _ = find_peaks(x, width=800,prominence=1,height=51)
fig= plt.figure(figsize=(19,8))
plt.plot(x)
plt.xlim(0,45000)
plt.plot(peak, x[peak], "x", color = 'r')
here is my out put
how can I do it in another way any suggestions.
thanks in advance

Astronomical Plotting Techniques in Python

I am new to python and am currently in the process of attempting to plot the retrograde motion of mars. I have a txt file that has R.A. and Declination in addition 12 other rows of data (like apparent magnitude etc). However, from that file I am trying to convert only the R.A and Dec. to decimal degrees in order to create a scatter plot with dec on the x axis and R.A. on the y axis. After researching I discovered that atrophy/skycoord may be the best tool to use. The problem I am having is how to code the conversion for the two specific rows of data I'm needing. Any help is greatly appreciated![][1]
I am currently in the process of attempting to plot the retrograde motion of mars. I have a txt file that has R.A. and Declination in addition 12 other rows of data (like apparent magnitude etc). However, from that file I am trying to convert only the R.A and Dec. to decimal degrees in order to create a scatter plot with dec on the x axis and R.A. on the y axis. After researching I discovered that atrophy/skycoord may be the best tool to use. The problem I am having is how to code the conversion for the two specific rows of data I'm needing. Any help is greatly appreciated!
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
f = open("Mars2.txt", "r")
print(f.read())
df = pd.read_csv('Mars2.txt', sep=";", names=['Date(0 UT)','Apparent R.A.','Apparent Declination','Distance to Earth','Distance to Sun','App. Mag.','Ang. Diam.','Phase Illum','Phase Angle','S.E Long','S.E Lat','P.A Axis','Ls','Solar Elong'])
print (df)
df.plot(x ='Apparent Declination', y='Apparent R.A.', kind = 'scatter')
from astropy import units as u
from astropy.coordinates import SkyCoord
from astropy.io import ascii
c = SkyCoord(ra=10.625*u.degree, dec=41.2*u.degree, frame='icrs')

How to speed up a Python Basemap choropleth animation

Taking ideas from various sources, and combining with my own, I sought to create an animated maps showing the shading of countries based on some value in my data.
The basic process is this:
Run DB query to get dataset, keyed by country and time
Use pandas to do some data manipulation (sums, avgs, etc)
Initialize basemap object, then load Load external shapefile
Using the animation library, color the countries, one frame for each distinct "time" in the dataset.
Save as gif or mp4 or whatever
This works just fine. The problem is that it is extremely slow. I have potentially over 100k time intervals (over several metrics) I want to animate, and I'm getting an average time of 15s to generate each frame, and it gets worse the more frames there are. At this rate, it will potentially take weeks of maxing out the cpu and memory on my computer to generate a single animation.
I know that matplotlib isn't known for being very fast (examples: 1 and 2) But I read stories of people generating animations at 5+ fps and wonder what I'm doing wrong.
Some optimizations that I have done:
Only recolor the countries in the animate function. This takes on average ~3s per frame, so while it could be improved, it's not what takes the most time.
I use the blit option.
I tried using a smaller plots size and less detailed basemap, but the results were marginal.
Perhaps a less detailed shapefile would speed up the coloring of the shapes, but as I said before, that's only a 3s per frame improvement.
Here is the code (minus a few identifiable features)
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import time
from math import pi
from sqlalchemy import create_engine
from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from geonamescache import GeonamesCache
from datetime import datetime
def get_dataset(avg_interval, startTime, endTime):
### SQL query
# Returns a dataframe with fields [country, unixtime, metric1, metric2, metric3, metric4, metric5]]
# I use unixtime so I can group by any arbitrary interval to get sums and avgs of the metrics (hence the param avg_interval)
return df
# Initialize plot figure
fig=plt.figure(figsize=(11, 6))
ax = fig.add_subplot(111, axisbg='w', frame_on=False)
# Initialize map with Robinson projection
m = Basemap(projection='robin', lon_0=0, resolution='c')
# Load and read shapefile
shapefile = 'countries/ne_10m_admin_0_countries'
m.readshapefile(shapefile, 'units', color='#dddddd', linewidth=0.005)
# Get valid country code list
gc = GeonamesCache()
iso2_codes = list(gc.get_dataset_by_key(gc.get_countries(), 'fips').keys())
# Get dataset and remove invalid countries
# This one will get daily aggregates for the first week of the year
df = get_dataset(60*60*24, '2016-01-01', '2016-01-08')
df.set_index(["country"], inplace=True)
df = df.ix[iso2_codes].dropna()
num_colors = 20
# Get list of distinct times to iterate over in the animation
period = df["unixtime"].sort_values(ascending=True).unique()
# Assign bins to each value in the df
values = df["metric1"]
cm = plt.get_cmap('afmhot_r')
scheme= cm(1.*np.arange(num_colors)/num_colors)
bins = np.linspace(values.min(), values.max(), num_colors)
df["bin"] = np.digitize(values, bins) - 1
# Initialize animation return object
x,y = m([],[])
point = m.plot(x, y,)[0]
# Pre-zip country details and shap objects
zipped = zip(m.units_info, m.units)
tbegin = time.time()
# Animate! This is the part that takes a long time. Most of the time taken seems to happen between frames...
def animate(i):
# Clear the axis object so it doesn't draw over the old one
ax.clear()
# Dynamic title
fig.suptitle('Num: {}'.format(datetime.utcfromtimestamp(int(i)).strftime('%Y-%m-%d %H:%M:%S')), fontsize=30, y=.95)
tstart = time.time()
# Get current frame dataset
frame = df[df["unixtime"]==i]
# Loop through every country
for info, shape in zipped:
iso2 = info['ISO_A2']
if iso2 not in frame.index:
# Gray if not in dataset
color = '#dddddd'
else:
# Colored if in dataset
color = scheme[int(frame.ix[iso2]["bin"])]
# Get shape info for country, then color on the ax subplot
patches = [Polygon(np.array(shape), True)]
pc = PatchCollection(patches)
pc.set_facecolor(color)
ax.add_collection(pc)
tend = time.time()
#print "{}%: {} of {} took {}s".format(str(ind/tot*100), str(ind), str(tot), str(tend-tstart))
print "{}: {}s".format(datetime.utcfromtimestamp(int(i)).strftime('%Y-%m-%d %H:%M:%S'), str(tend-tstart))
return None
# Initialize animation object
output = animation.FuncAnimation(fig, animate, period, interval=150, repeat=False, blit=False)
filestring = time.strftime("%Y%m%d%H%M%S")
# Save animation object as m,p4
#output.save(filestring + '.mp4', fps=1, codec='ffmpeg', extra_args=['-vcodec', 'libx264'])
# Save animation object as gif
output.save(filestring + '.gif', writer='imagemagick')
tfinish = time.time()
print "Total time: {}s".format(str(tfinish-tbegin))
print "{}s per frame".format(str((tfinish-tbegin)/len(df["unixtime"].unique())))
P.S. I know the code is sloppy and could use some cleanup. I'm open to any suggestions, especially if that cleanup would improve performance!
Edit 1: Here is an example of the output
2016-01-01 00:00:00: 3.87843298912s
2016-01-01 00:00:00: 4.08691620827s
2016-01-02 00:00:00: 3.40868711472s
2016-01-03 00:00:00: 4.21187019348s
Total time: 29.0233821869s
9.67446072896s per frame
The first first few lines represent the date being processed, and the runtime of each frame. I have no clue why the first one is repeated. The final line it the total runtime of the program divided by the number of frames. Note that the average time is 2-3x the individual times. This makes me think that there is something happening "between" the frames that is eating up a lot of time.
Edit 2: I ran some performance tests and determined that average time to generate each additional frame is greater than the last, proportional to the number of frames, indicating that this is an quadratic-time process. (or would it be exponential?) Either way, I'm very confused why this wouldn't be linear. If the dataset is already generated, and the maps take a constant time to regenerate, what variable is causing each extra frame to take longer than the previous?
Edit 3: I just made the realization that I have no idea how the animation function works. The (x,y) and point variables were taken from an example that was just plotting moving dots, so it makes sense in that context. A map... not so much. I tried returning something map related from the animate function and got better performance. Returning the ax object (return ax,) makes the procedure run in linear time... but doesn't write anything to the gif. Anybody have any idea what I need to return from the animate function to make this work?
Edit 4: Clearing the axis every frame lets the frames generate at a constant rate! Now I just have to work on general optimizations. I'll start with ImportanceOfBeingErnest's suggestion first. The previous edits are obsolete now.

Real Time temperature plotting with python

I am currently making a project which requires real time monitoring of various quantities like temperature, pressure, humidity etc. I am following a approach of making individual arrays of all the sensors and ploting a graph using matplotlib and drwnow.
HOST = "localhost"
PORT = 4223
UID1 = "tsJ" # S1
from tinkerforge.ip_connection import IPConnection
from tinkerforge.bricklet_ptc import BrickletPTC
import numpy as np
import serial
import matplotlib
from matplotlib.ticker import ScalarFormatter, FormatStrFormatter
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
from drawnow import *
# creating arrays to feed the data
tempC1 = []
def makeafig():
# creating subplots
fig1 = plt.figure(1)
a = fig1.add_subplot(111)
#setting up axis label, auto formating of axis and title
a.set_xlabel('Time [s]', fontsize = 10)
a.set_ylabel('Temperature [°C]', fontsize = 10)
y_formatter = matplotlib.ticker.ScalarFormatter(useOffset=False)
a.yaxis.set_major_formatter(y_formatter)
title1 = "Current Room Temperature (Side1): " + str(temperature1/100) + " °C"
a.set_title(title1, fontsize = 10)
#plotting the graph
a.plot(tempC1, "#00A3E0")
#saving the figure
fig1.savefig('RoomTemperature.png', dpi=100)
while True:
ipcon = IPConnection() # Create IP connection
ptc1 = BrickletPTC(UID1, ipcon) # S1
ipcon.connect(HOST, PORT) # Connect to brickd
#setting the temperature from PTC bricklet
temperature1 = ptc1.get_temperature()
#processing data from a temperature sensor to 1st array
dataArray1=str(temperature1/100).split(',')
temp1 = float(dataArray1[0])
tempC1.append(temp1)
#making a live figure
drawnow(makeafig)
plt.draw()
This is the approach I found good on the internet and it is working. The only problem I am facing is It consumes more time if I made more arrays for other sensors and the plot being made lags from the real time when I compare it with a stopwatch.
Is there any good and efficient approach for obtaining live graphs that will be efficient with lot of sensors and doen't lag with real time.
Or any command to clear up the already plotted array values?
I'd be obliged if anyone can help me with this problem.

I'd like to ask that, does filling data in arrays continuosly makes the process slow or is it my misconception?
That one is easy to test; creating an empty list and appending a few thousand values to it takes roughly 10^-4 seconds, so that shouldn't be a problem. For me somewhat surprisingly, it is actually faster than creating and filling a fixed size numpy.ndarray (but that is probably going to depend on the size of the list/array).
I quickly played around with your makeafig() function, placing a = fig1.add_subplot(111) up to (including) a.plot(..) in a simple for i in range(1,5) loop, with a = fig1.add_subplot(2,2,i); that makes makeafig() about 50% slower, but differences are only about 0.1-0.2 seconds. Is that in line with the lag that you are experiencing?
That's about what I can test without the real data, my next step would be to time the part from ipcon=.. to temperature1=... Perhaps the bottleneck is simply the retrieval of the data? I'm sure that there are several examples on SO on how to time parts of Python scripts, for these kind of problems something like the example below should be sufficient:
import time
t0 = time.time()
# do something
dt = time.time() - t0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there a more efficient way to plot csv file content? - python

Related

Plot each Dask partition seperatly using python

How to to check the cycles count in time series data using Python

Astronomical Plotting Techniques in Python

How to speed up a Python Basemap choropleth animation

Real Time temperature plotting with python

Categories

Resources