Faster plotting in matplotlib or better options - python

I am trying to visualize some data for a log of close to 25,000 data points. When running this with matplotlib.pyplot on Python it is taking a really long time to render simple line graphs, and sometimes I've had to simply exit out from it taking 10+ minutes. This data log was made for sampling purposes, and real data logs can be a lot higher compared to this (some files can be several gigabytes long).
With this in mind, is there any way to plot data this big in matplotlib, without extremely slow execution? Or is there perhaps another framework that can do this a lot better in python? I understand that it can still take a while to render at that size, but for practical purposes, taking 10+ minutes for each plot really is not useful. Any help or guides are appreciated.
Here is a sample of my code:
df = pd.read_csv('sample.txt', low_memory=False) #25k Lines of data
df = df.iloc[:-2, :] # dropping last two rows since we don't need them
#'some_column' and 'another_column_name' for example purposes
#Both are 25k lines long
y = some_column
x = df[another_column_name]
x.pop(0) #removing unnecessary value, ignore this
fig, ax = plt.subplots()
tmpy = df[y]
tmpy.pop(0) # removing unnecessary value, ignore this
ax.plot(x, tmpy) # plot x against y
ax.set_title('Sample Graph')
plt.show()
In here I try to basically plot a column from the Pandas dataframe against another column. Very simple plotting to try and produce a line graph. The columns consist of some integers but mostly decimal values. It takes a really long time to do just this sample, real files are much bigger as mentioned. The goal is to be able to accomplish this with any file that is input.

Related

How to reduce Plotly HTML size in Python?

I am doing three plots:
Line plot, Boxplot and Histogram.
All these plots don't really need hoverinfo. Also these plots don't really need to plot each point of the data.
Ass you can see the plots are very simple, however, when dealing with huge data (30 million of observations) the html results weights 5 MBs, which is a lot because there are 100 plots more like this.
At the moment I have made some optimizations...
When saving to html I put these parameters:
fig.to_html( include_plotlyjs="cdn", full_html=False)
Which reduces plot size a lot, however is not enough.
I have also tried in the Line plot specifying this parameter line = {"simplify":True} and hoverinfo = "skip". However, file size is almost the same.
Any help/ workaround is appreciated

Should I transform a CSV to an ndarray to make a plot?

Recently I learned that if we want to manipulate the data in a CSV file from Excel, we need to transform it first into an ndarray with NumPy (please correct me if what I just learned is wrong).
While knowing about that, I also learned how to make plot with matplotlib. I saw the simple code to display a plot with matplotlib somewhere and the writer didn't transform it into an ndarray, he/she just simply displayed it with using row[0] and row[1].
Why didn't he/she transform it into a NumPy ndarray first? And how can I tell when should I turn CSV file into an ndarray?
It's really hard to say what this other person was doing to make their plot without seeing their code, but probably the data was already in memory as a Python object. You can only make a plot in matplotlib using data that you have in memory, e.g. from a Python list, or from a NumPy array, or maybe from a Pandas DataFrame, or some other object.
As you probably know, CSV is a file format. It's not a Python or NumPy object. In order to make a plot from the data, you must use some kind of file-reading code to read the file into memory. Then you can do things with it in Python.
People do this file reading in all sorts of different ways, depending on their ultimate goal. For example, you can use NumPy's genfromtxt() function, as mentioned by a commenter and as described in this Stack Overflow question. So you might do this, for example:
data = np.genfromtxt("mydata.csv", delimiter=',')
A note about pandas
A lot of people really like Pandas for handling data from CSVs. This is because a CSV can have all sorts of different data in it. For example, it might have a column of strings, a column of floats, a column of dates, etc. NumPy is great for datasets in which every element is of the same type (e.g. all floats representing the same thing, like measurements of temperature on a surface, say). But it's not ideal for datasets in which you have lots of different kinds of measurement. That's what Pandas is for. Pandas is also great for reading and writing CSV and even XLS files.
Your data does not have to be an ndarray in order to plot it with matplotlib. You can read in your data as a list and it will plot all the same as also mentioned by kwinkunks. How you read in your data matters which is a step you really need to worry about first!
To answer your question, if you really want to manipulate data and not just plot it then using a numpy array is the way to go. The advantage of using numpy arrays is that you can easily compute new variables and condition the data you have.
Take the following example. On the left you can plot the data as a list but you cannot manipulate the data and subset points. On the right side if your data is a numpy array you can easily condition the data say take only x values greater than 4 and plot them as red.
import matplotlib.pyplot as plt
import numpy as np
#Declare some data as a list
x = [2,5,4,3,6,2,6,10,1,0,.5]
y = [7,2,8,1,4,5,6,5,4,5,2]
#Make that same data a numpy array
x_array = np.array([2,5,4,3,6,2,6,10,1,0,.5])
y_array = np.array([7,2,8,1,4,5,6,5,4,5,2])
#Declare a figure with 2 subplots
fig = plt.figure(figsize=(12,6))
ax1 = plt.subplot(121)
ax2 = plt.subplot(122)
#Plot only the list
ax1.scatter(x,y)
#Plot only the list again on the second subplot
ax2.scatter(x,y)
#Index the data based on condition and plot those points as red
ax2.scatter(x_array[x_array>3],y_array[x_array>3],c='red')
plt.show()

Nice ways to collect pyplot graphs into a single page/document/etc?

I'm currently working on a program that plots many graphs. I'm using Jupyter. It works okay so far, but it is printing each graph to its own window. This is less than ideal because there are hundreds of graphs.
What are some ways to condense the output? I am hoping for a way to have the many graphs sent to a single document/window.
Also, I am iterating over a dictionary and only plotting graphs when the program encounters something of interest, so the frequency is not very predictable. It is something like this:
while still_true:
if my_condition is True:
a = np.arange(20) // not actually a range, but a dynamic np array
plt.plot(a)
plt.ylabel("some numbers")
plt.show()

KDE is very slow with large data

When I try to make a scatter plot, colored by density, it takes forever.
Probably because the length of the data is quite big.
This is basically how I do it:
xy = np.vstack([np.array(x_values),np.array(y_values)])
z = gaussian_kde(xy)(xy)
plt.scatter(np.array(x_values), np.array(x_values), c=z, s=100, edgecolor='')
As an additional info, I have to add that:
>>len(x_values)
809649
>>len(y_values)
809649
Is it any other option to get the same results but with better speed results?
No, there is not good solutions.
Every point should be prepared, and a circle is drawn, which probably will be hidden by other points.
My tricks: (note these point may change slightly the output)
get minimum and maximum, and set image on such size, so that figure needs not to be redone.
remove data, as much as possible:
duplicate data
convert with a chosen precision (e.g. of floats) and remove duplicate data. You may calculate the precision with half size of the dot (or with resolution of graph, if you want the original look).
Less data: more speed. Removal is far quicker than drawing a point in a graph (which will be overwritten).
Often heatmaps are more interesting for huge data sets: it gives more information. But in your case, I think you still have too much data.
Note: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde has also a nice example (just 2000 points). In any case, this pages uses also my first point.
I would suggest plotting a sample of the data.
If the sample is large enough you should get the same distribution.
Making sure the plot is relevant to the entire data set is also quite easy as you can simply take multiple samples and compare between them.

Plotting an histogram in log log scale with identical bar thickness

I'm trying to plot input data in an histogram in log-log scale (to quickly view if this could fit a power law), but I'm having trouble in outputting the way I want. I'm using Python and more specificaly the matplotlib/numpy libraries:
thebins = N.linspace(min_data.min(),min_data.max(),int(sys.argv[len(sys.argv)-1]))
thebins = N.log(thebins)
bar_min = plt.hist(min_data,bins=thebins,alpha=0.40,label=['Minimal Distance'],log=True)
min_data is my 1d data array, the two first lines are for creating the bins and then putting them in a log scale. The final line is for 'filling' the bins/histogram with log y scale.
The graphical output is:
It may seem fussy but I'm not satisifed with having bins of different thickness, it seems to me that the data is harder to read or can even be misread from that. Not all log-log histogram have same width bins and I'm convinced it can be done within Python; do you have an idea of to change my code to get there?
Thank you in advance ;)
Should have been a nobrainer: I only had to take the log of my data for the x axis, and then build the histogram passing the argument "log=True" for the y axis.

Categories