Plotting large datasets with pandas

Plotting large datasets with pandas - python

I convert an oscilloscope dataset with millions of values into a pandas DataFrame. Next step is to plot it. But Matplotlib needs on my fairly powerful machine ~50 seconds to plot the DataFrame.
import pandas as pd
import matplotlib.pyplot as plt
import readTrc
datX, datY, m = readTrc.readTrc('C220180104_ch2_UHF00000.trc')
srx, sry = pd.Series(datX), pd.Series(datY)
df = pd.concat([srx, sry], axis = 1)
df.set_index(0, inplace = True)
df.plot(grid = 1)
plt.show()
Now I found out that there is a way to make matplotlib faster with large datasets by using 'Agg'.
import matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt
import readTrc
datX, datY, m = readTrc.readTrc('C220180104_ch2_UHF00000.trc')
srx, sry = pd.Series(datX), pd.Series(datY)
df = pd.concat([srx, sry], axis = 1)
df.set_index(0, inplace = True)
df.plot(grid = 1)
plt.show()
Unfortunately no plot is shown. The process of processing the plot takes ~5 seconds (a big improvement) but no plot is shown. Is this method not compatible with pandas?

You can use Ploty and Lenspy (was built to solve this exact problem). Here is an example of how you can plot 10m points on scatter plot. This plot runs super fast on my 2016 MacBook.
import numpy as np
import plotly.graph_objects as go
from lenspy import DynamicPlot
# First, let's create a very large figure
x = np.arange(1, 11, 1e-6)
y = 1e-2*np.sin(1e3*x) + np.sin(x) + 1e-3*np.sin(1e10*x)
fig = go.Figure(data=[go.Scattergl(x=x, y=y)])
fig.update_layout(title=f"{len(x):,} Data Points.")
# Use DynamicPlot.show to view the plot
plot = DynamicPlot(fig)
plot.show()
# Plot will be available in the browser at http://127.0.0.1:8050/
For your use case (again, I cannot test this since I don’t have access to your dataset):
import pandas as pd
import matplotlib.pyplot as plt
import readTrc
from lenspy import DynamicPlot
import plotly.graph_objects as go
datX, datY, m = readTrc.readTrc('C220180104_ch2_UHF00000.trc')
srx, sry = pd.Series(datX), pd.Series(datY)
fig = go.Figure(data=[go.Scattergl(x=srx, y=sry)])
fig.update_layout(title=f"{len(x):,} Data Points.")
# Use DynamicPlot.show to view the plot
plot = DynamicPlot(fig)
plot.show()
Disclaimer: I am the creator of Lenspy

Related

How to plot Multiline Graphs Via Seaborn library in Python?

I have written a code that looks like this:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
T = np.array([10.03,100.348,1023.385])
power1 = np.array([100000,86000,73000])
power2 = np.array([1008000,95000,1009000])
df1 = pd.DataFrame(data = {'Size': T, 'Encrypt_Time': power1, 'Decrypt_Time': power2})
exp1= sns.lineplot(data=df1)
plt.savefig('exp1.png')
exp1_smooth= sns.lmplot(x='Size', y='Time', data=df, ci=None, order=4, truncate=False)
plt.savefig('exp1_smooth.png')
That gives me Graph_1:
The Size = x- axis is a constant line but as you can see in my code it varies from (10,100,1000).
How does this produces a constant line? I want to produce a multiline graph with x-axis = Size(T),y- axis= Encrypt_Time and Decrypt_Time (power1 & power2).
Also I wanted to plot a smooth graph of the same graph I am getting right now but it gives me error. What needs to be done to achieve a smooth multi-line graph with x-axis = Size(T),y- axis= Encrypt_Time and Decrypt_Time (power1 & power2)?

I think it not the issue, the line represents for size looks like constant but it NOT.
Can see that values of size in range 10-1000 while the minimum division of y-axis is 20,000 (20 times bigger), make it look like a horizontal line on your graph.
You can try with a bigger values to see the slope clearly.
If you want 'size` as x-axis, you can try below example:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
T = np.array([10.03,100.348,1023.385])
power1 = np.array([100000,86000,73000])
power2 = np.array([1008000,95000,1009000])
df1 = pd.DataFrame(data = {'Size': T, 'Encrypt_Time': power1, 'Decrypt_Time': power2})
fig = plt.figure()
fig = sns.lineplot(data=df1, x='Size',y='Encrypt_Time' )
fig = sns.lineplot(data=df1, x='Size',y='Decrypt_Time' )

Plotting categorized data in Seaborn

I have categorized data. At specific dates I have data (A to E) that is counted every 15 minutes.
When I want to plot with seaborn I get this:
Bigger bubbles cover smaller ones and the entire thing is not easy readable (e.g. 2020-05-12 at 21:15). Is it possible to display the bubbles for each 15-minute-class next to each other with a little bit of overlap?
My code:
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import os
df = pd.read_csv("test_df.csv")
#print(df)
sns.set_theme()
sns.scatterplot(
data = df,
x = "date",
y = "time",
hue = "category",
size = "amount",sizes=(15, 200)
)
plt.gca().invert_yaxis()
plt.show()
My CSV file:
date,time,amount,category
2020-05-12,21:15,13,A
2020-05-12,21:15,2,B
2020-05-12,21:15,5,C
2020-05-12,21:15,1,D
2020-05-12,21:30,4,A
2020-05-12,21:30,2,C
2020-05-12,21:30,1,D
2020-05-12,21:45,3,B
2020-05-12,22:15,4,A
2020-05-12,22:15,2,D
2020-05-12,22:15,9,E
2020-05-12,00:15,21,D
2020-05-12,00:30,11,E
2020-05-12,04:15,7,A
2020-05-12,04:30,1,B
2020-05-12,04:30,2,C
2020-05-12,04:45,1,A
2020-05-14,21:15,1,A
2020-05-14,21:15,5,C
2020-05-14,21:15,3,D
2020-05-14,21:30,4,A
2020-05-14,21:30,1,D
2020-05-14,21:45,5,B
2020-05-14,22:15,4,A
2020-05-14,22:15,11,E
2020-05-14,00:15,2,D
2020-05-14,00:30,11,E
2020-05-14,04:15,9,A
2020-05-14,04:30,11,B
2020-05-14,04:30,5,C
2020-05-14,05:00,7,A

You can use a seaborn swarmplot for this. You first have to separate the "amount" column into separate entries, using .reindex and .repeat. Then you can plot.
Here is the code:
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import os
df = pd.read_csv("test.csv")
df = df.reindex(df.index.repeat(df.amount))
sns.swarmplot(data = df, x = "date", y = "time", hue = "category")
plt.gca().invert_yaxis()
plt.show()
Here is the output:

Plotting tendency line in Python

I want to plot a tendency line on top of a data plot. This must be simple but I have not been able to figure out how to get to it.
Let us say I have the following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=list('A'))
sns.lineplot(data=df)
ax.set(xlabel="Index",
ylabel="Variable",
title="Sample")
plt.show()
The resulting plot is:
What I would like to add is a tendency line. Something like the red line in the following:
I thank you for any feedback.

A moving average is one method (my first thought, and already suggested).
Another method is to use a polynomial fit. Since you had 100 points in your original data, I picked a 10th order fit (square root of data length) in the example below. With some modification of your original code:
idx = [i for i in range(100)]
rnd = np.random.randint(0,100,size=100)
ser = pd.Series(rnd, idx)
fit = np.polyfit(idx, rnd, 10)
pf = np.poly1d(fit)
plt.plot(idx, rnd, 'b', idx, pf(idx), 'r')
This code provides a plot like this:

You can do something like this using Rolling Average:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = np.random.randint(0,100,size=(100, 1))
df["rolling_avg"] = df.A.rolling(7).mean().shift(-3)
sns.lineplot(data=df)
plt.show()
You could also do a Regression plot to analyse how data can be interpolated using:
ax = sns.regplot(x=df.index, y="A",
data=df,
scatter_kws={"s": 10},
order=10,
ci=None)

plot sensor boolean data matplotlib

I have data from two sensors that I want to visualize. Both sensors take only 0/1 values. How can I change the xaxis labels to show the time series and y axis should have 2 labels 0 and 1 representing the value of sensors along the time series.
import pandas as pd
import matplotlib.pyplot as plt
def drawgraph(inputFile):
df=pd.read_csv(inputFile)
fig=plt.figure()
ax=fig.add_subplot(111)
y = df[['sensor1']]
x=df.index
plt.plot(x,y)
plt.show()

You should have explained what you tried before asking a question for this to be meaningful. Anyway, below is the example.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
trange = pd.date_range("11:00", "21:30", freq="30min")
df = pd.DataFrame({'time':trange,'sensor1':np.round(np.random.rand(len(trange))),\
'sensor2':np.round(np.random.rand(len(trange)))})
df = df.set_index('time')
df.plot(yticks=[0,1],ylim=[-0.1,1.1],style={'sensor1':'ro','sensor2':'bx'})

forming histogram plots in python

suppose I want to plot 2 histogram subplots on the same window in python, one below the next. The data from these histograms will be read from a file containing a table with attributes A and B.
In the same window, I need a plot of A vs the number of each A and a plot of B vs the number of each B - directly below the plot of A. so suppose the attributes were height and weight, then we'd have a graph of height and number of people with said height and below it a separate graph of weight and number of people with said weight.
import numpy as np; import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
frame = pd.read_csv('data.data', header=None)
subplot.hist(frame['A'], frame['A.count()'])
subplot.hist(frame['B'], frame['B.count()'])
Thanks for any help!

Using pandas you can make histograms like this:
import numpy as np; import pandas as pd
import matplotlib.pyplot as plt
frame = pd.read_csv('data.csv')
frame.hist(layout = (2,1))
plt.show()
I'm confused by the second part of the question. Do you want four separate subplots?

You can do this:
import numpy as np
import numpy.random
import pandas as pd
import matplotlib.pyplot as plt
#df = pd.read_csv('data.data', header=None)
df = pd.DataFrame({'A': numpy.random.random_integers(0,10,30),
'B': numpy.random.random_integers(0,10,30)})
print df['A']
ax1 = plt.subplot(211)
ax1.set_title('A')
ax1.set_ylabel('number of people')
ax1.set_xlabel('height')
ax2 = plt.subplot(212)
ax2.set_title('B')
ax2.set_ylabel('number of people')
ax2.set_xlabel('weight')
ax1.hist(df['A'])
ax2.hist(df['B'])
plt.tight_layout()
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Plotting large datasets with pandas - python

Related

How to plot Multiline Graphs Via Seaborn library in Python?

Plotting categorized data in Seaborn

Plotting tendency line in Python

plot sensor boolean data matplotlib

forming histogram plots in python

Categories

Resources