I am a newbie in datascience and I was trying to plot a scatter plot for a dataset with 4000 rows. I am running Jupyter Notebook on a macbook. I found it took more than five minutes for the scatter plot to appear in the Jupyter notebook. My notebook was recently bought and it is 2.3Ghz intel core i5 and the memory is 8GB.
I have two questions: why it took so long? why the plot was so congested (for example, all x scales appeared small and they came together and could not be read clearly) and not very clear. The dataset is here: https://raw.githubusercontent.com/datascienceinc/learn-data-science/master/Introduction-to-K-means-Clustering/Data/data_1024.csv
I really appreciate for any englightments.
Here is my code:
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
df= pd.read_csv('/users/kyaw/Downloads/data_1024.csv')
df = df.join(df['Driver_ID'].str.split(expand=True))
df = df.drop(["Driver_ID"], axis=1)
df.columns=['Driver_ID','Distance_Feature','Speeding_Feature']
f1 = df['Distance_Feature'].values
f2 = df['Speeding_Feature'].values
X=np.array(list(zip(f1,f2)))
fig=plt.gcf()
fig.set_size_inches(10,8)
kmeans = KMeans(n_clusters=3).fit(X)
plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, cmap='rainbow')
plt.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black')
plt.show()
I tried to run your code and it didn't work. I make the following corrections
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
#%matplotlib inline --> Removed this inline, maybe is here due to jupyter
from sklearn.cluster import KMeans
df= pd.read_csv('./data_1024.csv',sep='\t' ) #indicate the separator as tab.
#remove the other instructions that are useless
f1 = df['Distance_Feature'].values
f2 = df['Speeding_Feature'].values
X=np.array(list(zip(f1,f2)))
fig=plt.gcf()
fig.set_size_inches(10,8)
kmeans = KMeans(n_clusters=3).fit(X)
plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, cmap='rainbow')
plt.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black')
plt.show()
I got this image
Related
Currently I'm doing some data visualization using python, matplotlib and mplcursor that requires to show different parameters and values at the same time in a certain time period.
Sample CSV data that was extracted from a system:
https://i.stack.imgur.com/fjd1d.png
My expected output would look like this:
https://i.stack.imgur.com/zXGXA.png
Found the same case but they were using numpy functions: Add the vertical line to the hoverbox (see pictures)
Hoping someone will suggest what is the best approach of my problem.
Code below:
import matplotlib.pyplot as plt
import numpy as np
import mplcursors
import pandas as pd
fig, ax=plt.subplots()
y1=ax.twinx()
y2=ax.twinx()
y2.spines.right.set_position(("axes", 1.05))
df=pd.read_csv(r"C:\Users\OneDrive\Desktop\sample.csv")
time=df['Time']
yd1=df['Real Power']
yd2=df['Frequency']
yd3=df['SOC']
l1=ax.plot(time,yd1,color='black', label='Real Power')
l2=y1.plot(time,yd2, color='blue', label='Frequency')
l3=y2.plot(time,yd3, color='orange', label='SOC')
df=pd.DataFrame(df)
arr=df.to_numpy()
print(arr)
def show_annotation(sel):
x=sel.target[0]
annotation_str = df['Real Power'][sel.index]
#sel.annotation.set_text(annotation_str)
fig.autofmt_xdate()
cursor=mplcursors.cursor(hover=True)
cursor.connect('add', show_annotation)
plt.show()```
I have a scatter plot im working with and for some reason im not seeing all the x values on my graph
#%%
from pandas import DataFrame, read_csv
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
file = r"re2.csv"
df = pd.read_csv(file)
#sns.set(rc={'figure.figsize':(11.7,8.27)})
g = sns.FacetGrid(df, col='city')
g.map(plt.scatter, 'type', 'price').add_legend()
This is an image of a small subset of my plots, you can see that Res is displaying, the middle bar should be displaying Con and the last would be Mlt. These are all defined in the type column from my data set but are not displaying.
Any clue how to fix?
Python is doing what you tell it to do. Just pick different features, presumably things that make more sense for plotting, if you want to generate a more interesting plots. See this generic example below.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="darkgrid")
tips = sns.load_dataset("tips")
sns.relplot(x="total_bill", y="tip", hue="smoker", data=tips);
Personally, I like plotly plots, which are dynamic, more than I like seaborn plots.
https://plotly.com/python/line-and-scatter/
Trying to plot a simple graph in Jupyter Notebook with the package matplotlib, I came accross a strange problem that I had never had before.
I've seen that it has hapenned before to other people, and the answers talk about backends and other complicated stuff that I can't understand, me having only a rather basic knowledge of Python.
Here comes the code:
import numpy as np
import matplotlib.pyplot as plt
time_samples = np.arange(17000)
force_samples = np.arange(17000)
plt.plot(time_samples,force_samples)
plt.show()
time_samples2 = np.random.rand(1,1000)
force_samples2 = np.random.rand(1,1000)
plt.plot(time_samples2,force_samples2)
plt.show()
And this is what I get:
I have no clue why this is happenning.
I think the array dimension is the issue. x and y should be a 1D array.
import numpy as np
import matplotlib.pyplot as plt
time_samples = np.arange(17000)
force_samples = np.arange(17000)
plt.plot(time_samples,force_samples)
plt.show()
time_samples2 = np.random.rand(1000)
force_samples2 = np.random.rand(1000)
plt.plot(time_samples2,force_samples2)
plt.show()
I convert an oscilloscope dataset with millions of values into a pandas DataFrame. Next step is to plot it. But Matplotlib needs on my fairly powerful machine ~50 seconds to plot the DataFrame.
import pandas as pd
import matplotlib.pyplot as plt
import readTrc
datX, datY, m = readTrc.readTrc('C220180104_ch2_UHF00000.trc')
srx, sry = pd.Series(datX), pd.Series(datY)
df = pd.concat([srx, sry], axis = 1)
df.set_index(0, inplace = True)
df.plot(grid = 1)
plt.show()
Now I found out that there is a way to make matplotlib faster with large datasets by using 'Agg'.
import matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt
import readTrc
datX, datY, m = readTrc.readTrc('C220180104_ch2_UHF00000.trc')
srx, sry = pd.Series(datX), pd.Series(datY)
df = pd.concat([srx, sry], axis = 1)
df.set_index(0, inplace = True)
df.plot(grid = 1)
plt.show()
Unfortunately no plot is shown. The process of processing the plot takes ~5 seconds (a big improvement) but no plot is shown. Is this method not compatible with pandas?
You can use Ploty and Lenspy (was built to solve this exact problem). Here is an example of how you can plot 10m points on scatter plot. This plot runs super fast on my 2016 MacBook.
import numpy as np
import plotly.graph_objects as go
from lenspy import DynamicPlot
# First, let's create a very large figure
x = np.arange(1, 11, 1e-6)
y = 1e-2*np.sin(1e3*x) + np.sin(x) + 1e-3*np.sin(1e10*x)
fig = go.Figure(data=[go.Scattergl(x=x, y=y)])
fig.update_layout(title=f"{len(x):,} Data Points.")
# Use DynamicPlot.show to view the plot
plot = DynamicPlot(fig)
plot.show()
# Plot will be available in the browser at http://127.0.0.1:8050/
For your use case (again, I cannot test this since I don’t have access to your dataset):
import pandas as pd
import matplotlib.pyplot as plt
import readTrc
from lenspy import DynamicPlot
import plotly.graph_objects as go
datX, datY, m = readTrc.readTrc('C220180104_ch2_UHF00000.trc')
srx, sry = pd.Series(datX), pd.Series(datY)
fig = go.Figure(data=[go.Scattergl(x=srx, y=sry)])
fig.update_layout(title=f"{len(x):,} Data Points.")
# Use DynamicPlot.show to view the plot
plot = DynamicPlot(fig)
plot.show()
Disclaimer: I am the creator of Lenspy
I'm starting to learn a bit of python (been using R) for data analysis. I'm trying to create two plots using seaborn, but it keeps saving the second on top of the first. How do I stop this behavior?
import seaborn as sns
iris = sns.load_dataset('iris')
length_plot = sns.barplot(x='sepal_length', y='species', data=iris).get_figure()
length_plot.savefig('ex1.pdf')
width_plot = sns.barplot(x='sepal_width', y='species', data=iris).get_figure()
width_plot.savefig('ex2.pdf')
You have to start a new figure in order to do that. There are multiple ways to do that, assuming you have matplotlib. Also get rid of get_figure() and you can use plt.savefig() from there.
Method 1
Use plt.clf()
import seaborn as sns
import matplotlib.pyplot as plt
iris = sns.load_dataset('iris')
length_plot = sns.barplot(x='sepal_length', y='species', data=iris)
plt.savefig('ex1.pdf')
plt.clf()
width_plot = sns.barplot(x='sepal_width', y='species', data=iris)
plt.savefig('ex2.pdf')
Method 2
Call plt.figure() before each one
plt.figure()
length_plot = sns.barplot(x='sepal_length', y='species', data=iris)
plt.savefig('ex1.pdf')
plt.figure()
width_plot = sns.barplot(x='sepal_width', y='species', data=iris)
plt.savefig('ex2.pdf')
I agree with a previous comment that importing matplotlib.pyplot is not the best software engineering practice as it exposes the underlying library. As I was creating and saving plots in a loop, then I needed to clear the figure and found out that this can now be easily done by importing seaborn only:
since version 0.11:
import seaborn as sns
import numpy as np
data = np.random.normal(size=100)
path = "/path/to/img/plot.png"
plot = sns.displot(data) # also works with histplot() etc
plot.fig.savefig(path)
plot.fig.clf() # this clears the figure
# ... continue with next figure
alternative example with a loop:
import seaborn as sns
import numpy as np
for i in range(3):
data = np.random.normal(size=100)
path = "/path/to/img/plot2_{0:01d}.png".format(i)
plot = sns.displot(data)
plot.fig.savefig(path)
plot.fig.clf() # this clears the figure
before version 0.11 (original post):
import seaborn as sns
import numpy as np
data = np.random.normal(size=100)
path = "/path/to/img/plot.png"
plot = sns.distplot(data)
plot.get_figure().savefig(path)
plot.get_figure().clf() # this clears the figure
# ... continue with next figure
Create specific figures and plot onto them:
import seaborn as sns
iris = sns.load_dataset('iris')
length_fig, length_ax = plt.subplots()
sns.barplot(x='sepal_length', y='species', data=iris, ax=length_ax)
length_fig.savefig('ex1.pdf')
width_fig, width_ax = plt.subplots()
sns.barplot(x='sepal_width', y='species', data=iris, ax=width_ax)
width_fig.savefig('ex2.pdf')
I've found that if the interaction is turned off seaborn plot the heatmap normally.