I am building a GUI in PySide where I have to keep redrawing pandas.DataFrame objects.
I found out in this simple snippet of code that plotting the pandas DataFrame object df takes much longer to plot than the numpy.array object, despite the fact that the plots are nearly identical. This is too slow for my GUI. Why is this so much slower?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data = np.cumsum(np.random.randn(100, 10), axis=0)
df = pd.DataFrame(data)
df.plot() # Compare the speed of this line... (slow)
plt.plot(data) # to this line. (fast)
I like the way that the pandas.DataFrame plots look, especially because in my real example my x-axis is datetime data from pandas. I do not know how to format a matplotlib.pyplot x-axis to look good with datetime data.
How do I speed up pandas.DataFrame plotting?
Related
Pyplot directly on data from yfinance
Here's a little script which loads data into pyplot from yfinance:
import yfinance as yf
import matplotlib.pyplot as plt
data = yf.Ticker('MSFT').history(period="max").reset_index()[["Date", "Open"]]
plt.plot(data["Date"], data["Open"])
plt.show()
The UI loads quickly and is quite responsive. It looks like this:
Pyplot on equivalent data from csv
Here's a similar script which writes the data to CSV, loads it from CSV, then plots the graph using the data from CSV:
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
data = yf.Ticker('MSFT').history(period="max").reset_index()[["Date", "Open"]]
data.to_csv('out.csv')
from_csv = pd.read_csv('out.csv', index_col=0)
plt.plot(from_csv["Date"], from_csv["Open"])
plt.show()
This time:
The UI loads much more slowly
Zooming is slow
Panning is slow
Resizing is slow
The horizontal axes labels don't display clearly
Question
I'd like to avoid hitting the yfinance API each time the script is run so as to not burden their systems unnecessarily. (I'd include a bit more logic than in the script above which takes care of not accessing the API if a CSV is available. I kept the example simple for the sake of demonstration.)
Is there a way to get the CSV version to result in a pyplot UI that is as responsive as the direct-from-API version?
When loading from CSV, the Date column needs to be converted to an actual date value.
from_csv["Date"] = pd.to_datetime(from_csv["Date"])
Here's a fast version of the above script:
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
data = yf.Ticker('MSFT').history(period="max").reset_index()[["Date", "Open"]]
data.to_csv('out.csv')
from_csv = pd.read_csv('out.csv', index_col=0)
from_csv["Date"] = pd.to_datetime(from_csv["Date"])
plt.plot(from_csv["Date"], from_csv["Open"])
plt.show()
I am getting data that has over 2 million columns and that data runs fast through pandas and I want to apply linear regression to this equation and it takes a very long time to process through normal python code. The equation computes every single column value of price Closes and that takes a very long time. Would numpy or any other library be more efficient in processing this data and is there a function that could increase this process by a good amount?
%%time
%matplotlib notebook
import pandas as pd
import math
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from copy import deepcopy
from itertools import groupby
minute_2015_20 = 'input.csv'
data =pd.read_csv(minute_2015_20, low_memory=False)
#reverses all the table data values
data1 = data.iloc[::-1].reset_index(drop=True)
data1
Since I don't actually know your approach, first thought would be to make sure you are running a vectorized operation rather than a loop.
If you are already doing that, you can try parallelism on the chunks of data.
This should make the operations faster.
Another approach ,provided you have access to a GPU, would be to convert it to tensor and perform the actions on a GPU.
I have a dataframe of emails that has three columns: From, Message and Received (which is a date format).
I've written the below script to show how many messages there are per month in a bar plot.
But the plot doesn't show and I can't work out why, it's no doubt very simple. Any help understanding why is much appreciated!
Thanks!
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('XXX')
df = df[df['Message'].notna()]
df['Received'] = pd.to_datetime(df['Received'], format='%d/%m/%Y')
df['Received'].groupby(df['Received'].dt.month).count().plot
A pyplot object (commonly plt) is not shown until you call plt.show(). It is designed that way so you can create your plot and then modify it as needed before showing or saving.
Also checkout plt.savefig().
(My first ever StackOverflow question)
I'm trying to plot bitcoin's market-cap against the date using pandas and matplotlib in Python.
Here is my code:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#read in CSV file using Pandas built in method
df = pd.read_csv("btc.csv", index_col=0, parse_dates=True)
Here are some details about the data frame:
dataframe details
matplotlib code:
#Plot marketcap(usd)
plt.plot(df.index, df["marketcap(USD)"])
plt.show()
Result:
Incorrect result
The plot seems to be more like scribbles that seem to move backwards. How could I fix this?
You can plot your Pandas Series "marketcap(USD)" directly using:
df["marketcap(USD)"].plot()
See the Pandas documentation on Basic Plotting
I want the x axis tick marks to be the different states ie. IDLE, Data=Addr, Hammer, etc that are in column A of the csv file.
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.read_csv("Output.csv", index_col = 0)
df1.plot(x = df1.index.values)
I have also tried
df1.plot(xticks = df1.index.values)
without any success.
CSV File
Plot
Thanks in advance!
You may want to try Seaborn because it looks like it is not a plotting issue but rather peripheral styling issue (all blacked out) in your environment.
Once you installed Seaborn, insert a piece of code below to yours.
import seaborn as sns
sns.set_style("whitegrid")
As a side note, if you wish to align the number of ticks in x axis to that of labels you have, replace your plotting part with the following:
df1.plot()
plt.xticks(range(df1.shape[0]), df1.index)
Hope this helps.