Not getting the proper graph comparison using Python

Not getting the proper graph comparison using Python - python

I am trying to compare and get a proper point of intersection between the two CSV files. I am using the graph depiction for better understanding.
But I am getting very diminished image of one graph as compared to another.
See the following:
Here is the data: trade-volume.csv
Here is the real graph:
Here is the data: miners-revenue.csv
Here is the real graph:
Here is the program I wrote for comparison:
import pandas as pd
import matplotlib.pyplot as plt
dat2 = pd.read_csv("trade-volume.csv", parse_dates=['time'])
dat3 = pd.read_csv("miners-revenue.csv", parse_dates=['time'])
dat2['timeDiff'] = (dat2['time'] - dat2['time'][0]).astype('timedelta64[D]')
dat3['timeDiff'] = (dat3['time'] - dat3['time'][0]).astype('timedelta64[D]')
fig, ax = plt.subplots()
ax.plot(dat2['timeDiff'], dat2['Value'])
ax.plot(dat3['timeDiff'], dat3['Value'])
plt.show()
I got the output like the following:
As one can see the orange color graph is very low and I could not understand the points as it is lower. I am willing to overlap the graphs and then check.
Please help me make it possible with my existing code, if no alteration required.

The problem comes down to your y axis. One has a maximum of 60,000,000 while the other has a maximum of 6,000,000,000. Trying to plot these on the same graph is going to lead to one "looking" like a straight line even though it isn't if you zoom in.
A possible solution is to use a second y axis (you can change the color of the lines using the color= argument in ax.plot():
import pandas as pd
import matplotlib.pyplot as plt
dat2 = pd.read_csv("trade-volume.csv", parse_dates=['time'])
dat3 = pd.read_csv("miners-revenue.csv", parse_dates=['time'])
dat2['timeDiff'] = (dat2['time'] - dat2['time'][0]).astype('timedelta64[D]')
dat3['timeDiff'] = (dat3['time'] - dat3['time'][0]).astype('timedelta64[D]')
fig, ax = plt.subplots()
ax.plot(dat2['timeDiff'], dat2['Value'], color="blue")
ax2=ax.twinx()
ax2.plot(dat3['timeDiff'], dat3['Value'], color="red")
plt.show()

Both data live on very different scales. You may normalize both in order to compare them.
import pandas as pd
import matplotlib.pyplot as plt
dat2 = pd.read_csv("trade-volume.csv", parse_dates=['time'])
dat3 = pd.read_csv("miners-revenue.csv", parse_dates=['time'])
dat2['timeDiff'] = (dat2['time'] - dat2['time'][0]).astype('timedelta64[D]')
dat3['timeDiff'] = (dat3['time'] - dat3['time'][0]).astype('timedelta64[D]')
fig, ax = plt.subplots()
ax.plot(dat2['timeDiff'], dat2['Value']/dat2['Value'].values.max())
ax.plot(dat3['timeDiff'], dat3['Value']/dat3['Value'].values.max())
plt.show()

Related

Visualizing Time series data

I'm trying to do some data analysis with python and pandas on a power consumption dataset.
However when I plot the data I get that stright line from 5-1-2007 to 13-1-2007 but I have no missing values in my dataset which is a weird behavior as I made sure that my dataset in clean.
Anyone had similar issue? or can explain this behavior?
Thank you.
EDIT: Here is what the data looks like in that range
EDIT 2 : Here is the link to the original dataset (before cleaning) if that might help: https://archive.ics.uci.edu/ml/machine-learning-databases/00235/

How does the data between 2007-01-01 and 2007-01-15 look like? (use df[(df['Date_Time'] >= '2007-01-01 ') & (df['Date_Time'] <= '2007-01-15')]).
If no data is missing it could be that the dataset has been manipulated and the missing datapoints were interpolated (see Interpolation)

Fact is that when there is data on the x (Datetime)axis, then if there is no data on the y axis,
then the rendering continues anyway. Is especially noticeable on financial data on weekends and holidays or when there are gaps.
Here this problem is described enter link description here
Although you say that the data is present, but still try this code, maybe it's a matter of omissions.
In order not to draw when there is no data for the y axis, is used 'ticker.FuncFormatter(format_data)'.
Below I attach the code where I specifically made data gaps in the data file and a picture of how it turned out:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df = pd.read_csv('custom.csv',
index_col='DATE',
parse_dates=True,
infer_datetime_format=True)
z = df.iloc[:, 3].values
date = df.iloc[:, 0].index.date
fig, axes = plt.subplots(ncols=2)
ax = axes[0]
ax.plot(date, z)
ax.set_title("Default")
fig.autofmt_xdate()
N = len(z)
ind = np.arange(N)
def format_date(x, pos=None):
thisind = np.clip(int(x + 0.5), 0, N - 1)
return date[thisind].strftime('%Y-%m-%d')
ax = axes[1]
ax.plot(ind, z)
ax.xaxis.set_major_formatter(ticker.FuncFormatter(format_date))
ax.set_title("Without empty values")
fig.autofmt_xdate()
plt.show()

matplotlib: Add AxesSubplot instances to a figure

I'm going insane here ... this should be a simple exercise but I'm stuck:
I have a Jupyter notebook and am using the ruptures Python package. All I want to do is, take the figure or AxesSubplot(s) that the display() function returns and add it to a figure of my own, so I can share the x-axis, have a single image, etc.:
import pandas as pd
import matplotlib.pyplot as plt
myfigure = plt.figure()
l = len(df.columns)
for index, series in enumerate(df):
data = series.to_numpy().astype(int)
algo = rpt.KernelCPD(kernel='rbf', min_size=4).fit(data)
result = algo.predict(pen=3)
myfigure.add_subplot(l, 1, index+1)
rpt.display(data, result)
plt.title(series.name)
plt.show()
What I get is a figure with the desired number of subplots (all empty) and n separate figures from ruptures:
When instead I want want the subplots to be filled with the figures ...

I basically had to recreate the plot that ruptures.display(data,result) produces, to get my desired figure:
import pandas as pd
import numpy as np
import ruptures as rpt
import matplotlib.pyplot as plt
from matplotlib.ticker import EngFormatter
fig, axs = plt.subplots(len(df.columns), figsize=(22,20), dpi=300)
for index, series in enumerate(df):
resampled = df[series].dropna().resample('6H').mean().pad()
data = resampled.to_numpy().astype(int)
algo = rpt.KernelCPD(kernel='rbf', min_size=4).fit(data)
result = algo.predict(pen=3)
# Create ndarray of tuples from the result
result = np.insert(result, 0, 0) # Insert 0 as first result
tuples = np.array([ result[i:i+2] for i in range(len(result)-1) ])
ax = axs[index]
# Fill area beween results alternating blue/red
for i, tup in enumerate(tuples):
if i%2==0:
ax.axvspan(tup[0], tup[1], lw=0, alpha=.25)
else:
ax.axvspan(tup[0], tup[1], lw=0, alpha=.25, color='red')
ax.plot(data)
ax.set_title(series)
ax.yaxis.set_major_formatter(EngFormatter())
plt.subplots_adjust(hspace=.3)
plt.show()
I've wasted more time on this than I can justify, but it's pretty now and I can sleep well tonight :D

Changing the order of pandas/matplotlib line plotting without changing data order

Given the following example:
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
df.plot(linewidth=10)
The order of plotting puts the last column on top:
How can I make this keep the data & legend order but change the behaviour so that it plots X on top of Y on top of Z?
(I know I can change the data column order and edit the legend order but I am hoping for a simpler easier method leaving the data as is)
UPDATE: final solution used:
(Thanks to r-beginners) I used the get_lines to modify the z-order of each plot
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
fig = plt.figure()
ax = fig.add_subplot(111)
df.plot(ax=ax, linewidth=10)
lines = ax.get_lines()
for i, line in enumerate(lines, -len(lines)):
line.set_zorder(abs(i))
fig
In a notebook produces:

Get the default zorder and sort it in the desired order.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
np.random.seed(2021)
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
ax = df.plot(linewidth=10)
l = ax.get_children()
print(l)
l[0].set_zorder(3)
l[1].set_zorder(1)
l[2].set_zorder(2)
Before definition
After defining zorder

I will just put this answer here because it is a solution to the problem, but probably not the one you are looking for.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# generate data
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
# read columns in reverse order and plot them
# so normally, the legend will be inverted as well, but if we invert it again, you should get what you want
df[df.columns[::-1]].plot(linewidth=10, legend="reverse")
Note that in this example, you don't change the order of your data, you just read it differently, so I don't really know if that's what you want.
You can also make it easier on the eyes by creating a corresponding method.
def plot_dataframe(df: pd.DataFrame) -> None:
df[df.columns[::-1]].plot(linewidth=10, legend="reverse")
# then you just have to call this
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
plot_dataframe(df)

matplotlib event.xdata out of timeries range

Having an issue using matplotlib event.xdata when plotting pandas.Timeseries, I tried to reproduce the answer proposed in a very related question, but get a very strange behavior.
Here's the code, adapted to python3 and with a little more stuff in the on_click() function:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
def on_click(event):
if event.inaxes is not None:
# provide raw and converted x data
print(f"{event.xdata} --> {mdates.num2date(event.xdata)}")
# add a vertical line at clicked location
line = ax.axvline(x=event.xdata)
plt.draw()
t = pd.date_range('2015-11-01', '2016-01-06', freq='H')
y = np.random.normal(0, 1, t.size).cumsum()
df = pd.DataFrame({'Y':y}, index=t)
fig, ax = plt.subplots()
line = None
df.plot(ax=ax)
fig.canvas.mpl_connect('button_press_event', on_click)
plt.show()
If I launch this, I get the following diagram, with expected date range between Nov. 2015 and Jan. 2016, as is the cursor position information provided in the footer of the window (here 2015-11-01 10:00), and correct location of the vertical lines:
However, the command-line output is as follows:
C:\Users\me\Documents\code\>python matplotlib_even.xdate_num2date.py
402189.6454115977 --> 1102-02-27 15:29:23.562039+00:00
402907.10400704964 --> 1104-02-15 02:29:46.209088+00:00
Those event.xdata values are clearly out of both input data range and x axis data range, and are unusable for later use (like, try to find the closest y value in the serie).
So, does anyone know how I can get a correct xdata?

Something must have changed in the way matplotlib/pandas handles datetime info between the answer to the related question you linked and now. I cannot comment on why, but I found a solution to your problem.
I went digging through the code that shows the coordinates in the bottom left of the status bar, and I found that when you're plotting a timeseries, pandas patches the functions that prints this info and replaces it with this one.
From there, you can see that you need to convert the float value to a Period object.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def on_click(event):
print(pd.Period(ordinal=int(event.xdata), freq='H'))
t = pd.date_range('2015-11-01', '2016-01-06', freq='H')
y = np.random.normal(0, 1, t.size).cumsum()
df = pd.DataFrame({'Y': y}, index=t)
fig, ax = plt.subplots()
df.plot(ax=ax)
fig.canvas.mpl_connect('button_press_event', on_click)
plt.show()

HDF5 file to diagram in python

I'm trying to generate some diagrams from an .h5 file but I don't know how to do it.
I'm using pytables, numpy and matplotlib.
The hdf5 files I use contains 2 sets of data, 2 differents curves.
My goal is to get diagrams like this one.
This is what I managed to do for the moment:
import tables as tb
import numpy as np
import matplotlib.pyplot as plt
h5file = tb.openFile(args['FILE'], "a")
for group in h5file.walkGroups("/"):
for array in h5file.walkNodes("/","Array"):
if(isinstance(array.atom.dflt, int)):
tab = np.array(array.read())
x = tab[0]
y = tab[1]
plt.plot(x, y)
plt.show()
x and y values are good but I don't know how to use them, so the result is wrong. I get a triangle instead of what I want ^^
Thank you for your help
EDIT
I solved my problem.
Here is the code :
fig = plt.figure()
tableau = np.array(array.read())
x = tableau[0]
y = tableau[1]
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
ax1.plot(x)
ax2.plot(y)
plt.title(array.name)
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Not getting the proper graph comparison using Python - python

Related

Visualizing Time series data

matplotlib: Add AxesSubplot instances to a figure

Changing the order of pandas/matplotlib line plotting without changing data order

matplotlib event.xdata out of timeries range

HDF5 file to diagram in python

Categories

Resources