I'm trying to do some data analysis with python and pandas on a power consumption dataset.
However when I plot the data I get that stright line from 5-1-2007 to 13-1-2007 but I have no missing values in my dataset which is a weird behavior as I made sure that my dataset in clean.
Anyone had similar issue? or can explain this behavior?
Thank you.
EDIT: Here is what the data looks like in that range
EDIT 2 : Here is the link to the original dataset (before cleaning) if that might help: https://archive.ics.uci.edu/ml/machine-learning-databases/00235/
How does the data between 2007-01-01 and 2007-01-15 look like? (use df[(df['Date_Time'] >= '2007-01-01 ') & (df['Date_Time'] <= '2007-01-15')]).
If no data is missing it could be that the dataset has been manipulated and the missing datapoints were interpolated (see Interpolation)
Fact is that when there is data on the x (Datetime)axis, then if there is no data on the y axis,
then the rendering continues anyway. Is especially noticeable on financial data on weekends and holidays or when there are gaps.
Here this problem is described enter link description here
Although you say that the data is present, but still try this code, maybe it's a matter of omissions.
In order not to draw when there is no data for the y axis, is used 'ticker.FuncFormatter(format_data)'.
Below I attach the code where I specifically made data gaps in the data file and a picture of how it turned out:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df = pd.read_csv('custom.csv',
index_col='DATE',
parse_dates=True,
infer_datetime_format=True)
z = df.iloc[:, 3].values
date = df.iloc[:, 0].index.date
fig, axes = plt.subplots(ncols=2)
ax = axes[0]
ax.plot(date, z)
ax.set_title("Default")
fig.autofmt_xdate()
N = len(z)
ind = np.arange(N)
def format_date(x, pos=None):
thisind = np.clip(int(x + 0.5), 0, N - 1)
return date[thisind].strftime('%Y-%m-%d')
ax = axes[1]
ax.plot(ind, z)
ax.xaxis.set_major_formatter(ticker.FuncFormatter(format_date))
ax.set_title("Without empty values")
fig.autofmt_xdate()
plt.show()
Related
I'm currently needing some help here since I’m kinda novice. So I was able to import and plot my time series data via Pandas and Matplotlib, respectively. The thing is, the plot is too cramped up (due to the amount of data lol).
Using the same data set, is it possible to ‘divide’ the whole plot into 3 separate subplots?
Here's a sample to what I mean:
What I'm trying to do here is to distribute my plot into 3 subplots (it seems it doesn't have ncol=x).
Initially, my code runs like this;
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format
data = pd.read_csv ('all_visuallc.csv')
df = pd.DataFrame(data, columns= ['JD', 'Magnitude'])
print(df) #displays ~37000ish data x 2 columns
colors = ('#696969') #a very nice dim grey heh
area = np.pi*1.7
ax = df.plot.scatter(x="JD", y="Magnitude", s=area, c=colors, alpha=0.2)
ax.set(title='HD 39801', xlabel='Julian Date', ylabel='Visual Magnitude')
ax.invert_yaxis()
ax.xaxis.set_minor_locator(ticker.AutoMinorLocator())
ax.yaxis.set_minor_locator(ticker.AutoMinorLocator())
plt.rcParams['figure.figsize'] = [20, 4]
plt.rcParams['figure.dpi'] = 250
plt.savefig('test_p.jpg')
plt.show()
which shows a very tight plot:
Thanks everyone and I do hope for your help and responses.
P.S. I think iloc[value:value] to slice from a df may work?
First of all, you have to create multiple plots for every part of your data.
For example, if we want to split data into 3 parts, we will create 3 subplots. And then, as you correctly wrote, we can apply iloc (or another type of indexing) to the data.
Here is a toy example, but I hope you are be able to apply your decorations to it.
y = np.arange(0,20,1)
x = np.arange(20,40,1)
sample = pd.DataFrame(x,y).reset_index().rename(columns={'index':'y',
0:'x'})
n_plots = 3
figs, axs = plt.subplots(n_plots, figsize=[30,10])
# Suppose we want to split data into equal parts
start_ind = 0
for i in range(n_plots):
end_ind = start_ind + round(len(sample)/n_plots) #(*)
part_of_frame = sample.iloc[start_ind:end_ind]
axs[i].scatter(part_of_frame['x'], part_of_frame['y'])
start_ind = end_ind
It's also possible to split data into unequal parts by changing the logic in the string (*)
Having an issue using matplotlib event.xdata when plotting pandas.Timeseries, I tried to reproduce the answer proposed in a very related question, but get a very strange behavior.
Here's the code, adapted to python3 and with a little more stuff in the on_click() function:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
def on_click(event):
if event.inaxes is not None:
# provide raw and converted x data
print(f"{event.xdata} --> {mdates.num2date(event.xdata)}")
# add a vertical line at clicked location
line = ax.axvline(x=event.xdata)
plt.draw()
t = pd.date_range('2015-11-01', '2016-01-06', freq='H')
y = np.random.normal(0, 1, t.size).cumsum()
df = pd.DataFrame({'Y':y}, index=t)
fig, ax = plt.subplots()
line = None
df.plot(ax=ax)
fig.canvas.mpl_connect('button_press_event', on_click)
plt.show()
If I launch this, I get the following diagram, with expected date range between Nov. 2015 and Jan. 2016, as is the cursor position information provided in the footer of the window (here 2015-11-01 10:00), and correct location of the vertical lines:
However, the command-line output is as follows:
C:\Users\me\Documents\code\>python matplotlib_even.xdate_num2date.py
402189.6454115977 --> 1102-02-27 15:29:23.562039+00:00
402907.10400704964 --> 1104-02-15 02:29:46.209088+00:00
Those event.xdata values are clearly out of both input data range and x axis data range, and are unusable for later use (like, try to find the closest y value in the serie).
So, does anyone know how I can get a correct xdata?
Something must have changed in the way matplotlib/pandas handles datetime info between the answer to the related question you linked and now. I cannot comment on why, but I found a solution to your problem.
I went digging through the code that shows the coordinates in the bottom left of the status bar, and I found that when you're plotting a timeseries, pandas patches the functions that prints this info and replaces it with this one.
From there, you can see that you need to convert the float value to a Period object.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def on_click(event):
print(pd.Period(ordinal=int(event.xdata), freq='H'))
t = pd.date_range('2015-11-01', '2016-01-06', freq='H')
y = np.random.normal(0, 1, t.size).cumsum()
df = pd.DataFrame({'Y': y}, index=t)
fig, ax = plt.subplots()
df.plot(ax=ax)
fig.canvas.mpl_connect('button_press_event', on_click)
plt.show()
I am trying to compare and get a proper point of intersection between the two CSV files. I am using the graph depiction for better understanding.
But I am getting very diminished image of one graph as compared to another.
See the following:
Here is the data: trade-volume.csv
Here is the real graph:
Here is the data: miners-revenue.csv
Here is the real graph:
Here is the program I wrote for comparison:
import pandas as pd
import matplotlib.pyplot as plt
dat2 = pd.read_csv("trade-volume.csv", parse_dates=['time'])
dat3 = pd.read_csv("miners-revenue.csv", parse_dates=['time'])
dat2['timeDiff'] = (dat2['time'] - dat2['time'][0]).astype('timedelta64[D]')
dat3['timeDiff'] = (dat3['time'] - dat3['time'][0]).astype('timedelta64[D]')
fig, ax = plt.subplots()
ax.plot(dat2['timeDiff'], dat2['Value'])
ax.plot(dat3['timeDiff'], dat3['Value'])
plt.show()
I got the output like the following:
As one can see the orange color graph is very low and I could not understand the points as it is lower. I am willing to overlap the graphs and then check.
Please help me make it possible with my existing code, if no alteration required.
The problem comes down to your y axis. One has a maximum of 60,000,000 while the other has a maximum of 6,000,000,000. Trying to plot these on the same graph is going to lead to one "looking" like a straight line even though it isn't if you zoom in.
A possible solution is to use a second y axis (you can change the color of the lines using the color= argument in ax.plot():
import pandas as pd
import matplotlib.pyplot as plt
dat2 = pd.read_csv("trade-volume.csv", parse_dates=['time'])
dat3 = pd.read_csv("miners-revenue.csv", parse_dates=['time'])
dat2['timeDiff'] = (dat2['time'] - dat2['time'][0]).astype('timedelta64[D]')
dat3['timeDiff'] = (dat3['time'] - dat3['time'][0]).astype('timedelta64[D]')
fig, ax = plt.subplots()
ax.plot(dat2['timeDiff'], dat2['Value'], color="blue")
ax2=ax.twinx()
ax2.plot(dat3['timeDiff'], dat3['Value'], color="red")
plt.show()
Both data live on very different scales. You may normalize both in order to compare them.
import pandas as pd
import matplotlib.pyplot as plt
dat2 = pd.read_csv("trade-volume.csv", parse_dates=['time'])
dat3 = pd.read_csv("miners-revenue.csv", parse_dates=['time'])
dat2['timeDiff'] = (dat2['time'] - dat2['time'][0]).astype('timedelta64[D]')
dat3['timeDiff'] = (dat3['time'] - dat3['time'][0]).astype('timedelta64[D]')
fig, ax = plt.subplots()
ax.plot(dat2['timeDiff'], dat2['Value']/dat2['Value'].values.max())
ax.plot(dat3['timeDiff'], dat3['Value']/dat3['Value'].values.max())
plt.show()
I am trying to show time series lines representing an effort amount using matplotlib and pandas.
I've got my DF's to all to overlay in one plot, however when I do python seems to strip the x axis of the date and input some numbers. (I'm not sure where these come from but at a guess, not all days contain the same data so python has reverted to using an index id number). If I plot any one of these they come up with date on the x-axis.
Any hints or solutions to make the x axis show date for the multiple plot would be much appreciated.
This is the single figure plot with time axis:
Code I'm using to plot is
fig = pl.figure()
ax = fig.add_subplot(111)
ax.plot(b342,color='black')
ax.plot(b343,color='blue')
ax.plot(b344,color='red')
ax.plot(b345,color='green')
ax.plot(b346,color='pink')
ax.plot(fi,color='yellow')
plt.show()
This is the multiple plot fig with weird x axis:
One option would be to manually specify the x-axis based on the DataFrame index, and then plot directly using matplotlib.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# make up some data
n = 100
dates = pd.date_range(start = "2015-01-01", periods = n, name = "yearDate")
dfs = []
for i in range(3):
df = pd.DataFrame(data = np.random.random(n)*(i + 1), index = dates,
columns = ["FishEffort"] )
df.df_name = str(i)
dfs.append(df)
# plot it directly using matplotlib instead of through the DataFrame
fig = plt.figure()
ax = fig.add_subplot()
for df in dfs:
plt.plot(df.index,df["FishEffort"], label = df.df_name)
plt.legend()
plt.show()
Another option would be to concatenate your DataFrames and plot using Pandas. If you give your "FishEffort" field the correct label name when loading the data or via DataFrame.rename then the labels will be specified automatically.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n = 100
dates = pd.date_range(start = "2015-01-01", periods = n, name = "yearDate")
dfs = []
for i in range(3):
df = pd.DataFrame(data = np.random.random(n)*(i + 1), index = dates,
columns = ["DataFrame #" + str(i) ] )
df.df_name = str(i)
dfs.append(df)
df = pd.concat(dfs, axis = 1)
df.plot()
I've found an answer that does what I want, it seems that calling plt.plot wasn't using the date as the x axis, however calling it using the pandas documentation did the trick.
ax = b342.plot(label='342')
b343.plot(ax=ax, label='test')
b344.plot(ax=ax)
b345.plot(ax=ax)
b346.plot(ax=ax)
fi.plot(ax=ax)
plt.show()
I was wondering if anyone knew hwo to change the labels here?
My dataframe has uneven time index.
how could I find a way to plot the data, and local the index automatically? I searched here, and I know I can plot something like
e.plot()
but the time index (x axis) will be even interval, for example per 5 minutes.
if I have to 100 data in first 5 minutes and 6 data for the second 5 minutes, how do I plot
with number of data evenly. and locate the right timestamp on x axis.
here's even count, but I don't know how to add time index.
plot(e['Bid'].values)
example of data format as requested
Time,Bid
2014-03-05 21:56:05:924300,1.37275
2014-03-05 21:56:05:924351,1.37272
2014-03-05 21:56:06:421906,1.37275
2014-03-05 21:56:06:421950,1.37272
2014-03-05 21:56:06:920539,1.37275
2014-03-05 21:56:06:920580,1.37272
2014-03-05 21:56:09:071981,1.37275
2014-03-05 21:56:09:072019,1.37272
and here's the link
http://code.google.com/p/eu-ats/source/browse/trunk/data/new/eur-fix.csv
here's the code, I used to plot
import numpy as np
import pandas as pd
import datetime as dt
e = pd.read_csv("data/ecb/eur.csv", dtype={'Time':object})
e.Time = pd.to_datetime(e.Time, format='%Y-%m-%d %H:%M:%S:%f')
e.plot()
f = e.copy()
f.index = f.Time
x = [str(s)[:-7] for s in f.index]
ff = f.set_index(pd.Series(x))
ff.index.name = 'Time'
ff.plot()
Update:
I added two new plots for comparison to clarify the issue. Now I tried brute force to convert timestamp index back to string, and plot string as x axis. the format easily got messed up. it seems hard to customize location of x label.
Ok, it seems like what you're after is that you want to move around the x-tick locations so that there are an equal number of points between each tick. And you'd like to have the grid drawn on these appropriately-located ticks. Do I have that right?
If so:
import pandas as pd
import urllib
import matplotlib.pyplot as plt
import seaborn as sbn
content = urllib.urlopen('https://eu-ats.googlecode.com/svn/trunk/data/new/eur-fix.csv')
df = pd.read_csv(content, header=0)
df['Time'] = pd.to_datetime(df['Time'], format='%Y-%m-%d %H:%M:%S:%f')
every30 = df.loc[df.index % 30 == 0, 'Time'].values
fig, ax = plt.subplots(1, 1, figsize=(9, 5))
df.plot(x='Time', y='Bid', ax=ax)
ax.set_xticks(every30)
I have tried to reproduce your issue, but I can't seem to. Can you have a look at this example and see how your situation differs?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
np.random.seed(0)
idx = pd.date_range('11:00', '21:30', freq='1min')
ser = pd.Series(data=np.random.randn(len(idx)), index=idx)
ser = ser.cumsum()
for i in range(20):
for j in range(8):
ser.iloc[10*i +j] = np.nan
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
ser.plot(ax=axes[0])
ser.dropna().plot(ax=axes[1])
gives the following two plots:
There are a couple differences between the graphs. The one on the left doesn't connect the non-continuous bits of data. And it lacks vertical gridlines. But both seem to respect the actual index of the data. Can you show an example of your e series? What is the exact format of its index? Is it a datetime_index or is it just text?
Edit:
Playing with this, my guess is that your index is actually just text. If I continue from above with:
idx_str = [str(x) for x in idx]
newser = ser
newser.index = idx_str
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
newser.plot(ax=axes[0])
newser.dropna().plot(ax=axes[1])
then I get something like your problem:
More edit:
If this is in fact your issue (the index is a bunch of strings, not really a bunch of timestamps) then you can convert them and all will be well:
idx_fixed = pd.to_datetime(idx_str)
fixedser = newser
fixedser.index = idx_fixed
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
fixedser.plot(ax=axes[0])
fixedser.dropna().plot(ax=axes[1])
produces output identical to the first code sample above.
Editing again:
To see the uneven spacing of the data, you can do this:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
fixedser.plot(ax=axes[0], marker='.', linewidth=0)
fixedser.dropna().plot(ax=axes[1], marker='.', linewidth=0)
Let me try this one from scratch. Does this solve your issue?
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
import urllib
content = urllib.urlopen('https://eu-ats.googlecode.com/svn/trunk/data/new/eur-fix.csv')
df = pd.read_csv(content, header=0, index_col='Time')
df.index = pd.to_datetime(df.index, format='%Y-%m-%d %H:%M:%S:%f')
df.plot()
The thing is, you want to plot bid vs time. If you've put the times into your index then they become your x-axis for "free". If the time data is just another column, then you need to specify that you want to plot bid as the y-axis variable and time as the x-axis variable. So in your code above, even when you convert the time data to be datetime type, you were never instructing pandas/matplotlib to use those datetimes as the x-axis.