Python pandas and matplotlib automatically filling in missing data

Python pandas and matplotlib automatically filling in missing data - python

Im very new to python and trying to figure out how to graph out some data which can have missing data for any given date.
The data is number of jobs completed (y), their rating (secondary Y), and date (x).
The graph looks how id like however jobs dont get completed each day so there are days where there is no data and the line on the graph just stops.
Is there a way to have it automatically connect the dots on the graph?
import matplotlib.pyplot as plt
import pandas as pd
import database
df = pd.DataFrame(database.getTasks("Pete"), columns=['date', 'rating', 'jobs']).set_index('date')
fig, ax = plt.subplots()
ax3 = ax.twinx()
rspine = ax3.spines['right']
rspine.set_position(('axes', 1.15))
ax3.set_frame_on(True)
ax3.patch.set_visible(False)
df.jobs.plot(ax=ax, style='b-')
df.rating.plot(ax=ax, style='r-', secondary_y=True)
plt.show()

I think you are looking for Dataframe.fillna().
df.fillna(method='ffill')
Forward Fill ('ffill') will use the last valid observation in place of a missing value.

to fill your data you can use pandas fill.na and use 'method=ffill' to propagate the last valid value. Check the documentation to see what method fits best.

Related

Plotting dates with Pandas Matplotlib - random (apparently) years

I am trying to plot a simple .csv file downloaded from Yahoo-finance (file example here), but I cannot understand why the years appear as (apparently) random numbers. Please see image below:
Another thing that I would like to do is to remove the x axis from the top graph (since the same axis is already in the bottom plot) but I would like to keep the dashed grid. I tired to use ax[0].set_xticklabels([]), but it didn't work.
Here is my code:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter, MonthLocator, YearLocator
#LOAD DATA
df_name = "0P0000UL8U.L.csv"
col_list = ["Date", "Adj Close"] #list of column to import
df = pd.read_csv(df_name, header=0, usecols=col_list, na_values=['null'], thousands=r',', parse_dates=["Date"], dayfirst=True)
df = df.dropna() #Drop the rows where at least one element is missing.
df.set_index("Date", inplace = True)
df.index = [pd.to_datetime(date).date() for date in df.index] #convert index to datetime.date, not datetime.datetime.
print("Opening df:\n", df)
print("\nLength of df: ", len(df.index))
#PLOT DATA
fig, ax = plt.subplots(2,1, figsize=(11,5))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.25, hspace=0.8) #Adjust space between graphs
df[["Adj Close"]].plot(ax=ax[0], kind="line", style="-", color="blue", stacked=False, rot=90)
ax[0].set_axisbelow(True) # To put plot grid below plots
ax[0].yaxis.grid(color='gray', linestyle='dashed')
ax[0].xaxis.grid(color='gray', linestyle='dashed')
ax[0].xaxis.set_major_locator(YearLocator()) # specify a MonthLocator
ax[0].xaxis.set_major_formatter(DateFormatter("%b %Y"))
ax[0].set(xlabel=None, ylabel="Price") # Set title and labels for axes
df[["Adj Close"]].plot(ax=ax[1], kind="line", style="-", color="blue", stacked=False, rot=90)
ax[1].set_axisbelow(True) # To put plot grid below plots
ax[1].yaxis.grid(color='gray', linestyle='dashed')
ax[1].xaxis.grid(color='gray', linestyle='dashed')
ax[1].xaxis.set_major_locator(YearLocator()) # specify a MonthLocator
ax[1].xaxis.set_major_formatter(DateFormatter("%b %Y"))
ax[1].set(xlabel="Time", ylabel="Price") # Set title and labels for axes
fig.savefig("0P0000UL8U.L.png", bbox_inches='tight', dpi=300)
What am I doing wrong? Thank for any help in advance.

To remove the x-Axis labels from the top graph, you can add the following line:
ax[0].tick_params(labelbottom=False)
before ax[0].set(xlabel=None, ylabel="Price")

It's not your fault. Python 3 is very far from stable yet. That's why hardcore developers still prefer Python 2. This time matplotlib devs screwed dates handling. They even have a number of corresponding bugs (#18010, #17983, #34850).
Meantime you can downgrade matplotlib to v 3.2.2, it's working perfectly and wait if devs repair the bug.

Integrating over range of dates, and labeling the xaxis

I am trying to integrate 2 curves as they change through time using pandas. I am loading data from a CSV file like such:
Where the Dates are the X-axis and both the Oil & Water points are the Y-axis. I have learned to use the cross-section option to isolate the "NAME" values, but am having trouble finding a good way to integrate with dates as the X-axis. I eventually would like to be able to take the integrals of both curves and stack them against each other. I am also having trouble with the plot defaulting the x-ticks to arbitrary values, instead of the dates.
I can change the labels/ticks manually, but have a large CSV to process and would like to automate the process. Any help would be greatly appreciated.
NAME,DATE,O,W
A,1/20/2000,12,50
B,1/20/2000,25,28
C,1/20/2000,14,15
A,1/21/2000,34,50
B,1/21/2000,8,3
C,1/21/2000,10,19
A,1/22/2000,47,35
B,1/22/2000,4,27
C,1/22/2000,46,1
A,1/23/2000,19,31
B,1/23/2000,18,10
C,1/23/2000,19,41
Contents of CSV in text form above.

Further to my comment above, here is some sample code (using logic from the example mentioned) to label your xaxis with formatted dates. Hope this helps.
Data Collection / Imports:
Just re-creating your dataset for the example.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
header = ['NAME', 'DATE', 'O', 'W']
data = [['A','1/20/2000',12,50],
['B','1/20/2000',25,28],
['C','1/20/2000',14,15],
['A','1/21/2000',34,50],
['B','1/21/2000',8,3],
['C','1/21/2000',10,19],
['A','1/22/2000',47,35],
['B','1/22/2000',4,27],
['C','1/22/2000',46,1],
['A','1/23/2000',19,31],
['B','1/23/2000',18,10],
['C','1/23/2000',19,41]]
df = pd.DataFrame(data, columns=header)
df['DATE'] = pd.to_datetime(df['DATE'], format='%m/%d/%Y')
# Subset to just the 'A' labels.
df_a = df[df['NAME'] == 'A']
Plotting:
# Define the number of ticks you need.
nticks = 4
# Define the date format.
mask = '%m-%d-%Y'
# Create the set of custom date labels.
step = int(df_a.shape[0] / nticks)
xdata = np.arange(df_a.shape[0])
xlabels = df_a['DATE'].dt.strftime(mask).tolist()[::step]
# Create the plot.
fig, ax = plt.subplots(1, 1)
ax.plot(xdata, df_a['O'], label='Oil')
ax.plot(xdata, df_a['W'], label='Water')
ax.set_xticks(np.arange(df_a.shape[0], step=step))
ax.set_xticklabels(xlabels, rotation=45, horizontalalignment='right')
ax.set_title('Test in Naming Labels for the X-Axis')
ax.legend()
Output:

I'd recommend modifying the X-axis into some form of integers or floats (Seconds, minutes, hours days since a certain time, based on the precision that you need). You can then use usual methods to integrate and the x-axes would no longer default to some other values.
See How to convert datetime to integer in python

Plot Data from CSV and group values in colum

I am pretty new in python and try to understand how to do the following:
I am trying to plot data from a csv file where I have values for A values for B and values for C. How can I group it and plot it based on the Valuegroup and as values using the colum values?
import pandas as pd
import matplotlib.pyplot as plt
csv_loader = pd.read_csv('C:/Test.csv', encoding='cp1252', sep=';', index_col=0).dropna()
#csv_loader.plot()
print(csv_loader)
fig, ax = plt.subplots()
csv_loader.groupby('Valuegroup').plot(x='Date', y='Value', ax=ax, legend=False, kind='line')
The data looks like the following:
Calcgroup;Valuegroup;id;Date;Value
Group1;A;1;20080103;0.1
Group1;A;1;20080104;0.3
Group1;A;1;20080107;0.5
Group1;A;1;20080108;0.9
Group1;B;1;20080103;0.5
Group1;B;1;20080104;1.3
Group1;B;1;20080107;2.0
Group1;B;1;20080108;0.15
Group1;C;1;20080103;1.9
Group1;C;1;20080104;2.1
Group1;C;1;20080107;2.9
Group1;C;1;20080108;0.45

If you want to take a mean of Value for each Valuegroup and show them with line chart, use
csv_loader.groupby('Valuegroup')['Value'].mean().plot()
There are various chart types available, please refer to pandas documentation on plot

Not getting the proper graph comparison using Python

I am trying to compare and get a proper point of intersection between the two CSV files. I am using the graph depiction for better understanding.
But I am getting very diminished image of one graph as compared to another.
See the following:
Here is the data: trade-volume.csv
Here is the real graph:
Here is the data: miners-revenue.csv
Here is the real graph:
Here is the program I wrote for comparison:
import pandas as pd
import matplotlib.pyplot as plt
dat2 = pd.read_csv("trade-volume.csv", parse_dates=['time'])
dat3 = pd.read_csv("miners-revenue.csv", parse_dates=['time'])
dat2['timeDiff'] = (dat2['time'] - dat2['time'][0]).astype('timedelta64[D]')
dat3['timeDiff'] = (dat3['time'] - dat3['time'][0]).astype('timedelta64[D]')
fig, ax = plt.subplots()
ax.plot(dat2['timeDiff'], dat2['Value'])
ax.plot(dat3['timeDiff'], dat3['Value'])
plt.show()
I got the output like the following:
As one can see the orange color graph is very low and I could not understand the points as it is lower. I am willing to overlap the graphs and then check.
Please help me make it possible with my existing code, if no alteration required.

The problem comes down to your y axis. One has a maximum of 60,000,000 while the other has a maximum of 6,000,000,000. Trying to plot these on the same graph is going to lead to one "looking" like a straight line even though it isn't if you zoom in.
A possible solution is to use a second y axis (you can change the color of the lines using the color= argument in ax.plot():
import pandas as pd
import matplotlib.pyplot as plt
dat2 = pd.read_csv("trade-volume.csv", parse_dates=['time'])
dat3 = pd.read_csv("miners-revenue.csv", parse_dates=['time'])
dat2['timeDiff'] = (dat2['time'] - dat2['time'][0]).astype('timedelta64[D]')
dat3['timeDiff'] = (dat3['time'] - dat3['time'][0]).astype('timedelta64[D]')
fig, ax = plt.subplots()
ax.plot(dat2['timeDiff'], dat2['Value'], color="blue")
ax2=ax.twinx()
ax2.plot(dat3['timeDiff'], dat3['Value'], color="red")
plt.show()

Both data live on very different scales. You may normalize both in order to compare them.
import pandas as pd
import matplotlib.pyplot as plt
dat2 = pd.read_csv("trade-volume.csv", parse_dates=['time'])
dat3 = pd.read_csv("miners-revenue.csv", parse_dates=['time'])
dat2['timeDiff'] = (dat2['time'] - dat2['time'][0]).astype('timedelta64[D]')
dat3['timeDiff'] = (dat3['time'] - dat3['time'][0]).astype('timedelta64[D]')
fig, ax = plt.subplots()
ax.plot(dat2['timeDiff'], dat2['Value']/dat2['Value'].values.max())
ax.plot(dat3['timeDiff'], dat3['Value']/dat3['Value'].values.max())
plt.show()

Multiple series in a trace for plotly

I dynamically generate a pandas dataframe where columns are months, index is day-of-month, and values are cumulative revenue. This is fairly easy, b/c it just pivots a dataframe that is month/dom/rev.
But now I want to plot it in plotly. Since every month the columns will expand, I don't want to manually add a trace per month. But I can't seem to have a single trace incorporate multiple columns. I could've sworn this was possible.
revs = Scatter(
x=df.index,
y=[df['2016-Aug'], df['2016-Sep']],
name=['rev', 'revvv'],
mode='lines'
)
data=[revs]
fig = dict( data=data)
iplot(fig)
This generates an empty graph, no errors. Ideally I'd just pass df[df.columns] to y. Is this possible?

You were probably thinking about cufflinks. You can plot a whole dataframe with Plotly using the iplot function without data replication.
An alternative would be to use pandas.plot to get an matplotlib object which is then converted via plotly.tools.mpl_to_plotly and plotted. The whole procedure can be shortened to one line:
plotly.plotly.plot_mpl(df.plot().figure)
The output is virtually identical, just the legend needs tweaking.
import plotly
import pandas as pd
import random
import cufflinks as cf
data = plotly.tools.OrderedDict()
for month in ['2016-Aug', '2016-Sep']:
data[month] = [random.randrange(i * 10, i * 100) for i in range(1, 30)]
#using cufflinks
df = pd.DataFrame(data, index=[i for i in range(1, 30)])
fig = df.iplot(asFigure=True, kind='scatter', filename='df.html')
plot_url = plotly.offline.plot(fig)
print(plot_url)
#using mpl_to_plotly
plot_url = plotly.offline.plot(plotly.tools.mpl_to_plotly(df.plot().figure))
print(plot_url)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas and matplotlib automatically filling in missing data - python

I think you are looking for Dataframe.fillna(). df.fillna(method='ffill') Forward Fill ('ffill') will use the last valid observation in place of a missing value.

to fill your data you can use pandas fill.na and use 'method=ffill' to propagate the last valid value. Check the documentation to see what method fits best.

Related

Plotting dates with Pandas Matplotlib - random (apparently) years

Integrating over range of dates, and labeling the xaxis

Plot Data from CSV and group values in colum

Not getting the proper graph comparison using Python

Multiple series in a trace for plotly

Categories

Resources