Preparing Data-frame for Bokeh Consumption - python

Trying to plot with Bokeh using a data-frame but plot is displaying empty. Beginner here; missing something fundamental.
My plot works if I hard code some basic X and Y variables so I know the issue has to do with the data-frame I'm trying to use as a source.
...
df = pd.DataFrame(j)
df.columns = ['Team','Type','Date','SLA_MET']
df['SLA_MET']= df['SLA_MET'].round(2)
pd.set_option('display.max_columns', 10)
print(df)
source = ColumnDataSource(df)
p = figure(background_fill_color='gray',
background_fill_alpha=0.5,
border_fill_color='blue',
border_fill_alpha=0.25,
plot_height=600,
plot_width=1000,
x_axis_label='Month',
x_axis_location='below',
y_axis_label='% SLA Met',
y_axis_location='left',
title='Percentage of SLA Met',
title_location='above',
toolbar_location='below',
tools='save')
p.line(source=source,x='Date',y='SLA_MET')
show(p)
Decided to pass clean lists to plot
for index, row in df.iterrows():
if row[2] =='Service Request':
sr_list.append(row[3])
else:
inc_list.append(row[3])
date_list.append(row[1]) # Only need 1 list of dates
Problem is dates in scientific notation and dates are not in order.

Bokeh does not know what to do with the strings in your Date column. You have two options:
convert this column to real python/numpy/pandas (numeric) datetime values, and also set x_axis_type="datetime" in your figure call, or
use the string values as categorical factors
It's not clear what your intention is, so I can't recommend one vs the other.

Related

Time series plot showing unique occurrences per day

I have a dataframe, where I would like to make a time series plot with three different lines that each show the daily occurrences (the number of rows per day) for each of the values in another column.
To give an example, for the following dataframe, I would like to see the development for how many a's, b's and c's there have been each day.
df = pd.DataFrame({'date':pd.to_datetime(['2019-10-10','2019-10-14','2019-10-09','2019-10-10','2019-10-08','2019-10-14','2019-10-10','2019-10-08','2019-10-08','2019-10-13','2019-10-08','2019-10-12','2019-10-11','2019-10-09','2019-10-08']),
'letter':['a','b','c','a','b','b','b','b','c','b','b','a','b','a','c']})
When I try the command below (my best guess so far), however, it does not filter for the different dates (I would like three lines representing each of the letters.
Any ideas on how to solve this?
df.groupby(['date']).count().plot()['letter']
I have also tried a solution in Matplotlib, though this one gives an error..
fig, ax = plt.subplots()
ax.plot(df['date'], df['letter'].count())
Based on your question, I believe you are looking for a line plot which has dates in X-axis and the counts of letters in the Y-axis. To achieve this, these are the steps you will need to do...
Group the dataframe by date and then letter - get the number of entries/rows for each which you can do using size()
Flatten the grouped dataframe using reset_index(), rename the new column to Counts and sort by letter column (so that the legend shows the data in the alphabetical format)... these are more to do with keeping the new dataframe and graph clean and presentable. I would suggest you do each step separately and print, so that you know what is happening in each step
Plot each line plot separately using filtering the dataframe by each specific letter
Show legend and rotate date so that it comes out with better visibility
The code is shown below....
df = pd.DataFrame({'date':pd.to_datetime(['2019-10-10','2019-10-14','2019-10-09','2019-10-10','2019-10-08','2019-10-14','2019-10-10','2019-10-08','2019-10-08','2019-10-13','2019-10-08','2019-10-12','2019-10-11','2019-10-09','2019-10-08']),
'letter':['a','b','c','a','b','b','b','b','c','b','b','a','b','a','c']})
df_grouped = df.groupby(by=['date', 'letter']).size().reset_index() ## New DF for grouped data
df_grouped.rename(columns = {0 : 'Counts'}, inplace = True)
df_grouped.sort_values(['letter'], inplace=True)
colors = ['r', 'g', 'b'] ## New list for each color, change as per your preference
for i, ltr in enumerate(df_grouped.letter.unique()):
plt.plot(df_grouped[df_grouped.letter == ltr].date, df_grouped[df_grouped.letter == ltr].Counts, '-o', label=ltr, c=colors[i])
plt.gcf().autofmt_xdate() ## Rotate X-axis so you can see dates clearly without overlap
plt.legend() ## Show legend
Output graph

Why is the code not plotting the expected output?

country = str(input())
import matplotlib.pyplot as plt
lines = f.readlines ()
x = []
y = []
results = []
for line in lines:
words = line.split(',')
f.close()
plt.plot(x,y)
plt.show()
First problem is in the title of the plot. It is giving Population inCountryI instead of Population in Country I.
Second problem is in the graph.
While my answer could point out the mistakes in your code, I think it might also be enlightening to show another, perhaps more standard way, of doing this. This is particularly useful if you're going to do this more often, or with large datasets.
Handling CSV files and creating subgroups out of them by yourself is nice, but can become very tricky. Python already has a built-in csv module, but the Pandas library is nowadays basically the default (there are other options as well) for handling tabular data. Which means it is widely available, and/or easy to install. Plus it goes well with Matplotlib. (Read some of Pandas' user's guide for a good overview.)
With Pandas, you can use the following (I've put comments on the code in between the actual code):
import pandas as pd
import matplotlib.pyplot as plt
mpl.rcParams['figure.figsize'] = (8, 8)
# Read the CSV file into a Pandas dataframe
# For a normal CSV, this will work fine without tweaks
df = pd.read_csv('population.csv')
# Convert the month and year columns to a datetime
# Years have to be converted to string type for that
# '%b%Y' is the format for month abbrevation (English) and 4-digit year;
# see e.g. https://strftime.org/
# Instead of creating a new column, we set the date as the index ("row-indices")
# of the dataframe
df.index = pd.to_datetime(df['Month'] + df['Year'].astype(str), format='%b%Y')
# We can remove the month and year columns now
df = df.drop(columns=['Month', 'Year'])
# For nicety, replace the dot in the country name with a space
df['Country'] = df['Country'].str.replace('.', ' ', regex=False)
# Group the dataframe by country, and loop over the groups
# The resulting grouped dataframes, `grouped`, will have just
# their index (date) values and population values
# The .plot() method will therefore automatically use
# the index/dates as x-axis, and the population as
# y-axis.
for country, grouped in df.groupby('Country'):
# Use the convenience .plot() method
grouped.plot()
# Standard Matplotlib functions are still available
plt.title(country)
The resulting plots are shown below (2, given the example data).
If you don't want a legend (since there is only one line), use grouped.plot(legend=None) instead.
If you want to pick one specific country, remove and replace the whole for-loop with the following
country = "Country II"
df[df['Country'] == country].plot()
If you want to do even more, also have a look at the Seaborn library.
Resulting plots:

Reformatting y axis values in a multi-line plot in Python

Updated with more info
I've seen this answered on here for single line plots, but I need help with a plot showing two variables, if that matters at all... I am fairly new to python in general. My line graph shows two different departments' funding over the years. I just want to reformat the y axis to display as a number in the hundreds of millions.
Using a csv for the general public funding report of Minneapolis.
msp_df = pd.read_csv('Minneapolis_Data_Snapshot_v2.csv',error_bad_lines=False)
msp_df.info()
Saved just the two depts I was interested in, to a dataframe.
CPED_df = (msp_df['Unnamed: 0'] == 'CPED')
msp_df.iloc[CPED_df.values]
police_df = (msp_df['Unnamed: 0'] == 'Police')
msp_df.iloc[police_df.values]
("test" is the new name of my data frame containing all the info as seen below.)
test = pd.DataFrame({'Year': range(2014,2021),
'CPED': msp_df.iloc[CPED_df.values].T.reset_index(drop=True).drop(0,0)[5].tolist(),
'Police': msp_df.iloc[police_df.values].T.reset_index(drop=True).drop(0,0)[4].tolist()})
The numbers from the original dataset were being read as strings because of the commas so had to fix that first.)
test['Police2'] = test['Police'].str.replace(',','').astype(int)
test['CPED2'] = test['CPED'].str.replace(',','').astype(int)
And here is my code for the plot. It executes, I'm just wanting to reformat the y axis number scale. Right now it just shows up as a decimal. (I've already imported pandas and seaborn and matploblib)
plt.plot(test.Year, test.Police2, test.Year, test.CPED2)
plt.ylabel('Budget in Hundreds of Millions')
plt.xlabel('Year')
Current plot
Any help super appreciated! Thanks :)
the easiest way to reformat the y axis, to force it to take certain values ​​is to use
plt.yticks(ticks, labels)
for example if you want to have only display values ​​from 0 to 1 you can do :
plt.yticks([0,0.2,0.5,0.7,1], ['a', 'b', 'c', 'd', 'e'])

Legend on pandas plot of time series shows only "None"

data is a pandas dataframe with a date-time-index on entries with multiple attributes. One of these attributes is called STATUS. I tried to create a plot of the number of entries per day, broken down by the STATUS attribute.
My first attempt using pandas.plot:
for status in data["STATUS"].unique():
entries = data[data["STATUS"] == status]
entries.groupby(pandas.TimeGrouper("D")).size().plot(figsize=(16,4), legend=True)
The result:
How should I modify the code above so that the legend shows which status the curve belongs to?
Also, feel free to suggest a different approach to realizing such a visualization (group time series by time interval, count entries, and break down by attributes of the entries).
I believe that with below change to your code you will get what you want:
fig, ax = plt.subplots()
for status in data["STATUS"].unique():
entries = data[data["STATUS"] == status]
dfPlot = pandas.DataFrame(entries.groupby(pandas.TimeGrouper("D")).size())
dfPlot.columns=[status]
dfPlot.plot(ax=ax, figsize=(16,4), legend=True)
What happened is that the output for size function gives you a Series type with no name in its column. So creating a Dataframe from the Series and changing the column name does the trick.

Plotting Pandas DataFrames as single days on the x-axis in Python/Matplotlib

I've got data like this:
col1 ;col2
2001-01-01;1
2001-01-01;2
2001-01-02;3
2001-01-03;4
2001-01-03;2
2001-01-04;2
I'm reading it in Python/Pandas using pd.read_csv(...) into a DataFrame.
Now I want to plot col2 on the y-axis and col1 on the x-axis day-wise. I searched a lot but couldn't too many very useful pages describing this in detail. I found that matplotlib does currently NOT support the dataformat in which the dates are stored in (datetime64).
I tried converting it like this:
fig, ax = plt.subplots()
X = np.asarray(df['col1']).astype(DT.datetime)
xfmt = mdates.DateFormatter('%b %d')
ax.xaxis.set_major_formatter(xfmt)
ax.plot(X, df['col2'])
plt.show()
but this does NOT work.
What is the best way?
I can only find bits there and bits there, but nothing really working in complete and more importantly, up-to-date ressources related to this functionality for the latest version of pandas/numpy/matplotlib.
I'd also be interested to convert this absolut dates to consecutive day-indices, i.e:
The starting day 2001-01-01 is Day 1, thus the data would look like this:
col1 ;col2 ; col3
2001-01-01;1;1
2001-01-01;2;1
2001-01-02;3;2
2001-01-03;4;3
2001-01-03;2;3
2001-01-04;2;4
.....
2001-02-01;2;32
Thank you very much in advance.
Pandas.read_csv supports parse_dates=True (default of course is False) That would save you converting the dates separately.
Also for a simple dataframe like this, pandas plot() function works perfectly well.
Example:
dates = pd.date_range('20160601',periods=4)
dt = pd.DataFrame(np.random.randn(4,1),index=dates,columns=['col1'])
dt.plot()
plt.show()
Ok as far as I can see there's no need anymore to use matplotlib directly, but instead pandas itself already offer plotting functions which can be used as methods to the dataframe-objects, see http://pandas.pydata.org/pandas-docs/stable/visualization.html. These functions themselves use matplotlib, but are easier to use because they handle the datatypes correctly themselves :-)

Categories