Stacked area chart with datetime axis - python

I am attepmtimng to create a Bokeh stacked area chart from the following Pandas DataFrame.
An example of the of the DataFrame (df) is as follows;
date tom jerry bill
2014-12-07 25 12 25
2014-12-14 15 16 30
2014-12-21 10 23 32
2014-12-28 12 13 55
2015-01-04 5 15 20
2015-01-11 0 15 18
2015-01-18 8 9 17
2015-01-25 11 5 16
The above DataFrame represents a snippet of the total df, which snaps over a number of years and contains additional names to the ones shown.
I am attempting to use the datetime column date as the x-axis, with the count information for each name as the y-axis.
Any assistance that anyone could provide would be greatly appreciated.

You can create a stacked area chart by using the patch glyph. I first used df.cumsum to stack the values in the dataframe by row. After that I append two rows to the dataframe with the max and min date and Y value 0. I plot the patches in a reverse order of the column list (excluding the date column) so the person with the highest values is getting plotted first and the persons with lower values are plotted after.
Another implementation of a stacked area chart can be found here.
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.palettes import inferno
from bokeh.models.formatters import DatetimeTickFormatter
df = pd.read_csv('stackData.csv')
df_stack = df[list(df)[1:]].cumsum(axis=1)
df_stack['date'] = df['date'].astype('datetime64[ns]')
bot = {list(df)[0]: max(df_stack['date'])}
for column in list(df)[1:]:
bot[column] = 0
df_stack = df_stack.append(bot, ignore_index=True)
bot = {list(df)[0]: min(df_stack['date'])}
for column in list(df)[1:]:
bot[column] = 0
df_stack = df_stack.append(bot, ignore_index=True)
p = figure(x_axis_type='datetime')
p.xaxis.formatter=DatetimeTickFormatter(days=["%d/%m/%Y"])
p.xaxis.major_label_orientation = 45
for person, color in zip(list(df_stack)[2::-1], inferno(len(list(df_stack)))):
p.patch(x=df_stack['date'], y=df_stack[person], color=color, legend=person)
p.legend.click_policy="hide"
show(p)

Related

Stacked bar plot of large data in python

I would like to plot a stacked bar plot from a csv file in python. I have three columns of data
year word frequency
2018 xyz 12
2017 gfh 14
2018 sdd 10
2015 fdh 1
2014 sss 3
2014 gfh 12
2013 gfh 2
2012 gfh 4
2011 wer 5
2010 krj 4
2009 krj 4
2019 bfg 4
... 300+ rows of data.
I need to go through all the data and plot a stacked bar plot which is categorized based on the year, so x axis is word and y axis is frequency, the legend color should show year wise. I want to see how the evolution of each word occured year wise. Some of the technology words are repeatedly used in every year and hence the stack bar graph should add the values on top and plot, for example the word gfh initially plots 14 for year 2017, and then in year 2014 I want the gfh word to plot (in a different color) for a value of 12 on top of the gfh of 2017. How do I do this? So far I called the csv file in my code. But I don't understand how could it go over all the rows and stack the words appropriately (as some words repeat through all the years). Any help is highly appreciated. Also the years are arranged in random order in csv but I sorted them year wise to make it easier. I am just learning python and trying to understand this plotting routine since i have 40 years of data and ~20 words. So I thought stacked bar plot is the best way to represent them. Any other visualisation method is also welcome.
This can be done using pandas:
import pandas as pd
df = pd.read_csv("file.csv")
# Aggregate data
df = df.groupby(["word", "year"], as_index=False).agg({"frequency": "sum"})
# Create list to sort by
sorter = (
df.groupby(["word"], as_index=False)
.agg({"frequency": "sum"})
.sort_values("frequency")["word"]
.values
)
# Pivot, reindex, and plot
df = df.pivot(index="word", columns="year", values="frequency")
df = df.reindex(sorter)
df.plot.bar(stacked=True)
Which outputs:

Count frequency and plot

I would need to plot the frequency of items by date. My csv contains three columns: one for Date, one for Name & Surname and another one for Birthday.
I am interested in plotting the frequency of people recorded in a date. My expected output would be:
Date Count
0 01/01/2018 9
1 01/02/2018 12
2 01/03/2018 6
3 01/04/2018 4
4 01/05/2018 5
.. ... ...
.. 02/27/2020 122
.. 02/28/2020 84
The table above was found as follows:
by_date = df.groupby(df['Date']).size().reset_index(name='Count')
Date is a column in my csv file, but not Count. This explains the reason why I am having difficulties to draw a line plot.
How can I plot the frequency as a list of numbers/column?
Although not absolutely required, you should convert the Date column into Timestamp for easier analysis in later steps:
df['Date'] = pd.to_datetime(df['Date'])
Now, to your question. To count many births there are per day, you can use value_counts:
births = df['Date'].value_counts()
But you don't even have to do that for plotting a histogram! Use hist:
import matplotlib.dates as mdates
year = mdates.YearLocator()
month = mdates.MonthLocator()
formatter = mdates.ConciseDateFormatter(year)
ax = df['Date'].hist()
ax.set_title('# of births')
ax.xaxis.set_major_locator(year)
ax.xaxis.set_minor_locator(month)
ax.xaxis.set_major_formatter(formatter)
Result (from random data):

Make plot between multiple files for a particular Id in a specific period

I am trying to make a plot to compare 4 different files. each file has ID, Date and Value. While ID's and dates remain the same the value differs in each of the files. Now i want to plot the value field for ID lets say "A" for some 7 day period in January. The result would be a a overlaying plot of four different value from the four different files. How can i go about this in python? I want to keep it as automated as possible without several manual steps. Appreciate all your help!
Example data set below
Sample data set 1
ID Date Value
A 01-01-18 12
A 01-02-18 15
A 01-03-18 18
A 02-01-18 12
B 01-01-18 11
B 01-02-18 19
C 01-01-18 15
Sample data set 2
ID Date Value
A 01-01-18 13
A 01-02-18 16
A 01-03-18 12
A 02-01-18 13
B 01-01-18 16
B 01-02-18 15
C 01-01-18 13
Sample data set 3
ID Date Value
A 01-01-18 12
A 01-02-18 12
A 01-03-18 13
A 02-01-18 14
B 01-01-18 15
B 01-02-18 12
C 01-01-18 13
Sample data set 4
ID Date Value
A 01-01-18 12
A 01-02-18 15
A 01-03-18 14
A 02-01-18 12
B 01-01-18 11
B 01-02-18 14
C 01-01-18 13
From this sample data -lets say i am trying to plot for ID "A" between date 01-01-18 to 01-03-18 the values. So i will have a plot of 4 different lines representing the value of each of the data set.
I have been able to do this in Excel but it has involved too many manual steps and the data is 800,000 lines +, so i don't feel very confident. I am sure there is a better way to do it in python.
Suppose your data is stored in separate text files. Then you may do what you want with the following code:
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
filenames = ['sample_1.txt', 'sample_2.txt', 'sample_3.txt', 'sample_4.txt']
data = list()
for filename in filenames:
data.append(pd.read_table(filename, delimiter=' ', parse_dates=[1]))
fig = plt.figure()
for idx in range(len(filenames)):
condition_1 = data[idx].loc[:, 'ID'] == 'A'
condition_2 = (
(data[idx].loc[:, 'Date'] >= '2018-01-01') &
(data[idx].loc[:, 'Date'] <= '2018-01-03'))
plt.plot(
data[idx].loc[condition_1 & condition_2, 'Date'],
data[idx].loc[condition_1 & condition_2, 'Value'], 'o--')
plt.title('Some figure')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend(filenames)
# X-axis formatting
days = mdates.DayLocator()
days_fmt = mdates.DateFormatter('%Y-%m-%d')
fig.gca().xaxis.set_major_locator(days)
fig.gca().xaxis.set_major_formatter(days_fmt)
Result:

Manipulate Python Data frame to plot line charts

I have the following data frame:
Parameters Year 2016 Year 2017 Year 2018....
0) X 10 12 13
1) Y 12 12 45
2) Z 89 23 97
3
.
.
.
I want to make a line chart with the column headers starting from Year 2016 to be on the x-axis and each line on the chart to represent each of the parameters - X, Y, Z
I am using the matplotlib library to make the plot but it is throwing errors.
Where given this dataframe:
df = pd.DataFrame({'Parameters':['X','Y','Z'],
'Year 2016':[10,12,13],
'Year 2017':[12,12,45],
'Year 2018':[89,23,97]})
Input Dataframe:
Parameters Year 2016 Year 2017 Year 2018
0 X 10 12 89
1 Y 12 12 23
2 Z 13 45 97
You can use some dataframe shaping and pandas plot:
df_out = df.set_index('Parameters').T
df_out.set_axis(pd.to_datetime(df_out.index, format='Year %Y'), axis=0, inplace=False)\
.plot()
Output Graph:
If you have a pandas DataFrame, let's call it df, for which your columns are X, Y, Z, etc. and your rows are the years in order, you can simply call df.plot() to plot each column as a line with the y axis being the values and the row name giving the x-axis.

Make all columns (dates) index of data frame

My data is organized like this:
Where country code is the index of the data frame and the columns are the years for the data. First, is it possible to plot line graphs (using matplotlib.pylot) over time for each country without transforming the data any further?
Second, if the above is not possible, how can I make the columns the index of the table so I can plot time series line graphs?
Trying df.t gives me this:
How can I make the dates the index now?
Transpose using df.T.
Plot as usual.
Sample:
import pandas as pd
df = pd.DataFrame({1990:[344,23,43], 1991:[234,64,23], 1992:[43,2,43]}, index = ['AFG', 'ALB', 'DZA'])
df = df.T
df
AFG ALB DZA
1990 344 23 43
1991 234 64 23
1992 43 2 43
# transform index to dates
import datetime as dt
df.index = [dt.date(year, 1, 1) for year in df.index]
import matplotlib.pyplot as plt
df.plot()
plt.savefig('test.png')

Categories