I am using an SQLite database with Pandas and want to display the dynamic data using Bokeh (varea_stack)
My dynamic data (df) structure looks like this:
id date site numberOfSessions ... avgSessionDuration uniqueDimensionCombinations events pageViews
0 1 2020-07-29 177777770 3 ... 11.00 2 4 3
1 2 2020-07-29 178888883 1 ... 11.00 1 4 3
2 3 2020-07-29 177777770 1 ... 11.00 1 4 3
3 4 2020-07-29 173333333 2 ... 260.50 2 23 10
4 5 2020-07-29 178888883 2 ... 260.50 2 23 10
5 6 2020-07-29 173333333 2 ... 260.50 2 23 10
6 7 2020-07-29 178888883 12 ... 103.75 12 143 36
7 8 2020-07-30 178376403 12 ... 103.75 12 143 36
8 9 2020-07-30 178376403 12 ... 103.75 12 143 36
9 10 2020-07-28 178376403 12 ... 103.75 12 143 36
I would like to create a varea_stack plot where the:
x-axis -> "date"
y-axis -> "numberOfSessions" stacked according to "site"
(I am thinking maybe using some sort of Pivot Table?)
this is what I have:
from bokeh.plotting import figure, output_file, show
from bokeh.embed import components
from bokeh.models import HoverTool
plot = figure()
plot.varea_stack(df.site.unique().tolist(), x=df.index.values.tolist(), source=df)
script, div = components(plot)
the Error I get:
Keyword argument sequences for broadcasting must be the same length as stackers
I have been searching online (https://docs.bokeh.org/en/latest/docs/reference/plotting.html#bokeh.plotting.figure.Figure.varea_stack) and through Stackoverflow. I can't seem to find an answer.
I can't really speak to the Pandas operations needed, but this is the general format the data needs to be in for varea_stack:
sites = [<list of sites>]
data = {
'date' : <all the datetime values>,
<site1> : <site1 values for every date>,
<site2> : <site2 values for every date>,
<site3> : <site3 values for every date>,
...
}
plot.varea_stack(sites, x='date', source=data)
Note that to be usable by varea_stack the following must be true:
every item in the sites list has to be a column in the data
every sites column has to be the same length (a value for every date)
Note that the above also assumes the dates are converted to real datetime values. If you are using your dates are categoricals (i.e. not using real datetimes and a continous datetime axis) then you will need to pass the list of date (strings) to the x_range for figure as well (as with any categorical axis).
Related
I have a dataframe for air pollution with several missing gaps like this:
Date AMB_TEMP CO PM10 PM2.5
2010-01-01 0 8 10 ... 15
2010-01-01 1 10 15 ... 20
...
2010-01-02 0 5 ...
2010-01-02 1 ... 20
...
2010-01-03 1 4 13 ... 34
To specify, here's the data link: shorturl.at/blBN1
Take the first column "ambient temperature"(amb_temp) for instance:
There are given missing info below:
Length of time series: 87648
Number of Missing Values:746
Percentage of Missing Values: 0.85 %
Number of Gaps: 136
Average Gap Size: 5.485294
Longest NA gap (series of consecutive NAs): 32
Most frequent gap size (series of consecutive NA series): 1(occurring 50 times)
I want to plot the overview of missing value and what I've done is:
import missingno as msno
missing_plot = msno.matrix(df , freq='Y')
and got a figure like this:
Obviously, in the first column, the AMB_TEMP is not consistent to the real. Only three horizontal lines but actually it should be at least 136.
**Update: Thanks to Patrick, I also tried only one column, and nothing improved.
Is there any error from the code or else..?
I have a dataset with the following [structure][1] -
On a high level it is a time series data. I want to plot this time series data and have a unique color for each column. This will enable me to show the transitions better to the viewer. The column names/labels change from one data set to another. That means I need to create colors for the y value based on labels present in each dataset. I am trying to decide how to do this in a scalable manner.
Sample data ->
;(1275, 51) PCell Tput Avg (kbps) (Average);(1275, 95) PCell Tput Avg (kbps) (Average);(56640, 125) PCell Tput Avg (kbps) (Average);Time Stamp
0;;;79821.1;2021-04-29 23:01:53.624
1;;;79288.3;2021-04-29 23:01:53.652
2;;;77629.2;2021-04-29 23:01:53.682
3;;;78980.3;2021-04-29 23:01:53.695
4;;;77953.4;2021-04-29 23:01:53.723
5;;;;2021-04-29 23:01:53.748
6;;75558.7;;2021-04-29 23:01:53.751
7;;73955.5;;2021-04-29 23:01:53.780
8;;73689.8;;2021-04-29 23:01:53.808
9;;74819.8;;2021-04-29 23:01:53.839
10;10000;;;2021-04-29 23:01:53.848
11;68499;;;2021-04-29 23:01:53.867
[1]: https://i.stack.imgur.com/YM2P6.png
As long as each dataframe has the datetime column labeled as 'Time Stamp', you should just be able to do this for each one:
import matplotlib.pyplot as plt
df = #get def
df.plot.line(x = 'Time Stamp',grid=True)
plt.show()
Example df:
>>> df
A B C D Time Stamp
0 6 9 7 8 2018-01-01 00:00:00.000000000
1 7 3 8 6 2018-05-16 18:04:05.267114496
2 4 1 4 0 2018-09-29 12:08:10.534228992
3 1 2 5 8 2019-02-12 06:12:15.801343744
4 6 7 9 3 2019-06-28 00:16:21.068458240
5 8 5 9 9 2019-11-10 18:20:26.335572736
6 3 0 8 2 2020-03-25 12:24:31.602687232
7 8 0 0 9 2020-08-08 06:28:36.869801984
8 0 9 7 8 2020-12-22 00:32:42.136916480
9 7 8 9 2 2021-05-06 18:36:47.404030976
Resulting plot:
Is there an easy way to sum the value of all the rows above the current row in an adjacent column? Click on the image below to see what I'm trying to make. It's easier to see it than explain it.
Text explanation: I'm trying to create a chart where column B is either the sum or percent of total of all the rows in A that are above it. That way I can quickly visualize where the quartile, third, etc are in the dataframe. I'm familiar with the percentile function
How to calculate 1st and 3rd quartiles?
but I'm not sure I can get it to do exactly what I want it to do. Image below as well as text version:
Text Version
1--1%
1--2%
4--6%
4--10%
2--12%
...
and so on to 100 percent.
Do i need to write a for loop to do this?
Excel Chart:
you can use cumsum for this:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=dict(x=[13,22,34,21,33,41,87,24,41,22,18,12,13]))
df["percent"] = (100*df.x.cumsum()/df.x.sum()).round(1)
output:
x percent
0 13 3.4
1 22 9.2
2 34 18.1
3 21 23.6
4 33 32.3
5 41 43.0
6 87 65.9
7 24 72.2
8 41 82.9
9 22 88.7
10 18 93.4
11 12 96.6
12 13 100.0
I have a Dataframe object coming from a SQL-Query that looks like this:
Frage/Diskussion ... Wissenschaft&Technik
date ...
2018-05-10 13 ... 6
2018-05-11 28 ... 1
2018-05-12 11 ... 2
2018-05-13 21 ... 3
2018-05-14 30 ... 4
2018-05-15 38 ... 5
2018-05-16 25 ... 7
2018-05-17 23 ... 2
2018-05-18 24 ... 4
2018-05-19 31 ... 4
[10 rows x 6 columns]
I want to visualize this data with a Matplotlib stackplot in python.
What works is following line:
df.plot(kind='area', stacked=True)
What doesn't work is following line:
plt.stackplot(df.index, df.values)
The error I get with the last line is:
"ValueError: operands could not be broadcast together with shapes (10,) (6,) "
Obviously the last line with the 10 rows x 6 columns is passed into the plotting function.. and I can't get rid of it.
Writing out each column by hand is also working but not really what I want since there will be many rows later on.
plt.stackplot(df.index.values, df['Frage/Diskussion'], df['Humor'], df['Nachrichten'], df['Politik'], df['Interessant'], df['Wissenschaft&Technik'])
Your problem here is that df.values is a column by row array. To get the form you want you need to transpose it. Fortunately, that is easy. Replace df.values by df.values.T! So in your code replace:
plt.stackplot(df.index,df.values)
with
plt.stackplot(df.index,df.values.T)
I'm trying to scatter plot the following dataframe:
mydf = pd.DataFrame({'x':[1,2,3,4,5,6,7,8,9],
'y':[9,8,7,6,5,4,3,2,1],
'z':np.random.randint(0,9, 9)},
index=["12:00", "1:00", "2:00", "3:00", "4:00",
"5:00", "6:00", "7:00", "8:00"])
x y z
12:00 1 9 1
1:00 2 8 1
2:00 3 7 7
3:00 4 6 7
4:00 5 5 4
5:00 6 4 2
6:00 7 3 2
7:00 8 2 8
8:00 9 1 8
I would like to see the times "12:00, 1:00, ..." as the x-axis and x,y,z columns on the y-axis.
When I try to plot with pandas via mydf.plot(kind="scatter"), I get the error ValueError: scatter requires and x and y column. Do I have to break down my dataframe into appropriate parameters? What I would really like to do is get this scatter plotted with seaborn.
Just running
mydf.plot(style=".")
works fine for me:
Seaborn is actually built around pandas.DataFrames. However, your data frame needs to be "tidy":
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
Since you want to plot x, y, and z on the same plot, it seems like they are actually different observations. Thus, you really have three variables: time, value, and the letter used.
The "tidy" standard comes from Hadly Wickham, who implemented it in the tidyr package.
First, I convert the index to a Datetime:
mydf.index = pd.DatetimeIndex(mydf.index)
Then we do the conversion to tidy data:
pivoted = mydf.unstack().reset_index()
and rename the columns
pivoted = pivoted.rename(columns={"level_0": "letter", "level_1": "time", 0: "value"})
Now, this is what our data looks like:
letter time value
0 x 2019-03-13 12:00:00 1
1 x 2019-03-13 01:00:00 2
2 x 2019-03-13 02:00:00 3
3 x 2019-03-13 03:00:00 4
4 x 2019-03-13 04:00:00 5
Unfortunately, seaborn doesn't play with DateTimes that well, so you can just extract the hour as an integer:
pivoted["hour"] = pivoted["time"].dt.hour
With a data frame in this form, seaborn takes in the data easily:
import seaborn as sns
sns.set()
sns.scatterplot(data=pivoted, x="hour", y="value", hue="letter")
Outputs: