plotting multiple lines in one line plot - python

I have a dataframe that hast 3 columns. I made it up from a bigger dataframe like this :
new_df = df[['client_name', 'time_window_end', 'tag_count']]
then I used groupby to find out the number of tags for each client in each day using this code :
new_df.groupby(['client_name' ,'time_window_end']) ['tag_count'].count()
I totally have 70 client names in a list an I want to loop through my list to plot a line
plot for each costumer name. in the x axis I want to have 'time_window_end' and in the y axis I want to have 'tag_count'.
I want 70 plot but the for loop that I have written does not do that. I would be happy if you could help me to fix it.
clients = new_df['client_name'].unique()
client_list = clients.tolist()
for client in client_list[:60]:
temp = new_df.loc[new_df['client_name'] == client]
x = temp.groupby(temp['time_window_end'].dt.floor('d'))['tag_count'].sum()
df2 = x.to_frame()
df2.reset_index(inplace=True)
df2["time_window_end"]= pd.to_datetime(df2["time_window_end"])
line_chart = df2.copy()
plt.plot(line_chart.reset_index()["time_window_end"], x)

If I'm understanding this right, it sounds like the seaborn package might have what you need. The plotting functions take the argument 'hue' which splits plots up into multiple lines, based on the data in a column
import seaborn as sn
new_df = new_df.groupby(['client_name' ,'time_window_end']) ['tag_count'].count().reset_index()
sn.relplot(
data = new_df,
x = pd.to_datetime(new_df["time_window_end"]),
y = 'tag_count',
hue = 'client_name',
kind = 'line')
EDIT: to get multiple plots
import seaborn as sn
new_df["time_window_end"] = pd.to_datetime(new_df["time_window_end"])
g = sn.FacetGrid(
data = new_df,
row = 'client_name')
g.map(sn.lineplot, 'time_window_end', 'tag_count')
EDIT again: to get separate plot images
import matplotlib.pyplot as plt
for name in pd.unique(new_df.client_names):
sn.lineplot(
data = new_df.loc[new_df.client_names == name],
x = 'time_window_end',
y = 'tag_count',
label = name)
plt.show()

Related

How to plot multiple traces with trendlines?

I'm trying to plot trendlines on multiple traces on scatters in plotly. I'm kind of stumped on how to do it.
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'],
y=df_df['Height (meters)'],
name='Douglas Fir', mode='markers')
)
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'],
y=df_wp['Height (meters)'],
name='White Pine',mode='markers'),
)
fig.update_layout(title="Tree Circumference vs Height (meters)",
xaxis_title=df_df['Circumference (meters)'].name,
yaxis_title=df_df['Height (meters)'].name,
title_x=0.5)
fig.show()
Trying to get something like this:
Here's how I resolved it. Basically I used numpy polyfit function to calculation my slop. I then added the slop for each data set as a tracer
import numpy as np
df_m, df_b = np.polyfit(df_df['Circumference (meters)'].to_numpy(), df_df['Height (meters)'].to_numpy(), 1)
wp_m, wp_b = np.polyfit(df_wp['Circumference (meters)'].to_numpy(), df_wp['Height (meters)'].to_numpy(), 1)
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'],
y=df_df['Height (meters)'],
name='Douglas Fir', mode='markers')
)
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'],
y=(df_m*df_df['Circumference (meters)'] + df_b),
name='douglas fir trendline',
mode='lines')
)
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'],
y=df_wp['Height (meters)'],
name='White Pine',mode='markers'),
)
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'],
y=(wp_m * df_wp['Circumference (meters)'] + wp_b),
name='white pine trendline',
mode='lines')
)
fig.update_layout(title="Tree Circumference vs Height (meters)",
xaxis_title=df_df['Circumference (meters)'].name,
yaxis_title=df_df['Height (meters)'].name,
title_x=0.5)
fig.show()
You've already put together a procedure that solves your problem, but I would like to mention that you can use plotly.express and do the very same thing with only a very few lines of code. Using px.scatter() there are actually two slightly different ways, depending on whether your data is of a long or wide format. Your data seems to be of the latter format, since you're asking:
how can I make this work with separate traces?
So I'll start with that. And I'll use a subset of the built-in dataset px.data.stocks() since you haven't provided a data sample.
Code 1 - Wide data
fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT'],
trendline = 'ols',
)
Code 2 - Long data
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
Plot 1 - Identical results
About the data:
A dataframe of a wide format typically has an index with unique values in the left-most column, variable names in the column headers, and corresponding values for each variable per index in the columns like this:
index AAPL MSFT
0 1.000000 1.000000
1 1.011943 1.015988
2 1.019771 1.020524
3 0.980057 1.066561
4 0.917143 1.040708
Here, adding information about another variable would require adding another column.
A dataframe of a long format, on the other hand, typically organizes the same data with only (though not necessarily only) three columns; index, variable and value:
index variable value
0 AAPL 1.000000
1 AAPL 1.011943
.
.
100 MSFT 1.720717
101 MSFT 1.752239
An contrary to the wide format, this means that index will have duplicate values. But for a good reason.
So what's the difference?
If you look at Code 1 you'll see that the only thing you need to specify for px.scatter in order to get multiple traces with trendlines, in this case AAPL and MSFT on the y-axis versus an index on the x-axis, is trendline = 'ols'. This is because plotly.express automatically identifies the data format as wide and knows how to apply the trendlines correctly. Different columns means different catrgories for which a trace and trendline are produced.
As for the "long approach", you've got both GOOG and AAPL in the same variable column, and values for both of them in the value column. But setting color = 'variable' lets plotly.express know how to categorize the variable column, correctly separate the data in in the value column, and thus correctly produce the trendlines. A different name in the variable column means that index and value in the same row belongs to different categories, for which a new trace and trendline are built.
Any pros and cons?
The arguably only advantage with the wide format is that it's easier to read (particularly for those of us damaged by too many years of sub-excellent data handling with Excel). And one great advantage with the long format is that you can easily illustrate more dimensions of the data if you have more categories with, for example, different symbols or sizes for the markers.
Another advantage with the long format occurs if the dataset changes, for example with the addition of another variable 'AMZN'. Then the name and the values of that variable will occur in the already existing columns instead of adding another one like you would for the wide format. This means that you actually won't have to change the code in:
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
... in order to add the data to the figure.
While for the wide format, you would have to specify y = ['GOOG', 'AAPL', 'AMZN'] in:
fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT', 'AMZN'],
trendline = 'ols',
)
And I would strongly argue that this outweighs the slight inconvenience of speifying color = 'variable' in:
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
Plot 2 - A new variable:
Complete code
# imports
import pandas as pd
import plotly.express as px
# data
df = px.data.stocks()
# df.date = pd.to_datetime(df.date)
df_wide = df.drop(['date', 'GOOG', 'AMZN', 'NFLX', 'FB'], axis = 1).reset_index()
# df_wide = df.drop(['date', 'GOOG', 'NFLX', 'FB'], axis = 1).reset_index()
df_long = pd.melt(df_wide, id_vars = 'index')
df_long
fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT'],
trendline = 'ols',
)
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
# fig_long.show()
fig_wide.show()

Colour code the plot based on the two data frame values

I would like to colour code the scatter plot based upon the two data frame values such that for each different values of df[1], a new color is to be assigned and for each df[2] value having same df[1] value, the assigned color earlier needs the opacity variation with highest value of df[2] (among df[2] values having same df[1] value) getting 100 % opaque and the lowest getting least opaque among the group of the data points.
Here is the code:
def func():
...
df = pd.read_csv(PATH + file, sep=",", header=None)
b = 2.72
a = 0.00000009
popt, pcov = curve_fit(func, df[2], df[5]/df[4], p0=[a,b])
perr = np.sqrt(np.diag(pcov))
plt.scatter(df[1], df[5]/df[4]/df[2])
# Plot responsible for the datapoints in the figure
plt.plot(df[1], func_cpu(df[2], *popt)/df[2], "r")
# plot responsible for the curve in the figure
plt.legend(loc="upper left")
Here is the sample dataset:
**df[0],df[1],df[2],df[3],df[4],df[5],df[6]**
file_name_1_i1,31,413,36120,10,9,10
file_name_1_i2,31,1240,60488,10,25,27
file_name_1_i3,31,2769,107296,10,47,48
file_name_1_i4,31,8797,307016,10,150,150
file_name_2_i1,34,72,10868,11,9,10
file_name_2_i2,34,6273,250852,11,187,196
file_name_3_i1,36,84,29568,12,9,10
file_name_3_i2,36,969,68892,12,25,26
file_name_3_i3,36,6545,328052,12,150,151
file_name_4_i1,69,116,40712,13,25,26
file_name_4_i2,69,417,80080,13,47,48
file_name_4_i2,69,1313,189656,13,149,150
file_name_4_i4,69,3009,398820,13,195,196
file_name_4_i5,69,22913,2855044,13,3991,4144
file_name_5_i1,85,59,48636,16,47,48
file_name_5_i2,85,163,64888,15,77,77
file_name_5_i3,85,349,108728,16,103,111
file_name_5_i4,85,1063,253180,14,248,248
file_name_5_i5,85,2393,526164,15,687,689
file_name_5_i6,85,17713,3643728,15,5862,5867
file_name_6_i1,104,84,75044,33,137,138
file_name_6_i2,104,455,204792,28,538,598
file_name_6_i3,104,1330,513336,31,2062,2063
file_name_6_i4,104,2925,1072276,28,3233,3236
file_name_6_i5,104,6545,2340416,28,7056,7059
...
So, the x-axis would be df[1] which are 31, 31, 31, 31, 34, 34,... and the y-axis is df[5], df[4], df[2] which are 9, 10, 413. For each different value of df[1], a new colour needs to be assigned. It would be fine to repeat the color cycles say after 6 unique colours. And among each color the opacity needs to be changed wrt to the value of df[2] (though y-axis is df[5], df[4], df[2]). The highest getting the darker version of the same color, and the lowest getting the lightest version of the same color.
and the scatter plot:
This is roughly how my desired solution of the color code needs to look like:
I have around 200 entries in the csv file.
Does using NumPy in this scenario is more advantageous ?
Let me know if this is appropriate or if I have misunderstood anything-
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# not needed for you
# df = pd.read_csv('~/Documents/tmp.csv')
max_2 = pd.DataFrame(df.groupby('1').max()['2'])
no_unique_colors = 3
color_set = [np.random.random((3)) for _ in range(no_unique_colors)]
# assign colors to unique df2 in cyclic order
max_2['colors'] = [color_set[unique_df2 % no_unique_colors] for unique_df2 in range(max_2.shape[0])]
# calculate the opacities for each entry in the dataframe
colors = [list(max_2.loc[df1].colors) + [float(df['2'].iloc[i])/max_2['2'].loc[df1]] for i, df1 in enumerate(df['1'])]
# repeat thrice so that df2, df4 and df5 share the same opacity
colors = [x for x in colors for _ in range(3)]
plt.scatter(df['1'].values.repeat(3), df[['2', '4', '5']].values.reshape(-1), c=colors)
plt.show()
Well, what do you know. I understood this task totally differently. I thought the point was to have alpha levels according to all df[2], df[4], and df[5] values for each df[1] value. Oh well, since I have done the work already, why not post it?
from matplotlib import pyplot as plt
import pandas as pd
from itertools import cycle
from matplotlib.colors import to_rgb
#read the data, column numbers will be generated automatically
df = pd.read_csv("data.txt", sep = ",", header=None)
#our figure with the ax object
fig, ax = plt.subplots(figsize=(10,10))
#definition of the colors
sc_color = cycle(["tab:orange", "red", "blue", "black"])
#get groups of the same df[1] value, they will also be sorted at the same time
dfgroups = df.iloc[:, [2, 4, 5]].groupby(by=df[1])
#plot each group with a different colour
for groupkey, groupval in dfgroups:
#create group dataframe with df[1] value as x and df[2], df[4], and df[5] values as y
groupval= groupval.melt(var_name="x", value_name="y")
groupval.x = groupkey
#get min and max y for the normalization
y_high = groupval.y.max()
y_low = groupval.y.min()
#read out r, g, and b values of the next color in the cycle
r, g, b = to_rgb(next(sc_color))
#create a colour array with nonlinear normalized alpha levels
#between 0.2 and 0.8, so that all data point are visible
group_color = [(r, g, b, 0.19 + 0.8 * ((y_high-val) / (y_high-y_low))**7) for val in groupval.y]
#and plot
ax.scatter(groupval.x, groupval.y, c=group_color)
plt.show()
Sample output of your data:
Two main problems here. One is that alpha in a scatter plot does not accept an array. But color does, hence, the detour to read out the RGB values and create an RGBA array with added alpha levels.
The other is that your data are spread over a rather wide range. A linear normalization makes changes near the lowest values invisible. There is surely some optimization possible; I like for instance this suggestion.

plot matplotlib aggregated data python

I need plot of aggregrated data
import pandas as pd
basic_data= pd.read_csv('WHO-COVID-19-global-data _2.csv',parse_dates= ['Date_reported'] )
cum_daily_cases = basic_data.groupby('Date_reported')[['New_cases']].sum()
import pylab
x = cum_daily_cases['Date_reported']
y = cum_daily_cases['New_cases']
pylab.plot(x,y)
pylab.show()
Error: 'Date_reported'
Input: Date_reported, Country_code, Country, WHO_region, New_cases, Cumulative_cases, New_deaths, Cumulative_deaths 2020-01-03,AF,Afghanistan,EMRO,0,0,0,0
Output: the total quantity of "New cases" showed on the plot per day.
What should I do to run this plot? link to dataset
The column names contain a leading space (can be easily seen by checking basic_data.dtypes). Fix that by adding the following line immediately after basic_data was read:
basic_data.columns = [s.strip() for s in basic_data.columns]
In addition, your x variable should be the index after groupby-sum, not a column Date_reported. Correction:
x = cum_daily_cases.index
The plot should show as expected.

ggplot summarise mean value of categorical variable on y axis

I am trying to replicate a Python plot in R that I found in this Kaggle notebook: Titanic Data Science Solutions
This is the Python code to generate the plot, the dataset used can be found here:
import seaborn as sns
...
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()
Here is the resulting plot.
The survival column takes values of 0 and 1 (survive or not survive) and the y-axis is displaying the mean per pclass. When searching for a way to calculate the mean using ggplot2, I usually find the stat_summary() function. The best I could do was this:
library(dplyr)
library(ggplot2)
...
train_df %>%
ggplot(aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
stat_summary(fun.y = mean, geom = "line") +
facet_grid(Embarked ~ .)
The output can be found here.
There are some issues:
There seems to be an empty facet, maybe from NA's in Embarked?
The points don't align with the line
The lines are different than those in the Python plot
I think I also haven't fully grasped the layering concept of ggplot. I would like to separate the geom = "line" in the stat_summary() function and rather add it as a + geom_line().
There is actually an empty level (i.e. "") in train_df$Embarked. You can filter that out before plotting.
train_df <- read.csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
train_df <- subset(train_df, Embarked != "")
ggplot(train_df, aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
stat_summary(fun.data = 'mean_cl_boot') +
geom_line(stat = 'summary', fun.y = mean) +
facet_grid(Embarked ~ .)
You can replicate the python plot by drawing confidence intervals using stat_summary. Although your lines with stat_summary were great, I've rewritten it as a geom_line call, as you asked.
Note that your ggplot code doesn't draw any points, so I can't answer that part, but probably you were drawing the raw values which are just many 0s and 1s.

How to have fixed x ticks while plotting using pandas dataframe

My dataframe columns are employee and X-Folder. This is the example of data:
For every unique employee I count the number of each X-Folder. In this case X-Folders are contacts, straw, conference etc. I plot graph using:
from collections import Counter
for e in employees:
topics = df1.loc[df1.employee==e , "X-Folder"]
letter_counts = Counter(topics)
df2 = pd.DataFrame.from_dict(letter_counts, orient='index')
ax = df2.plot( kind='bar',figsize=(10,10))
It is working just fine. The only problem is that x-ticks are not fixed. For example for first employee conference, meetings , active international is zero. So, it won't show these on x axis and will only show contacts and straw. I want a graph that shows all the labels.
Edit: I want all the topics shown on x axis with only relevant counter values plotted
Edit: This is what I have done. I have stored all the topics in a list and in the dictionary letter_counts I just assign it the value zero. It works fine for the bar graph:
import matplotlib.pyplot as plt
from collections import Counter
topic = df1["X-Folder"].unique()
for e in employees:
topics = df1.loc[df1.employee==e , "X-Folder"]
letter_counts = Counter(topics)
for t in topic:
if str(t) not in letter_counts.keys():
letter_counts[t] = 0
df2 = pd.DataFrame.from_dict(letter_counts, orient='index')
ax = df2.plot( kind='bar',figsize=(20,10))
This is the output
But for the area graph it doesnt work (only prints 4 of the topics from a total of 23 topics):
Try this, putting in your own values for x of course.
From StackOverflow: Changing the “tick frequency” on x or y axis in matplotlib?
x = [1,10,20,30,40,50,60,70,80,90]
plt.xticks(np.arange(min(x), max(x)+1, 1.0))

Categories