How to show last row of Pandas DataFrame in box plot - python

Random data:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.DataFrame(np.random.normal(size=(20,4)))
data
0 1 2 3
0 -0.710006 -0.748083 -1.261515 0.048941
1 0.856541 0.533073 0.649113 -0.236297
2 -0.091005 -0.244658 -2.194779 0.632878
3 -0.059058 0.807661 -0.418446 -0.295255
4 -0.103701 0.775622 0.258412 0.024411
5 -0.447976 -0.034419 -1.521598 -0.903301
6 1.451105 0.549661 -1.655751 -0.147499
7 1.479374 -1.475347 0.665726 0.236611
8 -1.427979 -1.812916 0.522802 0.006066
9 0.198515 1.203476 -0.475389 -1.721707
10 0.286255 0.564450 0.590050 -0.657811
11 -1.076161 1.820218 -0.315127 -0.848114
12 0.061848 0.303502 0.978169 0.024630
13 -0.307827 -1.047835 0.547052 -0.647217
14 0.679214 0.734134 0.158803 -0.334951
15 0.469675 1.043391 -1.449727 1.335354
16 -0.483831 -0.988185 0.264027 -0.831833
17 -2.013968 -0.200699 1.076526 1.275300
18 -0.199473 -1.630597 -1.697146 -0.177458
19 1.245289 0.132349 1.054312 -0.082550
data.boxplot(vert= False, figsize = (15,10))
I want to add red dots to the box plot indicating the last value (bottom) in each column. For example (red dots I've edited in are not in their exact position, but this gives you a general idea):
Thank you.

You could just add a scatter plot on top of the boxplot.
For the provided example, it looks like this:
fig, ax = plt.subplots(figsize=(8,5))
df.boxplot(vert= False, patch_artist=True, ax=ax, zorder=1)
lastrow = df.iloc[-1,:]
print(lastrow)
ax.scatter(x=lastrow, y=[*range(1,len(lastrow)+1)], color='r', zorder=2)
# for displaying the values of the red points:
for i, val in enumerate(lastrow,1):
ax.annotate(text=f"{val:.2f}", xy=(val,i+0.1))

Related

Highlight a single point with a marker in lineplot

I would like to highlithgt a single point on my lineplot graph using a marker. So far I managed to create my plot and insert the highlight where I wanted.
The problem is that I have 4 differents lineplot (4 different categorical attributes) and I get the marker placed on every sigle lineplot like in the following image:
I would like to place the marker only on the 2020 line (the purple one). This is my code so far:
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
import seaborn as sns
import numpy as np
import matplotlib.gridspec as gridspec
fig = plt.figure(figsize=(15,10))
gs0 = gridspec.GridSpec(2,2, figure=fig, hspace=0.2)
ax1 = fig.add_subplot(gs0[0,:]) # lineplot
ax2 = fig.add_subplot(gs0[1,0]) #Used for another plot not shown here
ax3 = fig.add_subplot(gs0[1,1]) #Used for another plot not shown here
flatui = ["#636EFA", "#EF553B", "#00CC96", "#AB63FA"]
sns.lineplot(ax=ax1,x="number of weeks", y="avg streams", hue="year", data=df, palette=flatui, marker = 'o', markersize=20, fillstyle='none', markeredgewidth=1.5, markeredgecolor='black', markevery=[5])
ax1.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{:,.0f}'.format(x/1000) + 'K'))
ax1.set(title='Streams trend')
ax1.xaxis.set_major_locator(ticker.MultipleLocator(2))
I used the markevery field to place a marker in position 5. Is there a way to specify also on which line/category place my marker?
EDIT: This is my dataframe:
avg streams date year number of weeks
0 145502.475 01-06 2017 0
1 158424.445 01-13 2017 1
2 166912.255 01-20 2017 2
3 169132.215 01-27 2017 3
4 181889.905 02-03 2017 4
... ... ... ... ...
181 760505.945 06-26 2020 25
182 713891.695 07-03 2020 26
183 700764.875 07-10 2020 27
184 753817.945 07-17 2020 28
185 717685.125 07-24 2020 29
186 rows × 4 columns
markevery is a Line2D property. sns.lineplot doesn't return the lines so you need to get the line you want to annotate from the Axes. Remove all the marker parameters from the lineplot call and add ...
lines = ax1.get_lines()
If the 2020 line/data is the fourth in the series,
line = lines[3]
line.set_marker('o')
line.set_markersize(20)
line.set_markevery([5])
line.set_fillstyle('none')
line.set_markeredgewidth(1.5)
line.set_markeredgecolor('black')
# or
props = {'marker':'o','markersize':20, 'fillstyle':'none','markeredgewidth':1.5,
'markeredgecolor':'black','markevery': [5]}
line.set(**props)
Another option, inspired by Quang Hoang's comment would be to add a circle around/at the point deriving the point from the DataFrame.
x = 5 # your spec
wk = df['number of weeks']==5
yr = df['year']==2020
s = df[wk & yr]
y = s['avg streams'].to_numpy()
# or
y = df.loc[(df['year']==2020) & (df['number of weeks']==5),'avg streams'].to_numpy()
ax1.plot(x,y, 'ko', markersize=20, fillstyle='none', markeredgewidth=1.5)

How to set_xticklabels to corresponding "dates" column of a pandas DataFrame?

The issue
I have a pandas dataframe with two columns of data and corresponding date times in another column. I want to make a line plot for the two sets of data and have the date times, which is a list of strings, as the x-axis tick labels. But when I make the plot, the tick labels don't align with the corresponding data for that date. Instead the xlabels just come from the first n entries of the dates column.
My original program had a tick at every data point, so I tried setting ax.set_xticks to reduce the number of ticks, and that's when this issue arose.
The code
import numpy as np
import datetime
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
data = { 'Set1': np.random.rand(24),
'Set2': np.random.rand(24)
}
df = pd.DataFrame(data,columns=['Set1','Set2'])
date_list = []
base = datetime.datetime(2000,1,1,0)
for i in range(len(df.Set1)):
date = base + datetime.timedelta(hours=i*3)
date_frmt = date.strftime("%b%d%Hz")
date_list.append(str(date_frmt))
df['Dates'] = date_list
print(df)
Set1 Set2 Dates
0 0.521824 0.371057 Jan0100z
1 0.726503 0.945712 Jan0103z
2 0.881100 0.725798 Jan0106z
3 0.432198 0.549191 Jan0109z
4 0.083255 0.297057 Jan0112z
5 0.428145 0.441973 Jan0115z
6 0.168049 0.411889 Jan0118z
7 0.654588 0.822227 Jan0121z
8 0.540984 0.824515 Jan0200z
9 0.999410 0.809121 Jan0203z
10 0.055359 0.901241 Jan0206z
11 0.163407 0.085901 Jan0209z
12 0.523488 0.011856 Jan0212z
13 0.133038 0.881413 Jan0215z
14 0.880946 0.301656 Jan0218z
15 0.575265 0.972408 Jan0221z
16 0.489332 0.399983 Jan0300z
17 0.119246 0.216152 Jan0303z
18 0.805346 0.873699 Jan0306z
19 0.806190 0.277772 Jan0309z
20 0.868357 0.311854 Jan0312z
21 0.042386 0.461695 Jan0315z
22 0.354832 0.262534 Jan0318z
23 0.209049 0.780153 Jan0321z
ax = df.plot(figsize=(10, 5))
ax.set_xticklabels(df.Dates)
The problem
How do I get the ticks' labels to align correctly with their corresponding data?
You can use pyplot's DateFormatter:
fig, ax = plt.subplots()
for col in ['Set1', 'Set2']:
ax.plot(df['Dates'], df[col], label=col)
ax.legend()
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b%d%Hz'))
Output:
You can use this line after ax = df.plot(figsize=(10, 5)):
plt.xticks(np.arange(df.shape[0])[::1], df.Dates[::1], rotation=35)
The indexing part [::1] lets you pick at which interval you want to put them (here all of them, but it will be cluttered). So you won't need this line ax.set_xticklabels(df.Dates).

How to manage different labels in a bar chart, taking data from a text file?

I'm new in using matplotlib, so I'm having some problems. I must create a bar chart with different labels, for each website that I have.
The file is like the following:
1001 adblock 12
1001 badger 11
1001 disconnect 15
1001 ghostery 15
1001 nottrack 14
1001 origin 15
1001 policy 16
1001 ultimate 14
4ruote adblock 12
4ruote badger 1
4ruote disconnect 14
4ruote ghostery 27
4ruote nottrack 9
4ruote origin 26
4ruote policy 34
4ruote ultimate 20
...... ........ ...
My goal is to create a bar chart in which I have:
on the x axis sites (first column of the file), is a string
on the y axis the values (third column of the file) for that site (that are 8 times repeated inside the file), so 8 integer values
labels that, for a specific site, are present in the second column (strings).
I read different answers but each one didn't threat this comparison between labels, for a same variable.
What I'm doing is read the file, splitting the row and taking the first and third column, but how can I manage the labels?
seaborn does this neatly:
from pandas import read_csv
from matplotlib.pyplot import show
from seaborn import factorplot
fil = read_csv('multi_bar.txt', sep=r'\s*', engine='python', header=None)
fil.columns=['site','type','value']
factorplot(data=fil, x='site', y='value', hue='type', kind='bar')
show()
Let's assume that you have read the websites into 8 different datasets (adblock, badger, disconnect, etc). You can then use the logic below to plot each series and show their labels on a legend.
import numpy
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
#this is your number of datasets
x = numpy.arange(8)
width = 0.1
#plot each dataset here, offset by the width of the preceding bars
b1 = ax.bar(x, adblock, width, color='r')
b2 = ax.bar(x + width, badger, color='g')
b3 = ax.bar(x + width*2, disconnect, color='m')
legend([b1[0], b2[0], b3[0]], ['adblock', 'badger',
'disconnect'])
plt.show()

How to border a bar for particular data aperiod in Python

I have a dataset of a year and its numerical description. Example:
X Y
1890 6
1900 4
2000 1
2010 9
I plot a bar like plt.bar(X,Y) and it looks like:
How can I make the step of the X scale more detailet, for example, 2 years?
Can I border somehow every 5 years with another color, red, for instatnce?
There are some different ways to do this. This is a possible solution:
import matplotlib.pyplot as plt
x = [1890,1900,2000,2010]
y = [6,4,1,9]
stepsize = 10 # Chose your step here
fig, ax = plt.subplots()
ax.bar(x,y)
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, stepsize))
plt.show()
, the result is:

How to add more space in between Y-points in a barh()?

Image:
Should be pretty clear from the image; basically I want the Y-labels to be visible. How do I add more spacing between them? My code is as follows:
1 import numpy as np
2 import matplotlib.pyplot as plt
3 import sys
4
5 file_hours = sys.argv[1]
6 file_bytes = sys.argv[2]
7
8 list_hours = []
9 list_bytes = []
10 for line in open(file_hours, 'r'):
11 list_hours.append(line)
12 for line in open(file_bytes, 'r'):
13 list_bytes.append(float(line))
14
15 y_pos = np.arange(len(list_hours))
16 fig = plt.figure()
17
18 plt.barh(y_pos,list_bytes,align='center')
19 plt.yticks(y_pos, list_hours)
20 plt.show()
EDIT: Goes without saying that I have a large amount of data to graph. I doesn't matter if the graph is (much) taller.
You can do something like:
fig = plt.figure(figsize=(8,36))
where the first number is the width and the second is the height. Reference: http://matplotlib.org/api/figure_api.html#matplotlib.figure.Figure
You can also space out the yticks, e.g:
plt.yticks(np.arange(min(y_pos), max(y_pos), 5.0)

Categories