Python - connect two sets of dots in order - python

I created two scatterplots and put them on the same graph. I also want to match the points of the two scatterplots (note that the two scatterplots have the same number of points).
My current code is provided below, and the plot I want to get is sketched at the bottom of this post.
plt.scatter(tmp_df['right_eye_x'], tmp_df['right_eye_y'],
color='green', label='right eye')
plt.scatter(tmp_df['left_eye_x'], tmp_df['left_eye_y'],
color='cyan', label='left eye')
plt.legend()
Here is a fake dataframe you may use, in case you need to do some testing. (My data is of the following format; you may use the last two lines in the code chunk to create the dataframe)
timestamp right_eye_x right_eye_y left_eye_x left_eye_y
15 54 22 28 19
20 56 21 29 21
25 59 16 28 16
30 58 18 31 18
35 62 15 33 14
data = {'timestamp':[15,20,25,30,35],
'right_eye_x':[54, 56, 59, 58, 62],
'right_eye_y':[22, 21, 16, 18, 15],
'left_eye_x':[28, 29, 22, 31, 33],
'left_eye_y':[19, 21, 16, 18, 14]}
tmp_df = pd.DataFrame(data)
I saw this post: Matplotlib python connect two scatter plots with lines for each pair of (x,y) values?
while I am still very confused.
I would appreciate any insights! Thank you!
(If you find any part confusing, please let me know!)

Use the solution from the comments that is shown in the post you cite.
import matplotlib.pyplot as plt
import numpy as np
x1 = [0.19, 0.15, 0.13, 0.25]
x2 = [0.18, 0.5, 0.13, 0.25]
y1 = [0.1, 0.15, 0.3, 0.2]
y2 = [0.85, 0.76, 0.8, 0.9]
for i in range(len(x1)):
plt.plot([x1[i],x2[i]], [y1[i],y2[i]])
You can put labels, colors and stuff looking at https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html

Related

How to make a multi-level chart column label by hue

This is a continuation of this question. But now I have a bar-chart with hue.
Here's what I have:
df = pd.DataFrame({'age': ['20-30', '20-30', '20-30', '30-40', '30-40', '30-40', '40-50', '40-50', '40-50', '50-60', '50-60', '50-60'],
'expenses':['50$', '100$', '200$', '50$', '100$', '200$', '50$', '100$', '200$', '50$', '100$', '200$'],
'users': [59, 42, 57, 68, 47, 98, 75, 73, 54, 81, 52, 43],
'buyers': [22, 35, 18, 27, 12, 57, 19, 29, 31, 47, 10, 5],
'percentage': [37.2881, 83.3333, 31.5789, 39.7058, 25.5319, 58.1632, 25.3333, 39.7260, 57.4074, 58.0246, 19.2307, 11.6279]})
index
age
expenses
users
buyers
percentage
0
20-30
50$
59
22
37.2881
1
20-30
100$
42
35
83.3333
2
20-30
200$
57
18
31.5789
3
30-40
50$
68
27
39.7058
4
30-40
100$
47
12
25.5319
5
30-40
200$
98
57
58.1632
6
40-50
50$
75
19
25.3333
7
40-50
100$
73
29
39.726
8
40-50
200$
54
31
57.4074
9
50-60
50$
81
47
58.0246
10
50-60
100$
52
10
19.2307
11
50-60
200$
43
5
11.6279
fig, ax = plt.subplots(figsize=(20, 10))
# Plot the all users
sns.barplot(x='age', y='users', data=df, hue='expenses', palette='Blues', edgecolor='grey', alpha=0.7, ax=ax)
# Plot the buyers
sns.barplot(x='age', y='buyers', data=df, hue='expenses', palette='Blues', edgecolor='darkgrey', hatch='//', ax=ax)
plt.show()
I need to get the same chart. In the case of hue, the code:
# extract the separate containers
c1, c2 = ax.containers
# annotate with the users values
ax.bar_label(c1, fontsize=13)
# annotate with the buyer and percentage values
l2 = [f"{v.get_height()}: {df.loc[i, 'percentage']}%" for i, v in enumerate(c2)]
ax.bar_label(c2, labels=l2, fontsize=8, label_type='center', fontweight='bold')
no longer works.
I would be glad for any hints.
Each object in ax.containers represents the bars for a single hue group.
When using bar_label, the annotations for each bar in '50$', then '100$', and then '200$' are added.
I think it's easier to select the correct data by annotating the 'buyers' group separately.
The answer to your previous question selects the data from the entire dataframe, but here Boolean indexing is used to select only a segment of the dataframe. Using print(data) in each loop will help with understanding.
fig, ax = plt.subplots(figsize=(20, 10))
# plot the all users
sns.barplot(x='age', y='users', data=df, hue='expenses', palette='Blues', edgecolor='grey', alpha=0.7, ax=ax)
# annotate the bars in the 3 containers (1 container per hue group)
for c in ax.containers:
ax.bar_label(c)
# plot the 'buyers', which adds 3 more containers to ax
sns.barplot(x='age', y='buyers', data=df, hue='expenses', palette='Blues', edgecolor='darkgrey', hatch='//', ax=ax)
# iterate through the last 3 new containers containing the hatched groups
for c in ax.containers[3:]:
# get the hue label, which will be used to select the data group
hue_label = c.get_label()
# select the data based on hue_label
data = df.loc[df.expenses.eq(hue_label), ['buyers', 'percentage']]
# customize the labels
labels = [f"{v.get_height()}: {data.iloc[i, 1]:0.2f}%" for i, v in enumerate(c)]
# add the labels
ax.bar_label(c, labels=labels)
plt.show()

Matplotlib error plotting interval bins for discretized values form pandas dataframe

An error is returned when I want to plot an interval.
I created an interval for my age column so now I want to show on a chart the age interval compares to the revenue
my code
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
clients['tranche'] = pd.cut(clients.age, bins)
clients.head()
client_id sales revenue birth age sex tranche
0 c_1 39 558.18 1955 66 m (60, 70]
1 c_10 58 1353.60 1956 65 m (60, 70]
2 c_100 8 254.85 1992 29 m (20, 30]
3 c_1000 125 2261.89 1966 55 f (50, 60]
4 c_1001 102 1812.86 1982 39 m (30, 40]
# Plot a scatter tranche x revenue
df = clients.groupby('tranche')[['revenue']].sum().reset_index().copy()
plt.scatter(df.tranche, df.revenue)
plt.show()
But an error appears ending by
TypeError: float() argument must be a string or a number, not 'pandas._libs.interval.Interval'
How to use an interval for plotting ?
You'll need to add labels. (i tried to convert them to str using .astype(str) but that does not seem to work in 3.9)
if you do the following, it will work just fine.
labels = ['10-20', '20-30', '30-40']
df['tranche'] = pd.cut(df.age, bins, labels=labels)

Scatter Plot from pandas table in Python

New student to python and struggling with a task at the moment. I'm trying to publish a scatter plot from some data in a pandas table and can't seem to work it out.
Here is a sample of my data set:
import pandas as pd
data = {'housing_age': [14, 11, 3, 4],
'total_rooms': [25135, 32627, 39320, 37937],
'total_bedrooms': [4819, 6445, 6210, 5471],
'population': [35682, 28566, 16305, 16122]}
df = pd.DataFrame(data)
I'm trying to draw a scatter plot on the data in housing_age, but having some difficult figuring it out.
Initially tried for x axis to be 'housing_data' and the y axis to be a count of housing data, but couldn't get the code to work. Then read somewhere that x-axis should be variable, and y-axis should be constant, so tried this code:
x='housing_data'
y=[0,5,10,15,20,25,30,35,40,45,50,55]
plt.scatter(x,y)
ax.set_xlabel("Number of buildings")
ax.set_ylabel("Age of buildings")
but get this error:
ValueError: x and y must be the same size
Note - the data in 'housing_data' ranges from 1-53 years.
I imagine this should be a pretty easy thing, but for some reason I can't figure it out.
Does anyone have any tips?
I understand you are starting so confusion is common. Please bear with me.
From your description, it looks like you swapped x and y:
# x is the categories: 0-5 yrs, 5-10 yrs, ...
x = [0,5,10,15,20,25,30,35,40,45,50,55]
# y is the number of observations in each category
# I just assigned it some random numbers
y = [78, 31, 7, 58, 88, 43, 47, 87, 91, 87, 36, 78]
plt.scatter(x,y)
plt.set_title('Housing Data')
Generally, if you have a list of observations and you want to count them across a number of categories, it's called a histogram. pandas has many convenient functions to give you a quick look at the data. The one of interest for this question is hist - create a histogram:
# A series of 100 random buildings whose age is between 1 and 55 (inclusive)
buildings = pd.Series(np.random.randint(1, 55, 100))
# Make a histogram with 10 bins
buildings.hist(bins=10)
# The edges of those bins were determined automatically so they appear a bit weird:
pd.cut(buildings, bins=10)
0 (22.8, 28.0]
1 (7.2, 12.4]
2 (33.2, 38.4]
3 (38.4, 43.6]
4 (48.8, 54.0]
...
95 (48.8, 54.0]
96 (22.8, 28.0]
97 (12.4, 17.6]
98 (43.6, 48.8]
99 (1.948, 7.2]
You can also set the bins explicitly: 0-5, 5-10, ..., 50-55
buildings.hist(bins=range(0,60,5))
# Then the edges are no longer random
pd.cut(buildings, bins=range(0,60,5))
0 (25, 30]
1 (5, 10]
2 (30, 35]
3 (40, 45]
4 (45, 50]
...
95 (45, 50]
96 (25, 30]
97 (15, 20]
98 (40, 45]
99 (0, 5]

How can I iterate through a CSV file and plot it to a boxplot by each column representing a second in Python?

Say I have a csv file like so:
20 30 33 54 12 56
90 54 66 12 88 11
33 22 63 86 12 65
11 44 65 34 23 26
I want to create a boxplot where each column is a second, which is also the x-axis. The actual data to be on the y. So, 20, 90, 33, 11 will be on 1 second and on one plot and 30, 54, 22, 44 on 2 seconds and so on. Also, the csv file has more data than this that I am not sure how many data sets so I can't hard code anything in.
This is what I have so far:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('/user/Desktop/test.csv', header = None)
fig = plt.figure()
ax = fig.add_subplot()
plt.xlabel('Time (s)')
plt.ylabel('ms')
df.boxplot()
plt.show()
Try this:
axes = df.groupby(df.columns//10, axis=1).boxplot(subplots=True,
figsize=(12,18))
plt.xlabel('Time (s)')
plt.ylabel('ms')
plt.show()
Output:
If you want to set y limits of the subplots:
for ax in axes.flatten():
ax.set_ylim(0,100)

scipy.signal: filtering variable-time dataset

I have a large dataset of the form [t, y(t)] to which I want to apply an IIR low-pass filter (first- or second-order Butterworth should suffice) using scipy.signal (in particular scipy.filter.butter and scipy.filter.filtfilt). The problem is that t is not regularly spaced, which appears to be a requirement for the functions in scipy.signal.
For any "missing" points, I know that my signal remains unchanged from its previous value (so given two consecutive points t1 and t2 in my t-data and a point T not in the data, such that t1<T<t2, the "real" function Y(t) which I'm sampling would take the value Y(T)=Y(t1)). t is integer-valued, so I could simply add the missing points, but this would cause the size of my dataset to grow by a factor ~10, which is problematic given that it's already very large.
So the question is, is there a (sufficiently simple and low-overhead) way to filter my dataset without inserting all "missing" points?
You can efficiently "wrap" your data into a function.
If your data is in the form of a list of lists then you'll need to convert it into a dict and to create a sorted list of your t values. Then you can interpolate the missing values using the list bisection algorithm in the bisect module.
Here's some demo code written in Python 2, but it should be straight-forward to convert it to Python 3, if required.
from random import seed, sample
from bisect import bisect
#Create some fake data
seed(37)
data = dict((u, u/10.) for u in sample(xrange(50), 25))
keys = data.keys()
keys.sort()
print keys
def interp(t):
i = bisect(keys, t)
k = keys[max(0, i-1)]
return data[k]
for i in xrange(50):
print i, interp(i)
output
[2, 4, 8, 10, 14, 15, 19, 21, 22, 23, 26, 27, 29, 30,
32, 33, 34, 35, 37, 38, 39, 42, 43, 44, 48]
0 0.2
1 0.2
2 0.2
3 0.2
4 0.4
5 0.4
6 0.4
7 0.4
8 0.8
9 0.8
10 1.0
11 1.0
12 1.0
13 1.0
14 1.4
15 1.5
16 1.5
17 1.5
18 1.5
19 1.9
20 1.9
21 2.1
22 2.2
23 2.3
24 2.3
25 2.3
26 2.6
27 2.7
28 2.7
29 2.9
30 3.0
31 3.0
32 3.2
33 3.3
34 3.4
35 3.5
36 3.5
37 3.7
38 3.8
39 3.9
40 3.9
41 3.9
42 4.2
43 4.3
44 4.4
45 4.4
46 4.4
47 4.4
48 4.8
49 4.8
(I manually wrapped the output of keys to make it easier to read without horizontal scrolling).
You'll get a tiny speedup by re-writing the body of the interpolation function in one line:
def interp(t):
return data[keys[max(0, bisect(keys, t)-1)]]
It's much less readable, IMHO, but the speed difference may be worth it if the function gets called a lot.
The answer by PM 2Ring works, but assuming that your data are already ordered by t, it is less efficient than possible. It takes log-linear time and linear additional space. You can write a generator that produces a transformed dataset with regular sampling intervals in linear time and constant additional space:
# Assumes that dataset rows are lists as described in the question:
# [[t1, Y(t1)], [t2, Y(t2)], [t3, Y(t3)], ..., [tz, Y(tz)]]
# If this assumption is wrong, just extract t and Y(t) in another way.
# The generated range starts at t1 and ends directly after tz.
# Warning: will overgenerate points if the data are more densely sampled
# than the requested sampling interval.
def step_interpolate(dataset, interval):
left = next(dataset) # [t1, Y(t1)]
right = next(dataset) # [t2, Y(t2)]
t_regular = left[0]
while True:
if left is right: # same list object
right = next(dataset) # iteration stops when dataset stops
if right[0] <= t_regular:
left = right
yield [t_regular, left[1]]
t_regular += interval
Testing:
data = [[1, 10], [15, 2], [50, 100], [55, 17]]
for item in step_interpolate(iter(data), 10):
print item[0], item[1]
Output:
1 10
11 10
21 2
31 2
41 2
51 100
61 17

Categories