New student to python and struggling with a task at the moment. I'm trying to publish a scatter plot from some data in a pandas table and can't seem to work it out.
Here is a sample of my data set:
import pandas as pd
data = {'housing_age': [14, 11, 3, 4],
'total_rooms': [25135, 32627, 39320, 37937],
'total_bedrooms': [4819, 6445, 6210, 5471],
'population': [35682, 28566, 16305, 16122]}
df = pd.DataFrame(data)
I'm trying to draw a scatter plot on the data in housing_age, but having some difficult figuring it out.
Initially tried for x axis to be 'housing_data' and the y axis to be a count of housing data, but couldn't get the code to work. Then read somewhere that x-axis should be variable, and y-axis should be constant, so tried this code:
x='housing_data'
y=[0,5,10,15,20,25,30,35,40,45,50,55]
plt.scatter(x,y)
ax.set_xlabel("Number of buildings")
ax.set_ylabel("Age of buildings")
but get this error:
ValueError: x and y must be the same size
Note - the data in 'housing_data' ranges from 1-53 years.
I imagine this should be a pretty easy thing, but for some reason I can't figure it out.
Does anyone have any tips?
I understand you are starting so confusion is common. Please bear with me.
From your description, it looks like you swapped x and y:
# x is the categories: 0-5 yrs, 5-10 yrs, ...
x = [0,5,10,15,20,25,30,35,40,45,50,55]
# y is the number of observations in each category
# I just assigned it some random numbers
y = [78, 31, 7, 58, 88, 43, 47, 87, 91, 87, 36, 78]
plt.scatter(x,y)
plt.set_title('Housing Data')
Generally, if you have a list of observations and you want to count them across a number of categories, it's called a histogram. pandas has many convenient functions to give you a quick look at the data. The one of interest for this question is hist - create a histogram:
# A series of 100 random buildings whose age is between 1 and 55 (inclusive)
buildings = pd.Series(np.random.randint(1, 55, 100))
# Make a histogram with 10 bins
buildings.hist(bins=10)
# The edges of those bins were determined automatically so they appear a bit weird:
pd.cut(buildings, bins=10)
0 (22.8, 28.0]
1 (7.2, 12.4]
2 (33.2, 38.4]
3 (38.4, 43.6]
4 (48.8, 54.0]
...
95 (48.8, 54.0]
96 (22.8, 28.0]
97 (12.4, 17.6]
98 (43.6, 48.8]
99 (1.948, 7.2]
You can also set the bins explicitly: 0-5, 5-10, ..., 50-55
buildings.hist(bins=range(0,60,5))
# Then the edges are no longer random
pd.cut(buildings, bins=range(0,60,5))
0 (25, 30]
1 (5, 10]
2 (30, 35]
3 (40, 45]
4 (45, 50]
...
95 (45, 50]
96 (25, 30]
97 (15, 20]
98 (40, 45]
99 (0, 5]
Related
I created two scatterplots and put them on the same graph. I also want to match the points of the two scatterplots (note that the two scatterplots have the same number of points).
My current code is provided below, and the plot I want to get is sketched at the bottom of this post.
plt.scatter(tmp_df['right_eye_x'], tmp_df['right_eye_y'],
color='green', label='right eye')
plt.scatter(tmp_df['left_eye_x'], tmp_df['left_eye_y'],
color='cyan', label='left eye')
plt.legend()
Here is a fake dataframe you may use, in case you need to do some testing. (My data is of the following format; you may use the last two lines in the code chunk to create the dataframe)
timestamp right_eye_x right_eye_y left_eye_x left_eye_y
15 54 22 28 19
20 56 21 29 21
25 59 16 28 16
30 58 18 31 18
35 62 15 33 14
data = {'timestamp':[15,20,25,30,35],
'right_eye_x':[54, 56, 59, 58, 62],
'right_eye_y':[22, 21, 16, 18, 15],
'left_eye_x':[28, 29, 22, 31, 33],
'left_eye_y':[19, 21, 16, 18, 14]}
tmp_df = pd.DataFrame(data)
I saw this post: Matplotlib python connect two scatter plots with lines for each pair of (x,y) values?
while I am still very confused.
I would appreciate any insights! Thank you!
(If you find any part confusing, please let me know!)
Use the solution from the comments that is shown in the post you cite.
import matplotlib.pyplot as plt
import numpy as np
x1 = [0.19, 0.15, 0.13, 0.25]
x2 = [0.18, 0.5, 0.13, 0.25]
y1 = [0.1, 0.15, 0.3, 0.2]
y2 = [0.85, 0.76, 0.8, 0.9]
for i in range(len(x1)):
plt.plot([x1[i],x2[i]], [y1[i],y2[i]])
You can put labels, colors and stuff looking at https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html
An error is returned when I want to plot an interval.
I created an interval for my age column so now I want to show on a chart the age interval compares to the revenue
my code
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
clients['tranche'] = pd.cut(clients.age, bins)
clients.head()
client_id sales revenue birth age sex tranche
0 c_1 39 558.18 1955 66 m (60, 70]
1 c_10 58 1353.60 1956 65 m (60, 70]
2 c_100 8 254.85 1992 29 m (20, 30]
3 c_1000 125 2261.89 1966 55 f (50, 60]
4 c_1001 102 1812.86 1982 39 m (30, 40]
# Plot a scatter tranche x revenue
df = clients.groupby('tranche')[['revenue']].sum().reset_index().copy()
plt.scatter(df.tranche, df.revenue)
plt.show()
But an error appears ending by
TypeError: float() argument must be a string or a number, not 'pandas._libs.interval.Interval'
How to use an interval for plotting ?
You'll need to add labels. (i tried to convert them to str using .astype(str) but that does not seem to work in 3.9)
if you do the following, it will work just fine.
labels = ['10-20', '20-30', '30-40']
df['tranche'] = pd.cut(df.age, bins, labels=labels)
I have data points as dataframe just like represented in figure1
sample data
df=
74 34
74.5 34.5
75 34.5
75 34
74.5 34
74 34.5
76 34
76 34.5
74.5 34
74 34.5
75.5 34.5
75.5 34
75 34
75 34.5
I want to add random points in between those points but keep the shape of the initial points.
Desired output will be somehow like in figure 2 (black dots represent the random points.And the red line represent the boundary)
~Any suggestions? I am looking for a general solution since the geometry of the outer boundary will change in problem
interpolation might be worth looking into:
import numpy as np
# lets suppose y = 2x and x[i], y[i] is a data point
x = [1, 5, 16, 20, 60]
y = [2, 10, 32, 40, 120]
interp_here = [7, 8, 9, 21] # the points where we are going to interpolate values.
print(np.interp(interp_here, x, y)) ## [14. 16. 18. 42.]
If you want random points, then you could use the above as a guide line and then for each point add/subtract some delta.
If the shape is convex it is pretty simple:
def get_random_point(points):
point_selectors = np.random.randint(0, len(points), 2)
scale = np.random.rand()#value between 0 and 1
start_point = points[point_selectors[0]]
end_point = points[point_selectors[1]]
return start_point + (end_point - start_point) * scale
The shape you have specified is not convex. But without you additionally specifying which points make up the exterior of your shape or additional constraints like e.g. you only want to allow for lines to go parallel to either x or y axis the shape you see is mathematically not sufficiently specified.
Final remark: There are algorithms which can check whether a point is within a polygon (Point in polygon).
You can then 1) specify bounding polygon 2) generate a point within the bounding rectangle of your shape and 3) test whether the point lies within the shape of your polygon.
im working with ebay, and i have a list of 100 prices of items that sold. what i want to do is separate each floating-point price into groups, and then count up the groups to sortof determine the most common general price for this item, so i can automate the pricing of my own item.
initially, i thought to separate prices by the $10 value, but i realized that this isnt a good method of grouping because prices can vary greatly because of outliers or unrelated items, etc.
if i have a list of prices like so: [90, 92, 95, 99, 1013, 1100]
my desire is for the application to separate values into:
{nineties:4, thousands:2}
but im just not sure how to tell python to do this. preferably, the simplest i can integrate this snippet into my code, the better!
any help or suggestions would be appreciated!
The technique you use depends your notion of what a group is.
If the number of groups is known, use kmeans with k==2. See this link for working code in pure Python:
from kmeans import k_means, assign_data
prices = [90, 92, 95, 99, 1013, 1100]
points = [(x,) for x in prices]
centroids = k_means(points, k=2)
labeled = assign_data(centroids, points)
for centroid, group in labeled.items():
print('Group centered around:', centroid[0])
print([x for (x,) in group])
print()
This outputs:
Group centered around: 94.0
[90, 92, 95, 99]
Group centered around: 1056.5
[1013, 1100]
Alternatively, if a fixed maximum distance between elements defines the groupings, then just sort and loop over the elements, checking the distance between them to see whether a new group has started:
max_gap = 100
prices.sort()
groups = []
last_price = prices[0] - (max_gap + 1)
for price in prices:
if price - last_price > max_gap:
groups.append([])
groups[-1].append(price)
last_price = price
print(groups)
This outputs:
[[90, 92, 95, 99], [1013, 1100]]
Naive approach to point in the right direction:
> from math import log10
> from collections import Counter
> def f(i):
> x = 10**int(log10(i)) # largest from 1, 10, 100, etc. < i
> return i // x * x
> lst = [90, 92, 95, 99, 1013, 1100]
> c = Counter(map(f, lst))
> c
Counter({90: 4, 1000: 2})
Assume that your buckets are somewhat arbitrary in size (like between 55 and 95 and between 300 and 366) then you can use a binning approach to classify a value into a bin range. The cut-off for the various bin sizes can be anything you want so long as they are increasing in size left to right.
Assume these bin values:
bins=[0,100,1000,10000]
Then:
[0,100,1000,10000]
^ bin 1 -- 0 <= x < 100
^ bin 2 -- 100 <= x < 1000
^ bin 3 -- 1000 <= x < 10000
You can use numpy digitize to do this:
import numpy as np
bins=np.array([0.0,100,1000,10000])
prices=np.array([90, 92, 95, 99, 1013, 1100])
inds=np.digitize(prices,bins)
You can also do this in pure Python:
bins=[0.0,100,1000,10000]
tests=zip(bins, bins[1:])
prices=[90, 92, 95, 99, 1013, 1100]
inds=[]
for price in prices:
if price <min(bins) or price>max(bins):
idx=-1
else:
for idx, test in enumerate(tests,1):
if test[0]<= price < test[1]:
break
inds.append(idx)
Then classify by bin (from the result of either approach above):
for i, e in enumerate(prices):
print "{} <= {} < {} bin {}".format(bins[inds[i]-1],e,bins[inds[i]],inds[i])
0.0 <= 90 < 100 bin 1
0.0 <= 92 < 100 bin 1
0.0 <= 95 < 100 bin 1
0.0 <= 99 < 100 bin 1
1000 <= 1013 < 10000 bin 3
1000 <= 1100 < 10000 bin 3
Then filter out the values of interest (bin 1) versus the outlier (bin 3)
>>> my_prices=[price for price, bin in zip(prices, inds) if bin==1]
my_prices
[90, 92, 95, 99]
I think scatter plots are underrated for this sort of thing. I recommend plotting the distribution of prices, then choosing threshold(s) that look right for your data, then adding any descriptive stats by group that you want.
# Reproduce your data
prices = pd.DataFrame(pd.Series([90, 92, 95, 99, 1013, 1100]), columns=['price'])
# Add an arbitrary second column so I have two columns for scatter plot
prices['label'] = 'price'
# jitter=True spreads your data points out horizontally, so you can see
# clearly how much data you have in each group (groups based on vertical space)
sns.stripplot(data=prices, x='label', y='price', jitter=True)
plt.show()
Any number between 200 and 1,000 separates your data nicely. I'll arbitrarily choose 200, maybe you'll choose different threshold(s) with more data.
# Add group labels, Get average by group
prices['price group'] = pd.cut(prices['price'], bins=(0,200,np.inf))
prices['group average'] = prices.groupby('price group')['price'].transform(np.mean)
price label price group group average
0 90 price (0, 200] 94.0
1 92 price (0, 200] 94.0
2 95 price (0, 200] 94.0
3 99 price (0, 200] 94.0
4 1013 price (200, inf] 1056.5
5 1100 price (200, inf] 1056.5
I have a 2 dataframes
One- Score Card for scoring student marks
Second One-Student dataset.
I want to apply score card on a given student dataset to compute score and aggregate them. I'm trying to devlop a generic function that takes the
score card and applies on any studentmarks dataset
import pandas as pd
score_card_data = {
'subject_id': ['MATHS', 'SCIENCE', 'ARTS'],
'bin_list': [[0,25,50,75,100], [0,20,40,60,80,100], [0,20,40,60,80,100]],
'bin_value': [[1,2,3,4], [1,2,3,4,5], [3,4,5,6,7] ]}
score_card_data = pd.DataFrame(score_card_data, columns = ['subject_id', 'bin_list', 'bin_value'])
score_card_data
student_scores = {
'STUDENT_ID': ['S1', 'S2', 'S3','S4','S5'],
'MATH_MARKS': [10,15,25,65,75],
'SCIENCE_MARKS': [8,15,20,35,85],
'ARTS_MARKS':[55,90,95,88,99]}
student_scores = pd.DataFrame(student_scores, columns = ['STUDENT_ID', 'MATH_MARKS', 'SCIENCE_MARKS','ARTS_MARKS'])
student_scores
Functions
Define bins
Apply the bins over columns
bins = list(score_card_data.loc[score_card_data['subject_id'] == 'MATHS', 'bin_list'])
student_scores['MATH_SCORE'] = pd.cut(student_scores['MATH_MARKS'],bins, labels='MATHS_MARKS')
Error:ValueError: object too deep for desired array
I'm trying to convert the cellvalue to a string and it is getting detected as an object. Any way to resolve
How can I make the function more generic?
Thanks
Pari
You can just use bins[0] to extract the list, which otherwise raises the ValueError:
bins[0]
[0, 25, 50, 75, 100]
type(bins[0])
<class 'list'>
student_scores['MATH_SCORE'] = pd.cut(student_scores['MATH_MARKS'], bins[0])
STUDENT_ID MATH_MARKS SCIENCE_MARKS ARTS_MARKS MATH_SCORE
0 S1 10 8 55 (0, 25]
1 S2 15 15 90 (0, 25]
2 S3 25 20 95 (0, 25]
3 S4 65 35 88 (50, 75]
4 S5 75 85 99 (50, 75]
I left out the labels because you'd need to provide a list of four labels given there are five cutoffs / bin edges.