I have an 120x70 matrix of which I want to graph diagonal lines.
for ease of typing here, I will explain my problem with a smaller 4x4 matrix.
index
2020
2021
2022
2023
0
1
2
5
7
1
3
5
8
10
0
1
2
5
3
1
3
5
8
4
I now want to graph for example starting at 2021 index 0
so that I get the following diagonal numbers in a graphs: 2, 8, 10
or if I started at 2020 I would get 1, 5, 5, 4.
Kind regards!
You can do this with a simple for-loop. e.g.:
matrix = np.array((120, 70))
graph_points = []
column_index = 0 # Change this to whatever column you want to start at
for i in range(matrix.shape[0]):
graph_points.append(matrix[i, column_index])
column_index += 1
if column_index >= matrix.shape[1]:
break
## Plot graph_points here
Related
I have a parking lot with cars of different models (nr) and the cars are so closely packed that in order for one to get out one might need to move some others. A little like a 15Puzzle, only I can take one or more cars out of the parking lot. Ordered_car_List includes the cars that will be picked up today, and they need to be taken out of the parking lot with as few non-ordered cars as possible moved. There are more columns to this panda, but this is what I can't figure out.
I have a Program that works good for small sets of data, but it seems that this is not the way of the PANDAS :-)
I have this:
cars = pd.DataFrame({'x': [1,1,1,1,1,2,2,2,2],
'y': [1,2,3,4,5,1,2,3,4],
'order_number':[6,6,7,6,7,9,9,10,12]})
cars['order_number_no_dublicates_down'] = None
Ordered_car_List = [6,9,9,10,28]
i=0
while i < len(cars):
temp_val = cars.at[i, 'order_number']
if temp_val in Ordered_car_List:
cars.at[i, 'order_number_no_dublicates_down'] = temp_val
Ordered_car_List.remove(temp_val)
i+=1
If I use cars.apply(lambda..., how can I change the Ordered_car_List in each iteration?
Is there another approach that I can take?
I found this page, and it made me want to be faster. The Lambda approach is in the middle when it comes to speed, but it still is so much faster than what I am doing now.
https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
Updating cars
We can vectorize this based on two counters:
cumcount() to cumulatively count each unique value in cars['order_number']
collections.Counter() to count each unique value in Ordered_car_List
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
# order_number cumcount maxcount
# 0 6 1 1
# 1 6 2 1
# 2 7 1 0
# 3 6 3 1
# 4 7 2 0
# 5 9 1 2
# 6 9 2 2
# 7 10 1 1
# 8 12 1 0
So then we only want to keep cars['order_number'] where cumcount <= maxcount:
either use DataFrame.loc[]
cars.loc[cumcount <= maxcount, 'nodup'] = cars['order_number']
or Series.where()
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
or Series.mask() with the condition inverted
cars['nodup'] = cars['order_number'].mask(cumcount > maxcount)
Updating Ordered_car_List
The final Ordered_car_List is a Counter() difference:
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
# [6, 9, 9, 10]
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))
# [28]
Final output
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
# x y order_number nodup
# 0 1 1 6 6.0
# 1 1 2 6 NaN
# 2 1 3 7 NaN
# 3 1 4 6 NaN
# 4 1 5 7 NaN
# 5 2 1 9 9.0
# 6 2 2 9 9.0
# 7 2 3 10 10.0
# 8 2 4 12 NaN
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))
# [28]
Timings
Note that your loop is still very fast with small data, but the vectorized counter approach just scales much better:
I have a dataframe with points on a 2-dimensional plane:
index x y
0 0 -0.032836 49.268820
1 0 4.160005 49.268820
2 0 4.105928 68.330440
3 0 -0.062953 68.342125
4 1 4.166139 49.269398
5 1 8.497650 49.278310
6 1 8.592334 68.336560
7 1 4.041361 68.336560
8 2 8.426349 49.278890
9 2 13.480260 49.278890
10 2 13.446286 68.336560
11 2 8.467557 68.336560
12 3 13.438516 49.278374
13 3 17.356792 49.287285
14 3 17.378400 68.338240
15 3 13.382163 68.333786
16 4 17.295988 49.289800
17 4 21.418156 49.289800
18 4 21.336264 67.359630
19 4 17.313816 67.359630
and I've been trying to find a way to draw lines between the (x,y) coordinates for each index. The resulting plot should be closed rectangles.
Now, I've tried to approach this by defining series:
x = df['x']
y = df['y']
and then
index_l = df.index.tolist()
for i in index_l:
plt.plot([df.x[i],df.y[i]])
This doesn't work at all. Any idea on how to proceed. A note: ideally, I would like to have a rectangle, but if doing this by even connecting diagonally is easier, I can live with it.
Thankful for any hints or solutions.
You can group by the index and then for x, y values of each group, append the first row to the end so that plt.plot plots a closed rectangle:
for idx, points in df.groupby("index")[["x", "y"]]:
points_to_plot = points.append(points.iloc[0])
plt.plot(points_to_plot.x, points_to_plot.y)
to get this plot
Consider the sorted array a:
a = np.array([0, 2, 3, 4, 5, 10, 11, 11, 14, 19, 20, 20])
If I specified left and right deltas,
delta_left, delta_right = 1, 1
Then this is how I'd expect the clusters to be assigned:
# a = [ 0 . 2 3 4 5 . . . . 10 11 . . 14 . . . . 19 20
# 11 20
#
# [10--|-12] [19--|-21]
# [1--|--3] [10--|-12] [19--|-21]
# [-1--|--1] [3--|--5] [9--|-11] [18--|-20]
# +--+--|--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
# [2--|--4] [13--|-15]
#
# │ ╰──┬───╯ ╰┬─╯ │ ╰┬─╯
# │ cluster 2 Cluster 3 │ Cluster 5
# Cluster 1 Cluster 4
NOTE: Despite the interval [-1, 1] sharing an edge with [1, 3], neither interval includes an adjacent point and therefore do not constitute joining their respective clusters.
Assuming the cluster assignments were stored in an array named clusters, I'd expect the results to look like this
print(clusters)
[1 2 2 2 2 3 3 3 4 5 5 5]
However, suppose I change the left and right deltas to be different:
delta_left, delta_right = 2, 1
This means that for a value of x it should be combined with any other point in the interval [x - 2, x + 1]
# a = [ 0 . 2 3 4 5 . . . . 10 11 . . 14 . . . . 19 20
# 11 20
#
# [9-----|-12] [18-----|-21]
# [0-----|--3] [9-----|-12] [18-----|-21]
# [-2-----|--1][2-----|--5] [8-----|-11] [17-----|-20]
# +--+--|--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
# [1 ----|--4] [12-----|-15]
#
# ╰─────┬─────╯ ╰┬─╯ │ ╰┬─╯
# cluster 1 Cluster 2 │ Cluster 4
# Cluster 3
NOTE: Despite the interval [9, 12] sharing an edge with [12, 15], neither interval includes an adjacent point and therefore do not constitute joining their respective clusters.
Assuming the cluster assignments were stored in an array named clusters, I'd expect the results to look like this:
print(clusters)
[1 1 1 1 1 2 2 2 3 4 4 4]
We will leverage np.searchsorted and logic to find cluster edges.
First, let's take a closer look at what np.searchsorted does:
Find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved.
What I'll do is execute np.searchsorted with a using a - delta_left. Let's look at that for delta_left = 1
# a =
# [ 0 2 3 4 5 10 11 11 14 19 20 20]
#
# a - delta_left
# [-1 1 2 3 4 9 10 10 13 18 19 19]
-1 would get inserted at position 0 to maintain order
1 would get inserted at position 1 to maintain order
2 would get inserted at position 1 as well, indicating that 2 might be in the same cluster as 1
3 would get inserted at position 2 indicating that 3 might be in the same cluster as 2
so on and so forth
What we notice is that only when an element less delta would get inserted at its current position would we consider a new cluster starting.
We do this again for the right side with a difference. The difference is that by default if a bunch of elements are the same, np.searchsorted assumes to insert into the front of values. To identify the ends of clusters, I'm going to want to insert after the identical elements. Therefore I'll use the paramater side='right'
If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index. If there is no suitable index, return either 0 or N (where N is the length of a).
Now the logic. A cluster can only begin if a prior cluster has ended, with the exception of the first cluster. We'll then consider a shifted version of the results of our second np.searchsorted
Let's now define our function
def delta_cluster(a, dleft, dright):
# use to track whether searchsorted results are at correct positions
rng = np.arange(len(a))
edge_left = a.searchsorted(a - dleft)
starts = edge_left == rng
# we append 0 to shift
edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
ends = edge_right == rng
return (starts & ends).cumsum()
demonstration
with left, right deltas equal to 1 and 1
print(delta_cluster(a, 1, 1))
[1 2 2 2 2 3 3 3 4 5 5 5]
with left, right deltas equal to 2 and 1
print(delta_cluster(a, 2, 1))
[1 1 1 1 1 2 2 2 3 4 4 4]
Extra Credit
What if a isn't sorted?
I'll utilize information learned from this post
def delta_cluster(a, dleft, dright):
s = a.argsort()
size = s.size
if size > 1000:
y = np.empty(s.size, dtype=np.int64)
y[s] = np.arange(s.size)
else:
y = s.argsort()
a = a[s]
rng = np.arange(len(a))
edge_left = a.searchsorted(a - dleft)
starts = edge_left == rng
edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
ends = edge_right == rng
return (starts & ends).cumsum()[y]
demonstration
b = np.random.permutation(a)
print(b)
[14 10 3 11 20 0 19 20 4 11 5 2]
print(delta_cluster(a, 2, 1))
[1 1 1 1 1 2 2 2 3 4 4 4]
print(delta_cluster(b, 2, 1))
[3 2 1 2 4 1 4 4 1 2 1 1]
print(delta_cluster(b, 2, 1)[b.argsort()])
[1 1 1 1 1 2 2 2 3 4 4 4]
I have a pandas series of value_counts for a data set. I would like to plot the data with a color band (I'm using bokeh, but calculating the data band is the important part):
I hesitate to use the word standard deviation since all the references I use calculate that based on the mean value, and I specifically want to use the mode as the center.
So, basically, I'm looking for a way in pandas to start at the mode and return a new series that of value counts that includes 68.2% of the sum of the value_counts. If I had this series:
val count
1 0
2 0
3 3
4 1
5 2
6 5 <-- mode
7 4
8 3
9 2
10 1
total = sum(count) # example value 21
band1_count = 21 * 0.682 # example value ~ 14.3
This is the order they would be added based on an algorithm that walks the value count on each side of the mode and includes the higher of the two until the sum of the counts is > than 14.3.
band1_values = [6, 7, 8, 5, 9]
Here are the steps:
val count step
1 0
2 0
3 3
4 1
5 2 <-- 4) add to list -- eq (9,2), closer to (6,5)
6 5 <-- 1) add to list -- mode
7 4 <-- 2) add to list -- gt (5,2)
8 3 <-- 3) add to list -- gt (5,2)
9 2 <-- 5) add to list -- gt (4,1), stop since sum of counts > 14.3
10 1
Is there a native way to do this calculation in pandas or numpy? If there is a formal name for this study, I would appreciate knowing what it's called.
I did a program to calculate the inventory in python;however, i have problem formatting the layout output. What I have done so far is:
def summary(a,b,c,row,col,tot):
d={0:"Small", 1:"Medium", 2:"Large", 3:"Xlarge"}
for i in range(row):
for j in range(col):
print "%6d" %(a[i][j]),
print "%s%6d\n" %(d[i],(b[i])),
print "\n" ,
for j in range(col):
print "%6d" %(c[j]),
print "%6d\n" %tot
so the output comes the 7 x 4 matrix and the total to the right hand side and by column total. However I want to put some names on the left hand side to represent the specific name like size small etc so i used a dictionary but what i am getting is on the right hand side just before the row total. I can't figure out how can i put it on the left hand side in the same row as the numbers. I want to put two columns apart from the number (matrix) which one would be a size in the first far left column in the middle and then in second column names as u can see specified used in dictionary and then the numbers would come in the same row.
Thanks a lot for any help or suggestions. I did a program to calculate the inventory in python;however, i have problem formatting the layout output. What I have done so far is:
def summary(a,b,c,row,col,tot):
d={0:"Small", 1:"Medium", 2:"Large", 3:"Xlarge"}
for i in range(row):
for j in range(col):
print "%6d" %(a[i][j]),
print "%s%6d\n" %(d[i],(b[i])),
print "\n" ,
for j in range(col):
print "%6d" %(c[j]),
print "%6d\n" %tot
so the output comes the 7 x 4 matrix and the total to the right hand side and by column total. However I want to put some names on the left hand side to represent the specific name like size small etc so i used a dictionary but what i am getting is on the right hand side just before the row total. I can't figure out how can i put it on the left hand side in the same row as the numbers. I want to put two columns apart from the number (matrix) which one would be a size in the first far left column in the middle and then in second column names as u can see specified used in dictionary and then the numbers would come in the same row.
Thanks a lot for any help or suggestions.
I want it to look like this
small 1 1 1 1 1 1 1 7
medium 1 1 1 1 1 1 1 7
size large 1 1 1 1 1 1 1 7
xlarge 1 1 1 1 1 1 1 7
4 4 4 4 4 4 4 28
and i get
1 1 1 1 1 1 1 small 7
1 1 1 1 1 1 1 medium 7
1 1 1 1 1 1 1 large 7
1 1 1 1 1 1 1 xlarge 7
4 4 4 4 4 4 4 28
sorry for not being specific enough previously.
Just print it before the row:
def summary(a,b,c,row,col,tot):
d={0:"Small", 1:"Medium", 2:"Large", 3:"Xlarge"}
for i in range(row):
print d[i].ljust(6),
for j in range(col):
print "%6d" %(a[i][j]),
print "%6d\n" %(b[i]),
print "\n" ,
for j in range(col):
print "%6d" %(c[j]),
print "%6d\n" %tot
This assumes you want the first column left justified. Right justification (rjust()) and centering (center()) are also available.
Also, since you're just using contiguous numeric indices, you can just use a list instead of a dictionary.
As a side note, more descriptive variables are never a bad thing. Also, according to this, % formatting is obsolete, and the format() method should be used in new programs.
You just have to move the "%s" and the appropriate variable to the correct position:
def summary(a,b,c,row,col,tot):
d={0:"Small", 1:"Medium", 2:"Large", 3:"Xlarge"}
for i in range(row):
print "%8s" % d[i],
for j in range(col):
print "%6d" %(a[i][j]),
print "%6d\n" % ((b[i])),
print "\n" ,
print "%8s" % " ",
for j in range(col):
print "%6d" %(c[j]),
print "%6d\n" %tot
When calling this with (note that this are just test-numbers, you will replace them with the real ones):
summary([[1, 2, 3, 4, 5, 6, 7],
[1, 2, 3, 4, 5, 6, 7],
[1, 2, 3, 4, 5, 6, 7],
[1, 2, 3, 4, 5, 6, 7]], [12, 13, 14, 15],
[22, 23, 24, 25, 26, 27, 28], 4, 7, 7777)
you get something like:
Small 1 2 3 4 5 6 7 12
Medium 1 2 3 4 5 6 7 13
Large 1 2 3 4 5 6 7 14
Xlarge 1 2 3 4 5 6 7 15
22 23 24 25 26 27 28 7777
If you want the names left adjusted, you have to add a '-' before the format description like:
print "%-8s" % d[i],