Pandas dataframe - pairing off rows within a bucket - python

I have a dataframe that looks like this:
bucket type v
0 0 X 14
1 1 X 10
2 1 Y 11
3 1 X 15
4 2 X 16
5 2 Y 9
6 2 Y 10
7 3 Y 20
8 3 X 18
9 3 Y 15
10 3 X 14
The desired output looks like this:
bucket type v v_paired
0 1 X 14 nan (no Y coming before it)
1 1 X 10 nan (no Y coming before it)
2 1 Y 11 14 (highest X in bucket 1 before this row)
3 1 X 15 11 (lowest Y in bucket 1 before this row)
4 2 X 16 nan (no Y coming before it in the same bucket)
5 2 Y 9 16 (highest X in same bucket coming before)
6 2 Y 10 16 (highest X in same bucket coming before)
7 3 Y 20 nan (no X coming before it in the same bucket)
8 3 X 18 20 (single Y coming before it in same bucket)
9 3 Y 15 18 (single Y coming before it in same bucket)
10 3 X 14 15 (smallest Y coming before it in same bucket)
The goal is to construct the v_paired column, and the rules are as follows:
Look for rows in the same bucket, coming before this one, that have opposite type(X vs Y), call these 'pair candidates'
If the current row is X, choose the min. v out of the pair candidates to become v_paired for the current row, if the current row is Y, choose the max. v out of the pair candidates to be the v_paired for the current row
Thanks in advance.

I believe this should be done in a sequential manner...
first group by bucket
groups = df.groupby('bucket', group_keys=False)
this function will be applied to each bucket group
def func(group):
y_value = None
x_value = None
result = []
for _, (_, value_type, value) in group.iterrows():
if value_type == 'X':
x_value = max(filter(None,(x_value, value)))
result.append(y_value)
elif value_type == 'Y':
y_value = min(filter(None,(y_value, value)))
result.append(x_value)
return pd.DataFrame(result)
df['v_paired'] = groups.apply(func)
hopefuly this will do the job

Related

Pandas: Run a set of codes for multiple parameters with multiple levels of each parameter (output is a dataframe)

Lets say we have a set of codes as given below. Currently, we have two parameters whose value are initialized by user input. The output here is a dataframe.
What we want?
Use a function, to create a dataframe with all combinations of X and Y. Lets say X and Y has 4 input values each. Then
Join the output dataframe, df for each combination to get the desired output dataframe.
X= float(input("Enter the value of X: "))
Y = float(input("Enter the value of Y: "))
A= X*Y
B=X*(Y^2)
df = pd.DataFrame({"X": X, "Y": Y, "A": A, "B": B})
Desired output
X Y A B
1 2 2 4
1 4 4 16
1 6 6 36
1 8 8 64
2 2 4 8
2 4 8 32
2 6 12 72
2 8 16 128
3 2 6 12
3 4 12 48
3 6 18 108
3 8 24 192
4 2 8 16
4 4 16 64
4 6 24 144
4 8 32 256
Is this what you were looking for?
def so_help():
x = input('Please enter all X values separated by a comma(,)')
y = input('Please enter all Y values separated by a comma(,)')
#In case anyone gets comma happy
x = x.strip(',')
y = y.strip(',')
x_list = x.split(',')
y_list = y.split(',')
df_x = pd.DataFrame({'X' : x_list})
df_y = pd.DataFrame({'Y' : y_list})
df_cross = pd.merge(df_x, df_y, how = 'cross')
df_cross['X'] = df_cross['X'].astype(int)
df_cross['Y'] = df_cross['Y'].astype(int)
df_cross['A'] = df_cross['X'].mul(df_cross['Y'])
df_cross['B'] = df_cross['X'].mul(df_cross['Y'].pow(2))
return df_cross
so_help()

length of list len(list) resulting wrong value in Python

It might sound trivial but I am surprised by the output. Basically, I have am calculating y = m*x + b for given a, b & x. With below code I am able to get the desired result of y which a list of 20 values.
But when I am checking the length of the list, I am getting 1 in return. And the range is (0,1) which is weird as I was expecting it to be 20.
Am I making any mistake here?
a = 10
b = 0
x = df['x']
print(x)
0 0.000000
1 0.052632
2 0.105263
3 0.157895
4 0.210526
5 0.263158
6 0.315789
7 0.368421
8 0.421053
9 0.473684
10 0.526316
11 0.578947
12 0.631579
13 0.684211
14 0.736842
15 0.789474
16 0.842105
17 0.894737
18 0.947368
19 1.000000
y_new = []
for i in x:
y = a*x +b
y_new.append(y)
len(y_new)
Output: 1
print(y_new)
[0 0.000000
1 0.526316
2 1.052632
3 1.578947
4 2.105263
5 2.631579
6 3.157895
7 3.684211
8 4.210526
9 4.736842
10 5.263158
11 5.789474
12 6.315789
13 6.842105
14 7.368421
15 7.894737
16 8.421053
17 8.947368
18 9.473684
19 10.000000
Name: x, dtype: float64]
I would propose two solutions:
The first solution is : you convert your columnn df['x'] into a list by doing df['x'].tolist() and you re-run your code and also you should replace ax+b by ai+b
The second solution is (which I would do): You convert your df['x'] into an array by doing x = np.array(df['x']). By doing this you can do some array broadcasting.
So, your code will simply be :
x = np.array(df['x'])
y = a*x + b
This should give you the desired output.
I hope this would be helpful
With the code below, I have a length of 20 for the array y_new. Are you sure to print the right value? According to this post, df['x'] returns a panda Series so df['x'] is equivalent to pd.Series(...).
df['x'] — index a column named 'x'. Returns pd.Series
import pandas as pd
a = 10
b = 0
x = pd.Series(data=[0.000000,0.052632,0.105263,0.157895,0.210526, 0.263158, 0.315789, 0.368421, 0.421053,0.473684,0.526316,0.578947,0.631579
,0.684211,0.736842,0.789474,0.842105,0.894737,0.947368,1.000000])
y_new = []
for i in x:
y = a*x +b
y_new.append(y)
print("y_new length: " + str(len(y_new)) )
Output:
y_new length: 20

Generate combinations for a comma separated strings in a pandas row

I have a dataframe like this:
ID, Values
1 10, 11, 12, 13
2 14
3 15, 16, 17, 18
I want to create a new dataframe like this:
ID COl1 Col2
1 10 11
1 11 12
1 12 13
2 14
3 15 16
3 16 17
3 17 18
Please help me in how to do this???
Note: The rows in Values column of input df are str type.
Use list comprehension with flattening and small change - if i > 0: to if i == 2: for correct working with one element values:
from collections import deque
#https://stackoverflow.com/a/36586925
def chunks(iterable, chunk_size=2, overlap=1):
# we'll use a deque to hold the values because it automatically
# discards any extraneous elements if it grows too large
if chunk_size < 1:
raise Exception("chunk size too small")
if overlap >= chunk_size:
raise Exception("overlap too large")
queue = deque(maxlen=chunk_size)
it = iter(iterable)
i = 0
try:
# start by filling the queue with the first group
for i in range(chunk_size):
queue.append(next(it))
while True:
yield tuple(queue)
# after yielding a chunk, get enough elements for the next chunk
for i in range(chunk_size - overlap):
queue.append(next(it))
except StopIteration:
# if the iterator is exhausted, yield any remaining elements
i += overlap
if i == 2:
yield tuple(queue)[-i:]
L = [[x] + list(z) for x, y in zip(df['ID'], df['Values']) for z in (chunks(y.split(', ')))]
df = pd.DataFrame(L, columns=['ID','Col1','Col2']).fillna('')
print (df)
ID Col1 Col2
0 1 10 11
1 1 11 12
2 1 12 13
3 2 14
4 3 15 16
5 3 16 17
6 3 17 18
Tried slightly different approach. Created a function which will return numbers in pairs from the initial comma separated string.
def pairup(mystring):
"""Function to return paired up list from string"""
mylist = mystring.split(',')
if len(mylist) == 1: return [mylist]
splitlist = []
for index, item in enumerate(mylist):
try:
splitlist.append([mylist[index], mylist[index+1]])
except:
pass
return splitlist
Now let's create the new data frame.
# https://stackoverflow.com/a/39955283/3679377
new_df = df[['ID']].join(
df.Values.apply(lambda x: pd.Series(pairup(x)))
.stack()
.apply(lambda x: pd.Series(x))
.fillna("")
.reset_index(level=1, drop=True),
how='left').reset_index(drop=True)
new_df.columns = ['ID', 'Col 1', 'Col 2']
Here's the output of print(new_df).
ID Col 1 Col 2
0 1 10 11
1 1 11 12
2 1 12 13
3 2 14
4 3 15 16
5 3 16 17
6 3 17 18

data extraction from text file in Python

I have a text file that represents motion vector data from a video clip.
# pts=-26 frame_index=2 pict_type=P output_type=raw shape=3067x4
8 8 0 0
24 8 0 -1
40 8 0 0
...
8 24 0 0
24 24 3 1
40 24 0 0
...
8 40 0 0
24 40 0 0
40 40 0 0
# pts=-26 frame_index=3 pict_type=P output_type=raw shape=3067x4
8 8 0 1
24 8 0 0
40 8 0 0
...
8 24 0 0
24 24 5 -3
40 24 0 0
...
8 40 0 0
24 40 0 0
40 40 0 0
...
So it is some sort of grid where first two digits are x and y coordinates and third and fourth are the x and y values for motion vectors.
To use further this data I need to extract pairs of x and y values where at least one value differs from 0 and organize them in lists.
For example:
(0, -1, 2)
(3, 1, 2)
(0, 1, 3)
(5, 3, 3)
The third digit is a frame_index.
I would appreciate a lot if somebody cold help me with the plan how to crack this task. From what I should start.
This is actually quite simple since there is only one type of data.
We can do this without resorting to e.g. regular expressions.
Disregarding any error checking (Did we actually read 3067 points for frame 2, or only 3065? Is a line malformed? ...) it would look something like this
frame_data = {} # maps frame_idx -> list of (x, y, vx, vy)
for line in open('mydatafile.txt', 'r'):
if line.startswith('#'): # a header line
options = {key: value for key, value in
[token.split('=') for token in line[1:].split()]
}
curr_frame = int(options['frame_index'])
curr_data = []
frame_data[curr_frame] = curr_data
else: # Not a header line
x, y, vx, vy = map(int, line.split())
frame_data.append((x, y, vx, vy))
You know have a dictionary that maps a frame number to a list of (x, y, vx, vy) tuple elements.
Extracting the new list from the dictionary is now easy:
result = []
for frame_number, data in frame_data.items():
for x, y, vx, vy in data:
if not (vx == 0 and vy == 0):
result.append((vx, vy, frame_number))

Binning values into groups with a minimum size using pandas

I'm trying to bin a sample of observations into n discrete groups, then combine these groups until each subgroup has a mimimum of 6 members. So far, I've generated bins, and grouped my DataFrame into them:
# df is a DataFrame containing 135 measurments
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
grp.size()
1 4
2 1
3 2
4 3
5 2
6 8
7 7
8 6
9 19
10 12
11 13
12 12
13 7
14 12
15 12
16 2
17 3
18 6
19 3
21 1
So I can see that I need to combine groups 1 - 3, 3 - 5, and 16 - 21, while leaving the others intact, but I don't know how to do this programmatically.
You can do this:
df = pd.DataFrame(np.random.random_integers(1,200,135), columns=['heights'])
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
sizes = grp.size()
def f(vals, max):
sum = 0
group = 1
for v in vals:
sum += v
if sum <= max:
yield group
else:
group +=1
sum = v
yield group
#I've changed 6 by 30 for the example cause I don't have your original dataset
grp.size().groupby([g for g in f(sizes, 30)])
And if you do print grp.size().groupby([g for g in f(sizes, 30)]).cumsum() you will see that the cumulative sums is grouped as expected.
Also if you want to group the original values you can do something like:
dat = np.random.random_integers(0,200,135)
dat = np.array([78,116,146,111,147,78,14,91,196,92,163,144,107,182,58,89,77,134,
83,126,94,70,121,175,174,88,90,42,93,131,91,175,135,8,142,166,
1,112,25,34,119,13,95,182,178,200,97,8,60,189,49,94,191,81,
56,131,30,107,16,48,58,65,78,8,0,11,45,179,151,130,35,64,
143,33,49,25,139,20,53,55,20,3,63,119,153,14,81,93,62,162,
46,29,84,4,186,66,90,174,55,48,172,83,173,167,66,4,197,175,
184,20,23,161,70,153,173,127,51,186,114,27,177,96,93,105,169,158,
83,155,161,29,197,143,122,72,60])
df = pd.DataFrame({'heights':dat})
bins = np.digitize(dat,np.linspace(0,200,21))
grp = df.heights.groupby(bins)
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f(x):
global c,s
res = pd.Series([c]*x.size,index=x.index)
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.apply(f)
print df.groupby(g).size()
#another way of doing the same, just a matter of taste
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f2(x):
global c,s
res = [c]*x.size #here is the main difference with f
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.transform(f2) #call it this way
print df.groupby(g).size()

Categories