pandas groupby + transform gives shape mismatch - python

I have a Pandas Dataframe with one column for the index of the row in the group. I now want to determine whether that row is in the beginning, middle, or end of the group based on this index. I wanted to apply a UDF that returns start (0) middle (1) or end(2) as output, and I want to save that output per row in a new column. Here is my UDF:
def add_position_within_group(group):
length_of_group = group.max()
three_lists = self.split_lists_into_three_parts([x for x in range(length_of_group)])
result_list = []
for x in group:
if int(x) in three_lists[0]:
result_list.append(0)
elif int(x) in three_lists[1]:
result_list.append(1)
elif int(x) in three_lists[2]:
result_list.append(2)
return result_list
Here is the split_lists_into_three_parts method (tried and tested):
def split_lists_into_three_parts(self, event_list):
k, m = divmod(len(event_list), 3)
total_list = [event_list[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(3)]
start_of_list = total_list[0]
middle_of_list = total_list[1]
end_of_list = total_list[2]
return [start_of_list,middle_of_list,end_of_list]
Here is the line of code that groups the Dataframe and runs transform() which when called on a groupby, according to what I have read, iterates over all the groups and takes the column as a series as an argument and applies my UDF. It has to return a one-dimensional list or series the same size as the group.:
compound_data_frame["position_in_sequence"] = compound_data_frame.groupby('patient_id')["group_index"].transform(self.add_position_within_group)
I'm getting the following error :
shape mismatch: value array of shape (79201,) could not be broadcast to indexing result of shape (79202,)
I still can't figure out what kind of output my function has to have when passed to transform, or why I'm getting this error. Any help would be much appreciated.

Well I'm embarrassed to say this but here goes: in order to create the three lists of indices I use range(group.max()), which creates a range of the group-size -1. What I should have done is either used the group size or added 1 to group.max().

Related

Taking n elements at a time from 1d list and add them to 2d list

I have a list making up data, and I'd like to take 4 elements at a time from this list and put them in a 2d list where each 4-element increment is a new row of said list.
My first attempts involve input to 1d list:
list.append(input("Enter data type 1:")) list.append(input("Enter data type 2:")) etc.
and then I've tried to loop the list and to "switch" rows once the index reaches 4.
for x in range(n * 4):
for idx, y in enumerate(list):
if idx % 4 == 0:
x = x + 1
list[y] = result[x][y]
where I've initialised result according to the following:
and
ran = int(len(list)/4)
result=[[0 for x in range(ran)] for j in range(n)]
I've also attempted to ascribe a temporary empty list that will append to an initialised 2D list.
`
row.append(list)
result=[[x for x in row] for j in range(n + 1)]
#result[n]=row
print(result)
n = n + 1
row.clear()
list.clear()
so that each new loop starts with an empty row, takes input from user and copies it.
I'm at a loss for how to make result save the first entry and not be redefined at second,third,fourth entries.
I think this post is probably what you need. With np.reshape() you can just have your list filled with all the values you need and do the reshaping after in a single step.

Execute function using value of list as parameter

I have a function that has a parameter requiring a value, and I have the values stored within a list. (These values are the index numbers of a dataframe; the dfFunction takes a row from the dataframe using iloc, so a number between 0 and 99 is needed for the function to return a value).
Such as
IndexList = [0,1,2,3,4,...,99]
RowIndex = IndexList.index(0)
dfFunction (RowIndex)
#Output: 10
But I'd like to be able to run through the list of index values and apply them to the function, thus producing the function result for each index number.
However, the code I have at the moment, to iterate through the list, is returning <function __main__.IndexFunction()>
def IndexFunction():
RowIndex = 0
while RowIndex <= len(IndexList):
RowIndexValue = IndexList.index(RowIndex)
RowIndex += 1
return RowIndexValue
I thought I could apply this function as the parameter, like: dfFunction(IndexFunction) but I also know this isn't correct.
Is there a while loop or way to apply enumerate() or something else so I can use the list values in the dfFunction and produce the results for all of them?
Ideally:
dfFunction(#insert solution)
#Output dfFunction results for each list index to put back in dataframe as a new column
OR:
IndexList = [0,1,2,3,4,5,...,99]
IndexNumber = 0
row_index = IndexList.index(IndexNumber)
Result = dfColumn1.iloc[row_index] / dfFunction(row_index)
Result
#Output
IndexNumber = 1
row_index = IndexList.index(IndexNumber)
Result = dfColumn1.iloc[row_index] / dfFunction(row_index)
Result
#Output
IndexNumber = 3
row_index = IndexList.index(IndexNumber)
Result = dfColumn1.iloc[row_index] / dfFunction(row_index)
Result
#Output
etc
Simplified so there doesn't have to be 100 repeats of those chunks of code, so one function spits out all the outputs in one go, like:
#Output of function from Index Position 1
#Output of function from Index Position 2
#Output of function from Index Position 3
#Output of function from Index Position 4

Handling dataframe slice endpoints

I am writing a function that splits a dataframe into n equally sized slices, similar to np.array_split, but instead of starting from index 0, it starts from the index of the remainder.
First, I get the indices with which to split.
lst = [1] * 100
df = pd.DataFrame(lst)
def df_split(df, n):
step = math.floor(len(df)/n)
remainder = len(df) % step
splits = [remainder]
while max(splits) < len(df):
splits.append(max(splits) + step)
return splits
splits = df_split(df, 3)
This returns [1, 34, 67, 100].
I would then like to get the sub-arrays, or slices:
arrays = []
for i in range(len(splits) - 1):
st = splits[i]
end = splits[i+1]
arrays.append(df[st:end])
The final iteration of this for loop, however, is indexing df[67:100], which is exclusive of the last row in the df. I would like to make sure that the last row is included.
If I utilize df[67:101], I get an out-of-index error.
I could write an if statement that checks whether end is the last element in splits, and then simply returns df[67:], which would give the desired output, but I was wondering if there is a simpler way to achieve the same result.

Delete 2D unique elements in a 2D NumPy array

I generate a set of unique coordinate combinations by using:
axis_1 = np.arange(image.shape[0])
axis_1 = np.reshape(axis_1,(axis_1.shape[0],1))
axis_2 = np.arange(image.shape[1])
axis_2 = np.reshape(axis_2,(axis_2.shape[0],1))
coordinates = np.array(np.meshgrid(axis_1, axis_2)).T.reshape(-1,2)
I then check for some condition and if it is satisfied i want to delete the coordinates from the array.
Something like this:
if image[coordinates[i,0], coordinates[i,1]] != 0:
remove coordinates i from coordinates
I tried the remove and delete commands but one doesn't work for arrays and the other simply just removes every instance where coordinates[i,0] and coordinates[i,1] appear, rather than the unique combination of both.
You can use np.where to generate the coordinate pairs that should be removed, and np.unique combined with masking to remove them:
y, x = np.where(image > 0.7)
yx = np.column_stack((y, x))
combo = np.vstack((coordinates, yx))
unique, counts = np.unique(combo, axis=0, return_counts=True)
clean_coords = unique[counts == 1]
The idea here is to stack the original coordinates and the coordinates-to-be-removed in the same array, then drop the ones that occur in both.
You can use the numpy.delete function, but this function returns a new modified array, and does not modify the array in-place (which would be quite problematic, specially in a for loop).
So your code would look like that:
nb_rows_deleted = 0
for i in range(0, coordinates.shape[0]):
corrected_i = i - nb_rows_deleted
if image[coordinates[corrected_i, 0], coordinates[corrected_i, 1]] != 0:
coordinates = np.delete(coordinates, corrected_i, 0)
nb_rows_deleted += 1
The corrected_i takes into consideration that some rows have been deleted during your loop.

Python concatenate 2D array to new list if condition is met

Let's say I have an array:
print(arr1.shape)
(188621, 10)
And in the nth column (let's say 4 for this example), I want to check when a value is above a threshold, t. I want to create a new list (of x instances) of the entire row of arr1 when the ith iteration of the 4th column is above threshold t. In other words, it is extracting the ith row from arr1 when the condition in the 4th column is met. So far I have:
arr2 = []
for i in range(0,len(arr1)):
if arr1[i,4] > t:
arr2.append(arr1[i,:])
I have also tried something along the lines of:
for i in range(0,len(arr1)):
if arr1[i,4] > t:
if len(arr2) == 0:
arr2 = arr1[i,:]
else:
arr2 = np.concatenate((arr2,arr1[i,:]))
However, both instances seem to be growing in 1D terms of x*10 instead of a 2D list of (x, 10) when the conditions are met. What am I missing here?
Well, it wasn't that difficult.
arr2 = arr1[np.logical_not(arr1[:,4] < t)]

Categories