Group rows based on +- threshold on high dimensional object

Group rows based on +- threshold on high dimensional object - python

I have a large df with coordinates in multiple dimensions. I am trying to create classes (Objects) based on threshold difference between the coordinates. An example df is as below:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
So based on this df I want to group each row to a class based on -+ 2 across all coordinates. So the df will have a unique group name added to each row. So the output for this threshold function is:
'x' 'y' 'z' 'group'
1 10 7 -
2 14 6 -
3 5 2 G1
4 14 43 -
5 3 1 G1
6 12 40 -
It is similar to clustering but I want to work on my own threshold functions. How can this done in python.
EDIT
To clarify the threshold is based on the similar coordinates. All rows with -+ threshold across all coordinates will be grouped as a single object. It can also be taken as grouping rows based on a threshold across all columns and assigning unique labels to each group.

As far as I understood, what you need is a function apply. It was not very clear from your statement, whether you need all the differences between the coordinates, or just the neighbouring differences (x-y and y-z). The row 5 has the difference between x and z coordinate 4, but is still assigned to the class G1.
That's why I wrote it for the two possibilities and you can just choose which one you need more:
import pandas as pd
import numpy as np
def your_specific_function(row):
'''
For all differences use this:
diffs = np.array([abs(row.x-row.y), abs(row.y-row.z), abs(row.x-row.z)])
'''
# for only x - y, y - z use this:
diffs = np.diff(row)
statement = all(diffs <= 2)
if statement:
return 'G1'
else:
return '-'
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
df['group'] = df.apply(your_specific_function, axis = 1)
print(df.head())

Related

How to get a series of highest values out of equal portions of another pandas series python

I am working with a series in python, What I want to achieve is to get the highest value out of every n values in the series.
For example:
if n is 3
Series: 2, 1, 3, 5, 3, 6, 1, 6, 9
Expected Series: 3, 6, 9
I have tried nlargest function in pandas but it returns largest values in descending order, But I need the values in order of the original series.

There are various options. If the series is guaranteed to have a length of a multiple of n, you could drop down to numpy and do a .reshape followed by .max along an axis.
Otherwise, if the index is the default (0, 1, 2, ...), you can use groupby:
import pandas as pd
n = 3
ser = pd.Series([2, 1, 3, 5, 3, 6, 1, 6, 9])
out = ser.groupby(ser.index // n).max()
out:
0 3
1 6
2 9
dtype: int64

How to apply rolling mean function while keeping all the observations with duplicated indices in time

I have a dataframe that has duplicated time indices and I would like to get the mean across all for the previous 2 days (I do not want to drop any observations; they are all information that I need). I've checked pandas documentation and read previous posts on Stackoverflow (such as Apply rolling mean function on data frames with duplicated indices in pandas), but could not find a solution. Here's an example of how my data frame look like and the output I'm looking for. Thank you in advance.
data:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],'t': [1, 2, 3, 2, 1, 2, 2, 3, 4],'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
output:
t
v2
1
-
2
-
3
4.167
4
5
5
6.667

A rough proposal to concatenate 2 copies of the input frame in which values in 't' are replaced respectively by values of 't+1' and 't+2'. This way, the meaning of the column 't' becomes "the target day".
Setup:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],
't': [1, 2, 3, 2, 1, 2, 2, 3, 4],
'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
Implementation:
len = df.shape[0]
incr = pd.DataFrame({'id': [0]*len, 't': [1]*len, 'v1':[0]*len}) # +1 in 't'
df2 = pd.concat([df + incr, df + incr + incr]).groupby('t').mean()
df2 = df2[1:-1] # Drop the days that have no full values for the 2 previous days
df2 = df2.rename(columns={'v1': 'v2'}).drop('id', axis=1)
Output:
v2
t
3 4.166667
4 5.000000
5 6.666667

Thank you for all the help. I ended up using groupby + rolling (2 Day), and then drop duplicates (keep the last observation).

How to do point biserial correlation for multiple columns in one iteration

I am trying to calculate a point biserial correlation for a set of columns in my datasets. I am able to do it on individual variable, however if i need to calculate for all the columns in one iteration then it is showing an error.
Below is the code:
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
from scipy import stats
corr_list = {}
y = df['A'].astype(float)
for column in df:
x = df[['B','C','D']].astype(float)
corr = stats.pointbiserialr(x, y)
corr_list[['B','C','D']] = corr
print(corr_list)
TypeError: No loop matching the specified signature and casting was found for ufunc add

x must be a column not a dataframe, if you take the column instead of the dataframe , it will work. You can try this :
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
print(df)
from scipy import stats
corr_list = []
y = df['A'].astype(float)
for column in df:
x=df[column]
corr = stats.pointbiserialr(list(x), list(y))
corr_list.append(corr[0])
print(corr_list)
by the way you can use print(df.corr())and this give you the Correlation Matrix of the dataframe

You can use the pd.DataFrame.corrwith() function:
df[['B', 'C', 'D']].corrwith(df['A'].astype('float'), method=stats.pointbiserialr)
Output will be a list of the columns and their corresponding correlations & p-values (row 0 and 1, respectively) with the target DataFrame or Series. Link to docs:
B C D
0 4.547937e-18 0.400066 -0.094916
1 1.000000e+00 0.504554 0.879331

Check one-on-one relationship between two columns

I have two columns A and B in a pandas dataframe, where values are repeated multiple times. For a unique value in A, B is expected to have "another" unique value too. And each unique value of A has a corresponding unique value in B (See example below in the form of two lists). But since each value in each column is repeated multiple times, I would like to check if any one-to-one relationship exists between two columns or not. Is there any inbuilt function in pandas to check that? If not, is there an efficient way of achieving that task?
Example:
A = [1, 3, 3, 2, 1, 2, 1, 1]
B = [5, 12, 12, 10, 5, 10, 5, 5]
Here, for each 1 in A, the corresponding value in B is always 5, and nothing else. Similarly, for 2-->10, and for 3-->12. Hence, each number in A has only one/unique corresponding number in B (and no other number). I have called this one-on-one relationship. Now I want to check if such relationship exists between two columns in pandas dataframe or not.
An example where this relationship is not satisfied:
A = [1, 3, 3, 2, 1, 2, 1, 1]
B = [5, 12, 12, 10, 5, 10, 7, 5]
Here, 1 in A doesn't have a unique corresponding value in B. It has two corresponding values - 5 and 7. Hence, the relationship is not satisfied.

Consider you have some dataframe:
d = df({'A': [1, 3, 1, 2, 1, 3, 2], 'B': [4, 6, 4, 5, 4, 6, 5]})
d has groupby method, which returns GroupBy object. This is the interface to group some rows by equal column value, for example.
gb = d.groupby('A')
grouped_b_column = gb['B']
On grouped rows you could perform an aggregation. Lets find min and max value in every group.
res = grouped_b_column.agg([np.min, np.max])
>>> print(res)
amin amax
A
1 4 4
2 5 5
3 6 6
Now we just should check that amin and amax are equal in every group, so every group consists of equal B fields:
res['amin'].equals(res['amax'])
If this check is OK, then for every A you have unique B. Now you should check the same criteria for A and B columns swapped.

Chained Lookups in Pandas Dataframe

I have a very large pandas dataframe with two columns that I'd like to recursively lookup.
Given input of the following dataframe:
NewID, OldID
1, 0
2, 1
3, 2
5, 4
7, 6
8, 7
9, 5
I'd like to generate the series OriginalId:
NewID, OldId, OriginalId
1, 0, 0
2, 1, 0
3, 2, 0
5, 4, 4
7, 6, 6
8, 7, 6
9, 5, 4
This can be trivially solved by iterating over the sorted data and for each row, checking if OldId points to an existing NewId and if so, setting OriginalId to OriginalId for that row.
This can be solved by iteratively merging and updating columns, by the following algorithm:
Merge OldId to NewId.
For any one that did not match, set OriginalId to OldId.
If they did match, set OldId to OldId for the matched column.
Repeat until OriginalIds are all filled in.
Feels like there should be a pandas friendly way to do this via cumulative sums or similar.

Easy:
df.set_index('NewID', inplace=True)
df.loc[:, 'OriginalId'] = df.loc[df['OldId'], 'OldID'].fillna(df['OldId'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group rows based on +- threshold on high dimensional object - python

Related

How to get a series of highest values out of equal portions of another pandas series python

How to apply rolling mean function while keeping all the observations with duplicated indices in time

How to do point biserial correlation for multiple columns in one iteration

Check one-on-one relationship between two columns

Chained Lookups in Pandas Dataframe

Categories

Resources