I have the following problem, I want to detect if 2 or more consecutive values in a column of a dataframe have a value greater than 0.5. For this I have chosen the following approach: I check each cell if the value is less than 0.5 and create an entry in the column "condition". (See table)
Now I have the following problem, how can I detect in a column if 2 consecutive cells have the same value (row 4-5)? Or is it possible to detect the problem also directly in the Data column.
If 2 consecutive cells are False, the dataframe can be discarded.
I would be very grateful for any help!
data
condition
0
0.1
True
1
0.1
True
2
0.25
True
3
0.3
True
4
0.6
False
5
0.7
False
6
0.3
True
7
0.1
True
6
0.9
False
7
0.1
True
You can compute a boolean series of values greater than 0.5 (i.e True when invalid). Then apply a boolean and (&) between this series and its shift. Any two consecutive True values will yield True. You can check if any is present to decide to discard the dataset:
s = df['data'].gt(0.5)
(s&s.shift()).any()
Output: True -> the dataset is invalid
You can use the .diff method and check that it is equal to zero.
df['eq_to_prev'] = df.data.diff().eq(0)
Related
I am trying to make a article similarity checker by comparing 6 articles with list of articles that I obtained from an API. I have used cosine similarity to compare each article one by one with the 6 articles that I am using as baseline.
My dataframe now looks like this:
id
Article
cosinesin1
cosinesin2
cosinesin3
cosinesin4
cosinesin5
cosinesin6
Similar
id1
[Article1]
0.2
0.5
0.6
0.8
0.7
0.8
True
id2
[Article2]
0.1
0.2
0.03
0.8
0.2
0.45
False
So I want to add Similar column in my dataframe that could check values for each Cosinesin (1-6) and return True if at least 3 out of 6 has value more than 0.5 otherwise return False.
Is there any way to do this in python?
Thanks
In Python, you can treat True and False as integers, 1 and 0, respectively.
So if you compare all the similarity metrics to 0.5, you can sum over the resulting Boolean DataFrame along the columns, to get the number of comparisons that resulted in True for each row. Comparing those numbers to 3 yields the column you want:
cos_cols = [f"cosinesin{i}" for i in range(1, 7)]
df['Similar'] = (df[cos_cols] > 0.5).sum(axis=1) >= 3
I am facing issues with pandas filtering of rows. I am trying to filter out team whose sum of weight is not equal to one.
dfteam
Team Weight
A 0.2
A 0.5
A 0.2
A 0.1
B 0.5
B 0.25
B 0.25
dfteamtemp = dfteam.groupby(['Team'], as_index=False)['Weight'].sum()
dfweight = dfteamtemp[(dfteamtemp['Weight'].astype(float)!=1.0)]
dfweight
Team Weight
0 A 1.0
I am not sure about the reason for this output. I should get an empty dataframe but it is giving me Team A even thought the sum is 1.
You are a victim of floating point inaccuracies. The first value does not exactly add up to 1.0 -
df.groupby('Team').Weight.sum().iat[0]
0.99999999999999989
You can resolve this by using np.isclose instead -
np.isclose(df.groupby('Team').Weight.sum(), 1.0)
array([ True, True], dtype=bool)
And filter on this array. Or, as #ayhan suggested, use groupby + filter -
df.groupby('Team').filter(lambda x: not np.isclose(x['Weight'].sum(), 1))
Empty DataFrame
Columns: [Team, Weight]
Index: []
I am try to preprocess a data set and I'd like to delete very sparse columns by setting a threshold such that columns with values less than that have entries fewer than the threshold will be removed.
The code below should get the job done but I do not understand how it works, kindly assist with an explanation or suggestions on how I can get this done. Thanks!
sparse_col_idx = ((x_sparse > 0).mean(0) > 0.05).A.ravel()
x_sparse has dim of (12060, 272776)
Lets break this down into steps. Assuming x_sparse is a DataFrame then x_sparse > 0 will return a DataFrame with the same exact dimensions, index and columns with each value as True or False based on the condition given (here is where the value > 0)
.mean(0)
This takes the mean of each column. Since False evaluates as 0 and True evaluates 1, mean() returns the percentage of the column that meet the criteria. You are down to a Series at this point, where the column names are the index and values are the percentage that meet the criteria.
> 0.05
This now changes the previous Series to a Series of booleans that match the column names meeting the criteria.
.A.ravel()
This doesn't look necessary. I will come up with a simple example below to show the steps.
Create a DataFrame with random normal values
np.random.seed(3)
x_sparse = pd.DataFrame(data=np.random.randn(100, 5), columns=list('abcde'))
print(x_sparse.head())
output:
a b c d e
0 1.788628 0.436510 0.096497 -1.863493 -0.277388
1 -0.354759 -0.082741 -0.627001 -0.043818 -0.477218
2 -1.313865 0.884622 0.881318 1.709573 0.050034
3 -0.404677 -0.545360 -1.546477 0.982367 -1.101068
4 -1.185047 -0.205650 1.486148 0.236716 -1.023785
# the argument 0 is unnecessary. The default is get average of columns
(x_sparse > 0).mean()
Output
a 0.48
b 0.52
c 0.44
d 0.55
e 0.45
# create a threshold
threshold = .5
(x_sparse > 0).mean() > threshold
Output
a False
b True
c False
d True
e False
Keep specific columns
threshold = .5
keep = (x_sparse > 0).mean() > threshold
x_sparse[x_sparse.columns[keep]]
Output
b d
0 0.436510 -1.863493
1 -0.082741 -0.043818
2 0.884622 1.709573
3 -0.545360 0.982367
4 -0.205650 0.236716
I have a scenario where I want to check for a minimum criterion (0.6) being met over consecutive pandas dataframe rows in a column (Col1), which also meets a criterion when the starting value is at least (0.7) i.e.:
Col1
0.3
0.5
0.55
0.8 = true
0.65 = true
0.9 = true
0.61 = true
0.3
0.6
0.67
0.74 = true
0.63 = true
0.61 = true
In other words, the check would be True if the value is at least 0.7, or if the value is at least 0.6 and the previous values are at least 0.6 with the first value in the consecutive series being at least 0.7.
It will be running off a very large data set so needs to be efficient. I am thinking something with shift() would work...but can't get it quite right.
You can use Series.where() to construct the logical Series.
Steps:
initialize the Series with nan values;
assign True for all values larger than 0.7
assign False for all values smaller than 0.6
forward fill values between 0.6~0.7 as it depends on previous values
fill possible missing values at the beginning of the Series
convert the dtype to boolean (optional)
so:
import pandas as pd
import numpy as np
df['check'] = np.nan
df['check'] = (df['check'].where(df.Col1 < 0.7, True)
.where(df.Col1 > 0.6, False)
.ffill().fillna(False)
.astype(bool))
I have a pandas dataframe in with several groups and I would like to exclude groups where some conditions (in a specific column) are not met. E.g. delete here group B because they have a non-number value in column "crit1".
I could delete specific columns based on the condition df.loc[:, (df >< 0).any(axis=0)] but then it doesn't delete the whole group.
And somehow I can't make the next step and apply this to the whole group.
name crit1 crit2
A 0.3 4
A 0.7 6
B inf 4
B 0.4 3
So the result after this filtering (allow only floats) should be:
A 0.3 4
A 0.7 6
You can use groupby and filter, for the example you give you can check if np.inf exists in a group and filter on the condition:
import pandas as pd
import numpy as np
df.groupby('name').filter(lambda g: (g != np.inf).all().all())
# name crit1 crit2
# 0 A 0.3 4
# 1 A 0.7 6
If the predicate only applies to one column, you can access the column via g., for example:
df.groupby('name').filter(lambda g: (g.crit1 != np.inf).all())
# name crit1 crit2
# 0 A 0.3 4
# 1 A 0.7 6