Even out quantity of rows with a certain pair - python

So, I have a pandas data frame that looks something like this:
data | Flag | Set
-----------------------------
0 | True | A
30 | True | A
-1 | False | A
20 | True | B
5 | False | B
19 | False | B
7 | False | C
8 | False | c
How can I (elegantly) drop rows in such a way that, for each set, there is an equal number of True and False Flags? The output would look something like this
data | Flag | Set
-----------------------------
0 | True | A
-1 | False | A
20 | True | B
5 | False | B
as for A, there is one false flag, for B there is one true flag, and for C there are zero true flags. I know how to brute force this, but I feel like there's some elegant way that I don't know about.

First get counts of Flags per Set by crosstab, filter out rows with 0 - it means unique True or False values and get minimal value to dictionary d:
df1 = pd.crosstab(df['Set'], df['Flag'])
d = df1[df1.ne(0).all(axis=1)].min(axis=1).to_dict()
print (d)
{'A': 1, 'B': 1}
Then filter rows by Set column and keys of dictionary and then use DataFrame.head per groups by this dict :
df1 = (df[df['Set'].isin(d.keys())]
.groupby(['Set', 'Flag'], group_keys=False)
.apply(lambda x: x.head(d[x.name[0]])))
print (df1)
data Flag Set
2 -1 False A
0 0 True A
4 5 False B
3 20 True B
EDIT: For verify solution for return if there are 2 times True and False per Set A:
print (df)
data Flag Set
0 0 True A
1 8 True A
2 30 True A
3 -1 False A
4 -14 False A
5 20 True B
6 5 False B
7 19 False B
8 7 False C
9 8 False c
df1 = pd.crosstab(df['Set'], df['Flag'])
d = df1[df1.ne(0).all(axis=1)].min(axis=1).to_dict()
print (d)
{'A': 2, 'B': 1}
df1 = (df[df['Set'].isin(d.keys())]
.groupby(['Set', 'Flag'], group_keys=False)
.apply(lambda x: x.head(d[x.name[0]])))
print (df1)
data Flag Set
3 -1 False A
4 -14 False A
0 0 True A
1 8 True A
6 5 False B
5 20 True B

This might be a possible solution consisting of 3 steps:
Removing all sets that don't have true and false flags (here C)
Counting the number of rows that are wanted for each set-flag combination
Removing all rows that are over that number of counted rows
This yield the following code:
df = pd.DataFrame(data={"data":[0, 30, -1, 20, 5, 19, 7, 8],
"Flag":[True, True, False, True, False, False, False, False],
"Set":["A", "A", "A", "B", "B", "B", "C", "C"]})
# 1. removing sets with only one of both flags
reducer = df.groupby("Set")["Flag"].transform("nunique") > 1
df_reduced = df.loc[reducer]
# 2. counting the minimum number of rows per set
counts = df_reduced.groupby(["Set", "Flag"]).count().groupby("Set").min()
# 3. reducing each set and flag to the minumum number of rows
df_equal = df_reduced.groupby(["Set", "Flag"]) \
.apply(lambda x: x.head(counts.loc[x["Set"].values[0]][0])) \
.reset_index(drop=True)

EDIT: I came up with an easy-to-understand, concise solution:
Simply get the .cumcount() grouped by set and flag
Check if a group of set and the cumcount result above (cc below in the code) is duplicated. If a group contains no duplicates, then that means it needs to be removed.
In[1]:
data Flag Set
0 0 True A
1 8 True A
2 30 True A
3 0 True A
4 8 True A
5 30 True A
6 -1 False A
7 -14 False A
8 -1 False A
9 -14 False A
10 20 True B
11 5 False B
12 19 False B
13 7 False C
14 8 False c
EDIT 2: Per #Jezrael , I could simplify the below three lines of code further to:
df = (df[df.assign(cc = df.groupby(['Set', 'Flag'])
.cumcount()).duplicated(['Set','cc'], keep=False)])
Further breakdown of code below.
df['cc'] = df.groupby(['Set', 'Flag']).cumcount()
s = df.duplicated(['Set','cc'], keep=False)
df = df[s].drop('cc', axis=1)
df
Out[1]:
data Flag Set
0 0 True A
1 8 True A
2 30 True A
3 0 True A
6 -1 False A
7 -14 False A
8 -1 False A
9 -14 False A
10 20 True B
11 5 False B
Prior to dropping, this is how the data would look:
df['cc'] = df.groupby(['Set', 'Flag']).cumcount()
df['s'] = df.duplicated(['Set','cc'], keep=False)
# df = df[df['s']].drop('cc', axis=1)
df
Out[1]:
data Flag Set cc s
0 0 True A 0 True
1 8 True A 1 True
2 30 True A 2 True
3 0 True A 3 True
4 8 True A 4 False
5 30 True A 5 False
6 -1 False A 0 True
7 -14 False A 1 True
8 -1 False A 2 True
9 -14 False A 3 True
10 20 True B 0 True
11 5 False B 0 True
12 19 False B 1 False
13 7 False C 0 False
14 8 False c 0 False
Then, the False rows in column s are dropped with df = df[df['s']]

Related

how to change suffix on new df column of df when merging iteratively

I have a temp df and a a dflst.
the temp has as columns the unique col names from a dataframes in a dflst .
The dflst has a dynamic len, my problem arrises when len(dflst)>=4.
aLL DFs (temp and all the ones in dflst) have columns with true/false values and a p column with numbers
code to recreate data:
#making temp df
var_cols=['a', 'b', 'c', 'd']
temp = pd.DataFrame(list(itertools.product([False, True], repeat=len(var_cols))), columns=var_cols)
#makinf dflst
df0=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'b']))), columns=['a', 'b'])
df0['p']= np.random.randint(1, 5, df0.shape[0])
df1=pd.DataFrame(list(itertools.product([False, True], repeat=len(['c', 'd']))), columns=['c', 'd'])
df1['p']= np.random.randint(1, 5, df1.shape[0])
df2=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'c', ]))), columns=['a', 'c'])
df2['p']= np.random.randint(1, 5, df2.shape[0])
df3=pd.DataFrame(list(itertools.product([False, True], repeat=len(['d']))), columns=['d'])
df3['p']= np.random.randint(1, 5, df3.shape[0])
dflst=[df0, df1, df2, df3]
I want to merge the dfs in dflst, so that the 'p'col values from dfs in dflst into temp df, in the rows with compatible values between the two .
I am currently doing it with pd.merge as follows:
for df in dflst:
temp = temp.merge(df, on=list(df)[:-1], how='right')
but this results to a df that has same names for different columns, when dflst has 4 or more dfs.. I understand that that is due to suffix of merge. but it creates problems with column indexing.
How can I have unique names on the new columns added to temp iteratively?
I don't fully understand what you want but IIUC:
for i, df in enumerate(dflst):
temp = temp.merge(df.rename(columns={'p': f'p{i}'}),
on=df.columns[:-1].tolist(), how='right')
print(temp)
# Output:
a b c d p0 p1 p2 p3
0 False False False False 4 2 2 1
1 False True False False 3 2 2 1
2 False False True False 4 3 4 1
3 False True True False 3 3 4 1
4 True False False False 3 2 2 1
5 True True False False 3 2 2 1
6 True False True False 3 3 1 1
7 True True True False 3 3 1 1
8 False False False True 4 4 2 3
9 False True False True 3 4 2 3
10 False False True True 4 1 4 3
11 False True True True 3 1 4 3
12 True False False True 3 4 2 3
13 True True False True 3 4 2 3
14 True False True True 3 1 1 3
15 True True True True 3 1 1 3

Creating a matrix of 0 and 1 that maintains its shape

I am attempting to create a matrix of 1 if every 2nd column value is greater than the previous column value and 0s if less, when I use np.where it just flattens it I want to keep the first column and the last column and it shape.
df = pd.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])
newd=pd.DataFrame()
for x in df.columns[1::2]:
if bool(df.iloc[:,df.columns.get_loc(x)] <=
df.iloc[:,df.columns.get_loc(x)-1]):
newdf.append(1)
else:newdf.append(0)
This question was a little vague, but I will answer a question that I think gets at the heart of what you are asking:
Say you start with a matrix:
df1 = pd.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])
Which creates:
A B C D
0 2.464130 0.796172 -1.406528 0.332499
1 -0.370764 -0.185119 -0.514149 0.158218
2 -2.164707 0.888354 0.214550 1.334445
3 2.019189 0.910855 0.582508 -0.861778
4 1.574337 -1.063037 0.771726 -0.196721
5 1.091648 0.407703 0.406509 -1.052855
6 -1.587963 -1.730850 0.168353 -0.899848
7 0.225723 0.042629 2.152307 -1.086585
Now you can use pd.df.shift() to shift the entire matrix, and then check the resulting columns item by item in one step. For example:
df1.shift(1)
Creates:
A B C D
0 -0.370764 -0.185119 -0.514149 0.158218
1 -2.164707 0.888354 0.214550 1.334445
2 2.019189 0.910855 0.582508 -0.861778
3 1.574337 -1.063037 0.771726 -0.196721
4 1.091648 0.407703 0.406509 -1.052855
5 -1.587963 -1.730850 0.168353 -0.899848
6 0.225723 0.042629 2.152307 -1.086585
7 NaN NaN NaN NaN
And now you can check the resulting columns with a new matrix as so:
df2 = df1.shift(-1) > df1
which returns:
A B C D
0 False False True False
1 False True True True
2 True True True False
3 False False True True
4 False True False False
5 False False False True
6 True True True False
7 False False False False
To complete your question, we convert the True/False to 1/0 as such:
df2 = df2.applymap(lambda x: 1 if x == True else 0)
Which returns:
A B C D
0 0 0 1 0
1 0 1 1 1
2 1 1 1 0
3 0 0 1 1
4 0 1 0 0
5 0 0 0 1
6 1 1 1 0
7 0 0 0 0
In one line:
df2 = (df1.shift(-1)>df1).replace({True:1,False:0})

How to add a new column in Pandas by definning my own function that check values in many columns?

So i have a very big data set, and i need to create a function that checks the value in the same row for multiple columns, obviously the values are different for each column i want to check.
Then if all the given columns to check their values are true, i want to return something and add a new column to the DF to use as flag for later filtering.
I think you need compare by eq with all for check if all values are True:
df = pd.DataFrame({'A':[1,2,3],
'B':[1,5,6],
'C':[1,8,9],
'D':[1,3,5],
'E':[1,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 1 1 1 1 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#check same values in columns A,B,C,E
cols = ['B','C','E']
print (df[cols].eq(df.A, axis=0))
B C E
0 True True True
1 False False False
2 False False False
print (df[cols].eq(df.A, axis=0).all(axis=1))
0 True
1 False
2 False
dtype: bool
df['col'] = df[cols].eq(df.A, axis=0).all(axis=1)
print (df)
A B C D E F col
0 1 1 1 1 1 7 True
1 2 5 8 3 3 4 False
2 3 6 9 5 6 3 False
EDIT by comment:
You need create boolean mask with & (and), | (or) or ~ (not):
print ((df.A == 1) & (df.B > 167) & (df.B <=200))
0 False
1 False
2 False
dtype: bool
df['col'] = (df.A == 1) & (df.B > 167) & (df.B <=200)
print (df)
A B C D E F col
0 1 1 1 1 1 7 False
1 2 5 8 3 3 4 False
2 3 6 9 5 6 3 False
please try using masks.for example
mask=df['a']==4
df=df[mask]

Identifying closest value in a column for each filter using Pandas

I have a data frame with categories and values. I need to find the value in each category closest to a value. I think I'm close but I can't really get the right output when applying the results of argsort to the original dataframe.
For example, if the input was defined in the code below the output should have only (a, 1, True), (b, 2, True), (c, 2, True) and all other isClosest Values should be False.
If multiple values are closest then it should be the first value listed marked.
Here is the code I have which works but I can't get it to reapply to the dataframe correctly. I would love some pointers.
df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['values'] = [1, 2, 3, 4, 5, 4, 3, 2, 1, 0]
df['isClosest'] = False
uniqueCategories = df['category'].unique()
for c in uniqueCategories:
filteredCategories = df[df['category']==c]
sortargs = (filteredCategories['value']-2.0).abs().argsort()
#how to use sortargs so that we set column in df isClosest=True if its the closest value in each category to 2.0?
You can create a column of absolute differences:
df['dif'] = (df['values'] - 2).abs()
df
Out:
category values dif
0 a 1 1
1 b 2 0
2 b 3 1
3 b 4 2
4 c 5 3
5 a 4 2
6 b 3 1
7 c 2 0
8 c 1 1
9 a 0 2
And then use groupby.transform to check whether the minimum value of each group is equal to the difference you calculated:
df['is_closest'] = df.groupby('category')['dif'].transform('min') == df['dif']
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
df.groupby('category')['dif'].idxmin() would also give you the indices of the closest values for each category. You can use that for mapping too.
For selection:
df.loc[df.groupby('category')['dif'].idxmin()]
Out:
category values dif
0 a 1 1
1 b 2 0
7 c 2 0
For assignment:
df['is_closest'] = False
df.loc[df.groupby('category')['dif'].idxmin(), 'is_closest'] = True
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
The difference between these approaches is that if you check equality against the difference, you would get True for all rows in case of ties. However, with idxmin it will return True for the first occurrence (only one for each group).
Solution with DataFrameGroupBy.idxmin - get indexes of minimal values per group and then assign boolean mask by Index.isin to column isClosest:
idx = (df['values'] - 2).abs().groupby([df['category']]).idxmin()
print (idx)
category
a 0
b 1
c 7
Name: values, dtype: int64
df['isClosest'] = df.index.isin(idx)
print (df)
category values isClosest
0 a 1 True
1 b 2 True
2 b 3 False
3 b 4 False
4 c 5 False
5 a 4 False
6 b 3 False
7 c 2 True
8 c 1 False
9 a 0 False

All rows within a given column must match, for all columns

I have a Pandas DataFrame of data in which all rows within a given column must match:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
In [10]: df
Out[10]:
A B C D E
0 1 2 3 4 5
1 1 2 3 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
I would like a quick way to know if there is an variance anywhere in the DataFrame. At this point, I don't need to know which values have varied, since I will be going in to handle those later. I just need a quick way to know if the DataFrame needs further attention or if I can ignore it and move on to the next one.
I can check any given column using
(df.loc[:,'A'] != df.loc[0,'A']).any()
but my Pandas knowledge limits me to iterating through the columns (I understand iteration is frowned upon in Pandas) to compare all of them:
A B C D E
0 1 2 3 4 5
1 1 2 9 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
for col in df.columns:
if (df.loc[:,col] != df.loc[0,col]).any():
print("Found a fail in col %s" % col)
break
Out: Found a fail in col C
Is there an elegant way to return a boolean if any row within any column of a dataframe does not match all the values in the column... possibly without iteration?
Given your example dataframe:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
You can use the following:
df.apply(pd.Series.nunique) > 1
Which gives you:
A False
B False
C False
D False
E False
dtype: bool
If we then force a couple of errors:
df.loc[3, 'C'] = 0
df.loc[5, 'B'] = 20
You then get:
A False
B True
C True
D False
E False
dtype: bool
You can compare the entire DataFrame to the first row like this:
In [11]: df.eq(df.iloc[0], axis='columns')
Out[11]:
A B C D E
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True True
5 True True True True True
6 True True True True True
7 True True True True True
8 True True True True True
9 True True True True True
then test if all values are true:
In [13]: df.eq(df.iloc[0], axis='columns').all()
Out[13]:
A True
B True
C True
D True
E True
dtype: bool
In [14]: df.eq(df.iloc[0], axis='columns').all().all()
Out[14]: True
You can use apply to loop through columns and check if all the elements in the column are the same:
df.apply(lambda col: (col != col[0]).any())
# A False
# B False
# C False
# D False
# E False
# dtype: bool

Categories