How can i search for duplicate columns in a dataframe and then create a new column with same name. the new column is result of 'OR' operator of these columns. Then drop old duplicated columns.
Example:
For that, I tried to create a unique column 'job' that is the result of 'OR' operator of the two 'job' columns in the table bellow.
There is my table look like:
name
job
maried
children
job
John
True
True
True
True
Peter
True
False
True
True
Karl
False
True
True
True
jack
False
False
False
False
the result that I want is:
name
job
maried
children
John
True
True
True
Peter
True
False
True
Karl
True
True
True
jack
False
False
False
I tried to do this (df1 is my table):
df_join = pd.DataFrame()
df1_dulp = pd.DataFrame()
df_tmp = pd.DataFrame()
for column in df1.columns:
df1_dulp = df1.filter(like=str(column))
if df1_dulp.shape[1] >= 2:
for i in range(0, df1_dulp.shape[1]):
df_tmp += df1_dulp.iloc[:,i]
if column in df1_dulp.columns:
df1_dulp.drop(column, axis=1, inplace=True)
df_join = df_join.join(df1_dulp, how = 'left', lsuffix='left', rsuffix='right')
The result is an empty table (df_join).
You can select the boolean columns with select_dtypes, then aggregate as OR with groupby.any on columns:
out = (df
.select_dtypes(exclude='bool')
.join(df.select_dtypes('bool')
.groupby(level=0, axis=1, sort=False).any()
)
)
output:
name job maried children
0 John True True True
1 Peter True False True
2 Karl True True True
3 jack False False False
Related
My Dataframe has a column named "Teacher" and i want to know in that column the rows that are empty.
Example:
print(df["Teacher"])
0
1
2 Richard
3
4 Richard
Name: Teacher, Length: 5, dtype: object
I know that if i do something like this:
R = ['R']
cond = df['Teacher'].str.startswith(tuple(R))
print(cond)
It prints the rows of that column and tells me in boolean the teacher that starts with the R.
print(cond)
0 False
1 False
2 True
3 False
4 True
Name: Teacher, Length: 5, dtype: object
I want the same for the empty ones, to return True when its empty and false when its not but dont know how.
If empty is missing value or Nones use Series.isna:
cond = df['Teacher'].isna()
If empty is zero or more spaces use Series.str.contains:
cond = df['Teacher'].str.contains(r'^\s*$', na=False)
If empty is empty string compare by it:
cond = df['Teacher'] == ''
df = pd.DataFrame({'Teacher':['',' ', None, np.nan, 'Richard']})
cond1 = df['Teacher'].isna()
cond2 = df['Teacher'].str.contains(r'^\s*$', na=False)
cond3 = df['Teacher'] == ''
df = df.assign(cond1= cond1, cond2= cond2, cond3= cond3)
print (df)
Teacher cond1 cond2 cond3
0 False True True
1 False True False
2 None True False False
3 NaN True False False
4 Richard False False False
I want to break dataframe into blocks from one True value to next True value:
data
flag
MODS start 12/12/2020
True
Some data
False
Some data
False
MODS start 30/12/2020
True
Some data
False
Some data
False
To
data
flag
MODS start 12/12/2020
True
Some data
False
Some data
False
data
flag
MODS start 30/12/2020
True
Some data
False
Some data
False
Please help
You can use cumsum to create groups then query the datafame for each group:
df = pd.DataFrame({'data':['MODS start 12/12/202','Some data', 'Some data', 'MODS starts 30/12/2020', 'Some data', 'Some data'],
'flag':[True, False, False, True, False, False]})
df['grp'] = df['flag'].cumsum()
print(df)
Output:
data flag grp
0 MODS start 12/12/202 True 1
1 Some data False 1
2 Some data False 1
3 MODS starts 30/12/2020 True 2
4 Some data False 2
5 Some data False 2
The use:
df.query('grp == 1')
data flag grp
0 MODS start 12/12/202 True 1
1 Some data False 1
2 Some data False 1
and
df.query('grp == 2')
data flag grp
3 MODS starts 30/12/2020 True 2
4 Some data False 2
5 Some data False 2
You can use numpy.split:
np.split(df, df.index[df.flag])[1:]
Here, I used [1:] because numpy.split also consider the groups before the first index, even if it's empty.
That said, you can also use a simple list comprehension:
idx = df.index[df.flag].tolist() + [df.shape[0]]
[df.iloc[idx[i]:idx[i+1]] for i in range(len(idx)-1)]
Output (both approaches):
data flag
0 MODS start 12/12/2020 True
1 Some data False
2 Some data False
data flag
3 MODS start 30/12/2020 True
4 Some data False
5 Some data False
Get a list of indices of rows with flag = True
true_idx = df[df['flag']==True].index
n = len(true_idx)
Loop over true_idx and create a list of dataframes from each true index to next
new_dfs_list = [df.iloc[ true_idx[i]:true_idx[i+1], :] for i in range(n-1)]
append last df from last true index to the tail of df
new_dfs_list.append(df.iloc[ true_idx[n-1]:, :])
access any of your new_dfs by index
print(new_dfs_list[-1])
I have a dataframe 'df' from which I want to select the subset where 3 specific columns are not null.
So far I have tried to apply bool filtering
mask_df = df[['Empty', 'Peak', 'Full']].notnull()
which gives me the following result
Empty Peak Full
0 True False False
1 False False False
2 True True True
3 False False False
4 False False False
... ... ... ...
2775244 True True True
2775245 True True True
2775246 False False False
2775247 False False False
2775248 False False False
Now I want to select ONLY the rows where the mask for those 3 columns is True (i.e., rows where those 3 columns have null values). If I filter the original dataframe 'df' with this mask I get the original dataframe full of null values, except those where the mask_df is "True".
I probably can do this by applying a lambda function row-wise, but I would prefer to avoid that computation if there was a simpler way to do this.
Thanks in advance!
use pandas.DataFrame.all:
df[mask_df.all(axis = 1)]
The attribute is_unique returns False on rows of my DataFrame although these should be unique. What is going on?
This works as expected:
multi_index = pd.MultiIndex.from_product([['A', 'B','C'], ['spam', 'foo'], [2019,2020]])
df = pd.DataFrame(index=multi_index, columns=['Value'])
df.index.is_unique # returns True as expected
But with my data it does not: I get False on every row in the dataframe.
df['unique'] = df.index.is_unique # returns False on all rows
df['unique'].sum() # returns 0
But if I select any row using a unique combination of index keys just one row is returned although the column 'Unique' shows False on this one row:
df.sort_index(inplace=True) # to avoid indexing.py:1494: PerformanceWarning
df.loc[('00AO00', '2019-2020', 1319), 'Unique'] # returns one row with value False
The DataFrame with my data is shared in this OneDrive folder. (I just did df.to_pickle('df.pkl') ).
Not sure if understand, but if want check all duplicated MultiIndex values use Index.duplicated with keep=False for all dupes and filter in boolean indexing:
df = pd.read_pickle('df.pkl')
print (df[df.index.duplicated(keep=False)])
Totaal unique
Vestigingsnummer Jaar Postcode leerling
00AZ00 2016-2017 Onbekend 1 False
Onbekend 2 False
00BW00 2016-2017 Onbekend 2 False
Onbekend 7 False
2017-2018 Onbekend 4 False
... ...
31BK00 2019-2020 Onbekend 12 False
31FM00 2018-2019 Onbekend 2 False
Onbekend 1 False
31LK00 2019-2020 Onbekend 1 False
Onbekend 1 False
[5057 rows x 2 columns]
If remove duplicated values remove keep=False, so is used default keep='first', inverse mask and filter - then get unique MultiIndex:
df1 = df[~df.index.duplicated()]
print (df1.index.is_unique)
True
So I have a pytest testing the results of a query that returns pandas dataframe.
I want to assert that a particular column col has all the values that are a substring of a given input.
So this below gives me the rows (dataframe) that have that column's col value containing some input part. How can I assert it to be true?
assert result_df[result_df['col'].astype(str).str.contains(category)].bool == True
doesn't work
Try this:
assert result_df[result_df['col'].astype(str).str.contains(category)].bool.all(axis=None) == True
Please refer to the pandas docs for more info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.all.html
The reason your code doesn't work is because you are trying to test whether the dataframe object is True, not all of the values in it.
I believe you need Series.all for check if all values of filtered Series are Trues:
assert result_df['col'].astype(str).str.contains(category).all()
Sample:
result_df = pd.DataFrame({
'col':list('aaabbb')
})
print (result_df)
col
0 a
1 a
2 a
3 b
4 b
5 b
category = 'b'
assert result_df['col'].astype(str).str.contains(category).all()
AssertionError
Detail:
print (result_df['col'].astype(str).str.contains(category))
0 False
1 False
2 False
3 True
4 True
5 True
Name: col, dtype: bool
print (result_df['col'].astype(str).str.contains(category).all())
False
category = 'a|b'
assert result_df['col'].astype(str).str.contains(category).all()
print (result_df['col'].astype(str).str.contains(category))
0 True
1 True
2 True
3 True
4 True
5 True
Name: col, dtype: bool
print (result_df['col'].astype(str).str.contains(category).all())
True
Found it. assert result_df[result_df['col'].astype(str).str.contains(category)].bool works
or assert result_df['col'].astype(str).str.contains(category).all (Thanks to #jezrael for suggesting all)