Split pandas dataframe into blocks based on column values

Split pandas dataframe into blocks based on column values - python

I want to break dataframe into blocks from one True value to next True value:
data
flag
MODS start 12/12/2020
True
Some data
False
Some data
False
MODS start 30/12/2020
True
Some data
False
Some data
False
To
data
flag
MODS start 12/12/2020
True
Some data
False
Some data
False
data
flag
MODS start 30/12/2020
True
Some data
False
Some data
False
Please help

You can use cumsum to create groups then query the datafame for each group:
df = pd.DataFrame({'data':['MODS start 12/12/202','Some data', 'Some data', 'MODS starts 30/12/2020', 'Some data', 'Some data'],
'flag':[True, False, False, True, False, False]})
df['grp'] = df['flag'].cumsum()
print(df)
Output:
data flag grp
0 MODS start 12/12/202 True 1
1 Some data False 1
2 Some data False 1
3 MODS starts 30/12/2020 True 2
4 Some data False 2
5 Some data False 2
The use:
df.query('grp == 1')
data flag grp
0 MODS start 12/12/202 True 1
1 Some data False 1
2 Some data False 1
and
df.query('grp == 2')
data flag grp
3 MODS starts 30/12/2020 True 2
4 Some data False 2
5 Some data False 2

You can use numpy.split:
np.split(df, df.index[df.flag])[1:]
Here, I used [1:] because numpy.split also consider the groups before the first index, even if it's empty.
That said, you can also use a simple list comprehension:
idx = df.index[df.flag].tolist() + [df.shape[0]]
[df.iloc[idx[i]:idx[i+1]] for i in range(len(idx)-1)]
Output (both approaches):
data flag
0 MODS start 12/12/2020 True
1 Some data False
2 Some data False
data flag
3 MODS start 30/12/2020 True
4 Some data False
5 Some data False

Get a list of indices of rows with flag = True
true_idx = df[df['flag']==True].index
n = len(true_idx)
Loop over true_idx and create a list of dataframes from each true index to next
new_dfs_list = [df.iloc[ true_idx[i]:true_idx[i+1], :] for i in range(n-1)]
append last df from last true index to the tail of df
new_dfs_list.append(df.iloc[ true_idx[n-1]:, :])
access any of your new_dfs by index
print(new_dfs_list[-1])

Related

Calculate sum of columns of same name pandas

How can i search for duplicate columns in a dataframe and then create a new column with same name. the new column is result of 'OR' operator of these columns. Then drop old duplicated columns.
Example:
For that, I tried to create a unique column 'job' that is the result of 'OR' operator of the two 'job' columns in the table bellow.
There is my table look like:
name
job
maried
children
job
John
True
True
True
True
Peter
True
False
True
True
Karl
False
True
True
True
jack
False
False
False
False
the result that I want is:
name
job
maried
children
John
True
True
True
Peter
True
False
True
Karl
True
True
True
jack
False
False
False
I tried to do this (df1 is my table):
df_join = pd.DataFrame()
df1_dulp = pd.DataFrame()
df_tmp = pd.DataFrame()
for column in df1.columns:
df1_dulp = df1.filter(like=str(column))
if df1_dulp.shape[1] >= 2:
for i in range(0, df1_dulp.shape[1]):
df_tmp += df1_dulp.iloc[:,i]
if column in df1_dulp.columns:
df1_dulp.drop(column, axis=1, inplace=True)
df_join = df_join.join(df1_dulp, how = 'left', lsuffix='left', rsuffix='right')
The result is an empty table (df_join).

You can select the boolean columns with select_dtypes, then aggregate as OR with groupby.any on columns:
out = (df
.select_dtypes(exclude='bool')
.join(df.select_dtypes('bool')
.groupby(level=0, axis=1, sort=False).any()
)
)
output:
name job maried children
0 John True True True
1 Peter True False True
2 Karl True True True
3 jack False False False

Annotating maximum by iterating each rows and make new column with resultant output

Annotating maximum by iterating each rows. and make new column with resultant output.
Can anyone help using pandas in Python, how to get the result?
text A B C
index
0 Cool False False True
1 Drunk True False False
2 Study False True False
Output:
Text Result
index
0 Cool False
1 Drunk False
2 Study False

If the sum of each row is more than half the length of the columns, True is the more common value.
Try:
df["Result"] = df.drop("text", axis=1).sum(axis=1)>=len(df.columns)//2+1
output = df[["text", "Result"]]
>>> df
text Result
0 Cool False
1 Drunk False
2 Study False

How to find duplicates in pandas dataframe and print them

I am checking a panadas dataframe for duplicate rows using the duplicated function, which works well. But how do I print out the row contents of only the items that are true?
for example, If I run:
duplicateCheck = dataSet.duplicated(subset=['Name', 'Date',], keep=False)
print(duplicateCheck)
it outputs:
0 False
1 False
2 False
3 False
4 True
5 True
6 False
7 False
8 False
9 False
I'm looking for something like:
for row in duplicateCheck.keys():
if row == True:
print (row, duplicateCheck[row])
Which prints the items from the dataframe that are duplicates.

Why not
duplicateCheck = dataSet.duplicated(subset=['Name', 'Date',], keep=False)
print(dataSet[duplicateCheck])

Pandas dataframe check if a value exists in multiple columns for one row

I want to print out the row where the value is "True" for more than one column.
For example if data frame is the following:
Remove Ignore Repair
0 True False False
1 False True True
2 False True False
I want it to print:
1
Is there an elegant way to do this instead of bunch of if statements?

you can use sum and pass axis=1 to sum over columns.
import pandas as pd
df = pd.DataFrame({'a':[False, True, False],'b':[False, True, False], 'c':[True, False, False,]})
print(df)
print("Ans: ",df[(df.sum(axis=1)>1)].index.tolist())
output:
a b c
0 False False True
1 True True False
2 False False False
Ans: [1]

To get the first row that meets the criteria:
df.index[df.sum(axis=1).gt(1)][0]
Output:
Out[14]: 1
Since you can get multiple matches, you can exclude the [0] to get all the rows that meet your criteria

You can just sum booleans as they will be interpreted as True=1, False=0:
df.sum(axis=1) > 1
So to filter to rows where this evaluates as True:
df.loc[df.sum(axis=1) > 1]
Or the same thing but being more explicit about converting the booleans to integers:
df.loc[df.astype(int).sum(axis=1) > 1]

Extracting all rows from pandas Dataframe that have certain value in a specific column

I am relatively new to Python/Pandas and am struggling with extracting the correct data from a pd.Dataframe. What I actually have is a Dataframe with 3 columns:
data =
Position Letter Value
1 a TRUE
2 f FALSE
3 c TRUE
4 d TRUE
5 k FALSE
What I want to do is put all of the TRUE rows into a new Dataframe so that the answer would be:
answer =
Position Letter Value
1 a TRUE
3 c TRUE
4 d TRUE
I know that you can access a particular column using
data['Value']
but how do I extract all of the TRUE rows?
Thanks for any help and advice,
Alex

You can test which Values are True:
In [11]: data['Value'] == True
Out[11]:
0 True
1 False
2 True
3 True
4 False
Name: Value, dtype: bool
and then use fancy indexing to pull out those rows:
In [12]: data[data['Value'] == True]
Out[12]:
Position Letter Value
0 1 a True
2 3 c True
3 4 d True
*Note: if the values are actually the strings 'TRUE' and 'FALSE' (they probably shouldn't be!) then use:
data['Value'] == 'TRUE'

You can wrap your value/values in a list and do the following:
new_df = df.loc[df['yourColumnName'].isin(['your', 'list', 'items'])]
This will return a new dataframe consisting of rows where your list items match your column name in df.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split pandas dataframe into blocks based on column values - python

Related

Calculate sum of columns of same name pandas

Annotating maximum by iterating each rows and make new column with resultant output

How to find duplicates in pandas dataframe and print them

Pandas dataframe check if a value exists in multiple columns for one row

Extracting all rows from pandas Dataframe that have certain value in a specific column

Categories

Resources