Pandas: Remove rows except the first new occurrence of a value

Pandas: Remove rows except the first new occurrence of a value - python

I have a dataframe
x y
a 1
b 1
c 1
d 0
e 0
f 0
g 1
h 1
i 0
j 0
I want to remove the rows with 0 except every first new occurence of 0 after 1, so the resultant dataframe should be
x y
a 1
b 1
c 1
d 0
g 1
h 1
i 0
Is it possible to do it without creating groups or row by row iteration to make it faster since I have a big dataframe.

Let us try diff with cumsum create the continue value group , then try duplicated
out = df[~df.y.diff().ne(0).cumsum().duplicated() | df.y].copy()
Out[352]:
x y
0 a 1
1 b 1
2 c 1
3 d 0
6 g 1
7 h 1
8 i 0

Check consecutive similarity using shift()
df[df.y.ne(0)|(df.y.eq(0)&df.y.shift(1).ne(0))]
x y
0 a 1
1 b 1
2 c 1
3 d 0
6 g 1
7 h 1
8 i 0

Related

How to count cumulatively with conditions on a groupby?

Say I have a data-frame, filled as below, with the column 'Key' having one of five possible values A, B, C, D, X. I would like to add a new column 'Res' that counts the number of these letters cumulatively and resets each time it hits and X.
For example:
Key Res
0 D 1
1 X 0
2 B 1
3 C 2
4 D 3
5 X 0
6 A 1
7 C 2
8 X 0
9 X 0
May anyone assist in how I can achieve this?

A possible solution:
a = df.Key.ne('X')
df['new'] = ((a.cumsum()-a.cumsum().where(~a).ffill().fillna(0)).astype(int))
Another possible solution, which is more basic than the previous one, but much faster (several orders of magnitude):
s = np.zeros(len(df), dtype=int)
for i in range(len(df)):
if df.Key[i] != 'X':
s[i] = s[i-1] + 1
df['new'] = s
Output:
Key Res new
0 D 1 1
1 X 0 0
2 B 1 1
3 C 2 2
4 D 3 3
5 X 0 0
6 A 1 1
7 C 2 2
8 X 0 0
9 X 0 0

Example
df = pd.DataFrame(list('DXBCDXACXX'), columns=['Key'])
df
Key
0 D
1 X
2 B
3 C
4 D
5 X
6 A
7 C
8 X
9 X
Code
df1 = pd.concat([df.iloc[[0]], df])
grouper = df1['Key'].eq('X').cumsum()
df1.assign(Res=df1.groupby(grouper).cumcount()).iloc[1:]
result:
Key Res
0 D 1
1 X 0
2 B 1
3 C 2
4 D 3
5 X 0
6 A 1
7 C 2
8 X 0
9 X 0

Delete row and column from pandas dataframe

I have a CSV file which is contains a symmetric adjacency matrix which means row and column have equivalent labels.
I would like to import this into a pandas dataframe, ideally have some GUI pop up and ask for a list of items to delete....and then take that list in and set the values in the relative row and column as zero's and return a separate altered dataframe.
In short, something that takes the following matrix
a b c d e
a 0 3 5 3 5
b 3 0 2 4 5
c 5 2 0 1 7
d 3 4 1 0 9
e 5 5 7 9 0
Pops up a simple interface asking "which regions should be deleted" and a line to enter those regions
and say c and e are entered
returns
a b c d e
a 0 3 0 3 0
b 3 0 0 4 0
c 0 0 0 0 0
d 3 4 0 0 0
e 0 0 0 0 0
with the altered entries as shown in bold
it should be able to do this for as many areas as entered which can be up to 379....ideally seperated by commas

Set columns and rows by index values with DataFrame.loc:
vals = ['c','e']
df.loc[vals, :] = 0
df[vals] = 0
#alternative
#df.loc[:, vals] = 0
print (df)
a b c d e
a 0 3 0 3 0
b 3 0 0 4 0
c 0 0 0 0 0
d 3 4 0 0 0
e 0 0 0 0 0
Another solution is create boolean mask with numpy broadcasting and set values by DataFrame.mask:
mask = df.index.isin(vals) | df.columns.isin(vals)[:, None]
df = df.mask(mask, 0)
print (df)
a b c d e
a 0 3 0 3 0
b 3 0 0 4 0
c 0 0 0 0 0
d 3 4 0 0 0
e 0 0 0 0 0

Start by importing the csv:
import pandas as pd
adj_matrix = pd.read_csv("file/name/to/your.csv", index_col=0)
Then request the input:
regions = input("Please enter the regions that you want deleted (as an array of strings)")
adj_matrix.loc[regions, :] = 0
adj_matrix.loc[:, regions] = 0
Now adj_matrix should be in the form you want.

Using pandas to order results that are outputted every two rows

A program I am working with outputed a tab delimited file that looks like this:
marker A B C
Bin_1 1 2 1
marker C G H B T
Bin_2 3 1 1 1 2
marker B H T Z Y A C
Bin_3 1 1 2 1 3 4 5
I want to fix it so that it looks like this:
marker A B C G H T Y Z
Bin_1 1 2 1 0 0 0 0 0
Bin_2 0 1 3 1 1 1 0 0
Bin_3 4 1 5 0 1 2 3 1
This is what I have so far
import pandas as pd
from collections import OrderedDict
df = pd.read_csv('markers.txt',header=None,sep='\t')
x = map(list,df.values)
list_of_dicts = []
s = 0
e =1
g = len(x)+1
while e < g:
new_dict = OrderedDict(zip(x[s],x[e]))
list_of_dicts.append(new_dict)
s += 2
e += 2
Initially I was converting these to dictionaries and then was going to do some kind of count and recreate a dataframe but that seems to be taking a lot of time and memory for what seems like an easy task. Any suggestions on a better way to approach this?

lines = [str.strip(l).split() for l in open('markers.txt').readlines()]
dicts = {b[0]: pd.Series(dict(zip(m[1:], b[1:])))
for m, b in zip(lines[::2], lines[1::2])}
pd.concat(dicts).unstack(fill_value=0)
A B C G H T Y Z
Bin_1 1 2 1 0 0 0 0 0
Bin_2 0 1 3 1 1 2 0 0
Bin_3 4 1 5 0 1 2 3 1

The insight is that when you "append" DataFrames, the result is a DataFrame with columns that are the union of the columns, with NaNs or whatever in the holes. So:
$ cat test.py
import pandas as pd
frame = pd.DataFrame()
with open('/tmp/foo.tsv') as markers:
while True:
line = markers.readline()
if not line:
break
columns = line.strip().split('\t')
data = markers.readline().strip().split('\t')
new = pd.DataFrame(data=[data], columns=columns)
frame = frame.append(new)
frame = frame.fillna(0)
print(frame)
$ python test.py < /tmp/foo.tsv
A B C G H T Y Z marker
0 1 2 1 0 0 0 0 0 Bin_1
0 0 1 3 1 1 2 0 0 Bin_2
0 4 1 5 0 1 2 3 1 Bin_3
If you aren't using pandas anywhere else, then this might (or might not) be overkill. But if you are already using it anyway, then I think this is totally reasonable.

Not the most elegant thing in the world, but...
headers = df.iloc[::2][0].apply(lambda x: x.split()[1:])
data = df.iloc[1::2][0].apply(lambda x: x.split()[1:])
result = []
for h, d in zip(headers.values, data.values):
result.append(pd.Series(d, index=h))
pd.concat(result, axis=1).fillna(0).T
A B C G H T Y Z
0 1 2 1 0 0 0 0 0
1 0 1 3 1 1 2 0 0
2 4 1 5 0 1 2 3 1

Why not manipulate the data into a dict on input and then construct the DataFrame:
>>> with open(...) as f:
... d = {}
... for marker, bins in zip(f, f):
... z = zip(h.split(), v.split())
... _, bin = next(z)
... d[bin] = dict(z)
>>> pd.DataFrame(d).fillna(0).T
A B C G H T Y Z
Bin_1 1 2 1 0 0 0 0 0
Bin_2 0 1 3 1 1 2 0 0
Bin_3 4 1 5 0 1 2 3 1
If you really need the column axis name:
>>> pd.DataFrame(d).fillna(0).rename_axis('marker').T
marker A B C G H T Y Z
Bin_1 1 2 1 0 0 0 0 0
Bin_2 0 1 3 1 1 2 0 0
Bin_3 4 1 5 0 1 2 3 1

Condition based on First Element of a group in python

I have a dataframe df with columns [ShowOnAir, AfterPremier, ID, EverOnAir].
My condition is that
if it is the first element of groupby(df.ID)
then if (df.ShowOnAir ==0 or df.AfterPremier == 0), then EverOnAir = 0
else EverOnAir = 1
I am not sure how to compare the first element of the groupby, with elements of the orignal dataframe df.
would really appreciate if I could get help in it ,
Thank you

You can get a row number for your groups by using cumsum, then you can do your logic on the resulting dataframe:
df = pd.DataFrame([[1],[1],[2],[2],[2]])
df['n']=1
df.groupby(0).cumsum()
n
0 1
1 2
2 1
3 2
4 3

You can first create new column EverOnAir filled 1. Then groupby by ID and apply custom function f, where find first element of columns by iat and fill 0:
print df
ShowOnAir AfterPremier ID
0 0 0 a
1 0 1 a
2 1 1 a
3 1 1 b
4 1 0 b
5 0 0 b
6 0 1 c
7 1 0 c
8 0 0 c
def f(x):
#print x
x['EverOnAir'].iat[0] = np.where((x['ShowOnAir'].iat[0] == 0 ) |
(x['AfterPremier'].iat[0] == 0), 0, 1)
return x
df['EverOnAir'] = 1
print df.groupby('ID').apply(f)
ShowOnAir AfterPremier ID EverOnAir
0 0 0 a 0
1 0 1 a 1
2 1 1 a 1
3 1 1 b 1
4 1 0 b 1
5 0 0 b 1
6 0 1 c 0
7 1 0 c 1
8 0 0 c 1

finding streaks in pandas dataframe

I have a pandas dataframe as follows:
time winner loser stat
1 A B 0
2 C B 0
3 D B 1
4 E B 0
5 F A 0
6 G A 0
7 H A 0
8 I A 1
each row is a match result. the first column is the time of the match, second and third column contain winner/loser and the fourth column is one stat from the match.
I want to detect streaks of zeros for this stat per loser.
The expected result should look like this:
time winner loser stat streak
1 A B 0 1
2 C B 0 2
3 D B 1 0
4 E B 0 1
5 F A 0 1
6 G A 0 2
7 H A 0 3
8 I A 1 0
In pseudocode the algorithm should work like this:
.groupby loser column.
then iterate over each row of each loser group
in each row, look at the stat column: if it contains 0, then increment the streak value from the previous row by 0. if it is not 0, then start a new streak, that is, put 0 into the streak column.
So the .groupby is clear. But then I would need some sort of .apply where I can look at the previous row? this is where I am stuck.

You can apply custom function f, then cumsum, cumcount and astype:
def f(x):
x['streak'] = x.groupby( (x['stat'] != 0).cumsum()).cumcount() +
( (x['stat'] != 0).cumsum() == 0).astype(int)
return x
df = df.groupby('loser', sort=False).apply(f)
print df
time winner loser stat streak
0 1 A B 0 1
1 2 C B 0 2
2 3 D B 1 0
3 4 E B 0 1
4 5 F A 0 1
5 6 G A 0 2
6 7 H A 0 3
7 8 I A 1 0
For better undestanding:
def f(x):
x['c'] = (x['stat'] != 0).cumsum()
x['a'] = (x['c'] == 0).astype(int)
x['b'] = x.groupby( 'c' ).cumcount()
x['streak'] = x.groupby( 'c' ).cumcount() + x['a']
return x
df = df.groupby('loser', sort=False).apply(f)
print df
time winner loser stat c a b streak
0 1 A B 0 0 1 0 1
1 2 C B 0 0 1 1 2
2 3 D B 1 1 0 0 0
3 4 E B 0 1 0 1 1
4 5 F A 0 0 1 0 1
5 6 G A 0 0 1 1 2
6 7 H A 0 0 1 2 3
7 8 I A 1 1 0 0 0

Not as elegant as jezrael's answer, but for me easier to understand...
First, define a function that works with a single loser:
def f(df):
df['streak2'] = (df['stat'] == 0).cumsum()
df['cumsum'] = np.nan
df.loc[df['stat'] == 1, 'cumsum'] = df['streak2']
df['cumsum'] = df['cumsum'].fillna(method='ffill')
df['cumsum'] = df['cumsum'].fillna(0)
df['streak'] = df['streak2'] - df['cumsum']
df.drop(['streak2', 'cumsum'], axis=1, inplace=True)
return df
The streak is essentially a cumsum, but we need to reset it each time stat is 1. We therefore subtract the value of the cumsum where stat is 1, carried forward until the next 1.
Then groupby and apply by loser:
df.groupby('loser').apply(f)
The result is as expected.

You could use iterrows to access previous row:
df['streak'] = 0
for i, row in df.iterrows():
if i != 0:
if row['stat'] == 0:
if row['loser'] == df.ix[i-1, 'loser']:
df.ix[i, 'streak'] = df.ix[i-1, 'streak'] + 1
else:
df.ix[i, 'streak'] = 1
else:
if row['stat'] == 0:
df.ix[i, 'streak'] = 1
Which gives:
In [210]: df
Out[210]:
time winner loser stat streak
0 1 A B 0 1
1 2 C B 0 2
2 3 D B 1 0
3 4 E B 0 1
4 5 F A 0 1
5 6 G A 0 2
6 7 H A 0 3
7 8 I A 1 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Remove rows except the first new occurrence of a value - python

Let us try diff with cumsum create the continue value group , then try duplicated out = df[~df.y.diff().ne(0).cumsum().duplicated() | df.y].copy() Out[352]: x y 0 a 1 1 b 1 2 c 1 3 d 0 6 g 1 7 h 1 8 i 0

Check consecutive similarity using shift() df[df.y.ne(0)|(df.y.eq(0)&df.y.shift(1).ne(0))] x y 0 a 1 1 b 1 2 c 1 3 d 0 6 g 1 7 h 1 8 i 0

Related

How to count cumulatively with conditions on a groupby?

Delete row and column from pandas dataframe

Using pandas to order results that are outputted every two rows

Condition based on First Element of a group in python

finding streaks in pandas dataframe

Categories

Resources