How to delete specific rows in pandas dataframe if a condition is met - python

I have a pandas dataframe with few thousand rows and only one column. The structure of the content is as follows:
| 0
0 | Score 1
1 | Date 1
2 | Group 1
3 | Score 1
4 | Score 2
5 | Date 2
6 | Group 2
7 | Score 2
8 | Score 3
9 | Date 3
10| Group 3
11| ...
12| ...
13| Score (n-1)
14| Score n
15| Date n
16| Group n
I need to delete all rows with index i if "Score" in row(i) and "Score" in row(i+1). Any suggestion on how to achieve this?
The expected output is as follows:
| 0
0 | Score 1
1 | Date 1
2 | Group 1
3 | Score 2
4 | Date 2
5 | Group 2
6 | Score 3
7 | Date 3
8 | Group 3
9 | ...
10| ...
11| Score n
12| Date n
13| Group n

I need to delete all rows with index i if "Score" in row(i) and "Score" in row(i+1). Any suggestion on how to achieve this?
Given
>>> df
0
0 Score 1
1 Date 1
2 Group 1
3 Score 1
4 Score 2
5 Date 2
6 Group 2
7 Score 2
8 Score 3
9 Date 3
you can use
>>> mask = df.assign(shift=df[0].shift(-1)).apply(lambda s: s.str.contains('Score')).all(1)
>>> df[~mask].reset_index(drop=True)
0
0 Score 1
1 Date 1
2 Group 1
3 Score 2
4 Date 2
5 Group 2
6 Score 3
7 Date 3
Although if I were you I would use fix the format of the data first as the commenters already pointed out.

Related

How do I get the maximum value for every group and rank with all other groups?

I want to find the max value for every team and rank the team ascending.
This is the dataframe:
TEAM | GROUP | SCORE
1 | A | 5
1 | B | 5
1 | C | 5
2 | D | 6
2 | A | 6
3 | D | 5
3 | A | 5
No team should have the same rank so in case the score is similar who shows up first gets the first rank - others will adjust accordingly. So the output for this is:
TEAM | GROUP | SCORE | RANK
1 | A | 5 | 1
1 | B | 5 | 1
1 | C | 5 | 1
2 | D | 6 | 3
2 | A | 6 | 3
3 | D | 5 | 2
3 | A | 5 | 2
I'm not very familiar with some python syntax but here's what I have so far:
team = df.groupby(['TEAM'])
for x in team:
df['Rank'] = x.groupby(['TEAM'])['SCORE'].max().rank()
Please try the below which uses sorting on score and team, then gets the changes and does a cumulative sum for rank:
s = df[['TEAM','SCORE']].sort_values(['SCORE','TEAM'])
df['RANK'] = s['TEAM'].ne(s['TEAM'].shift()).cumsum()
print(df)
TEAM GROUP SCORE RANK
0 1 A 5 1
1 1 B 5 1
2 1 C 5 1
3 2 D 6 3
4 2 A 6 3
5 3 D 5 2
6 3 A 5 2

Count consecutive occurrence's for column [duplicate]

This question already has answers here:
GroupBy Pandas Count Consecutive Zero's
(2 answers)
Closed 1 year ago.
I am trying to count consecutive occurrence's for Products column. Result should be as shown in "Total counts" column. I tried using groupby with cumsum but my logic could not work
+----------+--------------+
| Products | Total counts |
+----------+--------------+
| 1 | 3 |
+----------+--------------+
| 1 | 3 |
+----------+--------------+
| 1 | 3 |
+----------+--------------+
| 2 | 1 |
+----------+--------------+
| 3 | 3 |
+----------+--------------+
| 3 | 3 |
+----------+--------------+
| 3 | 3 |
+----------+--------------+
| 4 | 2 |
+----------+--------------+
| 4 | 2 |
+----------+--------------+
Use groupby with transform and count,
df['Total counts'] = df.groupby('Products').transform('count')
Output:
Products Total counts
0 1 3
1 1 3
2 1 3
3 2 1
4 3 3
5 3 3
6 3 3
7 4 2
8 4 2
Consective Products, that repeat later in dataframe:
grp = (df['Products'] != df['Products'].shift()).cumsum()
df['Total counts'] = df.groupby(grp)['Products'].transform('count')
Output:
Products Total counts
0 1 3
1 1 3
2 1 3
3 2 1
4 3 3
5 3 3
6 3 3
7 4 2
8 4 2

groupby to get the average, using dynamic condition

I have been searching about groupby using conditions and found many posts about that. This one for example: Pandas: conditional group-specific computations
However, I couldn't find any where the condition is applied over itself. In my case I'd like to get the average (or count or any other formula for that matter) but the thing that I couldn't find is to filter the dataset over a dynamic condition.
To illustrate this, this is the summarized dataset:
ID | Seq | Total
1 | 1 | 1
1 | 2 | 2
1 | 3 | 3
2 | 1 | 1
2 | 2 | 2
2 | 3 | 1
If I want to get the mean grouped by ID, but with the additional condition that for each record within the group, only those where the Seq is smaller must be computed. This should be the result
ID | Seq | Total | x
1 | 1 | 1 | 1 <-- mean of 1
1 | 2 | 2 | 1.5 <-- mean of 1 and 2
1 | 3 | 3 | 2 <-- mean of 1,2 and 3
2 | 1 | 1 | 1 <-- mean of 1
2 | 2 | 2 | 1.5 <-- mean of 1 and 2
2 | 3 | 1 | 1.33 < mean of 1, 2 and 1
Any help will be appreciated!
It looks like you are just trying to get the expanding().mean() of the ID-grouped Total column, e.g.:
In []:
df['x'] = df.groupby('ID')['Total'].expanding().mean().values
df
Out[]:
ID Seq Total x
0 1 1 1 1.000000
1 1 2 2 1.500000
2 1 3 3 2.000000
3 2 1 1 1.000000
4 2 2 2 1.500000
5 2 3 1 1.333333

Pandas groupby on multiple values

Start with a sorted table:
Index | A | B | C |
0 | A1| 0 | Group 1 |
1 | A1| 0 | Group 1 |
2 | A1| 1 | Group 2 |
3 | A1| 1 | Group 2 |
4 | A1| 2 | Group 3 |
5 | A1| 2 | Group 3 |
6 | A2| 7 | Group 4 |
7 | A2| 7 | Group 4 |
Returns records 0,1,2,3,6,7
First I want to create groups based on Columns A and B.
Then I want only the first two subgroups of a Column A group returned.
I want all the records returned for the subgroup.
Thank you so much.
Use pd.factorize within a groupby and filter for less than 2
df[df.groupby('A').B.transform(lambda x: x.factorize()[0]).lt(2)]
# same as
# df[df.groupby('A').B.transform(lambda x: x.factorize()[0]) < 2]
A B C
0 A1 0 Group 1
1 A1 0 Group 1
2 A1 1 Group 2
3 A1 1 Group 2
6 A2 7 Group 4
7 A2 7 Group 4

Interleaving Pandas Dataframes by Timestamp

I've got 2 Pandas DataFrame, each of them containing 2 columns. One of the columns is a timestamp column [t], the other one contains sensor readings [s].
I now want to create a single DataFrame, containing 4 columns, that is interleaved on the timestamp column.
Example:
First Dataframe:
+----+----+
| t1 | s1 |
+----+----+
| 0 | 1 |
| 2 | 3 |
| 3 | 3 |
| 5 | 2 |
+----+----+
Second DataFrame:
+----+----+
| t2 | s2 |
+----+----+
| 1 | 5 |
| 2 | 3 |
| 4 | 3 |
+----+----+
Target:
+----+----+----+----+
| t1 | t2 | s1 | s2 |
+----+----+----+----+
| 0 | 0 | 1 | 0 |
| 0 | 1 | 1 | 5 |
| 2 | 1 | 3 | 5 |
| 2 | 2 | 3 | 3 |
| 3 | 2 | 3 | 3 |
| 3 | 4 | 3 | 3 |
| 5 | 4 | 2 | 3 |
+----+----+----+----+
I hat a look at pandas.merge, but that left me with a lot of NaNs and an unsorted table.
a.merge(b, how='outer')
Out[55]:
t1 s1 t2 s2
0 0 1 NaN NaN
1 2 3 2 3
2 3 3 NaN NaN
3 5 2 NaN NaN
4 1 NaN 1 5
5 4 NaN 4 3
Merging will put NaNs in common columns that you merge on, if those values are not present in both indexes. It will not create new data that is not present in the dataframes that are being merged.
For example, index 0 in your target dataframe shows t2 with a value of 0. This is not present in the second dataframe, so you cannot expect it to appear in the merged dataframe either. Same applies for other rows as well.
What you can do instead is reindex the dataframes to a common index. In your case, since the maximum index is 5 in the target dataframe, lets use this list to reindex both input dataframes:
In [382]: ind
Out[382]: [0, 1, 2, 3, 4, 5]
Now, we will reindex according both inputs to this index:
In [372]: x = a.set_index('t1').reindex(ind).fillna(0).reset_index()
In [373]: x
Out[373]:
t1 s1
0 0 1
1 1 0
2 2 3
3 3 3
4 4 0
5 5 2
In [374]: y = b.set_index('t2').reindex(ind).fillna(0).reset_index()
In [375]: y
Out[375]:
t2 s2
0 0 0
1 1 5
2 2 3
3 3 0
4 4 5
5 5 0
And, now we merge it to get something close to the target dataframe:
In [376]: x.merge(y, left_on=['t1'], right_on=['t2'], how='outer')
Out[376]:
t1 s1 t2 s2
0 0 1 0 0
1 1 0 1 5
2 2 3 2 3
3 3 3 3 0
4 4 0 4 5
5 5 2 5 0

Categories