pandas dataframe: compare value in one column with previous value - python

I have a pandas dataframe in which I want to add a column (col_new), which values depend on a comparison of values in a existing column (col_exist).
Existing column (type=objects) contains As and Bs.
New column should count, starting with 1.
If an A follows an A, the count should rise by one.
If an A follows a B, the count should rise by one.
If a B follows an A, the count should not rise.
If a B follows a B, the count should not rise.
col_exist col_new
A 1
A 2
A 3
B 3
A 4
B 4
B 4
A 5
B 5
I am completely new to programming, so thank you in advance for your adequade answer.

Use eq and cumsum:
df['col_new'] = df['col_exist'].eq('A').cumsum()
output:
col_exist col_new
0 A 1
1 A 2
2 A 3
3 B 3
4 A 4
5 B 4
6 B 4
7 A 5
8 B 5

Related

Sliding minimum value in a pandas column

I am working with a pandas dataframe where I have the following two columns: "personID" and "points". I would like to create a third variable ("localMin") which will store the minimum value of the column "points" at each point in the dataframe as compared with all previous values in the "points" column for each personID (see image below).
Does anyone have an idea how to achieve this most efficiently? I have approached this problem using shift() with different period sizes, but of course, shift is sensitive to variations in the sequence and doesn't always produce the output I would expect.
Thank you in advance!
Use groupby.cummin:
df['localMin'] = df.groupby('personID')['points'].cummin()
Example:
df = pd.DataFrame({'personID': list('AAAAAABBBBBB'),
'points': [3,4,2,6,1,2,4,3,1,2,6,1]
})
df['localMin'] = df.groupby('personID')['points'].cummin()
output:
personID points localMin
0 A 3 3
1 A 4 3
2 A 2 2
3 A 6 2
4 A 1 1
5 A 2 1
6 B 4 4
7 B 3 3
8 B 1 1
9 B 2 1
10 B 6 1
11 B 1 1

Is there a way to filter out rows from a table with an unnamed column

I'm currently trying to do analysis of rolling correlations of a dataset with four compared values but only need the output of rows containing 'a'
I got my data frame by using the command newdf = df.rolling(3).corr()
Sample input (random numbers)
a b c d
1 a
1 b
1 c
1 d
2 a
2 b
2 c
2 d
3 a
3 b 5 6 3
3 c 4 3 1
3 d 3 4 2
4 a 1 3 5 6
4 b 6 2 4 1
4 c 8 6 6 7
4 d 2 5 4 6
5 a 2 5 4 1
5 b 1 4 6 3
5 c 2 6 3 7
5 d 3 6 3 7
and need the output
a b c d
1 a 1 3 5 6
2 a 2 5 4 1
I've tried filtering it by doing adf = newdf.filter(['a'], axis=0) however that gets rid of everything and when doing it for the other axis it filters by column. Unfortunately the column containing the rows with values: a, b, c, d is unnamed so I cant filter that column individually. This wouldn't be an issue however if its possible to flip the rows and columns with the values being listed by index to get the desired output.
Try using loc. Put the column of abcdabcd ... as index and just use loc
df.loc['a']
The actual source of problem in your case is that your DataFrame
has a MultiIndex.
So when you attempt to execute newdf.filter(['a'], axis=0) you want
to leave rows with the index containing only "a" string.
But since your DataFrame has a MultiIndex, each row with "a" at
level 1 contains also some number at level 0.
To get your intended result, run:
newdf.filter(like='a', axis=0)
maybe followed by .dropna().
An alterantive solution is:
newdf.xs('a', level=1, drop_level=False)

How to remove duplicates based on two columns removing the the largest of 3rd column in pandas dataframe?

Suppose I have a pandas dataframe that is like this:
df=
A B 6 2
A C 4 2
D F 9 3
K L 8 9
A B 4 3
D F 8 2
How can I say, if columns A and B have duplicates remove the ones that have the largest column C?
So for instance we can see lines 1 and 5 have the same columns A and B.
A B 6 2 (Line 1)
A B 4 3 (Line 5)
I want to remove line 1 as 6 is greater than 4.
So my output should be
A C 4 2
K L 8 9
A B 4 3
D F 8 2
Try sorting the column in descending order on which you need to find max value using
pd.sort_values
Then drop_duplicates using pd.drop_duplicate
df.sort_values(by=['C'],ascending=[True],inplace=True)
df.drop_duplicates(subset=['A','B'],inplace=True)

df.groupby() modification HELP needed

This is my table:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 2
Now, I want to group all rows by Column A and B. Column C should be summed and for column E, I want to use the value where value C is max.
I did the first part of grouping A and B and summing C. I did this with:
df = df.groupby(['A', 'B'])['C'].sum()
But at this point, I am not sure how to tell that column E should take the value where C is max.
The end result should look like this:
A B C E
0 1 1 6 4
1 3 3 8 2
Can somebody help me with this past piece?
Thanks!
Using groupby with agg after sorting by C.
In general, if you are applying different functions to different columns, DataFrameGroupBy.agg allows you to pass a dictionary specifying which operation is applied to each column:
df.sort_values('C').groupby(['A', 'B'], sort=False).agg({'C': 'sum', 'E': 'last'})
C E
A B
1 1 6 4
3 3 8 2
By sorting by column C first, and not sorting as part of groupby, we can select the last value of E per group, which will align with the maximum value of C for each group.

In pandas, how to display the most frequent diagnoses in dataframe, but only count 1 occurrence of the same diagnoses per patient

In pandas and python:
I have a large datasets with health records where patients have records of diagnoses.
How to display the most frequent diagnoses, but only count 1 occurrence of the same diagnoses per patient?
Example ('pid' is patient id. 'code' is the code of a diagnosis):
IN:
pid code
1 A
1 B
1 A
1 A
2 A
2 A
2 B
2 A
3 B
3 C
3 D
4 A
4 A
4 A
4 B
OUT:
B 4
A 3
C 1
D 1
I would like to be able to use .isin .index if possible.
Example:
Remove all rows with less than 3 in frequency count in column 'code'
s = df['code'].value_counts().ge(3)
df = df[df['code'].isin(s[s].index)]
You can use groupby + nunique:
df.groupby(by='code').pid.nunique().sort_values(ascending=False)
Out[60]:
code
B 4
A 3
D 1
C 1
Name: pid, dtype: int64
To remove all rows with less than 3 in frequency count in column 'code'
df.groupby(by='code').filter(lambda x: x.pid.nunique()>=3)
Out[55]:
pid code
0 1 A
1 1 B
2 1 A
3 1 A
4 2 A
5 2 A
6 2 B
7 2 A
8 3 B
11 4 A
12 4 A
13 4 A
14 4 B
Since you mention value_counts
df.groupby('code').pid.value_counts().count(level=0)
Out[42]:
code
A 3
B 4
C 1
D 1
Name: pid, dtype: int64
You should be able to use the groupby and nunique() functions to obtain a distinct count of patients that had each diagnosis. This should give you the result you need:
df[['pid', 'code']].groupby(['code']).nunique()

Categories