Create if/else flag column based on group value in pandas - python

I have a DataFrame df which looks like this
ID timediff group_count
1 30 1
2 20 4
2 25 4
2 40 4
2 27 4
3 15 3
3 10 3
3 40 3
I'm trying to create a flag column which assess records on a group ID level to check if following conditions are met:
if df.timediff =< 30 OR
(df.timediff >30 and df.count >=4 )
then df[flag]=1
else df[flag]=0`
df should flag like this then
ID timediff group_count flag1
1 30 1 1
2 20 4 1
2 25 4 1
2 40 4 1
2 27 4 1
3 15 3 0
3 10 3 0
3 40 3 0
Groups flagged with 0 should be dropped. Wondering if those 0 can be dropped immediately.

We can try with groupby transform create the conditions
cond = df['timediff'].le(30).groupby(df['ID']).transform('all')
cnt = df.groupby('ID')['timediff'].transform('count')
df['new'] = (((cnt>=4) & (~cond)) | cond).astype(int)
df
Out[194]:
ID timediff group_count new
0 1 30 1 1
1 2 20 4 1
2 2 25 4 1
3 2 40 4 1
4 2 27 4 1
5 3 15 3 0
6 3 10 3 0
7 3 40 3 0

Related

counting consequtive duplicate elements in a dataframe and storing them in a new colum

I am trying to count the consecutive elements in a data frame and store them in a new column. I don't want to count the total number of times an element appears overall in the list but how many times it appeared consecutively, i used this:
a=[1,1,3,3,3,5,6,3,3,0,0,0,2,2,2,0]
df = pd.DataFrame(list(zip(a)), columns =['Patch'])
df['count'] = df.groupby('Patch').Patch.transform('size')
print(df)
this gave me a result like this:
Patch count
0 1 2
1 1 2
2 3 5
3 3 5
4 3 5
5 5 1
6 6 1
7 3 5
8 3 5
9 0 4
10 0 4
11 0 4
12 2 3
13 2 3
14 2 3
15 0 4
however i want the result to be like this:
Patch count
0 1 2
1 3 3
2 5 1
3 6 1
4 3 2
5 0 3
6 2 3
7 0 1
df = (
df.groupby((df.Patch != df.Patch.shift(1)).cumsum())
.agg({"Patch": ("first", "count")})
.reset_index(drop=True)
.droplevel(level=0, axis=1)
.rename(columns={"first": "Patch"})
)
print(df)
Prints:
Patch count
0 1 2
1 3 3
2 5 1
3 6 1
4 3 2
5 0 3
6 2 3
7 0 1

Pandas Structured 2D Data to XYZ Table

I want to create an xyz table from a structured grid representation of data in a Pandas DataFrame.
What I have is:
# Grid Nodes/indices
x=np.arange(5)
y=np.arange(5)
# DataFrame
df = pd.DataFrame(np.random.rand(5,5), columns=x, index=y)
>>>df
0 1 2 3 4
0 0.536047 0.673782 0.935536 0.853286 0.916081
1 0.884820 0.438207 0.070120 0.292445 0.789046
2 0.899633 0.822928 0.445154 0.643797 0.776154
3 0.682129 0.974159 0.078451 0.306872 0.689137
4 0.117842 0.770962 0.861076 0.429738 0.149199
I want to convert the above DataFrame to this DataFrame structure:
>>>df
x y val
0 0 #
0 1 #
...
4 4 #
I can create a for loop to do this but I believe I should be able to do this using the pivot, stack, or some other built-in method though but I am not getting it from the documentation. It seems to create multilevel DataFrames which I do not want. Bonus points for converting it back.
dff = df.stack().reset_index(name="values")
pd.pivot_table(index="level_0",columns="level_1",values="values",data=dff)
First part is taken from the previous answer to be used for the unstacking part.
First one is for stacking and second for unstacking.
level_0 level_1 values
0 0 0 0.536047
1 0 1 0.673782
2 0 2 0.935536
3 0 3 0.853286
4 0 4 0.916081
5 1 0 0.884820
6 1 1 0.438207
7 1 2 0.070120
8 1 3 0.292445
9 1 4 0.789046
10 2 0 0.899633
11 2 1 0.822928
12 2 2 0.445154
13 2 3 0.643797
14 2 4 0.776154
15 3 0 0.682129
16 3 1 0.974159
17 3 2 0.078451
18 3 3 0.306872
19 3 4 0.689137
20 4 0 0.117842
21 4 1 0.770962
22 4 2 0.861076
23 4 3 0.429738
24 4 4 0.149199
# Unstacking
level_1 0 1 2 3 4
level_0
0 0.536047 0.673782 0.935536 0.853286 0.916081
1 0.884820 0.438207 0.070120 0.292445 0.789046
2 0.899633 0.822928 0.445154 0.643797 0.776154
3 0.682129 0.974159 0.078451 0.306872 0.689137
4 0.117842 0.770962 0.861076 0.429738 0.149199
Use df.stack with df.reset_index:
In [4474]: df = df.stack().reset_index(name='value').rename(columns={'level_0':'x', 'level_1': 'y'})
In [4475]: df
Out[4475]:
x y value
0 0 0 0.772210
1 0 1 0.921495
2 0 2 0.903645
3 0 3 0.980514
4 0 4 0.156923
5 1 0 0.516448
6 1 1 0.121148
7 1 2 0.394074
8 1 3 0.532963
9 1 4 0.369175
10 2 0 0.605971
11 2 1 0.712189
12 2 2 0.866299
13 2 3 0.174830
14 2 4 0.042236
15 3 0 0.350161
16 3 1 0.100152
17 3 2 0.049185
18 3 3 0.808631
19 3 4 0.562624
20 4 0 0.090918
21 4 1 0.713475
22 4 2 0.723183
23 4 3 0.569887
24 4 4 0.980238
For converting it back, use df.pivot:
In [4481]: unstacked_df = df.pivot('x', 'y')
In [4482]: unstacked_df
Out[4482]:
value
y 0 1 2 3 4
x
0 0.772210 0.921495 0.903645 0.980514 0.156923
1 0.516448 0.121148 0.394074 0.532963 0.369175
2 0.605971 0.712189 0.866299 0.174830 0.042236
3 0.350161 0.100152 0.049185 0.808631 0.562624
4 0.090918 0.713475 0.723183 0.569887 0.980238

pandas create category column based on sequence repetition in another column

This is very likely a duplicate, but I'm not sure what to search for to find it.
I have a column in a dataframe that cycles from 0 to some value a number of times (in my example it cycles to 4 three times) . I want to create another column that simply shows which cycle it is. Example:
import pandas as pd
df = pd.DataFrame({'A':[0,1,2,3,4,0,1,2,3,4,0,1,2,3,4]})
df['desired_output'] = [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2]
print(df)
A desired_output
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 0 1
6 1 1
7 2 1
8 3 1
9 4 1
10 0 2
11 1 2
12 2 2
13 3 2
14 4 2
I was thinking maybe something along the lines of a groupby(), cumsum() and transform(), but I'm not quite sure how to implement it. Could be wrong though.
Compare by 0 with Series.eq and then add Series.cumsum, last subtract 1:
df['desired_output'] = df['A'].eq(0).cumsum() - 1
print (df)
A desired_output
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 0 1
6 1 1
7 2 1
8 3 1
9 4 1
10 0 2
11 1 2
12 2 2
13 3 2
14 4 2

compare the next row value and change the current row value using pandas python

any way of comparing a row value with the next row value and change the current row value using pandas?
Basically in the the first Data frame DF1, in the value column, one of the value is '999', so the values of the next rows for that 'user-id' is less than the value '999'. so in this case i want to add '1000' which is 10^(len(999)) to the all successive values of that 'user-id'.
I tried using shift, but I found that it skips one of the row value by giving a 'Null'. And I am also not sure how to do it without creating a new value.
For example,
if this is the data set I have, DF1
user-id serial-number value day
1 2 10 1
1 2 20 2
1 2 30 3
1 2 40 4
1 2 50 5
1 2 60 6
1 2 70 7
1 2 80 8
1 2 90 9
1 2 100 10
1 2 999 11
1 2 300 12
1 2 400 13
2 3 11 1
2 3 12 2
2 3 13 3
2 3 14 4
2 3 99 5
2 3 16 6
2 3 17 7
2 3 18 8
I need the resultant data frame to be DF1:
user-id serial-number value day
1 2 10 1
1 2 20 1
1 2 30 1
1 2 40 1
1 2 50 1
1 2 60 1
1 2 70 1
1 2 80 1
1 2 90 1
1 2 100 1
1 2 999 1
1 2 1300 1
1 2 1400 1
. .
2 3 11 1
2 3 12 1
2 3 13 1
2 3 14 1
2 3 99 1
2 3 116 1
2 3 117 1
2 3 118 1
I think I've explained the question properly.
similarly i want to do it for all the values in the "value" column for each user ID.
Any suggestions?
I have 2 methods for this:
This method we multiply by the max value of each user-id - it works on the sample dataset you porivded but it might not work overal.
df.set_index('user-id', inplace=True)
df['value'] += df.groupby('user-id')['value'].apply(
lambda x:(x.shift() > x).astype(int).cumsum()
) * 10**df.groupby('user-id')['value'].max().apply(lambda x: len(str(x)))
The other on is looping through each item:
def foo(x):
for i in range(1,len(x)):
if x.iloc[i] < x.iloc[i-1]:
x.iloc[i:] = x.iloc[i:] + 10**(len(str(x.iloc[i-1])))
return x
df['value'] = df.groupby('user-id')['value'].apply(foo)

Python Pandas dataframe is not including all duplicates

I'm basically trying to create a Pandas dataframe (CQUAD_mech_loads) that is a subset of a larger dataframe (CQUAD_Mech). This subset dataframe is essentially created by filtering based on two conditions. There are NO duplicates in the larger dataframe (CQUAD_Mech).
The problem is that my subset dataframe doesn't include the duplicate ID's in the ELM column. It does, however, include duplicates in the LC column.
CQUAD_ELM is a list containing four ID's ([387522, 387522, 387506, 387507]). I have duplicate ID's of 387522. Right now, CQUAD_mech_loads is a dataframe with only three rows for the three unique IDs. I want that fourth duplicate ID in there as well.
The code:
def get_df(df, col1, cond1, col2='', cond2=0):
return df[(df[col1] == cond1) & (df[col2].isin(cond2))].reset_index(drop=True)
CQUAD_mech_loads = get_df(CQUAD_Mech,'LC', LC, 'ELM', CQUAD_ELM)
The output (where is the other line for 387522?):
LC ELM FX FY FXY
0 3113 387506 0 0 0
1 3113 387507 0 0 0
2 3113 387522 0 0 0
Since you're dropping the index anyway, you can just set the index to be the column you're interested in and use .ix indexing:
In [28]: df = pd.DataFrame(np.arange(25).reshape(5,5))
In [29]: df
Out[29]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
In [30]: df.set_index(4, drop=False).ix[[4,4,19,4,24]].reset_index(drop=True)
Out[30]:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 15 16 17 18 19
3 0 1 2 3 4
4 20 21 22 23 24
EDIT: Your current method just finds each distinct col1/col2 pair. If you want to filter on multiple columns, just do it twice, once for each column:
In [98]: df.set_index(1, drop=False).ix[[1, 6, 16]].set_index(4, drop=False).ix[[4,4,4,4,4,4,4,4,19,9]].reset_index(drop=True)
Out[98]:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
5 0 1 2 3 4
6 0 1 2 3 4
7 0 1 2 3 4
8 15 16 17 18 19
9 5 6 7 8 9

Categories