Pandas dataframe - create new column based on simple calcuation

Pandas dataframe - create new column based on simple calcuation - python

I want to make a calculation based on 4 columns in a dataframe and apply the result to a new column.
The 4 columns I'm interested in are as follows.
rating_1, time_1, rating_2, time_2 col_x col_y etc
0 1 1 1 1 1 1
If time_1 is greater than time_2 I want rating_1 in the new column, if time_2 is greater I want rating_2 in the column.
What's the simplest way to do this please?

you can use numpy.where() method:
In [241]: x
Out[241]:
rating_1 time_1 rating_2 time_2 col_x col_y
0 11 1 21 1 1 1
1 12 2 21 1 1 1
2 13 1 21 5 1 1
3 14 5 21 5 1 1
In [242]: x['new'] = np.where(x.time_1 > x.time_2, x.rating_1, x.rating_2)
In [243]: x
Out[243]:
rating_1 time_1 rating_2 time_2 col_x col_y new
0 11 1 21 1 1 1 21
1 12 2 21 1 1 1 12
2 13 1 21 5 1 1 21
3 14 5 21 5 1 1 21

def myfunc(row):
if row.time_1 >= row.time_2:
return row.rating_1
else:
return row.rating_2
df.loc[:, 'calculatedColumn'] = df.apply(myfunc, axis = 1)

Related

Create if/else flag column based on group value in pandas

I have a DataFrame df which looks like this
ID timediff group_count
1 30 1
2 20 4
2 25 4
2 40 4
2 27 4
3 15 3
3 10 3
3 40 3
I'm trying to create a flag column which assess records on a group ID level to check if following conditions are met:
if df.timediff =< 30 OR
(df.timediff >30 and df.count >=4 )
then df[flag]=1
else df[flag]=0`
df should flag like this then
ID timediff group_count flag1
1 30 1 1
2 20 4 1
2 25 4 1
2 40 4 1
2 27 4 1
3 15 3 0
3 10 3 0
3 40 3 0
Groups flagged with 0 should be dropped. Wondering if those 0 can be dropped immediately.

We can try with groupby transform create the conditions
cond = df['timediff'].le(30).groupby(df['ID']).transform('all')
cnt = df.groupby('ID')['timediff'].transform('count')
df['new'] = (((cnt>=4) & (~cond)) | cond).astype(int)
df
Out[194]:
ID timediff group_count new
0 1 30 1 1
1 2 20 4 1
2 2 25 4 1
3 2 40 4 1
4 2 27 4 1
5 3 15 3 0
6 3 10 3 0
7 3 40 3 0

How to divide dataframe into 2 equal parts (first half rows and second half rows) - in Python

I have a dataframe and need to break it into 2 equal dataframes.
1st dataframe would contain top half rows and 2nd would contain the remaining rows.
Please help how to achieve this using python.
Also in both the even rows scenario and odd rows scenario (as in odd rows I would need to drop the last row to make it equal).

Consider df:
In [122]: df
Out[122]:
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 2
3 1 8 1 5
4 1 8 1 5
5 1 8 0 5
6 2 3 0 0
7 2 8 1 5
8 2 8 1 5
9 2 9 2 1
10 2 9 0 1
11 2 12 1 3
12 3 4 5 6
Use numpy.array_split():
In [127]: import numpy as np
In [128]: def split_df(df):
...: if len(df) % 2 != 0: # Handling `df` with `odd` number of rows
...: df = df.iloc[:-1, :]
...: df1, df2 = np.array_split(df, 2)
...: return df1, df2
...:
In [130]: df1, df2 = split_df(df)
In [131]: df1
Out[131]:
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 2
3 1 8 1 5
4 1 8 1 5
5 1 8 0 5
In [133]: df2
Out[133]:
id days sold days_lag
6 2 3 0 0
7 2 8 1 5
8 2 8 1 5
9 2 9 2 1
10 2 9 0 1
11 2 12 1 3

with a simple eg. you can try as below:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13],['Tom',20],['Jerry',25]]
#data = [['Alex',10],['Bob',12],['Clarke',13],['Tom',20]]
data1 = data[0:int(len(data)/2)]
if (len(data) % 2) == 0:
data2 = data[int(len(data)/2):]
else:
data2 = data[int(len(data)/2):-1]
df1 = pd.DataFrame(data1, columns=['Name', 'Age'], dtype=float); print("1st half:\n",df1)
df2 = pd.DataFrame(data2, columns=['Name', 'Age'], dtype=float); print("2nd Half:\n",df2)
Output:
D:\Python>python temp.py
1st half:
Name Age
0 Alex 10.0
1 Bob 12.0
2nd Half:
Name Age
0 Clarke 13.0
1 Tom 20.0

Pandas group operation on columns

I have a grouped pandas groupby object.
dis type id date qty
1 1 10 2017-01-01 1
1 1 10 2017-01-01 0
1 1 10 2017-01-02 4.5
1 2 11 2017-04-03 1
1 2 11 2017-04-03 2
1 2 11 2017-04-03 0
1 2 11 2017-04-05 0
I want to apply some operations on this groupby object.
I want to add a new column total_order that calculates the number of orders on a particular date for a particular material
A column zero_qty that calculates the number of zero orders for a particular date for a particular material
change the date column to make it calculate the number of days between each subsequent order for a particular material. The first order becomes 0.
The final dataframe should like something like this:
dis type id date qty total_order zero_qty
1 1 10 0 1 2 1
1 1 10 0 0 2 1
1 1 10 1 4.5 1 1
1 2 11 0 1 3 2
1 2 11 0 2 3 2
1 2 11 0 0 3 2
1 2 11 2 0 1 1

I think you need transform for count size of groups to total_order, then count number of zeros in qty and last get difference by diff with fillna and days:
Notice - for difference need sorted columns, sort_values do it if necessary:
df = df.sort_values(['dis','type','id','date'])
g = df.groupby(['dis','type','id','date'])
df['total_order'] = g['id'].transform('size')
df['zero_qty'] = g['qty'].transform(lambda x: (x == 0).sum()).astype(int)
df['date'] = df.groupby(['dis','type','id'])['date'].diff().fillna(0).dt.days
print (df)
dis type id date qty total_order zero_qty
0 1 1 10 0 1.0 2 1
1 1 1 10 0 0.0 2 1
2 1 1 10 1 4.5 1 0
3 1 2 11 0 1.0 3 1
4 1 2 11 0 2.0 3 1
5 1 2 11 0 0.0 3 1
6 1 2 11 2 0.0 1 1
Another solution instead multiple transform use apply with custom function:
df = df.sort_values(['dis','type','id','date'])
def f(x):
x['total_order'] = len(x)
x['zero_qty'] = x['qty'].eq(0).sum().astype(int)
return x
df = df.groupby(['dis','type','id','date']).apply(f)
df['date'] = df.groupby(['dis','type','id'])['date'].diff().fillna(0).dt.days
print (df)
dis type id date qty total_order zero_qty
0 1 1 10 0 1.0 2 1
1 1 1 10 0 0.0 2 1
2 1 1 10 1 4.5 1 0
3 1 2 11 0 1.0 3 1
4 1 2 11 0 2.0 3 1
5 1 2 11 0 0.0 3 1
6 1 2 11 2 0.0 1 1
EDIT:
Last row can be rewrite too if need process more columns:
def f2(x):
#add another code
x['date'] = x['date'].diff().fillna(0).dt.days
return x
df = df.groupby(['dis','type','id']).apply(f2)

Pandas Dataframe select random rows from grouping, and finding average of each grouping

I have a dataframe df that looks like this:
ID1 ID2 Bool Count
0 12868123 387DB71C 0 1
1 12868123 84C0E502 1 11
2 12868123 387DB71C 1 1
8 12868123 80A9DCFC 0 16
9 12868123 7A260136 1 20
10 12868123 80A9DCFC 0 16
11 12868123 80BB4591 0 36
327295 8617B7D9 76A08B0E 0 19
327296 8617B7D9 76A08B0E 0 19
327297 8617B7D9 76D0DA26 1 2
327298 8617B7D9 7C92B2A6 1 3
327299 8617B7D9 75883296 1 1
327300 8617B7D9 78711A4F 0 12
327301 8617B7D9 78711A4F 0 12
327302 8617B7D9 78711A4F 0 12
I want to do two things:
1- I want to "randomly" extract n unique rows for each (ID1, Bool) instance.
So if n = 2, one possible result could be:
ID1 ID2 Bool Count
0 12868123 387DB71C 0 1
8 12868123 80A9DCFC 0 16
1 12868123 84C0E502 1 11
2 12868123 387DB71C 1 1
327295 8617B7D9 76A08B0E 0 19
327296 8617B7D9 76A08B0E 0 19
327297 8617B7D9 76D0DA26 1 2
327298 8617B7D9 7C92B2A6 1 3
I tried looking for something along the line of df.groupby('ID1', 'Bool').random(size=n), but couldn't figure it out.
2- I then want to calculate the average Count for each (ID1, Bool) pair. So that the final resulting DF is:
ID1 Bool AverageCount
0 12868123 0 8.5
1 12868123 1 6
2 8617B7D9 0 19
3 8617B7D9 1 2.5
I think I have the second part figured out:
df.groupby(['ID1','Bool'])['Count'].mean()

groupby + sample
df.groupby(
['ID1', 'Bool']
).apply(
lambda df: df.sample(2).Count.mean()
).reset_index(name='AverageCount')

You can use groupby with numpy.random.choice:
n = 2
df1 = df.groupby(['ID1', 'Bool'])['Count'] \
.apply(lambda x: np.mean(np.random.choice(x, n))) \
.reset_index(name='AverageCount')
print (df1)
ID1 Bool AverageCount
0 12868123 0 18.5
1 12868123 1 6.0
2 8617B7D9 0 19.0
3 8617B7D9 1 3.0

count cumulative number of rows since a condition is et in a Pandas DataFrame

I have a pandas DF that has two columns, Day, and Data, reading from a csv file.
After reading, I add 3 columns "Days with condition 0", 1, and 2. For example, for the columns 'Days with condition 2' I do this:
DF['Days with condition 2'] = ''
DF['Days with condition 2'][DF['Data']==2]=1
What I need to do and can't figure out is how to calculate 'Days since condition' 0,1,2. For example, the 'Days since condition 2' should display 11 in index 19, since that's the number of rows since the last condition was triggered (index 8). Is there any pandas function to do this?

Starting with your two original columns
Day Data
0 1 1
1 2 0
2 3 0
3 4 0
4 5 0
5 6 0
6 7 1
7 8 0
8 9 2
9 10 0
10 11 0
11 12 1
12 13 0
13 14 0
14 15 0
15 16 1
16 17 0
17 18 1
18 19 0
19 20 2
20 21 0
21 22 0
22 23 0
Here's how you could populate "Days with condition 2". Filter for the 2s using boolean indexing. Then we subtract our previous Day using shift().
The next couple of steps filters for the first occurrence of 2 and updates "Days with condition 2" equal to Day, but it could be whatever you want it to be
Then a fillna() to get rid of the NaNs. The same pattern could be used for the other two columns you want to add
filter = (df["Data"] == 2)
df.loc[filter,"Days with condition 2"] = df[filter]["Day"] - df[filter]["Day"].shift(1)
filter = filter & (df["Days with condition 2"].isnull())
df.loc[filter,"Days with condition 2"] = df[filter]["Day"]
df = df.fillna(0)
df
Day Data Days with condition 2
0 1 1 0
1 2 0 0
2 3 0 0
3 4 0 0
4 5 0 0
5 6 0 0
6 7 1 0
7 8 0 0
8 9 2 9
9 10 0 0
10 11 0 0
11 12 1 0
12 13 0 0
13 14 0 0
14 15 0 0
15 16 1 0
16 17 0 0
17 18 1 0
18 19 0 0
19 20 2 11
20 21 0 0
21 22 0 0
22 23 0 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe - create new column based on simple calcuation - python

def myfunc(row): if row.time_1 >= row.time_2: return row.rating_1 else: return row.rating_2 df.loc[:, 'calculatedColumn'] = df.apply(myfunc, axis = 1)

Related

Create if/else flag column based on group value in pandas

How to divide dataframe into 2 equal parts (first half rows and second half rows) - in Python

Pandas group operation on columns

Pandas Dataframe select random rows from grouping, and finding average of each grouping

count cumulative number of rows since a condition is et in a Pandas DataFrame

Categories

Resources