Pandas DataFrame create new csv column based on two other columns - python

I need to create a new column in a csv called BTTS, which is based on two other columns, FTHG and FTAG. If FTHG & FTAG are both greater than zero, BTTS should be 1. Otherwise it should be zero.
What's the best way to do this in pandas / numpys?

I'm not sure, what the best way is. But here is one solution using pandas loc method:
df.loc[((df['FTHG'] > 0) & (df['FTAG'] > 0)),'BTTS'] = 1
df['BTTS'].fillna(0, inplace=True)
Another solution using pandas apply method:
def check_greater_zero(row):
return 1 if row['FTHG'] > 0 & row['FTAG'] > 0 else 0
df['BTTS'] = df.apply(check_greater_zero, axis=1)
EDIT:
As stated in the comments, the first, vectorized, implementation is more efficient.

I dont know if this is the best way to do it but this works :)
df['BTTS'] = [1 if x == y == 1 else 0 for x, y in zip(df['FTAG'], df['FTHG'])]

Related

compare two columns in data frame, then produce 1 or 0 if they are equal or not

I wanted to add a column that would tell me if two of my results were the same so I could calculate a % of true/1 or false/0
def same(closests):
if 'ConvenienceStoreClosest' >= 'ConvenienceStoreClosestOSRM':
return 1
else:
return 0
This is what I tried
df_all['same'] = df_all['ConvenienceStoreClosest'].apply(same)
specific section from df_all
Never use a loop/apply when you can use vectorial code.
In your case a simple way would be:
df_all['same'] = (df_all['ConvenienceStoreClosest']
.eq(df['ConvenienceStoreClosestOSRM'])
.astype(int)
)

Rownumber without groupby Pandas

I have three columns: id(unique), value, time
I want to create a new column that does a simple row_number without any partitioning
I tried : df['test'] = df.groupby('id_col').cumcount()+1
But the output is only ones.
Expecting to get 1->len of the dataframe
Also , is there a way to do it in numpy for better performance
If your index is already ordered starting from 0
df["row_num"] = df.index + 1
else:
df["row_num"] = df.reset_index().index + 1
Comparing time with %%timeit speed from fastest to slowest: #Scott Boston's method > #Henry Ecker's method > mine
df["row_num"] = range(1,len(df)+1)
Alternative:
df.insert(0, "row_num", range(1,len(df)+1))

I want to update a pandas dataframe iteratively

I have a dataframe that I need to check some conditions in 2 other columns and update another column iteratively. Basically I want to replace NaNs in smoking_status column with new categories.
Here is my code:
import numpy as np
for i in range(df.shape[0]):
if df['age'][i] < 15 and df['smoking_status'][i] == np.nan:
df['smoking_status'][i] = 'never smoked'
elif df['age'][i] >= 15 and df['smoking_status'][i] == np.nan:
df['smoking_status'][i] = 'occassional smoker'
The code runs but when I check my updated table I still notice no change. Any help would be appreciated.
Try to use pandas' vectorized functions instead of looping through every problem. They are both faster and result in neater code:
cond = df['smoking_status'].isna()
df.loc[cond, 'smoking_status'] = np.where(df.loc[cond, 'Age'] < 15, 'never smoked', 'occassional smoker')

Keep upper n rows of a pandas dataframe based on condition

how would I delete all rows from a dataframe that come after a certain fulfilled condition? As an example I have the following dataframe:
import pandas as pd
xEnd=1
yEnd=2
df = pd.DataFrame({'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]})
How would i get a dataframe that deletes the last 4 rows and keeps the upper 2 as in row 2 the condition x=xEnd and y=yEnd is fulfilled.
EDITED: should have mentioned that the dataframe is not necessarily ascending. Could also be descending and i still would like to get the upper ones.
To slice your dataframe until the first time a condition across 2 series are satisfied, first calculate the required index and then slice via iloc.
You can calculate the index via set_index, isin and np.ndarray.argmax:
idx = df.set_index(['x', 'y']).isin((xEnd, yEnd)).values.argmax()
res = df.iloc[:idx+1]
print(res)
x y id
0 1 1 0
1 1 2 1
If you need better performance, see Efficiently return the index of the first value satisfying condition in array.
not 100% sure i understand correctly, but you can filter your dataframe like this:
df[(df.x <= xEnd) & (df.y <= yEnd)]
this yields the dataframe:
id x y
0 0 1 1
1 1 1 2
If x and y are not strictly increasing and you want whats above the line that satisfy condition:
df[df.index <= (df[(df.x == xEnd) & (df.y == yEnd)]).index.tolist()]
df = df.iloc[[0:yEnd-1],[:]]
Select just first two rows and keep all columns and put it in new dataframe.
Or you can use the same name of variable too.

pandas generate new columns according to the former ones

There is a dataset with one of the columns containing some missing values.I want to generate a new column and if the cell of the former column is missing then assign the new columns with 1,else 0.
I tried
df[newcolumn] = map(lamba x: 1 if x is None else 0, df[formercolumn])
but it didn't work.
While
df[newcolunm] = df[formercolunms].isnull().apply(lambda x: 1 if x is True else 0)
worked well.
Any better solutions to this situation?
Use np.where:
df['newcolumns'] = np.where(df.formercolumns.isnull(),0,1)
I have the following using numpy which is realy similar to your solution, but slightly shorter/faster
df[newcolunm] = df[formercolunms].apply(lambda x: 0 if np.isnan(x) else 1)
I think however that Scott answer is better/faster.

Categories