I need to create a new column in a csv called BTTS, which is based on two other columns, FTHG and FTAG. If FTHG & FTAG are both greater than zero, BTTS should be 1. Otherwise it should be zero.
What's the best way to do this in pandas / numpys?
I'm not sure, what the best way is. But here is one solution using pandas loc method:
df.loc[((df['FTHG'] > 0) & (df['FTAG'] > 0)),'BTTS'] = 1
df['BTTS'].fillna(0, inplace=True)
Another solution using pandas apply method:
def check_greater_zero(row):
return 1 if row['FTHG'] > 0 & row['FTAG'] > 0 else 0
df['BTTS'] = df.apply(check_greater_zero, axis=1)
EDIT:
As stated in the comments, the first, vectorized, implementation is more efficient.
I dont know if this is the best way to do it but this works :)
df['BTTS'] = [1 if x == y == 1 else 0 for x, y in zip(df['FTAG'], df['FTHG'])]
Related
I wanted to add a column that would tell me if two of my results were the same so I could calculate a % of true/1 or false/0
def same(closests):
if 'ConvenienceStoreClosest' >= 'ConvenienceStoreClosestOSRM':
return 1
else:
return 0
This is what I tried
df_all['same'] = df_all['ConvenienceStoreClosest'].apply(same)
specific section from df_all
Never use a loop/apply when you can use vectorial code.
In your case a simple way would be:
df_all['same'] = (df_all['ConvenienceStoreClosest']
.eq(df['ConvenienceStoreClosestOSRM'])
.astype(int)
)
I have three columns: id(unique), value, time
I want to create a new column that does a simple row_number without any partitioning
I tried : df['test'] = df.groupby('id_col').cumcount()+1
But the output is only ones.
Expecting to get 1->len of the dataframe
Also , is there a way to do it in numpy for better performance
If your index is already ordered starting from 0
df["row_num"] = df.index + 1
else:
df["row_num"] = df.reset_index().index + 1
Comparing time with %%timeit speed from fastest to slowest: #Scott Boston's method > #Henry Ecker's method > mine
df["row_num"] = range(1,len(df)+1)
Alternative:
df.insert(0, "row_num", range(1,len(df)+1))
I have a dataframe that I need to check some conditions in 2 other columns and update another column iteratively. Basically I want to replace NaNs in smoking_status column with new categories.
Here is my code:
import numpy as np
for i in range(df.shape[0]):
if df['age'][i] < 15 and df['smoking_status'][i] == np.nan:
df['smoking_status'][i] = 'never smoked'
elif df['age'][i] >= 15 and df['smoking_status'][i] == np.nan:
df['smoking_status'][i] = 'occassional smoker'
The code runs but when I check my updated table I still notice no change. Any help would be appreciated.
Try to use pandas' vectorized functions instead of looping through every problem. They are both faster and result in neater code:
cond = df['smoking_status'].isna()
df.loc[cond, 'smoking_status'] = np.where(df.loc[cond, 'Age'] < 15, 'never smoked', 'occassional smoker')
how would I delete all rows from a dataframe that come after a certain fulfilled condition? As an example I have the following dataframe:
import pandas as pd
xEnd=1
yEnd=2
df = pd.DataFrame({'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]})
How would i get a dataframe that deletes the last 4 rows and keeps the upper 2 as in row 2 the condition x=xEnd and y=yEnd is fulfilled.
EDITED: should have mentioned that the dataframe is not necessarily ascending. Could also be descending and i still would like to get the upper ones.
To slice your dataframe until the first time a condition across 2 series are satisfied, first calculate the required index and then slice via iloc.
You can calculate the index via set_index, isin and np.ndarray.argmax:
idx = df.set_index(['x', 'y']).isin((xEnd, yEnd)).values.argmax()
res = df.iloc[:idx+1]
print(res)
x y id
0 1 1 0
1 1 2 1
If you need better performance, see Efficiently return the index of the first value satisfying condition in array.
not 100% sure i understand correctly, but you can filter your dataframe like this:
df[(df.x <= xEnd) & (df.y <= yEnd)]
this yields the dataframe:
id x y
0 0 1 1
1 1 1 2
If x and y are not strictly increasing and you want whats above the line that satisfy condition:
df[df.index <= (df[(df.x == xEnd) & (df.y == yEnd)]).index.tolist()]
df = df.iloc[[0:yEnd-1],[:]]
Select just first two rows and keep all columns and put it in new dataframe.
Or you can use the same name of variable too.
There is a dataset with one of the columns containing some missing values.I want to generate a new column and if the cell of the former column is missing then assign the new columns with 1,else 0.
I tried
df[newcolumn] = map(lamba x: 1 if x is None else 0, df[formercolumn])
but it didn't work.
While
df[newcolunm] = df[formercolunms].isnull().apply(lambda x: 1 if x is True else 0)
worked well.
Any better solutions to this situation?
Use np.where:
df['newcolumns'] = np.where(df.formercolumns.isnull(),0,1)
I have the following using numpy which is realy similar to your solution, but slightly shorter/faster
df[newcolunm] = df[formercolunms].apply(lambda x: 0 if np.isnan(x) else 1)
I think however that Scott answer is better/faster.