This question already has answers here:
Create new column based on values from other columns / apply a function of multiple columns, row-wise in Pandas
(8 answers)
Closed 2 years ago.
I am trying to build a column that will be based off of another. The new column should reflect the values meeting certain criteria and put 0's where the values do not meet the criteria.
Example, a column called bank balance will have negative and positive values; the new column, overdraft, will have the negative values for the appropriate row and 0 where the balance is greater than 0.
Bal Ovr
21 0
-34 -34
45 0
-32 -32
The final result should look like that.
Assuming your dataframe is called df, you can use np.where and do:
import numpy as np
df['Ovr'] = np.where(df['Bal'] <0,'df['Bal'],0)
which will create a column called Ovr, with 0's when Bal is +ve, and the same as Bal when Bal is -ve.
df["over"] = df.Bal.apply(lambda x: 0 if x>0 else x)
Additional method to enrich your coding skills. However, it isn't needed for such easy tasks.
Related
This question already has answers here:
Ranking in python dataframe
(1 answer)
Factorize a column of strings in pandas
(1 answer)
Closed 3 months ago.
Here is a dataframe example:
ColA
ColB
ColC
Low
10
Tg
Mid
20
asd
High
30
mnr
if you want to work on it, here is a copy paste:
df = pd.DataFrame({
'ColA':['Low','Mid','High'],
'ColB':[10,20,30],
'ColC':['Tg','asd','mnr']
})
What I want to do is, Create a function that returns a continuous value(ex. 1-2-3) depends on its value distribution on ColB.
Above example, ColA:Low has 10 in ColB , and ColA:Mid got 20.
def getlinear(x):
return 0 if x=='Low' else 1 if x=='Mid' else 2
this function solves the problem and returns continuous values, but then, I need to create another function to apply for ColC. I want one function for all.
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 6 months ago.
I have a df that is 2,000 lines long, it is taking me about 5 minutes to map 2,000 IDs to the corresponding names, which is way too long.
I'm trying to figure out a way to reduce the mapping time. One option I want to try is mapping the IDs as I make the dictionary- and not storing more than one line at a time in the dictionary.
Here is the process I'm using:
df_IDs=
studentID grade age
123 12th 18
432 11th 17
421 11th 16
And I want to replace the 'studentID' column with the student names that I can get from a mapping file (student_directory_df).
I created functions that will make a dictionary and map the IDs to the names:
dicts={}
def search_tool_student(df, column, accession):
index=np.where(df[column].str.contains(accession, na=False))
if len(index[0])==0:
done=""
pass
else:
done=df.iloc[index]
return(done)
def create_dict(x):
result_df = search_tool_student(student_directory_df, 'studentID', x)
if (len(result_df) == 0):
print('bad match found: '+ x)
else:
student_name = result_df['name'].iloc[0]
dicts[x] = student_name
return(dicts)
def map_ID(df_IDs):
studentIDs=df_IDs['studentID']
new_dict=list(map(create_dict, studentIDs))[0]
df_IDs['studentID']=df_IDs['studentID'].map(new_dict).fillna(df_IDs['studentID'])
return(df_IDs)
desired output
studentID grade age
sally 12th 18
joe 11th 17
sarah 11th 16
Pandas works much better when you think of DataFrames not as grids of cells to index into, but rather collections of columns (Series) than you can manipulate. The more you can offload into a single pandas function, the more likely it is to be optimized and more performant.
In this case, you want use Series.map as best you can. Series.map can take a function, dictionary-like mapping, or a Series.
Since you already have the student names and ids in a dataframe, I would recommend something like the following:
# create a series to do the mapping
id_to_name_map = student_directory_df.set_index('studentID')['name']
# pass it to map all at once
df_IDs['student_name'] = df_IDs['studentID'].map(id_to_name_map)
Hope that helps!
This question already has answers here:
Pandas how to use pd.cut()
(5 answers)
Closed 6 months ago.
I am using Pandas cut to bin certain values in ranges according to a column. I am using user defined bins i.e the ranges are being passed as array.
df['Range'] = pd.cut(df.TOTAL, bins=[0,100,200,300,400,450,500,600,700,800,900,1000,2000])
However the values I have are ranging till 100000. This restricts the values to 2000 as an upper limit, and I am losing values greater than 2000. I want to keep an interal for greater than 2000. Is there any way to do this?
Let's add np.inf to end of your bin list:
pd.cut(df.TOTAL, bins=[0,100,200,300,400,450,500,600,700,800,900,1000,2000,np.inf])
This question already has answers here:
Binning a column with pandas
(4 answers)
Closed 3 years ago.
I have a dataframe of cars. I have its car price column and I want to create a new column carsrange that would have values like 'high','low' etc according to car price. Like for example :
if price is between 0 and 9000 then cars range should have 'low' for those cars. similarly, if price is between 9000 and 30,000 carsrange should have 'medium' for those cars etc. I tried doing it, but my code is replacing one value to the other. Any help please?
I ran a for loop in the price column, and use the if-else iterations to define my column values.
for i in cars_data['price']:
if (i>0 and i<9000): cars_data['carsrange']='Low'
elif (i<9000 and i<18000): cars_data['carsrange']='Medium-Low'
elif (i<18000 and i>27000): cars_data['carsrange']='Medium'
elif(i>27000 and i<36000): cars_data['carsrange']='High-Medium'
else : cars_data['carsrange']='High'
Now, When I run the unique function for carsrange, it shows only 'High'.
cars_data['carsrange'].unique()
This is the Output:
In[74]:cars_data['carsrange'].unique()
Out[74]: array(['High'], dtype=object)
I believe I have applied the wrong concept here. Any ideas as to what I should do now?
you can use list:
resultList = []
for i in cars_data['price']:
if (i>0 and i<9000):
resultList.append("Low")
else:
resultList.append("HIGH")
# write other conditions here
cars_data["carsrange"] = resultList
then find uinque values from cars_data["carsrange"]
This question already has an answer here:
Replace values in a pandas series via dictionary efficiently
(1 answer)
Closed 4 years ago.
I have a dataframe next_train with weekly data for many players (80,000 players observed through 4 weeks, total of 320,000 observations) and a dictionary players containing a binary variable for some of the players (say 10,000). I want to add this binary variable to the dataframe next_train (if a player is not in the dictionary players, I set the variable equal to zero). This is how I'm doing it:
next_train = pd.read_csv()
# ... calculate dictionary 'players' ...
next_train['variable'] = 0
for player in players:
next_train.loc[next_train['id_of_player'] == player, 'variable'] = players[player]
However the for loop takes ages to complete, and I don't understand why. It looks like the task is to perform binary search for the value player in my dataframe for 10,000 times (size of the players dictionary), but the execution time is several minutes. Is there any efficient way to do this task?
You should use map instead of slicing, that will be way faster:
next_train['variable'] = next_train.id_of_player.map(players)
As you want 0 in the other rows, you can then use fillna:
next_train.variable.fillna(0,inplace = True)
Moreover, if your dictionnary only contains boolean values, you might want to redefine the type of variable column to take less space. So you end with this piece of code:
next_train['variable'] = next_train.id_of_player.map(players).fillna(0).astype(int)
Use map and fillna:
next_train['variable'] = next_train['id_of_player'].map(players).fillna(0)
This creates a new column by applying the dictionary on the player ids and then fills all empty values with 0.