Sample df:
Student Marks
Avery 70
Joe 80
John 75
Jordan 90
I want to use a function as below to return marks when a student name is passed.
def get_marks(student):
return *something*
Expected Output: get_marks('Joe') ==> 80
I think the following might work.
def get_marks(student):
p = df.index[df['Student'] == student].tolist()
p = p[0]
return df['Marks'][p]
What I have done is I have first to get the index of row of the Student and then simply return the Marks for the same index.
Related
I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).
So if my data set is:
Name Gender Age
Jack 5
Anna F 6
Carl M 7
Jake M 7
Therefore Jack would be removed from the dataset.
On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.
Eg:
Name Gender Age Address
Jack 5 *address*
Anna F 6 *address*
Carl M 7 Undetermined
Jake M 7 Unspecified
Now,
Jack will be removed due to empty field.
Carl will be removed due to the value Undetermined present in a column.
Jake will be removed due to the value Unspecified present in a column.
For now, this has been my approach but I keep getting a TypeError.
list = []
for i in df.columns:
if df[i] == "":
# everytime there is an empty string, add 1 to list
list.append(1)
# count list to see how many entries there are with empty string
len(list)
Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000+ rows in my actual dataset.
Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.
Edit - Thank you for all your feedbacks, you all have been incredibly helpful.
To delete a row based on a condition use the following:
df = df.drop(df[condition].index)
For example:
df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.
I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.
Regarding question 2, here's how to remove rows with the specified values in a given column:
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Let's assume we have a Pandas DataFrame object df.
To remove every row given your conditions, simply do:
df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]
If the unspecified fields are NaN, you can also do:
df = df.dropna(how="any", axis = 0)
Answer from #ThatCSFresher or #Bence will help you out in removing rows based on single column... Which is great!
However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;
df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
"Gender":["","F","M","M"],
"Age":[5,6,7,7],
"Address":["address","address","Undetermined","Unspecified"]})
df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]
# Output of df;
Name Gender Age Address Noise_Tag
0 Jack 5 address Noise
1 Anna F 6 address No Noise
2 Carl M 7 Undetermined Noise
3 Jake M 7 Unspecified Noise
# Output of df1;
Name Gender Age Address
1 Anna F 6 address
Well, OP actually wants to delete any column with "empty" string.
df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.
If you want to delete specifically for address column, then you can just delete using
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.
df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]
You can build masks and then filter the df according to it:
m1 = df.eq('').any(axis=1)
# m1 is True if any cell in a row has an empty string
m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'
out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)
Output:
Name Gender Age Address
1 Anna F 6 *address*
Used Input:
df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
'Gender': ['', 'F', 'M', 'M'],
'Age': [5, 6, 7, 7],
'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)
using lambda fun
Code:
df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]
Output:
Name Gender Age Address
1 Anna F 6 *add
I have a datframe
>temp
Age Rank PhoneNumber State City
10 1 99-22344-1 Ga abc
15 12 No Ma xyz
For the column(Phone Number), I want to strip all characters like - unless they are full phone numbers and if it says No or any word apart from a numeric, I want it to be a blank. How can I do this
My attempt is able to handle special chars but not words symbols like 'No'
temp['PhoneNumber '] = temp['PhoneNumber '].str.replace('[^\d]+', '')
Desired Output df -
>temp
Age Rank PhoneNumber State City
10 1 99223441 Ga abc
15 12 Ma xyz
This does the job.
import pandas as pd
import re
data = [
[10, 1, '99-223344-1', 'GA', 'Abc'],
[15, 12, "No", 'MA', 'Xyz']
]
df = pd.DataFrame(data, columns=['Age Rank PhoneNumber State City'.split()])
print(df)
def valphone(p):
p = p['PhoneNumber']
if re.match(r'[123456789-]+$', p):
return p
else:
return ""
print(df['PhoneNumber'])
df['PhoneNumber'] = df['PhoneNumber'].apply(valphone, axis=1)
print(df)
Output:
Age Rank PhoneNumber State City
0 10 1 99-223344-1 GA Abc
1 15 12 No MA Xyz
Age Rank PhoneNumber State City
0 10 1 99-223344-1 GA Abc
1 15 12 MA Xyz
I do have to admit to a bit of frustration with this. I EXPECTED to be able to do
df['PhoneNumber'] = df['PhoneNumber'].apply(valphone)
because df['PhoneNumber'] should return a Series, and the Series.apply function should pass me one value at a time. However, that's not what happens here, and I don't know why. df['PhoneNumber'] returns a DataFrame instead of a Series, so I have to use the column reference inside the function.
Thus, YOU may need to do some experimentation. If df['PhoneNumber'] returns a Series for you, then you don't need the axis=1, and you don't need the p = p['PhoneNumber'] line in the function.
Followup
OK, assuming the presence of a "phone number validation" module, as is mentioned in the comments, this becomes:
import phonenumbers
...
def valphone(p):
p = p['PhoneNumber'] # May not be required
n = phonenumbmers.parse(p)
if phonenumbers.is_possible_number(n):
return p
else:
return ''
...
temp['PhoneNumber'] = temp['PhoneNumber'].apply(str).str.findall(r'\d').str.join('')
Name Amount
Rajesh 50
Mahesh 20
Jon 60
Jack 85
This data is in my excel file
Name is constant & Amount variable (changes monthly )
I want to total amount of Rajesh + Jon = 110
I have done following program .it shows following error
name 'Rajesh' is not defined
My code as below
import pandas as pd
top=pd.read_excel(r'E:\Python\ipc_python.xlsx',header=0,usecols="A:B",sheet_name="add")
Amount1=(top.head(4))
Series_topsheet=top['Amount'].to_list()
Legend=top['Name'].to_list()
Legend=Amount1
Total_amount_Rajesh_Jon = Rajesh + Jon
print(Total_amount_Rajesh_Jon)
You are trying to refer to variables that do not exist.
"Rajesh" and "Jon" are undefined variables even if they are names of columns in your dataframe.
Try this:
Total_amount_Rajesh_Jon = top["Rajesh"] + top["Jon"]
It may be that you need to convert type to float:
Total_amount_Rajesh_Jon = float(top["Rajesh"]) + float(top["Jon"])
You didn't define the names called Rajesh and Jon variable.
First step: you take the rajesh row.
Second step: take the Jon row.
rajesh = top.loc[top.Name == "Rajesh"]
jon= top.loc[top.Name == "Jon"]
then, you sum amount attributes
Total_amount_Rajesh_Jon = rajesh.Amount + jon.Amount
print(Total_amount_Rajesh_Jon)
Have a good day.
I am trying to perform string matching between two pandas dataframe.
df_1:
ID Text Text_ID
1 Apple 53
2 Banana 84
3 Torent File 77
df_2:
ID File_name
22 apple_mask.txt
23 melon_banana.txt
24 Torrent.pdf
25 Abc.ppt
Objective: I want to populate the Text_ID against File_name in df_2 if the string in df_1['Text'] matches with df_2['File_name']. If no matches found then populate the df_2[Text_ID] as -1. So the resultant df` looks like
ID Flie_name Text_ID
22 apple_mask.txt 53
23 melon_banana.txt 84
24 Torrent.pdf 77
25 Abc.ppt -1
I have tried this SO thread, but it is giving a column where File_name wise fuzz score is listed.
I am trying out a non fuzzy way. Please see below the code snippets:
text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
if str(j) in str(i):
t_i = df_1.loc[df_1['Text']==i,'Text_ID']
text_id.append(t_i)
else:
t_i = -1
text_id.append(t_i)
df_2['Text_ID'] = text_id
But I am getting a blank text_id list.
Can anybody provide some clue on this? I am OK to use fuzzywuzzy as well.
You can get it with the following code:
df2['Text_ID'] = -1 # set -1 by default for all the file names
for _,file_name in df2.iterrows():
for _, text in df1.iterrows():
if text[0].lower() in file_name[0]: # compare strings
df2.loc[df2.File_name == file_name[0],'Text_ID'] = text[1] # assaign the Text_ID from df1 in df2
break
Keep in mind:
String comparison: As it is now working, apple and banana are contained in apple_mask.txt and melon_banana.txt, but torrent file is not in torrent.pdf. Consider redefining the string comparison.
df.iterrows() returns two values, the index of the row and the values of the row, in this case I have replaced the index by _ since it is not necessary to solve this problem
result:
df2
File_name Text_ID
0 apple_mask.text 53
1 melon_banana.txt 84
2 Torrent.pdf -1
3 Abc.ppt -1
You can try following code:
text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
if j.lower().find(i.lower()) == -1:
t_i = -1
df_2.loc[df_2['File_name']==j,'Text_ID']=t_i
else:
t_i = df_1.loc[df_1['Text']==i,'Text_ID']
df_2.loc[df_2['File_name']==j,'Text_ID']=t_i
Let's say I have a CSV file which reads
Student_Name Grade
Mary 75
John 65
Stella 90
I'd like to store Stella's grade as a variable.
My current code looks like:
import pandas as pd
student_grades = pd.read_csv('.../Term2grades.csv')
x = student_grades.loc[student_grades['Student_Name'] == "Stella", ['Grade']]
print(x)
The output of this code is:
Grade
2 90
However, I only want to get 90 so that I can use it later (if x > 85 etc.)
Thanks for the help.
Access the underlying numpy array and take its first element (assuming you have a single element):
student_grades.loc[student_grades['Student_Name'] == "Stella", 'Grade'].values[0]
Out: 90
You can also use iat or iloc on the returning Series:
student_grades.loc[student_grades['Student_Name'] == "Stella", 'Grade'].iloc[0]
Out: 90