Function to get row Value of a dataframe using str Index - python

Sample df:
Student Marks
Avery 70
Joe 80
John 75
Jordan 90
I want to use a function as below to return marks when a student name is passed.
def get_marks(student):
return *something*
Expected Output: get_marks('Joe') ==> 80

I think the following might work.
def get_marks(student):
p = df.index[df['Student'] == student].tolist()
p = p[0]
return df['Marks'][p]
What I have done is I have first to get the index of row of the Student and then simply return the Marks for the same index.

Related

Pandas remove every entry with a specific value

I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).
So if my data set is:
Name Gender Age
Jack 5
Anna F 6
Carl M 7
Jake M 7
Therefore Jack would be removed from the dataset.
On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.
Eg:
Name Gender Age Address
Jack 5 *address*
Anna F 6 *address*
Carl M 7 Undetermined
Jake M 7 Unspecified
Now,
Jack will be removed due to empty field.
Carl will be removed due to the value Undetermined present in a column.
Jake will be removed due to the value Unspecified present in a column.
For now, this has been my approach but I keep getting a TypeError.
list = []
for i in df.columns:
if df[i] == "":
# everytime there is an empty string, add 1 to list
list.append(1)
# count list to see how many entries there are with empty string
len(list)
Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000+ rows in my actual dataset.
Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.
Edit - Thank you for all your feedbacks, you all have been incredibly helpful.
To delete a row based on a condition use the following:
df = df.drop(df[condition].index)
For example:
df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.
I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.
Regarding question 2, here's how to remove rows with the specified values in a given column:
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Let's assume we have a Pandas DataFrame object df.
To remove every row given your conditions, simply do:
df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]
If the unspecified fields are NaN, you can also do:
df = df.dropna(how="any", axis = 0)
Answer from #ThatCSFresher or #Bence will help you out in removing rows based on single column... Which is great!
However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;
df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
"Gender":["","F","M","M"],
"Age":[5,6,7,7],
"Address":["address","address","Undetermined","Unspecified"]})
df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]
# Output of df;
Name Gender Age Address Noise_Tag
0 Jack 5 address Noise
1 Anna F 6 address No Noise
2 Carl M 7 Undetermined Noise
3 Jake M 7 Unspecified Noise
# Output of df1;
Name Gender Age Address
1 Anna F 6 address
Well, OP actually wants to delete any column with "empty" string.
df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.
If you want to delete specifically for address column, then you can just delete using
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.
df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]
You can build masks and then filter the df according to it:
m1 = df.eq('').any(axis=1)
# m1 is True if any cell in a row has an empty string
m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'
out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)
Output:
Name Gender Age Address
1 Anna F 6 *address*
Used Input:
df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
'Gender': ['', 'F', 'M', 'M'],
'Age': [5, 6, 7, 7],
'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)
using lambda fun
Code:
df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]
Output:
Name Gender Age Address
1 Anna F 6 *add

How do I force a blank for rows in a dataframe that have any str or character apart from numerics?

I have a datframe
>temp
Age Rank PhoneNumber State City
10 1 99-22344-1 Ga abc
15 12 No Ma xyz
For the column(Phone Number), I want to strip all characters like - unless they are full phone numbers and if it says No or any word apart from a numeric, I want it to be a blank. How can I do this
My attempt is able to handle special chars but not words symbols like 'No'
temp['PhoneNumber '] = temp['PhoneNumber '].str.replace('[^\d]+', '')
Desired Output df -
>temp
Age Rank PhoneNumber State City
10 1 99223441 Ga abc
15 12 Ma xyz
This does the job.
import pandas as pd
import re
data = [
[10, 1, '99-223344-1', 'GA', 'Abc'],
[15, 12, "No", 'MA', 'Xyz']
]
df = pd.DataFrame(data, columns=['Age Rank PhoneNumber State City'.split()])
print(df)
def valphone(p):
p = p['PhoneNumber']
if re.match(r'[123456789-]+$', p):
return p
else:
return ""
print(df['PhoneNumber'])
df['PhoneNumber'] = df['PhoneNumber'].apply(valphone, axis=1)
print(df)
Output:
Age Rank PhoneNumber State City
0 10 1 99-223344-1 GA Abc
1 15 12 No MA Xyz
Age Rank PhoneNumber State City
0 10 1 99-223344-1 GA Abc
1 15 12 MA Xyz
I do have to admit to a bit of frustration with this. I EXPECTED to be able to do
df['PhoneNumber'] = df['PhoneNumber'].apply(valphone)
because df['PhoneNumber'] should return a Series, and the Series.apply function should pass me one value at a time. However, that's not what happens here, and I don't know why. df['PhoneNumber'] returns a DataFrame instead of a Series, so I have to use the column reference inside the function.
Thus, YOU may need to do some experimentation. If df['PhoneNumber'] returns a Series for you, then you don't need the axis=1, and you don't need the p = p['PhoneNumber'] line in the function.
Followup
OK, assuming the presence of a "phone number validation" module, as is mentioned in the comments, this becomes:
import phonenumbers
...
def valphone(p):
p = p['PhoneNumber'] # May not be required
n = phonenumbmers.parse(p)
if phonenumbers.is_possible_number(n):
return p
else:
return ''
...
temp['PhoneNumber'] = temp['PhoneNumber'].apply(str).str.findall(r'\d').str.join('')

Python pandas load Excel file assign column values (variable) to another column (string)

Name Amount
Rajesh 50
Mahesh 20
Jon 60
Jack 85
This data is in my excel file
Name is constant & Amount variable (changes monthly )
I want to total amount of Rajesh + Jon = 110
I have done following program .it shows following error
name 'Rajesh' is not defined
My code as below
import pandas as pd
top=pd.read_excel(r'E:\Python\ipc_python.xlsx',header=0,usecols="A:B",sheet_name="add")
Amount1=(top.head(4))
Series_topsheet=top['Amount'].to_list()
Legend=top['Name'].to_list()
Legend=Amount1
Total_amount_Rajesh_Jon = Rajesh + Jon
print(Total_amount_Rajesh_Jon)
You are trying to refer to variables that do not exist.
"Rajesh" and "Jon" are undefined variables even if they are names of columns in your dataframe.
Try this:
Total_amount_Rajesh_Jon = top["Rajesh"] + top["Jon"]
It may be that you need to convert type to float:
Total_amount_Rajesh_Jon = float(top["Rajesh"]) + float(top["Jon"])
You didn't define the names called Rajesh and Jon variable.
First step: you take the rajesh row.
Second step: take the Jon row.
rajesh = top.loc[top.Name == "Rajesh"]
jon= top.loc[top.Name == "Jon"]
then, you sum amount attributes
Total_amount_Rajesh_Jon = rajesh.Amount + jon.Amount
print(Total_amount_Rajesh_Jon)
Have a good day.

Python: Merging two columns of two different pandas dataframe using string matching

I am trying to perform string matching between two pandas dataframe.
df_1:
ID Text Text_ID
1 Apple 53
2 Banana 84
3 Torent File 77
df_2:
ID File_name
22 apple_mask.txt
23 melon_banana.txt
24 Torrent.pdf
25 Abc.ppt
Objective: I want to populate the Text_ID against File_name in df_2 if the string in df_1['Text'] matches with df_2['File_name']. If no matches found then populate the df_2[Text_ID] as -1. So the resultant df` looks like
ID Flie_name Text_ID
22 apple_mask.txt 53
23 melon_banana.txt 84
24 Torrent.pdf 77
25 Abc.ppt -1
I have tried this SO thread, but it is giving a column where File_name wise fuzz score is listed.
I am trying out a non fuzzy way. Please see below the code snippets:
text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
if str(j) in str(i):
t_i = df_1.loc[df_1['Text']==i,'Text_ID']
text_id.append(t_i)
else:
t_i = -1
text_id.append(t_i)
df_2['Text_ID'] = text_id
But I am getting a blank text_id list.
Can anybody provide some clue on this? I am OK to use fuzzywuzzy as well.
You can get it with the following code:
df2['Text_ID'] = -1 # set -1 by default for all the file names
for _,file_name in df2.iterrows():
for _, text in df1.iterrows():
if text[0].lower() in file_name[0]: # compare strings
df2.loc[df2.File_name == file_name[0],'Text_ID'] = text[1] # assaign the Text_ID from df1 in df2
break
Keep in mind:
String comparison: As it is now working, apple and banana are contained in apple_mask.txt and melon_banana.txt, but torrent file is not in torrent.pdf. Consider redefining the string comparison.
df.iterrows() returns two values, the index of the row and the values of the row, in this case I have replaced the index by _ since it is not necessary to solve this problem
result:
df2
File_name Text_ID
0 apple_mask.text 53
1 melon_banana.txt 84
2 Torrent.pdf -1
3 Abc.ppt -1
You can try following code:
text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
if j.lower().find(i.lower()) == -1:
t_i = -1
df_2.loc[df_2['File_name']==j,'Text_ID']=t_i
else:
t_i = df_1.loc[df_1['Text']==i,'Text_ID']
df_2.loc[df_2['File_name']==j,'Text_ID']=t_i

boolean indexing to store a column value as a variable in python

Let's say I have a CSV file which reads
Student_Name Grade
Mary 75
John 65
Stella 90
I'd like to store Stella's grade as a variable.
My current code looks like:
import pandas as pd
student_grades = pd.read_csv('.../Term2grades.csv')
x = student_grades.loc[student_grades['Student_Name'] == "Stella", ['Grade']]
print(x)
The output of this code is:
Grade
2 90
However, I only want to get 90 so that I can use it later (if x > 85 etc.)
Thanks for the help.
Access the underlying numpy array and take its first element (assuming you have a single element):
student_grades.loc[student_grades['Student_Name'] == "Stella", 'Grade'].values[0]
Out: 90
You can also use iat or iloc on the returning Series:
student_grades.loc[student_grades['Student_Name'] == "Stella", 'Grade'].iloc[0]
Out: 90

Categories