I have this csv file :
Names Credit
0 James 21
1 John 34
2 Lucas 20
3 William 11
And what i want to do using Pandas is : If i put any name like John , I want to add his Credit in a variable to do some math with it.
i'm trying this :
import pandas as pd
df = pd.read_csv('file.csv')
n = input('Enter a name: ')
x = df[df['Names'] == n]['Credit']
print(x)
but doesn't work for me :
Enter a name: John
1 30
Name: Credit, dtype: int64
(i'm trying to get just the number : 30)
You can .squeeze() that last dimension:
>>> df[df.name == 'John']['credit'].squeeze()
30
What you want is loc:
n = input('Enter a name: ')
x = df.loc[df['Names'] == n, 'Credit']
x is a pandas series, if you want only one value, you can do like this:
x = df[df['Names'] == n]['Credit'].tolist()[0]
But if you have two "John" in your data frame you will only get the credit for the first, thus make sure your Names column is always unique. If it is always unique, consider doing the following:
df = df.set_index('Name', drop=True)
This will make 'Name' the index of your data frame and then you can get the credit more easily in the following way:
x = df.loc[n, 'Credit']
Related
I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).
So if my data set is:
Name Gender Age
Jack 5
Anna F 6
Carl M 7
Jake M 7
Therefore Jack would be removed from the dataset.
On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.
Eg:
Name Gender Age Address
Jack 5 *address*
Anna F 6 *address*
Carl M 7 Undetermined
Jake M 7 Unspecified
Now,
Jack will be removed due to empty field.
Carl will be removed due to the value Undetermined present in a column.
Jake will be removed due to the value Unspecified present in a column.
For now, this has been my approach but I keep getting a TypeError.
list = []
for i in df.columns:
if df[i] == "":
# everytime there is an empty string, add 1 to list
list.append(1)
# count list to see how many entries there are with empty string
len(list)
Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000+ rows in my actual dataset.
Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.
Edit - Thank you for all your feedbacks, you all have been incredibly helpful.
To delete a row based on a condition use the following:
df = df.drop(df[condition].index)
For example:
df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.
I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.
Regarding question 2, here's how to remove rows with the specified values in a given column:
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Let's assume we have a Pandas DataFrame object df.
To remove every row given your conditions, simply do:
df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]
If the unspecified fields are NaN, you can also do:
df = df.dropna(how="any", axis = 0)
Answer from #ThatCSFresher or #Bence will help you out in removing rows based on single column... Which is great!
However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;
df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
"Gender":["","F","M","M"],
"Age":[5,6,7,7],
"Address":["address","address","Undetermined","Unspecified"]})
df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]
# Output of df;
Name Gender Age Address Noise_Tag
0 Jack 5 address Noise
1 Anna F 6 address No Noise
2 Carl M 7 Undetermined Noise
3 Jake M 7 Unspecified Noise
# Output of df1;
Name Gender Age Address
1 Anna F 6 address
Well, OP actually wants to delete any column with "empty" string.
df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.
If you want to delete specifically for address column, then you can just delete using
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.
df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]
You can build masks and then filter the df according to it:
m1 = df.eq('').any(axis=1)
# m1 is True if any cell in a row has an empty string
m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'
out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)
Output:
Name Gender Age Address
1 Anna F 6 *address*
Used Input:
df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
'Gender': ['', 'F', 'M', 'M'],
'Age': [5, 6, 7, 7],
'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)
using lambda fun
Code:
df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]
Output:
Name Gender Age Address
1 Anna F 6 *add
I want to groupby and see if all members in the group meet a certain condition. Here's a dummy example:
x = ['Mike','Mike','Mike','Bob','Bob','Phil']
y = ['Attended','Attended','Attended','Attended','Not attend','Not attend']
df = pd.DataFrame({'name':x,'attendance':y})
And what I want to do is return a 3x2 dataframe that shows for each name, who was always in attendance. It should look like below:
new_df = pd.DataFrame({'name':['Mike','Bob','Phil'],'all_attended':[True,False,False]})
Whats the best way to do this?
Thanks so much.
Let's try
out = (df['attendance'].eq('Attended')
.groupby(df['name']).all()
.to_frame('all_attended').reset_index())
print(out)
name all_attended
0 Bob False
1 Mike True
2 Phil False
one way could be:
df.groupby('name')['attendance'].apply(lambda x: True if x.unique().all()=='Attended' else False)
name
Bob False
Mike True
Phil False
Name: attendance, dtype: bool
I would say away from strings for data that does not need to be a string:
z = [s == 'Attended' for s in y]
df = pd.DataFrame({'name': x, 'attended': z})
Now you can check if all the elements for a given group are True:
>>> df.groupby('name')['attendance'].all()
name
Bob False
Mike True
Phil False
Name: attendance, dtype: bool
If something can only be a 0 or 1, using a string introduces the possibility of errors because someone might type Atended instead of Attended, for example.
Name
Mark
Ben
20
James
50
Jimmy
70
I have a dataframe which looks something like this. I wanna check if the name exists and then it will print the mark for that specific person.
if len(df[(df['Name'] == "James")]) != 0:
print(len(df["Mark"]))
Above is my code. Hope to get some advise!
Better use a Series here with get with a default argument:
marks = df.set_index('Name')['Mark']
marks.get('James', 'missing')
# 50
marks.get('Nonexistent', 'missing')
# missing
Or without default, get returns None:
marks.get('Nonexistent') # no output
You can return the mark of a specified name in your Name column using loc. The below will print the Mark of the name you pass, and will return an empty series if the name does not exist in your Name column:
name_to_retrieve_mark = 'Ben'
df.loc[df.Name.eq(name_to_retrieve_mark),'Mark']
Out[13]:
0 20
Name: Mark, dtype: int64
name_to_retrieve_mark = 'Sophocles'
df.loc[df.Name.eq(name_to_retrieve_mark),'Mark']
Out[15]: Series([], Name: Mark, dtype: int64)
I have two dictionaries:
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
and a table consisting of one single column where bond names are contained:
bond_names=pd.DataFrame({'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']})
I need to replace the name with a string of the following format: EUA21 where the first two letters are the corresponding value to the currency key in the dictionary, the next letter is the value corresponding to the month key and the last two digits are the year from the name.
I tried to split the name using the following code:
bond_names['Names']=bond_names['Names'].apply(lambda x: x.split('.'))
but I am not sure how to proceed from here to create the string as I need to search the dictionaries at the same time for the currency and month extract the values join them and add the year from the name onto it.
This will give you a list of what you need:
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names = {'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']}
result = []
for names in bond_names['Names']:
bond = names.split('.')
result.append(currency[bond[1]] + time[bond[2]] + bond[3])
print(result)
You can do that like this:
import pandas as pd
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency = {'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names = pd.DataFrame({'Names': ['Bond.USD.JAN.21', 'Bond.USD.MAR.25', 'Bond.EUR.APR.22', 'Bond.HUF.JUN.21', 'Bond.HUF.JUL.23', 'Bond.GBP.JAN.21']})
bond_names['Names2'] = bond_names['Names'].apply(lambda x: currency[x[5:8]] + time[x[9:12]] + x[-2:])
print(bond_names['Names2'])
# 0 USA21
# 1 USC25
# 2 EUD22
# 3 HFF21
# 4 HFH23
# 5 GBA21
# Name: Names2, dtype: object
With extended regex substitution:
In [42]: bond_names['Names'].str.replace(r'^[^.]+\.([^.]+)\.([^.]+)\.(\d+)', lambda m: '{}{}{}'.format(curre
...: ncy.get(m.group(1), m.group(1)), time.get(m.group(2), m.group(2)), m.group(3)))
Out[42]:
0 USA21
1 USC25
2 EUD22
3 HFF21
4 HFH23
5 GBA21
Name: Names, dtype: object
You can try this :
import pandas as pd
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names=pd.DataFrame({'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']})
bond_names['Names']=bond_names['Names'].apply(lambda x: x.split('.'))
for idx, bond in enumerate(bond_names['Names']):
currencyID = currency.get(bond[1])
monthID = time.get(bond[2])
yearID = bond[3]
bond_names['Names'][idx] = currencyID + monthID + yearID
Output
Names
0 USA21
1 USC25
2 EUD22
3 HFF21
4 HFH23
5 GBA21
I have this table as an input and I would like to add the name of the header to its corresponding cells before converting it to a dataframe
I am generating association rules after converting the table to a dataframe and each rule is not clear if it belongs to which antecedent/consequent.
Example for the first column of my desired table:
Age
Age = 45
Age = 30
Age = 45
Age = 80
.
.
and so on for the rest of the columns. What is the best way to access each column and rewrite them? And is there a better solution to reference my values after generating association rules other than adding the name of the header to each cell?
Here is one way to add the column names to all cells:
df = pd.DataFrame({'age':[1,2],'sex':['M','F']})
df = df.applymap(str)
for c in df.columns:
df[c] = df[c].apply(lambda s: "{} = {}".format(c,s))
This yields:
age sex
0 age = 1 sex = M
1 age = 2 sex = F