This question already has answers here:
Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
(13 answers)
Closed 17 days ago.
I am trying to create a if statement using 3 fields 'Status', 'Emp_Type' and 'Check' I want to drop rows that don't fit the conditions. But I keep getting the error:
ValueError: The truth value of a Series is ambigous. Use a.empty, abool(), a.item(), a.any() or a.all()
I tried changing the Series to a string and using "and" and "&" in my if statement but nothing. Below is the code I tried and what the data looks like:
#changing Fields from series to string.
df.Check.apply(str)
df.Status.apply(str)
#Dropping rows with conditions
if(df['Check'] == 'Check') and (df['Emp_Type'] == 'Contractor') and (df['Status'] == 'T'):
df.drop()
The Data looks like this:
ID Name Status Emp_Type Check
1234 John Doe A Contractor Ignore
1234 John Doe T Contractor Ignore
1234 John Doe A Employee Ignore
1234 John Doe T Contractor Check
1234 John Doe A Employee Ignore
1234 John Doe T Contractor Check
And what I need is:
ID Name Status Emp_Type Check
1234 John Doe A Contractor Ignore
1234 John Doe T Contractor Ignore
1234 John Doe A Employee Ignore
1234 John Doe A Employee Ignore
Pandas works a bit differently than a standard if-statement; at least, if you think row or column wise. You need to combine the individual comparisons boolean-wise.
You probably want:
selection = (df['Check'] == 'Check') & (df['Emp_Type'] == 'Contractor') & (df['Status'] == 'T')
df = df.drop(df.index[selection])
# or
df = df[~selection]
Related
I have a pandas dataframe that looks like this
ID Year
0 509 2023.0
1 216 1998.0
2 193 1957.0
I want to be able to use the value of the ID parameter to be a be able to compare the Year parameter to the current year and evaluate if its true.
For example, this is the code I have:
if df.loc[df["ID"] == 509]["Year"] == 2023.0:
print("The ID belongs to this year")
But right now I am currently getting this error
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I think it has to do with the fact that I am comparing a series of values with a single float value but I am having trouble fixing it.
Try this:
id_num = 509
year = 2023
if (df['ID'].eq(id_num) & df['Year'].eq(year)).any():
print('The ID belongs to this year')
This is the code that I have done to check whether females' household income is higher than males' average household income.
#Get total number of female customers
df_Female = df[df['Gender']=='Female']
FemaleIncomeArray = df_Female.loc[:,'Income'].values #get female income
FemaleIncomeList = FemaleIncomeArray.tolist() #convert to list
#Get male average household income
df_Male = df[df['Gender']=='Male']
MaleAvgIncome = df_Male.groupby('Gender')['Income'].mean()
color_list = []
y_pos=[] #y-axis positions
for i in range(len(FemaleIncomeList)):
if FemaleIncomeList[i] >= MaleAvgIncome : #got error from this line
color_list.append('y')
else:
color_list.append('r')
y_pos.append(i+1)
However, I got an error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
sometimes when you check a condition on series or dataframes, your output is a series such as ( , False).
In this case you must use any, all, item,...
use print function for your condition to see the series
I have a dataframe of videos with several columns of tags (strings) as follows:
import pandas as pd
videos = [(1, 'cool video','drama','horror'), (2, 'great video','sports','drama'), (3,'super video','comedy','horror')]
df = pd.DataFrame(data=videos, columns=['video_id', 'title','tag_1','tag_2'])
video_id title tag_1 tag_2
0 1 cool video drama horror
1 2 great video sports drama
2 3 super video comedy horror
Then I have another dataframe of search terms "df_search_terms" (which I could put into a list, for example). I want to see if these search terms occur at least once in one of the columns, and if so increment a counter in the dataframe of search terms (that is to say, OK we found this term once for the video, so += 1). To clarify, I want to know how many times the search term is matched in the dataframe containing +/- 1000 videos, for at least one of the tags.
Obviously I can make a count of matches, but I only want to increment the counter in df_search_terms once for that particular term. Something like this (which doesn't work, but I hope you get the gist):
search_count=df['tag_1'].str.contains('drama').sum()
df_search_terms.loc[(df_search_terms['search_term'] == 'drama'),'matching_videos'] +=1
The df_search_terms would be something like this:
search_terms = [('drama',0), ('horror',0), ('sports',0)]
df_search_terms = pd.DataFrame(data=search_terms, columns=['search_term', 'number_matching_videos'])
search_term number_matching_videos
drama 0
horror 0
sports 0
I imagine the solution lies in some clever use of apply but I'm afraid I can't figure it out.
I've tried using an "if" statement such as below, but I have an error:
if df.loc[(df['name'] == 'drama') | (df['tag_1'] == 'drama') | (df['tag_2'] == 'drama')]:
df_search_terms.loc[(df_search_terms['search_term'] == 'drama'),'matching_videos'] +=1
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Try this:
(df_search_terms['number_matching_videos'] =
df_search_terms['search_term'].map(df.set_index('video_id')
.stack()
.str.get_dummies()
.sum()))
Here is another way:
df_search_terms['number_matching_videos'] = (df_search_terms['search_term']
.map((df.loc[:,df.columns.str.contains('tag')]
.stack()
.str.extractall('({})'.format(df_search_terms['search_term'].str.cat(sep='|')))[0]
.str.get_dummies()
.sum())))
Use regex to search and count all matches
search_re = '(' + df_search_terms.search_term.str.cat(sep=')|(') + ')'
Combine all tag columns into a single string and search
df_search_terms['number_matching_videos'] = (
df.filter(regex='tag_*')
.agg(' '.join, axis=1)
.str.extractall(search_re)
.notnull().sum()
)
Output
search_term number_matching_videos
0 drama 2
1 horror 2
2 sports 1
I would like to know if it could be possible to print one row (specific fields) on the screen, then ask for a boolean value, then add this value in the corresponding field in a new column.
For example, I have a dataframe
Name Age Job
Alexandra 24 Student
Michael 42 Lawyer
John 53 Data Analyst
...
I would like to print on the screen rows, checking them one by one.
So I should have:
Alexandra Student
then a command that asks if Alexandra is female. Since Alexandra is a female, I should put True (input value) as value in a new column, called Sex.
Then I move to the next row:
Michael Lawyer
Since Michael is not a female, I should put False in the Sex column.
Same for John.
At the end, my expected output would be:
Name Age Job Sex
Alexandra 24 Student True
Michael 42 Lawyer False
John 53 Data Analyst False
...
You could try this with df.to_records that as you can see here is the fastest when iterating over rows:
sexs=[True if str(input(f'{row[1]} {row[3]}\nIs female?\n')).lower()=='true' else False for row in df.to_records()]
df['Sex']=sexs
print(df)
Or, to avoid that conditional True if str(input(f'{row[1]} {row[3]}\nIs female?\n')).lower()=='true' else False, in view of the fact that you only will input 'True' or 'False', you can try:
import ast
sexs=[ast.literal_eval(str(input(f'{row[1]} {row[3]}\nIs female?\n'))) for row in df.to_records()]
df['Sex']=sexs
print(df)
And if you want to keep the inputs as strings 'True' or 'False', you could try:
sexs=[input(f'{row[1]} {row[3]}\nIs female?\n') for row in df.to_records()]
df['Sex']=sexs
Output of all the above options:
>>>Alexandra Student
>>>Is female?
True
>>>Michael Lawyer
>>>Is female?
False
>>>John Data Analyst
>>>Is female?
False
df
Name Age Job Sex
0 Alexandra 24 Student True
1 Michael 42 Lawyer False
2 John 53 Data Analyst False
This code should work. It iterates through the rows using a for loop.
df['Sex'] = np.NaN
for i in range(len(df)):
sex = input('Is {} {} a female? '.format(df.iloc[i,0],df.iloc[i,2]))
df.iloc[i,3] = sex
Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. By iteration, I mean using functions such as iterrows and itertuples that run in native Python.
I'm not quite sure what your mechanism is to determine the gender. But the best way for this problem is using a multi column apply function.
df['Sex'] = df.apply(
lambda row: find_gender(row['Name'], row['Job']),
axis=1
)
In your find_gender function, you can write your logic based on the name and job (as you have stated in your question). In this function, you will have to return a Boolean to apply it to the row.
I have a dataframe ('frame') that has multiple columns. In column 2 ('col2'), I want it to search for a set of characters (' 42 ') (spaces included) and then I want it to search for 'ID' in column 1 ('col1'). I then have values in columns 6 and 7 ('col6','col7') that I want to assign to a list (lets call it 'vals42'), but only the values from the index of where we found ' 42 ' up until the index where we found 'ID'. I then want this to start over and look for the next ' 42 ' and so on.
The other tricky part is that the first 'ID' is actually before the first ' 42 ', so I need it to start at the second 'ID' so I don't have an invalid range.
The other thing, I have a bunch of NaN's that exist in my dataframe.
I am trying to do this by getting the index of the ' 42 ' and the index of the 'ID' and then creating a subset of the pandas dataframe, and then when it goes on to the next ' 42 ' it creates another subset and so on.
In some of my previous code, I already figured out how to get the values from 'col6' and 'col7' into a list. What I am struggling with (I think) is getting the logic to check out for checking for these characters and getting the indexes of them.
for var in frame:
if (frame['col2'].str.contains(' 42 ')) == True:
begin = frame.index.get_loc(frame.name)
elif (frame['col1'].str.contains('ID')) == True:
end = frame.index.get_loc(frame.name)
subset = frame[begin:end]
for column in subset[['col6','col7']]:
ColumnContents = pd.concat([subset['col6'], subset['col7']])
ColumnContents = pd.to_numeric(ColumnContents, errors = 'coerce')
vals42 = ColumnContents[ColumnContents.apply(lambda x: x > 0 and x < 20)]
The pandas dataframe is made by reading in csv Excel files from a folder I have. Below is an example of the general format of the dataframe: The x's symbolize data that isn't important.
col1 col2 . . . col6 col7
0 ID xxx NaN NaN
1 xxx 67812 LT 42 01
2 xxx xxx NaN NaN
3 xxx xxx NaN NaN
4 xxx xxx NaN NaN
.
.
.
17 xxx xxx 0.543 1670
18 xxx xxx 0.321 8954
.
.
.
29 ID xxx NaN NaN
30 xxx 12976 42 01
So in this example, the values I want to assign start at index 17 and go all the way down to index 28. At index 29 though, we get a new 'ID' so it starts over. The number of rows for which values exist in columns 6 and 7 is not necessarily the same all the time. Also, the number is not always ' 42 ', it could be something else, but once I get it for one number I can apply it to others, so I am using ' 42 ' as an example. Ideally, I want the numbers from 'col6' and 'col7' to assign to the list if they meet the conditions (x>0, x<20).
EDIT: The 'ID' literally is just 'ID'. There is no special string associated with it in each instance, so the 'ID' in row 0 and row 29 are exactly the same. The x'd out data is not the same as the 'ID' or the '67812 LT 42 01'. The x'd out data is its own entire unique thing. Each x'd out cell has its entirely own unique set of characters in it, so it won't match up with the 'ID' or the '67812 LT 42 01'.
So the thing is, I know the bottom four lines of code work in my other scenario where I am just getting the values from the entire dataframe.
When I ran this I got the error "the truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
I have figured it out if anyone is curious for future reference:
start = True
vals42 = pd.Series()
for index2, rows in frame[1:].iterrows():
if (' 42 ' in rows['col2']):
begin = index2
start = True
elif (start is True) & ('ID' in rows['col1']):
end = index2
subset = frame[begin:end]
for column in subset[['col6','col7']]:
ColumnContents = pd.concat([subset['col6'], subset['col7']])
ColumnContents = pd.to_numeric(ColumnContents, errors = 'coerce')
subsetTotal = ColumnContents[ColumnContents.apply(lambda x: x > 0 and x < 20)]
vals42 = pd.concat([vals42, subsetTotal])
start = False
This answer was posted as an edit to the question Assigning values to a list based on a string search in a pandas dataframe? by the OP JJXS under CC BY-SA 4.0.