I need to add an indicator column to my dataframe that flags user with promo code (1 if on promo else 0 ). I need to look at two columns and see if any promo code exist under either of col_promo_1, col_promo_2. This is the code I'm using but it returns Nan value:
df['promo_ind'] = df[['col_promo_1', 'col_promo_2']].apply(lambda x: 1 if x is not None else 0)
However, when I use the code with only one column for example col_promo_1, the result is accurate. Any thoughts on how can I get this fixed?
Make a new column:
df['promo_ind'] = 0
You can build a mask and use it to set the values in the correct places:
df.loc[df['col_promo_1'].notna() | df['col_promo_2'].notna(), 'promo_ind'] = 1
Sticking to your approach, let's assume you have the below example DataFrame (df) with two columns (promo1 and promo2) and the goal is to indicate promo status in a third column, if a user is on either promo1 or promo2.
import pandas as pd
df = pd.DataFrame(data={'promo1': [0, 1, 0, 1], 'promo2': [0, 0, 1, 1]})
The line below, creates a third column, checks the two existing columns at every row and calculates the corresponding promo status accordingly. (The issue with the posted code is that "x" takes columns in the DataFrame one by one, although you want to take rows and check them. The fix is to set attribute axis=1 for apply() method.)
df['promo_ind'] = df[['promo1', 'promo2']].apply(lambda row: 0 if (row['promo1']==0 and row['promo2']==0) else 1, axis=1)
Related
I'm trying to add each value from one column ('smoking') with another column ('sex') and put the result in a new column called 'something'. The dataset is a DataFrame called 'data'. The values in the columns 'smoking' and 'sex' are int64.
The rows of the column 'smoking' have 1 or 0. The number 1 means that the persons smoke and the number 0 means that the person doesn't smoke. In the column 'sex' have 0 and 1 too, 0 for female and 1 for male.
for index, row in data.iterrows():
data.loc[index, 'something'] = row['smoking'] + row['sex']
data
The problem is that in the column 'something' there is just number 2.0, that means even in a row of 'smoking' is 0 and in the row of 'sex' is 1, the sum in 'something' is 2.0.
I am not undestanding the error.
I'm using python 3.9.2
The dataset is in this link of kaggle: https://www.kaggle.com/andrewmvd/heart-failure-clinical-data
I see #Vishnudev just posted the solution in a comment, but allow me to explain what is going wrong:
The issue here is that the addition is somehow resulting in a float as a result instead of an int. There are two solutions:
With the loop, casting the result to int:
for index, row in data.iterrows():
data.loc[index, 'something'] = row['smoking'] + row['sex']
data = data.astype(int)
data
Without the loop (as #Vishnudev suggested):
data['something'] = data['smoking'] + data['sex']
data
You need not iterate over entire rows for doing that, you could just use:
data['something'] = data['smoking'] + data['sex']
I'm trying to count the number of ships in a column of a dataframe. In this case I'm trying to count the number of 77Hs. I can do it for individual elements but actions on the whole column don't seem to work
E.g. This works with an individual element in my dataframe
df = pd.DataFrame({'Route':['Callais','Dover','Portsmouth'],'shipCode':[['77H','77G'],['77G'],['77H','77H']]})
df['shipCode'][2].count('77H')
But when I try and perform the action on every row using either
df['shipCode'].count('77H')
df['shipCode'].str.count('77H')
It fails with both attempts, any help on how to code this would be much appreciated
Thanks
what if you did something like this??
assuming your initial dictionary...
import pandas as pd
from collections import Counter
df = pd.DataFrame(df) #where df is the dictionary defined in OP
you can generate a Counter for all of the elements in the lists in each row like this:
df['counts'] = df['shipCode'].apply(lambda x: Counter(x))
output:
Route shipCode counts
0 Callais [77H, 77G] {'77H': 1, '77G': 1}
1 Dover [77G] {'77G': 1}
2 Portsmouth [77H, 77H] {'77H': 2}
or if you want one in particular, i.e. '77H', you can do something like this:
df['counts'] = df['shipCode'].apply(lambda x: Counter(x)['77H'])
output:
Route shipCode counts
0 Callais [77H, 77G] 1
1 Dover [77G] 0
2 Portsmouth [77H, 77H] 2
or even this using the first method (full Counter in each row):
[count['77H'] for count in df['counts']]
output:
[1, 0, 2]
The data frame has a shipcode column with a list of values.
First show a True or False value to identify rows that contain the string '77H' in the shipcode column.
> df['shipcode'].map(lambda val: val.count('77H')>0)
Now filter the data frame based on those True/False values obtained in the previous step.
> df[df['shipcode'].map(lambda val: val.count('77H')>0)]
Finally, get a count for all values in the data frame where the shipcode list contains a value matching '77H' using the python len method.
> len(df[df['shipcode'].map(lambda val: val.count('77H')>0)])
Another way that makes it easy to remember what's been analyzed is to create a column in the same data frame to store the True/False value. Then filter by the True/False values. It's really the same as above but a little prettier in my opinion.
> df['filter_column'] = df['shipcode'].map(lambda val: val.count('77H')>0)
> len(df[df['filter_column']])
Good luck and enjoy working with Python and Pandas to process your data!
I have a dataframe object from pandas and I wanted to know if there is any way that I can access a specific value from a specific column and change it.
from pandas import DataFrame as df
gameboard = df([['#','#',"#"],['#','#',"#"],['#','#',"#"]], columns = [1, 2, 3], index = [1,2,3])
print(gameboard)
Like for example, I wanted to change the '#' from the second 'second' list.
Or if gameboard was a 2d list how can I access gameboard[1][1]'s element.
I think you're looking for the .iloc function
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)
to access said value you would need to call something like:
gameboard.iloc[1, 1] = 6
iloc would essentially call the second row (that's what the [1 is) and then you choose the location of the value in the list (, 1] for the second value in our case). Finally you assign whatever new value you want that to be.
Your output would be:
1 2 3
1 # # #
2 # 6 #
3 # # #
edit using alollz recommendation.
I have a database with multiple columns and rows. I want to locate within the database rows that meet certain criteria of a subset of the columns AND if it meets that criteria change the value of a different column in that same row.
I am prototyping with the following database
df = pd.DataFrame([[1, 2], [4, 5], [5, 5], [5, 9], [55, 55]], columns=['max_speed', 'shield'])
df['frcst_stus'] = 'current'
df
which gives the following result:
max_speed shield frcst_stus
0 1 2 current
1 4 5 current
2 5 5 current
3 5 9 current
4 55 55 current
I want to change index row 2 to read 5, 5, 'hello' without changing the rest of the dataframe.
I can do the examples in the Pandas.loc documentation at setting values. I can set a row, a column, and rows matching a callable condition. But the call is on a single column or series. I want two.
And I have found a number of stackoverflow answers that answer the question using loc on a single column to set a value in a second column. That's not my issue. I want to search two columns worth of data.
The following allows me to get the row I want:
result = df[(df['shield'] == 5) & (df['max_speed'] == 5) & (df['frcst_stus'] == 'current')]
And I know that just changing the equal signs (== 'current') to (= 'current') gives me an error.
And when I select on two columns I can set the columns (see below), but both columns get set. ('arghh') and when I try to test the value of 'max_speed' I get a false is not in index error.
df.loc[:, ['max_speed', 'frcst_stus']] = 'hello'
I also get an error trying to explain the boolean issues with Python. Frankly, I just don't understand the whole overloading yet.
If need to set different values to both columns by mask m:
m = (df['shield'] == 5) & (df['max_speed'] == 5) & (df['frcst_stus'] == 'current')
df.loc[m, ['max_speed', 'frcst_stus']] = [100, 'hello']
If need to set same values to both columns by mask m:
df.loc[m, ['max_speed', 'frcst_stus']] = 'hello'
If need to set only one column by mask m:
df.loc[m, 'frcst_stus'] = 'hello'
I'm building a fuzzy search program, using FuzzyWuzzy, to find matching names in a dataset. My data is in a DataFrame of about 10378 rows and len(df['Full name']) is 10378, as expected. But len(choices) is only 1695.
I'm running Python 2.7.10 and pandas 0.17.0, in an IPython Notebook.
choices = df['Full name'].astype(str).to_dict()
def fuzzy_search_to_df (term, choices=choices):
search = process.extract(term, choices, limit=len(choices)) # does the search itself
rslts = pd.DataFrame(data=search, index=None, columns=['name', 'rel', 'df_ind']) # puts the results in DataFrame form
return rslts
results = fuzzy_search_to_df(term='Ben Franklin') # returns the search result for the given term
matches = results[results.rel > 85] # subset of results, these are the best search results
find = df.iloc[matches['df_ind']] # matches in the main df
As you can probably tell, I'm getting the index of the result in the choices dict as df_ind, which I had assumed would be the same as the index in the main dataframe.
I'm fairly certain that the issue is in the first line, with the to_dict() function, as len(df['Full name'].astype(str)results in 10378 and len(df['Full name'].to_dict()) results in 1695.
The issue is that you have multiple rows in your dataframe, where the index is the same, hence since a Python dictionary can only hold a single value for a single key , and in Series.to_dict() method, the index is used as the key, the values from those rows get overwritten by the values that come later.
A very simple example to show this behavior -
In [36]: df = pd.DataFrame([[1],[2]],index=[1,1],columns=['A'])
In [37]: df
Out[37]:
A
1 1
1 2
In [38]: df['A'].to_dict()
Out[38]: {1: 2}
This is what is happening in your case, and noted from the comments, since the amount of unique values for the index are only 1695 , we can confirm this by testing the value of len(df.index.unique()) .
If you are content with having numbers as key (the index of the dataframe) , then you can reset the indexes using DataFrame.reset_index() , and then use .to_dict() on that. Example -
choices = df.reset_index()['Full name'].astype(str).to_dict()
Demo from above example -
In [40]: df.reset_index()['A'].to_dict()
Out[40]: {0: 1, 1: 2}
This is the same the solution OP found - choices = dict(zip(df['n'],df['Full name'].astype(str))) (as can be seen from the comments) - but this method would be faster than using zip and dict .