I have a pandas data frame in the form of:
user_id referral_code referred_by
1 A None
2 B A
3 C B
5 None None
6 E B
7 None none
....
What I want to do is to create another column weight for each user id such that it will contain the total number of references he has done to others as well as the the number of time he was referred i.e I have to check if the referral_code of a user id is present in the referred_by column and count the frequency of the same and also add 1 if the referred_by column has a entry for the user.
Expected output is:
user_id referral_code referred_by weights
1 A None 1
2 B A 3
3 C B 1
5 None None None
6 E B 1
7 None none none
The approaches if have tried is using df.grouby along with size and count but nothing is giving the expected output.
You want to build a new conditional column. If the conditions are simple enough, you can do it with np.where. I suggest you to have a look at this post.
Here, it's quite complex, there shoud have a solution with np.where but not really obvious. In this case, you can use the apply method. It gives you the opportunity the write conditions as complex as you want. Using apply is less efficient than np.where since you need a python abstraction. Depends on your dataset and the complexity of your conditions.
Here an example with apply:
df = pd.DataFrame(
[[1, "A" , None],
[2 , "B" , "A"],
[3 , "C" , "B"],
[5 , None, None],
[6 , "E" , "B"],
[7 , None , None]],
columns = 'user_id referral_code referred_by'.split(' ')
)
print(df)
# user_id referral_code referred_by
# 0 1 A None
# 1 2 B A
# 2 3 C B
# 3 5 None None
# 4 6 E B
# 5 7 None None
weight_refered_by = df.referred_by.value_counts()
print(weight_refered_by)
# B 2
# A 1
def countWeight(row):
count = 0
if row['referral_code'] in weight_refered_by.index:
count = weight_refered_by[row.referral_code]
if row["referred_by"] != None:
count += 1
# If referral_code is none, result is none
# because referred_by is included in referral_code
if row["referral_code"] == None:
count = None
return count
df["weights"] = df.apply(countWeight, axis=1)
print(df)
# user_id referral_code referred_by weights
# 0 1 A None 1.0
# 1 2 B A 3.0
# 2 3 C B 1.0
# 3 5 None None NaN
# 4 6 E B 1.0
# 5 7 None None NaN
Hope that help !
What you can do is using weights = df.referred_by.value_counts()['myword']+1 and then add it to your df in the column weights !
Related
:)
I've a dataframe like that (it's an extract of the entire dataframe):
a b
1 1
6 3
7 5
1 7
12 5
12 5
2 5
95 2
44 3
i want to create a new column using NumPy in python based on a multiple where conditions, considering previous conditions. Let me explain with an example:
I want to create column 'C' with value = '1' when:
(a > b) and (a[-1] < b) and (the previous valued value of "c" must be 2)
another condition is 'C' = '2' when:
(a < b) and (the previous valued value of "c" must be 1)
Thanks you!
You can use np.select to return an array drawn from elements in choicelist, depending on conditions.
Use:
df['c'] = '' # --> assign initial value
conditions = [
(df['a'].gt(df['b']) & df['a'].shift().lt(df['b'])) & (df['c'].shift().eq('') | df['c'].shift().eq(2)),
df['a'].lt(df['b']) & (df['c'].shift().eq(1) | df['c'].shift().eq(''))
]
choices = [1, 2]
df['c'] = np.select(conditions, choices, default='')
print(df)
This prints:
a b c
0 1 1
1 6 3 1
2 7 5
3 1 7 2
4 12 5 1
5 12 5
6 2 5 2
7 95 2
8 44 3
So, my data is travel data.
I want to create a column df['user_type'] in which it'll determine if the df['user_id'] occurs more than once. If it does occur more than once, I'll list them as a frequent user.
Here is my code below, but it takes way too long:
#Column that determines user type
def determine_user_type(val):
df_freq = df[df['user_id'].duplicated()]
user_type = ""
if(val in df_freq['user_id'].values):
user_type = "Frequent"
else:
user_type = "Single"
return user_type
df['user_type'] = df['user_id'].apply(lambda x: determine_user_type(x))
Use numpy.where with duplicated and for return all dupes add parameter keep=False:
df = pd.DataFrame({'user_id':list('aaacbbt')})
df['user_type'] = np.where(df['user_id'].duplicated(keep=False), 'Frequent','Single')
Alternative:
d = {True:'Frequent',False:'Single'}
df['user_type'] = df['user_id'].duplicated(keep=False).map(d)
print (df)
user_id user_type
0 a Frequent
1 a Frequent
2 a Frequent
3 c Single
4 b Frequent
5 b Frequent
6 t Single
EDIT:
df = pd.DataFrame({'user_id':list('aaacbbt')})
print (df)
user_id
0 a
1 a
2 a
3 c
4 b
5 b
6 t
Here drop_duplicates remove all duplicates row by column user_id and return only first row (default parameter is keep='first'):
df_single = df.drop_duplicates('user_id')
print (df_single)
user_id
0 a
3 c
4 b
6 t
But Series.duplicated return Trues for all dupes without first:
print (df['user_id'].duplicated())
0 False
1 True
2 True
3 False
4 False
5 True
6 False
Name: user_id, dtype: bool
df_freq = df[df['user_id'].duplicated()]
print (df_freq)
user_id
1 a
2 a
5 b
Using jezrael's data
df = pd.DataFrame({'user_id':list('aaacbbt')})
You can use array slicing
df.assign(
user_type=
np.array(['Single', 'Frequent'])[
df['user_id'].duplicated(keep=False).astype(int)
]
)
user_id user_type
0 a Frequent
1 a Frequent
2 a Frequent
3 c Single
4 b Frequent
5 b Frequent
6 t Single
Data from Jez , method involve value_counts
df.user_id.map(df.user_id.value_counts().gt(1).replace({True:'Frequent',False:'Single'}))
Out[52]:
0 Frequent
1 Frequent
2 Frequent
3 Single
4 Frequent
5 Frequent
6 Single
Name: user_id, dtype: object
I have a pandas dataframe column of lists and want to extract numbers from list strings and add them to their own separate column.
Column A
0 [ FUNNY (1), CARING (1)]
1 [ Gives good feedback (17), Clear communicator (2)]
2 [ CARING (3), Gives good feedback (3)]
3 [ FUNNY (2), Clear communicator (1)]
4 []
5 []
6 [ CARING (1), Clear communicator (1)]
I would like the output to look as follows:
FUNNY CARING Gives good feedback Clear communicator
1 1 None None
None None 17 2
None 3 3 None
2 None None 1
None None None None
etc...
Let's use apply with pd.Series, then extract and reshape with set_index and unstack:
df['Column A'].apply(pd.Series).stack().str.extract(r'(\w+)\((\d+)', expand=True)\
.reset_index(1, drop=True).set_index(0, append=True)[1]\
.unstack(1)
Output:
0 Authentic Caring Classy Funny
0 1 3 None 2
1 2 None 1 2
Edit with new input data set:
df['Column A'].apply(pd.Series).stack().str.extract(r'(\w+).*\((\d+)', expand=True)\
.reset_index(1, drop=True)\
.set_index(0, append=True)[1]\
.unstack(1)
0 CARING Clear FUNNY Gives
0 1 None 1 None
1 None 2 None 17
2 3 None None 3
3 None 1 2 None
6 1 1 None None
I have 2 columns, User_ID and Item_ID. Now I want to make a new column 'Reordered' which will contain values as either 0 or 1. 0 is when a particular user has ordered an item only once, and 1 is when a particular user orders an item more than once.
I think this can be done by grouping on User_ID and then using apply function to map duplicated items as 1 and non duplicated as 0 but I'm not able to figure out the correct python code for that.
If someone can please help me with this.
You can use Series.duplicated with parameter keep=False for all duplicates - output is Trues and Falses. Last convert to ints by astype:
df['Reordered'] = df['User_ID'].duplicated(keep=False).astype(int)
Sample:
df = pd.DataFrame({'User_ID':list('aaabaccd'),
'Item_ID':list('eetyutyu')})
df['Reordered'] = df['User_ID'].duplicated(keep=False).astype(int)
print (df)
Item_ID User_ID Reordered
0 e a 1
1 e a 1
2 t a 1
3 y b 0
4 u a 1
5 t c 1
6 y c 1
7 u d 0
Or maybe need DataFrame.duplicated for check duplicates per each user:
df['Reordered'] = df.duplicated(['User_ID','Item_ID'], keep=False).astype(int)
print (df)
Item_ID User_ID Reordered
0 e a 1
1 e a 1
2 t a 0
3 y b 0
4 u a 0
5 t c 0
6 y c 0
7 u d 0
I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7