If I have a dataframe like this :
date A B C
01.01.2003 01.01.2003
02.01.2003
03.01.2003 03.01.2003
05.01.2003 05.01.2003
06.01.2003 06.01.2003
08.01.2003 08.01.2003 08.01.2003 08.01.2003
And I want to change if value in column A, B, C are all equal I want to delete value in column A and B, leave the C.
so the output
date A B C
01.01.2003 01.01.2003
02.01.2003
03.01.2003 03.01.2003
05.01.2003 05.01.2003
06.01.2003 06.01.2003
08.01.2003 08.01.2003
I applied np.where but the error says condition does not apply on timestamp
np.where((df['A'] & df ['B'] == df['C]'),
df['A'] & df['B], '')
thanks for the lead
Use boolean indexing with help of all:
df.loc[df[['A', 'B']].eq(df['C'], axis=0).all(axis=1), ['A', 'B']] = np.nan
Output:
date A B C
0 01.01.2003 01.01.2003 None None
1 02.01.2003 None None None
2 03.01.2003 NaN 03.01.2003 None
3 05.01.2003 05.01.2003 None None
4 06.01.2003 NaN 06.01.2003 None
5 08.01.2003 NaN NaN 08.01.2003
You can use pandas.DataFrame.loc with two conditions on row selection, namely A=B and B=C, and assign [None] to both A and B fields.
df.loc[(df['A']==df['B']) & (df['B']==df['C']), ['A', 'B']] = [[None, None]]
Output
date A B C
0 01.01.2003 01.,01.2003 None None
1 02.01.2003 None None None
2 03.01.2003 None 03.01.2003 None
3 05.01.2003 05.01.2003 None None
4 06.01.2003 None 06.01.2003 None
5 08.01.2003 None None 08.01.2003
Check the demo here.
Related
I have a large dataset (4GB) like this:
userID date timeofday seq
0 1000014754 20211028 20 133669542676:1:148;133658378700:1:16;133650937891:1:85
1 1000019906 20211028 6 508420199:0:0;133669581685:1:19
2 1000019906 20211028 22 133665269544:0:0
From this, I would like to split "seq" by ";" first and create a new dataset with renames. It looks like this:
userID date timeofday seq1 seq2 seq3 ... seqN
0 1000014754 20211028 20 133669542676:1:148 133658378700:1:16 133650937891:1:85
1 1000019906 20211028 6 508420199:0:0 133669581685:1:19 None None
2 1000019906 20211028 22 133665269544:0:0 None None None
Then I want to split the seq1,seq2,...,seqN by ":", and create a new dataset with renames. It looks like this:
userID date timeofday name1 click1 time1 name2 click2 time2 ....nameN clickN timeN
0 1000014754 20211028 20 133669542676 1 148 133658378700 1 16 133650937891 1 85 None None None
1 1000019906 20211028 6 508420199 0 0 133669581685 1 19 None None None None None None
2 1000019906 20211028 22 133665269544 0 0 None None None None None None None None None
I know pandas.split can split the columns, but I don't know how to split it effficiently. Thank you!
A clean solution is to use a regex and extractall, then reshape using unstack, rename the columns and join to the original dataframe.
Assuming df the dataframe name
df2 = (df['seq'].str.extractall(r'(?P<name>[^:]+):(?P<click>[^:]+):(?P<time>[^;]+);?')
.unstack('match')
.sort_index(level=1, axis=1, sort_remaining=False)
)
df2.columns = df2.columns.map(lambda x: f'{x[0]}{x[1]+1}')
df2 = df.drop(columns='seq').join(df2)
output:
userID date timeofday name1 click1 time1 name2 click2 time2 name3 click3 time3
0 1000014754 20211028 20 133669542676 1 148 133658378700 1 16 133650937891 1 85
1 1000019906 20211028 6 508420199 0 0 133669581685 1 19 NaN NaN NaN
2 1000019906 20211028 22 133665269544 0 0 NaN NaN NaN NaN NaN NaN
Try this, it should get you the result:
A = pd.DataFrame({1:[2,3,4], 2:['as:d', 'asd', 'a:sd']})
print(A)
for i in A.index:
split =str(A[2][i]).split(':',1)
A.at[i,3] = split[0]
if len(split) > 1:
A.at[i, 4] = split[1]
print(A)
It's probably slow since the dataframe is updated often. Alternatively you can write the new columns in separate lists and merge them into one table later.2
I have a pandas data frame in the form of:
user_id referral_code referred_by
1 A None
2 B A
3 C B
5 None None
6 E B
7 None none
....
What I want to do is to create another column weight for each user id such that it will contain the total number of references he has done to others as well as the the number of time he was referred i.e I have to check if the referral_code of a user id is present in the referred_by column and count the frequency of the same and also add 1 if the referred_by column has a entry for the user.
Expected output is:
user_id referral_code referred_by weights
1 A None 1
2 B A 3
3 C B 1
5 None None None
6 E B 1
7 None none none
The approaches if have tried is using df.grouby along with size and count but nothing is giving the expected output.
You want to build a new conditional column. If the conditions are simple enough, you can do it with np.where. I suggest you to have a look at this post.
Here, it's quite complex, there shoud have a solution with np.where but not really obvious. In this case, you can use the apply method. It gives you the opportunity the write conditions as complex as you want. Using apply is less efficient than np.where since you need a python abstraction. Depends on your dataset and the complexity of your conditions.
Here an example with apply:
df = pd.DataFrame(
[[1, "A" , None],
[2 , "B" , "A"],
[3 , "C" , "B"],
[5 , None, None],
[6 , "E" , "B"],
[7 , None , None]],
columns = 'user_id referral_code referred_by'.split(' ')
)
print(df)
# user_id referral_code referred_by
# 0 1 A None
# 1 2 B A
# 2 3 C B
# 3 5 None None
# 4 6 E B
# 5 7 None None
weight_refered_by = df.referred_by.value_counts()
print(weight_refered_by)
# B 2
# A 1
def countWeight(row):
count = 0
if row['referral_code'] in weight_refered_by.index:
count = weight_refered_by[row.referral_code]
if row["referred_by"] != None:
count += 1
# If referral_code is none, result is none
# because referred_by is included in referral_code
if row["referral_code"] == None:
count = None
return count
df["weights"] = df.apply(countWeight, axis=1)
print(df)
# user_id referral_code referred_by weights
# 0 1 A None 1.0
# 1 2 B A 3.0
# 2 3 C B 1.0
# 3 5 None None NaN
# 4 6 E B 1.0
# 5 7 None None NaN
Hope that help !
What you can do is using weights = df.referred_by.value_counts()['myword']+1 and then add it to your df in the column weights !
I am trying to update my_df based on conditional selection as in:
my_df[my_df['group'] == 'A']['rank'].fillna('A+')
However, this is not persistence ... e.g: the my_df still have NaN or NaT ... and I am not sure how to do this in_place. Please advise on how to persist the the update to my_df.
Create boolean mask and assign to filtered column rank:
my_df = pd.DataFrame({'group':list('AAAABC'),
'rank':['a','b',np.nan, np.nan, 'c',np.nan],
'C':[7,8,9,4,2,3]})
print (my_df)
group rank C
0 A a 7
1 A b 8
2 A NaN 9
3 A NaN 4
4 B c 2
5 C NaN 3
m = my_df['group'] == 'A'
my_df.loc[m, 'rank'] = my_df.loc[m, 'rank'].fillna('A+')
print(my_df)
group rank C
0 A a 7
1 A b 8
2 A A+ 9
3 A A+ 4
4 B c 2
5 C NaN 3
You need to assign it back
my_df.loc[my_df['group'] == 'A','rank']=my_df.loc[my_df['group'] == 'A','rank'].fillna('A+')
Your operations are not in-place, so you need to assign back to a variable. In addition, chained indexing is not recommended.
One option is pd.Series.mask with a Boolean series:
# data from #jezrael
df['rank'].mask((df['group'] == 'A') & df['rank'].isnull(), 'A+', inplace=True)
print(df)
C group rank
0 7 A a
1 8 A b
2 9 A A+
3 4 A A+
4 2 B c
5 3 C NaN
I have a pandas dataframe column of lists and want to extract numbers from list strings and add them to their own separate column.
Column A
0 [ FUNNY (1), CARING (1)]
1 [ Gives good feedback (17), Clear communicator (2)]
2 [ CARING (3), Gives good feedback (3)]
3 [ FUNNY (2), Clear communicator (1)]
4 []
5 []
6 [ CARING (1), Clear communicator (1)]
I would like the output to look as follows:
FUNNY CARING Gives good feedback Clear communicator
1 1 None None
None None 17 2
None 3 3 None
2 None None 1
None None None None
etc...
Let's use apply with pd.Series, then extract and reshape with set_index and unstack:
df['Column A'].apply(pd.Series).stack().str.extract(r'(\w+)\((\d+)', expand=True)\
.reset_index(1, drop=True).set_index(0, append=True)[1]\
.unstack(1)
Output:
0 Authentic Caring Classy Funny
0 1 3 None 2
1 2 None 1 2
Edit with new input data set:
df['Column A'].apply(pd.Series).stack().str.extract(r'(\w+).*\((\d+)', expand=True)\
.reset_index(1, drop=True)\
.set_index(0, append=True)[1]\
.unstack(1)
0 CARING Clear FUNNY Gives
0 1 None 1 None
1 None 2 None 17
2 3 None None 3
3 None 1 2 None
6 1 1 None None
My pandas dataframe:
dframe = pd.DataFrame({"A":list("abcde"), "B":list("aabbc"), "C":[1,2,3,4,5]}, index=[10,11,12,13,14])
A B C
10 a a 1
11 b a 2
12 c b 3
13 d b 4
14 e c 5
My desired output:
A B C a b c
10 a a 1 1 None None
11 b a 2 2 None None
12 c b 3 None 3 None
13 d b 4 None 4 None
14 e c 5 None None 5
Idea is to create new column based on values in 'B' column, copy respective values in 'C' column and paste them in newly created columns.
Here is my code:
lis = sorted(list(dframe.B.unique()))
#creating empty columns
for items in lis:
dframe[items] = None
#here copy and pasting
for items in range(0, len(dframe)):
slot = dframe.B.iloc[items]
dframe[slot][items] = dframe.C.iloc[items]
I ended up with this error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
app.launch_new_instance()
This code worked well in Python 2.7 but not in 3.x. Where I'm going wrong?
Start with
to_be_appended = pd.get_dummies(dframe.B).replace(0, np.nan).mul(dframe.C, axis=0)
Then concat
dframe = pd.concat([dframe, to_be_appended], axis=1)
Looks like:
print dframe
A B C a b c
10 a a 1 1.0 NaN NaN
11 b a 2 2.0 NaN NaN
12 c b 3 NaN 3.0 NaN
13 d b 4 NaN 4.0 NaN
14 e c 5 NaN NaN 5.0
Notes for searching.
This is combining one hot encoding with a broadcast multiplication.
Chained assignment will now by default warn if the user is assigning to a copy.
This can be changed with the option mode.chained_assignment, allowed options are raise/warn/None. See the docs.
In [5]: dfc = DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})
In [6]: pd.set_option('chained_assignment','warn')
The following warning / exception will show if this is attempted.
In [7]: dfc.loc[0]['A'] = 1111
Traceback (most recent call last)
...
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
Here is the correct method of assignment.
In [8]: dfc.loc[0,'A'] = 11
In [9]: dfc
A B
0 11 1
1 bbb 2
2 ccc 3