I have a the below dataset:
timestamp conversationId UserId MessageId tpMessage Message
1614578324 ceb9004ae9d3 1c376ef 5bbd34859329 question Where do you live?
1614578881 ceb9004ae9d3 1c376ef d3b5d3884152 answer Brooklyn
1614583764 ceb9004ae9d3 1c376ef 0e4501fcd61f question What's your name?
1614590885 ceb9004ae9d3 1c376ef 97d841b79ff7 answer Phill
1614594952 ceb9004ae9d3 1c376ef 11ed3fd24767 question What's your gender?
1614602036 ceb9004ae9d3 1c376ef 601538860004 answer Male
1614602581 ceb9004ae9d3 1c376ef 8bc8d9089609 question How old are you?
1614606219 ceb9004ae9d3 1c376ef a2bd45e64b7c answer 35
1614606240 jto9034pe0i5 1c489rl o6bd35e64b5j question What's your name?
1614606250 jto9034pe0i5 1c489rl 96jd89i55b7t answer Robert
and I'm trying to use a similar ROW_NUMBER function in pandas
ROW_NUMBER() OVER(PARTITION BY userId ORDER BY UserId,timestamp,conversationId ASC) AS num_Row
I tried some aproachs so far, none worked as intended:
df['row_number'] = df.groupby(['userId','timestamp','conversationId']).cumcount() + 1
or
df['row_number'] = df.sort_values(['userId','timestamp','conversationId'], ascending=[True,False]) \
.groupby(['userId']) \
.cumcount() + 1
print(df)
my disered output is as folows :
timestamp conversationId UserId MessageId tpMessage Message num_row
1614578324 ceb9004ae9d3 1c376ef 5bbd34859329 question Where do you live? 1
1614578881 ceb9004ae9d3 1c376ef d3b5d3884152 answer Brooklyn 2
1614583764 ceb9004ae9d3 1c376ef 0e4501fcd61f question What's your name? 3
1614590885 ceb9004ae9d3 1c376ef 97d841b79ff7 answer Phill 4
1614594952 ceb9004ae9d3 1c376ef 11ed3fd24767 question What's your gender? 5
1614602036 ceb9004ae9d3 1c376ef 601538860004 answer Male 6
1614602581 ceb9004ae9d3 1c376ef 8bc8d9089609 question How old are you? 7
1614606219 ceb9004ae9d3 1c376ef a2bd45e64b7c answer 35 8
1614606240 jto9034pe0i5 1c489rl o6bd35e64b5j question What's your name? 1
1614606250 jto9034pe0i5 1c489rl 96jd89i55b7t answer Robert 2
could you guys help on that?
Variation of you last attempt, which gives the provided output and matches the logic:
df['num_row'] = (df
.sort_values(by=['timestamp', 'conversationId'],
ascending=True) # this is the default
.groupby('UserId', sort=False).cumcount().add(1)
)
Output:
timestamp conversationId UserId MessageId tpMessage Message num_row
0 1614578324 ceb9004ae9d3 1c376ef 5bbd34859329 question Where do you live? 1
1 1614578881 ceb9004ae9d3 1c376ef d3b5d3884152 answer Brooklyn 2
2 1614583764 ceb9004ae9d3 1c376ef 0e4501fcd61f question What's your name? 3
3 1614590885 ceb9004ae9d3 1c376ef 97d841b79ff7 answer Phill 4
4 1614594952 ceb9004ae9d3 1c376ef 11ed3fd24767 question What's your gender? 5
5 1614602036 ceb9004ae9d3 1c376ef 601538860004 answer Male 6
6 1614602581 ceb9004ae9d3 1c376ef 8bc8d9089609 question How old are you? 7
7 1614606219 ceb9004ae9d3 1c376ef a2bd45e64b7c answer 35 8
8 1614606240 jto9034pe0i5 1c489rl o6bd35e64b5j question What's your name? 1
9 1614606250 jto9034pe0i5 1c489rl 96jd89i55b7t answer Robert 2
use pandasql:
df.sql("select *,ROW_NUMBER() OVER(PARTITION BY userId ORDER BY UserId,timestamp,conversationId ASC) AS num_Row from self")
out:
timestamp conversationId UserId MessageId tpMessage Message num_row
0 1614578324 ceb9004ae9d3 1c376ef 5bbd34859329 question Where do you live? 1
1 1614578881 ceb9004ae9d3 1c376ef d3b5d3884152 answer Brooklyn 2
2 1614583764 ceb9004ae9d3 1c376ef 0e4501fcd61f question What's your name? 3
3 1614590885 ceb9004ae9d3 1c376ef 97d841b79ff7 answer Phill 4
4 1614594952 ceb9004ae9d3 1c376ef 11ed3fd24767 question What's your gender? 5
5 1614602036 ceb9004ae9d3 1c376ef 601538860004 answer Male 6
6 1614602581 ceb9004ae9d3 1c376ef 8bc8d9089609 question How old are you? 7
7 1614606219 ceb9004ae9d3 1c376ef a2bd45e64b7c answer 35 8
8 1614606240 jto9034pe0i5 1c489rl o6bd35e64b5j question What's your name? 1
9 1614606250 jto9034pe0i5 1c489rl 96jd89i55b7t answer Robert 2
Related
This question already has answers here:
Get rows based on distinct values from one column
(2 answers)
Closed 1 year ago.
I have a dataframe with thousands rows like this:
city zip_code name
paris 1 John
paris 1 Eric
paris 2 David
LA 3 David
LA 4 David
LA 4 NaN
How can I do a groupby city and zip code and know the name for each city and zip_code grouped ?
Expected output: a dataframe with rows with unique city and unique zip_code and corresponding names in other column (one row per name)
city zip_code name
paris 1 John
Eric
paris 2 David
LA 3 David
LA 4 David
IIUC, you want to know the existing combinations of city and zip_code?
[k for k,_ in df.groupby(['city', 'zip_code'])]
output: [('LA', 3), ('LA', 4), ('paris', 1), ('paris', 2)]
edit following your change in the question:
It looks like you want:
df.drop_duplicates().dropna()
output:
city zip_code name
0 paris 1 John
1 paris 1 Eric
2 paris 2 David
3 LA 3 David
4 LA 4 David
I'm try extract the first two words from a string in dataframe
df["Name"]
Name
Anthony Frank Hawk
John Rodney Mullen
Robert Dean Silva Burnquis
Geoffrey Joseph Rowley
To get index of the second " "(Space) I try this but find return NaN instead return number of characters until second Space.
df["temp"] = df["Name"].str.find(" ")+1
df["temp"] = df["Status"].str.find(" ", start=df["Status"], end=None)
df["temp"]
0 NaN
1 NaN
2 NaN
3 NaN
and the last step is slice those names, I try this code but don't work to.
df["Status"] = df["Status"].str.slice(0,df["temp"])
df["Status"]
0 NaN
1 NaN
2 NaN
3 NaN
expected return
0 Anthony Frank
1 John Rodney
2 Robert Dean
3 Geoffrey Joseph
if you have a more efficient way to do this please let me know!?
df['temp'] = df.Name.str.rpartition().get(0)
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean Silva
3 Geoffrey Joseph Rowley Geoffrey Joseph
EDIT
If only first two elements are required in output.
df['temp'] = df.Name.str.split().str[:2].str.join(' ')
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join(x[:2]))
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join([x[0], x[1]]))
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean
3 Geoffrey Joseph Rowley Geoffrey Joseph
You can use str.index(substring) instead of str.find, it returns the smallest index of the substring(such as " ", empty space) found in the string. Then you can split the string by that index and reapply the above to the second string in the resulting list.
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have a simple data frame with first and last names. I would like to get the equivalent of SQL SELF JOIN equivalent in pandas.
Here goes the full example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'first_name': ['Rose','Summer','Jane','Kim','Jack'],
'last_name': ['Howard','Solstice','Kim','Cruz','Rose'],
'customer_id': [1,2,3,4,5]})
df
first_name last_name customer_id
0 Rose Howard 1
1 Summer Solstice 2
2 Jane Kim 3
3 Kim Cruz 4
4 Jack Rose 5
REQUIRED OUTPUT
customer_id first_name last_name customer_id_1 first_name_1 last_name_1
1 Rose Howard 5 Jack Rose
4 Kim Cruz 3 Jane Kim
Using SQL
select a.first_name, a.last_name, b.first_name, b.last_name
from df as a, df as b
where a.first_name = b.last_name
My attempt
(pd.concat( [ df[['first_name','last_name']],
df[['first_name','last_name']].add_suffix('_1')
], axis=1, ignore_index=False)
)
first_name last_name first_name_1 last_name_1
0 Rose Howard Rose Howard
1 Summer Solstice Summer Solstice
2 Jane Rose Jane Rose
But,
(pd.concat( [ df,df.add_suffix('_1')], axis=1)
.query(" first_name == last_name_1 ")
)
This gives empty output to my surprise!!
I want two rows and fours columns as given by SQL.
Use left_on and right_on
df.merge(df, left_on='first_name', right_on='last_name')
Result:
first_name_x last_name_x customer_id_x first_name_y last_name_y \
0 Rose Howard 1 Jack Rose
1 Kim Cruz 4 Jane Kim
customer_id_y
0 5
1 3
I have a dataframe like this. And i want to find the first pair and do operations with that.
fname lname time of entry ............... other columns
Adrian Peter 1
Jhon Adrian 3
Peter Rusk 4
Rusk Anton 10
Gile John 12
Angela Gomes 13
Gomes Angela 14
Now i want something like this where culprit is a value that is both in fname and lname. if for example both values are in fname and lname as the Angela Gomes case below then the culprit has to have one line with Angela and other with Gomes.
pair fname lname Culprit time diff ...... other columns
1 Adrian Peter Adrian -2
1 John Adrian Adrian 2
2 Peter Rusk Rusk -6
2 Rusk Anton Rusk 6
3 Angela Gomes Angela -1
3 Gomes Angela Gomes 1
From the above i know that in number 3 both Angela and Gomes are culprits. Time should also be sorted by ascending order.
I'm not in love with this, there's probably a better way, but it works and doesn't use any python iteration / lists.
Code:
# Find and number the pairs and filter out the rows that don't belong
df = df.loc[(df['fname'].isin(df['lname'].shift())) | (df['fname'].isin(df['lname'].shift(-1)))].reset_index(drop = True)
df['pair'] = (df.index / 2.0).astype(int) + 1
# Find the culprit
df['culprit'] = df.loc[(df['fname'] == df['lname'].shift(-1)) | (df['fname'] == df['lname'].shift(1)), 'fname']
df.sort_values(by = ['pair','culprit'], inplace = True)
df.fillna(method = 'ffill', inplace = True)
# Calculate the time difference
df['time_diff'] = df.loc[df['pair'] == df['pair'].shift(1), 'ToE'] - df['ToE'].shift(1)
df['time_diff'] = df['time_diff'].fillna(df['time_diff'].shift(-1) * -1).astype(int)
# Sort
df.sort_values(by = ['pair','time_diff'], inplace = True)
print df[['pair','fname','lname','culprit','time_diff']].to_string(index = False)
Output:
pair fname lname culprit time_diff
1 Adrian Peter Adrian -2
1 John Adrian Adrian 2
2 Peter Rusk Rusk -6
2 Rusk Anton Rusk 6
3 Angela Gomes Angela -1
3 Gomes Angela Gomes 1
I have two data frames
df1 =
actorID actorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
df2 =
directorID directorName
0 john_lasseter John Lasseter
1 joe_johnston Joe Johnston
2 donald_petrie Donald Petrie
3 forest_whitaker Forest Whitaker
4 charles_shyer Charles Shyer
What I ideally want is a concatenation of these two dataframes, like pd.concat((df1, df2)):
actorID-directorID actorName-directorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
5 john_lasseter John Lasseter
6 joe_johnston Joe Johnston
7 donald_petrie Donald Petrie
8 forest_whitaker Forest Whitaker
9 charles_shyer Charles Shyer
however I want there to be an easy way to specify that I want to join df1.actorName and df2.directorName together, and actorID / directorID. How can I do this?