merge two data frame after use str.contains? - python

I have two data frames I want to match partial strings by using str.contains function then merge them.
Here is an example:
data1
email is_mane name id
hi#amal.com 1 there is rain 10
hi2#amal.com 1 here is the food 9
hi3#amal.com 1 let's go together 8
hi4#amal.com 1 today is my birthday 6
data2
id name
1 the rain is beautiful
1 the food
2 together
4 my birthday
3 your birthday
And here is the code I wrote:
data.loc[data.name.str.contains('|'.join(data2.name)),:]
and the output:
email is_mane name id
hi2#amal.com 1 here is the food 9
hi3#amal.com 1 let's go together 8
hi4#amal.com 1 today is my birthday 6
As you can see it did not return "there is rain" even that rain word is contained in dara2: could it be because of space?
Also I want to merge data1 with data2 so that will help me to know what email has match.
I would like to have the following output:
email is_mane name id id2 name2
hi2#amal.com 1 here is the food 9 1 the food
hi3#amal.com 1 let's go together 8 2 together
hi4#amal.com 1 today is my birthday 6 4 my birthday
hi4#amal.com 1 today is my birthday 6 3 your birthday
Is there is any way to do it?

If you're good with matching only full words you can do (so e.g. dog and dogs won't match)
data1["key"]=data1["name"].str.split(r"[^\w+]")
data2["key"]=data2["name"].str.split(r"[^\w+]")
data3=data1.explode("key").merge(data2.explode("key"), on="key", suffixes=["", "2"]).drop("key", axis=1).drop_duplicates()
Otherwise it's a matter of cross join, and applying str.contains(...) to filter out the ones, which aren't matching.

Related

Multiple group counts within data base

I have been presented with a very small dataset that has the date of each time a user logs into a system, I have to use this data set to create a table where I can show for each log-in the cumulative monthly counts of logs and the overall cumulative counts of logs, this is the data set I have:
date
user
1/01/2022
Mark
2/01/2022
Mark
3/02/2022
Mark
4/02/2022
Mark
5/03/2022
Mark
6/03/2022
Mark
7/03/2022
Mark
8/03/2022
Mark
9/03/2022
Mark
and this is my desired output:
row
date
user
monthly_track
acum_track
1
1/01/2022
Mark
1
1
2
2/01/2022
Mark
2
2
3
3/02/2022
Mark
1
3
4
4/02/2022
Mark
2
4
5
5/03/2022
Mark
1
5
6
6/03/2022
Mark
2
6
7
7/03/2022
Mark
3
7
8
8/03/2022
Mark
4
8
9
9/03/2022
Mark
5
9
Why? Let's look at the row number 5. This is the first time the user Mark has logged into the system during the month 3 (March) but it is the 5th overall login in the data set (for the purpose of learning there will only be one year (2022).
I have no idea as to how to get the monthly and overall count together. I can groupby user and sort by date to count how many times in total a user has logged in, but I know that in order to achive my desired output I will have to group by date and user and then make counts based on month but I will have to somehow group the data by user (only) to get the overall count and I dont think I could group twice the data.
First you need to convert date to actual datetime values with to_datetime. The rest is simple with groupby and cumcount:
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
df['monthly_count'] = df.groupby([df['user'], df['date'].dt.year, df['date'].dt.month]).cumcount() + 1
df['acum_count'] = df.groupby('user').cumcount() + 1
Output:
>>> df
date user monthly_count acum_count
0 2022-01-01 Mark 1 1
1 2022-01-02 Mark 2 2
2 2022-02-03 Mark 1 3
3 2022-02-04 Mark 2 4
4 2022-03-05 Mark 1 5
5 2022-03-06 Mark 2 6
6 2022-03-07 Mark 3 7
7 2022-03-08 Mark 4 8
8 2022-03-09 Mark 5 9

How to count Pandas df elements with dynamic condition per row (=countif)

I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby, but I am struggling because my logical grouping condition is dynamic.
Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions
They must be the same customer (same Guest ID)
They must have been there on the previous day
If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-... to identify new customers.
I managed to do this with a for loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.
How can I wrap the following code into something smarter than a loop?
for i in range (0, len(df)):
newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))
This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df, because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns
I have the following input
df
Day Guest ID
0 3230 Tom
1 3230 Peter
2 3231 Tom
3 3232 Peter
4 3232 Peter
and expect this output
df
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
Note that elements 3 and 4 are not necessarily duplicates - given there might be additional, varying columns (such as their order).
Do:
# ensure the df is sorted by date
df = df.sort_values('Day')
# group by customer and find the diff within each group
df['newby'] = (df.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
UPDATE
If multiple visits are allowed per day, you could do:
# only keep unique visits per day
uniques = df.drop_duplicates()
# ensure the df is sorted by date
uniques = uniques.sort_values('Day')
# group by customer and find the diff within each group
uniques['newby'] = (uniques.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
# merge the uniques visits back into the original df
res = df.merge(uniques, on=['Day', 'Guest ID'])
print(res)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
As an alternative, without sorting or merging, you could do:
lookup = {(day + 1, guest) for day, guest in df[['Day', 'Guest ID']].value_counts().to_dict()}
df['newby'] = (~pd.MultiIndex.from_arrays([df['Day'], df['Guest ID']]).isin(lookup)).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1

Python: Clustering with grouped data

With grouped data I mean the following: Assume we have a data set which is grouped by a single feature, e.g. customer data, which is grouped by the single customer:
Customer | Purchase Nr | Item | Paid Amount ($)
1 1 TShirt 15
1 2 Trousers 25
1 3 Scarf 10
2 1 Underwear 5
2 2 Dress 35
2 3 Trousers 30
2 4 TShirt 10
3 1 TShirt 8
3 2 Socks 5
4 1 Shorts 13
I want to find clusters in a way, that a customers purchases are in one single cluster, in other words, that that a customer is not appearing in two clusters.
I thought about grouping the data set by the customer with a groupby, though it is difficult to express all the information of the columns for one customer in only one column. Futher, the order of purchases is important to me, e.g. if a T-Shirt was bought first or second.
Is there any cluster algorithm which includes information about groups like this?
Thank you!

How do i bin data into categories alphabetically using pandas?

I have a data frame which includes a column containing a series of strings
books = pd.DataFrame([[1,'In Search of Lost Time'],[2,'Don Quixote'],[3,'Ulysses'],[4,'The Great Gatsby'],[5,'Moby Dick']], columns = ['Book ID', 'Title'])
Book ID Title
0 1 In Search of Lost Time
1 2 Don Quixote
2 3 Ulysses
3 4 The Great Gatsby
4 5 Moby Dick
And a sorted list of boundaries
boundaries = ['AAAAAAA','The Great Gatsby', 'zzzzzzzz']
I would like to use these boundaries to categorize the values in the data frame into alphabetical bins similarly to how pd.cut() works for numeric data. My desire output would look something like the below.
Book ID Title binning
0 1 In Search of Lost Time ['AAAAAAA','The Great Gatsby')
1 2 Don Quixote ['AAAAAAA','The Great Gatsby')
2 3 Ulysses ['The Great Gatsby','zzzzzzzz')
3 4 The Great Gatsby ['The Great Gatsby','zzzzzzzz')
4 5 Moby Dick ['AAAAAAA','The Great Gatsby')
Is this possible?
searchsorted
boundaries = np.array(['The Great Gatsby'])
bins = np.array(['[A..The Great Gatsby)', '[The Great Gatsby..Z]'])
books.assign(binning=bins[boundaries.searchsorted(books.Title)])
Book ID Title binning
0 1 In Search of Lost Time [A..The Great Gatsby)
1 2 Don Quixote [A..The Great Gatsby)
2 3 Ulysses [The Great Gatsby..Z]
3 4 The Great Gatsby [A..The Great Gatsby)
4 5 Moby Dick [A..The Great Gatsby)
Expand this to some other set of boundaries:
from string import ascii_uppercase as letters
boundaries = np.array([*string.ascii_uppercase[1:-1]])
bins = np.array([f'[{a}..{b})' for a, b in zip(letters, letters[1:])])
books.assign(binning=bins[boundaries.searchsorted(books.Title)])
Book ID Title binning
0 1 In Search of Lost Time [I..J)
1 2 Don Quixote [D..E)
2 3 Ulysses [U..V)
3 4 The Great Gatsby [T..U)
4 5 Moby Dick [M..N)

How to retrieve first and last row per person from xlsx file using python?

I'm very new to programming- and I have a python question regarding reading and manipulating excel files. I've been trying out the xlrd and openpyxl modules, but still haven't been able to do exactly what I'm looking for.
I have two excel files: one with a list of people, their ID number, and their scores on a survey. In File #1, each person only has ONE line. The second file has a list of people's ID numbers, a date, and some other survey scores. In File #2, each person can have many lines of scores for administrations of the survey that occurred on different days.
My goal is to take each person's ID number from File #2, find any rows in File #1 that have the same ID number, and return the FIRST and LAST row per person from File #1. I then want to ADD these first & last rows onto the end of each person's row in File #2 (so just make each person's row longer).
I've made two test files of fake data and then a file of what I'd like it to look like in the end so that people might understand better what I'm trying to do...
File #1:
ID first_name last_name Score1 Score2 Score3
12-34-56-7 joe schmoe 2 7 3
98-76-54-3 jane doe 3 5 8
99-11-99-1 sally jones 5 6 9
File #2:
ID DATE Score4 Score5 Score6
12-34-56-7 12/28/2000 8 9 4
12-34-56-7 12/29/2000 5 1 6
12-34-56-7 12/30/2000 2 5 5
12-34-56-7 12/31/2000 1 4 7
12-34-56-7 1/1/2001 4 6 3
12-34-56-7 1/2/2001 7 3 1
98-76-54-3 4/18/1999 9 2 2
98-76-54-3 9/6/2014 6 7 8
98-76-54-3 10/5/2020 3 8 9
99-11-99-1 6/7/2012 2 5 5
99-11-99-1 6/8/2012 5 2 4
Sorry I know that might have been confusing...this is what I want it to look like in the end. Ex. find any rows in File #1 that have the same ID number as the first person in File #2 (12-34-56-7) and grab his first and last rows of data from File #1, then tack them on to the end of his row in File #2. Rinse and repeat for every person in File #2.
ID first_name last_name Score1 Score2 Score3 ID DATE Score4 Score5 Score6 ID DATE Score4 Score5 Score6
12-34-56-7 joe schmoe 2 7 3 12-34-56-7 12/28/2000 8 9 4 12-34-56-7 1/2/2001 7 3 1
98-76-54-3 jane doe 3 5 8 98-76-54-3 4/18/1999 9 2 2 98-76-54-3 9/6/2014 6 7 8
99-11-99-1 sally jones 5 6 9 99-11-99-1 10/5/2020 3 8 9 99-11-99-1 6/8/2012 5 2 4
Any advice for how to write this? Again, I'm really really new to writing scripts so any and all help would be much appreciated :)
Install Openpyxl
Read Your sheet 2
Store each row for a particular id in a 2D array. // convert all to string
Read your sheet 1
append the name to the array according to the ID
take data from the arrays and write them to the excel sheet. // With converting back to corresponding data type.
If you are looking for the syntax. THis tutorial will help you.!
Read this
Also this Stack-overflow question will help you.

Categories