How to join/merge on unequal pandas dataframes

How to join/merge on unequal pandas dataframes - python

I would like to convert the following sql statement to the equivalent pandas expression.
select
a1.country,
a1.platform,
a1.url_page as a1_url_page,
a2.url_page as a2_url_page,
a1.userid, a1.a1_min_time,
min(a2.dvce_created_tstamp) as a2_min_time
from(
select country, platform, url_page, userid,
min(dvce_created_tstamp) as a1_min_time
from pageviews
group by 1,2,3,4) as a1
left outer join pageviews as a2 on a1.userid=a2.userid
and a1.a1_min_time < a2.dvce_created_tstamp
and a2.url_page <> a1.url_page
group by 1,2,3,4,5,6
I am aware of the merging command of pandas however in our case we have a composite join clause that includes also inequality. I haven't found some documentation on how to handle this case.
Of course I can think as a last resort to iterate through dataframes but I do not think that this is the most efficient way to do it.
For example we can add some sample input data
----------------------------------------------------------------
| country | platform | url_page | userid | dvce_created_tstamp |
|----------------------------------------------------------------
| gr | win | a | bar | 2019-01-01 00:00:00 |
| gr | win | b | bar | 2019-01-01 00:01:00 |
| gr | win | a | bar | 2019-01-01 00:02:00 |
| gr | win | a | foo | 2019-01-01 00:00:00 |
| gr | win | a | foo | 2019-01-01 01:00:00 |
The response from sql
When I use dataframe left merge command I get following output
(edit: Add sample data)
It is obvious that we miss the rows with null a2_url_page

Related

Plotly Dash callback between 2 pandas DataFrames

I have two pandas DataFrames. df1 is 2 years of time series data recorded hourly for 20,000+ users, and it looks something like this:
TimeStamp | UserID1 | UserID2 | ... | UserID20000 |
---------------------------------------------------------------
2017-01-01 00:00:00 | 1.5 | 22.5 | ... | 5.5 |
2017-01-01 01:00:00 | 4.5 | 3.2 | ... | 9.12 |
.
.
.
2019-12-31 22:00:00 | 4.2 | 7.6 | ... | 8.9 |
2029-12-31 23:00:00 | 3.2 | 0.9 | ... | 11.2 |
df2 is ~ 20 attributes for each of the users and looks something like this:
User | Attribute1 | Attribute2 | ... | Attribute20 |
------------------------------------------------------------
UserID1 | yellow | big | ... | 450 |
UserID2 | red | small | ... | 6500 |
.
.
.
UserID20000 | yellow | small | ... | 950 |
I would like to create a Plotly Dash with callbacks where a user can specify attribute values or ranges of values (ie Attribute1 == 'yellow', Attribute20 < 1000 AND Attribute20 > 500) to create line graphs of the time series data of only the users that meet the specified attribute criteria.
I'm new to Plotly, but I'm able to create static plots with matplotlib by filtering df2 based on the attributes I want, making a list of the User IDs after filter, and reindexing df1 with the list of filtered User IDs:
filtered_users = df2.loc[(df2[Attribute1] == 'yellow'), 'User'].to_list()
df1 = df1.reindex(filtered_users, axis=1)
While this works, I'm not sure if the code is that efficient, and I'd like to be able to explore the data interactively, hence the move to Plotly.

Where am I going wrong when analyzing this data?

Trying to find a trend in attendance. I filtered my existing df to this so I can look at 1 activity at a time.
+---+-----------+-------+----------+-------+---------+
| | Date | Org | Activity | Hours | Weekday |
+---+-----------+-------+----------+-------+---------+
| 0 | 8/3/2020 | Org 1 | Gen Ab | 10.5 | Monday |
| 1 | 8/25/2020 | Org 1 | Gen Ab | 2 | Tuesday |
| 3 | 8/31/2020 | Org 1 | Gen Ab | 8.5 | Monday |
| 7 | 8/10/2020 | Org 2 | Gen Ab | 1 | Monday |
| 8 | 8/14/2020 | Org 3 | Gen Ab | 3.5 | Friday |
+---+-----------+-------+----------+-------+---------+
This code:
gen_ab = att_df.loc[att_df['Activity'] == "Gen Ab"]
sum_gen_ab = gen_ab.groupby(['Date', 'Activity']).sum()
sum_gen_ab.head()
Returns this:
+------------+----------+------------+
| | | Hours |
+------------+----------+------------+
| Date | Activity | |
| 06/01/2020 | Gen Ab | 347.250000 |
| 06/02/2020 | Gen Ab | 286.266667 |
| 06/03/2020 | Gen Ab | 169.583333 |
| 06/04/2020 | Gen Ab | 312.633333 |
| 06/05/2020 | Gen Ab | 317.566667 |
+------------+----------+------------+
How do I make the summed column name 'Hours'? I still get the same result when I do this:
sum_gen_ab['Hours'] = gen_ab.groupby(['Date', 'Activity']).sum()
What I eventually want to do is have a line graph that shows the sum of hours for the activity over time. The time of course would be the dates in my df.
plt.plot(sum_gen_ab['Date'], sum_gen_ab['Hours'])
plt.show()
returns KeyError: Date

Once you've used groupby(['Date', 'Activity']) Date and Activity have been transformed to indices and can't be referenced with sum_gen_ab['Date'].
To avoid transforming them to indices you can use groupby(['Date', 'Activity'], as_index=False) instead.

I will typically use the pandasql library to manipulate my data frames into different datasets. This allows you to manipulate your pandas data frame with SQL code. Pandasql can be used alongside pandas.
EXAMPLE:
import pandas as pd
import pandasql as psql
df = "will be your dataset"
new_dataset = psql.sqldf('''
SELECT DATE, ACTIVITY, SUM(HOURS) as SUM_OF_HOURS
FROM df
GROUP BY DATE, ACTIVITY''')
new_dataset.head() #Shows the first 5 rows of your dataset

How can I check check for matching values in a second dataframe, then return a value from a column in the second dataframe?

I have two dataframes. One contains a list of the most recent meeting for each customer. The second is a list of statuses that each customer has been recorded with, and their start date and end date.
I want to look up a customer and meeting date, and find out what status they were at when the meeting occurred.
What I think this will involve is creating a new column in my meeting dataframe that checks the rows of the statuses dataframe for a matching customer ID, then checks if the date from the first dataframe is between two dates in the second. If it is, the calculated column will take its value from the second dataframe's status column.
My dataframes are:
meeting
| CustomerID | MeetingDate |
|------------|-------------|
| 70704 | 2019-07-23 |
| 70916 | 2019-09-04 |
| 72712 | 2019-04-16 |
statuses
| CustomerID | Status | StartDate | EndDate |
|------------|--------|------------|------------|
| 70704 | First | 2019-04-01 | 2019-06-30 |
| 70704 | Second | 2019-07-01 | 2019-08-25 |
| 70916 | First | 2019-09-01 | 2019-10-13 |
| 72712 | First | 2019-03-15 | 2019-05-02 |
So, I think I want to take meeting.CustomerID and find a match in statuses.CustomerID. I then want to check if meeting.MeetingDate is between statuses.StartDate and statuses.EndDate. If it is, I want to return statuses.Status from the matching row, if not, ignore that row and move to the next to see if that matches the criteria and return the Status as described.
The final result should look like:
| CustomerID | MeetingDate | Status |
|------------|-------------|--------|
| 70704 | 2019-07-23 | Second |
| 70916 | 2019-09-04 | First |
| 72712 | 2019-04-16 | First |
I'm certain there must be a neater and more streamlined way to do this than what I've suggested, but I'm still learning the ins and outs of python and pandas and would appreciate if someone could point me in the right direction.

This should work. If the columns are not sorted by CustomerID or Status, this can be easily done. This is assuming your dates are already a datetime type. Here, df2 refers to the dataframe whose columns are CustomerID, Status, StartDate, and EndDate.
import numpy as np
df2 = df2[::-1]
row_arr = np.unique(df2.CustomerID, return_index = True)[1]
df2 = df2.iloc[row_arr, :].drop(['StartDate', 'EndDate'], axis = 1)
final = pd.merge(df1, df2, how = 'inner', on = 'CustomerID')

I managed to wrangle something that works for me:
df = statuses.merge(meetings, on='CustomerID')
df = df[(df['MeetingDate'] >= df['StartDate']) & (df['MeetingDate'] <= df['EndDate'])].reset_index(drop=True)
Gives:
| CustomerID | Status | StartDate | EndDate | MeetingDate |
|------------|--------|------------|------------|-------------|
| 70704 | Second | 2019-01-21 | 2019-07-28 | 2019-07-23 |
| 70916 | First | 2019-09-04 | 2019-10-21 | 2019-09-04 |
| 72712 | First | 2019-03-19 | 2019-04-17 | 2019-04-16 |
And I can just drop the now unneeded columns.

Last Touch Attribution in MySQL

Conversions
user_id | tag | timestamp
|--------- |-------- |---------------------|
| 1 | click1 | 2016-11-01 01:20:39 |
| 2 | click2 | 2016-11-01 09:48:10 |
| 3 | click1 | 2016-11-04 14:27:22 |
| 4 | click4 | 2016-11-05 17:50:14 |
User Sessions
user_id | utm_campaign | session_start
|--------- |--------------- |---------------------|
| 1 | outbrain_2 | 2016-11-01 00:15:34 |
| 1 | email | 2016-11-01 01:00:29 |
| 2 | google_1 | 2016-11-01 08:24:39 |
| 3 | google_4 | 2016-11-04 14:25:06 |
| 4 | google_1 | 2016-11-05 17:43:02 |
Given the 2 tables above, I want to map each conversion event to the most recent campaign that brought a particular user to a site (aka last touch/last click attribution).
The desired output is a table of the format:
user_id | tag | timestamp | campaign
|--------- |-------- |---------------------|-----------
| 1 | click1 | 2016-11-01 01:20:39 | email
| 2 | click2 | 2016-11-01 09:48:10 | google_1
| 3 | click1 | 2016-11-04 14:27:22 | google_4
| 4 | click4 | 2016-11-05 17:50:14 | google_1
Note how user 1 visited the site via the outbrain_2 campaign and then came back to the site via the email campaign. Sometime during the user's second visit, they converted, thus the conversion should be attributed to email and not outbrain_2.
Is there a way to do this in MySQL or Python?

You can do this in Python with Pandas. I assume you can load the data from MySQL tables to Pandas dataframes conversions and sessions. First, concatenate both tables:
all = pd.concat([conversions,sessions])
Some of the elements in the new frame will be NAs. Create a new column that collects the time stamps from both tables:
all["ts"] = np.where(all["session_start"].isnull(),
all["timestamp"],
all["session_start"])
Sort by this column, forward fill the time values, group by the user ID, and select the last (most recent) row from each group:
groups = all.sort_values("ts").ffill().groupby("user_id",as_index=False).last()
Select the right columns:
result = groups[["user_id", "tag", "timestamp", "utm_campaign"]]
I tried this code with your sample data and got the right answer.

Pandas Missing Data

I come from a SPSS background and I want to declare missing values in a Pandas DataFrame.
Consider the following dataset from a Likert Scale:
SELECT COUNT(*),v_6 FROM datatable GROUP BY v_6;
| COUNT(*) | v_6 |
+----------+------+
| 1268 | NULL |
| 2 | -77 |
| 3186 | 1 |
| 2700 | 2 |
| 512 | 3 |
| 71 | 4 |
| 17 | 5 |
| 14 | 6 |
I have a DataFrame
pdf = psql.frame_query('SELECT * FROM datatable', con)
The null values are already declared as NaN - now I want -77 also to be a missing value.
In SPSS I am used to:
MISSING VALUES v_6 (-77).
No I am looking for the Pandas counterpart
I have read:
http://pandas.pydata.org/pandas-docs/stable/missing_data.html
but I honestly do not get the trick how the proposed way in my case would be...

Use pandas.Series.replace():
df['v_6'] = df['v_6'].replace(-77, np.NaN)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to join/merge on unequal pandas dataframes - python

Related

Plotly Dash callback between 2 pandas DataFrames

Where am I going wrong when analyzing this data?

How can I check check for matching values in a second dataframe, then return a value from a column in the second dataframe?

Last Touch Attribution in MySQL

Pandas Missing Data

Categories

Resources