Last Touch Attribution in MySQL - python

Conversions
user_id | tag | timestamp
|--------- |-------- |---------------------|
| 1 | click1 | 2016-11-01 01:20:39 |
| 2 | click2 | 2016-11-01 09:48:10 |
| 3 | click1 | 2016-11-04 14:27:22 |
| 4 | click4 | 2016-11-05 17:50:14 |
User Sessions
user_id | utm_campaign | session_start
|--------- |--------------- |---------------------|
| 1 | outbrain_2 | 2016-11-01 00:15:34 |
| 1 | email | 2016-11-01 01:00:29 |
| 2 | google_1 | 2016-11-01 08:24:39 |
| 3 | google_4 | 2016-11-04 14:25:06 |
| 4 | google_1 | 2016-11-05 17:43:02 |
Given the 2 tables above, I want to map each conversion event to the most recent campaign that brought a particular user to a site (aka last touch/last click attribution).
The desired output is a table of the format:
user_id | tag | timestamp | campaign
|--------- |-------- |---------------------|-----------
| 1 | click1 | 2016-11-01 01:20:39 | email
| 2 | click2 | 2016-11-01 09:48:10 | google_1
| 3 | click1 | 2016-11-04 14:27:22 | google_4
| 4 | click4 | 2016-11-05 17:50:14 | google_1
Note how user 1 visited the site via the outbrain_2 campaign and then came back to the site via the email campaign. Sometime during the user's second visit, they converted, thus the conversion should be attributed to email and not outbrain_2.
Is there a way to do this in MySQL or Python?

You can do this in Python with Pandas. I assume you can load the data from MySQL tables to Pandas dataframes conversions and sessions. First, concatenate both tables:
all = pd.concat([conversions,sessions])
Some of the elements in the new frame will be NAs. Create a new column that collects the time stamps from both tables:
all["ts"] = np.where(all["session_start"].isnull(),
all["timestamp"],
all["session_start"])
Sort by this column, forward fill the time values, group by the user ID, and select the last (most recent) row from each group:
groups = all.sort_values("ts").ffill().groupby("user_id",as_index=False).last()
Select the right columns:
result = groups[["user_id", "tag", "timestamp", "utm_campaign"]]
I tried this code with your sample data and got the right answer.

Related

Reference a Many-To-Many row

I am dealing with the design of a database in Flask connected to Postgresql. I have 2 Tables Reservation and Device which are related through a many-to-many relationship Table ReservationItem as follows:
| Reservation | | Device | | ReservationItem |
| ----------- | | ------ | | --------------- |
| id_res | | id_dev | | res_id (FK/PK) |
| etc... | | etc.. | | dev_id (FK/PK) |
| created_at |
| status |
Where dev_id and res_id are foreign keys and make up the composite primary key for the table. The columns created_at and status where originally conceived to track the history of the development of each Reservation-Device status.
Example
Someone reserves 3 Devices (respectively with id_dev's 1 - 2 - 3) on the 1st of January 2021 hence I would create 1 Reservation entry (id_res 1) and 3 ReservationItem entry with status "booked".
ReservationItem
| --------------------------------------|
| res_id | dev_id | created_at | status |
| ------------------------------------- |
| 1 | 1 | 2021-01-01 | booked |
| 1 | 2 | 2021-01-01 | booked |
| 1 | 3 | 2021-01-01 | booked |
On the 2nd of January the client returns the Device.id = 1 so I would create a fourth entry in the ReservationItem Table where the only updated fields are created_at and status, so that I could track where the devices are.
| --------------------------------------- |
| res_id | dev_id | created_at | status |
| --------------------------------------- |
| 1 | 1 | 2021-01-01 | booked |
| ... | ... | ... | ... |
| 1 | 1 | 2021-01-02 | returned |
Which basically weaken the uniqueness of the composite key (res_id,dev_id).
So I thought: Should I created another table lets say History to track these updates?
These would be my new models...
| ReservationItem | | History |
| --------------- | | ------------- |
| id_assoc (PK) | | id_hist (PK) |
| res_id (FK) | | assoc_id (FK) |
| dev_id (FK) | | created_at |
| | | status |
I would change the ReservationItem Table so that res_id are dev_id are not primary keys anymore. I would move the created_at and status into the History table and I would add the column id_assoc and use it as primary key, so that I can reference it from the History table.
I've been looking around and it seems that using one column as primary key in a many to many relationship is not ideal.
How would you design the relationships otherwise?
Is there any tool that Flask offers?
EDIT
After reading this post, which suggests to audit database table and write logs to track changed entries (or operations on databases) I found this article which suggests how to implement audit logs in Flask. But why wouldn't my solution work (or lets say "isn't ideal")?
thank you!

Where am I going wrong when analyzing this data?

Trying to find a trend in attendance. I filtered my existing df to this so I can look at 1 activity at a time.
+---+-----------+-------+----------+-------+---------+
| | Date | Org | Activity | Hours | Weekday |
+---+-----------+-------+----------+-------+---------+
| 0 | 8/3/2020 | Org 1 | Gen Ab | 10.5 | Monday |
| 1 | 8/25/2020 | Org 1 | Gen Ab | 2 | Tuesday |
| 3 | 8/31/2020 | Org 1 | Gen Ab | 8.5 | Monday |
| 7 | 8/10/2020 | Org 2 | Gen Ab | 1 | Monday |
| 8 | 8/14/2020 | Org 3 | Gen Ab | 3.5 | Friday |
+---+-----------+-------+----------+-------+---------+
This code:
gen_ab = att_df.loc[att_df['Activity'] == "Gen Ab"]
sum_gen_ab = gen_ab.groupby(['Date', 'Activity']).sum()
sum_gen_ab.head()
Returns this:
+------------+----------+------------+
| | | Hours |
+------------+----------+------------+
| Date | Activity | |
| 06/01/2020 | Gen Ab | 347.250000 |
| 06/02/2020 | Gen Ab | 286.266667 |
| 06/03/2020 | Gen Ab | 169.583333 |
| 06/04/2020 | Gen Ab | 312.633333 |
| 06/05/2020 | Gen Ab | 317.566667 |
+------------+----------+------------+
How do I make the summed column name 'Hours'? I still get the same result when I do this:
sum_gen_ab['Hours'] = gen_ab.groupby(['Date', 'Activity']).sum()
What I eventually want to do is have a line graph that shows the sum of hours for the activity over time. The time of course would be the dates in my df.
plt.plot(sum_gen_ab['Date'], sum_gen_ab['Hours'])
plt.show()
returns KeyError: Date
Once you've used groupby(['Date', 'Activity']) Date and Activity have been transformed to indices and can't be referenced with sum_gen_ab['Date'].
To avoid transforming them to indices you can use groupby(['Date', 'Activity'], as_index=False) instead.
I will typically use the pandasql library to manipulate my data frames into different datasets. This allows you to manipulate your pandas data frame with SQL code. Pandasql can be used alongside pandas.
EXAMPLE:
import pandas as pd
import pandasql as psql
df = "will be your dataset"
new_dataset = psql.sqldf('''
SELECT DATE, ACTIVITY, SUM(HOURS) as SUM_OF_HOURS
FROM df
GROUP BY DATE, ACTIVITY''')
new_dataset.head() #Shows the first 5 rows of your dataset

Sqlalchemy many to one array response

Im working with SQLAlchemy and Flask. I have a content table like:
--------------------------------------------
| id | title | description |
--------------------------------------------
| 1 | example | my content |
| 2 | another piece| my other content|
--------------------------------------------
And a status table like this:
--------------------------------------------------------
| id | content_id | status type | date |
--------------------------------------------------------
| 1 | 1 | written | 1/5/2020 |
| 2 | 1 | edited | 1/7/2020 |
--------------------------------------------------------
I want to be able to query the db and get a content with all of the status's in one row instead of have multiple rows of the content repeated. For example I want:
----------------------------------------------------------
| id | title | description | status's |
----------------------------------------------------------
| 1 | example | my content | [1,2] |
----------------------------------------------------------
Is there a way to do this with sqlalchemy?
You can use this query for fetching your answer:
SELECT b.*,
(SELECT GROUP_CONCAT (id) FROM status_table
WHERE content_id = b.id) AS `status's`
FROM status_table a JOIN content_table b
ON a.content_id = b.id
GROUP BY a.content_id;

Python - Pandas - Converting column with specific subsets into rows

I have a dataframe that looks like this below with Date, Price and Serial.
+----------+--------+--------+
| Date | Price | Serial |
+----------+--------+--------+
| 2/1/1996 | 0.5909 | 1 |
| 2/1/1996 | 0.5711 | 2 |
| 2/1/1996 | 0.5845 | 3 |
| 3/1/1996 | 0.5874 | 1 |
| 3/1/1996 | 0.5695 | 2 |
| 3/1/1996 | 0.584 | 3 |
+----------+--------+--------+
I will like to make it look like this where the serial becomes the column name and the data sorts itself into the correct date row as well as Serial column.
+----------+--------+--------+--------+
| Date | 1 | 2 | 3 |
+----------+--------+--------+--------+
| 2/1/1996 | 0.5909 | 0.5711 | 0.5845 |
| 3/1/1996 | 0.5874 | 0.5695 | 0.584 |
+----------+--------+--------+--------+
I understand I can do this via a loop but just wondering if there is a more efficient way to do this?
Thanks for your kind help. Also curious if there is a better way to paste such tables rather than attaching images in my questions =x
You can use pandas.pivot_table:
res = df.pivot_table(index='Date', columns='Serial', values='Price', aggfunc=np.sum)\
.reset_index()
res.columns.name = ''
Date 1 2 3
0 2/1/1996 0.5909 0.5711 0.5845
1 3/1/1996 0.5874 0.5695 0.5840

BigQuery streaming insertAll appears to lose data - why?

Im trying to use the streaming insert_all method to insert data to a table using the google-api-client gem in ruby.
So I start with creating a new table in Bigquery (read and write priveleges are correct)
with the following contents:
+-----+-----------+-------------+
| Row | person_id | person_name |
+-----+-----------+-------------+
| 1 | 1 | ABCD |
| 2 | 2 | EFGH |
| 3 | 3 | IJKL |
+-----+-----------+-------------+
This is my code in ruby: (I discovered earlier today that tabledata.insert_all is ruby for tabledata.insertAll - google docs / example need to be updated)
def streaming_insert_data_in_table(table, dataset=DATASET)
body = {"rows"=>[
{"json"=> {"person_id"=>10,"person_name"=>"george"}},
{"json"=> {"person_id"=>11,"person_name"=>"washington"}}
]}
result = #client.execute(
:api_method=> #bigquery.tabledata.insert_all,
:parameters=> {
:projectId=> #project_id.to_s,
:datasetId=> dataset,
:tableId=>table},
:body_object=>body,
)
puts result.body
end
So I run my code the first time and all appears fine. I see this in the table on Bigquery:
+-----+-----------+-------------+
| Row | person_id | person_name |
+-----+-----------+-------------+
| 1 | 1 | ABCD |
| 2 | 2 | EFGH |
| 3 | 3 | IJKL |
| 4 | 10 | george |
| 5 | 11 | washington |
+-----+-----------+-------------+
Then I change the data in the method to:
body = {"rows"=>[
{"json"=> {"person_id"=>5,"person_name"=>"john"}},
{"json"=> {"person_id"=>6,"person_name"=>"kennedy"}}
]}
Run the method and get this in Bigquery:
+-----+-----------+-------------+
| Row | person_id | person_name |
+-----+-----------+-------------+
| 1 | 1 | ABCD |
| 2 | 2 | EFGH |
| 3 | 3 | IJKL |
| 4 | 10 | george |
| 5 | 6 | kennedy |
+-----+-----------+-------------+
So, what gives? I've lost data.... (ids 11 and id 5 have vanished) The responses for the request do not have errors either.
Could someone tell me if Im doing something incorrectly or why this is happening please?
Any help is much appreciated.
Thanks and have a great day.
Discovered this appears something to do with the ui (row count doesn't populate for a while and trying to extract the data in the table results in an error "Unexpected. Please try again."). However data is actually stored and can be queried. Thanks for the help Jordan

Categories