Count Rows from a separate dataframe with conditions - PySpark - python

I have an airline booking transaction dataframe with columns as customer_id, date_of_travel, dest_country. I have another dataframe with hotel booking transaction details with customer_id, date_of_booking, dest_country.
I need to bring a column into the airline booking dataframe which would be count of all hotel bookings made for that customer_id in the same country within 1 month (+/- 30 days) of the date_of_travel.
I have added 2 additional columns to my airline booking dataset for Date_before_30_Days and Date_after_30_Days. I am having trouble understanding how do I get the counts for hotel bookings in the same country within 1 month of airline travel.
Airline Data
Customer_id Country Date_of_Travel
xyz US 10-20-2018
abc MX 03-04-2018
xyz US 04-05-2019
Hotel Data
Customer_id Country Date_of_Booking
xyz CA 03-30-2018
xyz US 05-01-2018
xyz US 10-15-2018
abc PH 10-20-2018
abc MX 03-01-2018
Final Output
Customer_id Country Date_of_Travel Total_Hotel_Bookings
xyz US 10-20-2018 1
abc MX 03-04-2018 0
xyz CA 04-05-2019 1

The Example you have provided is having the different output than you mentioned in statements. For eg.) The customer id "abc" in "MX" country is actually having hotel bookings on 1st March and he also have ticket booking on 4th March as per Airline data. Which is actually falling in (+/-) 30 days range.
By understanding your statements, I have made below example.
Airline Data
airline_data = pd.DataFrame({"Customer_Id":["xyz", "abc", "xyz"], "Country":["US", "MX", "CA"],"Date_Of_Travel":["10-20-2018", "03-04-2018", "04-05-2019"]})
print(airline_data)
Customer_Id Country Date_Of_Travel
xyz US 10-20-2018
abc MX 03-04-2018
xyz CA 04-05-2019
Hotel Data
Hotel_data = pd.DataFrame({"Customer_Id":["xyz","xyz", "xyz","abc", "abc"], "Country":["CA","US","US","PH","MX"],"Date_of_Booking":["03-30-2019", "05-01-2018", "10-15-2018", "10-20-2018", "03-01-2018"]})
print(Hotel_data)
Customer_Id Country Date_of_Booking
xyz CA 03-30-2019
xyz US 05-01-2018
xyz US 10-15-2018
abc PH 10-20-2018
abc MX 03-01-2018
Convert the Date column from string to Datetime object
airline_data["Date_Of_Travel"] = pd.to_datetime(airline_data["Date_Of_Travel"])
Hotel_data["Date_of_Booking"] = pd.to_datetime(Hotel_data["Date_of_Booking"])
Create two columns for Minimum date and Maximum date which is (+/-) 30 days.
airline_data["Min_date"] = (airline_data["Date_Of_Travel"]) - pd.Timedelta(days=30)
airline_data["Max_date"] = (airline_data["Date_Of_Travel"]) + pd.Timedelta(days=30)
Join two dataframes based on the Customer_id, Country to get the required data
df_1 = pd.merge(airline_data, Hotel_data, on=["Customer_Id", "Country"],how="left")
print(df_1)
Customer_Id Country Date_Of_Travel Min_date Max_date Date_of_Booking
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-05-01
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-10-15
abc MX 2018-03-04 2018-02-02 2018-04-03 2018-03-01
xyz CA 2019-04-05 2019-03-06 2019-05-05 2019-03-30
Create a column which is satisfying the condition of Booking is +/- 30 days from Travel date.
df_1["Bool"] = (df_1.Date_of_Booking >= df_1.Min_date) & (df_1.Date_of_Booking <= df_1.Max_date)
df_1["Bool"] = df_1["Bool"].apply(int)
print(df_1)
Customer_Id Country Date_Of_Travel Min_date Max_date Date_of_Booking Bool
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-05-01 0
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-10-15 1
abc MX 2018-03-04 2018-02-02 2018-04-03 2018-03-01 1
xyz CA 2019-04-05 2019-03-06 2019-05-05 2019-03-30 1
Now, apply groupby on the df_1 and get the sum of the Bool to get the total bookings of Each customer in particular country.
Req_Results = df_1.groupby(["Customer_Id","Country","Date_Of_Travel","Date_of_Booking"]).sum().reset_index()
Req_Results = Req_Results[Req_Results.Bool!=0]
Req_Results.rename(columns={"Bool":"Total_Hotel_Bookings"},inplace=True)
print(Req_Results)
Customer_Id Country Date_Of_Travel Date_of_Booking Total_Hotel_Bookings
abc MX 2018-03-04 2018-03-01 1
xyz CA 2019-04-05 2019-03-30 1
xyz US 2018-10-20 2018-10-15 1

for each row of the airline_data you have to create filters in your hotel_data:
filter_id = hotel_data[Customer_id] == desired_id
filter_country = hotel_data[Country] == desired_country
filter_date = (hotel_data[Date_of_Booking] > start_date) & (hotel_data[Date_of_Booking] < end_date)
then apply the filters and count the results:
filtered_df = hotel_data[filter_id & filter_country & filter_date]
total_bookings = len(filtered_df)
of course you can do this with a loop and add the number of bookings to a list and at the end just add the list as a new column of airline_data.
Does this help?

Related

Pandas: Groupby and sum customer profit, for every 6 months, starting from users first transaction

I have a dataset like this:
Customer ID
Date
Profit
1
4/13/2018
10.00
1
4/26/2018
13.27
1
10/23/2018
15.00
2
1/1/2017
7.39
2
7/5/2017
9.99
2
7/7/2017
10.01
3
5/4/2019
30.30
I'd like to groupby and sum profit, for every 6 months, starting at each users first transaction.
The output ideally should look like this:
Customer ID
Date
Profit
1
4/13/2018
23.27
1
10/13/2018
15.00
2
1/1/2017
7.39
2
7/1/2017
20.00
3
5/4/2019
30.30
The closest I've seem to get on this problem is by using:
df.groupby(['Customer ID',pd.Grouper(key='Date', freq='6M', closed='left')])['Profit'].sum().reset_index()
But, that doesn't seem to sum starting on a users first transaction day.
If the changing of dates is not possible (ex. customer 2 date is 7/1/2017 and not 7/5/2017), then at least summing the profit so that its based on each users own 6 month purchase journey would be extremely helpful. Thank you!
I can get you the first of the month until you find a more perfect solution.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df = (
df
.set_index("Date")
.groupby(["Customer ID"])
.Profit
.resample("6MS")
.sum()
.reset_index(name="Profit")
)
print(df)
Customer ID Date Profit
0 1 2018-04-01 23.27
1 1 2018-10-01 15.00
2 2 2017-01-01 7.39
3 2 2017-07-01 20.00
4 3 2019-05-01 30.30

Pandas add multiple rows with IF condition

I have following dataframe cosisting of city bicycle trips. However, I have some problems with handling trips that exceed over one hour(I want to use YYYYmmDDhh as a composite key in my data model). So what I want to do is to create a column "keyhour" that I could connect with other tables. This would be YYYYmmDDhh based on started_at IF start_hour == end_hour. However, if end_hour is greater than start_hour, I want to insert that many rows with the same TourID to my dataframe, in order to indicate that the trip has lasted few hours.
started_at ended_at duration start_station_id start_station_name start_station_description ... end_station_description end_station_latitude end_station_longitude TourID start_hour end_hour
0 2020-05-01 03:03:14.941000+00:00 2020-05-01 03:03:14.941000+00:00 635 484 Karenlyst allé ved Skabos vei ... langs Drammensveien 59.914145 10.715505 0 3 3
1 2020-05-01 03:05:48.529000+00:00 2020-05-01 03:05:48.529000+00:00 141 455 Sofienbergparken sør langs Sofienberggata ... ved Sars gate 59.921206 10.769989 1 3 3
2 2020-05-01 03:13:33.156000+00:00 2020-05-01 03:13:33.156000+00:00 330 550 Thereses gate ved Bislett trikkestopp ... ved Kristian IVs gate 59.914767 10.740971 2 3 3
3 2020-05-01 03:14:14.549000+00:00 2020-05-01 03:14:14.549000+00:00 479 597 Fredensborg ved rundkjøringen ... ved Oslo City 59.912334 10.752292 3 3 3
4 2020-05-01 03:20:12.355000+00:00 2020-05-01 03:20:12.355000+00:00 629 617 Bjerregaardsgate Øst ved Uelands gate ... langs Oslo gate 59.908255 10.767800 4 3 3
So for example if started_at = 2020-05-01 03:03:14.941000+00:00, ended_at = 2020-05-01 06:03:14.941000+00:00 , start_hour = 3, end_hour = 6 and TourID = 1, I want to have rows with:
keyhour ; TourID
2020050103 ;1
2020050104 ;1
2020050105 ;1
2020050106 ;1
And all other values(duration etc) related to this trip id.
However, I really cannot find any way to do it in Pandas. Is it possible or do I have to use pure python to re-write my source csv?
Thank you for any advice!
Assuming your dataframe is df and that you have import pandas as pd
# convert to datetime and rounddown to hour
df['started_at'] = pd.to_datetime(df['started_at']).dt.floor(freq='H')
df['ended_at'] = pd.to_datetime(df['ended_at']).dt.floor(freq='H')
# this creates a list of hourly datetime ranges from started_at to ended_at
df['keyhour'] = df.apply(lambda x: list(pd.date_range(x['started_at'], x['ended_at'], freq="1H")), axis='columns')
# this just expands to row each element in the list of keyhour column
df = df.explode('keyhour')
# conversts it to a string, of the format you specified
df['keyhour'] = df['keyhour'].dt.strftime('%Y%m%d%H')
df

Build dataframe with sequential timeseries

I have a dataset that contains many timestamps associated with different ships and ports.
obj_id timestamp port
0 4 2019-10-01 Houston
1 2 2019-09-01 New York
2 4 2019-07-31 Boston
3 1 2019-07-28 San Francisco
4 2 2019-10-15 Miami
5 1 2019-09-01 Honolulu
6 1 2019-08-01 Tokyo
I want to build a dataframe that contains a single record for the latest voyage by ship (obj_id), by assigning the latest timestamp/port for each obj_id as a 'destination', and the second latest timestamp/port as the 'origin'. So the final result would look something like this:
obj_id origin_time origin_port destination_time destination_port
0 4 2019-07-31 Boston 2019-10-01 Houston
1 2 2019-09-01 New York 2019-10-15 Miami
3 1 2019-07-28 Tokyo 2019-09-01 Honolulu
I've successfully filtered the latest timestamps for each obj_id through this code but still can't figure a way to filter the second latest timestamp, let alone pull them both into a single row.
df.sort_values(by ='timestamp', ascending = False).drop_duplicates(['obj_id'])
Using groupby.agg with first, last:
dfg = df.sort_values('timestamp').groupby('obj_id').agg(['first', 'last']).reset_index()
dfg.columns = [f'{c1}_{c2}' for c1, c2 in dfg.columns]
obj_id_ timestamp_first timestamp_last port_first port_last
0 1 2019-07-28 2019-09-01 San Francisco Honolulu
1 2 2019-09-01 2019-10-15 New York Miami
2 4 2019-07-31 2019-10-01 Boston Houston
You want to sort the trips by timestamp so we can get the most recent voyages, then group the voyages by object id and grab the first and second voyage per object, then merge.
groups = df.sort_values(by = "timestamp", ascending = False).groupby("obj_id")
pd.merge(groups.nth(1), groups.nth(0),
on="obj_id",
suffixes=("_origin", "_dest"))
Make sure your timestamp column is the proper timestamp data type though, otherwise your sorting will be messed up.

Python selecting row from second dataframe based on complex criteria

I have two dataframes, one with some purchasing data, and one with a weekly calendar, e.g.
df1:
purchased_at product_id cost
01-01-2017 1 £10
01-01-2017 2 £8
09-01-2017 1 £10
18-01-2017 3 £12
df2:
week_no week_start week_end
1 31-12-2016 06-01-2017
2 07-01-2017 13-01-2017
3 14-01-2017 20-01-2017
I want to use data from the two to add a 'week_no' column to df1, which is selected from df2 based on where the 'purchased_at' date in df1 falls between the 'week_start' and 'week_end' dates in df2, i.e.
df1:
purchased_at product_id cost week_no
01-01-2017 1 £10 1
01-01-2017 2 £8 1
09-01-2017 1 £10 2
18-01-2017 3 £12 3
I've searched but I've not been able to find an example where the data is being pulled from a second dataframe using comparisons between the two, and I've been unable to correctly apply any examples I've found, e.g.
df1.loc[(df1['purchased_at'] < df2['week_end']) &
(df1['purchased_at'] > df2['week_start']), df2['week_no']
was unsuccessful, with the ValueError 'can only compare identically-labeled Series objects'
Could anyone help with this problem, or I'm open to suggestions if there is a better way to achieve the same outcome.
edit to add further detail of df1
df1 full dataframe headers
purchased_at purchase_id product_id product_name transaction_id account_number cost
01-01-2017 1 1 A 1 AA001 £10
01-01-2017 2 2 B 1 AA001 £8
02-01-2017 3 1 A 2 AA008 £10
03-01-2017 4 3 C 3 AB040 £12
...
09-01-2017 12 1 A 10 AB102 £10
09-01-2017 13 2 B 11 AB102 £8
...
18-01-2017 20 3 C 15 AA001 £12
So the purchase_id increases incrementally with each row, the product_id and product_name have a 1:1 relationship, the transaction_id also increases incrementally, but there can be multiple purchases within a transaction.
If your dataframes are to big you can use this trick.
Do a full cartisian product join of all records to all records:
df_out = pd.merge(df1.assign(key=1),df2.assign(key=1),on='key')
Next filter out those records that do not match criteria in this case, where purchased_at is not between week_start and week_end
(df_out.query('week_start < purchased_at < week_end')
.drop(['key','week_start','week_end'], axis=1))
Output:
purchased_at product_id cost week_no
0 2017-01-01 1 £10 1
3 2017-01-01 2 £8 1
7 2017-01-09 1 £10 2
11 2017-01-18 3 £12 3
If you do have large dataframes then you can use this numpy method as proposed by PiRSquared.
a = df1.purchased_at.values
bh = df2.week_end.values
bl = df2.week_start.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns)
).drop(['week_start','week_end'],axis=1)
Output:
purchased_at product_id cost week_no
0 2017-01-01 00:00:00 1 £10 1
1 2017-01-01 00:00:00 2 £8 1
2 2017-01-09 00:00:00 1 £10 2
3 2017-01-18 00:00:00 3 £12 3
You could just use time.strftime() to extract the week number from the date. If you want to keep counting the weeks upwards, you need to define a "zero year" as the start of your time-series and offset the week_no accordingly:
import pandas as pd
data = {'purchased_at': ['01-01-2017', '01-01-2017', '09-01-2017', '18-01-2017'], 'product_id': [1,2,1,3], 'cost':['£10', '£8', '£10', '£12']}
df = pd.DataFrame(data, columns=['purchased_at', 'product_id', 'cost'])
def getWeekNo(date, year0):
datetime = pd.to_datetime(date, dayfirst=True)
year = int(datetime.strftime('%Y'))
weekNo = int(datetime.strftime('%U'))
return weekNo + 52*(year-year0)
df['week_no'] = df.purchased_at.apply(lambda x: getWeekNo(x, 2017))
Here, I use pd.to_dateime() to convert the datestring from df into a datetime-object. strftime('%Y') returns the year and strftime('%U') the week (with the first week of a year starting on it's first Sunday. If weeks should start on Monday, use '%W' instead).
This way, you don't need to maintain a seperate DataFrame only for week numbers.

for loop for multiple values of column using python

I got stuck in defining loop over values of a column.
So, I am applying ARIMA model to time series data. The data includes, one client's order with respective to different countries.So Arima model will apply for different countries for that particular client. For that purpose my challenge is looping that country column and use ARIMA for each country and here only I am getting stuck. The data is as:
Date ClientName Country order
2016-11-05 ABC Albania 9
2016-11-05 ABC1 Belgium 1
2016-11-05 ABC2 Cambodia 8
2016-11-05 MCN1 Canada 4
2016-11-05 MCN2 China 1
In above data I have to model data for each country.
If anyone help me in this problem it will be great.
IIUC you need iterrows:
for i, x in df.iterrows():
print (i)
print (x)
0
Date 2016-11-05
ClientName ABC
Country Albania
order 9
Name: 0, dtype: object
1
Date 2016-11-05
ClientName ABC1
Country Belgium
order 1
Name: 1, dtype: object
2
Date 2016-11-05
ClientName ABC2
Country Cambodia
order 8
Name: 2, dtype: object
3
Date 2016-11-05
ClientName MCN1
Country Canada
order 4
Name: 3, dtype: object
4
Date 2016-11-05
ClientName MCN2
Country China
order 1
Name: 4, dtype: object
What you have to do is use the iterrows and then you can access every column you want
for index , row in df.iterrows():
print row['ClientName'] , row['Country'] , row['order']
index will be the index of the row.

Categories