for loop for multiple values of column using python - python

I got stuck in defining loop over values of a column.
So, I am applying ARIMA model to time series data. The data includes, one client's order with respective to different countries.So Arima model will apply for different countries for that particular client. For that purpose my challenge is looping that country column and use ARIMA for each country and here only I am getting stuck. The data is as:
Date ClientName Country order
2016-11-05 ABC Albania 9
2016-11-05 ABC1 Belgium 1
2016-11-05 ABC2 Cambodia 8
2016-11-05 MCN1 Canada 4
2016-11-05 MCN2 China 1
In above data I have to model data for each country.
If anyone help me in this problem it will be great.

IIUC you need iterrows:
for i, x in df.iterrows():
print (i)
print (x)
0
Date 2016-11-05
ClientName ABC
Country Albania
order 9
Name: 0, dtype: object
1
Date 2016-11-05
ClientName ABC1
Country Belgium
order 1
Name: 1, dtype: object
2
Date 2016-11-05
ClientName ABC2
Country Cambodia
order 8
Name: 2, dtype: object
3
Date 2016-11-05
ClientName MCN1
Country Canada
order 4
Name: 3, dtype: object
4
Date 2016-11-05
ClientName MCN2
Country China
order 1
Name: 4, dtype: object

What you have to do is use the iterrows and then you can access every column you want
for index , row in df.iterrows():
print row['ClientName'] , row['Country'] , row['order']
index will be the index of the row.

Related

Get index and column name for a particular value in Pandas Dataframe

I have the following Pandas DataFrame:
A B
0 Exporter Invoice No. & Date
1 ABC PVT LTD. ABC/1234/2022-23 DATED 20/08/2022
2 1234/B, XYZ,
3 ABCD, DELHI, INDIA Proforma Invoice No. Date.
4 AB/CDE/FGH/2022-23/1234 20.08.2022
5 Consignee Buyer (If other than consignee)
6 ABC Co.
8 P.O BOX NO. 54321
9 Berlin, Germany
Now I want to search for a value in this DataFrame, and store the index and column name in 2 different variables.
For example:
If I search "Consignee", I should get
index = 5
column = 'A'
Assuming you really want the index/column of the match, you can use a mask and stack:
df.where(df.eq('Consignee')).stack()
output:
5 A Consignee
dtype: object
As list:
df.where(df.eq('Consignee')).stack().index.tolist()
output: [(5, 'A')]

Count Rows from a separate dataframe with conditions - PySpark

I have an airline booking transaction dataframe with columns as customer_id, date_of_travel, dest_country. I have another dataframe with hotel booking transaction details with customer_id, date_of_booking, dest_country.
I need to bring a column into the airline booking dataframe which would be count of all hotel bookings made for that customer_id in the same country within 1 month (+/- 30 days) of the date_of_travel.
I have added 2 additional columns to my airline booking dataset for Date_before_30_Days and Date_after_30_Days. I am having trouble understanding how do I get the counts for hotel bookings in the same country within 1 month of airline travel.
Airline Data
Customer_id Country Date_of_Travel
xyz US 10-20-2018
abc MX 03-04-2018
xyz US 04-05-2019
Hotel Data
Customer_id Country Date_of_Booking
xyz CA 03-30-2018
xyz US 05-01-2018
xyz US 10-15-2018
abc PH 10-20-2018
abc MX 03-01-2018
Final Output
Customer_id Country Date_of_Travel Total_Hotel_Bookings
xyz US 10-20-2018 1
abc MX 03-04-2018 0
xyz CA 04-05-2019 1
The Example you have provided is having the different output than you mentioned in statements. For eg.) The customer id "abc" in "MX" country is actually having hotel bookings on 1st March and he also have ticket booking on 4th March as per Airline data. Which is actually falling in (+/-) 30 days range.
By understanding your statements, I have made below example.
Airline Data
airline_data = pd.DataFrame({"Customer_Id":["xyz", "abc", "xyz"], "Country":["US", "MX", "CA"],"Date_Of_Travel":["10-20-2018", "03-04-2018", "04-05-2019"]})
print(airline_data)
Customer_Id Country Date_Of_Travel
xyz US 10-20-2018
abc MX 03-04-2018
xyz CA 04-05-2019
Hotel Data
Hotel_data = pd.DataFrame({"Customer_Id":["xyz","xyz", "xyz","abc", "abc"], "Country":["CA","US","US","PH","MX"],"Date_of_Booking":["03-30-2019", "05-01-2018", "10-15-2018", "10-20-2018", "03-01-2018"]})
print(Hotel_data)
Customer_Id Country Date_of_Booking
xyz CA 03-30-2019
xyz US 05-01-2018
xyz US 10-15-2018
abc PH 10-20-2018
abc MX 03-01-2018
Convert the Date column from string to Datetime object
airline_data["Date_Of_Travel"] = pd.to_datetime(airline_data["Date_Of_Travel"])
Hotel_data["Date_of_Booking"] = pd.to_datetime(Hotel_data["Date_of_Booking"])
Create two columns for Minimum date and Maximum date which is (+/-) 30 days.
airline_data["Min_date"] = (airline_data["Date_Of_Travel"]) - pd.Timedelta(days=30)
airline_data["Max_date"] = (airline_data["Date_Of_Travel"]) + pd.Timedelta(days=30)
Join two dataframes based on the Customer_id, Country to get the required data
df_1 = pd.merge(airline_data, Hotel_data, on=["Customer_Id", "Country"],how="left")
print(df_1)
Customer_Id Country Date_Of_Travel Min_date Max_date Date_of_Booking
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-05-01
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-10-15
abc MX 2018-03-04 2018-02-02 2018-04-03 2018-03-01
xyz CA 2019-04-05 2019-03-06 2019-05-05 2019-03-30
Create a column which is satisfying the condition of Booking is +/- 30 days from Travel date.
df_1["Bool"] = (df_1.Date_of_Booking >= df_1.Min_date) & (df_1.Date_of_Booking <= df_1.Max_date)
df_1["Bool"] = df_1["Bool"].apply(int)
print(df_1)
Customer_Id Country Date_Of_Travel Min_date Max_date Date_of_Booking Bool
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-05-01 0
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-10-15 1
abc MX 2018-03-04 2018-02-02 2018-04-03 2018-03-01 1
xyz CA 2019-04-05 2019-03-06 2019-05-05 2019-03-30 1
Now, apply groupby on the df_1 and get the sum of the Bool to get the total bookings of Each customer in particular country.
Req_Results = df_1.groupby(["Customer_Id","Country","Date_Of_Travel","Date_of_Booking"]).sum().reset_index()
Req_Results = Req_Results[Req_Results.Bool!=0]
Req_Results.rename(columns={"Bool":"Total_Hotel_Bookings"},inplace=True)
print(Req_Results)
Customer_Id Country Date_Of_Travel Date_of_Booking Total_Hotel_Bookings
abc MX 2018-03-04 2018-03-01 1
xyz CA 2019-04-05 2019-03-30 1
xyz US 2018-10-20 2018-10-15 1
for each row of the airline_data you have to create filters in your hotel_data:
filter_id = hotel_data[Customer_id] == desired_id
filter_country = hotel_data[Country] == desired_country
filter_date = (hotel_data[Date_of_Booking] > start_date) & (hotel_data[Date_of_Booking] < end_date)
then apply the filters and count the results:
filtered_df = hotel_data[filter_id & filter_country & filter_date]
total_bookings = len(filtered_df)
of course you can do this with a loop and add the number of bookings to a list and at the end just add the list as a new column of airline_data.
Does this help?

Build dataframe with sequential timeseries

I have a dataset that contains many timestamps associated with different ships and ports.
obj_id timestamp port
0 4 2019-10-01 Houston
1 2 2019-09-01 New York
2 4 2019-07-31 Boston
3 1 2019-07-28 San Francisco
4 2 2019-10-15 Miami
5 1 2019-09-01 Honolulu
6 1 2019-08-01 Tokyo
I want to build a dataframe that contains a single record for the latest voyage by ship (obj_id), by assigning the latest timestamp/port for each obj_id as a 'destination', and the second latest timestamp/port as the 'origin'. So the final result would look something like this:
obj_id origin_time origin_port destination_time destination_port
0 4 2019-07-31 Boston 2019-10-01 Houston
1 2 2019-09-01 New York 2019-10-15 Miami
3 1 2019-07-28 Tokyo 2019-09-01 Honolulu
I've successfully filtered the latest timestamps for each obj_id through this code but still can't figure a way to filter the second latest timestamp, let alone pull them both into a single row.
df.sort_values(by ='timestamp', ascending = False).drop_duplicates(['obj_id'])
Using groupby.agg with first, last:
dfg = df.sort_values('timestamp').groupby('obj_id').agg(['first', 'last']).reset_index()
dfg.columns = [f'{c1}_{c2}' for c1, c2 in dfg.columns]
obj_id_ timestamp_first timestamp_last port_first port_last
0 1 2019-07-28 2019-09-01 San Francisco Honolulu
1 2 2019-09-01 2019-10-15 New York Miami
2 4 2019-07-31 2019-10-01 Boston Houston
You want to sort the trips by timestamp so we can get the most recent voyages, then group the voyages by object id and grab the first and second voyage per object, then merge.
groups = df.sort_values(by = "timestamp", ascending = False).groupby("obj_id")
pd.merge(groups.nth(1), groups.nth(0),
on="obj_id",
suffixes=("_origin", "_dest"))
Make sure your timestamp column is the proper timestamp data type though, otherwise your sorting will be messed up.

How to extract certain string from a text?

I have a certain feature "Location" from which I want to extract country.
The feature looks like:
data['Location'].head()
0 stockton, california, usa
1 edmonton, alberta, canada
2 timmins, ontario, canada
3 ottawa, ontario, canada
4 n/a, n/a, n/a
Name: Location, dtype: object
I want:
data['Country'].head(3)
0 usa
1 canada
2 canada
I've tried:
data['Country'] = data.Location.str.extract('(+[a-zA-Z])', expand=False)
data[['Location', 'Country']].sample(10)
which returns:
error: nothing to repeat at position 1
When I try to put the '[a-zA-Z]+' it gives me city.
Help would be appreciated. Thanks.
You can also use regex patterns:
df['Country'] = df['Location'].str.split('(,\s)(\w+)$', n = 1, expand = True)[2]
Output:
df['Country'].head(3)
Out[111]:
0 usa
1 canada
2 canada
Name: country, dtype: object
data['Country'] = data['Location'].apply(lambda row: str(row).split(',')[-1])
You may do this, df.apply applies a function across all rows, our lambda function extracts the country, and apply is only called on one column and saved into another

Retrieve the corresponding row to an attribute in pandas dataframes

I'm trying to return the values of "ID" and "ctn" corresponding to the most recent date in a fairly large dataframe. Below is a small example of the dataframe (df).
ID cnt date
1 US 2018-02-01
2 CA 2017-10-01
3 UK 2018-06-10
2 UK 2018-08-23
1 FR 2017-05-01
Hence, the output shall look like as follows.
ID cnt date
2 UK 2018-08-23
3 UK 2018-06-10
1 US 2018-02-01
Applying "group_by" method on 'ID' and then "max()" of the date as
df.groupby('ID')['date'].max()
giving me the most recent date per each 'ID'; however, I'm not sure how to retrieve the corresponding 'ctn' value in 'df'. Thanks!
Sort the dataframe by dates before grouping. The most recent entry will be always the last:
df.sort_values('date').groupby('ID').last()

Categories