Python: Creating column based on a condition from the other dataframe - python

I have two data frames as follows:
df1=
date company userDomain keyword pageViews category
2015-12-02 1-800 Contacts glasses.com SAN 2 STORAGE
2015-12-02 1-800 Contacts rhgi.com SAN 3 STORAGE
2015-12-02 100 Percent Fun dialogdesign.ca SAN 1 STORAGE
2015-12-02 101netlink 101netlink.com SAN 8 STORAGE
2015-12-02 1020 nlc.bc.ca SAN 4 STORAGE
df2=
Outcome Job Title Wave
Created Opportunity IT Manager 1.0
Closed Out Prospect/Contact Infrastructure Manager 1.0
NaN IT Director 1.0
NaN Supervisor Technical Support 1.0
Created Opportunity Director of IT Services 1.0
Wave Date userDomain
2016-02-16 15:07:05 dialogdesign.ca
2016-02-16 15:07:05 rhgi.com
2016-02-16 15:07:05 surefire.com
2016-02-16 15:07:05 isd2144.org
2016-02-16 15:07:05 nlc.bc.ca
I would like to add a column in df1 called wave_date with dates from df2['Wave Date'] for all the df1['userDomain'] is in the df2['userDomain']
If there is no match of userDomain in both the frames, the value should be nan. I'm sorry if this is a very naive question but I'm frustrated with my failure. What I'm doing is something like this:
df1['wave_date'] = df1.apply(lambda x: df2['Wave Date'] if x['userDomain'].isin(df2['userDomain']) else np.nan)
I keep getting
IndexError: ('userDomain', 'occurred at index date')
Can you please point out the correct to do it? Thanks a lot

m = dict(zip(df2['userDomain'], df2['Wave Date']))
df1.assign(wave_date=df1.userDomain.map(m))
date company userDomain keyword pageViews category wave_date
0 2015-12-02 1-800 Contacts glasses.com SAN 2 STORAGE NaN
1 2015-12-02 1-800 Contacts rhgi.com SAN 3 STORAGE 2016-02-16 15:07:05
2 2015-12-02 100 Percent Fun dialogdesign.ca SAN 1 STORAGE 2016-02-16 15:07:05
3 2015-12-02 101netlink 101netlink.com SAN 8 STORAGE NaN
4 2015-12-02 1020 nlc.bc.ca SAN 4 STORAGE 2016-02-16 15:07:05

Related

argument of type "float" is not iterable when trying to use for loop

I have a countrydf as below, in which each cell in the country column contains a list of the countries where the movie was released.
countrydf
id Country release_year
s1 [US] 2020
s2 [South Africa] 2021
s3 NaN 2021
s4 NaN 2021
s5 [India] 2021
I want to make a new df which look like this:
country_yeardf
Year US UK Japan India
1925 NaN NaN NaN NaN
1926 NaN NaN NaN NaN
1927 NaN NaN NaN NaN
1928 NaN NaN NaN NaN
It has the release year and the number of movies released in each country.
My solution is that: with a blank df like the second one, run a for loop to count the number of movies released and then modify the value in the cell relatively.
countrylist=['Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', ….]
for x in countrylist:
for j in list(range(0,8807)):
if x in countrydf.country[j]:
t=int (countrydf.release_year[j] )
country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
an error occurred which read:
TypeError Traceback (most recent call last)
<ipython-input-25-225281f8759a> in <module>()
1 for x in countrylist:
2 for j in li:
----> 3 if x in countrydf.country[j]:
4 t=int(countrydf.release_year[j])
5 country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
TypeError: argument of type 'float' is not iterable
I don’t know which one is of float type here, I have check the type of countrydf.country[j] and it returned int.
I was using pandas and I am just getting started with it. Can anyone please explain the error and suggest a solution for a df that I want to create?
P/s: my English is not so good so hop you guys understand.
Here is a solution using groupby
df = pd.DataFrame([['US', 2015], ['India', 2015], ['US', 2015], ['Russia', 2016]], columns=['country', 'year'])
country year
0 US 2015
1 India 2015
2 US 2015
3 Russia 2016
Now just groupby country and year and unstack the output:
df.groupby(['year', 'country']).size().unstack()
country India Russia US
year
2015 1.0 NaN 2.0
2016 NaN 1.0 NaN
Some alternative ways to achieve this in pandas without loops.
If the Country Column have more than 1 value in the list in each row, you can try the below:
>>df['Country'].str.join("|").str.get_dummies().groupby(df['release_year']).sum()
India South Africa US
release_year
2020 0 0 1
2021 1 1 0
Else if Country has just 1 value per row in the list as you have shown in the example, you can use crosstab
>>pd.crosstab(df['release_year'],df['Country'].str[0])
Country India South Africa US
release_year
2020 0 0 1
2021 1 1 0

Can I copy values from other rows and column and automatically replace the missing values?

So, my dataframe is
price model_year model condition cylinders fuel odometer transmission type paint_color is_4wd date_posted days_listed
0 9400 2011.0 bmw x5 good 6.0 gas 145000.0 automatic SUV NaN True 2018-06-23 19
1 25500 NaN ford f-150 good 6.0 gas 88705.0 automatic pickup white True 2018-10-19 50
2 5500 2013.0 hyundai sonata like new 4.0 gas 110000.0 automatic sedan red False 2019-02-07 79
3 1500 2003.0 ford f-150 fair 8.0 gas NaN automatic pickup NaN False 2019-03-22 9
4 14900 2017.0 chrysler 200 excellent 4.0 gas 80903.0 automatic sedan black False 2019-04-02 28
As you can see, row 1's model is the same as row 3's, but row 1's model year is missing. It would naturally follow I can replace row 1's model year with row 3's so there isn't NaN there, and I'm aware I can manually change it, but the dataframe is over 50,000 rows long and there are many more values just like that Is there an automated way I can go about replacing these values like that?
Edit: After looking over the df just now, I've realized that I can't really replace the model year like that as it can change even within the same model, although I would still love to know how it's done if possible for future reference
You can merge dataframe with itself and fillna it.
df_want = df.merge(df[['model_year','model']].dropna().drop_duplicates(),on='model',how='left')
df_want['model_year'] = df_want['model_year_x'].fillna(df_want['model_year_y']
df_want = df_want.drop(['model_year_x','model_year_y'],axis=1)
Yes, you can replace all NaN model years with the non-nan entry like this:
models = df['model'].unique()
for m in models:
year = df.loc[(df['model_year'].notna()) & (df['model'] == m)]['model_year'].values[0]
df.at[(df['model_year'].isna()) & (df['model'] == m), 'model_year'] = year

Count Rows from a separate dataframe with conditions - PySpark

I have an airline booking transaction dataframe with columns as customer_id, date_of_travel, dest_country. I have another dataframe with hotel booking transaction details with customer_id, date_of_booking, dest_country.
I need to bring a column into the airline booking dataframe which would be count of all hotel bookings made for that customer_id in the same country within 1 month (+/- 30 days) of the date_of_travel.
I have added 2 additional columns to my airline booking dataset for Date_before_30_Days and Date_after_30_Days. I am having trouble understanding how do I get the counts for hotel bookings in the same country within 1 month of airline travel.
Airline Data
Customer_id Country Date_of_Travel
xyz US 10-20-2018
abc MX 03-04-2018
xyz US 04-05-2019
Hotel Data
Customer_id Country Date_of_Booking
xyz CA 03-30-2018
xyz US 05-01-2018
xyz US 10-15-2018
abc PH 10-20-2018
abc MX 03-01-2018
Final Output
Customer_id Country Date_of_Travel Total_Hotel_Bookings
xyz US 10-20-2018 1
abc MX 03-04-2018 0
xyz CA 04-05-2019 1
The Example you have provided is having the different output than you mentioned in statements. For eg.) The customer id "abc" in "MX" country is actually having hotel bookings on 1st March and he also have ticket booking on 4th March as per Airline data. Which is actually falling in (+/-) 30 days range.
By understanding your statements, I have made below example.
Airline Data
airline_data = pd.DataFrame({"Customer_Id":["xyz", "abc", "xyz"], "Country":["US", "MX", "CA"],"Date_Of_Travel":["10-20-2018", "03-04-2018", "04-05-2019"]})
print(airline_data)
Customer_Id Country Date_Of_Travel
xyz US 10-20-2018
abc MX 03-04-2018
xyz CA 04-05-2019
Hotel Data
Hotel_data = pd.DataFrame({"Customer_Id":["xyz","xyz", "xyz","abc", "abc"], "Country":["CA","US","US","PH","MX"],"Date_of_Booking":["03-30-2019", "05-01-2018", "10-15-2018", "10-20-2018", "03-01-2018"]})
print(Hotel_data)
Customer_Id Country Date_of_Booking
xyz CA 03-30-2019
xyz US 05-01-2018
xyz US 10-15-2018
abc PH 10-20-2018
abc MX 03-01-2018
Convert the Date column from string to Datetime object
airline_data["Date_Of_Travel"] = pd.to_datetime(airline_data["Date_Of_Travel"])
Hotel_data["Date_of_Booking"] = pd.to_datetime(Hotel_data["Date_of_Booking"])
Create two columns for Minimum date and Maximum date which is (+/-) 30 days.
airline_data["Min_date"] = (airline_data["Date_Of_Travel"]) - pd.Timedelta(days=30)
airline_data["Max_date"] = (airline_data["Date_Of_Travel"]) + pd.Timedelta(days=30)
Join two dataframes based on the Customer_id, Country to get the required data
df_1 = pd.merge(airline_data, Hotel_data, on=["Customer_Id", "Country"],how="left")
print(df_1)
Customer_Id Country Date_Of_Travel Min_date Max_date Date_of_Booking
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-05-01
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-10-15
abc MX 2018-03-04 2018-02-02 2018-04-03 2018-03-01
xyz CA 2019-04-05 2019-03-06 2019-05-05 2019-03-30
Create a column which is satisfying the condition of Booking is +/- 30 days from Travel date.
df_1["Bool"] = (df_1.Date_of_Booking >= df_1.Min_date) & (df_1.Date_of_Booking <= df_1.Max_date)
df_1["Bool"] = df_1["Bool"].apply(int)
print(df_1)
Customer_Id Country Date_Of_Travel Min_date Max_date Date_of_Booking Bool
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-05-01 0
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-10-15 1
abc MX 2018-03-04 2018-02-02 2018-04-03 2018-03-01 1
xyz CA 2019-04-05 2019-03-06 2019-05-05 2019-03-30 1
Now, apply groupby on the df_1 and get the sum of the Bool to get the total bookings of Each customer in particular country.
Req_Results = df_1.groupby(["Customer_Id","Country","Date_Of_Travel","Date_of_Booking"]).sum().reset_index()
Req_Results = Req_Results[Req_Results.Bool!=0]
Req_Results.rename(columns={"Bool":"Total_Hotel_Bookings"},inplace=True)
print(Req_Results)
Customer_Id Country Date_Of_Travel Date_of_Booking Total_Hotel_Bookings
abc MX 2018-03-04 2018-03-01 1
xyz CA 2019-04-05 2019-03-30 1
xyz US 2018-10-20 2018-10-15 1
for each row of the airline_data you have to create filters in your hotel_data:
filter_id = hotel_data[Customer_id] == desired_id
filter_country = hotel_data[Country] == desired_country
filter_date = (hotel_data[Date_of_Booking] > start_date) & (hotel_data[Date_of_Booking] < end_date)
then apply the filters and count the results:
filtered_df = hotel_data[filter_id & filter_country & filter_date]
total_bookings = len(filtered_df)
of course you can do this with a loop and add the number of bookings to a list and at the end just add the list as a new column of airline_data.
Does this help?

Build dataframe with sequential timeseries

I have a dataset that contains many timestamps associated with different ships and ports.
obj_id timestamp port
0 4 2019-10-01 Houston
1 2 2019-09-01 New York
2 4 2019-07-31 Boston
3 1 2019-07-28 San Francisco
4 2 2019-10-15 Miami
5 1 2019-09-01 Honolulu
6 1 2019-08-01 Tokyo
I want to build a dataframe that contains a single record for the latest voyage by ship (obj_id), by assigning the latest timestamp/port for each obj_id as a 'destination', and the second latest timestamp/port as the 'origin'. So the final result would look something like this:
obj_id origin_time origin_port destination_time destination_port
0 4 2019-07-31 Boston 2019-10-01 Houston
1 2 2019-09-01 New York 2019-10-15 Miami
3 1 2019-07-28 Tokyo 2019-09-01 Honolulu
I've successfully filtered the latest timestamps for each obj_id through this code but still can't figure a way to filter the second latest timestamp, let alone pull them both into a single row.
df.sort_values(by ='timestamp', ascending = False).drop_duplicates(['obj_id'])
Using groupby.agg with first, last:
dfg = df.sort_values('timestamp').groupby('obj_id').agg(['first', 'last']).reset_index()
dfg.columns = [f'{c1}_{c2}' for c1, c2 in dfg.columns]
obj_id_ timestamp_first timestamp_last port_first port_last
0 1 2019-07-28 2019-09-01 San Francisco Honolulu
1 2 2019-09-01 2019-10-15 New York Miami
2 4 2019-07-31 2019-10-01 Boston Houston
You want to sort the trips by timestamp so we can get the most recent voyages, then group the voyages by object id and grab the first and second voyage per object, then merge.
groups = df.sort_values(by = "timestamp", ascending = False).groupby("obj_id")
pd.merge(groups.nth(1), groups.nth(0),
on="obj_id",
suffixes=("_origin", "_dest"))
Make sure your timestamp column is the proper timestamp data type though, otherwise your sorting will be messed up.

Aggregate function to data frame in pandas

I want to create a dataframe from an aggregate function. I thought that it would create by default a dataframe as this solution states, but it creates a series and I don't know why (Converting a Pandas GroupBy object to DataFrame).
The dataframe is from Kaggle's San Francisco Salaries. My code:
df=pd.read_csv('Salaries.csv')
in: type(df)
out: pandas.core.frame.DataFrame
in: df.head()
out: EmployeeName JobTitle TotalPay TotalPayBenefits Year Status 2BasePay 2OvertimePay 2OtherPay 2Benefits 2Year
0 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 567595.43 567595.43 2011 NaN 167411.18 0.00 400184.25 NaN 2011-01-01
1 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 538909.28 538909.28 2011 NaN 155966.02 245131.88 137811.38 NaN 2011-01-01
2 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 335279.91 335279.91 2011 NaN 212739.13 106088.18 16452.60 NaN 2011-01-01
3 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 332343.61 332343.61 2011 NaN 77916.00 56120.71 198306.90 NaN 2011-01-01
4 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 326373.19 326373.19 2011 NaN 134401.60 9737.00 182234.59 NaN 2011-01-01
in: df2=df.groupby(['JobTitle'])['TotalPay'].mean()
type(df2)
out: pandas.core.series.Series
I want df2 to be a dataframe with the columns 'JobTitle' and 'TotalPlay'
Breaking down your code:
df2 = df.groupby(['JobTitle'])['TotalPay'].mean()
The groupby is fine. It's the ['TotalPay'] that is the misstep. That is telling the groupby to only execute the the mean function on the pd.Series df['TotalPay'] for each group defined in ['JobTitle']. Instead, you want to refer to this column with [['TotalPay']]. Notice the double brackets. Those double brackets say pd.DataFrame.
Recap
df2 = df2=df.groupby(['JobTitle'])[['TotalPay']].mean()

Categories