Build dataframe with sequential timeseries - python

I have a dataset that contains many timestamps associated with different ships and ports.
obj_id timestamp port
0 4 2019-10-01 Houston
1 2 2019-09-01 New York
2 4 2019-07-31 Boston
3 1 2019-07-28 San Francisco
4 2 2019-10-15 Miami
5 1 2019-09-01 Honolulu
6 1 2019-08-01 Tokyo
I want to build a dataframe that contains a single record for the latest voyage by ship (obj_id), by assigning the latest timestamp/port for each obj_id as a 'destination', and the second latest timestamp/port as the 'origin'. So the final result would look something like this:
obj_id origin_time origin_port destination_time destination_port
0 4 2019-07-31 Boston 2019-10-01 Houston
1 2 2019-09-01 New York 2019-10-15 Miami
3 1 2019-07-28 Tokyo 2019-09-01 Honolulu
I've successfully filtered the latest timestamps for each obj_id through this code but still can't figure a way to filter the second latest timestamp, let alone pull them both into a single row.
df.sort_values(by ='timestamp', ascending = False).drop_duplicates(['obj_id'])

Using groupby.agg with first, last:
dfg = df.sort_values('timestamp').groupby('obj_id').agg(['first', 'last']).reset_index()
dfg.columns = [f'{c1}_{c2}' for c1, c2 in dfg.columns]
obj_id_ timestamp_first timestamp_last port_first port_last
0 1 2019-07-28 2019-09-01 San Francisco Honolulu
1 2 2019-09-01 2019-10-15 New York Miami
2 4 2019-07-31 2019-10-01 Boston Houston

You want to sort the trips by timestamp so we can get the most recent voyages, then group the voyages by object id and grab the first and second voyage per object, then merge.
groups = df.sort_values(by = "timestamp", ascending = False).groupby("obj_id")
pd.merge(groups.nth(1), groups.nth(0),
on="obj_id",
suffixes=("_origin", "_dest"))
Make sure your timestamp column is the proper timestamp data type though, otherwise your sorting will be messed up.

Related

argument of type "float" is not iterable when trying to use for loop

I have a countrydf as below, in which each cell in the country column contains a list of the countries where the movie was released.
countrydf
id Country release_year
s1 [US] 2020
s2 [South Africa] 2021
s3 NaN 2021
s4 NaN 2021
s5 [India] 2021
I want to make a new df which look like this:
country_yeardf
Year US UK Japan India
1925 NaN NaN NaN NaN
1926 NaN NaN NaN NaN
1927 NaN NaN NaN NaN
1928 NaN NaN NaN NaN
It has the release year and the number of movies released in each country.
My solution is that: with a blank df like the second one, run a for loop to count the number of movies released and then modify the value in the cell relatively.
countrylist=['Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', ….]
for x in countrylist:
for j in list(range(0,8807)):
if x in countrydf.country[j]:
t=int (countrydf.release_year[j] )
country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
an error occurred which read:
TypeError Traceback (most recent call last)
<ipython-input-25-225281f8759a> in <module>()
1 for x in countrylist:
2 for j in li:
----> 3 if x in countrydf.country[j]:
4 t=int(countrydf.release_year[j])
5 country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
TypeError: argument of type 'float' is not iterable
I don’t know which one is of float type here, I have check the type of countrydf.country[j] and it returned int.
I was using pandas and I am just getting started with it. Can anyone please explain the error and suggest a solution for a df that I want to create?
P/s: my English is not so good so hop you guys understand.
Here is a solution using groupby
df = pd.DataFrame([['US', 2015], ['India', 2015], ['US', 2015], ['Russia', 2016]], columns=['country', 'year'])
country year
0 US 2015
1 India 2015
2 US 2015
3 Russia 2016
Now just groupby country and year and unstack the output:
df.groupby(['year', 'country']).size().unstack()
country India Russia US
year
2015 1.0 NaN 2.0
2016 NaN 1.0 NaN
Some alternative ways to achieve this in pandas without loops.
If the Country Column have more than 1 value in the list in each row, you can try the below:
>>df['Country'].str.join("|").str.get_dummies().groupby(df['release_year']).sum()
India South Africa US
release_year
2020 0 0 1
2021 1 1 0
Else if Country has just 1 value per row in the list as you have shown in the example, you can use crosstab
>>pd.crosstab(df['release_year'],df['Country'].str[0])
Country India South Africa US
release_year
2020 0 0 1
2021 1 1 0

Pandas add multiple rows with IF condition

I have following dataframe cosisting of city bicycle trips. However, I have some problems with handling trips that exceed over one hour(I want to use YYYYmmDDhh as a composite key in my data model). So what I want to do is to create a column "keyhour" that I could connect with other tables. This would be YYYYmmDDhh based on started_at IF start_hour == end_hour. However, if end_hour is greater than start_hour, I want to insert that many rows with the same TourID to my dataframe, in order to indicate that the trip has lasted few hours.
started_at ended_at duration start_station_id start_station_name start_station_description ... end_station_description end_station_latitude end_station_longitude TourID start_hour end_hour
0 2020-05-01 03:03:14.941000+00:00 2020-05-01 03:03:14.941000+00:00 635 484 Karenlyst allé ved Skabos vei ... langs Drammensveien 59.914145 10.715505 0 3 3
1 2020-05-01 03:05:48.529000+00:00 2020-05-01 03:05:48.529000+00:00 141 455 Sofienbergparken sør langs Sofienberggata ... ved Sars gate 59.921206 10.769989 1 3 3
2 2020-05-01 03:13:33.156000+00:00 2020-05-01 03:13:33.156000+00:00 330 550 Thereses gate ved Bislett trikkestopp ... ved Kristian IVs gate 59.914767 10.740971 2 3 3
3 2020-05-01 03:14:14.549000+00:00 2020-05-01 03:14:14.549000+00:00 479 597 Fredensborg ved rundkjøringen ... ved Oslo City 59.912334 10.752292 3 3 3
4 2020-05-01 03:20:12.355000+00:00 2020-05-01 03:20:12.355000+00:00 629 617 Bjerregaardsgate Øst ved Uelands gate ... langs Oslo gate 59.908255 10.767800 4 3 3
So for example if started_at = 2020-05-01 03:03:14.941000+00:00, ended_at = 2020-05-01 06:03:14.941000+00:00 , start_hour = 3, end_hour = 6 and TourID = 1, I want to have rows with:
keyhour ; TourID
2020050103 ;1
2020050104 ;1
2020050105 ;1
2020050106 ;1
And all other values(duration etc) related to this trip id.
However, I really cannot find any way to do it in Pandas. Is it possible or do I have to use pure python to re-write my source csv?
Thank you for any advice!
Assuming your dataframe is df and that you have import pandas as pd
# convert to datetime and rounddown to hour
df['started_at'] = pd.to_datetime(df['started_at']).dt.floor(freq='H')
df['ended_at'] = pd.to_datetime(df['ended_at']).dt.floor(freq='H')
# this creates a list of hourly datetime ranges from started_at to ended_at
df['keyhour'] = df.apply(lambda x: list(pd.date_range(x['started_at'], x['ended_at'], freq="1H")), axis='columns')
# this just expands to row each element in the list of keyhour column
df = df.explode('keyhour')
# conversts it to a string, of the format you specified
df['keyhour'] = df['keyhour'].dt.strftime('%Y%m%d%H')
df

Python: Creating column based on a condition from the other dataframe

I have two data frames as follows:
df1=
date company userDomain keyword pageViews category
2015-12-02 1-800 Contacts glasses.com SAN 2 STORAGE
2015-12-02 1-800 Contacts rhgi.com SAN 3 STORAGE
2015-12-02 100 Percent Fun dialogdesign.ca SAN 1 STORAGE
2015-12-02 101netlink 101netlink.com SAN 8 STORAGE
2015-12-02 1020 nlc.bc.ca SAN 4 STORAGE
df2=
Outcome Job Title Wave
Created Opportunity IT Manager 1.0
Closed Out Prospect/Contact Infrastructure Manager 1.0
NaN IT Director 1.0
NaN Supervisor Technical Support 1.0
Created Opportunity Director of IT Services 1.0
Wave Date userDomain
2016-02-16 15:07:05 dialogdesign.ca
2016-02-16 15:07:05 rhgi.com
2016-02-16 15:07:05 surefire.com
2016-02-16 15:07:05 isd2144.org
2016-02-16 15:07:05 nlc.bc.ca
I would like to add a column in df1 called wave_date with dates from df2['Wave Date'] for all the df1['userDomain'] is in the df2['userDomain']
If there is no match of userDomain in both the frames, the value should be nan. I'm sorry if this is a very naive question but I'm frustrated with my failure. What I'm doing is something like this:
df1['wave_date'] = df1.apply(lambda x: df2['Wave Date'] if x['userDomain'].isin(df2['userDomain']) else np.nan)
I keep getting
IndexError: ('userDomain', 'occurred at index date')
Can you please point out the correct to do it? Thanks a lot
m = dict(zip(df2['userDomain'], df2['Wave Date']))
df1.assign(wave_date=df1.userDomain.map(m))
date company userDomain keyword pageViews category wave_date
0 2015-12-02 1-800 Contacts glasses.com SAN 2 STORAGE NaN
1 2015-12-02 1-800 Contacts rhgi.com SAN 3 STORAGE 2016-02-16 15:07:05
2 2015-12-02 100 Percent Fun dialogdesign.ca SAN 1 STORAGE 2016-02-16 15:07:05
3 2015-12-02 101netlink 101netlink.com SAN 8 STORAGE NaN
4 2015-12-02 1020 nlc.bc.ca SAN 4 STORAGE 2016-02-16 15:07:05

Aggregate function to data frame in pandas

I want to create a dataframe from an aggregate function. I thought that it would create by default a dataframe as this solution states, but it creates a series and I don't know why (Converting a Pandas GroupBy object to DataFrame).
The dataframe is from Kaggle's San Francisco Salaries. My code:
df=pd.read_csv('Salaries.csv')
in: type(df)
out: pandas.core.frame.DataFrame
in: df.head()
out: EmployeeName JobTitle TotalPay TotalPayBenefits Year Status 2BasePay 2OvertimePay 2OtherPay 2Benefits 2Year
0 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 567595.43 567595.43 2011 NaN 167411.18 0.00 400184.25 NaN 2011-01-01
1 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 538909.28 538909.28 2011 NaN 155966.02 245131.88 137811.38 NaN 2011-01-01
2 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 335279.91 335279.91 2011 NaN 212739.13 106088.18 16452.60 NaN 2011-01-01
3 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 332343.61 332343.61 2011 NaN 77916.00 56120.71 198306.90 NaN 2011-01-01
4 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 326373.19 326373.19 2011 NaN 134401.60 9737.00 182234.59 NaN 2011-01-01
in: df2=df.groupby(['JobTitle'])['TotalPay'].mean()
type(df2)
out: pandas.core.series.Series
I want df2 to be a dataframe with the columns 'JobTitle' and 'TotalPlay'
Breaking down your code:
df2 = df.groupby(['JobTitle'])['TotalPay'].mean()
The groupby is fine. It's the ['TotalPay'] that is the misstep. That is telling the groupby to only execute the the mean function on the pd.Series df['TotalPay'] for each group defined in ['JobTitle']. Instead, you want to refer to this column with [['TotalPay']]. Notice the double brackets. Those double brackets say pd.DataFrame.
Recap
df2 = df2=df.groupby(['JobTitle'])[['TotalPay']].mean()

How can I count a resampled multi-indexed dataframe in pandas

I found this description of how to resample a multi-index:
Resampling Within a Pandas MultiIndex
However as soon as I use count instead of sum the solution is not working any longer
This might be related to: Resampling with 'how=count' causing problems
Not working count and strings:
values_a =[1]*16
states = ['Georgia']*8 + ['Alabama']*8
#cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([datetime.datetime(2012,1,1)+datetime.timedelta(days = i) for i in range(4)]*4)
df2 = pd.DataFrame(
{'value_a': values_a},
index = [states, dates])
df2.index.names = ['State', 'Date']
df2.reset_index(level=[0], inplace=True)
print(df2.groupby(['State']).resample('W',how='count'))
Yields:
2012-01-01 2012-01-08
State value_a State value_a
State
Alabama 2 2 6 6
Georgia 2 2 6 6
The working version with sum and numbers as values
values_a =[1]*16
states = ['Georgia']*8 + ['Alabama']*8
#cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([datetime.datetime(2012,1,1)+datetime.timedelta(days = i) for i in range(4)]*4)
df2 = pd.DataFrame(
{'value_a': values_a},
index = [states, dates])
df2.index.names = ['State', 'Date']
df2.reset_index(level=[0], inplace=True)
print(df2.groupby(['State']).resample('W',how='sum'))
Yields (notice no duplication of 'State'):
value_a
State Date
Alabama 2012-01-01 2
2012-01-08 6
Georgia 2012-01-01 2
2012-01-08 6
When using count, state isn't a nuisance column (it can count strings) so the resample is going to apply count to it (although the output is not what I would expect). You could do something like (tell it only to apply count to value_a),
>>> print df2.groupby(['State']).resample('W',how={'value_a':'count'})
value_a
State Date
Alabama 2012-01-01 2
2012-01-08 6
Georgia 2012-01-01 2
2012-01-08 6
Or more generally, you can apply different kinds of how to different columns:
>>> print df2.groupby(['State']).resample('W',how={'value_a':'count','State':'last'})
State value_a
State Date
Alabama 2012-01-01 Alabama 2
2012-01-08 Alabama 6
Georgia 2012-01-01 Georgia 2
2012-01-08 Georgia 6
So while the above allows you to count a resampled multi-index dataframe it doesn't explain the behavior of output fromhow='count'. The following is closer to the way I would expect it to behave:
print df2.groupby(['State']).resample('W',how={'value_a':'count','State':'count'})
State value_a
State Date
Alabama 2012-01-01 2 2
2012-01-08 6 6
Georgia 2012-01-01 2 2
2012-01-08 6 6
#Karl D soln is correct; this will be possible in 0.14/master (releasing shortly), see docs here
In [118]: df2.groupby([pd.Grouper(level='Date',freq='W'),'State']).count()
Out[118]:
value_a
Date State
2012-01-01 Alabama 2
Georgia 2
2012-01-08 Alabama 6
Georgia 6
Prior to 0.14 it was difficult to groupby / resample with a time based grouper and another grouper. pd.Grouper allows a very flexible specification to do this.

Categories