Create and trip report with end latitude and logitude - python

Please help, I have a data set structured like below
ss={'ride_id': {0: 'ride1',1: 'ride1',2: 'ride1',3: 'ride2',4: 'ride2',
5: 'ride2',6: 'ride2',7: 'ride3',8: 'ride3',9: 'ride3',10: 'ride3'},
'lat': {0: 5.616526,1: 5.623686, 2: 5.616555,3: 5.616556,4: 5.613834, 5: 5.612899,
6: 5.610804,7: 5.616614,8: 5.644431,9: 5.650771, 10: 5.610828},
'long': {0: -0.231901,1: -0.227248,2: -0.23192,3: -0.23168,4: -0.223812,
5: -0.22869,6: -0.226193,7: -0.231461,8: -0.237549,9: -0.271337,10: -0.226157},
'distance': {0: 0.0,1: 90.021,2: 138.0751,3: 0.0,4: 90.0041,5: 180.0293,6: 180.562, 7:0.0,8: 90.004,9: 180.0209,10: 189.0702},}
df=pd.DataFrame(ss)
the ride_id column indicates the number of trips taken in a window to make up the ride.
For example, ride1 consists of 2 trips, the first trip starts at index 0 and ends at index 1, then trip 2 starts at index 1 and ends at index 2.
I want to create a new data frame of trip reports, where each row will have the start coordinates (lat, long) and trip end coordinates(end_lat,end_long) taken from the next row and then distance. The results should look like the data frame below
sf={'ride_id': {0: 'ride1',1: 'ride1',2: 'ride2',3: 'ride2',4: 'ride2',},
'lat': {0: 5.616526,1: 5.623686,2: 5.616556,3: 3.613834, 4: 5.612899},
'long': {0: -0.231901,1: -0.227248,2: -0.23168,3: -0.223812,4: -0.22869},
'end_lat':{0: 5.623686,1: 5.616555,2: 5.613834,3: 5.612899,4: 5.610804},
'end_long':{0: -0.227248,1: -0.23192,2: -0.223812,3: -0.22869,4: -0.226193},
'distance': {0: 90.02100,1: 138.07510,2: 90.00410,3: 180.02930,4: 180.5621},}
df_s=pd.DataFrame(sf)
df_s
OUT:
ride_id lat long end_lat end_long distance
0 ride1 5.616526 -0.231901 5.623686 -0.227248 90.0210
1 ride1 5.623686 -0.227248 5.616555 -0.231920 138.0751
2 ride2 5.616556 -0.231680 5.613834 -0.223812 90.0041
3 ride2 3.613834 -0.223812 5.612899 -0.228690 180.0293
4 ride2 5.612899 -0.228690 5.610804 -0.226193 180.5621
I tried to group the data frame by the ride_id to isolate each ride_id, but I'm stuck, any ideas are warmly welcomed.

We can do groupby with shift then dropna
df['start_lat'] = df.groupby('ride_id')['lat'].shift()
df['start_long'] = df.groupby('ride_id')['long'].shift()
df = df.dropna()
df
Out[480]:
ride_id lat long distance start_lat start_long
1 ride1 5.623686 -0.227248 90.0210 5.616526 -0.231901
2 ride1 5.616555 -0.231920 138.0751 5.623686 -0.227248
4 ride2 5.613834 -0.223812 90.0041 5.616556 -0.231680
5 ride2 5.612899 -0.228690 180.0293 5.613834 -0.223812
6 ride2 5.610804 -0.226193 180.5620 5.612899 -0.228690
8 ride3 5.644431 -0.237549 90.0040 5.616614 -0.231461
9 ride3 5.650771 -0.271337 180.0209 5.644431 -0.237549
10 ride3 5.610828 -0.226157 189.0702 5.650771 -0.271337

Related

Plot count of unique values in Python

I have a data frame that is similar to the following:
Time Account_ID Device_ID Zip_Code
0 2011-02-02 12:02:19 ABC123 A12345 83420
1 2011-02-02 13:35:12 EFG456 B98765 37865
2 2011-02-02 13:54:57 EFG456 B98765 37865
3 2011-02-02 14:45:20 EFG456 C24568 37865
4 2011-02-02 15:08:58 ABC123 A12345 83420
5 2011-02-02 15:25:17 HIJ789 G97352 97452
How do I make a plot with the count of unique of account id's on the y-axis and the number of unique device id's associated with a single account id on the x-axis?
So in this instance the "1" bin on the x-axis would have a height of 2 since accounts "ABC123" and "HIJ789" only have 1 unique device id each and the "2" bin would have a height of 1 since account "EFG456" has two unique device id's associated with it.
EDIT
This is the output I got from trying
df.groupby("Account_ID")["Device_ID"].nunique().value_counts().plot.bar()
You can combine groupby nunique and value_counts like this:
df.groupby("Account_ID")["Device_ID"].nunique().value_counts().plot.bar()
Edit:
Code used to recreate your data:
df = pd.DataFrame({'Time': {0: '2011-02-02 12:02:19', 1: '2011-02-02 13:35:12', 2: '2011-02-02 13:54:57',
3: '2011-02-02 14:45:20', 4: '2011-02-02 15:08:58', 5: '2011-02-02 15:25:17'},
'Account_ID': {0: 'ABC123', 1: 'EFG456', 2: 'EFG456', 3: 'EFG456', 4: 'ABC123', 5: 'HIJ789'},
'Device_ID': {0: 'A12345', 1: 'B98765', 2: 'B98765', 3: 'C24568', 4: 'A12345', 5: 'G97352'},
'Zip_Code': {0: 83420, 1: 37865, 2: 37865, 3: 37865, 4: 83420, 5: 97452}})

Pandas Groupby and generate "duplicate" columns for each groupby value

I have a vertical data frame that I am looking to make more horizontal by "duplicating" columns for each item in the groupby column.
I have the following data frame:
pd.DataFrame({'posteam': {0: 'ARI', 1: 'ARI', 2: 'ARI', 3: 'ARI', 4: 'ARI'},
'offense_grouping': {0: 'personnel_00',
1: 'personnel_01',
2: 'personnel_02',
3: 'personnel_10',
4: 'personnel_11'},
'snap_ct': {0: 1, 1: 6, 2: 4, 3: 396, 4: 1441},
'personnel_epa': {0: 0.1539720594882965,
1: 0.7805194854736328,
2: -0.2678736448287964,
3: 0.1886662095785141,
4: 0.005721719935536385}})
And in its current state, there are 5 duplicate values in the 'posteam' column and 5 different values in the 'offense_grouping' column. Ideally, I would like to group by 'posteam' (so the team only has one row) and by 'offense_grouping'. Each 'offense_grouping' value is corresponded with 'snap_ct' and 'personnel_epa' values. I would like the end result of this group to be something like this:
posteam
personnel_00_snap_ct
personnel_00_personnel_epa
personnel_01_snap_ct
personnel_01_personnel_epa
personnel_02_snap_ct
personnel_02_personnel_epa
ARI
1
.1539...
6
.7805...
4
-.2679
And so on. How can this be achieved?
Given the data you provide, the following would give the expected result. But there might be more complex cases in your data.
z = (
df
.set_index(['posteam', 'offense_grouping'])
.unstack('offense_grouping')
.swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
)
# or, alternatively (might be better if you have multiple values
# for some given indices./columns):
z = (
df
.pivot_table(index='posteam', columns='offense_grouping', values=['snap_ct', 'personnel_epa'])
.swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
)
>>> z
offense_grouping personnel_00 personnel_01 \
snap_ct personnel_epa snap_ct personnel_epa
posteam
ARI 1 0.153972 6 0.780519
offense_grouping personnel_02 personnel_10 \
snap_ct personnel_epa snap_ct personnel_epa
posteam
ARI 4 -0.267874 396 0.188666
offense_grouping personnel_11
snap_ct personnel_epa
posteam
ARI 1441 0.005722
Then you can join the two levels of columns:
res = z.set_axis([f'{b}_{a}' for a, b in z.columns], axis=1)
>>> res
snap_ct_personnel_00 personnel_epa_personnel_00 snap_ct_personnel_01 personnel_epa_personnel_01 snap_ct_personnel_02 personnel_epa_personnel_02 snap_ct_personnel_10 personnel_epa_personnel_10 snap_ct_personnel_11 personnel_epa_personnel_11
posteam
ARI 1 0.153972 6 0.780519 4 -0.267874 396 0.188666 1441 0.005722
​```

Problem with pandas.Series.cat.rename_categories - getting 'categories must be unique' error

Am new to Python and working through some exercises.
I have a column in my data called 'sequels' (for books) with numbers 1 through to 8.
I want to make a new column called 'sequelcategory' which relabels the numbers - I want 1 to be renamed to 'Original' and anything else to be renamed to Sequel. The exercise suggests that I use "pd.Series.cat.rename_categories" to do this.
The first hurdle I overcame was beating an error that said I needed to have categorical data (it was initially int64), I did this with:
bookdata['sequels'] = bookdata['sequels'].astype('category')
That was all well and good. I think set to creating my new column:
bookdata["sequelcategory"] = bookdata["sequels"].cat.rename_categories({1: 'original', 2: 'sequel'})
The above works absolutely fine - the problem I am having is that I also want numbers 3 - 8 to also be relabelled 'sequel', meaning that the below:
bookdata["sequelcategory"] = bookdata["sequels"].cat.rename_categories({1: 'original', 2: 'sequel', 3: 'sequel', 4: 'sequel', 5: 'sequel', 6: 'sequel', 7: 'sequel', 8: 'sequel', })
...returns the error: ValueError: Categorical categories must be unique.
Anyone have some advice on the above? I know there are probably 101 other ways to do this, but I am being told I need to do it with pandas.Series.cat.rename_categories and can't for the life of me work it out.
Any help would be greatly appreciated!
We could map them before setting them as category,
bookdata = pd.DataFrame({'book series': [1, 2, 3, 4, 5, 1, 1, 2, 6, 8]})
bookdata
###
book series
0 1
1 2
2 3
3 4
4 5
5 1
6 1
7 2
8 6
9 8
map_dict = {1: 'original', 2: 'sequel', 3: 'sequel', 4: 'sequel', 5: 'sequel', 6: 'sequel', 7: 'sequel', 8: 'sequel'}
bookdata['sequelcategory'] = bookdata['book series'].map(map_dict).astype('category')
bookdata
###
book series sequelcategory
0 1 original
1 2 sequel
2 3 sequel
3 4 sequel
4 5 sequel
5 1 original
6 1 original
7 2 sequel
8 6 sequel
9 8 sequel
bookdata.info()
###
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 book series 10 non-null int64
1 sequelcategory 10 non-null category

How to do conditionals operations in columns in python pandas?

I'm trying to make a code that calculates the variation of "prod"("rgdpna"/"emp") in relation to one specific year. In an excel data, that contain data from several countries, and I need to do it for all of them.
(country, year, rgdpna and emp, are the data from excel)
Contry year rgdpna emp "prod"(rgdpna/emp) "prodvar"
Brazil 1980 100 12 8.3 (8.3/8.3) = 1
Brazil 1981 120 12 10 (10/8.3) = 1.2
Brazil 1982 140 15 9.3 (9.3/8.3) = 1.1
...
Canada 1980 300 11 27.2 (27.2/27.2) = 1
Canada 1981 327 10 32.7 (32.7/27.2) = 1.2
Canada 1982 500 12 41.6 (41.6/27.2) = 1.5
...
Something like this : "prodvar" = ("prod" when "year" >= 1980) divided by ("prod" when "year"==1980)
And i think i need to do with "while", but i don't know.
df["prod"] = df["rgdpna"].div (df["emp"])
For pandas, avoid doing for and while loops wherever possible.
Try this.
df['prod'] = df.apply(lambda x: x['prod']/df['prod'].loc[(df['year']==1980)&(df['country']==x['country'])].values[0], axis=1)
First of all, let's get your data into a complete, minimal example. For that we don't need the intermediate columns, so let's keep the relevant column only, and call it 'value' for clearness' sake:
data_dict = {'country': {0: 'Brazil',
1: 'Brazil',
2: 'Brazil',
3: 'Canada',
4: 'Canada',
5: 'Canada'},
'value': {0: 8.3, 1: 10, 2: 9.3, 3: 27.2, 4: 32.7, 5: 41.6},
'year': {0: 1980.0, 1: 1981.0, 2: 1982.0, 3: 1980.0, 4: 1981.0, 5: 1982.0}}
df = pd.DataFrame(data_dict)
(I'm also using clear column names in the rest of this answer, even if they're long)
Secondly, we will create an intermediate values column, that just holds the value when year is 1980:
df['value_1980'] = df.apply(lambda row: df.set_index(['year','country']).loc[1980]['value'][row['country']], axis=1)
Finally, we just divide the two, as in your example:
df['value_relative_to_1980'] = df['value'] / df['value_1980']
Check the result.

Annotating line chart with data values

I have plot with multiple line charts. I would like to plot data values for each point in each line that corresponds to its value in my pandas dataframe. I am however having difficulties annotating these data points.
This is a sample of the code I have written to try and solve this issue:
ax= weekdays.plot(marker='o',label='Conversion rates between signup states')
ax.set_xticklabels(['original','sunday','monday',
'tuesday','wednesday','thursday',
'friday','saturday'])
for i in weekdays.values:
ax.text(str(i),xy=i)
Here is a sample of my data (from weekdays dataframe). I returned it as a dictionary for ease of reading:
{'filter': {0: 'original',
1: 'sunday',
2: 'monday',
3: 'tuesday',
4: 'wednesday',
5: 'thursday',
6: 'friday',
7: 'saturday'},
'session_to_leads': {0: 16.28,
1: 13.88,
2: 13.63,
3: 15.110000000000001,
4: 13.469999999999999,
5: 13.54,
6: 12.58,
7: 12.82},
'leads_to_opps': {0: 9.47,
1: 6.279999999999999,
2: 7.62,
3: 8.6,
4: 7.5600000000000005,
5: 7.9,
6: 7.08,
7: 5.7299999999999995},
'opps_to_complete': {0: 1.92,
1: 0.86,
2: 1.3599999999999999,
3: 1.69,
4: 1.3599999999999999,
5: 1.48,
6: 1.51,
7: 0.88}}
You can try it in a different way with plotly. You can first generate a new data frame with 3 columns with the following code,
values = weekdays.T.values[1:].ravel()
idx = weekdays.T.values[0].ravel().tolist() * 3
cols = weekdays.columns[1:]
cols_ = []
for col in cols:
cols_.append([col]*8)
cols_ = np.array(cols_).ravel()
weekdays_ = pd.DataFrame({'days': idx, 'values': values, 'cols': cols_})
, the output is like below:
days values cols
0 original 16.28 session_to_leads
1 sunday 13.88 session_to_leads
2 monday 13.63 session_to_leads
3 tuesday 15.11 session_to_leads
4 wednesday 13.47 session_to_leads
5 thursday 13.54 session_to_leads
6 friday 12.58 session_to_leads
7 saturday 12.82 session_to_leads
8 original 9.47 leads_to_opps
Now, you use the following to get a plot
import plotly_express as px # or import plotly.express as px
px.line(weekdays_, x='days', y='values', color='cols')
which produces the following interactive plot.

Categories