How to fill missing GPS Data in pandas? - python

I have a data frame that looks something like this
+-----+------------+-------------+-------------------------+----+----------+----------+
| | Actual_Lat | Actual_Long | Time | ID | Cal_long | Cal_lat |
+-----+------------+-------------+-------------------------+----+----------+----------+
| 0 | 63.433376 | 10.397068 | 2019-09-30 04:48:13.540 | 11 | 10.39729 | 63.43338 |
| 1 | 63.433301 | 10.395846 | 2019-09-30 04:48:18.470 | 11 | 10.39731 | 63.43326 |
| 2 | 63.433259 | 10.394543 | 2019-09-30 04:48:23.450 | 11 | 10.39576 | 63.43323 |
| 3 | 63.433258 | 10.394244 | 2019-09-30 04:48:29.500 | 11 | 10.39555 | 63.43436 |
| 4 | 63.433258 | 10.394215 | 2019-09-30 04:48:35.683 | 11 | 10.39505 | 63.43427 |
| ... | ... | ... | ... | ...| ... | ... |
| 70 | NaN | NaN | NaT | NaN| 10.35826 | 63.43149 |
| 71 | NaN | NaN | NaT | NaN| 10.35809 | 63.43155 |
| 72 | NaN | NaN | NaT | NaN| 10.35772 | 63.43163 |
| 73 | NaN | NaN | NaT | NaN| 10.35646 | 63.43182 |
| 74 | NaN | NaN | NaT | NaN| 10.35536 | 63.43196 |
+-----+------------+-------------+-------------------------+----------+----------+----------+
Actual_lat and Actual_long contains GPS coordinates of data obtained from GPS device. Cal_lat and cal_lat are GPS coordinates obtained from OSRM's API. As you can see there is a lot of data missing in actual coordinates. I am looking to get a data set such that when I take difference of actual_lat vs cal_lat it should be zero or at least close to zero. I tried to fill these missing values with destination lat and long, but that would result in huge difference. My question is how can I fill these values using python/pandas so that when vehicle followed the OSRM estimated path the difference between actual lat/long and estimated lat/long should be zero or close to zero. I am new to GIS data Sets and have no idea about how to deal with them.
EDIT: I am looking for something like this.
+-----+------------+-------------+-------------------------+----------+----------+----------+----------------------+----------------------+
| | Actual_Lat | Actual_Long | Time | Tour ID | Cal_long | Cal_lat | coordinates_diff_Lat | coordinates_diff_Lon |
+-----+------------+-------------+-------------------------+----------+----------+----------+----------------------+----------------------+
| 0 | 63.433376 | 10.397068 | 2019-09-30 04:48:13.540 | 11 | 10.39729 | 63.43338 | -0.000 | -0.000 |
| 1 | 63.433301 | 10.395846 | 2019-09-30 04:48:18.470 | 11 | 10.39731 | 63.43326 | 0.000 | -0.001 |
| 2 | 63.433259 | 10.394543 | 2019-09-30 04:48:23.450 | 11 | 10.39576 | 63.43323 | 0.000 | -0.001 |
| 3 | 63.433258 | 10.394244 | 2019-09-30 04:48:29.500 | 11 | 10.39555 | 63.43436 | -0.001 | -0.001 |
| 4 | 63.433258 | 10.394215 | 2019-09-30 04:48:35.683 | 11 | 10.39505 | 63.43427 | -0.001 | -0.001 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 70 | 63.43000 | 10.35800 | NaT | 115268.0 | 10.35826 | 63.43149 | 0.000 | -0.003 |
| 71 | 63.43025 | 10.35888 | NaT | 115268.0 | 10.35809 | 63.43155 | 0.000 | -0.003 |
| 72 | 63.43052 | 10.35713 | NaT | 115268.0 | 10.35772 | 63.43163 | 0.000 | -0.002 |
| 73 | 63.43159 | 10.35633 | NaT | 115268.0 | 10.35646 | 63.43182 | 0.000 | -0.001 |
| 74 | 63.43197 | 10.35537 | NaT | 115268.0 | 10.35536 | 63.43196 | 0.000 | 0.000 |
+-----+------------+-------------+-------------------------+----------+----------+----------+----------------------+----------------------+
Note that 63.43197,10.35537 is destination and 63.433376,10.397068 is starting position. All these points represent road coordinates.

IIUC, you need something like this:
I am taking the columns out of df as list.
div = float(len(cal_lat)) / float(len(actual_lat))
new_l = []
for i in range(len(cal_lat)):
new_l.append(actual_lat[int(i/div)])
print(new_l)
len(new_l)
Do, the same with longitude columns.
Since these are GPS points you can tweak your model to have the accuracy of up to 3 digits, when taking the difference. So, keeping this in mind, starting from Actual_lat and lng , if your next value is same as the first, the difference won’t be much greater.
Hopefully, I made sense and you have your solution.

You need pandas.DataFrame.where.
Let's say your dataframe is df, then you can do:
df.Actual_Lat = df.Actual_Lat.where(~df.Actual_Lat.isna(), df.Cal_lat)

Related

ordene matrix with numpy.triu and non nan's values

I’ve a dataset where i need do a transformation to get a upper triangular matrix. So my matrix has this format:
| 1 | 2 | 3 |
01/01/1999 | nan | 582.96 | nan |
02/01/1999 | nan | 589.78 | 78.47 |
03/01/1999 | nan | 588.74 | 79.41 |
… | | |
01/01/2022 | 752.14 | 1005.78 | 193.47 |
02/01/2022 | 754.14 | 997.57 | 192.99 |
I use a dataframe.T, to get my date as columns, but I also need that my rows be ordened by non nan’s.
| 01/01/1999 | 02/01/1999 |03/01/1999 |… |01/01/2022 | 02/01/2022 |
2 | 582.96 | 589.78 | 588.74 |… | 1005.78 | 997.57 |
3 | nan | 78.47 | 79.41 | … | 193.47 | 192.99 |
1 | nan | nan | nan | … | 752.14 | 754.14 |
A tried use the different combinantions of numpy.triu, sort_by and dataframe.T but I haven’t success.
My main goal is get with this format, but if I get this with performance would be nice, cause my data is big.

sumif and countif on Python for multiple columns , On row level and not column level

I'm trying to figure a way to do:
COUNTIF(Col2,Col4,Col6,Col8,Col10,Col12,Col14,Col16,Col18,">=0.05")
SUMIF(Col2,Col4,Col6,Col8,Col10,Col12,Col14,Col16,Col18,">=0.05")
My attempt:
import pandas as pd
df=pd.read_excel(r'C:\\Users\\Downloads\\Prepped.xls') #Please use: https://github.com/BeboGhattas/temp-repo/blob/main/Prepped.xls
df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float) #changing dtype to float
#unconditional sum
df['sum']=df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float).sum(axis=1)
whatever goes below won't work
#sum if
df['greater-than-0.05']=df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float).sum([c for c in col if c >= 0.05])
| | # | word | B64684807 | B64684807Measure | B649845471 | B649845471Measure | B83344143 | B83344143Measure | B67400624 | B67400624Measure | B85229235 | B85229235Measure | B85630406 | B85630406Measure | B82615898 | B82615898Measure | B87558236 | B87558236Measure | B00000009 | B00000009Measure | 有效竞品数 | 关键词抓取时间 | 搜索量排名 | 月搜索量 | 在售商品数 | 竞争度 |
|---:|----:|:--------|------------:|:-------------------|-------------:|:-------------------------|------------:|:-------------------------|------------:|:-------------------|------------:|:-------------------|------------:|:-------------------|------------:|:-------------------|------------:|-------------------:|------------:|:-------------------|-------------:|:--------------------|-------------:|-----------:|-------------:|---------:|
| 0 | 1 | word 1 | 0.055639 | [主要流量词] | 0.049416 | nan | 0.072298 | [精准流量词, 主要流量词] | 0.00211 | nan | 0.004251 | nan | 0.007254 | nan | 0.074409 | [主要流量词] | 0.033597 | nan | 0.000892 | nan | 9 | 2022-10-06 00:53:56 | 5726 | 326188 | 3810 | 0.01 |
| 1 | 2 | word 2 | 0.045098 | nan | 0.005472 | nan | 0.010791 | nan | 0.072859 | [主要流量词] | 0.003423 | nan | 0.012464 | nan | 0.027396 | nan | 0.002825 | nan | 0.060989 | [主要流量词] | 9 | 2022-10-07 01:16:21 | 9280 | 213477 | 40187 | 0.19 |
| 2 | 3 | word 3 | 0.02186 | nan | 0.05039 | [主要流量词] | 0.007842 | nan | 0.028832 | nan | 0.044385 | [精准流量词] | 0.001135 | nan | 0.003866 | nan | 0.021035 | nan | 0.017202 | nan | 9 | 2022-10-07 00:28:31 | 24024 | 81991 | 2275 | 0.03 |
| 3 | 4 | word 4 | 0.000699 | nan | 0.01038 | nan | 0.001536 | nan | 0.021512 | nan | 0.007658 | nan | 5e-05 | nan | 0.048682 | nan | 0.001524 | nan | 0.000118 | nan | 9 | 2022-10-07 00:52:12 | 34975 | 53291 | 30970 | 0.58 |
| 4 | 5 | word 5 | 0.00984 | nan | 0.030248 | nan | 0.003006 | nan | 0.014027 | nan | 0.00904 | [精准流量词] | 0.000348 | nan | 0.000414 | nan | 0.006721 | nan | 0.00153 | nan | 9 | 2022-10-07 02:36:05 | 43075 | 41336 | 2230 | 0.05 |
| 5 | 6 | word 6 | 0.010029 | [精准流量词] | 0.120739 | [精准流量词, 主要流量词] | 0.014359 | nan | 0.002796 | nan | 0.002883 | nan | 0.028747 | [精准流量词] | 0.007022 | nan | 0.017803 | nan | 0.001998 | nan | 9 | 2022-10-07 00:44:51 | 49361 | 34791 | 517 | 0.01 |
| 6 | 7 | word 7 | 0.002735 | nan | 0.002005 | nan | 0.005355 | nan | 6.3e-05 | nan | 0.000772 | nan | 0.000237 | nan | 0.015149 | nan | 2.1e-05 | nan | 2.3e-05 | nan | 9 | 2022-10-07 09:48:20 | 53703 | 31188 | 511 | 0.02 |
| 7 | 8 | word 8 | 0.003286 | [精准流量词] | 0.058161 | [主要流量词] | 0.013681 | [精准流量词] | 0.000748 | [精准流量词] | 0.002684 | [精准流量词] | 0.013916 | [精准流量词] | 0.029376 | nan | 0.019792 | nan | 0.005602 | nan | 9 | 2022-10-06 01:51:53 | 58664 | 27751 | 625 | 0.02 |
| 8 | 9 | word 9 | 0.004273 | [精准流量词] | 0.025581 | [精准流量词] | 0.014784 | [精准流量词] | 0.00321 | [精准流量词] | 0.000892 | nan | 0.00223 | nan | 0.005315 | nan | 0.02211 | nan | 0.027008 | [精准流量词] | 9 | 2022-10-07 01:34:28 | 73640 | 20326 | 279 | 0.01 |
| 9 | 10 | word 10 | 0.002341 | [精准流量词] | 0.029604 | nan | 0.007817 | [精准流量词] | 0.000515 | [精准流量词] | 0.001865 | [精准流量词] | 0.010128 | [精准流量词] | 0.015378 | nan | 0.019677 | nan | 0.003673 | nan | 9 | 2022-10-07 01:17:44 | 80919 | 17779 | 207 | 0.01 |
So my question is,
How can i do the sumif and countif on the exact table (Should use col2,col4... etc, because every file will have the same format but different header, so using df['B64684807'] isn't helpful )
Sample file can be found at:
https://github.com/BeboGhattas/temp-repo/blob/main/Prepped.xls
IIUC, you can use a boolean mask:
df2 = df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float)
m = df2.ge(0.05)
df['countif'] = m.sum(axis=1)
df['sumif'] = df2.where(m).sum(axis=1)
output (last 3 columns only):
sum countif sumif
0 0.299866 3 0.202346
1 0.241317 2 0.133848
2 0.196547 1 0.050390
3 0.092159 0 0.000000
4 0.075174 0 0.000000
5 0.206376 1 0.120739
6 0.026360 0 0.000000
7 0.147246 1 0.058161
8 0.105403 0 0.000000
9 0.090998 0 0.000000

How to get the column values of a Dataframe into another dataframe as a new column after matching the values in columns that both dataframes have?

I'm trying to create a new column in a DataFrame and storing it with values stored in a different dataframe by first comparing the values of columns that both dataframes have. For example:
df1 >>>
| name | team | week | dates | interceptions | pass_yds | rating |
| ---- | ---- | -----| ---------- | ------------- | --------- | -------- |
| maho | KC | 1 | 2020-09-10 | 0 | 300 | 105 |
| went | PHI | 1 | 2020-09-13 | 2 | 225 | 74 |
| lock | DEN | 1 | 2020-09-14 | 0 | 150 | 89 |
| dris | DEN | 2 | 2020-09-20 | 1 | 220 | 95 |
| went | PHI | 2 | 2020-09-20 | 2 | 250 | 64 |
| maho | KC | 2 | 2020-09-21 | 1 | 245 | 101 |
df2 >>>
| name | team | week | catches | rec_yds | rec_tds |
| ---- | ---- | -----| ------- | ------- | ------- |
| ertz | PHI | 1 | 5 | 58 | 1 |
| fant | DEN | 2 | 6 | 79 | 0 |
| kelc | KC | 2 | 8 | 105 | 1 |
| fant | DEN | 1 | 3 | 29 | 0 |
| kelc | KC | 1 | 6 | 71 | 1 |
| ertz | PHI | 2 | 7 | 91 | 2 |
| goed | PHI | 2 | 2 | 15 | 0 |
I want to create a dates column in df2 with the values of the dates stored in the dates column in df1 after matching the teams and the weeks columns. After the matching, df2 in this example should look something like this:
df2 >>>
| name | team | week | catches | rec_yds | rec_tds | dates |
| ---- | ---- | -----| ------- | ------- | ------- | ---------- |
| ertz | PHI | 1 | 5 | 58 | 1 | 2020-09-13 |
| fant | DEN | 2 | 6 | 79 | 0 | 2020-09-20 |
| kelc | KC | 2 | 8 | 105 | 1 | 2020-09-20 |
| fant | DEN | 1 | 3 | 29 | 0 | 2020-09-14 |
| kelc | KC | 1 | 6 | 71 | 1 | 2020-09-10 |
| ertz | PHI | 2 | 7 | 91 | 2 | 2020-09-20 |
| goed | PHI | 2 | 2 | 15 | 0 | 2020-09-20 |
I'm looking for an optimal solution. I've already tried nested for loops and comparing the week and team columns from both dataframes together but that hasn't worked. At this point I'm all out of ideas. Please help!
Disclaimer: The actual DataFrames I'm working with are a lot larger. They have a lot more rows, columns, and values (i.e. a lot more teams in the team columns, a lot more dates in the dates columns, and a lot more weeks in the week columns)

Manipulate pandas columns with datetime

Please see this SO post Manipulating pandas columns
I shared this dataframe:
+----------+------------+-------+-----+------+
| Location | Date | Event | Key | Time |
+----------+------------+-------+-----+------+
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-04 | 1 | a | 2 |
| i2 | 2019-03-15 | 2 | b | 0 |
| i9 | 2019-02-22 | 2 | c | 0 |
| i9 | 2019-03-10 | 3 | d | |
| i9 | 2019-03-10 | 3 | d | 0 |
| s8 | 2019-04-22 | 1 | e | |
| s8 | 2019-04-25 | 1 | e | |
| s8 | 2019-04-28 | 1 | e | 6 |
| t14 | 2019-05-13 | 3 | f | |
+----------+------------+-------+-----+------+
This is a follow-up question. Consider two more columns after Date as shown below.
+-----------------------+----------------------+
| Start Time (hh:mm:ss) | Stop Time (hh:mm:ss) |
+-----------------------+----------------------+
| 13:24:38 | 14:17:39 |
| 03:48:36 | 04:17:20 |
| 04:55:05 | 05:23:48 |
| 08:44:34 | 09:13:15 |
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21 |
+-----------------------+----------------------+
The task remains the same - to get the time difference but in hours, corresponding to the Stop Time of the first row and Start Time of the last row
for each Key.
Based on the answer, I was trying something like this:
df['Time']=df.groupby(['Location','Event']).Date.\
transform(lambda x : (x.iloc[-1]-x.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')]
df['Time_h']=df.groupby(['Location','Event'])['Start Time (hh:mm:ss)','Stop Time (hh:mm:ss)'].\
transform(lambda x,y : (x.iloc[-1]-y.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')] # This gives an error on transform
to get the difference in days and hours separately and then combine. Is there a better way?

create rows from database column

currently i have this data retrieved from database as follows,
+------------+--------------+-------+-----+-------------+-----------+------------------+-----------------+
| Monitor ID | Casting Date | Label | AGE | Client Name | Project | Average Strength | Average Density |
+------------+--------------+-------+-----+-------------+-----------+------------------+-----------------+
| 1082 | 2018-07-05 | b52 | 1 | Trial Mix | Trial Mix | 21.78 | 2.436 |
| 1082 | 2018-07-05 | b52 | 2 | Trial Mix | Trial Mix | 33.11 | 2.406 |
| 1082 | 2018-07-05 | b52 | 4 | Trial Mix | Trial Mix | 43.11 | 2.447 |
| 1082 | 2018-07-05 | b52 | 8 | Trial Mix | Trial Mix | 48.22 | 2.444 |
| 1083 | 2018-07-05 | B53 | 1 | Trial Mix | Trial Mix | 10.44 | 2.421 |
| 1083 | 2018-07-05 | B53 | 2 | Trial Mix | Trial Mix | 20.0 | 2.400 |
| 1083 | 2018-07-05 | B53 | 4 | Trial Mix | Trial Mix | 27.78 | 2.397 |
| 1083 | 2018-07-05 | B53 | 8 | Trial Mix | Trial Mix | 33.33 | 2.409 |
| 1084 | 2018-07-05 | B54 | 1 | Trial Mix | Trial Mix | 12.89 | 2.430 |
| 1084 | 2018-07-05 | B54 | 2 | Trial Mix | Trial Mix | 24.44 | 2.427 |
| 1084 | 2018-07-05 | B54 | 4 | Trial Mix | Trial Mix | 34.22 | 2.412 |
| 1084 | 2018-07-05 | B54 | 8 | Trial Mix | Trial Mix | 41.56 | 2.501 |
+------------+--------------+-------+-----+-------------+-----------+------------------+-----------------+
how can i change the table to something like this?
+------------+--------------+-------+-----------+-----------+---------+-------------+---------+-------------+---------+-------------+---------+-------------+
| Monitor Id | Casting Date | Label | Client | Project | 1 Day | | 2 Days | | 4 Days | | 8 Days | |
+------------+--------------+-------+-----------+-----------+---------+-------------+---------+-------------+---------+-------------+---------+-------------+
| | | | | | avg str | avg density | avg str | avg density | avg str | avg density | avg str | avg density |
| | | | | | | | | | | | | |
| 1082 | 05/07/2018 | B52 | Trial Mix | Trial Mix | 21.78 | 2.436 | 33.11 | 2.406 | 43.11 | 2.44 | 48.22 | 2.444 |
| 1083 | 05/07/2018 | B53 | Trial Mix | Trial Mix | 10.44 | 2.421 | 20 | 2.4 | 27.78 | 2.397 | 33.33 | 2.409 |
| 1084 | 05/07/2018 | B54 | Trial Mix | Trial Mix | 12.89 | 2.43 | 24.44 | 2.427 | 34.22 | 2.412 | 41.56 | 2.501 |
+------------+--------------+-------+-----------+-----------+---------+-------------+---------+-------------+---------+-------------+---------+-------------+
i get the data by joining multiple table from the database using peewee
below is my full code to retrieve and format the data
from lib.database import *
import matplotlib.pyplot as plt
from datetime import datetime,timedelta
from prettytable import PrettyTable
import numpy as np
#table to hold data
table = PrettyTable()
table.field_names = ['Monitor ID','Casting Date','Label','AGE','Client Name','Project', 'Average Strength','Average Density']
#interval of 2 weeks ago
int = datetime.today()-timedelta(days=14)
result = MonitorCombine.select(ResultCombine.strength.alias('str'),ResultCombine.density.alias('density'),ResultCombine.age,MonitorCombine.clientname,MonitorCombine.p_alias,MonitorCombine.monitorid, MonitorCombine.monitor_label,MonitorCombine.casting_date).join(ResultCombine, on=(ResultCombine.monitorid == MonitorCombine.monitorid)).dicts().where(MonitorCombine.casting_date > int).order_by(MonitorCombine.monitor_label,ResultCombine.age.asc())
for r in result: table.add_row([r['monitorid'],r['casting_date'],r['monitor_label'],r['age'],r['clientname'],r['p_alias'],r['str'],r['density']])
print(table)
You have to pivot the data, since MariaDB has no pivot you could do it in sql:
SELECT
MonitorID,
CastingDate,
Label,
ClientName,
Project,
SUM(IF(Age=1, AverageStrength, 0)) AS AvgStr1,
SUM(IF(Age=2, AverageStrength, 0)) AS AvgStr2,
SUM(IF(Age=4, AverageStrength, 0)) AS AvgStr4,
SUM(IF(Age=8, AverageStrength, 0)) AS AvgStr8,
SUM(IF(Age=1, AverageDensity, 0)) AS AvgDensity1,
SUM(IF(Age=2, AverageDensity, 0)) AS AvgDensity2,
SUM(IF(Age=4, AverageDensity, 0)) AS AvgDensity4,
SUM(IF(Age=8, AverageDensity, 0)) AS AvgDensity8
FROM
YourTable
GROUP BY MonitorID, CastingDate, Label, ClientName, Project, Age
ORDER BY MonitorID, CastingDate;

Categories