create rows from database column - python

currently i have this data retrieved from database as follows,
+------------+--------------+-------+-----+-------------+-----------+------------------+-----------------+
| Monitor ID | Casting Date | Label | AGE | Client Name | Project | Average Strength | Average Density |
+------------+--------------+-------+-----+-------------+-----------+------------------+-----------------+
| 1082 | 2018-07-05 | b52 | 1 | Trial Mix | Trial Mix | 21.78 | 2.436 |
| 1082 | 2018-07-05 | b52 | 2 | Trial Mix | Trial Mix | 33.11 | 2.406 |
| 1082 | 2018-07-05 | b52 | 4 | Trial Mix | Trial Mix | 43.11 | 2.447 |
| 1082 | 2018-07-05 | b52 | 8 | Trial Mix | Trial Mix | 48.22 | 2.444 |
| 1083 | 2018-07-05 | B53 | 1 | Trial Mix | Trial Mix | 10.44 | 2.421 |
| 1083 | 2018-07-05 | B53 | 2 | Trial Mix | Trial Mix | 20.0 | 2.400 |
| 1083 | 2018-07-05 | B53 | 4 | Trial Mix | Trial Mix | 27.78 | 2.397 |
| 1083 | 2018-07-05 | B53 | 8 | Trial Mix | Trial Mix | 33.33 | 2.409 |
| 1084 | 2018-07-05 | B54 | 1 | Trial Mix | Trial Mix | 12.89 | 2.430 |
| 1084 | 2018-07-05 | B54 | 2 | Trial Mix | Trial Mix | 24.44 | 2.427 |
| 1084 | 2018-07-05 | B54 | 4 | Trial Mix | Trial Mix | 34.22 | 2.412 |
| 1084 | 2018-07-05 | B54 | 8 | Trial Mix | Trial Mix | 41.56 | 2.501 |
+------------+--------------+-------+-----+-------------+-----------+------------------+-----------------+
how can i change the table to something like this?
+------------+--------------+-------+-----------+-----------+---------+-------------+---------+-------------+---------+-------------+---------+-------------+
| Monitor Id | Casting Date | Label | Client | Project | 1 Day | | 2 Days | | 4 Days | | 8 Days | |
+------------+--------------+-------+-----------+-----------+---------+-------------+---------+-------------+---------+-------------+---------+-------------+
| | | | | | avg str | avg density | avg str | avg density | avg str | avg density | avg str | avg density |
| | | | | | | | | | | | | |
| 1082 | 05/07/2018 | B52 | Trial Mix | Trial Mix | 21.78 | 2.436 | 33.11 | 2.406 | 43.11 | 2.44 | 48.22 | 2.444 |
| 1083 | 05/07/2018 | B53 | Trial Mix | Trial Mix | 10.44 | 2.421 | 20 | 2.4 | 27.78 | 2.397 | 33.33 | 2.409 |
| 1084 | 05/07/2018 | B54 | Trial Mix | Trial Mix | 12.89 | 2.43 | 24.44 | 2.427 | 34.22 | 2.412 | 41.56 | 2.501 |
+------------+--------------+-------+-----------+-----------+---------+-------------+---------+-------------+---------+-------------+---------+-------------+
i get the data by joining multiple table from the database using peewee
below is my full code to retrieve and format the data
from lib.database import *
import matplotlib.pyplot as plt
from datetime import datetime,timedelta
from prettytable import PrettyTable
import numpy as np
#table to hold data
table = PrettyTable()
table.field_names = ['Monitor ID','Casting Date','Label','AGE','Client Name','Project', 'Average Strength','Average Density']
#interval of 2 weeks ago
int = datetime.today()-timedelta(days=14)
result = MonitorCombine.select(ResultCombine.strength.alias('str'),ResultCombine.density.alias('density'),ResultCombine.age,MonitorCombine.clientname,MonitorCombine.p_alias,MonitorCombine.monitorid, MonitorCombine.monitor_label,MonitorCombine.casting_date).join(ResultCombine, on=(ResultCombine.monitorid == MonitorCombine.monitorid)).dicts().where(MonitorCombine.casting_date > int).order_by(MonitorCombine.monitor_label,ResultCombine.age.asc())
for r in result: table.add_row([r['monitorid'],r['casting_date'],r['monitor_label'],r['age'],r['clientname'],r['p_alias'],r['str'],r['density']])
print(table)

You have to pivot the data, since MariaDB has no pivot you could do it in sql:
SELECT
MonitorID,
CastingDate,
Label,
ClientName,
Project,
SUM(IF(Age=1, AverageStrength, 0)) AS AvgStr1,
SUM(IF(Age=2, AverageStrength, 0)) AS AvgStr2,
SUM(IF(Age=4, AverageStrength, 0)) AS AvgStr4,
SUM(IF(Age=8, AverageStrength, 0)) AS AvgStr8,
SUM(IF(Age=1, AverageDensity, 0)) AS AvgDensity1,
SUM(IF(Age=2, AverageDensity, 0)) AS AvgDensity2,
SUM(IF(Age=4, AverageDensity, 0)) AS AvgDensity4,
SUM(IF(Age=8, AverageDensity, 0)) AS AvgDensity8
FROM
YourTable
GROUP BY MonitorID, CastingDate, Label, ClientName, Project, Age
ORDER BY MonitorID, CastingDate;

Related

Reading data to python dataframe

I am struggling with reading data into python dataframe. Am R programmer trying to do stuff in Python. So how would I read the following data into pandas dataframe? The data is actually the result of calling API.
Thanks.
b'{"mk_id":"1200011617609","doc_type":"sales_order","opr_code":"0","count_code":"1051885/2022","doc_date":"2022-08-23+02:00","partner":{"mk_id":"400020633177","business_entity":"false","taxpayer":"false","foreign_county":"true","customer":"Emilia Chabadova","street":"Ga\xc5\xa1tanov\xc3\xa1 2915/13","street_number":"2915/13","post_number":"92101","place":"Pie\xc5\xa1\xc5\xa5any","country":"Slovakia","count_code":"5770789334526546744","partner_contact":{"gsm":"+421949340254","email":"emily.chabadova#gmail.com"},"mk_address_id":"400020530565","country_iso_2":"SK","buyer":"true","supplier":"false"},"receiver":{"mk_id":"400020633177","business_entity":"false","taxpayer":"false","foreign_county":"true","customer":"Emilia Chabadova","street":"Ga\xc5\xa1tanov\xc3\xa1 2915/13","street_number":"2915/13","post_number":"92101","place":"Pie\xc5\xa1\xc5\xa5any","country":"Slovakia","count_code":"5770789334526546744","partner_contact":{"gsm":"+421949340254","email":"emily.chabadova#gmail.com"},"mk_address_id":"400020530565","country_iso_2":"SK","buyer":"true","supplier":"false"},"currency_code":"EUR","status_code":"Zaklju\xc4\x8dena","doc_created_email":"stifter.rok#gmail.com","buyer_order":"SK-956103","warehouse":"glavno","delivery_type":"Gls_sk","product_list":[{"count_code":"54","mk_id":"266405022384","code":"MSS","name":"Mousse","unit":"kos","amount":"1","price":"16.66","price_with_tax":"19.99","tax":"200"},{"count_code":"53","mk_id":"266405022383","code":"MIT","name":"Mitt","unit":"kos","amount":"1","price":"0","tax":"200"},{"count_code":"48","mk_id":"266404892511","code":"TM","name":"Tanning mist","name_desc":"TM","unit":"kos","amount":"1","price":"0","tax":"200"}],"extra_column":[{"name":"tracking_number","value":"91114278162"}],"sum_basic":"16.66","sum_tax_200":"3.33","sum_all":"19.99","sum_paid":"19.99","profit_center":"SHINE BROWN, PROIZVODNJA, TRGOVINA IN STORITVE, D.O.O.","bank_ref_number":"10518852022","method_of_payment":"Pla\xc4\x8dilo po povzetju","order_create_ts":"2022-08-23T09:43:00+02:00","created_ts":"2022-08-23T11:59:14+02:00","shipped_date":"2022-08-24+02:00","doc_link_list":[{"mk_id":"266412181173","count_code":"SK-MK-36044","doc_type":"sales_bill_foreign"},{"mk_id":"400015161112","count_code":"1043748/2022","doc_type":"warehouse_packing_list"},{"mk_id":"1200011617609","count_code":"1051885/2022","doc_type":"sales_order"}]}'
you can start by doing something like this :
result = {"mk_id":"1200011617609","doc_type":"sales_order","opr_code":"0","count_code":"1051885/2022","doc_date":"2022-08-23+02:00","partner":{"mk_id":"400020633177","business_entity":"false","taxpayer":"false","foreign_county":"true","customer":"Emilia Chabadova","street":"Ga\xc5\xa1tanov\xc3\xa1 2915/13","street_number":"2915/13","post_number":"92101","place":"Pie\xc5\xa1\xc5\xa5any","country":"Slovakia","count_code":"5770789334526546744","partner_contact":{"gsm":"+421949340254","email":"emily.chabadova#gmail.com"},"mk_address_id":"400020530565","country_iso_2":"SK","buyer":"true","supplier":"false"},"receiver":{"mk_id":"400020633177","business_entity":"false","taxpayer":"false","foreign_county":"true","customer":"Emilia Chabadova","street":"Ga\xc5\xa1tanov\xc3\xa1 2915/13","street_number":"2915/13","post_number":"92101","place":"Pie\xc5\xa1\xc5\xa5any","country":"Slovakia","count_code":"5770789334526546744","partner_contact":{"gsm":"+421949340254","email":"emily.chabadova#gmail.com"},"mk_address_id":"400020530565","country_iso_2":"SK","buyer":"true","supplier":"false"},"currency_code":"EUR","status_code":"Zaklju\xc4\x8dena","doc_created_email":"stifter.rok#gmail.com","buyer_order":"SK-956103","warehouse":"glavno","delivery_type":"Gls_sk","product_list":[{"count_code":"54","mk_id":"266405022384","code":"MSS","name":"Mousse","unit":"kos","amount":"1","price":"16.66","price_with_tax":"19.99","tax":"200"},{"count_code":"53","mk_id":"266405022383","code":"MIT","name":"Mitt","unit":"kos","amount":"1","price":"0","tax":"200"},{"count_code":"48","mk_id":"266404892511","code":"TM","name":"Tanning mist","name_desc":"TM","unit":"kos","amount":"1","price":"0","tax":"200"}],"extra_column":[{"name":"tracking_number","value":"91114278162"}],"sum_basic":"16.66","sum_tax_200":"3.33","sum_all":"19.99","sum_paid":"19.99","profit_center":"SHINE BROWN, PROIZVODNJA, TRGOVINA IN STORITVE, D.O.O.","bank_ref_number":"10518852022","method_of_payment":"Pla\xc4\x8dilo po povzetju","order_create_ts":"2022-08-23T09:43:00+02:00","created_ts":"2022-08-23T11:59:14+02:00","shipped_date":"2022-08-24+02:00","doc_link_list":[{"mk_id":"266412181173","count_code":"SK-MK-36044","doc_type":"sales_bill_foreign"},{"mk_id":"400015161112","count_code":"1043748/2022","doc_type":"warehouse_packing_list"},{"mk_id":"1200011617609","count_code":"1051885/2022","doc_type":"sales_order"}]}
pd.DataFrame([result])
Here is a way using BytesIO and json.normalize:
from ast import literal_eval
from io import BytesIO
import pandas as pd
data = b'{"mk_id":"1200011617609","doc_type":"sales_order","opr_code":"0","count_code":"1051885/2022","doc_date":"2022-08-23+02:00","partner":{"mk_id":"400020633177","business_entity":"false","taxpayer":"false","foreign_county":"true","customer":"Emilia Chabadova","street":"Ga\xc5\xa1tanov\xc3\xa1 2915/13","street_number":"2915/13","post_number":"92101","place":"Pie\xc5\xa1\xc5\xa5any","country":"Slovakia","count_code":"5770789334526546744","partner_contact":{"gsm":"+421949340254","email":"emily.chabadova#gmail.com"},"mk_address_id":"400020530565","country_iso_2":"SK","buyer":"true","supplier":"false"},"receiver":{"mk_id":"400020633177","business_entity":"false","taxpayer":"false","foreign_county":"true","customer":"Emilia Chabadova","street":"Ga\xc5\xa1tanov\xc3\xa1 2915/13","street_number":"2915/13","post_number":"92101","place":"Pie\xc5\xa1\xc5\xa5any","country":"Slovakia","count_code":"5770789334526546744","partner_contact":{"gsm":"+421949340254","email":"emily.chabadova#gmail.com"},"mk_address_id":"400020530565","country_iso_2":"SK","buyer":"true","supplier":"false"},"currency_code":"EUR","status_code":"Zaklju\xc4\x8dena","doc_created_email":"stifter.rok#gmail.com","buyer_order":"SK-956103","warehouse":"glavno","delivery_type":"Gls_sk","product_list":[{"count_code":"54","mk_id":"266405022384","code":"MSS","name":"Mousse","unit":"kos","amount":"1","price":"16.66","price_with_tax":"19.99","tax":"200"},{"count_code":"53","mk_id":"266405022383","code":"MIT","name":"Mitt","unit":"kos","amount":"1","price":"0","tax":"200"},{"count_code":"48","mk_id":"266404892511","code":"TM","name":"Tanning mist","name_desc":"TM","unit":"kos","amount":"1","price":"0","tax":"200"}],"extra_column":[{"name":"tracking_number","value":"91114278162"}],"sum_basic":"16.66","sum_tax_200":"3.33","sum_all":"19.99","sum_paid":"19.99","profit_center":"SHINE BROWN, PROIZVODNJA, TRGOVINA IN STORITVE, D.O.O.","bank_ref_number":"10518852022","method_of_payment":"Pla\xc4\x8dilo po povzetju","order_create_ts":"2022-08-23T09:43:00+02:00","created_ts":"2022-08-23T11:59:14+02:00","shipped_date":"2022-08-24+02:00","doc_link_list":[{"mk_id":"266412181173","count_code":"SK-MK-36044","doc_type":"sales_bill_foreign"},{"mk_id":"400015161112","count_code":"1043748/2022","doc_type":"warehouse_packing_list"},{"mk_id":"1200011617609","count_code":"1051885/2022","doc_type":"sales_order"}]}'
df = pd.DataFrame(BytesIO(data))
df[0] = df[0].str.decode("utf-8").apply(literal_eval)
df = pd.json_normalize(
data=df.pop(0),
record_path="product_list",
meta=["mk_id", "doc_type", "opr_code", "count_code", "doc_date", "currency_code",
"status_code", "doc_created_email", "buyer_order", "warehouse", "delivery_type"],
meta_prefix="meta."
)
print(df.to_markdown())
| | count_code | mk_id | code | name | unit | amount | price | price_with_tax | tax | name_desc | meta.mk_id | meta.doc_type | meta.opr_code | meta.count_code | meta.doc_date | meta.currency_code | meta.status_code | meta.doc_created_email | meta.buyer_order | meta.warehouse | meta.delivery_type |
|---:|-------------:|-------------:|:-------|:-------------|:-------|---------:|--------:|-----------------:|------:|:------------|--------------:|:----------------|----------------:|:------------------|:-----------------|:---------------------|:-------------------|:-------------------------|:-------------------|:-----------------|:---------------------|
| 0 | 54 | 266405022384 | MSS | Mousse | kos | 1 | 16.66 | 19.99 | 200 | nan | 1200011617609 | sales_order | 0 | 1051885/2022 | 2022-08-23+02:00 | EUR | Zaključena | stifter.rok#gmail.com | SK-956103 | glavno | Gls_sk |
| 1 | 53 | 266405022383 | MIT | Mitt | kos | 1 | 0 | nan | 200 | nan | 1200011617609 | sales_order | 0 | 1051885/2022 | 2022-08-23+02:00 | EUR | Zaključena | stifter.rok#gmail.com | SK-956103 | glavno | Gls_sk |
| 2 | 48 | 266404892511 | TM | Tanning mist | kos | 1 | 0 | nan | 200 | TM | 1200011617609 | sales_order | 0 | 1051885/2022 | 2022-08-23+02:00 | EUR | Zaključena | stifter.rok#gmail.com | SK-956103 | glavno | Gls_sk |

How to fill missing GPS Data in pandas?

I have a data frame that looks something like this
+-----+------------+-------------+-------------------------+----+----------+----------+
| | Actual_Lat | Actual_Long | Time | ID | Cal_long | Cal_lat |
+-----+------------+-------------+-------------------------+----+----------+----------+
| 0 | 63.433376 | 10.397068 | 2019-09-30 04:48:13.540 | 11 | 10.39729 | 63.43338 |
| 1 | 63.433301 | 10.395846 | 2019-09-30 04:48:18.470 | 11 | 10.39731 | 63.43326 |
| 2 | 63.433259 | 10.394543 | 2019-09-30 04:48:23.450 | 11 | 10.39576 | 63.43323 |
| 3 | 63.433258 | 10.394244 | 2019-09-30 04:48:29.500 | 11 | 10.39555 | 63.43436 |
| 4 | 63.433258 | 10.394215 | 2019-09-30 04:48:35.683 | 11 | 10.39505 | 63.43427 |
| ... | ... | ... | ... | ...| ... | ... |
| 70 | NaN | NaN | NaT | NaN| 10.35826 | 63.43149 |
| 71 | NaN | NaN | NaT | NaN| 10.35809 | 63.43155 |
| 72 | NaN | NaN | NaT | NaN| 10.35772 | 63.43163 |
| 73 | NaN | NaN | NaT | NaN| 10.35646 | 63.43182 |
| 74 | NaN | NaN | NaT | NaN| 10.35536 | 63.43196 |
+-----+------------+-------------+-------------------------+----------+----------+----------+
Actual_lat and Actual_long contains GPS coordinates of data obtained from GPS device. Cal_lat and cal_lat are GPS coordinates obtained from OSRM's API. As you can see there is a lot of data missing in actual coordinates. I am looking to get a data set such that when I take difference of actual_lat vs cal_lat it should be zero or at least close to zero. I tried to fill these missing values with destination lat and long, but that would result in huge difference. My question is how can I fill these values using python/pandas so that when vehicle followed the OSRM estimated path the difference between actual lat/long and estimated lat/long should be zero or close to zero. I am new to GIS data Sets and have no idea about how to deal with them.
EDIT: I am looking for something like this.
+-----+------------+-------------+-------------------------+----------+----------+----------+----------------------+----------------------+
| | Actual_Lat | Actual_Long | Time | Tour ID | Cal_long | Cal_lat | coordinates_diff_Lat | coordinates_diff_Lon |
+-----+------------+-------------+-------------------------+----------+----------+----------+----------------------+----------------------+
| 0 | 63.433376 | 10.397068 | 2019-09-30 04:48:13.540 | 11 | 10.39729 | 63.43338 | -0.000 | -0.000 |
| 1 | 63.433301 | 10.395846 | 2019-09-30 04:48:18.470 | 11 | 10.39731 | 63.43326 | 0.000 | -0.001 |
| 2 | 63.433259 | 10.394543 | 2019-09-30 04:48:23.450 | 11 | 10.39576 | 63.43323 | 0.000 | -0.001 |
| 3 | 63.433258 | 10.394244 | 2019-09-30 04:48:29.500 | 11 | 10.39555 | 63.43436 | -0.001 | -0.001 |
| 4 | 63.433258 | 10.394215 | 2019-09-30 04:48:35.683 | 11 | 10.39505 | 63.43427 | -0.001 | -0.001 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 70 | 63.43000 | 10.35800 | NaT | 115268.0 | 10.35826 | 63.43149 | 0.000 | -0.003 |
| 71 | 63.43025 | 10.35888 | NaT | 115268.0 | 10.35809 | 63.43155 | 0.000 | -0.003 |
| 72 | 63.43052 | 10.35713 | NaT | 115268.0 | 10.35772 | 63.43163 | 0.000 | -0.002 |
| 73 | 63.43159 | 10.35633 | NaT | 115268.0 | 10.35646 | 63.43182 | 0.000 | -0.001 |
| 74 | 63.43197 | 10.35537 | NaT | 115268.0 | 10.35536 | 63.43196 | 0.000 | 0.000 |
+-----+------------+-------------+-------------------------+----------+----------+----------+----------------------+----------------------+
Note that 63.43197,10.35537 is destination and 63.433376,10.397068 is starting position. All these points represent road coordinates.
IIUC, you need something like this:
I am taking the columns out of df as list.
div = float(len(cal_lat)) / float(len(actual_lat))
new_l = []
for i in range(len(cal_lat)):
new_l.append(actual_lat[int(i/div)])
print(new_l)
len(new_l)
Do, the same with longitude columns.
Since these are GPS points you can tweak your model to have the accuracy of up to 3 digits, when taking the difference. So, keeping this in mind, starting from Actual_lat and lng , if your next value is same as the first, the difference won’t be much greater.
Hopefully, I made sense and you have your solution.
You need pandas.DataFrame.where.
Let's say your dataframe is df, then you can do:
df.Actual_Lat = df.Actual_Lat.where(~df.Actual_Lat.isna(), df.Cal_lat)

Manipulate pandas columns with datetime

Please see this SO post Manipulating pandas columns
I shared this dataframe:
+----------+------------+-------+-----+------+
| Location | Date | Event | Key | Time |
+----------+------------+-------+-----+------+
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-04 | 1 | a | 2 |
| i2 | 2019-03-15 | 2 | b | 0 |
| i9 | 2019-02-22 | 2 | c | 0 |
| i9 | 2019-03-10 | 3 | d | |
| i9 | 2019-03-10 | 3 | d | 0 |
| s8 | 2019-04-22 | 1 | e | |
| s8 | 2019-04-25 | 1 | e | |
| s8 | 2019-04-28 | 1 | e | 6 |
| t14 | 2019-05-13 | 3 | f | |
+----------+------------+-------+-----+------+
This is a follow-up question. Consider two more columns after Date as shown below.
+-----------------------+----------------------+
| Start Time (hh:mm:ss) | Stop Time (hh:mm:ss) |
+-----------------------+----------------------+
| 13:24:38 | 14:17:39 |
| 03:48:36 | 04:17:20 |
| 04:55:05 | 05:23:48 |
| 08:44:34 | 09:13:15 |
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21 |
+-----------------------+----------------------+
The task remains the same - to get the time difference but in hours, corresponding to the Stop Time of the first row and Start Time of the last row
for each Key.
Based on the answer, I was trying something like this:
df['Time']=df.groupby(['Location','Event']).Date.\
transform(lambda x : (x.iloc[-1]-x.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')]
df['Time_h']=df.groupby(['Location','Event'])['Start Time (hh:mm:ss)','Stop Time (hh:mm:ss)'].\
transform(lambda x,y : (x.iloc[-1]-y.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')] # This gives an error on transform
to get the difference in days and hours separately and then combine. Is there a better way?

Using pandas as a distance matrix and then getting a sub-dataframe of relevant distances

I have created a pandas df that has the distances between location i and location j. Beginning with a start point P1 and end point P2, I want to find the sub-dataframe (distance matrix) that has one axis of the df having P1, P2 and the other axis having the rest of the indices.
I'm using a Pandas DF because I think its' the most efficient way
dm_dict = # distance matrix in dict form where you can call dm_dict[i][j] and get the distance from i to j
dm_df = pd.DataFrame().from_dict(dm_dict)
P1 = dm_df.max(axis=0).idxmax()
P2 = dm_df[i].idxmax()
route = [i, j]
remaining_locs = dm_df[dm_df[~dm_df.isin(route)].isin(route)]
while not_done:
# go through the remaining_locs until found all the locations are added.
No error messages, but the remaining_locs df is full of nan's rather than a df with the distances.
using dm_df[~dm_df.isin(route)].isin(route) seems to give me a boolean df that is accurate.
sample data, it's technically the haversine distance but the euclidean should be fine for filling up the matrix:
import numpy
def dist(i, j):
a = numpy.array((i[1], i[2]))
b = numpy.array((j[1], j[2]))
return numpy.linalg.norm(a-b)
locations = [
("Ottawa", 45.424722,-75.695),
("Edmonton", 53.533333,-113.5),
("Victoria", 48.428611,-123.365556),
("Winnipeg", 49.899444,-97.139167),
("Fredericton", 49.899444,-97.139167),
("StJohns", 47.561389, -52.7125),
("Halifax", 44.647778, -63.571389),
("Toronto", 43.741667, -79.373333),
("Charlottetown",46.238889, -63.129167),
("QuebecCity",46.816667, -71.216667 ),
("Regina", 50.454722, -104.606667),
("Yellowknife", 62.442222, -114.3975),
("Iqaluit", 63.748611, -68.519722)
]
dm_dict = {i: {j: dist(i, j) for j in locations if j != i} for i in locations}
It looks like you want scipy's distance_matrix:
df = pd.DataFrame(locations)
x = df[[1,2]]
dm = pd.DataFrame(distance_matrix(x,x),
index=df[0],
columns=df[0])
Output:
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
| | Ottawa | Edmonton | Victoria | Winnipeg | Fredericton | StJohns | Halifax | Toronto | Charlottetown | QuebecCity | Regina | Yellowknife | Iqaluit |
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
| 0 | | | | | | | | | | | | | |
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
| Ottawa | 0.000000 | 38.664811 | 47.765105 | 21.906059 | 21.906059 | 23.081609 | 12.148481 | 4.045097 | 12.592181 | 4.689667 | 29.345960 | 42.278586 | 19.678657 |
| Edmonton | 38.664811 | 0.000000 | 11.107987 | 16.759535 | 16.759535 | 61.080146 | 50.713108 | 35.503607 | 50.896264 | 42.813477 | 9.411122 | 8.953983 | 46.125669 |
| Victoria | 47.765105 | 11.107987 | 0.000000 | 26.267600 | 26.267600 | 70.658378 | 59.913580 | 44.241193 | 60.276176 | 52.173796 | 18.867990 | 16.637528 | 56.945306 |
| Winnipeg | 21.906059 | 16.759535 | 26.267600 | 0.000000 | 0.000000 | 44.488147 | 33.976105 | 18.802741 | 34.206429 | 26.105163 | 7.488117 | 21.334745 | 31.794214 |
| Fredericton | 21.906059 | 16.759535 | 26.267600 | 0.000000 | 0.000000 | 44.488147 | 33.976105 | 18.802741 | 34.206429 | 26.105163 | 7.488117 | 21.334745 | 31.794214 |
| StJohns | 23.081609 | 61.080146 | 70.658378 | 44.488147 | 44.488147 | 0.000000 | 11.242980 | 26.933071 | 10.500284 | 18.519147 | 51.974763 | 63.454538 | 22.625084 |
| Halifax | 12.148481 | 50.713108 | 59.913580 | 33.976105 | 33.976105 | 11.242980 | 0.000000 | 15.827902 | 1.651422 | 7.946971 | 41.444115 | 53.851052 | 19.731392 |
| Toronto | 4.045097 | 35.503607 | 44.241193 | 18.802741 | 18.802741 | 26.933071 | 15.827902 | 0.000000 | 16.434995 | 8.717042 | 26.111037 | 39.703942 | 22.761342 |
| Charlottetown | 12.592181 | 50.896264 | 60.276176 | 34.206429 | 34.206429 | 10.500284 | 1.651422 | 16.434995 | 0.000000 | 8.108112 | 41.691201 | 53.767927 | 18.320711 |
| QuebecCity | 4.689667 | 42.813477 | 52.173796 | 26.105163 | 26.105163 | 18.519147 | 7.946971 | 8.717042 | 8.108112 | 0.000000 | 33.587610 | 45.921044 | 17.145385 |
| Regina | 29.345960 | 9.411122 | 18.867990 | 7.488117 | 7.488117 | 51.974763 | 41.444115 | 26.111037 | 41.691201 | 33.587610 | 0.000000 | 15.477744 | 38.457705 |
| Yellowknife | 42.278586 | 8.953983 | 16.637528 | 21.334745 | 21.334745 | 63.454538 | 53.851052 | 39.703942 | 53.767927 | 45.921044 | 15.477744 | 0.000000 | 45.896374 |
| Iqaluit | 19.678657 | 46.125669 | 56.945306 | 31.794214 | 31.794214 | 22.625084 | 19.731392 | 22.761342 | 18.320711 | 17.145385 | 38.457705 | 45.896374 | 0.000000 |
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
I am pretty sure this is what I wanted:
filtered = dm_df.filter(items=route,axis=1).filter(items=set(locations).difference(set(route)), axis=0)
filtered is a df with [2 rows x 10 columns] and then I can find the minimum value from there

Saving Graphlab LDA model turns topics into gibberish?

Ok, this is just plain wacky. I think the problem may may have been introduced by a recent graphlab update, because I've never seen the issue before, but I'm not sure). Anyway, check this out:
import graphlab as gl
corpus = gl.SArray('path/to/corpus_data')
lda_model = gl.topic_model.create(dataset=corpus,num_topics=10,num_iterations=50,alpha=1.0,beta=0.1)
lda_model.get_topics(num_words=3).print_rows(30)
+-------+---------------+------------------+
| topic | word | score |
+-------+---------------+------------------+
| 0 | Music | 0.0195325651638 |
| 0 | Love | 0.0120906781994 |
| 0 | Photography | 0.00936914065591 |
| 1 | Recipe | 0.0205673829742 |
| 1 | Food | 0.0202932111556 |
| 1 | Sugar | 0.0162560126511 |
| 2 | Business | 0.0223993672813 |
| 2 | Science | 0.0164027313084 |
| 2 | Education | 0.0139221301443 |
| 3 | Science | 0.0134658216431 |
| 3 | Video_game | 0.0113924173881 |
| 3 | NASA | 0.0112188654905 |
| 4 | United_States | 0.0127908290673 |
| 4 | Automobile | 0.00888669047383 |
| 4 | Australia | 0.00854809547772 |
| 5 | Disease | 0.00704245203928 |
| 5 | Earth | 0.00693360028027 |
| 5 | Species | 0.00648700544757 |
| 6 | Religion | 0.0142311765509 |
| 6 | God | 0.0139990904439 |
| 6 | Human | 0.00765681454222 |
| 7 | Google | 0.0198547267697 |
| 7 | Internet | 0.0191105480317 |
| 7 | Computer | 0.0179914269911 |
| 8 | Art | 0.0378733245262 |
| 8 | Design | 0.0223646138082 |
| 8 | Artist | 0.0142755732766 |
| 9 | Film | 0.0205971724156 |
| 9 | Earth | 0.0125386246077 |
| 9 | Television | 0.0102082224947 |
+-------+---------------+------------------+
Ok, even without knowing anything about my corpus, these topics are at least kinda comprehensible, insofar as the top terms per topic are more or less related.
But now if simply save, and reload the model, the topics completely change (to nonsense, as far as can tell):
lda_model.save('test')
lda_model = gl.load_model('test')
lda_model.get_topics(num_words=3).print_rows(30)
+-------+-----------------------+-------------------+
| topic | word | score |
+-------+-----------------------+-------------------+
| 0 | Cleanliness | 0.00468171463384 |
| 0 | Chicken_soup | 0.00326753275774 |
| 0 | The_Language_Instinct | 0.00314506174959 |
| 1 | Equalization | 0.0015724652078 |
| 1 | Financial_crisis | 0.00132675410371 |
| 1 | Tulsa,_Oklahoma | 0.00118899041288 |
| 2 | Batoidea | 0.00142300468887 |
| 2 | Abbottabad | 0.0013474225953 |
| 2 | Migration_humaine | 0.00124284781396 |
| 3 | Gewürztraminer | 0.00147470845039 |
| 3 | Indore | 0.00107223358321 |
| 3 | White_wedding | 0.00104791136102 |
| 4 | Bregenz | 0.00130871351963 |
| 4 | Carl_Jung | 0.000879345016186 |
| 4 | ภ | 0.000855001542873 |
| 5 | 18e_eeuw | 0.000950866105797 |
| 5 | Vesuvianite | 0.000832367570269 |
| 5 | Gary_Kirsten | 0.000806410748201 |
| 6 | Sunday_Bloody_Sunday | 0.000828552346797 |
| 6 | Linear_cryptanalysis | 0.000681188343324 |
| 6 | Clothing_sizes | 0.00066708652481 |
| 7 | Mile | 0.000759081990574 |
| 7 | Pinwheel_calculator | 0.000721971708181 |
| 7 | Third_Age | 0.000623010955132 |
| 8 | Tennessee_Williams | 0.000597449568381 |
| 8 | Levite | 0.000551338743949 |
| 8 | Time_Out_(company) | 0.000536667117994 |
| 9 | David_Deutsch | 0.000543813843275 |
| 9 | Honing_(metalworking) | 0.00044496051774 |
| 9 | Clearing_(finance) | 0.000431699705779 |
+-------+-----------------------+-------------------+
Any idea what could possible be happening here? save should just pickle the model, so I don't see where the weirdness is happening, but somehow the topic distributions are getting totally changed around in some non-obvious way. I've verified this on two different machines (Linux and Mac). with similar weird results.
EDIT
Downgrading Graphlab from 1.7.1 to 1.6.1 seems to resolve this issue, but that's not a real solution. I don't see anything obvious in the 1.7.1 release notes to explain what happened, and would like this to work in 1.7.1 if possible...
This is a bug in Graphlab create 1.7.1. It has now been fixed in Graphlab Create 1.8.

Categories