I am struggling with reading data into python dataframe. Am R programmer trying to do stuff in Python. So how would I read the following data into pandas dataframe? The data is actually the result of calling API.
Thanks.
b'{"mk_id":"1200011617609","doc_type":"sales_order","opr_code":"0","count_code":"1051885/2022","doc_date":"2022-08-23+02:00","partner":{"mk_id":"400020633177","business_entity":"false","taxpayer":"false","foreign_county":"true","customer":"Emilia Chabadova","street":"Ga\xc5\xa1tanov\xc3\xa1 2915/13","street_number":"2915/13","post_number":"92101","place":"Pie\xc5\xa1\xc5\xa5any","country":"Slovakia","count_code":"5770789334526546744","partner_contact":{"gsm":"+421949340254","email":"emily.chabadova#gmail.com"},"mk_address_id":"400020530565","country_iso_2":"SK","buyer":"true","supplier":"false"},"receiver":{"mk_id":"400020633177","business_entity":"false","taxpayer":"false","foreign_county":"true","customer":"Emilia Chabadova","street":"Ga\xc5\xa1tanov\xc3\xa1 2915/13","street_number":"2915/13","post_number":"92101","place":"Pie\xc5\xa1\xc5\xa5any","country":"Slovakia","count_code":"5770789334526546744","partner_contact":{"gsm":"+421949340254","email":"emily.chabadova#gmail.com"},"mk_address_id":"400020530565","country_iso_2":"SK","buyer":"true","supplier":"false"},"currency_code":"EUR","status_code":"Zaklju\xc4\x8dena","doc_created_email":"stifter.rok#gmail.com","buyer_order":"SK-956103","warehouse":"glavno","delivery_type":"Gls_sk","product_list":[{"count_code":"54","mk_id":"266405022384","code":"MSS","name":"Mousse","unit":"kos","amount":"1","price":"16.66","price_with_tax":"19.99","tax":"200"},{"count_code":"53","mk_id":"266405022383","code":"MIT","name":"Mitt","unit":"kos","amount":"1","price":"0","tax":"200"},{"count_code":"48","mk_id":"266404892511","code":"TM","name":"Tanning mist","name_desc":"TM","unit":"kos","amount":"1","price":"0","tax":"200"}],"extra_column":[{"name":"tracking_number","value":"91114278162"}],"sum_basic":"16.66","sum_tax_200":"3.33","sum_all":"19.99","sum_paid":"19.99","profit_center":"SHINE BROWN, PROIZVODNJA, TRGOVINA IN STORITVE, D.O.O.","bank_ref_number":"10518852022","method_of_payment":"Pla\xc4\x8dilo po povzetju","order_create_ts":"2022-08-23T09:43:00+02:00","created_ts":"2022-08-23T11:59:14+02:00","shipped_date":"2022-08-24+02:00","doc_link_list":[{"mk_id":"266412181173","count_code":"SK-MK-36044","doc_type":"sales_bill_foreign"},{"mk_id":"400015161112","count_code":"1043748/2022","doc_type":"warehouse_packing_list"},{"mk_id":"1200011617609","count_code":"1051885/2022","doc_type":"sales_order"}]}'
you can start by doing something like this :
result = {"mk_id":"1200011617609","doc_type":"sales_order","opr_code":"0","count_code":"1051885/2022","doc_date":"2022-08-23+02:00","partner":{"mk_id":"400020633177","business_entity":"false","taxpayer":"false","foreign_county":"true","customer":"Emilia Chabadova","street":"Ga\xc5\xa1tanov\xc3\xa1 2915/13","street_number":"2915/13","post_number":"92101","place":"Pie\xc5\xa1\xc5\xa5any","country":"Slovakia","count_code":"5770789334526546744","partner_contact":{"gsm":"+421949340254","email":"emily.chabadova#gmail.com"},"mk_address_id":"400020530565","country_iso_2":"SK","buyer":"true","supplier":"false"},"receiver":{"mk_id":"400020633177","business_entity":"false","taxpayer":"false","foreign_county":"true","customer":"Emilia Chabadova","street":"Ga\xc5\xa1tanov\xc3\xa1 2915/13","street_number":"2915/13","post_number":"92101","place":"Pie\xc5\xa1\xc5\xa5any","country":"Slovakia","count_code":"5770789334526546744","partner_contact":{"gsm":"+421949340254","email":"emily.chabadova#gmail.com"},"mk_address_id":"400020530565","country_iso_2":"SK","buyer":"true","supplier":"false"},"currency_code":"EUR","status_code":"Zaklju\xc4\x8dena","doc_created_email":"stifter.rok#gmail.com","buyer_order":"SK-956103","warehouse":"glavno","delivery_type":"Gls_sk","product_list":[{"count_code":"54","mk_id":"266405022384","code":"MSS","name":"Mousse","unit":"kos","amount":"1","price":"16.66","price_with_tax":"19.99","tax":"200"},{"count_code":"53","mk_id":"266405022383","code":"MIT","name":"Mitt","unit":"kos","amount":"1","price":"0","tax":"200"},{"count_code":"48","mk_id":"266404892511","code":"TM","name":"Tanning mist","name_desc":"TM","unit":"kos","amount":"1","price":"0","tax":"200"}],"extra_column":[{"name":"tracking_number","value":"91114278162"}],"sum_basic":"16.66","sum_tax_200":"3.33","sum_all":"19.99","sum_paid":"19.99","profit_center":"SHINE BROWN, PROIZVODNJA, TRGOVINA IN STORITVE, D.O.O.","bank_ref_number":"10518852022","method_of_payment":"Pla\xc4\x8dilo po povzetju","order_create_ts":"2022-08-23T09:43:00+02:00","created_ts":"2022-08-23T11:59:14+02:00","shipped_date":"2022-08-24+02:00","doc_link_list":[{"mk_id":"266412181173","count_code":"SK-MK-36044","doc_type":"sales_bill_foreign"},{"mk_id":"400015161112","count_code":"1043748/2022","doc_type":"warehouse_packing_list"},{"mk_id":"1200011617609","count_code":"1051885/2022","doc_type":"sales_order"}]}
pd.DataFrame([result])
Here is a way using BytesIO and json.normalize:
from ast import literal_eval
from io import BytesIO
import pandas as pd
data = b'{"mk_id":"1200011617609","doc_type":"sales_order","opr_code":"0","count_code":"1051885/2022","doc_date":"2022-08-23+02:00","partner":{"mk_id":"400020633177","business_entity":"false","taxpayer":"false","foreign_county":"true","customer":"Emilia Chabadova","street":"Ga\xc5\xa1tanov\xc3\xa1 2915/13","street_number":"2915/13","post_number":"92101","place":"Pie\xc5\xa1\xc5\xa5any","country":"Slovakia","count_code":"5770789334526546744","partner_contact":{"gsm":"+421949340254","email":"emily.chabadova#gmail.com"},"mk_address_id":"400020530565","country_iso_2":"SK","buyer":"true","supplier":"false"},"receiver":{"mk_id":"400020633177","business_entity":"false","taxpayer":"false","foreign_county":"true","customer":"Emilia Chabadova","street":"Ga\xc5\xa1tanov\xc3\xa1 2915/13","street_number":"2915/13","post_number":"92101","place":"Pie\xc5\xa1\xc5\xa5any","country":"Slovakia","count_code":"5770789334526546744","partner_contact":{"gsm":"+421949340254","email":"emily.chabadova#gmail.com"},"mk_address_id":"400020530565","country_iso_2":"SK","buyer":"true","supplier":"false"},"currency_code":"EUR","status_code":"Zaklju\xc4\x8dena","doc_created_email":"stifter.rok#gmail.com","buyer_order":"SK-956103","warehouse":"glavno","delivery_type":"Gls_sk","product_list":[{"count_code":"54","mk_id":"266405022384","code":"MSS","name":"Mousse","unit":"kos","amount":"1","price":"16.66","price_with_tax":"19.99","tax":"200"},{"count_code":"53","mk_id":"266405022383","code":"MIT","name":"Mitt","unit":"kos","amount":"1","price":"0","tax":"200"},{"count_code":"48","mk_id":"266404892511","code":"TM","name":"Tanning mist","name_desc":"TM","unit":"kos","amount":"1","price":"0","tax":"200"}],"extra_column":[{"name":"tracking_number","value":"91114278162"}],"sum_basic":"16.66","sum_tax_200":"3.33","sum_all":"19.99","sum_paid":"19.99","profit_center":"SHINE BROWN, PROIZVODNJA, TRGOVINA IN STORITVE, D.O.O.","bank_ref_number":"10518852022","method_of_payment":"Pla\xc4\x8dilo po povzetju","order_create_ts":"2022-08-23T09:43:00+02:00","created_ts":"2022-08-23T11:59:14+02:00","shipped_date":"2022-08-24+02:00","doc_link_list":[{"mk_id":"266412181173","count_code":"SK-MK-36044","doc_type":"sales_bill_foreign"},{"mk_id":"400015161112","count_code":"1043748/2022","doc_type":"warehouse_packing_list"},{"mk_id":"1200011617609","count_code":"1051885/2022","doc_type":"sales_order"}]}'
df = pd.DataFrame(BytesIO(data))
df[0] = df[0].str.decode("utf-8").apply(literal_eval)
df = pd.json_normalize(
data=df.pop(0),
record_path="product_list",
meta=["mk_id", "doc_type", "opr_code", "count_code", "doc_date", "currency_code",
"status_code", "doc_created_email", "buyer_order", "warehouse", "delivery_type"],
meta_prefix="meta."
)
print(df.to_markdown())
| | count_code | mk_id | code | name | unit | amount | price | price_with_tax | tax | name_desc | meta.mk_id | meta.doc_type | meta.opr_code | meta.count_code | meta.doc_date | meta.currency_code | meta.status_code | meta.doc_created_email | meta.buyer_order | meta.warehouse | meta.delivery_type |
|---:|-------------:|-------------:|:-------|:-------------|:-------|---------:|--------:|-----------------:|------:|:------------|--------------:|:----------------|----------------:|:------------------|:-----------------|:---------------------|:-------------------|:-------------------------|:-------------------|:-----------------|:---------------------|
| 0 | 54 | 266405022384 | MSS | Mousse | kos | 1 | 16.66 | 19.99 | 200 | nan | 1200011617609 | sales_order | 0 | 1051885/2022 | 2022-08-23+02:00 | EUR | Zaključena | stifter.rok#gmail.com | SK-956103 | glavno | Gls_sk |
| 1 | 53 | 266405022383 | MIT | Mitt | kos | 1 | 0 | nan | 200 | nan | 1200011617609 | sales_order | 0 | 1051885/2022 | 2022-08-23+02:00 | EUR | Zaključena | stifter.rok#gmail.com | SK-956103 | glavno | Gls_sk |
| 2 | 48 | 266404892511 | TM | Tanning mist | kos | 1 | 0 | nan | 200 | TM | 1200011617609 | sales_order | 0 | 1051885/2022 | 2022-08-23+02:00 | EUR | Zaključena | stifter.rok#gmail.com | SK-956103 | glavno | Gls_sk |
Related
I have the following pandas dataframe, where the column id is the dataframe index
+----+-----------+------------+-----------+------------+
| | price_A | amount_A | price_B | amount_b |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
And I want to convert this dataframe in to a multi column data frame, that looks like this
+----+-----------+------------+-----------+------------+
| | A | B |
+----+-----------+------------+-----------+------------+
| id | price | amount | price | amount |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
I've tried transforming my old pandas dataframe in to a dict this way:
dict = {"A": df[["price_a","amount_a"]], "B":df[["price_b", "amount_b"]]}
df = pd.DataFrame(dict, index=df.index)
But I had no success, how can I do that?
Try renaming columns manually:
df.columns=pd.MultiIndex.from_tuples([x.split('_')[::-1] for x in df.columns])
df.index.name='id'
Output:
A B b
price amount price amount
id
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495
You can split the column names on the underscore and convert to a tuple. Once you map each split column name to a tuple, pandas will convert the Index to a MultiIndex for you. From there we just need to call swaplevel to get the letter level to come first and reassign to the dataframe.
note: in my input dataframe I replaced the column name "amount_b" with "amount_B" because it lined up with your expected output so I assumed it was a typo
df.columns = df.columns.str.split("_", expand=True).swaplevel()
print(df)
A B
price amount price amount
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495
Please see this SO post Manipulating pandas columns
I shared this dataframe:
+----------+------------+-------+-----+------+
| Location | Date | Event | Key | Time |
+----------+------------+-------+-----+------+
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-04 | 1 | a | 2 |
| i2 | 2019-03-15 | 2 | b | 0 |
| i9 | 2019-02-22 | 2 | c | 0 |
| i9 | 2019-03-10 | 3 | d | |
| i9 | 2019-03-10 | 3 | d | 0 |
| s8 | 2019-04-22 | 1 | e | |
| s8 | 2019-04-25 | 1 | e | |
| s8 | 2019-04-28 | 1 | e | 6 |
| t14 | 2019-05-13 | 3 | f | |
+----------+------------+-------+-----+------+
This is a follow-up question. Consider two more columns after Date as shown below.
+-----------------------+----------------------+
| Start Time (hh:mm:ss) | Stop Time (hh:mm:ss) |
+-----------------------+----------------------+
| 13:24:38 | 14:17:39 |
| 03:48:36 | 04:17:20 |
| 04:55:05 | 05:23:48 |
| 08:44:34 | 09:13:15 |
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21 |
+-----------------------+----------------------+
The task remains the same - to get the time difference but in hours, corresponding to the Stop Time of the first row and Start Time of the last row
for each Key.
Based on the answer, I was trying something like this:
df['Time']=df.groupby(['Location','Event']).Date.\
transform(lambda x : (x.iloc[-1]-x.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')]
df['Time_h']=df.groupby(['Location','Event'])['Start Time (hh:mm:ss)','Stop Time (hh:mm:ss)'].\
transform(lambda x,y : (x.iloc[-1]-y.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')] # This gives an error on transform
to get the difference in days and hours separately and then combine. Is there a better way?
I have created a pandas df that has the distances between location i and location j. Beginning with a start point P1 and end point P2, I want to find the sub-dataframe (distance matrix) that has one axis of the df having P1, P2 and the other axis having the rest of the indices.
I'm using a Pandas DF because I think its' the most efficient way
dm_dict = # distance matrix in dict form where you can call dm_dict[i][j] and get the distance from i to j
dm_df = pd.DataFrame().from_dict(dm_dict)
P1 = dm_df.max(axis=0).idxmax()
P2 = dm_df[i].idxmax()
route = [i, j]
remaining_locs = dm_df[dm_df[~dm_df.isin(route)].isin(route)]
while not_done:
# go through the remaining_locs until found all the locations are added.
No error messages, but the remaining_locs df is full of nan's rather than a df with the distances.
using dm_df[~dm_df.isin(route)].isin(route) seems to give me a boolean df that is accurate.
sample data, it's technically the haversine distance but the euclidean should be fine for filling up the matrix:
import numpy
def dist(i, j):
a = numpy.array((i[1], i[2]))
b = numpy.array((j[1], j[2]))
return numpy.linalg.norm(a-b)
locations = [
("Ottawa", 45.424722,-75.695),
("Edmonton", 53.533333,-113.5),
("Victoria", 48.428611,-123.365556),
("Winnipeg", 49.899444,-97.139167),
("Fredericton", 49.899444,-97.139167),
("StJohns", 47.561389, -52.7125),
("Halifax", 44.647778, -63.571389),
("Toronto", 43.741667, -79.373333),
("Charlottetown",46.238889, -63.129167),
("QuebecCity",46.816667, -71.216667 ),
("Regina", 50.454722, -104.606667),
("Yellowknife", 62.442222, -114.3975),
("Iqaluit", 63.748611, -68.519722)
]
dm_dict = {i: {j: dist(i, j) for j in locations if j != i} for i in locations}
It looks like you want scipy's distance_matrix:
df = pd.DataFrame(locations)
x = df[[1,2]]
dm = pd.DataFrame(distance_matrix(x,x),
index=df[0],
columns=df[0])
Output:
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
| | Ottawa | Edmonton | Victoria | Winnipeg | Fredericton | StJohns | Halifax | Toronto | Charlottetown | QuebecCity | Regina | Yellowknife | Iqaluit |
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
| 0 | | | | | | | | | | | | | |
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
| Ottawa | 0.000000 | 38.664811 | 47.765105 | 21.906059 | 21.906059 | 23.081609 | 12.148481 | 4.045097 | 12.592181 | 4.689667 | 29.345960 | 42.278586 | 19.678657 |
| Edmonton | 38.664811 | 0.000000 | 11.107987 | 16.759535 | 16.759535 | 61.080146 | 50.713108 | 35.503607 | 50.896264 | 42.813477 | 9.411122 | 8.953983 | 46.125669 |
| Victoria | 47.765105 | 11.107987 | 0.000000 | 26.267600 | 26.267600 | 70.658378 | 59.913580 | 44.241193 | 60.276176 | 52.173796 | 18.867990 | 16.637528 | 56.945306 |
| Winnipeg | 21.906059 | 16.759535 | 26.267600 | 0.000000 | 0.000000 | 44.488147 | 33.976105 | 18.802741 | 34.206429 | 26.105163 | 7.488117 | 21.334745 | 31.794214 |
| Fredericton | 21.906059 | 16.759535 | 26.267600 | 0.000000 | 0.000000 | 44.488147 | 33.976105 | 18.802741 | 34.206429 | 26.105163 | 7.488117 | 21.334745 | 31.794214 |
| StJohns | 23.081609 | 61.080146 | 70.658378 | 44.488147 | 44.488147 | 0.000000 | 11.242980 | 26.933071 | 10.500284 | 18.519147 | 51.974763 | 63.454538 | 22.625084 |
| Halifax | 12.148481 | 50.713108 | 59.913580 | 33.976105 | 33.976105 | 11.242980 | 0.000000 | 15.827902 | 1.651422 | 7.946971 | 41.444115 | 53.851052 | 19.731392 |
| Toronto | 4.045097 | 35.503607 | 44.241193 | 18.802741 | 18.802741 | 26.933071 | 15.827902 | 0.000000 | 16.434995 | 8.717042 | 26.111037 | 39.703942 | 22.761342 |
| Charlottetown | 12.592181 | 50.896264 | 60.276176 | 34.206429 | 34.206429 | 10.500284 | 1.651422 | 16.434995 | 0.000000 | 8.108112 | 41.691201 | 53.767927 | 18.320711 |
| QuebecCity | 4.689667 | 42.813477 | 52.173796 | 26.105163 | 26.105163 | 18.519147 | 7.946971 | 8.717042 | 8.108112 | 0.000000 | 33.587610 | 45.921044 | 17.145385 |
| Regina | 29.345960 | 9.411122 | 18.867990 | 7.488117 | 7.488117 | 51.974763 | 41.444115 | 26.111037 | 41.691201 | 33.587610 | 0.000000 | 15.477744 | 38.457705 |
| Yellowknife | 42.278586 | 8.953983 | 16.637528 | 21.334745 | 21.334745 | 63.454538 | 53.851052 | 39.703942 | 53.767927 | 45.921044 | 15.477744 | 0.000000 | 45.896374 |
| Iqaluit | 19.678657 | 46.125669 | 56.945306 | 31.794214 | 31.794214 | 22.625084 | 19.731392 | 22.761342 | 18.320711 | 17.145385 | 38.457705 | 45.896374 | 0.000000 |
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
I am pretty sure this is what I wanted:
filtered = dm_df.filter(items=route,axis=1).filter(items=set(locations).difference(set(route)), axis=0)
filtered is a df with [2 rows x 10 columns] and then I can find the minimum value from there
I'm trying to create a new dataframe for each possible combination in 'combinations' reading in some values from a dataframe, an example of the dataframe:
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159 |
| Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984 |
| Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289 |
| Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987 |
| Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148 |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
Here is my code at the moment.
import itertools
import pandas as pd
letters = ['G','A','L','M','F','W','K','Q','E','S','P','V','I','C','Y','H','R','N','D','T']
combinations = [''.join(i) for j in range(1,len(letters) + 1) for i in itertools.combinations(letters,r=j)]
df = pd.read_csv('COMPLETECOPYFORR.csv')
for combination in combinations:
new_df = df[['Species', 'OGT']]
new_df['Sum of percentage'] = df[list(combination)]
new_df.to_csv(combination + '.csv')
The desired output is something along the lines of 10 million CSV files, each with the name of the different combinations, so
G.csv, A.csv, through to GALMFWKQESPVICYHRNDT.csv
Species OGT Sum of percentage
------------------------------- ----- -------------------
Aeropyrum pernix 95 23.4353
Anaeromyxobacter dehalogenans 26 20.3232
Argobacterium fabrum 27 14.2312
Aquifex aeolicus 85 15.0403
Archaeoglobus fulgidus 83 34.0532
It looks like need:
new_df['Sum of percentage'] = df[list(combination)].sum(axis=1)
Ok, this is just plain wacky. I think the problem may may have been introduced by a recent graphlab update, because I've never seen the issue before, but I'm not sure). Anyway, check this out:
import graphlab as gl
corpus = gl.SArray('path/to/corpus_data')
lda_model = gl.topic_model.create(dataset=corpus,num_topics=10,num_iterations=50,alpha=1.0,beta=0.1)
lda_model.get_topics(num_words=3).print_rows(30)
+-------+---------------+------------------+
| topic | word | score |
+-------+---------------+------------------+
| 0 | Music | 0.0195325651638 |
| 0 | Love | 0.0120906781994 |
| 0 | Photography | 0.00936914065591 |
| 1 | Recipe | 0.0205673829742 |
| 1 | Food | 0.0202932111556 |
| 1 | Sugar | 0.0162560126511 |
| 2 | Business | 0.0223993672813 |
| 2 | Science | 0.0164027313084 |
| 2 | Education | 0.0139221301443 |
| 3 | Science | 0.0134658216431 |
| 3 | Video_game | 0.0113924173881 |
| 3 | NASA | 0.0112188654905 |
| 4 | United_States | 0.0127908290673 |
| 4 | Automobile | 0.00888669047383 |
| 4 | Australia | 0.00854809547772 |
| 5 | Disease | 0.00704245203928 |
| 5 | Earth | 0.00693360028027 |
| 5 | Species | 0.00648700544757 |
| 6 | Religion | 0.0142311765509 |
| 6 | God | 0.0139990904439 |
| 6 | Human | 0.00765681454222 |
| 7 | Google | 0.0198547267697 |
| 7 | Internet | 0.0191105480317 |
| 7 | Computer | 0.0179914269911 |
| 8 | Art | 0.0378733245262 |
| 8 | Design | 0.0223646138082 |
| 8 | Artist | 0.0142755732766 |
| 9 | Film | 0.0205971724156 |
| 9 | Earth | 0.0125386246077 |
| 9 | Television | 0.0102082224947 |
+-------+---------------+------------------+
Ok, even without knowing anything about my corpus, these topics are at least kinda comprehensible, insofar as the top terms per topic are more or less related.
But now if simply save, and reload the model, the topics completely change (to nonsense, as far as can tell):
lda_model.save('test')
lda_model = gl.load_model('test')
lda_model.get_topics(num_words=3).print_rows(30)
+-------+-----------------------+-------------------+
| topic | word | score |
+-------+-----------------------+-------------------+
| 0 | Cleanliness | 0.00468171463384 |
| 0 | Chicken_soup | 0.00326753275774 |
| 0 | The_Language_Instinct | 0.00314506174959 |
| 1 | Equalization | 0.0015724652078 |
| 1 | Financial_crisis | 0.00132675410371 |
| 1 | Tulsa,_Oklahoma | 0.00118899041288 |
| 2 | Batoidea | 0.00142300468887 |
| 2 | Abbottabad | 0.0013474225953 |
| 2 | Migration_humaine | 0.00124284781396 |
| 3 | Gewürztraminer | 0.00147470845039 |
| 3 | Indore | 0.00107223358321 |
| 3 | White_wedding | 0.00104791136102 |
| 4 | Bregenz | 0.00130871351963 |
| 4 | Carl_Jung | 0.000879345016186 |
| 4 | ภ | 0.000855001542873 |
| 5 | 18e_eeuw | 0.000950866105797 |
| 5 | Vesuvianite | 0.000832367570269 |
| 5 | Gary_Kirsten | 0.000806410748201 |
| 6 | Sunday_Bloody_Sunday | 0.000828552346797 |
| 6 | Linear_cryptanalysis | 0.000681188343324 |
| 6 | Clothing_sizes | 0.00066708652481 |
| 7 | Mile | 0.000759081990574 |
| 7 | Pinwheel_calculator | 0.000721971708181 |
| 7 | Third_Age | 0.000623010955132 |
| 8 | Tennessee_Williams | 0.000597449568381 |
| 8 | Levite | 0.000551338743949 |
| 8 | Time_Out_(company) | 0.000536667117994 |
| 9 | David_Deutsch | 0.000543813843275 |
| 9 | Honing_(metalworking) | 0.00044496051774 |
| 9 | Clearing_(finance) | 0.000431699705779 |
+-------+-----------------------+-------------------+
Any idea what could possible be happening here? save should just pickle the model, so I don't see where the weirdness is happening, but somehow the topic distributions are getting totally changed around in some non-obvious way. I've verified this on two different machines (Linux and Mac). with similar weird results.
EDIT
Downgrading Graphlab from 1.7.1 to 1.6.1 seems to resolve this issue, but that's not a real solution. I don't see anything obvious in the 1.7.1 release notes to explain what happened, and would like this to work in 1.7.1 if possible...
This is a bug in Graphlab create 1.7.1. It has now been fixed in Graphlab Create 1.8.