Related
I have a csv with 500+ rows where one column "_source" is stored as JSON. I want to extract that into a pandas dataframe. I need each key to be its own column. #I have a 1 mb Json file of online social media data that I need to convert the dictionary and key values into their own separate columns. The social media data is from Facebook,Twitter/web crawled... etc. There are approximately 528 separate rows of posts/tweets/text with each having many dictionaries inside dictionaries. I am attaching a few steps from my Jupyter notebook below to give a more complete understanding. need to turn all key value pairs for dictionaries inside dictionaries into columns inside a dataframe
Thank you so much this will be a huge help!!!
I have tried changing it to a dataframe by doing this
source = pd.DataFrame.from_dict(source, orient='columns')
And it returns something like this... I thought it might unpack the dictionary but it did not.
#source.head()
#_source
#0 {'sub_organization_id': 'default', 'uid': 'aba...
#1 {'sub_organization_id': 'default', 'uid': 'ab0...
#2 {'sub_organization_id': 'default', 'uid': 'ac0...
below is the shape
#source.shape (528, 1)
below is what the an actual "_source" row looks like stretched out. There are many dictionaries and key:value pairs where each key needs to be its own column. Thanks! The actual links have been altered/scrambled for privacy reasons.
{'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
before you post make sure the actual code works for the data attached. Thanks!
The below code I tried but it did not work there was a syntax error that I could not figure out.
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
^
SyntaxError: invalid syntax
Whoever can help me with this will be a saint!
I had to do something like that a while back. Basically I used a function that completely flattened out the json to identify the keys that would be turned into the columns, then iterated through the json to reconstruct a row and append each row into a "results" dataframe. So with the data you provided, it created 52 column row and looking through it, looks like it included all the keys into it's own column. Anything nested, for example: 'meta': {'rule_matcher':[{'atribs': {'website': ...]} should then have a column name meta.rule_matcher.atribs.website where the '.' denotes those nested keys
data_source = {'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
Code:
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
flat = flatten_json(data_source)
import pandas as pd
import re
results = pd.DataFrame()
special_cols = []
columns_list = list(flat.keys())
for item in columns_list:
try:
row_idx = re.findall(r'\_(\d+)\_', item )[0]
except:
special_cols.append(item)
continue
column = re.findall(r'\_\d+\_(.*)', item )[0]
column = re.sub(r'\_\d+\_', '.', column)
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
for item in special_cols:
results[item] = flat[item]
Output:
print (results.to_string())
atribs_website atribs_source atribs_version atribs_type results.rule_type results.rule_tag results.description results.project_veid results.campaign_id results.value results.organization_id results.sub_organization_id results.appid results.project_id results.rule_id results.node_id results.metadata_campaign_title results.metadata_project_title attribs_website attribs_version attribs_type results.render_status results.path results.image_hash results.url results.load_time sub_organization_id uid project_veid campaign_id organization_id norm_attribs_website norm_attribs_version norm_attribs_type project_id system_timestamp doc_appid doc_response_url doc_url doc_status_code doc_status_msg doc_encoding doc_attrs_uid doc_timestamp doc_crawlid type norm_body norm_domain norm_author norm_url norm_timestamp norm_id
0 github.com/res Explicit 1.1 crawl hashtag Far NaN A7180EA-7078-0C7F-ED5D-86AD7 2A6DA0C-365BB-67DD-B05830920 #Far NaN NaN ray CDE2F42-5B87-C594-C900E578C 1838 NaN AF AF github.com/res 1.0 Page Render success https://east.amanaws.com/rays-ime-store/render... bb7674b8ea3fc05bfd027a19815f82c https://discooprdapp.com/ 32.0 default ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b default default default github.com/res 1.1 crawl default 2019-02-22T19:04:53.569623 subtter https://discooprdapp.com https://discooprdapp.com/ 200 OK utf-8 2ab8f2651cb32261b911c990a8b 2019-02-22T19:04:53.963 7fd95-785-4dd259-fcc-8752f crawl \n discordapp.com crawl https://discooprdapp.com 2019-02-22T19:04:53.961283+00:00 7fc5-685-4dd9-cc-8762f
This question is regarding the use of todoist python api.
After adding an item (that's what a task is called in the api) I get these weird looking IDs. I say weird because a regular id is just an integer. But these IDs are not usable in anyway, I can't do api.items.get_by_id() with this id. What is going on? How do I get out from this weird state?
'reply email': 'b5b4eb2c-b28f-11e9-bd8d-80e6500af142',
I printed out a few more IDs, and all the integer ones work well with all API calls, the UUID ones throw an exception.
3318771761
3318771783
3318771807
3318771823
3318771843
61c30a10-b2a0-11e9-98d7-80e6500af142
62326586-b2a0-11e9-98d7-80e6500af142
62a3ea9e-b2a0-11e9-98d7-80e6500af142
631222ac-b2a0-11e9-98d7-80e6500af142
63816338-b2a0-11e9-98d7-80e6500af142
63efd14c-b2a0-11e9-98d7-80e6500af142
You have to use api.commit() to do the Sync and have the final id. The id you're trying to use is just a temporary id while the sync is not done.
>>> import todoist; import os; token = os.environ.get('token'); api = todoist.TodoistAPI(token); item=api.items.add('test');
>>> item
Item({'content': 'test',
'id': '3b4d77c0-b891-11e9-a080-2afbabeedbe3',
'project_id': 1490600000})
>>> api.commit()
{'full_sync': False, 'sync_status': {'3b4d793c-b891-11e9-a080-2afbabeaeaef': 'ok'}, 'temp_id_mapping': {'3b4d77c0-b891-11e9-a080-2afbabeaeaef': 3331113774}, 'labels': [], 'project_notes': [], 'filters': [], 'sync_token': 'EEVprctG1E39VDqJfu_cwyhpO6rkOaavyU5r70Eu0nY1ZjsWSjssGr4qLLLikucJAu_Zakld7DniBsEyZ7i820dqcZNcOAbUcbzHFpMpSjzr-GALTA', 'day_orders': {}, 'projects': [], 'collaborators': [], 'day_orders_timestamp': '1564672738.25', 'live_notifications_last_read_id': 2259500000, 'items': [{'legacy_project_id': 1484800000, 'day_order': -1, 'assigned_by_uid': 5050000, 'labels': [], 'sync_id': None, 'section_id': None, 'in_history': 0, 'child_order': 3, 'date_added': '2019-08-06T21:29:18Z', 'id': 3331113774, 'content': 'test2', 'checked': 0, 'user_id': 5050000, 'due': None, 'priority': 1, 'parent_id': None, 'is_deleted': 0, 'responsible_uid': None, 'project_id': 1490600000, 'date_completed': None, 'collapsed': 0}], 'notes': [], 'reminders': [], 'due_exceptions': [], 'live_notifications': [], 'sections': [], 'collaborator_states': []}
>>> item
Item({'assigned_by_uid': 5050000,
'checked': 0,
'child_order': 3,
'collapsed': 0,
'content': 'test',
'date_added': '2019-08-06T21:29:18Z',
'date_completed': None,
'day_order': -1,
'due': None,
'id': 3331100000,
'in_history': 0,
'is_deleted': 0,
'labels': [],
'legacy_project_id': 148400000,
'parent_id': None,
'priority': 1,
'project_id': 1490660000,
'responsible_uid': None,
'section_id': None,
'sync_id': None,
'user_id': 5050000})
I am trying to scrape some ticketing inventory info using Stubhub's API, but I cannot seem to figure out how to loop through the get request.
I basically want to loop through multiple events. The eventid_list is a list of eventids. The code I have is below:
inventory_url = 'https://api.stubhub.com/search/inventory/v2'
for eventid in eventid_list:
data = {'eventid': eventid, 'rows':500}
inventory = requests.get(inventory_url, headers=headers, params=data)
inv = inventory.json()
print(inv)
listing_df = pd.DataFrame(inv['listing'])
When I run this, the dataframe only returns results for one event, instead of multiple. What am I doing wrong?
EDIT: print(inv) outputs something like this:
{
'eventId': 102994860,
'totalListings': 82,
'totalTickets': 236,
'minQuantity': 1,
'maxQuantity': 6,
'listing': [
{
'listingId': 1297697413,
'currentPrice': {'amount': 108.58, 'currency': 'USD'},
'listingPrice': {'amount': 88.4, 'currency': 'USD'},
'sectionId': 1638686,
'row': 'E',
'quantity': 6,
'sellerSectionName': 'FRONT MEZZANINE RIGHT',
'sectionName': 'Front Mezzanine Sides',
'seatNumbers': '2,4,6,8,10,12',
'zoneId': 240236,
'zoneName': 'Front Mezzanine',
'deliveryTypeList': [5],
'deliveryMethodList': [23, 24, 25],
'isGA': 0,
'dirtyTicketInd': False,
'splitOption': '2',
'ticketSplit': '1',
'splitVector': [1, 2, 3, 4, 6],
'sellerOwnInd': 0,
'score': 0.0
},
...
{
'listingId': 1297697417,
'currentPrice': {'amount': 108.58, 'currency': 'USD'},
'listingPrice': {'amount': 88.4, 'currency': 'USD'},
'sectionId': 1638686,
'row': 'D',
'quantity': 3,
'sellerSectionName': 'FRONT MEZZANINE RIGHT',
'sectionName': 'Front Mezzanine Sides',
'seatNumbers': '2,4,6',
'zoneId': 240236,
'zoneName': 'Front Mezzanine',
'deliveryTypeList': [5],
'deliveryMethodList': [23, 24, 25],
'isGA': 0,
'dirtyTicketInd': False,
'splitOption': '2',
'ticketSplit': '1',
'splitVector': [1, 3],
'sellerOwnInd': 0,
'score': 0.0
},
]
}
I'm guessing inventory.json()['listing'] is a list of events. If so, you can try this:
inventory_url = 'https://api.stubhub.com/search/inventory/v2'
def get_event(eventid):
"""Given an event id returns inventory['listing']"""
data = {'eventid': eventid, 'rows':500}
inventory = requests.get(inventory_url, headers=headers, params=data)
return inventory.json().get('listing', [])
# Concatenate output of all events
events = itertools.flatten(get_event(eventid) for eventid in eventid_list)
listing_df = pd.DataFrame(list(events))
This is just a starting point, you will have to deal with cases where inventory.statos_code != 200. The result probably is not very useful, so you may have to flat some of the attributes for the listing items line currentPrice and listingPrice:
I have several lists of data which looks like this:
ISIN Currency Rates
26545555 Eur 0.12345
56554455 Eur 0.25665
75884554 Eur 0.89654
I want to save this data into a dictionary or json like format.
So I am trying to store the following data:
id: 0, ISIN: 26545555, Currency: Eur, Rates: 0.12345
id: 1, ISIN: 56554455, Currency: Eur, Rates: 0.25665
The problem is I am trying to use the following dictionary:
dict_data = {'id': '', 'ISIN': '', 'Currency': ''}
But when I try to append data it doesn't store all of the data.
I am getting the data from an Excel sheet using Pandas. If you think I should use something else, please let me know.
You should use list of dicts:
[
{'id': 0, 'ISIN': '', 'Currency': ''},
{'id': 1, 'ISIN': '', 'Currency': ''}
]
or if you want dict:
{
0: {'ISIN': '', 'Currency': ''},
1: {'ISIN': '', 'Currency': ''}
}
or(partly based on Marco's suggestion):
{
0: [26545555, 'Eur', 0.12345],
1: [26545555, 'Eur', 0.12345]
}
But personally I prefer second variant if you will need to access elements by ID.
One item is a dict, and group all data in a list
[
{'id': 0, 'ISIN': '', 'Currency': ''},
{'id': 1, 'ISIN': '', 'Currency': ''}
]
I have a List and inside the list i got a dict and i want to sort the list by a value of the dict.
How does this work?
[{'id': 0, 'thread': 'First',
'post': [
{'id': 0, 'title': 'MyPost', 'time': '2015-11-07 01:06:08.939687'}]
},
{'id': 1, 'thread': 'Second',
'post': [
{'id': 0, 'title': 'MyPost', 'time': '2015-11-07 01:06:42.933263'}]},
{'id': 2, 'name': 'NoPosts', 'post': []}]
I would like to sort my Threadlist by time of the first post, is that possible?
You can pass sort or sorted a key function:
In [11]: def key(x):
try:
return x["post"][0]["time"] # return ISO date string
except IndexError:
return "Not a date string" # any letter > all date strings
In [12]: sorted(d, key=key)
Out[12]:
[{'id': 0,
'post': [{'id': 0, 'time': '2015-11-07 01:06:08.939687', 'title': 'MyPost'}],
'thread': 'First'},
{'id': 1,
'post': [{'id': 0, 'time': '2015-11-07 01:06:42.933263', 'title': 'MyPost'}],
'thread': 'Second'},
{'id': 2, 'name': 'NoPosts', 'post': []}]