How do I read a yaml file into a Jupyter notebook? - python

I have a file from an Open API Spec that I have been trying to access in a Jupyter notebook. It is a .yaml file. I was able to upload it into Jupyter and put it in the same folder as the notebook I'd like to use to access it. I am new to Jupyter and Python, so I'm sorry if this is a basic question. I found a forum that suggested this code to read the data (in my file: "openapi.yaml"):
import yaml
with open("openapi.yaml", 'r') as stream:
try:
print(yaml.safe_load(stream))
except yaml.YAMLError as exc:
print(exc)
This seems to bring the data in, but it is a completely unstructured stream like so:
{'openapi': '3.0.0', 'info': {'title': 'XY Tracking API', 'version': '2.0', 'contact': {'name': 'Narrativa', 'url': 'http://link, 'email': '}, 'description': 'The XY Tracking Project collects information from different data sources to provide comprehensive data for the XYs, X-Y. Contact Support:'}, 'servers': [{'url': 'link'}], 'paths': {'/api': {'get': {'summary': 'Data by date range', 'tags': [], 'responses': {'200': {'description': 'OK', 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/covidtata'}}}}}, 'operationId': 'get-api', 'parameters': [{'schema': {'type': 'string', 'format': 'date'}, 'in': 'query', 'name': 'date_from', 'description': 'Date range beginig (YYYY-DD-MM)', 'required': True}, {'schema': {'type': 'string', 'format': 'date'}, 'in': 'query', 'name': 'date_to', 'description': 'Date range ending (YYYY-DD-MM)'}], 'description': 'Returns the data for a specific date range.'}}, '/api/{date}': {'parameters': [{'schema': {'type': 'string', 'format': 'date'}, 'name': 'date', 'in': 'path', 'required': True}], 'get': {'summary': 'Data by date', 'tags': [], 'responses': {'200': {'description': 'OK', 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/data'}}}}}, 'operationId': 'get-api-date', 'description': 'Returns the data for a specific day.'}}, '/api/country/{country}': {'parameters': [{'schema': {'type': 'string', 'example': 'spain'}, 'name': 'country', 'in': 'path', 'required': True, 'example': 'spain'}, {'schema': {'type': 'strin
...etc.
I'd like to work through the data for analysis but can't seem to access it correctly. Any help would be extremely appreciated!!! Thank you so much for reading.

What you're seeing in the output is JSON. This is in a machine-readable format which doesn't need human-readable newlines or indentation. You should be able to work with this data just fine in your code.
Alternatively, you may want to consider another parser/emitter such as ruamel.yaml which can make dealing with YAML files considerably easier than the package you're currently importing. Print statements with this package can preserve lines and indentation for better readability.

Related

Why reading a json format file resulting all the records going to _corrupt_record in pyspark

I am reading data from an api call and the data is in the form of json like below:
{'success': True, 'errors': \[\], 'requestId': '151a2#fg', 'warnings': \[\], 'result': \[{'id': 10322433, 'name': 'sdfdgd', 'desc': '', 'createdAt': '2016-09-20T13:48:58Z+0000', 'updatedAt': '2020-07-16T13:08:03Z+0000', 'url': 'https://eda', 'subject': {'type': 'Text', 'value': 'Register now'}, 'fromName': {'type': 'Text', 'value': 'ramjdn fg'}, 'fromEmail': {'type': 'Text', 'value': 'ffdfee#ozx.com'}, 'replyEmail': {'type': 'Text', 'value': 'ffdfee#ozx.com'}, 'folder': {'type': 'Folder', 'value': 478, 'folderName': 'sjha'}, 'operational': False, 'textOnly': False, 'publishToMSI': False, 'webView': False, 'status': 'approved', 'template': 1031, 'workspace': 'Default', 'isOpenTrackingDisabled': False, 'version': 2, 'autoCopyToText': True, 'preHeader': None}\]}
Now when I am creating a dataframe out of this data using below code:
df = spark.read.json(sc.parallelize(\[data\]))
I am getting only one column which is _corrupt_record, below is the dataframe o/p I am getting. I have tried using multine is true but am still not getting the desired output.
+--------------------+
| \_corrupt_record|
\+--------------------+
|{'id': 12526, 'na...|
\+--------------------+
Expected o/p is the dataframe after exploding json with different columns, like id as one column, name as other column and so on.
I have tried lot of things but not able to fix this.
I have made certain changes and it worked.
I need to define the custom schema
Then used this bit of code
data = sc.parallelize([items])
df = spark.createDataFrame(data,schema=schema)
And It worked.
If there are any optimized solution to this please feel free to share.

How save a json file in python from api response when the class is a list and object is not serializable

I have tried to find the answer but I could not find it
I am looking for the way to save in my computer a json file from python.
I call the API
configuration = api.Configuration()
configuration.api_key['X-XXXX-Application-ID'] = 'xxxxxxx'
configuration.api_key['X-XXX-Application-Key'] = 'xxxxxxxx1'
## List our parameters as search operators
opts= {
'title': 'Deutsche Bank',
'body': 'fraud',
'language': ['en'],
'published_at_start': 'NOW-7DAYS',
'published_at_end': 'NOW',
'per_page': 1,
'sort_by': 'relevance'
}
try:
## Make a call to the Stories endpoint for stories that meet the criteria of the search operators
api_response = api_instance.list_stories(**opts)
## Print the returned story
pp(api_response.stories)
except ApiException as e:
print('Exception when calling DefaultApi->list_stories: %s\n' % e)
I got the response like this
[{'author': {'avatar_url': None, 'id': 1688440, 'name': 'Pranav Nair'},
'body': 'The law firm will investigate whether the bank or its officials have '
'engaged in securities fraud or unlawful business practices. '
'Industries: Bank Referenced Companies: Deutsche Bank',
'categories': [{'confident': False,
'id': 'IAB11-5',
'level': 2,
'links': {'_self': 'https://,
'parent': 'https://'},
'score': 0.39,
'taxonomy': 'iab-qag'},
{'confident': False,
'id': 'IAB3-12',
'level': 2,
'links': {'_self': 'https://api/v1/classify/taxonomy/iab-qag/IAB3-12',
'score': 0.16,
'taxonomy': 'iab-qag'},
'clusters': [],
'entities': {'body': [{'indices': [[168, 180]],
'links': {'dbpedia': 'http://dbpedia.org/resource/Deutsche_Bank'},
'score': 1.0,
'text': 'Deutsche Bank',
'types': ['Bank',
'Organisation',
'Company',
'Banking',
'Agent']},
{'indices': [[80, 95]],
'links': {'dbpedia': 'http://dbpedia.org/resource/Securities_fraud'},
'score': 1.0,
'text': 'securities fraud',
'types': ['Practice', 'Company']},
'hashtags': ['#DeutscheBank', '#Bank', '#SecuritiesFraud'],
'id': 3004661328,
'keywords': ['Deutsche',
'behalf',
'Bank',
'firm',
'investors',
'Deutsche Bank',
'bank',
'fraud',
'unlawful'],
'language': 'en',
'links': {'canonical': None,
'coverages': '/coverages?story_id=3004661328',
'permalink': 'https://www.snl.com/interactivex/article.aspx?KPLT=7&id=58657069',
'related_stories': '/related_stories?story_id=3004661328'},
'media': [],
'paragraphs_count': 1,
'published_at': datetime.datetime(2020, 5, 19, 16, 8, 5, tzinfo=tzutc()),
'sentences_count': 2,
'sentiment': {'body': {'polarity': 'positive', 'score': 0.599704},
'title': {'polarity': 'neutral', 'score': 0.841333}},
'social_shares_count': {'facebook': [],
'google_plus': [],
'source': {'description': None,
'domain': 'snl.com',
'home_page_url': 'http://www.snl.com/',
'id': 8256,
'links_in_count': None,
'locations': [{'city': 'Charlottesville',
'country': 'US',
'state': 'Virginia'}],
'logo_url': None,
'name': 'SNL Financial',
'scopes': [{'city': None,
'country': 'US',
'level': 'national',
'state': None},
{'city': None,
'country': None,
'level': 'international',
'state': None}],
'title': None},
'summary': {'sentences': ['The law firm will investigate whether the bank or '
'its officials have engaged in securities fraud or '
'unlawful business practices.',
'Industries: Bank Referenced Companies: Deutsche '
'Bank']},
'title': "Law firm to investigate Deutsche Bank's US ops on behalf of "
'investors',
'translations': {'en': None},
'words_count': 26}]
In the documentation says "Stories you retrieve from the API are returned as JSON objects by default. These JSON story objects contain 22 top-level fields, whereas a full story object will contain 95 unique data points"​
The class is a list. When I have tried to save json file I have the error "TypeError: Object of type Story is not JSON serializable".
How I can save a json file in my computer?
The response you got is not json, json uses double quotes, but here its single quotes. Copy paste your response in the following link to see the issues
http://json.parser.online.fr/.
If you change it like
[{"author": {"avatar_url": None, "id": 1688440, "name": "Pranav Nair"},
"body": "......
It will work, You can use python json module to do it
import json
json.loads(the_dict_got_from_response).
But it should be the duty of the API provider to, To make it working you can json load the result you got.

Pandas read the chat log log json to data frame?

How to converting the multiple list to data frame. below list contains the details about cloud containers want to extract the information like name , language , description and workspace id.
{'workspaces': [{'name': 'A_SupportAgent_dev',
'language': 'en',
'metadata': {'api_version': {'major_version': 'v1',
'minor_version': '2019-02-28'},
'digressions': True},
'description': 'Credit Card Banking Support Agent to assist with Sales And Service, created by Oliver Ivanoski and Steve Green',
'workspace_id': '',
'learning_opt_out': False},
{'name': 'Neatnik Watson Assistant Webhook Demo Skill',
'language': 'en',
'metadata': {'api_version': {'major_version': 'v1',
'minor_version': '2019-02-28'}},
'webhooks': [{'url': 'https://neatnik.net/watson/assistant/webhook/',
'name': 'main_webhook',
'headers': []}],
'description': '',
'workspace_id': '',
'system_settings': {'tooling': {'store_generic_responses': True},
'system_entities': {'enabled': True},
'spelling_auto_correct': True},
'learning_opt_out': False}]
'pagination': {'refresh_url': '/v1/workspaces?version=2019-02-28'}}
Want to convert the above list below data frame
Tried
pd.DataFrame(list(Workspace_List.items()) ,columns=['workspaces', 'pagination'])
columns = list(Workspace_List.keys())
values = list(Workspace_List.values())
arr_len = len(values)
You need to specify columns as you have another dictionary. So i think following below code will help u to organize your desire output
key = ['name','language','description','workspace_id']
output = pd.DataFrame(columns = key)
for i in range(len(df['workspaces'])):
ll = df['workspaces'][i]
output.loc[i] = [ll[x] for x in key]

How to flatten nested dict formatted '_source' column of csv, into dataframe

I have a csv with 500+ rows where one column "_source" is stored as JSON. I want to extract that into a pandas dataframe. I need each key to be its own column. #I have a 1 mb Json file of online social media data that I need to convert the dictionary and key values into their own separate columns. The social media data is from Facebook,Twitter/web crawled... etc. There are approximately 528 separate rows of posts/tweets/text with each having many dictionaries inside dictionaries. I am attaching a few steps from my Jupyter notebook below to give a more complete understanding. need to turn all key value pairs for dictionaries inside dictionaries into columns inside a dataframe
Thank you so much this will be a huge help!!!
I have tried changing it to a dataframe by doing this
source = pd.DataFrame.from_dict(source, orient='columns')
And it returns something like this... I thought it might unpack the dictionary but it did not.
#source.head()
#_source
#0 {'sub_organization_id': 'default', 'uid': 'aba...
#1 {'sub_organization_id': 'default', 'uid': 'ab0...
#2 {'sub_organization_id': 'default', 'uid': 'ac0...
below is the shape
#source.shape (528, 1)
below is what the an actual "_source" row looks like stretched out. There are many dictionaries and key:value pairs where each key needs to be its own column. Thanks! The actual links have been altered/scrambled for privacy reasons.
{'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
before you post make sure the actual code works for the data attached. Thanks!
The below code I tried but it did not work there was a syntax error that I could not figure out.
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
^
SyntaxError: invalid syntax
Whoever can help me with this will be a saint!
I had to do something like that a while back. Basically I used a function that completely flattened out the json to identify the keys that would be turned into the columns, then iterated through the json to reconstruct a row and append each row into a "results" dataframe. So with the data you provided, it created 52 column row and looking through it, looks like it included all the keys into it's own column. Anything nested, for example: 'meta': {'rule_matcher':[{'atribs': {'website': ...]} should then have a column name meta.rule_matcher.atribs.website where the '.' denotes those nested keys
data_source = {'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
Code:
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
flat = flatten_json(data_source)
import pandas as pd
import re
results = pd.DataFrame()
special_cols = []
columns_list = list(flat.keys())
for item in columns_list:
try:
row_idx = re.findall(r'\_(\d+)\_', item )[0]
except:
special_cols.append(item)
continue
column = re.findall(r'\_\d+\_(.*)', item )[0]
column = re.sub(r'\_\d+\_', '.', column)
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
for item in special_cols:
results[item] = flat[item]
Output:
print (results.to_string())
atribs_website atribs_source atribs_version atribs_type results.rule_type results.rule_tag results.description results.project_veid results.campaign_id results.value results.organization_id results.sub_organization_id results.appid results.project_id results.rule_id results.node_id results.metadata_campaign_title results.metadata_project_title attribs_website attribs_version attribs_type results.render_status results.path results.image_hash results.url results.load_time sub_organization_id uid project_veid campaign_id organization_id norm_attribs_website norm_attribs_version norm_attribs_type project_id system_timestamp doc_appid doc_response_url doc_url doc_status_code doc_status_msg doc_encoding doc_attrs_uid doc_timestamp doc_crawlid type norm_body norm_domain norm_author norm_url norm_timestamp norm_id
0 github.com/res Explicit 1.1 crawl hashtag Far NaN A7180EA-7078-0C7F-ED5D-86AD7 2A6DA0C-365BB-67DD-B05830920 #Far NaN NaN ray CDE2F42-5B87-C594-C900E578C 1838 NaN AF AF github.com/res 1.0 Page Render success https://east.amanaws.com/rays-ime-store/render... bb7674b8ea3fc05bfd027a19815f82c https://discooprdapp.com/ 32.0 default ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b default default default github.com/res 1.1 crawl default 2019-02-22T19:04:53.569623 subtter https://discooprdapp.com https://discooprdapp.com/ 200 OK utf-8 2ab8f2651cb32261b911c990a8b 2019-02-22T19:04:53.963 7fd95-785-4dd259-fcc-8752f crawl \n discordapp.com crawl https://discooprdapp.com 2019-02-22T19:04:53.961283+00:00 7fc5-685-4dd9-cc-8762f

How do I get the 'Resolution' from Jira with python and XML-RPC

I'm using python to fetch issues from Jira with xml-rpc. It works well except it is missing the 'Resolution' field in the returned dictionary. For example 'Fixed', or 'WontFix' etc.
This is how I get the issue from Jira:
import xmlrpclib
s = xmlrpclib.ServerProxy('http://myjira.com/rpc/xmlrpc')
auth = s.jira1.login('user', 'pass')
issue = s.jira1.getIssue(auth, 'PROJ-28')
print issue.keys()
And this is the list of fields that I get back:
['status', 'project', 'attachmentNames', 'votes', 'updated',
'components', 'reporter', 'customFieldValues', 'created',
'fixVersions', 'summary', 'priority', 'assignee', 'key',
'affectsVersions', 'type', 'id', 'description']
The full content is:
{'affectsVersions': [{'archived': 'false',
'id': '11314',
'name': 'v3.09',
'released': 'false',
'sequence': '7'}],
'assignee': 'myuser',
'attachmentNames': '2011-08-17_attach.tar.gz',
'components': [],
'created': '2011-06-14 12:33:54.0',
'customFieldValues': [{'customfieldId': 'customfield_10040', 'values': ''},
{'customfieldId': 'customfield_10010',
'values': 'Normal'}],
'description': "Blah blah...\r\n",
'fixVersions': [],
'id': '28322',
'key': 'PROJ-28',
'priority': '3',
'project': 'PROJ',
'reporter': 'myuser',
'status': '1',
'summary': 'blah blah...',
'type': '1',
'updated': '2011-08-18 15:41:04.0',
'votes': '0'}
When I do:
resolutions = s.jira1.getResolutions(auth )
pprint.pprint(resolutions)
I get:
[{'description': 'A fix for this issue is checked into the tree and tested.',
'id': '1',
'name': 'Fixed'},
{'description': 'The problem described is an issue which will never be fixed.',
'id': '2',
'name': "Won't Fix"},
{'description': 'The problem is a duplicate of an existing issue.',
'id': '3',
'name': 'Duplicate'},
{'description': 'The problem is not completely described.',
'id': '4',
'name': 'Incomplete'},
{'description': 'All attempts at reproducing this issue failed, or not enough information was available to reproduce the issue. Reading the code produces no clues as to why this behavior would occur. If more information appears later, please reopen the issue.',
'id': '5',
'name': 'Cannot Reproduce'},
{'description': 'Code is checked in, and is, er, ready for build.',
'id': '6',
'name': 'Ready For Build'},
{'description': 'Invalid bug', 'id': '7', 'name': 'Invalid'}]
The Jira version is v4.1.1#522 and I using Python 2.7.
Any ideas why I don't get a field called 'resolution'?
Thanks!
The answer is that the getIssue method in JiraXmlRpcService.java calls makeIssueStruct with a RemoteIssue object. The RemoteIssue object contains the Resolution field, but makeIssueStruct copies only values that are set. So if Resolution is not set, it won't appear in the Hashtable there.

Categories