Constructing JSON Array - python

I want something like this:
[
('Urbandale paid the Regional Municipality of Ottawa-Carleton "redevelopment" charges',
{'entities': [(0, 9, 'PLTF')]}),
('Urbandale carries on business as a land developer.',
{'entities': [(0, 9, 'PLTF')]})
]
I am able to set it as a variable and get it to work, but now I am try to automate constructing the JSON array. This one seems to be in a unconventional format? i.e. why is there a comma after the sentence and not :
Anyway it is the format that is required by spacy.
I've tried creating a dictionary then doing json.dumps into a json object, which is a lot easier but what is wanted is an array.
I've looked at this post: Python - Create array of json objects from for loops and tried iterating constructively but I just get an invalid syntax error
spans = []
for mention in mentions:
mention = str(mention)
for f in re.finditer(subj, mention):
spans.append(f.span())
train_data = [{mention, "entities": (f.span()[0], f.span()[1], 'PLTF')} for mention, span in zip(mentions, spans)]
Edit: using json.load(), TEST_DATA[0][1] (the structure I want) yields pretty much the same result as TRAIN_DATA[0][1] except with the additional () on the inside of the dict. I'm pretty sure its the culprit as I get this error: TypeError: 'int' object is not iterable. So how would I insert that? If I simply put an extra () around it, python parses it and it's removed.
Solved: Just put an extra []
Thanks

What you want is actually a list of tuples of str as the first element and dict as the second element. Simply put more parenthesis and move your curry bracket.
train_data = [(mention, {"entities": (f.span()[0], f.span()[1], 'PLTF')}) for mention, span in zip(mentions, spans)]

Related

Python: Json.load gives list and can't parse data from it

I have a data.json file, which looks like this:
["{\"Day\":\"Today\",\"Event\":\"1\", \"Date\":\"2019-03-20\"}"]
I am trying to get "Event" from this file using python and miserably failing at this.
with open('data.json', 'r') as json_file:
data = json.load(json_file)
print (data['Event'])
I get the following error:
TypeError: list indices must be integers or slices, not str
And even when I try
print (data[0]['Event'])
then I get this error:
TypeError: string indices must be integers
One more thing:
print(type(data))
gives me "list"
I have searched all over and have not found a solution to this. I would really appreciate your suggestions.
You could use the ast module for this:
import ast
mydata = ["{\"Day\":\"Today\",\"Event\":\"1\", \"Date\":\"2019-03-20\"}"]
data = ast.literal_eval(mydata[0])
data
{'Day': 'Today', 'Event': '1', 'Date': '2019-03-20'}
data['Event']
'1'
Edit
Your original code does load the data into a list structure, but only contains a single string entry inside that list, despite proper json syntax. ast, like json, will parse that string entry into a python data structure, dict.
As it sits, when you try to index that list, it's not the same as calling a key in a dict, hence the slices cannot be str:
alist = [{'a':1, 'b':2, 'c':3}]
alist['a']
TypeError
# need to grab the dict entry from the list
adict = alist[0]
adict['a']
1
You need to convert the elements in data to dict using json module.
Ex:
import json
with open(filename) as infile:
data = json.load(infile)
for d in data:
print(json.loads(d)['Event'])
Or:
data = list(map(json.loads, data))
print(data[0]["Event"])
Output:
1
Your problem is that you are parsing it as a list that consists of a single element that is a string.
["{\"Day\":\"Today\",\"Event\":\"1\", \"Date\":\"2019-03-20\"}"]
See how the entire content of the list is surrounded by " on either side and every other " is preceded by a \? The slash generally means to ignore the special meaning the following character might have, but interpret it as purely a string.
If you have control over the file's contents, the easiest solution would be to adjust it. You will want it to be in a format like this:
[{"Day":"Today", "Event": "1", "Date": "2019-03-20"}]
Edit: As others have suggested, you can also parse it in its current state. Granted, cleaning the data is tedious, but oftentimes worth the effort. Though this may not be one of those cases. I'm leaving this answer up anyway because it may help with explaining why OPs initial attempt did not work, and why he received the error messages he got.

Python- Insert new values into 'nested' list?

What I'm trying to do isn't a huge problem in php, but I can't find much assistance for Python.
In simple terms, from a list which produces output as follows:
{"marketId":"1.130856098","totalAvailable":null,"isMarketDataDelayed":null,"lastMatchTime":null,"betDelay":0,"version":2576584033,"complete":true,"runnersVoidable":false,"totalMatched":null,"status":"OPEN","bspReconciled":false,"crossMatching":false,"inplay":false,"numberOfWinners":1,"numberOfRunners":10,"numberOfActiveRunners":8,"runners":[{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":2.8,"size":34.16},{"price":2.76,"size":200},{"price":2.5,"size":237.85}],"availableToLay":[{"price":2.94,"size":6.03},{"price":2.96,"size":10.82},{"price":3,"size":33.45}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832765}...
All I want to do is add in an extra field, containing the 'runner name' in the data set below, into each of the 'runners' sub lists from the initial data set, based on selection_id=selectionId.
So initially I iterate through the full dataset, and then create a separate list to get the runner name from the runner id (I should point out that runnerId===selectionId===selection_id, no idea why there are multiple names are used), this works fine and the code is shown below:
for market_book in market_books:
market_catalogues = trading.betting.list_market_catalogue(
market_projection=["RUNNER_DESCRIPTION", "RUNNER_METADATA", "COMPETITION", "EVENT", "EVENT_TYPE", "MARKET_DESCRIPTION", "MARKET_START_TIME"],
filter=betfairlightweight.filters.market_filter(
market_ids=[market_book.market_id],
),
max_results=100)
data = []
for market_catalogue in market_catalogues:
for runner in market_catalogue.runners:
data.append(
(runner.selection_id, runner.runner_name)
)
So as you can see I have the data in data[], but what I need to do is add it to the initial data set, based on the selection_id.
I'm more comfortable with Php or Javascript, so apologies if this seems a bit simplistic, but the code snippets I've found on-line only seem to assist with very simple Python lists and nothing 'nested' (to me the structure seems similar to a nested array).
As per the request below, here is the full list:
{"marketId":"1.130856098","totalAvailable":null,"isMarketDataDelayed":null,"lastMatchTime":null,"betDelay":0,"version":2576584033,"complete":true,"runnersVoidable":false,"totalMatched":null,"status":"OPEN","bspReconciled":false,"crossMatching":false,"inplay":false,"numberOfWinners":1,"numberOfRunners":10,"numberOfActiveRunners":8,"runners":[{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":2.8,"size":34.16},{"price":2.76,"size":200},{"price":2.5,"size":237.85}],"availableToLay":[{"price":2.94,"size":6.03},{"price":2.96,"size":10.82},{"price":3,"size":33.45}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832765},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":20,"size":3},{"price":19.5,"size":26.36},{"price":19,"size":2}],"availableToLay":[{"price":21,"size":13},{"price":22,"size":2},{"price":23,"size":2}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832767},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":11,"size":9.75},{"price":10.5,"size":3},{"price":10,"size":28.18}],"availableToLay":[{"price":11.5,"size":12},{"price":13.5,"size":2},{"price":14,"size":7.75}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832766},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":48,"size":2},{"price":46,"size":5},{"price":42,"size":5}],"availableToLay":[{"price":60,"size":7},{"price":70,"size":5},{"price":75,"size":10}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832769},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":18.5,"size":28.94},{"price":18,"size":5},{"price":17.5,"size":3}],"availableToLay":[{"price":21,"size":20},{"price":23,"size":2},{"price":24,"size":2}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832768},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":4.3,"size":9},{"price":4.2,"size":257.98},{"price":4.1,"size":51.1}],"availableToLay":[{"price":4.4,"size":20.97},{"price":4.5,"size":30},{"price":4.6,"size":16}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832771},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":24,"size":6.75},{"price":23,"size":2},{"price":22,"size":2}],"availableToLay":[{"price":26,"size":2},{"price":27,"size":2},{"price":28,"size":2}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832770},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":5.7,"size":149.33},{"price":5.5,"size":29.41},{"price":5.4,"size":5}],"availableToLay":[{"price":6,"size":85},{"price":6.6,"size":5},{"price":6.8,"size":5}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":10064909}],"publishTime":1551612312125,"priceLadderDefinition":{"type":"CLASSIC"},"keyLineDescription":null,"marketDefinition":{"bspMarket":false,"turnInPlayEnabled":false,"persistenceEnabled":false,"marketBaseRate":5,"eventId":"28180290","eventTypeId":"2378961","numberOfWinners":1,"bettingType":"ODDS","marketType":"NONSPORT","marketTime":"2019-03-29T00:00:00.000Z","suspendTime":"2019-03-29T00:00:00.000Z","bspReconciled":false,"complete":true,"inPlay":false,"crossMatching":false,"runnersVoidable":false,"numberOfActiveRunners":8,"betDelay":0,"status":"OPEN","runners":[{"status":"ACTIVE","sortPriority":1,"id":10064909},{"status":"ACTIVE","sortPriority":2,"id":12832765},{"status":"ACTIVE","sortPriority":3,"id":12832766},{"status":"ACTIVE","sortPriority":4,"id":12832767},{"status":"ACTIVE","sortPriority":5,"id":12832768},{"status":"ACTIVE","sortPriority":6,"id":12832770},{"status":"ACTIVE","sortPriority":7,"id":12832769},{"status":"ACTIVE","sortPriority":8,"id":12832771},{"status":"LOSER","sortPriority":9,"id":10317013},{"status":"LOSER","sortPriority":10,"id":10317010}],"regulators":["MR_INT"],"countryCode":"GB","discountAllowed":true,"timezone":"Europe\/London","openDate":"2019-03-29T00:00:00.000Z","version":2576584033,"priceLadderDefinition":{"type":"CLASSIC"}}}
i think i understand what you are trying to do now
first hold your data as a python object (you gave us a json object)
import json
my_data = json.loads(my_json_string)
for item in my_data['runners']:
item['selectionId'] = [item['selectionId'], my_name_here]
the thing is that my_data['runners'][i]['selectionId'] is a string, unless you want to concat the name and the id together, you should turn it into a list or even a dictionary
each item is a dicitonary so you can always also a new keys to it
item['new_key'] = my_value
So, essentially this works...with one exception...I can see from the print(...) in the loop that the attribute is updated, however what I can't seem to do is then see this update outside the loop.
mkt_runners = []
for market_catalogue in market_catalogues:
for r in market_catalogue.runners:
mkt_runners.append((r.selection_id, r.runner_name))
for market_book in market_books:
for runner in market_book.runners:
for x in mkt_runners:
if runner.selection_id in x:
setattr(runner, 'x', x[1])
print(market_book.market_id, runner.x, runner.selection_id)
print(market_book.json())
So the print(market_book.market_id.... displays as expected, but when I print the whole list it shows the un-updated version. I can't seem to find an obvious solution, which is odd, as it seems like a really simple thing (I tried messing around with indents, in case that was the problem, but it doesn't seem to be, its like its not refreshing the market_book list post update of the runners sub list)!

Parsing multiple occurrences of an item into a dictionary

Attempting to parse several separate image links from JSON data through python, but having some issues drilling down to the right level, due to what I believe is from having a list of strings.
For the majority of the items, I've had success with the below example, pulling back everything I need. Outside of this instance, everything is a 1:1 ratio of keys:values, but for this one, there are multiple values associated with one key.
resultsdict['item_name'] = item['attribute_key']
I've been adding it all to a resultsdict={}, but am only able to get to the below sample string when I print.
INPUT:
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures']
OUTPUT (only relevant section):
'images': [{u'VariationSpecificPictureSet': [{u'PictureURL': [u'http//imagelink1'], u'VariationSpecificValue': u'color1'}, {u'PictureURL': [u'http//imagelink2'], u'VariationSpecificValue': u'color2'}, {u'PictureURL': [u'http//imagelink3'], u'VariationSpecificValue': u'color3'}, {u'PictureURL': [u'http//imagelink4'], u'VariationSpecificValue': u'color4'}]
I feel like I could add ['VariationPictureSet']['PictureURL'] at the end of my initial input, but that throws an error due to the indices not being integers, but strings.
Ideally, I would like to see the output as a simple comma-separated list of just the URLs, as follows:
OUTPUT:
'images': http//imagelink1, http//imagelink2, http//imagelink3, http//imagelink4
An answer to your comment that required a bit of code to it.
When using
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures']
you get a list with one element, so I recommend using this
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures'][0]
now you can use
for image in resultsdict['images']['VariationsSpecificPictureSet']:
print(image['PictureUR‌​L'])
Thanks for the help, #uzzee, it's appreciated. I kept tinkering with it and was able to pull the continuous string of all the image URLs with the following code.
resultsdict['images'] = sum([x['PictureURL'] for x in item['variations']['Pictures'][0]['VariationSpecificPictureSet']],[])
Without the sum it looks like this and pulls in the whole list of lists...
resultsdict['images'] = [x['PictureURL'] for x in item['variations']['Pictures'][0]['VariationSpecificPictureSet']]

Parsing data with pattern

I'm parsing some data with the pattern as follows:
tagA:
titleA
dataA1
dataA2
dataA3
...
tagB:
titleB
dataB1
dataB2
dataB3
...
tagC:
titleC
dataC1
dataC2
...
...
These tags are stored in a list list_of_tags, if I iterate through the list, I can get all the tags; also, if iterating through the tags, I can get the title and the data associated with the title.
The tags in my data are pretty much something like <div>, so they are not useful to me; what I'm trying to do is to construct a dictionary which uses titles as keys and datas as a list of values.
The constructed dictionary would look like:
{
titleA: [dataA1, dataA2, dataA3...],
titleB: [dataB1, dataB2, dataB3...],
...
}
Notice every tag only contains one title and some datas, and title always comes before data.
So here are my working codes:
Method 1:
result = {}
for tag in list_of_tags:
list_of_values = []
for idx, elem in enumerate(tag):
if not idx:
key = elem
else:
construct_list_of_values()
update_the_dictionary()
Actually, method 1 works fine and gives me my desired result; however, if I put that piece of codes in PyCharm, it warns me that "Local variable 'key' might be referenced before assignment" at the last line. Hence, I try another approach:
Method 2:
result = {tag[0]: tag[1:] for tag in list_of_tags}
Method 2 works fine if tags are lists, but I also want the code to work normally if tags are generators ('generator' object is not subscriptable will occur with method 2)
In order to work with generators, I come up with:
Method 3:
key_val_list = [(next(tag), list(tag)) for tag in list_of_tags]
result = dict(key_val_list)
Method 3 also works; but I cannot write this in dictionary comprehension ({next(tag): list(tag) for tag in list_of_tags} would give StopIteration exception because list(tag) will be evaluated first)
So, my question is, is there an elegant way for dealing with this pattern which could work no matter tags are lists or generators? (method 1 seems to work for both, but I don't know if I should ignore the warning PyCharms gives; the other two methods looks more concise, but one can only work on lists while the other can only work on generators)
Sorry for the long question, thanks for the patience!
I guess the reason why PyCharm is giving you a warning is that you are using key in update_the_dictionary, but key could be left unassigned if tag does not contain at least one element. You might have the knowledge that the title will always be in the list, but the static analyzer is not able to infer that from the context.
If you are using Python 3, you might want to try using PEP 3132 - Extended Iterable Unpacking. It should work for both lists and generators.
e.g.
title, *data = tag

Reach a string behind unknown value in JSON

I use Wikipedia's API to get information about a page.
The API gives me JSON like this:
"query":{
"pages":{
"188791":{
"pageid":188791,
"ns":0,
"title":"Vanit\u00e9",
"langlinks":[
{
"lang":"bg",
"*":"Vanitas"
},
{
"lang":"ca",
"*":"Vanitas"
},
ETC.
}
}
}
}
You can see the full JSON response.
I want to obtain all entries like:
{
"lang":"ca",
"*":"Vanitas"
}
but the number key ("188791") in the pages object is the problem.
I found Find a value within nested json dictionary in python that explains me how to do enumerate the values.
Unfortunately I get the following exception:
TypeError: 'dict_values' object does not support indexing
My code is:
json["query"]["pages"].values()[0]["langlinks"]
It's probably a dumb question but I can't find a way to pass in the page id value.
One solution is to use the indexpageids parameter, e.g.: http://fr.wikipedia.org/w/api.php?action=query&titles=Vanit%C3%A9&prop=langlinks&lllimit=500&format=jsonfm&indexpageids. It will add an array of pageids to the response. You can then use that to access the dictionary.
As long as you're only querying one page at a time, Simeon Visser's answer will work. However, as a matter of good style, I'd recommend structuring your code so that you iterate over all the returned results, even if you know there should be only one:
for page in data["query"]["pages"].values():
title = page["title"]
langlinks = page["langlinks"]
# do something with langlinks...
In particular, by writing your code this way, if you ever find yourself needing to run the query for multiple pages, you can do it efficiently with a single MediaWiki API request.
You're using Python 3 and values() now returns a dict_values instead of a list. This is a view on the values of the dictionary.
Hence that's why you're getting that error because indexing fails. Indexing is possible in a list but not a view.
To fix it:
list(json["query"]["pages"].values())[0]["langlinks"]
If you really want just one page arbitrarily, do that the way Simeon Visser suggested.
But I suspect you want all langlinks in all pages, yes?
For that, you want a comprehension:
[page["langlinks"] for page in json["query"]["pages"].values()]
But of course that gives you a 2D list. If you want to iterate over each page's links, that's perfect. If you want to iterate over all of the langlinks at once, you want to flatten the list:
[langlink for page in json["query"]["pages"]
for langlink in page["langlinks"].values()]
… or…
itertools.chain.from_iterable(page["langlinks"]
for page in json["query"]["pages"].values())
(The latter gives you an iterator; if you need a list, wrap the whole thing in list. Conversely, for the first two, if you don't need a list, just any iterable, use parens instead of square brackets to get a generator expression.)

Categories