Parse Wikipedia Wikitext Template Named Parameters to Extract Data from Taxobox - python

Using Python, I am attempting to extract data from the several "fields" of a Wikipedia Taxobox (an infobox which is usually displayed for each animal or plant species page, see for example here: https://en.wikipedia.org/wiki/Okapi).
The solution provided here (How to use Wikipedia API to get section of sidebar?) is interesting but not useful in my case, since I am interested in data from a lower taxonomic category (species).
What I want is a way (as pythonic as possible) to access every field in a Taxobox and then get the data (as a dictionary, perhaps) of interest.
Thanks in advance for any assistance.
EDIT: Here (https://github.com/siznax/wptools) is another good solution which should be what I need, but unfortunately it is a set of command line tools (besides dependent of other command line tools available only on Linux) and not a Python library.
EDIT2: wptools is a (python 2,3) library now.

#maurobio, #jimhark wptools is a python (2+3) library now. It will
give you any infobox with "box" in the name as a python dict, but
you probably want to use Wikidata (e.g. okapi
https://www.wikidata.org/wiki/Q82037) because infoboxen are messy (to
say the least). If you focus on Wikidata, then everyone benefits,
and wptools can get Wikidata for you too. We've recently updated wptools so that it gets ALL Wikidata by default.
You can get the infobox data in the example below in some languages, but as #biojl points out, wikitext has a different structure in different languages!
>>> page = wptools.page('Okapi')
>>> page.get_parse()
en.wikipedia.org (parse) Okapi
en.wikipedia.org (imageinfo) File:Okapi2.jpg
Okapi (en) data
{
image: <list(1)> {'kind': 'parse-image', u'descriptionshorturl':...
infobox: <dict(9)> status, status_ref, name, image, taxon, autho...
iwlinks: <list(4)> https://commons.wikimedia.org/wiki/Okapia_joh...
pageid: 22709
parsetree: <str(39115)> <root><template><title>about</title><par...
requests: <list(2)> parse, imageinfo
title: Okapi
wikibase: Q82037
wikidata_url: https://www.wikidata.org/wiki/Q82037
wikitext: <str(29930)> {{about|the animal}}{{good article}}{{use...
}
>>> page.data['infobox']
{'authority': '([[P.L. Sclater]], 1901)',
'image': 'Okapi2.jpg',
'image_caption': "An okapi at [[Disney's Animal Kingdom]] in [[Florida]].",
'name': 'Okapi',
'parent_authority': '[[Ray Lankester|Lankester]], 1901',
'status': 'EN',
'status_ref': '<ext><name>ref</name><attr> name=iucn</attr><inner>{{IUCN2008|assessor=IUCN SSC Antelope Specialist Group|year=2008|id=15188|title=Okapia johnstoni|downloaded=26 November 2013}} Database entry includes a brief justification of why this species is endangered.</inner><close></ref></close></ext>',
'status_system': 'IUCN3.1',
'taxon': 'Okapia johnstoni'}
However, because it is structured, you can get Wikidata in many languages, e.g.
>>> page = wptools.page('Okapi', lang='fr')
>>> page.get_wikidata()
www.wikidata.org (wikidata) Okapi
www.wikidata.org (labels) P646|P349|P373|P685|P627|Q16521|Q7432|Q...
fr.wikipedia.org (imageinfo) File:Okapia johnstoni -Marwell Wildl...
Okapi (fr) data
{
aliases: <list(2)> Mondonga, Okapia johnstoni
claims: <dict(26)> P646, P181, P935, P815, P373, P1417, P685, P1...
description: espèce de mammifères
image: <list(2)> {'kind': 'wikidata-image', u'descriptionshortur...
label: Okapi
labels: <dict(31)> P646, P373, P685, P627, Q16521, Q7432, Q20415...
modified: <dict(1)> wikidata
pageid: 84481
requests: <list(3)> wikidata, labels, imageinfo
title: Okapi
what: taxon
wikibase: Q82037
wikidata: <dict(26)> identifiant BioLib (P838), taxon supérieur ...
wikidata_url: https://www.wikidata.org/wiki/Q82037
}
>>> page.data['wikidata']
{u'carte de r\xe9partition (P181)': u'Okapi distribution.PNG',
u'cat\xe9gorie Commons (P373)': u'Okapia johnstoni',
u'dur\xe9e de gestation (P3063)': {u'amount': u'+14.5',
u'lowerBound': u'+14.0',
u'unit': u'http://www.wikidata.org/entity/Q5151',
u'upperBound': u'+15.0'},
u'd\xe9crit par (P1343)': u'encyclop\xe9die Otto (Q2041543)',
u'galerie Commons (P935)': u'Okapia johnstoni',
u'identifiant ARKive (P2833)': u'okapi/okapia-johnstoni',
u'identifiant Animal Diversity Web (P4024)': u'Okapia_johnstoni',
u'identifiant Biblioth\xe8que nationale de la Di\xe8te (P349)': u'01092792',
u'identifiant BioLib (P838)': u'33523',
u'identifiant Encyclopedia of Life (P830)': u'308387',
u'identifiant Encyclop\xe6dia Britannica en ligne (P1417)': u'animal/okapi',
u'identifiant Fossilworks (P842)': u'149380',
u'identifiant Freebase (P646)': u'/m/05pf4',
u'identifiant GBIF (P846)': u'2441207',
u'identifiant ITIS (P815)': u'625037',
u'identifiant Mammal Species of the World (P959)': u'14200484',
u'identifiant NCBI (P685)': u'86973',
u'identifiant UICN (P627)': u'15188',
u'identifiant de la Grande Encyclop\xe9die russe en ligne (P2924)': u'2290412',
u'image (P18)': [u'Okapia johnstoni -Marwell Wildlife, Hampshire, England-8a.jpg',
u'Okapia johnstoni1.jpg'],
u"nature de l'\xe9l\xe9ment (P31)": u'taxon (Q16521)',
u'nom scientifique du taxon (P225)': u'Okapia johnstoni',
u'nom vernaculaire (P1843)': [u'Okapi', u'Okapi'],
u'rang taxinomique (P105)': u'esp\xe8ce (Q7432)',
u'statut de conservation UICN (P141)': u'esp\xe8ce en danger (Q11394)',
u'taxon sup\xe9rieur (P171)': u'Okapia (Q1872039)'}
Don't forget that you can edit Wikidata in your own language. There are tools available to enable editing a large number of Wikidata pages.
EDIT: we've added a more general parser that should work (to some extent) with any infobox syntax, e.g.
>>> page = wptools.page('Okapi', lang='fr')
>>> page.get_parse()
fr.wikipedia.org (parse) Okapi
Okapi (fr) data
{
infobox: <dict(2)> count, boxes
...
}
>>> page.data['infobox']['count']
13
>>> page.data['infobox']['boxes']
[{u'Taxobox d\xe9but': [[{'index': '1'}, 'animal'],
[{'index': '2'}, "''Okapia johnstoni''"],
[{'index': '3'}, 'Okapi2.jpg'],
[{'index': '4'}, 'Okapi']]},
{'Taxobox': [[{'index': '1'}, 'embranchement'],
[{'index': '2'}, 'Chordata']]},
{'Taxobox': [[{'index': '1'}, 'classe'], [{'index': '2'}, 'Mammalia']]},
{'Taxobox': [[{'index': '1'}, 'sous-classe'], [{'index': '2'}, 'Theria']]},
{'Taxobox': [[{'index': '1'}, 'ordre'], [{'index': '2'}, 'Artiodactyla']]},
{'Taxobox': [[{'index': '1'}, 'famille'], [{'index': '2'}, 'Giraffidae']]},
{'Taxobox taxon': [[{'index': '1'}, 'animal'],
[{'index': '2'}, 'genre'],
[{'index': '3'}, 'Okapia'],
[{'index': '4'}, '[[Edwin Ray Lankester|Lankester]], [[1901]]']]},
{'Taxobox taxon': [[{'index': '1'}, 'animal'],
[{'index': '2'}, u'esp\xe8ce'],
[{'index': '3'}, 'Okapia johnstoni'],
[{'index': '4'}, '([[Philip Lutley Sclater|Sclater]], [[1901]])']]},
{'Taxobox synonymes': [[{'index': '1'},
"* ''Equus johnstoni'' <small>P.L. Sclater, 1901</small>"]]},
{'Taxobox UICN': [[{'index': '1'}, 'EN'], [{'index': '2'}, 'A2abcd+4abcd']]},
{u'Taxobox r\xe9partition': [[{'index': '1'}, 'Okapi map.jpg']]},
{u'Taxobox r\xe9partition': [[{'index': '1'}, 'Okapi distribution.PNG']]},
{'Taxobox fin': []}]
Hope that helps.

{#siznax has posted a better answer. I'm only leaving my answer here as an example of using the wiki api's and parsing the results. This would only be of practical use if a library like wptools couldn't meet your needs for some reason.}
This is a significant rewrite that includes a (more) proper parser to match the template's closing double braces '}}'. Also makes it easier to request different template names and includes a main() to allow testing from the shell / command line.
import sys
import re
import requests
import json
wikiApiRoot = 'https://en.wikipedia.org/w/api.php'
# returns the position past the requested token or end of string if not found
def FindToken(text, token, start=0):
pos = text.find(token, start)
if -1 == pos:
nextTokenPos = len(text)
else:
nextTokenPos = pos
return nextTokenPos + len(token)
# Get the contents of the template as text
def GetTemplateText(wikitext, templateName):
templateTag = '{{' + templateName
startPos = FindToken(wikitext, templateTag)
if (len(wikitext) <= startPos):
# Template not found
return None
openCount = 1
curPos = startPos
nextOpenPos = FindToken(wikitext, '{{', curPos)
nextClosePos = FindToken(wikitext, '}}', curPos)
# scan for template's matching close braces
while 0 < openCount:
if nextOpenPos < nextClosePos:
openCount += 1
curPos = nextOpenPos
nextOpenPos = FindToken(wikitext, '{{', curPos)
else:
openCount -= 1
curPos = nextClosePos
nextClosePos = FindToken(wikitext, '}}', curPos)
templateText = wikitext[startPos:curPos-2]
return templateText
def GetTemplateDict(title, templateName='Taxobox'):
templateDict = None
# Get data from Wikipedia:
resp = requests.get(wikiApiRoot + '?action=query&prop=revisions&' +
'rvprop=content&rvsection=0&format=json&redirects&titles=' +
title)
# Get the response text into a JSON object:
rjson = json.loads(resp.text)
# Pull out the text for the revision:
wikitext = rjson['query']['pages'].values()[0]['revisions'][0]['*']
# Parse the text for the template
templateText = GetTemplateText(wikitext, templateName)
if templateText:
# Parse templateText to get named properties
templateItemIter = re.finditer(
r'\|\s*(\w*)\s*=\s*([^\n]*)\n',
templateText,
re.M)
templateList = [item.groups([0,1]) for item in templateItemIter]
templateDict = dict(templateList)
return templateDict
def main():
import argparse
import pprint
parser = argparse.ArgumentParser()
parser.add_argument('title', nargs='?', default='Okapia_johnstoni', help='title of the desired article')
parser.add_argument('template', nargs='?', default='Taxobox', help='name of the desired template')
args = parser.parse_args()
templateDict = GetTemplateDict(args.title, args.template)
pprint.pprint(templateDict)
if __name__ == "__main__":
main()
GetTemplateDict returns a dictionary of the page's taxobox entries. For the Okapi page, this includes:
binomial
binomial_authority
classis
familia
genus
genus_authority
image
image_caption
ordo
phylum
regnum
species
status
status_ref
status_system
trend
I expect the actual items to vary by page.
The dictionary values are Wikipedia's decorated text:
>>> taxoDict['familia']
'[[Giraffidae]]'
So additional parsing or filtering may be desired or required.

Related

unable to extract and read a particular part of link from a file

So basically I was trying to scrape a Reddit link about game of thrones. This is the link: https://www.reddit.com/r/gameofthrones/wiki/episode_discussion, this has many other links! What i was trying was to scrape all the links in a file which is done! Now i Have to individually scrape every link and print out the data in individual files either csv or json.
Ive tried all possible methods from google but still unable to come to a solution! Any help would be helpful
import praw
import json
import pandas as pd #Pandas for scraping and saving it as a csv
#This is PRAW.
reddit = praw.Reddit(client_id='',
client_secret='',
user_agent='android:com.example.myredditapp:v1.2.3 (by /u/AshKay12)',
username='******',
password='******')
subreddit=reddit.subreddit("gameofthrones")
Comments = []
submission = reddit.submission("links")
with open('got_reddit_links.json') as json_file:
data = json.load(json_file)
for p in data:
print('season: ' + str(p['season']))
print('episode: ' + str(p['episode']))
print('title: ' + str(p['title']))
print('links: ' + str(p['links']))
print('')
submission.comments.replace_more(limit=None)
for comment in submission.comments.list():
print(20*'#')
print('Parent ID:',comment.parent())
print('Comment ID:',comment.id)
print(comment.body)
Comments.append([comment.body, comment.id])
Comments = pd.DataFrame(Comments, columns=['All_Comments', 'Comment ID'])
Comments.to_csv('Reddit3.csv')
This code prints out the links, title and episode number. It also extracts data when the link is manually entered but there are over 50 links in the webiste so i extracted those and put it in a file.
You can find all episode blocks with the links, and then write a function to scrape the comments for each episode discovered by each link:
from selenium import webdriver
import requests, itertools, re
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.reddit.com/r/gameofthrones/wiki/episode_discussion')
new_d = soup(d.page_source, 'html.parser').find('div', {'class':'md wiki'}).find_all(re.compile('h2|h4|table'))
g = [(a, list(b)) for a, b in itertools.groupby(new_d, key=lambda x:x.name == 'h2')]
r = {g[i][-1][0].text:{g[i+1][-1][k].text:g[i+1][-1][k+1] for k in range(0, len(g[i+1][-1]), 2)} for i in range(0, len(g), 2)}
final_r = {a:{b:[j['href'] for j in c.find_all('a', {'href':re.compile('redd\.it')})] for b, c in k.items()} for a, k in r.items()}
Now, you have a dictionary with all the links structured according to Season and episode:
{'Season 1 Threads': {'1.01 Winter Is Coming': ['https://redd.it/gsd0t'], '1.02 The Kingsroad': ['https://redd.it/gwlcx'], '1.03 Lord Snow': ['https://redd.it/h1otp/'], '1.04 Cripples, Bastards, & Broken Things': ['https://redd.it/h70vv'].....
To get the comments, you have to use selenium as well to be able click on the button to display the entire comment structure:
import time
d = webdriver.Chrome('/path/to/chromedriver')
def scrape_comments(url):
d.get(url)
_b = [i for i in d.find_elements_by_tag_name('button') if 'VIEW ENTIRE DISCUSSION' in i.text][0]
_b.send_keys('\n')
time.sleep(1)
p_obj = soup(d.page_source, 'html.parser').find('div', {'class':'_1YCqQVO-9r-Up6QPB9H6_4 _1YCqQVO-9r-Up6QPB9H6_4'}).contents
p_obj = [i for i in p_obj if i != '\n']
c = [{'poster':'[deleted]' if i.a is None else i.a['href'], 'handle':getattr(i.find('div', {'class':'_2X6EB3ZhEeXCh1eIVA64XM _2hSecp_zkPm_s5ddV2htoj _zMIUk6t-WDI7fxfkvD02'}), 'text', 'N/A'), 'points':getattr(i.find('span', {'class':'_2ETuFsVzMBxiHia6HfJCTQ _3_GZIIN1xcMEC5AVuv4kfa'}), 'text', 'N/A'), 'time':getattr(i.find('a', {'class':'_1sA-1jNHouHDpgCp1fCQ_F'}), 'text', 'N/A'), 'comment':getattr(i.p, 'text', 'N/A')} for i in p_obj]
return c
Sample output when running scrape_comments on one of the urls:
[{'poster': '/user/BWPhoenix/', 'handle': 'N/A', 'points': 'Score hidden', 'time': '2 years ago', 'comment': 'Week one, so a couple of quick questions:'}, {'poster': '/user/No0neAtAll/', 'handle': 'N/A', 'points': '957 points', 'time': '2 years ago', 'comment': "Davos fans showing their love Dude doesn't say a word the entire episode and gives only 3 glances but still get's 548 votes."}, {'poster': '/user/MairmanChao/', 'handle': 'N/A', 'points': '421 points', 'time': '2 years ago', 'comment': 'Davos always gets votes for being the most honorable man in Westeros'}, {'poster': '/user/BourbonSlut/', 'handle': 'N/A', 'points': '47 points', 'time': '2 years ago', 'comment': 'I was hoping for some Tyrion dialogue too..'}.....
Now, putting it all together:
final_result = {a:{b:[scrape_comments(i) for i in c] for b, c in k.items()} for a, k in final_r.items()}
From here, you can now create a pd.DataFrame from final_result or write the results to the file.

How to tag sentences for spacy's Sence2vec implementation

SpaCy has implemented a sense2vec word embeddings package which they document here
The vectors are all of the form WORD|POS. For example, the sentence
Dear local newspaper, I think effects computers have on people are great learning skills/affects because they give us time to chat with friends/new people, helps us learn about the globe(astronomy) and keeps us out of trouble
Needs to be converted into
Dear|ADJ local|ADJ newspaper|NOUN ,|PUNCT I|PRON think|VERB effects|NOUN computers|NOUN have|VERB on|ADP people|NOUN are|VERB great|ADJ learning|NOUN skills/affects|NOUN because|ADP they|PRON give|VERB us|PRON time|NOUN to|PART chat|VERB with|ADP friends/new|ADJ people|NOUN ,|PUNCT helps|VERB us|PRON learn|VERB about|ADP the|DET globe(astronomy|NOUN )|PUNCT and|CONJ keeps|VERB us|PRON out|ADP of|ADP trouble|NOUN !|PUNCT
In order to be interpretable by the sense2vec pretrained embeddings and in order to be in the sense2vec format.
How can this be done?
Based off of SpaCy's bin/merge.py implementation which does exactly what is needed:
from spacy.en import English
import re
LABELS = {
'ENT': 'ENT',
'PERSON': 'ENT',
'NORP': 'ENT',
'FAC': 'ENT',
'ORG': 'ENT',
'GPE': 'ENT',
'LOC': 'ENT',
'LAW': 'ENT',
'PRODUCT': 'ENT',
'EVENT': 'ENT',
'WORK_OF_ART': 'ENT',
'LANGUAGE': 'ENT',
'DATE': 'DATE',
'TIME': 'TIME',
'PERCENT': 'PERCENT',
'MONEY': 'MONEY',
'QUANTITY': 'QUANTITY',
'ORDINAL': 'ORDINAL',
'CARDINAL': 'CARDINAL'
}
nlp = False;
def tag_words_in_sense2vec_format(passage):
global nlp;
if(nlp == False): nlp = English()
if isinstance(passage, str): passage = passage.decode('utf-8',errors='ignore');
doc = nlp(passage);
return transform_doc(doc);
def transform_doc(doc):
for index, ent in enumerate(doc.ents):
ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
#if index % 100 == 0: print ("enumerating at entity index " + str(index));
#for np in doc.noun_chunks:
# while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
# np = np[1:]
# np.merge(np.root.tag_, np.text, np.root.ent_type_)
strings = []
for index, sent in enumerate(doc.sents):
if sent.text.strip():
strings.append(' '.join(represent_word(w) for w in sent if not w.is_space))
#if index % 100 == 0: print ("converting at sentence index " + str(index));
if strings:
return '\n'.join(strings) + '\n'
else:
return ''
def represent_word(word):
if word.like_url:
return '%%URL|X'
text = re.sub(r'\s', '_', word.text)
tag = LABELS.get(word.ent_type_, word.pos_)
if not tag:
tag = '?'
return text + '|' + tag
Where
print(tag_words_in_sense2vec_format("Dear local newspaper, ..."))
results in
Dear|ADJ local|ADJ newspaper|NOUN ,|PUNCT ...

Parsing a python list converted from JSON

I am trying to use google custom search API to search through US news outlets. Using the code example provided by google you end up with a python dictionary containing a multitude of other dictionaries and lists. The tags listed next to "res" in the meta function are the values I am trying to access for each article.
import os.path
import csv
from lxml import html
from googleapiclient.discovery import build
def newslist():
'''
Uses google custom search to search 20 US news sources for gun control articles,
and converts info into python dictionary.
in - none
out - res: JSON formatted search results
'''
service = build("customsearch", "v1",
developerKey="key")
res = service.cse().list(
q='query',
cx='searchid',
).execute()
return res
def meta(res, doc_count):
'''
Finds necessary meta-data of all articles. Avoids collections, such as those found on Huffington Post and New York Times.
in - res: defined above
out - meta_csv: csv file with article meta-data
'''
row1 = ['doc_id', 'url', 'title', 'publisher', 'date']
if res['context']['items']['pagemap']['metatags']['applicationname'] is not 'collection':
for art in res['context']['items']:
url = res['context']['items']['link']
title = res['context']['items']['pagemap']['article']['newsarticle']['headline']
publisher = res['context']['items']['displayLink'].split('www.' and '.com')
date = res['context']['items']['pagemap']['newsarticle']['datepublished']
row2 = [doc_count, url, title, publisher, date]
with open('meta.csv', 'w', encoding = 'utf-8') as meta:
csv_file = csv.writer(meta, delimiter = ',', quotechar = '|',
quoting = csv.QUOTE_MINIMAL)
if doc_count == 1:
csv_file.writerow(row1)
csv_file.writerow(row2)
doc_count += 1
Here's and example of the printed output from a search query:
{'context': {'title': 'Gun Control articles'},
'items': [{'displayLink': 'www.washingtonpost.com',
'formattedUrl': 'https://www.washingtonpost.com/.../white-resentment-is-fueling-opposition- '
'to-gun-control-researchers-say/',
'htmlFormattedUrl': 'https://www.washingtonpost.com/.../white-resentment-is-fueling-opposition- '
'to-<b>gun</b>-<b>control</b>-researchers-say/',
'htmlSnippet': 'Apr 4, 2016 <b>...</b> Racial prejudice could play '
'a significant role in white Americans' '
'opposition to <br>\n'
'<b>gun control</b>, according to new research from '
'political scientists at ...',
'htmlTitle': 'White resentment is fueling opposition to <b>gun '
'control</b>, researchers say',
'kind': 'customsearch#result',
'link': 'https://www.washingtonpost.com/news/wonk/wp/2016/04/04/white-resentment-is-fueling-opposition-to-gun-control-researchers-say/',
'pagemap': {'cse_image': [{'src': 'https://img.washingtonpost.com/rf/image_1484w/2010-2019/WashingtonPost/2015/10/03/Others/Images/2015-10-03/Botsford_gunshow1004_15_10_03_41831443897980.jpg'}],
'cse_thumbnail': [{'height': '183',
'src': 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSXtMnfm_GHkX3d2dOWgmto3rFjmhzxV8InoPao1tBuiBrEWsDMz4WDKcPB',
'width': '275'}],
'metatags': [{'apple-itunes-app': 'app-id=938922398, '
'app-argument=https://www.washingtonpost.com/news/wonk/wp/2016/04/04/white-resentment-is-fueling-opposition-to-gun-control-researchers-say/',
'article:author': 'https://www.facebook.com/chrisingraham',
'article:publisher': 'https://www.facebook.com/washingtonpost',
'author': 'https://www.facebook.com/chrisingraham',
'fb:admins': '1513210492',
'fb:app_id': '41245586762',
'news_keywords': 'guns, gun control, '
'racial resentment, '
'white people',
'og:description': 'Some white gun owners '
'"understand '
"'freedom' in a very "
'particular way."',
'og:image': 'https://img.washingtonpost.com/rf/image_1484w/2010-2019/WashingtonPost/2015/10/03/Others/Images/2015-10-03/Botsford_gunshow1004_15_10_03_41831443897980.jpg',
'og:site_name': 'Washington Post',
'og:title': 'White resentment is fueling '
'opposition to gun control, '
'researchers say',
'og:type': 'article',
'og:url': 'https://www.washingtonpost.com/news/wonk/wp/2016/04/04/white-resentment-is-fueling-opposition-to-gun-control-researchers-say/',
'referrer': 'unsafe-url',
'twitter:card': 'summary_large_image',
'twitter:creator': '#_cingraham',
'viewport': 'width=device-width, '
'initial-scale=1.0, '
'user-scalable=yes, '
'minimum-scale=0.5, '
'maximum-scale=2.0'}],
'newsarticle': [{'articlebody': 'People look at '
'handguns during the '
"Nation's Gun Show in "
'Chantilly, Va. in '
'October 2015. (Photo '
'by Jabin Botsford/The '
'Washington Post) '
'Racial prejudice '
'could play a '
'significant role in '
'white...',
'datepublished': '2016-04-04T11:46-500',
'description': 'Some white gun owners '
'"understand '
"'freedom' in a very "
'particular way."',
'headline': 'White resentment is '
'fueling opposition to '
'gun control, researchers '
'say',
'image': 'https://img.washingtonpost.com/rf/image_1484w/2010-2019/WashingtonPost/2015/10/03/Others/Images/2015-10-03/Botsford_gunshow1004_15_10_03_41831443897980.jpg',
'mainentityofpage': 'True',
'url': 'https://www.washingtonpost.com/news/wonk/wp/2016/04/04/white-resentment-is-fueling-opposition-to-gun-control-researchers-say/'}],
'person': [{'name': 'Christopher Ingraham'}]},
'snippet': 'Apr 4, 2016 ... Racial prejudice could play a '
"significant role in white Americans' opposition to \n"
'gun control, according to new research from political '
'scientists at\xa0...',
'title': 'White resentment is fueling opposition to gun control, '
'researchers say'},
I understand that I could basically write a for loop, but I'm wondering if there is an easier, less code intensive way of accessing this data for each desired value: URL, title, publisher, and date.
Why don't you use json module?
import json
s = ... # Your JSON text
result = json.loads(s)
result will be a dict or list depending on your JSON.

Python search loop slow

I am running a search on a list of ads (adscrape). Each ad is a dict within adscrape (e.g. ad below). It searches through a list of IDs (database_ids) which could be between 200,000 - 1,000,000 items long. I want to find any ads in adscrape that don't have an ID already in database_ids.
My current code is below. It takes a loooong time, and multiple seconds for each ad to scan through database_ids. Is there a more efficient/faster way of running this (finding which items in a big list, are in another big list)?
database_ids = ['id1','id2','id3'...]
ad = {'body': u'\xa0SUV', 'loc': u'SA', 'last scan': '06/02/16', 'eng': u'\xa06cyl 2.7L ', 'make': u'Hyundai', 'year': u'2006', 'id': u'OAG-AD-12371713', 'first scan': '06/02/16', 'odo': u'168911', 'active': 'Y', 'adtype': u'Dealer: Used Car', 'model': u'Tucson Auto 4x4 ', 'trans': u'\xa0Automatic', 'price': u'9990'}
for ad in adscrape:
ad['last scan'] = date
ad['active'] = 'Y'
adscrape_ids.append(ad['id'])
if ad['id'] not in database_ids:
ad['first scan'] = date
print 'new ad:',ad
newads.append(ad)
`You can use list comprehensions for this as the code base given below. Use the existing database_ids list and adscrape dict as given above.
Code base:
new_adds_ids = [ad for ad in adscrape if ad['id'] not in database_ids]`
You can build ids_map as dict and check whether id in list by accessing key in that ids_map as in code snippet below:
database_ids = ['id1','id2','id3']
ad = {'id': u'OAG-AD-12371713', 'body': u'\xa0SUV', 'loc': u'SA', 'last scan': '06/02/16', 'eng': u'\xa06cyl 2.7L ', 'make': u'Hyundai', 'year': u'2006', 'first scan': '06/02/16', 'odo': u'168911', 'active': 'Y', 'adtype': u'Dealer: Used Car', 'model': u'Tucson Auto 4x4 ', 'trans': u'\xa0Automatic', 'price': u'9990'}
#build ids map
ids_map = dict((k, v) for v, k in enumerate(database_ids))
for ad in adscrape:
# some logic before checking whether id in database_ids
try:
ids_map[ad['id']]
except KeyError:
pass
else:
#error not thrown perform logic for existed ids
print 'id %s in list' % ad['id']

Failing to append to dictionary. Python

I am experiencing a strange faulty behaviour, where a dictionary is only appended once and I can not add more key value pairs to it.
My code reads in a multi-line string and extracts substrings via split(), to be added to a dictionary. I make use of conditional statements. Strangely only the key:value pairs under the first conditional statement are added.
Therefore I can not complete the dictionary.
How can I solve this issue?
Minimal code:
#I hope the '\n' is sufficient or use '\r\n'
example = "Name: Bugs Bunny\nDOB: 01/04/1900\nAddress: 111 Jokes Drive, Hollywood Hills, CA 11111, United States"
def format(data):
dic = {}
for line in data.splitlines():
#print('Line:', line)
if ':' in line:
info = line.split(': ', 1)[1].rstrip() #does not work with files
#print('Info: ', info)
if ' Name:' in info: #middle name problems! /maiden name
dic['F_NAME'] = info.split(' ', 1)[0].rstrip()
dic['L_NAME'] = info.split(' ', 1)[1].rstrip()
elif 'DOB' in info: #overhang
dic['DD'] = info.split('/', 2)[0].rstrip()
dic['MM'] = info.split('/', 2)[1].rstrip()
dic['YY'] = info.split('/', 2)[2].rstrip()
elif 'Address' in info:
dic['STREET'] = info.split(', ', 2)[0].rstrip()
dic['CITY'] = info.split(', ', 2)[1].rstrip()
dic['ZIP'] = info.split(', ', 2)[2].rstrip()
return dic
if __name__ == '__main__':
x = format(example)
for v, k in x.iteritems():
print v, k
Your code doesn't work, at all. You split off the name before the colon and discard it, looking only at the value after the colon, stored in info. That value never contains the names you are looking for; Name, DOB and Address all are part of the line before the :.
Python lets you assign to multiple names at once; make use of this when splitting:
def format(data):
dic = {}
for line in data.splitlines():
if ':' not in line:
continue
name, _, value = line.partition(':')
name = name.strip()
if name == 'Name':
dic['F_NAME'], dic['L_NAME'] = value.split(None, 1) # strips whitespace for us
elif name == 'DOB':
dic['DD'], dic['MM'], dic['YY'] = (v.strip() for v in value.split('/', 2))
elif name == 'Address':
dic['STREET'], dic['CITY'], dic['ZIP'] = (v.strip() for v in value.split(', ', 2))
return dic
I used str.partition() here rather than limit str.split() to just one split; it is slightly faster that way.
For your sample input this produces:
>>> format(example)
{'CITY': 'Hollywood Hills', 'ZIP': 'CA 11111, United States', 'L_NAME': 'Bunny', 'F_NAME': 'Bugs', 'YY': '1900', 'MM': '04', 'STREET': '111 Jokes Drive', 'DD': '01'}
>>> from pprint import pprint
>>> pprint(format(example))
{'CITY': 'Hollywood Hills',
'DD': '01',
'F_NAME': 'Bugs',
'L_NAME': 'Bunny',
'MM': '04',
'STREET': '111 Jokes Drive',
'YY': '1900',
'ZIP': 'CA 11111, United States'}

Categories