I am running a search on a list of ads (adscrape). Each ad is a dict within adscrape (e.g. ad below). It searches through a list of IDs (database_ids) which could be between 200,000 - 1,000,000 items long. I want to find any ads in adscrape that don't have an ID already in database_ids.
My current code is below. It takes a loooong time, and multiple seconds for each ad to scan through database_ids. Is there a more efficient/faster way of running this (finding which items in a big list, are in another big list)?
database_ids = ['id1','id2','id3'...]
ad = {'body': u'\xa0SUV', 'loc': u'SA', 'last scan': '06/02/16', 'eng': u'\xa06cyl 2.7L ', 'make': u'Hyundai', 'year': u'2006', 'id': u'OAG-AD-12371713', 'first scan': '06/02/16', 'odo': u'168911', 'active': 'Y', 'adtype': u'Dealer: Used Car', 'model': u'Tucson Auto 4x4 ', 'trans': u'\xa0Automatic', 'price': u'9990'}
for ad in adscrape:
ad['last scan'] = date
ad['active'] = 'Y'
adscrape_ids.append(ad['id'])
if ad['id'] not in database_ids:
ad['first scan'] = date
print 'new ad:',ad
newads.append(ad)
`You can use list comprehensions for this as the code base given below. Use the existing database_ids list and adscrape dict as given above.
Code base:
new_adds_ids = [ad for ad in adscrape if ad['id'] not in database_ids]`
You can build ids_map as dict and check whether id in list by accessing key in that ids_map as in code snippet below:
database_ids = ['id1','id2','id3']
ad = {'id': u'OAG-AD-12371713', 'body': u'\xa0SUV', 'loc': u'SA', 'last scan': '06/02/16', 'eng': u'\xa06cyl 2.7L ', 'make': u'Hyundai', 'year': u'2006', 'first scan': '06/02/16', 'odo': u'168911', 'active': 'Y', 'adtype': u'Dealer: Used Car', 'model': u'Tucson Auto 4x4 ', 'trans': u'\xa0Automatic', 'price': u'9990'}
#build ids map
ids_map = dict((k, v) for v, k in enumerate(database_ids))
for ad in adscrape:
# some logic before checking whether id in database_ids
try:
ids_map[ad['id']]
except KeyError:
pass
else:
#error not thrown perform logic for existed ids
print 'id %s in list' % ad['id']
Related
So I have a data set with user, date, and post columns. I'm trying to generate a column of the calories that foods contain in the post column for each user. This dataset has a length of 21, and the code below finds the food words, get their calorie value, append it to that user's respective calorie list, and append that list to the new column. The new generated column, however, somehow has a length of 25:
Current data: 21
New column: 25
Does anybody know why this occurs? Here is the code below and samples of what the original dataset and the new column look like:
while len(col) < len(data['post']):
for post, api_id, api_key in zip(data['post'], ids_keys.keys(), ids_keys.values()): # cycles through text data & api keys
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'x-app-id': api_id,
'x-app-key': api_key,
'x-remote-user-id': '0'
}
calories = []
print('Current data:', len(data['post']), '\n New column: ', len(col)) # prints length of post vs new cal column
for word in eval(post):
if word not in food:
continue
else:
print('Detected Word: ', word)
query = {'query': '{}'.format(word)}
try:
response = requests.request("POST", url, headers=headers, data=query)
except KeyError as ke:
print(ke, 'Out of calls, next key...')
ids_keys.pop(api_id) # drop current api id & key from dict if out of calls
print('API keys left:', len(ids_keys))
finally:
stats = response.json()
print('Food Stats: \n', stats)
print('Calories in food: ', stats['foods'][0]['nf_calories'])
calories.append(stats['foods'][0]['nf_calories'])
print('Current Key', api_id, ':', api_key)
col.append(calories)
if len(col) == len(data['post']):
break
I attempted to use the while loop to only append up to the length of the dataset, but to no avail.
Original Data Set:
pd.DataFrame({'user':['avskk', 'janejellyn', 'firlena227','...'],
'date': ['October 22', 'October 22', 'October 22','...'],
'post': [['autumn', 'fully', 'arrived', 'cooking', 'breakfast', 'toaster','...'],
['breakfast', 'chinese', 'sticky', 'rice', 'tempeh', 'sausage', 'cucumber', 'salad', 'lunch', 'going', 'lunch', 'coworkers', 'probably', 'black', 'bean', 'burger'],
['potato', 'bean', 'inspiring', 'food', 'day', 'today', '...']]
})
New Column:
pd.DataFrame({'Calories': [[22,33,45,32,2,5,7,9,76],
[43,78,54,97,32,56,97],
[23,55,32,22,7,99,66,98,54,35,33]]
})
Here is am example of the list:
{'Item': 'milk', 'Price': '2.0', 'Quantity': '2'}, {'Item': 'egg', 'Price': '12.0', 'Quantity': '1'}]
Here is my code:
def edit_items(info):
xy = info
print('Index | Orders')
for x in enumerate(xy):
print('\n')
print(x)
choice = int(input('Which entry would you like to edit? Choose by index. :'))
print(x[choice])
Id like the user to able to chose an entry by index, and allow them to edit information inside the dictionary.
So far my code prints out:
Index | Orders
(0, {'Item': 'milk', 'Price': '2.0', 'Quantity': '2'})
(1, {'Item': 'egg', 'Price': '12.0', 'Quantity': '1'})
But i have no idea how to choose one, assign It to a variable and carry out the ability to edit whats inside.
Cheers. Nalpak_
def edit_items(info):
xy = info
# to make multiple edits.
while True:
print('Index | Orders')
for x in range(len(xy)):
print(x,xy[x])
choice = int(input('Which entry would you like to edit?\nChoose by index: '))
print(xy[choice])
edit = input('What you want to edit: ') # Key val of dict
value = input("Enter: ") # Value for the specific key in dict
xy[choice][edit] = value
print('list updated.')
print(xy[choice])
more_edits = input('\nDo you want to make more edits?(y/n): ')
if more_edits == 'n':
break
edit_items(info)
this will help you make multiple edits.
If you want to edit an item in a dictionary, you can easily do it by accessing it by the key.
First, we set up the data
xy = [{'Item': 'milk', 'Price': '2.0', 'Quantity': '2'}, {'Item': 'egg', 'Price': '12.0', 'Quantity': '1'}]
Then if I understood you correctly, this edit_items method should do exactly what you need:
def edit_items(i):
name = input('Type in a new item name: ')
xy[i]['Item'] = name # 'Item' is the key.
Everything else is pretty much the same:
print('Index | Orders')
for x in enumerate(xy):
print('\n')
print(x)
choice = int(input('Which entry would you like to edit? Choose by index. :'))
print(xy[choice])
edit_items(choice)
print(xy)
If you want, you can also use input for getting a key (property) of an item you want to edit.
I'm kind of new to python, and I need some help. I'm making an employee list menu. My list of dictionaries is:
person_infos = [ {'name': 'John Doe', 'age': '46', 'job position': 'Chair Builder', 'pay per hour': '14.96','date hired': '2/26/19'},
{'name': 'Phillip Waltertower', 'age': '19', 'job position': 'Sign Holder', 'pay per hour': '10','date hired': '5/9/19'},
{'name': 'Karen Johnson', 'age': '40', 'job position': 'Manager', 'pay per hour': '100','date hired': '9/10/01'},
{'name': 'Linda Bledsoe', 'age': '60', 'job position': 'CEO', 'pay per hour': '700', 'date hired': '8/24/99'},
{'name': 'Beto Aretz', 'age': '22', 'job position': 'Social Media Manager', 'pay per hour': '49','date hired': '2/18/12'}]
and my "search the list of dicts input function" is how the program is supposed to print the correct dictionary based on the name the user inputs:
def search_query(person_infos):
if answer == '3':
search_query = input('Who would you like to find: ')
they_are_found = False
location = None
for i, each_employee in enumerate(person_infos):
if each_employee['name'] == search_query:
they_are_found = True
location = i
if they_are_found:
print('Found: ', person_infos[location]['name'], person_infos[location]['job position'], person_infos[location]['date hired'], person_infos[location]['pay per hour'])
else:
print('Sorry, your search query is non-existent.')
and I also have this-
elif answer =='3':
person_infos = search_query(person_infos)
This seems like a step in the right direction, but for
search_query = input('Who would you like to find: ')
if I input of the names in person_infos, like "John Doe," it just prints the last dictionary's information (no matter which specific dictionary it is, the last one in the order will always be outputted) instead of John Doe's. in this case, it would only print "Beto Aretz's."
Can someone please help? It's something I've been struggling on for a while and it would be awesome.
I've researched so much and I could not find something with things that I either knew how to do, or were the input search.
Thanks,
LR
At first glance it looks like because your location=i is not indented inside your if statement so it is getting set to the latest i on each iteration of the for loop. Let me know if this helps.
def search_query(person_infos):
if answer == '3':
search_query = input('Who would you like to find: ')
they_are_found = False
location = None
for i, each_employee in enumerate(person_infos):
if each_employee['name'] == search_query:
they_are_found = True
location = i
if they_are_found:
print('Found: ', person_infos[location]['name'], person_infos[location]['job position'], person_infos[location]['date hired'], person_infos[location]['pay per hour'])
else:
print('Sorry, your search query is non-existent.')
Using Python, I am attempting to extract data from the several "fields" of a Wikipedia Taxobox (an infobox which is usually displayed for each animal or plant species page, see for example here: https://en.wikipedia.org/wiki/Okapi).
The solution provided here (How to use Wikipedia API to get section of sidebar?) is interesting but not useful in my case, since I am interested in data from a lower taxonomic category (species).
What I want is a way (as pythonic as possible) to access every field in a Taxobox and then get the data (as a dictionary, perhaps) of interest.
Thanks in advance for any assistance.
EDIT: Here (https://github.com/siznax/wptools) is another good solution which should be what I need, but unfortunately it is a set of command line tools (besides dependent of other command line tools available only on Linux) and not a Python library.
EDIT2: wptools is a (python 2,3) library now.
#maurobio, #jimhark wptools is a python (2+3) library now. It will
give you any infobox with "box" in the name as a python dict, but
you probably want to use Wikidata (e.g. okapi
https://www.wikidata.org/wiki/Q82037) because infoboxen are messy (to
say the least). If you focus on Wikidata, then everyone benefits,
and wptools can get Wikidata for you too. We've recently updated wptools so that it gets ALL Wikidata by default.
You can get the infobox data in the example below in some languages, but as #biojl points out, wikitext has a different structure in different languages!
>>> page = wptools.page('Okapi')
>>> page.get_parse()
en.wikipedia.org (parse) Okapi
en.wikipedia.org (imageinfo) File:Okapi2.jpg
Okapi (en) data
{
image: <list(1)> {'kind': 'parse-image', u'descriptionshorturl':...
infobox: <dict(9)> status, status_ref, name, image, taxon, autho...
iwlinks: <list(4)> https://commons.wikimedia.org/wiki/Okapia_joh...
pageid: 22709
parsetree: <str(39115)> <root><template><title>about</title><par...
requests: <list(2)> parse, imageinfo
title: Okapi
wikibase: Q82037
wikidata_url: https://www.wikidata.org/wiki/Q82037
wikitext: <str(29930)> {{about|the animal}}{{good article}}{{use...
}
>>> page.data['infobox']
{'authority': '([[P.L. Sclater]], 1901)',
'image': 'Okapi2.jpg',
'image_caption': "An okapi at [[Disney's Animal Kingdom]] in [[Florida]].",
'name': 'Okapi',
'parent_authority': '[[Ray Lankester|Lankester]], 1901',
'status': 'EN',
'status_ref': '<ext><name>ref</name><attr> name=iucn</attr><inner>{{IUCN2008|assessor=IUCN SSC Antelope Specialist Group|year=2008|id=15188|title=Okapia johnstoni|downloaded=26 November 2013}} Database entry includes a brief justification of why this species is endangered.</inner><close></ref></close></ext>',
'status_system': 'IUCN3.1',
'taxon': 'Okapia johnstoni'}
However, because it is structured, you can get Wikidata in many languages, e.g.
>>> page = wptools.page('Okapi', lang='fr')
>>> page.get_wikidata()
www.wikidata.org (wikidata) Okapi
www.wikidata.org (labels) P646|P349|P373|P685|P627|Q16521|Q7432|Q...
fr.wikipedia.org (imageinfo) File:Okapia johnstoni -Marwell Wildl...
Okapi (fr) data
{
aliases: <list(2)> Mondonga, Okapia johnstoni
claims: <dict(26)> P646, P181, P935, P815, P373, P1417, P685, P1...
description: espèce de mammifères
image: <list(2)> {'kind': 'wikidata-image', u'descriptionshortur...
label: Okapi
labels: <dict(31)> P646, P373, P685, P627, Q16521, Q7432, Q20415...
modified: <dict(1)> wikidata
pageid: 84481
requests: <list(3)> wikidata, labels, imageinfo
title: Okapi
what: taxon
wikibase: Q82037
wikidata: <dict(26)> identifiant BioLib (P838), taxon supérieur ...
wikidata_url: https://www.wikidata.org/wiki/Q82037
}
>>> page.data['wikidata']
{u'carte de r\xe9partition (P181)': u'Okapi distribution.PNG',
u'cat\xe9gorie Commons (P373)': u'Okapia johnstoni',
u'dur\xe9e de gestation (P3063)': {u'amount': u'+14.5',
u'lowerBound': u'+14.0',
u'unit': u'http://www.wikidata.org/entity/Q5151',
u'upperBound': u'+15.0'},
u'd\xe9crit par (P1343)': u'encyclop\xe9die Otto (Q2041543)',
u'galerie Commons (P935)': u'Okapia johnstoni',
u'identifiant ARKive (P2833)': u'okapi/okapia-johnstoni',
u'identifiant Animal Diversity Web (P4024)': u'Okapia_johnstoni',
u'identifiant Biblioth\xe8que nationale de la Di\xe8te (P349)': u'01092792',
u'identifiant BioLib (P838)': u'33523',
u'identifiant Encyclopedia of Life (P830)': u'308387',
u'identifiant Encyclop\xe6dia Britannica en ligne (P1417)': u'animal/okapi',
u'identifiant Fossilworks (P842)': u'149380',
u'identifiant Freebase (P646)': u'/m/05pf4',
u'identifiant GBIF (P846)': u'2441207',
u'identifiant ITIS (P815)': u'625037',
u'identifiant Mammal Species of the World (P959)': u'14200484',
u'identifiant NCBI (P685)': u'86973',
u'identifiant UICN (P627)': u'15188',
u'identifiant de la Grande Encyclop\xe9die russe en ligne (P2924)': u'2290412',
u'image (P18)': [u'Okapia johnstoni -Marwell Wildlife, Hampshire, England-8a.jpg',
u'Okapia johnstoni1.jpg'],
u"nature de l'\xe9l\xe9ment (P31)": u'taxon (Q16521)',
u'nom scientifique du taxon (P225)': u'Okapia johnstoni',
u'nom vernaculaire (P1843)': [u'Okapi', u'Okapi'],
u'rang taxinomique (P105)': u'esp\xe8ce (Q7432)',
u'statut de conservation UICN (P141)': u'esp\xe8ce en danger (Q11394)',
u'taxon sup\xe9rieur (P171)': u'Okapia (Q1872039)'}
Don't forget that you can edit Wikidata in your own language. There are tools available to enable editing a large number of Wikidata pages.
EDIT: we've added a more general parser that should work (to some extent) with any infobox syntax, e.g.
>>> page = wptools.page('Okapi', lang='fr')
>>> page.get_parse()
fr.wikipedia.org (parse) Okapi
Okapi (fr) data
{
infobox: <dict(2)> count, boxes
...
}
>>> page.data['infobox']['count']
13
>>> page.data['infobox']['boxes']
[{u'Taxobox d\xe9but': [[{'index': '1'}, 'animal'],
[{'index': '2'}, "''Okapia johnstoni''"],
[{'index': '3'}, 'Okapi2.jpg'],
[{'index': '4'}, 'Okapi']]},
{'Taxobox': [[{'index': '1'}, 'embranchement'],
[{'index': '2'}, 'Chordata']]},
{'Taxobox': [[{'index': '1'}, 'classe'], [{'index': '2'}, 'Mammalia']]},
{'Taxobox': [[{'index': '1'}, 'sous-classe'], [{'index': '2'}, 'Theria']]},
{'Taxobox': [[{'index': '1'}, 'ordre'], [{'index': '2'}, 'Artiodactyla']]},
{'Taxobox': [[{'index': '1'}, 'famille'], [{'index': '2'}, 'Giraffidae']]},
{'Taxobox taxon': [[{'index': '1'}, 'animal'],
[{'index': '2'}, 'genre'],
[{'index': '3'}, 'Okapia'],
[{'index': '4'}, '[[Edwin Ray Lankester|Lankester]], [[1901]]']]},
{'Taxobox taxon': [[{'index': '1'}, 'animal'],
[{'index': '2'}, u'esp\xe8ce'],
[{'index': '3'}, 'Okapia johnstoni'],
[{'index': '4'}, '([[Philip Lutley Sclater|Sclater]], [[1901]])']]},
{'Taxobox synonymes': [[{'index': '1'},
"* ''Equus johnstoni'' <small>P.L. Sclater, 1901</small>"]]},
{'Taxobox UICN': [[{'index': '1'}, 'EN'], [{'index': '2'}, 'A2abcd+4abcd']]},
{u'Taxobox r\xe9partition': [[{'index': '1'}, 'Okapi map.jpg']]},
{u'Taxobox r\xe9partition': [[{'index': '1'}, 'Okapi distribution.PNG']]},
{'Taxobox fin': []}]
Hope that helps.
{#siznax has posted a better answer. I'm only leaving my answer here as an example of using the wiki api's and parsing the results. This would only be of practical use if a library like wptools couldn't meet your needs for some reason.}
This is a significant rewrite that includes a (more) proper parser to match the template's closing double braces '}}'. Also makes it easier to request different template names and includes a main() to allow testing from the shell / command line.
import sys
import re
import requests
import json
wikiApiRoot = 'https://en.wikipedia.org/w/api.php'
# returns the position past the requested token or end of string if not found
def FindToken(text, token, start=0):
pos = text.find(token, start)
if -1 == pos:
nextTokenPos = len(text)
else:
nextTokenPos = pos
return nextTokenPos + len(token)
# Get the contents of the template as text
def GetTemplateText(wikitext, templateName):
templateTag = '{{' + templateName
startPos = FindToken(wikitext, templateTag)
if (len(wikitext) <= startPos):
# Template not found
return None
openCount = 1
curPos = startPos
nextOpenPos = FindToken(wikitext, '{{', curPos)
nextClosePos = FindToken(wikitext, '}}', curPos)
# scan for template's matching close braces
while 0 < openCount:
if nextOpenPos < nextClosePos:
openCount += 1
curPos = nextOpenPos
nextOpenPos = FindToken(wikitext, '{{', curPos)
else:
openCount -= 1
curPos = nextClosePos
nextClosePos = FindToken(wikitext, '}}', curPos)
templateText = wikitext[startPos:curPos-2]
return templateText
def GetTemplateDict(title, templateName='Taxobox'):
templateDict = None
# Get data from Wikipedia:
resp = requests.get(wikiApiRoot + '?action=query&prop=revisions&' +
'rvprop=content&rvsection=0&format=json&redirects&titles=' +
title)
# Get the response text into a JSON object:
rjson = json.loads(resp.text)
# Pull out the text for the revision:
wikitext = rjson['query']['pages'].values()[0]['revisions'][0]['*']
# Parse the text for the template
templateText = GetTemplateText(wikitext, templateName)
if templateText:
# Parse templateText to get named properties
templateItemIter = re.finditer(
r'\|\s*(\w*)\s*=\s*([^\n]*)\n',
templateText,
re.M)
templateList = [item.groups([0,1]) for item in templateItemIter]
templateDict = dict(templateList)
return templateDict
def main():
import argparse
import pprint
parser = argparse.ArgumentParser()
parser.add_argument('title', nargs='?', default='Okapia_johnstoni', help='title of the desired article')
parser.add_argument('template', nargs='?', default='Taxobox', help='name of the desired template')
args = parser.parse_args()
templateDict = GetTemplateDict(args.title, args.template)
pprint.pprint(templateDict)
if __name__ == "__main__":
main()
GetTemplateDict returns a dictionary of the page's taxobox entries. For the Okapi page, this includes:
binomial
binomial_authority
classis
familia
genus
genus_authority
image
image_caption
ordo
phylum
regnum
species
status
status_ref
status_system
trend
I expect the actual items to vary by page.
The dictionary values are Wikipedia's decorated text:
>>> taxoDict['familia']
'[[Giraffidae]]'
So additional parsing or filtering may be desired or required.
I am experiencing a strange faulty behaviour, where a dictionary is only appended once and I can not add more key value pairs to it.
My code reads in a multi-line string and extracts substrings via split(), to be added to a dictionary. I make use of conditional statements. Strangely only the key:value pairs under the first conditional statement are added.
Therefore I can not complete the dictionary.
How can I solve this issue?
Minimal code:
#I hope the '\n' is sufficient or use '\r\n'
example = "Name: Bugs Bunny\nDOB: 01/04/1900\nAddress: 111 Jokes Drive, Hollywood Hills, CA 11111, United States"
def format(data):
dic = {}
for line in data.splitlines():
#print('Line:', line)
if ':' in line:
info = line.split(': ', 1)[1].rstrip() #does not work with files
#print('Info: ', info)
if ' Name:' in info: #middle name problems! /maiden name
dic['F_NAME'] = info.split(' ', 1)[0].rstrip()
dic['L_NAME'] = info.split(' ', 1)[1].rstrip()
elif 'DOB' in info: #overhang
dic['DD'] = info.split('/', 2)[0].rstrip()
dic['MM'] = info.split('/', 2)[1].rstrip()
dic['YY'] = info.split('/', 2)[2].rstrip()
elif 'Address' in info:
dic['STREET'] = info.split(', ', 2)[0].rstrip()
dic['CITY'] = info.split(', ', 2)[1].rstrip()
dic['ZIP'] = info.split(', ', 2)[2].rstrip()
return dic
if __name__ == '__main__':
x = format(example)
for v, k in x.iteritems():
print v, k
Your code doesn't work, at all. You split off the name before the colon and discard it, looking only at the value after the colon, stored in info. That value never contains the names you are looking for; Name, DOB and Address all are part of the line before the :.
Python lets you assign to multiple names at once; make use of this when splitting:
def format(data):
dic = {}
for line in data.splitlines():
if ':' not in line:
continue
name, _, value = line.partition(':')
name = name.strip()
if name == 'Name':
dic['F_NAME'], dic['L_NAME'] = value.split(None, 1) # strips whitespace for us
elif name == 'DOB':
dic['DD'], dic['MM'], dic['YY'] = (v.strip() for v in value.split('/', 2))
elif name == 'Address':
dic['STREET'], dic['CITY'], dic['ZIP'] = (v.strip() for v in value.split(', ', 2))
return dic
I used str.partition() here rather than limit str.split() to just one split; it is slightly faster that way.
For your sample input this produces:
>>> format(example)
{'CITY': 'Hollywood Hills', 'ZIP': 'CA 11111, United States', 'L_NAME': 'Bunny', 'F_NAME': 'Bugs', 'YY': '1900', 'MM': '04', 'STREET': '111 Jokes Drive', 'DD': '01'}
>>> from pprint import pprint
>>> pprint(format(example))
{'CITY': 'Hollywood Hills',
'DD': '01',
'F_NAME': 'Bugs',
'L_NAME': 'Bunny',
'MM': '04',
'STREET': '111 Jokes Drive',
'YY': '1900',
'ZIP': 'CA 11111, United States'}