python retrieve text from multiple random wikipedia pages

python retrieve text from multiple random wikipedia pages - python

I am using python 2.7 with wikipedia package to retrieve the text from multiple random wikipedia pages as explained in the docs.
I use the following code
def get_random_pages_summary(pages = 0):
import wikipedia
page_names = [wikipedia.random(1) for i in range(pages)]
return [[p,wikipedia.page(p).summary] for p in page_names]
text = get_random_pages_summary(50)
and get the following error
File
"/home/user/.local/lib/python2.7/site-packages/wikipedia/wikipedia.py",
line 393, in __load raise DisambiguationError(getattr(self, 'title',
page['title']), may_refer_to)
wikipedia.exceptions.DisambiguationError: "Priuralsky" may refer to:
Priuralsky District Priuralsky (rural locality)
what i am trying to do is to get the text. from random pages in Wikipedia, and I need it to be just regular text, without any markdown
I assume that the problem is getting a random name that has more than one option when searching for a Wikipedia page.
when i use it to get one Wikipedia page. it works well.
Thanks

As you're doing it for random articles and with a Wikipedia API (not directly pulling the HTML with different tools) my suggestion would be to catch the DisambiguationError and re-random article in case this happens.
def random_page():
random = wikipedia.random(1)
try:
result = wikipedia.page(random).summary
except wikipedia.exceptions.DisambiguationError as e:
result = random_page()
return result

According to the document(http://wikipedia.readthedocs.io/en/latest/quickstart.html) the error will return multiple page candidates so you need to search that candidate again.
try:
wikipedia.summary("Priuralsky")
except wikipedia.exceptions.DisambiguationError as e:
for page_name in e.options:
print(page_name)
print(wikipedia.page(page_name).summary)
You can improve your code like this.
import wikipedia
def get_page_sumarries(page_name):
try:
return [[page_name, wikipedia.page(page_name).summary]]
except wikipedia.exceptions.DisambiguationError as e:
return [[p, wikipedia.page(p).summary] for p in e.options]
def get_random_pages_summary(pages=0):
ret = []
page_names = [wikipedia.random(1) for i in range(pages)]
for p in page_names:
for page_summary in get_page_sumarries(p):
ret.append(page_summary)
return ret
text = get_random_pages_summary(50)

Related

Having trouble using Beautiful Soup's 'Next Sibling' to extract some information

On Auction websites, there is a clock counting down the time remaining. I am trying to extract that piece of information (among others) to print to a csv file.
For example, I am trying to take the value after 'Time Left:' on this site: https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx
I have tried 3 different options, without any success
1)
time = ''
try:
time = soup.find(id='tzcd').text.replace('Time Left:','')
#print("Time: ",time)
except Exception as e:
print(e)
time = ''
try:
time = soup.find(id='tzcd').text
#print("Time: ",time)
except:
pass
3
time = ''
try:
time = soup.find('div', id="BiddingTimeSection").find_next_sibling("div").text
#print("Time: ",time)
except:
pass
I am a new user of Python and don't know if it's because of the date/time structure of the pull or because of something else inherently flawed with my code.
Any help would be greatly appreciated!

That information is being pulled into page via a Javascript XHR call. You can see that by inspecting Network tab in browser's Dev tools. The following code will get you the time left in seconds:
import requests
s = requests.Session()
header = {'X-AjaxPro-Method': 'GetTimerText'}
payload = '{"inventoryId":271177}'
r = s.get('https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx')
s.headers.update(header)
r = s.post('https://auctionofchampions.com/ajaxpro/LotDetail,App_Web_lotdetail.aspx.cdcab7d2.1voto_yr.ashx', data=payload)
print(r.json()['value']['timeLeft'])
Response:
792309
792309 seconds are a bit over 9 days. There are easy ways to return them in days/hours/minutes, if you want.

Finding if some tag exists in HTML response and printing if/else accordingly

I am trying to collect data from a website (using Python). In a webpage, there are multiple listings of software and in each listing. My data is within a tag (h5) and certain class ('price_software_details).
However, in some cases, tag along with the data is missing. I want to print 'NA' message if data and tag are missing else it should print the data.
I tried the code that I have mentioned below, though it's not working.
Help please!
interest = soup.find(id = 'allsoftware')
for link in interest.findAll('h5'):
if link.find(class_ = 'price_software_details') == True:
print(link.getText())
else:
print('NA')

Have you tried error handling (try, except)?
interest = soup.find(id='allsoftware')
for link in interest.findAll('h5'):
try:
item = link.find({'class':'price_software_details'})
print(item.get_text())
except:
print('NA')

You need to know soup.find() never be True.It only will be result or None.
interest = soup.find(id = 'allsoftware')
for link in interest.findAll('h5'):
if link.find(class_ = 'price_software_details'):
print(link.getText())
else:
print('NA')

python try and exception

I am not a python geek but have tried to solve this problem using information from several answers to similar questions but none seem to really work in my case. Here it is:
I am calling a function from a python script:
Here is the function:
def getsom(X):
#some codes
try:
st = get data from site 1 using X
except:
print "not available from site 1, getting from site 2"
st = get data from site 2 using X
#some codes that depend on st
I am calling this from a python script as such:
#some codes
for yr in range(min_yr,max_yr+1):
day=1
while day<max_day:
st1 = getsom(X)
#some code that depends on st1
day+=1
This works fine when data is available on either site 1 or 2 for a particular day, but breaks down when it is unavailable on both sites for another day.
I want to be able to check for the next day if data is unavailable for a particular day for both sites. I have tried different configurations of try and except with no success and would appreciate any help on the most efficient way to do this.
Thanks!
***Edits
Final version that worked:
in the function part:
def getsom(X):
#some codes
try:
st = get data from site 1 using X
except:
print "not available from site 1, getting from site 2"
st = get data from site 2 using X
try:
st = get data from site 2 using X
except:
print "data not available from sites 1 and 2"
st=None
if st is not None:
#some codes that depend on st
In order to iterate to the next day on the script side, I had to handle the none case from the function with another try/except block:
#some codes
for yr in range(min_yr,max_yr+1):
day=1
while day<max_day:
try:
st=getsom(X)
except:
st=None
if st is not None:
#some codes that depend

As mentioned in the comments you seem like you want to catch the exception in first-level exception handler. You can do it like this:
def getsom(X):
#some codes
try:
st = get data from site 1 using X
except:
print "not available from site 1, getting from site 2"
try:
st = get data from site 2 using X
except:
print "Not available from site 2 as well."
# Here you can assign some arbitrary value to your variable (like None for example) or return from function.
#some codes that depend on st
If data is not available on neither of the sites you can assign some arbitrary value to your variable st or simply return from the function.
Is this what you are looking for?
Also, you shouldn't simply write except without specifying the type of exception you expect - look here for more info: Should I always specify an exception type in `except` statements?
Edit to answer the problem in comment:
If you have no data about certain day you can just return None and handle it like this:
#some codes
for yr in range(min_yr,max_yr+1):
day=1
while day<max_day:
st1 = getsom(X)
if st1 is not None:
#some code that depends on st1
day+=1

Why don't you create a separate function for it?
def getdata(X):
for site in [site1, site2]: # possibly more
try:
return get_data_from_site_using_X()
except:
print "not available in %s" % site
print "couldn't find data anywhere"
Then getsom becomes:
def getsom(X):
#some codes
st = getdata(X)
#some codes that depend on st

Python: return empty value on exception

I have some experience in Python, but I have never used try & except functions to catch errors due to lack of formal training.
I am working on extracting a few articles from wikipedia. For this I have an array of titles, a few of which do not have any article or search result at the end. I would like the page retrieval function just to skip those few names and continue running the script on the rest. Reproducible code follows.
import wikipedia
# This one works.
links = ["CPython"]
test = [wikipedia.page(link, auto_suggest=False) for link in links]
test = [testitem.content for testitem in test]
print(test)
#The sequence breaks down if there is no wikipedia page.
links = ["CPython","no page"]
test = [wikipedia.page(link, auto_suggest=False) for link in links]
test = [testitem.content for testitem in test]
print(test)
The library running it uses a method like this. Normally it would be really bad practice, but since this is just for a one-off data extraction, I am willing to change the local copy of the library to get it to work. Edit I included the complete function now.
def page(title=None, pageid=None, auto_suggest=True, redirect=True, preload=False):
'''
Get a WikipediaPage object for the page with title `title` or the pageid
`pageid` (mutually exclusive).
Keyword arguments:
* title - the title of the page to load
* pageid - the numeric pageid of the page to load
* auto_suggest - let Wikipedia find a valid page title for the query
* redirect - allow redirection without raising RedirectError
* preload - load content, summary, images, references, and links during initialization
'''
if title is not None:
if auto_suggest:
results, suggestion = search(title, results=1, suggestion=True)
try:
title = suggestion or results[0]
except IndexError:
# if there is no suggestion or search results, the page doesn't exist
raise PageError(title)
return WikipediaPage(title, redirect=redirect, preload=preload)
elif pageid is not None:
return WikipediaPage(pageid=pageid, preload=preload)
else:
raise ValueError("Either a title or a pageid must be specified")
What should I do to retreive only the pages that do not give the error. Maybe there is a way to filter out all items in the list that give this error or an error of some kind. Returning "NA" or similar would be fine with pages that don't exist. Skipping them without notice would be fine too. Thanks!

The function wikipedia.page will raise a wikipedia.exceptions.PageError if the page doesn't exist. That's the error you want to catch.
import wikipedia
links = ["CPython","no page"]
test=[]
for link in links:
try:
#try to load the wikipedia page
page=wikipedia.page(link, auto_suggest=False)
test.append(page)
except wikipedia.exceptions.PageError:
#if a "PageError" was raised, ignore it and continue to next link
continue
You have to surround the function wikipedia.page by a try block, so I'm afraid you can't use list comprehension.

Understand that this will be bad practice, but for a one off quick and dirty script you can just:
edit: Wait, sorry. I've just noticed the list comprehension. I'm actually not sure if this will work without breaking that down:
links = ["CPython", "no page"]
test = []
for link in links:
try:
page = wikipedia.page(link, auto_suggest=False)
test.append(page)
except wikipedia.exceptions.PageError:
pass
test = [testitem.content for testitem in test]
print(test)
pass Tells python to essentially to trust you and ignore the error so that it can continue on about its day.

Dictionary / JSON issue using Python 2.7

I'm looking at scraping some data from Facebook using Python 2.7. My code basically augments by 1 changing the Facebook profile ID to then capture details returned by the page.
An example of the page I'm looking to capture the data from is graph.facebook.com/4.
Here's my code below:
import scraperwiki
import urlparse
import simplejson
source_url = "http://graph.facebook.com/"
profile_id = 1
while True:
try:
profile_id +=1
profile_url = urlparse.urljoin(source_url, str(profile_id))
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json['results']:
print result
data = {}
data['id'] = result['id']
data['name'] = result['name']
data['first_name'] = result['first_name']
data['last_name'] = result['last_name']
data['link'] = result['link']
data['username'] = result['username']
data['gender'] = result['gender']
data['locale'] = result['locale']
print data['id'], data['name']
scraperwiki.sqlite.save(unique_keys=['id'], data=data)
#time.sleep(3)
except:
continue
profile_id +=1
I am using the scraperwiki site to carry out this check but no data is printed back to console despite the line 'print data['id'], data['name'] used just to check the code is working
Any suggestions on what is wrong with this code? As said, for each returned profile, the unique data should be captured and printed to screen as well as populated into the sqlite database.
Thanks

Any suggestions on what is wrong with this code?
Yes. You are swallowing all of your errors. There could be a huge number of things going wrong in the block under try. If anything goes wrong in that block, you move on without printing anything.
You should only ever use a try / except block when you are looking to handle a specific error.
modify your code so that it looks like this:
while True:
profile_id +=1
profile_url = urlparse.urljoin(source_url, str(profile_id))
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json['results']:
print result
data = {}
# ... more ...
and then you will get detailed error messages when specific things go wrong.
As for your concern in the comments:
The reason I have the error handling is because, if you look for
example at graph.facebook.com/3, this page contains no user data and
so I don't want to collate this info and skip to the next user, ie. no
4 etc
If you want to handle the case where there is no data, then find a way to handle that case specifically. It is bad practice to swallow all errors.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python retrieve text from multiple random wikipedia pages - python

Related

Having trouble using Beautiful Soup's 'Next Sibling' to extract some information

Finding if some tag exists in HTML response and printing if/else accordingly

python try and exception

Python: return empty value on exception

Dictionary / JSON issue using Python 2.7

Categories

Resources