So I'm interested in this theory that if you go to a random Wikipedia article, click the first link not inside parentheses repeatedly, in 95% of the cases you will end up on the article about Philosophy.
I wanted to write a script in Python that does the link fetching for me and in the end, print a nice list of which articles were visited (linkA -> linkB -> linkC) etc.
I managed to get the HTML DOM of the web pages, and managed to strip out some unnecessary links and the top description bar which leads disambiguation pages. So far I have concluded that:
The DOM begins with the table which you see on the right on some pages, for example in Human. We want to ignore these links.
The valid link elements all have a <p> element somewhere as their ancestor (most often parent or grandparent if it's inside a <b> tag or similar. The top bar which leads to disambiguation pages, does not seem to contain any <p> elements.
Invalid links contain some special words followed by a colon, e.g. Wikipedia:
So far, so good. But it's the parentheses that get me. In the article about Human for example, the first link not inside parentheses is "/wiki/Species", but the script finds "/wiki/Taxonomy" which is inside them.
I have no idea how to go about this programmatically, since I have to look for text in some combination of parent/child nodes which may not always be the same. Any ideas?
My code can be seen below, but it's something I made up really quickly and not very proud of. It's commented however, so you can see my line of thoughts (I hope :) ).
"""Wikipedia fun"""
import urllib2
from xml.dom.minidom import parseString
import time
def validWikiArticleLinkString(href):
""" Takes a string and returns True if it contains the substring
'/wiki/' in the beginning and does not contain any of the
"special" wiki pages.
"""
return (href.find("/wiki/") == 0
and href.find("(disambiguation)") == -1
and href.find("File:") == -1
and href.find("Wikipedia:") == -1
and href.find("Portal:") == -1
and href.find("Special:") == -1
and href.find("Help:") == -1
and href.find("Template_talk:") == -1
and href.find("Template:") == -1
and href.find("Talk:") == -1
and href.find("Category:") == -1
and href.find("Bibcode") == -1
and href.find("Main_Page") == -1)
if __name__ == "__main__":
visited = [] # a list of visited links. used to avoid getting into loops
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] # need headers for the api
currentPage = "Human" # the page to start with
while True:
infile = opener.open('http://en.wikipedia.org/w/index.php?title=%s&printable=yes' % currentPage)
html = infile.read() # retrieve the contents of the wiki page we are at
htmlDOM = parseString(html) # get the DOM of the parsed HTML
aTags = htmlDOM.getElementsByTagName("a") # find all <a> tags
for tag in aTags:
if "href" in tag.attributes.keys(): # see if we have the href attribute in the tag
href = tag.attributes["href"].value # get the value of the href attribute
if validWikiArticleLinkString(href): # if we have one of the link types we are looking for
# Now come the tricky parts. We want to look for links in the main content area only,
# and we want the first link not in parentheses.
# assume the link is valid.
invalid = False
# tables which appear to the right on the site appear first in the DOM, so we need to make sure
# we are not looking at a <a> tag somewhere inside a <table>.
pn = tag.parentNode
while pn is not None:
if str(pn).find("table at") >= 0:
invalid = True
break
else:
pn = pn.parentNode
if invalid: # go to next link
continue
# Next we look at the descriptive texts above the article, if any; e.g
# This article is about .... or For other uses, see ... (disambiguation).
# These kinds of links will lead into loops so we classify them as invalid.
# We notice that this text does not appear to be inside a <p> block, so
# we dismiss <a> tags which aren't inside any <p>.
pnode = tag.parentNode
while pnode is not None:
if str(pnode).find("p at") >= 0:
break
pnode = pnode.parentNode
# If we have reached the root node, which has parentNode None, we classify the
# link as invalid.
if pnode is None:
invalid = True
if invalid:
continue
###### this is where I got stuck:
# now we need to look if the link is inside parentheses. below is some junk
# for elem in tag.parentNode.childNodes:
# while elem.firstChild is not None:
# elem = elem.firstChid
# print elem.nodeValue
print href # this will be the next link
newLink = href[6:] # except for the /wiki/ part
break
# if we have been to this link before, break the loop
if newLink in visited:
print "Stuck in loop."
break
# or if we have reached Philosophy
elif newLink == "Philosophy":
print "Ended up in Philosophy."
break
else:
visited.append(currentPage) # mark this currentPage as visited
currentPage = newLink # make the the currentPage we found the new page to fetch
time.sleep(5) # sleep some to see results as debug
I found a python script on Github (http://github.com/JensTimmerman/scripts/blob/master/philosophy.py) to play this game.
It uses Beautifulsoup for HTML parsing and to cope with the parantheses issue he just removes text between brackets before parsing links.
Related
Im trying to write a scraper that randomly chooses a wiki article link from a page, goes there, grabs another, and loops that. I want to exclude links with "Category:", "File:", "List" in the href. Im pretty sure the links i want are all inside of p tags, but when I include "p" in find_all, i get "int object is not subscriptable" error.
The code below returns wiki pages but does not exclude the things i want to filter out.
This is a learning journey for me. All help is appreciated.
import requests
from bs4 import BeautifulSoup
import random
import time
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
print(title.text)
print(url)
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
# Here i am trying to select hrefs with /wiki/ in them and exclude hrefs with "Category:" etc. It does select for wikis but does not exclude anything.
if link['href'].find("/wiki/") == -1:
if link['href'].find("Category:") == 1:
if link['href'].find("File:") == 1:
if link['href'].find("List") == 1:
continue
# Use this link to scrape
linkToScrape = link
articleTitles = open("savedArticles.txt", "a+")
articleTitles.write(title.text + ", ")
articleTitles.close()
time.sleep(6)
break
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Anarchism")
You need to modify the for loop, .attrs is used to access the attributes of any tag. If you want to exclude links if the href value contains particular keyword then use !=-1 comparison.
Modified code:
import requests
from bs4 import BeautifulSoup
import random
import time
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
if("href" in link.attrs):
if link.attrs['href'].find("/wiki/") == -1 or link.attrs['href'].find("Category:") != -1 or link.attrs['href'].find("File:") != -1 or link.attrs['href'].find("List") != -1:
continue
linkToScrape = link
articleTitles = open("savedArticles.txt", "a+")
articleTitles.write(title.text + ", ")
articleTitles.close()
time.sleep(6)
break
if(linkToScrape):
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape.attrs['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Anarchism")
This section seems problematic.
if link['href'].find("/wiki/") == -1:
if link['href'].find("Category:") == 1:
if link['href'].find("File:") == 1:
if link['href'].find("List") == 1:
continue
find returns the index of the substring you are looking for, you are also using it wrong.
So if wiki is not found or Category:, File: etc. appears in href, then continue.
if link['href'].find("/wiki/") == -1 or \
link['href'].find("Category:") != -1 or \
link['href'].find("File:") != -1 or \
link['href'].find("List")!= -1 :
print("skipped " + link["href"])
continue
Saint Petersburg
https://en.wikipedia.org/wiki/St._Petersburg
National Diet Library
https://en.wikipedia.org/wiki/NDL_(identifier)
Template talk:Authority control files
https://en.wikipedia.org/wiki/Template_talk:Authority_control_files
skipped #searchInput
skipped /w/index.php?title=Template_talk:Authority_control_files&action=edit§ion=1
User: Tom.Reding
https://en.wikipedia.org/wiki/User:Tom.Reding
skipped http://toolserver.org/~dispenser/view/Main_Page
Iapetus (moon)
https://en.wikipedia.org/wiki/Iapetus_(moon)
87 Sylvia
https://en.wikipedia.org/wiki/87_Sylvia
skipped /wiki/List_of_adjectivals_and_demonyms_of_astronomical_bodies
Asteroid belt
https://en.wikipedia.org/wiki/Main_asteroid_belt
Detached object
https://en.wikipedia.org/wiki/Detached_object
Use :not() to handle the list of exclusions within href alongside * contains operator. This will filter out hrefs containing (*) specified substrings. Precede this with an attribute = value selector that contains * /wiki/. I have specified a case insensitive match via i, for the first two, which can be removed:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/2018_FIFA_World_Cup#Prize_money')
soup = bs(r.content, 'lxml') # 'html.parser'
links = [i['href'] for i in soup.select('#bodyContent a[href*="/wiki/"]:not([href*="Category:" i], [href*="File:" i], [href*="List"])')]
I'm using Django and Python 3.7. I want to have more efficient parsing so I was reading about SoupStrainer objects. I created a custom one to help me parse only the elements I need ...
def my_custom_strainer(self, elem, attrs):
for attr in attrs:
print("attr:" + attr + "=" + attrs[attr])
if elem == 'div' and 'class' in attr and attrs['class'] == "score":
return True
elif elem == "span" and elem.text == re.compile("my text"):
return True
article_stat_page_strainer = SoupStrainer(self.my_custom_strainer)
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
One of the conditions is I only want to parse "span" elements whose text matches a certain pattern. Hence the
elem == "span" and elem.text == re.compile("my text")
clause. However, this results in an
AttributeError: 'str' object has no attribute 'text'
error when I try and run the above. What's the proper way to write my strainer?
TLDR; No, this is currently not easily possible in BeautifulSoup (modification of BeautifulSoup and SoupStrainer objects would be needed).
Explanation:
The problem is that the Strainer-passed function gets called on handle_starttag() method. As you can guess, you only have values in the opening tag (eg. element name and attrs).
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/init.py#L524
if (self.parse_only and len(self.tagStack) <= 1
and (self.parse_only.text
or not self.parse_only.search_tag(name, attrs))):
return None
And as you can see, if your Strainer function returns False, the element gets discarded immediately, without having chance to take the inner text inside into consideration (unfortunately).
On the other hand if you add "text" to search.
SoupStrainer(text="my text")
it will start to search inside the tag for text, but this doesn't have context of element or attributes - you can see the irony :/
and combining it together will just find nothing. And you can't even access parent like shown here in find function:
https://gist.github.com/RichardBronosky/4060082
So currently Strainers are just good to filter on elements/attrs. You would need to change a lot of Beautiful soup code to get that working.
If you really need this, I suggest inheriting BeautifulSoup and SoupStrainer objects and modifying their behavior.
It seems you try to loop along soup elements in my_custom_strainer method.
In order to do so, you could do it as follows:
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
my_custom_strainer(soup, attrs)
Then slightly modify my_custom_strainer to meet something like:
def my_custom_strainer(soup, attrs):
for attr in attrs:
print("attr:" + attr + "=" + attrs[attr])
for d in soup.findAll(['div','span']):
if d.name == 'span' and 'class' in attr and attrs['class'] == "score":
return d.text # meet your needs here
elif d.name == 'span' and d.text == re.compile("my text"):
return d.text # meet your needs here
This way you can access the soup objects iteratively.
I recently created a lxml / BeautifulSoup parser for html files, which also searches between specific tags.
The function I wrote opens up a your operating system's file manager and allows you to select the specifi html file to parse.
def openFile(self):
options = QFileDialog.Options()
options |= QFileDialog.DontUseNativeDialog
fileName, _ = QFileDialog.getOpenFileName(self, "QFileDialog.getOpenFileName()", "",
"All Files (*);;Python Files (*.py)", options=options)
if fileName:
file = open(fileName)
data = file.read()
soup = BeautifulSoup(data, "lxml")
for item in soup.find_all('strong'):
results.append(float(item.text))
print('Score =', results[1])
print('Fps =', results[0])
You can see that the tag i specified was 'strong', and i was trying to find the text within that tag.
Hope I could help in someway.
I'm a newbie on python and thus on scrapy (tools to crawl website written in python...) too, hope someone can shed some lights on my way... I just wrote a spider consisting on 2 parsing fonctions:
- the first parsing function to parse the start page I'm crawling & which contains chapters & sub-chapters with 7 levels, some of the chapters at various level pointing ( to articles or lists of articles
- the second parsing function is there to parse the articles or list of articles and is invoked as the call back of scrapy.Request(...)
The objective of this spider is to create a sort of big DOM of the entire content with the chapters, sub-chapters, articles & their content.
I have a problem in the second function which seems to receive sometime responses that do not correspond the content located at the url used when invoking scrapy.Request. This problem disappeared when setting CONCURRENT_REQUESTS to 1. I initially thought that this was due to some multi-threading / non-re-entrant functions pb but found that I had no re-entrances issue and read afterwards that scrapy was actually not multi-threaded... so I cannot figure out where my pb comes from.
Here a snippet of my code
#---------------------------------------------
# Init part:
#---------------------------------------------
import scrapy
from scrapy import signals
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
from scrapy.exceptions import CloseSpider
top = Element('top')
curChild = top
class mytest(scrapy.Spider):
name = 'lfb'
#
# This is what make my code working but I don't know why !!!
# Ideally would like to benefit from the speed of having several concurrent
# requests when crawling & parsing
#
custom_settings = {
'CONCURRENT_REQUESTS': '1',
}
#
# This section is just here to be able to do something when the spider closes
# In this case I want to print the DOM I've created.
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(mytest, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def spider_closed(self, spider):
print ("Spider closed - !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
#this is to print the DOM created at the end
print tostring(top)
def parse(self, response):
pass
def start_requests(self):
level = 0
print "Start parsing legifrance level set to %d" % level
# This is to print the DOM which is empty (or almost - just the top element in there)
print tostring(top)
yield scrapy.Request("<Home Page>", callback=self.parse)
#----------------------------------------------
# First parsing function - Parsing the Home page - this one works fine (I think)
#----------------------------------------------
def parse(self, response):
for sel in response.xpath('//span'):
cl = sel.xpath("#class").extract()
desc = sel.xpath('text()').extract()
#
# Do some stuff here depending on the class (cl) of 'span' which corresponds
# to either one of the # 7 levels of chapters & sub-chapters or to list of
# articles attached to a sub-chapters. To simplify I'm just putting here the
# code corresponding to the handling of list of articles (cl == codeLienArt)
# ...
# ...
if cl == [unicode('codeLienArt')]:
art_plink= sel.css('a::attr("href")').extract()
artLink= "<Base URL>"+str(unicode(art_plink[0]))
#
# curChild points to the element in the DOM to which the list of articles
# should be attached. Pass it in the request meta, in order for the second
# parsing function to place the articles & their content at the right place
# in the DOM
#
thisChild = curChild
#
# print for debug - thisChild.text contains the heading of the sub-chapter
# to which the list of articles that will be processed by parse1 should be
# attached.
#
print "follow link cl:%s art:%s for %s" % (cl, sel.xpath('a/text()').extract(), thisChild.text )
#
# get the list of articles following artLink & pass the response to the second parsing function
# (I know it's called parse1 :-)
#
yield scrapy.Request(artLink, callback=self.parse1, meta={ 'element': thisChild })
#-------------------
# This is the second parsing function that parses list of Articles & their content
# format is basically one or several articles, each being presented(simplified) as
# < div class="Articles">
# <div class="titreArt"> Title here</div>
# <div class="corpsArt"> Sometime some text and often a list of paragraph <p>sentences</p>" ></div>
# </div>
#-------------------
def parse1(self, resp):
print "enter parse1"
numberOfArticles= 0
for selArt in resp.xpath('//div[#class="article"]'):
#
# This is where I see the problem when CONCURRENT_REQUESTS > 1, sometimes
# the response points to a page that is not the page that was requested in
# the previous parsing function...
#
clArt = selArt.xpath('.//div[#class="titreArt"]/text()').extract()
print clArt
numberOfArticles += 1
childArt = SubElement(resp.meta['element'], 'Article')
childArt.text =str(unicode("%s" % clArt[0]))
corpsArt = selArt.xpath('.//div[#class="corpsArt"]/text()').extract()
print "corpsArt=%s" % corpsArt
temp = ''
for corpsItem in corpsArt:
if corpsItem != '\n':
temp += corpsItem
if temp != '':
childCorps = SubElement(childArt, 'p')
childCorps.text = temp
print "corpsArt is not empty %s" % temp
for paraArt in selArt.xpath('.//div[#class="corpsArt"]//p/text()').extract():
childPara = SubElement(childArt, 'p')
childPara.text = paraArt
print "childPara.text=%s" % childPara.text
print "link followed %s (%d)" % (resp.url,numberOfArticles)
print "leave parse1"
yield
I'm writing a solution to test this phenomenon in Python. I have most of the logic done, but there are many edge cases that arise when following links in Wikipedia articles.
The problem I'm running into arises for a page like this where the first <p> has multiple levels of child elements and the first <a> tag after the first set of parentheses needs to be extracted. In this case, (to extract this link), you have to skip over the parentheses, and then get to the very next anchor tag/href. In most articles, my algorithm can skip over the parentheses, but with the way that it looks for links in front of parentheses (or if they don't exist), it is finding the anchor tag in the wrong place. Specifically, here: <span style="font-size: small;"><span id="coordinates">Coordinates
The algorithm works by iterating through the elements in the first paragraph tag (in the main body of the article), stringifying each element iteratively, and first checking to see if it contains either an '(' or an '
Is there any straight forward way to avoid embedded anchor tags and only take the first link that is a direct child of the first <p> ?
Below is the function with this code for reference:
**def getValidLink(self, currResponse):
currRoot = BeautifulSoup(currResponse.text,"lxml")
temp = currRoot.body.findAll('p')[0]
parenOpened = False
parenCompleted = False
openCount = 0
foundParen = False
while temp.next:
temp = temp.next
curr = str(temp)
if '(' in curr and str(type(temp)) == "<class 'bs4.element.NavigableString'>":
foundParen = True
break
if '<a' in curr and str(type(temp)) == "<class 'bs4.element.Tag'>":
link = temp
break
temp = currRoot.body.findAll('p')[0]
if foundParen:
while temp.next and not parenCompleted:
temp = temp.next
curr = str(temp)
if '(' in curr:
openCount += 1
if parenOpened is False:
parenOpened = True
if ')' in curr and parenOpened and openCount > 1:
openCount -= 1
elif ')' in curr and parenOpened and openCount == 1:
parenCompleted = True
try:
return temp.findNext('a').attrs['href']
except KeyError:
print "\nReached article with no main body!\n"
return None
try:
return str(link.attrs['href'])
except KeyError:
print "\nReached article with no main body\n"
return None**
I think you are seriously overcomplicating the problem.
There are multiple ways to use the direct parent-child relationship between the elements in BeautifulSoup. One way is the > CSS selector:
In [1]: import requests
In [2]: from bs4 import BeautifulSoup
In [3]: url = "https://en.wikipedia.org/wiki/Sierra_Leone"
In [4]: response = requests.get(url)
In [5]: soup = BeautifulSoup(response.content, "html.parser")
In [6]: [a.get_text() for a in soup.select("#mw-content-text > p > a")]
Out[6]:
['West Africa',
'Guinea',
'Liberia',
...
'Allen Iverson',
'Magic Johnson',
'Victor Oladipo',
'Frances Tiafoe']
Here we've found a elements that are located directly under the p elements directly under the element with id="mw-content-text" - from what I understand this is where the main Wikipedia article is located in.
If you need a single element, use select_one() instead of select().
Also, if you want to solve it via find*(), pass the recursive=False argument.
I am trying to get BeautifulSoup to do the following.
I have HTML files which I wish to modify. I am interested in two tags in particular, one which I will call TagA is
<div class ="A">...</div>
and one which I will call TagB
<p class = "B">...</p>
Both tags occur independently throughout the HTML and may themselves contain other tags and be nested inside other tags.
I want to place a marker tag around every TagA whenever it is not immediately followed by TagB so that
<p class="A"">...</p> becomes <marker><p class="A">...</p></marker>
But when TagA is followed immediately by TagB, I want the marker Tag to surround them both
so that
<p class="A">...</p><div class="B">...</div>
becomes
<marker><p class="A">...</p><div class="B">...</div></marker>
I can see how to select TagA and enclose it with the marker tag, but when it is followed by TagB I do not know if or how the BeautiulSoup 'selection' can be extended to include the NextSibling.
Any help appreciated.
beautifulSoup does have a "next sibling" function. find all tags of class A and use a.next_sibling to check if it is b.
look at the docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways
I think I was going about this the wrong way by trying to extend the 'selection' from one tag to the following. Instead I found the following code which insets the outer 'Marker' tag and then inserts the A and B tags does the trick.
I am pretty new to Python so would appreciate advice regarding improvements or snags with the following.
def isTagB(tag):
#If tag is <p class = "B"> return true
#if not - or tag is just a string return false
try:
return tag.name == 'p'#has_key('p') and tag.has_key('B')
except:
return False
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<div class = "A"><p><i>more content</i></p></div><div class = "A"><p><i>hello content</i></p></div><p class="B">da <i>de</i> da </p><div class = "fred">not content</div>""")
for TagA in soup.find_all("div", "A"):
Marker = soup.new_tag('Marker')
nexttag = TagA.next_sibling
#skipover white space
while str(nexttag).isspace():
nexttag = nexttag.next_sibling
if isTagB(nexttag):
TagA.replaceWith(Marker) #Put it where the A element is
Marker.insert(1,TagA)
Marker.insert(2,nexttag)
else:
#print("FALSE",nexttag)
TagA.replaceWith(Marker) #Put it where the A element is
Marker.insert(1,TagA)
print (soup)
import urllib
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen("http://ursite.com") #gives html response
soup = BeautifulSoup(html)
all_div = soup.findAll("div",attrs={}) #use attrs as dict for attribute parsing
#exa- attrs={'class':"class","id":"1234"}
single_div = all_div[0]
#to find p tag inside single_div
p_tag_obj = single_div.find("p")
you can use obj.findNext(), obj.findAllNext(), obj.findALLPrevious(), obj.findPrevious(),
to get attribute you can use obj.get("href"), obj.get("title") etc.