Web Scraping a wikipedia page - python

In some wikipedia pages, after the title of the article (appearing in bold), there is some text inside of parentheses used to explain the pronunciation and phonetics of the words in the title. For example, on this, after the bold title diglossia in the <p>, there is an open parenthesis. In order to find the corresponding close parenthesis, you would have to iterate through the text nodes one by one to find it, which is simple. What I'm trying to do is find the very next href link and store it.
The issue here is that (AFAIK), there isn't a way to uniquely identify the text node with the close parenthesis and then get the following href. Is there any straight forward (not convoluted) way to get the first link outside of the initial parentheses?
EDIT
In the case of the link provided here, the href to be stored should be: https://en.wikipedia.org/wiki/Dialects since that is the first link outside of the parenthesis

Is this what you want?
import requests
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
print parsed_html.body.findAll('p')[0].findAll('a')[0]
This gives:
linguistics
if you want to extract href then you can use this:
parsed_html.body.findAll('p')[0].findAll('a')[0].attrs[0][1]
UPDATE
It seems you want href after parentheses not the before one.
I have written script for it. Try this:
import requests
from BeautifulSoup import BeautifulSoup
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
temp = parsed_html.body.findAll('p')[0]
start_count = 0
started = False
found = False
while temp.next and found is False:
temp = temp.next
if '(' in temp:
start_count += 1
if started is False:
started = True
if ')' in temp and started and start_count > 1:
start_count -= 1
elif ')' in temp and started and start_count == 1:
found = True
print temp.findNext('a').attrs[0][1]

Related

Detect word and page of occurrence in Word Document

I am trying to detect specific words (with a regex pattern that I already have) in a Word Document. I do not only want to detect the word but also to know in which page it appears, I think of something like a list of tuples: [(WordA, 10), (WordB, 4) ....]
I am able to extract the text from the word document and detect all the words that match the regex pattern but I am not able to know if which page the word appears. Also, I want to detect all the occurrences regardless if they appear in the header, body or footnotes.
Here is my regex pattern:
pattern = re.compile(r'\bDOC[-–—]\d{9}(?!\d)')
Extraction of text:
import docx2txt
result = docx2txt.process("Word_Document.docx")
Thank you in advance,
I just wanted to say thank you to those who tried to answer this question. I found two solutions:
With Word Documents, splitting them into one word document per page with Aspose:
https://products.aspose.cloud/words/python/split/
Convert the Word Document into PDF and then create one PDF per page with PyPDF2 or other library
E
Ok, after a while of trying to figure this out, I managed to get this:
import docx2txt.docx2txt as docx2txt
import re
page_contents = []
def xml2text(xml):
text = u''
root = docx2txt.ET.fromstring(xml)
start = 0
for child in root.iter():
if child.tag == docx2txt.qn('w:t'):
t_text = child.text
text += t_text if t_text is not None else ''
elif child.tag == docx2txt.qn('w:tab'):
text += '\t'
elif child.tag in (docx2txt.qn('w:br'), docx2txt.qn('w:cr')):
text += '\n'
elif child.tag == docx2txt.qn("w:p"):
text += '\n\n'
elif child.tag == docx2txt.qn('w:lastRenderedPageBreak'):
end = len(text) + 1
page_contents.append(text[start:end])
start = len(text)
page_contents.append(text[start:len(text) + 1])
return text
docx2txt.xml2text = xml2text
docx2txt.process('test_file.docx') # use your filename
matches = []
pattern = re.compile(r'\bDOC[-–—]\d{9}(?!\d)')
for page_num, page_content in enumerate(page_contents, start=1):
# do regex search
all_matches = pattern.findall(page_content)
if all_matches:
for match in all_matches:
matches.append((match, page_num))
print(matches)
It modifies the module's function so that when it is called it will add each page to a list and the index + 1 will be the page number. It modifies the module's xml2text parser to additionally detect a page break and then add that pages contents to the local global list. It uses the tag 'lastRenderedPageBreak', the slight caution is to save the file if you have edited it so that the placement of these tags also gets updated.

Flatten HTML code, with tree structure delimiters

I have some raw HTML scraped from a random website, possibly messy, with some scripts, self-closing tags, etc. Example:
ex="<!DOCTYPE html PUBLIC \\\n><html lang=\\'en-US\\'><head><meta http-equiv=\\'Content-Type\\'/><title>Some text</title></head><body><h1>Some other text</h1><p><span style='color:red'>My</span> first paragraph.</p></body></html>"
I want to return the HTML DOM without any string, attributes or such stuff, only the tag structure, in the format of a string showing the relation between parents, children and siblings, this would be my expected output (though the use of brackets is a personnal choice):
'[html[head[meta, title], body[h1, p[span]]]]'
So far I tried using beautifulSoup (this answer was helpful). I figured out I should split the work in two steps:
- extract the tag "skeleton" of the html DOM, emptying everything like strings, attributes, and stuff before the <html>.
- return the flat HTML DOM, but structured with tree-like delimiters indicating each children and siblings, such as brackets.
I posted the code as an self-answer
You can use recursion. The name argument will give the name of the tag. You can check if the type is bs4.element.Tag to confirm if an element is a tag.
import bs4
ex="<!DOCTYPE html PUBLIC \\\n><html lang=\\'en-US\\'><head><meta http-equiv=\\'Content-Type\\'/><title>Some text</title></head><body><h1>Some other text</h1><p><span style='color:red'>My</span> first paragraph.</p></body></html>"
soup=bs4.BeautifulSoup(ex,'html.parser')
str=''
def recursive_child_seach(tag):
global str
str+=tag.name
child_tag_list=[x for x in tag.children if type(x)==bs4.element.Tag]
if len(child_tag_list) > 0:
str+='['
for i,child in enumerate(child_tag_list):
recursive_child_seach(child)
if not i == len(child_tag_list) - 1: #if not last child
str+=', '
if len(child_tag_list) > 0:
str+=']'
return
recursive_child_seach(soup.find())
print(str)
#html[head[meta, title], body[h1, p[span]]]
print('['+str+']')
#[html[head[meta, title], body[h1, p[span]]]]
I post here my first solution, which is still a bit messy and uses a lot of regex. The first function gets the emptied DOM structure and outputs it as a raw string, the second function modifies the string to add the delimiters.
import re
def clear_tags(htmlstring, remove_scripts=False):
htmlstring = re.sub("^.*?(<html)",r"\1", htmlstring, flags=re.DOTALL)
finishyoursoup = soup(htmlstring, 'html.parser')
for tag in finishyoursoup.find_all():
tag.attrs = {}
for sub in tag.contents:
if sub.string:
sub.string.replace_with('')
if remove_scripts:
[tag.extract() for tag in finishyoursoup.find_all(['script', 'noscript'])]
return(str(finishyoursoup))
clear_tags(ex)
# '<html><head><meta/><title></title></head><body><h1></h1><p><span></span></p></b
def flattened_html(htmlstring):
import re
squeletton = clear_tags(htmlstring)
step1 = re.sub("<([^/]*?)>", r"[\1", squeletton) # replace begining of tag
step2 = re.sub("</(.*?)>", r"]", step1) # replace end of tag
step3 = re.sub("<(.*?)/>", r"[\1]", step2) # deal with self-closing tag
step4 = re.sub("\]\[", ", ", step3) # gather sibling tags with coma
return(step4)
flattened_html(ex)
# '[html[head[meta, title], body[h1, p[span]]]]'

Scraping returning only one value

I wanted to scrape something as my first program, just to learn the basics really but I'm having trouble showing more than one result.
The premise is going to a forum (http://blackhatworld.com), scrape all thread titles and compare with a string. If it contains the word "free" it will print, otherwise it won't.
Here's the current code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
n=0
for x in range(len(threadtitles)):
test = list(threadtitles)[n]
test2 = list(test)[0]
if test2.find('free') == -1:
n=n+1
else:
print(test2)
n=n+1
This is the result of running the program:
https://i.gyazo.com/6cf1e135b16b04f0807963ce21b2b9be.png
As you can see it's checking for the word "free" and it works but it only shows first result while there are several more in the page.
By default, strings comparison is case sensitive (FREE != free). To solve your problem, first you need to put test2 in lowercase:
test2 = list(test)[0].lower()
To solve your problem and simplify your code try this:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
count = 0
for title in threadtitles:
if "free" in title.get_text().lower():
print(title.get_text())
else:
count += 1
print(count)
Bonus: Print value of href:
for title in threadtitles:
print(title["href"])
See also this.

Python script extract data from HTML page

I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.
I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.
I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8
...which is GREAT!
But I must have something wrong because I'm getting 0
Here's my code:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for table_row in soup.select(".expand-section li"):
table_cells = table_row.findAll('li')
if len(table_cells) > 0:
link = table_cells[0].find('a')['href']
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeaultifulSoup(r.text)
team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
print total_rank
Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.
I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!
First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.
teamstats = soup(class_='column large-2')[0].find_all(href=True)
The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.
links = [a['href'] for a in teamstats if a['href'] != '#']
Here is a sample of output:
links
Out[84]:
['/ncaa-basketball/stat/points-per-game',
'/ncaa-basketball/stat/average-scoring-margin',
'/ncaa-basketball/stat/offensive-efficiency',
'/ncaa-basketball/stat/floor-percentage',
'/ncaa-basketball/stat/1st-half-points-per-game',
A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.

How to find the position of specific element in a list?

I have a list like this:
website = ['http://freshtutorial.com/install-xamp-ubuntu/', 'http://linuxg.net/how-to-install-xampp-on-ubuntu-13-04-12-10-12-04/', 'http://ubuntuforums.org/showthread.php?t=2149654', 'http://andyhat.co.uk/2012/07/installing-xampp-32bit-ubuntu-11-10-12-04/', 'http://askubuntu.com/questions/303068/error-with-tar-command-cannot-install-xampp-1-8-1-on-ubuntu-13-04', 'http://askubuntu.com/questions/73541/how-to-install-xampp']
I want to search if the following list contain the certain URL or not.
URL would be in this format : url = 'http://freshtutorial.com'
The website is the 1st element of a list. Thus, I want to print 1 not 0.
I want everything in the loop so that if there's no website with that URL, it would go again and dynamically generate the list and again search for the website.
I have done this upto now:
for i in website:
if url in website:
print "True"
I can't seem to print the position and wrap everything in loop. Also, is it better to use regex or if this in that syntax. Thanks
for i, v in enumerate(website, 1):
if url in v:
print i
Here is the complete program:
def search(li,ur):
for u in li:
if u.startswith(ur):
return li.index(u)+1
return 0
def main():
website = ['http://freshtutorial.com/install-xamp-ubuntu/', 'http://linuxg.net/how-to-install-xampp-on-ubuntu-13-04-12-10-12-04/', 'http://ubuntuforums.org/showthread.php?t=2149654', 'http://andyhat.co.uk/2012/07/installing-xampp-32bit-ubuntu-11-10-12-04/', 'http://askubuntu.com/questions/303068/error-with-tar-command-cannot-install-xampp-1-8-1-on-ubuntu-13-04', 'http://askubuntu.com/questions/73541/how-to-install-xampp']
url = 'http://freshtutorial.com'
print search(website,url)
if __name__ == '__main__':
main()
The code -
for i in range(0,len(website)):
current_url = website[i]
if url in current_url:
print i+1
It's a simple for loop.

Categories