Python Web-scraping, category extraction

Python Web-scraping, category extraction - python

I have below code to extract quote text and author using Beautifulsoup. I am able to get that, however each quote falls under a category (e.g. KINDNESS in below html, at the end of string). Kindly let me know how to get category along with quote text and author.
table = soup.findAll('img')
for image in table:
alt_table = image.attrs['alt'].split('#')
# print(alt_table[0]) # Quote text extracted
# print(len(alt_table))
# To prevent index error if author is not there
if len(alt_table)>1:
quote = alt_table[0]
author = alt_table[1]
author = (alt_table[1]).replace('<Author:' , '').replace('>', '') #Format author label
print('Quote: %s \nAuthor: %s' %(quote, author))
else:
quote = alt_table[0]
print('Quote: %s' %(quote))
html example
</div><div class="col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top">
<img alt="Extend yourself in kindness to other human beings wherever you can. #<Author:0x00007f7746c65b78>" class="margin-10px-bottom shadow" height="310" src="https://assets.passiton.com/quotes/quote_artwork/8165/medium/20201208_tuesday_quote_alternate.jpg?1607102963" width="310"/>
<h5 class="value_on_red">KINDNESS</h5>

Since you are dealing with image tag use find_next to get the next tag and use .text to get the value.
table = soup.findAll('img')
for image in table:
alt_table = image.attrs['alt'].split('#')
# print(alt_table[0]) # Quote text extracted
# print(len(alt_table))
# To prevent index error if author is not there
if len(alt_table)>1:
quote = alt_table[0]
author = alt_table[1]
author = (alt_table[1]).replace('<Author:' , '').replace('>', '') #Format author label
print('Quote: %s \nAuthor: %s' %(quote, author))
print(image.find_next('h5', class_='value_on_red').find_next('a').text)
else:
quote = alt_table[0]
print('Quote: %s' %(quote))
print(image.find_next('h5', class_='value_on_red').find_next('a').text)

Related

Removing HTML Tag removes additional words

I am working on a data cleaning problem wherein I have a task to remove HTML tags from string while keeping the content of text.
Example text for cleanup is given below. I tried removing "pre" tags and somehow i do not get any data.
x = '<pre>i am </pre><p> siddharth </p><pre> sid </pre>'
re.sub(r'<pre>.*<\/pre>', '', x)
If I try adding "\n" which i deleted before, i do get output as shown below
x = '<pre>i am </pre>\n<p> siddharth </p><pre> sid </pre>'
re.sub(r'<pre>.*<\/pre>', '', x)
output - '\n siddharth '
A string from dataset for cleanup is given below for reference
'<p>I\'ve written a database generation script in SQL and want to execute it in my Adobe AIR application: </p> <pre> <code> Create Table tRole ( roleID integer Primary Key ,roleName varchar(40));Create Table tFile ( fileID integer Primary Key ,fileName varchar(50) ,fileDescription varchar(500) ,thumbnailID integer ,fileFormatID integer ,categoryID integer ,isFavorite boolean ,dateAdded date ,globalAccessCount integer ,lastAccessTime date ,downloadComplete boolean ,isNew boolean ,isSpotlight boolean ,duration varchar(30));Create Table tCategory ( categoryID integer Primary Key ,categoryName varchar(50) ,parent_categoryID integer);... </code> </pre> <p> I execute this in Adobe AIR using the following methods: </p> <pre> <code> public static function RunSqlFromFile(fileName:String):void { var file:File = File.applicationDirectory.resolvePath(fileName); var stream:FileStream = new FileStream(); stream.open(file, FileMode.READ) var strSql:String = stream.readUTFBytes(stream.bytesAvailable); NonQuery(strSql);}public static function NonQuery(strSQL:String):void{ var sqlConnection:SQLConnection = new SQLConnection(); sqlConnection.open(File.applicationStorageDirectory.resolvePath(DBPATH); var sqlStatement:SQLStatement = new SQLStatement(); sqlStatement.text = strSQL; sqlStatement.sqlConnection = sqlConnection; try { sqlStatement.execute(); } catch (error:SQLError) { Alert.show(error.toString()); }} </code> </pre> <p> No errors are generated, however only <code>tRole</code> exists. It seems that it only looks at the first query (up to the semicolon- if I remove it, the query fails). Is there a way to call multiple queries in one statement?</p>'
Detailed code for cleanup is given below. The array "arr" contains all the text for which cleanup is needed.
arr = [i.replace('\n','') for i in arr]
arr = [re.sub(r'<pre>.*<\/pre>', '', i) for i in arr]
arr = [re.sub(f'<code>.*<\/code>', '', i) for i in arr]
arr = [re.sub('<[^<]+?>', '', i) for i in arr]
Kindly let me know if anyone has experienced same issue and is able to surpass this blockage.

Because of BeautifulSoup tagging - To remove a specific tag and keep its content may use .unwrap():
from bs4 import BeautifulSoup
html = '''<pre>i am </pre><p> siddharth </p><pre> sid </pre>'''
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.select('pre'):
tag.unwrap()
soup
->
i am <p> siddharth </p> sid
Or to extract texts only use .get_text():
from bs4 import BeautifulSoup
html = '''<pre>i am </pre>\n<p> siddharth </p><pre> sid </pre>'''
soup = BeautifulSoup(html, 'html.parser')
soup.get_text(' ', strip=True)
->
i am siddharth sid

Another method, using .extract():
from bs4 import BeautifulSoup
html_doc = '<pre>i am </pre><p> siddharth </p><pre> sid </pre>'
soup = BeautifulSoup(html_doc, 'html.parser')
# remove <pre> and <code> tags:
for tag in soup.select('pre, code'):
tag.extract()
# get remaining text:
text = soup.get_text(strip=True, separator=' ')
print(text)
Prints:
siddharth

How can i get only name and contact number from div?

I'm trying to get name and contact number from div and div has three span, but the problem is that sometime div has only one span, some time two and sometime three span.
First span has name.
Second span has other data.
Third span has contact number
Here is HTML
<div class="ds-body-small" id="yui_3_18_1_1_1554645615890_3864">
<span class="listing-field" id="yui_3_18_1_1_1554645615890_3863">beth
budinich</span>
<span class="listing-field"><a href="http://Www.redfin.com"
target="_blank">See listing website</a></span>
<span class="listing-field" id="yui_3_18_1_1_1554645615890_4443">(206)
793-8336</span>
</div>
Here is my Code
try:
name= browser.find_element_by_xpath("//span[#class='listing-field'][1]")
name = name.text.strip()
print("name : " + name)
except:
print("Name are missing")
name = "N/A"
try:
contact_info= browser.find_element_by_xpath("//span[#class='listing-
field'][3]")
contact_info = contact_info.text.strip()
print("contact info : " + contact_info)
except:
print("contact_info are missing")
days = "N/A"
My code is not giving me correct result. Can anyone provide me best possible solution. Thanks

You can iterate throw contacts and check, if there's child a element and if match phone number pattern:
contacts = browser.find_elements_by_css_selector("span.listing-field")
contact_name = []
contact_phone = "N/A"
contact_web = "N/A"
for i in range(0, len(contacts)):
if len(contacts[i].find_elements_by_tag_name("a")) > 0:
contact_web = contacts[i].find_element_by_tag_name("a").get_attribute("href")
elif re.search("\\(\\d+\\)\\s+\\d+-\\d+", contacts[i].text):
contact_phone = contacts[i].text
else:
contact_name.append(contacts[i].text)
contact_name = ", ".join(contact_name) if len(contact_name) > 0 else "N/A"
Output:
contact_name: ['Kevin Howard', 'Howard enterprise']
contact_phone: '(206) 334-8414'
The page has captcha. To scrape better to use requests, all information provided in json format.

#sudharsan
# April 07 2019
from bs4 import BeautifulSoup
text ='''<div class="ds-body-small" id="yui_3_18_1_1_1554645615890_3864">
<span class="listing-field" id="yui_3_18_1_1_1554645615890_3863">beth
budinich</span>
<span class="listing-field"><a href="http://Www.redfin.com"
target="_blank">See listing website</a></span>
<span class="listing-field" id="yui_3_18_1_1_1554645615890_4443">(206)
793-8336</span>
</div>'''
# the given sample html is stored as a input in variable called "text"
soup = BeautifulSoup(text,"html.parser")
main = soup.find(class_="listing-field")
# Now the spans with class name "listing-field" is stored as list in "main"
print main[0].text
# it will print the first span element
print main[-1].text
# it will print the last span element
#Thank you
# if you like the code "Vote for it"

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python

I’m trying to extract data from this webpage and I'm having some trouble due to inconsistancies within the page's HTML formatting. I have a list of OGAP IDs and I want to extract the Gene Name and any literature information (PMID #) for each OGAP ID I iterate through. Thanks to other questions on here, and the BeautifulSoup documentation, I've been able to consistantly get the gene name for each ID, but I'm having trouble with the literature part. Here's a couple search terms that highlight the inconsitancies.
HTML sample that works
Search term: OG00131
<tr>
<td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation:
<br> PMID:
20068230
[CAD, ETD MS/MS]; <br>
<br>
</td>
</tr>
HTML sample that doesn't work
Search term: OG00020
<td align="top" bgcolor="#FBFFCC">
<div class="STYLE28">Literature describing O-GlcNAcylation: </div>
<div class="STYLE28">
<div class="STYLE28">PMID:
16408927
[Azide-tag, nano-HPLC/tandem MS]
</div>
<br>
Site has not yet been determined. Use
OGlcNAcScan
to predict the O-GlcNAc site. </div>
</td>
Here's the code I have so far
import urllib2
from bs4 import BeautifulSoup
#define list of genes
#initialize variables
gene_list = []
literature = []
# Test list
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
for i in range(len(gene_listID)):
print gene_listID[i]
# Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided
dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i]
# Opens the URL as a page
page = urllib2.urlopen(dbOGAP)
# Reads the page and parses it through "lxml" format
soup = BeautifulSoup(page, "lxml")
gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text
print gene_name[1:]
gene_list.append(gene_name[1:])
# PubMed IDs are located near the <td> tag with the term "Data and Source"
pmid = soup.find("span", text="Data and Source")
# Based on inspection of the website, need to move up to the parent <td> tag
pmid_p = pmid.parent
# Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag)
pmid_s = pmid_p.next_sibling
#for child in pmid_s.descendants:
# print child
# Now we search down the tree to find the next table data (<td>) tag
pmid_c = pmid_s.find("td")
temp_lit = []
# Next we print the text of the data
#print pmid_c.text
if "No literature is available" in pmid_c.text:
temp_lit.append("No literature is available")
print "Not available"
else:
# and then print out a list of urls for each pubmed ID we have
print "The following is available"
for link in pmid_c.find_all('a'):
# the <a> tag includes more than just the link address.
# for each <a> tag found, print the address (href attribute) and extra bits
# link.string provides the string that appears to be hyperlinked.
# In this case, it is the pubmedID
print link.string
temp_lit.append("PMID: " + link.string + " URL: " + link.get('href'))
literature.append(temp_lit)
print "\n"
So it seems the element is what is throwing the code for a loop. Is there a way to search for any element with the text "PMID" and return the text that comes after it (and url if there is a PMID number)? If not, would I just want to check each child, looking for the text I'm interested in?
I'm using Python 2.7.10

import requests
from bs4 import BeautifulSoup
import re
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID)
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+')
a_tag = soup.find('a', href=regex)
has_pmid = 'PMID' in a_tag.previous_element
if has_pmid :
print(a_tag.text, a_tag.next_sibling, a_tag.get("href"))
else:
print("Not available")
out:
18984734 [GalNAz-Biotin tagging, CAD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/18984734
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS]; http://www.ncbi.nlm.nih.gov/pubmed/16408927
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS] http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation
find the first a tag that match the target url, which end with numbers, than check if 'PMID' in it's previous element.
this web is so inconsistancies , and i try many times, hope this would help

Python extracting data from HTML using split

A certain page retrieved from a URL, has the following syntax :
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
I want to extract the data in Name, Surname etc. (I have to repeat this task for many pages)
For that I tried using the following code:
import urllib2
url = 'http://www.my.lk/details.aspx?view=1&id=%2031'
source = urllib2.urlopen(url)
start = '<p><strong>Given Name:</strong>'
end = '<strong>Surname'
givenName=(source.read().split(start))[1].split(end)[0]
start = 'Surname: </strong>'
end = 'Former/AKA Name'
surname=(source.read().split(start))[1].split(end)[0]
print(givenName)
print(surname)
When I'm calling the source.read.split method only one time it works fine. But when I use it twice it gives a list index out of range error.
Can someone suggest a solution?

You can use BeautifulSoup for parsing the HTML string.
Here is some code you might try,
It is using BeautifulSoup (to get the text made by the html code), then parses the string for extracting the data.
from bs4 import BeautifulSoup as bs
dic = {}
data = \
"""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
soup = bs(data)
# Get the text on the html through BeautifulSoup
text = soup.get_text()
# parsing the text
lines = text.splitlines()
for line in lines:
# check if line has ':', if it doesn't, move to the next line
if line.find(':') == -1:
continue
# split the string at ':'
parts = line.split(':')
# You can add more tests here like
# if len(parts) != 2:
# continue
# stripping whitespace
for i in range(len(parts)):
parts[i] = parts[i].strip()
# adding the vaules to a dictionary
dic[parts[0]] = parts[1]
# printing the data after processing
print '%16s %20s' % (parts[0],parts[1])
A tip:
If you are going to use BeautifulSoup to parse HTML,
You should have certain attributes like class=input or id=10, That is, you keep all tags of the same type to be the same id or class.
Update
Well for your comment, see the code below
It applies the tip above, making life (and coding) a lot easier
from bs4 import BeautifulSoup as bs
c_addr = []
id_addr = []
data = \
"""
<h2>Primary Location</h2>
<div class="address" id="10">
<p>
No. 4<br>
Private Drive,<br>
Sri Lanka ON K7L LK <br>
"""
soup = bs(data)
for i in soup.find_all('div'):
# get data using "class" attribute
addr = ""
if i.get("class")[0] == u'address': # unicode string
text = i.get_text()
for line in text.splitlines(): # line-wise
line = line.strip() # remove whitespace
addr += line # add to address string
c_addr.append(addr)
# get data using "id" attribute
addr = ""
if int(i.get("id")) == 10: # integer
text = i.get_text()
# same processing as above
for line in text.splitlines():
line = line.strip()
addr += line
id_addr.append(addr)
print "id_addr"
print id_addr
print "c_addr"
print c_addr

You are calling read() twice. That is the problem. Instead of doing that you want to call read once, store the data in a variable, and use that variable where you were calling read(). Something like this:
fetched_data = source.read()
Then later...
givenName=(fetched_data.split(start))[1].split(end)[0]
and...
surname=(fetched_data.split(start))[1].split(end)[0]
That should work. The reason your code didn't work is because the read() method is reading the content the first time, but after it gets done reading it is looking at the end of the content. The next time you call read() it has no more content remaining and throws an exception.
Check out the docs for urllib2 and methods on file objects

If you want to be quick, regexes are more useful for this kind of task. It can be a harsh learning curve at first but regexes will save your butt one day.
Try this code:
# read the whole document into memory
full_source = source.read()
NAME_RE = re.compile('Name:.+?>(.*?)<')
SURNAME_RE = re.compile('Surname:.+?>(.*?)<')
name = NAME_RE.search(full_source, re.MULTILINE).group(1).strip()
surname = SURNAME_RE.search(full_source, re.MULTILINE).group(1).strip()
See here for more info on how to use regexes in python.
A more comprehensive solution would involve parsing the HTML (using a lib like BeautifulSoup), but that can be overkill depending on your particular application.

You can use HTQL:
page="""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
import htql
print(htql.query(page, "<p>.<strong> {a=:tx; b=:xx} "))
# [('Name:', ' Pasan '),
# ('Surname: ', ' Wijesingher '),
# ('Former/AKA Name:', ' No Former/AKA Name '),
# ('Gender:', ' Male '),
# ('Language Fluency:', ' ENGLISH ')
# ]

How to retrieve the storyline paragraph with imdbpy?

So far, I haven't figured out how to retrieve the short description paragraph from imdb with imdbpy.
I can retrieve a very (very) long plot this way though :
ia = IMDb()
movie = ia.search_movie("brazil")
movie = movie[0]
movie = ia.get_movie(movie.movieID)
plot = movie.get('plot', [''])[0]
plot = plot.split('::')[0]
The last line removes the submitter username.
In the HTML source, the block I'm looking for is markedup as <p itemprop="description">.
Any idea ?
Thanks !

description = movie.get('plot outline')
This code will get a list of the type of information available for a movie:
movie.keys()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Web-scraping, category extraction - python

Related

Removing HTML Tag removes additional words

How can i get only name and contact number from div?

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python

Python extracting data from HTML using split

How to retrieve the storyline paragraph with imdbpy?

Categories

Resources