Extracting particular string from a text file in Python - python

Hi I have a copy of HTML code in a TEXT file,So i need to EXTRACT few information from that code,I managed to do it like this ,but i'm not getting any specific patterns to EXTRAXT the text.
Position : 27
Position : 28
Position : 29
I want to EXTRACT the URL link after "href",and the name of the Product after the TEXT "aria-label".How can i do that in Python?
Currently i'm using the below script for finding the lines which is of interest to me,
import psycopg2
try:
filepath = filePath='''/Users/lins/Downloads/pladiv.txt'''
with open(filePath, 'r') as file:
print('entered loop')
cnt=1
for line in file:
if 'pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="' in line:
print('Position : ' + str(cnt))
cnt=cnt+1
if 'href="' in line:
print(line)
fields=line.split(";")
#print(fields[0] + ' as URL')
except (Exception, psycopg2.Error) as error:
quit()
Note: I was inserting it to my PostgreSQL DB, The code is removed in the above sample.

You can either use regex, like this
import re
url = '<p>Hello World</p>More ExamplesEven More Examples'
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)
>>> print urls
['http://example.com', 'http://example2.com']
Or you can parse the file as HTML
>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser') # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://example2.com']
Either way both is fine.
EDIT - to get the entire value of href you can use this,
url = """"""
findall = re.findall("(https?://[^\s]+)", url)
print(findall)
['http://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&noCS=1&adword=google/PLA&pk_campaign=google/PLA"']

Related

Unexpected Output while parsing Html file using Beautiful Soup

I have a Basic Html file that contains text inside tags as follows:
<head>
<title></title>
</head>
<body>
<div>{#One#}</div>
<span>{#Two#}</span>
<b>{#Three#}</b>
<i>four</i>
<td>{#five#}</td>
<sup>{#six#}</sup>
<sub>{#seven#}</sub>
<i>eight</i>
</body>
Using Python I wanted to parse this file and and check for a special character (for eg. '{') and if this character is not present then return the line and the number on which its not present. So I wrote a small snippet for it.
from bs4 import BeautifulSoup
import re
import urllib2
url = "testhtml.html"
page = open(url)
soup = BeautifulSoup(page.read())
bdy = soup.find('body')
for lines in bdy:
for num,line in enumerate(lines,1):
if "{" not in word:
print num,lines
However when i run the program I get an strange output: Its shown below:
1
1
1
1
1
1<i>four</i>
1
1
1
1<i>eight</i>
Instead of :
4<i>four</i>
8<i>eight</i>
What Am I doing wrong here, it seems like a silly mistake.
Using find('body') returns the whole body tag with all of its contents as a single element. So, iterating over bdy doesn't give what you think.
You need to use bdy.find_all(True), which will return all the tags inside body. Then, change if statement to if '{' not in tag.text:.
soup = BeautifulSoup(html, 'lxml')
bdy = soup.find('body')
for i, tag in enumerate(bdy.find_all(True), 1):
if '{' not in tag.text:
print(i, tag)
Output:
4 <i>four</i>
8 <i>eight</i>
from bs4 import BeautifulSoup
import re
import urllib2
url = "index.html"
page = open(url)
soup = BeautifulSoup(page.read(), "html.parser")
soup.prettify()
bdy = soup.find('body')
for num, lines in enumerate(bdy):
for line in lines:
if line !='\n' and '{' not in line:
print num, lines

How to rectify a TypeError in python?Beautiful Soup string to tag error?

Here is simple snippet to scrape Wikipedia website and to print each of its contents separately like cast in separate variable and production in separate variable and so on ..
Here in the first div named "bodyContent" there is a another div names "mw-content-text" here my problem is retrieve the data of the first paragraphs before the tag "h2" and i have a code snippet to work out this and unable to convert from BeautifulSoup tag from string and the error is TypeError: unsupported operand type(s) for +: 'Tag' and 'str'
import urllib
from bs4 import BeautifulSoup
url ="https://en.wikipedia.org/wiki/Deadpool_(film)"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext,"lxml")
#print soup.prettify()
movie_title = soup.find('h1',{'id':'firstHeading'})
print movie_title.text
movie_info = soup.find_all('p')
#print movie_info[0].text
#print movie_info[1].text
'''I dont want like this because we dont know how many
intro paragraphs will be so we have to scrape all paras just before that h2 tag'''
Here the problem rises i want to iterate and add .next_sibling and to make a try-exception block to find if the
"resultant_next_url.name == 'p' "
def findNextSibling(base_url):
tag_addition = 'next_sibling'
next_url = base_url+'.'+tag_addition
return next_url
And finally to do like this
base_url = movie_info[0]
resultant_url = findNextSibling(base_url)
print resultant_url.text
Finally found answer, this is solving the problem
import urllib
from bs4 import BeautifulSoup
url ="https://en.wikipedia.org/wiki/Deadpool_(film)"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext,"lxml")
#print soup.prettify()
movie_title = soup.find('h1',{'id':'firstHeading'})
print movie_title.text
movie_info = soup.find_all('p')
# print movie_info[0].text
# print movie_info[1].text
def findNextSibling(resultant_url):
#tag_addition = 'next_sibling'
#base_url.string = base_url.string + '.' + tag_addition
return resultant_url.next_sibling
resultant_url = movie_info[0]
resultant_url = findNextSibling(resultant_url)
print resultant_url.text

Python HTML parsing script that takes array of URLs and outputs specific data about each of the URLs

I am trying to write an HTML parser in Python that takes as its input a URL or list of URLs and outputs specific data about each of those URLs in the format:
URL: data1: data2
The data points can be found at the exact same HTML node in each of the URLs. They are consistently between the same starting tags and ending tags. If anyone out there would like to help an amateur python programmer get the job done, it would be greatly appreciated. Extra points if you can come up with a way to output the information that can be easily copied and pasted into an excel document for subsequent data analysis!
For example, lets say I would like to output the view count for a particular YouTube video. For the URL http://www.youtube.com/watch?v=QOdW1OuZ1U0, the view count is around 3.6 million. For all YouTube videos, this number is found in the following format within the page's source:
<span class="watch-view-count ">
3,595,057
</span>
Fortunately, these exact tags are found only once on a particular YouTube video's page. These starting and ending tags can be inputted into the program or built-in and modified when necessary. The output of the program would be:
http://www.youtube.com/watch?v=QOdW1OuZ1U0: 3,595,057 (or 3595057).
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.youtube.com/watch?v=QOdW1OuZ1U0'
f = urllib2.urlopen(url)
data = f.read()
soup = BeautifulSoup(data)
span = soup.find('span', attrs={'class':'watch-view-count'})
print '{}:{}'.format(url, span.text)
If you do not want to use BeautifulSoup, you can use re:
import urllib2
import re
url = 'http://www.youtube.com/watch?v=QOdW1OuZ1U0'
f = urllib2.urlopen(url)
data = f.read()
pattern = re.compile('<span class="watch-view-count.*?([\d,]+).*?</span>', re.DOTALL)
r = pattern.search(data)
print '{}:{}'.format(url, r.group(1))
As for the outputs, I think you can store them in a csv file.
I prefer HTMLParser over re for this type of task. However, HTMLParser can be a bit tricky. I use immutable objects to store data... I'm sure this this the wrong way of doing it. But its worked with several projects for me in the past.
import urllib2
from HTMLParser import HTMLParser
import csv
position = []
results = [""]
class hp(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'span' and ('class', 'watch-view-count ') in attrs:
position.append('bingo')
def handle_endtag(self, tag):
if tag == 'span' and 'bingo' in position:
position.remove('bingo')
def handle_data(self, data):
if 'bingo' in position:
results[0] += " " + data.strip() + " "
my_pages = ["http://www.youtube.com/watch?v=QOdW1OuZ1U0"]
data = []
for url in my_pages:
response = urllib2.urlopen(url)
page = str(response.read())
parser = hp()
parser.feed(page)
data.append(results[0])
# reinitialize immutiable objects
position = []
results = [""]
index = 0
with open('/path/to/test.csv', 'wb') as f:
writer = csv.writer(f)
header = ['url', 'output']
writer.writerow(header)
for d in data:
row = [my_pages[index], data[index]]
writer.writerow(row)
index += 1
Then just open /path/to/test.csv in Excel

Can't decode output from BeautifulSoup in Python

I've been attempting to write a little scraper in Python using BeautifulSoup.
Everything goes smoothly until I attempt to print (or write to a file) the strings contained inside the various HTML elements. The site i'm scraping is: http://www.yellowpages.ca/search/si/1/Boots/Montreal+QC which contains various french characters. For some reason, when I attempt to print the content in the terminal or into a file, instead of decoding the string like it's supposed to, I'm getting the raw unicode output.
Here's the script:
from BeautifulSoup import BeautifulSoup as bs
import urllib as ul
##import re
base_url = 'http://www.yellowpages.ca'
data_file = open('yellow_file.txt', 'a')
data = ul.urlopen(base_url + '/locations/Quebec/Montreal/90014002.html').readlines()
bt = bs(str(data))
result = bt.findAll('div', 'ypgCategory')
bt = bs(str(result))
result = bt.findAll('a')
for tag in result:
link = base_url + tag['href']
##print str(link)
data = ul.urlopen(link).readlines()
#data = str(data).decode('latin-1')
bt = bs(str(data), convertEntities=bs.HTML_ENTITIES, fromEncoding='latin-1')
titles = bt.findAll('span', 'listingTitle')
phones = bt.findAll('a', 'phoneNumber')
entries = zip(titles, phones)
for title, phone in entries:
#print title.prettify(encoding='latin-1')
#data_file.write(title.text.decode('utf-8') + " " + phone.text.decode('utf-8') + "\n")
print title.text
data_file.close()
/************/
And the output of this is: Projets Autochtones Du Qu\xc3\xa9bec
As you can see the e with accent that's supposed to go in Quebec isn't displaying. I've tried everything mentioned on SO, calling unicode(), passing fromEncoding to soup, .decode('latin-1') but i'm getting nothing.
Any ideas?
This should be something like what you want:
from BeautifulSoup import BeautifulSoup as bs
import urllib as ul
base_url = 'http://www.yellowpages.ca'
data_file = open('yellow_file.txt', 'a')
bt = bs(ul.urlopen(base_url + '/locations/Quebec/Montreal/90014002.html'))
for div in bt.findAll('div', 'ypgCategory'):
for a in div.findAll('a'):
link = base_url + a['href']
bt = bs(ul.urlopen(link), convertEntities=bs.HTML_ENTITIES)
titles = bt.findAll('span', 'listingTitle')
phones = bt.findAll('a', 'phoneNumber')
for title, phone in zip(titles, phones):
line = '%s %s\n' % (title.text, phone.text)
data_file.write(line.encode('utf-8'))
print line.rstrip()
data_file.close()
Who told you to use latin-1 to decode something that is UTF-8? (clearly specified on the meta tag)
If you ware on Windows you may have problems outputting Unicode to console, better to test writing to text files first.
if you open a file as text do no write binary to it:
codecs.open(...,"w","utf-8").write(unicode_str)
open(...,"wb").write(unicode_str.encode("utf_8"))

Cannot prettify scraped html in BeautifulSoup

I have a small script that use urllib2 to get contents of a site, find all the link tags, appends a small piece of HTML in on the top and bottom, and then I try to prettify it. It keeps returning TypeError: sequence item 1: expected string, Tag found. I have looked around,i can't really find the issue. As always, any help, much appreciated.
import urllib2
from BeautifulSoup import BeautifulSoup
import re
reddit = 'http://www.reddit.com'
pre = '<html><head><title>Page title</title></head>'
post = '</html>'
site = urllib2.urlopen(reddit)
html=site.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a')
tags.insert(0,pre)
tags.append(post)
soup1 = BeautifulSoup(''.join(tags))
print soup1.prettify()
This is the trace back:
Traceback (most recent call last): File "C:\Python26\bea.py", line 21, in <module>
soup1 = BeautifulSoup(''.join(tags))
TypeError: sequence item 1: expected string, Tag found
This works for me:
soup1 = BeautifulSoup(''.join(str(t) for t in tags))
This pyparsing solution gives some decent output, too:
from pyparsing import makeHTMLTags, originalTextFor, SkipTo, Combine
# makeHTMLTags defines HTML tag patterns for given tag string
aTag,aEnd = makeHTMLTags("A")
# makeHTMLTags by default returns a structure containing
# the tag's attributes - we just want the original input text
aTag = originalTextFor(aTag)
aEnd = originalTextFor(aEnd)
# define an expression for a full link, and use a parse action to
# combine the returned tokens into a single string
aLink = aTag + SkipTo(aEnd) + aEnd
aLink.setParseAction(lambda tokens : ''.join(tokens))
# extract links from the input html
links = aLink.searchString(html)
# build list of strings for output
out = []
out.append(pre)
out.extend([' '+lnk[0] for lnk in links])
out.append(post)
print '\n'.join(out)
prints:
<html><head><title>Page title</title></head>
<a href="http://www.reddit.com/r/pics/" >pics</a>
<a href="http://www.reddit.com/r/reddit.com/" >reddit.com</a>
<a href="http://www.reddit.com/r/politics/" >politics</a>
<a href="http://www.reddit.com/r/funny/" >funny</a>
<a href="http://www.reddit.com/r/AskReddit/" >AskReddit</a>
<a href="http://www.reddit.com/r/WTF/" >WTF</a>
.
.
.
<a href="http://reddit.com/help/privacypolicy" >Privacy Policy</a>
close this window
<a href="http://www.reddit.com/feedback" >volunteer to translate</a>
close this window
</html>
soup1 = BeautifulSoup(''.join(unicode(tag) for tag in tags))
A bit of a syntax error on Jonathans answer, here's the correct one:
soup1 = BeautifulSoup(''.join([unicode(tag) for tag in tags]))

Categories