Extract HTML and search in Python

Extract HTML and search in Python - python

Hi I am still a beginner at python and I was experimenting.
I am looking for a way to request a url and get the data of the webpage so the page does not need to open.
Once I get the data, I need to search the data for a tag, for example, if it has 'hello' somewhere on the home page that is requested.
Here is an example:
import urllib.request
fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
x = mystr.find('testing word tag');
print(x)
Please bear with me as I am still a rookie and can't find an example of what I am looking for.
^ found this code on here but it does not seem to work to find a string.
Anyone knows the best way to do it?
Thank you guys :)

Here are the most used libraries for this kind of work:
Requests to get the HTML of the page.
BeautifulSoup to find elements (and much more)
$ pip install requests bs4
And in your favorite IDE:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.python.org")
soup = BeautifulSoup(r.content, "html.parser")
sometag = soup.find("sometag")
print(sometag)

Try this.
import requests
url = "https://stackoverflow.com/questions/63577634/extract-html-and-search-in-python"
res = requests.get(url)
print(res.text)

Another method.
from simplified_scrapy import SimplifiedDoc,req
html = req.get('https://www.python.org')
doc = SimplifiedDoc(html)
title = doc.getElement('title').text
print (title)
title = doc.getElementByText('Welcome to', tag='title').text
print (title)
Result:
Welcome to Python.org
Welcome to Python.org
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Related

Weird character not exists in html source python BeautifulSoup

I have watched a video that teaches how to use BeautifulSoup and requests to scrape a website
Here's the code
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
pages_to_scrape = 1
for i in range(1,pages_to_scrape+1):
url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.text, 'html.parser')
#print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
price=j.getText()
print(price)
The code i working well. But as for the results I noticed weird character before the euro symbol and when checking the html source, I didn't find that character. Any ideas why this character appears? and how this be fixed .. is using replace enough or there is a better approach?

Seems for me you explained your question wrongly. I assume that you are using Windows where your terminal IDLE is using the default encoding of cp1252,
But you are dealing with UTF-8, you've to configure your terminal/IDLE with UTF-8
import requests
from bs4 import BeautifulSoup
def main(url):
with requests.Session() as req:
for item in range(1, 10):
r = req.get(url.format(item))
print(r.url)
soup = BeautifulSoup(r.content, 'html.parser')
goal = [(x.h3.a.text, x.select_one("p.price_color").text)
for x in soup.select("li.col-xs-6")]
print(goal)
main("http://books.toscrape.com/catalogue/page-{}.html")
try to always use The DRY Principle which means Don’t Repeat Yourself”.
Since you are dealing with the same host so you've to maintain the same session instead of keep open tcp socket stream and then close it and then open it again. That's can lead to block your requests and consider it as DDOS attack where the TCP flags got captured by the back-end. imagine that you open your browser and then open a website then you close it and repeat the circle!
Python functions is usually looks nice and easy to read instead of letting code looks like journal text.
Notes: the usage of range() and {} format string, CSS selectors.

You could use page.content.decode('utf-8') instead of page.text. As people in the comments said, it is an encoding issue, and .content returns HTML as bytes, then you can convert it into string with right encoding using .decode('utf-8'), whereas .text returns string with bad encoding (maybe cp1252). The final code may look like this:
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
pages_to_scrape = 1
pages = [] # You forgot this line
for i in range(1,pages_to_scrape+1):
url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.content.decode('utf-8'), 'html.parser') # Replace .text with .content.decode('utf-8')
#print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
price=j.getText()
print(price)
This should hopefully work
P.S: Sorry for directly writing the answer, I don't have enought reputation to write in comments :D

Scraping the URLs of dynamically changing images from a website

I'm creating a python program that collects images from this website by Google
The images on the website change after a certain number of seconds, and the image url also changes with time. This change is handled by a script on the website. I have no idea how to get the image links from it.
I tried using BeautifulSoup and the requests library to get the image links from the site's html code:
import requests
from bs4 import BeautifulSoup
url = 'https://clients3.google.com/cast/chromecast/home'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
tags = soup('img')
for tag in tags:
print(tag)
But the code returns:
{{background_url}}' in the image src ("ng-src")
For example:
<img class="S9aygc-AHe6Kc" id="picture-background" image-error-handler="" image-index="0" ng-if="backgroundUrl" ng-src="{{backgroundUrl}}"/>
How can I get the image links from a dynamically changing site? Can BeautifulSoup handle this? If not what library will do the job?

import requests
import re
def main(url):
r = requests.get(url)
match = re.search(r"(lh4\.googl.+?mv)", r.text).group(1)
match = match.replace("\\", "").replace("u003d", "=")
print(match)
main("https://clients3.google.com/cast/chromecast/home")

Just a minor addition to the answer by αԋɱҽԃ αмєяιcαη (ahmed american) in case anyone is wondering
The subdomain (lhx) in lhx.google.com is also dynamic. As a result, the link can be lh3 or lh4 et cetera.
This code fixes the problem:
import requests
import re
r = requests.get("https://clients3.google.com/cast/chromecast/home").text
match = re.search(r"(lh.\.googl.+?mv)", r).group(1)
match = match.replace('\\', '').replace("u003d", "=")
print(match)
The major difference is that the lh4 in the code by ahmed american has been replaced with "lh." so that all images can be collected no matter the url.
EDIT: This line does not work:
match = match.replace('\\', '').replace("u003d", "=")
Replace with:
match = match.replace("\\", "")
match = match.replace("u003d", "=")

None of the provided answers worked for me. Issues may be related to using an older version of python and/or the source page changing some things around.
Also, this will return all matches instead of only the first match.
Tested in Python 3.9.6.
import requests
import re
url = 'https://clients3.google.com/cast/chromecast/home'
r = requests.get(url)
for match in re.finditer(r"(ccp-lh\..+?mv)", r.text, re.S):
image_link = 'https://%s' % (match.group(1).replace("\\", "").replace("u003d", "="))
print(image_link)

How to check whether the url is of format "/askwiki/questions/<any number>" in Python

I am trying to learn web scraping using BeautifulSoup and Python.
I scraped a list of urls from a website and I want to display the text of all th links that are in format "/askwiki/questions/ like
"/askwiki/questions/4" or "/askwiki/questions/123".
import requests
from bs4 import BeautifulSoup
url = 'http://unistd.herokuapp.com/askrec';
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml");
links = soup.find_all("a")
for link in links:
if #url is of my desired format
print link.text
What should I write in the if statement.
I am new to python as well as web scraping. It may be a really stupid question but I am not getting what to write there.
I tried like
if "/askwiki/questions/[0-9]+ " in link.get("href"):
if "/askwiki/questions/[0-9]?" in link.get("href"):
but it's not working.
P.S - There are other links too like 'askwiki/questions/tags' and /askwiki/questions/users'.

Edit: Using regex to identify only those with numbers at the end.
import re
for link in links:
url = str(link.get('href'))
if re.findall('/askwiki/questions/[\d]+', url):
print(link)

You're on the right track! The missing component is the re module.
I think what you want is something like this:
import re
matcher = re.compile(r"/askwiki/questions/[0-9]+")
if matcher.search(link.get("href")):
print(link.text)
Alternatively, you can just drop the number component, if you're only really looking for links with "/askwiki/questions" in:
if "/askwiki/questions" in link.get("href")
print(link.text)

try something like :
for link in links:
link = link.get("href")
if link.startswith("/askwiki/questions/"):
print(link.test)

If you want to use regex (ie what you have, [0-9]+), you have to import the re library. Check out this link to the documentation on using re to find patterns!

Opening webpage and returning a dict of all the links and their text

I'm trying to open a webpage and return all the links as a dictionary that would look like this.
{"http://my.computer.com/some/file.html" : "link text"}
So the link would be after the href= and the text would be between the > and the </a>
I'm using https://www.yahoo.com/ as my test website
I keep getting a this error:
'href=' in line:
TypeError: a bytes-like object is required, not 'str'
Heres my code:
def urlDict(myUrl):
url = myUrl
page = urllib.request.urlopen(url)
pageText = page.readlines()
urlList = {}
for line in pageText:
if '<a href=' in line:
try:
url = line.split('<a href="')[-1].split('">')[0]
txt = line.split('<a href="')[-1].split('">')[-1].split('< /a>')[0]
urlList[url] = txt
except:
pass
return urlList
What am I doing wrong? I've looked around and people have mostly suggest this mysoup parser thing. I'd use it, but I don't think that would fly with my teacher.

The issue is that you're attempting to compare a byte string to a regular string. If you add print(line) as the first command in your for loops, you'll see that it will print a string of HTML but it will have a b' at the beginning, indicating it's not utf-8 encoding. This makes things difficult. The proper way to use urllib here is the following:
def url_dict(myUrl):
with urllib.request.urlopen(myUrl) as f:
s = f.read().decode('utf-8')
This will have the s variable hold the entire text of the page. You can then use a regular expression to parse out the links and the link target. Here is an example which will pull the link targets without the HTML.
import urllib.request
import re
def url_dict():
# url = myUrl
with urllib.request.urlopen('http://www.yahoo.com') as f:
s = f.read().decode('utf-8')
r = re.compile('(?<=href=").*?(?=")')
print(r.findall(s))
url_dict()
Using regex to get both the html and the link itself in a dictionary is outside the scope of where you are in your class, so I would absolutely not recommend submitting it for the assignment, although I would recommend learning it for later use.
You'll want to use BeautifulSoup as suggested, as it make this entire thing extremely easy. There is an example in the docs that you can cut and paste to extract the URLs.

For what it's worth, here is a BeautifulSoup and requests approach.
Feel free to replace requests with urllib, but BeautifulSoup doesn't really have a nice replacement.
import requests
from bs4 import BeautifulSoup
def get_links(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
return { a_tag['href']: a_tag.text for a_tag in soup.find_all('a') }
for link, text in get_links('https://www.yahoo.com/').items():
print(text.strip(), link)

Rss Feed scraping with BeautifulSoup

I'm having trouble with my script. I am able to get the title and links but i cant seem to open the article and scrape the article. can somebody please help!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
source = urlopen('http://www.marketingmag.com.au/feed/').read()
title = re.compile('<title>(.*)</title>')
link = re.compile('<a href="(.*)">')
find_title = re.findall(title, source)
find_link = re.findall(link, source)
literate = []
literate[:] = range(1, 10)
for i in literate:
print find_title[i]
print find_link[i]
articlePage = urlopen(find_link[i]).read()
divBegin = articlePage.find('<div class="entry-content">')
article = articlePage[divBegin:(divBegin+1000)]
soup = BeautifulSoup(article)
paragList = soup.findAll('p')
for i in paragList:
print i
print ("\n")

Do not use regex to parse HTML. Just use Beautiful Soup and it's facilities like find_all to get the links and then you can use urllib2.urlopen to open the url and then read the contents.

Your Code strongly reminds me of: http://www.youtube.com/watch?v=Ap_DlSrT-iE
Why do you actually use BeautifulSoup for XML parsing? Its built for HTML-Sites and python itself has very good XML-Parsers. Example: http://docs.python.org/library/xml.dom.minidom.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract HTML and search in Python - python

Try this. import requests url = "https://stackoverflow.com/questions/63577634/extract-html-and-search-in-python" res = requests.get(url) print(res.text)

Related

Weird character not exists in html source python BeautifulSoup

Scraping the URLs of dynamically changing images from a website

How to check whether the url is of format "/askwiki/questions/<any number>" in Python

Opening webpage and returning a dict of all the links and their text

Rss Feed scraping with BeautifulSoup

Categories

Resources