Searching through HTML pages for certain text?

Searching through HTML pages for certain text? - python

I wanted to play around with python to learn it, so I'm taking on a little project, but a part of it requires me to search for a name on this list:
https://bughunter.withgoogle.com/characterlist/1
(the number one is to be incremented by one every time to search for the name)
So I will be HTML scraping it, I'm new to python and would appreciate if someone could give me an example of how to make this work.

import json
import requests
from bs4 import BeautifulSoup
URL = 'https://bughunter.withgoogle.com'
def get_page_html(page_num):
r = requests.get('{}/characterlist/{}'.format(URL, page_num))
r.raise_for_status()
return r.text
def get_page_profiles(page_html):
page_profiles = {}
soup = BeautifulSoup(page_html)
for table_cell in soup.find_all('td'):
profile_name = table_cell.find_next('h2').text
profile_url = table_cell.find_next('a')['href']
page_profiles[profile_name] = '{}{}'.format(URL, profile_url)
return page_profiles
if __name__ == '__main__':
all_profiles = {}
for page_number in range(1, 81):
current_page_html = get_page_html(page_number)
current_page_profiles = get_page_profiles(current_page_html)
all_profiles.update(current_page_profiles)
with open('google_hall_of_fame_profiles.json', 'w') as f:
json.dump(all_profiles, f, indent=2)
Your question wasn't clear about how you wanted the data structured after scraping so I just saved the profiles in a dict (with the key/value pair as {profile_name: profile_url}) and then dumped the results to a json file.
Let me know if anything is unclear!

Try this. You will need to install bs4 first (python 3). It will get all of the names of the people on the website page:
from bs4 import BeautifulSoup as soup
import urllib.request
text=str(urllib.request.urlopen('https://bughunter.withgoogle.com/characterlist/1').read())
text=soup(text)
print(text.findAll(class_='item-list')[0].get_text())

Related

Extract HTML and search in Python

Hi I am still a beginner at python and I was experimenting.
I am looking for a way to request a url and get the data of the webpage so the page does not need to open.
Once I get the data, I need to search the data for a tag, for example, if it has 'hello' somewhere on the home page that is requested.
Here is an example:
import urllib.request
fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
x = mystr.find('testing word tag');
print(x)
Please bear with me as I am still a rookie and can't find an example of what I am looking for.
^ found this code on here but it does not seem to work to find a string.
Anyone knows the best way to do it?
Thank you guys :)

Here are the most used libraries for this kind of work:
Requests to get the HTML of the page.
BeautifulSoup to find elements (and much more)
$ pip install requests bs4
And in your favorite IDE:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.python.org")
soup = BeautifulSoup(r.content, "html.parser")
sometag = soup.find("sometag")
print(sometag)

Try this.
import requests
url = "https://stackoverflow.com/questions/63577634/extract-html-and-search-in-python"
res = requests.get(url)
print(res.text)

Another method.
from simplified_scrapy import SimplifiedDoc,req
html = req.get('https://www.python.org')
doc = SimplifiedDoc(html)
title = doc.getElement('title').text
print (title)
title = doc.getElementByText('Welcome to', tag='title').text
print (title)
Result:
Welcome to Python.org
Welcome to Python.org
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Not able to do webscraping using beautifulsoup and requests

I am trying to scrape the first two sections values i.e 1*2 and DOUBLECHANCE sections values using bs4 and requests from this website https://web.bet9ja.com/Sport/SubEventDetail?SubEventID=76512106
The code which I written is:
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://web.bet9ja.com/Sport/SubEventDetail?SubEventID=76512106')
soup = bs.BeautifulSoup(source,'lxml')
for div in soup.find_all('div', class_='SEItem ng-scope'):
print(div.text)
when I run I am not getting anything please help me anyone

The page is loaded via JavaScript, so you have 2 option. or to use selenium or to call the Direct API.
Instead of using Selenium, I've called the API directly and got the required info.
Further explanation about XHR & API < can be found once you click here.
import requests
data = {
'IDGruppoQuota': '0',
'IDSottoEvento': '76512106'
}
def main(url):
r = requests.post(url, json=data).json()
count = 0
for item in r['d']['ClassiQuotaList']:
count += 1
print(item['ClasseQuota'], [x['Quota']
for x in item['QuoteList']])
if count == 2:
break
main("https://web.bet9ja.com/Controls/ControlsWS.asmx/GetSubEventDetails")
Output:
1X2 ['3.60', '4.20', '1.87']
Double Chance ['1.83', '1.19', '1.25']

Try:
import bs4 as bs
import urllib.request
import lxml
source = urllib.request.urlopen('https://web.bet9ja.com/Sport/SubEventDetail?SubEventID=76512106')
soup = bs.BeautifulSoup(source,'lxml')
a = soup.find_all('div')
for i in a:
try:
print(i['class'])
except:
pass
try:
sp = i.find_all('div')
for j in sp:
print(j['class'])
except:
pass
This helps you find available classes in the <div> tag.
You get nothing when the class you give doesn't exist. This happens as many of the sites are dynamically generated and requests can't get them. In these cases, we need to use selenium.

Web scraping using Python and Beautiful Soup for /post-sitemap.xml/

I am trying to scrape a page website/post-sitemap.xml which contains all url's posted for a wordpress website. In the first step, I need to make a list of all the url's present in post-sitemap. When I use requests.get and I check the output, it opens all of the internal urls as well, which is weird. My intention is to make a list of all url's first and then using a loop, I will scrape individual url's in the next function. Below is the code I have done so far. I would need all url's as a list as my final output if python gurus can help.
I have tried using requests.get and openurl but nothing seems to open only the base url for /post-sitemap.xml
import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
class wordpress_ext_url_cleanup(object):
def __init__(self,wp_url):
self.wp_url_raw = wp_url
self.wp_url = wp_url + '/post-sitemap.xml/'
def identify_ext_url(self):
html = requests.get(self.wp_url)
print(self.wp_url)
print(html.text)
soup = BeautifulSoup(html.text,'lxml')
#print(soup.get_text())
raw_data = soup.find_all('tr')
print (raw_data)
#for link in raw_data:
#print(link.get("href"))
def main():
print ("Inside Main Function");
url="http://punefirst dot com" #(knowingly removed the . so it doesnt look spammy)
first_call = wordpress_ext_url_cleanup(url)
first_call.identify_ext_url()
if __name__ == '__main__':
main()
I would need all 548 url's present in the post sitemap as a list which I will use it for the next function for further scraping.

The document that is returned from the server is XML and transformed with XSLT to HTML form (more info here). To parse all links from this XML, you can use this script:
import requests
from bs4 import BeautifulSoup
url = 'http://punefirst.com/post-sitemap.xml/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for loc in soup.select('url > loc'):
print(loc.text)
Prints:
http://punefirst.com
http://punefirst.com/hospitals/pcmc-hospitals/aditya-birla-memorial-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/saijyoti-hospital-and-icu-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/niramaya-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/chetna-hospital-chinchwad-pune
http://punefirst.com/hospitals/hadapsar-hospitals/pbmas-h-v-desai-eye-hospital
http://punefirst.com/hospitals/punecentral-hospitals/shree-sai-prasad-hospital
http://punefirst.com/hospitals/punecentral-hospitals/sadhu-vaswani-missions-medical-complex
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/shivneri-hospital
http://punefirst.com/hospitals/punecentral-hospitals/kelkar-nursing-home
http://punefirst.com/hospitals/pcmc-hospitals/shrinam-hospital
http://punefirst.com/hospitals/pcmc-hospitals/dhanwantari-hospital-nigdi
http://punefirst.com/hospitals/punecentral-hospitals/dr-tarabai-limaye-hospital
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/satyanand-hospital-kondhwa-pune
...and so on.

Opening webpage and returning a dict of all the links and their text

I'm trying to open a webpage and return all the links as a dictionary that would look like this.
{"http://my.computer.com/some/file.html" : "link text"}
So the link would be after the href= and the text would be between the > and the </a>
I'm using https://www.yahoo.com/ as my test website
I keep getting a this error:
'href=' in line:
TypeError: a bytes-like object is required, not 'str'
Heres my code:
def urlDict(myUrl):
url = myUrl
page = urllib.request.urlopen(url)
pageText = page.readlines()
urlList = {}
for line in pageText:
if '<a href=' in line:
try:
url = line.split('<a href="')[-1].split('">')[0]
txt = line.split('<a href="')[-1].split('">')[-1].split('< /a>')[0]
urlList[url] = txt
except:
pass
return urlList
What am I doing wrong? I've looked around and people have mostly suggest this mysoup parser thing. I'd use it, but I don't think that would fly with my teacher.

The issue is that you're attempting to compare a byte string to a regular string. If you add print(line) as the first command in your for loops, you'll see that it will print a string of HTML but it will have a b' at the beginning, indicating it's not utf-8 encoding. This makes things difficult. The proper way to use urllib here is the following:
def url_dict(myUrl):
with urllib.request.urlopen(myUrl) as f:
s = f.read().decode('utf-8')
This will have the s variable hold the entire text of the page. You can then use a regular expression to parse out the links and the link target. Here is an example which will pull the link targets without the HTML.
import urllib.request
import re
def url_dict():
# url = myUrl
with urllib.request.urlopen('http://www.yahoo.com') as f:
s = f.read().decode('utf-8')
r = re.compile('(?<=href=").*?(?=")')
print(r.findall(s))
url_dict()
Using regex to get both the html and the link itself in a dictionary is outside the scope of where you are in your class, so I would absolutely not recommend submitting it for the assignment, although I would recommend learning it for later use.
You'll want to use BeautifulSoup as suggested, as it make this entire thing extremely easy. There is an example in the docs that you can cut and paste to extract the URLs.

For what it's worth, here is a BeautifulSoup and requests approach.
Feel free to replace requests with urllib, but BeautifulSoup doesn't really have a nice replacement.
import requests
from bs4 import BeautifulSoup
def get_links(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
return { a_tag['href']: a_tag.text for a_tag in soup.find_all('a') }
for link, text in get_links('https://www.yahoo.com/').items():
print(text.strip(), link)

Downloading files from multiple websites.

This is my first Python project so it is very basic and rudimentary.
I often have to clean off viruses for friends and the free programs that I use are updated often. Instead of manually downloading each program, I was trying to create a simple way to automate the process. Since I am also trying to learn python I thought it would be a good opportunity to practice.
Questions:
I have to find the .exe file with some of the links. I can find the correct URL, but I get an error when it tries to download.
Is there a way to add all of the links into a list, and then create a function to go through the list and run the function on each url? I've Google'd quite a bit and I just cannot seem to make it work. Maybe I am not thinking in the right direction?
import urllib, urllib2, re, os
from BeautifulSoup import BeautifulSoup
# Website List
sas = 'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
tds = 'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
mbam = 'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
tr = 'http://www.simplysup.com/tremover/download.html'
urllist = [sas, tr, tds, tr]
urrllist2 = []
# Find exe files to download
match = re.compile('\.exe')
data = urllib2.urlopen(urllist)
page = BeautifulSoup(data)
# Check links
#def findexe():
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
urllist2.append(href)
except KeyError:
pass
os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))
As you can see, I have left the function commented out as I cannot get it to work correctly.
Should I abandon the list and just download them individually? I was trying to be efficient.
Any suggestions or if you could point me in the right direction, it would be most appreciated.

In addition to mikez302's answer, here's a slightly more readable way to write your code:
import os
import re
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
websites = [
'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
'http://www.simplysup.com/tremover/download.html'
]
download_links = []
for url in websites:
connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection)
connection.close()
for link in soup.findAll('a', {href: re.compile(r'\.exe$')}):
download_links.append(link['href'])
for url in download_links:
urllib.urlretrieve(url, r'C:\_VirusFixes', os.path.basename(url))

urllib2.urlopen is a function for accessing a single URL. If you want to access multiple ones, you should loop over the list. You should do something like this:
for url in urllist:
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
# Check links
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
urllist2.append(href)
except KeyError:
pass
os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))

The code above didn't work for me, in my case it was because the pages assemble their links through a script instead of including it in the code. When I ran into that problem I used the following code which is just a scraper:
import os
import re
import urllib
import urllib2
from bs4 import BeautifulSoup
url = ''
connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection) #Everything the same up to here
regex = '(.+?).zip' #Here we insert the pattern we are looking for
pattern = re.compile(regex)
link = re.findall(pattern,str(soup)) #This finds all the .zip (.exe) in the text
x=0
for i in link:
link[x]=i.split(' ')[len(i.split(' '))-1]
# When it finds all the .zip, it usually comes back with a lot of undesirable
# text, luckily the file name is almost always separated by a space from the
# rest of the text which is why we do the split
x+=1
os.chdir("F:\Documents")
# This is the filepath where I want to save everything I download
for i in link:
urllib.urlretrieve(url,filename=i+".zip") # Remember that the text we found doesn't include the .zip (or .exe in your case) so we want to reestablish that.
This is not as efficient as the codes in the previous answers but it will work for most almost any site.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Searching through HTML pages for certain text? - python

Related

Extract HTML and search in Python

Not able to do webscraping using beautifulsoup and requests

Web scraping using Python and Beautiful Soup for /post-sitemap.xml/

Opening webpage and returning a dict of all the links and their text

Downloading files from multiple websites.

Categories

Resources