I wrote a code for 4 properties to scrape data but I'm only getting data from just first field "title" and the other 3 fields return empty results. could anyone please guide me how can I fix this issue. thanks!
here is my code:
import requests
from bs4 import BeautifulSoup
#import pandas as pd
import csv
def get_page(url):
response = requests.get(url)
if not response.ok:
print('server responded:', response.status_code)
else:
soup = BeautifulSoup(response.text, 'html.parser') # 1. html , 2. parser
return soup
def get_detail_data(soup):
try:
title = soup.find('span',class_="text-info h4",id=False).find('strong').text
except:
title = 'empty'
print(title)
try:
add = soup.find('div',class_="col-xs-12 col-sm-4",id=False).find('strong')
except:
add = 'empty add'
print(add)
try:
phone = soup.find('div',class_="col-xs-12 col-sm-4",id=False).text
except:
phone = 'empty phone'
print(phone)
def main():
url = "https://www.dobsearch.com/people-finder/view.php?searchnum=287404084791&sessid=vusqgp50pm8r38lfe13la8ta1l"
get_detail_data(get_page(url))
if __name__ == '__main__':
main()
For the second one you are giving a class that has occurred before the one that you want so you need to change the class or go through multiple findings. and this happened for the third one too. this kinds of classes (col-xs-12) are some bootstrap classes and they are common classes to use so in general they are not good attempts to be used in finding (or you should make more complicated finds). and as I can see this site doesn't have much unique classes so I think you should use multiple find methods to get what you want. and another thing that I can say is to not use try...except unless you know what you are getting from that part.
Related
Basically I am trying to build a program that can identify log in pages by url.
My idea for doing so is parsing through the pages in search for textboxes (and than identify them by name and type). here is the code:
import requests
from bs4 import BeautifulSoup
\\parse page html (soup)
def parse(soup):
found = []
for a in soup.find_all('input'):
if(a['type'] in ['text','password','email']):
found.append(a['name'])
return found
\\get site's html
def get_site_content(url):
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html5lib')
textBoxes = parse(soup)
print("Found in: " +url)
print(textBoxes)
if __name__ == '__main__':
get_site_content('https://login.facebook.com')
get_site_content('https://www.instagram.com/accounts/login/')
get_site_content('https://instagram.com')
get_site_content('https://instagram.com/login')
get_site_content('https://login.yahoo.com')
Seems to work just fine, but for some reason I've had problems with instagram's log in page. here is the output:
Found in: https://login.facebook.com
['email', 'pass']
Found in: https://www.instagram.com/accounts/login/
[]
Found in: https://instagram.com
[]
Found in: https://instagram.com/login
[]
Found in: https://login.yahoo.com
['username', 'passwd']
Process finished with exit code 0
After using different libraries for getting the html and different parsers Ive come to understand that the problem is with the html = requests.get(url) line. it just doesn't get the full html.
any ideas on how to fix this?
Thanks in advance!
by the way if you have a better idea for what I am trying to accomplish I would love to hear it :)
Content is provided dynamically by JavaScript that would not be rendered by requests. To get the rendered page_source use selenium.
You also could select your elements more specific:
for a in soup.select('input[name]'):
Example
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
def parse(soup):
found = []
for a in soup.select('input[name]'):
if(a['type'] in ['text','password','email']):
found.append(a['name'])
return found
def get_site_content(url):
driver.get(url)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'html5lib')
textBoxes = parse(soup)
print("Found in: " +url)
print(textBoxes)
if __name__ == '__main__':
get_site_content('https://login.facebook.com')
get_site_content('https://www.instagram.com/accounts/login/')
get_site_content('https://instagram.com')
get_site_content('https://instagram.com/login')
get_site_content('https://login.yahoo.com')
Output
Found in: https://login.facebook.com
['email', 'pass']
Found in: https://www.instagram.com/accounts/login/
['username', 'password']
Found in: https://instagram.com
['username', 'password']
Found in: https://instagram.com/login
['username', 'password']
Found in: https://login.yahoo.com
['username', 'passwd']
Alright, so thanks to #user:14460824 (HedgHog) I have come to realize that the problem was the need to render the page since its rendered dynamically from Javascript. Personally, I didn't like selenium and used requests-html instead. it operates the same as selenium but just feels easier to use and in the future when I realize how to identify weather a web page is rendered dynamically from Javascript or not this library will be much easier to use so I won't waste resources.
here is the code:
from requests_html import HTMLSession
import requests
#parse page html
def parse(html):
found = []
for a in html.find('input'):
if(a.attrs['type'] in ['text','password','email'] and 'name' in a.attrs):
found.append(a.attrs['name'])
return found
#get site's html
def get_site_content(url):
try:
session = HTMLSession()
response = session.get(url)
#if(JAVASCRIPT): #here i need to find a way to tell weather
#Render the page #the page is rendered dynamically from Javascript
#response.html.render(timeout=20)
response.html.render(timeout=20) #for now render all pages
return response.html
except requests.exceptions.RequestException as e:
print(e)
def find_textboxes(url):
textBoxes = parse(get_site_content(url))
print("Found in: " +url)
print(textBoxes)
if __name__ == '__main__':
find_textboxes('https://login.facebook.com')
find_textboxes('https://www.instagram.com/accounts/login/')
find_textboxes('https://instagram.com')
find_textboxes('https://login.yahoo.com')
I am currently trying to get a spot in a zoom class, and if I don't get a spot I owe a driving school 300 dollars. Long story there, but not very important. I am trying to get a notification if the zoom registration link data has been updated. I originally tried to just see if the hash of the site was updated at any point, but I noticed there must be some internal clock that is changed, making it notify me every minute. The specific element I am looking to see if it is removed is...
<div class="form-group registration-over">Registration is closed.</div>
I am not sure how to isolate it within the hash. Below is the code I have for checking for any update.
import time
import hashlib
from urllib.request import urlopen, Request
import webbrowser
url = Request('URL HERE',
headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
print("running")
time.sleep(10)
while True:
try:
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
time.sleep(30)
response = urlopen(url).read()
newHash = hashlib.sha224(response).hexdigest()
if newHash == currentHash:
continue
else:
from datetime import datetime
now = datetime.now()
nowtime = now.strftime("%H:%M:%S: ")
print(nowtime, "Something changed")
webbrowser.open('URL HERE')
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
time.sleep(30)
continue
except Exception as e:
print("error")
You can use Beautiful Soup to parse the HTML response which creates a tree-like structure. From your example it looks like you want to search for the class of the registration div. The find method returns None if no element is found, so you could just test for that instead of the hash change.
I'm very new to write and work with classes in python. I've written a parser using class to check whether there is any next page url generated by .get_nextpage() method. However, when .get_nextpage() method produces a link then it should be printed right after self.get_nextpage(soup) line in try except block within .get_links() method. I've got stuck here as to how I can make it possible.
No alternative solution is what I'm after. I just wish to know the logic If I can make a go.
I used while True condition within .get_links() method so that it will run until the .get_nextpage() method generates a new link. (It's not the part of this question. Just to let you know why I used "while True" there)
This is the scraper:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://stackoverflow.com/questions/tagged/web-scraping"
class StackOverflowClass(object):
def __init__(self, link):
self.url = link
def get_links(self):
while True:
res = requests.get(self.url)
soup = BeautifulSoup(res.text,"lxml")
try:
self.get_nextpage(soup)
# what to do here to get the link generated within ".get_nextpage()" method
except:break
def get_nextpage(self,sauce):
nurl = sauce.select_one("div.pager a[rel='next']")
if nurl:
link = urljoin(self.url,nurl.get("href"))
crawler = StackOverflowClass(url)
crawler.get_links()
To be clearer what I meant, take a look at the following lines once again:
try:
self.get_nextpage(soup)
# what to do here to get the link generated within ".get_nextpage()" method
except:break
You can modify your get_nextpage as below:
def get_nextpage(self,sauce):
nurl = sauce.select_one("div.pager a[rel='next']")
if nurl:
link = urljoin(self.url,nurl.get("href"))
return link
and then you can use it in get_links() to get link value:
def get_links(self):
while True:
res = requests.get(self.url)
soup = BeautifulSoup(res.text,"lxml")
if self.get_nextpage(soup):
link = self.get_nextpage(soup)
# do whatever you want with link
else:break
Note that if/else is used instead of try/except as method/function without explicit return returns None and try: None will never generate exception and loop break will not be executed
As of now I have created a basic program in python 2.7 using urllib2 and re that gathers the html code of a website and prints it out for you as well as indexing a keyword. I would like to create a much more complex and dynamic program which could gather data from websites such as sports or stock statistics and aggregate them into lists which could then be used in analysis in something such as an excel document etc. I'm not asking for someone to literally write the code. I simply need help understanding more of how I should approach the code: whether I require extra libraries, etc. Here is the current code. It is very simplistic as of now.:
import urllib2
import re
y = 0
while(y == 0):
x = str(raw_input("[[[Enter URL]]]"))
keyword = str(raw_input("[[[Enter Keyword]]]"))
wait = 0
try:
req = urllib2.Request(x)
response = urllib2.urlopen(req)
page_content = response.read()
idall = [m.start() for m in re.finditer(keyword,page_content)]
wait = raw_input("")
print(idall)
wait = raw_input("")
print(page_content)
except urllib2.HTTPError as e:
print e.reason
You can use requests to deal with interaction with website. Here is link for it. http://docs.python-requests.org/en/latest/
And then you can use beautifulsoup to handle the html content. Here is link for it.http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
They're more ease of use than urllib2 and re.
Hope it helps.
I am trying to create a website downloader using python. I have the code for:
Finding all URLs from a page
Downloading a given URL
What I have to do is to recursively download a page, and if there's any other link in that page, I need to download them also. I tried combining the above two functions, but recursion thing doesn't work.
The codes are given below:
1)
*from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
if __name__ == "__main__":
import urllib
wanted_url=raw_input("Enter the URL: ")
usock = urllib.urlopen(wanted_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls: download(url)*
2) where download(url) function is defined as follows:
*def download(url):
import urllib
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
a=raw_input("Enter the URL")
download(a)
print "Done"*
Kindly help me on how to combine these two codes to "recursively" download the new links on a webpage that's being downloaded.
You may want to look into the Scrapy library.
It would make a task like this pretty trivial, and allow you to download multiple pages concurrently.
done_url = []
def download(url):
if url in done_url:return
...download url code...
done_url.append(url)
urls = sone_function_to_fetch_urls_from_this_page()
for url in urls:download(url)
This is a very sad/bad code. For example you will need to check if the url is within the domain you want to crawl or not. However, you asked for recursive.
Be mindful of the recursion depth.
There are just so many things wrong with my solution. :P
You must try some crawling library like Scrapy or something.
Generally, the idea is this:
def get_links_recursive(document, current_depth, max_depth):
links = document.get_links()
for link in links:
downloaded = link.download()
if current_depth < max_depth:
get_links_recursive(downloaded, depth-1, max_depth)
Call get_links_recursive(document, 0, 3) to get things started.