How to extract valid urls of similar pattern?

How to extract valid urls of similar pattern? - python

I am scraping an entire article management system storing thousands of articles. My script works, but the problem is that beautifulsoup and requests both take a long in determining whether the the page is an actual article or an article not found page. I have approximately 4000 articles and by calculating, the amount time the script will run will complete is in days.
for article_url in edit_article_list:
article_edit_page = s.get(article_url, data=payload).text
article_edit_soup = BeautifulSoup(article_edit_page, 'lxml')
# Section
if article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}) == None:
continue
else:
for thing in article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}).findAll("option", {"selected":"selected"}):
f.write(thing.get_text(strip=True) + "\t")
The first if determines whether the url is good or bad.
edit_article_list is made by:
for count in range(87418,307725):
edit_article_list.append(login_url+"AddEditArticle.aspxArticleID="+str(count))
My script right now checks for the bad and good urls and then scrapes the content. Is there any way I can get the valid urls of similar pattern using requests while making the url list?

To skip articles which don't exist, need to not allow redirects and check the status code:
for article_url in edit_article_list:
r = requests.get(article_url, data=payload, allow_redirects=False)
if r.status_code != 200:
continue
article_edit_page = r.text
article_edit_soup = BeautifulSoup(article_edit_page, 'lxml')
# Section
if article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}) == None:
continue
else:
for thing in article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}).findAll("option", {"selected":"selected"}):
f.write(thing.get_text(strip=True) + "\t")
I do though recommend parsing the article list page for the actual urls - you are currently firing off over 200,000 requests and only expecting 4,000 articles, that is a lot of overhead and traffic, and not very efficient!

Related

Recursive crawling with BeautifulSoup really slow

I'm building a crawler that downloads all .pdf Files of a given website and its subpages. For this, I've used built-in functionalities around the below simplified recursive function that retrieves all links of a given URL.
However this becomes quite slow, the longer it crawls a given website (may take 2 minutes or longer per URL).
I can't quite figure out what's causing this and would really appreciate suggestions on what needs to be changed in order to increase the speed.
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"https://www.srs-stahl.de/{page_url}").text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")

It is not that easy to figure out what activly slow down your crawling - It is maybe the way you crawl, server of the website, ...
In your code, you request a URL, grab the links and call the functions itself in the first iteration, so you only append requested urls.
You may want to work with "queues" to keep the processes more transparent.
One advantage is that if the script aborts, you have this information stored and can access it to start from the urls you already have collected to visit. Quite the opposite of your for loop, which may have to start at an earlier point to ensure it get all urls.
Another point is, you request the PDF files, but without using the response in any way. Wouldn't it make more sense to either download and save them directly or skip the request and at least keep the links in separate "queue" for post processing?
Collected information in comparison - Based on iterations
Code in question:
pages --> 24
Example code (without delay):
urlsVisited --> 24
urlsToVisit --> 87
urlsToDownload --> 67
Example
Just to demonstrate, feel free to create defs, classes and structure to your needs. Note added some delay, but you can skip it if you like. "Queues" to demonstrate the process are lists but should be files, database,... to store your data safely.
import requests, time
from bs4 import BeautifulSoup
baseUrl = 'https://www.srs-stahl.de'
urlsToDownload = []
urlsToVisit = ["https://www.srs-stahl.de/"]
urlsVisited = []
def crawl(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('a[href^="/"]'):
url = f"{baseUrl}{a['href']}"
if '.pdf' in url and url not in urlsToDownload:
urlsToDownload.append(url)
else:
if url not in urlsToVisit and url not in urlsVisited:
urlsToVisit.append(url)
while urlsToVisit:
url = urlsToVisit.pop(0)
try:
crawl(url)
except Exception as e:
print(f'Failed to crawl: {url} -> error {e}')
finally:
urlsVisited.append(url)
time.sleep(2)

Wait Before Scraping using Beatifulsoup

I'm trying to scrape data from this review site. It first go through first page, check if there's a 2nd page then go to it too. Problem is when getting to 2nd page. Page takes time to update and I still get the first page's data instead of 2nd
For example, if you go here, you will see how it takes time to load page 2 data
I tried to put a timeout or sleep but didn't work. Prefer a solution with minimal package/browser dependency (like webdriver.PhantomJS()) as I need to run this code on my employer's environment and not sure if I can use it. Thank you!!
from urllib.request import Request, urlopen
from time import sleep
from socket import timeout
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req, timeout=10).read()
webpage = web_byte.decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")
true=parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
while(true):
true = parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
if(not True):
true=False
else:
req = Request(softwareadvice+'?review.page=2', headers=hdr)
sleep(10)
webpage = urlopen(req, timeout=10)
sleep(10)
webpage = webpage.read().decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")

The reviews are loaded from external source via Ajax request. You can use this example how to load them:
import re
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.softwareadvice.com/sms-marketing/twilio-profile/reviews/"
api_url = (
"https://pkvwzofxkc.execute-api.us-east-1.amazonaws.com/production/reviews"
)
params = {
"q": "s*|-s*",
"facet.gdm_industry_id": '{"sort":"bucket","size":200}',
"fq": "(and product_id: '{}' listed:1)",
"q.options": '{"fields":["pros^5","cons^5","advice^5","review^5","review_title^5","vendor_response^5"]}',
"size": "50",
"start": "50",
"sort": "completeness_score desc,date_submitted desc",
}
# get product id
soup = BeautifulSoup(requests.get(url).content, "html.parser")
a = soup.select_one('a[href^="https://reviews.softwareadvice.com/new/"]')
id_ = int("".join(re.findall(r"\d+", a["href"])))
params["fq"] = params["fq"].format(id_)
for start in range(0, 3): # <-- increase the number of pages here
params["start"] = 50 * start
data = requests.get(api_url, params=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for h in data["hits"]["hit"]:
if "review" in h["fields"]:
print(h["fields"]["review"])
print("-" * 80)
Prints:
After 2 years using Twilio services, mainly phone and messages, I can say I am so happy I found this solution to handle my communications. It is so flexible, Although it has been a little bit complicated sometimes to self-learn about online phoning systems it saved me from a lot of hassles I wanted to avoid. The best benefit you get is the ultra efficient support service
--------------------------------------------------------------------------------
An amazingly well built product -- we rarely if ever had reliability issues -- the Twilio Functions were an especially useful post-purchase feature discovery -- so much so that we still use that even though we don't do any texting. We also sometimes use FracTEL, since they beat Twilio on pricing 3:1 for 1-800 texts *and* had MMS 1-800 support long before Twilio.
--------------------------------------------------------------------------------
I absolutely love using Twilio, have had zero issues in using the SIP and text messaging on the platform.
--------------------------------------------------------------------------------
Authy by Twilio is a run-of-the-mill 2FA app. There's nothing special about it. It works when you're not switching your hardware.
--------------------------------------------------------------------------------
We've had great experience with Twilio. Our users sign up for text notification and we use Twilio to deliver them information. That experience has been well-received by customers. There's more to Twilio than that but texting is what we use it for. The system barely ever goes down and always shows us accurate information of our usage.
--------------------------------------------------------------------------------
...and so on.

I have been scraping many types of websites and I think in the world of scraping, there are roughly 2 types of websites.
The first one is "URL-based" websites (i.e. you send request with URL, the server responds with HTML tags from which elements can be directly extracted), and the second one is "JavaScript-rendered" websites (i.e. the response you only get is the javascript and you can only see HTML tags after it is run).
In former's cases, you can freely navigate through the website with bs4. But in the latter's cases, you cannot always use URLs as a rule of thumb.
The site you are going to scrape is built with Angular.js, which is based on client-side rendering. So, the response you get is the JavaScript code, not HTML tags with page content in it. You have to run the code to get the content.
About the code you introduced:
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req, timeout=10).read() # response is javascript, not page content you want...
webpage = web_byte.decode('utf-8')
All you can get is the JavaScript code that must be run to get HTML elements. That is why you get the same pages(response) every time.
So, what to do? Is there any way to run JavaScript within bs4? I guess there aren't any appropriate ways to do this. You can use selenium for this one. You can literally wait until the page fully loads, you can click buttons and anchors, or get page content at any time.
Headless browsers in selenium might work, which means you don't have to see the controlled browser opening on your computer.
Here are some links that might be of help to you.
scrape html generated by javascript with python
https://sadesmith.com/2018/06/15/blog/scraping-client-side-rendered-data-with-python-and-selenium
Thanks for reading.

When parsing many websites, how to make Scrapy stop yielding requests in the for loop if data is found and move to the next website

So, I am parsing emails from many websites
1)
I take them from the front page and from the contacts section ('kont' or 'cont' in hrefs)
There could be many links with 'kont' or 'cont' at the front page
I don't want to visit all of them in the "for" loop
I would like the program to go to another website when the data is found in one of those links (email_list_2 != []). how to do that?
2)
There is some redundancy in the code, I yield data at the front page because I am afraid the request from the for loop would be unsuccessful, in which case I will lose data from the front page.
Can I just yield {'site': site,
'email_list_1': email_list_1,
'email_list_2': []} if data is not found
or
{'site': site,
'email_list_1': email_list_1,
'email_list_2': ['xyz']} if data is found without double yielding?
Please help
Regards,
class QuotesSpider(scrapy.Spider):
name = 'enrichment'
start_urls = website_list
def parse(self, response):
site = response.url
data = response.text
email_list_1 = emailRegex.findall(data)
yield {'lvl': '1',
'site': site,
'email_list_1': email_list_1,
'email_list_2': [],
}
soup = BeautifulSoup(data,'lxml')
for link in soup.find_all('a'):
raw_url = link.get('href')
full_url = str(site) + str(raw_url)
if (re.search('cont', full_url) != None or
re.search('kont', full_url) != None):
yield scrapy.Request(url=full_url,
callback=self.parse_2d_level,
meta={'site': site,'email_list_1': email_list_1 }
)
def parse_2d_level(self, response):
site = response.meta['site']
email_list_1 = response.meta['email_list_1']
data_2 = response.text
email_list_2 = emailRegex.findall(data_2)
yield {'lvl': '2',
'site': site,
'email_list_1': email_list_1,
'email_list_2': email_list_2,
}

I'm not sure I fully understand your question, but here it goes:
1 - You want to scrape PAGE1, look for 'cont' or 'kont' and if these components exists, make a new request for PAGE2. In PAGE2 you search for a email_list_2 and yield results. You asked:
I would like the program to go to another website when the data is
found in one of those links (email_list_2 != []). how to do that?
What website do you want it to go? Is it a follow on the page you are already scraping? Is it another website in your start_urls?
At current state, after parsing PAGE2 (on parse_2d_level method) your spider will yield results, whether it found values for email_list_2 or not. If there are other requests on queue, scrapy will go on to execute those, if there aren't, the spider will end.
2- You want to make sure the data you already found before the loop is yielded in case the request from inside the loop fails. Since you said
the request from the for loop would be unsuccessful
I'll assume you are only worried about the REQUEST failure, there are other ways your parsing could fail.
For failed request you can catch and handle the issue with a scrapy signal called spider_error, take a look here.
3-You should take a look at Scrapy's selectors, they are a very powerful tool. You don't need beautiful soup for the parsing, and the Selectors will help a lot with the precision.

Scrape title by only downloading relevant part of webpage

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I've seen previous questions like retrieving just the title of a webpage in python, but all of the ones I've found download the entire page before retrieving the title, which seems highly inefficient as most often the title is contained within the first few lines of HTML.
Is it possible to download only the parts of the webpage until the title has been found?
I've tried the following, but page.readline() downloads the entire page.
import urllib2
print("Looking up {}".format(link))
hdr = {'User-Agent': 'Mozilla/5.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(link, headers=hdr)
page = urllib2.urlopen(req, timeout=10)
content = ''
while '</title>' not in content:
content = content + page.readline()
-- Edit --
Note that my current solution makes use of BeautifulSoup constrained to only process the title so the only place I can optimize is likely to not read in the entire page.
title_selector = SoupStrainer('title')
soup = BeautifulSoup(page, "lxml", parse_only=title_selector)
title = soup.title.string.strip()
-- Edit 2 --
I've found that BeautifulSoup itself splits the content into multiple strings in the self.current_data
variable (see this function in bs4), but I'm unsure how to modify the code to basically stop reading all remaining content after the title has been found. One issue could be that redirects should still work.
-- Edit 3 --
So here's an example. I have a link www.xyz.com/abc and I have to follow this through any redirects (almost all of my links use a bit.ly kind of link shortening). I'm interested in both the title and domain that occurs after any redirections.
-- Edit 4 --
Thanks a lot for all of your assistance! The answer by Kul-Tigin works very well and has been accepted. I'll keep the bounty until it runs out though to see if a better answer comes up (as shown by e.g. a time measurement comparison).
-- Edit 5 --
For anyone interested: I've timed the accepted answer to be roughly twice as fast as my existing solution using BeautifulSoup4.

You can defer downloading the entire response body by enabling stream mode of requests.
Requests 2.14.2 documentation - Advanced Usage
By default, when you make a request, the body of the response is
downloaded immediately. You can override this behaviour and defer
downloading the response body until you access the Response.content
attribute with the stream parameter:
...
If you set stream to True when making a request, Requests cannot release the connection back to the pool unless you consume all the data or call Response.close.
This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while using stream=True, you should consider using contextlib.closing (documented here)
So, with this method, you can read the response chunk by chunk until you encounter the title tag. Since the redirects will be handled by the library you'll be ready to go.
Here's an error-prone code tested with Python 2.7.10 and 3.6.0:
try:
from HTMLParser import HTMLParser
except ImportError:
from html.parser import HTMLParser
import requests, re
from contextlib import closing
CHUNKSIZE = 1024
retitle = re.compile("<title[^>]*>(.*?)</title>", re.IGNORECASE | re.DOTALL)
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://example.com/abc", stream=True)) as res:
for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
buffer = "".join([buffer, chunk])
match = retitle.search(buffer)
if match:
print(htmlp.unescape(match.group(1)))
break

Question: ... the only place I can optimize is likely to not read in the entire page.
This does not read the entire page.
Note: Unicode .decode() will raise Exception if you cut a Unicode sequence in the middle. Using .decode(errors='ignore') remove those sequences.
For instance:
import re
try:
# PY3
from urllib import request
except:
import urllib2 as request
for url in ['http://www.python.org/', 'http://www.google.com', 'http://www.bit.ly']:
f = request.urlopen(url)
re_obj = re.compile(r'.*(<head.*<title.*?>(.*)</title>.*</head>)',re.DOTALL)
Found = False
data = ''
while True:
b_data = f.read(4096)
if not b_data: break
data += b_data.decode(errors='ignore')
match = re_obj.match(data)
if match:
Found = True
title = match.groups()[1]
print('title={}'.format(title))
break
f.close()
Output:
title=Welcome to Python.org
title=Google
title=Bitly | URL Shortener and Link Management Platform
Tested with Python: 3.4.2 and 2.7.9

You're scraping webpages using standard REST requests and I'm not aware of any request that only returns the title, so I don't think it's possible.
I know this doesn't necessarily help get the title only, but I usually use BeautifulSoup for any web scraping. It's much easier. Here's an example.
Code:
import requests
from bs4 import BeautifulSoup
urls = ["http://www.google.com", "http://www.msn.com"]
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
print "Title with tags: %s" % soup.title
print "Title: %s" % soup.title.text
print
Output:
Title with tags: <title>Google</title>
Title: Google
Title with tags: <title>MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos & Videos</title>
Title: MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos & Videos

the kind of thing you want i don't think can be done, since the way the web is set up, you get the response for a request before anything is parsed. there isn't usually a streaming "if encounter <title> then stop giving me data" flag. if there is id love to see it, but there is something that may be able to help you. keep in mind, not all sites respect this. so some sites will force you to download the entire page source before you can act on it. but a lot of them will allow you to specify a range header. so in a requests example:
import requests
targeturl = "http://www.urbandictionary.com/define.php?term=Blarg&page=2"
rangeheader = {"Range": "bytes=0-150"}
response = requests.get(targeturl, headers=rangeheader)
response.text
and you get
'<!DOCTYPE html>\n<html lang="en-US" prefix="og: http://ogp.me/ns#'
now of course here's the problems with this
what if you specify a range that is too short to get the title of the page?
whats a good range to aim for? (combination of speed and assurance of accuracy)
what happens if the page doesn't respect Range? (most of the time you just get the whole response you would have without it.)
i don't know if this might help you? i hope so. but i've done similar things to only get file headers for download checking.
EDIT4:
so i thought of another kind of hacky thing that might help. nearly every page has a 404 page not found page. we might be able to use this to our advantage. instead of requesting the regular page. request something like this.
http://www.urbandictionary.com/nothing.php
the general page will have tons of information, links, data. but the 404 page is nothing more than a message, and (in this case) a video. and usually there is no video. just some text.
but you also notice that the title still appears here. so perhaps we can just request something we know does not exist on any page like.
X5ijsuUJSoisjHJFk948.php
and get a 404 for each page. that way you only download a very small and minimalistic page. nothing more. which will significantly reduce the amount of information you download. thus increasing speed and efficiency.
heres the problem with this method: you need to check somehow if the page does not supply its own version of the 404. most pages have it because it looks good with the site. and its standard practice to include one. but not all of them do. make sure to handle this case.
but i think that could be something worth trying out. over the course of thousands of sites, it would save many ms of download time for each html.
EDIT5:
so as we talked about, since you are interested in urls that redirect. we might make use of an http head reqeust. which wont get the site content. just the headers. so in this case:
response = requests.head('http://myshortenedurl.com/5b2su2')
replace my shortenedurl with tunyurl to follow along.
>>>response
<Response [301]>
nice so we know this redirects to something.
>>>response.headers['Location']
'http://stackoverflow.com'
now we know where the url redirects to without actually following it or downloading any page source. now we can apply any of the other techniques previously discussed.
Heres an example, using requests and lxml modules and using the 404 page idea. (be aware, i have to replace bit.ly with bit'ly so stack overflow doesnt get mad.)
#!/usr/bin/python3
import requests
from lxml.html import fromstring
links = ['http://bit'ly/MW2qgH',
'http://bit'ly/1x0885j',
'http://bit'ly/IFHzvO',
'http://bit'ly/1PwR9xM']
for link in links:
response = '<Response [301]>'
redirect = ''
while response == '<Response [301]>':
response = requests.head(link)
try:
redirect = response.headers['Location']
except Exception as e:
pass
fakepage = redirect + 'X5ijsuUJSoisjHJFk948.php'
scrapetarget = requests.get(fakepage)
tree = fromstring(scrapetarget.text)
print(tree.findtext('.//title'))
so here we get the 404 pages, and it will follow any number of redirects. now heres the output from this:
Urban Dictionary error
Page Not Found - Stack Overflow
Error 404 (Not Found)!!1
Kijiji: Page Not Found
so as you can see we did indeed get out titles. but we see some problems with the method. namely some titles add things, and some just dont have a good title at all. and thats the issue with that method. we could however try the range method too. benefits of that would be the title would be correct, but sometimes we might miss it, and sometimes we have to download the whole pagesource to get it. increasing required time.
Also credit to alecxe for this part of my quick and dirty script
tree = fromstring(scrapetarget.text)
print(tree.findtext('.//title'))
for an example with the range method. in the loop for link in links: change the code after the try catch statement to this:
rangeheader = {"Range": "bytes=0-500"}
scrapetargetsection = requests.get(redirect, headers=rangeheader)
tree = fromstring(scrapetargetsection.text)
print(tree.findtext('.//title'))
output is:
None
Stack Overflow
Google
Kijiji: Free Classifieds in...
here we see urban dictionary has no title or ive missed it in the bytes returned. in any of these methods there are tradeoffs. the only way to get close to total accuracy would be to download the entire source for each page i think.

using urllib you can set the Range header to request a certain range of bytes, but there are some consequences:
it depends on the server to honor the request
you assume that data you're looking for is within desired range (however you can make another request using different range header to get next bytes - i.e. download first 300 bytes and get another 300 only if you can't find title within first result - 2 requests of 300 bytes are still much cheaper than whole document)
(edit) - to avoid situations when title tag splits between two ranged requests, make your ranges overlapped, see 'range_header_overlapped' function in my example code
import urllib
req = urllib.request.Request('http://www.python.org/')
req.headers['Range']='bytes=%s-%s' % (0, 300)
f = urllib.request.urlopen(req)
just to verify if server accepted our range:
content_range=f.headers.get('Content-Range')
print(content_range)

my code also solves cases when title tag is splitted between chunks.
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Tue May 30 04:21:26 2017
====================
#author: s
"""
import requests
from string import lower
from html.parser import HTMLParser
#proxies = { 'http': 'http://127.0.0.1:8080' }
urls = ['http://opencvexamples.blogspot.com/p/learning-opencv-functions-step-by-step.html',
'http://www.robindavid.fr/opencv-tutorial/chapter2-filters-and-arithmetic.html',
'http://blog.iank.org/playing-capitals-with-opencv-and-python.html',
'http://docs.opencv.org/3.2.0/df/d9d/tutorial_py_colorspaces.html',
'http://scikit-image.org/docs/dev/api/skimage.exposure.html',
'http://apprize.info/programming/opencv/8.html',
'http://opencvexamples.blogspot.com/2013/09/find-contour.html',
'http://docs.opencv.org/2.4/modules/imgproc/doc/geometric_transformations.html',
'https://github.com/ArunJayan/OpenCV-Python/blob/master/resize.py']
class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.match = False
self.title = ''
def handle_starttag(self, tag, attributes):
self.match = True if tag == 'title' else False
def handle_data(self, data):
if self.match:
self.title = data
self.match = False
def valid_content( url, proxies=None ):
valid = [ 'text/html; charset=utf-8',
'text/html',
'application/xhtml+xml',
'application/xhtml',
'application/xml',
'text/xml' ]
r = requests.head(url, proxies=proxies)
our_type = lower(r.headers.get('Content-Type'))
if not our_type in valid:
print('unknown content-type: {} at URL:{}'.format(our_type, url))
return False
return our_type in valid
def range_header_overlapped( chunksize, seg_num=0, overlap=50 ):
"""
generate overlapping ranges
(to solve cases when title tag splits between them)
seg_num: segment number we want, 0 based
overlap: number of overlaping bytes, defaults to 50
"""
start = chunksize * seg_num
end = chunksize * (seg_num + 1)
if seg_num:
overlap = overlap * seg_num
start -= overlap
end -= overlap
return {'Range': 'bytes={}-{}'.format( start, end )}
def get_title_from_url(url, proxies=None, chunksize=300, max_chunks=5):
if not valid_content(url, proxies=proxies):
return False
current_chunk = 0
myparser = TitleParser()
while current_chunk <= max_chunks:
headers = range_header_overlapped( chunksize, current_chunk )
headers['Accept-Encoding'] = 'deflate'
# quick fix, as my locally hosted Apache/2.4.25 kept raising
# ContentDecodingError when using "Content-Encoding: gzip"
# ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.',
# error('Error -3 while decompressing: incorrect header check',))
r = requests.get( url, headers=headers, proxies=proxies )
myparser.feed(r.content)
if myparser.title:
return myparser.title
current_chunk += 1
print('title tag not found within {} chunks ({}b each) at {}'.format(current_chunk-1, chunksize, url))
return False

Searching through a website directory, validate, then place URL in a list depending on content

I've been working on a script and I thought I would ask for help. I'm looking to search a series of websites, check if the site is valid. Then the next step would be to check for specific content on the site. If the site holds that content, place the URL in a list.
import urllib2
def getPage():
url="import urllib2
National=[]
Local=[]
Sports=[]
Culture=[]
def getPage():
url="http://readingeagle.com/section.aspx?id=2"
for i in range (0,100,1)
req = urllib2.Request(http://readingeagle.com/section.aspx?id=,i)
if "national" in response:
response = urllib2.urlopen(req)
return response.read()
for g in range (0,100,1)
if "national" in response:
National.append("http://readingeagle.com/section.aspx?id=,g"
# I would like to set-up an iteration to check the 'entryid from 1-100. If the term is found on the page, place the url in the list.
if __name__ == "__main__":
namesPage = getPage()
print (namesPage)

Here's my answer to the question of how to validate a given web site.
python check html valid
For checking the context of the page the tools consist of basic string methods, regex, or more sophisticated tools like lxml or beautifulsoup.
matchingSites = []
matchingSites.append(url) #Since you asked. :-p

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.