I want to monitor a announcement webpage, so that when a new announcement comes in, I can execute tasks as quickly as possible. Currently I'm using python with requests package:
allText = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).text
and then find the first occurrence of the text with a particular header corresponding to an article item:
ind = allText.find(''<li class="article-list-item">''); allText = allText[ind:]; ind = allText.find(''</a>''); allText = allText[0:ind]
I'm repeating the command (i.e. refreshing the page) every ~1.5 seconds.
The problems are:
it's not fast enough. It typically takes more than 3 seconds for my programme to detect it after a new webpage appears. I guess the text finding is taking up too much time. Is there a faster way?
on some website, the articles are concealed and the requests command does not return anything even though the browser can still see it. An example source code of the webpage is:
<div data-app="recent-activity" data-url="/hc/api/internal/recent_activities">/div>
How should I scrape this kind of page please?
Related
At https://nb-bet.com/Results, I try to access every match page, but the site seems to block me for a while after the first access. It turns out to get access only to 60-80 matches, and then I get an error 500 or 404 (although the page is available and no protection is displayed if you open it through a browser and this is only to any match page, for example, https://nb-bet page .com/Results will still open normally), which disappears after about 30-40 minutes and you can re-access new matches.
If I use time.sleep(random.uniform(5,10)), I only get access to 5-7 matches. I've tried using fake_headers, fake_useragent, access randomly, but to no avail. I need to find a solution without using proxies etc. Any help would be greatly appreciated.
For example, I provide links to 158 matches and how I go through them, the goal is simply to get the code 200 for each page in one pass (ie without a break of 30-40 minutes). The list with links to the matches had to be published on a separate site, because. this site does not skip posting a question because of the large text, I hope you understand.
The list of links is here - https://pastebin.com/LPeCP5bQ
import requests
s = requests.session()
for i in range(len(links)):
r = s.get(links[i])
print(r.status_code)
I am trying to scrape the earnings calendar data from the table from zacks.com and the url is attached below.
https://www.zacks.com/stock/research/aapl/earnings-calendar
The thing is I am trying to scrape all data from the table, but it has a dropdown list to select 10, 25, 50 and 100 rows on a page. Ideally I want to scrape for all 100 rows but when I select 100 from the dropdown list, the url doesn't change. My code is below.
To note that the website blocks user-agent so I had to use chrome driver to impersonate human visiting the web. The obtained result from the pd.read_html is a list of all the tables and the d[4] returns the earnings calendar with only 10 rows (which I want to change to 100)
driver = webdriver.Chrome('../files/chromedriver96')
symbol = 'AAPL'
url = 'https://www.zacks.com/stock/research/{}/earnings-calendar'.format(symbol)
driver.get(url)
content = driver.page_source
d = pd.read_html(content)
d[4]
So calling help for anyone to guide me on this
Thanks!
UPDATE: it looks like my last post was downgraded due to lack of clear articulation and evidence of showing the past research. Maybe I am still a newbie to posting questions on this site. Actually, I have found several pages including this page with the same issue but the solutions didn't seem to work for me, which is why I came to post this as a new question
UPDATE 12/05:
Thanks a lot for the advise. As commented below, I finally got it working. Below is the code I used
dropdown = driver.find_element_by_css_selector('#earnings_announcements_earnings_table_length')
time.sleep(1)
hundreds = dropdown.find_element_by_xpath(".//option[. = '100']")
hundreds.click()
Having taken a look this is not going to be something that is easy to scrape. Given that the table is produced from the javascript I would say you have two options.
Option one:
Use selenium to render the page allowing the javascript to run. This way you can simply use the id/class of the drop down to interact with it.
You can then scrape the data by looking at the values in the table.
Option two:
This is the more challenging one. Look through the data that the page gets in response and try to find requests which result in the data you then see on the page. By cross-referencing these there will be a way to directly request the data you want.
You may find that to get at the data you want you need to accept a key from the original request to the page and then send that key as part of a second request. This way should allow you to scrape the data without having to run a selenium instance which will run more efficiently.
My personal suggestion is to go with option one as computer resources are cheap and developer time expensive.
I am creating a simple application where I have to follow links from a page and so on...thus building a very basic prototype of a web crawler.
When I was testing it, i came across robot.txt which has hits limit for any external crawlers trying to crawl their site. For example, if a website's robot.txt has a hit limit of not more than 1 hit per second (as that of wikipedia.org) from a given IP, and if I crawl few pages of Wikipedia at the rate of 1 page per second, then how do i estimate how many hits will it incur while i crawl?
Question: if i am downloading one entire page through the urllib of python, how many hits will it account to?
Here is my Example Code:
import urllib.request
opener = urllib.request.FancyURLopener({})
open_url = opener.open(a)
page = open_url.read()
print page
If you download an entire page from a site with urllib, it will account as one (1) hit.
Save the page into a variable, and work with this variable from now on.
Additionally, I'd advise you to use requests instead of urllib. Much easier/better/stronger.
Link to the documentation of Requests.
One thing you can do is put a time gap between two request , this will solve your problem and it also prevent you from get blocked.
I'm new to developing python scripts, but i'm trying to develop a script that will inform me when web page has been updated. For each check I use a counter to see how many times the program has run until the site has updated in someway.
My doubt is that, when I feed the url "stackoverflow.com", my program can run upto 6 times, however when I feed the url "stackoverflow.com/questions" the program runs at most once. Both sites on refreshing seems to be updating their questions often. But could someone explain to me why is there a big difference on the number of times the program runs?
import urllib2
import time
refreshcnt=0
url=raw_input("Enter the site to check");
x=raw_input("Enter the time duration to refresh");
x=int(x);
url="http://"+url
response = urllib2.urlopen(url)
html = response.read()
htmlnew=html
while html==htmlnew:
time.sleep(x)
try:
htmlnew=urllib2.urlopen(url).read()
except IOError:
print "Can't open site"
break
refreshcnt+=1
print "Refresh Count",refreshcnt
print("The site has updated!");
Just add this little loop to the end of your code and see what's changing:
for i in xrange(min(len(htmlnew),len(html))):
if htmlnew[i] != html[i]:
print(htmlnew[i-20:i+20])
print(html[i-20:i+20])
break
I tried it quick and it appears that there is a ServerTime key that is updated every second. For one reason or another, it would appear that this key is updated every second on the "/questions" page, but is only updated every half a minute or so on the homepage.
However, doing a couple other quick checks, this is certainly not the only part of the HTML being updated on the "stackoverflow.com/questions" page. Just comparing the entire HTML against the old one probably won't work in many situations. You'll likely want to search for a specific part of the HTML and then see if that piece has changed. For example, look for the HTML signifying the title newest question on SO and see if that title is different than before.
I am learning web programming with Python, and one of the exercises I am working on is the following: I am writing a Python program to query the website "orbitz.com" and return the lowest airfare. The departure and arrival cities and dates are used to construct the URL.
I am doing this using the urlopen command, as follows:
(search_str contains the URL)
from lxml.html import parse
from urllib2 import urlopen
parsed = parse(urlopen(search_str))
doc = parsed.getroot()
links = doc.findall('.//a')
the_link = (links[j].text_content()).strip()
The idea is to retrieve all the links from the query results and search for strings such as "Delta", "United" etc, and read off the dollar amount next to the links.
It worked successfully until today - It looks like orbitz.com has changed their output page. Now, when you enter the travel details on the orbitz.com website, there appears a page showing a wheel saying "looking up itineraries" or something to that effect. This is just a filler page and contains no real information. After a few seconds, the real results page is displayed. Unfortunately, the Python code return the links for the filler page each time, and I never obtain the real results.
How can I get around this? I am a relative beginner to web programming, so any help is greatly appreciated.
This kind of things is normal in the world of crawlers.
What you need to do is figure out what url it is redirecting to after the "itinerary page" and you hit that url directly from your script.
Then figure out if they have changed the final search results page too, if so modify your script to accommodate those changes.