I am crawling some data from a website. I need to recover some links from a list of products. First I identified one of the links with inspect element:
Then I used request to save all the source code of that page in a text file:
source_code = requests.get(link)
plain_text= source_code.txt
Then I used my text editor to search the link and it did not found it. Im working with BeautifulSoup4, but I already tried several different ways to crawl the page to get the list of products but all give the same result.
My suspicion is that the list of product is generated by some code (probably Java) when someone enter the page, but I am not sure. I have been several hours trying to make this work so any hint is going to be appreciated.
Python never stop to amuse me. I found a Python library that uses PhantomJS. It allow us to run JavaScript code inside a python program. I will answer my own question after a lot of work:
from ghost import Ghost
import re
def filterProductLinks(links): #filter the useless links using regex
pLinks= list()
for l in links:
if re.match(".*productDetails.*",str(l)):
pLinks.append(l)
return pLinks #List of item url(40 max)
def getProductLinks(url): #get the links generated by Java code
ghost = Ghost(wait_timeout=100)
ghost.open(url)
links = ghost.evaluate("""
var links = document.querySelectorAll("a");
var listRet = [];
for (var i=0; i<links.length; i++){
listRet.push(links[i].href);
}
listRet;
""")
pLinks= filterProductLinks(links[0])
return pLinks
#Test
pLinks= getProductLinks('http://www.lider.cl/walmart/catalog/category.jsp?id=CF_Nivel3_000042&pId=CF_Nivel1_000003&navAction=jump&navCount=0#categoryCategory=CF_Nivel3_000042&pageSizeCategory=20¤tPageCategory=1¤tGroupCategory=1&orderByCategory=lowestPrice&lowerLimitCategory=0&upperLimitCategory=0&&504')
for l in pLinks:
print l
print len(pLinks)
The Java code is not mine. I took it from a Ghost.py documentation page: Ghost.py Documentation
Related
I've been with this all day and I'm getting a little overwhelmed, I explain, I have a personal project, scrape all the links of the acestream: // protocol from a website and turn them into a playlist for acestream. For now I can either remove the links from the web (something like the site map) or remove the acestream links from a specific subpage. One of the problems I have is that since the same acestream link appears several times on the page,
Obviously I get the same link multiple times and I only want it once. Besides, I don't know how to do it either (I'm very new to this) so that instead of putting the link in it, it automatically takes it from a list of links in a .csv, because I need to get an acestream link from each link that I put on it. in the .csv. I'm sorry about the tirade, I hope it's not a nuisance.
Hope you understand, I translated it with Google Translate
from bs4 import BeautifulSoup
import requests
# creating empty list
urls = []
# function created
def scrape(site):
# getting the request from url
r = requests.get(site)
# converting the text
s = BeautifulSoup(r.text, "html.parser")
for i in s.find_all("a"):
href = i.attrs['href']
if href.startswith("acestream://"):
site = site + href
if site not in urls:
urls.append(site)
print(site)
# calling the scrape function itself
# generally called recursion
scrape(site)
# main function
if __name__ == "__main__":
site = "https://www.websitehere.com/index.htm"
scrape(site)
Based off your last comment and your code, you can read in a .csv using
import pandas as pd
file_path = 'C:\<path to your csv>'
df = pd.read_csv(file_path)
csv_links = df['<your_column_name_for_links>'].to_list()
With this, you can get the URLs from the .csv. Just change the values in the <>.
I am attempting to scrape the figures shown on https://www.usdebtclock.org/world-debt-clock.html , however due to the numbers constantly changing i am unaware of how to collect this data.
This is an example of what i am attempting to do.
import requests
from bs4 import BeautifulSoup
url ="https://www.usdebtclock.org/world-debt-clock.html"
URL=requests.get(url)
site=BeautifulSoup(URL.text,"html.parser")
data=site.find_all("span",id="X4a79R9BW")
print(data)
The result is this:
"[ ]"
when i was expecting
"$19,987,137,284,731"
Is there something i can change in order to extract the number?
BeautifulSoup cannot do this for you, because the data you need is provided by JavaScript, and BeautifulSoup does not support JS processing.
An alternative is to use a tool such as Selenium WebDriver:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.usdebtclock.org/world-debt-clock.html')
elem2 = driver.find_element_by_xpath('//span[#id="X4a79R9BW"]')
print(elem2.text)
driver.close()
If you have not used Selenium WebDriver before, you need to follow the installation instructions here.
In particular, you will need to follow the instructions for downloading the browser driver of your choice (I use geckodriver for Firefox). And make sure the executable is on your path.
(I expect there are other Python-based alternatives, also.)
Based on the page's code, I think what you want to accomplish may not be possible with BS. Running your code returned [<span id="X4a79R9BW"> </span>]. Trying to getText() on that returned nothing. When inspecting the page, I noticed that the numerical value in the span was continuously updating as it does on the page. Viewing the page source showed that X4a79R9BW appeared at five places in the page. First to set aspects of the font, several places where an equation was being processed, and last the empty span scraped by your code. From viewing the source, it appears that the counter is an equation running inside a tag <script type="text/javascript">. Here is what I think is the equation running under the JavaScript tag:
{'leftMargin':0,'color':-16751104,:0 */var X3a34729DW = /*144,:14 */ 96.9230013 /*751104,:0 */; var R3a45G7S = /*7104,:54 */ 0.000000306947 /*43,451134,:5 */; var Y12 = /*241,:15457 */ 18442.16666 /*19601*2*2*/*21600*2*2; /*79301*2*2*/ var Class = new Date(); var Method = Class.getTime() / 1000 - Y12a4798; var Public = X3a34729DW + Method * R3a45G7S; var Assign = FormatNumber2(Public); document.getElementById ('X3a34729DW') .firstChild.nodeValue = Assign; /*'advance':4289}
This section of the page's source indicates that the text you want is being continuously updated via JavaScript. Given that, it is my understanding that BS is not the appropriate library to complete the desired task. Though I have not used it myself, I've seen Selenium as a suggested library for scraping pages dynamically updated via JavaScript. Good luck, perhaps someone else can help provide a clearer path forward.
I'm trying to retrieve a list of downloadable xls files on a website.
I'm a bit reluctant to provide full links to the website in question.
Hopefully I'm able to provide all necessary details all the same.
If this is useless, please let me know.
Download .xls files from a webpage using Python and BeautifulSoup is a very similar question, but the details below will show that the solution most likely will have to be different since the links on that particular site are tagged with a href anchor:
And the ones I'm trying to get are not tagged the same way.
On the webpage, the files that are available for downloading are listed like this:
A simple mousehover gives these further details:
I'm following the setup here with a few changes to produce the snippet below that provides a list of some links, but not to any of the xls files:
from bs4 import BeautifulSoup
import urllib
import re
def getLinks(url):
with urllib.request.urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, "lxml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
links1 = getLinks("https://SOMEWEBSITE")
A further inspection using ctrl+shift+I in Google Chrome reveals that those particular links do not have a href anchor tag, but rather a ng-href anchor tag:
So I tried changing that in the snippet above, but with no success.
And I've tried different combinations with e.compile("^https://"), attrs={'ng-href' and links.append(link.get('ng-href')), but still with no success.
So I'm hoping someone has a better suggestion!
EDIT - Further details
It seems it's a bit problematic to read these links directly.
When I use ctrl+shift+I and the Select an element in the page to inspect it Ctrl+Shift+C, this is what I can see when I hover over one of the links listed above:
And what I'm looking to extract here is the information associated with the ng-href tag. But If I right-click the page and select Show Source, the same tag only appears once along with som metadata(?):
And I guess this is why my rather basic approach is failing in the first place.
I'm hoping this makes sense to some of you.
Update:
using selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
driver.get('http://.....')
# wait max 15 second until the links appear
xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#ng-href, ".xls")]'))
# Or
# xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#href, ".xls")]'))
links = []
for link in xls_links:
url = "https://SOMEWEBSITE" + link.get_attribute('ng-href')
print(url)
links.append(url)
Assume ng-href is not dynamically generated, from your last image I see that the URL is not starts with https:// but the slash / you can try with regex URL contains .xls
for link in soup.findAll('a', attrs={'ng-href': re.compile(r"\.xls")}):
xls_link = "https://SOMEWEBSITE" + link['ng-href']
print(xls_link)
links.append(xls_link)
My guess is that the data you are trying to crawl is created dynamically: ng-href is one of AngularJs's constructs. You could try using Google Chrome's Network inspection as you already did (ctrl+shift+I) and see if you can find the url that is queried (open the network tab and reload the page). The query should typically return a JSON with the links to the xls-files.
There is a thread about a similar problem here. Perhaps that helps you: Unable to crawl some href in a webpage using python and beautifulsoup
This is from the book "automate the boring stuff with python".
At first I made a .bat file and ran it with arguments from cmd, didnt open any pages in chrome, looked up on here, changed up the code, still it executes perfectly and prints the print line but it doesnt open tabs as it should.
What am I doing wrong? Thanks in advance
#! python3
# lucky.py opens several google search matches
import requests,sys,webbrowser,bs4
searchTerm1 = 'python'
print('Googling...')
res = requests.get('https://www.google.com/search?={0}'.format(searchTerm1))
res.raise_for_status()
#retrieve top search result links
soup = bs4.BeautifulSoup(res.text,"html.parser")
#open a browser tab for each result.
linkElems = soup.select('.r a')
numOpen = min(5,len(linkElems))
for i in range(numOpen):
webbrowser.open('http://google.com' + linkElems[i].get('href'))
The short answer is that your URL is not returning results. Here's a URL that provides results: https://www.google.com/search?q=python.
I changed the one line in your code to use this template: "https://www.google.com/search?q={0} and I saw linkElems was non-trivial.
In short, webbrowser is not opening any pages because numOpen is 0, so the for loop tries to iterate over 0 items, which results in the code within that for loop block (webbrowser.open) to not get executed.
The longer, more detailed explanation of why the numOpen = 0 is due to a redirect that occurs with the initial GET request given your custom Google query. See this answer for how to circumvent these issues as there are numerous ways- the easiest is probably to use the Google search API.
As a result of the redirect, your BeautifulSoup search will not return any successful results, causing the numOpen variable to be set to 0 as there will be no list elements. As there are no list elements, the for loop does not execute.
You can debug things like this on your own the quick and dirty, but not perfect, way by simply adding print statements throughout the script to see which print statements fail to execute as well as looking at the variables and their returned values.
As an aside, the shebag should also be set to #!/usr/bin/env python3 rather than simply #! python3. Reference here.
Hope this helps
I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll