I am trying to web scrape text from a href inside a scope, which can't be accessed through xpath as I want to iterate through a table and find the text inside the box.
Here is a screenshot of what I want to find
I think you can do
import urllib.request
page_url = #Insert your url here
with urllib.request.urlopen(page_url) as f:
html = f.read().decode('utf-8')
html.find(#whatever)
which I think should return the entirety of the html for the page, which you can then scrape for whatever you need.
According to information provided by you, you wanted to print the text, for example in this case: "3i Group PLC".
I just call the xpath link for 3i Group PLC is link_abc because the xpath link is not provided by you.
Here is the code by using selenium library:
name = driver.find_element_by_xpath("link_abc")
print(name.text)
The output should be 3i Group PLC.
Related
I want to crawl the data from this website
I only need the text "Pictograph - A spoon 勺 with something 一 in it"
I checked Network -> Doc and I think the information is hidden here.
Because I found there's a line is
i.length > 0 && (r += '<span>» Formation: <\/span>' + i + _Eb)
And I think this page generates part of the page that we can see from the link.
However, I don't know what is the code? It has html, but it also contains so many function().
Update
If the code is Javascript, I would like to know how can I crawl the website not using Selenium?
Thanks!
This page use JavaScript to add this element. Using Selenium I can get HTML after adding this element and then I can search text in HTML. This HTML has strange construction - all text is in tag so this part has no special tag to find it. But it is last text in this tag and it starts with "Formation:" so I use BeautifulSoup to ge all text with all subtags using get_text() and then I can use split('Formation:') to get text after this element.
import selenium.webdriver
from bs4 import BeautifulSoup as BS
driver = selenium.webdriver.Firefox()
driver.get('https://www.archchinese.com/chinese_english_dictionary.html?find=%E4%B8%8E')
soup = BS(driver.page_source)
text = soup.find('div', {'id': "charDef"}).get_text()
text = text.split('Formation:')[-1]
print(text.strip())
Maybe Selenium works slower but it was faster to create solution.
If I could find url used by JavaScript to load data then I would use it without Selenium but I didn't see these information in XHR responses. There was few responses compressed (probably gzip) or encoded and maybe there was this text but I didn't try to uncompress/decode it.
I'm trying to retrieve a list of downloadable xls files on a website.
I'm a bit reluctant to provide full links to the website in question.
Hopefully I'm able to provide all necessary details all the same.
If this is useless, please let me know.
Download .xls files from a webpage using Python and BeautifulSoup is a very similar question, but the details below will show that the solution most likely will have to be different since the links on that particular site are tagged with a href anchor:
And the ones I'm trying to get are not tagged the same way.
On the webpage, the files that are available for downloading are listed like this:
A simple mousehover gives these further details:
I'm following the setup here with a few changes to produce the snippet below that provides a list of some links, but not to any of the xls files:
from bs4 import BeautifulSoup
import urllib
import re
def getLinks(url):
with urllib.request.urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, "lxml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
links1 = getLinks("https://SOMEWEBSITE")
A further inspection using ctrl+shift+I in Google Chrome reveals that those particular links do not have a href anchor tag, but rather a ng-href anchor tag:
So I tried changing that in the snippet above, but with no success.
And I've tried different combinations with e.compile("^https://"), attrs={'ng-href' and links.append(link.get('ng-href')), but still with no success.
So I'm hoping someone has a better suggestion!
EDIT - Further details
It seems it's a bit problematic to read these links directly.
When I use ctrl+shift+I and the Select an element in the page to inspect it Ctrl+Shift+C, this is what I can see when I hover over one of the links listed above:
And what I'm looking to extract here is the information associated with the ng-href tag. But If I right-click the page and select Show Source, the same tag only appears once along with som metadata(?):
And I guess this is why my rather basic approach is failing in the first place.
I'm hoping this makes sense to some of you.
Update:
using selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
driver.get('http://.....')
# wait max 15 second until the links appear
xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#ng-href, ".xls")]'))
# Or
# xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#href, ".xls")]'))
links = []
for link in xls_links:
url = "https://SOMEWEBSITE" + link.get_attribute('ng-href')
print(url)
links.append(url)
Assume ng-href is not dynamically generated, from your last image I see that the URL is not starts with https:// but the slash / you can try with regex URL contains .xls
for link in soup.findAll('a', attrs={'ng-href': re.compile(r"\.xls")}):
xls_link = "https://SOMEWEBSITE" + link['ng-href']
print(xls_link)
links.append(xls_link)
My guess is that the data you are trying to crawl is created dynamically: ng-href is one of AngularJs's constructs. You could try using Google Chrome's Network inspection as you already did (ctrl+shift+I) and see if you can find the url that is queried (open the network tab and reload the page). The query should typically return a JSON with the links to the xls-files.
There is a thread about a similar problem here. Perhaps that helps you: Unable to crawl some href in a webpage using python and beautifulsoup
I want to retrieve all links from a website that contain a specific phrase.
An example on a public website would be to retrieve all videos from a large youtube channel (for example Linus Tech Tips):
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.youtube.com/user/LinusTechTips/videos'
html = requests.get(url)
soup = bs(html.content, "html.parser")
current_link = ''
for link in soup.find_all('a'):
current_link = link.get('href')
print(current_link)
Now I have 3 problems here:
How do I get only hyperlinks containing a phrase like "watch?v="
Most hyperlinks aren't shown. In the browser: They appear when you scroll down. BeautifulSoup does only find the links which can be found without scrolling. How can I retrieve all hyperlinks?
All hyperlinks appear two times. How can I only choose each hyperlink once?
Any suggestions?
How do I get only hyperlinks containing a phrase like "watch?v="
Add a single if statement above your print statement
if 'watch?v=' in current_link:
print(current_link)
All hyperlinks appear two times. How can I only choose each hyperlink once?
Store all hyperlinks in a dictionary as the key and set the value to any arbitrary number (dictionaries only allow a single key entry so you wont be able to add duplicates)
Something like this:
myLinks = {} //declare a dictionary variable to hold your data
if 'watch?v=' in current_link:
print(current_link)
myLinks[currentLink] = 1
You can iterate over the keys (links) in the dictionary like this:
for link,val in myLinks:
print(link)
This will print all the links in your dictionary
Most hyperlinks aren't shown. In the browser: They appear when you scroll down. BeautifulSoup does only find the links which can be found without scrolling. How can I retrieve all hyperlinks?
I'm unsure as to how you directly get around the scripting on the page you have directed us to but you could always crawl the links you get from the initial scrape and rip new links off the side panels/traverse them, this should give you most, if not all, of the links you want.
To do so you would want another dictionary to store the already traversed links/check if you already traversed them. You can check for a key in a dictionary like so:
if key in myDict:
print('myDict has this key already!')
I would use the request library,
for python3
import urllib.request
import requests
SearchString="SampleURL.com"
response = requests.get(SearchString, stream=True)
zeta= str(response.content)
with open ("File.txt" , "w") as l:
l.write(zeta)
l.close()
#And now open up the file with the information written to it
x = open("File.txt", "r")
jello = []
for line in x:
jello.append(line)
t = (jello[0].split(""""salePrice":""",1)[1].split(",",1)[0] )
#you'll notice above that I have the keyword "salePrice", this should be a unique identifier in the pages xpath. typically f12 in chrome and then navigating til the item is highlighted gives you the xpath if you right click and copy
#Now this will only return a single result, youll want to use a for loop to iterate over the File.txt until you find all the separate results
I hope this helps Ill keep an eye on this thread if you need more help.
Part One and Three:
Create a list and append links to the list:
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.youtube.com/user/LinusTechTips/videos'
html = requests.get(url)
soup = bs(html.content, "html.parser")
links = [] # see here
for link in soup.find_all('a'):
links.append(link.get('href')) # and here
Then create a set and convert it back to list to remove duplicates:
links = list(set(links))
Now return the items of interest:
clean_links = [i for i in links if 'watch?v=' in i]
Part Two:
In order to navigate through the site you may need more than just Beautiful Soup. Scrapy has a great API that allows you to pull down a page and explore how you want to parse parent and child elements with xpath. I highly encourage you to try Scrapy and use the interactive shell to tweak your extraction method.
HELPFUL LINK
I want to scrape the pricing data from an eCommerce site called flipkart, I tried using Beautifulsoup with casperjs(nodejs utility) and similar libraries but none of them is good enough.
Here's the URL and the structure.
https://www.flipkart.com/redmi-note-4-gold-32-gb/p/itmer37fmekafqct?
the problem is the layout...What are some ways to get around this?
P.S : Is there anyway I could apply machine learning for getting the pricing data without knowing complex math? Like where do i even start?
You should probably construct your XPath in a way so it does not rely on the class, but rather on the content (node()) of the element you want to match. Alternatively you could match the data-reactid if that doesn't change?
For matching the div by data-reactid:
//div[#data-reactid=220]
Or for matching the div based on its location:
//span[child::img[#src="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/fa_8b4b59.png"]]/preceding-sibling::div
Assuming the img_path doesn't change you're on the safe side.
Since you can't use xpath due to dynamic changing you probably could try to use a regex for finding a price in the script tag on the page.
Something like this:
import requests
import re
url = "https://www.flipkart.com/redmi-note-4-gold-32-gb/p/itmer37fmekafqct"
r = requests.get(url)
pattern = re.compile('prexoAvailable\":[\w]+,\"price\":(\d+)')
result = pattern.search(r.text)
print(result.group(1))
from bs4 import BeatifulSoup
page = request.get(url, headers)
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.findAll('a', href=True, attrs={'class': '_31qSD5'}):
price = a.find('div', attrs={'class': '_1vC4OE _2rQ-NK'})
print(price.text)
E-commerce have does not allow anymore to scrape data like before, every entity of the product like product price, specification, reviews are now enclosed in a separate “Dynamic” class name.
And scraping certain data from the webpage you need to use specific class name which is dynamic. So using request.get() or soup() won't work.
Any way to do it using crawl spider? Not yielding requests. Just an example would suffice. I want to use the href text as the title of web page and have a link to the url that contained the link. I'm just using basic selectors to fill my item, but not sure how to get this information.
Edit:
I looked into it and I want to be able to pass in meta data of the href title and referencing url and also be able to comply with the rules I've defined rather than having to get all urls and conditioning on them myself.
meta={"hrefText" : ..., "refURL": ...}
see CrawlSpider code:
for link in links:
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
meaning you can get href text from response.meta['link_text']