I want to extract text of a particular span which is given in the snapshot. I am unable to find the span by its class attribute. I have attached The html source (snapshot) of the data to be extracted as well.
Any suggestions?
import bs4 as bs
import urllib
sourceUrl='https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2'
source=urllib.request.urlopen(sourceUrl).read()
soup=bs.BeautifulSoup(source, 'html.parser')
count=soup.find('span',{'class':'number'})
print(len(count))
See the image:
If you disable JavaScript in your browser you can easily see that span element that you want are disappearing.
In order to get that element one of the possible solutions can be using Selenium browser.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2')
span = driver.find_element_by_xpath('//li[3]/span')
print(span.text)
driver.close()
Output:
Another solution - find desired value deep down in web page source(in Chrome browser press Ctrl+U) and extract span value using a regular expression.
import re
import requests
r = requests.get(
'https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2')
span = re.search('\"posts_count\":(\d+)', r.text)
print(span.group(1))
Output:
If You know how to use CSS SELECTORS you can use :
mySpan = soup.select("span.number")
It will return List of all nodes which are valid for this selector.
So mySpan[0] could contain what You need. And then use one of the methods like for example get_text() to get what You need.
First of all you need to decode response
source=urllib.request.urlopen(sourceUrl).read().decode()
Maybe your issue will disappears after this fix
Related
I want to crawl the data from this website
I only need the text "Pictograph - A spoon 勺 with something 一 in it"
I checked Network -> Doc and I think the information is hidden here.
Because I found there's a line is
i.length > 0 && (r += '<span>» Formation: <\/span>' + i + _Eb)
And I think this page generates part of the page that we can see from the link.
However, I don't know what is the code? It has html, but it also contains so many function().
Update
If the code is Javascript, I would like to know how can I crawl the website not using Selenium?
Thanks!
This page use JavaScript to add this element. Using Selenium I can get HTML after adding this element and then I can search text in HTML. This HTML has strange construction - all text is in tag so this part has no special tag to find it. But it is last text in this tag and it starts with "Formation:" so I use BeautifulSoup to ge all text with all subtags using get_text() and then I can use split('Formation:') to get text after this element.
import selenium.webdriver
from bs4 import BeautifulSoup as BS
driver = selenium.webdriver.Firefox()
driver.get('https://www.archchinese.com/chinese_english_dictionary.html?find=%E4%B8%8E')
soup = BS(driver.page_source)
text = soup.find('div', {'id': "charDef"}).get_text()
text = text.split('Formation:')[-1]
print(text.strip())
Maybe Selenium works slower but it was faster to create solution.
If I could find url used by JavaScript to load data then I would use it without Selenium but I didn't see these information in XHR responses. There was few responses compressed (probably gzip) or encoded and maybe there was this text but I didn't try to uncompress/decode it.
How do I click an element using selenium and beautifulsoup in python? I got these lines of code and I find it difficult to achieve. I want to click every element in every iteration. There are no pagination or next page. There are only like about 10 elements and after clicking the last element, it should stop. Does anyone know what should I do. Here are my code
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import urllib
import urllib.request
from bs4 import BeautifulSoup
chrome_path = r"C:\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
url = 'https://www.99.co/singapore/condos-apartments/a-treasure-trove'
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
details = soup.select('.FloorPlans__container__rwH_w') //Whole container of the result
for d in details:
picture = d.find('span',{'class':'Tappable-inactive'}).click() //the single element.
print(d)
driver.close()
Here is the site https://www.99.co/singapore/condos-apartments/a-treasure-trove . I want to scrape the details and the image in every floor plans section but it is difficult because the image only appears after you click the specific element. I can only get the details except for the image itself. Try it yourself so that you know what I mean.
EDIT:
I tried this method
for d in driver.find_elements_by_xpath('//*[#id="floorPlans"]/div/div/div/div/span'):
d.click()
The problem is it clicks too fast that the image couldn't load. And also im using selenium here. Is there any method like selecting a beautifulsoup like this format picture = d.find('span',{'class':'Tappable-inactive'}).click() ?
You cannot interact with website widgets by using beautifulSoup you need to work with selenium. There are 2 ways to handle this problem.
First is to get the main wrapper (class) of the 10 elements and then iterate to each child element of the main class.
You can get the element by xpath and increment the last number in xpath by one in each iteration to move to the next element.
I print some result to check your code.
"details" only has one item.
And "picture" is not element. (So it's not clickable.)
details = soup.select('.FloorPlans__container__rwH_w')
print(details)
print(len(details))
for d in details:
print(d)
picture = d.find('span',{'class':'Tappable-inactive'})
print(picture)
Output:
For your edited version, you can check images have been visible before you do click().
Use visibility_of_element_located to do.
Reference: https://selenium-python.readthedocs.io/waits.html
I am data scraping a website to get a number. This number changes dynamically every split second, but upon inspection, the number is shown. I just need to capture that number but the div wrapper that contains it, it returns no value. What am I missing? (please go easy on me as I am quite new to Python and data scraping).
I have some code that works and returns the piece of html that supposedly contains the data I want, but no joy, the div wrapper returns no value.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://deuda-publica-espana.com')
deuda = BeautifulSoup(r.text, 'html.parser')
deuda = deuda.findAll('div', {'id': 'contador_PDEH'})
print(deuda)
I don't receive any errors, I am just getting [<div class="contador_xl contador_verde" id="contador_PDEH"></div>] with no value!
Indeed it is easy with selenium. I suspect there is a js script running a counter supplying the number which is why you can't find it with your method (as mentioned in comments)
from selenium import webdriver
d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://deuda-publica-espana.com/')
print(d.find_element_by_id('contador_PDEH').text)
d.quit()
from selenium import webdriver
import re
driver= webdriver.Chrome(executable_path=r"C:\Users\chromedriver")
sentence = "chiropractor in maryland"
url="https://google.com/search?hl=en&q={}".format(sentence)
driver.get(url)
links=driver.find_elements_by_xpath('//a[#href]')
maps=[i for i in links if i.text=="Maps"][0].click()
html=driver.page_source
#ChIJaYGxdRj9t4kRcJmJlvQkKX0
#ChIJCf4MzWjgt4kRluBnhQTHlBM
#ChIJBXxr8brIt4kRVE-gIYDyV8c
#ChIJX0W_Xo4syIkRUAtRFy8nz1Y place ids in html
Hello, this is my first selenium project I am trying to find the places ids from result I have added some of place id (i got using API), I tried to find them in inspector tools but I couldn't,however, they are available in the page source I tried using regex it seems that they follow the following path
2,[null,null,\\"bizbuilder:gmb_web\\",[6,7,4,1,3]\\n]\\n]\\n]\\n,1,null,null,null,null,null,null,[\\"-8523065488279764631\\",\\"9018780361702349168\\"]\\n]\\n]\\n]\\n,null,null,null,[[\\"chiropractor\\"]\\n]\\n,null,\\"ChIJaYGxdRj9t4kRcJmJlvQkKX0\\",null,null,null,[\\"South Gate\\",\\"806 Landmark Dr Suite 126\\",\\"806 Landmark Dr Suite 126\\",\\"Glen Burnie\\"]\\n,null,null,null,null,null,[null,\\"SearchResult.TYPE_PERSONAL_
after "\"chiropractor\"]\n]\n,null,\"Place ID",null ...
but I can't find the regex for it.
I need help writing the correct regex or find another way of finding palce_id.
I hope that no one answers with refer to using their API
I think this could be improved but the string itself sits in a script tag that has window.APP_OPTIONS in it. Each of those ids starts with ChIJ, has a defined character set following and is of length 27 in total.
I have also started directly with the map page rather than click to it. I didn't need a wait condition despite several runs. This could be added if wanted/required.
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import re
sentence = "chiropractor in maryland"
url = 'https://www.google.com/maps/search/{}'.format(sentence)
d = webdriver.Chrome()
d.get(url)
soup = bs(d.page_source, 'lxml')
for script in soup.select('script'):
if 'window.APP_OPTIONS' in script.text:
script = script.text
break
r = re.compile(r'(ChIJ[a-zA-Z\.0-9\-\_]{23})')
items = r.findall(script)
print(items)
d.quit()
A little riskier you could work off page_source direct
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import re
sentence = "chiropractor in maryland"
url = 'https://www.google.com/maps/search/{}'.format(sentence)
d = webdriver.Chrome()
d.get(url)
r = re.compile(r'(ChIJ[a-zA-Z\.0-9\-\_]{23})')
items = r.findall(d.page_source)
print(items)
d.quit()
Notes:
I am specifying a pattern designed to only match the required items currently (for given search). It is conceivable, in future/new searches, that pattern could occur and not be an id. The page_source is a larger search space and therefore a greater likelihood of encountering an unwanted string that matches the pattern. The script tag is not only where you would expect to find the ids but is also a smaller search space. Over time you might also want to check character set does not require any additional characters for matching new ids. You can easily check against the result per page count.
I am trying to count the number of items in a list box on a webpage and then select multiple items from this list box. I can select the items fine I am just struggling to find out how to count the items in the list box.
see code:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
... ...
accountListBox = Select(driver.find_element_by_id("ctl00_MainContent_accountItemsListBox"))
accountListBox.select_by_index(0)
print(len(accountListBox))
I have tried using len() which results in the error "TypeError: object of type 'Select' has no len()".
I have also tried accountListBox.size() and also removed the 'Select' from line 3 which also doesn't work.
Pretty new to this so would appreciate your feedback.
Thanks!
According to the docs a list of Select element's options can be obtained by saying select.options. In your particular case this would be accountListBox.options, and you need to call len() on that, and not on the Select instance itself:
print(len(accountListBox.options))
Or, if you only want to print a list of currently selected options:
print(len(accountListBox.all_selected_options))
You should use find_elements by using some common selector for each listbox's items to find all of them, store the found elements into a variable, and use a native python's library to count them.
I usually use Selenium along with Beautiful Soup. Beautiful Soup is a Python package for parsing HTML and XML documents.
With Beautiful Soup you can get the count of items in a list box in the following way:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS() # or webdriver.Firefox()
driver.get('http://some-website.com/some-page/')
html = driver.page_source.encode('utf-8')
b = BeautifulSoup(html, 'lxml')
items = b.find_all('p', attrs={'id':'ctl00_MainContent_accountItemsListBox'})
print(len(items))
I assumed that the DOM element you want to find is a paragraph (p tag), but you can replace this with whatever element you need to find.