from selenium import webdriver
import re
driver= webdriver.Chrome(executable_path=r"C:\Users\chromedriver")
sentence = "chiropractor in maryland"
url="https://google.com/search?hl=en&q={}".format(sentence)
driver.get(url)
links=driver.find_elements_by_xpath('//a[#href]')
maps=[i for i in links if i.text=="Maps"][0].click()
html=driver.page_source
#ChIJaYGxdRj9t4kRcJmJlvQkKX0
#ChIJCf4MzWjgt4kRluBnhQTHlBM
#ChIJBXxr8brIt4kRVE-gIYDyV8c
#ChIJX0W_Xo4syIkRUAtRFy8nz1Y place ids in html
Hello, this is my first selenium project I am trying to find the places ids from result I have added some of place id (i got using API), I tried to find them in inspector tools but I couldn't,however, they are available in the page source I tried using regex it seems that they follow the following path
2,[null,null,\\"bizbuilder:gmb_web\\",[6,7,4,1,3]\\n]\\n]\\n]\\n,1,null,null,null,null,null,null,[\\"-8523065488279764631\\",\\"9018780361702349168\\"]\\n]\\n]\\n]\\n,null,null,null,[[\\"chiropractor\\"]\\n]\\n,null,\\"ChIJaYGxdRj9t4kRcJmJlvQkKX0\\",null,null,null,[\\"South Gate\\",\\"806 Landmark Dr Suite 126\\",\\"806 Landmark Dr Suite 126\\",\\"Glen Burnie\\"]\\n,null,null,null,null,null,[null,\\"SearchResult.TYPE_PERSONAL_
after "\"chiropractor\"]\n]\n,null,\"Place ID",null ...
but I can't find the regex for it.
I need help writing the correct regex or find another way of finding palce_id.
I hope that no one answers with refer to using their API
I think this could be improved but the string itself sits in a script tag that has window.APP_OPTIONS in it. Each of those ids starts with ChIJ, has a defined character set following and is of length 27 in total.
I have also started directly with the map page rather than click to it. I didn't need a wait condition despite several runs. This could be added if wanted/required.
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import re
sentence = "chiropractor in maryland"
url = 'https://www.google.com/maps/search/{}'.format(sentence)
d = webdriver.Chrome()
d.get(url)
soup = bs(d.page_source, 'lxml')
for script in soup.select('script'):
if 'window.APP_OPTIONS' in script.text:
script = script.text
break
r = re.compile(r'(ChIJ[a-zA-Z\.0-9\-\_]{23})')
items = r.findall(script)
print(items)
d.quit()
A little riskier you could work off page_source direct
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import re
sentence = "chiropractor in maryland"
url = 'https://www.google.com/maps/search/{}'.format(sentence)
d = webdriver.Chrome()
d.get(url)
r = re.compile(r'(ChIJ[a-zA-Z\.0-9\-\_]{23})')
items = r.findall(d.page_source)
print(items)
d.quit()
Notes:
I am specifying a pattern designed to only match the required items currently (for given search). It is conceivable, in future/new searches, that pattern could occur and not be an id. The page_source is a larger search space and therefore a greater likelihood of encountering an unwanted string that matches the pattern. The script tag is not only where you would expect to find the ids but is also a smaller search space. Over time you might also want to check character set does not require any additional characters for matching new ids. You can easily check against the result per page count.
Related
I am writing a little script to get my F#H user data from a basic HTML page.
I want to locate my username on that page and the numbers before and after it.
All the data I want is between two HTML <tr> and </tr> tags.
I am currently using this:
re.search(r'<tr>(.*?)</tr>', htmlstring)
I know this works for any substring, as all google results for my question show. The difference here is i need it only when that substring also contains a specific word
However that only returns the first string between those two delimiters, not even all of them.
This pattern occurs hundreds of times on the page. I suspect it doesn't get them all because I'm not handling all the newline characters correctly but I'm not sure.
If it would return all of them, I could at least then sort them out to find one that contains my username going through each result.group(), but I can't even do that.
I have been fiddling with different regex expressions for ages now but can't figure what one I need to much frustration.
TL;DR -
I need a re.search() pattern that finds a substring between two words, that also contains a specific word.
If I understand correctly something like this might work
<tr>(?:(?:(?:(?!<\/tr>).)*?)\bWORD\b(?:.*?))<\/tr>
<tr> find "<tr>"
(?:(?:(?!<\/tr>).)*?) Find anything except "</tr>" as few times as possible
\bWORD\b find WORD
(?:.*?)) find anything as few times as possible
<\/tr> find "</tr>"
Sample
There are a few ways to do it but I prefer the pandas way:
from urllib import request
import pandas as pd # you need to install pandas
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
web_df: pd.DataFrame = pd.read_html(web_request, attrs={'class': 'members'})
web_df = web_df[0].set_index(keys=['Name'])
# print(web_df)
user_name_to_find_in_table = 'SteveMoody'
user_name_df = web_df.loc[user_name_to_find_in_table]
print(user_name_df)
Then there are plenty of ways to do this. Using just Beautifulsoup find or css selectors, or maybe re as Peter suggest?
Using beautifulsoup and "find" method, and re, you can do it the following way:
import re
from bs4 import BeautifulSoup as bs # you need to install beautifullsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.find(
lambda t: t.name == "td"
and re.findall(user_name_to_find_in_table, t.text, flags=re.I)
).find_parent(name="tr")
print(row_tag.get_text().strip('tr'))
Using Beautifulsoup and CSS Selectors(no re but Beautifulsoup):
from bs4 import BeautifulSoup as bs # you need to install beautifulsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.select_one(f'tr:has(> td:contains({user_name_to_find_in_table})) ')
print(row_tag.get_text().strip('tr'))
In your case I would favor the pandas example as you keep headers and can easily get other stats, and it runs very quickly.
Using Re:
So fa, best input is Peters' commentLink, so I just adapted it to Python code (happy to get edited), as this solution doesn't need any extra libraries installation.
import re
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
user_name_to_find_in_table = 'SteveMoody'
re_patern = rf'<tr>(?:(?:(?:(?!<\/tr>).)*?)\{user_name_to_find_in_table}\b(?:.*?))<\/tr>'
res = re.search(pattern=re_patern, string= str(web_request))
print(res.group(0))
Helpful lin to use variables in regex: stackflow
In the code example below, I get different results for the print(len(block1)) item at the end of the code. I cannot seem to figure out what is causing this:
my code,
page loading with Selenium,
some sort of anti-scrape method that Amazon uses, or
a silly thing I am missing.
My ten most recent results were
LOG: 3/14/2020 - 2:30pm EDT
Length Results for 10 separate runs:
0 / 20 / 55 / 25 / 57 / 55 / 6 / 59 / 54 / 39
# python version: 3.8.1
#Import necessary modules
from selenium import webdriver # version 3.141.0
from bs4 import BeautifulSoup # version 4.8.2
#set computer path and object to chrome browser
chrome_path = r"C:\webdrivers\chromedriver.exe"
browser = webdriver.Chrome(chrome_path)
# search Amazon for "bar+soap"
# use 'get' for URL request and set object to variable "source"
browser.get("https://www.amazon.com/s?k=soap+bar&ref=nb_sb_noss_2")
source = browser.page_source
#use Beautiful Soup to parse html
page_soup = BeautifulSoup(source, 'html.parser')
#set a variable "block1" to find all "a" tags that fit criteria
block1 = page_soup.findAll("a", {"class":"a-size-base"})
#print the number of tags pulled
print(len(block1))
Your code looks correct. I modified it a little to make sure everything is fine, so I collected tags with both Selenium and Beautiful Soup and counted, they always match.
I was getting very different results at first, so I added 7 sec wait after page load. This made things more stable, so it is possible that some of the elements just take longer to load and when you count are not on the page.
This didn't fully solve the issue. I am still getting different results, for 10 runs, I got 64(2), 65(6), 67(2).
My recommendation for you would be to:
try adding and increasing sleep and see how it behaves;
try actually printing out the results and see what is the difference between runs;
potentially, just use the result that you get most often because a lot of websites run product A/B tests and there can be multiple UI/content variants for the same page or different components of the same page (this is very likely our case). So every time we run the script, we get into a certain A/B variant or probably a combination of variants, which leads to these results.
Just in case, my code:
#Import necessary modules
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
#set computer path and object to chrome browser
browser = webdriver.Chrome()
# use 'get' for URL request and set object to variable "source"
browser.get("https://www.amazon.com/s?k=soap+bar&ref=nb_sb_noss_2")
sleep(7)
source = browser.page_source
#use Beautiful Soup to parse html
page_soup = BeautifulSoup(source, 'html.parser')
#set a variable "block1" to find all "a" tags that fit criteria
block1 = page_soup.findAll("a", {"class":"a-size-base"})
#print the number of tags pulled
print('BS', len(block1))
# To be save, let's also count with pure Selenium:
e = browser.find_elements_by_css_selector('a.a-size-base')
print('SEL', len(e))
Hope this helps, good luck.
I am data scraping a website to get a number. This number changes dynamically every split second, but upon inspection, the number is shown. I just need to capture that number but the div wrapper that contains it, it returns no value. What am I missing? (please go easy on me as I am quite new to Python and data scraping).
I have some code that works and returns the piece of html that supposedly contains the data I want, but no joy, the div wrapper returns no value.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://deuda-publica-espana.com')
deuda = BeautifulSoup(r.text, 'html.parser')
deuda = deuda.findAll('div', {'id': 'contador_PDEH'})
print(deuda)
I don't receive any errors, I am just getting [<div class="contador_xl contador_verde" id="contador_PDEH"></div>] with no value!
Indeed it is easy with selenium. I suspect there is a js script running a counter supplying the number which is why you can't find it with your method (as mentioned in comments)
from selenium import webdriver
d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://deuda-publica-espana.com/')
print(d.find_element_by_id('contador_PDEH').text)
d.quit()
I'm trying to retrieve a list of downloadable xls files on a website.
I'm a bit reluctant to provide full links to the website in question.
Hopefully I'm able to provide all necessary details all the same.
If this is useless, please let me know.
Download .xls files from a webpage using Python and BeautifulSoup is a very similar question, but the details below will show that the solution most likely will have to be different since the links on that particular site are tagged with a href anchor:
And the ones I'm trying to get are not tagged the same way.
On the webpage, the files that are available for downloading are listed like this:
A simple mousehover gives these further details:
I'm following the setup here with a few changes to produce the snippet below that provides a list of some links, but not to any of the xls files:
from bs4 import BeautifulSoup
import urllib
import re
def getLinks(url):
with urllib.request.urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, "lxml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
links1 = getLinks("https://SOMEWEBSITE")
A further inspection using ctrl+shift+I in Google Chrome reveals that those particular links do not have a href anchor tag, but rather a ng-href anchor tag:
So I tried changing that in the snippet above, but with no success.
And I've tried different combinations with e.compile("^https://"), attrs={'ng-href' and links.append(link.get('ng-href')), but still with no success.
So I'm hoping someone has a better suggestion!
EDIT - Further details
It seems it's a bit problematic to read these links directly.
When I use ctrl+shift+I and the Select an element in the page to inspect it Ctrl+Shift+C, this is what I can see when I hover over one of the links listed above:
And what I'm looking to extract here is the information associated with the ng-href tag. But If I right-click the page and select Show Source, the same tag only appears once along with som metadata(?):
And I guess this is why my rather basic approach is failing in the first place.
I'm hoping this makes sense to some of you.
Update:
using selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
driver.get('http://.....')
# wait max 15 second until the links appear
xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#ng-href, ".xls")]'))
# Or
# xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#href, ".xls")]'))
links = []
for link in xls_links:
url = "https://SOMEWEBSITE" + link.get_attribute('ng-href')
print(url)
links.append(url)
Assume ng-href is not dynamically generated, from your last image I see that the URL is not starts with https:// but the slash / you can try with regex URL contains .xls
for link in soup.findAll('a', attrs={'ng-href': re.compile(r"\.xls")}):
xls_link = "https://SOMEWEBSITE" + link['ng-href']
print(xls_link)
links.append(xls_link)
My guess is that the data you are trying to crawl is created dynamically: ng-href is one of AngularJs's constructs. You could try using Google Chrome's Network inspection as you already did (ctrl+shift+I) and see if you can find the url that is queried (open the network tab and reload the page). The query should typically return a JSON with the links to the xls-files.
There is a thread about a similar problem here. Perhaps that helps you: Unable to crawl some href in a webpage using python and beautifulsoup
I want to extract text of a particular span which is given in the snapshot. I am unable to find the span by its class attribute. I have attached The html source (snapshot) of the data to be extracted as well.
Any suggestions?
import bs4 as bs
import urllib
sourceUrl='https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2'
source=urllib.request.urlopen(sourceUrl).read()
soup=bs.BeautifulSoup(source, 'html.parser')
count=soup.find('span',{'class':'number'})
print(len(count))
See the image:
If you disable JavaScript in your browser you can easily see that span element that you want are disappearing.
In order to get that element one of the possible solutions can be using Selenium browser.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2')
span = driver.find_element_by_xpath('//li[3]/span')
print(span.text)
driver.close()
Output:
Another solution - find desired value deep down in web page source(in Chrome browser press Ctrl+U) and extract span value using a regular expression.
import re
import requests
r = requests.get(
'https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2')
span = re.search('\"posts_count\":(\d+)', r.text)
print(span.group(1))
Output:
If You know how to use CSS SELECTORS you can use :
mySpan = soup.select("span.number")
It will return List of all nodes which are valid for this selector.
So mySpan[0] could contain what You need. And then use one of the methods like for example get_text() to get what You need.
First of all you need to decode response
source=urllib.request.urlopen(sourceUrl).read().decode()
Maybe your issue will disappears after this fix