I created a script with selenium to get given GST NUMBER information. I completed that program and it's giving me required details on output without any problem.
Now I do not want to interact it with chrome browser anymore so I'm trying to do this with BeautilfulSoup.
BeautifulSoup is new for me so I have not much idea to find element and I searched a lot about how to send keys with BeautifulSoup but I'm not getting it.
Now my script is stuck here.
from bs4 import BeautifulSoup
import requests
import urllib.request as urllib2
quote_page = 'https://my.gstzen.in/p/search-taxpayer'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
Now even if I manage to find the gst input element then I'm wondering how do I send keys to it? like the 15 digit gst number with sending enter button code or clicking on "Search GST Details".
If possible, then let me know the solution so I can start my research on it.
Actually, I need to do complete this tonight.
Plus, Here is my script which do the same thing with selenium easily and I want to do the same thing with BeautilfulSoup because I do not want chrome to be run every time while checking the GST and BeautilfulSoup seems interesting.
import selenium
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
import csv
import requests
#import pyvirtualdisplay
#from pyvirtualdisplay import display
#display = Display(visible=0, size=(800, 600))
#display.start()
browser = webdriver.Chrome('E:\\Chrome Driver\\chromedriver_win32\\chromedriver.exe')
browser.set_window_position(-10000,0)
browser.get('https://my.gstzen.in/p/search-taxpayer/')
with open ('product.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
next(csv_reader)
for row in csv_reader:
name, phone = row
time.sleep(1)
gst = browser.find_element_by_name('gstin')
gst.click()
gst.send_keys(name)
time.sleep(1)
Details = browser.find_element_by_xpath("//*[contains(text(), ' Search GSTIN Details')]")
Details.click()
info = browser.find_element_by_class_name('col-sm-4')
print(info.text)
info2 = browser.find_element_by_xpath('/html/body/div[4]/div/div/div[1]/div[2]/div[2]/div[1]/div[2]')
print(info2.text)
input('Press Enter to quit')
browser.quit()
BeautifulSoup is a library for parsing and formatting, not interacting with web pages. For the latter, if that page requires JavaScript to work, you're stuck using a headless browser.
If it doesn't, you have at least two options:
Watch the Network tab in your browser's developer tools and see if you can recreate the request for the page you want using requests or urllib2
Use mechanize, which is built specifically to work with forms on sites that don't depend on JavaScript
mechanize is a little more work if there's no CSRF token or similar mechanism (though, again, it'll fail if JavaScript is required) and a little less work if there is.
Related
Is it possible to send a get request to a webdriver using selenium?
I want to scrape a website with an infinite page and want to scrape a substantial amount of the objects on the website. For this I use Selenium to open the website in a webdriver and scroll down the page until enough objects on the page are visible.
However, I'd like to scrape the information on the page with BeautifulSoup since this is the most effective way in this case. If the get request is send in the normal way (see the code) the response only holds the first objects and not the objects from the scrolled-down page (which makes sence).
But is there any way to send a get request to an open webdriver?
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import requests
from bs4 import BeautifulSoup
# Opening the website in the webdriver
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
# Loop for scrolling
scroll_start = 0
for i in range(100):
scroll_end = scroll_start + 1080
driver.execute_script(f'window.scrollTo({scroll_start}, {scroll_end})')
time.sleep(2)
scroll_start = scroll_end
# The get request
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
You should probably find out what is the endpoint that the website is using to get the data for the infinite scrolling.
Go to the website, open the Dev Tools, open the Network tab and find the HTTP request that is asking for the content you're seeking, then MAYBE you can use it too. Just know that there are a lot of variables like, are they using some sort of authorization for their APIs? Are the APIs returning JSON, XML, HTML, ...? Also, I am not sure if this is fair-use.
I've got a collection of URL's in a csv file and I want to loop through these links and open each link in the CSV one at a time. I'm getting several different errors depending on what I try but nonetheless I can't get the browser to open the links. The print shows that the links are there.
When I run my code i get the following error:
Traceback (most recent call last):
File "/Users/Main/PycharmProjects/ScrapingBot/classpassgiit.py", line 26, in <module>
open = browser.get(link_loop)
TypeError: Object of type bytes is not JSON serializable
Can someone help me with my code below if I am missing something or if i am doing it wrong.
My code:
import csv
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
import requests
browser = webdriver.Chrome(executable_path=r'./chromedriver')
contents = []
with open('ClassPasslite.csv', 'rt') as cp_csv:
cp_url = csv.reader(cp_csv)
for row in cp_url:
links = row[0]
contents.append(links)
for link in contents:
url_html = requests.get(links)
for link_loop in url_html:
print(contents)
open = browser.get(link_loop)
Apparently, you are messing something up with the names. Without having a copy of the .csv file, I cannot reproduce the error - hence, I will assume that you correctly extract the link from the text file.
In the second part of your code, you use requests.get to get the links (mind the plural) option, but links apparently is an element that you define in the previous section (links = row[0]), whereas link is the actual object you define in the for loop. Below you can find a version of the code that might be a helpful starting point.
Let me add, though, that the contemporaneous use of requests and selenium in this case makes little sense in your context: why getting an HTML page and then loop over its elements to get other pages with selenium?
import csv
import requests
browser = webdriver.Chrome(executable_path=r'./chromedriver')
contents = []
with open('ClassPasslite.csv', 'rt') as cp_csv:
cp_url = csv.reader(cp_csv)
for row in cp_url:
links = row[0]
contents.append(links)
for link in contents:
url_html = requests.get(link) # now this is singular
# Do what you have to do here with requests, in spite of using selenium #
Since you have not provided any form of what is contained in your variable contents I will assume that it is a list of url strings.
As #cap.py mentioned you are messing up by using requests and selenium at the same time. When you do a GET web request, the server at the destination will send you a text response. This text can be simply some text, like Hello world! or it can be some html. But this html code as to be interpreted in your computer which sent the request.
That's the point of selenium over requests: requests return the text gathered from the destination (url) while selenium ask a browser (e.g. Chrome) to do gather the text and if this text is some html, to interpret it to give you a real readable web page. Moreover the browser is running the javascript inside your page so dynamic pages works as well.
In the end the only thing needed to run your code is to do this:
import csv
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
import requests
browser = webdriver.Chrome(executable_path=r'./chromedriver')
contents = []
with open('ClassPasslite.csv', 'rt') as cp_csv:
cp_url = csv.reader(cp_csv)
for row in cp_url:
links = row[0]
contents.append(links)
#link should be something like "https://www.classpass.com/studios/forever-body-coaching-london?search-id=49534025882004019"
for link in contents:
browser.get(link)
# paste the code you have here
Tip: Don't forget that browsers take some time to load pages. Adding some time.sleep(3) will help you a lot.
There is a web site, I can get the data I need with Python / Selenium (I am new to Selenium and Python)
on the web page there are TABS, I can get the data on the first tab as that one is active by default, I cannot get data on the second TAB.
I attached an image: this shows the data in the overview TAB, I want to get the data in the Fundamental TAB as well. This web page is investing.com.
As for the code: (I did not use everything yet, some were added for future use)
from time import sleep, strftime
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart
from bs4 import BeautifulSoup
url = 'https://www.investing.com/stock-screener/?
sp=country::6|sector::a|industry::a|equityType::a|exchange::a|last::1,1220|avg_volume::250000,15950000%3Ceq_market_cap;1'
chrome_path = 'E:\\BackUp\\IT\\__Programming\\Python\\_Scripts\\_Ati\\CSV\\chromedriver'
driver = webdriver.Chrome(chrome_path)
#driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get(url)
my_name = driver.find_elements_by_xpath("//td[#data-column-name='name_trans']")
my_symbol = driver.find_elements_by_xpath("//td[#data-column-name='viewData.symbol']")
my_last = driver.find_elements_by_xpath("//td[#data-column-name='last']")
my_change = driver.find_elements_by_xpath("//td[#data-column-name='pair_change_percent']")
my_marketcap = driver.find_elements_by_xpath("//td[#data-column-name='eq_market_cap']")
my_volume = driver.find_elements_by_xpath("//td[#data-column-name='turnover_volume']")
The code Above all works.
The Xpath of the second tab does not work.
PE Ratio is in the second tab. (in the fundamentals)
I tried the three:
my_peratio = driver.find_elements_by_xpath("//*[#id="resultsTable"]/tbody/tr[1]/td[4]")
my_peratio = driver.find_elements_by_xpath("//*[#id='resultsTable']")
my_peratio = driver.find_elements_by_xpath("//td[#data-column-name='eq_pe_ratio']")
There are no error messages but the string 'my_peratio' han nothing in it. It is empty.
I really appreciate if you could direct me to the right direction.
Thanks a lot
Ati
enter image description here
Probably the data which is shown on the second tab is loaded dynamically.
In that case, you have to click on the second tab to show the data first.
driver.find_element_by_xpath("selector_for_second_tab").click()
After that it should be possible to get the data.
I would like some advice on how to scrape data from this website.
I started with selenium, but got stuck at the beginning because, for example, I have no idea how to set the dates.
My code until now:
from bs4 import BeautifulSoup as soup
from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Font
from selenium import webdriver
from selenium.webdriver.common.by import By
import datetime
import os
import time
import re
day = datetime.date.today().day
month = datetime.date.today().month
year = datetime.date.today().year
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit/non-usability-history'
cookieValue = '12-c12-cached|from:' +str(year)+ '-' +str(month)+ '-' +str(day-5)+ ','+'to:' +str(year)+ '-' +str(month)+ '-' + str(day) +',dateType:1,company:PreussenElektra,fuel:uranium,canceled:0,durationComparator:ge,durationValue:5,durationUnit:day'
#saving url
browser = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit'
browser.add_cookie({'name': 'tem', 'value': cookieValue})
browser.get(my_url)
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit/non-usability-history'
browser.get(my_url)
Obviously I am not asking for code, just some suggestions on how to continue with Selenium (how to set dates and other data) or any idea on how to scrape this website
Thanks in advance.
EDIT: I am trying to follow the cookie way. That is my updated code, I read that the cookie need to be created before loading the page and so I did, any idea why it is not working?
Best approach for you will be changing cookies, because every filter data is saved in cookie.
Check cookies in chrome ( f12 -> application -> cookies ) and play with filters. If you will change it in programmers tools you have to refresh website :)
Check this post on how to change cookies in selenium python.
To get values from website you have to use classic way like u did here, but you will have to use classes:
radio = browser.find_elements_by_class_name('aaaaaa')
You can always use xPath to search elements ( chrome will generate them for you ).
Is there any particular reason why you have decided to use selenium over other web scraping tools (scrapy, urllib, etc.)? I personally have not used Selenium but I have used some of the other tools. Below is an example of a script to just pull all the html from a page.
import urllib
import urllib2
from bs4 import BeautifulSoup as soup
link = "https://ubuntu.com"
page = urllib2.urlopen(link)
data = soup(page, 'html.parser')
print (data)
This is just a short script to pull all the HTML off a page. I believe BeautifulSoup has additional tools for inputting data into fields, but the exact method slips my mind right now, if I can find my notes on it I will edit this post. I remember it being very straightforward, though.
Best of luck!
Edit: here's a discussion web scraping tools from reddit a while back that I had saved https://www.reddit.com/r/Python/comments/1qnbq3/webscraping_selenium_vs_conventional_tools/
I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.
I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.
After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:
from selenium import webdriver
import codecs
filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')
driver = webdriver.Firefox()
driver.get("http://www.examplepage.com")
allelements = driver.find_elements_by_xpath("//*")
ferdigtxt = []
for i in allelements:
if i.text in ferdigtxt:
pass
else:
ferdigtxt.append(i.text)
filen.writelines(i.text)
filen.close()
driver.quit()
The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)
I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.
Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.
Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text
Using lxml, you might try something like this:
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean
url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
browser.get(url) # Load page
content=browser.page_source
cleaner=clean.Cleaner()
content=cleaner.clean_html(content)
with open('/tmp/source.html','w') as f:
f.write(content.encode('utf-8'))
doc=LH.fromstring(content)
with open('/tmp/result.txt','w') as f:
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text=elt.text or ''
tail=elt.tail or ''
words=' '.join((text,tail)).strip()
if words:
words=words.encode('utf-8')
f.write(words+'\n')
This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).
Here's a variation on #unutbu's answer:
#!/usr/bin/env python
import sys
from contextlib import closing
import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean import Cleaner
from selenium.webdriver import Firefox # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug
cache = FileSystemCache('.cachedir', threshold=100000)
url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"
# get page
page_source = cache.get(url)
if page_source is None:
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url)
page_source = browser.page_source
cache.set(url, page_source, timeout=60*60*24*7) # week in seconds
# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text
I've separated your task in two:
get page (including elements generated by javascript)
extract text
The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.