I'm attempting to scrape data off of table of data on a website, but when I run my code the output is just blank. I'm not sure why nothing is getting printed. Is the content getting scraped too large for the IDE terminal? Or is their a fundemental issue with my code?
Note: the website link is: https://www.kaggle.com/nareshbhat/health-care-data-set-on-heart-attack-possibility
And the data im trying to scrape is the table at the bottom (heart.csv)
Any help is greatly appreciated!
Code:
import time
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
import re
DRIVER_PATH = r"/Users/mouradsal/Downloads/DataSets Python/chromedriver"
URL = "https://www.kaggle.com/nareshbhat/health-care-data-set-on-heart-attack-possibility"
browser = webdriver.Chrome(DRIVER_PATH)
browser.get(URL)
time.sleep(4)
content = browser.find_elements_by_css_selector(".dfXZEj div")
for e in content:
start = e.get_attribute("innerHTML")
soup= BeautifulSoup(start, features=("lxml"))
print(soup.get_text())
Thanks
I have solved the above question using Java code and it is working fine for me.
Attaching the Java code for more information and as per my understanding in Python you can directly use
start = e.text
Java Code
WebDriverManager.chromedriver().setup();
WebDriver driver = new ChromeDriver();
driver.get("https://www.kaggle.com/nareshbhat/health-care-data-set-on-heart-attack-possibility");
List<WebElement> list = driver.findElements(By.cssSelector(".dfXZEj div"));
System.out.println(list.size());
for(WebElement element : list) {
System.out.println(element.getText());
}
driver.quit();
for e in content:
print(e.text)
you don't need to use beautifulsoup
Related
I'm currently in the process of systematically scraping data of an online retailer's website. I have been doing this once every week now for 2 months and my Python Code has been working great but when I tried to run it today, it returned blank files instead of my usual data. I tried multiple ways to solve this but haven't managed to fix it. I tried switching to geckodriver but same result. I also updated my selenium, chromedriver, chrome... but no luck. Has someone suggestions on fixing this?
(this is my first post so hopefully I displayed the code clearly)
from bs4 import BeautifulSoup
import re
import csv
from selenium import webdriver
import numpy
url = "https://www.zalando.be/sportsokken/_zwart/"
driver = webdriver.chrome(executable_path = "/Users/lisabyloos/Downloads/chromedriver")
pages = numpy.arange(1,3,1)
for page in pages:
driver.get(url + "?p=" + str(page))
html_content = driver.execute_script('return document.body.innerHTML')
soup = BeautifulSoup(html_content, "lxml")
product_divs = soup.find_all("div", attrs={"class": "_4qWUe8 w8MdNG cYylcv QylWsg SQGpu8 iOzucJ JT3_zV DvypSJ"})
results = []
for product in product_divs:
results.append(product.get_text(separator=";"))
import pandas as pd
df = pd.DataFrame([sub.split(";") for sub in results])
df.to_csv("myfile" + str(page) + ".csv" )
What happens?
Classes of elements you try to find are dynamically generated and have changed.
Note Pages change from time to time, but changes to structure are rarer than to styles. Therefore, it is always a good strategy to use elements or ids rather than classes for selection.
How to fix?
Adjust selecting criteria to get your results:
product_divs = soup.find_all('article')
I am trying to automate acquiring data for a research project from Morningstar using Selenium and BeautifulSoup in Python. I am quite new in Python so I have just tried a bunch of solutions from Stackoverflow and similar fora, but I have been unsuccessful.
What I am trying to scrape is on the url https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000014CU8&tab=3
In the url, I am specifically looking for the "Faktorprofil" for which you can click to show the data as a table. I can get the headings from the url, but I am unable to soup.find any other text. I have tried using multiple ids and classes, but without any luck. The code I believe I have been the most successful with is written below.
I hope someone can help!
from bs4 import BeautifulSoup
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument(" --headless")
chrome_driver = os.getcwd() +"/chromedriver"
driver = webdriver.Chrome(options=opts, executable_path=chrome_driver)
driver.get("https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000ZG2E&tab=3")
soup_file=driver.page_source
soup = BeautifulSoup(soup_file, 'html.parser')
print(soup.title.get_text())
#print(soup.find(class_='').get_text())
#print(soup.find(id='').get_text())
This is the data I want to scrape
[1]: https://i.stack.imgur.com/wkSMj.png
All those tables are in an iframe. Below code will retrieve the data and prints as a list.
driver.implicitly_wait(10)
driver.get("https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000014CU8&tab=3")
driver.switch_to.frame(1)
driver.find_element_by_xpath("//button[contains(#class,'show-table')]//span").click()
table = driver.find_elements_by_xpath("//div[contains(#class,'sal-mip-factor-profile__value-table')]/table//tr/th")
header = []
for tab in table:
header.append(tab.text)
print(header)
tablebody = driver.find_elements_by_xpath("//div[contains(#class,'sal-mip-factor-profile__value-table')]/table//tbody/tr")
for tab in tablebody:
data = []
content = tab.find_elements_by_tag_name("td")
for con in content:
data.append(con.text)
print(data)
Output:
['Faktorer', 'Fonds værdi', '5 år Min. Værdi', '5 år Maks værdi', 'Kategori Gennemsnitlig']
['Stil', '62,33', '31,52', '76,36', '48,20']
['Effektiv rente', '48,83', '20,82', '69,12', '34,74']
['Momentum', '58,47', '7,48', '77,21', '71,15']
['Kvalitet', '25,65', '21,61', '59,66', '38,15']
['Volatilitet', '45,25', '34,66', '81,08', '74,93']
['Likviditet', '35,70', '33,40', '74,94', '79,39']
['Størrelse', '39,60', '35,67', '48,78', '87,59']
Need to scrape the name of college and addresses from a site : https://www.collegenp.com/2-science-colleges/ , but the problem is that i am only getting the data of first 11 college present in the list and not getting data of others.My code is :
from selenium import webdriver
import bs4
from bs4 import BeautifulSoup
import requests
import pandas as pd
from time import sleep
driver=webdriver.Chrome('C:/Users/acer/Downloads/chromedriver.exe')
driver.get('https://www.collegenp.com/2-science-colleges/')
driver.refresh()
sleep(20)
page=requests.get("https://www.collegenp.com/2-science-colleges/")
college = []
location=[]
soup= BeautifulSoup(page.content,'html.parser')
for a in soup.find_all('div',attrs={'class':'media'}):
name=a.find('h3',attrs={'class':'college-name'})
college.append(name.text)
loc=a.find('span',attrs={'class':'college-address'})
location.append(loc.text)
df=pd.DataFrame({'College name':college,'Locations':location})
df.to_csv('hell.csv',index=False,encoding='utf-8')
Any guidance to scrape all the data?
The easiest way is just to execute a javascript command through python. For Example:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight-200);")
I have a website to scrape and i am using selenium to do it. When i finished writing the code, i noticed that i was not getting output at all when i print the table contents. I viewed the page source and then i found out that the table was not in the source. That is why even i find the xpath of the table from inspect element i cant get any output of it. Do someone know how could I get the response/data or just printing the table from the javascript response? Thanks.
Here is my current code
from bs4 import BeautifulSoup
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--incognito')
chrome_path = r"C:\chromedriver.exe"
driver = webdriver.Chrome(chrome_path,options=options)
driver.implicitly_wait(3)
url = "https://reversewhois.domaintools.com/?refine#q=
%5B%5B%5B%22whois%22%2C%222%22%2C%22VerifiedID%40SG-Mandatory%22%5D%5D%5D"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
#These line of codes is for selecting the desired search parameter from the combo box, you can disregard this since i was putting the whole url with params
input = driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[3]/input')
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[1]/div').click()
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[5]/div[1]/div/div[3]').click()
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[2]/div/div[1]').click()
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[6]/div[1]/div/div[1]').click
input.send_keys("VerifiedID#SG-Mandatory")
driver.find_element_by_xpath('//*[#id="search-button-container"]/button').click()
table = driver.find_elements_by_xpath('//*[#id="refine-preview-content"]/table/tbody/tr/td')
for i in table:
print(i) no output
I just want to scrape all the domain names like in the first result like 0 _ _ .sg
You can try the below code. After you have selected all the details options to fill and click on the search button it is kind of an implicit wait to make sure we get the full page source. Then we used the read_html from pandas which searches for any tables present in the html and returns a list of dataframe. we take the required df from there.
from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Options
import pandas as pd
options = Options()
options.add_argument('--incognito')
chrome_path = r"C:/Users/prakh/Documents/PythonScripts/chromedriver.exe"
driver = webdriver.Chrome(chrome_path,options=options)
driver.implicitly_wait(3)
url = "https://reversewhois.domaintools.com/?refine#q=%5B%5B%5B%22whois%22%2C%222%22%2C%22VerifiedID%40SG-Mandatory%22%5D%5D%5D"
driver.get(url)
#html = driver.page_source
#soup = BeautifulSoup(html,'lxml')
#These line of codes is for selecting the desired search parameter from the combo box
input = driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[3]/input')
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[1]/div').click()
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[5]/div[1]/div/div[3]').click()
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[2]/div/div[1]').click()
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[6]/div[1]/div/div[1]').click
input.send_keys("VerifiedID#SG-Mandatory")
driver.find_element_by_xpath('//*[#id="search-button-container"]/button').click()
time.sleep(5)
html = driver.page_source
tables = pd.read_html(html)
df = tables[-1]
print(df)
If you are open to other ways does the following give the expected results? It mimics the xhr the page does (though I have trimmed it down to essential elements only) to retrieve the lookup results. Faster than using a browser.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://reversewhois.domaintools.com/?ajax=mReverseWhois&call=ajaxUpdateRefinePreview&q=[[[%22whois%22,%222%22,%22VerifiedID#SG-Mandatory%22]]]&sf=true', headers=headers)
table = pd.read_html(r.json()['results'])
print(table)
I want to scrape pictures from a public Instagram account. I'm pretty familiar with bs4 so I started with that. Using the element inspector on Chrome, I noted the pictures are in an unordered list and li has class 'photo', so I figure, what the hell -- can't be that hard to scrape with findAll, right?
Wrong: it doesn't return anything (code below) and I soon notice that the code shown in element inspector and the code that I drew from requests were not the same AKA no unordered list in the code I pulled from requests.
Any idea how I can get the code that shows up in element inspector?
Just for the record, this was my code to start, which didn't work because the unordered list was not there:
from bs4 import BeautifulSoup
import requests
import re
r = requests.get('http://instagram.com/umnpics/')
soup = BeautifulSoup(r.text)
for x in soup.findAll('li', {'class':'photo'}):
print x
Thank you for your help.
If you look at the source code for the page, you'll see that some javascript generates the webpage. What you see in the element browser is the webpage after the script has been run, and beautifulsoup just gets the html file. In order to parse the rendered webpage you'll need to use something like Selenium to render the webpage for you.
So, for example, this is how it would look with Selenium:
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
url = 'http://instagram.com/umnpics/'
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)
for x in soup.findAll('li', {'class':'photo'}):
print x
Now the soup should be what you are expecting.