Selenium WebDriver to extract only paragraphs - python

I am totally new to all of this. I am trying to extract articles from a lot of pages but I put only 4 URLS in the code below and need to extract only important paragraphs from <p>text</p> == $0.
Here is my code for this sample:
currency = 'BTC'
btc_today = pd.DataFrame({'Currency':[],
'Date':[],
'Title': [],
'Content': [],
'URL':[]})
links = ["https://www.investing.com/news/cryptocurrency-news/3-reasons-why-bitcoins-drop-to-21k-and-the-marketwide-selloff-could-be-worse-than-you-think-2876810",
"https://www.investing.com/news/cryptocurrency-news/crypto-flipsider-news--btc-below-22k-no-support-for-pow-eth-ripple-brazil-odl-cardano-testnet-problems-mercado-launches-crypto-2876644",
"https://www.investing.com/news/cryptocurrency-news/can-exchanges-create-imaginary-bitcoin-to-dump-price-crypto-platform-exec-answers-2876559",
"https://www.investing.com/news/cryptocurrency-news/bitcoin-drops-7-to-hit-3week-lows-432SI-2876376"]
for link in links:
driver.get(link)
driver.maximize_window()
time.sleep(2)
data = []
date = driver.find_element(By.XPATH, f'/html/body/div[5]/section/div[1]/span').text.strip()
title = driver.find_element(By.XPATH,f'/html/body/div[5]/section/h1').text.strip()
url = link
content = driver.find_elements(By.TAG_NAME, 'p')
for item in content:
body = item.text
print(body)
articles = {'Currency': currency,'Date': date,'Title': title,'Content': body,'URL': url}
btc_today = btc_today.append(pd.DataFrame(articles, index=[0]))
btc_today.reset_index(drop=True, inplace=True)
btc_today
#I got this as a result
output
I have also tried to do it with this loop but it rturns results in many rows and not article by article
for p_number in range(1,10):
try:
content = driver.find_element(By.XPATH, f'/html/body/div[5]/section/div[3]/p[{p_number}]').text.strip()
#print(content)
except NoSuchElementException:
pass
can somebody help, please? I would really really appreciate it. I seriously did my best for days to find a solution but no progress

I am assuming you need to get the main content, for that, change the locator for the 'content':
content = driver.find_elements(By.CSS_SELECTOR, '.WYSIWYG.articlePage p')
Also, there are unnecessary '<p>' tags with the content - "Position added successfully to: " and "Continue reading on DailyCoin", you can ignore that using if statement inside the below for loop:
for item in content:
body = item.text
print(body)

Related

Python Selenium WebScrapping

I am trying to extract following information from the website https://www.brecorder.com/pakistan/2022-11-17
I want to do the following things
Extract Category name
Extract News Articles links as well as headline given on the page
Go to Individual article link and fetch whole news from there
Paginate to previous day and repeat the above mentioned steps
Store everything in a csv file
Now what I have done uptill now is I can get the category name, extract article links and paginate to previous page but my code isn't working well. First of all I am getting random articles links that aren't part of that particular webpage. I can paginate and extract articles link for the previous day but same happens there too (I am attaching a screenshot of it). Moreover, I am unable to click on individual link and get detailed news from there
I am also attaching the snippets of page's html
[[Page's html](https://i.stack.imgur.com/juvg0.png)](https://i.stack.imgur.com/rK1El.png)
My code up till now is
`PATH = r"C:\Users\HP\PycharmProjects\WebScraping\chromedriver.exe"
driver = webdriver.Chrome(PATH)
category_urls = ['https://www.brecorder.com/pakistan/2022-11-17']
Category = []
Headline = []
Date = []
Url = []
News = []
def Url_Extraction():
category_name = driver.find_element_by_css_selector(('div[class="p-4 text-md text-gr bg-orange-200 inline-block my-2 font-sans font-medium text-white"]'))
cat = category_name.text # Save category name in variable
print(f"{cat}")
news_articles = driver.find_elements_by_css_selector(('a[class="story__link "]'))
for element in news_articles:
URL = element.get_attribute('href')
print(URL)
Url.append(URL)
Category.append(cat)
current_time = time.time() - start_time
print(f'{len(Url)} urls extracted')
print(f'{len(Category)} categories extracted')
print(f'Current Time: {current_time / 3600:.2f} hr, {current_time / 60:.2f} min, {current_time:.2f} sec',
flush=True)
try:
next_page = driver.find_element(By.CSS_SELECTOR, 'a[class="infinite-more-link w-40 mx-auto text-center p-2 my-10 border bg-beige-400"')
driver.execute_script("arguments[0].click();", next_page)
except Exception as e:
print(e)
start_time = time.time()
for url in category_urls:
driver.get(url) # Go to Webpage
driver.implicitly_wait(30) # we don't need to wait 30 secs if element is already there (very useful)
for num in range(2):
print(f'page no. {num+1}')
Url_Extraction()
''' Saving URLs to a csv file'''
with open('URL_List', 'w', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerow(Url)
f.close()
''' Adding Data to a Dataframe'''
cols = ['Url', 'Category']
data = pd.DataFrame(columns=cols, index=range(len(Url)))
for index in range(len(Url)):
data.loc[index].Url = Url[index]
data.loc[index].Category = Category[index]
data.to_csv('URLlist_with_Cat.csv')
time.sleep(3)
driver.quit()

Problem. python scrape with requests + selenium

CODE IS HERE
Hi guys
I have some problem with scraping this dynamic site (https://kvartiry-bolgarii.ru/)
I need to get all the links to the home sale ads
I used selenium to load the page and get links to ads after that I move the page down to load new ads. After the new ads are loaded, I start to parse all the links on the page and write them to the list again.
But the data in the list is not updated and the script continues to work with the links that were on the page before scrolling down.
By the way, I set a check so that the script is executed until the last announcement on the site appears in the list, the link to which I found out in advance
How can this problem be corrected?
def get_link_info():
try:
url = "https://kvartiry-bolgarii.ru/"
driver = webdriver.Chrome(
executable_path=r'C:\Users\kk\Desktop\scrape_house\drivers\chromedriver.exe',
options=options
)
driver.get(url)
req = requests.get(url)
req.encoding = 'utf8'
soup = BeautifulSoup(req.text, "lxml")
articles = soup.find_all("div", class_="content")
links_urls = []
for article in articles:
house_url = article.find("a").get("href")
links_urls.append(house_url)
#print(links_urls)
first_link_number = links_urls[-2].split("-")[-1]
first_link_number = first_link_number[1:]
#print(first_link_number)
last_link_number = links_urls[-1].split("-")[-1]
last_link_number = last_link_number[1:]
#print(last_link_number)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
check = "https://kvartiry-bolgarii.ru/kvartira-v-elitnom-komplekse-s-unikalynym-sadom-o21751"
for a in links_urls:
if a != check:
for article in articles:
house_url = article.find("a").get("href")
links_urls.append(house_url)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
print(links_urls[-1])
else:
print(links_urls[0], links_urls[-1])
print("all links are ready")
Some pointers. You don't need to mix selenium,requests and BeautifulSoup. Just selenium is enough. When you are scrolling infinitely, you need to remove duplicate elements before adding them to your list.
You can try this. This should work.
from selenium import webdriver
import time
def get_link_info():
all_links = []
try:
driver = webdriver.Chrome(executable_path='C:/chromedriver.exe')
driver.get('https://kvartiry-bolgarii.ru/')
time.sleep(3)
old_links = set() # Empty Set
while True:
# Scroll to get more ads
driver.execute_script("window.scrollBy(0,3825)", "")
# Wait for new ads to load
time.sleep(8)
links_divs = driver.find_elements_by_xpath('//div[#class="content"]//a') # Find Elements
ans = set(links_divs) - set(old_links) # Remove old elements
for link in ans:
# Scroll to the link.
driver.execute_script("arguments[0].scrollIntoView();", link)
fir = link.get_attribute('href')
all_links.append(fir)
# Remove Duplicates
old_links = links_divs
except Exception as e:
raise e
get_link_info()

I want to extract the content of p tags with xpath. What should I do?

I want to extract the contents of all the p[#class="article"] tags under div[#class="content"].How should I write an XPath?It is worth mentioning that there is more than one div[#class="content"] tag. I would appreciate it if you could solve my problem.
[![enter image description here][2]][2]
And my code as fellows:
target_url = f' gongyi.qq.com/succor/detail.htm?id=12857 '
driver = webdriver.Chrome(r'我的驱动路径')
driver.get(target_url)
button = driver.find_element_by_xpath('.//ul[#id="middle_avi"]/li[#wrapid="process_desc"]/a')
button.click()
page_text = driver.page_source tree = etree.HTML(page_text)
p_list = tree.xpath('.//div[#class="proj_content"]/div[#class="content"]/p[#class="article "]')
for i in p_list:
print(i)
You can use
//div[#class='proj_content']/div[#class='content']/p[#class='article']
If this was not specific enough, you'd have to add further details.

Unsure why beautifulsoup code won't scrape site

I've used BS a fair bit, but I'm unsure why this won't scrape as the other addons I've made for Kodi work fine. Could someone perhaps look at the code between the tags and perhaps find the bit I'm missing?
The addon/python doesn't throw out any error, it just provides an empty GUI screen. If the title or image scraping were fine and the link wasn't, then it would show a title/image but the link wouldn't work when clicked. So it's obviously the title/image part. I've even tried hashing out the image section so it just looks for a link and title, but still nothing.
Link being scraped: https://store.counterpunch.org/feed/podcast/
def get_soup1(url1):
page = requests.get(url1)
soup1 = BeautifulSoup(page.text, 'html.parser')
print("type: ", type(soup1))
return soup1
get_soup1("https://store.counterpunch.org/feed/podcast/")
def get_playable_podcast1(soup1):
subjects = []
for content in soup1.find_all('item', limit=9):
try:
link = content.find('enclosure')
link = link.get('url')
print("\n\nLink: ", link)
title = content.find('title')
title = title.get_text()
except AttributeError:
continue
item = {
'url': link,
'title': title,
'thumbnail': "https://is2-ssl.mzstatic.com/image/thumb/Podcasts71/v4/71/55/88/71558834-c449-9ac3-e327-cad002e305b4/mza_4409042347411679857.jpg/600x600bb.jpg",
}
subjects.append(item)
return subjects
def compile_playable_podcast1(playable_podcast1):
items = []
for podcast in playable_podcast1:
items.append({
'label': podcast['title'],
'thumbnail': podcast['thumbnail'],
'path': podcast['url'],
'is_playable': True,
})
return items
You need a User-Agent
def get_soup1(url1):
page = requests.get(url1, headers = {'User-Agent':'Mozilla/5.0'})
soup1 = BeautifulSoup(page.text, 'html.parser')
print("type: ", type(soup1))
return soup1

Simulate clicking a link inside a link - Selenium Python

Python Knowledge: beginner
I managed to create a script to scrape contact information. The flow I followed since I am a beginner is to extract all the first links and copied it to text file and this is being used in link = browser.find_element_by_link_text(str(link_text)) Scraping of contact details have been confirmed working (based on my separate run). The problem is that after clicking the first links, it won't go on clicking the links inside it, hence it cannot scrape the contact info.
What is wrong with my script? Please bear in mind I am a beginner so my script is a little bit manual and lengthy.
Thanks very much!!!
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import requests
from bs4 import BeautifulSoup
import urllib
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import csv, time, lxml
######################### open file list ####################################
testfile = open("category.txt") # this is where I saved the category
readfile = testfile.read()
readfilesplit = readfile.split("\n")
############################### end ###################################
################### open browser ###############################
browser = webdriver.Firefox()
browser.get('http://aucklandtradesmen.co.nz/')
####################### end ###################################
link_texts = readfilesplit
for link_text in link_texts:
link = browser.find_element_by_link_text(str(link_text))
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-listing")))
link.click() #click link
time.sleep(5)
print "-------------------------------------------------------------------------------------------------"
print("Getting listings for '%s'" % link_text)
################# get list name #######################
urlNoList = 'http://aucklandtradesmen.co.nz/home-mainmenu-1.html'
r = requests.get(browser.current_url)
if (urlNoList != browser.current_url):
soup = BeautifulSoup(r.content, 'html.parser')
g_data = soup.find_all("div", {"class":"listing-summary"})
pageRange = soup.find_all("span", {"class":"xlistings"})
pageR = [pageRange[0].text]
pageMax = str(pageR)[-4:-2] # get max item for lists
X = str(pageMax).replace('nd', '0')
# print "Number of listings: ", X
Y = int(X) #convert string to int
print "Number of listings: ", Y
for item in g_data:
try:
listingNames = item.contents[1].text
lstList = []
lstList[len(lstList):] = [listingNames]
replStr = re.sub(r"u'", "'",str(lstList)) #strip u' char
replStr1 = re.sub(r"\s+'", "'",str(replStr)) #strip space and '
replStr2 = re.sub(r"\sFeatured", "",str(replStr1)) #strip Featured string
print "Cleaned string: ", replStr2
################ SCRAPE INFO ################
################### This is where the code is not executing #######################
count = 0
while (count < Y):
for info in replStr2:
link2 = browser.find_element_by_link_text(str(info))
time.sleep(10)
link2.click()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#rating-msg")))
print "count", count
count+= 1
print("Contact info for: '%s'" % link_text)
r2 = requests.get(browser.current_url)
soup2 = BeautifulSoup(r2.content, 'html.parser')
g_data2 = soup.find_all("div", {"class":"fields"})
for item2 in g_data2:
# print item.contents[0]
print item2.contents[0].text
print item2.contents[1].text
print item2.contents[2].text
print item2.contents[3].text
print item2.contents[4].text
print item2.contents[5].text
print item2.contents[6].text
print item2.contents[7].text
print item2.contents[8].text
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-listing")))
################### END ---- This is where the code is not executing END ---#######################
############ END SCRAPE INFO ####################
except NoSuchElementException:
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
else:
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
print "Number of listings: 0"
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
By the way this is some of the result:
-------------------------------------------------------------------------------------------------
Getting listings for 'Plumbers'
Number of listings: 5
Cleaned string: ['Hydroflame Plumbing & Gas Ltd']
Cleaned string: ['Osborne Plumbing Ltd']
Cleaned string: ['Plumbers Auckland Central']
Cleaned string: ['Griffiths Plumbing']
Cleaned string: ['Plumber Auckland']
-------------------------------------------------------------------------------------------------
Getting listings for 'Professional Services'
Number of listings: 2
Cleaned string: ['North Shore Chiropractor']
Cleaned string: ['Psychotherapy Werks - Rob Hunter']
-------------------------------------------------------------------------------------------------
Getting listings for 'Property Maintenance'
Number of listings: 7
Cleaned string: ['Auckland Tree Services']
Cleaned string: ['Bob the Tree Man']
Cleaned string: ['Flawless House Washing & Drain Unblocking']
Cleaned string: ['Yardiez']
Cleaned string: ['Build Corp Apartments Albany']
Cleaned string: ['Auckland Trellis']
Cleaned string: ['Landscape Design']
What I would do is change the logic some. Here's the logic flow I would suggest you use. This will eliminate the writing off of the links and speed up the script.
1. Navigate to http://aucklandtradesmen.co.nz/
2. Grab all elements using CSS selector "#index a" and store the attribute "href" of each
in an array of string (links to each category page)
3. Loop through the href array
3.1. Navigate to href
3.1.1. Grab all elements using CSS selector "div.listing-summary a" and store the
.text of each (company names)
3.1.2. If an element .by_link_text("Next") exists, click it and return to 3.1.1.
If you want business contact info off of the company pages, you would want to store the href in 3.1.1. and then loop through that list and grab what you want off the page.
Sorry about the weirdness of the formatting of the list. It won't let me indent more than one level.
okay I found a solution after thinking #jeffC's suggestion:
extract the href values and append it to the base url which is http://aucklandtradesmen.co.nz, so for example the if the extracted href is /home-mainmenu-1/alarms-a-security/armed-alarms-ltd-.html,and tell browser to navigate to that URL..and then I can do whatever I want in the current page..

Categories