I'm working on a project which I'm getting posts from couple of websites and show it in the main page of my website but with filter and letting users to search for keywords and see the posts with that keywords.
this is how the code works :
In here with get the customized url of the site with our keyword and city filter
def link_gen(city='', Kword='' ):
# for example.com
urls =[]
if Kword != '':
if city =='':
url = f'https://www.example.com/search/with{Kword}'
url = url.strip()
url = url.replace(" ", "-")
urls.append(url)
else:
url = f'https://www.example.com/search/with{Kword}in{city}'
url = url.strip()
url = url.replace(" ", "-")
urls.append(url)
else:
if city != '':
url = f'https://www.example.com/search/in{city}'
url = url.strip()
url = url.replace(" ", "-")
urls.append(url)
else: urls.append('none')
return urls
this part is where we crawl for the posts of the target website
# function for getting the title, link, icon, desc of all posts
def get_cards(urls):
data = []
# for example.com
if urls[0] != 'none':
# we use webdriver to get site with dynamic component and design
url = urls[0]
options = Options()
options.headless = True
browser = webdriver.Firefox(options=options)
browser.get(url)
print ("Headless Firefox Initialized")
soup = BeautifulSoup(browser.page_source, 'html.parser')
jobs = soup.find_all( 'div', class_="job-list-item", limit=3)
# looping through all the cards
for job in jobs :
# get the title, link, icon, desc
title = job.find('a', class_= "title vertical-top display-inline" ).text
icon = job.find(tage_name_img)['src']
link = job.find('a', class_= "title vertical-top display-inline" )['href']
date = job.find('div', class_= "date" ).text
data.append(dict(
title = title,
icon = f'https://www.example.com/{icon}',
link = f'https://www.example.com/{link}',
date = date,
site = 'example'
))
browser.close()
return data
but the problem is for getting the post and dynamic tags on the websites I needed to use selenium I can't use session.get(url) because it won't return all the tags
and with selenium it takes for ever to return the posts even though I only crawl 3 posts
but I think webdriver uses a lot of resources.
I ran out of ram when I when I tried to run it locally
any suggestion would be so much appreciated
I think my title explains it pretty well the problem I am facing. Let's look at a picture of the problem. (You can find the web-page at this adress, however it has probably changed).
I have highlighted the text that I want to grab in blue, this is the model-year 2008. Now, it is not necessary for the seller to submit the model-year, so this may or may not exist. But when it does exist it always follows the <i> tag with class ="fa fa-calender". My solution so far has been to grab all the text whitin <p class="result-details> ... </p>" (this then becomes a list) and then choose the second element, conditioned on that <i class="fa fa-calender> ... </i> exists. Otherwise I do not grab anything.
Now, it seems as this does not work in general since that text that comes before the second element can be aranged into more than one element if has a whitespace in it. So, is there any way (any function) that can grab a text string that neighbours another tag as seen in my picture?
PS: if I have made myself unclear, I just want to fetch the year 2008 from the post on the web page if it exists.
Edit
In this situation my code erroneously gives my the word "Hjulvältar" (bulldozer in english) instead of the year 2008.
CODE
from bs4 import BeautifulSoup
from datetime import date
import requests
url_avvikande = ['bomliftar','teleskop-bomliftar','kompakta-sjalvgaende-bomlyftar','bandschaktare','reachstackers','staplare']
today = date.today().isoformat()
url_main = 'https://www.mascus.se'
produktgrupper = ['lantbruksmaskiner','transportfordon','skogsmaskiner','entreprenadmaskiner','materialhantering','gronytemaskiner']
kategorier = {
'lantbruksmaskiner': ['traktorer','sjalvgaende-falthackar','skordetroskor','atv','utv:er','snoskotrar'],
'transportfordon': ['fordonstruckar','elektriska-fordon','terrangfordon'],
'skogsmaskiner': ['skog-skordare','skog-gravmaskiner','skotare','drivare','fallare-laggare','skogstraktorer','lunnare','terminal-lastare'],
'entreprenadmaskiner': ['gravlastare','bandgravare','minigravare-7t','hjulgravare','midigravmaskiner-7t-12t','atervinningshanterare','amfibiska-gravmaskiner','gravmaskiner-med-frontskopa','gravmaskiner-med-lang-rackvidd','gravmaskiner-med-slapskopa','rivningsgravare','specialgravmaskiner','hjullastare','kompaktlastare','minilastmaskiner','bandlastare','teleskopiska-hjullastare','redaskapshallare','gruvlastare','truckar-och-lastare-for-gruvor','bergborriggar','teleskoplastare','dumprar','minidumprar','gruvtruckar','banddumprar','specialiserade-dragare','vaghyvlar','vattentankbilar','allterrangkranar','terrangkranar-grov-terrang','-bandgaende-kranar','saxliftar','bomliftar','teleskop-bomliftar','personhissar-och-andra-hissar','kompakta-sjalvgaende-bomlyftar','krossar','mobila-krossar','sorteringsverk','mobila-sorteringsverk','bandschaktare','asfaltslaggningsmaskiner','--asfaltskallfrasmaskiner','tvavalsvaltar','envalsvaltar','jordkompaktorer','pneumatiska-hjulvaltar','andra-valtar','kombirullar','borrutrustning-ytborrning','horisontella-borrutrustning','trenchers-skar-gravmaskin'],
'materialhantering': ['dieseltruckar','eldrivna-gaffeltruckar','lpg-truckar','gaffeltruckar---ovriga','skjutstativtruck','sidlastare','teleskopbomtruckar','terminaltraktorer','reachstackers','ovriga-materialhantering-maskiner','staplare-led','staplare','plocktruck-laglyftande','plocktruck-hoglyftande','plocktruck-mediumlyftande','dragtruck','terrangtruck','4-vagstruck','smalgangstruck','skurborsttorkar','inomhus-sopmaskiner','kombinationsskurborstar'],
'gronytemaskiner': ['kompakttraktorer','akgrasklippare','robotgrasklippare','nollsvangare','plattformsklippare','sopmaskiner','verktygsfraktare','redskapsbarare','golfbilar','fairway-grasklippare','green-grasklippare','grasmattevaltar','ovriga-gronytemaskiner']
}
url = 'https://www.mascus.se'
mappar = ['Lantbruk', 'Transportfordon', 'Skogsmaskiner', 'Entreprenad', 'Materialhantering', 'Grönytemaskiner']
index = -1
status = True
for produktgrupp in kategorier:
index += 1
mapp = mappar[index]
save_path = f'/home/protector.local/vika99/webscrape_mascus/Annonser/{mapp}'
underkategorier = kategorier[produktgrupp]
for underkategori in underkategorier:
# OBS
if underkategori != 'borrutrustning-ytborrning' and status:
continue
else:
status = False
# OBS
if underkategori in url_avvikande:
url = f'{url_main}/{produktgrupp}/{underkategori}'
elif underkategori == 'gravmaskiner-med-frontskopa':
url = f'{url_main}/{produktgrupp}/begagnat-{underkategori}'
elif underkategori == 'borrutrustning-ytborrning':
url = f'{url_main}/{produktgrupp}/begagnad-{underkategori}'
else:
url = f'{url_main}/{produktgrupp}/begagnade-{underkategori}'
file_name = f'{save_path}/{produktgrupp}_{underkategori}_{today}.txt'
sida = 1
print(url)
with open(file_name, 'w') as f:
while True:
print(sida)
html_text = None
soup = None
links = None
while links == None:
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'lxml')
links = soup.find('ul', class_ = 'page-numbers')
annonser = soup.find_all('li', class_ = 'col-row single-result')
for annons in annonser:
modell = annons.find('a', class_ = 'title-font').text
if annons.p.find('i', class_ = 'fa fa-calendar') != None:
tillverkningsar = annons.find('p', class_ = 'result-details').text.strip().split(" ")[1]
else:
tillverkningsar = 'Ej angiven'
try:
pris = annons.find('span', class_ = 'title-font no-ws-wrap').text
except AttributeError:
pris = annons.find('span', class_ = 'title-font no-price').text
f.write(f'{produktgrupp:<21}{underkategori:25}{modell:<70}{tillverkningsar:<13}{pris:>14}\n')
url_part = None
sida += 1
try:
url_part = links.find('a', text = f'{sida}')['href']
except TypeError:
print(f'Avläsning av underkategori klar.')
break
url = f'{url_main}{url_part}'
As you loop the listings you can test if that calendar icon class is present, if it is then grab the next_sibling
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.mascus.se/entreprenadmaskiner/begagnade-pneumatiska-hjulvaltar')
soup = bs(r.content, 'lxml')
listings = soup.select('.single-result')
for listing in listings:
calendar = listing.select_one('.fa-calendar')
if calendar is not None:
print(calendar.next_sibling)
else:
print('Not present')
CODE IS HERE
Hi guys
I have some problem with scraping this dynamic site (https://kvartiry-bolgarii.ru/)
I need to get all the links to the home sale ads
I used selenium to load the page and get links to ads after that I move the page down to load new ads. After the new ads are loaded, I start to parse all the links on the page and write them to the list again.
But the data in the list is not updated and the script continues to work with the links that were on the page before scrolling down.
By the way, I set a check so that the script is executed until the last announcement on the site appears in the list, the link to which I found out in advance
How can this problem be corrected?
def get_link_info():
try:
url = "https://kvartiry-bolgarii.ru/"
driver = webdriver.Chrome(
executable_path=r'C:\Users\kk\Desktop\scrape_house\drivers\chromedriver.exe',
options=options
)
driver.get(url)
req = requests.get(url)
req.encoding = 'utf8'
soup = BeautifulSoup(req.text, "lxml")
articles = soup.find_all("div", class_="content")
links_urls = []
for article in articles:
house_url = article.find("a").get("href")
links_urls.append(house_url)
#print(links_urls)
first_link_number = links_urls[-2].split("-")[-1]
first_link_number = first_link_number[1:]
#print(first_link_number)
last_link_number = links_urls[-1].split("-")[-1]
last_link_number = last_link_number[1:]
#print(last_link_number)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
check = "https://kvartiry-bolgarii.ru/kvartira-v-elitnom-komplekse-s-unikalynym-sadom-o21751"
for a in links_urls:
if a != check:
for article in articles:
house_url = article.find("a").get("href")
links_urls.append(house_url)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
print(links_urls[-1])
else:
print(links_urls[0], links_urls[-1])
print("all links are ready")
Some pointers. You don't need to mix selenium,requests and BeautifulSoup. Just selenium is enough. When you are scrolling infinitely, you need to remove duplicate elements before adding them to your list.
You can try this. This should work.
from selenium import webdriver
import time
def get_link_info():
all_links = []
try:
driver = webdriver.Chrome(executable_path='C:/chromedriver.exe')
driver.get('https://kvartiry-bolgarii.ru/')
time.sleep(3)
old_links = set() # Empty Set
while True:
# Scroll to get more ads
driver.execute_script("window.scrollBy(0,3825)", "")
# Wait for new ads to load
time.sleep(8)
links_divs = driver.find_elements_by_xpath('//div[#class="content"]//a') # Find Elements
ans = set(links_divs) - set(old_links) # Remove old elements
for link in ans:
# Scroll to the link.
driver.execute_script("arguments[0].scrollIntoView();", link)
fir = link.get_attribute('href')
all_links.append(fir)
# Remove Duplicates
old_links = links_divs
except Exception as e:
raise e
get_link_info()
I am trying to scrape all the data of the google search results - title , URL and description.
However, I cant grab the description of the search results, it returns an empty string.
# check Chrome version: Menue (the three dots - upper right corner -> Help -> About Google Chrome)
# download ChromeDriver according to the Chrome version (example version 79)
# download from https://sites.google.com/a/chromium.org/chromedriver/downloads
# place the chromedriver.exe file in the current working directory
# pip install selenium
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from bs4.element import Tag
import pandas as pd
import random
keywords = pd.read_csv('keywords.csv', header=0, index_col=None)
df = pd.DataFrame(columns=['keyword', 'title', 'url', 'description'])
for i in keywords['keyword']:
# Scraper that gives bacck: titles, links, descriptions
driver = webdriver.Chrome()
google_url = "https://www.google.com/search?gl=US&q=" + i + "&num=" + str(10)
driver.get(google_url)
time.sleep(random.randrange(15,50))
soup = BeautifulSoup(driver.page_source,'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})
links = []
titles = []
descriptions = []
for r in result_div:
# Checks if each element is present, else, raise exception
try:
link = r.find('a', href=True)
title = None
title = r.find('h3')
if isinstance(title,Tag):
title = title.get_text()
description = None
description = r.find('span', attrs={'class': 'st'})
if isinstance(description, Tag):
description = description.get_text()
# Check to make sure everything is present before appending
if link != '' and title != '' and description != '':
links.append(link['href'])
titles.append(title)
descriptions.append(description)
# Next loop if one element is not present
except Exception as e:
print(e)
continue
for link, title, description in zip(links, titles, descriptions):
df = df.append({'keyword': i, 'title': title, 'url': link, 'description': description}, ignore_index=True)
df.to_csv(r'final_dataset.csv', index=False)
Anyone has an idea how to grab the description in the google search results.
Get the description node with the following code.
description = r.select('.aCOpRe span:not(.f)')
Also, you can use requests instead of selenium. The full example is in online IDE.
from requests import Session
from bs4 import BeautifulSoup
from bs4.element import Tag
import pandas as pd
keywords = pd.read_csv('keywords.csv', header=0, index_col=None)
df = pd.DataFrame(columns=['keyword', 'title', 'url', 'description'])
for i in keywords['keyword']:
# Scraper that gives back: titles, links, descriptions
params = {"q": i, 'gl': 'US', 'num': 10}
headers = {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36 Edg/80.0.361.62"
}
with Session() as session:
r = session.get(
"https://google.com/search", params=params, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})
links = []
titles = []
descriptions = []
for r in result_div:
# Checks if each element is present, else, raise exception
try:
link = r.find('a', href=True)
title = r.find('h3')
if isinstance(title, Tag):
title = title.get_text()
description = r.select('.aCOpRe span:not(.f)')
if isinstance(description, Tag):
description = description.get_text()
# Check to make sure everything is present before appending
if link != '' and title != '' and description != '':
links.append(link['href'])
titles.append(title)
descriptions.append(description)
# Next loop if one element is not present
except Exception as e:
print(e)
continue
for link, title, description in zip(links, titles, descriptions):
df = df.append({
'keyword': i,
'title': title,
'url': link,
'description': description
}, ignore_index=True)
df.to_csv(r'final_dataset.csv', index=False)
Alternatively, you can extract data from Google Search via SerpApi.
Disclaimer: I work at SerpApi.
I've used BS a fair bit, but I'm unsure why this won't scrape as the other addons I've made for Kodi work fine. Could someone perhaps look at the code between the tags and perhaps find the bit I'm missing?
The addon/python doesn't throw out any error, it just provides an empty GUI screen. If the title or image scraping were fine and the link wasn't, then it would show a title/image but the link wouldn't work when clicked. So it's obviously the title/image part. I've even tried hashing out the image section so it just looks for a link and title, but still nothing.
Link being scraped: https://store.counterpunch.org/feed/podcast/
def get_soup1(url1):
page = requests.get(url1)
soup1 = BeautifulSoup(page.text, 'html.parser')
print("type: ", type(soup1))
return soup1
get_soup1("https://store.counterpunch.org/feed/podcast/")
def get_playable_podcast1(soup1):
subjects = []
for content in soup1.find_all('item', limit=9):
try:
link = content.find('enclosure')
link = link.get('url')
print("\n\nLink: ", link)
title = content.find('title')
title = title.get_text()
except AttributeError:
continue
item = {
'url': link,
'title': title,
'thumbnail': "https://is2-ssl.mzstatic.com/image/thumb/Podcasts71/v4/71/55/88/71558834-c449-9ac3-e327-cad002e305b4/mza_4409042347411679857.jpg/600x600bb.jpg",
}
subjects.append(item)
return subjects
def compile_playable_podcast1(playable_podcast1):
items = []
for podcast in playable_podcast1:
items.append({
'label': podcast['title'],
'thumbnail': podcast['thumbnail'],
'path': podcast['url'],
'is_playable': True,
})
return items
You need a User-Agent
def get_soup1(url1):
page = requests.get(url1, headers = {'User-Agent':'Mozilla/5.0'})
soup1 = BeautifulSoup(page.text, 'html.parser')
print("type: ", type(soup1))
return soup1