I wrote a python code to scrape data from the site. It doesn't seem to work the way it supposed to do. I want to get all the articles from the page, but I get one paragraph from the first article multiple times. I can't see what's wrong with the code. Please help me fix it if you know what's the issue.
import requests
from bs4 import BeautifulSoup
URL = 'https://zdravi.doktorka.cz/clanky?page=0'
HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'}
HOST = 'https://zdravi.doktorka.cz'
def get_html(url, params=None):
r = requests.get(url, headers=HEADERS, params=params)
return r
def get_content(html):
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('article', class_='node-teaser-display')
articles = []
for item in items:
articles.append({
HOST + item.find('a').get('href'),
})
arts = []
for each in articles:
b = ''.join(each)
arts.append(b)
for art in arts:
page = get_html(art)
pagesoup = BeautifulSoup(html, 'html.parser')
parags = pagesoup.find('p').get_text()
print(art)
print(parags)
def parse():
html = get_html(URL)
if html.status_code == 200:
get_content(html.text)
else:
print('Error')
parse()
This is the response:
https://zdravi.doktorka.cz/infekcnost-bezpriznakovych-nosicu-covid-19-muze-byt-slaba-naznacuje-studie
Jsme tým lékařů, terapeutů, kosmetiček, odborníků pracujících ve zdravotnictví, v oboru fitness a ekologie. Náš web funguje od roku 1999 a patří mezi nejnavštěvovanější weby zabývající se zdravým životním stylem v ČR.
https://zdravi.doktorka.cz/pri-operativni-lecbe-sedeho-zakalu-existuji-tri-moznosti
Jsme tým lékařů, terapeutů, kosmetiček, odborníků pracujících ve zdravotnictví, v oboru fitness a ekologie. Náš web funguje od roku 1999 a patří mezi nejnavštěvovanější weby zabývající se zdravým životním stylem v ČR.
https://zdravi.doktorka.cz/epidemiolog-varuje-pred-dlouhodobym-nosenim-rousek
Jsme tým lékařů, terapeutů, kosmetiček, odborníků pracujících ve zdravotnictví, v oboru fitness a ekologie. Náš web funguje od roku 1999 a patří mezi nejnavštěvovanější weby zabývající se zdravým životním stylem v ČR.
https://zdravi.doktorka.cz/jidlo-muze-prozradit-na-co-mate-alergii
Jsme tým lékařů, terapeutů, kosmetiček, odborníků pracujících ve zdravotnictví, v oboru fitness a ekologie. Náš web funguje od roku 1999 a patří mezi nejnavštěvovanější weby zabývající se zdravým životním stylem v ČR.
https://zdravi.doktorka.cz/jak-muzeme-nyni-posilit-svou-imunitu
Jsme tým lékařů, terapeutů, kosmetiček, odborníků pracujících ve zdravotnictví, v oboru fitness a ekologie. Náš web funguje od roku 1999 a patří mezi nejnavštěvovanější weby zabývající se zdravým životním stylem v ČR.
In for-loop you have to use page.text instead of html
for art in arts:
page = get_html(art)
pagesoup = BeautifulSoup(page.text, 'html.parser')
parags = pagesoup.find('p').get_text()
print(art)
print(parags)
In html you have HTML from main page - so you always parsed the same HTML. But later you get new response from subpage and assing to variable page - and this variable has HTML from subpage.
BTW: Probably you would see it if you would check print( html )
EDIT: Full working code with other changes and with saving to file .csv
import requests
from bs4 import BeautifulSoup
import csv
URL = 'https://zdravi.doktorka.cz/clanky?page=0'
HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'}
HOST = 'https://zdravi.doktorka.cz'
def get_soup(url, headers=HEADERS, params=None):
r = requests.get(url, headers=headers, params=params)
if r.status_code != 200:
print('Error:', r.status_code, url)
return
return BeautifulSoup(r.text, 'html.parser')
def get_content(soup):
data = []
articles = soup.find_all('article', class_='node-teaser-display')
for item in articles:
url = HOST + item.find('a').get('href')
print(url)
soup = get_soup(url)
if soup:
paragraph = soup.find('p').get_text().strip()
print(paragraph)
data.append({
'url': url,
'paragraph': paragraph,
})
print('---')
with open('output.csv', 'w') as fh:
csv_writer = csv.DictWriter(fh, ['url', 'paragraph'])
csv_writer.writeheader()
csv_writer.writerows(data)
def parse():
soup = get_soup(URL)
if soup:
get_content(soup)
if __name__ == '__main__':
parse()
Related
So I'm scrapping this website: https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari
I'm scraping all the reviews for a particular product on this website. Also handling the pagination. The URL for different pages is like:
https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari?sayfa=1
https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari?sayfa=2
https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari?sayfa=3
The code is working as expected but when accessing the (n/2)+1 page, the website is redirecting to the first page of the comments. That is suppose total pages for a website is 12. Upon accessing the 7th page directly for which the URL is
https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari?sayfa=7
It redirects to the first page for which the URL is
https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari
When we goto the 7th page using the website's page links, it opens the 7th page with no issues. But its only when we access the page directly, its redirecting.
How can I handle this with BeautifulSoup? Also, Please don't suggest using Selenium as client needs the task to be done using BeautifulSoup only.
Here's the Python code:
from urllib import response
from bs4 import BeautifulSoup
from django.shortcuts import redirect
import requests
import sys
import re
file_path = 'output.txt'
sys.stdout = open(file_path, "w", encoding="utf-8")
# url = sys.argv[1]
url = 'https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari'
response = requests.get(url, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(response.content, 'html5lib')
div = soup.find_all('div', class_='paginationContentHolder')
# l = [(div.contents[0].strip(), span.text.strip())
# for div in soup.select('div.league-data')
# for span in div.select('span')]
divlist = soup.find_all("div", {'itemprop':'review'})
# print(divlist)
# for i in divlist:
# print(i)
# print('---------------------------')
pagecount = soup.find('div', {'class':'paginationBarHolder'})
pagelist = pagecount.find_all('ul')
pagelistnew = []
for i in pagelist[0]:
if i.text.isdigit():
pagelistnew.append(int(i.text))
maxpage = max(pagelistnew)
print(maxpage)
firstcard = divlist[0]
for i in divlist:
if i.find('span', {'itemprop':'description'}):
print('Review: ' + i.find('span', {'itemprop':'description'}).text)
ratinglist = i.find_all('div', {'class':'star', 'style':'width:12px;height:12px'})
print('Rating: ' + str(len(ratinglist)))
print('Date: ' + i.find('span', {'itemprop': 'datePublished'}).text)
# rating = 0
# for j in ratinglist:
# rating = rating + 1
# print(rating)
print()
urlnew = 'https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari?sayfa='
for i in range(2, maxpage + 1):
# payload = {'sayfa':i}
req = requests.get(urlnew + str(i), headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246", 'Access-Control-Allow-Origin': '*'}, allow_redirects=False)
print('URL: ' + urlnew + str(i))
print(req.status_code)
print()
soup2 = BeautifulSoup(req.content, 'html5lib')
divlist1 = soup2.find_all("div", {'itemprop':'review'})
# firstcard1 = divlist1[0]
for l in divlist1:
if l.find('span', {'itemprop':'description'}):
print('Review: ' + l.find('span', {'itemprop':'description'}).text)
ratinglist1 = l.find_all('div', {'class':'star', 'style':'width:12px;height:12px'})
print('Rating: ' + str(len(ratinglist1)))
print('Date: ' + l.find('span', {'itemprop': 'datePublished'}).text)
print()
To get all reviews you can use their API url. For example:
import json
import requests
# HBV0000130VNO is from the URL in the question
url = "https://user-content-gw-hermes.hepsiburada.com/queryapi/v2/ApprovedUserContents?skuList=HBV0000130VNO&from={}&size=10"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}
from_ = 0
while True:
data = requests.get(url.format(from_), headers=headers).json()
# uncomment to see all data:
# print(json.dumps(data, indent=4))
for d in data["data"]["approvedUserContent"]["approvedUserContentList"]:
print(d["customer"]["name"], d["customer"]["surname"])
print(d["review"]["content"])
print()
if data["currentItemCount"] != 10:
break
from_ += 10
Prints:
h********** z*********
Mükemmel kalitede bir ürün
A*** T*****
tek kelimeyle EFSANE. Her zaman olduğu gibi Apple kalitesi 1 numara. Benim kullanımımla 2 gün gidiyor şarjı günlük iş maillerimi yanıtlıyor raporlarımı inceliyorum. Akşamda belki bir dizi-film açıp şarjını bitiriyorum. Performansa diyecek hiç bir şey yok. Bu mimari ve altyapı son derece muazzam çalışacak şekilde tasarlanmış zaten.
E*** D*******
Kesinlikle fiyatıyla performansının birbirini fazlasıyla karşıladığı şahane bir iş istasyonu. Malzeme yapısı ve içindeki yeni M1 Yonga seti işlemcisi ile uzun süreli kullanımı ve iş gücünü vaat ediyor. Ayrıca içinden çıkan 61W'lık Adaptörü ile hızlı şarj olması ayrı bir güzelliği. Hepsiburada'ya ve Lojistiği olan HepsiJet'e hızlı ve güvenilir oldukları için çok teşekkür ederim.
O**** B****
Mükemmel bir ürün. Kullanmadan önce bu ekosisteme karşı yargıları olanlar, şimdi macbook'uma tapıyorlar.
A**** A*********
2019 macbook airden geçiş yaptığım için söyleyebilirim ki özellikle Apple'ın m1 çip için optimize ettiği safari süper hızlı çalışıyor chrome kullanmayı bıraktım diyebilirim görüntü ses kalitesi çok daha iyi ve işlemci çok hızlı çok memnunum pil 20 saat gidiyormuş ben genelde güç kaynağına bağlı kullanıyorum
Y**** G*****
Ben satın aldığımda MBA 512 GB ve MBP 256 fiyatları birbirine çok yakındı. O yüzden MBP tercih ettim. Arada oyun oynamak isteyenler için MBA yerine kesinlikle tecih edilmesi gereken ürün. Çünkü bazı oyunlarda throttling problemi yaşanabiliyor (Tomb Raider oyunları gibi)
ş***** a****
Tek kelimeyle Apple kalitesiyle tasarlanmış şahane bir ürün; fakat Türkiye şartlarından dolayı sırtımıza ağır yükler bindiren kur farkından kaynaklanan kallavi fiyatı biraz üzüyor.
...and so on.
I'm currently working on a web scraper on this website (https://www.allabolag.se). I would like to grab the title and link to every result on the page, and I'm currently stuck.
<a data-v-4565614c="" href="/5566435201/grenspecialisten-forvaltning-ab">Grenspecialisten Förvaltning AB</a>
This is an example from the website where I would like to grab the href and >Grenspecialisten Förvaltning AB< as It contains the title and link. How would I go about doing that?
The code I have currently looks like this
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15'}
url = 'https://www.allabolag.se/bransch/bygg-design-inredningsverksamhet/6/_/xv/BYGG-,%20DESIGN-%20&%20INREDNINGSVERKSAMHET/xv/JURIDIK,%20EKONOMI%20&%20KONSULTTJÄNSTER/xl/12/xb/AB/xe/4/xe/3'
r = requests.get (url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
questions = soup.findAll('div', {'class': 'tw-flex'})
for item in questions:
title = item.find('a', {''}).text
print(title)
Any help would be greatly appreciated!
Best regards :)
The results are embedded in the page in Json form. To decode it, you can use next example:
import json
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15"
}
url = "https://www.allabolag.se/bransch/bygg-design-inredningsverksamhet/6/_/xv/BYGG-,%20DESIGN-%20&%20INREDNINGSVERKSAMHET/xv/JURIDIK,%20EKONOMI%20&%20KONSULTTJÄNSTER/xl/12/xb/AB/xe/4/xe/3"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
data = json.loads(
soup.find(attrs={":search-result-default": True})[":search-result-default"]
)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for result in data:
print(
"{:<50} {}".format(
result["jurnamn"], "https://www.allabolag.se/" + result["linkTo"]
)
)
Prints:
Grenspecialisten Förvaltning AB https://www.allabolag.se/5566435201/grenspecialisten-forvaltning-ab
Peab Fastighetsutveckling Syd AB https://www.allabolag.se/5566998430/peab-fastighetsutveckling-syd-ab
BayWa r.e. Nordic AB https://www.allabolag.se/5569701377/baywa-re-nordic-ab
Kronetorp Park Projekt AB https://www.allabolag.se/5567196539/kronetorp-park-projekt-ab
SVENSKA HUSCOMPAGNIET AB https://www.allabolag.se/5568155583/svenska-huscompagniet-ab
Byggnadsaktiebolaget Gösta Bengtsson https://www.allabolag.se/5561081869/byggnadsaktiebolaget-gosta-bengtsson
Tectum Byggnader AB https://www.allabolag.se/5562903582/tectum-byggnader-ab
Winthrop Engineering and Contracting AB https://www.allabolag.se/5592128176/winthrop-engineering-and-contracting-ab
SPI Global Play AB https://www.allabolag.se/5565082897/spi-global-play-ab
Trelleborg Offshore & Construction AB https://www.allabolag.se/5560557711/trelleborg-offshore-construction-ab
M.J. Eriksson Entreprenad AB https://www.allabolag.se/5567043814/mj-eriksson-entreprenad-ab
Solix Group AB https://www.allabolag.se/5569669574/solix-group-ab
Gripen Betongelement AB https://www.allabolag.se/5566646427/gripen-betongelement-ab
BLS Construction AB https://www.allabolag.se/5569814345/bls-construction-ab
We Construction AB https://www.allabolag.se/5590705116/we-construction-ab
Helsingborgs Fasad & Kakel AB https://www.allabolag.se/5567814248/helsingborgs-fasad-kakel-ab
Gat & Kantsten Sverige AB https://www.allabolag.se/5566564919/gat-kantsten-sverige-ab
Bjärno Byggsystem AB https://www.allabolag.se/5566743190/bjarno-byggsystem-ab
Bosse Sandells Bygg Aktiebolag https://www.allabolag.se/5564391158/bosse-sandells-bygg-aktiebolag
Econet Vatten & Miljöteknik AB https://www.allabolag.se/5567388953/econet-vatten-miljoteknik-ab
Hi I would love to be able to scrape multiple page for this website
Can some one give help on how i can scrape scrape through all the pages i am only able to get information from one page how ever I just get information from one page
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
for i in range(2000):
Centris ='https://www.centris.ca/en/commercial-units~for-rent~montreal-ville-marie/26349148?view=Summary'.format(i)
r = get(Centris, headers=headers)
soup = bs(r.text, 'html.parser')
results = soup.find_all('div', attrs={'id':'divMainResult'})
data = []
for result in results:
titre = result.find('span', attrs={'data-id': 'PageTitle'})
titre = [str(titre.string).strip() for titre in titre]
superficie = result.find('div', attrs={'class': 'carac-value'}, string=re.compile('sqft'))
superficie = [str(superficie.string).strip() for superficie in superficie]
emplacement = result.find_all('h2', attrs={'class': 'pt-1'})
emplacement = [str(emplacement.string).strip() for emplacement in emplacement]
prix = result.find_all('span', attrs={'class':'text-nowrap'})
prix = [(prix.text).strip('\w.') for prix in prix]
description = result.find_all('div', attrs={'itemprop': 'description'})
description = [str(description.string).strip() for description in description]
lien = result.find_all('a', attrs={'class': 'dropdown-item js-copy-clipboard'})
To get pagination working you can simulate Ajax requests with requests module:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.centris.ca/Property/GetInscriptions"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}
json_data = {"startPosition": 0}
with requests.session() as s:
# load cookies:
s.get(
"https://www.centris.ca/en/commercial-units~for-rent?uc=0",
headers=headers,
)
for page in range(0, 100, 20): # <-- increase number of pages here
json_data["startPosition"] = page
data = s.post(url, headers=headers, json=json_data).json()
soup = BeautifulSoup(data["d"]["Result"]["html"], "html.parser")
for a in soup.select(".a-more-detail"):
print(a.select_one(".category").get_text(strip=True))
print(a.select_one(".address").get_text(strip=True, separator="\n"))
print("https://www.centris.ca" + a["href"])
print("-" * 80)
Prints:
Commercial unit for rent
6560, Avenue de l'Esplanade, suite 105
Montréal (Rosemont/La Petite-Patrie)
Neighbourhood La Petite-Patrie
https://www.centris.ca/en/commercial-units~for-rent~montreal-rosemont-la-petite-patrie/16168393?view=Summary
--------------------------------------------------------------------------------
Commercial unit for rent
75, Rue Principale
Gatineau (Aylmer)
Neighbourhood Vieux Aylmer, Des Cèdres, Marina
https://www.centris.ca/en/commercial-units~for-rent~gatineau-aylmer/22414903?view=Summary
--------------------------------------------------------------------------------
Commercial building for rent
53, Rue Saint-Pierre, suite D
Saint-Pie
https://www.centris.ca/en/commercial-buildings~for-rent~saint-pie/15771470?view=Summary
--------------------------------------------------------------------------------
...and so on.
Thank you so much I came up with this and it worked perfectly
import json
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.centris.ca/Property/GetInscriptions"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}
json_data = {"startPosition": 0}
with requests.session() as s:
Centris = []
# load cookies:
s.get(
"https://www.centris.ca/en/commercial-units~for-rent?uc=0",
headers=headers,
)
for page in range(0, 100, 20): # <-- increase number of pages here
json_data["startPosition"] = page
data = s.post(url, headers=headers, json=json_data).json()
soup = BeautifulSoup(data["d"]["Result"]["html"], "html.parser")
for a in soup.select(".a-more-detail"):
titre = a.select_one(".category").get_text(strip=True)
emplacement = a.select_one(".address").get_text(strip=True, separator="\n")
lien = "https://www.centris.ca" + a["href"]
prix = a.select_one(".price").get_text(strip=True)
Centris.append((titre, emplacement, lien, prix))
df = pd.DataFrame(Centris, columns={'Titre':titre, 'Emplacement':emplacement, 'Lien':lien, 'Prix':prix})
writer = pd.ExcelWriter('Centris.xlsx')
df.to_excel(writer)
writer.save()
print( 'Data Saved To excel' )
I'm having a university project and need to get data online. I would like to get some data from this website.
https://www.footballdatabase.eu/en/transfers/-/2020-10-03
For the 3rd of October I managed to get the first 19 rows but then there are 6 pages and I'm struggling to activate the button for loading the next page.
This is the html code for the button:
2
My code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.footballdatabase.eu/en/transfers/-/2020-10-03"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
Players = pageSoup.find_all("span", {"class": "name"})
Team = pageSoup.find_all("span", {"class": "firstteam"})
Values = pageSoup.find_all("span", {"class": "transferamount"})
Values[0].text
PlayersList = []
TeamList = []
ValuesList = []
j=1
for i in range(0,20):
PlayersList.append(Players[i].text)
TeamList.append(Team[i].text)
ValuesList.append(Values[i].text)
j=j+1
df = pd.DataFrame({"Players":PlayersList,"Team":TeamList,"Values":ValuesList})
Thank you very much!
You can use requests module to simulate the Ajax call. For example:
import requests
from bs4 import BeautifulSoup
data = {
'date': '2020-10-03',
'pid': 1,
'page': 1,
'filter': 'full',
}
url = 'https://www.footballdatabase.eu/ajax_transfers_show.php'
for data['page'] in range(1, 7): # <--- adjust number of pages here.
soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser')
for line in soup.select('.line'):
name = line.a.text
first_team = line.select_one('.firstteam').a.text if line.select_one('.firstteam').a else 'Free'
second_team = line.select_one('.secondteam').a.text if line.select_one('.secondteam').a else 'Free'
amount = line.select_one('.transferamount').text
print('{:<30} {:<20} {:<20} {}'.format(name, first_team, second_team, amount))
Prints:
Bruno Amione Belgrano Hellas Vérone 1.7 M€
Ismael Gutierrez Betis Deportivo Atlético B 1 M€
Vitaly Janelt Bochum Brentford 500 k€
Sven Ulreich Bayern Munich Hambourg SV 500 k€
Salim Ali Al Hammadi Baniyas Khor Fakkan Prêt
Giovanni Alessandretti Ascoli U-20 Recanatese Prêt
Gabriele Bellodi AC Milan U-20 Alessandria Prêt
Louis Britton Bristol City B Torquay United Prêt
Juan Brunetta Godoy Cruz Parme Prêt
Bobby Burns Barrow Glentoran Prêt
Bohdan Butko Shakhtar Donetsk Lech Poznan Prêt
Nicolò Casale Hellas Vérone Empoli Prêt
Alessio Da Cruz Parme FC Groningue Prêt
Dalbert Henrique Inter Milan Rennes Prêt
...and so on.
That's my first post in here so please be patient.
I'm trying to scrape all of the links having the particular word in (name of a city - Gdańsk ) it from my local news site.
The problem is, that I'm receiving some links which doesn't have the name of the city.
import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import lxml
import re
url = 'http://www.trojmiasto.pl'
nazwa_pliku = 'testowyplik.txt'
user_agent = UserAgent()
strona = requests.get(url,headers={'user-agent':user_agent.chrome})
with open(nazwa_pliku,'w') as plik:
plik.write(page.content.decode('utf-8')) if type(page.content) == bytes else file.write(page.content)
def czytaj():
plikk = open('testowyplik.txt')
data = plikk.read()
plikk.close()
return data
soup = BeautifulSoup(czytaj(),'lxml')
linki = [li.div.a for div in soup.find_all('div',class_='entry-letter')]
for lin in linki:
print(lin)
rezultaty = soup.find_all('a',string=re.compile("Gdańsk"))
print(rezultaty)
l=[]
s=[]
for tag in rezultaty:
l.append(tag.get('href'))
s.append(tag.text)
for i in range(len(s)):
print('url = '+l[i])
print('\n')
Here is a complete and simpler example in Python 3:
import requests
from bs4 import BeautifulSoup
city_name = 'Gdańsk'
url = 'http://www.trojmiasto.pl'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
}
with requests.get(url, headers=headers) as html:
if html.ok:
soup = BeautifulSoup(html.content, 'html.parser')
links = soup('a')
for link in links:
if city_name in link.text:
print('\t- (%s)[%s]' % (link.text, link.get('href')))
Here is the output of the code above (formatted as Markdown just for clarity):
- [ZTM Gdańsk](//ztm.trojmiasto.pl/)
- [ZTM Gdańsk](https://ztm.trojmiasto.pl/)
- [Polepsz Gdańsk i złóż projekt do BO](https://www.trojmiasto.pl/wiadomosci/Polepsz-Gdansk-i-zloz-projekt-do-BO-n132827.html)
- [Pomnik Pileckiego stanie w Gdańsku](https://www.trojmiasto.pl/wiadomosci/Pomnik-rotmistrza-Witolda-Pileckiego-jednak-stanie-w-Gdansku-n132806.html)
- [O Włochu, który pokochał Gdańsk](https://rozrywka.trojmiasto.pl/Roberto-M-Polce-Polacy-maja-w-sobie-cos-srodziemnomorskiego-n132686.html)
- [Plakaty z poezją na ulicach Gdańska](https://kultura.trojmiasto.pl/Plakaty-z-poezja-na-ulicach-Gdanska-n132696.html)
- [Uniwersytet Gdański skończył 49 lat](https://nauka.trojmiasto.pl/Uniwersytet-Gdanski-skonczyl-50-lat-n132797.html)
- [Zapisz się na Półmaraton Gdańsk](https://aktywne.trojmiasto.pl/Zapisz-sie-na-AmberExpo-Polmaraton-Gdansk-2019-n132785.html)
- [Groźby na witrynach barów w Gdańsku](https://www.trojmiasto.pl/wiadomosci/Celtyckie-krzyze-i-grozby-na-witrynach-barow-w-Gdansku-n132712.html)
- [Stadion Energa Gdańsk](https://www.trojmiasto.pl/Stadion-Energa-Gdansk-o25320.html)
- [Gdańsk Big Beat Day 2019 ](https://www.trojmiasto.pl/rd/?t=p&id_polecamy=59233&url=https%3A%2F%2Fimprezy.trojmiasto.pl%2FGdansk-Big-Beat-Day-2019-imp475899.html&hash=150ce9c9)
- [ZTM Gdańsk](https://ztm.trojmiasto.pl/)
You could try attribute = value with contains operator (*)
rezultaty = [item['href'] for item in soup.select("[href*='Gdansk']")]
Full script
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.trojmiasto.pl')
soup = bs(r.content, 'lxml')
rezultaty = [item['href'] for item in soup.select("[href*='Gdansk']")]
print(rezultaty)
Without list comprehension:
for item in soup.select("[href*='Gdansk']"):
print(item['href'])