Python understand accents like ( ^ ' ç º)

Python understand accents like ( ^ ' ç º) - python

I'm creating a Python script, basically this part I'm having problems, it simply takes the titles of the posts of a webpage.
Python does not understand the accents and I've tried everything I know
1 - put this code in the first line # - * - coding: utf-8 - * -
2 - put .encode ("utf-8")
code:
# -*- coding: utf-8 -*-
import re
import requests
def opena(url):
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
lexdan1 = requests.get(url,headers=headers)
lexdan2 = lexdan1.text
lexdan1.close
return lexdan2
dan = []
a = opena('http://www.megafilmesonlinehd.com/filmes-lancamentos')
d = re.compile('<strong class="tt-filme">(.+?)</strong>').findall(a)
for name in d:
name = name.encode("utf-8")
dan.append(name)
print dan
this what i got:
['Porta dos Fundos: Contrato Vital\xc3\xadcio HD 720p', 'Os 28 Homens de Panfilov Legendado HD', 'Estrelas Al\xc3\xa9m do Tempo Dublado', 'A Volta do Ju\xc3\xadzo Final Dublado Full HD 1080p', 'The Love Witch Legendado HD', 'Manchester \xc3\x80 Beira-Mar Legendado', 'Semana do P\xc3\xa2nico Dublado HD 720p', 'At\xc3\xa9 o \xc3\x9altimo Homem Legendado HD 720p', 'Arbor Demon Legendado HD 720p', 'Esquadr\xc3\xa3o de Elite Dublado Full HD 1080p', 'Ouija Origem do Mal Dublado Full HD 1080p', 'As Muitas Mulheres da Minha Vida Dublado HD 720p', 'Um Novo Desafio para Callan e sua Equipe Dublado Full HD 1080p', 'Terror Herdado Dublado DVDrip', 'Officer Downe Legendado HD', 'N\xc3\xa3o Bata Duas Vezes Legendado HD', 'Eu, Daniel Blake Legendado HD', 'Sangue Pela Gl\xc3\xb3ria Legendado', 'Quase 18 Legendado HD 720p', 'As Aventuras de Robinson Cruso\xc3\xa9 Dublado Full HD 1080p', 'Indigna\xc3\xa7\xc3\xa3o Dublado HD 720p']

Because you're telling the interpreter to print a list, the interpreter calls the list class's __str__ method. When you call a container's __str__method, it uses uses the __repr__ method for each of the contained objects (in this case - str type). The str type's __repr__ method doesn't convert the unicode characters, but its __str__ method (which gets called when you print an individual str object) does.
Here's a great question to help explain the difference:
Difference between __str__ and __repr__ in Python
If you print each string individually, you should get the results you want.
import re
import requests
def opena(url):
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
lexdan1 = requests.get(url,headers=headers)
lexdan2 = lexdan1.text
lexdan1.close
return lexdan2
dan = []
a = opena('http://www.megafilmesonlinehd.com/filmes-lancamentos')
d = re.compile('<strong class="tt-filme">(.+?)</strong>').findall(a)
for name in d:
dan.append(name)
for item in dan:
print item

When printing a list whatever is inside them is represented (calls __repr__ method), and not printed (call __str__ method):
class test():
def __repr__(self):
print '__repr__'
return ''
def __str__(self):
print '__str__'
return ''
will get you:
>>> a = [test()]
>>> a
[__repr__
]
>>> print a
[__repr__
]
>>> print a[0]
__str__
And the __repr__ method of string does not convert special characters (not even \t or \n).

Related

Why not able to scrape all pages from a website with BeautifulSoup?

I'm trying to get all the data from all pages,
i used a counter and cast it to take the page number in the url
then looped using this counter but always the same result
This is my code :
# Scrapping job offers from hello work website
#import libraries
import random
import requests
import csv
from bs4 import BeautifulSoup
from datetime import date
#configure user agent for mozilla browser
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0",
"Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0",
"Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"
]
random_user_agent= random.choice(user_agents)
headers = {'User-Agent': random_user_agent}
here where i have used my counter:
i=0
for i in range(1,15):
url = 'https://www.hellowork.com/fr-fr/emploi/recherche.html?p='+str(i)
print(url)
page = requests.get(url,headers=headers)
if (page.status_code==200):
soup = BeautifulSoup(page.text,'html.parser')
jobs = soup.findAll('div',class_=' new action crushed hoverable !tw-p-4 md:!tw-p-6 !tw-rounded-2xl')
#config csv
csvfile=open('jobList.csv','w+',newline='')
row_list=[] #to append list of job
try :
writer=csv.writer(csvfile)
writer.writerow(["ID","Job Title","Company Name","Contract type","Location","Publish time","Extract Date"])
for job in jobs:
id = job.get('id')
jobtitle= job.find('h3',class_='!tw-mb-0').a.get_text()
companyname = job.find('span',class_='tw-mr-2').get_text()
contracttype = job.find('span',class_='tw-w-max').get_text()
location = job.find('span',class_='tw-text-ellipsis tw-whitespace-nowrap tw-block tw-overflow-hidden 2xsOld:tw-max-w-[20ch]').get_text()
publishtime = job.find('span',class_='md:tw-mt-0 tw-text-xsOld').get_text()
extractdate = date.today()
row_list=[[id,jobtitle,companyname,contracttype,location,publishtime,extractdate]]
writer.writerows(row_list)
finally:
csvfile.close()

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
BeautifulSoup is not necessary needed here - You could get all and more information directly via api using a mix of requests and pandas - Check all available information here:
https://www.hellowork.com/searchoffers/getsearchfacets?p=1
Example
import requests
import pandas as pd
from datetime import datetime
df = pd.concat(
[
pd.json_normalize(
requests.get(f'https://www.hellowork.com/searchoffers/getsearchfacets?p={i}', headers={'user-agent':'bond'}).json(), record_path=['Results']
)[['ContractType','Localisation', 'OfferTitle', 'PublishDate', 'CompanyName']]
for i in range(1,15)
],
ignore_index=True
)
df['extractdate '] = datetime.today().strftime('%Y-%m-%d')
df.to_csv('jobList.csv', index=False)
Output
ContractType
Localisation
OfferTitle
PublishDate
CompanyName
extractdate
0
CDI
Beaurepaire - 85
Chef Gérant H/F
2023-01-24T16:35:15.867
Armonys Restauration - Morbihan
2023-01-24
1
CDI
Saumur - 49
Dessinateur Métallerie Débutant H/F
2023-01-24T16:35:14.677
G2RH
2023-01-24
2
Franchise
Villenave-d'Ornon - 33
Courtier en Travaux de l'Habitat pour Particuliers et Professionnels H/F
2023-01-24T16:35:13.707
Elysée Concept
2023-01-24
3
Franchise
Montpellier - 34
Courtier en Travaux de l'Habitat pour Particuliers et Professionnels H/F
2023-01-24T16:35:12.61
Elysée Concept
2023-01-24
4
CDD
Monaco
Spécialiste Senior Développement Matières Premières Cosmétique H/F
2023-01-24T16:35:06.64
Expectra Monaco
2023-01-24
...
275
CDI
Brétigny-sur-Orge - 91
Magasinier - Cariste H/F
2023-01-24T16:20:16.377
DELPHARM
2023-01-24
276
CDI
Lille - 59
Technicien Helpdesk Français - Italien H/F
2023-01-24T16:20:16.01
Akkodis
2023-01-24
277
CDI
Tours - 37
Conducteur PL H/F
2023-01-24T16:20:15.197
Groupe Berto
2023-01-24
278
Franchise
Nogent-le-Rotrou - 28
Courtier en Travaux de l'Habitat pour Particuliers et Professionnels H/F
2023-01-24T16:20:12.29
Elysée Concept
2023-01-24
279
CDI
Cholet - 49
Ingénieur Assurance Qualité H/F
2023-01-24T16:20:10.837
Akkodis
2023-01-24

Web scraping articles from Google News

I am trying to web scrape googlenews with the gnews package. However, I don't know how to do web scraping for older articles like, for example, articles from 2010.
from gnews import GNews
from newspaper import Article
import pandas as pd
import datetime
google_news = GNews(language='es', country='Argentina', period = '7d')
argentina_news = google_news.get_news('protesta clarin')
print(len(argentina_news))
this code works perfectly to get recent articles but I need older articles. I saw https://github.com/ranahaani/GNews#todo and something like the following appears:
google_news = GNews(language='es', country='Argentina', period='7d', start_date='01-01-2015', end_date='01-01-2016', max_results=10, exclude_websites=['yahoo.com', 'cnn.com'],
proxy=proxy)
but when I try star_date I get:
TypeError: __init__() got an unexpected keyword argument 'start_date'
can anyone help to get articles for specific dates. Thank you very mucha guys!

The example code is incorrect for gnews==0.2.7 which is the latest you can install off PyPI via pip (or whatever). The documentation is for the unreleased mainline code that you can get directly off their git source.
Confirmed by inspecting the GNews::__init__ method, and the method doesn't have keyword args for start_date or end_date:
In [1]: import gnews
In [2]: gnews.GNews.__init__??
Signature:
gnews.GNews.__init__(
self,
language='en',
country='US',
max_results=100,
period=None,
exclude_websites=None,
proxy=None,
)
Docstring: Initialize self. See help(type(self)) for accurate signature.
Source:
def __init__(self, language="en", country="US", max_results=100, period=None, exclude_websites=None, proxy=None):
self.countries = tuple(AVAILABLE_COUNTRIES),
self.languages = tuple(AVAILABLE_LANGUAGES),
self._max_results = max_results
self._language = language
self._country = country
self._period = period
self._exclude_websites = exclude_websites if exclude_websites and isinstance(exclude_websites, list) else []
self._proxy = {'http': proxy, 'https': proxy} if proxy else None
File: ~/src/news-test/.venv/lib/python3.10/site-packages/gnews/gnews.py
Type: function
If you want the start_date and end_date functionality, that was only added rather recently, so you will need to install the module off their git source.
# use whatever you use to uninstall any pre-existing gnews module
pip uninstall gnews
# install from the project's git main branch
pip install git+https://github.com/ranahaani/GNews.git
Now you can use the start/end functionality:
import datetime
import gnews
start = datetime.date(2015, 1, 15)
end = datetime.date(2015, 1, 16)
google_news = GNews(language='es', country='Argentina', start_date=start, end_date=end)
rsp = google_news.get_news("protesta")
print(rsp)
I get this as a result:
[{'title': 'Latin Roots: The Protest Music Of South America - NPR',
'description': 'Latin Roots: The Protest Music Of South America NPR',
'published date': 'Thu, 15 Jan 2015 08:00:00 GMT',
'url': 'https://www.npr.org/sections/world-cafe/2015/01/15/377491862/latin-roots-the-protest-music-of-south-america',
'publisher': {'href': 'https://www.npr.org', 'title': 'NPR'}}]
Also note:
period is ignored if you set start_date and end_date
Their documentation shows you can pass the dates as tuples like (2015, 1, 15). This doesn't seem to work - just be safe and pass a datetime object.

You can also use Python requests module and xpath to get what you need without using any external packages.
Here is a snapshot of the code:
from bs4 import BeautifulSoup
import requests
from lxml.html import fromstring
url = 'https://www.google.com/search?q=google+news&&hl=es&sxsrf=ALiCzsZoYzwIP0ZR9d6LLa5U6IJ2WDo1sw%3A1660116293247&source=lnt&tbs=cdr%3A1%2Ccd_min%3A8%2F10%2F2010%2Ccd_max%3A8%2F10%2F2022&tbm=nws'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4758.87 Safari/537.36",
}
r = requests.get(url, headers=headers, timeout=30)
root = fromstring(r.text)
news = []
for i in root.xpath('//div[#class="xuvV6b BGxR7d"]'):
item={}
item['title'] = i.xpath('.//div[#class="mCBkyc y355M ynAwRc MBeuO nDgy9d"]//text()')
item['description'] = i.xpath('.//div[#class="GI74Re nDgy9d"]//text()')
item['published date'] = i.xpath('.//div[#class="OSrXXb ZE0LJd"]//span/text()')
item['url'] = i.xpath('.//a/#href')
item['publisher'] = i.xpath('.//div[#class="CEMjEf NUnG9d"]//span/text()')
news.append(item)
And here is what i get:
for i in news:
print i
"""
{'published date': ['Hace 1 mes'], 'url': ['https://www.20minutos.es/noticia/5019464/0/google-news-regresa-a-espana-tras-ocho-anos-cerrado/'], 'publisher': ['20Minutos'], 'description': [u'"Google News ayuda a los lectores a encontrar noticias de fuentes \nfidedignas, desde los sitios web de noticias m\xe1s grandes del mundo hasta \nlas publicaciones...'], 'title': [u'Noticias de 20minutos en Google News: c\xf3mo seguir la \xfaltima ...']}
{'published date': ['14 jun 2022'], 'url': ['https://www.bbc.com/mundo/noticias-61803565'], 'publisher': ['BBC'], 'description': [u'C\xf3mo funciona LaMDA, el sistema de inteligencia artificial que "cobr\xf3 \nconciencia y siente" seg\xfan un ingeniero de Google. Alicia Hern\xe1ndez \n#por_puesto; BBC News...'], 'title': [u'C\xf3mo funciona LaMDA, el sistema de inteligencia artificial que "cobr\xf3 \nconciencia y siente" seg\xfan un ingeniero de Google']}
{'published date': ['24 mar 2022'], 'url': ['https://www.theguardian.com/world/2022/mar/24/russia-blocks-google-news-after-it-bans-ads-on-proukraine-invasion-content'], 'publisher': ['The Guardian'], 'description': [u'Russia has blocked Google News, accusing it of promoting \u201cinauthentic \ninformation\u201d about the invasion of Ukraine. The ban came just hours after \nGoogle...'], 'title': ['Russia blocks Google News after ad ban on content condoning Ukraine invasion']}
{'published date': ['2 feb 2021'], 'url': ['https://dircomfidencial.com/medios/google-news-showcase-que-es-y-como-funciona-el-agregador-por-el-que-los-medios-pueden-generar-ingresos-20210202-0401/'], 'publisher': ['Dircomfidencial'], 'description': [u'Google News Showcase: qu\xe9 es y c\xf3mo funciona el agregador por el que los \nmedios pueden generar ingresos. MEDIOS | 2 FEBRERO 2021 | ACTUALIZADO: 3 \nFEBRERO 2021 8...'], 'title': [u'Google News Showcase: qu\xe9 es y c\xf3mo funciona el ...']}
{'published date': ['4 nov 2021'], 'url': ['https://www.euronews.com/next/2021/11/04/google-news-returns-to-spain-after-the-country-adopts-new-eu-copyright-law'], 'publisher': ['Euronews'], 'description': ['News aggregator Google News will return to Spain following a change in \ncopyright law that allows online platforms to negotiate fees directly with \ncontent...'], 'title': ['Google News returns to Spain after the country adopts new EU copyright law']}
{'published date': ['27 may 2022'], 'url': ['https://indianexpress.com/article/technology/tech-news-technology/google-hit-with-fresh-uk-investigation-over-ad-tech-dominance-7938896/'], 'publisher': ['The Indian Express'], 'description': ['The Indian Express website has been rated GREEN for its credibility and \ntrustworthiness by Newsguard, a global service that rates news sources for \ntheir...'], 'title': ['Google hit with fresh UK investigation over ad tech dominance']}
{'published date': [u'Hace 1 d\xeda'], 'url': ['https://indianexpress.com/article/technology/tech-news-technology/google-down-outage-issues-user-error-8079170/'], 'publisher': ['The Indian Express'], 'description': ['The outage also impacted a range of other Google products such as Google \n... Join our Telegram channel (The Indian Express) for the latest news and \nupdates.'], 'title': ['Google, Google Maps and other services recover after global ...']}
{'published date': ['14 nov 2016'], 'url': ['https://www.reuters.com/article/us-alphabet-advertising-idUSKBN1392MM'], 'publisher': ['Reuters'], 'description': ["Google's move similarly does not address the issue of fake news or hoaxes \nappearing in Google search results. That happened in the last few days, \nwhen a search..."], 'title': ['Google, Facebook move to restrict ads on fake news sites']}
{'published date': ['27 sept 2021'], 'url': ['https://news.sky.com/story/googles-appeal-against-eu-record-3-8bn-fine-starts-today-as-us-cases-threaten-to-break-the-company-up-12413655'], 'publisher': ['Sky News'], 'description': ["Google's five-day appeal against the decision is being heard at European \n... told Sky News he expected there could be another appeal after the \nhearing in..."], 'title': [u"Google's appeal against EU record \xa33.8bn fine starts today, as US cases \nthreaten to break the company up"]}
{'published date': ['11 jun 2022'], 'url': ['https://www.washingtonpost.com/technology/2022/06/11/google-ai-lamda-blake-lemoine/'], 'publisher': ['The Washington Post'], 'description': [u"SAN FRANCISCO \u2014 Google engineer Blake Lemoine opened his laptop to the \ninterface for LaMDA, Google's artificially intelligent chatbot generator,..."], 'title': ["The Google engineer who thinks the company's AI has come ..."]}
"""

You can consider the parsel library which is an XML/HTML parser that have full XPath and CSS selectors support.
Using this library, I want to demonstrate how to scrape Google News using pagination. Оne of the ways is to use the start URL parameter which is equal to 0 by default. 0 means the first page, 10 is for the second, and so on.
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "protesta clarin",
"hl": "es", # language, es -> Spanish
"gl": "AR", # country of the search, AR -> Argentina
"tbm": "nws", # google news
"start": 0, # number page by default up to 0
}
While the next button exists, you need to increment the ["start"] parameter value by 10 to access the next page if it's present, otherwise we need to break out of the while loop:
if selector.css('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Code and full example in online IDE:
from parsel import Selector
import requests
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "protesta clarin",
"hl": "es", # language, es -> Spanish
"gl": "AR", # country of the search, AR -> Argentina
"tbm": "nws", # google news
"start": 0, # number page by default up to 0
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
page_num = 0
while True:
page_num += 1
print(f"{page_num} page:")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
for result in selector.css(".WlydOe"):
source = result.css(".NUnG9d span::text").get()
title = result.css(".mCBkyc::text").get()
link = result.attrib['href']
snippet = result.css(".GI74Re::text").get()
date = result.css(".ZE0LJd span::text").get()
print(source, title, link, snippet, date, sep='\n', end='\n\n')
if selector.css('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Output:
1 page:
Clarín
Protesta de productores rurales en Salliqueló ante la llegada ...
https://www.clarin.com/politica/protesta-productores-productores-rurales-salliquelo-llegada-presidente-ministro-economia_0_d6gT1lm0hy.html
Hace clic acá y te lo volvemos a enviar. Ya la active. Cancelar. Clarín.
Para comentar nuestras notas por favor completá los siguientes datos.
hace 9 horas
Clarín
Paro docente y polémica: por qué un conflicto en el sur afecta ...
https://www.clarin.com/sociedad/paro-docente-polemica-conflicto-sur-afecta-pais_0_7Kt1kRgL8F.html
Expertos en Educación explican a Clarín los límites –o la falta de ellos–
entre la protesta y el derecho a tener clases. Y por qué todos los...
hace 2 días
Clarín
El campo hace un paro en protesta por la presión impositiva y la falta de
gasoil
https://www.clarin.com/rural/protesta-presion-impositiva-falta-gasoil-campo-hara-paro-pais_0_z4VuXzllXP.html
Se viene la Rural de Palermo: precios, horarios y actividades de la
muestra. Newsletters Clarín Cosecha de noticias. Héctor Huergo trae lo
más...
Hace 1 mes
... other results from the 1st and subsequent pages.
31 page:
Ámbito Financiero
Hamburgo se blinda frente a protestas contra el G-20 y ...
http://www.ambito.com/888830-hamburgo-se-blinda-frente-a-protestas-contra-el-g-20-y-anticipa-tolerancia-cero
Fuerzas especiales alemanas frente al centro de convenciones donde se
desarrollará la cumbre. El ministro del Interior de Alemania, Thomas de...
4 jul 2017
Chequeado
Quién es Carlos Rosenkrantz, el nuevo presidente de la Corte ...
https://chequeado.com/el-explicador/quien-es-carlos-rosenkrantz-el-proximo-presidente-de-la-corte-suprema/
... entre otras, del Grupo Clarín y de Farmacity y Pegasus, las dos últimas
... más restrictiva del derecho a la protesta y a los cortes en la vía
pública.
3 nov 2018
ámbito.com
Echeverría: graves incidentes en protesta piquetera
https://www.ambito.com/informacion-general/echeverria-graves-incidentes-protesta-piquetera-n3594273
Una protesta de grupos piqueteros frente a la municipalidad de Esteban
Echeverría desató hoy serios incidentes que concluyeron con al menos...
20 nov 2009
... other results from 31st page.
If you need a more detailed explanation about scraping Google News, have a look at Web Scraping Google News with Python blog post.

Can't parse all the next page links from a website using requests

I'm trying to get all the links traversing next pages from this website. My script below can parse the links of next pages until 10. However, I can't go past that link visible as 10 at the bottom of that page.
I've tried with:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base = 'https://www.icab.es'
link = 'https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc82c494d2c55856f1e25c06b4b6fcee5ddabebfe2d30057589a86e9750b459e9d60598cc6e5c52a4697030b2b8921f29f'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
p = 1
while True:
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
"""some data I can fetch myself from current pages, so ignore this portion"""
p+=1
next_page = soup.select_one(f"a[title='{p}']")
if next_page:
link = urljoin(base,next_page.get("href"))
print("next page:",link)
else:
break
How can I get all the next page links from the website above?
PS selenium is not an option I would like to cope with.

You only need to get the href of ">" when your (p-1)%10 != 0
Code:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base = 'https://www.icab.es'
link = 'https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc82c494d2c55856f1e25c06b4b6fcee5ddabebfe2d30057589a86e9750b459e9d60598cc6e5c52a4697030b2b8921f29f'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
p = 1
while True:
r = s.get(link)
soup = BeautifulSoup(r.text, "lxml")
"""some data I can fetch myself from current pages, so ignore this portion"""
p += 1
if not ((p-1) % 10):
next_page = soup.select_one(f"a[title='Següent']")
else:
next_page = soup.select_one(f"a[title='{p}']")
if next_page:
link = urljoin(base, next_page.get("href"))
print("page", next_page.text, link)
Result(page >> could be considered as page ?1):
D:\python37\python.exe E:/work/Compile/python/python_project/try.py
page 2 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64ddd1e9bee4bbc2e02ab2de1dfe88d7a8623a04a8617c3a28f3f17b03d0448cd1d399689a629d00f53e570c691ad10f5cfba28f6f9ee8d48ddb9b701d116d7c2a6d4ea403cef5d996fcb28a12b9f7778cd7521cfdf3d243cb2b1f3de9dfe304a10437e417f6c68df79efddd721f2ab8167085132c5e745958a3a859b9d9f04b63e402ec6e8ae29bee9f4791fed51e5758ae33460e9a12b6d73f791fd118c0c95180539f1db11c86a7ab97b31f94fb84334dce6867d519873cc3b80e182ff0b778
page 3 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64cc111efb5ef5c42ecde14350e1f5a2e0e8db36a9258d5c95496c2450f44ccd8c4074edb392b09c03e136988ff15e707fa0e01d1ee6c3198a166e2677b4b418e0b07cafd4d98a19364077e7ed2ea0341001481d8b9622a969a524a487e7d69f6b571f2cb03c2277ecd858c68a7848a0995c1c0e873d705a72661b69ab39b253bb775bc6f7f6ae3df2028114735a04dcb8043775e73420cb40c4a5eccb727438ea225b582830ce84eb959753ded1b3eb57a14b283c282caa7ad04626be8320b4ab
page 4 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64d8e9a9d04523d43bfb106098548163bfec74e190632187d135f2a0949b334acad719ad7c326481a43dfc6f966eb038e0a5a178968601ad0681d586ffc8ec21e414628f96755116e65b7962dfcf3a227fc1053d17701937d4f747b94c273ce8b9ccec178386585075c17a4cb483c45b85c1209329d1251767b8a0b4fa29969cf6ad42c7b04fcc1e64b9defd528753677f56e081e75c1cbc81d1f4cc93adbde29d06388474671abbab246160d0b3f03a17d1db2c6cd6c6d7a243d872e353200a35
page 5 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c643ba4bcf6634af323cf239c7ccf7eca827c3a245352a03532a91c0ced15db81dcfc52b6dfa69853a68cb320e29ca25e25fac3da0e85667145375c3fa1541d80b1b056c03c02400220223ad5766bd1a4824171188fd85a5412b59bd48fe604451cbd56d763be67b50e474befa78340d625d222f1bb6b337d8d2b335d1aa7d0374b1be2372e77948f22a073e5e8153c32202a219ed2ef3f695b5b0040ded1ca9c4a03462b5937182c004a1a425725d3d20a10b41fd215d551abf10ef5e8a76ace4f
page 6 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64418cdf5c38c01a1ac019cc46242eb9ba25f012f2e4bee18a2864a19dde58d6ee2ae93254aff239c70b7019526af1a435e0e89a7c81dc4842e365163d8f9e571ae4fc8b0fc7455f573abee020e21207a604f3d6b7c2015c300a7b1dbc75980b435bb1904535bed2610771fee5e3338a79fad6d024ec2684561c3376463b2cacc00a99659918b41a12c92233bca3eaa1e003dbb0a094b787244ef3c33688b4382f89ad64a92fa8b738dd810b6e32a087564a8db2422c5b2013e9103b1b57b4248d
page 7 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64f96d66b04d442c09e3b891f2a5f3fb235c1aa2786face046665066db9a63e7ca4523e5cf28f4f17898204642a7d5ef3f8474ecd5bf58b252944d148467b495ad2450ea157ce8f606d4b9a6bc2ac04bec3a666757eac42cbea0737e8191b0d375425e11c76990c82246cfb9cbe94daa46942d824ff9f144f6b51c768b50c3e35acfa81e5ebf95bcb5200f5b505595908907c99b8d59893724eb16d5694d36cd30d8a15744af616675b2d6a447c10b79ca04814aece8c7ab4d878b7955cd5cd9ef
page 8 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64d1c210208efbd4630d44a5a92d37e5eabccba6abf83c2584404a24d08e0ad738be3598a03bbec8975275b06378cc5f7c55a9b700eb5bd4ee243a3c30f781512c0ebd23800890cb150621caab21a7a879639331b369d92bb9668815465f5d3b6c061daa011784909fc09af75ab705612ba504b4c268b43f8a029e840b8c69531423e8b5e8fe91d7cc628c309ffb633e233932b7c1b57c5cf0a2f2f47618bca4837ce355f34ae154565b447cfffcecb66458d19e5e5f3547f6916cd1c30baec1a7
page 9 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c6415c187c4ac2cf9d4c982984e1b56faf34a31272a90b11557d1651ad92b01a2ecd3c719cfe78863f99e31b0fc1b6bc7b09e1e0e585ebdc0b04fc9dca8744bb66e8af86d65b39827f1265a82aea0286376456ccfa9cce638d72c494db2391127979eed3d349d725f2e60e2629512c388738fc26b1c9f16a2b478862469835474b305f1300c0aa53c2c4033e4b0967a542079915e30bb18418eb79a47a292ed835dd54689c1fd9ceda898678e7114fa95d559b55367e6f7f9d1ce3fb5ebb5d479c5
page 10 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c644ab59a0b943deffee8845c2521afef0ea3ff2d96cc2b65a071da1201603b54b15b5c4363e92285c60dffd0e893ba6a58ff528fb3278db8e746697dc8712936a560a3da8085e3dcab05949afecddaced326332986240624575c6b7f104182a8c57718ec62e728d8eaa886a611ad55e0d3dd0c1ba59b47cf89d1bd5b000f9fbc5bd7d6310742a53eedfa44383d62145c28ebcf9f180ca49a3616fcfaf7ecaaa0b2f7183fc1d10d18e0062613e73f9077d11a1dfaf044990c200ac10aac4f7cb332
page » https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64ff2c69157ff5cf4b8ccbc2674205f4fb3048dc10db0c7cb36c42fbc59aaa972b9fab70578ff58757fae7a1f1ca17076dfddb919cf92389ba66c8de7f6ea9ec08277b0228f8bd14ea82409ff7e5a051ea58940736b475c6f75c7eba096b711812ed5b6b8454ec11145b0ce10191a38068c6ca7e7c64a86b4c71819d55b3ab34233e9887c7bfa05f9f8bc488cb0986fb2680b8cb9278a437e7c91c7b9d15426e159c30c6c2351ed300925ef1b24bbf2dbf60cf9dea935d179235ed46640d2b0b54
page 12 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64346907383e54eae9d772c10d3600822205ff9b81665ff0f58fd876b4e0d9aeb6e0271904c5251d9cf6eb1fdd1ea16f8ea3f42ad3db66678bc538c444e0e5e4064946826aaf85746b3f87fb436d83a8eb6d6590c25dc7f208a16c1db7307921d79269591e036fed1ec78ec7351227f925a32d4d08442b9fd65b02f6ef247ca5f713e4faffe994bf26a14c2cb21268737bc2bc92bb41b3e3aaa05de10da4e38de3ab725adb5560eee7575cdf6d51d59870efacc1b9553609ae1e16ea25e6d6e9e6
page 13 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64afc9149ba3dadd6054f6d8629d1c750431a15f9c4048195cfc2823f61f6cfd1f2e4f78eb835829db8e7c88279bf3a38788d8feaf5327f1b42d863bba24d893ea5e033510dc2e0579474ac7efcc1915438eacb83f2a3b5416e64e3beb726d721eb79f55082be0371414ccd132e95cd53339cf7a8d6ec15b72595bf87107d082c9db7bba6cf45b8cfe7a9352abe2f289ae8591afcfd78e17486c25e94ea57c00e290613a18a8b991def7e1cd4cae517a4ee1b744036336fbc68b657cd33cc4c949

I had problems with SSL, so I changed the default ssl_context for this site:
import ssl
import requests
import requests.adapters
from bs4 import BeautifulSoup
# adapted from https://stackoverflow.com/questions/42981429/ssl-failure-on-windows-using-python-requests/50215614
class SSLContextAdapter(requests.adapters.HTTPAdapter):
def init_poolmanager(self, *args, **kwargs):
ssl_context = ssl.create_default_context()
# Sets up old and insecure TLSv1.
ssl_context.options &= ~ssl.OP_NO_TLSv1_3 & ~ssl.OP_NO_TLSv1_2 & ~ssl.OP_NO_TLSv1_1
ssl_context.minimum_version = ssl.TLSVersion.TLSv1
kwargs['ssl_context'] = ssl_context
return super(SSLContextAdapter, self).init_poolmanager(*args, **kwargs)
base = 'https://www.icab.es'
link = 'https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc82c494d2c55856f1e25c06b4b6fcee5ddabebfe2d30057589a86e9750b459e9d60598cc6e5c52a4697030b2b8921f29f'
with requests.session() as s:
s.mount('https://www.icab.es', SSLContextAdapter())
p = 1
while True:
print('Page {}..'.format(p))
# r = urllib.request.urlopen(link, context=ssl_context)
r = s.get(link)
soup = BeautifulSoup(r.content, "lxml")
for li in soup.select('li.principal'):
print(li.get_text(strip=True))
p += 1
link = soup.select_one('a[title="{}"]'.format(p))
if not link:
link = soup.select_one('a[title="Següent"]')
if not link:
break
link = base + link['href']
Prints:
Page 1..
Sr./Sra. Martínez Gòmez, Marc
Sr./Sra. Eguinoa de San Roman, Roman
Sr./Sra. Morales Santiago, Maria Victoria
Sr./Sra. Bengoa Tortajada, Javier
Sr./Sra. Moralo Rodríguez, Xavier
Sr./Sra. Romagosa Huerta, Marta
Sr./Sra. Peña Moncho, Juan
Sr./Sra. Piñana Morera, Roman
Sr./Sra. Millán Sánchez, Antonio
Sr./Sra. Martínez Mira, Manel
Sr./Sra. Montserrat Rincón, Anna
Sr./Sra. Fernández Paricio, Maria Teresa
Sr./Sra. Ruiz Macián- Dagnino, Claudia
Sr./Sra. Barba Ausejo, Pablo
Sr./Sra. Bruna de Quixano, Jose Luis
Sr./Sra. Folch Estrada, Fernando
Sr./Sra. Gracia Castellón, Sonia
Sr./Sra. Sales Valls, Gemma Elena
Sr./Sra. Pastor Giménez-Salinas, Adolfo
Sr./Sra. Font Jané, Àlvar
Sr./Sra. García González, Susana
Sr./Sra. Garcia-Tornel Florensa, Xavier
Sr./Sra. Marín Granados, Alejandra
Sr./Sra. Albero Jové, José María
Sr./Sra. Galcerà Margalef, Montserrat
Page 2..
Sr./Sra. Chimenos Minguella, Sergi
Sr./Sra. Lacasta Casado, Ramón
Sr./Sra. Alcay Morandeira, Carlos
Sr./Sra. Ribó Massó, Ignacio
Sr./Sra. Fitó Baucells, Antoni
Sr./Sra. Paredes Batalla, Patricia
Sr./Sra. Prats Viñas, Francesc
Sr./Sra. Correig Ferré, Gerard
Sr./Sra. Subirana Freixas, Alba
Sr./Sra. Álvarez Crexells, Juan
Sr./Sra. Glaser Woloschin, Joan Nicolás
Sr./Sra. Nel-lo Padro, Francesc Xavier
Sr./Sra. Oliveras Dalmau, Rosa Maria
Sr./Sra. Badia Piqué, Montserrat
Sr./Sra. Fuentes-Lojo Rius, Alejandro
Sr./Sra. Argemí Delpuy, Marc
Sr./Sra. Espinoza Carrizosa, Pina
Sr./Sra. Ges Clot, Carla
Sr./Sra. Antón Tuneu, Beatriz
Sr./Sra. Schroder Vilalta, Andrea
Sr./Sra. Belibov, Mariana
Sr./Sra. Sole Lopez, Silvia
Sr./Sra. Reina Pardo, Luis
Sr./Sra. Cardenal Lagos, Manel Josep
Sr./Sra. Bru Galiana, David
...and so on.

Remove image sources with same class reference when web scraping in python?

I'm trying to write some code to extract some data from transfermarkt (Link Here for the page I'm using). I'm stuck trying to print the clubs. I've figured out that I need to access h2 and then the a class in order to just get the text. The HTML code is below
<div class="table-header" id="to-349"><a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018"><img src="https://tmssl.akamaized.net/images/wappen/small/349.png?lm=1574162298" title=" " alt="Barnsley FC" class="" /></a><h2><a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">Barnsley FC</a></h2></div>
so you can see if I just try find_all("a", "class": "vereinprofil_tooltip"}) it doesn't work properly as it also returns the image file which has no plain text? But if I can search for h2 first and then search find_all("a", "class": "vereinprofil_tooltip"}) within the returned h2 it would get me what I want. My code is below.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/league-one/transfers/wettbewerb/GB3/plus/?saison_id=2018&s_w=&leihe=1&intern=0&intern=1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Clubs = pageSoup.find_all("h2")
Club = Clubs.find("a", {"class": "vereinprofil_tooltip"})
print(Club)
I get the error in getattr
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I know what the error means but I've been going round in circles trying to find a way of actually doing it properly and getting what I want. Any help is appreciated.

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/league-one/transfers/wettbewerb/GB3/
plus/?saison_id=2018&s_w=&leihe=1&intern=0&intern=1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Clubs = pageSoup.find_all("h2")
print(type(Clubs)) # this can be removed, but I left it to expose how I figured this out
for club in Clubs:
print(club.text)
Basically: Clubs is a list (technically, a ResultSet, but the behavior is very similar), you need to iterate it as such. .text gives just the text, other attributes could be retrieved as well.
Output looks like:
Transfer record 18/19
Barnsley FC
Burton Albion
Sunderland AFC
Shrewsbury Town
Scunthorpe United
Charlton Athletic
Plymouth Argyle
Portsmouth FC
Peterborough United
Southend United
Bradford City
Blackpool FC
Bristol Rovers
Fleetwood Town
Doncaster Rovers
Oxford United
Gillingham FC
AFC Wimbledon
Walsall FC
Rochdale AFC
Accrington Stanley
Luton Town
Wycombe Wanderers
Coventry City
Transfer record 18/19
There are, however, a bunch of blank lines (I.e., .text was '') that you should probably handle as well.

my guess is you might mean findAll instead of find_all
I tried this code below and it works
content = """<div class="table-header" id="to-349">
<a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">
<img src="https://tmssl.akamaized.net/images/wappen/small/349.png?lm=1574162298" title=" " alt="Barnsley FC" class="" />
</a>
<h2>
<a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">
Barnsley FC
</a>
</h2>
</div>"""
soup = BeautifulSoup(content, 'html.parser')
#get main_box
main_box = soup.findAll('a', {'class': 'vereinprofil_tooltip'})
#print(main_box)
for main_text in main_box: # looping thru the list
if main_text.text.strip(): # get the body text
print(main_text.text.strip()) # print it
output is
Barnsley FC
I'll edit this with a reference to the documentation about findAll. cant remember it on to pof my head
edit:
did a look at the documentation, turns out find_all = findAll..
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
now I feel dumb lol

Can't Parse Google headings with python urllib

I cannot parse Gooogle search results:
def extracter(url,key,change):
if " " in key:
key=key.replace(" ",str(change))
url=url+str(key)
response=ur.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
sauce =ur.urlopen(response).read()
soup=bs(sauce,"html.parser")
return soup
def google(keyword):
soup = extracter("https://www.google.com/search?q=",str(keyword),"+")
search_result = soup.findAll("h3",attrs={"class":"LC20lb"})
print(search_result)
google("tony stark")
Output:
[]

I simply changed the headers and it worked:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
Result:
[<h3 class="LC20lb"><span dir="ltr">Tony Stark (Marvel Cinematic Universe) - Wikipedia</span></h3>, <h3 class="LC20lb"><span dir="ltr">Tony Stark / Iron Man - Wikipedia</span></h3>, <h3 class="LC20lb"><span dir="ltr">Iron Man | Marvel Cinematic Universe Wiki | FANDOM ...</span></h3>, <h3 class="LC20lb"><span dir="ltr">Tony Stark (Earth-199999) | Iron Man Wiki | FANDOM ...</span></h3>, <h3 class="LC20lb"><span dir="ltr">Is Tony Stark Alive As AI? Marvel Fans Say Tony Stark ...</span></h3>, <h3 class="LC20lb"><span dir="ltr">'Avengers: Endgame' Might Not Have Been the End of Tony ...</span></h3>, <h3 class="LC20lb"><span dir="ltr">Robert Downey Jr to RETURN to MCU as AI Tony Stark - ...</span></h3>, <h3 class="LC20lb"><span dir="ltr">Avengers Endgame theory: Tony Stark is backed up as AI ...</span></h3>]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python understand accents like ( ^ ' ç º) - python

Related

Why not able to scrape all pages from a website with BeautifulSoup?

Web scraping articles from Google News

Can't parse all the next page links from a website using requests

Remove image sources with same class reference when web scraping in python?

Can't Parse Google headings with python urllib

Categories

Resources