Wrong encoding with Python BeautifulSoup + MySql - python

I'm working with the BeautifulSoup python library.
I used the urllib2 library to download the HTML code from a page, and then I have parsed it with BeautifulSoup.
I want to save some of the HTML content into a MySql table, but I'm having some problems with the encoding. The MySql table is encoded with 'utf-8' charset.
Some examples:
When I download the HTML code and parse it with BeautifulSoup I have something like:
"Ver las \xc3\xbaltimas noticias. Ent\xc3\xa9rate de las noticias de \xc3\xbaltima hora con la mejor cobertura con fotos y videos"
The correct text would be:
"Ver las últimas noticias. Entérate de las noticias de última hora con la mejor cobertura con fotos y videos"
I have tried to encode and decode that text with multiple charsets, but when I insert it into MySql I have somethig like:
"Ver las últimas noticias y todos los titulares de hoy en Yahoo! Noticias Argentina. Entérate de las noticias de última hora con la mejor cobertura con fotos y videos"
I'm having problems with the encoding, but I don't know how to solve them.
Any suggestion?

You have correct UTF-8 data coming out of BeautifulSoup, but it's being stored in a normal string type, not python's native unicode string type. I think this is what you need to do:
codecs.decode(your_string, 'utf-8')
And then the string should be the proper data type and encoding to send to mysql.
An example:
>>> codecs.decode("Ver las \xc3\xbaltimas noticias. Ent\xc3\xa9rate de las noticias de \xc3\xbaltima hora con la mejor cobertura con fotos y videos", 'utf-8')
u'Ver las \xfaltimas noticias. Ent\xe9rate de las noticias de \xfaltima hora con la mejor cobertura con fotos y videos'
>>> print _
Ver las últimas noticias. Entérate de las noticias de última hora con la mejor cobertura con fotos y videos

BeautifulSoup returns all data as unicode strings. First triple check that the unicode strings are ccorrect. If not then there is some issue with the encoding of the input data.

Related

Can't get text without tag using Selenium Python

first of all, I'll show the code that I'm having problem to in order to better explain myself.
<div class="archivos"> ... </div>
<br>
<br>
<br>
<br>
THIS IS THE TEXT THAT I WANT TO CHECK
<div class="archivos"> ... </div>
...
I'm using Selenium in Python.
So, this is a piece of the html that I'm working with. My objective is, inside the div with "class=archivos", there's a link that i want to click, but for that, I need to first analyze the text that's over it to know if I want to click or not the link.
The problem is that there's no tag on the text, and I can't seem to find a way to copy it so I can search it for the information I want. The text changes every time so I need to locate the possible texts previous to every "class=archivos".
So far I've tried a lot of ways to find it using XPath mainly, trying to get to the previous element of the div. I haven't come with anything that works yet, as I'm not very experienced with Selenium and XPaths.
I've found this https://chercher.tech/python/relative-xpath-selenium-python,which helped me try some XPaths, and several responses here on SO but to no avail.
I've read somewhere that I can use Javascript code from Python using Selenium to get it, but I don't know Javascript and don't know how to do it. Maybe somebody understands what I'm talking about.
This is the webpage if it helps: http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERLST&DOCS=1-200&BASE=BOLE&SEC=FIRMA&SEPARADOR=&PUBL=20200901
Thanks in advance for the help, and I'll provide any further information if it's needed.
Here is example how to extract the previous text with BeautifulSoup. I loaded the page with requests module, but you can feed the HTML source to BeautifulSoup from selenium:
import requests
from bs4 import BeautifulSoup
url = 'http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERLST&DOCS=1-200&BASE=BOLE&SEC=FIRMA&SEPARADOR=&PUBL=20200901'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for t in soup.select('.archivos'):
previous_text = t.find_previous(text=True).strip()
link = t.a['href']
print(previous_text)
print('http://www.boa.aragon.es' + link)
print('-' * 80)
Prints:
ORDEN HAP/804/2020, de 17 de agosto, por la que se modifica la Relación de Puestos de Trabajo de los Departamentos de Industria, Competitividad y Desarrollo Empresarial y de Economía, Planificación y Empleo.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=1&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
ORDEN HAP/805/2020, de 17 de agosto, por la que se modifica la Relación de Puestos de Trabajo del Departamento de Agricultura, Ganadería y Medio Ambiente.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=2&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
ORDEN HAP/806/2020, de 17 de agosto, por la que se modifica la Relación de Puestos de Trabajo del Organismo Autónomo Instituto Aragonés de Servicios Sociales.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=3&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
ORDEN ECD/807/2020, de 24 de agosto, por la que se aprueba el expediente relativo al procedimiento selectivo de acceso al Cuerpo de Catedráticos de Música y Artes Escénicas.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=4&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
RESOLUCIÓN de 28 de julio de 2020, de la Dirección General de Justicia, por la que se convocan a concurso de traslado plazas vacantes entre funcionarios de los Cuerpos y Escalas de Gestión Procesal y Administrativa, Tramitación Procesal y
Administrativa y Auxilio Judicial de la Administración de Justicia.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=5&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
...and so on.

Python Encode Ã3 as ó

I have a string like
'La empresa de capitales mixtos que opera el predio de residuos,
Ceamse, aclarÃ3 este martes que la responsabilidad del
desentendimiento con los recicladores informales que provocÃ3 un
nuevo bloqueo y hace peligrar la recolecciÃ3n'
and y need this
'La empresa de capitales mixtos que opera el predio de residuos,
Ceamse, aclaró este martes que la responsabilidad del
desentendimiento con los recicladores informales que provocó un
nuevo bloqueo y hace peligrar la recolección'
how can I do this with Python ?
thanks!
You need to fix your webscraping script!
It looks like La Capital sends proper http header and html head information, and the content is UTF-8 encoded. So your script needs to handle that, and everything will work fine.
I know from experience requests.get and beautifulsoup 4 both handles Unicode well, so just debug your script, and see where it goes wrong. Check the raw input, check if you need your page's .content or .text, and fix it accordingly.

Yet another encoding issue with accented characters (scraping a Website with Python and BeautifulSoup)

(PREFACE: I know, this problem has been talked about a hundred of times, but I still don't understand it)
I am trying to load a html-page and output the text, even though I am getting the webpage correctly, BeautifulSoup destroys somehow the encoding of accented characters which are not part of the first 127 ASCII-characters:
# -*- coding: utf-8 -*-
import sys
from urllib import urlencode
from urlparse import parse_qsl
import re
import urlparse
import json
import urllib
from bs4 import BeautifulSoup
url = "http://www.rtve.es/alacarta/interno/contenttable.shtml?ctx=29010&locale=es&module=&orderCriteria=DESC&pageSize=15&mode=TEXT&seasonFilter=40015"
html=urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
div = soup.find_all("span", class_="detalle")
capitulo_detalle = div[0].text (doesn't work, capitulo_detalle is type str with utf-8, div[0].tex is type unicode)
Output of div[0].text should be something like:
Sátur se dirige al sur en busca de Estuarda y Gabi, pero un compañero de viaje inesperado hará que cambie de rumbo. Los hombres de Juan siguen presos. El enemigo comienza a realizar ejecuciones. Águila Roja tiene...
But the result I get is:
u'S\xe1tur se dirige al sur en busca de Estuarda y Gabi, pero un compa\xf1ero de
viaje inesperado har\xe1 que cambie de rumbo. Los hombres de Juan siguen presos
. El enemigo comienza a realizar ejecuciones. \xc1guila Roja tiene...'
--> What do I have to change to get the 'right' characters?
I know it must be a duplicate of these questions, but the answers doesn't seem to work here:
Python and BeautifulSoup encoding issues
How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
I also read the typical documentations about unicode, utf-8, ascii, e.g. https://docs.python.org/3/howto/unicode.html, obviously without success...
I believe I finally got it...
>>> div = soup.find("span", class_="detalle")
>>> div.text
u'S\xe1tur se dirige al sur en busca de Estuarda y Gabi, pero
---> this is unicode, \xe1 is the 'code' for 'á' (http://www.utf8-chartable.de/unicode-utf8-table.pl?start=4096&number=128&names=-&utf8=string-literal)
>>> print(div.text)
Sátur se dirige al sur en busca de Estuarda y Gabi, pero
---> 'print' evaluates the unicode code point correctly
>>> div.text.encode('utf-8')
'S\xc3\xa1tur se dirige al sur en busca de Estuarda y Gabi, pero
---> Unicode is encoded to utf-8 according to the table given on the url cited above. I didn't understand why the output is shown as \xc3\xa1 and not as 'á'.
>>> print div.text.encode('utf-8')
Sátur se dirige al sur en busca de Estuarda y Gabi, pero
---> and I didn't understand why print now evaluates it to a strange symbol....
>>> blurr = div.text.encode('cp850')
>>> blurr
'S\xa0tur se dirige al sur en busca de Estuarda y Gabi, pero
>>> type(blurr)
<type 'str'>
---> Unicode encoded to codepage 850, used within the python-shell under Windows
>>> print(blurr)
Sátur se dirige al sur en busca de Estuarda y Gabi, pero
---> Finally, it's right !!!
In Kodi I can use the utf-8 representation, so that e.g. the character 'á' is saved within the variable as \xc3\xa1, but when the content of the variable is displayed for example with "xbmcgui.Dialog().ok(addonname, blurr) it is shown correctly on the screen with an 'á'......
Und sowas soll man wissen......
import requests
from bs4 import BeautifulSoup
url = "http://www.rtve.es/alacarta/interno/contenttable.shtml?ctx=29010&locale=es&module=&orderCriteria=DESC&pageSize=15&mode=TEXT&seasonFilter=40015"
html=requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
div = soup.find("span", class_="detalle")
capitulo_detalle = div.text
out:
'Sátur se dirige al sur en busca de Estuarda y Gabi, pero un compañero de viaje inesperado hará que cambie de rumbo. Los hombres de Juan siguen presos. El enemigo comienza a realizar ejecuciones. Águila Roja tiene...'
use requests and python3, and the problem will never show up

Python Requests and Unicode

I am using the requests library to query the Diffbot API to get contents of an article from a web page url. When I visit a request URL that I create in my browser, it returns a JSON object with the text in Unicode (right?) for example (I shortended the text somewhat):
{"icon":"http://mexico.cnn.com/images/ico_mobile.jpg","text":"CIUDAD
DE MÉXICO (CNNMéxico) \u2014 Kassandra Guazo Cano tiene 32 años, pero
este domingo participó por primera vez en una elección.\n\"No había
sacado mi (credencial del) IFE (Instituto Federal Electoral) porque al
hacer el trámite hay mucha mofa cuando ven que tu nombre no coincide
con tu y otros documentos de acuerdo con su nueva identidad.\nSánchez
dice que los solicitantes no son discriminados, pero la experiencia de
Kassanda es diferente: \"hay que pagar un licenciado, dos peritos
(entre ellos un endocrinólogo). Además, el juez dicta sentencia para
el cambio de nombre y si no es favorable tienes que esperar otros
cuatro años para volver a demandar al registro civil\".\nAnte esta
situación, el Consejo para Prevenir y Eliminar la sculina, los
transgénero votan - México: Voto 2012 -
Nacional","url":"http://mexico.cnn.com/nacional/2012/07/02/con-apariencia-de-mujer-e-identidad-masculina-los-transexuales-votan","xpath":"/HTML[1]/BODY[1]/SECTION[5]/DIV[1]/ARTICLE[1]/DIV[1]/DIV[6]"}
When I use the python request library as follows:
def get_article(self, params={}):
api_endpoint = 'http://www.diffbot.com/api/article'
params.update({
'token': self.dev_token,
'format': self.output_format,
})
req = requests.get(api_endpoint, params=params)
return json.loads(req.content)
It returns this (again note that I shortened the text somewhat):
{u'url':
u'http://mexico.cnn.com/nacional/2012/07/02/con-apariencia-de-mujer-e-identidad-masculina-los-transexuales-votan',
u'text': u'CIUDAD DE M\xc9XICO (CNNM\xe9xico) \u2014 Kassandra Guazo
Cano tiene 32 a\xf1os, pero este domingo particip\xf3 por primera vez
en una elecci\xf3n.\n"No hab\xeda sacado mi (credencial del) IFE
(Instituto Federal Electoral) porque al hacOyuky Mart\xednez Col\xedn,
tambi\xe9n transg\xe9nero, y que estaba acompa\xf1ada de sus dos hijos
y su mam\xe1.\nAmbas trabajan como activistas en el Centro de Apoyo a
las Identidades Trans, A.C., donde participan en una campa\xf1a de
prevenci\xf3n de enfermedades sexuales.\n"Quisi\xe9ramos que no solo
nos vean como trabajadoras sexuales o estilistas, sino que luchamos
por nuestros derechos", dice Kassandra mientras sonr\xede, sostiene su
credencial de elector y levanta su pulgar entintado.', u'title': u'Con
apariencia de mujer e identidad masculina, los transg\xe9nero votan -
M\xe9xico: Voto 2012 - Nacional', u'xpath':
u'/HTML[1]/BODY[1]/SECTION[5]/DIV[1]/ARTICLE[1]/DIV[1]/DIV[6]',
u'icon': u'http://mexico.cnn.com/images/ico_mobile.jpg'}
I don't quite understand Unicode. How to make sure that what I get with requests is still Unicode?
You can use req.text instead of req.content to ensure that you get Unicode. This is described in:
https://requests.readthedocs.io/en/latest/api/#requests.Response.text
Concerning the "I don't quite understand unicode", there's an entertaining primer on Unicode by Joel Spolsky and the official Python Unicode HowTo which is a 10 minute read and covers everything Python specific.
The requests docs say that request will always return unicode, and the example content you posted is in fact unicode (notice the u'' string syntax? That's Python's syntax for unicode strings.), so there's no problem. Note that if you view the JSON response in a web browser, the u'' will not be there because it's a property of how Python stores a string.
If unicode is important to your application, please don't try to cope without really knowing about unicode. You're in for a world of pain, character set issues are extremely frustrating to debug if you don't know what you're doing. Reading both articles mentioned above maybe takes half an hour.
Try response.content.decode('utf-8') if response.text doesn't work.
According to the documentation, the main problem is that the encoding guessed by requests is determined based solely on the HTTP headers. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you can set response.encoding before accessing response.text.
Credit goes to Jay Taylor for commenting on TTT's answer - I almost missed the comment and thought it deserved its own answer.

How can I detect whether a given line in a file is a proper English sentence?

I need to detect if a given "line" in a file is an English sentence or not. I am using Python. An approximate answer would do. I understand this is an NLP question but is there a lightweight tool that gives a reasonable approximation? I do not want to use a full-fledged NLP toolkit for this though if that is the only way then it is fine.
If NLP toolkit is the answer then the one that I am reading about is the Natural Language Toolkit. If anyone has a simple example on how to detect a sentence handy, please point me to it.
Perhaps you are looking for Punkt Tokenizer from the nltk library, which can provide you the english sentences from a given text. You can then act upon the sentences by doing a grammar check(pointed out by Acron)
Currently, computer software cannot tell you whether a given string of tokens is a grammatical English sentence with any reasonable degree of reliability. You might, however, look into Amazon's Mechanical Turk. If you present the sentence to five native English speakers, and the majority of them say it is grammatical, you can assume it is with a reasonable level of certainty.
Of course, while Mechanical Turk does have a Web Services API and thus could be used from Python, it will not be real-time.
You can use Python Reverend. It has less the 400 lines of code. Take a look in how can you use it:
>>> from thomas import Bayes
>>> guesser = Bayes()
>>> guesser.train("en" u"a about above after again against all am an and any are aren't as at be because been before being below between both but by can't cannot could couldn't did didn't do does doesn't doing don't down during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most mustn't my myself no nor not of off on once only or other ought our ours ourselves out over own same shan't she she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't you you'd you'll you're you've your yours yourself yourselves")
>>> guesser.train("pt" u"último é acerca agora algmas alguns ali ambos antes apontar aquela aquelas aquele aqueles aqui atrás bem bom cada caminho cima com como comprido conhecido corrente das debaixo dentro desde desligado deve devem deverá direita diz dizer dois dos e ela ele eles em enquanto então está estão estado estar estará este estes esteve estive estivemos estiveram eu fará faz fazer fazia fez fim foi fora horas iniciar inicio ir irá ista iste isto ligado maioria maiorias mais mas mesmo meu muito muitos nós não nome nosso novo o onde os ou outro para parte pegar pelo pessoas pode poderá podia por porque povo promeiro quê qual qualquer quando quem quieto são saber sem ser seu somente têm tal também tem tempo tenho tentar tentaram tente tentei teu teve tipo tive todos trabalhar trabalho tu um uma umas uns usa usar valor veja ver verdade verdadeiro você")
>>> guesser.train("es" u"un una unas unos uno sobre todo también tras otro algún alguno alguna algunos algunas ser es soy eres somos sois estoy esta estamos estais estan como en para atras porque por qué estado estaba ante antes siendo ambos pero por poder puede puedo podemos podeis pueden fui fue fuimos fueron hacer hago hace hacemos haceis hacen cada fin incluso primero desde conseguir consigo consigue consigues conseguimos consiguen ir voy va vamos vais van vaya gueno ha tener tengo tiene tenemos teneis tienen el la lo las los su aqui mio tuyo ellos ellas nos nosotros vosotros vosotras si dentro solo solamente saber sabes sabe sabemos sabeis saben ultimo largo bastante haces muchos aquellos aquellas sus entonces tiempo verdad verdadero verdadera cierto ciertos cierta ciertas intentar intento intenta intentas intentamos intentais intentan dos bajo arriba encima usar uso usas usa usamos usais usan emplear empleo empleas emplean ampleamos empleais valor muy era eras eramos eran modo bien cual cuando donde mientras quien con entre sin trabajo trabajar trabajas trabaja trabajamos trabajais trabajan podria podrias podriamos podrian podriais yo aquel")
>>> guesser.guess(u'what language am i speaking')
>>> [('en', 0.99990000000000001)]
>>> guesser.guess(u'que língua eu estou falando')
>>> [('pt', 0.99990000000000001)]
>>> guesser.guess(u'en qué idioma estoy hablando')
>>> [('es', 0.99990000000000001)]
You should be very careful on choosing the best training data for your needs. Just for giving an idea to you I collect some stop words from English, Portuguese and Spanish.

Categories