Efficient way to sentence-tokenize and clean text

Efficient way to sentence-tokenize and clean text - python

I have a dataframe consisting of two columns, one with dates and the other with a string of text. I'd like to split the text in sentences and then apply some preprocessing.
Here is a simplified example of what I have:
import nltk
from nltk.corpus import stopwords
from pandarallel import pandarallel
pandarallel.initialize()
example_df=pd.DataFrame({'date':['2022-09-01'],'text':'Europa tiene un plan. Son cosas distintas. Perdón, esta es imagen dentro, y el recorte a los beneficios los ingresos totales conocimiento del uso fraudulento Además, el riesgo ha bajado. de gases nocivos, como el CO2. -La justicia europea ha confirmado se ha dado acceso al público por lo menos, intentar entrar. para reducir los contagios, vestido de chaqué. Han tenido que establecer de despido según informa que podría pasar desapercibida El Tribunal Supremo confirma en nuestra página web'})
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
example_df['sentence']=example_df['text'].parallel_apply(lambda x: spanish_tokenizer.tokenize(x))
As you can see, I rely on nltk tokenizer on the raw text to create a new column "sentences", that contains the list of sentences.
print(example_df['sentence'])
0 [Europa tiene un plan, Son cosas distintas, Perdón, esta es imagen dentro, y el recorte a los beneficios los ingresos totales conocimiento del uso fraudulento Además, el riesgo ha bajado, de gases nocivos, como el CO2, -La justicia europea ha confirmado se ha dado acceso al público por lo menos, intentar entrar, para reducir los contagios, vestido de chaqué, Han tenido que establecer de despido según informa que podría pasar desapercibida El Tribunal Supremo confirma en nuestra página web]
1 [casi todas las restricciones, Socios como Esquerra le echan un servicio público; con terrazas llenas Los voluntarios piden a todos los cuatros juntos en una semana la sentencia de cárcel para Griñán que Griñán no conoció la trama, de las hipotecas, A las afueras de Westminster]
Name: sentence, dtype: object
# Since commas might be misleading:
example_df.sentence[1]
['casi todas las restricciones',
'Socios como Esquerra le echan un servicio público; con terrazas llenas Los voluntarios piden a todos los cuatros juntos en una semana la sentencia de cárcel para Griñán que Griñán no conoció la trama, de las hipotecas',
'A las afueras de Westminster']
My next goal is to clean those sentences. Since I need punctuation for the tokenizer to work, I believe I need to do this process ex-post which implies looping, for each date of text, to each sentence. First of all, I am not sure how to do this operation with the pandas structure, here is one of my trials to remove stopwords:
from nltk.corpus import stopwords
stop = stopwords.words('spanish')
example_df['sentence'] = example_df['sentence'].parallel_apply(lambda x: ' '.join(
[word for word in i.split() for i in x if word not in (stop)]))
Which produces the following attribute error AttributeError: 'int' object has no attribute 'split'
Is there a more efficient/elegant wat to do this?

Since the sentence column is tokenized text (a list of strings) the list comprehension logic needs to be changed.
Eg:
sentences = ['casi todas las restricciones', 'Socios como Esquerra le echan un servicio público; con terrazas llenas Los voluntarios piden a todos los cuatros juntos en una semana la sentencia de cárcel para Griñán que Griñán no conoció la trama, de las hipotecas', 'A las afueras de Westminster']
stopwords_removed = [word for word in sent.split() for sent in sentences if word not in stop]
sent being the sentences inside the list and each word being the individual words you obtain after splitting by whitespace.
Your error is most likely caused due to a missing axis parameter
df.Column.parallel_apply(func, axis=1)
where func is a function that returns your list comprehension result

Related

Replace all occurrences of a word with another specific word that must appear somewhere in the sentence before that word

import re
#example 1
input_text = "((PERSON)María Rosa) ((VERB)pasará) unos dias aqui, hay que ((VERB)mover) sus cosas viejas de aqui, ya que sus cosméticos ((VERB)estorban) si ((VERB)estan) tirados por aquí. ((PERSON)Cyntia) es una buena modelo, su cabello es muy bello, hay que ((VERB)lavar) su cabello"
#example 2
input_text = "Sus útiles escolares ((VERB)estan) aqui, me sorprende que ((PERSON)Juan Carlos) los haya olvidado siendo que suele ((VERB)ser) tan cuidadoso con sus útiles."
#I need replace "sus" or "su" but under certain conditions
subject_capture_pattern = r"\(\(PERSON\)((?:\w\s*)+)\)" #underlined in red in the image
associated_info_capture_pattern = r"(?:sus|su)\s+((?:\w\s*)+)(?:\s+(?:del|de )|\s*(?:\(\(VERB\)|[.,;]))" #underlined in green in the image
identification_pattern =
replacement_sequence =
input_text = re.sub(identification_pattern, replacement_sequence, input_text, flags = re.IGNORECASE)
this is the correct output:
#for example 1
"((PERSON)María Rosa) ((VERB)pasará) unos dias aqui, hay que ((VERB)mover) cosas viejas ((CONTEXT) de María Rosa) de aqui, ya que cosméticos ((CONTEXT) de María Rosa) ((VERB)estorban) si ((VERB)estan) tirados por aquí. ((PERSON)Cyntia) es una buena modelo, cabello ((CONTEXT) de Cyntia) ((VERB)es) muy bello, hay que ((VERB)lavar) cabello ((CONTEXT) de Cyntia)"
#for example 2
"útiles escolares ((CONTEXT) NO DATA) ((VERB)estan) aqui, me sorprende que ((PERSON)Juan Carlos) los haya olvidado siendo que suele ((VERB)ser) tan cuidadoso con útiles ((CONTEXT) Juan Carlos)."
Details:
Replace the possessive pronouns "sus" or "su" with "de " + the content inside the last ((PERSON) "THIS SUBSTRING"), and if there is no ((PERSON) "THIS SUBSTRING") before then replace sus or su with ((PERSON) NO DATA)
Sentences are read from left to right, so the replacement will be the substring inside the parentheses ((PERSON)the substring) before that "sus" or "su", as shown in the example.
In the end, the replaced substrings should end up with this structure:
associated_info_capture_pattern + "((CONTEXT)" + subject_capture_pattern + ")"

This shows a way to do the replacement of su/sus like you asked for (albeit not with just a single re.sub). I didn't move the additional info, but you could modify it to handle that as well.
import re
subject_capture_pattern = r"\(\(PERSON\)((?:\w\s*)+)\)"
def replace_su_and_sus(input_text):
start = 0
replacement = "((PERSON) NO DATA)"
output_text = ""
for m in re.finditer(subject_capture_pattern, input_text):
output_text += re.sub(r"\b[Ss]us?\b", replacement, input_text[start:m.end()])
start = m.end()
replacement = m.group(0).replace("(PERSON)", "(CONTEXT) de ")
output_text += re.sub(r"\b[Ss]us?\b", replacement, input_text[start:])
return output_text
My strategy was:
Up until the first subject capture, replace su/sus with "NO DATA"
Up until the second subject capture, replace su/sus with the name from the first capture
Proceed similarly for each subsequent subject capture
Finally, replace any su/sus between the last subject capture and the end of the string

Create a regular expression that continues to extract information if and only if the following pattern is met using it as a condition

import re
def fun(x):
match=re.search(r"(?<=hay que) ([\w\s,]+) ([\w\s]+)",x)
if match:
for i in re.split(',|y',match[1]):
with open(f'{i}.txt','w') as file:
file.write(match[2])
input_text = str(input())
fun(input_text)
I need to create a regular expression that continues to match only if there is a comma , or y after the last one. Extracting the words and creating a text file as indicated in the following examples. And in case these words are followed by , or y, continue extracting. Then the end of the sentence must be written on one line of each of the .txt files created.
I was having trouble with sentences like for example:
hay que pintar y decorar las paredes de ese lugar
generate: pintar.txt, decorar las paredes de ese.txt
write inside each of them: lugar
but it should be:
generate: pintar.txt, decorar.txt
write inside each of them: las paredes de ese lugar
Other Examples...
input_sense: hay que pintar y decorar las paredes y los techos
generate: pintar.txt, decorar.txt
write inside each of them: las paredes y los techos
input_sense: hay que correr, saltar y cantar para llegar alli
generate: correr.txt , saltar.txt, cantar.txt
write inside each of them: para llegar alli
input_sense: yo creo que hay que saltar y correr para ir a ese lugar
generate: saltar.txt, correr.txt
write inside each of them: para ir a ese lugar
IMPORTANT: And in case the words listed begin with no ser|ser|no
input_sense: hay que esconderse y ser silenciosos para no ser descubiertos
generate: esconderse.txt, ser_silenciosos.txt
write inside each of them: para no ser descubiertos
input_sense: hay que trabajar, escalar y no temer si quieres llegar a la meta
generate: trabajar.txt, escalar.txt, no_temer.txt
write inside each of them: si quieres llegar a la meta

I would suggest first find what should be writen:
x = 'hay que trabajar, escalar y no temer si quieres llegar a la meta'
s = re.search(r'( las .*)|( los .*)|( para .*)|( si .*)', x)
content = s.group(0)
Then remove what you not want from you string:
x = x.replace('hay que','')
x = x.replace(content, '')
x.strip()
Replace special words with underscore
x = x.replace(' no ', ' no_')
x = x.replace(' ser ', ' ser_')
finally split your files names
filenames = [f'{name}.txt' for name in re.split(',| y ', x)]

How can I get the exact result of 1020 + 10-20 in Python? It gives me 1e+20

I am writing a code to solve second-grade equations and it works just well.
However, when I input the following equation:
x^2 + (10^(20) + 10^(-20)) + 1 = 0
(Yes, my input is 10**20 + 10**(-20)
I get:
x1 = 0 x2 = -1e+20
However, it is taking (10^(20) + 10^(-20) as 10e+20 while, if you do the math:
Here is the LaTeX formatted formula:
Which is almost 10^20 but not 10^20.
How can I get the exact result of that operation so I can get the exact value of the equation in x2?
My code is the following:
#===============================Función para obtener los coeficientes===============================
#Se van a anidar dos funciones. Primero la de coeficientes, luego la de la solución de la ecuación.
#Se define una función recursiva para obtener los coeficientes de la ecuación del usuario
def cof():
#Se determina si el coeficiente a introducir es un número o una cadena de operaciones
op = input("Si tu coeficiente es un número introduce 1, si es una cadena de operaciones introduce 2")
#Se compara la entrada del usuario con la opción.
if op == str(1):
#Se le solicita el número
num = input("¿Cuál es tu número?")
#Se comprueba que efectívamente sea un número
try:
#Si la entrada se puede convertir a flotante
float(num)
#Se establece el coeficiente como el valor flotante de la entrada
coef = float(num)
#Se retorna el valor del coeficiente
return coef
#Si no se pudo convertir a flotante...
except ValueError:
#Se le informa al usuario del error
print("No introdujiste un número. Inténtalo de nuevo")
#Se llama a la función de nuevo
return cofa()
#Si el coeficiente es una cadena (como en 10**20 + 10**-20)
elif op == str(2):
#La entrada se establece como la entrada del usuario
entrada = input("Input")
#Se intenta...
try:
#Evaluar la entrada. Si se puede...
eval(entrada)
#El coeficiente se establece como la evaluación de la entrada
coef = eval(entrada)
#Se regresa el coeficiente
return coef
#Si no se pudo establecer como tal...
except:
#Se le informa al usuario
print("No introdujiste una cadena de operaciones válida. Inténtalo de nuevo")
#Se llama a la función de nuevo
return cofa()
#Si no se introdujo ni 1 ni 2 se le informa al usuario
else:
#Se imprime el mensaje
print("No introdujiste n ni c, inténtalo de nuevo")
#Se llama a la función de nuevo
return cof()
#===============================Función para resolver la ecuación===============================
#Resuelve la ecuación
def sol_cuadratica():
#Se pide el valor de a
print("Introduce el coeficiente para a")
#Se llama a cof y se guarda el valor para a
a = cof()
#Se pide b
print("Introduce el coeficiente para b")
#Se llama cof y se guarda b
b = cof()
#Se pide c
print("Introduce el coeficiente para c")
#Se llama cof y se guarda c
c = cof()
#Se le informa al usuario de la ecuación a resolver
print("Vamos a encontrar las raices de la ecuación {}x² + {}x + {} = 0".format(a, b, c))
#Se analiza el discriminante
discriminante = (b**2 - 4*a*c)
#Si el discriminante es menor que cero, las raices son complejas
if discriminante < 0:
#Se le informa al usuario
print("Las raices son imaginarias. Prueba con otros coeficientes.")
#Se llama a la función de nuevo
return sol_cuadratica()
#Si el discriminante es 0, o mayor que cero, se procede a resolver
else:
#Ecuación para x1
x1 = (-b + discriminante**(1/2))/(2*a)
#Ecuación para x2
x2 = (-b - discriminante**(1/2))/(2*a)
#Se imprimen los resultados
print("X1 = " + str(x1))
print("X2 = " + str(x2))
sol_cuadratica()
Ignore the comments, I'm from a Spanish-speaking country.

The limitations of the machine floating point type is the reason why when adding a very small number to a very big number, the small number is just ignored.
This phenomenon is called absorption or cancellation.
With custom floating point objects (like the ones decimal module) you can achieve any precision (computations are slower, because floating point is now emulated, and not relying on the machine FPU capabilities anymore)
From the decimal module docs:
Unlike hardware based binary floating point, the decimal module has a user alterable precision (defaulting to 28 places) which can be as large as needed for a given problem
This can be achieved by changing the following global parameter decimal.getcontext().prec
import decimal
decimal.getcontext().prec = 41 # min value for the required range
d = decimal.Decimal(10**20)
d2 = decimal.Decimal(10**-20)
now
>>> d+d2
Decimal('100000000000000000000.00000000000000000001')
As suggested in comments, for the small number it's safer to let decimal module handle the division by using power operator on an already existing Decimal object (even if here, the result is the same):
d2 = decimal.Decimal(10)**-20
So you can use decimal.Decimal objects for your computations instead of native floats.

python delete images in string only obtain letters and numbers

I wrote a python program to obtain a string, and found there are images in some string, for example: 👉😢👈, or "Siempre en día de la Madre la pasábamos así todos en familia dando mucho cariño a nuestra preciosa madre pero hoy la vamos a pasar solos extrañando a mamá👩 pero siempre llevándola en nuestros corazones❤".
I want to delete theses images from the strings, obtaining only numbers and letters.
And please notice: these string are not only written in English, they may be written in all kinds of languages (for example: Arabic, or Japanese).
My program:
for post_item in group_member_posts_list:
if post_item['post_content']:
post_item_content_str = post_item['post_content']
print("post_item_content_str:" + post_item_content_str)
post_item_content_str = filter(str.isalnum,post_item_content_str)
print("after filter post_item_content_str:" + post_item_content_str )
b = TextBlob(post_item_content_str)
post_item_content_type = b.detect_language()
I tried to use filter function, but it gives errors. And isalnum function can only find English letters.
Could you please tell me how to resolve this problem?

By image, I believe you meant emojis (😀😋👌), you can simply use re.sub to replace them from your string.
import re
emoji_finder = re.compile('[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]+')
tcase_1 = "Siempre en día de la Madre la pasábamos así todos en familia dando mucho cariño a nuestra preciosa madre pero hoy la vamos a pasar solos extrañando a mamá👩 pero siempre llevándola en nuestros corazones❤"
tcase_2 = "bet👉😢👈ween"
print(re.sub(emoji_finder, "", tcase_1))
print(re.sub(emoji_finder, "", tcase_2))
Output:
Siempre en día de la Madre la pasábamos así
todos en familia dando mucho cariño a nuestra
preciosa madre pero hoy la vamos a pasar
solos extrañando a mamá pero siempre
llevándola en nuestros corazones
# and
between
Test it here: https://repl.it/IIWG
Adapted from this post and modified to support python 3.

How to get clean text from MediaWiki markup format using mwparserfromhell or a simple parser in python?

I am trying to get clean sentences from the Wikipedia page of a species.
For instance Abeis durangensis (pid = 1268312). Using the Wikipedia API in python to obtain the Wikipedia page:
import requests
pid = 1268312
q = {'action' : 'query',
'pageids': pid,
'prop' : 'revisions',
'rvprop' : 'content',
'format' : 'json'}
result = requests.get(eswiki_URI, params=q).json()
wikitext = result["query"]["pages"].values()[0]["revisions"][0]["*"]
gives:
{{Ficha de taxón
| name = ''Abies durangensis''
| image = Abies tamazula dgo.jpg
| status = LR/lc
| status_ref =<ref>Conifer Specialist Group 1998. [http://www.iucnredlist.org/search/details.php/42279/all ''Abies durangensis'']. [http://www.iucnredlist.org 2006 IUCN Red List of Threatened Species. ] Downloaded on 10 July 2007.</ref>
| regnum = [[Plantae]]
| divisio = [[Pinophyta]]
| classis = [[Pinopsida]]
| ordo = [[Pinales]]
| familia = [[Pinaceae]]
| genus = ''[[Abies]]''
| binomial = '''''Abies durangensis'''''
| binomial_authority = [[Maximino Martínez|Martínez]]<ref name=ipni>{{ cite web |url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal |title=Plant Name Details for ''Abies durangensis'' |publisher=[[International Plant Names Index|IPNI]] |accessdate=6 de octubre de 2009}}</ref>
| synonyms =
}}
'''''Abies durangensis''''' es una [[especie]] de [[conífera]] perteneciente a la familia [[Pinaceae]]. Son [[endémica]]s de [[México]] donde se encuentran en [[Durango]], [[Chihuahua]], [[Coahuila]], [[Jalisco]] y [[Sinaloa]]. También es conocido como 'Árbol de Coahuila' y 'pino mexicano'.<ref name=cje>{{ cite web |url=http://www.conifers.org/pi/ab/durangensis.htm |title=''Abies durangaensis'' description |author=Christopher J. Earle |date=11 de junio de 2006 |accessdate=6 de octubre de 2009}}</ref>
== Descripción ==
Es un [[árbol]] que alcanza los 40 metros de altura con un [[Tronco (botánica)|tronco]] recto que tiene 150 cm de diámetro. Las [[rama]]s son horizontales y la [[corteza (árbol)|corteza]] de color gris. Las [[hoja]]s son verde brillante de 20–35 mm de longitud por 1-1.5 mm de ancho. Tiene los conos de [[semilla]]s erectos en ramas laterales sobre un corto [[pedúnculo]]. Las [[semilla]]s son [[resina|resinosas]] con una [[núcula]] amarilla con alas.
== Taxonomía ==
''Abies durangensis'' fue descrita por [[Maximino Martínez]] y publicado en ''[[Anales del instituto de Biología de la Universidad Nacional de México]]'' 13: 2. 1942.<ref name = Trop>{{cita web |url=http://www.tropicos.org/Name/24901700 |título= ''{{PAGENAME}}''|fechaacceso=21 de enero de 2013 |formato= |obra= Tropicos.org. [[Missouri Botanical Garden]]}}</ref>
;[[Etimología]]:
'''''Abies''''': nombre genérico que viene del nombre [[latin]]o de ''[[Abies alba]]''.<ref>[http://www.calflora.net/botanicalnames/pageAB-AM.html En Nombres Botánicos]</ref>
'''''durangensis''''': [[epíteto]] geográfico que alude a su localización en [[Durango]].
;Variedades:
* ''Abies durangensis var. coahuilensis'' (I. M. Johnst.) Martínez
;[[sinonimia (biología)|Sinonimia]]:
* ''Abies durangensis subsp. neodurangensis'' (Debreczy, I.Rácz & R.M.Salazar) Silba'
* ''Abies neodurangensis'' Debreczy, I.Rácz & R.M.Salazar<ref>[http://www.theplantlist.org/tpl/record/kew-2609816 ''{{PAGENAME}}'' en PlantList]</ref><ref name = Kew>{{cita web|url=http://apps.kew.org/wcsp/namedetail.do?name_id=2609816 |título=''{{PAGENAME}}'' |work= World Checklist of Selected Plant Families}}</ref>
;''var. coahuilensis'' (I.M.Johnst.) Martínez
* ''Abies coahuilensis'' I.M.Johnst.
* ''Abies durangensis subsp. coahuilensis'' (I.M.Johnst.) Silba
== Véase también ==
* [[Terminología descriptiva de las plantas]]
* [[Anexo:Cronología de la botánica]]
* [[Historia de la Botánica]]
* [[Pinaceae#Descripción|Características de las pináceas]]
== Referencias ==
{{listaref}}
== Bibliografía ==
# CONABIO. 2009. Catálogo taxonómico de especies de México. 1. In Capital Nat. México. CONABIO, Mexico City.
== Enlaces externos ==
{{commonscat}}
{{wikispecies|Abies}}
* http://web.archive.org/web/http://ww.conifers.org/pi/ab/durangensis.htm
* http://www.catalogueoflife.org/search.php
[[Categoría:Abies|durangensis]]
[[Categoría:Plantas descritas en 1942]]
[[Categoría:Plantas descritas por Martínez]]
I am interested in the (unmarked) text just after the infobox, the gloss:
Abies durangensis es una especie de conífera perteneciente a la familia Pinaceae. Son endémicas de México donde se encuentran en Durango, Chihuahua, Coahuila, Jalisco y Sinaloa. También es conocido como 'Árbol de Coahuila' y 'pino mexicano'.
Until now i consulted https://www.mediawiki.org/wiki/Alternative_parsers so i found that mwparserfromhell is the less complicated parser in python. However, i dont see clearly how to do what i pretend. When i use the example proposed in the documentation i just can't see where the gloss is.
for t in templates:
print(t.name).encode('utf-8')
print(t.params)
Ficha de taxón
[u" name = ''Abies durangensis''\n", u' image = Abies tamazula dgo.jpg \n', u' status = LR/lc\n', u" status_ref =<ref>Conifer Specialist Group 1998. [http://www.iucnredlist.org/search/details.php/42279/all ''Abies durangensis'']. [http://www.iucnredlist.org 2006 IUCN Red List of Threatened Species. ] Downloaded on 10 July 2007.</ref>\n", u' regnum = [[Plantae]]\n', u' divisio = [[Pinophyta]]\n', u' classis = [[Pinopsida]]\n', u' ordo = [[Pinales]]\n', u' familia = [[Pinaceae]]\n', u" genus = ''[[Abies]]'' \n", u" binomial = '''''Abies durangensis'''''\n", u" binomial_authority = [[Maximino Mart\xednez|Mart\xednez]]<ref name=ipni>{{ cite web |url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal |title=Plant Name Details for ''Abies durangensis'' |publisher=[[International Plant Names Index|IPNI]] |accessdate=6 de octubre de 2009}}</ref>\n", u' synonyms = \n']
cite web
[u'url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal ', u"title=Plant Name Details for ''Abies durangensis'' ", u'publisher=[[International Plant Names Index|IPNI]] ', u'accessdate=6 de octubre de 2009']
cite web
[u'url=http://www.conifers.org/pi/ab/durangensis.htm ', u"title=''Abies durangaensis'' description ", u'author=Christopher J. Earle ', u'date=11 de junio de 2006 ', u'accessdate=6 de octubre de 2009']
cita web
[u'url=http://www.tropicos.org/Name/24901700 ', u"t\xedtulo= ''{{PAGENAME}}''", u'fechaacceso=21 de enero de 2013 ', u'formato= ', u'obra= Tropicos.org. [[Missouri Botanical Garden]]']
PAGENAME
[]
PAGENAME
[]
cita web
[u'url=http://apps.kew.org/wcsp/namedetail.do?name_id=2609816 ', u"t\xedtulo=''{{PAGENAME}}'' ", u'work= World Checklist of Selected Plant Families']
PAGENAME
[]
listaref
[]
commonscat
[]
wikispecies
[u'Abies']

Instead of torturing yourself with parsing of something that's not even expressable in formal grammar, use the TextExtracts API:
https://es.wikipedia.org/w/api.php?action=query&prop=extracts&explaintext=1&titles=Abies%20durangensis&format=json
gives the following output:
Abies durangensis es una especie de conífera perteneciente a la
familia Pinaceae. Son endémicas de México donde se encuentran en
Durango, Chihuahua, Coahuila, Jalisco y Sinaloa. También es conocido
como 'Árbol de Coahuila' y 'pino mexicano'.
== Descripción ==
Es un árbol que alcanza los 40 metros de altura con un tronco recto que tiene 150 cm de diámetro. Las ramas son
horizontales y la corteza de color gris. Las hojas son verde brillante
de 20–35 mm de longitud por 1-1.5 mm de ancho. Tiene los conos de
semillas erectos en ramas laterales sobre un corto pedúnculo. Las
semillas son resinosas con una núcula amarilla con alas.
[...]
Append &exintro=1 to the URL if you need only the lead.

You can use wikitextparser
If wikitext has your text string, all you need is:
import wikitextparser
parsed = wikitextparser.parse(wikitext)
then you can get the plain text portion of the whole page or a particular section, for example: parsed.sections[1].plain_text() will give you the plain text of the second section of the page which seems to be what you are looking for.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient way to sentence-tokenize and clean text - python

Related

Replace all occurrences of a word with another specific word that must appear somewhere in the sentence before that word

Create a regular expression that continues to extract information if and only if the following pattern is met using it as a condition

How can I get the exact result of 1020 + 10-20 in Python? It gives me 1e+20

python delete images in string only obtain letters and numbers

How to get clean text from MediaWiki markup format using mwparserfromhell or a simple parser in python?

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient way to sentence-tokenize and clean text - python

Related

Replace all occurrences of a word with another specific word that must appear somewhere in the sentence before that word

Create a regular expression that continues to extract information if and only if the following pattern is met using it as a condition

How can I get the exact result of 10**20 + 10**-20 in Python? It gives me 1e+20

python delete images in string only obtain letters and numbers

How to get clean text from MediaWiki markup format using mwparserfromhell or a simple parser in python?

Categories

Resources

How can I get the exact result of 1020 + 10-20 in Python? It gives me 1e+20