Extract elements from list of strings that match regex

Extract elements from list of strings that match regex - python

I have a list of a lot of elements:
l = ['tribunavtplus.client.db.PagingResultSet/2610158785',
'java.util.ArrayList/4159755760',
'tribunavtplus.client.db.ERGEBNISSTORE/3664619045',
'605',
'I. Sozialversicherungsgerichtshof',
'',
'a514394f9f18429995d4cd5c64fbc6de',
'Arrêt de la Ie Cour des assurances sociales du Tribunal cantonal',
'605 2022 1',
'3a4604ff7f9b404193346990b98f7179',
'605 2021 234',
'Assurance-accidents, rente (capacité de travail, revenus retenus dans le calcul du taux d\\x27invalidité, exigibilité), atteinte à l\\x27intégrité.',
'608',
'D:\\\\InetPubData\\\\PublicationDocuments\\\\80749472d88e4d45a0d3c340acbb670a.pdf',
'Polizeirichter Greyerz',
'D:\\\\InetPubData\\\\PublicationDocuments\\\\412d9325978b487f8b15a13380ba4436.pdf',
'Assurance-invalidité (refus de rente et de mesures d\\x27ordre professionnel).',
'101',
'I. Zivilappellationshof',
'c321e55aa536438fb95a06eb798eff82',
'Arrêt de la Ie Cour d\\x27appel civil du Tribunal cantonal',
'101 2022 202',
'2022-09-16',
'D:\\\\InetPubData\\\\PublicationDocuments\\\\c321e55aa536438fb95a06eb798eff82.pdf',
'Eheschutzmassnahmen',
'Gericht Saane',
'Mesures protectrices de l\\x27union conjugale, droit de visite, contributions d\\x27entretien.',
'603',
'III. Verwaltungsgerichtshof',
'3a801995f04b48948c0d1d2224b8f4b9',
'Arrêt de la IIIe Cour administrative du Tribunal cantonal',
'603 2022 110',
'D:\\\\InetPubData\\\\PublicationDocuments\\\\3a801995f04b48948c0d1d2224b8f4b9.pdf',
'Strassenverkehr und Transportwesen',
'Amt für Strassenverkehr und Schifffahrt (ASS)',
'502 2022 24',
'2022-09-22',
'D:\\\\InetPubData\\\\PublicationDocuments\\\\d2c0687313f24491870571dcfd64d6ad.pdf',
'102',
'java.lang.Boolean/476441737']
I want to extract elements such as:
'101 2022 202'
'502 2022 24'
'605 2022 1'
etc.
So blocks of 3 nubers between 1 and 4 digits with a space inside.
I have tried this, but It does not work:
regex_id = re.compile(r"(\d{1,5})(\s)(\d{1,5})(\s)(\d{1,5})")
id_list = [value for value in l if regex_id.search(str(value))==True]
print(id_list)
But I always get empty list

Related

Regex to split Spanish lastnames in first lastname and second lastname

My goal is to write a python 3 function that takes lastnames as row from a csv and splits them correctly into lastname_1 and lastname_2.
Spanish names have the following structure: firstname + lastname_1 + lastname_2
Forgettin about the firstname, I would like a code that splits a lastname in these 2 categories (lastname_1, lastname_2) Which is the challenge?
Sometimes the lastname(s) have prepositions.
DE BLAS ZAPATA "de blas" ist the first lastname and "Zapata" the second lastname
MATIAS DE LA MANO "Matias" is lastname_1, "de la mano" lastname_2
LOPEZ FERNANDEZ DE VILLAVERDE Lopez Fernandez is lastname_1, de villaverda lastname_2
DE MIGUEL DEL CORRAL De Miguel is lastname_1, del corral lastname_2 More: VIDAL DE LA PEÑA SOLIS Vidal is surtname_1, de la pena solis surname_2
MONTAVA DEL ARCO Montava is surname_1 Del Arco surname_2
and the list could go on and on.
I am currently stuck and I found this code in perl but I struggle to understand the main idea behind it to translate it to python 3.
import re
preposition_lst = ['DE LO ', 'DE LA ', 'DE LAS ', 'DEL ', 'DELS ', 'DE LES ', 'DO ', 'DA ', 'DOS ', 'DAS', 'DE ']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PEÑA", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
for case in cases:
for prep in preposition_lst:
m = re.search(f"(.*)({prep}[A-ZÀ-ÚÄ-Ü]+)", case, re.I) # re.I makes it case insensitive
try:
print(m.groups())
print(prep)
except:
pass

Try this one:
import re
preposition_lst = ['DE','LO','LA','LAS','DEL','DELS','LES','DO','DA','DOS','DAS']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PEÑA", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
def last_name(name):
case=re.findall(r'\w+',name)
res=list(filter(lambda x: x not in preposition_lst,case))
return res
list_final =[]
for case in cases:
list_final.append(last_name(case))
for i in range(len(list_final)):
if len(list_final[i])>2:
name1=' '.join(list_final[i][:2])
name2=' '.join(list_final[i][2:])
list_final[i]=[name1,name2]
print(list_final)
#[['BLAS', 'ZAPATA'], ['MATIAS', 'MANO'], ['LOPEZ FERNANDEZ', 'VILLAVERDE'], ['MIGUEL', 'CORRAL'], ['VIDAL', 'PE�A'], ['MONTAVA', 'ARCO'], ['CASAS', 'VALLE']]

Does this suit your requirements?
import re
preposition_lst = ['DE LO', 'DE LA', 'DE LAS', 'DE LES', 'DEL', 'DELS', 'DO', 'DA', 'DOS', 'DAS', 'DE']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PENA SOLIS", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
def split_name(name):
f1 = re.compile("(.*)({preps}(.+))".format(preps = "(" + " |".join(preposition_lst) + ")"))
m1 = f1.match(case)
if m1:
if len(m1.group(1)) != 0:
return m1.group(1).strip(), m1.group(2).strip()
else:
return " ".join(name.split()[:-1]), name.split()[-1]
else:
return " ".join(name.split()[:-1]), name.split()[-1]
for case in cases:
first, second = split_name(case)
print("{} --> name 1 = {}, name 2 = {}".format(case, first, second))
# DE BLAS ZAPATA --> name 1 = DE BLAS, name 2 = ZAPATA
# MATIAS DE LA MANO --> name 1 = MATIAS, name 2 = DE LA MANO
# LOPEZ FERNANDEZ DE VILLAVERDE --> name 1 = LOPEZ FERNANDEZ, name 2 = DE VILLAVERDE
# DE MIGUEL DEL CORRAL --> name 1 = DE MIGUEL, name 2 = DEL CORRAL
# VIDAL DE LA PENA SOLIS --> name 1 = VIDAL, name 2 = DE LA PENA SOLIS
# MONTAVA DEL ARCO --> name 1 = MONTAVA, name 2 = DEL ARCO
# DOS CASAS VALLE --> name 1 = DOS CASAS, name 2 = VALLE

Error when using treetagger : list index out of range

I am using treetagger to extract lemma of word. I have a function which do that but for some words it is giving list out range error :
def treetagger_process(texte):
''' process le texte et renvoie un dictionnaire '''
d_tag = {}
for line in texte:
newvalues = tagger.tag_text(line)
d_tag[line] = newvalues
#print(d_tag)
#lemmatisation
d_lemma = defaultdict(list)
d_lemma_1 = defaultdict(list)
for k, v in d_tag.items():
print(k)
for p in v:
#print(p)
parts = p.split('\t')
print(parts)
forme=parts[0]
cat=parts[1]
lemme = parts[2] # un seul mot
print(lemme)
try:
if lemme =='':
d_lemma[k].append(forme)
elif "|" in lemme :
print(lemme)
lemme_double = lemme.split("|")
lemme_final = lemme_double[1]
print(lemme_final)
d_lemma[k].append(lemme_final)
else:
d_lemma[k].append(lemme)
except:
print(parts)
for k, v in d_lemma.items():
d_lemma_1[k]=" ".join(v)
# Suppression de l'espace
print(d_lemma_1)
return d_lemma_1
Is there a way to overcome that error, it seems the word "dns-remplacé" is causing problem but I requested that if "|" is not find , the word is automatically send to the dictionnary.
error :
...
L'idée de départ est amusante mais elle est très mal exploitée en raison notamment d'une mise en scène paresseuse , d'une intrigue à laquelle on ne croit pas une seconde et d'acteurs qui semblent à l'abandon .
["L'", 'DET:ART', 'le']
le
['idée', 'NOM', 'idée']
idée
['de', 'PRP', 'de']
de
['départ', 'NOM', 'départ']
départ
['est', 'VER:pres', 'être']
être
['amusante', 'ADJ', 'amusant']
amusant
['mais', 'KON', 'mais']
mais
['elle', 'PRO:PER', 'elle']
elle
['est', 'VER:pres', 'être']
être
['très', 'ADV', 'très']
très
['mal', 'ADV', 'mal']
mal
['exploitée', 'VER:pper', 'exploiter']
exploiter
['en', 'PRP', 'en']
en
['raison', 'NOM', 'raison']
raison
['notamment', 'ADV', 'notamment']
notamment
["d'", 'PRP', 'de']
de
['une', 'DET:ART', 'un']
un
['mise', 'NOM', 'mise']
mise
['en', 'PRP', 'en']
en
['scène', 'NOM', 'scène']
scène
['paresseuse', 'ADJ', 'paresseux']
paresseux
[',', 'PUN', ',']
,
["d'", 'PRP', 'de']
de
['une', 'DET:ART', 'un']
un
['intrigue', 'NOM', 'intrigue']
intrigue
['à', 'PRP', 'à']
à
['laquelle', 'PRO:REL', 'lequel']
lequel
['on', 'PRO:PER', 'on']
on
['ne', 'ADV', 'ne']
ne
['croit', 'VER:pres', 'croire']
croire
['pas', 'ADV', 'pas']
pas
['une', 'DET:ART', 'un']
un
['seconde', 'NUM', 'second']
second
['et', 'KON', 'et']
et
["d'", 'PRP', 'de']
de
['acteurs', 'NOM', 'acteur']
acteur
['qui', 'PRO:REL', 'qui']
qui
['semblent', 'VER:pres', 'sembler']
sembler
['à', 'PRP', 'à']
à
["l'", 'DET:ART', 'le']
le
['abandon', 'NOM', 'abandon']
abandon
['.', 'SENT', '.']
.
"Faux cul ed wright dit que son film est un hommage aux buddy movies hollywoodiens et nottament bad boys.aussi se permet il de parodier des scènes de ce """"chef d'oeuvre""""au tics de caméra prés,et si on le savait pas déjà on se rend compte à quel point michael bay est un tacheron;Hot Fuzz ne se contente pas d'être une parodie de plus à la """"y'a t-il un flic..."
['"', 'PUN:cit', '"']
"
['Faux', 'ADJ', 'faux']
faux
['cul', 'NOM', 'cul']
cul
['ed', 'NOM', 'ed']
ed
['wright', 'NOM', 'wright']
wright
['dit', 'VER:pper', 'dire']
dire
['que', 'KON', 'que']
que
['son', 'DET:POS', 'son']
son
['film', 'NOM', 'film']
film
['est', 'VER:pres', 'être']
être
['un', 'DET:ART', 'un']
un
['hommage', 'NOM', 'hommage']
hommage
['aux', 'PRP:det', 'au']
au
['buddy', 'NOM', 'buddy']
buddy
['movies', 'NOM', 'movies']
movies
['hollywoodiens', 'ADJ', 'hollywoodien']
hollywoodien
['et', 'KON', 'et']
et
['nottament', 'VER:pres', 'nottament']
nottament
['bad', 'NOM', 'bad']
bad
['dns-remplacé', 'ADJ', 'dns-remplacé']
dns-remplacé
['<repdns text="boys.aussi" />']
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-25-a8bd3e42b86e> in <module>
21 #l_1_dict = spacy_process(l_1)
22 #l_1_dict = spacy_process_treet(l_1)
---> 23 l_1_dict = treetagger_process(l_1)
24
25 print(type(l_1_dict))
<ipython-input-24-20ed242ad8de> in treetagger_process(texte)
18 parts = p.split('\t')
19 forme=parts[0]
---> 20 cat=parts[1]
21 lemme = parts[2] # un seul mot
22 print(lemme)
IndexError: list index out of range
Example of texte :

Youre getting the list index out of range error because youre splitting on an object for \t and expecting it to return an array of 3 components. the dns-substitute seems to only have 1 object returned in that list.
in this example you have 3 objects:
bad
['dns-remplacé', 'ADJ', 'dns-remplacé']
here you only have 1 object:
dns-remplacé
['<repdns text="boys.aussi" />']
i would recommend you check that splitting on '\t' returns a list of size 3 and if it doesnt then you need to handle that scenario accordingly
parts = p.split('\t')
if len(parts) == 3:
#do normal processing
else:
#handle this situation

How to replace " \\u0027 " by " ' " in python?

i am new to python, and im trying to program a scraper.
firstly, i extract this kind of string in a variable (lets call it data[1], because it's contained in an array):
\"description\":\"Alors qu\\u0027ils montaient dans l\\u0027un des
couloirs du versant nord du Hohneck (\\"le premier couloir \u00E0
droite\\"), deux alpinistes ont d\u00E9clench\u00E9 une plaque et ont
\u00E9t\u00E9 emport\u00E9s tous les deux. L\\u0027un sera enseveli et
l\\u0027autre ne pourra d\u00E9clencher l\\u0027alerte qu\\u0027\u00E0
la nuit. La victime ne sera retrouv\u00E9e que d\u00E9but avril.
\u003cbr\u003e Sur la photo prise en f\u00E9vrier 2011, le
trac\u00E9 approximatif de l\\u0027avalanche a \u00E9t\u00E9
repr\u00E9sent\u00E9.\",
then, i use :
data = data[1].encode().decode('unicode-escape')
but it gives me :
"description":"Alors qu\u0027ils montaient dans l\u0027un des couloirs
du versant nord du Hohneck (\"le premier couloir à droite\"), deux
alpinistes ont déclenché une plaque et ont été emportés tous les deux.
L\u0027un sera enseveli et l\u0027autre ne pourra déclencher
l\u0027alerte qu\u0027à la nuit. La victime ne sera retrouvée que
début avril. \u003cbr\u003e Sur la photo prise en février 2011, le
tracé approximatif de l\u0027avalanche a été représenté.",
indeed, char with accent had been replaced but apostrophes stay unprocessed !
It seems the two backslashes are the cause.
i tried several methods :
like decode twice and then "\u0027" become "'", but "é" become "Ã©".
data.replace('Ã©', 'é') or data.replace(u'\u0027', u'é') dont work
So, do you have any idea how i could fix this probleme ?

Problem fixed !
As user2357112 supports Monica said, i tried to process manually a json.
But doing :
data = html_page.find_all("script")
data = re.findall("(?<=JSON\.parse\(')[A-Za-z0-9'.?!/+=;:,()\\\ \"\-{_àâçéèêëíìîôùûæœÁÀÂÃÇÉÈÊËÍÌÎÔÙÛ©´<>]*", str(data))
data = data[0].encode().decode('unicode-escape') + "\"\"}"
data_dict = json.loads(data)
string_data_dict = json.dumps(data_dict)
for cle, val in data_dict.items() :
print(cle + " : " + str(val))
solves the bug.
With this code, an input string like :
<script>
var admin = false;
var avalanche = JSON.parse('{...\"description\":\"Alors qu\\u0027ils montaient dans l\\u0027un des couloirs du versant nord du Hohneck (\\\"le premier couloir \u00E0 droite\\\"), deux alpinistes ont d\u00E9clench\u00E9 une plaque et ont \u00E9t\u00E9 emport\u00E9s tous les deux. L\\u0027un sera enseveli et l\\u0027autre ne pourra d\u00E9clencher l\\u0027alerte qu\\u0027\u00E0 la nuit. La victime ne sera retrouv\u00E9e que d\u00E9but avril. \\u003cbr\\u003e Sur la photo prise en f\u00E9vrier 2011, le trac\u00E9 approximatif de l\\u0027avalanche a \u00E9t\u00E9 repr\u00E9sent\u00E9.\",...
gives an output like :
...
description : Alors qu'ils montaient dans l'un des couloirs du versant nord du Hohneck ("le premier couloir à droite"), deux alpinistes ont déclenché une plaque et ont été emportés tous les deux. L'un sera enseveli et l'autre ne pourra déclencher l'alerte qu'à la nuit. La victime ne sera retrouvée que début avril. <br> Sur la photo prise en février 2011, le tracé approximatif de l'avalanche a été représenté.
...
Thanks you.

Histogram representing number of substitutions, insertions and deleting in sequences

l have two columns that represent : right sequence and predicted sequence. l want to make statistics on the number of deletion, substitution and insertion by comparing each right sequence with its predicted sequence.
l did the levenstein distance to get the number of characters which are different (see the function below) and error_dist function to get the most common errors (in terms of substitution) :
here is a sample of my data :
de de
date date
pour pour
etoblissemenls etablissements
avec avec
code code
communications communications
r r
seiche seiche
titre titre
publiques publiques
ht ht
bain bain
du du
ets ets
premier premier
dans dans
snupape soupape
minimum minimum
blanc blanc
fr fr
nos nos
au au
bl bl
consommations consommations
somme somme
euro euro
votre votre
offre offre
forestier forestier
cs cs
de de
pour pour
de de
paye r
cette cette
votre votre
valeurs valeurs
des des
gfda gfda
tva tva
pouvoirs pouvoirs
de de
revenus revenus
offre offre
ht ht
card card
noe noe
montant montant
r r
comprises comprises
quantite quantite
nature nature
ticket ticket
ou ou
rapide rapide
de de
sous sous
identification identification
du du
document document
suicide suicide
bretagne bretagne
tribunal tribunal
services services
cif cif
moyen moyen
gaec gaec
total total
lorsque lorsque
contact contact
fermeture fermeture
la la
route route
tva tva
ia ia
noyal noyal
brie brie
de de
nanterre nanterre
charcutier charcutier
semestre semestre
de de
rue rue
le le
bancaire bancaire
martigne martigne
recouvrement recouvrement
la la
sainteny sainteny
de de
franc franc
rm rm
vro vro
here is my code
import pandas as pd
import collections
import numpy as np
import matplotlib.pyplot as plt
import distance
def error_dist():
df = pd.read_csv('data.csv', sep=',')
df = df.astype(str)
df = df.replace(['é', 'è', 'È', 'É'], 'e', regex=True)
df = df.replace(['à', 'â', 'Â'], 'a', regex=True)
dictionnary = []
for i in range(len(df)):
if df.manual_raw_value[i] != df.raw_value[i]:
text = df.manual_raw_value[i]
text2 = df.raw_value[i]
x = len(df.manual_raw_value[i])
y = len(df.raw_value[i])
z = min(x, y)
for t in range(z):
if text[t] != text2[t]:
d = (text[t], text2[t])
dictionnary.append(d)
#print(dictionnary)
dictionnary_new = dict(collections.Counter(dictionnary).most_common(25))
pos = np.arange(len(dictionnary_new.keys()))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(dictionnary_new.keys())
plt.bar(range(len(dictionnary_new)), dictionnary_new.values(), width, color='g')
plt.show()
enter image description here
and the levenstein distance :
def levenstein_dist():
df = pd.read_csv('data.csv', sep=',')
df=df.astype(str)
df['string diff'] = df.apply(lambda x: distance.levenshtein(x['raw_value'], x['manual_raw_value']), axis=1)
plt.hist(df['string diff'])
plt.show()
enter image description here
Now l want to make a histograms showing three bins : number of substitution, number of insertion and number of deletion . How can l proceed ?
Thank you

Thanks to the suggestions of #YohanesGultom the answer for the problem can be found here :
http://www.nltk.org/_modules/nltk/metrics/distance.html
or
https://gist.github.com/kylebgorman/1081951

How to get clean text from MediaWiki markup format using mwparserfromhell or a simple parser in python?

I am trying to get clean sentences from the Wikipedia page of a species.
For instance Abeis durangensis (pid = 1268312). Using the Wikipedia API in python to obtain the Wikipedia page:
import requests
pid = 1268312
q = {'action' : 'query',
'pageids': pid,
'prop' : 'revisions',
'rvprop' : 'content',
'format' : 'json'}
result = requests.get(eswiki_URI, params=q).json()
wikitext = result["query"]["pages"].values()[0]["revisions"][0]["*"]
gives:
{{Ficha de taxón
| name = ''Abies durangensis''
| image = Abies tamazula dgo.jpg
| status = LR/lc
| status_ref =<ref>Conifer Specialist Group 1998. [http://www.iucnredlist.org/search/details.php/42279/all ''Abies durangensis'']. [http://www.iucnredlist.org 2006 IUCN Red List of Threatened Species. ] Downloaded on 10 July 2007.</ref>
| regnum = [[Plantae]]
| divisio = [[Pinophyta]]
| classis = [[Pinopsida]]
| ordo = [[Pinales]]
| familia = [[Pinaceae]]
| genus = ''[[Abies]]''
| binomial = '''''Abies durangensis'''''
| binomial_authority = [[Maximino Martínez|Martínez]]<ref name=ipni>{{ cite web |url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal |title=Plant Name Details for ''Abies durangensis'' |publisher=[[International Plant Names Index|IPNI]] |accessdate=6 de octubre de 2009}}</ref>
| synonyms =
}}
'''''Abies durangensis''''' es una [[especie]] de [[conífera]] perteneciente a la familia [[Pinaceae]]. Son [[endémica]]s de [[México]] donde se encuentran en [[Durango]], [[Chihuahua]], [[Coahuila]], [[Jalisco]] y [[Sinaloa]]. También es conocido como 'Árbol de Coahuila' y 'pino mexicano'.<ref name=cje>{{ cite web |url=http://www.conifers.org/pi/ab/durangensis.htm |title=''Abies durangaensis'' description |author=Christopher J. Earle |date=11 de junio de 2006 |accessdate=6 de octubre de 2009}}</ref>
== Descripción ==
Es un [[árbol]] que alcanza los 40 metros de altura con un [[Tronco (botánica)|tronco]] recto que tiene 150 cm de diámetro. Las [[rama]]s son horizontales y la [[corteza (árbol)|corteza]] de color gris. Las [[hoja]]s son verde brillante de 20–35 mm de longitud por 1-1.5 mm de ancho. Tiene los conos de [[semilla]]s erectos en ramas laterales sobre un corto [[pedúnculo]]. Las [[semilla]]s son [[resina|resinosas]] con una [[núcula]] amarilla con alas.
== Taxonomía ==
''Abies durangensis'' fue descrita por [[Maximino Martínez]] y publicado en ''[[Anales del instituto de Biología de la Universidad Nacional de México]]'' 13: 2. 1942.<ref name = Trop>{{cita web |url=http://www.tropicos.org/Name/24901700 |título= ''{{PAGENAME}}''|fechaacceso=21 de enero de 2013 |formato= |obra= Tropicos.org. [[Missouri Botanical Garden]]}}</ref>
;[[Etimología]]:
'''''Abies''''': nombre genérico que viene del nombre [[latin]]o de ''[[Abies alba]]''.<ref>[http://www.calflora.net/botanicalnames/pageAB-AM.html En Nombres Botánicos]</ref>
'''''durangensis''''': [[epíteto]] geográfico que alude a su localización en [[Durango]].
;Variedades:
* ''Abies durangensis var. coahuilensis'' (I. M. Johnst.) Martínez
;[[sinonimia (biología)|Sinonimia]]:
* ''Abies durangensis subsp. neodurangensis'' (Debreczy, I.Rácz & R.M.Salazar) Silba'
* ''Abies neodurangensis'' Debreczy, I.Rácz & R.M.Salazar<ref>[http://www.theplantlist.org/tpl/record/kew-2609816 ''{{PAGENAME}}'' en PlantList]</ref><ref name = Kew>{{cita web|url=http://apps.kew.org/wcsp/namedetail.do?name_id=2609816 |título=''{{PAGENAME}}'' |work= World Checklist of Selected Plant Families}}</ref>
;''var. coahuilensis'' (I.M.Johnst.) Martínez
* ''Abies coahuilensis'' I.M.Johnst.
* ''Abies durangensis subsp. coahuilensis'' (I.M.Johnst.) Silba
== Véase también ==
* [[Terminología descriptiva de las plantas]]
* [[Anexo:Cronología de la botánica]]
* [[Historia de la Botánica]]
* [[Pinaceae#Descripción|Características de las pináceas]]
== Referencias ==
{{listaref}}
== Bibliografía ==
# CONABIO. 2009. Catálogo taxonómico de especies de México. 1. In Capital Nat. México. CONABIO, Mexico City.
== Enlaces externos ==
{{commonscat}}
{{wikispecies|Abies}}
* http://web.archive.org/web/http://ww.conifers.org/pi/ab/durangensis.htm
* http://www.catalogueoflife.org/search.php
[[Categoría:Abies|durangensis]]
[[Categoría:Plantas descritas en 1942]]
[[Categoría:Plantas descritas por Martínez]]
I am interested in the (unmarked) text just after the infobox, the gloss:
Abies durangensis es una especie de conífera perteneciente a la familia Pinaceae. Son endémicas de México donde se encuentran en Durango, Chihuahua, Coahuila, Jalisco y Sinaloa. También es conocido como 'Árbol de Coahuila' y 'pino mexicano'.
Until now i consulted https://www.mediawiki.org/wiki/Alternative_parsers so i found that mwparserfromhell is the less complicated parser in python. However, i dont see clearly how to do what i pretend. When i use the example proposed in the documentation i just can't see where the gloss is.
for t in templates:
print(t.name).encode('utf-8')
print(t.params)
Ficha de taxón
[u" name = ''Abies durangensis''\n", u' image = Abies tamazula dgo.jpg \n', u' status = LR/lc\n', u" status_ref =<ref>Conifer Specialist Group 1998. [http://www.iucnredlist.org/search/details.php/42279/all ''Abies durangensis'']. [http://www.iucnredlist.org 2006 IUCN Red List of Threatened Species. ] Downloaded on 10 July 2007.</ref>\n", u' regnum = [[Plantae]]\n', u' divisio = [[Pinophyta]]\n', u' classis = [[Pinopsida]]\n', u' ordo = [[Pinales]]\n', u' familia = [[Pinaceae]]\n', u" genus = ''[[Abies]]'' \n", u" binomial = '''''Abies durangensis'''''\n", u" binomial_authority = [[Maximino Mart\xednez|Mart\xednez]]<ref name=ipni>{{ cite web |url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal |title=Plant Name Details for ''Abies durangensis'' |publisher=[[International Plant Names Index|IPNI]] |accessdate=6 de octubre de 2009}}</ref>\n", u' synonyms = \n']
cite web
[u'url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal ', u"title=Plant Name Details for ''Abies durangensis'' ", u'publisher=[[International Plant Names Index|IPNI]] ', u'accessdate=6 de octubre de 2009']
cite web
[u'url=http://www.conifers.org/pi/ab/durangensis.htm ', u"title=''Abies durangaensis'' description ", u'author=Christopher J. Earle ', u'date=11 de junio de 2006 ', u'accessdate=6 de octubre de 2009']
cita web
[u'url=http://www.tropicos.org/Name/24901700 ', u"t\xedtulo= ''{{PAGENAME}}''", u'fechaacceso=21 de enero de 2013 ', u'formato= ', u'obra= Tropicos.org. [[Missouri Botanical Garden]]']
PAGENAME
[]
PAGENAME
[]
cita web
[u'url=http://apps.kew.org/wcsp/namedetail.do?name_id=2609816 ', u"t\xedtulo=''{{PAGENAME}}'' ", u'work= World Checklist of Selected Plant Families']
PAGENAME
[]
listaref
[]
commonscat
[]
wikispecies
[u'Abies']

Instead of torturing yourself with parsing of something that's not even expressable in formal grammar, use the TextExtracts API:
https://es.wikipedia.org/w/api.php?action=query&prop=extracts&explaintext=1&titles=Abies%20durangensis&format=json
gives the following output:
Abies durangensis es una especie de conífera perteneciente a la
familia Pinaceae. Son endémicas de México donde se encuentran en
Durango, Chihuahua, Coahuila, Jalisco y Sinaloa. También es conocido
como 'Árbol de Coahuila' y 'pino mexicano'.
== Descripción ==
Es un árbol que alcanza los 40 metros de altura con un tronco recto que tiene 150 cm de diámetro. Las ramas son
horizontales y la corteza de color gris. Las hojas son verde brillante
de 20–35 mm de longitud por 1-1.5 mm de ancho. Tiene los conos de
semillas erectos en ramas laterales sobre un corto pedúnculo. Las
semillas son resinosas con una núcula amarilla con alas.
[...]
Append &exintro=1 to the URL if you need only the lead.

You can use wikitextparser
If wikitext has your text string, all you need is:
import wikitextparser
parsed = wikitextparser.parse(wikitext)
then you can get the plain text portion of the whole page or a particular section, for example: parsed.sections[1].plain_text() will give you the plain text of the second section of the page which seems to be what you are looking for.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.