Regex to split Spanish lastnames in first lastname and second lastname

Regex to split Spanish lastnames in first lastname and second lastname - python

My goal is to write a python 3 function that takes lastnames as row from a csv and splits them correctly into lastname_1 and lastname_2.
Spanish names have the following structure: firstname + lastname_1 + lastname_2
Forgettin about the firstname, I would like a code that splits a lastname in these 2 categories (lastname_1, lastname_2) Which is the challenge?
Sometimes the lastname(s) have prepositions.
DE BLAS ZAPATA "de blas" ist the first lastname and "Zapata" the second lastname
MATIAS DE LA MANO "Matias" is lastname_1, "de la mano" lastname_2
LOPEZ FERNANDEZ DE VILLAVERDE Lopez Fernandez is lastname_1, de villaverda lastname_2
DE MIGUEL DEL CORRAL De Miguel is lastname_1, del corral lastname_2 More: VIDAL DE LA PEÑA SOLIS Vidal is surtname_1, de la pena solis surname_2
MONTAVA DEL ARCO Montava is surname_1 Del Arco surname_2
and the list could go on and on.
I am currently stuck and I found this code in perl but I struggle to understand the main idea behind it to translate it to python 3.
import re
preposition_lst = ['DE LO ', 'DE LA ', 'DE LAS ', 'DEL ', 'DELS ', 'DE LES ', 'DO ', 'DA ', 'DOS ', 'DAS', 'DE ']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PEÑA", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
for case in cases:
for prep in preposition_lst:
m = re.search(f"(.*)({prep}[A-ZÀ-ÚÄ-Ü]+)", case, re.I) # re.I makes it case insensitive
try:
print(m.groups())
print(prep)
except:
pass

Try this one:
import re
preposition_lst = ['DE','LO','LA','LAS','DEL','DELS','LES','DO','DA','DOS','DAS']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PEÑA", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
def last_name(name):
case=re.findall(r'\w+',name)
res=list(filter(lambda x: x not in preposition_lst,case))
return res
list_final =[]
for case in cases:
list_final.append(last_name(case))
for i in range(len(list_final)):
if len(list_final[i])>2:
name1=' '.join(list_final[i][:2])
name2=' '.join(list_final[i][2:])
list_final[i]=[name1,name2]
print(list_final)
#[['BLAS', 'ZAPATA'], ['MATIAS', 'MANO'], ['LOPEZ FERNANDEZ', 'VILLAVERDE'], ['MIGUEL', 'CORRAL'], ['VIDAL', 'PE�A'], ['MONTAVA', 'ARCO'], ['CASAS', 'VALLE']]

Does this suit your requirements?
import re
preposition_lst = ['DE LO', 'DE LA', 'DE LAS', 'DE LES', 'DEL', 'DELS', 'DO', 'DA', 'DOS', 'DAS', 'DE']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PENA SOLIS", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
def split_name(name):
f1 = re.compile("(.*)({preps}(.+))".format(preps = "(" + " |".join(preposition_lst) + ")"))
m1 = f1.match(case)
if m1:
if len(m1.group(1)) != 0:
return m1.group(1).strip(), m1.group(2).strip()
else:
return " ".join(name.split()[:-1]), name.split()[-1]
else:
return " ".join(name.split()[:-1]), name.split()[-1]
for case in cases:
first, second = split_name(case)
print("{} --> name 1 = {}, name 2 = {}".format(case, first, second))
# DE BLAS ZAPATA --> name 1 = DE BLAS, name 2 = ZAPATA
# MATIAS DE LA MANO --> name 1 = MATIAS, name 2 = DE LA MANO
# LOPEZ FERNANDEZ DE VILLAVERDE --> name 1 = LOPEZ FERNANDEZ, name 2 = DE VILLAVERDE
# DE MIGUEL DEL CORRAL --> name 1 = DE MIGUEL, name 2 = DEL CORRAL
# VIDAL DE LA PENA SOLIS --> name 1 = VIDAL, name 2 = DE LA PENA SOLIS
# MONTAVA DEL ARCO --> name 1 = MONTAVA, name 2 = DEL ARCO
# DOS CASAS VALLE --> name 1 = DOS CASAS, name 2 = VALLE

Related

Replace all occurrences of a word with another specific word that must appear somewhere in the sentence before that word

import re
#example 1
input_text = "((PERSON)María Rosa) ((VERB)pasará) unos dias aqui, hay que ((VERB)mover) sus cosas viejas de aqui, ya que sus cosméticos ((VERB)estorban) si ((VERB)estan) tirados por aquí. ((PERSON)Cyntia) es una buena modelo, su cabello es muy bello, hay que ((VERB)lavar) su cabello"
#example 2
input_text = "Sus útiles escolares ((VERB)estan) aqui, me sorprende que ((PERSON)Juan Carlos) los haya olvidado siendo que suele ((VERB)ser) tan cuidadoso con sus útiles."
#I need replace "sus" or "su" but under certain conditions
subject_capture_pattern = r"\(\(PERSON\)((?:\w\s*)+)\)" #underlined in red in the image
associated_info_capture_pattern = r"(?:sus|su)\s+((?:\w\s*)+)(?:\s+(?:del|de )|\s*(?:\(\(VERB\)|[.,;]))" #underlined in green in the image
identification_pattern =
replacement_sequence =
input_text = re.sub(identification_pattern, replacement_sequence, input_text, flags = re.IGNORECASE)
this is the correct output:
#for example 1
"((PERSON)María Rosa) ((VERB)pasará) unos dias aqui, hay que ((VERB)mover) cosas viejas ((CONTEXT) de María Rosa) de aqui, ya que cosméticos ((CONTEXT) de María Rosa) ((VERB)estorban) si ((VERB)estan) tirados por aquí. ((PERSON)Cyntia) es una buena modelo, cabello ((CONTEXT) de Cyntia) ((VERB)es) muy bello, hay que ((VERB)lavar) cabello ((CONTEXT) de Cyntia)"
#for example 2
"útiles escolares ((CONTEXT) NO DATA) ((VERB)estan) aqui, me sorprende que ((PERSON)Juan Carlos) los haya olvidado siendo que suele ((VERB)ser) tan cuidadoso con útiles ((CONTEXT) Juan Carlos)."
Details:
Replace the possessive pronouns "sus" or "su" with "de " + the content inside the last ((PERSON) "THIS SUBSTRING"), and if there is no ((PERSON) "THIS SUBSTRING") before then replace sus or su with ((PERSON) NO DATA)
Sentences are read from left to right, so the replacement will be the substring inside the parentheses ((PERSON)the substring) before that "sus" or "su", as shown in the example.
In the end, the replaced substrings should end up with this structure:
associated_info_capture_pattern + "((CONTEXT)" + subject_capture_pattern + ")"

This shows a way to do the replacement of su/sus like you asked for (albeit not with just a single re.sub). I didn't move the additional info, but you could modify it to handle that as well.
import re
subject_capture_pattern = r"\(\(PERSON\)((?:\w\s*)+)\)"
def replace_su_and_sus(input_text):
start = 0
replacement = "((PERSON) NO DATA)"
output_text = ""
for m in re.finditer(subject_capture_pattern, input_text):
output_text += re.sub(r"\b[Ss]us?\b", replacement, input_text[start:m.end()])
start = m.end()
replacement = m.group(0).replace("(PERSON)", "(CONTEXT) de ")
output_text += re.sub(r"\b[Ss]us?\b", replacement, input_text[start:])
return output_text
My strategy was:
Up until the first subject capture, replace su/sus with "NO DATA"
Up until the second subject capture, replace su/sus with the name from the first capture
Proceed similarly for each subsequent subject capture
Finally, replace any su/sus between the last subject capture and the end of the string

Error when using treetagger : list index out of range

I am using treetagger to extract lemma of word. I have a function which do that but for some words it is giving list out range error :
def treetagger_process(texte):
''' process le texte et renvoie un dictionnaire '''
d_tag = {}
for line in texte:
newvalues = tagger.tag_text(line)
d_tag[line] = newvalues
#print(d_tag)
#lemmatisation
d_lemma = defaultdict(list)
d_lemma_1 = defaultdict(list)
for k, v in d_tag.items():
print(k)
for p in v:
#print(p)
parts = p.split('\t')
print(parts)
forme=parts[0]
cat=parts[1]
lemme = parts[2] # un seul mot
print(lemme)
try:
if lemme =='':
d_lemma[k].append(forme)
elif "|" in lemme :
print(lemme)
lemme_double = lemme.split("|")
lemme_final = lemme_double[1]
print(lemme_final)
d_lemma[k].append(lemme_final)
else:
d_lemma[k].append(lemme)
except:
print(parts)
for k, v in d_lemma.items():
d_lemma_1[k]=" ".join(v)
# Suppression de l'espace
print(d_lemma_1)
return d_lemma_1
Is there a way to overcome that error, it seems the word "dns-remplacé" is causing problem but I requested that if "|" is not find , the word is automatically send to the dictionnary.
error :
...
L'idée de départ est amusante mais elle est très mal exploitée en raison notamment d'une mise en scène paresseuse , d'une intrigue à laquelle on ne croit pas une seconde et d'acteurs qui semblent à l'abandon .
["L'", 'DET:ART', 'le']
le
['idée', 'NOM', 'idée']
idée
['de', 'PRP', 'de']
de
['départ', 'NOM', 'départ']
départ
['est', 'VER:pres', 'être']
être
['amusante', 'ADJ', 'amusant']
amusant
['mais', 'KON', 'mais']
mais
['elle', 'PRO:PER', 'elle']
elle
['est', 'VER:pres', 'être']
être
['très', 'ADV', 'très']
très
['mal', 'ADV', 'mal']
mal
['exploitée', 'VER:pper', 'exploiter']
exploiter
['en', 'PRP', 'en']
en
['raison', 'NOM', 'raison']
raison
['notamment', 'ADV', 'notamment']
notamment
["d'", 'PRP', 'de']
de
['une', 'DET:ART', 'un']
un
['mise', 'NOM', 'mise']
mise
['en', 'PRP', 'en']
en
['scène', 'NOM', 'scène']
scène
['paresseuse', 'ADJ', 'paresseux']
paresseux
[',', 'PUN', ',']
,
["d'", 'PRP', 'de']
de
['une', 'DET:ART', 'un']
un
['intrigue', 'NOM', 'intrigue']
intrigue
['à', 'PRP', 'à']
à
['laquelle', 'PRO:REL', 'lequel']
lequel
['on', 'PRO:PER', 'on']
on
['ne', 'ADV', 'ne']
ne
['croit', 'VER:pres', 'croire']
croire
['pas', 'ADV', 'pas']
pas
['une', 'DET:ART', 'un']
un
['seconde', 'NUM', 'second']
second
['et', 'KON', 'et']
et
["d'", 'PRP', 'de']
de
['acteurs', 'NOM', 'acteur']
acteur
['qui', 'PRO:REL', 'qui']
qui
['semblent', 'VER:pres', 'sembler']
sembler
['à', 'PRP', 'à']
à
["l'", 'DET:ART', 'le']
le
['abandon', 'NOM', 'abandon']
abandon
['.', 'SENT', '.']
.
"Faux cul ed wright dit que son film est un hommage aux buddy movies hollywoodiens et nottament bad boys.aussi se permet il de parodier des scènes de ce """"chef d'oeuvre""""au tics de caméra prés,et si on le savait pas déjà on se rend compte à quel point michael bay est un tacheron;Hot Fuzz ne se contente pas d'être une parodie de plus à la """"y'a t-il un flic..."
['"', 'PUN:cit', '"']
"
['Faux', 'ADJ', 'faux']
faux
['cul', 'NOM', 'cul']
cul
['ed', 'NOM', 'ed']
ed
['wright', 'NOM', 'wright']
wright
['dit', 'VER:pper', 'dire']
dire
['que', 'KON', 'que']
que
['son', 'DET:POS', 'son']
son
['film', 'NOM', 'film']
film
['est', 'VER:pres', 'être']
être
['un', 'DET:ART', 'un']
un
['hommage', 'NOM', 'hommage']
hommage
['aux', 'PRP:det', 'au']
au
['buddy', 'NOM', 'buddy']
buddy
['movies', 'NOM', 'movies']
movies
['hollywoodiens', 'ADJ', 'hollywoodien']
hollywoodien
['et', 'KON', 'et']
et
['nottament', 'VER:pres', 'nottament']
nottament
['bad', 'NOM', 'bad']
bad
['dns-remplacé', 'ADJ', 'dns-remplacé']
dns-remplacé
['<repdns text="boys.aussi" />']
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-25-a8bd3e42b86e> in <module>
21 #l_1_dict = spacy_process(l_1)
22 #l_1_dict = spacy_process_treet(l_1)
---> 23 l_1_dict = treetagger_process(l_1)
24
25 print(type(l_1_dict))
<ipython-input-24-20ed242ad8de> in treetagger_process(texte)
18 parts = p.split('\t')
19 forme=parts[0]
---> 20 cat=parts[1]
21 lemme = parts[2] # un seul mot
22 print(lemme)
IndexError: list index out of range
Example of texte :

Youre getting the list index out of range error because youre splitting on an object for \t and expecting it to return an array of 3 components. the dns-substitute seems to only have 1 object returned in that list.
in this example you have 3 objects:
bad
['dns-remplacé', 'ADJ', 'dns-remplacé']
here you only have 1 object:
dns-remplacé
['<repdns text="boys.aussi" />']
i would recommend you check that splitting on '\t' returns a list of size 3 and if it doesnt then you need to handle that scenario accordingly
parts = p.split('\t')
if len(parts) == 3:
#do normal processing
else:
#handle this situation

Python .split() having trouble with txt line

I am trying to split lines in a txt file but having the following problem:
with open("KARDEX.txt", 'r', encoding="latin-1") as file:
data = []
for line in file:
data.append(line)
print(data[0])
>>ÿþ# Número de artículo Descripción del artículo Clase de operación Código de deudor/acreedor Nombre de deudor/acreedor
d = print(data[0].split(" "))
>> ['ÿþ#\x00', '\x00N\x00ú\x00m\x00e\x00r\x00o\x00 \x00d\x00e\x00 \x00a\x00r\x00t\x00í\x00c\x00u\x00l\x00o\x00', '\x00D\x00e\x00s\x00c\x00r\x00i\x00p\x00c\x00i\x00ó\x00n\x00 \x00d\x00e\x00l\x00 \x00a\x00r\x00t\x00í\x00c\x00u\x00l\x00o\x00', '\x00C\x00l\x00a\x00s\x00e\x00 \x00d\x00e\x00 \x00o\x00p\x00e\x00r\x00a\x00c\x00i\x00ó\x00n\x00']

If you want to split by 2 or more spaces then use:
re.split(r" {2,}", st)
st:
st = "ÿþ# Número de artículo Descripción del artículo Clase de operación Código de deudor/acreedor Nombre de deudor/acreedor"
re.split(r" {2,}", st)
Output:
['ÿþ#',
'Número de artículo',
'Descripción del artículo',
'Clase de operación',
'Código de deudor/acreedor',
'Nombre de deudor/acreedor']

How to remove newline from end and beginning of every list element [python]

I've got the following code:
from itemspecs import itemspecs
x = itemspecs.split('###')
res = []
for item in x:
res.append(item.split('##'))
print(res)
This imports a string from another document. And gives me the following:
[['\n1\nWie was de Nederlandse scheepvaarder die de Spaanse zilvervloot veroverde?\n',
'\nA. Michiel de Ruyter\nB. Piet Heijn\nC. De zilvervloot is nooit door de Nederlanders
onderschept\n', '\nB\n', '\nAntwoord B De Nederlandse vlootvoogd werd hierdoor bekend.\n'],
['\n2\nIn welk land ligt Upernavik?\n', '\nA. Antartica\nB. Canada\nC. Rusland\nD.
Groenland\nE. Amerika\n', '\nD\n', '\nAntwoord D Het is een dorp in Groenland met 1224
inwoners.\n']]
But now I want to remove all the \n from every end and beginning of every element in this list. How can I do this?

You can do that by stripping "\n" as follow
a = "\nhello\n"
stripped_a = a.strip("\n")
so, what you need to do is iterate through the list and then apply the strip on the string as shown below
res_1=[]
for i in res:
tmp=[]
for j in i:
tmp.append(j.strip("\n"))
res_1.append(tmp)
The above answer just removes \n from start and end. if you want to remove all the new lines in a string, just use .replace('\n"," ") as shown below
res_1=[]
for i in res:
tmp=[]
for j in i:
tmp.append(j.replace("\n"))
res_1.append(tmp)

strip() is easiest method in this case, but if you want to do any advanced text processing in the future, it isn't bad idea to learn regular expressions:
import re
from pprint import pprint
l = [['\n1\nWie was de Nederlandse scheepvaarder die de Spaanse zilvervloot veroverde?\n',
'\nA. Michiel de Ruyter\nB. Piet Heijn\nC. De zilvervloot is nooit door de Nederlandersonderschept\n',
'\nB\n',
'\nAntwoord B De Nederlandse vlootvoogd werd hierdoor bekend.\n'],
['\n2\nIn welk land ligt Upernavik?\n',
'\nA. Antartica\nB. Canada\nC. Rusland\nD.Groenland\nE. Amerika\n',
'\nD\n', '\nAntwoord D Het is een dorp in Groenland met 1224inwoners.\n']]
l = [[re.sub(r'^([\s]*)|([\s]*)$', '', j)] for i in l for j in i]
pprint(l, width=120)
Output:
[['1\nWie was de Nederlandse scheepvaarder die de Spaanse zilvervloot veroverde?'],
['A. Michiel de Ruyter\nB. Piet Heijn\nC. De zilvervloot is nooit door de Nederlandersonderschept'],
['B'],
['Antwoord B De Nederlandse vlootvoogd werd hierdoor bekend.'],
['2\nIn welk land ligt Upernavik?'],
['A. Antartica\nB. Canada\nC. Rusland\nD.Groenland\nE. Amerika'],
['D'],
['Antwoord D Het is een dorp in Groenland met 1224inwoners.']]

How to get clean text from MediaWiki markup format using mwparserfromhell or a simple parser in python?

I am trying to get clean sentences from the Wikipedia page of a species.
For instance Abeis durangensis (pid = 1268312). Using the Wikipedia API in python to obtain the Wikipedia page:
import requests
pid = 1268312
q = {'action' : 'query',
'pageids': pid,
'prop' : 'revisions',
'rvprop' : 'content',
'format' : 'json'}
result = requests.get(eswiki_URI, params=q).json()
wikitext = result["query"]["pages"].values()[0]["revisions"][0]["*"]
gives:
{{Ficha de taxón
| name = ''Abies durangensis''
| image = Abies tamazula dgo.jpg
| status = LR/lc
| status_ref =<ref>Conifer Specialist Group 1998. [http://www.iucnredlist.org/search/details.php/42279/all ''Abies durangensis'']. [http://www.iucnredlist.org 2006 IUCN Red List of Threatened Species. ] Downloaded on 10 July 2007.</ref>
| regnum = [[Plantae]]
| divisio = [[Pinophyta]]
| classis = [[Pinopsida]]
| ordo = [[Pinales]]
| familia = [[Pinaceae]]
| genus = ''[[Abies]]''
| binomial = '''''Abies durangensis'''''
| binomial_authority = [[Maximino Martínez|Martínez]]<ref name=ipni>{{ cite web |url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal |title=Plant Name Details for ''Abies durangensis'' |publisher=[[International Plant Names Index|IPNI]] |accessdate=6 de octubre de 2009}}</ref>
| synonyms =
}}
'''''Abies durangensis''''' es una [[especie]] de [[conífera]] perteneciente a la familia [[Pinaceae]]. Son [[endémica]]s de [[México]] donde se encuentran en [[Durango]], [[Chihuahua]], [[Coahuila]], [[Jalisco]] y [[Sinaloa]]. También es conocido como 'Árbol de Coahuila' y 'pino mexicano'.<ref name=cje>{{ cite web |url=http://www.conifers.org/pi/ab/durangensis.htm |title=''Abies durangaensis'' description |author=Christopher J. Earle |date=11 de junio de 2006 |accessdate=6 de octubre de 2009}}</ref>
== Descripción ==
Es un [[árbol]] que alcanza los 40 metros de altura con un [[Tronco (botánica)|tronco]] recto que tiene 150 cm de diámetro. Las [[rama]]s son horizontales y la [[corteza (árbol)|corteza]] de color gris. Las [[hoja]]s son verde brillante de 20–35 mm de longitud por 1-1.5 mm de ancho. Tiene los conos de [[semilla]]s erectos en ramas laterales sobre un corto [[pedúnculo]]. Las [[semilla]]s son [[resina|resinosas]] con una [[núcula]] amarilla con alas.
== Taxonomía ==
''Abies durangensis'' fue descrita por [[Maximino Martínez]] y publicado en ''[[Anales del instituto de Biología de la Universidad Nacional de México]]'' 13: 2. 1942.<ref name = Trop>{{cita web |url=http://www.tropicos.org/Name/24901700 |título= ''{{PAGENAME}}''|fechaacceso=21 de enero de 2013 |formato= |obra= Tropicos.org. [[Missouri Botanical Garden]]}}</ref>
;[[Etimología]]:
'''''Abies''''': nombre genérico que viene del nombre [[latin]]o de ''[[Abies alba]]''.<ref>[http://www.calflora.net/botanicalnames/pageAB-AM.html En Nombres Botánicos]</ref>
'''''durangensis''''': [[epíteto]] geográfico que alude a su localización en [[Durango]].
;Variedades:
* ''Abies durangensis var. coahuilensis'' (I. M. Johnst.) Martínez
;[[sinonimia (biología)|Sinonimia]]:
* ''Abies durangensis subsp. neodurangensis'' (Debreczy, I.Rácz & R.M.Salazar) Silba'
* ''Abies neodurangensis'' Debreczy, I.Rácz & R.M.Salazar<ref>[http://www.theplantlist.org/tpl/record/kew-2609816 ''{{PAGENAME}}'' en PlantList]</ref><ref name = Kew>{{cita web|url=http://apps.kew.org/wcsp/namedetail.do?name_id=2609816 |título=''{{PAGENAME}}'' |work= World Checklist of Selected Plant Families}}</ref>
;''var. coahuilensis'' (I.M.Johnst.) Martínez
* ''Abies coahuilensis'' I.M.Johnst.
* ''Abies durangensis subsp. coahuilensis'' (I.M.Johnst.) Silba
== Véase también ==
* [[Terminología descriptiva de las plantas]]
* [[Anexo:Cronología de la botánica]]
* [[Historia de la Botánica]]
* [[Pinaceae#Descripción|Características de las pináceas]]
== Referencias ==
{{listaref}}
== Bibliografía ==
# CONABIO. 2009. Catálogo taxonómico de especies de México. 1. In Capital Nat. México. CONABIO, Mexico City.
== Enlaces externos ==
{{commonscat}}
{{wikispecies|Abies}}
* http://web.archive.org/web/http://ww.conifers.org/pi/ab/durangensis.htm
* http://www.catalogueoflife.org/search.php
[[Categoría:Abies|durangensis]]
[[Categoría:Plantas descritas en 1942]]
[[Categoría:Plantas descritas por Martínez]]
I am interested in the (unmarked) text just after the infobox, the gloss:
Abies durangensis es una especie de conífera perteneciente a la familia Pinaceae. Son endémicas de México donde se encuentran en Durango, Chihuahua, Coahuila, Jalisco y Sinaloa. También es conocido como 'Árbol de Coahuila' y 'pino mexicano'.
Until now i consulted https://www.mediawiki.org/wiki/Alternative_parsers so i found that mwparserfromhell is the less complicated parser in python. However, i dont see clearly how to do what i pretend. When i use the example proposed in the documentation i just can't see where the gloss is.
for t in templates:
print(t.name).encode('utf-8')
print(t.params)
Ficha de taxón
[u" name = ''Abies durangensis''\n", u' image = Abies tamazula dgo.jpg \n', u' status = LR/lc\n', u" status_ref =<ref>Conifer Specialist Group 1998. [http://www.iucnredlist.org/search/details.php/42279/all ''Abies durangensis'']. [http://www.iucnredlist.org 2006 IUCN Red List of Threatened Species. ] Downloaded on 10 July 2007.</ref>\n", u' regnum = [[Plantae]]\n', u' divisio = [[Pinophyta]]\n', u' classis = [[Pinopsida]]\n', u' ordo = [[Pinales]]\n', u' familia = [[Pinaceae]]\n', u" genus = ''[[Abies]]'' \n", u" binomial = '''''Abies durangensis'''''\n", u" binomial_authority = [[Maximino Mart\xednez|Mart\xednez]]<ref name=ipni>{{ cite web |url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal |title=Plant Name Details for ''Abies durangensis'' |publisher=[[International Plant Names Index|IPNI]] |accessdate=6 de octubre de 2009}}</ref>\n", u' synonyms = \n']
cite web
[u'url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal ', u"title=Plant Name Details for ''Abies durangensis'' ", u'publisher=[[International Plant Names Index|IPNI]] ', u'accessdate=6 de octubre de 2009']
cite web
[u'url=http://www.conifers.org/pi/ab/durangensis.htm ', u"title=''Abies durangaensis'' description ", u'author=Christopher J. Earle ', u'date=11 de junio de 2006 ', u'accessdate=6 de octubre de 2009']
cita web
[u'url=http://www.tropicos.org/Name/24901700 ', u"t\xedtulo= ''{{PAGENAME}}''", u'fechaacceso=21 de enero de 2013 ', u'formato= ', u'obra= Tropicos.org. [[Missouri Botanical Garden]]']
PAGENAME
[]
PAGENAME
[]
cita web
[u'url=http://apps.kew.org/wcsp/namedetail.do?name_id=2609816 ', u"t\xedtulo=''{{PAGENAME}}'' ", u'work= World Checklist of Selected Plant Families']
PAGENAME
[]
listaref
[]
commonscat
[]
wikispecies
[u'Abies']

Instead of torturing yourself with parsing of something that's not even expressable in formal grammar, use the TextExtracts API:
https://es.wikipedia.org/w/api.php?action=query&prop=extracts&explaintext=1&titles=Abies%20durangensis&format=json
gives the following output:
Abies durangensis es una especie de conífera perteneciente a la
familia Pinaceae. Son endémicas de México donde se encuentran en
Durango, Chihuahua, Coahuila, Jalisco y Sinaloa. También es conocido
como 'Árbol de Coahuila' y 'pino mexicano'.
== Descripción ==
Es un árbol que alcanza los 40 metros de altura con un tronco recto que tiene 150 cm de diámetro. Las ramas son
horizontales y la corteza de color gris. Las hojas son verde brillante
de 20–35 mm de longitud por 1-1.5 mm de ancho. Tiene los conos de
semillas erectos en ramas laterales sobre un corto pedúnculo. Las
semillas son resinosas con una núcula amarilla con alas.
[...]
Append &exintro=1 to the URL if you need only the lead.

You can use wikitextparser
If wikitext has your text string, all you need is:
import wikitextparser
parsed = wikitextparser.parse(wikitext)
then you can get the plain text portion of the whole page or a particular section, for example: parsed.sections[1].plain_text() will give you the plain text of the second section of the page which seems to be what you are looking for.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex to split Spanish lastnames in first lastname and second lastname - python

Related

Replace all occurrences of a word with another specific word that must appear somewhere in the sentence before that word

Error when using treetagger : list index out of range

Python .split() having trouble with txt line

How to remove newline from end and beginning of every list element [python]

How to get clean text from MediaWiki markup format using mwparserfromhell or a simple parser in python?

Categories

Resources