I wrote a script to extract sentences in huge set which contains particular pattern. The problem lied in the fact that , for some patterns I checked the value of the attribute at the beginning or ending of the pattern to see if the word is present in a particular list. I have 4 dictionaries with 2 lists of positive and negative word. So far I wrote the script and I am able to use the function I wrote with one dictionary. I am thinking how can I improve the my function so that I can use it at the same time of the 4 dictionaries without duplicating the bloc which loop in the dictionary.
I give an example with two dictionaries (since the script is quite long I make a small example with all the necessary element
import spacy.attrs
from spacy.attrs import POS
import spacy
from spacy import displacy
from spacy.lang.fr import French
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
from spacy.lemmatizer import Lemmatizer
nlp = spacy.load("fr_core_news_md")
from spacy.matcher import Matcher#LIST
##################### List of lexicon
# Lexique Diko
lexicon = open(os.path.join('/h/Ressources/Diko.txt'), 'r', encoding='utf-8')
data = pd.read_csv(lexicon, sep=";", header=None)
data.columns = ["id", "terme", "pol"]
pol_diko_pos = data.loc[data.pol =='positive', 'terme']
liste_pos_D = list(pol_diko_pos)
print(liste_pos[1])
pol_diko_neg = data.loc[data.pol =='negative', 'terme']
liste_neg_D = list(pol_diko_neg)
#print(type(liste_neg))
# Lexique Polarimots
lexicon_p = open(os.path.join('/h/Ressources/polarimots.txt'), 'r', encoding='utf-8')
data_p = pd.read_csv(lexicon_p, sep="\t", header=None)
#data.columns = ["terme", "pol", "pos", "degre"]
data_p.columns = ["ind", "terme", "cat", "pol", "fiabilité"]
pol_polarimot_pos = data_p.loc[data_p.pol =='POS', 'terme']
liste_pos_P = list(pol_polarimot_pos)
print(liste_pos_P[1])
pol_polarimot_neg = data_p.loc[data_p.pol =='NEG', 'terme']
liste_neg_P = list(pol_polarimot_neg)
#print(type(liste_neg))
# ############################# Lists
sentence_not_extract_lexique_1 =[] #List of all sentences without the specified pattern
sentence_extract_lexique_1 = [] #list of sentences which the pattern[0] is present in the first lexicon
sentence_not_extract_lexique_2 =[] #List of all sentences without the specified pattern
sentence_extract_lexique_2 = [] #list of sentences which the pattern[0] is present in the second lexicon
list_token_pos = [] #list of the token found in the lexique
list_token_neg = [] #list of the token found in the lexique
list_token_not_found = [] #list of the token not found in the lexique
#PATTERN
pattern1 = [{"POS": {"IN": ["VERB", "AUX","ADV","NOUN","ADJ"]}}, {"IS_PUNCT": True, "OP": "*"}, {"LOWER": "mais"} ]
pattern1_tup = (pattern1, 1, True)
pattern3 = [{"LOWER": {"IN": ["très","trop"]}},
{"POS": {"IN": ["ADV","ADJ"]}}]
pattern3_tup = (pattern3, 0, True)
pattern4 = [{"POS": "ADV"}, # adverbe de négation
{"POS": "PRON","OP": "*"},
{"POS": {"IN": ["VERB", "AUX"]}},
{"TEXT": {"IN": ["pas", "plus", "aucun", "aucunement", "point", "jamais", "nullement", "rien"]}},]
pattern4_tup = (pattern4, None, False)
#Tuple of pattern
pattern_list_tup =[pattern1_tup, pattern3_tup, pattern4_tup]
pattern_name = ['first', 'second', 'third', 'fourth']
length_of_list = len(pattern_list_tup)
print('length', length_of_list)
#index of the value of attribute to check in the lexicon
value_of_attribute = [0,-1,-1]
# List of lexicon to use
lexique_1 = [lexique_neg, lexique_pos]
lexique_2 = [lexique_2neg, lexique_2pos]
# text (example of some sentences)
file =b= ["Le film est superbe mais cette édition DVD est nulle !",
"J'allais dire déplorable, mais je serais peut-être un peu trop extrême.",
"Hélas, l'impression de violence, bien que très bien rendue, ne sauve pas cette histoire gothique moderne de la sécheresse scénaristique, le tout couvert d'un adultère dont le propos semble être gratuit, classique mais intéressant...",
"Tout ça ne me donne pas envie d'utiliser un pieu mais plutôt d'aller au pieu (suis-je drôle).",
"Oui biensur, il y a la superbe introduction des parapluies au debut, et puis lorsqu il sent des culs tout neufs et qu il s extase, j ai envie de faire la meme chose apres sur celui de ma voisine de palier (ma voisine de palier elle a un gros cul, mais j admets que je voudrais bien lui foute mon tarin), mais c est tout, apres c est un film tres noir, lent et qui te plonge dans le depression.",
"Et bien hélas ce DVD ne m'a pas appris grand chose par rapport à la doc des agences de voyages et la petite dame qui fait ses dessins est bien gentille mais tout tourne un peu trop autour d'elle.",
"Au final on passe de l'un a l'autre sans subtilité, et on n'arrive qu'à une caricature de plus : si Kechiche avait comme but initial déclaré de fustiger les préjugés, c'est le contraire qui ressort de ce ''film'' truffé de clichés très préjudiciables pour les quelques habitants de banlieue qui ne se reconnaîtront pas dans cette lourde farce.",
"-ci écorche les mots, les notes... mais surtout nos oreilles !"]
# Loop to check each sentence and extract the sentences with the specified pattern from above
for pat in range(0, length_of_list):
matcher = Matcher(nlp.vocab)
matcher.add("matching_2", None, pattern_list_tup[pat][0])
# print(pat)
# print(pattern_list_tup[pat][0])
for sent in file:
doc =nlp(sent)
matches= matcher(doc)
for match_id, start, end in matches:
span = doc[start:end].lemma_.split()
#print(f"{pattern_name[pat]} pattern found: {span}")
This is the part I want ot modify to use it for another dictionary, the goal is to able to retrieve sentences extract by 4 different dictionaries to make a comparison and then check which sentences are present in more than two list.
# Condition to use the lexicon and extract the sentence
if (pattern_list_tup[pat][2]):
if (span[value_of_attribute[pat]] in lexique_1[pattern_list_tup[pat][1]]):
if sent not in sentence_extract:
sentence_extract_lexique_1.append(sent)
if (pattern_list_tup[pat][1] == 1):
list_token_pos.append(span[value_of_attribute[pat]])
if (pattern_list_tup[pat][1] == 0):
list_token_neg.append(span[value_of_attribute[pat]])
else:
list_token_not_found.append(span[value_of_attribute[pat]]) # the text form is not present in the lexicon need the lemma form
sentence_not_extract_lexique_1.append(sent)
else:
if sent not in sentence_extract:
sentence_extract_lexique_1.append(sent)
print(len(sentence_extract))
print(sentence_extract)
One solution I find is to duplicate the code abode and change the name of the list where the sentences are stored but since I have 2 dictionaries duplicating will make the code longer is there a way to combine the looping the 2 dictionaries (actually 4 dictionaries in the original) and append the result to the good list. So, for example, when I use lexique_1 , all the sentences extracted are send to "sentence_extract_lexique_1" and so on for the other.
In my opinion attempt using the if-elif-else chain. If not attempt only using the if-elif block simply because the elif statement catches the specific condition of interest. In which you're trying to catch a specific to compare and check with the sentences. Keep in mind if you try the if-elif-else chain its a good method, but it only works when you need one test to pass. Because Python finds one test to pass and it skips the rest. Its very efficient and allows you to test for one specific condition.
This is my code:
import pandas as pd
import os
import glob as g
archivos = g.glob('C:\Users\Desktop\*.csv')
for archiv in archivos:
nombre = os.path.splitext(archiv)[0]
df = pd.read_csv(archiv, sep=",")
d = pd.to_datetime(df['DATA_LEITURA'], format="%Y%m%d")
df['FECHA_LECTURA'] = d.dt.date
del df['DATA_LEITURA']
df['CONSUMO']=""
df['DIAS']=""
df["SUMDIAS"]=""
df["SUMCONS"]=""
df["CONSANUAL"] = ""
ordenado = df.sort_values(['NR_CPE','FECHA_LECTURA', 'HORA_LEITURA'], ascending=True)
##Agrupamos por el CPE
agrupado = ordenado.groupby('NR_CPE')
for name, group in agrupado: #Recorremos el grupo
indice = group.index.values
inicio = indice[0]
fin = indice[-1]
#Llenamos la primeras lectura de cada CPE, con esa lectura (porque no hay una lectura anterior)
ordenado.CONSUMO.loc[inicio] = 0
ordenado.DIAS.loc[inicio] = 0
cont=0
for i in indice: #Recorremos lo que hay dentro de los grupos, dentro de los CPES(lecturas)
if i > inicio and i <= fin :
cont=cont+1
consumo = ordenado.VALOR_LEITURA[indice[cont]] - ordenado.VALOR_LEITURA[indice[cont-1]]
dias = (ordenado.FECHA_LECTURA[indice[cont]] - ordenado.FECHA_LECTURA[indice[cont-1]]).days
ordenado.CONSUMO.loc[i] = consumo
ordenado.DIAS.loc[i] = dias
# Hago las sumatorias, el resultado es un objeto DataFrame
dias = agrupado['DIAS'].sum()
consu = agrupado['CONSUMO'].sum()
canu = (consu/dias) * 365
#Contador con el numero de courrencias de los campos A,B y C
conta=0
contb=0
contc=0
#Como es un DF, para recorrerlo tengo que iterar sobre ellos para hacer la comparacion
print "Grupos:"
for ind, sumdias in dias.iteritems():
if sumdias <= 180:
grupo = "A"
conta=conta+1
elif sumdias > 180 and sumdias <= 365:
grupo = "B"
contb=contb+1
elif sumdias > 365:
grupo = "C"
contc=contc+1
print "grupo A: " , conta
print "grupo B: " , contb
print "grupo C: " , contc
#Formateamos los campos para no mostrar todos los decimales
Fdias = dias.map('{:.0f}'.format)
Fcanu = canu.map('{:.2f}'.format)
frames = [Fdias, consu, Fcanu]
concat = pd.concat(frames,axis=1).replace(['inf','nan'],[0,0])
with open('C:\Users\Documents\RPE_PORTUGAL\Datos.csv','a') as f:
concat.to_csv(f,header=False,columns=['CPE','DIAS','CONSUMO','CONSUMO_ANUAL'])
try:
ordenado.to_excel(nombre+'.xls', columns=["NOME_DISTRITO",
"NR_CPE","MARCA_EQUIPAMENTO","NR_EQUIPAMENTO","VALOR_LEITURA","REGISTADOR","TIPO_REGISTADOR",
"TIPO_DADOS_RECOLHIDOS","FACTOR_MULTIPLICATIVO_FINAL","NR_DIGITOS_INTEIRO","UNIDADE_MEDIDA",
"TIPO_LEITURA","MOTIVO_LEITURA","ESTADO_LEITURA","HORA_LEITURA","FECHA_LECTURA","CONSUMO","DIAS"],
index=False)
print (archiv)
print ("===============================================")
print ("*****Se ha creado el archivo correctamente*****")
print ("===============================================")
except IOError:
print ("===================================================")
print ("¡¡¡¡¡Hubo un error en la escritura del archivo!!!!!")
print ("===================================================")
This takes a file where I have lectures of energy consumption from different dates for every light meter('NR_CPE') and do some calculations:
Calculate the energy consumption for every 'NR_CPE' by substracting the previous reading with the next one and the result put in a new column named 'CONSUMO'.
Calculate the number of days where I'v got a reading and sum up the number of days
Add the consumption for every 'NR_CPE' and calculate the anual consumption.
Finally I want to classify by number of days that every light meter('NR_CPE') has a lecture. A if it has less than 180 days, B between 180 and 1 year and C more than a year.
And finally write this result in two differents files.
Any idea of how should I re-code this to have the same output and be faster?
Thank you all.
BTW this is my dataset:
,NOME_DISTRITO,NR_CPE,MARCA_EQUIPAMENTO,NR_EQUIPAMENTO,VALOR_LEITURA,REGISTADOR,TIPO_REGISTADOR,TIPO_DADOS_RECOLHIDOS,FACTOR_MULTIPLICATIVO_FINAL,NR_DIGITOS_INTEIRO,UNIDADE_MEDIDA,TIPO_LEITURA,MOTIVO_LEITURA,ESTADO_LEITURA,DATA_LEITURA,HORA_LEITURA
0,GUARDA,A002000642VW,101,1865411,4834,001,S,1,1,4,kWh,1,1,A,20150629,205600
1,GUARDA,A002000642VW,101,1865411,4834,001,S,1,1,4,kWh,2,2,A,20160218,123300
2,GUARDA,A002000642VJ,122,204534,25083,001,S,1,1,5,kWh,1,1,A,20150629,205700
3,GUARDA,A002000642VJ,122,204534,27536,001,S,1,1,5,kWh,2,2,A,20160218,123200
4,GUARDA,A002000642HR,101,1383899,11734,001,S,1,1,5,kWh,1,1,A,20150629,205600
5,GUARDA,A002000642HR,101,1383899,11800,001,S,1,1,5,kWh,2,2,A,20160218,123000
6,GUARDA,A002000995VM,101,97706436,12158,001,S,1,1,5,kWh,1,3,A,20150713,155300
7,GUARDA,A002000995VM,101,97706436,12163,001,S,1,1,5,kWh,2,2,A,20160129,162300
8,GUARDA,A002000995VM,101,97706436,12163,001,S,1,1,5,kWh,2,2,A,20160202,195800
9,GUARDA,A2000995VM,101,97706436,12163,001,S,1,1,5,kWh,1,3,A,20160404,145200
10,GUARDA,A002000996LV,168,5011703276,3567,001,V,1,1,6,kWh,1,1,A,20150528,205900
11,GUARDA,A02000996LV,168,5011703276,3697,001,V,1,1,6,kWh,2,2,A,20150929,163500
12,GUARDA,A02000996LV,168,5011703276,1287,002,P,1,1,6,kWh,1,1,A,20150528,205900
Generally you want to avoid for loops in pandas.
For example, the first loop where you calculate total consumption and days could be rewritten as a groupby apply something like:
def last_minus_first(df):
columns_of_interest = df[['VALOR_LEITURA', 'days']]
diff = columns_of_interest.iloc[-1] - columns_of_interest.iloc[0]
return diff
df['date'] = pd.to_datetime(df['DATA_LEITURA'], format="%Y%m%d")
df['days'] = (df['date'] - pd.datetime(1970,1,1)).dt.days # create days column
df.groupby('NR_CPE').apply(last_minus_first)
(btw I don't understand why you are subtracting each entry from the previous, surely for meter readings this is the same as last-first?)
Then given the result of the above as consumption, you can replace your second for loop (for ind, sumdias in dias.iteritems()) with something like:
pd.cut(consumption.days, [-1, 180, 365, np.inf], labels=['a', 'b', 'c']).value_counts()
I'm coding a project, in which I have 2 files (dataStructure.py and calculUser.py) working together and 1 which is a test file.
In structureDonnees.py I have this function which reads a dataset containing cars and builds data structures :
# -*-encoding:utf-8-*-
import csv
import sys #pour utiliser maximum et minimum du type float
from calculUser import *
from trajetUser import *
def recupVoiture() :
#nom de la base de donnée
nomFichier = 'CO2_passenger_cars_v10.csv'
#on ouvre le fichier en lecture
opener = open(nomFichier, "r")
#On ouvre le fichier nomFichier en lecture
lectureFichier = csv.reader(opener, delimiter='\t')
#le dico contenant les carburants
fuelType = dict()
#le dico contenant les voitures
voiture = dict()
#le dico contenant les émissions de CO2 en g/km
emission = dict()
#minimum et maximum emission
min_emission = sys.float_info.max #initialisé à max(float) pour que l'on soit sûr que toutes les emissions soient plus petites
max_emission = sys.float_info.min #initialisé à min(float) pour que l'on soit sûr que toutes les emissions soient plus grandes
for row in lectureFichier :
#Si la colonne existe
if row:
#construction du dictionnaire voiture
if voiture.has_key(row[10]) :
if row[11].upper() not in voiture[row[10]] : voiture[row[10]].append("%s" %row[11].upper()) #on ajoute le modèle
else :
voiture[row[10]] = [] #on crée une liste vide contenant les modèles et leurs versions
voiture[row[10]].append("%s" %row[11]) #on ajoute le modèle et sa version
#construction du dictionnaire fuelType
if fuelType.has_key(row[10]) : fuelType[row[10]].append(row[19].upper()) #ajout du type de carburant utilisé par la voiture correspondante dans voiture{}
else :
fuelType[row[10]] = [] #on crée une liste vide contenant les carburants
fuelType[row[10]].append(row[19]) #ajout du type de carburant utilisé par la voiture correspondante dans voiture{}
#construction du dictionnaire emission
if emission.has_key(row[10]) :
emission[row[10]].append(row[14]) #ajout de la quantité de CO2 émise par la voiture correspondante dans voiture{}
min_emission = minEmission(float(row[14]), min_emission)
max_emission = maxEmission(float(row[14]), max_emission)
else :
emission[row[10]] = [] #on crée une liste vide contenant les émissions en CO2
fuelType[row[10]].append(row[14]) #ajout de la quantité de CO2 émise par la voiture correspondante dans voiture{}
min_emission = minEmission(float(row[14]), min_emission)
max_emission = maxEmission(float(row[14]), max_emission)
#On ferme le fichier
opener.close()
#La valeur de retour est un tableau contenant les structures de données générées.
res = [voiture, fuelType, emission, min_emission, max_emission]
return res
In the calculUser.py, I defined the minEmission and maxEmission function :
def minEmission(emissionFichier, min_emission) :
if emissionFichier < min_emission :
min_emission = emissionFichier
return min_emission
def maxEmission(emissionFichier, max_emission) :
if emissionFichier > max_emission :
max_emission = emissionFichier
return max_emission
When I'm executing test.py, I get an error with this line :
table = recupVoiture()
Traceback (most recent call last):
File "test.py", line 13, in <module>
tableau = recupVoiture()
File "/home/user/Polytech/ge3/ProjetPython/structureDonnees.py", line 60, in recupVoiture
min_emission = minEmission(float(row[14]), min_emission)
NameError: global name 'minEmission' is not defined
I don't understand why I get this error. By executing everything except test.py I get no error but when I do it doesn't execute due to this minEmission and maxEmission not defined.
Is it because I'm calling a function when I'm defining a function?
How could I fix it?
I fixed the problem, it seems like my functions minEmission() and maxEmission() couldn't do a reference to max_emission and min_emission since those variables are declared in structureDonnees.py and not in calculUser.py.
I fixed it by creating an intermediary variable which takes the value of min_emission and max_emission and which is returned, instead of min_emission and max_emission.
Plus, I had to do a : from calculUser import minEmission, maxEmissiondirectly in the recupVoiture() function. I know that's awful but it solved the problem. I'll use it until I find a better, cleaner solution.
Thanks for the help guys, I'll do a better post/code if I have to ask any other help ! :)
I am trying to get clean sentences from the Wikipedia page of a species.
For instance Abeis durangensis (pid = 1268312). Using the Wikipedia API in python to obtain the Wikipedia page:
import requests
pid = 1268312
q = {'action' : 'query',
'pageids': pid,
'prop' : 'revisions',
'rvprop' : 'content',
'format' : 'json'}
result = requests.get(eswiki_URI, params=q).json()
wikitext = result["query"]["pages"].values()[0]["revisions"][0]["*"]
gives:
{{Ficha de taxón
| name = ''Abies durangensis''
| image = Abies tamazula dgo.jpg
| status = LR/lc
| status_ref =<ref>Conifer Specialist Group 1998. [http://www.iucnredlist.org/search/details.php/42279/all ''Abies durangensis'']. [http://www.iucnredlist.org 2006 IUCN Red List of Threatened Species. ] Downloaded on 10 July 2007.</ref>
| regnum = [[Plantae]]
| divisio = [[Pinophyta]]
| classis = [[Pinopsida]]
| ordo = [[Pinales]]
| familia = [[Pinaceae]]
| genus = ''[[Abies]]''
| binomial = '''''Abies durangensis'''''
| binomial_authority = [[Maximino Martínez|Martínez]]<ref name=ipni>{{ cite web |url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal |title=Plant Name Details for ''Abies durangensis'' |publisher=[[International Plant Names Index|IPNI]] |accessdate=6 de octubre de 2009}}</ref>
| synonyms =
}}
'''''Abies durangensis''''' es una [[especie]] de [[conífera]] perteneciente a la familia [[Pinaceae]]. Son [[endémica]]s de [[México]] donde se encuentran en [[Durango]], [[Chihuahua]], [[Coahuila]], [[Jalisco]] y [[Sinaloa]]. También es conocido como 'Árbol de Coahuila' y 'pino mexicano'.<ref name=cje>{{ cite web |url=http://www.conifers.org/pi/ab/durangensis.htm |title=''Abies durangaensis'' description |author=Christopher J. Earle |date=11 de junio de 2006 |accessdate=6 de octubre de 2009}}</ref>
== Descripción ==
Es un [[árbol]] que alcanza los 40 metros de altura con un [[Tronco (botánica)|tronco]] recto que tiene 150 cm de diámetro. Las [[rama]]s son horizontales y la [[corteza (árbol)|corteza]] de color gris. Las [[hoja]]s son verde brillante de 20–35 mm de longitud por 1-1.5 mm de ancho. Tiene los conos de [[semilla]]s erectos en ramas laterales sobre un corto [[pedúnculo]]. Las [[semilla]]s son [[resina|resinosas]] con una [[núcula]] amarilla con alas.
== Taxonomía ==
''Abies durangensis'' fue descrita por [[Maximino Martínez]] y publicado en ''[[Anales del instituto de Biología de la Universidad Nacional de México]]'' 13: 2. 1942.<ref name = Trop>{{cita web |url=http://www.tropicos.org/Name/24901700 |título= ''{{PAGENAME}}''|fechaacceso=21 de enero de 2013 |formato= |obra= Tropicos.org. [[Missouri Botanical Garden]]}}</ref>
;[[Etimología]]:
'''''Abies''''': nombre genérico que viene del nombre [[latin]]o de ''[[Abies alba]]''.<ref>[http://www.calflora.net/botanicalnames/pageAB-AM.html En Nombres Botánicos]</ref>
'''''durangensis''''': [[epíteto]] geográfico que alude a su localización en [[Durango]].
;Variedades:
* ''Abies durangensis var. coahuilensis'' (I. M. Johnst.) Martínez
;[[sinonimia (biología)|Sinonimia]]:
* ''Abies durangensis subsp. neodurangensis'' (Debreczy, I.Rácz & R.M.Salazar) Silba'
* ''Abies neodurangensis'' Debreczy, I.Rácz & R.M.Salazar<ref>[http://www.theplantlist.org/tpl/record/kew-2609816 ''{{PAGENAME}}'' en PlantList]</ref><ref name = Kew>{{cita web|url=http://apps.kew.org/wcsp/namedetail.do?name_id=2609816 |título=''{{PAGENAME}}'' |work= World Checklist of Selected Plant Families}}</ref>
;''var. coahuilensis'' (I.M.Johnst.) Martínez
* ''Abies coahuilensis'' I.M.Johnst.
* ''Abies durangensis subsp. coahuilensis'' (I.M.Johnst.) Silba
== Véase también ==
* [[Terminología descriptiva de las plantas]]
* [[Anexo:Cronología de la botánica]]
* [[Historia de la Botánica]]
* [[Pinaceae#Descripción|Características de las pináceas]]
== Referencias ==
{{listaref}}
== Bibliografía ==
# CONABIO. 2009. Catálogo taxonómico de especies de México. 1. In Capital Nat. México. CONABIO, Mexico City.
== Enlaces externos ==
{{commonscat}}
{{wikispecies|Abies}}
* http://web.archive.org/web/http://ww.conifers.org/pi/ab/durangensis.htm
* http://www.catalogueoflife.org/search.php
[[Categoría:Abies|durangensis]]
[[Categoría:Plantas descritas en 1942]]
[[Categoría:Plantas descritas por Martínez]]
I am interested in the (unmarked) text just after the infobox, the gloss:
Abies durangensis es una especie de conífera perteneciente a la familia Pinaceae. Son endémicas de México donde se encuentran en Durango, Chihuahua, Coahuila, Jalisco y Sinaloa. También es conocido como 'Árbol de Coahuila' y 'pino mexicano'.
Until now i consulted https://www.mediawiki.org/wiki/Alternative_parsers so i found that mwparserfromhell is the less complicated parser in python. However, i dont see clearly how to do what i pretend. When i use the example proposed in the documentation i just can't see where the gloss is.
for t in templates:
print(t.name).encode('utf-8')
print(t.params)
Ficha de taxón
[u" name = ''Abies durangensis''\n", u' image = Abies tamazula dgo.jpg \n', u' status = LR/lc\n', u" status_ref =<ref>Conifer Specialist Group 1998. [http://www.iucnredlist.org/search/details.php/42279/all ''Abies durangensis'']. [http://www.iucnredlist.org 2006 IUCN Red List of Threatened Species. ] Downloaded on 10 July 2007.</ref>\n", u' regnum = [[Plantae]]\n', u' divisio = [[Pinophyta]]\n', u' classis = [[Pinopsida]]\n', u' ordo = [[Pinales]]\n', u' familia = [[Pinaceae]]\n', u" genus = ''[[Abies]]'' \n", u" binomial = '''''Abies durangensis'''''\n", u" binomial_authority = [[Maximino Mart\xednez|Mart\xednez]]<ref name=ipni>{{ cite web |url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal |title=Plant Name Details for ''Abies durangensis'' |publisher=[[International Plant Names Index|IPNI]] |accessdate=6 de octubre de 2009}}</ref>\n", u' synonyms = \n']
cite web
[u'url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal ', u"title=Plant Name Details for ''Abies durangensis'' ", u'publisher=[[International Plant Names Index|IPNI]] ', u'accessdate=6 de octubre de 2009']
cite web
[u'url=http://www.conifers.org/pi/ab/durangensis.htm ', u"title=''Abies durangaensis'' description ", u'author=Christopher J. Earle ', u'date=11 de junio de 2006 ', u'accessdate=6 de octubre de 2009']
cita web
[u'url=http://www.tropicos.org/Name/24901700 ', u"t\xedtulo= ''{{PAGENAME}}''", u'fechaacceso=21 de enero de 2013 ', u'formato= ', u'obra= Tropicos.org. [[Missouri Botanical Garden]]']
PAGENAME
[]
PAGENAME
[]
cita web
[u'url=http://apps.kew.org/wcsp/namedetail.do?name_id=2609816 ', u"t\xedtulo=''{{PAGENAME}}'' ", u'work= World Checklist of Selected Plant Families']
PAGENAME
[]
listaref
[]
commonscat
[]
wikispecies
[u'Abies']
Instead of torturing yourself with parsing of something that's not even expressable in formal grammar, use the TextExtracts API:
https://es.wikipedia.org/w/api.php?action=query&prop=extracts&explaintext=1&titles=Abies%20durangensis&format=json
gives the following output:
Abies durangensis es una especie de conífera perteneciente a la
familia Pinaceae. Son endémicas de México donde se encuentran en
Durango, Chihuahua, Coahuila, Jalisco y Sinaloa. También es conocido
como 'Árbol de Coahuila' y 'pino mexicano'.
== Descripción ==
Es un árbol que alcanza los 40 metros de altura con un tronco recto que tiene 150 cm de diámetro. Las ramas son
horizontales y la corteza de color gris. Las hojas son verde brillante
de 20–35 mm de longitud por 1-1.5 mm de ancho. Tiene los conos de
semillas erectos en ramas laterales sobre un corto pedúnculo. Las
semillas son resinosas con una núcula amarilla con alas.
[...]
Append &exintro=1 to the URL if you need only the lead.
You can use wikitextparser
If wikitext has your text string, all you need is:
import wikitextparser
parsed = wikitextparser.parse(wikitext)
then you can get the plain text portion of the whole page or a particular section, for example: parsed.sections[1].plain_text() will give you the plain text of the second section of the page which seems to be what you are looking for.