How can I merge two json files data in Python - python

Suppose My first json file data is like this.
{"derivedFrom": "1e21781bfc33ae80e074369165368080", "text": "PAKIS RAPE KIDS: Mexico police in college raids: Vehicles are set ablaze and more than 120 peo... #Ewok_League #edl"}
{"derivedFrom": "1e26ebbfbfaea600e0746a3016b628a8", "text": "Bullfighting sparks animal rights protests in Mexico - video - The Guardian "}
and second JSON file data is like this
{"derivedFrom": "1e21292e9b4ca680e074bd999ef8cc3a","text": "#TheArkham No crea que me olvido de los amigos,espero todo este marchando bien, un abrazo."}
{"derivedFrom": "1e2602635130a980e0744bad1c470046", "text": "Avisale al IFE Que #SiDragonBallFueraMexicano Le hubiera dado a Noe Hernandez Semillas Del Ermita\u00f1o (ESO HUBIERA SIDO ESTUPENDO. QEPD)"}
My ultimate goal is to merge both the JSON file's text data only by appending "1," in the first file and "0," in the second file.
I wrote the script like this but I'm sure I cannot do in this way in Python.
import json
positiveFile = open('train_posi_tweets_2017.txt')
negativeFile = open('train_nega_tweets_2017.txt')
for linePos,lineNeg in positiveFile,negativeFile:
distros_dictPos=json.loads(linePos)
distros_dictNeg=json.loads(lineNeg)
distros_dictPosVal = distros_dict['text'].encode('utf-8')
distros_dictNegVal = distros_dict['text'].encode('utf-8')
print distros_dictNegVal
So the final output should be like this.
1,"PAKIS RAPE KIDS: Mexico police in college raids: Vehicles are set ablaze and more than 120 peo... #Ewok_League #edl"
1,"Bullfighting sparks animal rights protests in Mexico - video - The Guardian "
0,"#TheArkham No crea que me olvido de los amigos,espero todo este marchando bien, un abrazo."
0,"Avisale al IFE Que #SiDragonBallFueraMexicano Le hubiera dado a Noe Hernandez Semillas Del Ermita\u00f1o (ESO HUBIERA SIDO ESTUPENDO. QEPD)

import json
positiveFile = open('test1.txt')
negativeFile = open('test2.txt')
for linePos,lineNeg in positiveFile,negativeFile:
print(1, json.loads(linePos)['text'].encode('utf-8'))
print(0, json.loads(lineNeg)['text'].encode('utf-8'))
Output
1 b'PAKIS RAPE KIDS: Mexico police in college raids: Vehicles are set ablaze and more than 120 peo... #Ewok_League #edl'
0 b'Bullfighting sparks animal rights protests in Mexico - video - The Guardian '
1 b'#TheArkham No crea que me olvido de los amigos,espero todo este marchando bien, un abrazo.'
0 b'Avisale al IFE Que #SiDragonBallFueraMexicano Le hubiera dado a Noe Hernandez Semillas Del Ermita\xc3\xb1o (ESO HUBIERA SIDO ESTUPENDO. QEPD)'
the output is
0
1
0
1
or you want to be first of all 1 and after this 0 ?

Related

Efficient way to sentence-tokenize and clean text

I have a dataframe consisting of two columns, one with dates and the other with a string of text. I'd like to split the text in sentences and then apply some preprocessing.
Here is a simplified example of what I have:
import nltk
from nltk.corpus import stopwords
from pandarallel import pandarallel
pandarallel.initialize()
example_df=pd.DataFrame({'date':['2022-09-01'],'text':'Europa tiene un plan. Son cosas distintas. Perdón, esta es imagen dentro, y el recorte a los beneficios los ingresos totales conocimiento del uso fraudulento Además, el riesgo ha bajado. de gases nocivos, como el CO2. -La justicia europea ha confirmado se ha dado acceso al público por lo menos, intentar entrar. para reducir los contagios, vestido de chaqué. Han tenido que establecer de despido según informa que podría pasar desapercibida El Tribunal Supremo confirma en nuestra página web'})
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
example_df['sentence']=example_df['text'].parallel_apply(lambda x: spanish_tokenizer.tokenize(x))
As you can see, I rely on nltk tokenizer on the raw text to create a new column "sentences", that contains the list of sentences.
print(example_df['sentence'])
0 [Europa tiene un plan, Son cosas distintas, Perdón, esta es imagen dentro, y el recorte a los beneficios los ingresos totales conocimiento del uso fraudulento Además, el riesgo ha bajado, de gases nocivos, como el CO2, -La justicia europea ha confirmado se ha dado acceso al público por lo menos, intentar entrar, para reducir los contagios, vestido de chaqué, Han tenido que establecer de despido según informa que podría pasar desapercibida El Tribunal Supremo confirma en nuestra página web]
1 [casi todas las restricciones, Socios como Esquerra le echan un servicio público; con terrazas llenas Los voluntarios piden a todos los cuatros juntos en una semana la sentencia de cárcel para Griñán que Griñán no conoció la trama, de las hipotecas, A las afueras de Westminster]
Name: sentence, dtype: object
# Since commas might be misleading:
example_df.sentence[1]
['casi todas las restricciones',
'Socios como Esquerra le echan un servicio público; con terrazas llenas Los voluntarios piden a todos los cuatros juntos en una semana la sentencia de cárcel para Griñán que Griñán no conoció la trama, de las hipotecas',
'A las afueras de Westminster']
My next goal is to clean those sentences. Since I need punctuation for the tokenizer to work, I believe I need to do this process ex-post which implies looping, for each date of text, to each sentence. First of all, I am not sure how to do this operation with the pandas structure, here is one of my trials to remove stopwords:
from nltk.corpus import stopwords
stop = stopwords.words('spanish')
example_df['sentence'] = example_df['sentence'].parallel_apply(lambda x: ' '.join(
[word for word in i.split() for i in x if word not in (stop)]))
Which produces the following attribute error AttributeError: 'int' object has no attribute 'split'
Is there a more efficient/elegant wat to do this?
Since the sentence column is tokenized text (a list of strings) the list comprehension logic needs to be changed.
Eg:
sentences = ['casi todas las restricciones', 'Socios como Esquerra le echan un servicio público; con terrazas llenas Los voluntarios piden a todos los cuatros juntos en una semana la sentencia de cárcel para Griñán que Griñán no conoció la trama, de las hipotecas', 'A las afueras de Westminster']
stopwords_removed = [word for word in sent.split() for sent in sentences if word not in stop]
sent being the sentences inside the list and each word being the individual words you obtain after splitting by whitespace.
Your error is most likely caused due to a missing axis parameter
df.Column.parallel_apply(func, axis=1)
where func is a function that returns your list comprehension result

Regex to split Spanish lastnames in first lastname and second lastname

My goal is to write a python 3 function that takes lastnames as row from a csv and splits them correctly into lastname_1 and lastname_2.
Spanish names have the following structure: firstname + lastname_1 + lastname_2
Forgettin about the firstname, I would like a code that splits a lastname in these 2 categories (lastname_1, lastname_2) Which is the challenge?
Sometimes the lastname(s) have prepositions.
DE BLAS ZAPATA "de blas" ist the first lastname and "Zapata" the second lastname
MATIAS DE LA MANO "Matias" is lastname_1, "de la mano" lastname_2
LOPEZ FERNANDEZ DE VILLAVERDE Lopez Fernandez is lastname_1, de villaverda lastname_2
DE MIGUEL DEL CORRAL De Miguel is lastname_1, del corral lastname_2 More: VIDAL DE LA PEÑA SOLIS Vidal is surtname_1, de la pena solis surname_2
MONTAVA DEL ARCO Montava is surname_1 Del Arco surname_2
and the list could go on and on.
I am currently stuck and I found this code in perl but I struggle to understand the main idea behind it to translate it to python 3.
import re
preposition_lst = ['DE LO ', 'DE LA ', 'DE LAS ', 'DEL ', 'DELS ', 'DE LES ', 'DO ', 'DA ', 'DOS ', 'DAS', 'DE ']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PEÑA", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
for case in cases:
for prep in preposition_lst:
m = re.search(f"(.*)({prep}[A-ZÀ-ÚÄ-Ü]+)", case, re.I) # re.I makes it case insensitive
try:
print(m.groups())
print(prep)
except:
pass
Try this one:
import re
preposition_lst = ['DE','LO','LA','LAS','DEL','DELS','LES','DO','DA','DOS','DAS']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PEÑA", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
def last_name(name):
case=re.findall(r'\w+',name)
res=list(filter(lambda x: x not in preposition_lst,case))
return res
list_final =[]
for case in cases:
list_final.append(last_name(case))
for i in range(len(list_final)):
if len(list_final[i])>2:
name1=' '.join(list_final[i][:2])
name2=' '.join(list_final[i][2:])
list_final[i]=[name1,name2]
print(list_final)
#[['BLAS', 'ZAPATA'], ['MATIAS', 'MANO'], ['LOPEZ FERNANDEZ', 'VILLAVERDE'], ['MIGUEL', 'CORRAL'], ['VIDAL', 'PE�A'], ['MONTAVA', 'ARCO'], ['CASAS', 'VALLE']]
Does this suit your requirements?
import re
preposition_lst = ['DE LO', 'DE LA', 'DE LAS', 'DE LES', 'DEL', 'DELS', 'DO', 'DA', 'DOS', 'DAS', 'DE']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PENA SOLIS", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
def split_name(name):
f1 = re.compile("(.*)({preps}(.+))".format(preps = "(" + " |".join(preposition_lst) + ")"))
m1 = f1.match(case)
if m1:
if len(m1.group(1)) != 0:
return m1.group(1).strip(), m1.group(2).strip()
else:
return " ".join(name.split()[:-1]), name.split()[-1]
else:
return " ".join(name.split()[:-1]), name.split()[-1]
for case in cases:
first, second = split_name(case)
print("{} --> name 1 = {}, name 2 = {}".format(case, first, second))
# DE BLAS ZAPATA --> name 1 = DE BLAS, name 2 = ZAPATA
# MATIAS DE LA MANO --> name 1 = MATIAS, name 2 = DE LA MANO
# LOPEZ FERNANDEZ DE VILLAVERDE --> name 1 = LOPEZ FERNANDEZ, name 2 = DE VILLAVERDE
# DE MIGUEL DEL CORRAL --> name 1 = DE MIGUEL, name 2 = DEL CORRAL
# VIDAL DE LA PENA SOLIS --> name 1 = VIDAL, name 2 = DE LA PENA SOLIS
# MONTAVA DEL ARCO --> name 1 = MONTAVA, name 2 = DEL ARCO
# DOS CASAS VALLE --> name 1 = DOS CASAS, name 2 = VALLE

How to replace " \\u0027 " by " ' " in python?

i am new to python, and im trying to program a scraper.
firstly, i extract this kind of string in a variable (lets call it data[1], because it's contained in an array):
\"description\":\"Alors qu\\u0027ils montaient dans l\\u0027un des
couloirs du versant nord du Hohneck (\\"le premier couloir \u00E0
droite\\"), deux alpinistes ont d\u00E9clench\u00E9 une plaque et ont
\u00E9t\u00E9 emport\u00E9s tous les deux. L\\u0027un sera enseveli et
l\\u0027autre ne pourra d\u00E9clencher l\\u0027alerte qu\\u0027\u00E0
la nuit. La victime ne sera retrouv\u00E9e que d\u00E9but avril.
\u003cbr\u003e Sur la photo prise en f\u00E9vrier 2011, le
trac\u00E9 approximatif de l\\u0027avalanche a \u00E9t\u00E9
repr\u00E9sent\u00E9.\",
then, i use :
data = data[1].encode().decode('unicode-escape')
but it gives me :
"description":"Alors qu\u0027ils montaient dans l\u0027un des couloirs
du versant nord du Hohneck (\"le premier couloir à droite\"), deux
alpinistes ont déclenché une plaque et ont été emportés tous les deux.
L\u0027un sera enseveli et l\u0027autre ne pourra déclencher
l\u0027alerte qu\u0027à la nuit. La victime ne sera retrouvée que
début avril. \u003cbr\u003e Sur la photo prise en février 2011, le
tracé approximatif de l\u0027avalanche a été représenté.",
indeed, char with accent had been replaced but apostrophes stay unprocessed !
It seems the two backslashes are the cause.
i tried several methods :
like decode twice and then "\u0027" become "'", but "é" become "é".
data.replace('é', 'é') or data.replace(u'\u0027', u'é') dont work
So, do you have any idea how i could fix this probleme ?
Problem fixed !
As user2357112 supports Monica said, i tried to process manually a json.
But doing :
data = html_page.find_all("script")
data = re.findall("(?<=JSON\.parse\(')[A-Za-z0-9'.?!/+=;:,()\\\ \"\-{_àâçéèêëíìîôùûæœÁÀÂÃÇÉÈÊËÍÌÎÔÙÛ©´<>]*", str(data))
data = data[0].encode().decode('unicode-escape') + "\"\"}"
data_dict = json.loads(data)
string_data_dict = json.dumps(data_dict)
for cle, val in data_dict.items() :
print(cle + " : " + str(val))
solves the bug.
With this code, an input string like :
<script>
var admin = false;
var avalanche = JSON.parse('{...\"description\":\"Alors qu\\u0027ils montaient dans l\\u0027un des couloirs du versant nord du Hohneck (\\\"le premier couloir \u00E0 droite\\\"), deux alpinistes ont d\u00E9clench\u00E9 une plaque et ont \u00E9t\u00E9 emport\u00E9s tous les deux. L\\u0027un sera enseveli et l\\u0027autre ne pourra d\u00E9clencher l\\u0027alerte qu\\u0027\u00E0 la nuit. La victime ne sera retrouv\u00E9e que d\u00E9but avril. \\u003cbr\\u003e Sur la photo prise en f\u00E9vrier 2011, le trac\u00E9 approximatif de l\\u0027avalanche a \u00E9t\u00E9 repr\u00E9sent\u00E9.\",...
gives an output like :
...
description : Alors qu'ils montaient dans l'un des couloirs du versant nord du Hohneck ("le premier couloir à droite"), deux alpinistes ont déclenché une plaque et ont été emportés tous les deux. L'un sera enseveli et l'autre ne pourra déclencher l'alerte qu'à la nuit. La victime ne sera retrouvée que début avril. <br> Sur la photo prise en février 2011, le tracé approximatif de l'avalanche a été représenté.
...
Thanks you.

Histogram representing number of substitutions, insertions and deleting in sequences

l have two columns that represent : right sequence and predicted sequence. l want to make statistics on the number of deletion, substitution and insertion by comparing each right sequence with its predicted sequence.
l did the levenstein distance to get the number of characters which are different (see the function below) and error_dist function to get the most common errors (in terms of substitution) :
here is a sample of my data :
de de
date date
pour pour
etoblissemenls etablissements
avec avec
code code
communications communications
r r
seiche seiche
titre titre
publiques publiques
ht ht
bain bain
du du
ets ets
premier premier
dans dans
snupape soupape
minimum minimum
blanc blanc
fr fr
nos nos
au au
bl bl
consommations consommations
somme somme
euro euro
votre votre
offre offre
forestier forestier
cs cs
de de
pour pour
de de
paye r
cette cette
votre votre
valeurs valeurs
des des
gfda gfda
tva tva
pouvoirs pouvoirs
de de
revenus revenus
offre offre
ht ht
card card
noe noe
montant montant
r r
comprises comprises
quantite quantite
nature nature
ticket ticket
ou ou
rapide rapide
de de
sous sous
identification identification
du du
document document
suicide suicide
bretagne bretagne
tribunal tribunal
services services
cif cif
moyen moyen
gaec gaec
total total
lorsque lorsque
contact contact
fermeture fermeture
la la
route route
tva tva
ia ia
noyal noyal
brie brie
de de
nanterre nanterre
charcutier charcutier
semestre semestre
de de
rue rue
le le
bancaire bancaire
martigne martigne
recouvrement recouvrement
la la
sainteny sainteny
de de
franc franc
rm rm
vro vro
here is my code
import pandas as pd
import collections
import numpy as np
import matplotlib.pyplot as plt
import distance
def error_dist():
df = pd.read_csv('data.csv', sep=',')
df = df.astype(str)
df = df.replace(['é', 'è', 'È', 'É'], 'e', regex=True)
df = df.replace(['à', 'â', 'Â'], 'a', regex=True)
dictionnary = []
for i in range(len(df)):
if df.manual_raw_value[i] != df.raw_value[i]:
text = df.manual_raw_value[i]
text2 = df.raw_value[i]
x = len(df.manual_raw_value[i])
y = len(df.raw_value[i])
z = min(x, y)
for t in range(z):
if text[t] != text2[t]:
d = (text[t], text2[t])
dictionnary.append(d)
#print(dictionnary)
dictionnary_new = dict(collections.Counter(dictionnary).most_common(25))
pos = np.arange(len(dictionnary_new.keys()))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(dictionnary_new.keys())
plt.bar(range(len(dictionnary_new)), dictionnary_new.values(), width, color='g')
plt.show()
enter image description here
and the levenstein distance :
def levenstein_dist():
df = pd.read_csv('data.csv', sep=',')
df=df.astype(str)
df['string diff'] = df.apply(lambda x: distance.levenshtein(x['raw_value'], x['manual_raw_value']), axis=1)
plt.hist(df['string diff'])
plt.show()
enter image description here
Now l want to make a histograms showing three bins : number of substitution, number of insertion and number of deletion . How can l proceed ?
Thank you
Thanks to the suggestions of #YohanesGultom the answer for the problem can be found here :
http://www.nltk.org/_modules/nltk/metrics/distance.html
or
https://gist.github.com/kylebgorman/1081951

How to get clean text from MediaWiki markup format using mwparserfromhell or a simple parser in python?

I am trying to get clean sentences from the Wikipedia page of a species.
For instance Abeis durangensis (pid = 1268312). Using the Wikipedia API in python to obtain the Wikipedia page:
import requests
pid = 1268312
q = {'action' : 'query',
'pageids': pid,
'prop' : 'revisions',
'rvprop' : 'content',
'format' : 'json'}
result = requests.get(eswiki_URI, params=q).json()
wikitext = result["query"]["pages"].values()[0]["revisions"][0]["*"]
gives:
{{Ficha de taxón
| name = ''Abies durangensis''
| image = Abies tamazula dgo.jpg
| status = LR/lc
| status_ref =<ref>Conifer Specialist Group 1998. [http://www.iucnredlist.org/search/details.php/42279/all ''Abies durangensis'']. [http://www.iucnredlist.org 2006 IUCN Red List of Threatened Species. ] Downloaded on 10 July 2007.</ref>
| regnum = [[Plantae]]
| divisio = [[Pinophyta]]
| classis = [[Pinopsida]]
| ordo = [[Pinales]]
| familia = [[Pinaceae]]
| genus = ''[[Abies]]''
| binomial = '''''Abies durangensis'''''
| binomial_authority = [[Maximino Martínez|Martínez]]<ref name=ipni>{{ cite web |url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal |title=Plant Name Details for ''Abies durangensis'' |publisher=[[International Plant Names Index|IPNI]] |accessdate=6 de octubre de 2009}}</ref>
| synonyms =
}}
'''''Abies durangensis''''' es una [[especie]] de [[conífera]] perteneciente a la familia [[Pinaceae]]. Son [[endémica]]s de [[México]] donde se encuentran en [[Durango]], [[Chihuahua]], [[Coahuila]], [[Jalisco]] y [[Sinaloa]]. También es conocido como 'Árbol de Coahuila' y 'pino mexicano'.<ref name=cje>{{ cite web |url=http://www.conifers.org/pi/ab/durangensis.htm |title=''Abies durangaensis'' description |author=Christopher J. Earle |date=11 de junio de 2006 |accessdate=6 de octubre de 2009}}</ref>
== Descripción ==
Es un [[árbol]] que alcanza los 40 metros de altura con un [[Tronco (botánica)|tronco]] recto que tiene 150 cm de diámetro. Las [[rama]]s son horizontales y la [[corteza (árbol)|corteza]] de color gris. Las [[hoja]]s son verde brillante de 20–35 mm de longitud por 1-1.5 mm de ancho. Tiene los conos de [[semilla]]s erectos en ramas laterales sobre un corto [[pedúnculo]]. Las [[semilla]]s son [[resina|resinosas]] con una [[núcula]] amarilla con alas.
== Taxonomía ==
''Abies durangensis'' fue descrita por [[Maximino Martínez]] y publicado en ''[[Anales del instituto de Biología de la Universidad Nacional de México]]'' 13: 2. 1942.<ref name = Trop>{{cita web |url=http://www.tropicos.org/Name/24901700 |título= ''{{PAGENAME}}''|fechaacceso=21 de enero de 2013 |formato= |obra= Tropicos.org. [[Missouri Botanical Garden]]}}</ref>
;[[Etimología]]:
'''''Abies''''': nombre genérico que viene del nombre [[latin]]o de ''[[Abies alba]]''.<ref>[http://www.calflora.net/botanicalnames/pageAB-AM.html En Nombres Botánicos]</ref>
'''''durangensis''''': [[epíteto]] geográfico que alude a su localización en [[Durango]].
;Variedades:
* ''Abies durangensis var. coahuilensis'' (I. M. Johnst.) Martínez
;[[sinonimia (biología)|Sinonimia]]:
* ''Abies durangensis subsp. neodurangensis'' (Debreczy, I.Rácz & R.M.Salazar) Silba'
* ''Abies neodurangensis'' Debreczy, I.Rácz & R.M.Salazar<ref>[http://www.theplantlist.org/tpl/record/kew-2609816 ''{{PAGENAME}}'' en PlantList]</ref><ref name = Kew>{{cita web|url=http://apps.kew.org/wcsp/namedetail.do?name_id=2609816 |título=''{{PAGENAME}}'' |work= World Checklist of Selected Plant Families}}</ref>
;''var. coahuilensis'' (I.M.Johnst.) Martínez
* ''Abies coahuilensis'' I.M.Johnst.
* ''Abies durangensis subsp. coahuilensis'' (I.M.Johnst.) Silba
== Véase también ==
* [[Terminología descriptiva de las plantas]]
* [[Anexo:Cronología de la botánica]]
* [[Historia de la Botánica]]
* [[Pinaceae#Descripción|Características de las pináceas]]
== Referencias ==
{{listaref}}
== Bibliografía ==
# CONABIO. 2009. Catálogo taxonómico de especies de México. 1. In Capital Nat. México. CONABIO, Mexico City.
== Enlaces externos ==
{{commonscat}}
{{wikispecies|Abies}}
* http://web.archive.org/web/http://ww.conifers.org/pi/ab/durangensis.htm
* http://www.catalogueoflife.org/search.php
[[Categoría:Abies|durangensis]]
[[Categoría:Plantas descritas en 1942]]
[[Categoría:Plantas descritas por Martínez]]
I am interested in the (unmarked) text just after the infobox, the gloss:
Abies durangensis es una especie de conífera perteneciente a la familia Pinaceae. Son endémicas de México donde se encuentran en Durango, Chihuahua, Coahuila, Jalisco y Sinaloa. También es conocido como 'Árbol de Coahuila' y 'pino mexicano'.
Until now i consulted https://www.mediawiki.org/wiki/Alternative_parsers so i found that mwparserfromhell is the less complicated parser in python. However, i dont see clearly how to do what i pretend. When i use the example proposed in the documentation i just can't see where the gloss is.
for t in templates:
print(t.name).encode('utf-8')
print(t.params)
Ficha de taxón
[u" name = ''Abies durangensis''\n", u' image = Abies tamazula dgo.jpg \n', u' status = LR/lc\n', u" status_ref =<ref>Conifer Specialist Group 1998. [http://www.iucnredlist.org/search/details.php/42279/all ''Abies durangensis'']. [http://www.iucnredlist.org 2006 IUCN Red List of Threatened Species. ] Downloaded on 10 July 2007.</ref>\n", u' regnum = [[Plantae]]\n', u' divisio = [[Pinophyta]]\n', u' classis = [[Pinopsida]]\n', u' ordo = [[Pinales]]\n', u' familia = [[Pinaceae]]\n', u" genus = ''[[Abies]]'' \n", u" binomial = '''''Abies durangensis'''''\n", u" binomial_authority = [[Maximino Mart\xednez|Mart\xednez]]<ref name=ipni>{{ cite web |url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal |title=Plant Name Details for ''Abies durangensis'' |publisher=[[International Plant Names Index|IPNI]] |accessdate=6 de octubre de 2009}}</ref>\n", u' synonyms = \n']
cite web
[u'url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal ', u"title=Plant Name Details for ''Abies durangensis'' ", u'publisher=[[International Plant Names Index|IPNI]] ', u'accessdate=6 de octubre de 2009']
cite web
[u'url=http://www.conifers.org/pi/ab/durangensis.htm ', u"title=''Abies durangaensis'' description ", u'author=Christopher J. Earle ', u'date=11 de junio de 2006 ', u'accessdate=6 de octubre de 2009']
cita web
[u'url=http://www.tropicos.org/Name/24901700 ', u"t\xedtulo= ''{{PAGENAME}}''", u'fechaacceso=21 de enero de 2013 ', u'formato= ', u'obra= Tropicos.org. [[Missouri Botanical Garden]]']
PAGENAME
[]
PAGENAME
[]
cita web
[u'url=http://apps.kew.org/wcsp/namedetail.do?name_id=2609816 ', u"t\xedtulo=''{{PAGENAME}}'' ", u'work= World Checklist of Selected Plant Families']
PAGENAME
[]
listaref
[]
commonscat
[]
wikispecies
[u'Abies']
Instead of torturing yourself with parsing of something that's not even expressable in formal grammar, use the TextExtracts API:
https://es.wikipedia.org/w/api.php?action=query&prop=extracts&explaintext=1&titles=Abies%20durangensis&format=json
gives the following output:
Abies durangensis es una especie de conífera perteneciente a la
familia Pinaceae. Son endémicas de México donde se encuentran en
Durango, Chihuahua, Coahuila, Jalisco y Sinaloa. También es conocido
como 'Árbol de Coahuila' y 'pino mexicano'.
== Descripción ==
Es un árbol que alcanza los 40 metros de altura con un tronco recto que tiene 150 cm de diámetro. Las ramas son
horizontales y la corteza de color gris. Las hojas son verde brillante
de 20–35 mm de longitud por 1-1.5 mm de ancho. Tiene los conos de
semillas erectos en ramas laterales sobre un corto pedúnculo. Las
semillas son resinosas con una núcula amarilla con alas.
[...]
Append &exintro=1 to the URL if you need only the lead.
You can use wikitextparser
If wikitext has your text string, all you need is:
import wikitextparser
parsed = wikitextparser.parse(wikitext)
then you can get the plain text portion of the whole page or a particular section, for example: parsed.sections[1].plain_text() will give you the plain text of the second section of the page which seems to be what you are looking for.

Categories