Histogram representing number of substitutions, insertions and deleting in sequences

Histogram representing number of substitutions, insertions and deleting in sequences - python

l have two columns that represent : right sequence and predicted sequence. l want to make statistics on the number of deletion, substitution and insertion by comparing each right sequence with its predicted sequence.
l did the levenstein distance to get the number of characters which are different (see the function below) and error_dist function to get the most common errors (in terms of substitution) :
here is a sample of my data :
de de
date date
pour pour
etoblissemenls etablissements
avec avec
code code
communications communications
r r
seiche seiche
titre titre
publiques publiques
ht ht
bain bain
du du
ets ets
premier premier
dans dans
snupape soupape
minimum minimum
blanc blanc
fr fr
nos nos
au au
bl bl
consommations consommations
somme somme
euro euro
votre votre
offre offre
forestier forestier
cs cs
de de
pour pour
de de
paye r
cette cette
votre votre
valeurs valeurs
des des
gfda gfda
tva tva
pouvoirs pouvoirs
de de
revenus revenus
offre offre
ht ht
card card
noe noe
montant montant
r r
comprises comprises
quantite quantite
nature nature
ticket ticket
ou ou
rapide rapide
de de
sous sous
identification identification
du du
document document
suicide suicide
bretagne bretagne
tribunal tribunal
services services
cif cif
moyen moyen
gaec gaec
total total
lorsque lorsque
contact contact
fermeture fermeture
la la
route route
tva tva
ia ia
noyal noyal
brie brie
de de
nanterre nanterre
charcutier charcutier
semestre semestre
de de
rue rue
le le
bancaire bancaire
martigne martigne
recouvrement recouvrement
la la
sainteny sainteny
de de
franc franc
rm rm
vro vro
here is my code
import pandas as pd
import collections
import numpy as np
import matplotlib.pyplot as plt
import distance
def error_dist():
df = pd.read_csv('data.csv', sep=',')
df = df.astype(str)
df = df.replace(['é', 'è', 'È', 'É'], 'e', regex=True)
df = df.replace(['à', 'â', 'Â'], 'a', regex=True)
dictionnary = []
for i in range(len(df)):
if df.manual_raw_value[i] != df.raw_value[i]:
text = df.manual_raw_value[i]
text2 = df.raw_value[i]
x = len(df.manual_raw_value[i])
y = len(df.raw_value[i])
z = min(x, y)
for t in range(z):
if text[t] != text2[t]:
d = (text[t], text2[t])
dictionnary.append(d)
#print(dictionnary)
dictionnary_new = dict(collections.Counter(dictionnary).most_common(25))
pos = np.arange(len(dictionnary_new.keys()))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(dictionnary_new.keys())
plt.bar(range(len(dictionnary_new)), dictionnary_new.values(), width, color='g')
plt.show()
enter image description here
and the levenstein distance :
def levenstein_dist():
df = pd.read_csv('data.csv', sep=',')
df=df.astype(str)
df['string diff'] = df.apply(lambda x: distance.levenshtein(x['raw_value'], x['manual_raw_value']), axis=1)
plt.hist(df['string diff'])
plt.show()
enter image description here
Now l want to make a histograms showing three bins : number of substitution, number of insertion and number of deletion . How can l proceed ?
Thank you

Thanks to the suggestions of #YohanesGultom the answer for the problem can be found here :
http://www.nltk.org/_modules/nltk/metrics/distance.html
or
https://gist.github.com/kylebgorman/1081951

Related

on the purchase of a product from a distributor (price entered by the customer), minimum number of pieces to be refunded (python)

I have an exercise where I need to know how much the distributor will give back to the customer who bought a product. (constraint: the refund has to be done just with 50 cents, 1€(100 cents) and 2€(200 cents) coins
I can do the calculation to find out how much the distributor should refund.
But the problem I encounter is to try to display the number of coins that make up the total refunded to the customer
Here is what I have done so far :
coins = {200: 5, 100: 3, 50: 2, 20: 4, 10: 10, 5: 20, 2: 100, 1: 100}
# Dictionnaire des quantités de chaque valeur de pièce rendue
change_given = {200: 0, 100: 1, 50: 0, 20: 0, 10: 0, 5: 0, 2: 0, 1: 0}
def min_num_coins(amount, change_given):
# Initialiser le compteur de pièces à zéro
num_coins = 0
# Pour chaque valeur de pièce, tant que la somme est supérieure à la valeur de la pièce
for value in sorted(coins, reverse=True):
while amount >= value and coins[value] > 0:
# Incrémenter le compteur de pièces
num_coins += 1
# Soustraire la valeur de la pièce de la somme
amount -= value
# Décrémenter la quantité disponible de cette valeur de pièce
coins[value] -= 1
# Incrémenter la quantité de cette valeur de pièce rendue
change_given[value] += 1
# Retourner le nombre de pièces et les pièces rendues
return num_coins, change_given
# Demander au client de saisir le montant du produit
amount_owed = int(input("Entrez le montant du produit (en euros): "))
# Demander au client de saisir le montant payé
amount_given = int(input("Entrez le montant payé (en euros): "))
# Calculer la somme due
change_due = amount_given - amount_owed
# Séparer la somme due en euros et centimes
euros, centimes = divmod(change_due, 100)
# Appeler la fonction pour obtenir le nombre minimum de pièces
num_coins, coins_given = min_num_coins(centimes, change_given)
# Afficher le résultat
print(f"Il faut rendre {centimes} € et {euros} centimes en {num_coins} pièce(s)")
# Générer un message qui affiche les pièces rendues au client
change_message = "Pièces rendues: "
for value, quantity in coins_given.items():
if quantity > 0:
change_message += f"{quantity} x {value // 100} €, "
change_message = change_message[:-2]
# Afficher le message
print(change_message)
The error I get when I want to get the number of coins (the example is: the product costs 5€, the customer gives 10€, the distributor gives back 5€ but says he gives 5€ in one coin).
Here is the message I get:
20 € and 0 cents must be returned in 1 coin(s)
Coins returned: 1 x €1, 1 x €0.

How can I modify an "if condition" in order to apply it to different list at the same time?

I wrote a script to extract sentences in huge set which contains particular pattern. The problem lied in the fact that , for some patterns I checked the value of the attribute at the beginning or ending of the pattern to see if the word is present in a particular list. I have 4 dictionaries with 2 lists of positive and negative word. So far I wrote the script and I am able to use the function I wrote with one dictionary. I am thinking how can I improve the my function so that I can use it at the same time of the 4 dictionaries without duplicating the bloc which loop in the dictionary.
I give an example with two dictionaries (since the script is quite long I make a small example with all the necessary element
import spacy.attrs
from spacy.attrs import POS
import spacy
from spacy import displacy
from spacy.lang.fr import French
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
from spacy.lemmatizer import Lemmatizer
nlp = spacy.load("fr_core_news_md")
from spacy.matcher import Matcher#LIST
##################### List of lexicon
# Lexique Diko
lexicon = open(os.path.join('/h/Ressources/Diko.txt'), 'r', encoding='utf-8')
data = pd.read_csv(lexicon, sep=";", header=None)
data.columns = ["id", "terme", "pol"]
pol_diko_pos = data.loc[data.pol =='positive', 'terme']
liste_pos_D = list(pol_diko_pos)
print(liste_pos[1])
pol_diko_neg = data.loc[data.pol =='negative', 'terme']
liste_neg_D = list(pol_diko_neg)
#print(type(liste_neg))
# Lexique Polarimots
lexicon_p = open(os.path.join('/h/Ressources/polarimots.txt'), 'r', encoding='utf-8')
data_p = pd.read_csv(lexicon_p, sep="\t", header=None)
#data.columns = ["terme", "pol", "pos", "degre"]
data_p.columns = ["ind", "terme", "cat", "pol", "fiabilité"]
pol_polarimot_pos = data_p.loc[data_p.pol =='POS', 'terme']
liste_pos_P = list(pol_polarimot_pos)
print(liste_pos_P[1])
pol_polarimot_neg = data_p.loc[data_p.pol =='NEG', 'terme']
liste_neg_P = list(pol_polarimot_neg)
#print(type(liste_neg))
# ############################# Lists
sentence_not_extract_lexique_1 =[] #List of all sentences without the specified pattern
sentence_extract_lexique_1 = [] #list of sentences which the pattern[0] is present in the first lexicon
sentence_not_extract_lexique_2 =[] #List of all sentences without the specified pattern
sentence_extract_lexique_2 = [] #list of sentences which the pattern[0] is present in the second lexicon
list_token_pos = [] #list of the token found in the lexique
list_token_neg = [] #list of the token found in the lexique
list_token_not_found = [] #list of the token not found in the lexique
#PATTERN
pattern1 = [{"POS": {"IN": ["VERB", "AUX","ADV","NOUN","ADJ"]}}, {"IS_PUNCT": True, "OP": "*"}, {"LOWER": "mais"} ]
pattern1_tup = (pattern1, 1, True)
pattern3 = [{"LOWER": {"IN": ["très","trop"]}},
{"POS": {"IN": ["ADV","ADJ"]}}]
pattern3_tup = (pattern3, 0, True)
pattern4 = [{"POS": "ADV"}, # adverbe de négation
{"POS": "PRON","OP": "*"},
{"POS": {"IN": ["VERB", "AUX"]}},
{"TEXT": {"IN": ["pas", "plus", "aucun", "aucunement", "point", "jamais", "nullement", "rien"]}},]
pattern4_tup = (pattern4, None, False)
#Tuple of pattern
pattern_list_tup =[pattern1_tup, pattern3_tup, pattern4_tup]
pattern_name = ['first', 'second', 'third', 'fourth']
length_of_list = len(pattern_list_tup)
print('length', length_of_list)
#index of the value of attribute to check in the lexicon
value_of_attribute = [0,-1,-1]
# List of lexicon to use
lexique_1 = [lexique_neg, lexique_pos]
lexique_2 = [lexique_2neg, lexique_2pos]
# text (example of some sentences)
file =b= ["Le film est superbe mais cette édition DVD est nulle !",
"J'allais dire déplorable, mais je serais peut-être un peu trop extrême.",
"Hélas, l'impression de violence, bien que très bien rendue, ne sauve pas cette histoire gothique moderne de la sécheresse scénaristique, le tout couvert d'un adultère dont le propos semble être gratuit, classique mais intéressant...",
"Tout ça ne me donne pas envie d'utiliser un pieu mais plutôt d'aller au pieu (suis-je drôle).",
"Oui biensur, il y a la superbe introduction des parapluies au debut, et puis lorsqu il sent des culs tout neufs et qu il s extase, j ai envie de faire la meme chose apres sur celui de ma voisine de palier (ma voisine de palier elle a un gros cul, mais j admets que je voudrais bien lui foute mon tarin), mais c est tout, apres c est un film tres noir, lent et qui te plonge dans le depression.",
"Et bien hélas ce DVD ne m'a pas appris grand chose par rapport à la doc des agences de voyages et la petite dame qui fait ses dessins est bien gentille mais tout tourne un peu trop autour d'elle.",
"Au final on passe de l'un a l'autre sans subtilité, et on n'arrive qu'à une caricature de plus : si Kechiche avait comme but initial déclaré de fustiger les préjugés, c'est le contraire qui ressort de ce ''film'' truffé de clichés très préjudiciables pour les quelques habitants de banlieue qui ne se reconnaîtront pas dans cette lourde farce.",
"-ci écorche les mots, les notes... mais surtout nos oreilles !"]
# Loop to check each sentence and extract the sentences with the specified pattern from above
for pat in range(0, length_of_list):
matcher = Matcher(nlp.vocab)
matcher.add("matching_2", None, pattern_list_tup[pat][0])
# print(pat)
# print(pattern_list_tup[pat][0])
for sent in file:
doc =nlp(sent)
matches= matcher(doc)
for match_id, start, end in matches:
span = doc[start:end].lemma_.split()
#print(f"{pattern_name[pat]} pattern found: {span}")
This is the part I want ot modify to use it for another dictionary, the goal is to able to retrieve sentences extract by 4 different dictionaries to make a comparison and then check which sentences are present in more than two list.
# Condition to use the lexicon and extract the sentence
if (pattern_list_tup[pat][2]):
if (span[value_of_attribute[pat]] in lexique_1[pattern_list_tup[pat][1]]):
if sent not in sentence_extract:
sentence_extract_lexique_1.append(sent)
if (pattern_list_tup[pat][1] == 1):
list_token_pos.append(span[value_of_attribute[pat]])
if (pattern_list_tup[pat][1] == 0):
list_token_neg.append(span[value_of_attribute[pat]])
else:
list_token_not_found.append(span[value_of_attribute[pat]]) # the text form is not present in the lexicon need the lemma form
sentence_not_extract_lexique_1.append(sent)
else:
if sent not in sentence_extract:
sentence_extract_lexique_1.append(sent)
print(len(sentence_extract))
print(sentence_extract)
One solution I find is to duplicate the code abode and change the name of the list where the sentences are stored but since I have 2 dictionaries duplicating will make the code longer is there a way to combine the looping the 2 dictionaries (actually 4 dictionaries in the original) and append the result to the good list. So, for example, when I use lexique_1 , all the sentences extracted are send to "sentence_extract_lexique_1" and so on for the other.

In my opinion attempt using the if-elif-else chain. If not attempt only using the if-elif block simply because the elif statement catches the specific condition of interest. In which you're trying to catch a specific to compare and check with the sentences. Keep in mind if you try the if-elif-else chain its a good method, but it only works when you need one test to pass. Because Python finds one test to pass and it skips the rest. Its very efficient and allows you to test for one specific condition.

Pandas + Python: More efficient code

This is my code:
import pandas as pd
import os
import glob as g
archivos = g.glob('C:\Users\Desktop\*.csv')
for archiv in archivos:
nombre = os.path.splitext(archiv)[0]
df = pd.read_csv(archiv, sep=",")
d = pd.to_datetime(df['DATA_LEITURA'], format="%Y%m%d")
df['FECHA_LECTURA'] = d.dt.date
del df['DATA_LEITURA']
df['CONSUMO']=""
df['DIAS']=""
df["SUMDIAS"]=""
df["SUMCONS"]=""
df["CONSANUAL"] = ""
ordenado = df.sort_values(['NR_CPE','FECHA_LECTURA', 'HORA_LEITURA'], ascending=True)
##Agrupamos por el CPE
agrupado = ordenado.groupby('NR_CPE')
for name, group in agrupado: #Recorremos el grupo
indice = group.index.values
inicio = indice[0]
fin = indice[-1]
#Llenamos la primeras lectura de cada CPE, con esa lectura (porque no hay una lectura anterior)
ordenado.CONSUMO.loc[inicio] = 0
ordenado.DIAS.loc[inicio] = 0
cont=0
for i in indice: #Recorremos lo que hay dentro de los grupos, dentro de los CPES(lecturas)
if i > inicio and i <= fin :
cont=cont+1
consumo = ordenado.VALOR_LEITURA[indice[cont]] - ordenado.VALOR_LEITURA[indice[cont-1]]
dias = (ordenado.FECHA_LECTURA[indice[cont]] - ordenado.FECHA_LECTURA[indice[cont-1]]).days
ordenado.CONSUMO.loc[i] = consumo
ordenado.DIAS.loc[i] = dias
# Hago las sumatorias, el resultado es un objeto DataFrame
dias = agrupado['DIAS'].sum()
consu = agrupado['CONSUMO'].sum()
canu = (consu/dias) * 365
#Contador con el numero de courrencias de los campos A,B y C
conta=0
contb=0
contc=0
#Como es un DF, para recorrerlo tengo que iterar sobre ellos para hacer la comparacion
print "Grupos:"
for ind, sumdias in dias.iteritems():
if sumdias <= 180:
grupo = "A"
conta=conta+1
elif sumdias > 180 and sumdias <= 365:
grupo = "B"
contb=contb+1
elif sumdias > 365:
grupo = "C"
contc=contc+1
print "grupo A: " , conta
print "grupo B: " , contb
print "grupo C: " , contc
#Formateamos los campos para no mostrar todos los decimales
Fdias = dias.map('{:.0f}'.format)
Fcanu = canu.map('{:.2f}'.format)
frames = [Fdias, consu, Fcanu]
concat = pd.concat(frames,axis=1).replace(['inf','nan'],[0,0])
with open('C:\Users\Documents\RPE_PORTUGAL\Datos.csv','a') as f:
concat.to_csv(f,header=False,columns=['CPE','DIAS','CONSUMO','CONSUMO_ANUAL'])
try:
ordenado.to_excel(nombre+'.xls', columns=["NOME_DISTRITO",
"NR_CPE","MARCA_EQUIPAMENTO","NR_EQUIPAMENTO","VALOR_LEITURA","REGISTADOR","TIPO_REGISTADOR",
"TIPO_DADOS_RECOLHIDOS","FACTOR_MULTIPLICATIVO_FINAL","NR_DIGITOS_INTEIRO","UNIDADE_MEDIDA",
"TIPO_LEITURA","MOTIVO_LEITURA","ESTADO_LEITURA","HORA_LEITURA","FECHA_LECTURA","CONSUMO","DIAS"],
index=False)
print (archiv)
print ("===============================================")
print ("*****Se ha creado el archivo correctamente*****")
print ("===============================================")
except IOError:
print ("===================================================")
print ("¡¡¡¡¡Hubo un error en la escritura del archivo!!!!!")
print ("===================================================")
This takes a file where I have lectures of energy consumption from different dates for every light meter('NR_CPE') and do some calculations:
Calculate the energy consumption for every 'NR_CPE' by substracting the previous reading with the next one and the result put in a new column named 'CONSUMO'.
Calculate the number of days where I'v got a reading and sum up the number of days
Add the consumption for every 'NR_CPE' and calculate the anual consumption.
Finally I want to classify by number of days that every light meter('NR_CPE') has a lecture. A if it has less than 180 days, B between 180 and 1 year and C more than a year.
And finally write this result in two differents files.
Any idea of how should I re-code this to have the same output and be faster?
Thank you all.
BTW this is my dataset:
,NOME_DISTRITO,NR_CPE,MARCA_EQUIPAMENTO,NR_EQUIPAMENTO,VALOR_LEITURA,REGISTADOR,TIPO_REGISTADOR,TIPO_DADOS_RECOLHIDOS,FACTOR_MULTIPLICATIVO_FINAL,NR_DIGITOS_INTEIRO,UNIDADE_MEDIDA,TIPO_LEITURA,MOTIVO_LEITURA,ESTADO_LEITURA,DATA_LEITURA,HORA_LEITURA
0,GUARDA,A002000642VW,101,1865411,4834,001,S,1,1,4,kWh,1,1,A,20150629,205600
1,GUARDA,A002000642VW,101,1865411,4834,001,S,1,1,4,kWh,2,2,A,20160218,123300
2,GUARDA,A002000642VJ,122,204534,25083,001,S,1,1,5,kWh,1,1,A,20150629,205700
3,GUARDA,A002000642VJ,122,204534,27536,001,S,1,1,5,kWh,2,2,A,20160218,123200
4,GUARDA,A002000642HR,101,1383899,11734,001,S,1,1,5,kWh,1,1,A,20150629,205600
5,GUARDA,A002000642HR,101,1383899,11800,001,S,1,1,5,kWh,2,2,A,20160218,123000
6,GUARDA,A002000995VM,101,97706436,12158,001,S,1,1,5,kWh,1,3,A,20150713,155300
7,GUARDA,A002000995VM,101,97706436,12163,001,S,1,1,5,kWh,2,2,A,20160129,162300
8,GUARDA,A002000995VM,101,97706436,12163,001,S,1,1,5,kWh,2,2,A,20160202,195800
9,GUARDA,A2000995VM,101,97706436,12163,001,S,1,1,5,kWh,1,3,A,20160404,145200
10,GUARDA,A002000996LV,168,5011703276,3567,001,V,1,1,6,kWh,1,1,A,20150528,205900
11,GUARDA,A02000996LV,168,5011703276,3697,001,V,1,1,6,kWh,2,2,A,20150929,163500
12,GUARDA,A02000996LV,168,5011703276,1287,002,P,1,1,6,kWh,1,1,A,20150528,205900

Generally you want to avoid for loops in pandas.
For example, the first loop where you calculate total consumption and days could be rewritten as a groupby apply something like:
def last_minus_first(df):
columns_of_interest = df[['VALOR_LEITURA', 'days']]
diff = columns_of_interest.iloc[-1] - columns_of_interest.iloc[0]
return diff
df['date'] = pd.to_datetime(df['DATA_LEITURA'], format="%Y%m%d")
df['days'] = (df['date'] - pd.datetime(1970,1,1)).dt.days # create days column
df.groupby('NR_CPE').apply(last_minus_first)
(btw I don't understand why you are subtracting each entry from the previous, surely for meter readings this is the same as last-first?)
Then given the result of the above as consumption, you can replace your second for loop (for ind, sumdias in dias.iteritems()) with something like:
pd.cut(consumption.days, [-1, 180, 365, np.inf], labels=['a', 'b', 'c']).value_counts()

Global name is not defined executing a function

I'm coding a project, in which I have 2 files (dataStructure.py and calculUser.py) working together and 1 which is a test file.
In structureDonnees.py I have this function which reads a dataset containing cars and builds data structures :
# -*-encoding:utf-8-*-
import csv
import sys #pour utiliser maximum et minimum du type float
from calculUser import *
from trajetUser import *
def recupVoiture() :
#nom de la base de donnée
nomFichier = 'CO2_passenger_cars_v10.csv'
#on ouvre le fichier en lecture
opener = open(nomFichier, "r")
#On ouvre le fichier nomFichier en lecture
lectureFichier = csv.reader(opener, delimiter='\t')
#le dico contenant les carburants
fuelType = dict()
#le dico contenant les voitures
voiture = dict()
#le dico contenant les émissions de CO2 en g/km
emission = dict()
#minimum et maximum emission
min_emission = sys.float_info.max #initialisé à max(float) pour que l'on soit sûr que toutes les emissions soient plus petites
max_emission = sys.float_info.min #initialisé à min(float) pour que l'on soit sûr que toutes les emissions soient plus grandes
for row in lectureFichier :
#Si la colonne existe
if row:
#construction du dictionnaire voiture
if voiture.has_key(row[10]) :
if row[11].upper() not in voiture[row[10]] : voiture[row[10]].append("%s" %row[11].upper()) #on ajoute le modèle
else :
voiture[row[10]] = [] #on crée une liste vide contenant les modèles et leurs versions
voiture[row[10]].append("%s" %row[11]) #on ajoute le modèle et sa version
#construction du dictionnaire fuelType
if fuelType.has_key(row[10]) : fuelType[row[10]].append(row[19].upper()) #ajout du type de carburant utilisé par la voiture correspondante dans voiture{}
else :
fuelType[row[10]] = [] #on crée une liste vide contenant les carburants
fuelType[row[10]].append(row[19]) #ajout du type de carburant utilisé par la voiture correspondante dans voiture{}
#construction du dictionnaire emission
if emission.has_key(row[10]) :
emission[row[10]].append(row[14]) #ajout de la quantité de CO2 émise par la voiture correspondante dans voiture{}
min_emission = minEmission(float(row[14]), min_emission)
max_emission = maxEmission(float(row[14]), max_emission)
else :
emission[row[10]] = [] #on crée une liste vide contenant les émissions en CO2
fuelType[row[10]].append(row[14]) #ajout de la quantité de CO2 émise par la voiture correspondante dans voiture{}
min_emission = minEmission(float(row[14]), min_emission)
max_emission = maxEmission(float(row[14]), max_emission)
#On ferme le fichier
opener.close()
#La valeur de retour est un tableau contenant les structures de données générées.
res = [voiture, fuelType, emission, min_emission, max_emission]
return res
In the calculUser.py, I defined the minEmission and maxEmission function :
def minEmission(emissionFichier, min_emission) :
if emissionFichier < min_emission :
min_emission = emissionFichier
return min_emission
def maxEmission(emissionFichier, max_emission) :
if emissionFichier > max_emission :
max_emission = emissionFichier
return max_emission
When I'm executing test.py, I get an error with this line :
table = recupVoiture()
Traceback (most recent call last):
File "test.py", line 13, in <module>
tableau = recupVoiture()
File "/home/user/Polytech/ge3/ProjetPython/structureDonnees.py", line 60, in recupVoiture
min_emission = minEmission(float(row[14]), min_emission)
NameError: global name 'minEmission' is not defined
I don't understand why I get this error. By executing everything except test.py I get no error but when I do it doesn't execute due to this minEmission and maxEmission not defined.
Is it because I'm calling a function when I'm defining a function?
How could I fix it?

I fixed the problem, it seems like my functions minEmission() and maxEmission() couldn't do a reference to max_emission and min_emission since those variables are declared in structureDonnees.py and not in calculUser.py.
I fixed it by creating an intermediary variable which takes the value of min_emission and max_emission and which is returned, instead of min_emission and max_emission.
Plus, I had to do a : from calculUser import minEmission, maxEmissiondirectly in the recupVoiture() function. I know that's awful but it solved the problem. I'll use it until I find a better, cleaner solution.
Thanks for the help guys, I'll do a better post/code if I have to ask any other help ! :)

How to get clean text from MediaWiki markup format using mwparserfromhell or a simple parser in python?

I am trying to get clean sentences from the Wikipedia page of a species.
For instance Abeis durangensis (pid = 1268312). Using the Wikipedia API in python to obtain the Wikipedia page:
import requests
pid = 1268312
q = {'action' : 'query',
'pageids': pid,
'prop' : 'revisions',
'rvprop' : 'content',
'format' : 'json'}
result = requests.get(eswiki_URI, params=q).json()
wikitext = result["query"]["pages"].values()[0]["revisions"][0]["*"]
gives:
{{Ficha de taxón
| name = ''Abies durangensis''
| image = Abies tamazula dgo.jpg
| status = LR/lc
| status_ref =<ref>Conifer Specialist Group 1998. [http://www.iucnredlist.org/search/details.php/42279/all ''Abies durangensis'']. [http://www.iucnredlist.org 2006 IUCN Red List of Threatened Species. ] Downloaded on 10 July 2007.</ref>
| regnum = [[Plantae]]
| divisio = [[Pinophyta]]
| classis = [[Pinopsida]]
| ordo = [[Pinales]]
| familia = [[Pinaceae]]
| genus = ''[[Abies]]''
| binomial = '''''Abies durangensis'''''
| binomial_authority = [[Maximino Martínez|Martínez]]<ref name=ipni>{{ cite web |url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal |title=Plant Name Details for ''Abies durangensis'' |publisher=[[International Plant Names Index|IPNI]] |accessdate=6 de octubre de 2009}}</ref>
| synonyms =
}}
'''''Abies durangensis''''' es una [[especie]] de [[conífera]] perteneciente a la familia [[Pinaceae]]. Son [[endémica]]s de [[México]] donde se encuentran en [[Durango]], [[Chihuahua]], [[Coahuila]], [[Jalisco]] y [[Sinaloa]]. También es conocido como 'Árbol de Coahuila' y 'pino mexicano'.<ref name=cje>{{ cite web |url=http://www.conifers.org/pi/ab/durangensis.htm |title=''Abies durangaensis'' description |author=Christopher J. Earle |date=11 de junio de 2006 |accessdate=6 de octubre de 2009}}</ref>
== Descripción ==
Es un [[árbol]] que alcanza los 40 metros de altura con un [[Tronco (botánica)|tronco]] recto que tiene 150 cm de diámetro. Las [[rama]]s son horizontales y la [[corteza (árbol)|corteza]] de color gris. Las [[hoja]]s son verde brillante de 20–35 mm de longitud por 1-1.5 mm de ancho. Tiene los conos de [[semilla]]s erectos en ramas laterales sobre un corto [[pedúnculo]]. Las [[semilla]]s son [[resina|resinosas]] con una [[núcula]] amarilla con alas.
== Taxonomía ==
''Abies durangensis'' fue descrita por [[Maximino Martínez]] y publicado en ''[[Anales del instituto de Biología de la Universidad Nacional de México]]'' 13: 2. 1942.<ref name = Trop>{{cita web |url=http://www.tropicos.org/Name/24901700 |título= ''{{PAGENAME}}''|fechaacceso=21 de enero de 2013 |formato= |obra= Tropicos.org. [[Missouri Botanical Garden]]}}</ref>
;[[Etimología]]:
'''''Abies''''': nombre genérico que viene del nombre [[latin]]o de ''[[Abies alba]]''.<ref>[http://www.calflora.net/botanicalnames/pageAB-AM.html En Nombres Botánicos]</ref>
'''''durangensis''''': [[epíteto]] geográfico que alude a su localización en [[Durango]].
;Variedades:
* ''Abies durangensis var. coahuilensis'' (I. M. Johnst.) Martínez
;[[sinonimia (biología)|Sinonimia]]:
* ''Abies durangensis subsp. neodurangensis'' (Debreczy, I.Rácz & R.M.Salazar) Silba'
* ''Abies neodurangensis'' Debreczy, I.Rácz & R.M.Salazar<ref>[http://www.theplantlist.org/tpl/record/kew-2609816 ''{{PAGENAME}}'' en PlantList]</ref><ref name = Kew>{{cita web|url=http://apps.kew.org/wcsp/namedetail.do?name_id=2609816 |título=''{{PAGENAME}}'' |work= World Checklist of Selected Plant Families}}</ref>
;''var. coahuilensis'' (I.M.Johnst.) Martínez
* ''Abies coahuilensis'' I.M.Johnst.
* ''Abies durangensis subsp. coahuilensis'' (I.M.Johnst.) Silba
== Véase también ==
* [[Terminología descriptiva de las plantas]]
* [[Anexo:Cronología de la botánica]]
* [[Historia de la Botánica]]
* [[Pinaceae#Descripción|Características de las pináceas]]
== Referencias ==
{{listaref}}
== Bibliografía ==
# CONABIO. 2009. Catálogo taxonómico de especies de México. 1. In Capital Nat. México. CONABIO, Mexico City.
== Enlaces externos ==
{{commonscat}}
{{wikispecies|Abies}}
* http://web.archive.org/web/http://ww.conifers.org/pi/ab/durangensis.htm
* http://www.catalogueoflife.org/search.php
[[Categoría:Abies|durangensis]]
[[Categoría:Plantas descritas en 1942]]
[[Categoría:Plantas descritas por Martínez]]
I am interested in the (unmarked) text just after the infobox, the gloss:
Abies durangensis es una especie de conífera perteneciente a la familia Pinaceae. Son endémicas de México donde se encuentran en Durango, Chihuahua, Coahuila, Jalisco y Sinaloa. También es conocido como 'Árbol de Coahuila' y 'pino mexicano'.
Until now i consulted https://www.mediawiki.org/wiki/Alternative_parsers so i found that mwparserfromhell is the less complicated parser in python. However, i dont see clearly how to do what i pretend. When i use the example proposed in the documentation i just can't see where the gloss is.
for t in templates:
print(t.name).encode('utf-8')
print(t.params)
Ficha de taxón
[u" name = ''Abies durangensis''\n", u' image = Abies tamazula dgo.jpg \n', u' status = LR/lc\n', u" status_ref =<ref>Conifer Specialist Group 1998. [http://www.iucnredlist.org/search/details.php/42279/all ''Abies durangensis'']. [http://www.iucnredlist.org 2006 IUCN Red List of Threatened Species. ] Downloaded on 10 July 2007.</ref>\n", u' regnum = [[Plantae]]\n', u' divisio = [[Pinophyta]]\n', u' classis = [[Pinopsida]]\n', u' ordo = [[Pinales]]\n', u' familia = [[Pinaceae]]\n', u" genus = ''[[Abies]]'' \n", u" binomial = '''''Abies durangensis'''''\n", u" binomial_authority = [[Maximino Mart\xednez|Mart\xednez]]<ref name=ipni>{{ cite web |url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal |title=Plant Name Details for ''Abies durangensis'' |publisher=[[International Plant Names Index|IPNI]] |accessdate=6 de octubre de 2009}}</ref>\n", u' synonyms = \n']
cite web
[u'url=http://www.ipni.org:80/ipni/idPlantNameSearch.do;jsessionid=0B15264060FDA0DCF216D997C89185EC?id=676563-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Bjsessionid%3D0B15264060FDA0DCF216D997C89185EC%3Ffind_wholeName%3DAbies%2Bdurangensis%26output_format%3Dnormal ', u"title=Plant Name Details for ''Abies durangensis'' ", u'publisher=[[International Plant Names Index|IPNI]] ', u'accessdate=6 de octubre de 2009']
cite web
[u'url=http://www.conifers.org/pi/ab/durangensis.htm ', u"title=''Abies durangaensis'' description ", u'author=Christopher J. Earle ', u'date=11 de junio de 2006 ', u'accessdate=6 de octubre de 2009']
cita web
[u'url=http://www.tropicos.org/Name/24901700 ', u"t\xedtulo= ''{{PAGENAME}}''", u'fechaacceso=21 de enero de 2013 ', u'formato= ', u'obra= Tropicos.org. [[Missouri Botanical Garden]]']
PAGENAME
[]
PAGENAME
[]
cita web
[u'url=http://apps.kew.org/wcsp/namedetail.do?name_id=2609816 ', u"t\xedtulo=''{{PAGENAME}}'' ", u'work= World Checklist of Selected Plant Families']
PAGENAME
[]
listaref
[]
commonscat
[]
wikispecies
[u'Abies']

Instead of torturing yourself with parsing of something that's not even expressable in formal grammar, use the TextExtracts API:
https://es.wikipedia.org/w/api.php?action=query&prop=extracts&explaintext=1&titles=Abies%20durangensis&format=json
gives the following output:
Abies durangensis es una especie de conífera perteneciente a la
familia Pinaceae. Son endémicas de México donde se encuentran en
Durango, Chihuahua, Coahuila, Jalisco y Sinaloa. También es conocido
como 'Árbol de Coahuila' y 'pino mexicano'.
== Descripción ==
Es un árbol que alcanza los 40 metros de altura con un tronco recto que tiene 150 cm de diámetro. Las ramas son
horizontales y la corteza de color gris. Las hojas son verde brillante
de 20–35 mm de longitud por 1-1.5 mm de ancho. Tiene los conos de
semillas erectos en ramas laterales sobre un corto pedúnculo. Las
semillas son resinosas con una núcula amarilla con alas.
[...]
Append &exintro=1 to the URL if you need only the lead.

You can use wikitextparser
If wikitext has your text string, all you need is:
import wikitextparser
parsed = wikitextparser.parse(wikitext)
then you can get the plain text portion of the whole page or a particular section, for example: parsed.sections[1].plain_text() will give you the plain text of the second section of the page which seems to be what you are looking for.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Histogram representing number of substitutions, insertions and deleting in sequences - python

Thanks to the suggestions of #YohanesGultom the answer for the problem can be found here : http://www.nltk.org/_modules/nltk/metrics/distance.html or https://gist.github.com/kylebgorman/1081951

Related

on the purchase of a product from a distributor (price entered by the customer), minimum number of pieces to be refunded (python)

How can I modify an "if condition" in order to apply it to different list at the same time?

Pandas + Python: More efficient code

Global name is not defined executing a function

How to get clean text from MediaWiki markup format using mwparserfromhell or a simple parser in python?

Categories

Resources