I'm looking for a line (using str.join I think) to cut a long string if the number of word is too much. I have the begining but I don't know whow to insert \n
example = "Au Fil Des Antilles De La Martinique a Saint Barthelemy"
nmbr_word = len(example.split(" "))
if nmbr_word >= 6:
# cut the string to have
result = "Au Fil Des Antilles De La\nMartinique a Saint Barthelemy"
Thanks
How about using the textwrap module?
>>> import textwrap
>>> s = "Au Fil Des Antilles De La Martinique a Saint Barthelemy"
>>> textwrap.wrap(s, 30)
['Au Fil Des Antilles De La', 'Martinique a Saint Barthelemy']
>>> "\n".join(textwrap.wrap(s, 30))
'Au Fil Des Antilles De La\nMartinique a Saint Barthelemy'
How about:
'\n'.join([' '.join(nmbr_word[i:i+6]) for i in xrange(0, len(nmbr_word), 6)])
Related
I want to count occurrences of each word in a text to spot the key words coming over the most.
This script works quite well but the problem is that this text is written in French. So there are important key words that would be missed.
For example the word Europe may appear in the text like l'Europe or en Europe.
In the first case, the code will remove the apostrophe and l'Europe is considered as one unique word leurope in the final result.
How can I improve the code to split l' from Europe?
import string
# Open the file in read mode
#text = open("debat.txt", "r")
text = ["Monsieur Mitterrand, vous avez parlé une minute et demie de moins que Monsieur Chirac dans cette première partie. Je préfère ne pas avoir parlé une minute et demie de plus pour dire des choses aussi irréelles et aussi injustes que celles qui viennent d'être énoncées. Si vous êtes d'accord, nous arrêtons cette première partie, nous arrêtons les chronomètres et nous repartons maintenant pour une seconde partie en parlant de l'Europe. Pour les téléspectateurs, M. Mitterrand a parlé 18 minutes 36 et M. Chirac, 19 minutes 56. Ce n'est pas un drame !... On vous a, messieurs, probablement jamais vus plus proches à la fois physiquement et peut-être politiquement que sur les affaires européennes... les Français vous ont vus, en effet, à la télévision, participer ensemble à des négociations, au coude à coude... voilà, au moins, un domaine dans lequel, sans aucun doute, vous connaissez fort bien, l'un et l'autre, les opinions de l'un et de l'autre. Nous avons envie de vous demander ce qui, aujourd'hui, au -plan européen, vous sépare et vous rapproche ?... et aussi lequel de vous deux a le plus évolué au cours des quelques années qui viennent de s'écouler ?... #"]
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Remove the punctuation marks from the line
line = line.translate(line.maketrans("", "", string.punctuation))
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
sorted_tuples = sorted(d.items(), key=lambda item: item[1], reverse=True)
sorted_dict = {k: v for k, v in sorted_tuples}
# Print the contents of dictionary
for key in list(sorted_dict.keys()):
print(key, ":", sorted_dict[key])
line = line.translate(line.maketrans("", "", string.punctuation))
removes all punctuation characters (l'Europe becomes lEurope). Instead of that, you may want to replace them by spaces, using for example:
for p in string.punctuation:
line = line.replace(p, ' ')
Where you currently have:
line = line.translate(line.maketrans("", "", string.punctuation))
... you can add the following line before it:
line = line.translate(line.maketrans("'", " "))
This will replace the ' character with a space wherever it's found, and the line using string.punctuation will behave exactly as before, except that it will not encounter any ' characters since we have already replaced them.
My goal is to write a python 3 function that takes lastnames as row from a csv and splits them correctly into lastname_1 and lastname_2.
Spanish names have the following structure: firstname + lastname_1 + lastname_2
Forgettin about the firstname, I would like a code that splits a lastname in these 2 categories (lastname_1, lastname_2) Which is the challenge?
Sometimes the lastname(s) have prepositions.
DE BLAS ZAPATA "de blas" ist the first lastname and "Zapata" the second lastname
MATIAS DE LA MANO "Matias" is lastname_1, "de la mano" lastname_2
LOPEZ FERNANDEZ DE VILLAVERDE Lopez Fernandez is lastname_1, de villaverda lastname_2
DE MIGUEL DEL CORRAL De Miguel is lastname_1, del corral lastname_2 More: VIDAL DE LA PEÑA SOLIS Vidal is surtname_1, de la pena solis surname_2
MONTAVA DEL ARCO Montava is surname_1 Del Arco surname_2
and the list could go on and on.
I am currently stuck and I found this code in perl but I struggle to understand the main idea behind it to translate it to python 3.
import re
preposition_lst = ['DE LO ', 'DE LA ', 'DE LAS ', 'DEL ', 'DELS ', 'DE LES ', 'DO ', 'DA ', 'DOS ', 'DAS', 'DE ']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PEÑA", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
for case in cases:
for prep in preposition_lst:
m = re.search(f"(.*)({prep}[A-ZÀ-ÚÄ-Ü]+)", case, re.I) # re.I makes it case insensitive
try:
print(m.groups())
print(prep)
except:
pass
Try this one:
import re
preposition_lst = ['DE','LO','LA','LAS','DEL','DELS','LES','DO','DA','DOS','DAS']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PEÑA", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
def last_name(name):
case=re.findall(r'\w+',name)
res=list(filter(lambda x: x not in preposition_lst,case))
return res
list_final =[]
for case in cases:
list_final.append(last_name(case))
for i in range(len(list_final)):
if len(list_final[i])>2:
name1=' '.join(list_final[i][:2])
name2=' '.join(list_final[i][2:])
list_final[i]=[name1,name2]
print(list_final)
#[['BLAS', 'ZAPATA'], ['MATIAS', 'MANO'], ['LOPEZ FERNANDEZ', 'VILLAVERDE'], ['MIGUEL', 'CORRAL'], ['VIDAL', 'PE�A'], ['MONTAVA', 'ARCO'], ['CASAS', 'VALLE']]
Does this suit your requirements?
import re
preposition_lst = ['DE LO', 'DE LA', 'DE LAS', 'DE LES', 'DEL', 'DELS', 'DO', 'DA', 'DOS', 'DAS', 'DE']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PENA SOLIS", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
def split_name(name):
f1 = re.compile("(.*)({preps}(.+))".format(preps = "(" + " |".join(preposition_lst) + ")"))
m1 = f1.match(case)
if m1:
if len(m1.group(1)) != 0:
return m1.group(1).strip(), m1.group(2).strip()
else:
return " ".join(name.split()[:-1]), name.split()[-1]
else:
return " ".join(name.split()[:-1]), name.split()[-1]
for case in cases:
first, second = split_name(case)
print("{} --> name 1 = {}, name 2 = {}".format(case, first, second))
# DE BLAS ZAPATA --> name 1 = DE BLAS, name 2 = ZAPATA
# MATIAS DE LA MANO --> name 1 = MATIAS, name 2 = DE LA MANO
# LOPEZ FERNANDEZ DE VILLAVERDE --> name 1 = LOPEZ FERNANDEZ, name 2 = DE VILLAVERDE
# DE MIGUEL DEL CORRAL --> name 1 = DE MIGUEL, name 2 = DEL CORRAL
# VIDAL DE LA PENA SOLIS --> name 1 = VIDAL, name 2 = DE LA PENA SOLIS
# MONTAVA DEL ARCO --> name 1 = MONTAVA, name 2 = DEL ARCO
# DOS CASAS VALLE --> name 1 = DOS CASAS, name 2 = VALLE
I am trying to split lines in a txt file but having the following problem:
with open("KARDEX.txt", 'r', encoding="latin-1") as file:
data = []
for line in file:
data.append(line)
print(data[0])
>>ÿþ# Número de artículo Descripción del artículo Clase de operación Código de deudor/acreedor Nombre de deudor/acreedor
d = print(data[0].split(" "))
>> ['ÿþ#\x00', '\x00N\x00ú\x00m\x00e\x00r\x00o\x00 \x00d\x00e\x00 \x00a\x00r\x00t\x00í\x00c\x00u\x00l\x00o\x00', '\x00D\x00e\x00s\x00c\x00r\x00i\x00p\x00c\x00i\x00ó\x00n\x00 \x00d\x00e\x00l\x00 \x00a\x00r\x00t\x00í\x00c\x00u\x00l\x00o\x00', '\x00C\x00l\x00a\x00s\x00e\x00 \x00d\x00e\x00 \x00o\x00p\x00e\x00r\x00a\x00c\x00i\x00ó\x00n\x00']
If you want to split by 2 or more spaces then use:
re.split(r" {2,}", st)
st:
st = "ÿþ# Número de artículo Descripción del artículo Clase de operación Código de deudor/acreedor Nombre de deudor/acreedor"
re.split(r" {2,}", st)
Output:
['ÿþ#',
'Número de artículo',
'Descripción del artículo',
'Clase de operación',
'Código de deudor/acreedor',
'Nombre de deudor/acreedor']
I've got the following code:
from itemspecs import itemspecs
x = itemspecs.split('###')
res = []
for item in x:
res.append(item.split('##'))
print(res)
This imports a string from another document. And gives me the following:
[['\n1\nWie was de Nederlandse scheepvaarder die de Spaanse zilvervloot veroverde?\n',
'\nA. Michiel de Ruyter\nB. Piet Heijn\nC. De zilvervloot is nooit door de Nederlanders
onderschept\n', '\nB\n', '\nAntwoord B De Nederlandse vlootvoogd werd hierdoor bekend.\n'],
['\n2\nIn welk land ligt Upernavik?\n', '\nA. Antartica\nB. Canada\nC. Rusland\nD.
Groenland\nE. Amerika\n', '\nD\n', '\nAntwoord D Het is een dorp in Groenland met 1224
inwoners.\n']]
But now I want to remove all the \n from every end and beginning of every element in this list. How can I do this?
You can do that by stripping "\n" as follow
a = "\nhello\n"
stripped_a = a.strip("\n")
so, what you need to do is iterate through the list and then apply the strip on the string as shown below
res_1=[]
for i in res:
tmp=[]
for j in i:
tmp.append(j.strip("\n"))
res_1.append(tmp)
The above answer just removes \n from start and end. if you want to remove all the new lines in a string, just use .replace('\n"," ") as shown below
res_1=[]
for i in res:
tmp=[]
for j in i:
tmp.append(j.replace("\n"))
res_1.append(tmp)
strip() is easiest method in this case, but if you want to do any advanced text processing in the future, it isn't bad idea to learn regular expressions:
import re
from pprint import pprint
l = [['\n1\nWie was de Nederlandse scheepvaarder die de Spaanse zilvervloot veroverde?\n',
'\nA. Michiel de Ruyter\nB. Piet Heijn\nC. De zilvervloot is nooit door de Nederlandersonderschept\n',
'\nB\n',
'\nAntwoord B De Nederlandse vlootvoogd werd hierdoor bekend.\n'],
['\n2\nIn welk land ligt Upernavik?\n',
'\nA. Antartica\nB. Canada\nC. Rusland\nD.Groenland\nE. Amerika\n',
'\nD\n', '\nAntwoord D Het is een dorp in Groenland met 1224inwoners.\n']]
l = [[re.sub(r'^([\s]*)|([\s]*)$', '', j)] for i in l for j in i]
pprint(l, width=120)
Output:
[['1\nWie was de Nederlandse scheepvaarder die de Spaanse zilvervloot veroverde?'],
['A. Michiel de Ruyter\nB. Piet Heijn\nC. De zilvervloot is nooit door de Nederlandersonderschept'],
['B'],
['Antwoord B De Nederlandse vlootvoogd werd hierdoor bekend.'],
['2\nIn welk land ligt Upernavik?'],
['A. Antartica\nB. Canada\nC. Rusland\nD.Groenland\nE. Amerika'],
['D'],
['Antwoord D Het is een dorp in Groenland met 1224inwoners.']]
l have two columns that represent : right sequence and predicted sequence. l want to make statistics on the number of deletion, substitution and insertion by comparing each right sequence with its predicted sequence.
l did the levenstein distance to get the number of characters which are different (see the function below) and error_dist function to get the most common errors (in terms of substitution) :
here is a sample of my data :
de de
date date
pour pour
etoblissemenls etablissements
avec avec
code code
communications communications
r r
seiche seiche
titre titre
publiques publiques
ht ht
bain bain
du du
ets ets
premier premier
dans dans
snupape soupape
minimum minimum
blanc blanc
fr fr
nos nos
au au
bl bl
consommations consommations
somme somme
euro euro
votre votre
offre offre
forestier forestier
cs cs
de de
pour pour
de de
paye r
cette cette
votre votre
valeurs valeurs
des des
gfda gfda
tva tva
pouvoirs pouvoirs
de de
revenus revenus
offre offre
ht ht
card card
noe noe
montant montant
r r
comprises comprises
quantite quantite
nature nature
ticket ticket
ou ou
rapide rapide
de de
sous sous
identification identification
du du
document document
suicide suicide
bretagne bretagne
tribunal tribunal
services services
cif cif
moyen moyen
gaec gaec
total total
lorsque lorsque
contact contact
fermeture fermeture
la la
route route
tva tva
ia ia
noyal noyal
brie brie
de de
nanterre nanterre
charcutier charcutier
semestre semestre
de de
rue rue
le le
bancaire bancaire
martigne martigne
recouvrement recouvrement
la la
sainteny sainteny
de de
franc franc
rm rm
vro vro
here is my code
import pandas as pd
import collections
import numpy as np
import matplotlib.pyplot as plt
import distance
def error_dist():
df = pd.read_csv('data.csv', sep=',')
df = df.astype(str)
df = df.replace(['é', 'è', 'È', 'É'], 'e', regex=True)
df = df.replace(['à', 'â', 'Â'], 'a', regex=True)
dictionnary = []
for i in range(len(df)):
if df.manual_raw_value[i] != df.raw_value[i]:
text = df.manual_raw_value[i]
text2 = df.raw_value[i]
x = len(df.manual_raw_value[i])
y = len(df.raw_value[i])
z = min(x, y)
for t in range(z):
if text[t] != text2[t]:
d = (text[t], text2[t])
dictionnary.append(d)
#print(dictionnary)
dictionnary_new = dict(collections.Counter(dictionnary).most_common(25))
pos = np.arange(len(dictionnary_new.keys()))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(dictionnary_new.keys())
plt.bar(range(len(dictionnary_new)), dictionnary_new.values(), width, color='g')
plt.show()
enter image description here
and the levenstein distance :
def levenstein_dist():
df = pd.read_csv('data.csv', sep=',')
df=df.astype(str)
df['string diff'] = df.apply(lambda x: distance.levenshtein(x['raw_value'], x['manual_raw_value']), axis=1)
plt.hist(df['string diff'])
plt.show()
enter image description here
Now l want to make a histograms showing three bins : number of substitution, number of insertion and number of deletion . How can l proceed ?
Thank you
Thanks to the suggestions of #YohanesGultom the answer for the problem can be found here :
http://www.nltk.org/_modules/nltk/metrics/distance.html
or
https://gist.github.com/kylebgorman/1081951