how to extract specific content from dataframe based on condition python - python

Consider the following pandas dataframe:
this is an example of ingredients_text :
farine de blé 34% (france), pépites de chocolat 20g (ue) (sucre, pâte de cacao, beurre de cacao, émulsifiant lécithines (tournesol), arôme) (cacao : 44% minimum), matière grasse végétale (palme), sucre, 8,5% chocolat(sucre, pâte de cacao, cacao et cacao maigre en poudre) (cacao: 38% minimum), 5,5% éclats de noix de pécan (non ue), poudres à lever : diphosphates carbonates de sodium, blancs d’œufs, fibres d'acacia, lactose et protéines de lait, sel. dont lait.
oignon 18g oil hell: kartoffelstirke, milchzucker, maltodextrin, reismehl. 100g produkt enthalten: 1559KJ ,energie 369 kcal lt;0.5g lt;0.1g 909 fett davon gesättigte fettsāuren kohlenhydrate davon ,zucker 26g
I separated the ingredients of each line into words with the folowing code :
for i in df['ingredients_text'][:].index:
words = df["ingredients_text"][i].split(',')
df["ingredients_text"][i]=words
Any idea of how to extract the ingredients with % and g from the text in onether column called 'ingredient' ?
For instance, the desired output should be:
['farine de blé 34%', 'pépites de chocolat 20g','cacao : 44%' ,'8,5% chocolat' ,'cacao: 38%', '5,5% éclats de noix de pécan']
['oignon 18g oil hell', '100g produkt enthalten', 'lt;0.5g', 'lt;0.1g' , '26g zucker']

df = pd.DataFrame({'ingredient_text': ['a%bgC, abc, a%, cg', 'xyx']})
ingredient_text
0 a%bgC, abc, a%, cg
1 xyx
Split the ingredients into a list
df['ingredient_text'] = df['ingredient_text'].str.split(',')
ingredient_text
0 [a%bgC, abc, a%, cg]
1 [xyx]
Search for your strings in the list
df['ingredient'] = df['ingredient_text'].apply(lambda x: [s for s in x if ('%' in s) or ('g' in s)])
ingredient_text ingredient
0 [a%bgC, abc, a%, cg] [a%bgC, a%, cg]
1 [xyx] []

Related

Remove 2 or more consecutive non-caps words from strings stored within a list of strings using regex, and separate

import re
list_with_penson_names_in_this_input = ["María Sol", "María del Carmen Perez Agüiño", "Melina Saez Sossa", "el de juego es Alex" , "Harry ddeh jsdelasd Maltus ", "Ben White ddesddsh jsdelasd Regina yáshas asdelas Javier Ruben Rojas", "Robert", 'Melina presento el nuevo presupuesto', "presento el nuevo presupuesto, María del Carmén "]
aux_list = []
for i in list_with_penson_names_in_this_input:
list_with_penson_names_in_this_input.remove(i)
aux_list = re.sub(, , i)
list_with_penson_names_in_this_input = list_with_penson_names_in_this_input + aux_list
aux_list = []
print(list_with_penson_names_in_this_input) #print fixed name list
If there are more than 2 words ((?:\w\s*)+) that do not start with a capital letter in a row, and that are not connectors of the type [del|de el|de] then it should be eliminated from the first name, in capital letter, detected, and have the second name detected (if it exists) place it as a separate element within the list. for example,
["Harry ddeh jsdelasd Maltus "] --> ["Harry", "Maltus"]
["Ben White ddesddsh jsdelasd Regina yáshas asdelas Javier Ruben Rojas"] --> ["Ben White", "Regina", "Javier Ruben Rojas"]
and if there is not more than one name, you should remove if there are 2 or more consecutive words that do not start with a capital letter, and that are not connectors of the type [del|de el|de]
["Melina Martinez presento el nuevo presupuesto"] --> ["Melina Martinez"]
["Melina presento el nuevo presupuesto"] --> ["Melina"]
["presento el nuevo presupuesto, María del Carmén "] --> ["María del Carmén"]
When it comes to fixing those elements that do not meet the specifications, the elements on this list should be:
["María Sol", "María del Carmen Perez Agüiño", "Melina Saez Sossa", "Alex" , "Harry", "Maltus", "Ben White", "Regina", "Javier Ruben Rojas", "Robert", 'Melina', "María del Carmén"]
For the words in between that don't start with a capital letter i tried something like this ((?:\w\s*)+) , but even so it does not restrict the presence or not of words with a capital letter
Match the words that starts with lowercase with continuous two or more and use re.split to split the matched words.
import re
name_lst = ["María Sol", "María del Carmen Perez Agüiño", "Melina Saez Sossa",
"el de juego es Alex", "Harry ddeh jsdelasd Maltus ",
"Ben White ddesddsh jsdelasd Regina yáshas asdelas Javier Ruben Rojas",
"Robert", 'Melina presento el nuevo presupuesto',
"presento el nuevo presupuesto, María del Carmén "]
cleaned_names = []
for i in name_lst:
data = re.split(r"(?:\b[a-z]\S+\s){2,}(?:[a-z]+$)?", i)
cleaned_names.extend(data)
output = [i.strip() for i in cleaned_names if i]
print(output)
>>> ['María Sol', 'María del Carmen Perez Agüiño', 'Melina Saez Sossa', 'Alex', 'Harry', 'Maltus', 'Ben White', 'Regina', 'Javier Ruben Rojas', 'Robert', 'Melina', 'María del Carmén']

Extract elements from list of strings that match regex

I have a list of a lot of elements:
l = ['tribunavtplus.client.db.PagingResultSet/2610158785',
'java.util.ArrayList/4159755760',
'tribunavtplus.client.db.ERGEBNISSTORE/3664619045',
'605',
'I. Sozialversicherungsgerichtshof',
'',
'a514394f9f18429995d4cd5c64fbc6de',
'Arrêt de la Ie Cour des assurances sociales du Tribunal cantonal',
'605 2022 1',
'3a4604ff7f9b404193346990b98f7179',
'605 2021 234',
'Assurance-accidents, rente (capacité de travail, revenus retenus dans le calcul du taux d\\x27invalidité, exigibilité), atteinte à l\\x27intégrité.',
'608',
'D:\\\\InetPubData\\\\PublicationDocuments\\\\80749472d88e4d45a0d3c340acbb670a.pdf',
'Polizeirichter Greyerz',
'D:\\\\InetPubData\\\\PublicationDocuments\\\\412d9325978b487f8b15a13380ba4436.pdf',
'Assurance-invalidité (refus de rente et de mesures d\\x27ordre professionnel).',
'101',
'I. Zivilappellationshof',
'c321e55aa536438fb95a06eb798eff82',
'Arrêt de la Ie Cour d\\x27appel civil du Tribunal cantonal',
'101 2022 202',
'2022-09-16',
'D:\\\\InetPubData\\\\PublicationDocuments\\\\c321e55aa536438fb95a06eb798eff82.pdf',
'Eheschutzmassnahmen',
'Gericht Saane',
'Mesures protectrices de l\\x27union conjugale, droit de visite, contributions d\\x27entretien.',
'603',
'III. Verwaltungsgerichtshof',
'3a801995f04b48948c0d1d2224b8f4b9',
'Arrêt de la IIIe Cour administrative du Tribunal cantonal',
'603 2022 110',
'D:\\\\InetPubData\\\\PublicationDocuments\\\\3a801995f04b48948c0d1d2224b8f4b9.pdf',
'Strassenverkehr und Transportwesen',
'Amt für Strassenverkehr und Schifffahrt (ASS)',
'502 2022 24',
'2022-09-22',
'D:\\\\InetPubData\\\\PublicationDocuments\\\\d2c0687313f24491870571dcfd64d6ad.pdf',
'102',
'java.lang.Boolean/476441737']
I want to extract elements such as:
'101 2022 202'
'502 2022 24'
'605 2022 1'
etc.
So blocks of 3 nubers between 1 and 4 digits with a space inside.
I have tried this, but It does not work:
regex_id = re.compile(r"(\d{1,5})(\s)(\d{1,5})(\s)(\d{1,5})")
id_list = [value for value in l if regex_id.search(str(value))==True]
print(id_list)
But I always get empty list

Regex to split Spanish lastnames in first lastname and second lastname

My goal is to write a python 3 function that takes lastnames as row from a csv and splits them correctly into lastname_1 and lastname_2.
Spanish names have the following structure: firstname + lastname_1 + lastname_2
Forgettin about the firstname, I would like a code that splits a lastname in these 2 categories (lastname_1, lastname_2) Which is the challenge?
Sometimes the lastname(s) have prepositions.
DE BLAS ZAPATA "de blas" ist the first lastname and "Zapata" the second lastname
MATIAS DE LA MANO "Matias" is lastname_1, "de la mano" lastname_2
LOPEZ FERNANDEZ DE VILLAVERDE Lopez Fernandez is lastname_1, de villaverda lastname_2
DE MIGUEL DEL CORRAL De Miguel is lastname_1, del corral lastname_2 More: VIDAL DE LA PEÑA SOLIS Vidal is surtname_1, de la pena solis surname_2
MONTAVA DEL ARCO Montava is surname_1 Del Arco surname_2
and the list could go on and on.
I am currently stuck and I found this code in perl but I struggle to understand the main idea behind it to translate it to python 3.
import re
preposition_lst = ['DE LO ', 'DE LA ', 'DE LAS ', 'DEL ', 'DELS ', 'DE LES ', 'DO ', 'DA ', 'DOS ', 'DAS', 'DE ']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PEÑA", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
for case in cases:
for prep in preposition_lst:
m = re.search(f"(.*)({prep}[A-ZÀ-ÚÄ-Ü]+)", case, re.I) # re.I makes it case insensitive
try:
print(m.groups())
print(prep)
except:
pass
Try this one:
import re
preposition_lst = ['DE','LO','LA','LAS','DEL','DELS','LES','DO','DA','DOS','DAS']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PEÑA", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
def last_name(name):
case=re.findall(r'\w+',name)
res=list(filter(lambda x: x not in preposition_lst,case))
return res
list_final =[]
for case in cases:
list_final.append(last_name(case))
for i in range(len(list_final)):
if len(list_final[i])>2:
name1=' '.join(list_final[i][:2])
name2=' '.join(list_final[i][2:])
list_final[i]=[name1,name2]
print(list_final)
#[['BLAS', 'ZAPATA'], ['MATIAS', 'MANO'], ['LOPEZ FERNANDEZ', 'VILLAVERDE'], ['MIGUEL', 'CORRAL'], ['VIDAL', 'PE�A'], ['MONTAVA', 'ARCO'], ['CASAS', 'VALLE']]
Does this suit your requirements?
import re
preposition_lst = ['DE LO', 'DE LA', 'DE LAS', 'DE LES', 'DEL', 'DELS', 'DO', 'DA', 'DOS', 'DAS', 'DE']
cases = ["DE BLAS ZAPATA", "MATIAS DE LA MANO", "LOPEZ FERNANDEZ DE VILLAVERDE", "DE MIGUEL DEL CORRAL", "VIDAL DE LA PENA SOLIS", "MONTAVA DEL ARCO", "DOS CASAS VALLE"]
def split_name(name):
f1 = re.compile("(.*)({preps}(.+))".format(preps = "(" + " |".join(preposition_lst) + ")"))
m1 = f1.match(case)
if m1:
if len(m1.group(1)) != 0:
return m1.group(1).strip(), m1.group(2).strip()
else:
return " ".join(name.split()[:-1]), name.split()[-1]
else:
return " ".join(name.split()[:-1]), name.split()[-1]
for case in cases:
first, second = split_name(case)
print("{} --> name 1 = {}, name 2 = {}".format(case, first, second))
# DE BLAS ZAPATA --> name 1 = DE BLAS, name 2 = ZAPATA
# MATIAS DE LA MANO --> name 1 = MATIAS, name 2 = DE LA MANO
# LOPEZ FERNANDEZ DE VILLAVERDE --> name 1 = LOPEZ FERNANDEZ, name 2 = DE VILLAVERDE
# DE MIGUEL DEL CORRAL --> name 1 = DE MIGUEL, name 2 = DEL CORRAL
# VIDAL DE LA PENA SOLIS --> name 1 = VIDAL, name 2 = DE LA PENA SOLIS
# MONTAVA DEL ARCO --> name 1 = MONTAVA, name 2 = DEL ARCO
# DOS CASAS VALLE --> name 1 = DOS CASAS, name 2 = VALLE

Histogram representing number of substitutions, insertions and deleting in sequences

l have two columns that represent : right sequence and predicted sequence. l want to make statistics on the number of deletion, substitution and insertion by comparing each right sequence with its predicted sequence.
l did the levenstein distance to get the number of characters which are different (see the function below) and error_dist function to get the most common errors (in terms of substitution) :
here is a sample of my data :
de de
date date
pour pour
etoblissemenls etablissements
avec avec
code code
communications communications
r r
seiche seiche
titre titre
publiques publiques
ht ht
bain bain
du du
ets ets
premier premier
dans dans
snupape soupape
minimum minimum
blanc blanc
fr fr
nos nos
au au
bl bl
consommations consommations
somme somme
euro euro
votre votre
offre offre
forestier forestier
cs cs
de de
pour pour
de de
paye r
cette cette
votre votre
valeurs valeurs
des des
gfda gfda
tva tva
pouvoirs pouvoirs
de de
revenus revenus
offre offre
ht ht
card card
noe noe
montant montant
r r
comprises comprises
quantite quantite
nature nature
ticket ticket
ou ou
rapide rapide
de de
sous sous
identification identification
du du
document document
suicide suicide
bretagne bretagne
tribunal tribunal
services services
cif cif
moyen moyen
gaec gaec
total total
lorsque lorsque
contact contact
fermeture fermeture
la la
route route
tva tva
ia ia
noyal noyal
brie brie
de de
nanterre nanterre
charcutier charcutier
semestre semestre
de de
rue rue
le le
bancaire bancaire
martigne martigne
recouvrement recouvrement
la la
sainteny sainteny
de de
franc franc
rm rm
vro vro
here is my code
import pandas as pd
import collections
import numpy as np
import matplotlib.pyplot as plt
import distance
def error_dist():
df = pd.read_csv('data.csv', sep=',')
df = df.astype(str)
df = df.replace(['é', 'è', 'È', 'É'], 'e', regex=True)
df = df.replace(['à', 'â', 'Â'], 'a', regex=True)
dictionnary = []
for i in range(len(df)):
if df.manual_raw_value[i] != df.raw_value[i]:
text = df.manual_raw_value[i]
text2 = df.raw_value[i]
x = len(df.manual_raw_value[i])
y = len(df.raw_value[i])
z = min(x, y)
for t in range(z):
if text[t] != text2[t]:
d = (text[t], text2[t])
dictionnary.append(d)
#print(dictionnary)
dictionnary_new = dict(collections.Counter(dictionnary).most_common(25))
pos = np.arange(len(dictionnary_new.keys()))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(dictionnary_new.keys())
plt.bar(range(len(dictionnary_new)), dictionnary_new.values(), width, color='g')
plt.show()
enter image description here
and the levenstein distance :
def levenstein_dist():
df = pd.read_csv('data.csv', sep=',')
df=df.astype(str)
df['string diff'] = df.apply(lambda x: distance.levenshtein(x['raw_value'], x['manual_raw_value']), axis=1)
plt.hist(df['string diff'])
plt.show()
enter image description here
Now l want to make a histograms showing three bins : number of substitution, number of insertion and number of deletion . How can l proceed ?
Thank you
Thanks to the suggestions of #YohanesGultom the answer for the problem can be found here :
http://www.nltk.org/_modules/nltk/metrics/distance.html
or
https://gist.github.com/kylebgorman/1081951

python - cut a string in 2 lines

I'm looking for a line (using str.join I think) to cut a long string if the number of word is too much. I have the begining but I don't know whow to insert \n
example = "Au Fil Des Antilles De La Martinique a Saint Barthelemy"
nmbr_word = len(example.split(" "))
if nmbr_word >= 6:
# cut the string to have
result = "Au Fil Des Antilles De La\nMartinique a Saint Barthelemy"
Thanks
How about using the textwrap module?
>>> import textwrap
>>> s = "Au Fil Des Antilles De La Martinique a Saint Barthelemy"
>>> textwrap.wrap(s, 30)
['Au Fil Des Antilles De La', 'Martinique a Saint Barthelemy']
>>> "\n".join(textwrap.wrap(s, 30))
'Au Fil Des Antilles De La\nMartinique a Saint Barthelemy'
How about:
'\n'.join([' '.join(nmbr_word[i:i+6]) for i in xrange(0, len(nmbr_word), 6)])

Categories