Python way to detect language ISO code - python

I have millions of sentence fragments and I am trying to determine if each is in English, French, Japanese, or Germ. Is there a python program to do this?
s1 = 'This is where lies a person'
s2 = 'ボウリング・フォー・コロンバイン(字幕版)'
s3 = 'Ep. 2448 : épisode du 12 mars 2014 (Plus belle la vie, Saison 10, Vol. 6)
language_of_string(s1) ==> EN
language_of_string(s2) ==> JP
language_of_string(s3) ==> FR

try langid with source code
https://github.com/saffsd/langid.py
>>> import langid
>>> langid.classify("This is a test")
('en', 0.99999999099035441)

guess_language
s1 = 'This is where lies a person'
s2 = 'ボウリング・フォー・コロンバイン(字幕版)'
s3 = 'Ep. 2448 : épisode du 12 mars 2014 (Plus belle la vie, Saison 10, Vol. 6)'
import guess_language
print guess_language.guessLanguage(s1)
print guess_language.guessLanguage(s2)
print guess_language.guessLanguage(s3)
en
ja
fr

Related

how to extract specific content from dataframe based on condition python

Consider the following pandas dataframe:
this is an example of ingredients_text :
farine de blé 34% (france), pépites de chocolat 20g (ue) (sucre, pâte de cacao, beurre de cacao, émulsifiant lécithines (tournesol), arôme) (cacao : 44% minimum), matière grasse végétale (palme), sucre, 8,5% chocolat(sucre, pâte de cacao, cacao et cacao maigre en poudre) (cacao: 38% minimum), 5,5% éclats de noix de pécan (non ue), poudres à lever : diphosphates carbonates de sodium, blancs d’œufs, fibres d'acacia, lactose et protéines de lait, sel. dont lait.
oignon 18g oil hell: kartoffelstirke, milchzucker, maltodextrin, reismehl. 100g produkt enthalten: 1559KJ ,energie 369 kcal lt;0.5g lt;0.1g 909 fett davon gesättigte fettsāuren kohlenhydrate davon ,zucker 26g
I separated the ingredients of each line into words with the folowing code :
for i in df['ingredients_text'][:].index:
words = df["ingredients_text"][i].split(',')
df["ingredients_text"][i]=words
Any idea of how to extract the ingredients with % and g from the text in onether column called 'ingredient' ?
For instance, the desired output should be:
['farine de blé 34%', 'pépites de chocolat 20g','cacao : 44%' ,'8,5% chocolat' ,'cacao: 38%', '5,5% éclats de noix de pécan']
['oignon 18g oil hell', '100g produkt enthalten', 'lt;0.5g', 'lt;0.1g' , '26g zucker']
df = pd.DataFrame({'ingredient_text': ['a%bgC, abc, a%, cg', 'xyx']})
ingredient_text
0 a%bgC, abc, a%, cg
1 xyx
Split the ingredients into a list
df['ingredient_text'] = df['ingredient_text'].str.split(',')
ingredient_text
0 [a%bgC, abc, a%, cg]
1 [xyx]
Search for your strings in the list
df['ingredient'] = df['ingredient_text'].apply(lambda x: [s for s in x if ('%' in s) or ('g' in s)])
ingredient_text ingredient
0 [a%bgC, abc, a%, cg] [a%bgC, a%, cg]
1 [xyx] []

Python error: string indices must be integers

i need to write a small json data object with python, but when i use this, it don't work, what do i wrong?
This is for the newest version of Python
import urllib, json
import requests
import json
with open('locaties.json') as json_file:
data = json.load(json_file)
for parkeerlocaties in data['parkeerlocaties']:
for locatie in parkeerlocaties['parkeerlocatie']:
for title in locatie['title']:
print("Hello World")
{"parkeerlocaties":[{"parkeerlocatie":{"title":"Fietsenstalling Tolhuisplein","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9032801,52.3824545]}","type":"Fietspunt","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/fiets\/fietsparkeren\/gemeentelijke\/","urltitle":"www.amsterdam.nl\/fiets","adres":"Buiksloterweg 3","postcode":"1031 CC","woonplaats":"Amsterdam","opmerkingen":"Alleen toegankelijk voor abonnementhouders van Tolhuisplein, automatische stalling"}},{"parkeerlocatie":{"title":"Fietsenstalling Paradiso","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8833735,52.3621851]}","type":"Fietspunt","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/fiets\/fietsparkeren\/gemeentelijke\/","urltitle":"www.amsterdam.nl\/fiets","adres":"Weteringschans 4 A","postcode":"1017 SG","woonplaats":"Amsterdam","opmerkingen":"Maximale parkeerduur 28 dagen, stalling met toezicht"}},{"parkeerlocatie":{"title":"Fietsenstalling Zuidplein","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8719467,52.3398642]}","type":"Fietspunt","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/fiets\/fietsparkeren\/gemeentelijke\/","urltitle":"www.amsterdam.nl\/fiets","adres":"Zuidplein 5","postcode":"1077 XV","woonplaats":"Amsterdam","opmerkingen":"Maximale parkeerduur 28 dagen, stalling met toezicht"}},{"parkeerlocatie":{"title":"Fietsenstalling Station Rai (gesloten tot februari 2019)","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8905079,52.339392]}","type":"Fietspunt","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/fiets\/fietsparkeren\/gemeentelijke\/","urltitle":"www.amsterdam.nl\/fiets","adres":"Europaboulevard 4","postcode":"1083 AD","woonplaats":"Amsterdam","opmerkingen":"Sluit voor renovatie op 21 juli 2018. Er zijn rond het station extra parkeerplekken voor fiets gemaakt."}},{"parkeerlocatie":{"title":"P+R Zeeburg","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9607015,52.3719632]}","type":"P+R","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeren-reizen\/#h4f9f93f8-875b-4d18-936a-c1eba9d6f198","urltitle":"www.amsterdam.nl\/penr ","adres":"Zuiderzeeweg 46 a","postcode":"1095KJ","woonplaats":"Amsterdam","opmerkingen":"","OV_bus":"bus 37 Noord - Amstelstation vv","OV_tram":"tram 26 Ijburg - Centraal Station vv","OV":"tram;GVB_26_1;08240, bus;GVB_37_2;08134"}},{"parkeerlocatie":{"title":"Weekend P+R VUmc","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8611063,52.3361167]}","type":"P+R","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeren-reizen\/#hdce18cfd-fc8f-4728-be57-2d9a23b494d9","urltitle":"www.amsterdam.nl\/penr","adres":"Gustav Mahlerlaan 3004","postcode":"1081 LA","woonplaats":"Amsterdam","opmerkingen":"","OV_metro":"metro 51 Isolatorweg - Centraal Station vv (maart 2019 t\/m eind 2020), metro 50 met overstap Overamstel op 51 Centraal Station","OV_tram":"tram 24 VU medisch centrum - Centraal Station vv, tram 5 Amstelveen - Van Hallstraat vv","OV":"metro;GVB_50_1;07343;09563, tram;GVB_24_1;07350, tram;GVB_5_1;07410"}},{"parkeerlocatie":{"title":"P+R Bos en Lommer","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8453671,52.379131]}","type":"P+R","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeren-reizen\/#h9434503d-d323-4331-b792-5210ce062c42","urltitle":"www.amsterdam.nl\/penr ","adres":"Leeuwendalersweg 23 b","postcode":"1055JE","woonplaats":"Amsterdam","opmerkingen":"","OV_bus":"bus 21 Geuzenveld - Centraal Station vv","OV_tram":"tram 7 Slotermeer - Azartplein vv","OV":"bus;GVB_21_1;03060, tram;GVB_7_1;03167"}},{"parkeerlocatie":{"title":"P+R Sloterdijk","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8384209,52.3900128]}","type":"P+R","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeren-reizen\/#h628fb483-dec3-4a9d-9d52-50136e9639ec","urltitle":"www.amsterdam.nl\/penr","adres":"Piarcoplein 1","postcode":"1043DW","woonplaats":"Amsterdam","opmerkingen":"","OV_bus":"bus 22 Station Sloterdijk - Muiderpoortstation vv","OV_metro":"metro 50 Isolatorweg - Gein vv, overstap 51 op Station Zuid \/ Station RAI \/ Overamstel","OV_tram":"tram 19 Station Sloterdijk - Diemen vv","OV_trein":"Treinen tussen station Sloterdijk en de stations CS, Muiderpoort en Amstel (GVB P+R-kaart niet geldig)","OV":"tram;GVB_19_1;02361;00014, metro;GVB_50_1;02295;09563, metro;GVB_51_1;*09563, bus;GVB_22_1;02367;00001"}},{"parkeerlocatie":{"title":"P+R Olympisch Stadion","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8539215,52.3440266]}","type":"P+R","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeren-reizen\/#h4567b083-9fea-4848-882a-280b6abc7853","urltitle":"www.amsterdam.nl\/penr ","adres":"Olympisch Stadion 44","postcode":"1076DE","woonplaats":"Amsterdam","opmerkingen":"","OV_tram":"tram 24 VU medisch centrum - Centraal Station vv","OV":"tram;GVB_24_1;07121"}},{"parkeerlocatie":{"title":"P+R Johan Cruijff ArenA","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9405734,52.3137551]}","type":"P+R","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeren-reizen\/#h1dfa5189-98e8-42ce-8119-ce74f2451969","urltitle":"www.amsterdam.nl\/penr","adres":"Burgemeester Stramanweg 130","postcode":"1101EP","woonplaats":"Amsterdam","opmerkingen":"","OV_metro":"metro 54 Gein - Centraal Station vv","OV_trein":"Treinen tussen station Bijlmer Arena en stations Amstel, Muiderpoort en Centraal Station (GVB P+R-kaart niet geldig)","OV":"metro;GVB_54_1;09522"}},{"parkeerlocatie":{"title":"Amsterdamse Poort (P21 t\/m 24)","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9626214,52.3192019]}","type":"CommercieleParkeergarage","url":"https:\/\/www.q-park.nl\/nl-nl\/parkeren\/amsterdam\/amsterdamse-poort-p21\/","urltitle":"Amsterdamse Poort P21","adres":"Bijlmerdreef 700","postcode":"1103DS","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"P18 HES\/ ROC","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9466199,52.3152543]}","type":"Parkeergarage","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeergarages\/parkeergarages\/garage-p18-hes-roc\/","urltitle":"Bekijk P18 HES\/ ROC op www.amsterdam.nl\/parkeergarages","adres":"Fraijlemaborg 131","postcode":"1102CV","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"P1 ArenA","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9405851,52.3137433]}","type":"Parkeergarage","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeergarages\/parkeergarages\/parkeergarage-p1\/","urltitle":"Bekijk P1 ArenA op www.amsterdam.nl\/parkeergarages","adres":"Burgemeester Stramanweg 130","postcode":"1101EP","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"P10 Plaza ArenA","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9409531,52.3080762]}","type":"Parkeergarage","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeergarages\/parkeergarages\/p10-plaza-arena\/","urltitle":"Bekijk P10 Plaza ArenA op www.amsterdam.nl\/parkeergarages","adres":"Herikerbergweg 288","postcode":"1101CT","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"P 3 Mikado","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9413266,52.3103066]}","type":"Parkeergarage","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeergarages\/parkeergarages\/garage-p3-mikado\/","urltitle":"Bekijk P3 Mikado op www.amsterdam.nl\/parkeergarages","adres":"De entree 228","postcode":"1101EE","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"RAI Parking","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8921615,52.3383996]}","type":"CommercieleParkeergarage","url":"https:\/\/www.rai.nl\/nl\/contact-bereikbaarheid-en-parkeren\/parkeren-bij-rai-amsterdam\/","urltitle":"Rai Parking","adres":"Europaboulevard 24","postcode":"1078GZ","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Qpark Eurocenter","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8888094,52.3358123]}","type":"CommercieleParkeergarage","url":"https:\/\/www.q-park.nl\/nl-nl\/parkeren\/amsterdam\/eurocenter\/","urltitle":"Qpark Eurocenter ","adres":"Barbara Strozzilaan 342","postcode":"1083HN","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Qpark Mahler","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8723915,52.3377672]}","type":"CommercieleParkeergarage","url":"https:\/\/www.q-park.nl\/nl-nl\/parkeren\/amsterdam\/mahler\/","urltitle":"Qpark Mahler","adres":"Claude Debussylaan 42","postcode":"1082MD","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Olympisch Stadion","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8539215,52.3440266]}","type":"CommercieleParkeergarage","url":"http:\/\/www.p1.nl\/parkeren\/parkeergarage-olympisch-stadion\/","urltitle":"P1 Parkeergarage Olympisch Stadion ","adres":"Olympisch Stadion 44","postcode":"1076DE","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Interparking Oranjekwartier Amsterdam","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.839149,52.3546448]}","type":"CommercieleParkeergarage","url":"http:\/\/www.interparking.nl\/nl-NL\/find-parking\/Oranjekwartier\/","urltitle":"Interparking Oranjekwartier Amsterdam","adres":"Carnapstraat 200","postcode":"1062KZ","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Bomengarage P2 (Boven 't IJ) ","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9388323,52.3994577]}","type":"Parkeergarage","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeergarages\/parkeergarages\/garage-p2bomengarage\/","urltitle":"Bekijk Bomengarage P2 op www.amsterdam.nl\/parkeergarages","adres":"Buikslotermeerplein 237","postcode":"1025 XB","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Qpark Westergasfabriek","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8662388,52.3847072]}","type":"CommercieleParkeergarage","url":"https:\/\/www.q-park.nl\/nl-nl\/parkeren\/amsterdam\/westergasfabriek\/","urltitle":"Qpark Amsterdam Westergasfabriek","adres":"Van Bleiswijkstraat 8","postcode":"1051DG","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Qpark Europarking","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8766781,52.3699218]}","type":"CommercieleParkeergarage","url":"https:\/\/www.q-park.nl\/nl-nl\/parkeren\/amsterdam\/europarking\/","urltitle":"Qpark Europarking","adres":"Marnixstraat 250","postcode":"1016TL","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Qpark Byzantium","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8793897,52.3618422]}","type":"CommercieleParkeergarage","url":"https:\/\/www.q-park.nl\/nl-nl\/parkeren\/amsterdam\/byzantium\/","urltitle":"Qpark Amsterdam Byzantium","adres":"Tesselschadestraat 1","postcode":"1054ET","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Piet Heingarage","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9173751,52.3773883]}","type":"Parkeergarage","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeergarages\/parkeergarages\/parkeergarage-piet\/","urltitle":"Bekijk Piet Heingarage op www.amsterdam.nl\/parkeergarages","adres":"Piet Heinkade 59","postcode":"1019GM","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Parking Centrum Oosterdok","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9092051,52.3761913]}","type":"CommercieleParkeergarage","url":"http:\/\/www.parkingcentrumoosterdok.nl\/","urltitle":"Parking Centrum Oosterdok","adres":"Oosterdoksstraat 150","postcode":"1011AD","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Markenhoven","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.908618,52.3696328]}","type":"Parkeergarage","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeergarages\/parkeergarages\/garage-markenhoven\/","urltitle":"Bekijk Markenhoven op www.amsterdam.nl\/parkeergarages","adres":"Anne Frankstraat 220","postcode":"1011 MP","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"(P1) Parking Waterlooplein","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9043352,52.3689665]}","type":"CommercieleParkeergarage","url":"http:\/\/www.parkereninwaterlooplein.nl\/","urltitle":"Parkeergarage Waterlooplein ","adres":"Valkenburgerstraat 238","postcode":"1011ND","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Stadhuis - Muziektheater","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9018035,52.3670615]}","type":"Parkeergarage","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeergarages\/parkeergarages\/garage-stadhuis\/","urltitle":" Bekijk Stadhuis-Muziektheater op www.amsterdam.nl\/parkeergarages","adres":"Waterlooplein 28","postcode":"1011PG","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Parkeergarage Prins & Keizer","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.891798,52.3622906]}","type":"CommercieleParkeergarage","url":"http:\/\/www.apcoa.nl\/parkeren-in\/amsterdam\/apcoa-parking-prins-keizer.html","urltitle":"Apcoa Parking Prins & Keizer","adres":"Prinsengracht 927","postcode":"1017HL","woonplaats":"Amsterdam","aantal":"140","opmerkingen":""}},{"parkeerlocatie":{"title":"Qpark De Bijenkorf","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.895162,52.373881]}","type":"CommercieleParkeergarage","url":"https:\/\/www.q-park.nl\/nl-nl\/parkeren\/amsterdam\/de-bijenkorf\/","urltitle":"Qpark De Bijenkorf","adres":"Beursplein 15","postcode":"1012JW","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Qpark Nieuwendijk","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8944693,52.3764423]}","type":"CommercieleParkeergarage","url":"https:\/\/www.q-park.nl\/nl-nl\/parkeren\/amsterdam\/nieuwendijk\/","urltitle":"Qpark Nieuwendijk","adres":"Nieuwezijds Kolk 18","postcode":"1012PV","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"(P1) Parking Amsterdam Centre","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8970068,52.3785141]}","type":"CommercieleParkeergarage","url":"http:\/\/www.p1.nl\/parkeren\/p1-parking-amsterdam-centre\/","urltitle":"P1 Parking Amsterdam Centre ","adres":"Prins Hendrikkade 20 a","postcode":"1012TL","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Parkeergarage Apcoa Heinekenplein","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8924871,52.3571537]}","type":"CommercieleParkeergarage","url":"http:\/\/www.apcoa.nl\/parkeren-in\/amsterdam\/apcoa-parking-heinekenplein.html","urltitle":"Apcoa garage Heinekenplein ","adres":"Eerste Van der Helststraat 6","postcode":"1072NV","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"Qpark Museumplein","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.8798246,52.3571347]}","type":"CommercieleParkeergarage","url":"https:\/\/www.q-park.nl\/nl-nl\/parkeren\/amsterdam\/museumplein\/","urltitle":"Qpark Museumplein ","adres":"Van Baerlestraat 33 B","postcode":"1071AP","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"P4 en P5 Villa ArenA","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9389632,52.3118578]}","type":"Parkeergarage","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeergarages\/parkeergarages\/","urltitle":"Bekijk P4 en P5 Villa ArenA op www.amsterdam.nl\/parkeergarages ","adres":"De entree 7","postcode":"1101BH","woonplaats":"Amsterdam","opmerkingen":""}},{"parkeerlocatie":{"title":"P4 en P5 Villa ArenA","Locatie":"{\"type\":\"Point\",\"coordinates\":[4.9389632,52.3118578]}","type":"Parkeergarage","url":"https:\/\/www.amsterdam.nl\/parkeren-verkeer\/parkeergarages\/parkeergarages\/","urltitle":"Bekijk P4 en P5 Villa ArenA op www.amsterdam.nl\/parkeergarages","adres":"De entree 7","postcode":"1101BH","woonplaats":"Amsterdam","opmerkingen":""}}
The current error message is "TypeError: string indices must be integers" but i think it should give all the titles of the parkeerlocatie.
parkeerlocaties['parkeerlocatie'] is not a list, it's a dictionary. You should use parkeerlocaties['parkeerlocaties']['title']. And the title is a string, there's no reason to iterate over it (unless you want to process it character by character for some reason).
with open('locaties.json') as json_file:
data = json.load(json_file)
for parkeerlocaties in data['parkeerlocaties']:
print('Title: ', parkeerlocaties['parkeerlocaties']['title'])
>>> type(data)
<type 'dict'>
>>> data['parkeerlocaties']
<type 'list'>
So the code could be
import urllib, json
import requests
import json
with open('locaties.json') as json_file:
data = json.load(json_file)
parkeerlocaties = data['parkeerlocaties']
for parkeerlocatie in parkeerlocaties:
print(parkeerlocatie['parkeerlocatie']['title'])

Ideas to improve language detection between Spanish and Catalan

I'm working on a text mining script in python. I need to detect the language of a natural language field from the dataset.
The thing is, 98% of the rows are in Spanish and Catalan. I tried using some algorithms like the stopwords one or the langdetect library, but these languages share a lot of words so they fail a lot.
I'm looking for some ideas to improve this algorithm.
One thought is, make a dictionary with some words that are specific to Spanish and Catalan, so if one text has any of these words, it's tagged as that language.
Approach 1: Distinguishing characters
Spanish and Catalan (note: there will be exceptions for proper names and loanwords e.g. Barça):
esp_chars = "ñÑáÁýÝ"
cat_chars = "çÇàÀèÈòÒ·ŀĿ"
Example:
sample_texts = ["El año que es abundante de poesía, suele serlo de hambre.",
"Cal no abandonar mai ni la tasca ni l'esperança."]
for text in sample_texts:
if any(char in text for char in esp_chars):
print("Spanish: {}".format(text))
elif any(char in text for char in cat_chars):
print("Catalan: {}".format(text))
>>> Spanish: El año que es abundante de poesía, suele serlo de hambre.
Catalan: Cal no abandonar mai ni la tasca ni l'esperança.
If this isn't sufficient, you could expand this logic to search for language exclusive digraphs, letter combinations, or words:
Spanish only
Catalan only
Words
como y su con él otro
com i seva amb ell altre
Initial digraphs
d' l'
Digraphs
ss tj qü l·l l.l
Terminal digraphs
ig
Catalan letter combinations that only marginally appear in Spanish
tx
tg          (Es. exceptions postgrado, postgraduado, postguerra)
ny          (Es. exceptions mostly prefixed in-, en-, con- + y-)
ll (terminal) (Es. exceptions (loanwords): detall, nomparell)
Approach 2: googletrans library
You could also use the googletrans library to detect the language:
from googletrans import Translator
translator = Translator()
for text in sample_texts:
lang = translator.detect(text).lang
print(lang, ":", text)
>>> es : El año que es abundante de poesía, suele serlo de hambre.
ca : Cal no abandonar mai ni la tasca ni l'esperança.
DicCat = ['amb','cap','dalt','damunt','des','dintre','durant','excepte','fins','per','pro','sense','sota','llei','hi','ha','més','mes','moment','órgans', 'segóns','Article','i','per','els','amb','és','com','dels','més','seu','seva','fou','també','però','als','després','aquest','fins','any','són','hi','pel','aquesta','durant','on','part','altres','anys','ciutat','cap','des','seus','tot','estat','qual','segle','quan','ja','havia','molt','rei','nom','fer','així','li','sant','encara','pels','seves','té','partit','està','mateix','pot','nord','temps','fill','només','dues','sota','lloc','això','alguns','govern','uns','aquests','mort','nou','tots','fet','sense','frança','grup','tant','terme','fa','tenir','segons','món','regne','exèrcit','segona','abans','mentre','quals','aquestes','família','catalunya','eren','poden','diferents','nova','molts','església','major','club','estats','seua','diversos','grans','què','arribar','troba','població','poble','foren','època','haver','eleccions','diverses','tipus','riu','dia','quatre','poc','regió','exemple','batalla','altre','espanya','joan','actualment','tenen','dins','llavors','centre','algunes','important','altra','terra','antic','tenia','obres','estava','pare','qui','ara','havien','començar','història','morir','majoria','qui','ara','havien','començar','història','morir','majoria']
DicEsp = ['los','y','bajo','con', 'entre','hacia','hasta','para','por','según','segun','sin','tras','más','mas','ley','capítulo','capitulo','título','titulo','momento','y','las','por','con','su','para','lo','como','más','pero','sus','le','me','sin','este','ya','cuando','todo','esta','son','también','fue','había','muy','años','hasta','desde','está','mi','porque','qué','sólo','yo','hay','vez','puede','todos','así','nos','ni','parte','tiene','él','uno','donde','bien','tiempo','mismo','ese','ahora','otro','después','te','otros','aunque','esa','eso','hace','otra','gobierno','tan','durante','siempre','día','tanto','ella','sí','dijo','sido','según','menos','año','antes','estado','sino','caso','nada','hacer','estaba','poco','estos','presidente','mayor','ante','unos','algo','hacia','casa','ellos','ayer','hecho','mucho','mientras','además','quien','momento','millones','esto','españa','hombre','están','pues','hoy','lugar','madrid','trabajo','otras','mejor','nuevo','decir','algunos','entonces','todas','días','debe','política','cómo','casi','toda','tal','luego','pasado','medio','estas','sea','tenía','nunca','aquí','ver','veces','embargo','partido','personas','grupo','cuenta','pueden','tienen','misma','nueva','cual','fueron','mujer','frente','josé','tras','cosas','fin','ciudad','he','social','tener','será','historia','muchos','juan','tipo','cuatro','dentro','nuestro','punto','dice','ello','cualquier','noche','aún','agua','parece','haber','situación','fuera','bajo','grandes','nuestra','ejemplo','acuerdo','habían','usted','estados','hizo','nadie','países','horas','posible','tarde','ley','importante','desarrollo','proceso','realidad','sentido','lado','mí','tu','cambio','allí','mano','eran','estar','san','número','sociedad','unas','centro','padre','gente','relación','cuerpo','incluso','través','último','madre','mis','modo','problema','cinco','carlos','hombres','información','ojos','muerte','nombre','algunas','público','mujeres','siglo','todavía','meses','mañana','esos','nosotros','hora','muchas','pueblo','alguna','dar','don','da','tú','derecho','verdad','maría','unidos','podría','sería','junto','cabeza','aquel','luis','cuanto','tierra','equipo','segundo','director','dicho','cierto','casos','manos','nivel','podía','familia','largo','falta','llegar','propio','ministro','cosa','primero','seguridad','hemos','mal','trata','algún','tuvo','respecto','semana','varios','real','sé','voz','paso','señor','mil','quienes','proyecto','mercado','mayoría','luz','claro','iba','éste','pesetas','orden','español','buena','quiere','aquella','programa','palabras','internacional','esas','segunda','empresa','puesto','ahí','propia','libro','igual','político','persona','últimos','ellas','total','creo','tengo','dios','española','condiciones','méxico','fuerza','solo','único','acción','amor','policía','puerta','pesar','sabe','calle','interior','tampoco','ningún','vista','campo','buen','hubiera','saber','obras','razón','niños','presencia','tema','dinero','comisión','antonio','servicio','hijo','última','ciento','estoy','hablar','dio','minutos','producción','camino','seis','quién','fondo','dirección','papel','demás','idea','especial','diferentes','dado','base','capital','ambos','europa','libertad','relaciones','espacio','medios','ir','actual','población','empresas','estudio','salud','servicios','haya','principio','siendo','cultura','anterior','alto','media','mediante','primeros','arte','paz','sector','imagen','medida','deben','datos','consejo','personal','interés','julio','grupos','miembros','ninguna','existe','cara','edad','movimiento','visto','llegó','puntos','actividad','bueno','uso','niño','difícil','joven','futuro','aquellos','mes','pronto','soy','hacía','nuevos','nuestros','estaban','posibilidad','sigue','cerca','resultados','educación','atención','gonzález','capacidad','efecto','necesario','valor','aire','investigación','siguiente','figura','central','comunidad','necesidad','serie','organizació','nuevas','calidad']
DicEng = ['all','my','have','do','and', 'or', 'what', 'can', 'you', 'the', 'on', 'it', 'at', 'since', 'for', 'ago', 'before', 'past', 'by', 'next', 'from','with', 'wich','law','is','the','of','and','to','in','is','you','that','it','he','was','for','on','are','as','with','his','they','at','be','this','have','from','or','one','had','by','word','but','not','what','all','were','we','when','your','can','said','there','use','an','each','which','she','do','how','their','if','will','up','other','about','out','many','then','them','these','so','some','her','would','make','like','him','into','time','has','look','two','more','write','go','see','number','no','way','could','people','my','than','first','water','been','call','who','oil','its','now','find','long','down','day','did','get','come','made','may','part','may','part']
def WhichLanguage(text):
Input = text.lower().split(" ")
CatScore = []
EspScore = []
EngScore = []
for e in Input:
if e in DicCat:
CatScore.append(e)
if e in DicEsp:
EspScore.append(e)
if e in DicEng:
EngScore.append(e)
if(len(EngScore) > len(EspScore)) and (len(EngScore) > len(CatScore)):
Language ='English'
else:
if(len(CatScore) > len(EspScore)):
Language ='Catala'
else:
Language ='Espanyol'
print(text)
print("ESP= ",len(EspScore),EspScore)
print("Cat = ",len(CatScore), CatScore)
print("ING= ",len(EngScore),EngScore)
print( 'Language is =', Language)
print("-----")
return(Language)
print(WhichLanguage("Hola bon dia"))

python - cut a string in 2 lines

I'm looking for a line (using str.join I think) to cut a long string if the number of word is too much. I have the begining but I don't know whow to insert \n
example = "Au Fil Des Antilles De La Martinique a Saint Barthelemy"
nmbr_word = len(example.split(" "))
if nmbr_word >= 6:
# cut the string to have
result = "Au Fil Des Antilles De La\nMartinique a Saint Barthelemy"
Thanks
How about using the textwrap module?
>>> import textwrap
>>> s = "Au Fil Des Antilles De La Martinique a Saint Barthelemy"
>>> textwrap.wrap(s, 30)
['Au Fil Des Antilles De La', 'Martinique a Saint Barthelemy']
>>> "\n".join(textwrap.wrap(s, 30))
'Au Fil Des Antilles De La\nMartinique a Saint Barthelemy'
How about:
'\n'.join([' '.join(nmbr_word[i:i+6]) for i in xrange(0, len(nmbr_word), 6)])

How to convert string to seo-url?

I would like to convert a accented string to a seo-url ...
For instance:
"Le bébé (de 4 ans) a également un étrange "rire"" to :
"le-bebe-de-4-ans-a-egalement-un-etrange-rire"
Any solution, please ?
Thanks !
This is what I use:
def _doStringSEOptiomization(objectName,pageName,lang,objectId):
"""
Prende in input il nome di un'offerta e svolge dei passi:
1- Trasforma tutte le variazioni delle vocali
in vocali normali
2- Attraverso una serie di REGEX, elimina i caratteri non desiderati e torna
una stringa da inserire in un link adatto ai motori di ricerca e alle indicizzazioni
"""
try:
import re #importo il modulo per le REGEX
Speaker.log_debug(GREEN("core.ws_site.do_sites_offers_data_redux._doStringSEOptiomization() input: objectName=%s, pageName=%s, lang=%s, objectId=%s" % (objectName,pageName,lang,objectId)))
#mappa dei caratteri html-entity e unicode
vocalMap = { 'a' : ['à','á','â','ã','ä','å','æ','à','á','â','ã','ä','å','ā','æ'],
'e' : ['è','é','ê','ë','è','é','ê','ë','ē'],
'i' : ['ì','í','î','ï','ì','í','î','ï','ī'],
'o' : ['ò','ó','ô','œ','õ','ö','ò','ó','ô','œ','õ','ö','ō'],
'u' : ['ù','ú','û','ü','ù','ú','û','ü','ū']
}
objectName = objectName.lower() #trasformo la stringa di partenza in caratteri minuscoli
for vocale, lista in vocalMap.iteritems(): #per ogni elemento della mappa avrà una chiave ed una lista
for elemento in lista: #itero su tutti gli elementi della lista
objectName = objectName.replace(elemento,vocale) #sostituisco nel nome dell'offerta, la vocale all' HTML-entity
objectName = objectName.replace("/","-")
objectName = re.sub("[^a-z0-9_\s-]","",objectName) #######################################
objectName = re.sub("[\s-]+"," ",objectName) #strippo tutti i caratteri non voluti:#
objectName = re.sub("[\s_]","-",objectName) #######################################
objectName = pageName+"--"+objectName
objectName += "-"+lang+"-"+str(objectId) #aggiungo la lingua e l'id dell'offerta
except Exception,s:
Speaker.log_error("_doStringSEOptiomization(): Error=%s"%RED(s))
return objectName
You have to adapt it for your situation.
This might (or might not) be enough:
import re
import unidecode
def normalized_id(title):
title = unidecode.unidecode(title).lower()
return re.sub('\W+', '-', title.replace("'", '')).strip('-')
>>> a = u'Le bébé (de 4 ans) a également un étrange "rire"'
>>> r = unicodedata.normalize('NFKD',a).encode('cp1256','ignore')
>>> r = unicode(re.sub('[^\w\s-]','',r).strip().lower())
>>> r = re.sub('[-\s]+','-',r)
>>> print r
le-bebe-de-4-ans-a-egalement-un-etrange-rire
I use cp1256 (latin 1) to handle accented characters...
Perfect ! Thanks a lot !
If you have Django around, you can use its defaultfilter slugify (or adapt it for your needs).
[~]$ python
Python 2.7.1 (r271:86882M, Nov 30 2010, 10:35:34)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> import re
>>> def seo_string(x):
... r = unicodedata.normalize('NFKD',x).encode('ascii','ignore')
... r = unicode(re.sub('[^\w\s-]','',x).strip().lower())
... return re.sub('[-\s]+','-',r)
...
>>> seo_string(u'Le bébé (de 4 ans) a également un étrange "rire"')
u'le-bb-de-4-ans-a-galement-un-trange-rire'
With thanks to the great slugify of django's built-in filters, however it won't do replacement of é with e like the solution posted by #doncallisto
Here are the several ways to do so: Generating Slugs By Armin Ronacher.

Categories