Process subset of a Document with spacy

Process subset of a Document with spacy - python

When processing text with multiple German sentences like text = 'Frau Dr. Peters ist heute nicht erreichbar. Kommen Sie bitte morgen wieder.'. I want to do some further steps on each sentence. For this I need (in my case) only the tokens as text.
Here's an example:
import spacy
nlp = spacy.load('de_core_news_sm')
text = 'Frau Dr. Peters ist heute nicht erreichbar. Kommen Sie bitte morgen wieder.'
doc = nlp(text)
# Convert Document to list of tokens
def token_to_list(doc):
tokens = []
for token in doc:
tokens.append(token.lower_)
return tokens
sentences = list(doc.sents)
tokens_sent = []
for sent in sentences:
tokens = tokens_to_list(sent.as_doc())
tokens_sent.append(tokens)
print(tokens_sent)
I would expect to see this in my console:
[['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.'], ['kommen', 'sie', 'bitte', 'morgen', 'wieder', '.']]
Instead this is the output (I added some format for better visibility):
['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.',
['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.',
[...],
'kommen', 'sie', 'bitte', 'morgen', 'wieder', '.',
[...]
],
'kommen', 'sie', 'bitte', 'morgen', 'wieder', '.'
['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.',
[...],
'kommen', 'sie', 'bitte', 'morgen', 'wieder', '.',
[...]
]
]
As you can see, there seems to be some kind of recursion over my list. Further inspection shows that the [...] element contains the same list of elements as the layer above and continues in itself.
I can't figure out why or how to achieve the expected output.

Related

NLTK find german nouns

I want to extract all german nouns from a german text in lemmatized form with NLTK.
I also checked spacy but NLTK is much more preferred because in english it already works with the needed performance and requested data structure.
I have the following working code for english:
import nltk
from nltk.stem import WordNetLemmatizer
#germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'
text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'
tokens = nltk.word_tokenize(text)
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']
print (tokens)
I get the print as expected:
['year', 'sender', 'key', 'recipient']
Now I tried to do this for German:
import nltk
from nltk.stem import WordNetLemmatizer
germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'
#text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'
tokens = nltk.word_tokenize(germanText, language='german')
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']
print (tokens)
And I get a wrong result:
['jahrtausendelang', 'man', 'davon', 'au', 'der', 'sender', 'einen', 'geheimen', 'zwar', 'den', 'gleichen', 'wie', 'der', 'empfänger', 'benötigt']
The lemmatization did not work and the noun extraction did not work.
How is the proper way to apply different languages to this code?
I also checked other solutions like:
from nltk.stem.snowball import GermanStemmer
stemmer = GermanStemmer("german") # Choose a language
tokenGer=stemmer.stem(tokens)
But this would make me start from the beginning.

I have found a way with the HanoverTagger:
from HanTa import HanoverTagger as ht
tagger = ht.HanoverTagger('morphmodel_ger.pgz')
words = nltk.word_tokenize(text)
print(tagger.tag_sent(words) )
tokens=[word for (word,x,pos) in tagger.tag_sent(words,taglevel= 1) if pos == 'NN']
I get the outcome as expected: ['Jahrtausendelang', 'Sender', 'Schlüssel', 'Empfänger']

Got Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)

I was getting the following error while i was trying to read a list in spacy.
TypeError: Argument 'string' has incorrect type (expected spacy.tokens.token.Token, got str)
Here is the code below
f= "MotsVides.txt"
file= open(f, 'r', encoding='utf-8')
stopwords = [line.rstrip() for line in file]
# stopwords =['alors', 'au', 'aucun', 'aussi', 'autre', 'avant', 'avec', 'avoir', 'bon', 'car', 'ce', 'cela', 'ces', 'ceux', 'chaque', 'ci', 'comme', 'comment', 'ça', 'dans', 'des', 'du', 'dedans', 'dehors', 'depuis', 'deux', 'devrait', 'doit', 'donc', 'dos', 'droite', 'début', 'elle', 'elles', 'en', 'encore', 'essai', 'est', 'et', 'eu', 'étaient', 'état', 'étions', 'été', 'être', 'fait', 'faites', 'fois', 'font', 'force', 'haut', 'hors', 'ici', 'il', 'ils', 's', 'juste', 'la', 'le', 'les', 'leur', 'là\t ma', 'maintenant', 'mais', 'mes', 'mine', 'moins', 'mon', 'mot', 'même', 'ni', 'nommés', 'notre', 'nous', 'nouveaux', 'ou', 'où', 'par', 'parce', 'parole', 'pas', 'personnes', 'peut', 'peu', 'pièce', 'plupart', 'pour', 'pourquoi', 'quand', 'que', 'quel', 'quelle', 'quelles', 'quels', 'qui\t sa', 'sans', 'ses', 'seulement', 'si', 'sien', 'son', 'sont', 'sous', 'soyez', 'sujet', 'sur', 'ta', 'tandis', 'tellement', 'tels', 'tes', 'ton', 'tous', 'tout', 'trop', 'très', 'tu', 'valeur', 'voie', 'voient', 'vont', 'votre', 'vous', 'vu']
def spacy_process(texte):
for lt in texte:
mytokens = nlp(lt)
print(mytokens)
mytokens2 = [word.lemma_.lower().strip() for word in mytokens if word.pos_ != "PUNCT" and word not in stopwords]
print(type(mytokens2))
a = ['je suis la bonne personne et droit à la caricature.', 'Je suis la bonne personne et droit à la caricature.']
spacy_process(a)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-133-03cc18018278> in <module>
33
34 a = ['je suis la bonne personne et droit à la caricature.', 'Je suis la bonne personne et droit à la caricature.']
---> 35 spacy_process(a)
<ipython-input-133-03cc18018278> in spacy_process(texte)
28 mytokens = nlp(lt)
29 print(mytokens)
---> 30 mytokens2 = [word.lemma_.lower().strip() for word in mytokens if word.pos_ != "PUNCT" and word not in stopwords]
31
32 print(type(mytokens2))
<ipython-input-133-03cc18018278> in <listcomp>(.0)
28 mytokens = nlp(lt)
29 print(mytokens)
---> 30 mytokens2 = [word.lemma_.lower().strip() for word in mytokens if word.pos_ != "PUNCT" and word not in stopwords]
31
32 print(type(mytokens2))
TypeError: Argument 'other' has incorrect typ (expected spacy.tokens.token.Token, got str)

The issue is that word from word not in stopwords is a Token not a string. Python is complaining because it's trying to search and do comparisons between a list of strings and the Token class which doesn't work.
With spacy you want to use word.text to get the string, not word.
The following code should work...
import spacy
stopwords = ['alors', 'au', 'aucun', 'aussi', 'autre'] # truncated for simplicity
nlp = spacy.load('en')
def spacy_process(texte):
for lt in texte:
mytokens = nlp(lt)
mytokens2 = [word.lemma_.lower().strip() for word in mytokens if word.pos_ != "PUNCT" and word.text not in stopwords]
print(mytokens2)
a = ['je suis la bonne personne et droit à la caricature.', 'Je suis la bonne personne et droit à la caricature.']
spacy_process(a)
BTW... Checking for a value in a list is fairly slow. You should convert your list to a set to speed things up.

Retrieving the first list only (JSON-Python)

I have a complicated JSON object with dictionary information about words and I want to get only the synonyms. I managed to retrieve them, but some words have two or more lists of synonyms (since they can be for example a verb and a noun at the same time). I would like to get only the first list of synonyms. Here is what I've done:
import requests
import json
with open(r'C:\Users...') as file:
list = []
for line in file.readlines():
list += line.split()
for keyword in list:
print(keyword)
ship_api_url = "https://..."
request_data = requests.get(ship_api_url)
data = request_data.text
parsed = json.loads(data)
# print(json.dumps(parsed, indent=3))
for item in parsed:
print(item['meta']['syns'][0])
And here's what I get - note that the word 'watch' has three lists of synonyms, the word 'create' has only one list of synonyms and the word 'created' has two lists of synonyms:
watch
['custodian', 'guard', 'guardian', 'keeper', 'lookout', 'minder', 'picket', 'sentinel', 'sentry', 'warden', 'warder', 'watcher', 'watchman']
['eye', 'follow', 'observe']
['anticipate', 'await', 'expect', 'hope (for)']
create
['beget', 'breed', 'bring', 'bring about', 'bring on', 'catalyze', 'cause', 'do', 'draw on', 'effect', 'effectuate', 'engender', 'generate', 'induce', 'invoke', 'make', 'occasion', 'produce', 'prompt', 'result (in)', 'spawn', 'translate (into)', 'work', 'yield']
created
['begot', 'bred', 'brought', 'brought about', 'brought on', 'catalyzed', 'caused', 'did', 'drew on', 'effected', 'effectuated', 'engendered', 'generated', 'induced', 'invoked', 'made', 'occasioned', 'produced', 'prompted', 'resulted (in)', 'spawned', 'translated (into)', 'worked', 'yielded']
['beget', 'breed', 'bring', 'bring about', 'bring on', 'catalyze', 'cause', 'do', 'draw on', 'effect', 'effectuate', 'engender', 'generate', 'induce', 'invoke', 'make', 'occasion', 'produce', 'prompt', 'result (in)', 'spawn', 'translate (into)', 'work', 'yield']
If I add another [0] after the [0] I already have, I get the first word of each list, not the first whole list as I need...

If I got it right, you want to do something like this:
import requests
import json
with open(r'C:\Users...') as file:
list = []
for line in file.readlines():
list += line.split()
for keyword in list:
print(keyword)
ship_api_url = "https://..."
request_data = requests.get(ship_api_url)
data = request_data.text
parsed = json.loads(data)
# print(json.dumps(parsed, indent=3))
for item in parsed:
for i in item['meta']['syns']:
print(item['meta']['syns'][i])
Also, don't name your variable list as it is reserved variable in Python.

As suggested in a comment by martineau, I solved the problem by adding a break statement after the print(item['meta']['syns'][0]) to stop the loop.

How can I solve an attribute error when using spacy?

I am using spacy for natural language processing in german.
But I am running into this error:
AttributeError: 'str' object has no attribute 'text'
This is the text data I am working with:
tex = ['Wir waren z.B. früher auf\'m Fahrrad unterwegs in München (immer nach 11 Uhr).',
'Nun fahren wir öfter mit der S-Bahn in München herum. Tja. So ist das eben.',
'So bleibt mir nichts anderes übrig als zu sagen, vielen Dank für alles.',
'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.']
My code:
data = [re.sub(r"\"", "", i) for i in tex]
data1 = [re.sub(r"\“", "", i) for i in data]
data2 = [re.sub(r"\„", "", i) for i in data1]
nlp = spacy.load('de')
spacy_doc1 = []
for line in data2:
spac = nlp(line)
lem = [tok.lemma_ for tok in spac]
no_punct = [tok.text for tok in lem if re.match('\w+', tok.text)]
no_numbers = [tok for tok in no_punct if not re.match('\d+', tok)]
I am writing every string in a seperate list, because I need to assign the result of the processing to the original specific string.
I also understand that the result that is written into lem is not in a format anymore that spacy can process.
So how can I do this correctly?

The problem here lies in the fact that SpaCy's token.lemma_ returns a string, and that strings have no text attribute (as the error states).
I suggest doing the same as you did when you wrote:
no_numbers = [tok for tok in no_punct if not re.match('\d+', tok)]
The only difference with this line in your code would be that you'd have to include the special string "-PRON-" in case you encounter English pronouns:
import re
import spacy
# using the web English model for practicality here
nlp = spacy.load('en_core_web_sm')
tex = ['I\'m going to get a cat tomorrow',
'I don\'t know if I\'ll be able to get him a cat house though!']
data = [re.sub(r"\"", "", i) for i in tex]
data1 = [re.sub(r"\“", "", i) for i in data]
data2 = [re.sub(r"\„", "", i) for i in data1]
spacy_doc1 = []
for line in data2:
spac = nlp(line)
lem = [tok.lemma_ for tok in spac]
no_punct = [tok for tok in lem if re.match('\w+', tok) or tok in ["-PRON-"]]
no_numbers = [tok for tok in no_punct if not re.match('\d+', tok)]
print(no_numbers)
# > ['-PRON-', 'be', 'go', 'to', 'get', 'a', 'cat', 'tomorrow']
# > ['-PRON-', 'do', 'not', 'know', 'if', '-PRON-', 'will', 'be', 'able', 'to', 'get', '-PRON-', 'a', 'cat', 'house', 'though']
Please tell me if this solved your problem as I may have misunderstood your issue.

Ideas to improve language detection between Spanish and Catalan

I'm working on a text mining script in python. I need to detect the language of a natural language field from the dataset.
The thing is, 98% of the rows are in Spanish and Catalan. I tried using some algorithms like the stopwords one or the langdetect library, but these languages share a lot of words so they fail a lot.
I'm looking for some ideas to improve this algorithm.
One thought is, make a dictionary with some words that are specific to Spanish and Catalan, so if one text has any of these words, it's tagged as that language.

Approach 1: Distinguishing characters
Spanish and Catalan (note: there will be exceptions for proper names and loanwords e.g. Barça):
esp_chars = "ñÑáÁýÝ"
cat_chars = "çÇàÀèÈòÒ·ŀĿ"
Example:
sample_texts = ["El año que es abundante de poesía, suele serlo de hambre.",
"Cal no abandonar mai ni la tasca ni l'esperança."]
for text in sample_texts:
if any(char in text for char in esp_chars):
print("Spanish: {}".format(text))
elif any(char in text for char in cat_chars):
print("Catalan: {}".format(text))
>>> Spanish: El año que es abundante de poesía, suele serlo de hambre.
Catalan: Cal no abandonar mai ni la tasca ni l'esperança.
If this isn't sufficient, you could expand this logic to search for language exclusive digraphs, letter combinations, or words:
Spanish only
Catalan only
Words
como y su con él otro
com i seva amb ell altre
Initial digraphs
d' l'
Digraphs
ss tj qü l·l l.l
Terminal digraphs
ig
Catalan letter combinations that only marginally appear in Spanish
tx
tg          (Es. exceptions postgrado, postgraduado, postguerra)
ny          (Es. exceptions mostly prefixed in-, en-, con- + y-)
ll (terminal) (Es. exceptions (loanwords): detall, nomparell)
Approach 2: googletrans library
You could also use the googletrans library to detect the language:
from googletrans import Translator
translator = Translator()
for text in sample_texts:
lang = translator.detect(text).lang
print(lang, ":", text)
>>> es : El año que es abundante de poesía, suele serlo de hambre.
ca : Cal no abandonar mai ni la tasca ni l'esperança.

DicCat = ['amb','cap','dalt','damunt','des','dintre','durant','excepte','fins','per','pro','sense','sota','llei','hi','ha','més','mes','moment','órgans', 'segóns','Article','i','per','els','amb','és','com','dels','més','seu','seva','fou','també','però','als','després','aquest','fins','any','són','hi','pel','aquesta','durant','on','part','altres','anys','ciutat','cap','des','seus','tot','estat','qual','segle','quan','ja','havia','molt','rei','nom','fer','així','li','sant','encara','pels','seves','té','partit','està','mateix','pot','nord','temps','fill','només','dues','sota','lloc','això','alguns','govern','uns','aquests','mort','nou','tots','fet','sense','frança','grup','tant','terme','fa','tenir','segons','món','regne','exèrcit','segona','abans','mentre','quals','aquestes','família','catalunya','eren','poden','diferents','nova','molts','església','major','club','estats','seua','diversos','grans','què','arribar','troba','població','poble','foren','època','haver','eleccions','diverses','tipus','riu','dia','quatre','poc','regió','exemple','batalla','altre','espanya','joan','actualment','tenen','dins','llavors','centre','algunes','important','altra','terra','antic','tenia','obres','estava','pare','qui','ara','havien','començar','història','morir','majoria','qui','ara','havien','començar','història','morir','majoria']
DicEsp = ['los','y','bajo','con', 'entre','hacia','hasta','para','por','según','segun','sin','tras','más','mas','ley','capítulo','capitulo','título','titulo','momento','y','las','por','con','su','para','lo','como','más','pero','sus','le','me','sin','este','ya','cuando','todo','esta','son','también','fue','había','muy','años','hasta','desde','está','mi','porque','qué','sólo','yo','hay','vez','puede','todos','así','nos','ni','parte','tiene','él','uno','donde','bien','tiempo','mismo','ese','ahora','otro','después','te','otros','aunque','esa','eso','hace','otra','gobierno','tan','durante','siempre','día','tanto','ella','sí','dijo','sido','según','menos','año','antes','estado','sino','caso','nada','hacer','estaba','poco','estos','presidente','mayor','ante','unos','algo','hacia','casa','ellos','ayer','hecho','mucho','mientras','además','quien','momento','millones','esto','españa','hombre','están','pues','hoy','lugar','madrid','trabajo','otras','mejor','nuevo','decir','algunos','entonces','todas','días','debe','política','cómo','casi','toda','tal','luego','pasado','medio','estas','sea','tenía','nunca','aquí','ver','veces','embargo','partido','personas','grupo','cuenta','pueden','tienen','misma','nueva','cual','fueron','mujer','frente','josé','tras','cosas','fin','ciudad','he','social','tener','será','historia','muchos','juan','tipo','cuatro','dentro','nuestro','punto','dice','ello','cualquier','noche','aún','agua','parece','haber','situación','fuera','bajo','grandes','nuestra','ejemplo','acuerdo','habían','usted','estados','hizo','nadie','países','horas','posible','tarde','ley','importante','desarrollo','proceso','realidad','sentido','lado','mí','tu','cambio','allí','mano','eran','estar','san','número','sociedad','unas','centro','padre','gente','relación','cuerpo','incluso','través','último','madre','mis','modo','problema','cinco','carlos','hombres','información','ojos','muerte','nombre','algunas','público','mujeres','siglo','todavía','meses','mañana','esos','nosotros','hora','muchas','pueblo','alguna','dar','don','da','tú','derecho','verdad','maría','unidos','podría','sería','junto','cabeza','aquel','luis','cuanto','tierra','equipo','segundo','director','dicho','cierto','casos','manos','nivel','podía','familia','largo','falta','llegar','propio','ministro','cosa','primero','seguridad','hemos','mal','trata','algún','tuvo','respecto','semana','varios','real','sé','voz','paso','señor','mil','quienes','proyecto','mercado','mayoría','luz','claro','iba','éste','pesetas','orden','español','buena','quiere','aquella','programa','palabras','internacional','esas','segunda','empresa','puesto','ahí','propia','libro','igual','político','persona','últimos','ellas','total','creo','tengo','dios','española','condiciones','méxico','fuerza','solo','único','acción','amor','policía','puerta','pesar','sabe','calle','interior','tampoco','ningún','vista','campo','buen','hubiera','saber','obras','razón','niños','presencia','tema','dinero','comisión','antonio','servicio','hijo','última','ciento','estoy','hablar','dio','minutos','producción','camino','seis','quién','fondo','dirección','papel','demás','idea','especial','diferentes','dado','base','capital','ambos','europa','libertad','relaciones','espacio','medios','ir','actual','población','empresas','estudio','salud','servicios','haya','principio','siendo','cultura','anterior','alto','media','mediante','primeros','arte','paz','sector','imagen','medida','deben','datos','consejo','personal','interés','julio','grupos','miembros','ninguna','existe','cara','edad','movimiento','visto','llegó','puntos','actividad','bueno','uso','niño','difícil','joven','futuro','aquellos','mes','pronto','soy','hacía','nuevos','nuestros','estaban','posibilidad','sigue','cerca','resultados','educación','atención','gonzález','capacidad','efecto','necesario','valor','aire','investigación','siguiente','figura','central','comunidad','necesidad','serie','organizació','nuevas','calidad']
DicEng = ['all','my','have','do','and', 'or', 'what', 'can', 'you', 'the', 'on', 'it', 'at', 'since', 'for', 'ago', 'before', 'past', 'by', 'next', 'from','with', 'wich','law','is','the','of','and','to','in','is','you','that','it','he','was','for','on','are','as','with','his','they','at','be','this','have','from','or','one','had','by','word','but','not','what','all','were','we','when','your','can','said','there','use','an','each','which','she','do','how','their','if','will','up','other','about','out','many','then','them','these','so','some','her','would','make','like','him','into','time','has','look','two','more','write','go','see','number','no','way','could','people','my','than','first','water','been','call','who','oil','its','now','find','long','down','day','did','get','come','made','may','part','may','part']
def WhichLanguage(text):
Input = text.lower().split(" ")
CatScore = []
EspScore = []
EngScore = []
for e in Input:
if e in DicCat:
CatScore.append(e)
if e in DicEsp:
EspScore.append(e)
if e in DicEng:
EngScore.append(e)
if(len(EngScore) > len(EspScore)) and (len(EngScore) > len(CatScore)):
Language ='English'
else:
if(len(CatScore) > len(EspScore)):
Language ='Catala'
else:
Language ='Espanyol'
print(text)
print("ESP= ",len(EspScore),EspScore)
print("Cat = ",len(CatScore), CatScore)
print("ING= ",len(EngScore),EngScore)
print( 'Language is =', Language)
print("-----")
return(Language)
print(WhichLanguage("Hola bon dia"))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Process subset of a Document with spacy - python

Related

NLTK find german nouns

Got Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)

Retrieving the first list only (JSON-Python)

How can I solve an attribute error when using spacy?

Ideas to improve language detection between Spanish and Catalan

Categories

Resources