How can I solve an attribute error when using spacy? - python

I am using spacy for natural language processing in german.
But I am running into this error:
AttributeError: 'str' object has no attribute 'text'
This is the text data I am working with:
tex = ['Wir waren z.B. früher auf\'m Fahrrad unterwegs in München (immer nach 11 Uhr).',
'Nun fahren wir öfter mit der S-Bahn in München herum. Tja. So ist das eben.',
'So bleibt mir nichts anderes übrig als zu sagen, vielen Dank für alles.',
'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.']
My code:
data = [re.sub(r"\"", "", i) for i in tex]
data1 = [re.sub(r"\“", "", i) for i in data]
data2 = [re.sub(r"\„", "", i) for i in data1]
nlp = spacy.load('de')
spacy_doc1 = []
for line in data2:
spac = nlp(line)
lem = [tok.lemma_ for tok in spac]
no_punct = [tok.text for tok in lem if re.match('\w+', tok.text)]
no_numbers = [tok for tok in no_punct if not re.match('\d+', tok)]
I am writing every string in a seperate list, because I need to assign the result of the processing to the original specific string.
I also understand that the result that is written into lem is not in a format anymore that spacy can process.
So how can I do this correctly?

The problem here lies in the fact that SpaCy's token.lemma_ returns a string, and that strings have no text attribute (as the error states).
I suggest doing the same as you did when you wrote:
no_numbers = [tok for tok in no_punct if not re.match('\d+', tok)]
The only difference with this line in your code would be that you'd have to include the special string "-PRON-" in case you encounter English pronouns:
import re
import spacy
# using the web English model for practicality here
nlp = spacy.load('en_core_web_sm')
tex = ['I\'m going to get a cat tomorrow',
'I don\'t know if I\'ll be able to get him a cat house though!']
data = [re.sub(r"\"", "", i) for i in tex]
data1 = [re.sub(r"\“", "", i) for i in data]
data2 = [re.sub(r"\„", "", i) for i in data1]
spacy_doc1 = []
for line in data2:
spac = nlp(line)
lem = [tok.lemma_ for tok in spac]
no_punct = [tok for tok in lem if re.match('\w+', tok) or tok in ["-PRON-"]]
no_numbers = [tok for tok in no_punct if not re.match('\d+', tok)]
print(no_numbers)
# > ['-PRON-', 'be', 'go', 'to', 'get', 'a', 'cat', 'tomorrow']
# > ['-PRON-', 'do', 'not', 'know', 'if', '-PRON-', 'will', 'be', 'able', 'to', 'get', '-PRON-', 'a', 'cat', 'house', 'though']
Please tell me if this solved your problem as I may have misunderstood your issue.

Related

NLTK find german nouns

I want to extract all german nouns from a german text in lemmatized form with NLTK.
I also checked spacy but NLTK is much more preferred because in english it already works with the needed performance and requested data structure.
I have the following working code for english:
import nltk
from nltk.stem import WordNetLemmatizer
#germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'
text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'
tokens = nltk.word_tokenize(text)
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']
print (tokens)
I get the print as expected:
['year', 'sender', 'key', 'recipient']
Now I tried to do this for German:
import nltk
from nltk.stem import WordNetLemmatizer
germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'
#text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'
tokens = nltk.word_tokenize(germanText, language='german')
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']
print (tokens)
And I get a wrong result:
['jahrtausendelang', 'man', 'davon', 'au', 'der', 'sender', 'einen', 'geheimen', 'zwar', 'den', 'gleichen', 'wie', 'der', 'empfänger', 'benötigt']
The lemmatization did not work and the noun extraction did not work.
How is the proper way to apply different languages to this code?
I also checked other solutions like:
from nltk.stem.snowball import GermanStemmer
stemmer = GermanStemmer("german") # Choose a language
tokenGer=stemmer.stem(tokens)
But this would make me start from the beginning.
I have found a way with the HanoverTagger:
from HanTa import HanoverTagger as ht
tagger = ht.HanoverTagger('morphmodel_ger.pgz')
words = nltk.word_tokenize(text)
print(tagger.tag_sent(words) )
tokens=[word for (word,x,pos) in tagger.tag_sent(words,taglevel= 1) if pos == 'NN']
I get the outcome as expected: ['Jahrtausendelang', 'Sender', 'Schlüssel', 'Empfänger']

Process subset of a Document with spacy

When processing text with multiple German sentences like text = 'Frau Dr. Peters ist heute nicht erreichbar. Kommen Sie bitte morgen wieder.'. I want to do some further steps on each sentence. For this I need (in my case) only the tokens as text.
Here's an example:
import spacy
nlp = spacy.load('de_core_news_sm')
text = 'Frau Dr. Peters ist heute nicht erreichbar. Kommen Sie bitte morgen wieder.'
doc = nlp(text)
# Convert Document to list of tokens
def token_to_list(doc):
tokens = []
for token in doc:
tokens.append(token.lower_)
return tokens
sentences = list(doc.sents)
tokens_sent = []
for sent in sentences:
tokens = tokens_to_list(sent.as_doc())
tokens_sent.append(tokens)
print(tokens_sent)
I would expect to see this in my console:
[['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.'], ['kommen', 'sie', 'bitte', 'morgen', 'wieder', '.']]
Instead this is the output (I added some format for better visibility):
['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.',
['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.',
[...],
'kommen', 'sie', 'bitte', 'morgen', 'wieder', '.',
[...]
],
'kommen', 'sie', 'bitte', 'morgen', 'wieder', '.'
['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.',
[...],
'kommen', 'sie', 'bitte', 'morgen', 'wieder', '.',
[...]
]
]
As you can see, there seems to be some kind of recursion over my list. Further inspection shows that the [...] element contains the same list of elements as the layer above and continues in itself.
I can't figure out why or how to achieve the expected output.

How to extract specific lemma or pos/tag using spacy?

I am using spacy to lemmatize and parse a list of sentences. the data are contained in an excel file.
I would like to write a function which allow me to return different lemma of my sentences.
For example returning only lemma with a specific tag ("VERB" OR "VERB" +"ADJ")
This is my code :
import spacy
from spacy.lang.fr import French
from spacy_lefff import LefffLemmatizer, POSTagger
nlp = spacy.load("fr_core_news_sm")
nlp=spacy.load('fr')
parser = French()
path = 'Gold.xlsx'
my_sheet ="Gold"
df = read_excel(path, sheet_name= my_sheet)
def tokenizeTexte(sample):
tokens = parser(sample)
lemmas = []
for tok in tokens:
lemmas.append((tok.lemma_.lower(), tok.tag_, tok.pos_))
tokens = lemmas
tokens = [tok for tok in tokens if tok not in stopwords]
return tokens
df['Preprocess_verbatim'] = df.apply(lambda row:tokenizeTexte(row['verbatim']), axis=1)
print(df)
df.to_excel('output.xlsx')
I would like to be able to return all lemma with for example "verb" or "adj" or "adv" tag and then modify to return all the lemma.
I also wish to return different combination of lemma ( "PRON" +" "VERB"+"ADJ")
How can i do that with spacy ?
this is what i obtain with my code
id ... Preprocess_verbatim
0 463 ... [(ce, , ), (concept, , ), (résoudre, , ), (que...
1 2647 ... [(alors, , ), (ça, , ), (vouloir, , ), (dire, ...
2 5391 ... [(ça, , ), (ne, , ), (changer, , ), (rien, , )...
3 1120 ... [(sur, , ), (le, , ), (station, , ), (de, , ),
tok.tag and tok.pos does not appear , do you know why?
My file :
example of my data :
id verbatim
14 L'économe originellement est donc celui qui a la responsabilité, pour des personnes d'une maison, d'une unité d'organisation donnée .
25 De leur donner des rations de ressources au temps opportun.
56 Contrairement à l'idée qu'on se fait l'économe n'est pas axé sur le capital, c'est-à-dire sur l'action de capitaliser, mais sur les individus d'une unité organisation, c'est-à-dire sur l'action de partager, de redistribuer d'une façon juste et opportune des ressources aux différents membre
First, I think your model isn't working correctly because you're defining the nlp object twice. I believe you only need it once. I am also not sure what parser is doing and I'm not sure you need it. For this code, I would use something like the following:
nlp = spacy.load("fr_core_news_sm")
doc = nlp(sample)
tokens = [tok for tok in doc]
Then, doc is a spacy Doc object, and tokens is a list of spaCy Token objects. From here, the loop that iterates over your tokens would work.
If you want to do the POS selection in your existing preprocessing function, I think you only need to change one line in your loop:
for tok in tokens:
if tok.pos_ in ("VERB", "ADJ", "ADV"):
lemmas.append((tok.lemma_.lower(), tok.tag_, tok.pos_))
This will only add tokens with those specific parts of speech to your lemmas list.
I also noticed another issue in your code on this line further down:
tokens = [tok for tok in tokens if tok not in stopwords]
At this point tok is your tuple of (lemma, tag, pos), so unless your list of stopwords is tuples of the same format, and not only lemmas or tokens you want to exclude, this step will not exclude anything.
Putting it all together, you'd have something like this, which would return a list of tuples of (lemma, tag, pos) if the POS is correct:
nlp = spacy.load("fr_core_news_sm")
stopwords = ["here", "are", "some", "stopwords"]
def tokenizeTexte(sample):
doc = nlp(sample)
lemmas = []
for tok in tokens:
if tok.pos_ in ("VERB", "ADJ", "ADV"):
lemmas.append((tok.lemma_.lower(), tok.tag_, tok.pos_))
tokens = [(lemma, tag, pos) for (lemma, tag, pos) in lemmas if lemma not in stopwords]
return tokens

Split string into list of two words, repeating the last word

I need to split a string into a list of each two words, but repeating the last word of each pair of words.
Here is what I tried, by using examples I found for other questions:
line = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."""
def split_line(in_line):
line_sp = line.split(" ")
line_two = [" ".join(line_sp[i:i + 2]) for i in range(0, len(line_sp), 2)]
return line_two
print(split_line(line))
This results into:
['Lorem ipsum', 'dolor sit', 'amet, consectetur', 'adipiscing elit,', 'sed do', 'eiusmod tempor', 'incididunt ut', 'labore et', 'dolore magna', 'aliqua.']
But what I actually need is this:
['Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet', 'amet, consectetur', 'consectetur adipiscing', ...]
How can I make it work?
Thanks!
You can use zip on the following two slices of words:
words = line.split()
print(list(map(' '.join, zip(words[:-1], words[1:]))))
This outputs:
['Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet,', 'amet, consectetur', 'consectetur adipiscing', 'adipiscing elit,', 'elit, sed', 'sed do', 'do eiusmod', 'eiusmod tempor', 'tempor incididunt', 'incididunt ut', 'ut labore', 'labore et', 'et dolore', 'dolore magna', 'magna aliqua.']
Simple for loop
l = line.split(' ')
result = []
for i in range(len(l) - 1):
result.append(l[i] + ' ' + l[i+1])
print(result)
# ['Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet,', 'amet, consectetur', 'consectetur adipiscing', 'adipiscing elit,', 'elit, sed', 'sed do', 'do eiusmod', 'eiusmod tempor', 'tempor incididunt', 'incididunt ut', 'ut labore', 'labore et', 'et dolore', 'dolore magna', 'magna aliqua.', 'Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet,', 'amet, consectetur', 'consectetur adipiscing', 'adipiscing elit,', 'elit, sed', 'sed do', 'do eiusmod', 'eiusmod tempor', 'tempor incididunt', 'incididunt ut', 'ut labore', 'labore et', 'et dolore', 'dolore magna', 'magna aliqua.']
What you are looking for is nltk.bigrams()
import nltk
bigrm = list(nltk.bigrams(line.split()))
You can start with constructing a list of words in the line
words = line.split()
then you can make a list of lists containing consequential pairs with slicing
pairs = [words[i:i + 2] for i in range(len(words))]
finally, you can take each pair and joint it with ' '
result = [" ".join(pair) for pair in pairs if len(pair) > 1]
You can try something like, I dont know syntax in python so answering in java.
may be you can convert it to python
String line = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.";
String[] split = line.split(" ");
String [] line_two = new String[split.length-1];
for (int i = 1; i < split.length; i++) {
line_two[i-1] =split[i-1] +" "+split[i];
}
You can use a lazy generator with zip:
def split_line(in_line):
line_sp = line.split()
yield from map(' '.join, zip(line_sp, line_sp[1:]))
print(list(split_line(line)))
['Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet,',
...
'labore et', 'et dolore', 'dolore magna', 'magna aliqua.']
You can try it with regex, too:
rslt=[ " ".join(tup) for tup in re.findall(r"(\w+)\W+(?=(\w+))",line) ]
\w+ one or more word characters;
(\w+) we capture the matched pattern;
\W+ one or more non-word characters;
(?=(\w+)) look ahead as (?=...), but don't step forward, however capture the next word.
For whatever it is worth, just change the iterative value for loop from 2 to 1:
BEFORE:
line_sp = line.split(" ")
line_two = [" ".join(line_sp[i:i + 2]) for i in range(0, len(line_sp), 2)]
return line_two
FIXED:
line_sp = line.split(" ")
line_two = [" ".join(line_sp[i:i + 2]) for i in range(0, len(line_sp),1)]
return line_two
print(split_line(line))

Ideas to improve language detection between Spanish and Catalan

I'm working on a text mining script in python. I need to detect the language of a natural language field from the dataset.
The thing is, 98% of the rows are in Spanish and Catalan. I tried using some algorithms like the stopwords one or the langdetect library, but these languages share a lot of words so they fail a lot.
I'm looking for some ideas to improve this algorithm.
One thought is, make a dictionary with some words that are specific to Spanish and Catalan, so if one text has any of these words, it's tagged as that language.
Approach 1: Distinguishing characters
Spanish and Catalan (note: there will be exceptions for proper names and loanwords e.g. Barça):
esp_chars = "ñÑáÁýÝ"
cat_chars = "çÇàÀèÈòÒ·ŀĿ"
Example:
sample_texts = ["El año que es abundante de poesía, suele serlo de hambre.",
"Cal no abandonar mai ni la tasca ni l'esperança."]
for text in sample_texts:
if any(char in text for char in esp_chars):
print("Spanish: {}".format(text))
elif any(char in text for char in cat_chars):
print("Catalan: {}".format(text))
>>> Spanish: El año que es abundante de poesía, suele serlo de hambre.
Catalan: Cal no abandonar mai ni la tasca ni l'esperança.
If this isn't sufficient, you could expand this logic to search for language exclusive digraphs, letter combinations, or words:
Spanish only
Catalan only
Words
como y su con él otro
com i seva amb ell altre
Initial digraphs
d' l'
Digraphs
ss tj qü l·l l.l
Terminal digraphs
ig
Catalan letter combinations that only marginally appear in Spanish
tx
tg          (Es. exceptions postgrado, postgraduado, postguerra)
ny          (Es. exceptions mostly prefixed in-, en-, con- + y-)
ll (terminal) (Es. exceptions (loanwords): detall, nomparell)
Approach 2: googletrans library
You could also use the googletrans library to detect the language:
from googletrans import Translator
translator = Translator()
for text in sample_texts:
lang = translator.detect(text).lang
print(lang, ":", text)
>>> es : El año que es abundante de poesía, suele serlo de hambre.
ca : Cal no abandonar mai ni la tasca ni l'esperança.
DicCat = ['amb','cap','dalt','damunt','des','dintre','durant','excepte','fins','per','pro','sense','sota','llei','hi','ha','més','mes','moment','órgans', 'segóns','Article','i','per','els','amb','és','com','dels','més','seu','seva','fou','també','però','als','després','aquest','fins','any','són','hi','pel','aquesta','durant','on','part','altres','anys','ciutat','cap','des','seus','tot','estat','qual','segle','quan','ja','havia','molt','rei','nom','fer','així','li','sant','encara','pels','seves','té','partit','està','mateix','pot','nord','temps','fill','només','dues','sota','lloc','això','alguns','govern','uns','aquests','mort','nou','tots','fet','sense','frança','grup','tant','terme','fa','tenir','segons','món','regne','exèrcit','segona','abans','mentre','quals','aquestes','família','catalunya','eren','poden','diferents','nova','molts','església','major','club','estats','seua','diversos','grans','què','arribar','troba','població','poble','foren','època','haver','eleccions','diverses','tipus','riu','dia','quatre','poc','regió','exemple','batalla','altre','espanya','joan','actualment','tenen','dins','llavors','centre','algunes','important','altra','terra','antic','tenia','obres','estava','pare','qui','ara','havien','començar','història','morir','majoria','qui','ara','havien','començar','història','morir','majoria']
DicEsp = ['los','y','bajo','con', 'entre','hacia','hasta','para','por','según','segun','sin','tras','más','mas','ley','capítulo','capitulo','título','titulo','momento','y','las','por','con','su','para','lo','como','más','pero','sus','le','me','sin','este','ya','cuando','todo','esta','son','también','fue','había','muy','años','hasta','desde','está','mi','porque','qué','sólo','yo','hay','vez','puede','todos','así','nos','ni','parte','tiene','él','uno','donde','bien','tiempo','mismo','ese','ahora','otro','después','te','otros','aunque','esa','eso','hace','otra','gobierno','tan','durante','siempre','día','tanto','ella','sí','dijo','sido','según','menos','año','antes','estado','sino','caso','nada','hacer','estaba','poco','estos','presidente','mayor','ante','unos','algo','hacia','casa','ellos','ayer','hecho','mucho','mientras','además','quien','momento','millones','esto','españa','hombre','están','pues','hoy','lugar','madrid','trabajo','otras','mejor','nuevo','decir','algunos','entonces','todas','días','debe','política','cómo','casi','toda','tal','luego','pasado','medio','estas','sea','tenía','nunca','aquí','ver','veces','embargo','partido','personas','grupo','cuenta','pueden','tienen','misma','nueva','cual','fueron','mujer','frente','josé','tras','cosas','fin','ciudad','he','social','tener','será','historia','muchos','juan','tipo','cuatro','dentro','nuestro','punto','dice','ello','cualquier','noche','aún','agua','parece','haber','situación','fuera','bajo','grandes','nuestra','ejemplo','acuerdo','habían','usted','estados','hizo','nadie','países','horas','posible','tarde','ley','importante','desarrollo','proceso','realidad','sentido','lado','mí','tu','cambio','allí','mano','eran','estar','san','número','sociedad','unas','centro','padre','gente','relación','cuerpo','incluso','través','último','madre','mis','modo','problema','cinco','carlos','hombres','información','ojos','muerte','nombre','algunas','público','mujeres','siglo','todavía','meses','mañana','esos','nosotros','hora','muchas','pueblo','alguna','dar','don','da','tú','derecho','verdad','maría','unidos','podría','sería','junto','cabeza','aquel','luis','cuanto','tierra','equipo','segundo','director','dicho','cierto','casos','manos','nivel','podía','familia','largo','falta','llegar','propio','ministro','cosa','primero','seguridad','hemos','mal','trata','algún','tuvo','respecto','semana','varios','real','sé','voz','paso','señor','mil','quienes','proyecto','mercado','mayoría','luz','claro','iba','éste','pesetas','orden','español','buena','quiere','aquella','programa','palabras','internacional','esas','segunda','empresa','puesto','ahí','propia','libro','igual','político','persona','últimos','ellas','total','creo','tengo','dios','española','condiciones','méxico','fuerza','solo','único','acción','amor','policía','puerta','pesar','sabe','calle','interior','tampoco','ningún','vista','campo','buen','hubiera','saber','obras','razón','niños','presencia','tema','dinero','comisión','antonio','servicio','hijo','última','ciento','estoy','hablar','dio','minutos','producción','camino','seis','quién','fondo','dirección','papel','demás','idea','especial','diferentes','dado','base','capital','ambos','europa','libertad','relaciones','espacio','medios','ir','actual','población','empresas','estudio','salud','servicios','haya','principio','siendo','cultura','anterior','alto','media','mediante','primeros','arte','paz','sector','imagen','medida','deben','datos','consejo','personal','interés','julio','grupos','miembros','ninguna','existe','cara','edad','movimiento','visto','llegó','puntos','actividad','bueno','uso','niño','difícil','joven','futuro','aquellos','mes','pronto','soy','hacía','nuevos','nuestros','estaban','posibilidad','sigue','cerca','resultados','educación','atención','gonzález','capacidad','efecto','necesario','valor','aire','investigación','siguiente','figura','central','comunidad','necesidad','serie','organizació','nuevas','calidad']
DicEng = ['all','my','have','do','and', 'or', 'what', 'can', 'you', 'the', 'on', 'it', 'at', 'since', 'for', 'ago', 'before', 'past', 'by', 'next', 'from','with', 'wich','law','is','the','of','and','to','in','is','you','that','it','he','was','for','on','are','as','with','his','they','at','be','this','have','from','or','one','had','by','word','but','not','what','all','were','we','when','your','can','said','there','use','an','each','which','she','do','how','their','if','will','up','other','about','out','many','then','them','these','so','some','her','would','make','like','him','into','time','has','look','two','more','write','go','see','number','no','way','could','people','my','than','first','water','been','call','who','oil','its','now','find','long','down','day','did','get','come','made','may','part','may','part']
def WhichLanguage(text):
Input = text.lower().split(" ")
CatScore = []
EspScore = []
EngScore = []
for e in Input:
if e in DicCat:
CatScore.append(e)
if e in DicEsp:
EspScore.append(e)
if e in DicEng:
EngScore.append(e)
if(len(EngScore) > len(EspScore)) and (len(EngScore) > len(CatScore)):
Language ='English'
else:
if(len(CatScore) > len(EspScore)):
Language ='Catala'
else:
Language ='Espanyol'
print(text)
print("ESP= ",len(EspScore),EspScore)
print("Cat = ",len(CatScore), CatScore)
print("ING= ",len(EngScore),EngScore)
print( 'Language is =', Language)
print("-----")
return(Language)
print(WhichLanguage("Hola bon dia"))

Categories