spacy-udpipe with pytextrank to extract keywords from non-English text - python
I've been using pytextrank (https://github.com/DerwenAI/pytextrank/) with spacy and English models for keywords extraction - it works great!
Now I need to process non-English texts and I found udpipe (https://github.com/TakeLab/spacy-udpipe) but it doesn't work out of the box ... after
nlp = spacy_udpipe.load("sk")
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc = nlp(text)
I get tokens with POS and DEP tags, but there is nothing in doc._.phrases (doc.noun_chunks is also empty) and in nlp.pipe_names is just ['textrank']
What should I add to the spacy's pipeline to get it working? I assume pytextrank needs noun_chunks...
Any tip or suggestion where to look will help me - thanks!
would you mind starting an issue about this on the PyTextRank repo?
https://github.com/DerwenAI/pytextrank/issues
Also, if you could please provide example text to use (in the language requested)
We'll try to debug this integration.
Thanks for pointing it out!
Paco
I found a solution! I'm not sure how clean is the nlp.Defaults.syntax_iterators = {"noun_chunks" : get_chunks}, but it works (it's based on how are the noun_chunks defined in syntax_iterators.py and __init__.py in spaCy/lang/en)
import spacy_udpipe, spacy, pytextrank
from spacy.matcher import Matcher
from spacy.attrs import POS
def get_chunks(doc):
np_label = doc.vocab.strings.add("NP")
matcher = Matcher(nlp.vocab)
pattern = [{POS: 'ADJ', "OP": "+"}, {POS: {"IN": ["NOUN", "PROPN"]}, "OP": "+"}]
matcher.add("Adjective(s), (p)noun", None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
yield start, end, np_label
spacy_udpipe.download("sk") # download model
nlp = spacy_udpipe.load("sk")
nlp.Defaults.syntax_iterators = {"noun_chunks" : get_chunks} #noun_chunk replacement
tr = pytextrank.TextRank(logger=None)
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
text = "Wikipédia je webová encyklopédia s otvoreným obsahom, ktorú možno slobodne čítať aj upravovať. Je sponzorovaná neziskovou organizáciou Wikimedia Foundation. Má 285 nezávislých jazykových vydaní vrátane slovenského a najrozsiahlejšieho anglického. Popri článkoch encyklopedického typu obsahuje, najmä anglická encyklopédia, aj články podobajúce sa almanachu, atlasu či stránky aktuálnych udalostí. Wikipédia je jedným z najpopulárnejších zdrojov informácií na webe s približne 13 miliardami zobrazení mesačne. Jej rast je skoro exponenciálny. Wikipédii (takmer 2 milióny). Wikipédia bola spustená 15. januára 2001 ako doplnok k expertmi písanej Nupedii. So stále rastúcou popularitou sa Wikipédia stala podhubím pre sesterské projekty ako Wikislovník (Wiktionary), Wikiknihy (Wikibooks) a Wikisprávy (Wikinews). Jej články sú upravované dobrovoľníkmi vo wiki štýle, čo znamená, že články môže meniť v podstate hocikto. Wikipediáni presadzujú politiku „nestranný uhol pohľadu“. Podľa nej relevantné názory ľudí sú sumarizované bez ambície určiť objektívnu pravdu. Vzhľadom na to, že Wikipédia presadzuje otvorenú filozofiu, jej najväčším problémom je vandalizmus a nepresnosť. "
doc = nlp(text)
print("Noun chunks:")
for nc in doc.noun_chunks:
print(nc)
print("\nKeywords:")
for phrase in doc._.phrases:
print("{:.4f} {:5d} {}".format(phrase.rank, phrase.count, phrase.text))
print(phrase.chunks)
Related
NLTK find german nouns
I want to extract all german nouns from a german text in lemmatized form with NLTK. I also checked spacy but NLTK is much more preferred because in english it already works with the needed performance and requested data structure. I have the following working code for english: import nltk from nltk.stem import WordNetLemmatizer #germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.' text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.' tokens = nltk.word_tokenize(text) tokens = [tok.lower() for tok in tokens] lemmatizer = WordNetLemmatizer() tokens = [lemmatizer.lemmatize(tok) for tok in tokens] tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N'] print (tokens) I get the print as expected: ['year', 'sender', 'key', 'recipient'] Now I tried to do this for German: import nltk from nltk.stem import WordNetLemmatizer germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.' #text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.' tokens = nltk.word_tokenize(germanText, language='german') tokens = [tok.lower() for tok in tokens] lemmatizer = WordNetLemmatizer() tokens = [lemmatizer.lemmatize(tok) for tok in tokens] tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N'] print (tokens) And I get a wrong result: ['jahrtausendelang', 'man', 'davon', 'au', 'der', 'sender', 'einen', 'geheimen', 'zwar', 'den', 'gleichen', 'wie', 'der', 'empfänger', 'benötigt'] The lemmatization did not work and the noun extraction did not work. How is the proper way to apply different languages to this code? I also checked other solutions like: from nltk.stem.snowball import GermanStemmer stemmer = GermanStemmer("german") # Choose a language tokenGer=stemmer.stem(tokens) But this would make me start from the beginning.
I have found a way with the HanoverTagger: from HanTa import HanoverTagger as ht tagger = ht.HanoverTagger('morphmodel_ger.pgz') words = nltk.word_tokenize(text) print(tagger.tag_sent(words) ) tokens=[word for (word,x,pos) in tagger.tag_sent(words,taglevel= 1) if pos == 'NN'] I get the outcome as expected: ['Jahrtausendelang', 'Sender', 'Schlüssel', 'Empfänger']
How to extract specific lemma or pos/tag using spacy?
I am using spacy to lemmatize and parse a list of sentences. the data are contained in an excel file. I would like to write a function which allow me to return different lemma of my sentences. For example returning only lemma with a specific tag ("VERB" OR "VERB" +"ADJ") This is my code : import spacy from spacy.lang.fr import French from spacy_lefff import LefffLemmatizer, POSTagger nlp = spacy.load("fr_core_news_sm") nlp=spacy.load('fr') parser = French() path = 'Gold.xlsx' my_sheet ="Gold" df = read_excel(path, sheet_name= my_sheet) def tokenizeTexte(sample): tokens = parser(sample) lemmas = [] for tok in tokens: lemmas.append((tok.lemma_.lower(), tok.tag_, tok.pos_)) tokens = lemmas tokens = [tok for tok in tokens if tok not in stopwords] return tokens df['Preprocess_verbatim'] = df.apply(lambda row:tokenizeTexte(row['verbatim']), axis=1) print(df) df.to_excel('output.xlsx') I would like to be able to return all lemma with for example "verb" or "adj" or "adv" tag and then modify to return all the lemma. I also wish to return different combination of lemma ( "PRON" +" "VERB"+"ADJ") How can i do that with spacy ? this is what i obtain with my code id ... Preprocess_verbatim 0 463 ... [(ce, , ), (concept, , ), (résoudre, , ), (que... 1 2647 ... [(alors, , ), (ça, , ), (vouloir, , ), (dire, ... 2 5391 ... [(ça, , ), (ne, , ), (changer, , ), (rien, , )... 3 1120 ... [(sur, , ), (le, , ), (station, , ), (de, , ), tok.tag and tok.pos does not appear , do you know why? My file : example of my data : id verbatim 14 L'économe originellement est donc celui qui a la responsabilité, pour des personnes d'une maison, d'une unité d'organisation donnée . 25 De leur donner des rations de ressources au temps opportun. 56 Contrairement à l'idée qu'on se fait l'économe n'est pas axé sur le capital, c'est-à-dire sur l'action de capitaliser, mais sur les individus d'une unité organisation, c'est-à-dire sur l'action de partager, de redistribuer d'une façon juste et opportune des ressources aux différents membre
First, I think your model isn't working correctly because you're defining the nlp object twice. I believe you only need it once. I am also not sure what parser is doing and I'm not sure you need it. For this code, I would use something like the following: nlp = spacy.load("fr_core_news_sm") doc = nlp(sample) tokens = [tok for tok in doc] Then, doc is a spacy Doc object, and tokens is a list of spaCy Token objects. From here, the loop that iterates over your tokens would work. If you want to do the POS selection in your existing preprocessing function, I think you only need to change one line in your loop: for tok in tokens: if tok.pos_ in ("VERB", "ADJ", "ADV"): lemmas.append((tok.lemma_.lower(), tok.tag_, tok.pos_)) This will only add tokens with those specific parts of speech to your lemmas list. I also noticed another issue in your code on this line further down: tokens = [tok for tok in tokens if tok not in stopwords] At this point tok is your tuple of (lemma, tag, pos), so unless your list of stopwords is tuples of the same format, and not only lemmas or tokens you want to exclude, this step will not exclude anything. Putting it all together, you'd have something like this, which would return a list of tuples of (lemma, tag, pos) if the POS is correct: nlp = spacy.load("fr_core_news_sm") stopwords = ["here", "are", "some", "stopwords"] def tokenizeTexte(sample): doc = nlp(sample) lemmas = [] for tok in tokens: if tok.pos_ in ("VERB", "ADJ", "ADV"): lemmas.append((tok.lemma_.lower(), tok.tag_, tok.pos_)) tokens = [(lemma, tag, pos) for (lemma, tag, pos) in lemmas if lemma not in stopwords] return tokens
Spacy - Chunk NE tokens
Let's say that I have a document, like so: import spacy nlp = spacy.load('en') doc = nlp('My name is John Smith') [t for t in doc] > [My, name, is, John, Smith] Spacy is intelligent enough to realize that 'John Smith' is a multi-token named entity: [e for e in doc.ents] > [John Smith] How can I make it chunk named entities into discrete tokens, like so: > [My, name, is, John Smith]
Spacy documentation on NER says that you can access token entity annotations using the token.ent_iob_ and token.ent_type_ attributes. https://spacy.io/usage/linguistic-features#accessing Example: import spacy nlp = spacy.load('en') doc = nlp('My name is John Smith') ne = [] merged = [] for t in doc: # "O" -> current token is not part of the NE if t.ent_iob_ == "O": if len(ne) > 0: merged.append(" ".join(ne)) ne = [] merged.append(t.text) else: ne.append(t.text) if len(ne) > 0: merged.append(" ".join(ne)) print(merged) This will print: ['My', 'name', 'is', 'John Smith']
Truecasing - SpaCy
Intent is to capitalize based on POS tags, which I could achieve with the help of the below link. How can I best determine the correct capitalization for a word? Trying to achieve similar results using spacy? def truecase(doc): truecased_sents = [] # list of truecased sentences tagged_sent = token.tag_([word.lower() for token in doc]) normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent] normalized_sent[0] = normalized_sent[0].capitalize() string = re.sub(" (?=[\.,'!?:;])", "", ' '.join(normalized_sent)) return string it throws out this error tagged_sent = token.tag_([word.lower() for token in doc]) NameError: global name 'token' is not defined how to declare token as global and solve this issue. Is my approach correct?
import spacy, re nlp = spacy.load('en_core_web_sm') doc = nlp(u'autonomous cars shift insurance liability toward manufacturers.') tagged_sent = [(w.text, w.tag_) for w in doc] normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent] normalized_sent[0] = normalized_sent[0].capitalize() string = re.sub(" (?=[\.,'!?:;])", "", ' '.join(normalized_sent)) print string Output: Autonomous Cars shift Insurance Liability toward Manufacturers.
Ideas to improve language detection between Spanish and Catalan
I'm working on a text mining script in python. I need to detect the language of a natural language field from the dataset. The thing is, 98% of the rows are in Spanish and Catalan. I tried using some algorithms like the stopwords one or the langdetect library, but these languages share a lot of words so they fail a lot. I'm looking for some ideas to improve this algorithm. One thought is, make a dictionary with some words that are specific to Spanish and Catalan, so if one text has any of these words, it's tagged as that language.
Approach 1: Distinguishing characters Spanish and Catalan (note: there will be exceptions for proper names and loanwords e.g. Barça): esp_chars = "ñÑáÁýÝ" cat_chars = "çÇàÀèÈòÒ·ŀĿ" Example: sample_texts = ["El año que es abundante de poesía, suele serlo de hambre.", "Cal no abandonar mai ni la tasca ni l'esperança."] for text in sample_texts: if any(char in text for char in esp_chars): print("Spanish: {}".format(text)) elif any(char in text for char in cat_chars): print("Catalan: {}".format(text)) >>> Spanish: El año que es abundante de poesía, suele serlo de hambre. Catalan: Cal no abandonar mai ni la tasca ni l'esperança. If this isn't sufficient, you could expand this logic to search for language exclusive digraphs, letter combinations, or words: Spanish only Catalan only Words como y su con él otro com i seva amb ell altre Initial digraphs d' l' Digraphs ss tj qü l·l l.l Terminal digraphs ig Catalan letter combinations that only marginally appear in Spanish tx tg (Es. exceptions postgrado, postgraduado, postguerra) ny (Es. exceptions mostly prefixed in-, en-, con- + y-) ll (terminal) (Es. exceptions (loanwords): detall, nomparell) Approach 2: googletrans library You could also use the googletrans library to detect the language: from googletrans import Translator translator = Translator() for text in sample_texts: lang = translator.detect(text).lang print(lang, ":", text) >>> es : El año que es abundante de poesía, suele serlo de hambre. ca : Cal no abandonar mai ni la tasca ni l'esperança.
DicCat = ['amb','cap','dalt','damunt','des','dintre','durant','excepte','fins','per','pro','sense','sota','llei','hi','ha','més','mes','moment','órgans', 'segóns','Article','i','per','els','amb','és','com','dels','més','seu','seva','fou','també','però','als','després','aquest','fins','any','són','hi','pel','aquesta','durant','on','part','altres','anys','ciutat','cap','des','seus','tot','estat','qual','segle','quan','ja','havia','molt','rei','nom','fer','així','li','sant','encara','pels','seves','té','partit','està','mateix','pot','nord','temps','fill','només','dues','sota','lloc','això','alguns','govern','uns','aquests','mort','nou','tots','fet','sense','frança','grup','tant','terme','fa','tenir','segons','món','regne','exèrcit','segona','abans','mentre','quals','aquestes','família','catalunya','eren','poden','diferents','nova','molts','església','major','club','estats','seua','diversos','grans','què','arribar','troba','població','poble','foren','època','haver','eleccions','diverses','tipus','riu','dia','quatre','poc','regió','exemple','batalla','altre','espanya','joan','actualment','tenen','dins','llavors','centre','algunes','important','altra','terra','antic','tenia','obres','estava','pare','qui','ara','havien','començar','història','morir','majoria','qui','ara','havien','començar','història','morir','majoria'] DicEsp = ['los','y','bajo','con', 'entre','hacia','hasta','para','por','según','segun','sin','tras','más','mas','ley','capítulo','capitulo','título','titulo','momento','y','las','por','con','su','para','lo','como','más','pero','sus','le','me','sin','este','ya','cuando','todo','esta','son','también','fue','había','muy','años','hasta','desde','está','mi','porque','qué','sólo','yo','hay','vez','puede','todos','así','nos','ni','parte','tiene','él','uno','donde','bien','tiempo','mismo','ese','ahora','otro','después','te','otros','aunque','esa','eso','hace','otra','gobierno','tan','durante','siempre','día','tanto','ella','sí','dijo','sido','según','menos','año','antes','estado','sino','caso','nada','hacer','estaba','poco','estos','presidente','mayor','ante','unos','algo','hacia','casa','ellos','ayer','hecho','mucho','mientras','además','quien','momento','millones','esto','españa','hombre','están','pues','hoy','lugar','madrid','trabajo','otras','mejor','nuevo','decir','algunos','entonces','todas','días','debe','política','cómo','casi','toda','tal','luego','pasado','medio','estas','sea','tenía','nunca','aquí','ver','veces','embargo','partido','personas','grupo','cuenta','pueden','tienen','misma','nueva','cual','fueron','mujer','frente','josé','tras','cosas','fin','ciudad','he','social','tener','será','historia','muchos','juan','tipo','cuatro','dentro','nuestro','punto','dice','ello','cualquier','noche','aún','agua','parece','haber','situación','fuera','bajo','grandes','nuestra','ejemplo','acuerdo','habían','usted','estados','hizo','nadie','países','horas','posible','tarde','ley','importante','desarrollo','proceso','realidad','sentido','lado','mí','tu','cambio','allí','mano','eran','estar','san','número','sociedad','unas','centro','padre','gente','relación','cuerpo','incluso','través','último','madre','mis','modo','problema','cinco','carlos','hombres','información','ojos','muerte','nombre','algunas','público','mujeres','siglo','todavía','meses','mañana','esos','nosotros','hora','muchas','pueblo','alguna','dar','don','da','tú','derecho','verdad','maría','unidos','podría','sería','junto','cabeza','aquel','luis','cuanto','tierra','equipo','segundo','director','dicho','cierto','casos','manos','nivel','podía','familia','largo','falta','llegar','propio','ministro','cosa','primero','seguridad','hemos','mal','trata','algún','tuvo','respecto','semana','varios','real','sé','voz','paso','señor','mil','quienes','proyecto','mercado','mayoría','luz','claro','iba','éste','pesetas','orden','español','buena','quiere','aquella','programa','palabras','internacional','esas','segunda','empresa','puesto','ahí','propia','libro','igual','político','persona','últimos','ellas','total','creo','tengo','dios','española','condiciones','méxico','fuerza','solo','único','acción','amor','policía','puerta','pesar','sabe','calle','interior','tampoco','ningún','vista','campo','buen','hubiera','saber','obras','razón','niños','presencia','tema','dinero','comisión','antonio','servicio','hijo','última','ciento','estoy','hablar','dio','minutos','producción','camino','seis','quién','fondo','dirección','papel','demás','idea','especial','diferentes','dado','base','capital','ambos','europa','libertad','relaciones','espacio','medios','ir','actual','población','empresas','estudio','salud','servicios','haya','principio','siendo','cultura','anterior','alto','media','mediante','primeros','arte','paz','sector','imagen','medida','deben','datos','consejo','personal','interés','julio','grupos','miembros','ninguna','existe','cara','edad','movimiento','visto','llegó','puntos','actividad','bueno','uso','niño','difícil','joven','futuro','aquellos','mes','pronto','soy','hacía','nuevos','nuestros','estaban','posibilidad','sigue','cerca','resultados','educación','atención','gonzález','capacidad','efecto','necesario','valor','aire','investigación','siguiente','figura','central','comunidad','necesidad','serie','organizació','nuevas','calidad'] DicEng = ['all','my','have','do','and', 'or', 'what', 'can', 'you', 'the', 'on', 'it', 'at', 'since', 'for', 'ago', 'before', 'past', 'by', 'next', 'from','with', 'wich','law','is','the','of','and','to','in','is','you','that','it','he','was','for','on','are','as','with','his','they','at','be','this','have','from','or','one','had','by','word','but','not','what','all','were','we','when','your','can','said','there','use','an','each','which','she','do','how','their','if','will','up','other','about','out','many','then','them','these','so','some','her','would','make','like','him','into','time','has','look','two','more','write','go','see','number','no','way','could','people','my','than','first','water','been','call','who','oil','its','now','find','long','down','day','did','get','come','made','may','part','may','part'] def WhichLanguage(text): Input = text.lower().split(" ") CatScore = [] EspScore = [] EngScore = [] for e in Input: if e in DicCat: CatScore.append(e) if e in DicEsp: EspScore.append(e) if e in DicEng: EngScore.append(e) if(len(EngScore) > len(EspScore)) and (len(EngScore) > len(CatScore)): Language ='English' else: if(len(CatScore) > len(EspScore)): Language ='Catala' else: Language ='Espanyol' print(text) print("ESP= ",len(EspScore),EspScore) print("Cat = ",len(CatScore), CatScore) print("ING= ",len(EngScore),EngScore) print( 'Language is =', Language) print("-----") return(Language) print(WhichLanguage("Hola bon dia"))