Hey guys I have a quick question about dataframe stemming. You see I want to know how to iterate through a dataframe of text, stem each word, and then return the dataframe to the text is was originally in (With the stemmed word changes of course).
I will give an example
dic = {0:['He was caught eating by the tree'],1:['She was playing with her friends']}
dic = pd.DataFrame(dic)
dic = dic.T
dic.rename(columns = {0:'text'},inplace=True)
When you run these 4 codes, you will get a dataframe of text. I would like to a method to iterate and stem each word as I have a dataframe constiting of over 30k of such sentences and would like to stem it. Thank you very much.
Related
I just started to learn python. I have a question about matching some of the words in my dataset in excel.
words_list is included some of the words I would like to find in a dataset.
words_list = ('tried','mobile','abc')
df is the extract from excel and picked up a single column.
df =
0 to make it possible or easier for someone to do ...
1 unable to acquire a buffer item very likely ...
2 The organization has tried to make...
3 Broadway tried a variety of mobile Phone for the..
I would like to get the result like this:
'None',
'None',
'tried',
'tried','mobile'
I tried in Jupiter like this:
list = [ ]
for word in df:
if any (aa in word for aa in words_List):
list.append(word)
else:
list.append('None')
print(list)
But the result will show the whole sentence in df
'None'
'None'
'The organization has tried to make...'
'Broadway tried a variety of mobile Phone for the..'
Can I only show the result only in the words list?
Sorry for my English and
thank you all
I'd suggest a manipulation on the DataFrame (that should always be your first thought, use the power of pandas)
import pandas as pd
words_list = {'tried', 'mobile', 'abc'}
df = pd.DataFrame({'col': ['to make it possible or easier for someone to do',
'unable to acquire a buffer item very likely',
'The organization has tried to make',
'Broadway tried a variety of mobile Phone for the']})
df['matches'] = df['col'].str.split().apply(lambda x: set(x) & words_list)
print(df)
col matches
0 to make it possible or easier for someone to do {}
1 unable to acquire a buffer item very likely {}
2 The organization has tried to make {tried}
3 Broadway tried a variety of mobile Phone for the {mobile, tried}
The reason it's printing the whole line has to do with your:
for word in df:
Your "word" variable is actually taking the whole line. Then it's checking the whole line to see if it contains your search word. If it does find it, then it basically says, "yes, I found ____ in this line, so append the line to your list.
What it sounds like you want to do is first split the line into words, and THEN check.
list = [ ]
found = False
for line in df:
words = line.split(" ")
for word in word_list:
if word in words:
found = True
list.append(word)
# this is just to append "None" if nothing found
if found:
found = False
else:
list.append("None")
print(list)
As a side note, you may want to use pprint instead of print when working with lists. It prints lists, dictionaries, etc in easier to read layouts. I don't know if you'll need to install the package. That depends on how you initially installed python. But usage would be something like:
from pprint import pprint
dictionary = {'firstkey':'firstval','secondkey':'secondval','thirdkey':'thirdval'}
pprint(dictionary)
I'm working in some kind of NLP. I compare a daframe of articles with inputs words. The main goal is classify text if a bunch of words were found
I've tried to extract the values in the dictionary and convert into a list and then apply stemming to it. The problem is that later I'll do another process to split and compare according to the keys. I think if more practical to work directly in the dictionary.
search = {'Tecnology' : ['computer', 'digital', 'sistem'], 'Economy' : ['bank', 'money']}
words_list = list()
for key in search.keys():
words_list.append(search[key])
search_values = [val for sublist in words_list for val in sublist]
search_values_stem = [stemmer.stem(word) for word in test]
I expect a dictionary stemmed to compare directly with the column of the articles stemmed
If I understood your question correctly, you are looking to apply stemming to the values of your dictionary (and not the keys), and in addition the values in your dictionary are all lists of strings.
The following code should do that:
def stemList(l):
return([stemmer.stem(word) for word in l])
# your initial dictionary is called search (as in your example code)
#the following creates a new dictionary where stemming has been applied to the values
stemmedSearch = {}
for key in search:
stemmedSearch[key] = stemList(search[key])
I'm working with text data, that is handwritten, so it has lots of ortographic errors. I'm currently working with pyspellchecker to clean the data and I'm using the correct() method to find the most likely word when a word doesn't exist. My approach was to create a dictionary with all poorly written words as keys and the most likely word as value:
dic={}
for i in df.text:
misspelled = spell.unknown(i.split())
for word in misspelled:
dic[word]=spell.correction(word)
Even though this is working, it is doing so very slowly. Thus, I wanted to know if there's a faster option to implement this. Do you have any ideas?
Edit: there are 10571 rows in df.text and strings are usually 5-15 words long. Each loop is taking around 3-5 seconds, which makes for a total of around 40000 seconds to run the whole loop.
If all you want to do is create a mapping from the misspelled words you encountered to their the suggestion, you can reduce the size of the dataset by removing duplicate words. This will minimize the number of calls to spell.unknown and spell.correction, as well prevent unneeded updates to the contents of the dictionary.
uniquewords = set().union(*(sentence.split() for sentence in df.text))
corrections = {word: spell.correction(word) for word in spell.unknown(uniquewords)}
you could try pd.apply instead of doing a loop:
eng = pd.Series(['EmpName', 'EMP_NAME', 'EMP.NAME', 'EMPName', 'CUSTOMIR', 'TIER187CAST', 'MultipleTIMESTAMPinTABLE', 'USD$'])
eng = eng.str.lower()
eng = eng.str.split()
spell = SpellChecker()
def msp(x):
return spell.unknown(x)
eng.apply(msp)
I have a large pandas dataframe. A column contains text broken down into sentences, one sentence per row. I need to check the sentences for the presence of terms used in various ontologies. Some of the ontologies are fairly large and have more than 100.000 entries. In addition some of the ontologies contain molecule names with hyphens, commas, and other characters that may or may not be present in the text to be examined, hence, the need for regular expressions.
I came up with the code below, but it's not fast enough to deal with my data. Any suggestions are welcome.
Thank you!
import pandas as pd
import re
sentences = ["""There is no point in driving yourself mad trying to stop
yourself going mad""",
"The ships hung in the sky in much the same way that bricks don’t"]
sentence_number = list(range(0, len(sentences)))
d = {'sentence' : sentences, 'number' : sentence_number}
df = pd.DataFrame(d)
regexes = ['\\bt\\w+', '\\bs\\w+']
big_regex = '|'.join(regexes)
compiled_regex = re.compile(big_regex, re.I)
df['found_regexes'] = df.sentence.str.findall(compiled_regex)
Hi Im trying this code in NLTK3:-
Somehow I managed to fix line-6 to work with version 3 of NLTK. But stil the for loop doesnt return anything at all.
import nltk
sample = """ some random text content with names and countries etc"""
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences=nltk.chunk.ne_chunk_sents(tagged_sentences) #Managed to fix this to work with version_3
for i in chunked_sentences:
if hasattr(i,'label'):
if i.label()=='NE':
print i
Also if I try to debug , I see this output :
for i in chunked_sentences:
if hasattr(i,'label') and i.label:
print i.label
S
S
S
S
S
S
S
S
Then how do I check it for "NE". Theres something wrong with NLTK-3 that Im really not able to figure out.Pls help
It seems you are iterating over sentences. I assume you want to iterate over the individual nodes contained in sentences.
It should work like this:
for sentence in chunked_sentences:
for token in sentence:
if hasattr(token,'label') and token.label() == 'NE':
print token
Edit: For future reference, what tipped me off to the fact that you are iterating over sentences is simply that the root node for a sentence is commonly labeled 'S'.