Pandas how replace values of every sentences starting by specific char? - python

I have a daframe like this:
Sentences
0 "a) Sentence 1"
1 "b) Sentence 2"
I would like to ignore "a) " and "b) " at the beginning of every row of the column Sentences.
I tried to code it: When the three first char of a sentence is 'b) ' I take the [3:] of the sentence:
df.loc[df.Names[0:3] == 'b) ', "Names"] = row['Names'][3:]
But doesn't work
Expected output:
Sentences
0 "Sentence 1"
1 "Sentence 2"

Using below as sample:
Sentences
0 a) Sentence 1
1 b) Sentence 2
2 This is a test sentence
3 NaN
You can use pd.Series.str.startswith to check for rows starting with a) and b), and then assign directly:
df.loc[df['Sentences'].str.startswith(("a) ","b) "), na=False), "Sentences"] = df['Sentences'].str[3:]
print (df)
Sentences
0 Sentence 1
1 Sentence 2
2 This is a test sentence
3 NaN

Try the str.replace function of the column like this
df.sentence.str.replace(".) ", "")
this will return a dataframe that you want I think.
Reference str.replace():
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html?highlight=str%20replace#pandas.Series.str.replace
str.replace takes a regular expression or string and replaces it with another string. For more info please see the link above.

If every sentence is going to start with a letter followed by a ')' and assuming you don't want to replace additional occurrences of ) after the first one, this will work
df["Sentences"] = df["Sentences"].str.split("\) ", expand=True).drop(0, axis=1).to_numpy().sum(axis=1)

Related

Classifiy dataframe row according to string occurence from a list

With the following dataframe:
Sentence
0 This is an example of sentence
1 This is another example
2 This is an dfferent example
3 A sentence is a bag of words
4 Random words
And the following list:
['sentence', 'another', 'words']
What is the most efficient way to summarize the occurrence of each word from the list in each row of the column 'Sentence'? I'm looking for the following result:
Sentence word_occurence
0 This is an example of sentence sentence
1 This is another example another
2 This is an dfferent example
3 A sentence is a bag of words [sentence, words]
4 Random words words
Thanks in advance!
You can do it using apply function as well:
df.assign(word_occurence = lambda x: x.sentence.apply(lambda s: np.array([witem for witem in w if witem in s])))

How to split sentences in a list?

I am trying to create a function to count the number of words and mean length of words in any given sentence or sentences. I can't seem to split the string into two sentences to be put into a list, assuming the sentence has a period and ending the sentence.
Question marks and exclamation marks should be replaced by periods to be recognized as a new sentence in the list.
For example: "Haven't you eaten 8 oranges today? I don't know if you did." would be: ["Haven't you eaten 8 oranges today", "I don't know if you did"]
The mean length for this example would be 44/12 = 3.6
def word_length_list(text):
text = text.replace('--',' ')
for p in string.punctuation + "‘’”“":
text = text.replace(p,'')
text = text.lower()
words = text.split(".")
word_length = []
print(words)
for i in words:
count = 0
for j in i:
count = count + 1
word_length.append(count)
return(word_length)
testing1 = word_length_list("Haven't you eaten 8 oranges today? I don't know if you did.")
print(sum(testing1)/len(testing1))
One option might use re.split:
inp = "Haven't you eaten 8 oranges today? I don't know if you did."
sentences = re.split(r'(?<=[?.!])\s+', inp)
print(sentences)
This prints:
["Haven't you eaten 8 oranges today?", "I don't know if you did."]
We could also use re.findall:
inp = "Haven't you eaten 8 oranges today? I don't know if you did."
sentences = re.findall(r'.*?[?!.]', inp)
print(sentences) # prints same as above
Note that in both cases we are assuming that period . would only appear as a stop, and not as part of an abbrevation. If period can have multiple contexts, then it could be tricky to tease apart sentences. For example:
Jon L. Skeet earned more point than anyone. Gordon Linoff also earned a lot of points.
It is not clear here whether period means end of sentence or part of an abbreviation.
An example to split using regex:
import re
s = "Hello! How are you?"
print([x for x in re.split("[\.\?\!]+",s.strip()) if not x == ''])

Pandas: replace all words in a row with a certain value except for words from a list

I have a dataframe as follows but larger:
df = {"text": ["it is two degrees warmer", "it is five degrees warmer today", "it was ten degrees warmer and not cooler", "it is ten degrees cooler", "it is too frosty today", "it is a bit icy and frosty today" ]}
allowed_list= ["cooler", "warmer", "frosty", "icy"]
I would like to replace all the words except for the words in the list with 'O', while keeping it comma separated like this:
desired output:
text
0 O,O,O,O,warmer
1 O,O,O,O,warmer,O
2 O,O,O,O,warmer,O,O,cooler
3 O,O,O,O,cooler
4 O,O,O,frosty,O
5 O,O,O,O,icy,O,frosty,O,
what I have done so far is to split the sting rows to list with str.split(' ') based on white space but not sure how to get rid of the words that are not in the list.
You could use a list comprehension, and join back setting , as a separator. Also by building a set from allowed_list we'll have a faster lookup:
allowed_set= set(["cooler","warmer","frosty","icy"])
df['text'] = [','.join([w if w in allowed_set else 'O' for w in s.split()])
for s in df['text']]
print(df)
text
0 O,O,O,O,warmer
1 O,O,O,O,warmer,O
2 O,O,O,O,warmer,O,O,cooler
3 O,O,O,O,cooler
4 O,O,O,frosty,O
5 O,O,O,O,icy,O,frosty,O

Find the words in a list, then remove the word and any other trailing words in the column

How do I find the words in the list and remove any other words after the word found?
For example:
remove_words = ['stack', 'over', 'flow']
Input:
0 abc test test stack yxz
1 cde test12 over ste
2 def123 flow test123
3 yup over 4562
Would like to find the words from a list remove_words list in the pandas dataframe column and remove those words and any words after.
Results:
0 abc test test
1 cde test12
2 def123
3 yup
Use split by all joined values by | for regex OR and select first lists by str[0]:
remove_words = ['stack', 'over', 'flow']
#for more general solution with word boundary
pat = r'\b{}\b'.format('|'.join(remove_words))
df['col'] = df['col'].str.split(pat, n=1).str[0]
print (df)
col
0 abc test test
1 cde test12
2 def123
3 yup
The first step would be to check if the input has a value in it, if not, you can just return the entire input
if "stack" or "over" or "flow" not in input:
return input
Now for the removing part. I think the best way to do this is to loop through each value in the input array(I am assuming it is an array) and call str_replace
I have not written in pandas dataframe, but the concert should be the same in any language just loop through all the words and use a replace method with an empty string.
remove_words = ['stack', 'over', 'flow']
inputline = "abc test test stack yxz"
for word in inputline.split(" "):
if word in remove_words:
print(inputline[:test.index(word)])
This will split the string input into a list then finds the index of any words in the remove_words list and slice the rest of the list off. Just need to do a loop to replace the hardcore string for your whole dataset.

Regex named groups in R

For all intents and purposes, I am a Python user and use the Pandas library on a daily basis. The named capture groups in regex is extremely useful. So, for example, it is relatively trivial to extract occurrences of specific words or phrases and to produce concatenated strings of the results in new columns of a dataframe. An example of how this might be achieved is given below:
import numpy as np
import pandas as pd
import re
myDF = pd.DataFrame(['Here is some text',
'We all love TEXT',
'Where is the TXT or txt textfile',
'Words and words',
'Just a few works',
'See the text',
'both words and text'],columns=['origText'])
print("Original dataframe\n------------------")
print(myDF)
# Define regex to find occurrences of 'text' or 'word' as separate named groups
myRegex = """(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"""
myCompiledRegex = re.compile(myRegex,flags=re.I|re.X)
# Extract all occurrences of 'text' or 'word'
myMatchesDF = myDF['origText'].str.extractall(myCompiledRegex)
print("\nDataframe of matches (with multi-index)\n--------------------")
print(myMatchesDF)
# Collapse resulting multi-index dataframe into single rows with concatenated fields
myConcatMatchesDF = myMatchesDF.groupby(level = 0).agg(lambda x: '///'.join(x.fillna('')))
myConcatMatchesDF = myConcatMatchesDF.replace(to_replace = "^/+|/+$",value = "",regex = True) # Remove '///' at start and end of strings
print("\nCollapsed and concatenated matches\n----------------------------------")
print(myConcatMatchesDF)
myDF = myDF.join(myConcatMatchesDF)
print("\nFinal joined dataframe\n----------------------")
print(myDF)
This produces the following output:
Original dataframe
------------------
origText
0 Here is some text
1 We all love TEXT
2 Where is the TXT or txt textfile
3 Words and words
4 Just a few works
5 See the text
6 both words and text
Dataframe of matches (with multi-index)
--------------------
textOcc wordOcc
match
0 0 text NaN
1 0 TEXT NaN
2 0 TXT NaN
1 txt NaN
2 text NaN
3 0 NaN Word
1 NaN word
5 0 text NaN
6 0 NaN word
1 text NaN
Collapsed and concatenated matches
----------------------------------
textOcc wordOcc
0 text
1 TEXT
2 TXT///txt///text
3 Word///word
5 text
6 text word
Final joined dataframe
----------------------
origText textOcc wordOcc
0 Here is some text text
1 We all love TEXT TEXT
2 Where is the TXT or txt textfile TXT///txt///text
3 Words and words Word///word
4 Just a few works NaN NaN
5 See the text text
6 both words and text text word
I've printed each stage to try to make it easy to follow.
The question is, can I do something similar in R. I've searched the web but can't find anything that describes the use of named groups (although I'm an R-newcomer and so might be searching for the wrong libraries or descriptive terms).
I've been able to identify those items that contain one or more matches but I cannot see how to extract specific matches or how to make use of the named groups. The code I have so far (using the same dataframe and regex as in the Python example above) is:
origText = c('Here is some text','We all love TEXT','Where is the TXT or txt textfile','Words and words','Just a few works','See the text','both words and text')
myDF = data.frame(origText)
myRegex = "(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"
myMatches = grep(myRegex,myDF$origText,perl=TRUE,value=TRUE,ignore.case=TRUE)
myMatches
[1] "Here is some text" "We all love TEXT" "Where is the TXT or txt textfile" "Words and words"
[5] "See the text" "both words and text"
myMatchesRow = grep(myRegex,myDF$origText,perl=TRUE,value=FALSE,ignore.case=TRUE)
myMatchesRow
[1] 1 2 3 4 6 7
The regex seems to be working and the correct rows are identified as containing a match (i.e. all except row 5 in the above example). However, my question is, can I produce an output that is similar to that produced by Python where the specific matches are extracted and listed in new columns in the dataframe that are named using the group names contained in the regex?
Base R does capture the information about the names but it doesn't have a good helper to extract them by name. I write a wrapper to help called regcapturedmatches. You can use it with
myRegex = "(?<textOcc>t[e]?xt)|(?<wordOcc>word)"
m<-regexpr(myRegex, origText, perl=T, ignore.case=T)
regcapturedmatches(origText,m)
Which returns
textOcc wordOcc
[1,] "text" ""
[2,] "TEXT" ""
[3,] "TXT" ""
[4,] "" "Word"
[5,] "" ""
[6,] "text" ""
[7,] "" "word"

Categories