Regex named groups in R - python

For all intents and purposes, I am a Python user and use the Pandas library on a daily basis. The named capture groups in regex is extremely useful. So, for example, it is relatively trivial to extract occurrences of specific words or phrases and to produce concatenated strings of the results in new columns of a dataframe. An example of how this might be achieved is given below:
import numpy as np
import pandas as pd
import re
myDF = pd.DataFrame(['Here is some text',
'We all love TEXT',
'Where is the TXT or txt textfile',
'Words and words',
'Just a few works',
'See the text',
'both words and text'],columns=['origText'])
print("Original dataframe\n------------------")
print(myDF)
# Define regex to find occurrences of 'text' or 'word' as separate named groups
myRegex = """(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"""
myCompiledRegex = re.compile(myRegex,flags=re.I|re.X)
# Extract all occurrences of 'text' or 'word'
myMatchesDF = myDF['origText'].str.extractall(myCompiledRegex)
print("\nDataframe of matches (with multi-index)\n--------------------")
print(myMatchesDF)
# Collapse resulting multi-index dataframe into single rows with concatenated fields
myConcatMatchesDF = myMatchesDF.groupby(level = 0).agg(lambda x: '///'.join(x.fillna('')))
myConcatMatchesDF = myConcatMatchesDF.replace(to_replace = "^/+|/+$",value = "",regex = True) # Remove '///' at start and end of strings
print("\nCollapsed and concatenated matches\n----------------------------------")
print(myConcatMatchesDF)
myDF = myDF.join(myConcatMatchesDF)
print("\nFinal joined dataframe\n----------------------")
print(myDF)
This produces the following output:
Original dataframe
------------------
origText
0 Here is some text
1 We all love TEXT
2 Where is the TXT or txt textfile
3 Words and words
4 Just a few works
5 See the text
6 both words and text
Dataframe of matches (with multi-index)
--------------------
textOcc wordOcc
match
0 0 text NaN
1 0 TEXT NaN
2 0 TXT NaN
1 txt NaN
2 text NaN
3 0 NaN Word
1 NaN word
5 0 text NaN
6 0 NaN word
1 text NaN
Collapsed and concatenated matches
----------------------------------
textOcc wordOcc
0 text
1 TEXT
2 TXT///txt///text
3 Word///word
5 text
6 text word
Final joined dataframe
----------------------
origText textOcc wordOcc
0 Here is some text text
1 We all love TEXT TEXT
2 Where is the TXT or txt textfile TXT///txt///text
3 Words and words Word///word
4 Just a few works NaN NaN
5 See the text text
6 both words and text text word
I've printed each stage to try to make it easy to follow.
The question is, can I do something similar in R. I've searched the web but can't find anything that describes the use of named groups (although I'm an R-newcomer and so might be searching for the wrong libraries or descriptive terms).
I've been able to identify those items that contain one or more matches but I cannot see how to extract specific matches or how to make use of the named groups. The code I have so far (using the same dataframe and regex as in the Python example above) is:
origText = c('Here is some text','We all love TEXT','Where is the TXT or txt textfile','Words and words','Just a few works','See the text','both words and text')
myDF = data.frame(origText)
myRegex = "(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"
myMatches = grep(myRegex,myDF$origText,perl=TRUE,value=TRUE,ignore.case=TRUE)
myMatches
[1] "Here is some text" "We all love TEXT" "Where is the TXT or txt textfile" "Words and words"
[5] "See the text" "both words and text"
myMatchesRow = grep(myRegex,myDF$origText,perl=TRUE,value=FALSE,ignore.case=TRUE)
myMatchesRow
[1] 1 2 3 4 6 7
The regex seems to be working and the correct rows are identified as containing a match (i.e. all except row 5 in the above example). However, my question is, can I produce an output that is similar to that produced by Python where the specific matches are extracted and listed in new columns in the dataframe that are named using the group names contained in the regex?

Base R does capture the information about the names but it doesn't have a good helper to extract them by name. I write a wrapper to help called regcapturedmatches. You can use it with
myRegex = "(?<textOcc>t[e]?xt)|(?<wordOcc>word)"
m<-regexpr(myRegex, origText, perl=T, ignore.case=T)
regcapturedmatches(origText,m)
Which returns
textOcc wordOcc
[1,] "text" ""
[2,] "TEXT" ""
[3,] "TXT" ""
[4,] "" "Word"
[5,] "" ""
[6,] "text" ""
[7,] "" "word"

Related

Use regex to extract number before a list of words in pandas dataframe

I want to extract only the numbers before a list of specific words. Then put the extracted numbers in a new column.
The list of words is: l = ["car", "truck", "van"]. I only put singular form here, but it should also apply to plural.
df = pd.DataFrame(columns=["description"], data=[["have 3 cars"], ["a 1-car situation"], ["may be 2 trucks"]])
We can call the new column for extracted number df["extracted_num"]
Thank you!
You can use Series.str.extract
l = ["car", "truck", "van"]
pat = f"(\d+)[\s-](?:{'|'.join(l)})"
df['extracted_num'] = df['description'].str.extract(pat)
Output:
>>> print(pat)
(\d+)[\s-](?:car|truck|van)
>>> df
description extracted_num
0 have 3 cars 3
1 a 1-car situation 1
2 may be 2 trucks 2
Explanation:
(\d+) - Matches one or more digits and captures the group;
[\s-] - Matches a single space or hyphen;
(?:{'|'.join(l)})"- Matches any word from the list l without capturing it.

Pandas isin() does not return anything even when the keywords exist in the dataframe

I'd like to search for a list of keywords in a text column and select all rows where the exact keywords exist. I know this question has many duplicates, but I can't understand why the solution is not working in my case.
keywords = ['fake', 'false', 'lie']
df1:
text
19152
I think she is the Corona Virus....
19154
Boy you hate to see that. I mean seeing how it was contained and all.
19155
Tell her it’s just the fake flu, it will go away in a few days.
19235
Is this fake news?
...
...
20540
She’ll believe it’s just alternative facts.
Expected results: I'd like to select rows that have the exact keywords in my list ('fake', 'false', 'lie). For example, in the above df, it should return rows 19155 and 19235.
str.contains()
df1[df1['text'].str.contains("|".join(keywords))]
The problem with str.contains() is that the result is not limited to the exact keywords. For example, it returns sentences with believe (e.g., row 20540) because lie is a substring of "believe"!
pandas.Series.isin
To find the rows including the exact keywords, I used pd.Series.isin:
df1[df1.text.isin(keywords)]
#df1[df1['text'].isin(keywords)]
Even though I see there are matches in df1, it doesn't return anything.
import re
df[df.text.apply(lambda x: any(i for i in re.findall('\w+', x) if i in keywords))]
Output:
text
2 Tell her it’s just the fake flu, it will go aw...
3 Is this fake news?
If text is as follows,
df1 = pd.DataFrame()
df1['text'] = [
"Dear Kellyanne, Please seek the help of Paula White I believe ...",
"trump saying it was under controll was a lie, ...",
"Her mouth should hanve been ... All the lies she has told ...",
"she'll believe ...",
"I do believe in ...",
"This value is false ...",
"This value is fake ...",
"This song is fakelove ..."
]
keywords = ['misleading', 'fake', 'false', 'lie']
First,
Simple way is this.
df1[df1.text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
text
5 This value is false ...
6 This value is fake ...
It'll not catch the words like "believe", but can't catch the words "lie," because of the special letter.
Second,
So if remove a special letter in the text data like
new_text = df1.text.apply(lambda x: re.sub("[^0-9a-zA-Z]+", " ", x))
df1[new_text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
Now It can catch the word "lie,".
text
1 trump saying it was under controll was a lie, ...
5 This value is false ...
6 This value is fake ...
Third,
It can't still catch the word lies. It can be solved by using a library that tokenizes to the same verb from a different forms verb. You can find how to tokenize from here(tokenize-words-in-a-list-of-sentences-python
I think splitting words then matching is a better and straightforward approach, e.g. if the df and keywords are
df = pd.DataFrame({'text': ['lama abc', 'cow def', 'foo bar', 'spam egg']})
keywords = ['foo', 'lama']
df
text
0 lama abc
1 cow def
2 foo bar
3 spam egg
This should return the correct result
df.loc[pd.Series(any(word in keywords for word in words) for words in df['text'].str.findall(r'\w+'))]
text
0 lama abc
2 foo bar
Explaination
First, do words splitting in df['text']
splits = df['text'].str.findall(r'\w+')
splits is
0 [lama, abc]
1 [cow, def]
2 [foo, bar]
3 [spam, egg]
Name: text, dtype: object
Then we need to find if there exists any word in a row should appear in the keywords
# this is answer for a single row, if words is the split list of that row
any(word in keywords for word in words)
# for the entire dataframe, use a Series, `splits` from above is word split lists for every line
rows = pd.Series(any(word in keywords for word in words) for words in splits)
rows
0 True
1 False
2 True
3 False
dtype: bool
Now we can find the correct rows with
df.loc[rows]
text
0 lama abc
2 foo bar
Be aware this approach could consume much more memory as it needs to generate the split list on each line. So if you have huge data sets, this might be a problem.
I believe it's because pd.Series.isin() checks if the string is in the column, and not if the string in the column contains a specific word. I just tested this code snippet:
s = pd.Series(['lama abc', 'cow', 'lama', 'beetle', 'lama',
'hippo'], name='animal')
s.isin(['cow', 'lama'])
And as I was thinking, the first string, even containing the word 'lama', returns False.
Maybe try using regex? See this: searching a word in the column pandas dataframe python

Pandas how replace values of every sentences starting by specific char?

I have a daframe like this:
Sentences
0 "a) Sentence 1"
1 "b) Sentence 2"
I would like to ignore "a) " and "b) " at the beginning of every row of the column Sentences.
I tried to code it: When the three first char of a sentence is 'b) ' I take the [3:] of the sentence:
df.loc[df.Names[0:3] == 'b) ', "Names"] = row['Names'][3:]
But doesn't work
Expected output:
Sentences
0 "Sentence 1"
1 "Sentence 2"
Using below as sample:
Sentences
0 a) Sentence 1
1 b) Sentence 2
2 This is a test sentence
3 NaN
You can use pd.Series.str.startswith to check for rows starting with a) and b), and then assign directly:
df.loc[df['Sentences'].str.startswith(("a) ","b) "), na=False), "Sentences"] = df['Sentences'].str[3:]
print (df)
Sentences
0 Sentence 1
1 Sentence 2
2 This is a test sentence
3 NaN
Try the str.replace function of the column like this
df.sentence.str.replace(".) ", "")
this will return a dataframe that you want I think.
Reference str.replace():
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html?highlight=str%20replace#pandas.Series.str.replace
str.replace takes a regular expression or string and replaces it with another string. For more info please see the link above.
If every sentence is going to start with a letter followed by a ')' and assuming you don't want to replace additional occurrences of ) after the first one, this will work
df["Sentences"] = df["Sentences"].str.split("\) ", expand=True).drop(0, axis=1).to_numpy().sum(axis=1)

Remove mentions and special characters for twitter dataset

I am trying to remove from this dataframe the mentions and special characters as "!?$..." and especially the character "#" but keeping the text of the hashtag.
Something like this is what I would like to have:
tweet clean_tweet
---------------------------------------------|-----------
"This is an example #user2 #Science ! #Tech" | "This is an example Science Tech"
"Hi How are you #user45 #USA" | "Hi How are you USA"
I am not sure how to iterate and do this in my dataframe in the column tweet
I tried with this for special characters
df["clean_tweet"] = df.columns.str.replace('[#,#,&]', '')
But I have this error
ValueError: Length of values (38) does not match length of index (82702)
You are trying to process column names
try this
df["clean_tweet"] = df["tweet"].str.replace('[#,#,&]', '')
I see you want to remove #user.So I used regex here
df['clean_tweet'] = df['tweet'].replace(regex='(#\w+)|#|&|!',value='')
tweet clean_tweet
0 This is an example #user2 #Science ! #Tech This is an example Science Tech
1 Hi How are you #user45 #USA Hi How are you USA

Pandas: replace all words in a row with a certain value except for words from a list

I have a dataframe as follows but larger:
df = {"text": ["it is two degrees warmer", "it is five degrees warmer today", "it was ten degrees warmer and not cooler", "it is ten degrees cooler", "it is too frosty today", "it is a bit icy and frosty today" ]}
allowed_list= ["cooler", "warmer", "frosty", "icy"]
I would like to replace all the words except for the words in the list with 'O', while keeping it comma separated like this:
desired output:
text
0 O,O,O,O,warmer
1 O,O,O,O,warmer,O
2 O,O,O,O,warmer,O,O,cooler
3 O,O,O,O,cooler
4 O,O,O,frosty,O
5 O,O,O,O,icy,O,frosty,O,
what I have done so far is to split the sting rows to list with str.split(' ') based on white space but not sure how to get rid of the words that are not in the list.
You could use a list comprehension, and join back setting , as a separator. Also by building a set from allowed_list we'll have a faster lookup:
allowed_set= set(["cooler","warmer","frosty","icy"])
df['text'] = [','.join([w if w in allowed_set else 'O' for w in s.split()])
for s in df['text']]
print(df)
text
0 O,O,O,O,warmer
1 O,O,O,O,warmer,O
2 O,O,O,O,warmer,O,O,cooler
3 O,O,O,O,cooler
4 O,O,O,frosty,O
5 O,O,O,O,icy,O,frosty,O

Categories