Using regex to match keywords in DF - python

Good morning all I am struggling a bit with regex :(
Scenario: I have loaded an excel file into Pandas as DF to enable me to search for keywords across multiple columns.
Data:
Columns include title, scope, description and review. There are 6 keywords I need to search for.
Current approach:
Using numpy where str contains I have found matches but these are partial matches within other strings. I need to find only whole words. The below works but as I said will also identify matches within strings such as 'booking' or 'training'. I need to find a way to only find 'book' or 'train'.
keywords = ['book','train','job']
df["NewValue"] = np.where((df['title'].str.contains('|'.join(keywords)))
(df['scope'].str.contains('|'.join(keywords))) |
(df['description'].str.contains('|'.join(keywords)))|
(df['review'].str.contains('|'.join(keywords))),1,0)

You can use word boundary \b in raw string
keywords = ['book','train','job']
cols = ['title', 'scope', 'description', 'review']
m = df[cols].apply(lambda col: col.str.contains(r'\b' + '|'.join(keywords) + r'\b')).any(axis=1)
df['NewValue'] = np.where(m,1,0)
# or
df['NewValue'] = m.astype(int)
To extract matched word, you can use
out = (pd.concat([df[col].str.extract(r'\b(' + '|'.join(keywords) + r')\b')
for col in cols], axis=1)
.groupby(level=0, axis=1).first())

Related

Filter list with regex multiple values from a JSON file

Trying to get a list with filtered items using regex. I am trying to get out a specific location codes from the results. I am able to get the results from a JSON file, but I am stuck at figuring out how I can use multiple regex values to filter out the results from the JSON file.
This is how far I am:
import json
import re
file_path = './response.json'
result = []
with open(file_path) as f:
data = json.loads(f.read())
for d in data:
result.append(d['location_code'])
result = list(dict.fromkeys(result))
re_list = ['.*dk*', '.*se*', '.*fi*', '.*no*']
matches = []
for r in re_list:
matches += re.findall( r, result)
# r = re.compile('.*denmark*', '', '', '')
# filtered_list = list(filter(r.match, result))
print(matches)
Output from the first JSON sort. I need to filter out country initials like dk, no, lv, fi, ee etc. and leave only the data that include the specific country codes.
[
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|87',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.nl.na.amsterdam|firefox|28',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|87',
...
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.de.na.frankfurt.amazon|chromium|87'
]
Would appreciate any help. Thanks!
In that case, I know this could work if you try. here is a way that could be used:
Set up multiple fields.
for the first pattern you could:
"2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|([^"]+)"
or
"2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|*"
or
for text:
.*?text"\s?:\s?"([\w\s]+)
for names:
.*?name"\s?:\s?"([\w\s]+)
let me know it, if you are able to do
This looks like regex won't be the best tool; for example, .*fi.* will match sofia, which is probably not wanted; even if we insist on periods before and after, all of the example rows have .na., but probably shouldn't match a search for Namibia.
Probably a better way would be to parse the string more carefully, using one or more of (a) the csv module (if it can contain quoting and escaping in the fields), (b) the split method, and/or (c) regular expressions, to retrieve the country code from each row. Once we have the country code, we can then compare it explicitly
For example, using the split method:
DATA = [
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|87',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.nl.na.amsterdam|firefox|28',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|87',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.de.na.frankfurt.amazon|chromium|87'
]
COUNTRIES = ['dk', 'se', 'fi', 'no']
def extract_country(row):
geo = row.split('|')[1]
country = geo.split('.')[2]
return country
filtered = [
row for row in DATA
if extract_country(row) in COUNTRIES
]
print(filtered)
or, if you prefer one-liners, you can skip the extract_country function:
filtered = [
row for row in DATA
if row.split('|')[1].split('.')[2] in COUNTRIES
]
Both of these split the row on | and take the second column to get the geographical area, then split the geo area on . and take the third item, which seems to be the country code. If you have documentation for your data source, you will be able to check whether this is true.
One additional check might be to verify that the extracted country code has exactly two letters, as a partial check for irregularities in the data:
def extract_country(row):
geo = row.split('|')[1]
country = geo.split('.')[2]
if not re.match('^[a-z]{2}$', country):
raise ValueError(
'Expected a two-letter country code, got "%s" in row %s'
% (country, row)
)
return country

Python dataframe : strip part of string, on each column row, if it is in specific format [duplicate]

I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers

Pandas DataFrame - Extract string between two strings and include the first delimiter

I've the following strings in column on a dataframe:
"LOCATION: FILE-ABC.txt"
"DRAFT-1-FILENAME-ADBCD.txt"
And I want to extract everything that is between the word FILE and the ".". But I want to include the first delimiter. Basically I am trying to return the following result:
"FILE-ABC"
"FILENAME-ABCD"
For that I am using the script below:
df['field'] = df.string_value.str.extract('FILE/(.w+)')
But I am not able to return the desired information (always getting NA).
How can I do this?
you can accomplish this all within the regex without having to use string slicing.
df['field'] = df.string_value.str.extract('(FILE.*(?=.txt))')
FILE is the what we begin the match on
.* grabs any number of characters
(?=) is a lookahead assertion that matches without
consuming.
Handy regex tool https://pythex.org/
If the strings will always end in .txt then you can try with the following:
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Example:
import pandas as pd
text = ["LOCATION: FILE-ABC.txt","DRAFT-1-FILENAME-ADBCD.txt"]
data = {'index':[0,1],'string_value':text}
df = pd.DataFrame(data)
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Output:
index string_value field
0 0 LOCATION: FILE-ABC.txt FILE-ABC
1 1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD
You can make a capturing group that captures from (including) 'FILE' greedily to the last period. Or you can make it not greedy so it stops at the first . after FILE.
import pandas as pd
df = pd.DataFrame({'string_value': ["LOCATION: FILE-ABC.txt", "DRAFT-1-FILENAME-ADBCD.txt",
"BADFILENAME.foo.txt"]})
df['field_greedy'] = df['string_value'].str.extract('(FILE.*)\.')
df['field_not_greedy'] = df['string_value'].str.extract('(FILE.*?)\.')
print(df)
string_value field_greedy field_not_greedy
0 LOCATION: FILE-ABC.txt FILE-ABC FILE-ABC
1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD FILENAME-ADBCD
2 BADFILENAME.foo.txt FILENAME.foo FILENAME

Group .txt data into dataframe

I have a .txt file with data like such:
[12.06.17, 13:18:36] Name1: Test test test
[12.06.17, 13:20:20] Name2 ❤️: blabla
[12.06.17, 13:20:44] Name2 ❤️: words words words
words
words
words
[12.06.17, 13:29:03] Name1: more words more words
[12.06.17, 13:38:52] Name3 Surname Nickname: 👍🏼
[12.06.17, 13:40:37] Name1: message?
Note, that there can be multiple names before the message and also multiline messages can occur. I tried many things for the last days already to split the data into the groups 'date', 'time', 'name', 'message'.
I was able to figure out, that the regex
(.)(\d+\.\d+\.\d+)(,)(\s)(\d+:\d+:\d+)(.)(\s)([^:]+)(:)
is able to capture everything up to the message (cf.: https://regex101.com/r/hQlgeM/3). But I cannot figure out how to add the message so that multiline messages are grouped into the previous message.
Lastly: If I am able to capture each group from the .txt with regex, how do I actually pass each group into a separate column. I've been looking at examples for the last three days, but I still cannot figure out how to finally construct this dataframe.
Code that I tried to work with:
df = pd.read_csv('chat.txt', names = ['raw'])
data = df.iloc[:,0]
re.match(r'\[([^]]+)\] ([^:]+):(.*)', data)
Another try that did not work:
input_file = open("chat.txt", "r", encoding='utf-8')
content = input_file.read()
df = pd.DataFrame(content, columns = ['raw'])
df['date'] = df['raw'].str.extract(r'^(.)(\d+\.\d+\.\d+)', expand=True)
df['time'] = df['raw'].str.extract(r'(\s)(\d+:\d+:\d+)', expand=True)
df['name'] = df['raw'].str.extract(r'(\s)([^:]+)(:)', expand=True)
df['message'] = df['raw'].str.extract(r'^(.)(?<=:).*$', expand=True)
df
A complete solution will look like
import pandas as pd
import io, re
file_path = 'chat.txt'
rx = re.compile(r'\[(?P<date>\d+(?:\.\d+){2}),\s(?P<time>\d+(?::\d+){2})]\s(?P<name>[^:]+):(?P<message>.*)')
col_list = []
date = time = name = message = ''
with io.open(file_path, "r", encoding = "utf-8", newline="\n") as sr:
for line in sr:
m = rx.match(line)
if m:
col_list.append([date, time, name, message])
date = m.group("date")
time = m.group("time")
name = m.group("name")
message = m.group("message")
else:
if line:
message += line
df = pd.DataFrame(col_list, columns=['date', 'time', 'name', 'message'])
Pattern details
\[ - a [ char
(?P<date>\d+(?:\.\d+){2}) - Group "date": 1+ digits and then two repetitions of . and two digits
,\s - , and a whitespace
(?P<time>\d+(?::\d+){2}) - Group "time": 1+ digits and then two repetitions of : and two digits
]\s - ] and a whitespace
(?P<name>[^:]+) - Group "name": one or more chars other than :
: - a colon
(?P<message>.*) - Group "message": any 0+ chars, as many as possible, up to the end of line.
Then, the logic is as follows:
A line is read in and tested against the pattern
If there is a match, the four variables - date, time, name and message - details are initialized
If the next line does not match the pattern it is considered part of the message and is thus appended to message variable.
Here is the solution that I figured works in my case. The problem was that I was using read_csv() when it is txt data. Also I needed to use regex to build my format before passing in into pandas:
import re
import pandas as pd
chat = open('chat.txt').read()
pattern = r'(?s)\[(?P<date>\d+(?:\.\d+){2}),\s(?P<time>\d+(?::\d+){2})]\s(?P<name>[^:]+):(?P<message>.*?)(?=\[\d+\.\d+\.\d+,\s\d+:\d+:\d+]|\Z)'
for row in re.findall(pattern, chat):
row
df = pd.DataFrame(re.findall(pattern, chat), columns=['date', 'time', 'name', 'message'])
print (df.tail)

Check if string is in pandas Dataframe column, and create new Dataframe

I am trying to check if a string is in a Pandas column. I tried doing it two ways but they both seem to check for a substring.
itemName = "eco drum ecommerce"
words = self.itemName.split(" ")
df.columns = ['key','word','umbrella', 'freq']
df = df.dropna()
df = df.loc[df['word'].isin(words)]
I also tried this way, but this also checks for substring
words = self.itemName.split(" ")
words = '|'.join(words)
df.columns = ['key','word','umbrella', 'freq']
df = df.dropna()
df = df.loc[df['word'].str.contains(words, case=False)]
The word was this: "eco drum".
Then I did this:
words = self.itemName.split(" ")
words = '|'.join(words)
To end up with this:
eco|drum
This is the "word" column:
Thank you, is it possible this way to not match substrings?
You have the right idea. .contains has the regex pattern match option set to True by default. Therefore all you need to do is add anchors to your regex pattern e.g. "ball" will become "^ball$".
df = pd.DataFrame(columns=['key'])
df["key"] = ["largeball", "ball", "john", "smallball", "Ball"]
print(df.loc[df['key'].str.contains("^ball$", case=False)])
Referring more specifically to your question, since you want to search for multiple words, you will have to create the regex pattern to give to contains.
# Create dataframe
df = pd.DataFrame(columns=['word'])
df["word"] = ["ecommerce", "ecommerce", "ecommerce", "ecommerce", "eco", "drum"]
# Create regex pattern
word = "eco drum"
words = word.split(" ")
words = "|".join("^{}$".format(word) for word in words)
# Find matches in dataframe
print(df.loc[df['word'].str.contains(words, case=False)])
The code words = "|".join("^{}$".format(word) for word in words) is referred to as a generator expression. Given ['eco', 'drum'] it will return this pattern: ^eco$|^drum$.

Categories