Check if string is in pandas Dataframe column, and create new Dataframe - python

I am trying to check if a string is in a Pandas column. I tried doing it two ways but they both seem to check for a substring.
itemName = "eco drum ecommerce"
words = self.itemName.split(" ")
df.columns = ['key','word','umbrella', 'freq']
df = df.dropna()
df = df.loc[df['word'].isin(words)]
I also tried this way, but this also checks for substring
words = self.itemName.split(" ")
words = '|'.join(words)
df.columns = ['key','word','umbrella', 'freq']
df = df.dropna()
df = df.loc[df['word'].str.contains(words, case=False)]
The word was this: "eco drum".
Then I did this:
words = self.itemName.split(" ")
words = '|'.join(words)
To end up with this:
eco|drum
This is the "word" column:
Thank you, is it possible this way to not match substrings?

You have the right idea. .contains has the regex pattern match option set to True by default. Therefore all you need to do is add anchors to your regex pattern e.g. "ball" will become "^ball$".
df = pd.DataFrame(columns=['key'])
df["key"] = ["largeball", "ball", "john", "smallball", "Ball"]
print(df.loc[df['key'].str.contains("^ball$", case=False)])
Referring more specifically to your question, since you want to search for multiple words, you will have to create the regex pattern to give to contains.
# Create dataframe
df = pd.DataFrame(columns=['word'])
df["word"] = ["ecommerce", "ecommerce", "ecommerce", "ecommerce", "eco", "drum"]
# Create regex pattern
word = "eco drum"
words = word.split(" ")
words = "|".join("^{}$".format(word) for word in words)
# Find matches in dataframe
print(df.loc[df['word'].str.contains(words, case=False)])
The code words = "|".join("^{}$".format(word) for word in words) is referred to as a generator expression. Given ['eco', 'drum'] it will return this pattern: ^eco$|^drum$.

Related

extracting a string from between to strings in dataframe

im trying to extract a value from my data frame
i have a column ['Desc'] it contains sentences in the folowing format
_000it_ZZZ$$$-
_0780it_ZBZT$$$-
_011it_BB$$$-
_000it_CCCC$$$-
I want to extract the string between 'it_' and '$$$'
I have tried this code but does not seem to work
# initializing substrings
sub1 = "it_"
sub2 = "$$$"
# getting index of substrings
idx1 = df['DESC'].find(sub1)
idx2 = df['DESC'].find(sub2)
# length of substring 1 is added to
# get string from next character
df['results'] = df['DESC'][idx1 + len(sub1) + 1: idx2]
I would appreciate your help
You can use str.extract to get the desired output in your new column.
import pandas as pd
import re
df = pd.DataFrame({
'DESC' : ["_000it_ZZZ$$$-", "_0780it_ZBZT$$$-", "_011it_BB$$$-", "_000it_CCCC$$$-", "_000it_123$$$-"]
})
pat = r"(?<=it_)(.+)(?=[\$]{3}-)"
df['results'] = df['DESC'].str.extract(pat)
print(df)
DESC results
0 _000it_ZZZ$$$- ZZZ
1 _0780it_ZBZT$$$- ZBZT
2 _011it_BB$$$- BB
3 _000it_CCCC$$$- CCCC
4 _000it_123$$$- 123
You can see the regex pattern on Regex101 for more details.
You could try using a regex pattern. It matches your cases you listed here, but I can't guarantee that it will generalize to all possible patterns.
import re
string = "_000it_ZZZ$$$-"
p = re.compile(r"(?<=it_)(.*)(?<!\W)")
m = p.findall(string)
print(m) # ['_ZZZ']
The pattern looks for it in the string and then stops untill it meets a non-word character.

Using regex to match keywords in DF

Good morning all I am struggling a bit with regex :(
Scenario: I have loaded an excel file into Pandas as DF to enable me to search for keywords across multiple columns.
Data:
Columns include title, scope, description and review. There are 6 keywords I need to search for.
Current approach:
Using numpy where str contains I have found matches but these are partial matches within other strings. I need to find only whole words. The below works but as I said will also identify matches within strings such as 'booking' or 'training'. I need to find a way to only find 'book' or 'train'.
keywords = ['book','train','job']
df["NewValue"] = np.where((df['title'].str.contains('|'.join(keywords)))
(df['scope'].str.contains('|'.join(keywords))) |
(df['description'].str.contains('|'.join(keywords)))|
(df['review'].str.contains('|'.join(keywords))),1,0)
You can use word boundary \b in raw string
keywords = ['book','train','job']
cols = ['title', 'scope', 'description', 'review']
m = df[cols].apply(lambda col: col.str.contains(r'\b' + '|'.join(keywords) + r'\b')).any(axis=1)
df['NewValue'] = np.where(m,1,0)
# or
df['NewValue'] = m.astype(int)
To extract matched word, you can use
out = (pd.concat([df[col].str.extract(r'\b(' + '|'.join(keywords) + r')\b')
for col in cols], axis=1)
.groupby(level=0, axis=1).first())

Pandas DataFrame - Extract string between two strings and include the first delimiter

I've the following strings in column on a dataframe:
"LOCATION: FILE-ABC.txt"
"DRAFT-1-FILENAME-ADBCD.txt"
And I want to extract everything that is between the word FILE and the ".". But I want to include the first delimiter. Basically I am trying to return the following result:
"FILE-ABC"
"FILENAME-ABCD"
For that I am using the script below:
df['field'] = df.string_value.str.extract('FILE/(.w+)')
But I am not able to return the desired information (always getting NA).
How can I do this?
you can accomplish this all within the regex without having to use string slicing.
df['field'] = df.string_value.str.extract('(FILE.*(?=.txt))')
FILE is the what we begin the match on
.* grabs any number of characters
(?=) is a lookahead assertion that matches without
consuming.
Handy regex tool https://pythex.org/
If the strings will always end in .txt then you can try with the following:
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Example:
import pandas as pd
text = ["LOCATION: FILE-ABC.txt","DRAFT-1-FILENAME-ADBCD.txt"]
data = {'index':[0,1],'string_value':text}
df = pd.DataFrame(data)
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Output:
index string_value field
0 0 LOCATION: FILE-ABC.txt FILE-ABC
1 1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD
You can make a capturing group that captures from (including) 'FILE' greedily to the last period. Or you can make it not greedy so it stops at the first . after FILE.
import pandas as pd
df = pd.DataFrame({'string_value': ["LOCATION: FILE-ABC.txt", "DRAFT-1-FILENAME-ADBCD.txt",
"BADFILENAME.foo.txt"]})
df['field_greedy'] = df['string_value'].str.extract('(FILE.*)\.')
df['field_not_greedy'] = df['string_value'].str.extract('(FILE.*?)\.')
print(df)
string_value field_greedy field_not_greedy
0 LOCATION: FILE-ABC.txt FILE-ABC FILE-ABC
1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD FILENAME-ADBCD
2 BADFILENAME.foo.txt FILENAME.foo FILENAME

Group .txt data into dataframe

I have a .txt file with data like such:
[12.06.17, 13:18:36] Name1: Test test test
[12.06.17, 13:20:20] Name2 ❤️: blabla
[12.06.17, 13:20:44] Name2 ❤️: words words words
words
words
words
[12.06.17, 13:29:03] Name1: more words more words
[12.06.17, 13:38:52] Name3 Surname Nickname: 👍🏼
[12.06.17, 13:40:37] Name1: message?
Note, that there can be multiple names before the message and also multiline messages can occur. I tried many things for the last days already to split the data into the groups 'date', 'time', 'name', 'message'.
I was able to figure out, that the regex
(.)(\d+\.\d+\.\d+)(,)(\s)(\d+:\d+:\d+)(.)(\s)([^:]+)(:)
is able to capture everything up to the message (cf.: https://regex101.com/r/hQlgeM/3). But I cannot figure out how to add the message so that multiline messages are grouped into the previous message.
Lastly: If I am able to capture each group from the .txt with regex, how do I actually pass each group into a separate column. I've been looking at examples for the last three days, but I still cannot figure out how to finally construct this dataframe.
Code that I tried to work with:
df = pd.read_csv('chat.txt', names = ['raw'])
data = df.iloc[:,0]
re.match(r'\[([^]]+)\] ([^:]+):(.*)', data)
Another try that did not work:
input_file = open("chat.txt", "r", encoding='utf-8')
content = input_file.read()
df = pd.DataFrame(content, columns = ['raw'])
df['date'] = df['raw'].str.extract(r'^(.)(\d+\.\d+\.\d+)', expand=True)
df['time'] = df['raw'].str.extract(r'(\s)(\d+:\d+:\d+)', expand=True)
df['name'] = df['raw'].str.extract(r'(\s)([^:]+)(:)', expand=True)
df['message'] = df['raw'].str.extract(r'^(.)(?<=:).*$', expand=True)
df
A complete solution will look like
import pandas as pd
import io, re
file_path = 'chat.txt'
rx = re.compile(r'\[(?P<date>\d+(?:\.\d+){2}),\s(?P<time>\d+(?::\d+){2})]\s(?P<name>[^:]+):(?P<message>.*)')
col_list = []
date = time = name = message = ''
with io.open(file_path, "r", encoding = "utf-8", newline="\n") as sr:
for line in sr:
m = rx.match(line)
if m:
col_list.append([date, time, name, message])
date = m.group("date")
time = m.group("time")
name = m.group("name")
message = m.group("message")
else:
if line:
message += line
df = pd.DataFrame(col_list, columns=['date', 'time', 'name', 'message'])
Pattern details
\[ - a [ char
(?P<date>\d+(?:\.\d+){2}) - Group "date": 1+ digits and then two repetitions of . and two digits
,\s - , and a whitespace
(?P<time>\d+(?::\d+){2}) - Group "time": 1+ digits and then two repetitions of : and two digits
]\s - ] and a whitespace
(?P<name>[^:]+) - Group "name": one or more chars other than :
: - a colon
(?P<message>.*) - Group "message": any 0+ chars, as many as possible, up to the end of line.
Then, the logic is as follows:
A line is read in and tested against the pattern
If there is a match, the four variables - date, time, name and message - details are initialized
If the next line does not match the pattern it is considered part of the message and is thus appended to message variable.
Here is the solution that I figured works in my case. The problem was that I was using read_csv() when it is txt data. Also I needed to use regex to build my format before passing in into pandas:
import re
import pandas as pd
chat = open('chat.txt').read()
pattern = r'(?s)\[(?P<date>\d+(?:\.\d+){2}),\s(?P<time>\d+(?::\d+){2})]\s(?P<name>[^:]+):(?P<message>.*?)(?=\[\d+\.\d+\.\d+,\s\d+:\d+:\d+]|\Z)'
for row in re.findall(pattern, chat):
row
df = pd.DataFrame(re.findall(pattern, chat), columns=['date', 'time', 'name', 'message'])
print (df.tail)

How can I Optimize a search in pandas dataframe

I need to search the word 'mas' in Dataframe, the column with frase is Corpo, and the text in this column is splitted in list, for example: I like birds ---> split [I,like,birds]. So, I need search 'mas' in a portuguese frase and catch just the words after 'mas'. The code is taking to long to execute this function.
df.Corpo.update(df.Corpo.str.split()) #tokeniza frase
df.Corpo = df.Corpo.fillna('')
for i in df.index:
for j in range(len(df.Corpo[i])):
lista_aux = []
if df.Corpo[i][j] == 'mas' or df.Corpo[i][j] == 'porem' or df.Corpo[i][j] == 'contudo' or df.Corpo[i][j] == 'todavia':
lista_aux = df.Corpo[i]
df.Corpo[i] = lista_aux[j+1:]
break
if df.Corpo[i][j] == 'question':
df.Corpo[i] = ['question']
break
When working with pandas dataframes (or numpy arrays) you should always try to use vectorized operations instead of for-loops over individual dataframe elements. Vectorized operations are (nearly always) significantly faster than for-loops.
In your case you could use pandas built-in vectorized operation str.extract, which allows extraction of the string part that matches a regex search pattern. The regex search pattern mas (.+) should capture the part of a string that follows after 'mas'.
import pandas as pd
# Example dataframe with phrases
df = pd.DataFrame({'Corpo': ['I like birds', 'I mas like birds', 'I like mas birds']})
# Use regex search to extract phrase sections following 'mas'
df2 = df.Corpo.str.extract(r'mas (.+)')
# Fill gaps with full original phrase
df2 = df2.fillna(df.Corpo)
will give as result:
In [1]: df2
Out[1]:
0
0 I like birds
1 like birds
2 birds

Categories