Group .txt data into dataframe - python

I have a .txt file with data like such:
[12.06.17, 13:18:36] Name1: Test test test
[12.06.17, 13:20:20] Name2 ❤️: blabla
[12.06.17, 13:20:44] Name2 ❤️: words words words
words
words
words
[12.06.17, 13:29:03] Name1: more words more words
[12.06.17, 13:38:52] Name3 Surname Nickname: 👍🏼
[12.06.17, 13:40:37] Name1: message?
Note, that there can be multiple names before the message and also multiline messages can occur. I tried many things for the last days already to split the data into the groups 'date', 'time', 'name', 'message'.
I was able to figure out, that the regex
(.)(\d+\.\d+\.\d+)(,)(\s)(\d+:\d+:\d+)(.)(\s)([^:]+)(:)
is able to capture everything up to the message (cf.: https://regex101.com/r/hQlgeM/3). But I cannot figure out how to add the message so that multiline messages are grouped into the previous message.
Lastly: If I am able to capture each group from the .txt with regex, how do I actually pass each group into a separate column. I've been looking at examples for the last three days, but I still cannot figure out how to finally construct this dataframe.
Code that I tried to work with:
df = pd.read_csv('chat.txt', names = ['raw'])
data = df.iloc[:,0]
re.match(r'\[([^]]+)\] ([^:]+):(.*)', data)
Another try that did not work:
input_file = open("chat.txt", "r", encoding='utf-8')
content = input_file.read()
df = pd.DataFrame(content, columns = ['raw'])
df['date'] = df['raw'].str.extract(r'^(.)(\d+\.\d+\.\d+)', expand=True)
df['time'] = df['raw'].str.extract(r'(\s)(\d+:\d+:\d+)', expand=True)
df['name'] = df['raw'].str.extract(r'(\s)([^:]+)(:)', expand=True)
df['message'] = df['raw'].str.extract(r'^(.)(?<=:).*$', expand=True)
df

A complete solution will look like
import pandas as pd
import io, re
file_path = 'chat.txt'
rx = re.compile(r'\[(?P<date>\d+(?:\.\d+){2}),\s(?P<time>\d+(?::\d+){2})]\s(?P<name>[^:]+):(?P<message>.*)')
col_list = []
date = time = name = message = ''
with io.open(file_path, "r", encoding = "utf-8", newline="\n") as sr:
for line in sr:
m = rx.match(line)
if m:
col_list.append([date, time, name, message])
date = m.group("date")
time = m.group("time")
name = m.group("name")
message = m.group("message")
else:
if line:
message += line
df = pd.DataFrame(col_list, columns=['date', 'time', 'name', 'message'])
Pattern details
\[ - a [ char
(?P<date>\d+(?:\.\d+){2}) - Group "date": 1+ digits and then two repetitions of . and two digits
,\s - , and a whitespace
(?P<time>\d+(?::\d+){2}) - Group "time": 1+ digits and then two repetitions of : and two digits
]\s - ] and a whitespace
(?P<name>[^:]+) - Group "name": one or more chars other than :
: - a colon
(?P<message>.*) - Group "message": any 0+ chars, as many as possible, up to the end of line.
Then, the logic is as follows:
A line is read in and tested against the pattern
If there is a match, the four variables - date, time, name and message - details are initialized
If the next line does not match the pattern it is considered part of the message and is thus appended to message variable.

Here is the solution that I figured works in my case. The problem was that I was using read_csv() when it is txt data. Also I needed to use regex to build my format before passing in into pandas:
import re
import pandas as pd
chat = open('chat.txt').read()
pattern = r'(?s)\[(?P<date>\d+(?:\.\d+){2}),\s(?P<time>\d+(?::\d+){2})]\s(?P<name>[^:]+):(?P<message>.*?)(?=\[\d+\.\d+\.\d+,\s\d+:\d+:\d+]|\Z)'
for row in re.findall(pattern, chat):
row
df = pd.DataFrame(re.findall(pattern, chat), columns=['date', 'time', 'name', 'message'])
print (df.tail)

Related

extracting a string from between to strings in dataframe

im trying to extract a value from my data frame
i have a column ['Desc'] it contains sentences in the folowing format
_000it_ZZZ$$$-
_0780it_ZBZT$$$-
_011it_BB$$$-
_000it_CCCC$$$-
I want to extract the string between 'it_' and '$$$'
I have tried this code but does not seem to work
# initializing substrings
sub1 = "it_"
sub2 = "$$$"
# getting index of substrings
idx1 = df['DESC'].find(sub1)
idx2 = df['DESC'].find(sub2)
# length of substring 1 is added to
# get string from next character
df['results'] = df['DESC'][idx1 + len(sub1) + 1: idx2]
I would appreciate your help
You can use str.extract to get the desired output in your new column.
import pandas as pd
import re
df = pd.DataFrame({
'DESC' : ["_000it_ZZZ$$$-", "_0780it_ZBZT$$$-", "_011it_BB$$$-", "_000it_CCCC$$$-", "_000it_123$$$-"]
})
pat = r"(?<=it_)(.+)(?=[\$]{3}-)"
df['results'] = df['DESC'].str.extract(pat)
print(df)
DESC results
0 _000it_ZZZ$$$- ZZZ
1 _0780it_ZBZT$$$- ZBZT
2 _011it_BB$$$- BB
3 _000it_CCCC$$$- CCCC
4 _000it_123$$$- 123
You can see the regex pattern on Regex101 for more details.
You could try using a regex pattern. It matches your cases you listed here, but I can't guarantee that it will generalize to all possible patterns.
import re
string = "_000it_ZZZ$$$-"
p = re.compile(r"(?<=it_)(.*)(?<!\W)")
m = p.findall(string)
print(m) # ['_ZZZ']
The pattern looks for it in the string and then stops untill it meets a non-word character.

Using regex to match keywords in DF

Good morning all I am struggling a bit with regex :(
Scenario: I have loaded an excel file into Pandas as DF to enable me to search for keywords across multiple columns.
Data:
Columns include title, scope, description and review. There are 6 keywords I need to search for.
Current approach:
Using numpy where str contains I have found matches but these are partial matches within other strings. I need to find only whole words. The below works but as I said will also identify matches within strings such as 'booking' or 'training'. I need to find a way to only find 'book' or 'train'.
keywords = ['book','train','job']
df["NewValue"] = np.where((df['title'].str.contains('|'.join(keywords)))
(df['scope'].str.contains('|'.join(keywords))) |
(df['description'].str.contains('|'.join(keywords)))|
(df['review'].str.contains('|'.join(keywords))),1,0)
You can use word boundary \b in raw string
keywords = ['book','train','job']
cols = ['title', 'scope', 'description', 'review']
m = df[cols].apply(lambda col: col.str.contains(r'\b' + '|'.join(keywords) + r'\b')).any(axis=1)
df['NewValue'] = np.where(m,1,0)
# or
df['NewValue'] = m.astype(int)
To extract matched word, you can use
out = (pd.concat([df[col].str.extract(r'\b(' + '|'.join(keywords) + r')\b')
for col in cols], axis=1)
.groupby(level=0, axis=1).first())

Python dataframe : strip part of string, on each column row, if it is in specific format [duplicate]

I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers

Pandas DataFrame - Extract string between two strings and include the first delimiter

I've the following strings in column on a dataframe:
"LOCATION: FILE-ABC.txt"
"DRAFT-1-FILENAME-ADBCD.txt"
And I want to extract everything that is between the word FILE and the ".". But I want to include the first delimiter. Basically I am trying to return the following result:
"FILE-ABC"
"FILENAME-ABCD"
For that I am using the script below:
df['field'] = df.string_value.str.extract('FILE/(.w+)')
But I am not able to return the desired information (always getting NA).
How can I do this?
you can accomplish this all within the regex without having to use string slicing.
df['field'] = df.string_value.str.extract('(FILE.*(?=.txt))')
FILE is the what we begin the match on
.* grabs any number of characters
(?=) is a lookahead assertion that matches without
consuming.
Handy regex tool https://pythex.org/
If the strings will always end in .txt then you can try with the following:
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Example:
import pandas as pd
text = ["LOCATION: FILE-ABC.txt","DRAFT-1-FILENAME-ADBCD.txt"]
data = {'index':[0,1],'string_value':text}
df = pd.DataFrame(data)
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Output:
index string_value field
0 0 LOCATION: FILE-ABC.txt FILE-ABC
1 1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD
You can make a capturing group that captures from (including) 'FILE' greedily to the last period. Or you can make it not greedy so it stops at the first . after FILE.
import pandas as pd
df = pd.DataFrame({'string_value': ["LOCATION: FILE-ABC.txt", "DRAFT-1-FILENAME-ADBCD.txt",
"BADFILENAME.foo.txt"]})
df['field_greedy'] = df['string_value'].str.extract('(FILE.*)\.')
df['field_not_greedy'] = df['string_value'].str.extract('(FILE.*?)\.')
print(df)
string_value field_greedy field_not_greedy
0 LOCATION: FILE-ABC.txt FILE-ABC FILE-ABC
1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD FILENAME-ADBCD
2 BADFILENAME.foo.txt FILENAME.foo FILENAME

Check if string is in pandas Dataframe column, and create new Dataframe

I am trying to check if a string is in a Pandas column. I tried doing it two ways but they both seem to check for a substring.
itemName = "eco drum ecommerce"
words = self.itemName.split(" ")
df.columns = ['key','word','umbrella', 'freq']
df = df.dropna()
df = df.loc[df['word'].isin(words)]
I also tried this way, but this also checks for substring
words = self.itemName.split(" ")
words = '|'.join(words)
df.columns = ['key','word','umbrella', 'freq']
df = df.dropna()
df = df.loc[df['word'].str.contains(words, case=False)]
The word was this: "eco drum".
Then I did this:
words = self.itemName.split(" ")
words = '|'.join(words)
To end up with this:
eco|drum
This is the "word" column:
Thank you, is it possible this way to not match substrings?
You have the right idea. .contains has the regex pattern match option set to True by default. Therefore all you need to do is add anchors to your regex pattern e.g. "ball" will become "^ball$".
df = pd.DataFrame(columns=['key'])
df["key"] = ["largeball", "ball", "john", "smallball", "Ball"]
print(df.loc[df['key'].str.contains("^ball$", case=False)])
Referring more specifically to your question, since you want to search for multiple words, you will have to create the regex pattern to give to contains.
# Create dataframe
df = pd.DataFrame(columns=['word'])
df["word"] = ["ecommerce", "ecommerce", "ecommerce", "ecommerce", "eco", "drum"]
# Create regex pattern
word = "eco drum"
words = word.split(" ")
words = "|".join("^{}$".format(word) for word in words)
# Find matches in dataframe
print(df.loc[df['word'].str.contains(words, case=False)])
The code words = "|".join("^{}$".format(word) for word in words) is referred to as a generator expression. Given ['eco', 'drum'] it will return this pattern: ^eco$|^drum$.

Categories