I started to learn regex in python and I've got the following task:
I need to write a script taking those 2 strings:
string_1 = 'merchant ID 1234, device ID 45678, serial# 123456789'
string_2 = 'merchant ID 8765, user ID 531476, serial# 87654321'
and displaying only the strings which has merchant ID #### and device ID #### in them.
To check for the first condition I wrote the following line:
ex_1 = re.findall(r'\merchant\b\s\ID\b\s\d+', string_1)
print (ex_1)
output: ['merchant ID 1234'] - works fine!
Problem is I can't get the other condition for some reason:
ex_2 = re.findall(r'\device\b\s\ID\b\s\d+', string_1)
output: [] - empty list.
What am I doing wrong?
Because:
ex_2 = re.findall(r'\device\b\s\ID\b\s\d+', string_1)
^^
Which matches a number, but \m in \merchant is still m. However you should remove the \ which before \ID and \device like:
>>> re.findall(r'device\b\sID\b\s\d+', string_1)
['device ID 45678']
Your grouping is wrong. Use brackets for the grouping:
(merchant ID \d+|device ID \d+)
e.g.
>>>re.findall('(merchant ID \d+|device ID \d+)', string_1)
['merchant ID 1234', 'device ID 45678']
Be careful with the special character '\'. '\device\' matches with [0-9] + 'evice'.
With Pythex you can test your regex, and consult a great cheatsheet.
Related
In my homework, I need to extract the first name, last name, ID code, phone number, date of birth and address of a person from a given string using Regex. The order of the parameters always remains the same. Each parameter requires a separate pattern.
Requirements are as follows:
Both first and last names always begin with a capital letter followed by at least one lowercase letter.
ID code is always 11 characters long and consists only of numbers.
The phone number itself is a combination of 7-8 numbers. The phone number might be separated from the area code with a whitespace, but not necessarily. It is also possible that there is no area code at all.
Date of birth is formatted as dd-MM-YYYY
Address is everything else that remains.
I got the following patterns for each parameter:
str1 = "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
first_name_pattern = r"^[A-Z][a-z]+"
last_name_pattern = r"[A-z][a-z]+(?=[0-9])"
id_code_pattern = r"\d{11}(?=\+)"
phone_number_pattern = r"\+\d{3}?\s*\d{7,8}"
date_pattern = r"\d{1,2}\-\d{1,2}\-\d{1,4}"
address_pattern = r"[A-Z][a-z]*\s.*$"
first_name_match = re.findall(first_name_pattern, str1)
last_name_match = re.findall(last_name_pattern, str1)
id_code_match = re.findall(id_code_pattern, str1)
phone_number_match = re.findall(phone_number_pattern, str1)
date_match = re.findall(date_pattern, str1)
address_match = re.findall(address_pattern, str1)
So, given "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti", I get ['Heino'] ['Plekk'] ['69712047623'] ['+372 56887364' ] ['12-09-2020'] ['Tartu mnt 183,Tallinn,16881,Eesti'], which suits me perfectly.
The problem starts when the area code is missing, because now id_code_pattern can't find the id code because of (?=\+), and if one tries to use |\d{11} (or) there is another problem because now it finds both id code and phone number (69712047623 and 37256887364). And how to improve phone_number_pattern so that it finds only 7 or 8 digits of the phone number, I do not understand.
A single expression with some well-crafted capture groups will help you immensely:
import re
str1 = "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
pattern = r"^(?P<first_name>[A-Z][a-z]+)(?P<last_name>[A-Z][a-z]+)(?P<id_code>\d{11})(?P<phone>(?:\+\d{3})?\s*\d{7,8})(?P<dob>\d{1,2}\-\d{1,2}\-\d{1,4})(?P<address>.*)$"
print(re.match(pattern, str1).groupdict())
Repl.it | regex101
Result:
{'first_name': 'Heino', 'last_name': 'Plekk', 'id_code': '69712047623', 'phone': '+37256887364', 'dob': '12-09-2020', 'address': 'Tartu mnt 183,Tallinn,16881,Eesti'}
I am trying to remove from this dataframe the mentions and special characters as "!?$..." and especially the character "#" but keeping the text of the hashtag.
Something like this is what I would like to have:
tweet clean_tweet
---------------------------------------------|-----------
"This is an example #user2 #Science ! #Tech" | "This is an example Science Tech"
"Hi How are you #user45 #USA" | "Hi How are you USA"
I am not sure how to iterate and do this in my dataframe in the column tweet
I tried with this for special characters
df["clean_tweet"] = df.columns.str.replace('[#,#,&]', '')
But I have this error
ValueError: Length of values (38) does not match length of index (82702)
You are trying to process column names
try this
df["clean_tweet"] = df["tweet"].str.replace('[#,#,&]', '')
I see you want to remove #user.So I used regex here
df['clean_tweet'] = df['tweet'].replace(regex='(#\w+)|#|&|!',value='')
tweet clean_tweet
0 This is an example #user2 #Science ! #Tech This is an example Science Tech
1 Hi How are you #user45 #USA Hi How are you USA
I have following text:
This is the foo test the date purchase id is /STAR2015A. This is another foo test the purchase is /STAR2022M. Yet another foo test, get it back by if u dont like, purchase id is /STAR2039K. You wont be surprised if i write another id /STAR2050L.
I want to get all the unique purchase ids. It starts with /STAR every time and ends with letter A-M. Also, the number ranges from 2010 - 2050. I tried following but it doesnt return any result:
import re
dset = []
text = "This is the foo test the date purchase id is /STAR2015A. This is another foo test the purchase is /STAR2022M. Yet another foo test, get it back by if u dont like, purchase id is /STAR2039K. You wont be surprised if i write another id /STAR2050L. "
pattern = re.findall("[^\/STAR[20][10-50][A-M]]",text)
print(pattern)
Let me know how to solve this.
You could use
/STAR20(?:[1-4]\d|50)[A-M]
/STAR20 Match literally
(?: Non capture group
[1-4]\d Match 10 - 49
| or
50 Match 50
) Close group
[A-M] Match A - M
Regex demo | Python demo
Example
result = re.findall(r"/STAR20(?:[1-4]\d|50)[A-M]", text)
I want to replace the text between the '|' and '/' in the string ("|伊士曼柯达公司/") with '!!!'.
s = '柯達⑀柯达⑀ /Kodak (brand, US film company)/full name Eastman Kodak Company 伊士曼柯達公司|伊士曼柯达公司/'
print(s)
s = re.sub(r'\|.*?\/.', '/!!!', s)
print('\t', s)
I tested the code first on https://regex101.com/, and it worked perfectly.
I can't quite figure out why it's not doing the replacement in python.
Variant's of escaping I've tried also include:
s = re.sub(r'|.*?\/.', '!!!', s)
s = re.sub(r'|.*?/.', '!!!', s)
s = re.sub(r'\|.*?/.', '!!!', s)
Each time the string comes out unchanged.
You can change your regex to this one, which uses lookarounds to ensure what you want to replace is preceded by | and followed by /
(?<=\|).*?(?=/)
Check this Python code,
import re
s = '柯達⑀柯达⑀ /Kodak (brand, US film company)/full name Eastman Kodak Company 伊士曼柯達公司|伊士曼柯达公司/'
print(s)
s = re.sub(r'(?<=\|).*?(?=/)', '!!!', s)
print(s)
Prints like you expect,
柯達⑀柯达⑀ /Kodak (brand, US film company)/full name Eastman Kodak Company 伊士曼柯達公司|伊士曼柯达公司/
柯達⑀柯达⑀ /Kodak (brand, US film company)/full name Eastman Kodak Company 伊士曼柯達公司|!!!/
Online Python Demo
I have four speakers like this:
Team_A=[Fred,Bob]
Team_B=[John,Jake]
They are having a conversation and it is all represented by a string, ie. convo=
Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine
How do I disassemble and reassemble the string so I can split it into 2 strings, 1 string of what Team_A said, and 1 string from what Team_A said?
output: team_A_said="hello how is it going?", team_B_said="hi we are doing fine"
The lines don't matter.
I have this awful find... then slice code that is not scalable. Can someone suggest something else? Any libraries to help with this?
I didn't find anything in nltk library
This code assumes that contents of convo strictly conforms to the
name\nstuff they said\n\n
pattern. The only tricky code it uses is zip(*[iter(lines)]*3), which creates a list of triplets of strings from the lines list. For a discussion on this technique and alternate techniques, please see How do you split a list into evenly sized chunks in Python?.
#!/usr/bin/env python
team_ids = ('A', 'B')
team_names = (
('Fred', 'Bob'),
('John', 'Jake'),
)
#Build a dict to get team name from person name
teams = {}
for team_id, names in zip(team_ids, team_names):
for name in names:
teams[name] = team_id
#Each block in convo MUST consist of <name>\n<one line of text>\n\n
#Do NOT omit the final blank line at the end
convo = '''Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine
'''
lines = convo.splitlines()
#Group lines into <name><text><empty> chunks
#and append the text into the appropriate list in `said`
said = {'A': [], 'B': []}
for name, text, _ in zip(*[iter(lines)]*3):
team_id = teams[name]
said[team_id].append(text)
for team_id in team_ids:
print 'Team %s said: %r' % (team_id, ' '.join(said[team_id]))
output
Team A said: 'hello how is it going?'
Team B said: 'hi we are doing fine'
You could use a regular expression to split up each entry. itertools.ifilter can then be used to extract the required entries for each conversation.
import itertools
import re
def get_team_conversation(entries, team):
return [e for e in itertools.ifilter(lambda x: x.split('\n')[0] in team, entries)]
Team_A = ['Fred', 'Bob']
Team_B = ['John', 'Jake']
convo = """
Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine"""
find_teams = '^(' + '|'.join(Team_A + Team_B) + r')$'
entries = [e[0].strip() for e in re.findall('(' + find_teams + '.*?)' + '(?=' + find_teams + r'|\Z)', convo, re.S+re.M)]
print 'Team-A', get_team_conversation(entries, Team_A)
print 'Team-B', get_team_conversation(entries, Team_B)
Giving the following output:
Team-A ['Fred\nhello', 'Bob\nhow is it going?']
Team_B ['John\nhi', 'Jake\nwe are doing fine']
It is a problem of language parsing.
Answer is a Work in progress
Finite state machine
A conversation transcript can be understood by imagining it as parsed by automata with the following states :
[start] ---> [Name]----> [Text]-+----->[end]
^ |
| | (whitespaces)
+-----------------+
You can parse your conversation by making it follow that state machine. If your parsing succeeds (ie. follows the states to end of text) you can browse your "conversation tree" to derive meaning.
Tokenizing your conversation (lexer)
You need functions to recognize the name state. This is straightforward
name = (Team_A | Team_B) + '\n'
Conversation alternation
In this answer, I did not assume that a conversation involves alternating between the people speaking, like this conversation would :
Fred # author 1
hello
John # author 2
hi
Bob # author 3
how is it going ?
Bob # ERROR : author 3 again !
are we still on for saturday, Fred ?
This might be problematic if your transcript concatenates answers from same author