Extract sub-string between multiple certain words using regex in python

Extract sub-string between multiple certain words using regex in python - python

Regex sub-string
I want to extract Phone, Fax, Mobile I get from string if not It can return null string. I want 3 list of Phone, Fax, Mobile from any given text string string example are given below.
ex1 = "miramar road margie shoop san diego ca 12793 manager phone 6035550160 fax 6035550161 mobile 6035550178 marsgies travel wwwmarpiestravelcom"
ex2 = "david packard electrical engineering 350 serra mall room 170 phone 650 7259327 stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu"
ex3 = "stanford electrical engineering vijay chandrasekhar electrical engineering 17 comstock circle apt 101 stanford ca 94305 phone 9162210411"
It is possible with regex like this:
phone_regex = re.match(".*phone(.*)fax(.*)mobile(.*)",ex1)
phone = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][0]
mobile = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][2]
fax = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][1]
Result from ex1:
phone = 6035550160
fax = 6035550161
mobile = 6035550178
ex2 does not have a mobile entry, so I get:
Traceback (most recent call last):
phone = [re.sub("[^0-9]", "", x) for x in phone_regex.groups()][0]
AttributeError: 'NoneType' object has no attribute 'groups'
Question
I need, either a better regex solution, as I am new to regex,
or, a solution, to catch AttributeError and assign null string.

You may use a simple re.findall like this:
dict(re.findall(r'\b({})\s*(\d+)'.format("|".join(keys)), ex))
The regex will look like
\b(phone|fax|mobile)\s*(\d+)
See the regex demo online.
Pattern details
\b - a word boundary
(phone|fax|mobile) - Group 1: one of the words listed
\s* - 0+ whitespaces
(\d+) - Group 2: one or more digits
See the Python demo:
import re
exs = ["miramar road margie shoop san diego ca 12793 manager phone 6035550160 fax 6035550161 mobile 6035550178 marsgies travel wwwmarpiestravelcom",
"david packard electrical engineering 350 serra mall room 170 phone 650 7259327 stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu",
"stanford electrical engineering vijay chandrasekhar electrical engineering 17 comstock circle apt 101 stanford ca 94305 phone 9162210411"]
keys = ['phone', 'fax', 'mobile']
for ex in exs:
res = dict(re.findall(r'\b({})\s*(\d+)'.format("|".join(keys)), ex))
print(res)
Output:
{'fax': '6035550161', 'phone': '6035550160', 'mobile': '6035550178'}
{'fax': '650', 'phone': '650'}
{'phone': '9162210411'}

Use re.search
Demo:
import re
ex1 = "miramar road margie shoop san diego ca 12793 manager phone 6035550160 fax 6035550161 mobile 6035550178 marsgies travel wwwmarpiestravelcom"
ex2 = "david packard electrical engineering 350 serra mall room 170 phone 650 7259327 stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu"
ex3 = "stanford electrical engineering vijay chandrasekhar electrical engineering 17 comstock circle apt 101 stanford ca 94305 phone 9162210411"
for i in [ex1, ex2, ex3]:
phone = re.search(r"(?P<phone>(?<=\phone\b).*?(?=([a-z]|$)))", i)
if phone:
print "Phone: ", phone.group("phone")
fax = re.search(r"(?P<fax>(?<=\bfax\b).*?(?=([a-z]|$)))", i)
if fax:
print "Fax: ", fax.group("fax")
mob = re.search(r"(?P<mob>(?<=\bmobile\b).*?(?=([a-z]|$)))", i)
if mob:
print "mob: ", mob.group("mob")
print("-----")
Output:
Phone: 6035550160
Fax: 6035550161
mob: 6035550178
-----
Phone: 650 7259327
Fax: 650 723 1882
-----
Phone: 9162210411
-----

I think I understand what you want.. and it has to do with getting exactly the first match after a keyword. What you need in such a case is the question mark ?:
" '?' is also a quantifier. Is short for {0,1}. It means "Match zero or one of the group preceding this question mark." It can also be interpreted as the part preceding the question mark is optional"
And here is some code that should work, in case the definition wasnt enough
import re
res_dict = {}
list_keywords = ['phone', 'cell', 'fax']
for i_key in list_keywords:
temp_res = re.findall(i_key + '(.*?) [a-zA-Z]', ex1)
res_dict[i_key] = temp_res

I think the following regexes should work fine:
mobile = re.findall('mobile([0-9]*)', ex1.replace(" ",""))[0]
fax = re.findall('fax([0-9]*)', ex1.replace(" ",""))[0]
phone = re.findall('phone([0-9]*)', ex1.replace(" ",""))[0]

Related

How do I transform a non-CSV text file into a CSV using Python/Pandas?

I have a text file that looks like this:
Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe # buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc
I'd like the file to look like this:
Id Number
Location
Street
Buyer
12345678
1234561791234567090-8.9
999 Street AVE
john doe
12345688
3582561791254567090-8.9
123 Street AVE
Jane doe # buyer % LLC
12345689
8542561791254567090-8.9
854 Street AVE
Jake and Bob: Owner%LLC: Inc
I have tried the following:
# 1 Read text file and ignore bad lines (lines with extra colons thus reading as extra fields).
tr = pd.read_csv('C:\\File Path\\test.txt', sep=':', header=None, error_bad_lines=False)
# 2 Convert into a dataframe/pivot table.
ndf = pd.DataFrame(tr.pivot(index=None, columns=0, values=1))
# 3 Clean up the pivot table to remove NaNs and reset the index (line by line).
nf2 = ndf.apply(lambda x: x.dropna().reset_index(drop=True))
Here is where got the last line (#3): https://stackoverflow.com/a/62481057/10448224
When I do the above and export to CSV the headers are arranged like the following:
(index)
Street
Buyer
Id Number
Location
The data is filled in nicely but at some point the Buyer field becomes inaccurate but the rest of the fields are accurate through the entire DF.
My guesses:
When I run #1 part of my script I get the following errors 507 times:
b'Skipping line 500: expected 2 fields, saw 3\nSkipping line 728: expected 2 fields, saw 3\
At the tail end of the new DF I am missing exactly 507 entries for the Byer field. So I think when I drop my bad lines, the field is pushing my data up.
Pain Points:
The Buyer field will sometimes have extra colons and other odd characters. So when I try to use a colon as a delimiter I run into problems.
I am new to Python and I am very new to using functions. I primarily use Pandas to manipulate data at a somewhat basic level. So in the words of the great Michael Scott: "Explain it to me like I'm five." Many many thanks to anyone willing to help.

Here's what I meant by reading in and using split. Very similar to other answers. Untested and I don't recall if inputline include eol, so I stripped it too.
with open('myfile.txt') as f:
data = [] # holds database
record = {} # holds built up record
for inputline in f:
key,value = inputline.strip().split(':',1)
if key == "Id Number": # new record starting
if len(record):
data.append(record) # write previous record
record = {}
record.update({key:value})
if len(record):
data.append(record) # out final record
df = pd.DataFrame(data)

This is a minimal example that demonstrates the basics:
cat split_test.txt
Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe # buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc
import csv
with open("split_test.txt", "r") as f:
id_val = "Id Number"
list_var = []
for line in f:
split_line = line.strip().split(':')
print(split_line)
if split_line[0] == id_val:
d = {}
d[split_line[0]] = split_line[1]
list_var.append(d)
else:
d.update({split_line[0]: split_line[1]})
list_var
[{'Id Number': ' 12345689',
'Location': ' 8542561791254567090-8.9',
'Street': ' 854 Street AVE',
'Buyer': ' Jake and Bob'},
{'Id Number': ' 12345678',
'Location': ' 1234561791234567090-8.9',
'Street': ' 999 Street AVE',
'Buyer': ' john doe'},
{'Id Number': ' 12345688',
'Location': ' 3582561791254567090-8.9',
'Street': ' 123 Street AVE',
'Buyer': ' Jane doe # buyer % LLC'}]
with open("split_ex.csv", "w") as csv_file:
field_names = list_var[0].keys()
csv_writer = csv.DictWriter(csv_file, fieldnames=field_names)
csv_writer.writeheader()
for row in list_var:
csv_writer.writerow(row)

I would try reading the file line by line, splitting the key-value pairs into a list of dicts to look something like:
data = [
{
"Id Number": 12345678,
"Location": 1234561791234567090-8.9,
...
},
{
"Id Number": ...
}
]
# easy to create the dataframe from here
your_df = pd.DataFrame(data)

Get information from a xml imdb response with no tags

I'm working on a movie data base, getting responses from imdb. I'm getting the response in a xml format, but it has no tags, just the information mixed. How can I get each of the data in there?
Here's how the respone shows up:
<?xml version="1.0" encoding="UTF-8"?><root response="True">
<movie title="Batman" year="1989" rated="PG-13" released="23 Jun 1989" runtime="126 min" genre="Action, Adventure" director="Tim Burton" writer="Bob Kane (Batman characters), Sam Hamm (story), Sam Hamm (screenplay), Warren Skaaren (screenplay)" actors="Michael Keaton, Jack Nicholson, Kim Basinger, Robert Wuhl" plot="Gotham City. Crime boss Carl Grissom (Jack Palance) effectively runs the town but there's a new crime fighter in town - Batman (Michael Keaton). Grissom's right-hand man is Jack Napier (Jack Nicholson), a brutal man who is not entirely sane... After falling out between the two Grissom has Napier set up with the Police and Napier falls to his apparent death in a vat of chemicals. However, he soon reappears as The Joker and starts a reign of terror in Gotham City. Meanwhile, reporter Vicki Vale (Kim Basinger) is in the city to do an article on Batman. She soon starts a relationship with Batman's everyday persona, billionaire Bruce Wayne." language="English, French, Spanish" country="USA, UK" awards="Won 1 Oscar. Another 8 wins & 26 nominations." poster="https://m.media-amazon.com/images/M/MV5BMTYwNjAyODIyMF5BMl5BanBnXkFtZTYwNDMwMDk2._V1_SX300.jpg" metascore="69" imdbRating="7.6" imdbVotes="302,842" imdbID="tt0096895" type="movie" />
</root>

Here is my answer to your question
xmlRaw="""< ?xml
version = "1.0"
encoding = "UTF-8"? > < root
response = "True" >
< movie
title = "Batman"
year = "1989"
rated = "PG-13"
released = "23 Jun 1989"
runtime = "126 min"
genre = "Action, Adventure"
director = "Tim Burton"
writer = "Bob Kane (Batman characters), Sam Hamm (story), Sam Hamm (screenplay), Warren Skaaren (screenplay)"
actors = "Michael Keaton, Jack Nicholson, Kim Basinger, Robert Wuhl"
plot = "Gotham City. Crime boss Carl Grissom (Jack Palance) effectively runs the town but there's a new crime fighter in town - Batman (Michael Keaton). Grissom's right-hand man is Jack Napier (Jack Nicholson), a brutal man who is not entirely sane... After falling out between the two Grissom has Napier set up with the Police and Napier falls to his apparent death in a vat of chemicals. However, he soon reappears as The Joker and starts a reign of terror in Gotham City. Meanwhile, reporter Vicki Vale (Kim Basinger) is in the city to do an article on Batman. She soon starts a relationship with Batman's everyday persona, billionaire Bruce Wayne."
language = "English, French, Spanish"
country = "USA, UK"
awards = "Won 1 Oscar. Another 8 wins & 26 nominations."
poster = "https://m.media-amazon.com/images/M/MV5BMTYwNjAyODIyMF5BMl5BanBnXkFtZTYwNDMwMDk2._V1_SX300.jpg"
metascore = "69"
imdbRating = "7.6"
imdbVotes = "302,842"
imdbID = "tt0096895"
type = "movie" / >
< / root >"""
def getValue(xml, value):
textString = xmlRaw.split('\n')
for line in textString:
if value in line:
returnData = line
return returnData
print (getValue(xmlRaw, 'title'))
print (getValue(xmlRaw, 'year'))
print (getValue(xmlRaw, 'rated'))
print (getValue(xmlRaw, 'released'))
#add more as you need the data

Python file parsing, can't catch strings in new line

So Parsing a large text file with 56,900 book titles with authors and a etext no.
Trying to find the authors. By parsing the file.
The file is a like this:
TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora,      56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
The Junior Classics, Volume 3: Tales from Greece and Rome, by Various 56887
~ ~ ~ ~ Posting Dates for the below eBooks: 1 Mar 2018 to 31 Mar 2018 ~ ~ ~ ~
TITLE and AUTHOR ETEXT NO.
The American Missionary, Volume 41, No. 1, January, 1887, by Various 56886
Morganin miljoonat, mennessä Sven Elvestad 56885
[Author a.k.a. Stein Riverton]
[Subtitle: Salapoliisiromaani]
[Language: Finnish]
"Trip to the Sunny South" in March, 1885, by L. S. D 56884
Balaam and His Master, by Joel Chandler Harris 56883
[Subtitle: and Other Sketches and Stories]
Susien saaliina, mennessä Jack London 56882
[Language: Finnish]
Forged Egyptian Antiquities, by T. G. Wakeling 56881
The Secret Doctrine, Vol. 3 of 4, by Helena Petrovna Blavatsky 56880
[Subtitle: Third Edition]
No Posting 56879
Author name usually starts after "by" or when there is no "by" in line then author name starts after a comma ","...However the "," can be a part of the title if the line has a by.
So, I parsed it for by first then for comma.
Here is what I tried:
def search_by_author():
fhand = open('GUTINDEX.ALL')
print("Search by Author:")
for line in fhand:
if not line.startswith(" [") and not line.startswith("TITLE"):
if not line.startswith("~"):
words = line.rstrip()
words = line.lstrip()
words = words[:-6]
if ", by" in words:
words = words[words.find(', by'):]
words = words[5:]
print (words)
else:
words = words[words.find(', '):]
words = words[5:]
if "," in words:
words = words[words.find(', '):]
if words.startswith(','):
words =words[words.find(','):]
print (words)
else:
print (words)
else:
print (words)
if " by" in words:
words = words[words.find('by')]
print(words)
search_by_author()
However it can't seem to find the author name for lines like
Aspects of plant life; with special reference to the British flora,      56900
by Robert Lloyd Praeger

As per your file, info about a book can be spread across multiple lines. There is a blank line after each book info. I used that to gather all info about a book and then parse it to get the author info.
import re
def search_by_author():
fhand = open('GUTINDEX.ALL')
book_info = ''
for line in fhand:
line = line.rstrip()
if (line.startswith('TITLE') or line.startswith('~')):
continue
if (len(line) == 0):
# remove info in square bracket from book_info
book_info = re.sub(r'\[.*$', '', book_info)
if ('by ' in book_info):
tokens = book_info.split('by ')
else:
tokens = book_info.split(',')
if (len(tokens) > 1):
authors = tokens[-1].strip()
print(authors)
book_info = ''
else:
# remove ETEXT NO. from line
line = re.sub(r'\d+$', '', line)
book_info += ' ' + line.rstrip()
search_by_author()
Output:
Robert Lloyd Praeger
Sabine Baring-Gould
mennessä Charles T. Russell
mennessä Charles T. Russell
Horatio Alger, Jr.
Al Avery and George Rutherford Montgomery
Lillian Garis
Boris Sidis
par Francis Picabia
Frederick Strauss
di Luigi Amabile
Fletcher Pratt
di Teodoro Pascal
Various
Various
mennessä Sven Elvestad
L. S. D
Joel Chandler Harris
mennessä Jack London
T. G. Wakeling
Helena Petrovna Blavatsky

How to apply a regex to each sublists of a list?

Let's say I have a list of lists like this:
lis_ = [['"Fun is the enjoyment of pleasure"\t\t',
'#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t','Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware https://t.co/k9sOEpKjbg\t\t'],
['I just became the mayor of Porta Romana on #username! http://4sq.com/9QROVv\t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated http://t.co/heyOhpb1\t\t", "#username Don't use my family surname for your app ???? http://t.co/1yYLXIO9\t\t"]
]
I would like to remove the links of each sublist, so I tried with this regular expression:
new_list = re.sub(r'^https?:\/\/.*[\r\n]*', '', tweets, flags=re.MULTILINE)
I used the MULTILINE flag since when I print list_ it looks like:
[]
[]
[]
...
[]
The problem with the above aproach is that I got an TypeError: expected string or buffer clearly I can not pass like this the sublists to the regex. How can I apply the above regex to the set of sublists in list_ ? in order to get something like this (i.e. the sublists without any type of link):
[['"Fun is the enjoyment of pleasure"\t\t',
'#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t','Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware'],
['I just became the mayor of Porta Romana on #username! \t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated \t\t", "#username Don't use my family surname for your app ????\t\t"]
]
Does this can be done with a map or is there any other efficient aproach?.
Thanks in advance guys

You need to use \b instead of start of the line anchor.
>>> lis_ = [['"Fun is the enjoyment of pleasure"\t\t',
'#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t','Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware https://t.co/k9sOEpKjbg\t\t'],
['I just became the mayor of Porta Romana on #username! http://4sq.com/9QROVv\t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated http://t.co/heyOhpb1\t\t", "#username Don't use my family surname for your app ???? http://t.co/1yYLXIO9\t\t"]
]
>>> [[re.sub(r'\bhttps?:\/\/.*[\r\n]*', '', i)] for x in lis_ for i in x]
[['"Fun is the enjoyment of pleasure"\t\t'], ['#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t'], ['Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware '], ['I just became the mayor of Porta Romana on #username! '], ["RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated "], ["#username Don't use my family surname for your app ???? "]]
OR
>>> l = []
>>> for i in lis_:
m = []
for j in i:
m.append(re.sub(r'\bhttps?:\/\/.*[\r\n]*', '', j))
l.append(m)
>>> l
[['"Fun is the enjoyment of pleasure"\t\t', '#username det fanns ett utvik med "sabrina without a stitch". acke nothing. #username\t\t', 'Report by #username - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware '], ['I just became the mayor of Porta Romana on #username! ', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated ", "#username Don't use my family surname for your app ???? "]]

It seems that you have a list of lists of strings.
In that case, you simply need to iterate over these lists the proper way:
list_ = [['blablablalba', 'blabalbablbla', 'blablala', 'http://t.co/xSnsnlNyq5'], ['blababllba', 'blabalbla', 'blabalbal'],['http://t.co/xScsklNyq5'], ['blablabla', 'http://t.co/xScsnlNyq3']]
def remove_links(sublist):
return [s for s in sublist if not re.search(r'https?:\/\/.*[\r\n]*', s)]
final_list = map(remove_links, list_)
# [['blablablalba', 'blabalbablbla', 'blablala'], ['blababllba', 'blabalbla', 'blabalbal'], [], ['blablabla']]
If you want to remove any empty sub-lists afterwards:
final_final_list = [l for l in final_list if l]

How can I do a non-greedy (backtracking) match with OneOrMore etc. in pyparsing?

I am trying to parse a partially standardized street address into it's components using pyparsing. I want to non-greedy match a street name that may be N tokens long.
For example:
444 PARK GARDEN LN
Should be parsed into:
number: 444
street: PARK GARDEN
suffix: LN
How would I do this with PyParsing? Here's my initial code:
from pyparsing import *
def main():
street_number = Word(nums).setResultsName('street_number')
street_suffix = oneOf("ST RD DR LN AVE WAY").setResultsName('street_suffix')
street_name = OneOrMore(Word(alphas)).setResultsName('street_name')
address = street_number + street_name + street_suffix
result = address.parseString("444 PARK GARDEN LN")
print result.dump()
if __name__ == '__main__':
main()
but when I try parsing it, the street suffix gets gobbled up by the default greedy parsing behavior.

Use the negation, ~, to check to see if the upcoming street_name is actually a street_suffix.
from pyparsing import *
street_number = Word(nums)('street_number')
street_suffix = oneOf("ST RD DR LN AVE WAY")('street_suffix')
street_name = OneOrMore(~street_suffix + Word(alphas))('street_name')
address = street_number + street_name + street_suffix
result = address.parseString("444 PARK GARDEN LN")
print result.dump()
In addition, you don't have to use setResultsName, you can simply use the syntax above. IMHO it leads to a much cleaner grammar definition.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.