Python Parseing String by Single Quotes - python

I am trying to Parse the following string:
EDITED to account for spaces...
['THE LOCATION', 'THE NAME', 'THE JOB', 'THE AREA')]
Right now I use regular expressions and split the data with the comma into a list
InfoL = re.split(",", Info)
However my output is
'THE LOCATION'
'THE NAME'
'THE JOB'
'THE AREA')]
Looking to have the output as follows
THE LOCATION
THE NAME
THE JOB
THE AREA
Thoughts?

One possibility is to use strip() to remove the unwanted characters:
In [18]: s="['LOCATION', 'NAME', 'JOB', 'AREA')]"
In [19]: print '\n'.join(tok.strip("[]()' ") for tok in s.split(','))
LOCATION
NAME
JOB
AREA
Like your original solution, this will break if any of strings are allowed to contain commas.
P.S. If that closing parenthesis in your example is a typo, you might be able to use ast.literal_eval():
In [22]: print '\n'.join(ast.literal_eval(s))
LOCATION
NAME
JOB
AREA

InfoL = re.split("[, '()\[\]]*", Info)

try this code snippet :
import re
tmpS = "['THE LOCATION', 'THE NAME', 'THE JOB', 'THE AREA')]"
tmpS = re.sub('[^\w\s,]+', '', tmpS)
print tmpS # Result -> 'THE LOCATION, THE NAME, THE JOB, THE AREA'
for s in tmpS.split(','):
print s.strip()
O/P:
THE LOCATION
THE NAME
THE JOB
THE AREA

Related

python bypass re.finditer match when searched words are in a defined expression

I have a list of words (find_list) that I want to find in a text and a list of expressions containing those words that I want to bypass (scape_list) when it is in the text.
I can find all the words in the text using this code:
find_list = ['name', 'small']
scape_list = ['small software', 'company name']
text = "My name is Klaus and my middle name is Smith. I work for a small company. The company name is Small Software. Small Software sells Software Name."
final_list = []
for word in find_list:
s = r'\W{}\W'.format(word)
matches = re.finditer(s, text, (re.MULTILINE | re.IGNORECASE))
for word_ in matches:
final_list.append(word_.group(0))
The final_list is:
[' name ', ' name ', ' name ', ' Name.', ' small ', ' Small ', ' Small ']
Is there a way to bypass expressions listed in scape_list and obtain a final_list like this one:
[' name ', ' name ', ' Name.', ' small ']
final_list and scape_list are always being updated. So I think that regex is a good approach.
You can capture the word before and after the find_list word using the regex and check whether both the combinations are not present in the scape_list. I have added comments where I have changed the code. (And better change the scape_list to a set if it can become large in future)
find_list = ['name', 'small']
scape_list = ['small software', 'company name']
text = "My name is Klaus and my middle name is Smith. I work for a small company. The company name is Small Software. Small Software sells Software Name."
final_list = []
for word in find_list:
s = r'(\w*\W)({})(\W\w*)'.format(word) # change the regex to capture adjacent words
matches = re.finditer(s, text, (re.MULTILINE | re.IGNORECASE))
for word_ in matches:
if ((word_.group(1) + word_.group(2)).strip().lower() not in scape_list
and (word_.group(2) + word_.group(3)).strip().lower() not in scape_list): # added this condition
final_list.append(word_.group(2)) # changed here
final_list
['name', 'name', 'Name', 'small']

Remove Numbers and Turn into a List

#
PYTHON
A clearer way of asking the question is:
If I have a string as follows:
'PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5'
How do I turn that string into:
Palm Beach, Gavea, Maronas, Iowa, Orange Park
So that is, make each item in the list 'title'(ie. Uppercase first letter and the rest lower case), delete the numbers and the word 'Race'.
I am setting up to export to Excel.
Thanks in advance - Angus
#
You can do it without importing any library:
races = """PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5"""
''.join([ch if not ch.isdigit() else 'xxx' for ch in races.replace('Race ','')]).split('xxx')
Output:
['PALM BEACH.', 'Gavea', 'Maronas', 'IOWA', 'ORANGE PARK.', '']
You can use re.split and some string manipulation:
import re
>>> s = 'PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5'
>>> # Split by the race and folowed by a digit
>>> race_names = re.split('Race \d+', s)
>>> def format_name(name):
... # Remove the trailing period on some race names
... name = name.rstrip('.')
... # Change name to title case
... name = name.title()
... return name
>>> # Format the name and remove any empty entries in the list
>>> race_names = [format_name(name) for name in race_names if name]
>>> list(race_names)
['Palm Beach', 'Gavea', 'Maronas', 'Iowa', 'Orange Park']

Python -- split a string with multiple occurrences of same delimiter

How can I take a string that looks like this
string = 'Contact name: John Doe Contact phone: 222-333-4444'
and split the string on both colons? Ideally the output would look like:
['Contact Name', 'John Doe', 'Contact phone','222-333-4444']
The real issue is that the name can be an arbitrary length however, I think it might be possible to use re to split the string after a certain number of space characters (say at least 4, since there will likely always be at least 4 spaces between the end of any name and beginning of Contact phone) but I'm not that good with regex. If someone could please provide a possible solution (and explanation so I can learn), that would be thoroughly appreciated.
You can use re.split:
import re
s = 'Contact name: John Doe Contact phone: 222-333-4444'
new_s = re.split(':\s|\s{2,}', s)
Output:
['Contact name', 'John Doe', 'Contact phone', '222-333-4444']
Regex explanation:
:\s => matches an occurrence of ': '
| => evaluated as 'or', attempts to match either the pattern before or after it
\s{2,} => matches two or more whitespace characters

Python regex to match Ledger/hledger account journal entry

I am writing a program in Python to parse a Ledger/hledger journal file.
I'm having problems coming up with a regex that I'm sure is quite simple. I want to parse a string of the form:
expenses:food:food and wine 20.99
and capture the account sections (between colons, allowing any spaces), regardless of the number of sub-accounts, and the total, in groups. There can be any number of spaces between the final character of the sub-account name and the price digits.
expenses:food:wine:speciality 19.99 is also allowable (no space in sub-account).
So far I've got (\S+):|(\S+ \S+):|(\S+ (?!\d))|(\d+.\d+) which is not allowing for any number of sub-accounts and possible spaces. I don't think I want to have OR operators in there either as this is going to concatenated with other regexes with .join() as part of the parsing function.
Any help greatly appreciated.
Thanks.
You can use the following:
((?:[^\s:]+)(?:\:[^\s:]+)*)\s*(\d+\.\d+)
Now we can use:
s = 'expenses:food:wine:speciality 19.99'
rgx = re.compile(r'((?:[^\s:]+)(?:\:[^\s:]+)*)\s*(\d+\.\d+)')
mat = rgx.match(s)
if mat:
categories,price = mat.groups()
categories = categories.split(':')
Now categories will be a list containing the categories, and price a string with the price. For your sample input this gives:
>>> categories
['expenses', 'food', 'wine', 'speciality']
>>> price
'19.99'
You don't need regex for such a simple thing at all, native str.split() is more than enough:
def split_ledger(line):
entries = line.split(":") # first split all the entries
last = entries.pop() # take the last entry
return entries + last.rsplit(" ", 1) # split on last space and return all together
print(split_ledger("expenses:food:food and wine 20.99"))
# ['expenses', 'food', 'food and wine ', '20.99']
print(split_ledger("expenses:food:wine:speciality 19.99"))
# ['expenses', 'food', 'wine', 'speciality ', '19.99']
Or if you don't want the leading/trailing whitespace in any of the entries:
def split_ledger(line):
entries = [e.strip() for e in line.split(":")]
last = entries.pop()
return entries + [e.strip() for e in last.rsplit(" ", 1)]
print(split_ledger("expenses:food:food and wine 20.99"))
# ['expenses', 'food', 'food and wine', '20.99']
print(split_ledger("expenses:food:wine:speciality 19.99"))
# ['expenses', 'food', 'wine', 'speciality', '19.99']

Using Regex to 'Clean Up' a List of Names [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am using Regex to clean up a list of names so that they are normal. Let's say this list was...
000000AAAAAARob Alsod ## Notice multiple 0's and A's?
AAAPerson Person ## Here, too
Jeff the awesome Guy ## Four words...
Jenna DEeath ## A name like this can exist.
GEOFFERY EVERDEEN ## All caps
shy guy ## All lowercase
Theone Normalperson ## Example name. This one is fine.
Guywith Whitespace ## Trailing or leading whitespace is a nono.
So, as you can see, people don't format their names correctly, so I need a program to highlight all the unwanted stuff. This includes:
Numbers at the start of the name.
Any uppercase without lowercase after. i.e. AAAAAAAJosh
Anything that is all uppercase.
Anything that doesn't start with uppercase. i.e. josh
Trailing and leading whitespace.
I think that is all I need to filter out. The ending product should look something like this:
Rob Alsod ## No more 0's and A's.
Person Person ## No more leading A's (or other letters).
Jeff Guy ## No lowercase words in his name.
Jenna DEeath ## HASN'T removed the D in the middle.
## Name removed as it was all uppercase.
## Name removed as it was all lowercase.
Theone Normalperson ## Nothing changed.
Guywith Whitespace ## Removed whitespace.
EDIT: Sorry about that. Here is my current code:
# Enter your code for "Name Cleaning" here.
import re
namenum = []
num = 0
for sen in open('file.txt'):
namenum += [sen.split(',')]
namenum[num][0] = re.sub(r'\s[a-z]+', '', namenum[num][0])
namenum[num][0] = re.sub(r'^([0-9]*)', '', namenum[num][0])
namenum[num][0] = re.sub(r'^[A-Z]*?\s[A-Z]*?$', '', namenum[num][0])
namenum[num][0] = re.sub(r'[^a-zA-Z ][A-Z]*(?=[A-Z])', '', namenum[num][0])
namenum[num][0] = re.sub(r'\b[a-z]+\b', '', namenum[num][0])
namenum[num][0] = re.sub(r'^\s*', '', namenum[num][0])
namenum[num][0] = re.sub(r'\s*$', '', namenum[num][0])
if namenum[num][0] == '':
namenum[num][0] = 'Invalid Name'
num += 1
for i in range(len(namenum)):
namenum[i][1] = int(namenum[i][1].strip())
namenum = sorted(namenum, key=lambda item: (-item[1], item[0]))
for i in range(0, len(namenum)):
print(namenum[i][0]+','+str(namenum[i][1]))
It does half the job, but it misses out on some stuff for some reason.
Here is the output:
AAAAAARob Alsod
AAAPerson Person
Guywith Whitespace
Invalid Name
Invalid Name
Jeff Guy
Jenna DEeath
Theone Normalperson
I also tried inputting a name like harry hamilton and it gave back harry, which it should have removed.
This regex removes all your invalid examples. None of your examples requires the for loop which filters banned words, but I think you will need it.
Although this code removes all invalid names from a list it should be easy to modify it to request a new input from the user. Also it doesn't let you know why a name is invalid, but you could just display all the rules.
from re import match
def rules(name):
for badWord in bannedWords:
if name.lower().find(badWord) >= 0:
return False
return match(r'^([A-Z][a-z]+(?:[A-Z]?[a-z]+)* ?){1,}$', name)
bannedWords = ('really', 'awesome')
input = ['000000AAAAAARob Alsod', 'AAAPerson Person', 'Jeff the awesome Guy', 'Jenna DEeath', 'GEOFFERY EVERDEEN', 'shy guy', 'Theone Normalperson', ' Guywith Whitespace', 'Someone Middlename MacIntyre', '', 'Jack Really Awesome']
results = filter(rules, input)
print results
Produces the result:
['Theone Normalperson', 'Someone Middlename MacIntyre']
Without the for loop:
from re import match
def rules(name):
return match(r'^([A-Z][a-z]+(?:[A-Z]?[a-z]+)* ?){1,}$', name)
input = ['000000AAAAAARob Alsod', 'AAAPerson Person', 'Jeff the awesome Guy', 'Jenna DEeath', 'GEOFFERY EVERDEEN', 'shy guy', 'Theone Normalperson', ' Guywith Whitespace', 'Someone Middlename MacIntyre', '', 'Jack Really Awesome']
results = filter(rules, input)
print results
Produces the result:
['Theone Normalperson', 'Someone Middlename MacIntyre', 'Jack Really Awesome']

Categories