Remove Numbers and Turn into a List - python

#
PYTHON
A clearer way of asking the question is:
If I have a string as follows:
'PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5'
How do I turn that string into:
Palm Beach, Gavea, Maronas, Iowa, Orange Park
So that is, make each item in the list 'title'(ie. Uppercase first letter and the rest lower case), delete the numbers and the word 'Race'.
I am setting up to export to Excel.
Thanks in advance - Angus
#

You can do it without importing any library:
races = """PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5"""
''.join([ch if not ch.isdigit() else 'xxx' for ch in races.replace('Race ','')]).split('xxx')
Output:
['PALM BEACH.', 'Gavea', 'Maronas', 'IOWA', 'ORANGE PARK.', '']

You can use re.split and some string manipulation:
import re
>>> s = 'PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5'
>>> # Split by the race and folowed by a digit
>>> race_names = re.split('Race \d+', s)
>>> def format_name(name):
... # Remove the trailing period on some race names
... name = name.rstrip('.')
... # Change name to title case
... name = name.title()
... return name
>>> # Format the name and remove any empty entries in the list
>>> race_names = [format_name(name) for name in race_names if name]
>>> list(race_names)
['Palm Beach', 'Gavea', 'Maronas', 'Iowa', 'Orange Park']

Related

How to remove multiple substrings at the end of a list of strings in Python?

I have a list of strings:
lst =['puppies com', 'company abc org', 'company a com', 'python limited']
If at the end of the string there is the word
limited, com or org I would like to remove it. How can I go about doing this?
I have tried;
for item in lst:
j= item.strip('limited')
j= item.strip('org')
I've also tried the replace function with no avail.
Thanks
You can use this example to remove selected last words from the list of string:
lst =['dont strip this', 'puppies com', 'company abc org', 'company a com', 'python limited']
to_strip = {'limited', 'com', 'org'}
out = []
for item in lst:
tmp = item.rsplit(maxsplit=1)
if tmp[-1] in to_strip:
out.append(tmp[0])
else:
out.append(item)
print(out)
Prints:
['dont strip this', 'puppies', 'company abc', 'company a', 'python']
If i understand this correctly you always want to remove the last word in each sentance?
If that's the case this should work:
lst =['puppies com', 'company abc org', 'company a com', 'python limited']
for i in lst:
f = i.rsplit(' ', 1)[0]
print(f)
Returns:
puppies
company abc
company a
python
rsplit is a shorthand for "reverse split", and unlike regular split
works from the end of a string. The second parameter is a maximum
number of splits to make - e.g. value of 1 will give you two-element
list as a result (since there was a single split made, which resulted
in two pieces of the input string). As described here
This is also available in the python doc here.

Python regex to match Ledger/hledger account journal entry

I am writing a program in Python to parse a Ledger/hledger journal file.
I'm having problems coming up with a regex that I'm sure is quite simple. I want to parse a string of the form:
expenses:food:food and wine 20.99
and capture the account sections (between colons, allowing any spaces), regardless of the number of sub-accounts, and the total, in groups. There can be any number of spaces between the final character of the sub-account name and the price digits.
expenses:food:wine:speciality 19.99 is also allowable (no space in sub-account).
So far I've got (\S+):|(\S+ \S+):|(\S+ (?!\d))|(\d+.\d+) which is not allowing for any number of sub-accounts and possible spaces. I don't think I want to have OR operators in there either as this is going to concatenated with other regexes with .join() as part of the parsing function.
Any help greatly appreciated.
Thanks.
You can use the following:
((?:[^\s:]+)(?:\:[^\s:]+)*)\s*(\d+\.\d+)
Now we can use:
s = 'expenses:food:wine:speciality 19.99'
rgx = re.compile(r'((?:[^\s:]+)(?:\:[^\s:]+)*)\s*(\d+\.\d+)')
mat = rgx.match(s)
if mat:
categories,price = mat.groups()
categories = categories.split(':')
Now categories will be a list containing the categories, and price a string with the price. For your sample input this gives:
>>> categories
['expenses', 'food', 'wine', 'speciality']
>>> price
'19.99'
You don't need regex for such a simple thing at all, native str.split() is more than enough:
def split_ledger(line):
entries = line.split(":") # first split all the entries
last = entries.pop() # take the last entry
return entries + last.rsplit(" ", 1) # split on last space and return all together
print(split_ledger("expenses:food:food and wine 20.99"))
# ['expenses', 'food', 'food and wine ', '20.99']
print(split_ledger("expenses:food:wine:speciality 19.99"))
# ['expenses', 'food', 'wine', 'speciality ', '19.99']
Or if you don't want the leading/trailing whitespace in any of the entries:
def split_ledger(line):
entries = [e.strip() for e in line.split(":")]
last = entries.pop()
return entries + [e.strip() for e in last.rsplit(" ", 1)]
print(split_ledger("expenses:food:food and wine 20.99"))
# ['expenses', 'food', 'food and wine', '20.99']
print(split_ledger("expenses:food:wine:speciality 19.99"))
# ['expenses', 'food', 'wine', 'speciality', '19.99']

how to get the second word of an element in a list

Given an input string, search the list of tuples that store all the bus stop data and return a list of the tuples that contain roads with the matching string. UPPERCASE and lowercase are considered considered the same.
If no matches are found, return an empty list instead.
Assume that the bus stop data is already provided, i.e. that the following statement has been evaluated:
bus_stops = read_data('bus_stops.txt')
I am given
bus_stops.txt
01012,Victoria St,Hotel Grand Pacific
01013,Victoria St,St. Joseph's Ch
01019,Victoria St,Bras Basah Cplx
And when the following expression is executed:
lookup_bus_stop_by_road_name(bus_stops, 'st')
I should get:
[('01012', 'Victoria St', 'Hotel Grand Pacific'), ('01013', 'Victoria St', "St. Joseph's Ch"), ('01019', 'Victoria St', 'Bras Basah Cplx')]
Please help me check my code:
def lookup_bus_stop_by_road_name(bus_stops, name):
matched = []
for stops in bus_stops:
new_name = name.lower()
if stops[1] == new_name:
matched.append(stops)
return matched
An even shorter (and Pythonic) way would be to use list comprehensions like that:
def lookup_bus_stop_by_road_name(bus_stops, name):
return [bus_stop for bus_stop in bus_stops if name.lower() in bus_stop[1].lower()]
Replace s with open etc.. I've used s string to quickly demonstrate.
>>> s = '''\
01012,Victoria St,Hotel Grand Pacific
01013,Victoria St,St. Joseph's Ch
01019,Victoria St,Bras Basah Cplx''';
>>>
>>> lines = s.split('\n');
>>> lines
['01012,Victoria St,Hotel Grand Pacific', "01013,Victoria St,St. Joseph's Ch", '01019,Victoria St,Bras Basah Cplx']
>>> l = [];
>>> for line in lines: l.append(tuple(line.split(',')));
>>> l
[('01012', 'Victoria St', 'Hotel Grand Pacific'), ('01013', 'Victoria St', "St. Joseph's Ch"), ('01019', 'Victoria St', 'Bras Basah Cplx')]
You should change your function to
def lookup_bus_stop_by_road_name(bus_stops, name):
matched = []
new_name = name.lower()
for stops in bus_stops:
if name in stops:
matched.append(tuple(stops.split(',')))
return matched

Cut of middle word from a string python

I am trying to cut of few words from the scraped data.
3 Bedroom, Residential Apartment in Velachery
There are many rows of data like this. I am trying to remove the word 'Bedroom' from the string. I am using beautiful soup and python to scrape the webpage, and here I am using this
for eachproperty in properties:
print eachproperty.string[2:]
I know what the above code will do. But I cannot figure out how to just remove the "Bedroom" which is between 3 and ,Residen....
>>> import re
>>> strs = "3 Bedroom, Residential Apartment in Velachery"
>>> re.sub(r'\s*Bedroom\s*', '', strs)
'3, Residential Apartment in Velachery'
or:
>>> strs.replace(' Bedroom', '')
'3, Residential Apartment in Velachery'
Note that strings are immutable, so you need to assign the result off re.sub and str.replace to a variable.
What you need is the replace method:
line = "3 Bedroom, Residential Apartment in Velachery"
line = line.replace("Bedroom", "")
# For multiple lines use a for loop
for line in lines:
line = line.replace("Bedroom", "")
A quick answer is
k = input_string.split()
if "Bedroom" in k:
k.remove("Bedroom")
answer = ' '.join(k)
This won't handle punctuation like in your question. To do that you need
rem = "Bedroom"
answer = ""
for i in range(len(input_string)-len(rem)):
if (input_string[i:i+len(rem)]==rem):
answer = input_string[:i]+input_string[i+len(rem)]
break

What is efficient way to match words in string?

Example:
names = ['James John', 'Robert David', 'Paul' ... the list has 5K items]
text1 = 'I saw James today'
text2 = 'I saw James John today'
text3 = 'I met Paul'
is_name_in_text(text1,names) # this returns false 'James' in not in list
is_name_in_text(text2,names) # this returns 'James John'
is_name_in_text(text3,names) # this return 'Paul'
is_name_in_text() searches if any of the name list is in text.
The easy way to do is to just check if the name is in the list by using in operator, but the list has 5,000 items, so it is not efficient. I can just split the text into words and check if the words are in the list, but this not going to work if you have more than one word matching. Line number 7 will fail in this case.
Make names into a set and use the in-operator for fast O(1) lookup.
You can use a regex to parse out the possible names in a sentence:
>>> import re
>>> findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')
>>> def is_name_in_text(text, names):
for possible_name in set(findnames.findall(text)):
if possible_name in names:
return possible_name
return False
>>> names = set(['James John', 'Robert David', 'Paul'])
>>> is_name_in_text('I saw James today', names)
False
>>> is_name_in_text('I saw James John today', names)
'James John'
>>> is_name_in_text('I met Paul', names)
'Paul'
Build a regular expression with all the alternatives. This way you don't have to worry about somehow pulling the names out of the phrases beforehand.
import re
names_re = re.compile(r'\b' +
r'\b|\b'.join(re.escape(name) for name in names) +
r'\b')
print names_re.search('I saw James today')
You may use Python's set in order to get good performance while using the in operator.
If you have a mechanism of pulling the names out of the phrases and don't need to worry about partial matches (the full name will always be in the string), you can use a set rather than a list.
Your code is exactly the same, with this addition at line 2:
names = set(names)
The in operation will now function much faster.

Categories