Cut of middle word from a string python - python

I am trying to cut of few words from the scraped data.
3 Bedroom, Residential Apartment in Velachery
There are many rows of data like this. I am trying to remove the word 'Bedroom' from the string. I am using beautiful soup and python to scrape the webpage, and here I am using this
for eachproperty in properties:
print eachproperty.string[2:]
I know what the above code will do. But I cannot figure out how to just remove the "Bedroom" which is between 3 and ,Residen....

>>> import re
>>> strs = "3 Bedroom, Residential Apartment in Velachery"
>>> re.sub(r'\s*Bedroom\s*', '', strs)
'3, Residential Apartment in Velachery'
or:
>>> strs.replace(' Bedroom', '')
'3, Residential Apartment in Velachery'
Note that strings are immutable, so you need to assign the result off re.sub and str.replace to a variable.

What you need is the replace method:
line = "3 Bedroom, Residential Apartment in Velachery"
line = line.replace("Bedroom", "")
# For multiple lines use a for loop
for line in lines:
line = line.replace("Bedroom", "")

A quick answer is
k = input_string.split()
if "Bedroom" in k:
k.remove("Bedroom")
answer = ' '.join(k)
This won't handle punctuation like in your question. To do that you need
rem = "Bedroom"
answer = ""
for i in range(len(input_string)-len(rem)):
if (input_string[i:i+len(rem)]==rem):
answer = input_string[:i]+input_string[i+len(rem)]
break

Related

Remove Numbers and Turn into a List

#
PYTHON
A clearer way of asking the question is:
If I have a string as follows:
'PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5'
How do I turn that string into:
Palm Beach, Gavea, Maronas, Iowa, Orange Park
So that is, make each item in the list 'title'(ie. Uppercase first letter and the rest lower case), delete the numbers and the word 'Race'.
I am setting up to export to Excel.
Thanks in advance - Angus
#
You can do it without importing any library:
races = """PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5"""
''.join([ch if not ch.isdigit() else 'xxx' for ch in races.replace('Race ','')]).split('xxx')
Output:
['PALM BEACH.', 'Gavea', 'Maronas', 'IOWA', 'ORANGE PARK.', '']
You can use re.split and some string manipulation:
import re
>>> s = 'PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5'
>>> # Split by the race and folowed by a digit
>>> race_names = re.split('Race \d+', s)
>>> def format_name(name):
... # Remove the trailing period on some race names
... name = name.rstrip('.')
... # Change name to title case
... name = name.title()
... return name
>>> # Format the name and remove any empty entries in the list
>>> race_names = [format_name(name) for name in race_names if name]
>>> list(race_names)
['Palm Beach', 'Gavea', 'Maronas', 'Iowa', 'Orange Park']

Python regex to match Ledger/hledger account journal entry

I am writing a program in Python to parse a Ledger/hledger journal file.
I'm having problems coming up with a regex that I'm sure is quite simple. I want to parse a string of the form:
expenses:food:food and wine 20.99
and capture the account sections (between colons, allowing any spaces), regardless of the number of sub-accounts, and the total, in groups. There can be any number of spaces between the final character of the sub-account name and the price digits.
expenses:food:wine:speciality 19.99 is also allowable (no space in sub-account).
So far I've got (\S+):|(\S+ \S+):|(\S+ (?!\d))|(\d+.\d+) which is not allowing for any number of sub-accounts and possible spaces. I don't think I want to have OR operators in there either as this is going to concatenated with other regexes with .join() as part of the parsing function.
Any help greatly appreciated.
Thanks.
You can use the following:
((?:[^\s:]+)(?:\:[^\s:]+)*)\s*(\d+\.\d+)
Now we can use:
s = 'expenses:food:wine:speciality 19.99'
rgx = re.compile(r'((?:[^\s:]+)(?:\:[^\s:]+)*)\s*(\d+\.\d+)')
mat = rgx.match(s)
if mat:
categories,price = mat.groups()
categories = categories.split(':')
Now categories will be a list containing the categories, and price a string with the price. For your sample input this gives:
>>> categories
['expenses', 'food', 'wine', 'speciality']
>>> price
'19.99'
You don't need regex for such a simple thing at all, native str.split() is more than enough:
def split_ledger(line):
entries = line.split(":") # first split all the entries
last = entries.pop() # take the last entry
return entries + last.rsplit(" ", 1) # split on last space and return all together
print(split_ledger("expenses:food:food and wine 20.99"))
# ['expenses', 'food', 'food and wine ', '20.99']
print(split_ledger("expenses:food:wine:speciality 19.99"))
# ['expenses', 'food', 'wine', 'speciality ', '19.99']
Or if you don't want the leading/trailing whitespace in any of the entries:
def split_ledger(line):
entries = [e.strip() for e in line.split(":")]
last = entries.pop()
return entries + [e.strip() for e in last.rsplit(" ", 1)]
print(split_ledger("expenses:food:food and wine 20.99"))
# ['expenses', 'food', 'food and wine', '20.99']
print(split_ledger("expenses:food:wine:speciality 19.99"))
# ['expenses', 'food', 'wine', 'speciality', '19.99']

What Regex to use in this example

I am parsing a string that I know will definitely only contain the following distinct phrases that I want to parse:
'Man of the Match'
'Goal'
'Assist'
'Yellow Card'
'Red Card'
The string that I am parsing could contain everything from none of the elements above to all of them (i.e. the string being parsed could be anything from None to 'Man of the Match Goal Assist Yellow Card Red Card'.
For those of you that understand football, you will also realise that the elements 'Goal' and 'Assist' could in theory be repeated an infinite number of times. The element 'Yellow Card' could be repeated 0, 1 or 2 times also.
I have built the following Regex (where 'incident1' is the string being parsed), which I believed would return an unlimited number of all preceding Regexes, however all I am getting is single instances:
regex1 = re.compile("Man of the Match*", re.S)
regex2 = re.compile("Goal*", re.S)
regex3 = re.compile("Assist*", re.S)
regex4 = re.compile("Red Card*", re.S)
regex5 = re.compile("Yellow Card*", re.S)
mysearch1 = re.search(regex1, incident1)
mysearch2 = re.search(regex2, incident1)
mysearch3 = re.search(regex3, incident1)
mysearch4 = re.search(regex4, incident1)
mysearch5 = re.search(regex5, incident1)
#print mystring
print "incident1 = ", incident1
if mysearch1 is not None:
print "Man of the match = ", mysearch1.group()
if mysearch2 is not None:
print "Goal = ", mysearch2.group()
if mysearch3 is not None:
print "Assist = ", mysearch3.group()
if mysearch4 is not None:
print "Red Card = ", mysearch4.group()
if mysearch5 is not None:
print "Yellow Card = ", mysearch5.group()
This works as long as there is only one instance of every element encountered in a string, however if a player was for example to score more than one goal, this code only returns one instance of 'Goal'.
Can anyone see what I am doing wrong?
You can try something like this:
import re
s = "here's an example Man of the Match match and a Red Card match, and another Red Card match"
patterns = [
'Man of the Match',
'Goal',
'Assist',
'Yellow Card',
'Red Card',
]
repattern = '|'.join(patterns)
matches = re.findall(repattern, s, re.IGNORECASE)
print matches # ['Man of the Match', 'Red Card', 'Red Card']
Some general overview on regex methods in python:
re.search | re.match
In your previous attempt, you tried to use re.search. This only returned one result, and as you'll see this isn't unusual. These two functions are used to identify if a line contains a certain regex. You'd use these for something like:
s = subprocess.check_output('ipconfig') # calls ipconfig and sends output to s
for line in s.splitlines():
if re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", str(line)):
# if line contains an IP address...
print(line)
You use re.match to specifically check if the regex matches at the BEGINNING of the string. This is usually used with a regex that matches the WHOLE string. For example:
lines = ['Adam Smith, Age: 24, Male, Favorite Thing: Reading page: 16',
'Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example']
# two Adams, but we only want the one who is 16 years old.
repattern = re.compile(r'''Adam \w+, Age: 16, (?:Male|Female), Favorite Thing: [^,]*?''')
for line in lines:
if repattern.match(line):
print(line)
# Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example
# note if we'd used re.search for Age: 16, it would have found both lines!
The take away is that you use these two functions to select lines in a longer document (or any iterable)
re.findall | re.finditer
It seems in this case, you aren't trying to match a line, you're trying to pull some specifically-formatted information from the string. Let's see some examples of that.
s = """Phone book:
Adam: (555)123-4567
Joe: (555)987-6543
Alice:(555)135-7924"""
pat = r'''(?:\(\d{3}\))?\d{3}-?\d{4}'''
phone_numbers = re.findall(pat, s)
print(phone_numbers)
# ['(555)123-4567','(555)987-6543','(555)135-7924']
re.finditer returns a generator instead of a list. You'd use this the same way you'd use xrange instead of range in Python2. re.findall(some_pattern, some_string) can make a GIANT list if there are a TON of matches. re.finditer will not.
other methods: re.split | re.sub
re.split is great if you have a number of things you need to split by. Imagine you had the string:
s = '''Hello, world! It's great that you're talking to me, and everything, but I'd really rather you just split me on punctuation marks. Okay?'''
There's no great way to do that with str.split like you're used to, so instead do:
separators = [".", "!", "?", ","]
splitpattern = '|'.join(map(re.escape, separators))
# re.escape takes a string and escapes out any characters that regex considers
# special, for instance that . would otherwise be "any character"!
split_s = re.split(splitpattern, s)
print(split_s)
# ['Hello', ' world', " It's great that you're talking to me", ' and everything', " but I'd really rather you just split me on punctuation marks", ' Okay', '']
re.sub is great in cases where you know something will be formatted regularly, but you're not sure exactly how. However, you REALLY want to make sure they're all formatted the same! This will be a little advanced and use several methods, but stick with me....
dates = ['08/08/2014', '09-13-2014', '10.10.1997', '9_29_09']
separators = list()
new_sep = "/"
match_pat = re.compile(r'''
\d{1,2} # two digits
(.) # followed by a separator (capture)
\d{1,2} # two more digits
\1 # a backreference to that separator
\d{2}(?:\d{2})? # two digits and optionally four digits''', re.X)
for idx,date in enumerate(dates):
match = match_pat.match(date)
if match:
sep = match.group(1) # the separator
separators.append(sep)
else:
dates.pop(idx) # this isn't really a date, is it?
repl_pat = '|'.join(map(re.escape, separators))
final_dates = re.sub(repl_pat, new_sep, '\n'.join(dates))
print(final_dates)
# 08/08/2014
# 09/13/2014
# 10/10/1997
# 9/29/09
A slightly less advanced example, you can use re.sub with any sort of formatted expression and pass it a function to return! For instance:
def get_department(dept_num):
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
if hasattr(dept_num, 'group'): # then it's a match, not a number
dept_num = dept_num.group(0)
return departments.get(dept_num, "Unknown Dept")
file = r"""Name,Performance Review,Department
Adam,3,1
Joe,5,2
Alice,1,3
Eve,12,4""" # this looks like a csv file
dept_names = re.sub(r'''\d+$''', get_department, file, flags=re.M)
print(dept_names)
# Name,Performance Review,Department
# Adam,3,I.T.
# Joe,5,Administration
# Alice,1,Human Resources
# Eve,12,Maintenance
Without using regex here you could do:
replaced_lines = []
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
for line in file.splitlines():
the_split_line = line.split(',')
replaced_lines.append(','.join(the_split_line[:-1]+ \
departments.get(the_split_line[-1], "Unknown Dept")))
new_file = '\n'.join(replaced_lines)
# LOTS OF STRING MANIPULATION, YUCK!
Instead we replace all that for loop and string splitting, list slicing, and string manipulation with a function and a re.sub call. In fact, if you use a lambda it's even easier!
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
re.sub(r'''\d+$''', lambda x: departments.get(x, "Unknown Dept"), file, flags=re.M)
# DONE!

Python Error"TypeError: coercing to Unicode: need string or buffer, list found"

The purpose of this code is to make a program that searches a persons name (on Wikipedia, specifically) and uses keywords to come up with reasons why that person is significant.
I'm having issues with this specific line "if fact_amount < 5 and (terms in sentence.lower()):" because I get this error ("TypeError: coercing to Unicode: need string or buffer, list found")
If you could offer some guidance it would be greatly appreciated, thank you.
import requests
import nltk
import re
#You will need to install requests and nltk
terms = ['pronounced'
'was a significant'
'major/considerable influence'
'one of the (X) most important'
'major figure'
'earliest'
'known as'
'father of'
'best known for'
'was a major']
names = ["Nelson Mandela","Bill Gates","Steve Jobs","Lebron James"]
#List of people that you need to get info from
for name in names:
print name
print '==============='
#Goes to the wikipedia page of the person
r = requests.get('http://en.wikipedia.org/wiki/%s' % (name))
#Parses the raw html into text
raw = nltk.clean_html(r.text)
#Tries to split each sentence.
#sort of buggy though
#For example St. Mary will split after St.
sentences = re.split('[?!.][\s]*',raw)
fact_amount = 0
for sentence in sentences:
#I noticed that important things came after 'he was' and 'she was'
#Seems to work for my sample list
#Also there may be buggy sentences, so I return 5 instead of 3
if fact_amount < 5 and (terms in sentence.lower()):
#remove the reference notation that wikipedia has
#ex [ 33 ]
sentence = re.sub('[ [0-9]+ ]', '', sentence)
#removes newlines
sentence = re.sub('\n', '', sentence)
#removes trailing and leading whitespace
sentence = sentence.strip()
fact_amount += 1
#sentence is formatted. Print it out
print sentence + '.'
print
You should be checking it the other way
sentence.lower() in terms
terms is list and sentence.lower() is a string. You can check if a particular string is there in a list, but you cannot check if a list is there in a string.
you might mean if any(t in sentence_lower for t in terms), to check whether any terms from terms list is in the sentence string.

Converting a text file to a list

I have the following text file:
"""[' Hoffa remains Allen Iverson Bill Cosby WWE Payback results Juneteenth shooting Miss Utah flub Octopus pants Magna Carta Holy Grail China supercomputer Sibling bullying ']"""
I would like to create a list from it and apply a function to each name
this is my code so far:
listing = open(fileName, 'r')
lines = listing.read().split(',')
for line in lines:
#Function
Strip out character like """['] first from the start and end of the string using str.strip, now split the resulting string at six spaces(' '*6). Splitting returns a list, but some items still have traling and leading white-spaces, you can remove them using str.strip again.
with open(fileName) as f:
lis = [x.strip() for x in f.read().strip('\'"[]').split(' '*6)]
print lis
...
['Hoffa remains', 'Allen Iverson', 'Bill Cosby', 'WWE Payback results', 'Juneteenth shooting', 'Miss Utah flub', 'Octopus pants', 'Magna Carta Holy Grail', 'China supercomputer', 'Sibling bullying']
Applying function to the above list:
List comprehension:
[func(x) for x in lis]
map:
map(func, lis)
I would first refer you to some other similar posts: similar post
And you can't use a coma here you don't have a coma between the data you wan't to separate. This function splits the string you have into substring depending on the delimiter you gave it: a coma ','.

Categories