strip white spaces and new lines when reading from file - python

I have the following code, that successfully strips end of line characters when reading from file, but doesn't do so for any leading and trailing white spaces (I want the spaces in between to be left!)
What is the best way to achieve this? (Note, this is a specific example, so not a duplicate of general methods to strip strings)
My code: (try it with the test data: "Mr Moose" (not found) and if you try "Mr Moose " (that is a space after the Moose) it will work.
#A COMMON ERROR is leaving in blank spaces and then finding you cannot work with the data in the way you want!
"""Try the following program with the input: Mr Moose
...it doesn't work..........
but if you try "Mr Moose " (that is a space after Moose..."), it will work!
So how to remove both new lines AND leading and trailing spaces when reading from a file into a list. Note, the middle spaces between words must remain?
"""
alldata=[]
col_num=0
teacher_names=[]
delimiter=":"
with open("teacherbook.txt") as f:
for line in f.readlines():
alldata.append((line.strip()))
print(alldata)
print()
print()
for x in alldata:
teacher_names.append(x.split(delimiter)[col_num])
teacher=input("Enter teacher you are looking for:")
if teacher in teacher_names:
print("found")
else:
print("No")
Desired output, on producing the list alldata
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
i.e - remove all leading and trailing white space at the start, and before or after the delimiter. The spaces in between words such as Mr Moose, must be left.
Contents of teacherbook:
Mr Moose : Maths
Mr Goose: History
Mrs Congenelipilling: English
Thanks in advance

You could use a regex:
txt='''\
Mr Moose : Maths
Mr Goose: History
Mrs Congenelipilling: English'''
>>> [re.sub(r'\s*:\s*', ':', line).strip() for line in txt.splitlines()]
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
So your code becomes:
import re
col_num=0
teacher_names=[]
delimiter=":"
with open("teacherbook.txt") as f:
alldata=[re.sub(r'\s*{}\s*'.format(delimiter), delimiter, line).rstrip() for line in f]
print(alldata)
for x in alldata:
teacher_names.append(x.split(delimiter)[col_num])
print(teacher_names)
Prints:
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
['Mr Moose', 'Mr Goose', 'Mrs Congenelipilling']
The key part is the regex:
re.sub(r'\s*{}\s*'.format(delimiter), delimiter, line).rstrip()
^ 0 to unlimited spaced before the delimiter
^ place for the delimiter
^ unlimited trailing space
Interactive Demo
For an all Python solution, I would use str.partition to get the left hand and right hand side of the delimiter then strip the whitespace as needed:
alldata=[]
with open("teacherbook.txt") as f:
for line in f:
lh,sep,rh=line.rstrip().partition(delimiter)
alldata.append(lh.rstrip() + sep + rh.lstrip())
Same output
Another suggestion. Your data is more suited to a dict than a list.
You can do:
di={}
with open("teacherbook.txt") as f:
for line in f:
lh,sep,rh=line.rstrip().partition(delimiter)
di[lh.rstrip()]=rh.lstrip()
Or comprehension version:
with open("teacherbook.txt") as f:
di={lh.rstrip():rh.lstrip()
for lh,_,rh in (line.rstrip().partition(delimiter) for line in f)}
Then access like this:
>>> di['Mr Moose']
'Maths'

No need to use readlines(), you can simply iterate through the file object to get each line, and use strip() to remove the \n and whitespaces. As such, you can use this list comprehension;
with open('teacherbook.txt') as f:
alldata = [':'.join([value.strip() for value in line.split(':')])
for line in f]
print(alldata)
Outputs;
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']

Change:
teacher_names.append(x.split(delimiter)[col_num])
to:
teacher_names.append(x.split(delimiter)[col_num].strip())

remove all leading and trailing white space at the start, and before or after the delimiter. The spaces in between words such as Mr Moose, must be left.
You can split your string at the delimiter, strip the whitespace from them, and concatenate them back together again:
for line in f.readlines():
new_line = ':'.join([s.strip() for s in line.split(':')])
alldata.append(new_line)
Example:
>>> lines = [' Mr Moose : Maths', ' Mr Goose : History ']
>>> lines
[' Mr Moose : Maths', ' Mr Goose : History ']
>>> data = []
>>> for line in lines:
new_line = ':'.join([s.strip() for s in line.split(':')])
data.append(new_line)
>>> data
['Mr Moose:Maths', 'Mr Goose:History']

You can do it easily with regex - re.sub:
import re
re.sub(r"[\n \t]+$", "", "aaa \t asd \n ")
Out[17]: 'aaa \t asd'
first argument pattern - [all characters you want to remove]++ - one or more matches$$ - end of the string
https://docs.python.org/2/library/re.html

With string.rstrip('something') you can remove that 'something' from the right end of the string like this:
a = 'Mr Moose \n'
print a.rstrip(' \n') # prints 'Mr Moose\n' instead of 'Mr Moose \n\n'

Related

How to add a missing closing parenthesis to a string in Python?

I have multiple strings to postprocess, where a lot of the acronyms have a missing closing bracket. Assume the string text below, but also assume that this type of missing bracket happens often.
My code below only works by adding the closing bracket to the missing acronym independently, but not to the full string/sentence. Any tips on how to do this efficiently, and preferably without needing to iterate ?
import re
#original string
text = "The dog walked (ABC in the park"
#Desired output:
desired_output = "The dog walked (ABC) in the park"
#My code:
acronyms = re.findall(r'\([A-Z]*\)?', text)
for acronym in acronyms:
if ')' not in acronym: #find those without a closing bracket ')'.
print(acronym + ')') #add the closing bracket ')'.
#current output:
>>'(ABC)'
You may use
text = re.sub(r'(\([A-Z]+(?!\))\b)', r"\1)", text)
With this approach, you can also get rid of the check if the text has ) in it before, see a demo on regex101.com.
In full:
import re
#original string
text = "The dog walked (ABC in the park"
text = re.sub(r'(\([A-Z]+(?!\))\b)', r"\1)", text)
print(text)
This yields
The dog walked (ABC) in the park
See a working demo on ideone.com.
For the typical example you have provided, I don't see the need of using regex
You can just use some string methods:
text = "The dog walked (ABC in the park"
withoutClosing = [word for word in text.split() if word.startswith('(') and not word.endswith(')') ]
withoutClosing
Out[45]: ['(ABC']
Now you have the words without closing parenthesis, you can just replace them:
for eachWord in withoutClosing:
text = text.replace(eachWord, eachWord+')')
text
Out[46]: 'The dog walked (ABC) in the park'

Extract a string after a text with regex in Python

I have a doc file that it has the following structure:
This is a fairy tale written by
John Doe and Mary Smith
Auckland,somewhere
This story is awesome
I would like to extract the two lines of text which are:
John Doe and Mary Smith
Auckland,somewhere
and append those values into a list by using regex. The two lines that I want to extract are always between the lines This is a fairy tale written by and This story is awesome. How can I do that? I have tried some combinations with before_keyword,keyword,after_keyword=text.partition(regex), but no luck at all.
You can use a regex with re.DOTALL that enables . to match any character including newlines. Once you have the text between the two delimiters, you can use another regex without the re.DOTALL to extract lines that contain at least one non-whitespace character (\S).
import re
lst = []
with open('input.txt') as f:
text = f.read()
match = re.search('This is a fairy tale written by(.*?)This story is awesome',
text, re.DOTALL)
if match:
lst.extend(re.findall('.*\S.*', match.group(1)))
print(lst)
Gives:
[' John Doe and Mary Smith', ' Auckland,somewhere']
You may start with this:
re.search(r'(?<=This is a fairy tale written by\n).*?(?=\n\s*This story is awesome)', s, re.MULTILINE|re.DOTALL).group(0)
and fine-tune this regex. re.MULTILINE may be omitted as you do not have ^ or $ anyway, but re.DOTALL is required to let . to match newline as well. The regex above uses look ahead and look behind (?<=), (?=). If you do not like that, you can use parentheses instead for captures.
If you can create a list of strings from your docfile, then no need to use a regex. Just do this simple program:
fileContent = ['This is a fairy tale written by','John Doe and Mary Smith','Auckland,somewhere','This story is awesome',
'Some other things', 'story texts', 'Not Important data',
'This is a fairy tale written by','Kem Cho?','Majama?','This story is awesome', 'Not important data']
authorsList = []
for i in range(len(fileContent)-3):
if fileContent[i] == 'This is a fairy tale written by' and fileContent[i+3] == 'This story is awesome':
authorsList.append([fileContent[i+1], fileContent[i+2]])
print(authorsList)
Here I simply check for 'This is a fairy tale written by' and 'This story is awesome' and if it is found, append text between it in your list.
Output:
[['John Doe and Mary Smith', 'Auckland,somewhere'], ['Kem Cho?', 'Majama?']]
Try using this instead. It should match anything between these two strings.
re.search(r'(?<=This is a fairy tale).*?(?=This story is awesome)',text)

Iterate and match all elements with regex

So I have something like this:
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
I want to replace every element with the first name so it would look like this:
data = ['Alice Smith', 'Tim', 'Uncle Neo']
So far I got:
for i in range(len(data)):
if re.match('(.*) and|with|\&', data[i]):
a = re.match('(.*) and|with|\&', data[i])
data[i] = a.group(1)
But it doesn't seem to work, I think it's because of my pattern but I can't figure out the right way to do this.
Use a list comprehension with re.split:
result = [re.split(r' (?:and|with|&) ', x)[0] for x in data]
The | needs grouping with parentheses in your attempt. Anyway, it's too complex.
I would just use re.sub to remove the separation word & the rest:
data = [re.sub(" (and|with|&) .*","",d) for d in data]
result:
['Alice Smith', 'Tim', 'Uncle Neo']
You can try this:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
final_data = [re.sub('\sand.*?$|\s&.*?$|\swith.*?$', '', i) for i in data]
Output:
['Alice Smith', 'Tim', 'Uncle Neo']
Simplify your approach to the following:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
data = [re.search(r'.*(?= (and|with|&))', i).group() for i in data]
print(data)
The output:
['Alice Smith', 'Tim', 'Uncle Neo']
.*(?= (and|with|&)) - positive lookahead assertion, ensures that name/surname .* is followed by any item from the alternation group (and|with|&)
Brief
I would suggest using Casimir's answer if possible, but, if you are not sure what word might follow (that is to say that and, with, and & are dynamic), then you can use this regex.
Note: This regex will not work for some special cases such as names with apostrophes ' or dashes -, but you can add them to the character list that you're searching for. This answer also depends on the name beginning with an uppercase character and the "union word" as I'll name it (and, with, &, etc.) not beginning with an uppercase character.
Code
See this regex in use here
Regex
^((?:[A-Z][a-z]*\s*)+)\s.*
Substitution
$1
Result
Input
Alice Smith and Bob
Tim with Sam Dunken
Uncle Neo & 31
Output
Alice Smith
Tim
Uncle Neo
Explanation
Assert position at the beginning of the string ^
Match a capital alpha character [A-Z]
Match between any number of lowercase alpha characters [a-z]*
Match between any number of whitespace characters (you can specify spaces if you'd prefer using * instead) \s*
Match the above conditions between one and unlimited times, all captured into capture group 1 (...)+: where ... contains everything above
Match a whitespace character, followed by any character (except new line) any number of times
$1: Replace with capture group 1

String comparison in python words ending with

I have a set of words as follows:
['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
In the above sentences i need to identify all sentences ending with ? or . or 'gy'. and print the final word.
My approach is as follows:
# words will contain the string i have pasted above.
word = [w for w in words if re.search('(?|.|gy)$', w)]
for i in word:
print i
The result i get is:
Hey, how are you?
My name is Mathews.
I hate vegetables
French fries came out soggy
The expected result is:
you?
Mathews.
soggy
Use endswith() method.
>>> for line in testList:
for word in line.split():
if word.endswith(('?', '.', 'gy')) :
print word
Output:
you?
Mathews.
soggy
Use endswith with a tuple.
lines = ['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
for line in lines:
for word in line.split():
if word.endswith(('?', '.', 'gy')):
print word
Regular expression alternative:
import re
lines = ['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
for line in lines:
for word in re.findall(r'\w+(?:\?|\.|gy\b)', line):
print word
You were close.
You just need to escape the special characters (? and .) in the pattern:
re.search(r'(\?|\.|gy)$', w)
More details in the documentation.

Converting a text file to a list

I have the following text file:
"""[' Hoffa remains Allen Iverson Bill Cosby WWE Payback results Juneteenth shooting Miss Utah flub Octopus pants Magna Carta Holy Grail China supercomputer Sibling bullying ']"""
I would like to create a list from it and apply a function to each name
this is my code so far:
listing = open(fileName, 'r')
lines = listing.read().split(',')
for line in lines:
#Function
Strip out character like """['] first from the start and end of the string using str.strip, now split the resulting string at six spaces(' '*6). Splitting returns a list, but some items still have traling and leading white-spaces, you can remove them using str.strip again.
with open(fileName) as f:
lis = [x.strip() for x in f.read().strip('\'"[]').split(' '*6)]
print lis
...
['Hoffa remains', 'Allen Iverson', 'Bill Cosby', 'WWE Payback results', 'Juneteenth shooting', 'Miss Utah flub', 'Octopus pants', 'Magna Carta Holy Grail', 'China supercomputer', 'Sibling bullying']
Applying function to the above list:
List comprehension:
[func(x) for x in lis]
map:
map(func, lis)
I would first refer you to some other similar posts: similar post
And you can't use a coma here you don't have a coma between the data you wan't to separate. This function splits the string you have into substring depending on the delimiter you gave it: a coma ','.

Categories