Most elegant way to get rid of white space and new lines? - python

I'm writing a script to parse specific data from each outlook email.
I wrote something to strip out all carriage returns, new space, and white spaces from my string before parsing it, but it's very ugly. Any ideas on making it more elegant?
messageStr = messageStr.replace("\r","")
messageStr = messageStr.split('\n')
messageStr = [i for i in messageStr if i != '']
messageStr = [i for i in messageStr if i != ' ']

The .strip method of strings removes leading and trailing whitespace. If you wanted to get rid of the carriage returns on each line and other leading/trailing whitespace you could do this
lines = [line.strip() for line in message.split('\n')]
If you want to remove all whitespace, not just leading/trailing, you could do something similar against a string containing all whitespace you want to filter. The string module has a helper for this. The following would remove all whitespace from a string s:
import string
filtered_string = ''.join(char for char in s if char not in string.whitespace)

This task is related to data cleaning task , Here is my approach :
Put all symbols in a list and then just check if any symbol is in list then delete it.
dummy_string='Hello this is \n example \r to remove '' the special symbols ' ''
special_sym=['\r','\n','',' ']
[dummy_string.split().__delitem__(j) for j,i in enumerate(dummy_string.split()) if i in special_sym]
print(" ".join(dummy_string.split()))
output:
Hello this is example to remove the special symbols
P.S : you don't need '\r','\n' in special_sym list because when you do split() it automatically removes those but still i showed there just for example.

Related

How to remove whitespaces in individual list elements in nested lists from a text file?

I'm really struggling with this one, any help would be appreciated. Trying to get rid of whitespaces around individual list elements. I have tried putting the strip() in various places but I can't figure out how to get it to work properly. :
DATAFILE = "unit-patterns.txt"
with open (DATAFILE, "r") as file:
pattern_strings = []
pattern_lists1 = []
pattern_lists = []
ex_whtspc_lst = []
for line in file:
pattern_strings.append(line.split("\t"))
for index in pattern_strings:
pattern_lists1.append(index[0])
for ele in pattern_lists1:
pattern_lists.append(ele.split("+"))
print(str(pattern_lists[0:5]).strip())
I get whitespaces around each list element like this:
"[-CITS2224-1-]" - dashes to indicate whitespaces
Text file
Expected output
The problem you are facing is that the .strip() method "returns a copy of the string with the leading and trailing characters removed".
In your example, in the [ CITS2224 1 ] string, the leading character is [ and the trailing character is ]. This causes a regular .strip() call to return the string unmodified.
If you pass the [] characters to the .strip() call, as suggested by #Tranbi, it will work correctly. Why? This will make python strip any leading and trailing [] characters until a different character is encountered.
Keep in mind that .strip("[] ") won't strip the brackets or whitespaces from inside the string, for example [ foo [] bar ] - you will get foo [] bar instead. Again, this is because .strip() considers only the characters passed in the argument (or whitespace if no arguments). If it encounters a character outside of it's interest, processing is stopped.

How to remove a character from a string within a list PYTHON

I currently have the following code but I am struggling to figure out the code that will remove the punctuation from the strings within the list. Please let me know if you know what I should input.
"""
This function checks to see if the last character is either punctuation or a
"\n" character and removes these unwanted characters.
Question: 1
"""
import string
result = string.punctuation
def myFun(listA):
A = listA
for element in A:
for i in element:
if i in result:
#what do i put here
print(A)
myFun(['str&', 'cat', 'dog\n', 'myStr.'])
In Python, string is basically a list of characters, so you can access the last character by str[-1], or if you wanna check for \n, str[-2:]
Now, you can perform a simple check on str[-1] first by,
if str[-1] in string.punctuation
and remove it with str = str[:-1]
and str[-2:] by:
if str[-2:] == "\n"
and remove it with str = str[:-2]
Notes: After you perform each checking, remember to add a continue to iterate to the next loop, or else, it will remove from the str twice for case that ends with both \n and a punctuation, such as "test\n," will become "test"
p.s. I intentionally didn't put them in your code and left that part for you to do.

how to ignore punctuation when counting characters in string in python

In my homework there is question about write a function words_of_length(N, s) that can pick unique words with certain length from a string, but ignore punctuations.
what I am trying to do is:
def words_of_length(N, s): #N as integer, s as string
#this line i should remove punctuation inside the string but i don't know how to do it
return [x for x in s if len(x) == N] #this line should return a list of unique words with certain length.
so my problem is that I don't know how to remove punctuation , I did view "best way to remove punctuation from string" and relevant questions, but those looks too difficult in my lvl and also because my teacher requires it should contain no more than 2 lines of code.
sorry that I can't edit my code in question properly, it's first time i ask question here, there much i need to learn, but pls help me with this one. thanks.
Use string.strip(s[, chars])
https://docs.python.org/2/library/string.html
In you function replace x with strip (x, ['.', ',', ':', ';', '!', '?']
Add more punctuation if needed
First of all, you need to create a new string without characters you want to ignore (take a look at string library, particularly string.punctuation), and then split() the resulting string (sentence) into substrings (words). Besides that, I suggest using type annotation, instead of comments like those.
def words_of_length(n: int, s: str) -> list:
return [x for x in ''.join(char for char in s if char not in __import__('string').punctuation).split() if len(x) == n]
>>> words_of_length(3, 'Guido? van, rossum. is the best!'))
['van', 'the']
Alternatively, instead of string.punctuation you can define a variable with the characters you want to ignore yourself.
You can remove punctuation by using string.punctuation.
>>> from string import punctuation
>>> text = "text,. has ;:some punctuation."
>>> text = ''.join(ch for ch in text if ch not in punctuation)
>>> text # with no punctuation
'text has some punctuation'

Python regular expression to remove space and capitalize letters where the space was?

I want to create a list of tags from a user supplied single input box, separated by comma's and I'm looking for some expression(s) that can help automate this.
What I want is to supply the input field and:
remove all double+ whitespaces, tabs, new lines (leaving just single spaces)
remove ALL (single's and double+) quotation marks, except for comma's, which there can be only one of
in between each comma, i want Something Like Title Case, but excluding the first word and not at all for single words, so that when the last spaces are removed, the tag comes out as 'somethingLikeTitleCase' or just 'something' or 'twoWords'
and finally, remove all remaining spaces
Here's what I have gathered around SO so far:
def no_whitespace(s):
"""Remove all whitespace & newlines. """
return re.sub(r"(?m)\s+", "", s)
# remove spaces, newlines, all whitespace
# http://stackoverflow.com/a/42597/523051
tag_list = ''.join(no_whitespace(tags_input))
# split into a list at comma's
tag_list = tag_list.split(',')
# remove any empty strings (since I currently don't know how to remove double comma's)
# http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings
tag_list = filter(None, tag_list)
I'm lost though when it comes to modifying that regex to remove all the punctuation except comma's and I don't even know where to begin for the capitalizing.
Any thoughts to get me going in the right direction?
As suggested, here are some sample inputs = desired_outputs
form: 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps' should come out as
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
Here's an approach to the problem (that doesn't use any regular expressions, although there's one place where it could). We split up the problem into two functions: one function which splits a string into comma-separated pieces and handles each piece (parseTags), and one function which takes a string and processes it into a valid tag (sanitizeTag). The annotated code is as follows:
# This function takes a string with commas separating raw user input, and
# returns a list of valid tags made by sanitizing the strings between the
# commas.
def parseTags(str):
# First, we split the string on commas.
rawTags = str.split(',')
# Then, we sanitize each of the tags. If sanitizing gives us back None,
# then the tag was invalid, so we leave those cases out of our final
# list of tags. We can use None as the predicate because sanitizeTag
# will never return '', which is the only falsy string.
return filter(None, map(sanitizeTag, rawTags))
# This function takes a single proto-tag---the string in between the commas
# that will be turned into a valid tag---and sanitizes it. It either
# returns an alphanumeric string (if the argument can be made into a valid
# tag) or None (if the argument cannot be made into a valid tag; i.e., if
# the argument contains only whitespace and/or punctuation).
def sanitizeTag(str):
# First, we turn non-alphanumeric characters into whitespace. You could
# also use a regular expression here; see below.
str = ''.join(c if c.isalnum() else ' ' for c in str)
# Next, we split the string on spaces, ignoring leading and trailing
# whitespace.
words = str.split()
# There are now three possibilities: there are no words, there was one
# word, or there were multiple words.
numWords = len(words)
if numWords == 0:
# If there were no words, the string contained only spaces (and/or
# punctuation). This can't be made into a valid tag, so we return
# None.
return None
elif numWords == 1:
# If there was only one word, that word is the tag, no
# post-processing required.
return words[0]
else:
# Finally, if there were multiple words, we camel-case the string:
# we lowercase the first word, capitalize the first letter of all
# the other words and lowercase the rest, and finally stick all
# these words together without spaces.
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
And indeed, if we run this code, we get:
>>> parseTags("tHiS iS a tAg, \t\n!&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
There are two points in this code that it's worth clarifying. First is the use of str.split() in sanitizeTags. This will turn a b c into ['a','b','c'], whereas str.split(' ') would produce ['','a','b','c','']. This is almost certainly the behavior you want, but there's one corner case. Consider the string tAG$. The $ gets turned into a space, and is stripped out by the split; thus, this gets turned into tAG instead of tag. This is probably what you want, but if it isn't, you have to be careful. What I would do is change that line to words = re.split(r'\s+', str), which will split the string on whitespace but leave in the leading and trailing empty strings; however, I would also change parseTags to use rawTags = re.split(r'\s*,\s*', str). You must make both these changes; 'a , b , c'.split(',') becomes ['a ', ' b ', ' c'], which is not the behavior you want, whereas r'\s*,\s*' deletes the space around the commas too. If you ignore leading and trailing white space, the difference is immaterial; but if you don't, then you need to be careful.
Finally, there's the non-use of regular expressions, and instead the use of str = ''.join(c if c.isalnum() else ' ' for c in str). You can, if you want, replace this with a regular expression. (Edit: I removed some inaccuracies about Unicode and regular expressions here.) Ignoring Unicode, you could replace this line with
str = re.sub(r'[^A-Za-z0-9]', ' ', str)
This uses [^...] to match everything but the listed characters: ASCII letters and numbers. However, it's better to support Unicode, and it's easy, too. The simplest such approach is
str = re.sub(r'\W', ' ', str, flags=re.UNICODE)
Here, \W matches non-word characters; a word character is a letter, a number, or the underscore. With flags=re.UNICODE specified (not available before Python 2.7; you can instead use r'(?u)\W' for earlier versions and 2.7), letters and numbers are both any appropriate Unicode characters; without it, they're just ASCII. If you don't want the underscore, you can add |_ to the regex to match underscores as well, replacing them with spaces too:
str = re.sub(r'\W|_', ' ', str, flags=re.UNICODE)
This last one, I believe, matches the behavior of my non-regex-using code exactly.
Also, here's how I'd write the same code without those comments; this also allows me to eliminate some temporary variables. You might prefer the code with the variables present; it's just a matter of taste.
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = ''.join(c if c.isalnum() else ' ' for c in str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
To handle the newly-desired behavior, there are two things we have to do. First, we need a way to fix the capitalization of the first word: lowercase the whole thing if the first letter's lowercase, and lowercase everything but the first letter if the first letter's upper case. That's easy: we can just check directly. Secondly, we want to treat punctuation as completely invisible: it shouldn't uppercase the following words. Again, that's easy—I even discuss how to handle something similar above. We just filter out all the non-alphanumeric, non-whitespace characters rather than turning them into spaces. Incorporating those changes gives us
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = filter(lambda c: c.isalnum() or c.isspace(), str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
words0 = words[0].lower() if words[0][0].islower() else words[0].capitalize()
return words0 + ''.join(w.capitalize() for w in words[1:])
Running this code gives us the following output
>>> parseTags("tHiS iS a tAg, AnD tHIs, \t\n!&#^ , se#%condcomment$ , No!pUnc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'AndThis', 'secondcomment', 'NopUnc', 'ifNOSPACESthenPRESERVEcaps']
You could use a white list of characters allowed to be in a word, everything else is ignored:
import re
def camelCase(tag_str):
words = re.findall(r'\w+', tag_str)
nwords = len(words)
if nwords == 1:
return words[0] # leave unchanged
elif nwords > 1: # make it camelCaseTag
return words[0].lower() + ''.join(map(str.title, words[1:]))
return '' # no word characters
This example uses \w word characters.
Example
tags_str = """ 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$,
ifNOSPACESthenPRESERVEcaps' """
print("\n".join(filter(None, map(camelCase, tags_str.split(',')))))
Output
thisIsATag
whitespace
secondcomment
noPunc
ifNOSPACESthenPRESERVEcaps
I think this should work
def toCamelCase(s):
# remove all punctuation
# modify to include other characters you may want to keep
s = re.sub("[^a-zA-Z0-9\s]","",s)
# remove leading spaces
s = re.sub("^\s+","",s)
# camel case
s = re.sub("\s[a-z]", lambda m : m.group(0)[1].upper(), s)
# remove all punctuation and spaces
s = re.sub("[^a-zA-Z0-9]", "", s)
return s
tag_list = [s for s in (toCamelCase(s.lower()) for s in tag_list.split(',')) if s]
the key here is to make use of re.sub to make the replacements you want.
EDIT : Doesn't preserve caps, but does handle uppercase strings with spaces
EDIT : Moved "if s" after the toCamelCase call

Replace the single quote (') character from a string

I need to strip the character "'" from a string in python. How do I do this?
I know there is a simple answer. Really what I am looking for is how to write ' in my code. for example \n = newline.
As for how to represent a single apostrophe as a string in Python, you can simply surround it with double quotes ("'") or you can escape it inside single quotes ('\'').
To remove apostrophes from a string, a simple approach is to just replace the apostrophe character with an empty string:
>>> "didn't".replace("'", "")
'didnt'
Here are a few ways of removing a single ' from a string in python.
str.replace
replace is usually used to return a string with all the instances of the substring replaced.
"A single ' char".replace("'","")
str.translate
In Python 2
To remove characters you can pass the first argument to the funstion with all the substrings to be removed as second.
"A single ' char".translate(None,"'")
In Python 3
You will have to use str.maketrans
"A single ' char".translate(str.maketrans({"'":None}))
re.sub
Regular Expressions using re are even more powerful (but slow) and can be used to replace characters that match a particular regex rather than a substring.
re.sub("'","","A single ' char")
Other Ways
There are a few other ways that can be used but are not at all recommended. (Just to learn new ways). Here we have the given string as a variable string.
Using list comprehension
''.join([c for c in string if c != "'"])
Using generator Expression
''.join(c for c in string if c != "'")
Another final method can be used also (Again not recommended - works only if there is only one occurrence )
Using list call along with remove and join.
x = list(string)
x.remove("'")
''.join(x)
Do you mean like this?
>>> mystring = "This isn't the right place to have \"'\" (single quotes)"
>>> mystring
'This isn\'t the right place to have "\'" (single quotes)'
>>> newstring = mystring.replace("'", "")
>>> newstring
'This isnt the right place to have "" (single quotes)'
You can escape the apostrophe with a \ character as well:
mystring.replace('\'', '')
I met that problem in codewars, so I created temporary solution
pred = "aren't"
pred = pred.replace("'", "99o")
pred = pred.title()
pred = pred.replace("99O", "'")
print(pred)
You can use another char combination, like 123456k and etc., but the last char should be letter

Categories