I have a string that kind of looks like this:
"stuff . // : /// more-stuff .. .. ...$%$% stuff -> DD"
and I want to strip off all punctuation, make everything uppercase and collapse all whitespace so that it looks like this:
"STUFF MORE STUFF STUFF DD"
Is this possible with one regex or do I need to combine more than two? This is what I have so far:
def normalize(string):
import re
string = string.upper()
rex = re.compile(r'\W')
rex_s = re.compile(r'\s{2,}')
result = rex.sub(' ', string) # this produces a string with tons of whitespace padding
result = rex.sub('', result) # this reduces all those spaces
return result
The only thing that doesn't work is the whitespace collapsing. Any ideas?
Here's a single-step approach (but the uppercasing actually uses a string method -- much simpler!):
rex = re.compile(r'\W+')
result = rex.sub(' ', strarg).upper()
where strarg is the string argument (don't use names that shadow builtins or standard library modules, please).
s = "$$$aa1bb2 cc-dd ee_ff ggg."
re.sub(r'\W+', ' ', s).upper()
# ' AA1BB2 CC DD EE_FF GGG '
Is _ punctuation?
re.sub(r'[_\W]+', ' ', s).upper()
# ' AA1BB2 CC DD EE FF GGG '
Don't want the leading and trailing space?
re.sub(r'[_\W]+', ' ', s).strip().upper()
# 'AA1BB2 CC DD EE FF GGG'
result = rex.sub(' ', string) # this produces a string with tons of whitespace padding
result = rex.sub('', result) # this reduces all those spaces
Because you typo'd and forgot to use rex_s for the second call instead. Also, you need to substitute at least one space back in or you'll end up with any multiple-space gap becoming no gap at all, instead of a single-space gap.
result = rex.sub(' ', string) # this produces a string with tons of whitespace padding
result = rex_s.sub(' ', result) # this reduces all those spaces
Do you have to use regular expressions? Do you feel you must do it in one line?
>>> import string
>>> s = "stuff . // : /// more-stuff .. .. ...$%$% stuff -> DD"
>>> s2 = ''.join(c for c in s if c in string.letters + ' ')
>>> ' '.join(s2.split())
'stuff morestuff stuff DD'
works in python3 this will retain the same whitespace character you collapsed. So if you have a tab and a space next to each other they wont collapse into a single character.
def collapse_whitespace_characters(raw_text):
ret = ''
if len(raw_text) > 1:
prev_char = raw_text[0]
ret += prev_char
for cur_char in raw_text[1:]:
if not cur_char.isspace() or cur_char != prev_char:
ret += cur_char
prev_char = cur_char
else:
ret = raw_text
return ret
this one will collapse whitespace sets into the first whitespace character it sees
def collapse_whitespace(raw_text):
ret = ''
if len(raw_text) > 1:
prev_char = raw_text[0]
ret += prev_char
for cur_char in raw_text[1:]:
if not cur_char.isspace() or \
(cur_char.isspace() and not prev_char.isspace()):
ret += cur_char
prev_char = cur_char
else:
ret = raw_text
return ret
>>> collapse_whitespace_characters('we like spaces and\t\t TABS AND WHATEVER\xa0\xa0IS')
'we like spaces and\t TABS\tAND WHATEVER\xa0IS'
>>> collapse_whitespace('we like spaces and\t\t TABS AND WHATEVER\xa0\xa0IS')
'we like spaces and\tTABS\tAND WHATEVER\xa0IS'
for punctuation
def collapse_punctuation(raw_text):
ret = ''
if len(raw_text) > 1:
prev_char = raw_text[0]
ret += prev_char
for cur_char in raw_text[1:]:
if cur_char.isalnum() or cur_char != prev_char:
ret += cur_char
prev_char = cur_char
else:
ret = raw_text
return ret
to actually answer the question
orig = 'stuff . // : /// more-stuff .. .. ...$%$% stuff -> DD'
collapse_whitespace(''.join([(c.upper() if c.isalnum() else ' ') for c in orig]))
as said, the regexp would be something like
re.sub('\W+', ' ', orig).upper()
One can use regular expression to substitute reoccurring white spaces.
White space is given by \s with \s+ meaning: at least one.
import re
rex = re.compile(r'\s+')
test = " x y z z"
res = rex.sub(' ', test)
print(f">{res}<")
> x y z z<
Note this also affects/includes carriage return, etc.
Related
I need a way to copy all of the positions of the spaces of one string to another string that has no spaces.
For example:
string1 = "This is a piece of text"
string2 = "ESTDTDLATPNPZQEPIE"
output = "ESTD TD L ATPNP ZQ EPIE"
Insert characters as appropriate into a placeholder list and concatenate it after using str.join.
it = iter(string2)
output = ''.join(
[next(it) if not c.isspace() else ' ' for c in string1]
)
print(output)
'ESTD TD L ATPNP ZQ EPIE'
This is efficient as it avoids repeated string concatenation.
You need to iterate over the indexes and characters in string1 using enumerate().
On each iteration, if the character is a space, add a space to the output string (note that this is inefficient as you are creating a new object as strings are immutable), otherwise add the character in string2 at that index to the output string.
So that code would look like:
output = ''
si = 0
for i, c in enumerate(string1):
if c == ' ':
si += 1
output += ' '
else:
output += string2[i - si]
However, it would be more efficient to use a very similar method, but with a generator and then str.join. This removes the slow concatenations to the output string:
def chars(s1, s2):
si = 0
for i, c in enumerate(s1):
if c == ' ':
si += 1
yield ' '
else:
yield s2[i - si]
output = ''.join(char(string1, string2))
You can try insert method :
string1 = "This is a piece of text"
string2 = "ESTDTDLATPNPZQEPIE"
string3=list(string2)
for j,i in enumerate(string1):
if i==' ':
string3.insert(j,' ')
print("".join(string3))
outout:
ESTD TD L ATPNP ZQ EPIE
How would I convert this string
'\\n this is a docstring for\\n the main function.\\n a,\\n b,\\n c\\n '
into
'\n this is a docstring for\n the main function.\n a,\n b,\n c\n '
keeping in mind I would also like to do this for '\t' and all other escaped characters.
The code for the reverse way is
def fix_string(s):
""" takes the string and replaces any `\n` with `\\n` so that the read file will be recognized """
# escape chars = \t , \b , \n , \r , \f , \' , \" , \\
new_s = ''
for i in s:
if i == '\t':
new_s += '\\t'
elif i == '\b':
new_s += '\\b'
elif i == '\n':
new_s += '\\n'
elif i == '\r':
new_s += '\\r'
elif i == '\f':
new_s += '\\f'
elif i == '\'':
new_s += "\\'"
elif i == '\"':
new_s += '\\"'
else:
new_s += i
return new_s
would I possibly need to look at the actual numeric values for the characters and check the next character say if I find a ('\',92) character followed by a ('n',110)?
Don't reinvent the wheel here. Python has your back. Besides, handling escape syntax correctly, is harder than it looks.
The correct way to handle this
In Python 2, use the str-to-str string_escape codec:
string.decode('string_escape')
This interprets any Python-recognized string escape sequences for you, including \n and \t.
Demo:
>>> string = '\\n this is a docstring for\\n the main function.\\n a,\\n b,\\n c\\n '
>>> string.decode('string_escape')
'\n this is a docstring for\n the main function.\n a,\n b,\n c\n '
>>> print string.decode('string_escape')
this is a docstring for
the main function.
a,
b,
c
>>> '\\t\\n\\r\\xa0\\040'.decode('string_escape')
'\t\n\r\xa0 '
In Python 3, you'd have to use the codecs.decode() and the unicode_escape codec:
codecs.decode(string, 'unicode_escape')
as there is no str.decode() method and this is not a str -> bytes conversion.
Demo:
>>> import codecs
>>> string = '\\n this is a docstring for\\n the main function.\\n a,\\n b,\\n c\\n '
>>> codecs.decode(string, 'unicode_escape')
'\n this is a docstring for\n the main function.\n a,\n b,\n c\n '
>>> print(codecs.decode(string, 'unicode_escape'))
this is a docstring for
the main function.
a,
b,
c
>>> codecs.decode('\\t\\n\\r\\xa0\\040', 'unicode_escape')
'\t\n\r\xa0 '
Why straightforward str.replace() won't cut it
You could try to do this yourself with str.replace(), but then you also need to implement proper escape parsing; take \\\\n for example; this is \\n, escaped. If you naively apply str.replace() in sequence, you end up with \n or \\\n instead:
>>> '\\\\n'.decode('string_escape')
'\\n'
>>> '\\\\n'.replace('\\n', '\n').replace('\\\\', '\\')
'\\\n'
>>> '\\\\n'.replace('\\\\', '\\').replace('\\n', '\n')
'\n'
The \\ pair should be replaced by just one \ characters, leaving the n uninterpreted. But the replace option either will end up replacing the trailing \ together with the n with a newline character, or you end up with \\ replaced by \, and then the \ and the n are replaced by a newline. Either way, you end up with the wrong output.
The slow way to handle this, manually
You'll have to process the characters one by one instead, pulling in more characters as needed:
_map = {
'\\\\': '\\',
"\\'": "'",
'\\"': '"',
'\\a': '\a',
'\\b': '\b',
'\\f': '\f',
'\\n': '\n',
'\\r': '\r',
'\\t': '\t',
}
def unescape_string(s):
output = []
i = 0
while i < len(s):
c = s[i]
i += 1
if c != '\\':
output.append(c)
continue
c += s[i]
i += 1
if c in _map:
output.append(_map[c])
continue
if c == '\\x' and i < len(s) - 2: # hex escape
point = int(s[i] + s[i + 1], 16)
i += 2
output.append(chr(point))
continue
if c == '\\0': # octal escape
while len(c) < 4 and i < len(s) and s[i].isdigit():
c += s[i]
i += 1
point = int(c[1:], 8)
output.append(chr(point))
return ''.join(output)
This now can handle the \xhh and the standard 1-letter escapes, but not the \0.. octal escape sequences, or \uhhhh Unicode code points, or \N{name} unicode name references, nor does it handle malformed escapes in quite the same way as Python would.
But it does handle the escaped escape properly:
>>> unescape_string(string)
'\n this is a docstring for\n the main function.\n a,\n b,\n c\n '
>>> unescape_string('\\\\n')
'\\n'
Do know this is far slower than using the built-in codec.
the simplest solution to this is just to use a str.replace() call
s = '\\n this is a docstring for\\n the main function.\\n a,\\n b,\\n c\\n '
s1 = s.replace('\\n','\n')
s1
output
'\n this is a docstring for\n the main function.\n a,\n b,\n c\n '
def convert_text(text):
return text.replace("\\n","\n").replace("\\t","\t")
text = '\\n this is a docstring for\\n the main function.\\n a,\\n b,\\n c\\n '
print convert_text(text)
output:
this is a docstring for
the main function.
a,
b,
c
I've got a string broken into pairs of letters and I'm looking for a way to get rid of all the pairs of identical letters, by inserting characters in between them, to form new pairs. Further, I'm looking to split them up one pair at a time. What I've managed to do so far is put split all identical blocks simultaneously, but that's not what I'm looking for. So, for example, consider "fr ee tr ee". This should go to "fr eX et re e", not "fr eXe tr eXe".
Anyone got any ideas?
EDIT: TO be more clear, I need to go through the string, and at the first instance of a "double block", insert an X, and form new pairs on everything to the right of the X. SO. "AA BB", goes to "AX AB B".
So far I have
def FUN(text):
if len(text) < 2:
return text
result = ""
for i in range(1, len(text), 2):
if text[i] == text[i - 1]:
result += text[i - 1] + "X" + text[i]
else:
result += text[i-1:i+1]
if len(text) % 2 != 0:
result += text[-1]
return result
How about this ? :
r = list()
S = "free tree"
S = "".join(S.split())
s = list()
for i in range(0,len(S)) :
s.append(S[i])
while len(s) > 0 :
c1 = s.pop(0)
c2 = 'X'
if len(s) > 0 :
if s[0]!=c1 :
c2 = s.pop(0)
else :
c2 = ''
r.append("{0}{1}".format(c1,c2))
result = " ".join(r)
print(result)
Hope this helps :)
You could turn your string into a list and check each pairing in a loop then insert another character in between where you find the same character. Working on the code now will edit.
my_string = "freetreebreefrost"
my_parts = [my_string[i:i+2] for i in range(0,len(my_string),2)]
final_list = []
while len(my_parts):
part = my_parts.pop(0)
if part in my_parts:
tmp_str = part[1] +"".join(my_parts)
my_parts = [tmp_str[i:i+2] for i in range(0,len(tmp_str),2)]
final_list.append(part[0]+"X")
else:
final_list.append(part)
print final_list
there is probably a much cooler way to do this
Ok here it is:
s = "free tree aa"
def seperateStringEveryTwoChars(s):
# get rid of any space
s = s.replace(' ', '')
x = ""
for i, l in enumerate(s, 0):
x += l
if i % 2:
x += ' '
return x.rstrip()
def findFirstDuplicateEntry(stringList):
for i, elem in enumerate(stringList, 0):
if len(elem) > 1 and elem[0] == elem[1]:
return i
return None
def yourProgram(s):
x = seperateStringEveryTwoChars(s)
# print x # debug only
splitX = x.split(' ')
# print splitX # debug only
i = findFirstDuplicateEntry(splitX)
if i == None:
return seperateStringEveryTwoChars(s)
# print i # debug only
splitX[i] = splitX[i][0] + "X" + splitX[i][1]
# print splitX # debug only
s = ''.join(splitX)
# print s # debug only
# print "Done" # debug only
return yourProgram(s)
print yourProgram(s)
Output:
fr eX et re ea a
with an input of string of "aabbccddd" it will output "aX ab bc cd dX d"
This is a simple 3 lines of code solution, single pass, as easy as it gets.
No splitting, joining, arrays, for loops, nothing.
First, remove all spaces from the string, Replace_All \s+ with ""
Replace_All with callback ((.)(?:(?!\2)(.)|)(?!$))
a. if (matched $3) replace with $1
b. else replace with $1+"X"
Finally, put a space between every 2 chars. Replace_All (..) with $1 + " "
This is a test using Perl (don't know Python that well)
$str = 'ee ee rx xx tt bb ff fr ee tr ee';
$str =~ s/\s+//g;
$str =~ s/((.)(?:(?!\2)(.)|)(?!$))/ defined $3 ? "$1" : "$1X"/eg;
$str =~ s/(..)/$1 /g;
print $str,"\n";
# Output:
# eX eX eX er xX xX xt tb bf fX fr eX et re e
# Expanded regex
#
( # (1 start)
( . ) # (2)
(?:
(?! \2 ) # Not equal to the first char?
( . ) # (3) Grab the next one
|
# or matches the first, an X will be inserted here
)
(?! $ )
) # (1 end)
I am trying to split a comma delimited string in python. The tricky part for me here is that some of the fields in the data themselves have a comma in them and they are enclosed within quotes (" or '). The resulting split string should also have the quotes around the fields removed. Also, some fields can be empty.
Example:
hey,hello,,"hello,world",'hey,world'
needs to be split into 5 parts like below
['hey', 'hello', '', 'hello,world', 'hey,world']
Any ideas/thoughts/suggestions/help with how to go about solving the above problem in Python would be much appreciated.
Thank You,
Vish
Sounds like you want the CSV module.
(Edit: The original answer had trouble with empty fields on the edges due to the way re.findall works, so I refactored it a bit and added tests.)
import re
def parse_fields(text):
r"""
>>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\''))
['hey', 'hello', '', 'hello,world', 'hey,world']
>>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\','))
['hey', 'hello', '', 'hello,world', 'hey,world', '']
>>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\','))
['', 'hey', 'hello', '', 'hello,world', 'hey,world', '']
>>> list(parse_fields(''))
['']
>>> list(parse_fields(','))
['', '']
>>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string'))
['testing', 'quotes not at "the" beginning \'of\' the', 'string']
>>> list(parse_fields('testing,"unterminated quotes'))
['testing', '"unterminated quotes']
"""
pos = 0
exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""")
while True:
m = exp.search(text, pos)
result = m.group(2)
separator = m.group(3)
yield result
if not separator:
break
pos = m.end(0)
if __name__ == "__main__":
import doctest
doctest.testmod()
(['"]?) matches an optional single- or double-quote.
(.*?) matches the string itself. This is a non-greedy match, to match as much as necessary without eating the whole string. This is assigned to result, and it's what we actually yield as a result.
\1 is a backreference, to match the same single- or double-quote we matched earlier (if any).
(,|$) matches the comma separating each entry, or the end of the line. This is assigned to separator.
If separator is false (eg. empty), that means there's no separator, so we're at the end of the string--we're done. Otherwise, we update the new start position based on where the regex finished (m.end(0)), and continue the loop.
The csv module won't handle the scenario of " and ' being quotes at the same time. Absent a module that provides that kind of dialect, one has to get into the parsing business. To avoid reliance on a third party module, we can use the re module to do the lexical analysis, using the re.MatchObject.lastindex gimmick to associate a token type with the matched pattern.
The following code when run as a script passes all the tests shown, with Python 2.7 and 2.2.
import re
# lexical token symbols
DQUOTED, SQUOTED, UNQUOTED, COMMA, NEWLINE = xrange(5)
_pattern_tuples = (
(r'"[^"]*"', DQUOTED),
(r"'[^']*'", SQUOTED),
(r",", COMMA),
(r"$", NEWLINE), # matches end of string OR \n just before end of string
(r"[^,\n]+", UNQUOTED), # order in the above list is important
)
_matcher = re.compile(
'(' + ')|('.join([i[0] for i in _pattern_tuples]) + ')',
).match
_toktype = [None] + [i[1] for i in _pattern_tuples]
# need dummy at start because re.MatchObject.lastindex counts from 1
def csv_split(text):
"""Split a csv string into a list of fields.
Fields may be quoted with " or ' or be unquoted.
An unquoted string can contain both a " and a ', provided neither is at
the start of the string.
A trailing \n will be ignored if present.
"""
fields = []
pos = 0
want_field = True
while 1:
m = _matcher(text, pos)
if not m:
raise ValueError("Problem at offset %d in %r" % (pos, text))
ttype = _toktype[m.lastindex]
if want_field:
if ttype in (DQUOTED, SQUOTED):
fields.append(m.group(0)[1:-1])
want_field = False
elif ttype == UNQUOTED:
fields.append(m.group(0))
want_field = False
elif ttype == COMMA:
fields.append("")
else:
assert ttype == NEWLINE
fields.append("")
break
else:
if ttype == COMMA:
want_field = True
elif ttype == NEWLINE:
break
else:
print "*** Error dump ***", ttype, repr(m.group(0)), fields
raise ValueError("Missing comma at offset %d in %r" % (pos, text))
pos = m.end(0)
return fields
if __name__ == "__main__":
tests = (
("""hey,hello,,"hello,world",'hey,world'\n""", ['hey', 'hello', '', 'hello,world', 'hey,world']),
("""\n""", ['']),
("""""", ['']),
("""a,b\n""", ['a', 'b']),
("""a,b""", ['a', 'b']),
(""",,,\n""", ['', '', '', '']),
("""a,contains both " and ',c""", ['a', 'contains both " and \'', 'c']),
("""a,'"starts with "...',c""", ['a', '"starts with "...', 'c']),
)
for text, expected in tests:
result = csv_split(text)
print
print repr(text)
print repr(result)
print repr(expected)
print result == expected
I fabricated something like this. Very redundant I suppose, but it does the job for me. You have to adapt it a bit to your specifications:
def csv_splitter(line):
splitthese = [0]
splitted = []
splitpos = True
for nr, i in enumerate(line):
if i == "\"" and splitpos == True:
splitpos = False
elif i == "\"" and splitpos == False:
splitpos = True
if i == "," and splitpos == True:
splitthese.append(nr)
splitthese.append(len(line)+1)
for i in range(len(splitthese)-1):
splitted.append(re.sub("^,|\"","",line[splitthese[i]:splitthese[i+1]]))
return splitted
I have some strings that I want to delete some unwanted characters from them.
For example: Adam'sApple ----> AdamsApple.(case insensitive)
Can someone help me, I need the fastest way to do it, cause I have a couple of millions of records that have to be polished.
Thanks
One simple way:
>>> s = "Adam'sApple"
>>> x = s.replace("'", "")
>>> print x
'AdamsApple'
... or take a look at regex substitutions.
Here is a function that removes all the irritating ascii characters, the only exception is "&" which is replaced with "and". I use it to police a filesystem and ensure that all of the files adhere to the file naming scheme I insist everyone uses.
def cleanString(incomingString):
newstring = incomingString
newstring = newstring.replace("!","")
newstring = newstring.replace("#","")
newstring = newstring.replace("#","")
newstring = newstring.replace("$","")
newstring = newstring.replace("%","")
newstring = newstring.replace("^","")
newstring = newstring.replace("&","and")
newstring = newstring.replace("*","")
newstring = newstring.replace("(","")
newstring = newstring.replace(")","")
newstring = newstring.replace("+","")
newstring = newstring.replace("=","")
newstring = newstring.replace("?","")
newstring = newstring.replace("\'","")
newstring = newstring.replace("\"","")
newstring = newstring.replace("{","")
newstring = newstring.replace("}","")
newstring = newstring.replace("[","")
newstring = newstring.replace("]","")
newstring = newstring.replace("<","")
newstring = newstring.replace(">","")
newstring = newstring.replace("~","")
newstring = newstring.replace("`","")
newstring = newstring.replace(":","")
newstring = newstring.replace(";","")
newstring = newstring.replace("|","")
newstring = newstring.replace("\\","")
newstring = newstring.replace("/","")
return newstring
Any characters in the 2nd argument of the translate method are deleted:
>>> "Adam's Apple!".translate(None,"'!")
'Adams Apple'
NOTE: translate requires Python 2.6 or later to use None for the first argument, which otherwise must be a translation string of length 256. string.maketrans('','') can be used in place of None for pre-2.6 versions.
Try:
"Adam'sApple".replace("'", '')
One step further, to replace multiple characters with nothing:
import re
print re.sub(r'''['"x]''', '', '''a'"xb''')
Yields:
ab
str.replace("'","");
As has been pointed out several times now, you have to either use replace or regular expressions (most likely you don't need regexes though), but if you also have to make sure that the resulting string is plain ASCII (doesn't contain funky characters like é, ò, µ, æ or φ), you could finally do
>>> u'(like é, ò, µ, æ or φ)'.encode('ascii', 'ignore')
'(like , , , or )'
An alternative that will take in a string and an array of unwanted chars
# function that removes unwanted signs from str
#Pass the string to the function and an array ofunwanted chars
def removeSigns(str,arrayOfChars):
charFound = False
newstr = ""
for letter in str:
for char in arrayOfChars:
if letter == char:
charFound = True
break
if charFound == False:
newstr += letter
charFound = False
return newstr
Let's say we have the following list:
states = [' Alabama ', 'Georgia!', 'Georgia', 'georgia', 'south carolina##', 'West virginia?']
Now we will define a function clean_strings()
import re
def clean_strings(strings):
result = []
for value in strings:
value = value.strip()
value = re.sub('[!#?]', '', value)
value = value.title()
result.append(value)
return result
When we call the function clean_strings(states)
The result will look like:
['Alabama',
'Georgia',
'Georgia',
'Georgia',
'Florida',
'South Carolina',
'West Virginia']
I am probably late for the answer but i think below code would also do ( to an extreme end)
it will remove all the unncesary chars:
a = '; niraj kale 984wywn on 2/2/2017'
a= re.sub('[^a-zA-Z0-9.?]',' ',a)
a = a.replace(' ',' ').lstrip().rstrip()
which will give
'niraj kale 984wywn on 2 2 2017'