Replace escape sequence characters in a string in Python 3.x - python

I have used the following code to replace the escaped characters in a string. I have first done splitting by \n and the used re.sub(), but still I dont know what I am missing, the code is not working according to the expectations. I am a newbie at Python, so please don't judge if there are optimisation problems. Here is my code:
#import sys
import re
String = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
splitString = String.split('\n')
replacedStrings = []
i=0
for oneString in splitString:
#oneString = oneString.replace(r'^(.?)*(\\[^n])+(.?)*$', "")
oneString = re.sub(r'^(.?)*(\\[^n])+(.?)*$', "", oneString)
print(oneString)
replacedStrings.insert(i, oneString)
i += 1
print(replacedStrings)
My aim here is: I need the values only (without the escaped sequences) as the split strings.
My approach here is:
I have split the string by \n that gives me array list of separate strings.
Then, I have checked each string using the regex, if the regex matches, then the matched substring is replaced to "".
Then I have pushed those strings to a collection, thinking that it will store the replaced strings in the new array list.
So basically, I am through with 1 and 2, but currently I am stuck at 3. Following is my Output:
1
2
3
4
['1\r\r\t\r', '2\r\r', '3\r\r\r\r', '\r', '\r4', '\r']

You might find it easier to use re.findall here with the simple pattern \S+:
input = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
output = re.findall(r'\S+', input)
print(output)
['1', '2', '3', '4']
This approach will isolate and match any islands of one or more non whitespace characters.
Edit:
Based on your new input data, we can try matching on the pattern [^\r\n\t]+:
input = "jkahdjkah \r\r\t\r\nA: B\r\r\nA : B\r\r\r\r\n\r\n\r4\n\r"
output = re.findall(r'[^\r\n\t]+', input)
print(output)
['jkahdjkah ', 'A: B', 'A : B', '4']
re.sub isn't really the right tool for the job here. What would be on the table is split or re.findall, because you want to repeatedly match/isolate a certain part of your text. re.sub is useful for taking a string and transforming it to something else. It can be used to extract text, but does not work so well for multiple matches.

You were almost there, I would just use string.strip() to replace multiple \r and \n at the start and the end of the strings
String = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
splitString = String.split('\n')
replacedStrings = []
i=0
for oneString in splitString:
s = oneString.strip()
if s != '':
print(s)
replacedStrings.append(s)
print(replacedStrings)
The output will look like
1
2
3
4
['1', '2', '3', '4']
For "jkahdjkah \r\r\t\r\nA: B\r\r\nA : B\r\r\r\r\n\r\n\r4\n\r", the output will be ['jkahdjkah', 'A: B', 'A : B', '4']

I have found one more method, this seems to work fine, it might not be as optimised as the other answers, but its just another way:
import re
splitString = []
String = "jhgdf\r\r\t\r\nA : B\r\r\nA : B\r\r\r\r\n\r\n\rA: B\n\r"
splitString = re.compile('[\r\t\n]+').split(String)
if "" in splitString:
splitString.remove("")
print(splitString)
I added it here, so that people going through the same trouble as me, might want to overlook this approach too.
Following is the Output that I have got after using the above code:
['jhgdf', 'A : B', 'A : B', 'A: B']

Related

How to fix 'replace' keyword when it is not working in python

I am writing a code that needs to get four individual values, and one of the values has the newline character in addition to an extra apostrophe and bracket like so: 11\n']. I only need the 11 and have been able to strip the '], but I am unable to remove the newline character.
I have tried various different set ups of strip and replace, and both strip and replace are not removing the part.
with open('gil200110raw.txt', 'r') as qcfile:
txt = qcfile.readlines()
line1 = txt[1:2]
line2 = txt[2:3]
line1 = str(line1)
line2 = str(line2)
sptline1 = line1.split(' ')
sptline2 = line2.split(' ')
totalobs = sptline1[39]
qccalc1 = sptline2[2]
qccalc2 = sptline2[9]
qccalc3 = sptline2[16]
qccalc4 = sptline2[22]
qccalc4 = qccalc4.strip("\n']")
qccalc4 = qccalc4.replace("\n", "")
I did not get an error, but the output of print(qccalc4) is 11\n. I expect the output to be 11.
Use rstrip instead!
>>> 'test string\n'.rstrip()
'test string'
You can use regex to match the outputs you're looking for.
From your description, I assume it is all integers, consider the following snippet
import re
p = re.compile('[0-9]+')
sample = '11\n\'] dwqed 12 444'
results = p.findall(sample)
results would now contain the array ['11', '12', '444'].
re is the regex package for python and p is the pattern we would like to find in our text, this pattern [0-9]+ simply means match one or more characters 0 to 9
you can find the documentation here

How to parse a CSV with commas between parenthesis and missing values

I tried using pyparsing to parse a CSV with:
Commas between parenthesis (or brackets, etc): "a(1,2),b" should return the list ["a(1,2)","b"]
Missing values: "a,b,,c," should return the list ['a','b','','c','']
I worked a solution but it seems "dirty". Mainly, the Optional inside only one of the possible atomics. I think the optional should be independent of the atomics. That is, I feel it should be put somewhere else, for example in the delimitedList optional arguments, but in my trial and error that was the only place that worked and made sense. It could be in any of the possible atomics so I chose the first.
Also, I don't fully understand what originalTextFor is doing but if I remove it it stops working.
Working example:
import pyparsing as pp
# Function that parses a line of columns separated by commas and returns a list of the columns
def fromLineToRow(line):
sqbrackets_col = pp.Word(pp.printables, excludeChars="[],") | pp.nestedExpr(opener="[",closer="]") # matches "a[1,2]"
parens_col = pp.Word(pp.printables, excludeChars="(),") | pp.nestedExpr(opener="(",closer=")") # matches "a(1,2)"
# In the following line:
# * The "^" means "choose the longest option"
# * The "pp.Optional" can be in any of the expressions separated by "^". I put it only on the first. It's used for when there are missing values
atomic = pp.originalTextFor(pp.Optional(pp.OneOrMore(parens_col))) ^ pp.originalTextFor(pp.OneOrMore(sqbrackets_col))
grammar = pp.delimitedList(atomic)
row = grammar.parseString(line).asList()
return row
file_str = \
"""YEAR,a(2,3),b[3,4]
1960,2.8,3
1961,4,
1962,,1
1963,1.27,3"""
for line in file_str.splitlines():
row = fromLineToRow(line)
print(row)
Prints:
['YEAR', 'a(2,3)', 'b[3,4]']
['1960', '2.8', '3']
['1961', '4', '']
['1962', '', '1']
['1963', '1.27', '3']
Is this the right way to do this? Is there a "cleaner" way to use the Optional inside the first atomic?
Working inside-out, I get this:
# chars not in ()'s or []'s - also disallow ','
non_grouped = pp.Word(pp.printables, excludeChars="[](),")
# grouped expressions in ()'s or []'s
grouped = pp.nestedExpr(opener="[",closer="]") | pp.nestedExpr(opener="(",closer=")")
# use OneOrMore to allow non_grouped and grouped together
atomic = pp.originalTextFor(pp.OneOrMore(non_grouped | grouped))
# or based on your examples, you *could* tighten this up to:
# atomic = pp.originalTextFor(non_grouped + pp.Optional(grouped))
originalTextFor recombines the original input text within the leading and trailing boundaries of the matched expressions, and returns a single string. If you leave this out, then you will get all the sub-expressions in a nested list of strings, like ['a',['2,3']]. You could rejoin them with repeated calls to ''.join, but that would collapse out whitespace (or use ' '.join, but that has the opposite problem of potentially introducing whitespace).
To optionalize the elements of the list, just say so in the definition of the delimited list:
grammar = pp.delimitedList(pp.Optional(atomic, default=''))
Be sure to add the default value, else the empty slots will just get dropped.
With these changes I get:
['YEAR', 'a(2,3)', 'b[3,4]']
['1960', '2.8', '3']
['1961', '4', '']
['1962', '', '1']
['1963', '1.27', '3']
What you can do is using regex re, for instance:
>>> import re
>>> re.split(r',\s*(?![^()]*\))', line1)
['a(1,2)', 'b']
>>> re.split(r',\s*(?![^()]*\))', line2)
['a', 'b', '', 'c', '']
import re
with open('44289614.csv') as f:
for line in map(str.strip, f):
l = re.split(',\s*(?![^()[]]*[\)\]])', line)
print(len(l), l)
Output:
3 ['YEAR', 'a(2,3)', 'b[3,4]']
3 ['1960', '2.8', '3']
3 ['1961', '4', '']
3 ['1962', '', '1']
3 ['1963', '1.27', '3']
Modified from this answer.
I also like this answer, which suggests modifying the input slightly and using quotechar of the csv module.

Python Regex Simple Split - Empty at first index

I have a String that looks like
test = '20170125NBCNightlyNews'
I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)
I am trying to use re. I have a working version by writing.
re.split('(\d+)',test)
Simple enough, this gives me the values I need in a list.
['', '20170125', 'NBCNightlyNews']
However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.
I also tried telling it to match the begininning of the string as well, and got the same results.
>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>
Does anyone have any input as to why this is there / how I can avoid the empty string?
Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:
test[:8], test[8:]
Will split your strings just fine.
What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.
So, if you have
test = '20170125NBCNightlyNews'
This is happening:
20170125 NBCNightlyNews
^^^^^^^^
The string is split into three parts, everything before the number, the number itself and everything after the number.
Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.
re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']
re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']
You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.
To avoid that you can use filter:
>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']
Why re.split when you can just match and get the groups?...
import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')
result = re.match(pattern, test)
result.groups()[0] # for the date part
result.groups()[1] # for the show name
I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.
From the documentation:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.
So if you have:
test = 'test20170125NBCNightlyNews'
The indexes would remain unaffected:
>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']
If the date is always 8 digits long, I would access the substrings directly (without using regex):
>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']
If the length of the date might vary, I would use:
>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']

Python detect character surrounded by spaces

Anyone know how I can find the character in the center that is surrounded by spaces?
1 + 1
I'd like to be able to separate the + in the middle to use in a if/else statement.
Sorry if I'm not too clear, I'm a Python beginner.
I think you are looking for something like the split() method which will split on white space by default.
Suppose we have a string s
s = "1 + 1"
chunks = s.split()
print(chunks[1]) # Will print '+'
This regular expression will detect a single character surrounded by spaces, if the character is a plus or minus or mult or div sign: r' ([+-*/]) '. Note the spaces inside the apostrophes. The parentheses "capture" the character in the middle. If you need to recognize a different set of characters, change the set inside the brackets.
If you haven't dealt with regular expressions before, read up on the re module. They are very useful for simple text processing. The two relevant features here are "character classes" (the square brackets in my example) and "capturing parentheses" (the round parens).
You can use regex:
s="1 + 1"
a=re.compile(r' (?P<sym>.) ')
a.search(s).group('sym')
import re
def find_between(string, start_=' ', end_=' '):
re_str = r'{}([-+*/%^]){}'.format(start_, end_)
try:
return re.search(re_str, string).group(1)
except AttributeError:
return None
print(find_between('9 * 5', ' ', ' '))
Not knowing how many spaces separate your central character, then I'd use the following:
s = '1 + 1'
middle = filter(None, s.split())[1]
print middle # +
The split works as in the solution provided by Zac, but if there are more than a single space, then the returned list will have a bunch of '' elements, which we can get rid of with the filter(None, ) function.
Then it's just a matter of extracting your second element.
Check it in action at https://eval.in/636622
If we look at it step-by-step, then here is how it all works using a python console:
>>> s = '1 + 1'
>>> s.split()
['1', '+', '', '', '1']
>>> filter(None, s.split())
['1', '+', '1']
>>> filter(None, s.split())[1]
'+'

Python - defining string split delimiter?

How could I define string delimiter for splitting in most efficient way? I mean to not need to use many if's etc?
I have strings that need to be splited strictly into two element lists. The problem is those strings have different symbols by which I can split them. For example:
'Hello: test1'. This one has split delimiter ': '. The other example would be:
'Hello - test1'. So this one would be ' - '. Also split delimiter could be ' -' or '- '. So if I know all variations of delimiters, how could I define them most efficiently?
First I did something like this:
strings = ['Hello - test', 'Hello- test', 'Hello -test']
for s in strings:
delim = ' - '
if len(s.split('- ', 1)) == 2:
delim = '- '
elif len(s.split(' -', 1)) == 2:
delim = ' -'
print s.split(delim, 1)[1])
But then I got new strings that had another unexpected delimiters. So doing this way I should add even more ifs to check other delimiters like ': '. But then I wondered if there is some better way to define them (there is not problem if I should need to include new delimiters in some kind of list if I would need to later on). Maybe regex would help or some other tool?
Put all the delimiters inside re.split function like below using logical OR | operator.
re.split(r': | - | -|- ', string)
Add maxsplit=1, if you want to do an one time split.
re.split(r': | - | -|- ', string, maxsplit=1)
You can use the split function of the re module
>>> strings = ['Hello1 - test1', 'Hello2- test2', 'Hello3 -test3', 'Hello4 :test4', 'Hello5 : test5']
>>> for s in strings:
... re.split(" *[:-] *",s)
...
['Hello1', 'test1']
['Hello2', 'test2']
['Hello3', 'test3']
['Hello4', 'test4']
['Hello5', 'test5']
Where between [] you put all the possible delimiters. The * indicates that some spaces can be put before or after.
\s*[:-]\s*
You can split by this.Use re.split(r"\s*[:-]\s*",string).See demo.
https://regex101.com/r/nL5yL3/14
You should use this if you can have delimiters like - or - or -.wherein you have can have multiple spaces.
This isn't the best way, but if you want to avoid using re for some (or no) reason, this is what I would do:
>>> strings = ['Hello - test', 'Hello- test', 'Hello -test', 'Hello : test']
>>> delims = [':', '-'] # all possible delimiters; don't worry about spaces.
>>>
>>> for string in strings:
... delim = next((d for d in delims if d in string), None) # finds the first delimiter in delims that's present in the string (if there is one)
... if not delim:
... continue # No delimiter! (I don't know how you want to handle this possibility; this code will simply skip the string all together.)
... print [s.strip() for s in string.split(delim, 1)] # assuming you want them in list form.
['Hello', 'test']
['Hello', 'test']
['Hello', 'test']
['Hello', 'test']
This uses Python's native .split() to break the string at the delimiter, and then .strip() to trim the white space off the results, if there is any. I've used next to find the appropriate delimiter, but there are plenty of things you can swap that out with (especially if you like for blocks).
If you're certain that each string will contain at least one of the delimiters (preferably exactly one), then you can shave it down to this:
## with strings and delims defined...
>>> for string in strings:
... delim = next(d for d in delims if d in string) # raises StopIteration at this line if there is no delimiter in the string.
... print [s.strip() for s in string.split(delim, 1)]
I'm not sure if this is the most elegant solution, but it uses fewer if blocks, and you won't have to import anything to do it.

Categories