Anyone know how I can find the character in the center that is surrounded by spaces?
1 + 1
I'd like to be able to separate the + in the middle to use in a if/else statement.
Sorry if I'm not too clear, I'm a Python beginner.
I think you are looking for something like the split() method which will split on white space by default.
Suppose we have a string s
s = "1 + 1"
chunks = s.split()
print(chunks[1]) # Will print '+'
This regular expression will detect a single character surrounded by spaces, if the character is a plus or minus or mult or div sign: r' ([+-*/]) '. Note the spaces inside the apostrophes. The parentheses "capture" the character in the middle. If you need to recognize a different set of characters, change the set inside the brackets.
If you haven't dealt with regular expressions before, read up on the re module. They are very useful for simple text processing. The two relevant features here are "character classes" (the square brackets in my example) and "capturing parentheses" (the round parens).
You can use regex:
s="1 + 1"
a=re.compile(r' (?P<sym>.) ')
a.search(s).group('sym')
import re
def find_between(string, start_=' ', end_=' '):
re_str = r'{}([-+*/%^]){}'.format(start_, end_)
try:
return re.search(re_str, string).group(1)
except AttributeError:
return None
print(find_between('9 * 5', ' ', ' '))
Not knowing how many spaces separate your central character, then I'd use the following:
s = '1 + 1'
middle = filter(None, s.split())[1]
print middle # +
The split works as in the solution provided by Zac, but if there are more than a single space, then the returned list will have a bunch of '' elements, which we can get rid of with the filter(None, ) function.
Then it's just a matter of extracting your second element.
Check it in action at https://eval.in/636622
If we look at it step-by-step, then here is how it all works using a python console:
>>> s = '1 + 1'
>>> s.split()
['1', '+', '', '', '1']
>>> filter(None, s.split())
['1', '+', '1']
>>> filter(None, s.split())[1]
'+'
Related
I have used the following code to replace the escaped characters in a string. I have first done splitting by \n and the used re.sub(), but still I dont know what I am missing, the code is not working according to the expectations. I am a newbie at Python, so please don't judge if there are optimisation problems. Here is my code:
#import sys
import re
String = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
splitString = String.split('\n')
replacedStrings = []
i=0
for oneString in splitString:
#oneString = oneString.replace(r'^(.?)*(\\[^n])+(.?)*$', "")
oneString = re.sub(r'^(.?)*(\\[^n])+(.?)*$', "", oneString)
print(oneString)
replacedStrings.insert(i, oneString)
i += 1
print(replacedStrings)
My aim here is: I need the values only (without the escaped sequences) as the split strings.
My approach here is:
I have split the string by \n that gives me array list of separate strings.
Then, I have checked each string using the regex, if the regex matches, then the matched substring is replaced to "".
Then I have pushed those strings to a collection, thinking that it will store the replaced strings in the new array list.
So basically, I am through with 1 and 2, but currently I am stuck at 3. Following is my Output:
1
2
3
4
['1\r\r\t\r', '2\r\r', '3\r\r\r\r', '\r', '\r4', '\r']
You might find it easier to use re.findall here with the simple pattern \S+:
input = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
output = re.findall(r'\S+', input)
print(output)
['1', '2', '3', '4']
This approach will isolate and match any islands of one or more non whitespace characters.
Edit:
Based on your new input data, we can try matching on the pattern [^\r\n\t]+:
input = "jkahdjkah \r\r\t\r\nA: B\r\r\nA : B\r\r\r\r\n\r\n\r4\n\r"
output = re.findall(r'[^\r\n\t]+', input)
print(output)
['jkahdjkah ', 'A: B', 'A : B', '4']
re.sub isn't really the right tool for the job here. What would be on the table is split or re.findall, because you want to repeatedly match/isolate a certain part of your text. re.sub is useful for taking a string and transforming it to something else. It can be used to extract text, but does not work so well for multiple matches.
You were almost there, I would just use string.strip() to replace multiple \r and \n at the start and the end of the strings
String = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
splitString = String.split('\n')
replacedStrings = []
i=0
for oneString in splitString:
s = oneString.strip()
if s != '':
print(s)
replacedStrings.append(s)
print(replacedStrings)
The output will look like
1
2
3
4
['1', '2', '3', '4']
For "jkahdjkah \r\r\t\r\nA: B\r\r\nA : B\r\r\r\r\n\r\n\r4\n\r", the output will be ['jkahdjkah', 'A: B', 'A : B', '4']
I have found one more method, this seems to work fine, it might not be as optimised as the other answers, but its just another way:
import re
splitString = []
String = "jhgdf\r\r\t\r\nA : B\r\r\nA : B\r\r\r\r\n\r\n\rA: B\n\r"
splitString = re.compile('[\r\t\n]+').split(String)
if "" in splitString:
splitString.remove("")
print(splitString)
I added it here, so that people going through the same trouble as me, might want to overlook this approach too.
Following is the Output that I have got after using the above code:
['jhgdf', 'A : B', 'A : B', 'A: B']
I am trying to implement a tokeniser in python (without using NLTK libraries) that splits a string into words using blank spaces. Example usage is:
>> tokens = tokenise1(“A (small, simple) example”)
>> tokens
[‘A’, ‘(small,’, ‘simple)’, ‘example’]
I can get some of the way using regular expressions but my return value includes white spaces which I don't want. How do i get the correct return value as per example usage?
What i have so far is:
def tokenise1(string):
return re.split(r'(\S+)', string)
and it returns:
['', 'A', ' ', '(small,', ' ', 'simple)', ' ', 'example', '']
so i need to get rid of the white space in the return
The output is having spaces because you capture them using (). Instead you can split like
re.split(r'\s+', string)
['A', '(small,', 'simple)', 'example']
\s+ Matches one or more spaces.
How could I define string delimiter for splitting in most efficient way? I mean to not need to use many if's etc?
I have strings that need to be splited strictly into two element lists. The problem is those strings have different symbols by which I can split them. For example:
'Hello: test1'. This one has split delimiter ': '. The other example would be:
'Hello - test1'. So this one would be ' - '. Also split delimiter could be ' -' or '- '. So if I know all variations of delimiters, how could I define them most efficiently?
First I did something like this:
strings = ['Hello - test', 'Hello- test', 'Hello -test']
for s in strings:
delim = ' - '
if len(s.split('- ', 1)) == 2:
delim = '- '
elif len(s.split(' -', 1)) == 2:
delim = ' -'
print s.split(delim, 1)[1])
But then I got new strings that had another unexpected delimiters. So doing this way I should add even more ifs to check other delimiters like ': '. But then I wondered if there is some better way to define them (there is not problem if I should need to include new delimiters in some kind of list if I would need to later on). Maybe regex would help or some other tool?
Put all the delimiters inside re.split function like below using logical OR | operator.
re.split(r': | - | -|- ', string)
Add maxsplit=1, if you want to do an one time split.
re.split(r': | - | -|- ', string, maxsplit=1)
You can use the split function of the re module
>>> strings = ['Hello1 - test1', 'Hello2- test2', 'Hello3 -test3', 'Hello4 :test4', 'Hello5 : test5']
>>> for s in strings:
... re.split(" *[:-] *",s)
...
['Hello1', 'test1']
['Hello2', 'test2']
['Hello3', 'test3']
['Hello4', 'test4']
['Hello5', 'test5']
Where between [] you put all the possible delimiters. The * indicates that some spaces can be put before or after.
\s*[:-]\s*
You can split by this.Use re.split(r"\s*[:-]\s*",string).See demo.
https://regex101.com/r/nL5yL3/14
You should use this if you can have delimiters like - or - or -.wherein you have can have multiple spaces.
This isn't the best way, but if you want to avoid using re for some (or no) reason, this is what I would do:
>>> strings = ['Hello - test', 'Hello- test', 'Hello -test', 'Hello : test']
>>> delims = [':', '-'] # all possible delimiters; don't worry about spaces.
>>>
>>> for string in strings:
... delim = next((d for d in delims if d in string), None) # finds the first delimiter in delims that's present in the string (if there is one)
... if not delim:
... continue # No delimiter! (I don't know how you want to handle this possibility; this code will simply skip the string all together.)
... print [s.strip() for s in string.split(delim, 1)] # assuming you want them in list form.
['Hello', 'test']
['Hello', 'test']
['Hello', 'test']
['Hello', 'test']
This uses Python's native .split() to break the string at the delimiter, and then .strip() to trim the white space off the results, if there is any. I've used next to find the appropriate delimiter, but there are plenty of things you can swap that out with (especially if you like for blocks).
If you're certain that each string will contain at least one of the delimiters (preferably exactly one), then you can shave it down to this:
## with strings and delims defined...
>>> for string in strings:
... delim = next(d for d in delims if d in string) # raises StopIteration at this line if there is no delimiter in the string.
... print [s.strip() for s in string.split(delim, 1)]
I'm not sure if this is the most elegant solution, but it uses fewer if blocks, and you won't have to import anything to do it.
I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.
I'm trying to split an inputted document at specific characters. I need to split them at [ and ] but I'm having a difficult time figuring this out.
def main():
for x in docread:
words = x.split('[]')
for word in words:
doclist.append(word)
this is the part of the code that splits them into my list. However, it is returning each line of the document.
For example, I want to convert
['I need to [go out] to lunch', 'and eat [some food].']
to
['I need to', 'go out', 'to lunch and eat', 'some food', '.']
Thanks!
You could try using re.split() instead:
>>> import re
>>> re.split(r"[\[\]]", "I need to [go out] to lunch")
['I need to ', 'go out', ' to lunch']
The odd-looking regular expression [\[\]] is a character class that means split on either [ or ]. The internal \[ and \] must be backslash-escaped because they use the same characters as the [ and ] to surround the character class.
str.split() splits at the exact string you pass to it, not at any of its characters. Passing "[]" would split at occurrences of [], but not at individual brackets. Possible solutions are
splitting twice:
words = [z for y in x.split("[") for z in y.split("]")]
using re.split().
string.split(s), the one you are using, treats the entire content of 's' as a separator. In other words, you input should've looked like "[]'I need to []go out[] to lunch', 'and eat []some food[].'[]" for it to give you the results you want.
You need to use split(s) from the re module, which will treat s as a regex
import re
def main():
for x in docread:
words = re.split('[]', x)
for word in words:
doclist.append(word)