I want to split a String in python using multiple delimiter. In my case I also want the delimiter which was used returned in a list of delimiters.
Example:
string = '1000+20-12+123-165-564'
(Methods which split the string and return lists with numbers and delimiter)
numbers = ['1000', '20', '12', '123', '165', '564']
delimiter = ['+', '-', '+', '-', '-']
I hope my question is understandable.
You might use re.split for this task following way
import re
string = '1000+20-12+123-165-564'
elements = re.split(r'(\d+)',string) # note capturing group
print(elements) # ['', '1000', '+', '20', '-', '12', '+', '123', '-', '165', '-', '564', '']
numbers = elements[1::2] # last 2 is step, get every 2nd element, start at index 1
delimiter = elements[2::2] # again get every 2nd element, start at index 2
print(numbers) # ['1000', '20', '12', '123', '165', '564']
print(delimiter) # ['+', '-', '+', '-', '-', '']
Just capture (...) the delimiter along with matching/splitting with re.split:
import re
s = '1000+20-12+123-165-564'
parts = re.split(r'([+-])', s)
numbers, delims = parts[::2], parts[1::2]
print(numbers, delims)
['1000', '20', '12', '123', '165', '564'] ['+', '-', '+', '-', '-']
Related
So I have this txt file:
Haiku
5 *
7 *
5 *
Limerick
8 A
8 A
5 B
5 B
8 A
And I want to write a function that returns something like this:
[['Haiku', '5', '*', '7', '*', '5', '*'], ['Limerick', '8', 'A', '8', 'A', '5', 'B', '5', 'B', '8' ,'A']]
Ive tried this:
small_pf = open('datasets/poetry_forms_small.txt')
lst = []
for line in small_pf:
lst.append(line.strip())
small_pf.close()
print(lst)
At the end I end up with this:
['Haiku', '5 *', '7 *', '5 *', '', 'Limerick', '8 A', '8 A', '5 B', '5 B', '8 A']
My problem is that this is one big list, and the elements of the list are attached together, like '5 *' or '8 A'.
I honestly don't know where to start and thats why I need some guidance into what to do for those two problems.
Any help would be greatly appreciated.
When you see an empty line : don't add it, save the tmp list you've been filling, and continue
lst = []
with open('test.txt') as small_pf:
tmp_list = []
for line in small_pf:
line = line.rstrip("\n")
if line == "":
lst.append(tmp_list)
tmp_list = []
else:
tmp_list.extend(line.split())
if tmp_list: # add last one
lst.append(tmp_list)
print(lst)
# [['Haiku', '5', '*', '7', '*', '5', '*'],
# ['Limerick', '8', 'A', '8', 'A', '5', 'B', '5', 'B', '8', 'A']]
First split the file into sections on blank lines (\n\n), then split each section on any whitespace (newlines or spaces).
lst = [section.split() for section in small_pf.read().split('\n\n')]
Result:
[['Haiku', '5', '*', '7', '*', '5', '*'],
['Limerick', '8', 'A', '8', 'A', '5', 'B', '5', 'B', '8', 'A']]
Solution without using extra modules
small_pf = small_pf.readlines()
result = []
tempList = []
for index,line in enumerate(small_pf):
if line == "\n" or index == len(small_pf) -1:
result.append(tempList.copy())
del tempList[:]
else:
for value in line.strip("\n").split():
tempList.append(value)
result
Solution with module
You can use regex to solve your problem:
import re
small_pf = small_pf.read()
[re.split("\s|\n", x) for x in re.split("\n\n", small_pf)]
Output
[['Haiku', '5', '*', '7', '*', '5', '*'],
['Limerick', '8', 'A', '8', 'A', '5', 'B', '5', 'B', '8', 'A']]
This approach assumes that a line either starts with a character that is a decimal value or a nondecimal value. Moreover, it assumes that if it starts with a nondecimal value that this should start a new list with the line (as a string, without any trailing whitespace) as the first element. If subsequent lines start with a decimal value, these are stripped of trailing whitespace, and parts of the line (determined by separation from a space) are added as elements in the most recently created list.
lst = []
with open("blankpaper.txt") as f:
for line in f:
# ignore empty lines
if line.rstrip() == '':
continue
if not line[0].isdecimal():
new_list = [line.rstrip()]
lst.append(new_list)
continue
new_list.extend(line.rstrip().split(" "))
print(lst)
Output
[['Haiku', '5', '*', '7', '*', '5', '*'], ['Limerick', '8', 'A', '8', 'A', '5', 'B', '5', 'B', '8', 'A']]
I hope this helps. If there are any questions, please let me know.
I have a list:
output = ['9', '-', '-', '7', '-', '4', '4', '-', '3', '-', '0', '2']
and I'm trying trying to reduce the '-','-' section to just a single '-', however, haven't had much luck in trying.
final = [output[i] for i in range(len(output)) if output[i] != output[i-1]]
final = 9-7-4-3-02
I've tried that above, but it also reduces the '4','4' to only '4'. So any help would be great.
You should check if the item is equal to the previous item and to '-', which can easily be done in Python using a == b == c.
Note that you should also handle the first character differently, since output[0] == output[0-1] will compare the first item with the last item, which might lead to invalid results.
The following code will handle this:
final = [output[0]] + [output[i] for i in range(1, len(output)) if not (output[i] == output[i-1] == '-')]
The zip() function is your friend for situations where you need to compare/process elements and their predecessor:
final = [a for a,b in zip(output,['']+output) if (a,b) != ('-','-')]
You can use itertools.groupby:
from itertools import groupby as gb
output = ['9', '-', '-', '7', '-', '4', '4', '-', '3', '-', '0', '2']
r = [j for a, b in gb(output) for j in ([a] if a == '-' else b)]
Output:
['9', '-', '7', '-', '4', '4', '-', '3', '-', '0', '2']
How can I split a string that includes "sign characters" but no spaces? For example:
aString = '1+20*40-3'
I want the output to be:
['1', '+', '20', '*', '40', '-', '3']
I tried this:
aString.split('+' and '*' and '-')
but that didn't work.
You can use regular expression to do this task in python. The code will be:
import re
aString= '1+20*40-3'
print re.findall('[+-/*]|\d+',aString)
output:
>>>
['1', '+', '20', '*', '40', '-', '3']
Refer documentation here
I have a lot of python strings such as "A7*4", "Z3+8", "B6 / 11", and I want to split these strings so that they would be in a list, in the format ["A7", "*", "4"], ["B6", "/", "11"], etc. I have used a lot of different split methods but I think I need to just perform the split where there is a math symbol, such as /,*,+,-. I would also need to strip out the whitespace.
Currently I am using the code re.split(r'(\D)', "B6 / 11"), which is returning ['', 'B', '6', ' ', '', '/', '', ' ', '11']. Instead I want to get back ["B6", "/", "11"].
You should split on the character set [+-/*] after removing the whitespace from the string:
>>> import re
>>> def mysplit(mystr):
... return re.split("([+-/*])", mystr.replace(" ", ""))
...
>>> mysplit("A7*4")
['A7', '*', '4']
>>> mysplit("Z3+8")
['Z3', '+', '8']
>>> mysplit("B6 / 11")
['B6', '/', '11']
>>>
There is a way to solve this without regular expressions by using the Python tokenizer. I used a more complex formula to show the capabilities of this solution.
from io import StringIO
import tokenize
formula = "(A7*4) - (Z3+8) - ( B6 / 11)"
print([token[1] for token in tokenize.generate_tokens(StringIO(formula).readline) if token[1]])
Result:
['(', 'A7', '*', '4', ')', '-', '(', 'Z3', '+', '8', ')', '-', '(', 'B6', '/', '11', ')']
If I have a string
'x+13.5*10x-4e1'
how can I split it into the following list of tokens?
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
Currently I'm using the shlex module:
str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
tokenList.append(str(token))
return tokenList
But this returns:
['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']
So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.
In an ideal world, e and E would not be recognised as letters in the same way, so
'-4e1'
would become
['-', '4e1']
but
'-4x1'
would become
['-', '4', 'x', '1']
Can anybody help?
Use the regular expression module's split() function, to split at
'\d+' -- digits (number characters) and
'\W+' -- non-word characters:
CODE:
import re
print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:
[\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5
CODE:
print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']
Another alternative not suggested here, is to using nltk.tokenize module
Well, the problem seems not to be quite simple. I think, a good way to get robust (but, unfortunately, not so short) solution is to use Python Lex-Yacc for creating a full-weight tokenizer. Lex-Yacc is a common (not only Python) practice for this, thus there can exist ready grammars for creating a simple arithmetic tokenizer (like this one), and you have just to fit them to your specific needs.