python: splitting a long string at distinct places in one run

python: splitting a long string at distinct places in one run - python

I am entirely new to programming and just yesterday started learning python for scientific purposes.
Now, I would like to split a single very long string (174 chars) into several smaller as follows:
string = 'AA111-99XYZ '
split = ('AA', 11, 1, -99, 'XYZ')
Right now, the only thing I can think of is to use the slice syntax x-times, but maybe there is a more elegant way? Is there a way to use a list of integers to indicate the positions of where to split, e.g.
split_at = (2, 4, 5, 8, 11)
split = function(split_at, string)
I hope my question is not too silly - I couldn't find a similar example, but maybe I just don't know what I'm looking for?
Thanks,
Jan

Like this:
>>> string = 'AA111-99XYZ '
>>> split_at = [2, 4, 5, 8, 11]
>>> [string[i:j] for i, j in zip([0]+split_at, split_at+[None])]
['AA', '11', '1', '-99', 'XYZ', ' ']

def split_string(string, points):
for left, right in zip(points, points[1:]):
yield string[left:right]

to avoid redundancy, you could take ATOzTOA's nice solution and put it in a lamba-function:
st = 'AA111-99XYZ '
sa = [2, 4, 5, 8, 11]
res = lambda string,split_at:[string[i:j] for i, j in zip([0]+split_at, split_at+[None])]
print(res(st,sa))

Being relatively new to Python myself, I took the approach of a complete beginner here just to help guide someone who isn't yet familiar with the power of Python.
string = 'AA111-99XYZ '
split_at = [2, 4, 5, 8, 11]
for i in range(len(split_at)):
if i == 0:
print string[:split_at[i]]
if i < len(split_at)-1:
print string[split_at[i]:split_at[i+1]]
if i == len(split_at)-1:
print string[split_at[i]:]

Related

Iterate through a string in chunks of different sizes python

So I am working with files in python, feel like there is a name for them but I'm not sure what it is. They are like csv files but with no separator. Anyway in my file I have lots of lines of data where the first 7 characters are an ID number then the next 5 are something else and so on. So I want to go through the file reading each line and splitting it up and storing it into a list. Here is an example:
From the file: "0030108102017033119080001010048000000"
These are the chunks I would like to split the string into: [7, 2, 8, 6, 2, 2, 5, 5] Each number represents the length of each chunk.
First I tried this:
n = [7, 2, 8, 6, 2, 2, 5, 5]
for i in range(0, 37, n):
print(i)
Naturally this didn't work, so now I've started thinking about possible methods and they all seem quite complex. I looked around online and couldn't seem to find anything, only even sized chunks. So any input?
EDIT: The answer I'm looking for should in this case look like this:
['0030108', '10', '20170331', '190800', '01', '01', '00480', '00000']
Where each value in the list n represents the length of each chunk.

If these are ASCII strings (or rather, one byte per character), I might use struct.unpack for this.
>>> import struct
>>> sizes = [7, 2, 8, 6, 2, 2, 5, 5]
>>> struct.unpack(''.join("%ds" % x for x in sizes), "0030108102017033119080001010048000000")
('0030108', '10', '20170331', '190800', '01', '01', '00480', '00000')
>>>
Otherwise, you can construct the necessary slice objects from partial sums of the sizes, which is simple to do if you are using Python 3:
>>> psums = list(itertools.accumulate([0] + sizes))
>>> [s[slice(*i)] for i in zip(psums, psums[1:])]
['0030108', '10', '20170331', '190800', '01', '01', '00480', '00000']
accumulate can be implemented in Python 2 with something like
def accumulate(itr):
total = 0
for x in itr:
total += x
yield total

from itertools import accumulate, chain
s = "0030108102017033119080001010048000000"
n = [7, 2, 8, 6, 2, 2, 5, 5]
ranges = list(accumulate(n))
list(map(lambda i: s[i[0]:i[1]], zip(chain([0], ranges), ranges))
# ['0030108', '10', '20170331', '190800', '01', '01', '00480', '00000']

Could you try this?
for line in file:
n = [7, 2, 8, 6, 2, 2, 5, 5]
total = 0
for i in n:
print(line[total:total+i])
total += i
This is how I might have done it. The code iterates through each line in the file, and for each line, iterate through the list of lengths you need to pull out which is in the list n. This can be amended to do something else instead of print, but the idea is that a slice is returned from the line. The total variable keeps track of how far into the lines we are.

Here's a generator that yields the chunks by iterating through the characters of the lsit and forming substrings from them. You can use this to process any iterable in this fashion.:
def chunks(s, sizes):
it = iter(s)
for size in sizes:
l = []
try:
for _ in range(size):
l.append(next(it))
finally:
yield ''.join(l)
s="0030108102017033119080001010048000000"
n = [7, 2, 8, 6, 2, 2, 5, 5]
print(list(chunks(s, n)))
# ['0030108', '10', '20170331', '190800', '01', '01', '00480', '00000']

Python: how to count the occurrence of specific pattern with overlap in a list or string?

My problem is logically easy, but hard to implement for me. I have a list of numbers,(or you could say a string of numbers, it is not hard to transfer between strings and lists) I'd like to count the occurrence of some specific patterns with overlap. For example, the code is below:
A = [0, 1, 2 ,4, 5, 8, 4, 4, 5, 8, 2, 4, 4, 5, 5, 8, 9, 10, 3, 2]
For "4,5,8" occurs, then I count a1 = 1, a2 = 1, a3 = 1. For "4,4,5,8" occurs, then I count a1 = 2, a2 = 1, a3 = 1. For "4,4,5,5,5,8,8,8" occurs, I count a1 = 2, a2 = 3, a3 = 3. That is to say, for a pattern, you count if the pattern at least include "4,5,8" in this order. "4,5,9" doesn't count. "4,4,4,5,5,2,8" doesn't count at all. For "4,5,4,5,8", a1 = 1, a2 = 1, a3 = 1.
Thank you all for your help.

You can match patterns like this using regular expressions.
https://regexr.com/ is a super helpful tool for experimenting with/learning about regular expressions.
The built in module re does the job:
import re
def make_regex_object(list_of_chars):
# make a re object out of [a, b... n]: 'a+b+ ... n+'' (note the '+' at end)
# the '+' means it matches one or more occurrence of each character
return re.compile('+'.join([str(char) for char in list_of_chars]) + '+')
searcher = make_regex_object(['a', 'b', 'c', 'd'])
searcher.pattern # 'a+b+c+d+'
x = searcher.search('abczzzzabbbcddefaaabbbccceeabc')
# caution - search only matches first instance of pattern
print(x) # <_sre.SRE_Match object; span=(7, 14), match='abbbcdd'>
x.end() # 14
x.group() # 'abbbcdd'
You could then repeat this on the remainder of your string if you want to count multiple instances of the pattern. You can count character occurrences with x.group().count(char) or something better.

I tried to use
re.findall(r'26+8',test_string)
This will output the substring like "266666668", "268", "26668" in a nonoverlapping way. However, what if I want to search if there were a pattern shown below: "2(6+8+)+7" (this syntax doesn't work in "re", what I want essentially is a pattern like "266868688887", in which you could see there is back and forth movement between 6 and 8. Once you reach 7, the search is done. Has anyone a right idea to express the pattern is "re"? Thanks!

Recreating a sentence from the position of each word in it

I am looking to develop a program that identifies individual words in a sentence and replaces each word with the position of each word in the list.
For example, this sentence:
HELLO I NEED SOME HELP IN PYTHON PLEASE CAN YOU HELP ME IN PYTHON
This contains 11 different words and I'd like my program to recreate the sentence from the positions of these 11 words in a list:
1,2,3,4,5,6,7,8,9,10,5,11,6,7
I'd then like to save this new list in a separate file. So far I have only gotten this:
#splitting my string to individual words
my_string = "HELLO I NEED SOME HELP IN PYTHON PLEASE CAN YOU HELP ME IN PYTHON"
splitted = my_string.split()

>>> my_string = "HELLO I NEED SOME HELP IN PYTHON PLEASE CAN YOU HELP ME IN PYTHON"
>>> splitted = my_string.split()
>>> order = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 5, 11, 6, 7
>>> new_str = ' '.join(splitted[el] for el in order)
'I NEED SOME HELP IN PYTHON PLEASE CAN YOU HELP IN ME PYTHON PLEASE'
Updated according to your comment:
You are looking for index() method.
my_string = "HELLO I NEED SOME HELP IN PYTHON PLEASE CAN YOU HELP ME IN PYTHON"
splitted = my_string.split()
test = "I NEED SOME HELP IN PYTHON PLEASE CAN YOU HELP IN ME PYTHON PLEASE".split()
print ', '.join(str(splitted.index(el)) for el in test)
>>> 1, 2, 3, 4, 5, 6, 7, 8, 9, 4, 5, 11, 6, 7
** we suppose that there are no repeating words

my_string = "HELLO I NEED SOME HELP IN PYTHON PLEASE CAN YOU HELP ME IN PYTHON"
splitted = my_string.split()
d = {}
l=[]
for i,j in enumerate(splitted):
if j in d:
l.append(d[j])
else:
d[j]=i
l.append(i)
print l
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 4, 11, 5, 6]

Try:
>>> from collections import OrderedDict
>>> my_string = "HELLO I NEED SOME HELP IN PYTHON PLEASE CAN YOU HELP ME IN PYTHON"
>>> splitted = my_string.split()
>>> key_val = {elem : index + 1 for index, elem in enumerate(list(OrderedDict.fromkeys(splitted)))}
>>> [key_val[elem] for elem in splitted]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 5, 11, 6, 7]
list(OrderedDict.fromkeys(splitted)) create a list having only unique elements from splitted.
key_val is dictionary of these unique elements as key and their index as the value.

Try this:
sentence= "HELLO I NEED SOME HELP IN PYTHON PLEASE CAN YOU HELP ME IN PYTHON"
lst = sentence.split()
lst2= []
for i in lst:
if i not in lst2:
lst2.append(i)
inp = inputSentence.split()
output=[]
for i in inp:
print lst2.index(i)+1,
output.append(lst2.index(i)+1)
The index is evaluated and stored in lst2. You just have to pass your input string to inputSentence , so that you are able to test this code.

benstring = input ('enter a sentence ')
print('the sentence is', benstring)
ben2string = str.split(benstring)
print('the sentence now looks like this',ben2string)
x=len(ben2string)
for i in range(0,x):
if x is

Sentence = input("Enter a sentence!")
s = Sentence.split() #Splits the numbers/words up so you can see them individually.
another = [0]
for count, i in enumerate(s):
if s.count(i) < 2:
another.append(max(another) + 1) #adds +1 each time a word is used, showing the position of each word.
else:
another.append(s.index(i) +1)
another.remove(0)
print(another) #prints the number that word is in.

Can't seem to add percentage at last in this list?

I have a list like this:
la = [3, 7, 8, 9, 50, 100]
I am trying to convert the list into string and add percentage like this:
3%, 7%, 8% and lastly 100%
I doing like this:
str1 = '%, '.join(str(e) for e in sorted_list)
It does the right thing except there's no percentage at last i.e only 100 but I want 100%. How can I do this? Thanks

', '.join('%d%%' % (x,) for x in la)

Yup, it's doing exactly what it's supposed to: putting '%, ' BETWEEN each element. How about simply ...
str1 = '%, '.join(str(e) for e in sorted_list) + '%'
(if sorted_list can ever be empty, you should handle that case)

Thy this:
print map(lambda n: str(n) + '%', [3, 7, 8, 9, 50, 100])
Which creates a list of all of them converted to strings with a percent.
If you still want a string, you can slightly modify it like this:
str1 = ', '.join(map(lambda n: str(n) + '%', [3, 7, 8, 9, 50, 100]))

la = [3, 7, 8, 9, 50, 100]
(lambda k: ' and lastly '.join([', '.join(k[:-1]), k[-1]]))(["%s%%" % a for a in la])
# '3%, 7%, 8%, 9%, 50% and lastly 100%'
EDIT: small subscripting mistake (then again, as Rohit Jain says... :D )

python 3.2
str1 = ','.join(str(e)+"%" for e in sorted_list)

pythonic format for indices

I am after a string format to efficiently represent a set of indices.
For example "1-3,6,8-10,16" would produce [1,2,3,6,8,9,10,16]
Ideally I would also be able to represent infinite sequences.
Is there an existing standard way of doing this? Or a good library? Or can you propose your own format?
thanks!
Edit: Wow! - thanks for all the well considered responses. I agree I should use ':' instead. Any ideas about infinite lists? I was thinking of using "1.." to represent all positive numbers.
The use case is for a shopping cart. For some products I need to restrict product sales to multiples of X, for others any positive number. So I am after a string format to represent this in the database.

You don't need a string for that, This is as simple as it can get:
from types import SliceType
class sequence(object):
def __getitem__(self, item):
for a in item:
if isinstance(a, SliceType):
i = a.start
step = a.step if a.step else 1
while True:
if a.stop and i > a.stop:
break
yield i
i += step
else:
yield a
print list(sequence()[1:3,6,8:10,16])
Output:
[1, 2, 3, 6, 8, 9, 10, 16]
I'm using Python slice type power to express the sequence ranges. I'm also using generators to be memory efficient.
Please note that I'm adding 1 to the slice stop, otherwise the ranges will be different because the stop in slices is not included.
It supports steps:
>>> list(sequence()[1:3,6,8:20:2])
[1, 2, 3, 6, 8, 10, 12, 14, 16, 18, 20]
And infinite sequences:
sequence()[1:3,6,8:]
1, 2, 3, 6, 8, 9, 10, ...
If you have to give it a string then you can combine #ilya n. parser with this solution. I'll extend #ilya n. parser to support indexes as well as ranges:
def parser(input):
ranges = [a.split('-') for a in input.split(',')]
return [slice(*map(int, a)) if len(a) > 1 else int(a[0]) for a in ranges]
Now you can use it like this:
>>> print list(sequence()[parser('1-3,6,8-10,16')])
[1, 2, 3, 6, 8, 9, 10, 16]

If you're into something Pythonic, I think 1:3,6,8:10,16 would be a better choice, as x:y is a standard notation for index range and the syntax allows you to use this notation on objects. Note that the call
z[1:3,6,8:10,16]
gets translated into
z.__getitem__((slice(1, 3, None), 6, slice(8, 10, None), 16))
Even though this is a TypeError if z is a built-in container, you're free to create the class that will return something reasonable, e.g. as NumPy's arrays.
You might also say that by convention 5: and :5 represent infinite index ranges (this is a bit stretched as Python has no built-in types with negative or infinitely large positive indexes).
And here's the parser (a beautiful one-liner that suffers from slice(16, None, None) glitch described below):
def parse(s):
return [slice(*map(int, x.split(':'))) for x in s.split(',')]
There's one pitfall, however: 8:10 by definition includes only indices 8 and 9 -- without upper bound. If that's unacceptable for your purposes, you certainly need a different format and 1-3,6,8-10,16 looks good to me. The parser then would be
def myslice(start, stop=None, step=None):
return slice(start, (stop if stop is not None else start) + 1, step)
def parse(s):
return [myslice(*map(int, x.split('-'))) for x in s.split(',')]
Update: here's the full parser for a combined format:
from sys import maxsize as INF
def indices(s: 'string with indices list') -> 'indices generator':
for x in s.split(','):
splitter = ':' if (':' in x) or (x[0] == '-') else '-'
ix = x.split(splitter)
start = int(ix[0]) if ix[0] is not '' else -INF
if len(ix) == 1:
stop = start + 1
else:
stop = int(ix[1]) if ix[1] is not '' else INF
step = int(ix[2]) if len(ix) > 2 else 1
for y in range(start, stop + (splitter == '-'), step):
yield y
This handles negative numbers as well, so
print(list(indices('-5, 1:3, 6, 8:15:2, 20-25, 18')))
prints
[-5, 1, 2, 6, 7, 8, 10, 12, 14, 20, 21, 22, 23, 24, 25, 18, 19]
Yet another alternative is to use ... (which Python recognizes as the built-in constant Ellipsis so you can call z[...] if you want) but I think 1,...,3,6, 8,...,10,16 is less readable.

This is probably about as lazily as it can be done, meaning it will be okay for even very large lists:
def makerange(s):
for nums in s.split(","): # whole list comma-delimited
range_ = nums.split("-") # number might have a dash - if not, no big deal
start = int(range_[0])
for i in xrange(start, start + 1 if len(range_) == 1 else int(range_[1]) + 1):
yield i
s = "1-3,6,8-10,16"
print list(makerange(s))
output:
[1, 2, 3, 6, 8, 9, 10, 16]

import sys
class Sequencer(object):
def __getitem__(self, items):
if not isinstance(items, (tuple, list)):
items = [items]
for item in items:
if isinstance(item, slice):
for i in xrange(*item.indices(sys.maxint)):
yield i
else:
yield item
>>> s = Sequencer()
>>> print list(s[1:3,6,8:10,16])
[1, 2, 6, 8, 9, 16]
Note that I am using the xrange builtin to generate the sequence. That seems awkward at first because it doesn't include the upper number of sequences by default, however it proves to be very convenient. You can do things like:
>>> print list(s[1:10:3,5,5,16,13:5:-1])
[1, 4, 7, 5, 5, 16, 13, 12, 11, 10, 9, 8, 7, 6]
Which means you can use the step part of xrange.

This looked like a fun puzzle to go with my coffee this morning. If you settle on your given syntax (which looks okay to me, with some notes at the end), here is a pyparsing converter that will take your input string and return a list of integers:
from pyparsing import *
integer = Word(nums).setParseAction(lambda t : int(t[0]))
intrange = integer("start") + '-' + integer("end")
def validateRange(tokens):
if tokens.from_ > tokens.to:
raise Exception("invalid range, start must be <= end")
intrange.setParseAction(validateRange)
intrange.addParseAction(lambda t: list(range(t.start, t.end+1)))
indices = delimitedList(intrange | integer)
def mergeRanges(tokens):
ret = set()
for item in tokens:
if isinstance(item,int):
ret.add(item)
else:
ret += set(item)
return sorted(ret)
indices.setParseAction(mergeRanges)
test = "1-3,6,8-10,16"
print indices.parseString(test)
This also takes care of any overlapping or duplicate entries, such "3-8,4,6,3,4", and returns a list of just the unique integers.
The parser takes care of validating that ranges like "10-3" are not allowed. If you really wanted to allow this, and have something like "1,5-3,7" return 1,5,4,3,7, then you could tweak the intrange and mergeRanges parse actions to get this simpler result (and discard the validateRange parse action altogether).
You are very likely to get whitespace in your expressions, I assume that this is not significant. "1, 2, 3-6" would be handled the same as "1,2,3-6". Pyparsing does this by default, so you don't see any special whitespace handling in the code above (but it's there...)
This parser does not handle negative indices, but if that were needed too, just change the definition of integer to:
integer = Combine(Optional('-') + Word(nums)).setParseAction(lambda t : int(t[0]))
Your example didn't list any negatives, so I left it out for now.
Python uses ':' for a ranging delimiter, so your original string could have looked like "1:3,6,8:10,16", and Pascal used '..' for array ranges, giving "1..3,6,8..10,16" - meh, dashes are just as good as far as I'm concerned.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: splitting a long string at distinct places in one run - python

Like this: >>> string = 'AA111-99XYZ ' >>> split_at = [2, 4, 5, 8, 11] >>> [string[i:j] for i, j in zip([0]+split_at, split_at+[None])] ['AA', '11', '1', '-99', 'XYZ', ' ']

def split_string(string, points): for left, right in zip(points, points[1:]): yield string[left:right]

to avoid redundancy, you could take ATOzTOA's nice solution and put it in a lamba-function: st = 'AA111-99XYZ ' sa = [2, 4, 5, 8, 11] res = lambda string,split_at:[string[i:j] for i, j in zip([0]+split_at, split_at+[None])] print(res(st,sa))

Related

Iterate through a string in chunks of different sizes python

Python: how to count the occurrence of specific pattern with overlap in a list or string?

Recreating a sentence from the position of each word in it

Can't seem to add percentage at last in this list?

pythonic format for indices

Categories

Resources