Python regex to parse into a 2D array - python

I have a string like this that I need to parse into a 2D array:
str = "'813702104[813702106]','813702141[813702143]','813702172[813702174]'"
the array equiv would be:
arr[0][0] = 813702104
arr[0][1] = 813702106
arr[1][0] = 813702141
arr[1][1] = 813702143
#... etc ...
I'm trying to do this by REGEX. The string above is buried in an HTML page but I can be certain it's the only string in that pattern on the page. I'm not sure if this is the best way, but it's all I've got right now.
imgRegex = re.compile(r"(?:'(?P<main>\d+)\[(?P<thumb>\d+)\]',?)+")
If I run imgRegex.match(str).groups() I only get one result (the first couplet). How do I either get multiple matches back or a 2d match object (if such a thing exists!)?
Note: Contrary to how it might look, this is not homework
Note part deux: The real string is embedded in a large HTML file and therefore splitting does not appear to be an option.
I'm still getting answers for this, so I thought I better edit it to show why I'm not changing the accepted answer. Splitting, though more efficient on this test string, isn't going to extract the parts from a whole HTML file. I could combine a regex and splitting but that seems silly.
If you do have a better way to find the parts from a load of HTML (the pattern \d+\[\d+\] is unique to this string in the source), I'll happily change accepted answers. Anything else is academic.

I would try findall or finditer instead of match.
Edit by Oli: Yeah findall work brilliantly but I had to simplify the regex to:
r"'(?P<main>\d+)\[(?P<thumb>\d+)\]',?"

I think I will not go for regex for this task. Python list comprehension is quite powerful for this
In [27]: s = "'813702104[813702106]','813702141[813702143]','813702172[813702174]'"
In [28]: d=[[int(each1.strip(']\'')) for each1 in each.split('[')] for each in s.split(',')]
In [29]: d[0][1]
Out[29]: 813702106
In [30]: d[1][0]
Out[30]: 813702141
In [31]: d
Out[31]: [[813702104, 813702106], [813702141, 813702143], [813702172, 813702174]]

Modifying your regexp a little,
>>> str = "'813702104[813702106]','813702141[813702143]','813702172[813702174]"
>>> imgRegex = re.compile(r"'(?P<main>\d+)\[(?P<thumb>\d+)\]',?")
>>> print imgRegex.findall(str)
[('813702104', '813702106'), ('813702141', '813702143')]
Which is a "2 dimensional array" - in Python, "a list of 2-tuples".

I've got something that seems to work on your data set:
In [19]: str = "'813702104[813702106]','813702141[813702143]','813702172[813702174]'"
In [20]: ptr = re.compile( r"'(?P<one>\d+)\[(?P<two>\d+)\]'" )
In [21]: ptr.findall( str )
Out [23]:
[('813702104', '813702106'),
('813702141', '813702143'),
('813702172', '813702174')]

Alternatively, you could use Python's [statement for item in list] syntax for building lists. You should find this to be considerably faster than a regex, particularly for small data sets. Larger data sets will show a less marked difference (it only has to load the regular expressions engine once no matter the size), but the listmaker should always be faster.
Start by splitting the string on commas:
>>> str = "'813702104[813702106]','813702141[813702143]','813702172[813702174]'"
>>> arr = [pair for pair in str.split(",")]
>>> arr
["'813702104[813702106]'", "'813702141[813702143]'", "'813702172[813702174]'"]
Right now, this returns the same thing as just str.split(","), so isn't very useful, but you should be able to see how the listmaker works — it iterates through list, assigning each value to item, executing statement, and appending the resulting value to the newly-built list.
In order to get something useful accomplished, we need to put a real statement in, so we get a slice of each pair which removes the single quotes and closing square bracket, then further split on that conveniently-placed opening square bracket:
>>> arr = [pair[1:-2].split("[") for pair in str.split(",")]
>>> arr
>>> [['813702104', '813702106'], ['813702141', '813702143'], ['813702172', '813702174']]
This returns a two-dimensional array like you describe, but the items are all strings rather than integers. If you're simply going to use them as strings, that's far enough. If you need them to be actual integers, you simply use an "inner" listmaker as the statement for the "outer" listmaker:
>>> arr = [[int(x) for x in pair[1:-2].split("[")] for pair in str.split(",")]
>>> arr
>>> [[813702104, 813702106], [813702141, 813702143], [813702172, 813702174]]
This returns a two-dimensional array of the integers representing in a string like the one you provided, without ever needing to load the regular expressions engine.

Related

String slicing in numpy array

Say we have an numpy.ndarray with numpy.str_ elements. For example, below arr is the numpy.ndarray with two numpy.str_ elements like this:
arr = ['12345"""ABCDEFG' '1A2B3C"""']
Trying to perform string slicing on each numpy element.
For example, how can we slice the first element '12345"""ABCDEFG' so that we replace its 10 last characters with the string REPL, i.e.
arr = ['12345REPL' '1A2B3C"""']
Also, is it possible to perform string substitutions, e.g. substitute all characters after a specific symbol?
Strings are immutable, so you should either create slices and manually recombine or use regular expressions. For example, to replace the last 10 characters of the first element in your array, arr, you could do:
import numpy as np
import re
arr = np.array(['12345"""ABCDEFG', '1A2B3C"""'])
arr[0] = re.sub(arr[0][-10:], 'REPL', arr[0])
print(arr)
#['12345REPL' '1A2B3C"""']
If you want to replace all characters after a specific character you could use a regular expression or find the index of that character in the string and use that as the slicing index.
EDIT: Your comment is more about regular expressions than simply Python slicing, but this is how you could replace everything after the triple quote:
re.sub('["]{3}(.+)', 'REPL', arr[0])
This line essentially says, "Find the triple quote and everything after it, but only replace every character after the triple quotes."
In python, strings are immutable. Also, in NumPy, array scalars are immutable; your string is therefore immutable.
What you would want to do in order to slice is to treat your string like a list and access the elements.
Say we had a string where we wanted to slice at the 3rd letter, excluding the third letter:
my_str = 'purple'
sliced_str = my_str[:3]
Now that we have the part of the string, say we wanted to substitute z's for every letter following where we sliced. We would have to work with the new string that pulled out the letters we wanted, and create an additional string with the desired string that we want to create:
# say I want to replace the end of 'my_str', from where we sliced, with a string named 's'
s = 'dandylion'
new_string = sliced_str + s # returns 'pudandylion'
Because string types are immutable, you have to store elements you want to keep, then combine the stored elements with the elements you would like to add in a new variable.
np.char has replace function, which applies the corresponding string method to each element of the array:
In [598]: arr = np.array(['12345"""ABCDEFG', '1A2B3C"""'])
In [599]: np.char.replace(arr,'"""ABCDEFG',"REPL")
Out[599]:
array(['12345REPL', '1A2B3C"""'],
dtype='<U9')
In this particular example it can be made to work, but it isn't nearly as general purpose as re.sub. Also these char functions are only modestly faster than iterating on the array. There are some good examples of that in #Divakar's link.

Python: removing specific lines from an object

I have a bit of a weird question here.
I am using iperf to test performance between a device and a server. I get the results of this test over SSH, which I then want to parse into values using a parser that has already been made. However, there are several lines at the top of the results (which I read into an object of lines) that I don't want to go into the parser. I know exactly how many lines I need to remove from the top each time though. Is there any way to drop specific entries out of a list? Something like this in psuedo-python
print list
["line1","line2","line3","line4"]
list = list.drop([0 - 1])
print list
["line3","line4"]
If anyone knows anything I could use I would really appreciate you helping me out. The only thing I can think of is writing a loop to iterate through and make a new list only putting in what I need. Anyway, thanlks!
Michael
Slices:
l = ["line1","line2","line3","line4"]
print l[2:] # print from 2nd element (including) onwards
["line3","line4"]
Slices syntax is [from(included):to(excluded):step]. Each part is optional. So you can write [:] to get the whole list (or any iterable for that matter -- string and tuple as an example from the built-ins). You can also use negative indexes, so [:-2] means from beginning to the second last element. You can also step backwards, [::-1] means get all, but in reversed order.
Also, don't use list as a variable name. It overrides the built-in list class.
This is what the slice operator is for:
>>> before = [1,2,3,4]
>>> after = before[2:]
>>> print after
[3, 4]
In this instance, before[2:] says 'give me the elements of the list before, starting at element 2 and all the way until the end.'
(also -- don't use reserved words like list or dict as variable names -- doing that can lead to confusing bugs)
You can use slices for that:
>>> l = ["line1","line2","line3","line4"] # don't use "list" as variable name, it's a built-in.
>>> print l[2:] # to discard items up to some point, specify a starting index and no stop point.
['line3', 'line4']
>>> print l[:1] + l[3:] # to drop items "in the middle", join two slices.
['line1', 'line4']
why not use a basic list slice? something like:
list = list[3:] #everything from the 3 position to the end
You want del for that
del list[:2]
You can use "del" statment to remove specific entries :
del(list[0]) # remove entry 0
del(list[0:2]) # remove entries 0 and 1

Python: Looping starts from final item and ends with the first one

Is there any "pythonic way" to tell python to loop in a string (or list) starting from the last item and ending with the first one?
For example the word Hans i want python to read or sort it as snaH
Next, how can i tell pyhon the following: now from the string you resulted , search for 'a' find it ok , if you find 'n' follows 'a' , put '.' after 'n' and then print the original order of letters
The clearest and most pythonic way to do this is to used the reversed() builtin.
wrong_way = [1, 2, 3, 4]
for item in reversed(wrong_way):
print(item)
Which gives:
4
3
2
1
This is the best solution as not only will it generate a reversed iterator naturally, but it can also call the dedicated __reversed__() method if it exists, allowing for a more efficient reversal in some objects.
You can use wrong_way[::-1] to reverse a list, but this is a lot less readable in code, and potentially less efficient. It does, however, show the power of list slicing.
Note that reversed() provide iterators, so if you want to do this with a string, you will need to convert your result back to a string, which is fortunately easy, as you just do:
"".join(iterator)
e.g:
"".join(reversed(word))
The str.join() method takes an iterator and joins every element into a string, using the calling string as the separator, so here we use the empty string to place them back-to-back.
How about this?
>>> s = "Hans"
>>> for c in s[::-1]:
print c
s
n
a
H
Alternatively, if you want a new string that's the reverse of the first, try this:
>>> "".join(reversed("Hans"))
'snaH'
Sure, just use list_name[::-1]. e.g.
>>> l = ['one', 'two', 'three']
>>> for i in l[::-1]:
... print i
...
three
two
one

What is the best way to create a string array in python?

I'm relatively new to Python and it's libraries and I was wondering how I might create a string array with a preset size. It's easy in java but I was wondering how I might do this in python.
So far all I can think of is
strs = ['']*size
And some how when I try to call string methods on it, the debugger gives me an error X operation does not exist in object tuple.
And if it was in java this is what I would want to do.
String[] ar = new String[size];
Arrays.fill(ar,"");
Please help.
Error code
strs[sum-1] = strs[sum-1].strip('\(\)')
AttributeError: 'tuple' object has no attribute 'strip'
Question: How might I do what I can normally do in Java in Python while still keeping the code clean.
In python, you wouldn't normally do what you are trying to do. But, the below code will do it:
strs = ["" for x in range(size)]
In Python, the tendency is usually that one would use a non-fixed size list (that is to say items can be appended/removed to it dynamically). If you followed this, there would be no need to allocate a fixed-size collection ahead of time and fill it in with empty values. Rather, as you get or create strings, you simply add them to the list. When it comes time to remove values, you simply remove the appropriate value from the string. I would imagine you can probably use this technique for this. For example (in Python 2.x syntax):
>>> temp_list = []
>>> print temp_list
[]
>>>
>>> temp_list.append("one")
>>> temp_list.append("two")
>>> print temp_list
['one', 'two']
>>>
>>> temp_list.append("three")
>>> print temp_list
['one', 'two', 'three']
>>>
Of course, some situations might call for something more specific. In your case, a good idea may be to use a deque. Check out the post here: Python, forcing a list to a fixed size. With this, you can create a deque which has a fixed size. If a new value is appended to the end, the first element (head of the deque) is removed and the new item is appended onto the deque. This may work for what you need, but I don't believe this is considered the "norm" for Python.
The simple answer is, "You don't." At the point where you need something to be of fixed length, you're either stuck on old habits or writing for a very specific problem with its own unique set of constraints.
The best and most convenient method for creating a string array in python is with the help of NumPy library.
Example:
import numpy as np
arr = np.chararray((rows, columns))
This will create an array having all the entries as empty strings. You can then initialize the array using either indexing or slicing.
Are you trying to do something like this?
>>> strs = [s.strip('\(\)') for s in ['some\\', '(list)', 'of', 'strings']]
>>> strs
['some', 'list', 'of', 'strings']
But what is a reason to use fixed size? There is no actual need in python to use fixed size arrays(lists) so you always have ability to increase it's size using append, extend or decrease using pop, or at least you can use slicing.
x = ['' for x in xrange(10)]
strlist =[{}]*10
strlist[0] = set()
strlist[0].add("Beef")
strlist[0].add("Fish")
strlist[1] = {"Apple", "Banana"}
strlist[1].add("Cherry")
print(strlist[0])
print(strlist[1])
print(strlist[2])
print("Array size:", len(strlist))
print(strlist)
The error message says it all: strs[sum-1] is a tuple, not a string. If you show more of your code someone will probably be able to help you. Without that we can only guess.
Sometimes I need a empty char array. You cannot do "np.empty(size)" because error will be reported if you fill in char later. Then I usually do something quite clumsy but it is still one way to do it:
# Suppose you want a size N char array
charlist = [' ']*N # other preset character is fine as well, like 'x'
chararray = np.array(charlist)
# Then you change the content of the array
chararray[somecondition1] = 'a'
chararray[somecondition2] = 'b'
The bad part of this is that your array has default values (if you forget to change them).
def _remove_regex(input_text, regex_pattern):
findregs = re.finditer(regex_pattern, input_text)
for i in findregs:
input_text = re.sub(i.group().strip(), '', input_text)
return input_text
regex_pattern = r"\buntil\b|\bcan\b|\bboat\b"
_remove_regex("row and row and row your boat until you can row no more", regex_pattern)
\w means that it matches word characters, a|b means match either a or b, \b represents a word boundary
If you want to take input from user here is the code
If each string is given in new line:
strs = [input() for i in range(size)]
If the strings are separated by spaces:
strs = list(input().split())

how to create a string which can be used as an array in python?

i want to create a string S , which can be used as an array , as in each element can be used separately by accesing them as an array.
That's how Python strings already work:
>>> a = "abcd"
>>> a[0]
'a'
>>> a[2]
'c'
But keep in mind that this is read only access.
You can convert a string to a list of characters by using list, and to go the other way use join:
>>> s = 'Hello, world!'
>>> l = list(s)
>>> l[7] = 'f'
>>> ''.join(l)
'Hello, forld!'
I am a bit surprised that no one seems to have written a popular "MutableString" wrapper class for Python. I would think that you'd want to have it store the string as a list, returning it via ''.join() and implement a suite of methods including those for strings (startswith, endswith, isalpha and all its ilk and so one) and those for lists.
For simple operations just operating on the list and using ''.join() as necessary is fine. However, for something something like: 'foobar'.replace('oba', 'amca') when you're working with a list representation gets to be ugly. (that=list(''.join(that).replace(something, it)) ... or something like that). The constant marshaling between list and string representations is visually distracting.

Categories