From list of strings, extract only characters within brackets

From list of strings, extract only characters within brackets - python

I have a list of strings that have variable construction but have a character sequence enclosed in square brackets. I want to extract only the sequence enclosed by the square brackets. There is only one instance of square brackets per string, which simplifies the process.
I am struggling to do so in an elegant manner, and this is clearly a simple problem with Python's large string library.
What is a simple expression to do this?

Check regular expression, "re"
Something like this should do the trick
import re
s = "hello_from_adele[this_is_the_string_i_am_looking_for]this_is_not_it"
match = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print match.group(1)
If you provide an example, we can be more specific

You don't even need re to do this:
In [11]: strng = "This is some text [that has brackets] followed by more text"
In [12]: strng[strng.index("[")+1:strng.index("]")]
Out[12]: 'that has brackets'
This uses string slicing to return the characters inside the brackets. index() returns the 0-based position of its argument. Since we don't want to include the [ at the beginning, we add 1. The second argument of the slice is the stop position, but it is not included in the returned substring, so we don't need to add anything to it.

If you prefer not to use regex for whatever reason, it should be easy to do with string splitting since you're guaranteed to have one and only one instance of [ and ].
s = "some[string]to check"
_, midright = s.split("[")
target, _ = midright.split("]")
or
target = s.split("[")[1].split("]")[0] # ewww

Related

How to remove a substrings from a list of strings?

I have a list of strings, all of which have a common property, they all go like this "pp:actual_string". I do not know for sure what the substring "pp:" will be, basically : acts as a delimiter; everything before : shouldn't be included in the result.
I have solved the problem using the brute force approach, but I would like to see a clever method, maybe something like regex.
Note : Some strings might not have this "pp:string" format, and could be already a perfect string, i.e. without the delimiter.
This is my current solution:
ll = ["pp17:gaurav","pp17:sauarv","pp17:there","pp17:someone"]
res=[]
for i in ll:
g=""
for j in range(len(i)):
if i[j] == ':':
index=j+1
res.append(i[index:len(i)])
print(res)
Is there a way that I can do it without creating an extra list ?

Whilst regex is an incredibly powerful tool with a lot of capabilities, using a "clever method" is not necessarily the best idea you are unfamiliar with its principles.
Your problem is one that can be solved without regex by splitting on the : character using the str.split() method, and just returning the last part by using the [-1] index value to represent the last (or only) string that results from the split. This will work even if there isn't a :.
list_with_prefixes = ["pp:actual_string", "perfect_string", "frog:actual_string"]
cleaned_list = [x.split(':')[-1] for x in list_with_prefixes]
print(cleaned_list)
This is a list comprehension that takes each of the strings in turn (x), splits the string on the : character, this returns a list containing the prefix (if it exists) and the suffix, and builds a new list with only the suffix (i.e. item [-1] in the list that results from the split. In this example, it returns:
['actual_string', 'perfect_string', 'actual_string']

Here are a few options, based upon different assumptions.
Most explicit
if s.startswith('pp:'):
s = s[len('pp:'):] # aka 3
If you want to remove anything before the first :
s = s.split(':', 1)[-1]
Regular expressions:
Same as startswith
s = re.sub('^pp:', '', s)
Same as split, but more careful with 'pp:' and slower
s = re.match('(?:^pp:)?(.*)', s).group(1)

Problems when trying to filter strings in a list?

I have a large list with strings and I would like to filter everything inside a parenthesis, thus I am using the following regex:
text_list = [' 1__(this_is_a_string) 74_string__(anotherString_with_underscores) question__(stringWithAlot_of_underscores) 1.0__(another_withUnderscores) 23:59:59__(get_arguments_end) 2018-05-13 00:00:00__(get_arguments_start)']
import re
r = re.compile('\([^)]*\)')
a_lis = list(filter(r.search, text_list))
print(a_lis)
I test my regex here, and is working. However, when I apply the above regex I end up with an empty list:
[]
Any idea of how to filter all the tokens inside parenthesis from a list?

Your regex is OK (though perhaps you don't want to capture the parentheses as part of the match), but search() is the wrong method to use. You want findall() to get the text of all the matches, rather than the indices of the first match:
list(map(r.findall, text_list))
This will give you a list of lists, where each inner list contains the strings which were inside parentheses.
For example, given this input:
text_list = ['asdf (qwe) asdf (gdfd)', 'xx', 'gdfw(rgf)']
The result is:
[['(qwe)', '(gdfd)'], [], ['(rgf)']]
If you want to exclude the parentheses, change the regex slightly:
'\(([^)]*)\)'
The unescaped parentheses within the escaped ones indicate what to capture.

Dot notation string manipulation

Is there a way to manipulate a string in Python using the following ways?
For any string that is stored in dot notation, for example:
s = "classes.students.grades"
Is there a way to change the string to the following:
"classes.students"
Basically, remove everything up to and including the last period. So "restaurants.spanish.food.salty" would become "restaurants.spanish.food".
Additionally, is there any way to identify what comes after the last period? The reason I want to do this is I want to use isDigit().
So, if it was classes.students.grades.0 could I grab the 0 somehow, so I could use an if statement with isdigit, and say if the part of the string after the last period (so 0 in this case) is a digit, remove it, otherwise, leave it.

you can use split and join together:
s = "classes.students.grades"
print '.'.join(s.split('.')[:-1])
You are splitting the string on . - it'll give you a list of strings, after that you are joining the list elements back to string separating them by .
[:-1] will pick all the elements from the list but the last one
To check what comes after the last .:
s.split('.')[-1]
Another way is to use rsplit. It works the same way as split but if you provide maxsplit parameter it'll split the string starting from the end:
rest, last = s.rsplit('.', 1)
'classes.students'
'grades'
You can also use re.sub to substitute the part after the last . with an empty string:
re.sub('\.[^.]+$', '', s)
And the last part of your question to wrap words in [] i would recommend to use format and list comprehension:
''.join("[{}]".format(e) for e in s.split('.'))
It'll give you the desired output:
[classes][students][grades]

The best way to do this is using the rsplit method and pass in the maxsplit argument.
>>> s = "classes.students.grades"
>>> before, after = s.rsplit('.', maxsplit=1) # rsplit('.', 1) in Python 2.x onwards
>>> before
'classes.students'
>>> after
'grades'
You can also use the rfind() method with normal slice operation.
To get everything before last .:
>>> s = "classes.students.grades"
>>> last_index = s.rfind('.')
>>> s[:last_index]
'classes.students'
Then everything after last .
>>> s[last_index + 1:]
'grades'

if '.' in s, s.rpartition('.') finds last dot in s,
and returns (before_last_dot, dot, after_last_dot):
s = "classes.students.grades"
s.rpartition('.')[0]

If your goal is to get rid of a final component that's just a single digit, start and end with re.sub():
s = re.sub(r"\.\d$", "", s)
This will do the job, and leave other strings alone. No need to mess with anything else.
If you do want to know about the general case (separate out the last component, no matter what it is), then use rsplit to split your string once:
>>> "hel.lo.there".rsplit(".", 1)
['hel.lo', 'there']
If there's no dot in the string you'll just get one element in your array, the entire string.

You can do it very simply with rsplit (str.rsplit([sep[, maxsplit]]) , which will return a list by breaking each element along the given separator.
You can also specify how many splits should be performed:
>>> s = "res.spa.f.sal.786423"
>>> s.rsplit('.',1)
['res.spa.f.sal', '786423']
So the final function that you describe is:
def dimimak_cool_function(s):
if '.' not in s: return s
start, end = s.rsplit('.', 1)
return start if end.isdigit() else s
>>> dimimak_cool_function("res.spa.f.sal.786423")
'res.spa.f.sal'
>>> dimimak_cool_function("res.spa.f.sal")
'res.spa.f.sal'

Retrieve part of string, variable length

I'm trying to learn how to use Regular Expressions with Python. I want to retrieve an ID number (in parentheses) in the end from a string that looks like this:
"This is a string of variable length (561401)"
The ID number (561401 in this example) can be of variable length, as can the text.
"This is another string of variable length (99521199)"
My coding fails:
import re
import selenium
# [Code omitted here, I use selenium to navigate a web page]
result = driver.find_element_by_class_name("class_name")
print result.text # [This correctly prints the whole string "This is a text of variable length (561401)"]
id = re.findall("??????", result.text) # [Not sure what to do here]
print id

This should work for your example:
(?<=\()[0-9]*
?<= Matches something preceding the group you are looking for but doesn't consume it. In this case, I used \(. ( is a special character, so it has to be escaped with \. [0-9] matches any number. The * means match any number of the directly preceding rule, so [0-9]* means match as many numbers as there are.

Solved this thanks to Kaz's link, very useful:
http://regex101.com/
id = re.findall("(\d+)", result.text)
print id[0]

You can use this simple solution :
>>> originString = "This is a string of variable length (561401)"
>>> str1=OriginalString.replace("("," ")
'This is a string of variable length 561401)'
>>> str2=str1.replace(")"," ")
'This is a string of variable length 561401 '
>>> [int(s) for s in string.split() if s.isdigit()]
[561401]
First, I replace parantheses with space. and then I searched the new string for integers.

No need to really use regular expressions here, if it is always at the end and always in parenthesis you can split, extract last element and remove the parenthesis by taking the substring ([1:-1]). Regexes are relatively time expensive.
line = "This is another string of variable length (99521199)"
print line.split()[-1][1:-1]
If you did want to use regular expressions I would do this:
import re
line = "This is another string of variable length (99521199)"
id_match = re.match('.*\((\d+)\)',line)
if id_match:
print id_match.group(1)

Splice a string based on certain characters

I'm looking for a way to examine only certain characters within a string. For example:
#Given the string
s= '((hello+world))'
s[1:')'] #This obviously doesn't work because you can only splice a string using ints
Basically I want the program to start at the second occurence of ( and then from there splice until it hits the first occurence of ). So then maybe from there I can return it to another fucntion or whatever. Any solutions?

You can do it as follows: (assuming you want the innermost parenthesis)
s[s.rfind("("):s.find(")")+1] if you want "(hello+world)"
s[s.rfind("(")+1:s.find(")")] if you want "hello+world"

You can strip parenthesis (if, in your case, they always appear at the beginning and the end of the string):
>>> s= '((hello+world))'
>>> s.strip('()')
'hello+world'
Another option is to use regular expression to extract what is inside the double parenthesis:
>>> re.match('\(\((.*?)\)\)', s).group(1)
'hello+world'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

From list of strings, extract only characters within brackets - python

Check regular expression, "re" Something like this should do the trick import re s = "hello_from_adele[this_is_the_string_i_am_looking_for]this_is_not_it" match = re.search(r"\[([A-Za-z0-9_]+)\]", s) print match.group(1) If you provide an example, we can be more specific

Related

How to remove a substrings from a list of strings?

Problems when trying to filter strings in a list?

Dot notation string manipulation

Retrieve part of string, variable length

Splice a string based on certain characters

Categories

Resources