I am trying to write a regex to a string that has the following format
12740(34,12) [abc (a1b2c3) (a2b3c4)......] myId123
Currently, I have something like this
\((?P<expression>\S+)\)
But with this, I can capture only the strings within square brackets.
Is there anyway I can capture the integers before the square brackets and also id at the end along with the strings within square brackets.
The number of strings enclosed within small brackets will not be the same. I could also have a string that looks like this
10(3,2) [abc (a1b2c3)] myId1
I know that I can write a simple regex for the above expression using brute force. But could anyone please help me write one when the number of strings within the square bracket keeps changing.
Thanks in advance
You can capture the information by using ^ and $, which mean start and end respectively:
((?P<front>^\d+)|\((?P<expression>\S+)\)|(?P<id>[a-zA-Z0-9]+)$)
Regex101:
https://regex101.com/r/PoA5k4/1
To make the result more usable, I'd turn it into a dictionary:
import re
myStr = "12740(34,12) [abc (a1b2c3) (a2b3c4)......] myId123"
di = {}
for find in re.findall("((?P<front>^\d+)|\((?P<expression>\S+)\)|(?P<id>[a-zA-Z0-9]+)$)",myStr):
if find[1] != "":
di["starter"] = find[1]
elif find[3] != "":
di["id"] = find[3]
else:
di.setdefault("expression",[]).append(find[2])
print(di)
Related
Learning Python, came across a demanding begginer's exercise.
Let's say you have a string constituted by "blocks" of characters separated by ';'. An example would be:
cdk;2(c)3(i)s;c
And you have to return a new string based on old one but in accordance to a certain pattern (which is also a string), for example:
c?*
This pattern means that each block must start with an 'c', the '?' character must be switched by some other letter and finally '*' by an arbitrary number of letters.
So when the pattern is applied you return something like:
cdk;cciiis
Another example:
string: 2(a)bxaxb;ab
pattern: a?*b
result: aabxaxb
My very crude attempt resulted in this:
def switch(string,pattern):
d = []
for v in range(0,string):
r = float("inf")
for m in range (0,pattern):
if pattern[m] == string[v]:
d.append(pattern[m])
elif string[m]==';':
d.append(pattern[m])
elif (pattern[m]=='?' & Character.isLetter(string.charAt(v))):
d.append(pattern[m])
return d
Tips?
To split a string you can use split() function.
For pattern detection in strings you can use regular expressions (regex) with the re library.
I have a list of strings that have variable construction but have a character sequence enclosed in square brackets. I want to extract only the sequence enclosed by the square brackets. There is only one instance of square brackets per string, which simplifies the process.
I am struggling to do so in an elegant manner, and this is clearly a simple problem with Python's large string library.
What is a simple expression to do this?
Check regular expression, "re"
Something like this should do the trick
import re
s = "hello_from_adele[this_is_the_string_i_am_looking_for]this_is_not_it"
match = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print match.group(1)
If you provide an example, we can be more specific
You don't even need re to do this:
In [11]: strng = "This is some text [that has brackets] followed by more text"
In [12]: strng[strng.index("[")+1:strng.index("]")]
Out[12]: 'that has brackets'
This uses string slicing to return the characters inside the brackets. index() returns the 0-based position of its argument. Since we don't want to include the [ at the beginning, we add 1. The second argument of the slice is the stop position, but it is not included in the returned substring, so we don't need to add anything to it.
If you prefer not to use regex for whatever reason, it should be easy to do with string splitting since you're guaranteed to have one and only one instance of [ and ].
s = "some[string]to check"
_, midright = s.split("[")
target, _ = midright.split("]")
or
target = s.split("[")[1].split("]")[0] # ewww
I know from the title you might think that this is a duplicate but it's not.
for id,row in enumerate(rows):
columns = row.findall("td")
teamName = columns[0].find("a").text, # Lag
playedGames = columns[1].text, # S
wins = columns[2].text,
draw = columns[3].text,
lost = columns[4].text,
dif = columns[6].text, # GM-IM
points = columns[7].text, # P - last column
dict[divisionName].update({id :{"teamName":teamName, "playedGames":playedGames, "wins":wins, "draw":draw, "lost":lost, "dif":dif, "points":points }})
This is how my Python code looks like. Most of the code is removed but essentially i am extracting some information from a website. And i am saving the information as a dictionary. When i print the dictionary every value has a bracket around them ["blbal"] which causes trouble in my Iphone application. I know that i can convert the variables to strings but i want to know if there is a way to get the information DIRECTLY as a string.
That looks like you have a string inside a list:
["blbal"]
To get the string just index l = ["blbal"] print(l[0]) -> "blbal".
If it is a string use str.strip '["blbal"]'.strip("[]") or slicing '["blbal"]'[1:-1] if they are always present.
you can also you replace to just replace the text/symbol that you don't want with the empty string.
text = ["blbal","test"]
strippedText = str(text).replace('[','').replace(']','').replace('\'','').replace('\"','')
print(strippedText)
import re
text = "some (string) [another string] in brackets"
re.sub("\(.*?\)", "", text)
# some in brackets
# works for () and will work for [] if you replace () with [].
The \(.*?\) format matches brackets with some text in them with an unspecified length. And the \[.*?\] format matches also but a square brackets with some text inside the brackets.
The output will not contain brackets and texts inside of them.
If you want to match only square brackets replace square brackets with the bracket of choice and vise versa.
To match () and [] bracket in one go, use this format (\(.*?\)|\[.*?\]:) joining two pattern with the | character.
So essentially I am trying to read lines from multiple files in a directory and using a regex to specifically find the beginnings of a sort of time stamp, I want to also place an instance of a list of months within the regex and then create a counter for each month based on how many times it appears. I have some code below, but it is still a work in progress. I know I closed off date_parse, but I that's why I'm asking. And please leave another suggestion if you can think of a more efficient method. thanks.
months = ['Jan','Feb','Mar','Apr','May','Jun',\
'Jul','Aug','Sep','Oct','Nov',' Dec']
date_parse = re.compile('[Date:\s]+[[A-Za-z]{3},]+[[0-9]{1,2}\s]')
counter=0
for line in sys.stdin:
if data_parse.match(line):
for month in months in line:
print '%s %d' % (month, counter)
In a regular expression, you can have a list of alternative patterns, separated using vertical bars.
http://docs.python.org/library/re.html
from collections import defaultdict
date_parse = re.compile(r'Date:\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)')
c = defaultdict(int)
for line in sys.stdin:
m = date_parse.match(line)
if m is None:
# pattern did not match
# could handle error or log it here if desired
continue # skip to handling next input line
month = m.group(1)
c[month] += 1
Some notes:
I recommend you use a raw string (with r'' or r"") for a pattern, so that backslashes will not become string escapes. For example, inside a normal string, \s is not an escape and you will get a backslash followed by an 's', but \n is an escape and you will get a single character (a newline).
In a regular expression, when you enclose a series of characters in square brackets, you get a "character class" that matches any of the characters. So when you put [Date:\s]+ you would match Date: but you would also match taD:e or any other combination of those characters. It's perfectly okay to just put in a string that should match itself, like Date:.
I need to create a regexp to match strings like this 999-123-222-...-22
The string can be finished by &Ns=(any number) or without this... So valid strings for me are
999-123-222-...-22
999-123-222-...-22&Ns=12
999-123-222-...-22&Ns=12
And following are not valid:
999-123-222-...-22&N=1
I have tried testing it several hours already... But did not manage to solve, really need some help
Not sure if you want to literally match 999-123-22-...-22 or if that can be any sequence of numbers/dashes. Here are two different regexes:
/^[\d-]+(&Ns=\d+)?$/
/^999-123-222-\.\.\.-22(&Ns=\d+)?$/
The key idea is the (&Ns=\d+)?$ part, which matches an optional &Ns=<digits>, and is anchored to the end of the string with $.
If you just want to allow strings 999-123-222-...-22 and 999-123-222-...-22&Ns=12 you better use a string function.
If you want to allow any numbers between - you can use the regex:
^(\d+-){3}[.]{3}-\d+(&Ns=\d+)?$
If the numbers must be of only 3 digits and the last number of only 2 digits you can use:
^(\d{3}-){3}[.]{3}-\d{2}(&Ns=\d{2})?$
This looks like a phone number and extension information..
Why not make things simpler for yourself (and anyone who has to read this later) and split the input rather than use a complicated regex?
s = '999-123-222-...-22&Ns=12'
parts = s.split('&Ns=') # splits on Ns and removes it
If the piece before the "&" is a phone number, you could do another split and get the area code etc into separate fields, like so:
phone_parts = parts[0].split('-') # breaks up the digit string and removes the '-'
area_code = phone_parts[0]
The portion found after the the optional '&Ns=' can be checked to see if it is numeric with the string method isdigit, which will return true if all characters in the string are digits and there is at least one character, false otherwise.
if len(parts) > 1:
extra_digits_ok = parts[1].isdigit()