best way to find substring using regex in python 3

best way to find substring using regex in python 3 - python

I was trying to find out the best way to find the specific substring in key value pair using re for the following:
some_string-variable_length/some_no_variable_digit/some_no1_variable_digit/some_string1/some_string2
eg: aba/101/11111/cde/xyz or aaa/111/1119/cde/xzx or ada/21111/5/cxe/yyz
here everything is variable and what I was looking for is something like below in key value pair:
`cde: 2` as there are two entries for cde
cxe: 1 as there is only one cxe
Note: everything is variable here except /. ie cde or cxe or some string will be there exactly after two / in each case
input:aba/101/11111/cde/xyz/blabla
output: cde:xyz/blabla
input: aaa/111/1119/cde/xzx/blabla
output: cde:xzx/blabla
input: aahjdsga/11231/1119/gfts/sjhgdshg/blabla
output: gfts:sjhgdshg/blabla
If you notice here, my key is always the first string after 3rd / and value is always the substring after key

Here are a couple of solutions based on your description that "key is always the first string after 3rd / and value is always the substring after key". The first uses str.split with a maxsplit of 4 to collect everything after the fourth / into the value. The second uses regex to extract the two parts:
inp = ['aba/101/11111/cde/xyz/blabla',
'aaa/111/1119/cde/xzx/blabla',
'aahjdsga/11231/1119/gfts/sjhgdshg/blabla'
]
for s in inp:
parts = s.split('/', 4)
key = parts[3]
value = parts[4]
print(f'{key}:{value}')
import re
for s in inp:
m = re.match(r'^(?:[^/]*/){3}([^/]*)/(.*)$', s)
if m is not None:
key = m.group(1)
value = m.group(2)
print(f'{key}:{value}')
For both pieces of code the output is
cde:xyz/blabla
cde:xzx/blabla
gfts:sjhgdshg/blabla

Others have already posted various regexes; a more broad question — is this problem best solved using a regex? Depending on how the data is formatted overall, it may be better parsed using
the .split('/') method on the string; or
csv.reader(..., delimiter='/') or csv.DictReader(..., delimiter='/') in the csv module.

Try (?<!\S)[^\s/]*(?:/[^\s/]*){2}/([^\s/]*)
demo
Try new per commnt
(?<!\S)[^\s/]*(?:/[^\s/]*){2}/([^\s/]*)(?:/(\S*))?
demo2

Related

Python: How to move the position of an output variable using the split() method

This is my first SO post, so go easy! I have a script that counts how many matches occur in a string named postIdent for the substring ff. Based on this it then iterates over postIdent and extracts all of the data following it, like so:
substring = 'ff'
global occurences
occurences = postIdent.count(substring)
x = 0
while x <= occurences:
for i in postIdent.split("ff"):
rawData = i
required_Id = rawData[-8:]
x += 1
To explain further, if we take the string "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff", it is clear there are 3 instances of ff. I need to get the 8 preceding characters at every instance of the substring ff, so for the first instance this would be 909a9090.
With the rawData, I essentially need to offset the variable required_Id by -1 when I get the data out of the split() method, as I am currently getting the last 8 characters of the current string, not the string I have just split. Another way of doing it could be to pass the current required_Id to the next iteration, but I've not been able to do this.
The split method gets everything after the matching string ff.
Using the partition method can get me the data I need, but does not allow me to iterate over the string in the same way.

Get the last 8 digits of each split using a slice operation in a list-comprehension:
s = "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff"
print([x[-8:] for x in s.split('ff') if x])
# ['909a9090', '90434390', 'sdfs9000']

Not a difficult problem, but tricky for a beginner.
If you split the string on 'ff' then you appear to want the eight characters at the end of every substring but the last. The last eight characters of string s can be obtained using s[-8:]. All but the last element of a sequence x can similarly be obtained with the expression x[:-1].
Putting both those together, we get
subject = '090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff'
for x in subject.split('ff')[:-1]:
print(x[-8:])
This should print
909a9090
90434390
sdfs9000

I wouldn't do this with split myself, I'd use str.find. This code isn't fancy but it's pretty easy to understand:
fullstr = "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff"
search = "ff"
found = None # our next offset of
last = 0
l = 8
print(fullstr)
while True:
found = fullstr.find(search, last)
if found == -1:
break
preceeding = fullstr[found-l:found]
print("At position {} found preceeding characters '{}' ".format(found,preceeding))
last = found + len(search)
Overall I like Austin's answer more; it's a lot more elegant.

duplicate item_01 and rename to item_02

I'm a complete beginner in Python, and I'm trying to use the language to make scripts in a program called Modo.
I'm attempting to make a script that duplicates an item which has a suffixed number, and adds 1 to the suffix of the new duplicated item.
So for example; duplicate 'item_01', which will create 'item_01 (2)', then rename 'item_01 (2)' to 'item_02'.
I'm having trouble finding out how to get Python to take the '_01' part of the previous item's name, then add 1 to it and using the sum of that as the suffix for the new item name.
Also, this is my first question on this great site, so if additional information is needed, please let me know.

I'm interpreting your question as "I have a string ending in a digit sequence, for example "item_01". I want to get a string with the same form as the original string, but with the digit incremented by one, for example "item_02"."
You could use re.sub to replace the digit sequence with a new one:
>>> import re
>>>
>>> s = "item_01"
>>> result = re.sub(
... r"\d+$", #find all digits at the end of the string,
... lambda m: str( #replacing them with a string
... int(m.group())+1 #equal to one plus the original value,
... ).zfill(len(m.group())), #with at least as much padding as the original value.
... s
... )
>>>
>>> print(result)
item_02
In one line, that would be
result = re.sub(r"\d+$", lambda m: str(int(m.group())+1).zfill(len(m.group())), s)
Note that the resulting string may be longer than the original string, if the original value is all nines:
>>> re.sub(r"\d+$", lambda m: str(int(m.group())+1).zfill(len(m.group())), "item_99")
'item_100'
And it will only increment the digit sequence at the very end of the string, and not any intermediary sequences.
>>> re.sub(r"\d+$", lambda m: str(int(m.group())+1).zfill(len(m.group())), "item_23_42")
'item_23_43'
And if the string has no suffix digit sequence, it will simply return the original value unaltered.
>>> re.sub(r"\d+$", lambda m: str(int(m.group())+1).zfill(len(m.group())), "item_foobar")
'item_foobar'

Getting the name out of a variable is not something you do in Python.
What you want to achieve here, reading what you wrote,
I'm attempting to make a script that duplicates an item which has a suffixed number, and adds 1 to the suffix of the new duplicated item.
So for example; duplicate 'item_01', which will create 'item_01 (2)', then rename 'item_01 (2)' to 'item_02'.
would be more convenient such as below :
some_var = 1
some_other_var = some_var + 1
You could have a function doing "I am adding one to the parameter I received and I return the value" !
def add_one(var):
return (var + 1)
some_var = 1
some_other_var = add_one(some_var)
If you want to "name" your variables, and be able to change them, even if I don't see why you would want to do this, I believe what you are looking for is a dict.
I am letting you look at the reference for the dictionnary though. :)

Is it possible to search and replace a string with "any" characters?

There are probably several ways to solve this problem, so I'm open to any ideas.
I have a file, within that file is the string "D133330593" Note: I do have the exact position within the file this string exists, but I don't know if that helps.
Following this string, there are 6 digits, I need to replace these 6 digits with 6 other digits.
This is what I have so far:
def editfile():
f = open(filein,'r')
filedata = f.read()
f.close()
#This is the line that needs help
newdata = filedata.replace( -TOREPLACE- ,-REPLACER-)
#Basically what I need is something that lets me say "D133330593******"
#->"D133330593123456" Note: The following 6 digits don't need to be
#anything specific, just different from the original 6
f = open(filein,'w')
f.write(newdata)
f.close()

Use the re module to define your pattern and then use the sub() function to substitute occurrence of that pattern with your own string.
import re
...
pat = re.compile(r"D133330593\d{6}")
re.sub(pat, "D133330593abcdef", filedata)
The above defines a pattern as -- your string ("D133330593") followed by six decimal digits. Then the next line replaces ALL occurrences of this pattern with your replacement string ("abcdef" in this case), if that is what you want.
If you want a unique replacement string for each occurrence of pattern, then you could use the count keyword argument in the sub() function, which allows you to specify the number of times the replacement must be done.
Check out this library for more info - https://docs.python.org/3.6/library/re.html

Let's simplify your problem to you having a string:
s = "zshisjD133330593090909fdjgsl"
and you wanting to replace the 6 characters after "D133330593" with "123456" to produce:
"zshisjD133330594123456fdjgsl"
To achieve this, we can first need to find the index of "D133330593". This is done by just using str.index:
i = s.index("D133330593")
Then replace the next 6 characters, but for this, we should first calculate the length of our string that we want to replace:
l = len("D133330593")
then do the replace:
s[:i+l] + "123456" + s[i+l+6:]
which gives us the desired result of:
'zshisjD133330593123456fdjgsl'
I am sure that you can now integrate this into your code to work with a file, but this is how you can do the heart of your problem .
Note that using variables as above is the right thing to do as it is the most efficient compared to calculating them on the go. Nevertheless, if your file isn't too long (i.e. efficiency isn't too much of a big deal) you can do the whole process outlined above in one line:
s[:s.index("D133330593")+len("D133330593")] + "123456" + s[s.index("D133330593")+len("D133330593")+6:]
which gives the same result.

split by regex and add matches to dictionary

first time posting here.
I'd like to 1) parse the following text:"keyword: some keywords concept :some concepts"
and 2) store into the dictionary: ['keyword']=>'some keywords', ['concept']=>'some concepts'.
There may be 0 or 1 'space' before each 'colon'. The following is what I've tried so far.
sample_text = "keyword: some keywords concept :some concepts"
p_res = re.compile("(\S+\s?):").split(sample_text) # Task 1
d_inc = dict([(k, v) for k,v in zip (p_res[::2], p_res[1::2])]) # Task 2
However, the list result p_res is wrong , with empty entry at the index 0, which consequently produce wrong dict. Is there something wrong with my regex?

Use re.findall to capture list of groups in a match. And then apply dict to convert list of tuples to dict.
>>> import re
>>> s = 'keyword: some keywords concept :some concepts'
>>> dict(re.findall(r'(\S+)\s*:\s*(.*?)\s*(?=\S+\s*:|$)', s))
{'concept': 'some concepts', 'keyword': 'some keywords'}
>>>
Above regex would capture key and it's corresponding value in two separate groups.
I assume that the input string contain only key value pair and the key won't contain any space character.
DEMO

Simply replace Task1 by this line:
p_res = re.compile("(\S+\s?):").split(sample_text)[1:] # Task 1
This will always ignore the (normally empty) element that is returned by re.split.
Background: Why does re.split return the empty first result?
What should the program do with this input:
sample_text = "Hello! keyword: some keywords concept :some concepts"
The text Hello! at the beginning of the input doesn't fit into the definition of your problem (which assumes that the input starts with a key).
Do you want to ignore it? Do you want to raise an exception if it appears? Do you want to want to add it to your dictionary with a special key?
re.split doesn't want to decide this for you: It returns whatever information appears and you make your decision. In our solution, we simply ignore whatever appears before the first key.

regular expressions to extract phone numbers

I am new to regular expressions and I am trying to write a pattern of phone numbers, in order to identify them and be able to extract them. My doubt can be summarized to the following simple example:
I try first to identify whether in the string is there something like (+34) which should be optional:
prefixsrch = re.compile(r'(\(?\+34\)?)?')
that I test in the following string in the following way:
line0 = "(+34)"
print prefixsrch.findall(line0)
which yields the result:
['(+34)','']
My first question is: why does it find two occurrences of the pattern? I guess that this is related to the fact that the prefix thing is optional but I do not completely understand it. Anyway, now for my big doubt
If we do a similar thing searching for a pattern of 9 digits we get the same:
numsrch = re.compile(r'\d{9}')
line1 = "971756754"
print numsrch.findall(line1)
yields something like:
['971756754']
which is fine. Now what I want to do is identify a 9 digits number, preceded or not, by (+34). So to my understanding I should do something like:
phonesrch = re.compile(r'(\(?\+34\)?)?\d{9}')
If I test it in the following strings...
line0 = "(+34)971756754"
line1 = "971756754"
print phonesrch.findall(line0)
print phonesrch.findall(line1)
this is, to my surprise, what I get:
['(+34)']
['']
What I was expecting to get is ['(+34)971756754'] and ['971756754']. Does anybody has the insight of this? thank you very much in advance.

Your capturing group is wrong. Make the country code within a non-capturing group and the entire expression in the capturing group
>>> line0 = "(+34)971756754"
>>> line1 = "971756754"
>>> re.findall(r'((?:\(?\+34\)?)?\d{9})',line0)
['(+34)971756754']
>>> re.findall(r'((?:\(?\+34\)?)?\d{9})',line1)
['971756754']
My first question is: why does it find two occurrences of the pattern?
This is because, ? which means it match 0 or 1 repetitions, so an empty string is also a valid match

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

best way to find substring using regex in python 3 - python

Try (?<!\S)[^\s/](?:/[^\s/]){2}/([^\s/]) demo Try new per commnt (?<!\S)[^\s/](?:/[^\s/]){2}/([^\s/])(?:/(\S*))? demo2

Related

Python: How to move the position of an output variable using the split() method

duplicate item_01 and rename to item_02

Is it possible to search and replace a string with "any" characters?

split by regex and add matches to dictionary

regular expressions to extract phone numbers

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

best way to find substring using regex in python 3 - python

Try (?<!\S)[^\s/]*(?:/[^\s/]*){2}/([^\s/]*) demo Try new per commnt (?<!\S)[^\s/]*(?:/[^\s/]*){2}/([^\s/]*)(?:/(\S*))? demo2

Related

Python: How to move the position of an output variable using the split() method

duplicate item_01 and rename to item_02

Is it possible to search and replace a string with "any" characters?

split by regex and add matches to dictionary

regular expressions to extract phone numbers

Categories

Resources

Try (?<!\S)[^\s/](?:/[^\s/]){2}/([^\s/]) demo Try new per commnt (?<!\S)[^\s/](?:/[^\s/]){2}/([^\s/])(?:/(\S*))? demo2