Python remove JSON substring

Python remove JSON substring - python

If I have a string where there is a valid JSON substring like this one:
mystr = '100{"1":2, "3":4}312'
What is the best way to do extract just the JSON string? The numbers outside can be anything (except a { or }), including newlines and things like that.
Just to be clear, this is the result I want
newStr = '{"1":2, "3":4}'
The best way I can think of do this is to use find and rfind and then take the substring. This seems too verbose to me and it isn't python 3.0 compliant (which I would prefer but is not essential)
Any help is appreciated.

Note that the following code very much assumes that there is nothing other than non-bracket material on either side of the JSON string.
import re
matcher = re.compile(r"""
^[^\{]* # Starting from the beginning of the string, match anything that isn't an opening bracket
( # Open a group to record what's next
\{.+\} # The JSON substring
) # close the group
[^}]*$ # at the end of the string, anything that isn't a closing bracket
""", re.VERBOSE)
# Your example
print matcher.match('100{"1":2, "3":4}312').group(1)
# Example with embedded hashmap
print matcher.match('100{"1":{"a":"b", "c":"d"}, "3":4}312').group(1)
The short, non-precompiled, non-commented version:
import re
print re.match("^[^\{]*(\{[^\}]+\})[^}]*$", '100{"1":2, "3":4}312').group(1)
Although for the sake of maintenance, commenting regular expressions is very much preferred.

Related

Extracting a word between two path separators that comes after a specific word

I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.
I had the few ideas on how to do it:
Write a RegEx that matches a string between \App\ and \feature\
Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.
I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.
I would much appreciate any help.
Thank you in advance!

The regex you want is:
(?<=\\App\\).*?(?=\\feature\\)
Explanation of the regex:
(?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
\ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
.* matches any character, zero or more times.
? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
The pattern in general also assumes that there are no \ characters between \App\ and \feature\.
The full code would be something like:
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'
pattern = rf"(?<=\{start}\).*?(?=\{end}\)"
print(pattern) # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0]) # Module
A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html

We can do that by str.find somethings like
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
output
Module

Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.
(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)
The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1

remove special charecters in a string using python

I have a string = "msdjdgf(^&%*(Aroha Technologies&^$^&*^CHJdjg" with special characters.
what i am trying is to remove all special charecters in the string and then display the word 'Aroha Technologies'
i was able to do with hard coding using lstrip() function but can anyone help me out how can i display string 'Aroha Technologies' in a single line using regular expressions.
edit suggested:-
by using this lstrip() and rstrip() functions i was able to remove characters from the string.
str = "msdjdgf(^&%*(Aroha Technologies&^$^&*^CHJdjg"
str=str.lstrip('msdjdgf(^&%*(')
str=str.rstrip('&^$^&*^CHJdjg')

here, A bit more dirty approach
import re # A module in python for String matching/operations
a = "msdjdgf(^&%*(Aroha Technologies&^$^&*^CHJdjg"
stuff = re.findall('\W(\w+\s\w+)\W', a)
print(stuff[0]) # Aroha Technologies
hope this helps ;)

You don't provide a lot of information, so this may or may not be close to what you want:
import re
origstr = "msdjdgf(^&%(Aroha Technologies&^$^&^CHJdjg"
match = re.search("[A-Z][a-z]*(?: [A-Z][a-z]*)*", origstr)
if match:
newstr = match.group()
(looks for a series of capitalized words with spaces between them)

Regex End of Line and Specific Chracters

So I'm writing a Python program that reads lines of serial data, and compares them to a dictionary of line codes to figure out which specific lines are being transmitted. I am attempting to use a Regular Expression in order to filter out the extra garbage line serial read string has on it, but I'm having a bit of an issue.
Every single code in my dictionary looks like this: T12F8B0A22**F8. The asterisks are the two alpha numeric pieces that differentiate each string code.
This is what I have so far as my regex: '/^T12F8B0A22[A-Z0-9]{2}F8$/'
I am getting a few errors with this however. My first error, is that there are some characters are the end of the string I still need to get rid of, which is odd because I thought $/ denoted the end of the line in regex. However when I run my code through the debugger I notice that after running through the following code:
#regexString contains the serial read line data
regexString = re.sub('/^T12F8B0A22[A-Z0-9]{2}F8$/', '', regexString)
My string looks something like this: 'T12F8B0A2200F8\\r'
I need to get rid of the \\r.
If for some reason I can't get rid of this with regex, how in python do you send specific string character through an argument? In this case I suppose it would be length - 3?

Your problem is threefold:
1) your string contains extra \r (Carriage Return character) before \n (New Line character); this is common in Windows and in network communication protocols; it is probably best to remove any trailing whitespace from your string:
regexString = regexString.rstrip()
2) as mentioned by Wiktor Stribiżew, your regexp is unnecessarily surrounded with / characters - some languages, like Perl, define regexp as a string delimited by / characters, but Python is not one of them;
3) your instruction using re.sub is actually replacing the matching part of regexString with an empty string - I believe this is the exact opposite of what you want (you want to keep the match and remove everything else, right?); that's why fixing the regexp makes things "even worse".
To summarize, I think you should use this instead of your current code:
m = re.match('T12F8B0A22[A-Z0-9]{2}F8', regexString)
regexString = m.group(0)

There are several ways to get rid of the "\r", but first a little analysis of your code :
1. the special charakter for the end is just '$' not '$\' in python.
2. re.sub will substitute the matched pattern with a string ( '' in your case) wich would substitute the string you want to get with an empty string and you are left with the //r
possible solutions:
use simple replace:
regexString.replace('\\r','')
if you want to stick to regex the approach is the same
pattern = '\\\\r'
match = re.sub(pattern, '',regexString)
2.2 if you want the acces the different groubs use re.search
match = re.search('(^T12F8B0A22[A-Z0-9]{2}F8)(.*)',regexString)
match.group(1) # will give you the T12...
match.groupe(2) # gives you the \\r

Just match what you want to find. Couple of examples:
import re
data = '''lots of
otherT12F8B0A2212F8garbage
T12F8B0A2234F8around
T12F8B0A22ABF8the
stringsT12F8B0A22CDF8
'''
print(re.findall('T12F8B0A22..F8',data))
['T12F8B0A2212F8', 'T12F8B0A2234F8', 'T12F8B0A22ABF8', 'T12F8B0A22CDF8']
m = re.search('T12F8B0A22..F8',data)
if m:
print(m.group(0))
T12F8B0A2212F8

String Simple Substitution

What's the easiest way of me converting the simpler regex format that most users are used to into the correct re python regex string?
As an example, I need to convert this:
string = "*abc+de?"
to this:
string = ".*abc.+de.?"
Of course I could loop through the string and build up another string character by character, but that's surely an inefficient way of doing this?

Those don't look like regexps you're trying to translate, they look more like unix shell globs. Python has a module for doing this already. It doesn't know about the "+" syntax you used, but neither does my shell, and I think the syntax is nonstandard.
>>> import fnmatch
>>> fnmatch.fnmatch("fooabcdef", "*abcde?")
True
>>> help(fnmatch.fnmatch)
Help on function fnmatch in module fnmatch:
fnmatch(name, pat)
Test whether FILENAME matches PATTERN.
Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
An initial period in FILENAME is not special.
Both FILENAME and PATTERN are first case-normalized
if the operating system requires it.
If you don't want this, use fnmatchcase(FILENAME, PATTERN).
>>>

.replacing() each of the wildcards is the quick way, but what if the wildcarded string contains other regex special characters? eg. someone searching for 'my.thing*' probably doesn't mean that '.' to match any character. And in the worst case things like match-group-creating parentheses are likely to break your final handling of the regex matches.
re.escape can be used to put literal characters into regexes. You'll have to split out the wildcard characters first though. The usual trick for that is to use re.split with a matching bracket, resulting in a list in the form [literal, wildcard, literal, wildcard, literal...].
Example code:
wildcards= re.compile('([?*+])')
escapewild= {'?': '.', '*': '.*', '+': '.+'}
def escapePart((parti, part)):
if parti%2==0: # even items are literals
return re.escape(part)
else: # odd items are wildcards
return escapewild[part]
def convertWildcardedToRegex(s):
parts= map(escapePart, enumerate(wildcards.split(s)))
return '^%s$' % (''.join(parts))

You'll probably only be doing this substitution occasionally, such as each time a user enters a new search string, so I wouldn't worry about how efficient the solution is.
You need to generate a list of the replacements you need to convert from the "user format" to a regex. For ease of maintenance I would store these in a dictionary, and like #Konrad Rudolph I would just use the replace method:
def wildcard_to_regex(wildcard):
replacements = {
'*': '.*',
'?': '.?',
'+': '.+',
}
regex = wildcard
for (wildcard_pattern, regex_pattern) in replacements.items():
regex = regex.replace(wildcard_pattern, regex_pattern)
return regex
Note that this only works for simple character replacements, although other complex code can at least be hidden in the wildcard_to_regex function if necessary.
(Also, I'm not sure that ? should translate to .? -- I think normal wildcards have ? as "exactly one character", so its replacement should be a simple . -- but I'm following your example.)

I'd use replace:
def wildcard_to_regex(str):
return str.replace("*", ".*").replace("?", .?").replace("#", "\d")
This probably isn't the most efficient way but it should be efficient enough for most purposes. Notice that some wildcard formats allow character classes which are more difficult to handle.

Here is a Perl example of doing this. It is simply using a table to replace each wildcard construct with the corresponding regular expression. I've done this myself previously, but in C. It shouldn't be too hard to port to Python.

regex, how to exlude search in match

Might be a bit messy title, but the question is simple.
I got this in Python:
string = "start;some;text;goes;here;end"
the start; and end; word is always at the same position in the string.
I want the second word which is some in this case. This is what I did:
import re
string = "start;some;text;goes;here;end"
word = re.findall("start;.+?;" string)
In this example, there might be a few things to modify to make it more appropriate, but in my actual code, this is the best way.
However, the string I get back is start;some;, where the search characters themselves is included in the output. I could index both ;, and extract the middle part, but there have to be a way to only get the actual word, and not the extra junk too?

No need for regex in my opinion, but all you need is a capture group here.
word = re.findall("start;(.+?);", string)
Another improvement I'd like to suggest is not using .. Rather be more specific, and what you are looking for is simply anything else than ;, the delimiter.
So I'd do this:
word = re.findall("start;([^;]+);", string)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python remove JSON substring - python

Related

Extracting a word between two path separators that comes after a specific word

remove special charecters in a string using python

Regex End of Line and Specific Chracters

String Simple Substitution

regex, how to exlude search in match

Categories

Resources