Regular expression in python - python

I am trying to match/sub the following line
line1 = '# Some text\n'
But avoid match/sub lines like this
'# Some text { .blah}\n'
So in other a # followed by any amount of words spaces and numbers (no punctuation) and then the end of line.
line2 = re.sub(r'# (\P+)$', r'# \1 { .text}', line1)
Puts the contents of line1 into line2 unchanged.
(I read somewhere that \P means everything except punctuation)
line2 = re.sub(r'# (\w*\d*\s*)+$', r'# \1 { .text}', line1)
Whereas the above gives
'# { .text}'
Any help is appreciated
Thanks
Tom

Your regex is a bit weird; expanded, it looks like
r"# ([a-zA-Z0-9_]*[0-9]*[ \t\n\r\f\v]*)+$"
Things to note:
It is not anchored to the beginning of the string, meaning it would match
print("Important stuff!") # Very important
The \d* is redundant, because it is already captured by \w*
Looking at your example, it seems you should be less worried about punctuation; the only thing you cannot have is a curly-brace ({).
Try
from functools import partial
def add_text(txt):
return re.sub(r"^#([^{]*)$", r"#\1 { .text }", txt, flags=re.M)
text = "# Some text\n# More text { .blah}\nprint('abc') # but not me!\n# And once again"
print("===before===")
print(text)
print("\n===after===")
print(add_text(text))
which gives
===before===
# Some text
# More text { .blah}
print('abc') # but not me!
# And once again
===after===
# Some text { .text }
# More text { .blah}
print('abc') # but not me!
# And once again { .text }

If you only want lines which start with a # and continue with alphanumeric values, spaces and _, you want this:
/^#[\w ]+$/gm

Related

How to remove text before a particular character or string in multi-line text?

I want to remove all the text before and including */ in a string.
For example, consider:
string = ''' something
other things
etc. */ extra text.
'''
Here I want extra text. as the output.
I tried:
string = re.sub("^(.*)(?=*/)", "", string)
I also tried:
string = re.sub(re.compile(r"^.\*/", re.DOTALL), "", string)
But when I print string, it did not perform the operation I wanted and the whole string is printing.
I suppose you're fine without regular expressions:
string[string.index("*/ ")+3:]
And if you want to strip that newline:
string[string.index("*/ ")+3:].rstrip()
The problem with your first regex is that . does not match newlines as you noticed. With your second one, you were closer but forgot the * that time. This would work:
string = re.sub(re.compile(r"^.*\*/", re.DOTALL), "", string)
You can also just get the part of the string that comes after your "*/":
string = re.search(r"(\*/)(.*)", string, re.DOTALL).group(2)
Update: After doing some research, I found that the pattern (\n|.) to match everything including newlines is inefficient. I've updated the answer to use [\s\S] instead as shown on the answer I linked.
The problem is that . in python regex matches everything except newlines. For a regex solution, you can do the following:
import re
strng = ''' something
other things
etc. */ extra text.
'''
print(re.sub("[\s\S]+\*/", "", strng))
# extra text.
Add in a .strip() if you want to remove that remaining leading whitespace.
to keep text until that symbol you can do:
split_str = string.split(' ')
boundary = split_str.index('*/')
new = ' '.join(split_str[0:boundary])
print(new)
which gives you:
something
other things
etc.
string_list = string.split('*/')[1:]
string = '*/'.join(string_list)
print(string)
gives output as
' extra text. \n'

Extract json values using just regex

I have a description field that is embedded within json and I'm unable to utilize json libraries to parse this data.
I use {0,23} in order in attempt to extract first 23 characters of string, how to extract entire value associated with description ?
import re
description = "'\description\" : \"this is a tesdt \n another test\" "
re.findall(r'description(?:\w+){0,23}', description, re.IGNORECASE)
For above code just ['description'] is displayed
You could try this code out:
import re
description = "description\" : \"this is a tesdt \n another test\" "
result = re.findall(r'(?<=description")(?:\s*\:\s*)(".{0,23}?(?=")")', description, re.IGNORECASE+re.DOTALL)[0]
print(result)
Which gives you the result of:
"this is a tesdt
another test"
Which is essentially:
\"this is a tesdt \n another test\"
And is what you have asked for in the comments.
Explanation -
(?<=description") is a positive look-behind that tells the regex to match the text preceded by description"
(?:\s*\:\s*) is a non-capturing group that tells the regex that description" will be followed by zero-or-more spaces, a colon (:) and again zero-or-more spaces.
(".{0,23}?(?=")") is the actual match desired, which consists of a double-quotes ("), zero-to-twenty three characters, and a double-quotes (") at the end.
# First just creating some test JSON
import json
data = {
'items': [
{
'description': 'A "good" thing',
# This is ignored because I'm assuming we only want the exact key 'description'
'full_description': 'Not a good thing'
},
{
'description': 'Test some slashes: \\ \\\\ \" // \/ \n\r',
},
]
}
j = json.dumps(data)
print(j)
# The actual code
import re
pattern = r'"description"\s*:\s*("(?:\\"|[^"])*?")'
descriptions = [
# I'm using json.loads just to parse the matched string to interpret
# escapes properly. If this is not acceptable then ast.literal_eval
# will probably also work
json.loads(d)
for d in re.findall(pattern, j)]
# Testing that it works
assert descriptions == [item['description'] for item in data['items']]

Python Regex returning a list of all occurrences of block data

I needed some help regarding finding a block of text from a text file.
The text file is a structured one.
From the File, I want to extract blocks of data which starts with a string and Ends with a } (Curly Bracket with No White Space and \r\n)
Example -:
ABCD = XYZAHFJKBKFF
{
DATAFIELD1 = "TYPE1"
{
VALUE = 1
VALUE = 2
VALUE = 3
}
DATAFIELD1 = "TYPE2"
{
VALUE = 5
VALUE = 6
VALUE = 7
}
}
pattern = re.compile(r"ABCD.*}",re.DOTALL)
fafs = re.findall(pattern, data)
This one does give me the result, but not as a list even if I use a for loop like
for letters in re.findall(pattern, data):
print(letters)
What i want to get is a list of All the Blocks of Data between the "ABCD" and "}".
There can be many occurrences and I want to get all of them in an iterable format or as a list.
can someone please help me with this.
Here, try this, it does what it sounds like you want:
txt = """ABCD = XYZAHFJKBKFF
{
sdfsd
sd
fsd
fsd
fsd
fsd
fsd
fsd
(This can be anything and including most common characters)
(This may Include Curly, Round brackets as well)
}"""
pattern = re.compile(r".*")
_fafs = re.findall(pattern, txt[txt.index("ABCD")+4:txt.rindex("}")])
fafs = [faf for faf in _fafs if faf != ""]
for letters in re.findall(pattern, txt):
print(letters)
The fafs = [faf for faf in _fafs if faf != ""] line is to remove the empty string items that appeared. Also, if you want to strip whitespace from the chunks of data, then replace that with fafs = [faf.strip(" \t\n\r") for faf in _fafs if faf != ""], and add any other whitespace characters (besides spaces, tabs, and two different newlines) into the str that is the argument to strip.
Oh, and replace txt's string literal with a call to whatever will procure the data you wish to parse. And if you want a starting flag other than "ABCD", then replace txt.index("ABCD")+4 with txt.index(flag)+len(flag)
And alternatively to _fafs = re.findall(pattern, txt[txt.index("ABCD")+4:txt.rindex("}")]):
ind = txt.index(flag)+len(flag)
_fafs = re.findall(pattern, txt[ind:txt.index("}", ind)])
BUT that'll stop at the first close brace after the starting flag.

Put all occurences in the String in quotes using regex in python

I hava a long string were I can find something like this data() { <some data which is always different here> } I want to put all occurences in quotes. This is what I'm doing but it has no effect:
string = re.sub(r'data \(\) {(.*)}', r'"/1"', string)
I suppose there should be something different between curly brackets but I have no idea what...
#EDIT
I realized my String look like this:
data() {
<some white spaces> here is text
<some white spaces> }
Whitespace matters, the direction of slashes matters (thanks Wiktor, I overlooked that before) and that quantifier should probably be lazy. Also, if there are newlines within your text, you need to allow for that
string = re.sub(r'(?s)data\(\) {(.*?)}', r'"\1"', string)
Testing it on your sample text:
In [4]: string = """data() {
...: <some white spaces> here is text
...: <some white spaces> }"""
In [5]: print(re.sub(r'(?s)data\(\) {(.*?)}', r'"\1"', string))
"
<some white spaces> here is text
<some white spaces> "

Picking up field value using Python regex

This is an example of two lines in a file that I am trying to pick up information from.
...
{ "SubtitleSettings_REPOSITORY", FieldType_STRING, (int32_t)REPOSITORY},
{ "PREFERRED_SUBTITLE_LANGUAGE", FieldType_STRING,SUBTITLE_LANGUAGE},
...
What I want to do is to find out the 3rd field of this weird data structure for the given string to match to 1st field, i.e.
SubtitleSettings_REPOSITORY => REPOSITORY
PREFERRED_SUBTITLE_LANGUAGE => SUBTITLE_LANGUAGE
The regx in my Python code can only handles the second line, but not cope with the first line. How I can improve it?
import re
...
#field is given a value in previous code, can be "SubtitleSettings_REPOSITORY", or "PREFERRED_SUBTITLE_LANGUAGE"
match = re.search(field+'"[, \t]+(\w+)[, \t]+(\w+)', src_file.read(), re.M|re.I)
return_value = match.group(2)
You can insert (?:\(\w+\))?, which allows (and ignores) an optional word in parentheses there:
match = re.search(field+'"[, \t]+(\w+)[, \t]+(?:\(\w+\))?(\w+)', line, re.M|re.I)
With this, the line matches and you get 'REPOSITORY' as desired.
import re
with open("input.txt") as f:
pattern = "\{ \"(.+)\",.+,(.+)\}"
for line in f:
first, third = re.findall(pattern, line.strip())[0]
print first.strip(), "=>", third.strip()
prints
SubtitleSettings_REPOSITORY => (int32_t)REPOSITORY
PREFERRED_SUBTITLE_LANGUAGE => SUBTITLE_LANGUAGE
where input.txt contains
{ "SubtitleSettings_REPOSITORY", FieldType_STRING, (int32_t)REPOSITORY},
{ "PREFERRED_SUBTITLE_LANGUAGE", FieldType_STRING,SUBTITLE_LANGUAGE}
Breakdown:
\{ \"(.+)\" matches strings with the structure { + space + " + text + " and extracts text
,.+,(.+)\} matches strings with the structure , + text1 + , + text2 + } and extracts text2

Categories