How to extract text part from file using Python & Regular Expressions

How to extract text part from file using Python & Regular Expressions - python

Using Python I want to read a text file, search for a string and print all lines between this matching string and another one.
The textfile looks like the following:
Text=variables.Job_SalesDispatch.CaptionNew
Tab=0
TabAlign=0
}
}
}
[UserVariables]
User1=#StJid;IF(fields.Fieldtype="Artikel.Gerät" , STR$(fields.id,0,0) , #StJid)
[Parameters]
[#Parameters]
{
[Parameters]
{
LL.ProjectDescription=? (default)
LL.SortOrderID=
}
}
[PageLayouts]
[#PageLayouts]
{
[PageLayouts]
{
[PageLayout]
{
DisplayName=
Condition=Page() = 1
SourceTray=0
Now I want to print all "UserVariables", so only the lines between [UserVariables] and the next line starting with a square bracket. In this example this would be [Parameters].
What I have done so far is:
with open("path/testfile.lst", encoding="utf8", errors="ignore") as file:
for line in file:
uservars = re.findall('\b(\w*UserVariables\w*)\b', line)
print (uservars)
what gives me only [].

If using regular expressions is not a mandatory requirement for you, you can go with something like this:
with open("path/testfile.lst", encoding="utf8", errors="ignore") as file:
inside_uservars = False
for line in file:
if inside_uservars:
if line.strip().startswith('['):
inside_uservars = False
else:
print(line)
if line.strip() == '[UserVariables]':
inside_uservars = True

We can try using re.findall with the following regex pattern:
\[UserVariables\]\n((?:(?!\[.*?\]).)*)
This says to match a [UserVariables] tag, followed by a slightly complicated looking expression:
((?:(?!\[.*?\]).)*)
This expression is a tempered dot trick which matches any character, one at a time, so long as what lies immediately ahead is not another tag contained in square brackets.
matches = re.findall(r'\[UserVariables\]\n((?:(?!\[.*?\]).)*)', input, re.DOTALL)
print(matches)
[' User1=#StJid;IF(fields.Fieldtype="Artikel.Ger\xc3\xa4t" , STR$(fields.id,0,0) , #StJid)\n']
Edit:
My answer assumes that the entire file content sits in memory, in a single Python string. You may read the entire file using:
with open('Path/to/your/file.txt', 'r') as content_file:
input = content_file.read()
matches = re.findall(r'\[UserVariables\]\n((?:(?!\[.*?\]).)*)', input, re.DOTALL)
print(matches)

Related

Erases the following tag while trying to replace a html tag using Python

I have a word doc of following format
test1.docx
["<h58>This is article|", "", ", "<s1>Author is|", "<s33>Research is on|", "<h4>CASE IS|", "<s6>1-3|"]
Tried to locate tag starting with <s.*?> and replace the tag and its contents with ""
def locatestag():
fileis = open("test1.docx")
for line in fileis:
print(line)
newfile = re.sub('<s.*?> .*? ','',line)
with open("new file.json","w") as newoutput:
son.dump(newfile, newoutput)
The final output file also makes tag like disapper.
The final contents is like
["<h58>This is article|", "", ", ]
How to remove <s.*> and its contents only while retaining rest of the tag (ie retain tag )

Adjust your regex so that only the <s.*> tags and their contents are matched.
for line in fileis:
print(line)
newfile = re.sub('<s[^>]*>[^"]*(?=")','',line)
The resulting newfile will be:
["<h58>This is article|", "", ", "", "", "<h4>CASE IS|", ""]
Of course this assumes that the "unclosed" double quote in what looks like an array of strings is not a typo but intentionally the content of your file.

You just want to remove the tags, and not everything after the tags so there is no need to add the extra .*?
Here is the final code for you
re.sub('<s.*?>','',line)

I am trying to write python script to find a string in file with more than 1000 lines and delete few lines (10) after that string match

Below fastfile (more than 1000 lines) I would like to search for string "Validate repo test2" and delete lines starting from "Validate repo test2" upto string "end"
and rewrite content to new file.
Fastfile
desc "Validate repo test1"
lane :validate_repo do
lint_source
execute_tests
validate_docs
ensure_tool_name_formatting
ensure_code_samples
ensure_special_docs_code_samples
ensure_code_snippets
ensure_actions_config_items_formatting
end
desc "Validate repo test2"
lane :validate_repo do
lint_source
execute_tests
validate_docs
ensure_tool_name_formatting
ensure_code_samples
ensure_special_docs_code_samples
ensure_code_snippets
ensure_actions_config_items_formatting
end
desc "Validate repo test3"
lane :validate_repo do
lint_source
execute_tests
validate_docs
ensure_tool_name_formatting
ensure_code_samples
ensure_special_docs_code_samples
ensure_code_snippets
ensure_actions_config_items_formatting
end

You could do something like this:
with open('Fastfile', 'r') as f_orig, open('Fastfile_new', 'w') as f_new:
skipping = False
for line in f_orig:
if 'Validate repo test2' in line:
skipping = True
if not skipping:
f_new.write(line)
if line[:3] == 'end':
skipping = False

I'm new to this, so I'm not sure how to credit the author, but this was useful to me:
Regex Match all characters between two strings
Thanks #zx81
You can use the regex:
(?s)(?<="Validate repo test[\d]*").*(?=end)
http://www.rexegg.com/regex-modifiers.html#dotall
The first section will enable "dot all mode", the rest of the regex says "Selects all characters between ""Validate repo test[\d]*"" and "end"".
From there you can use regex sub to remove all of them. All together it would look a bit like this:
import re
fileText = file.read()
regex = re.compile(r"\"Validate repo test[\d]*\"", re.DOTALL)
result = re.sub(regex, "", fileText)
file.write(result)

Maybe there are many solutions, but I think the follow codes can solve your problem too.
need_delete = False
with open(path_to_old_file, 'r') as fin, open(path_to_new_file, 'w+') as fout :
for line in fin:
if line.endswith('"Validate repo test2"\n'):
need_delete = True
if need_delete and not line.strip():
need_delete = False
continue
if not need_delete:
fout.write(line)
I hope this will help you.

Finding a pattern multiple times between start and end patterns python regex

i am trying to find a certain pattern between a start and end patterns for multiple lines. Here is what i mean:
i read a file and saved it in variable File, this is what the original file looks like:
File:
...
...
...
Keyword some_header_file {
XYZ g1234567S7894561_some_other_trash_underscored_text;
XYZ g1122334S9315919_different_other_trash_underscored_text;
}
...
...
...
I am trying to grab the 1234567 between the g and S and also the 1122334. The some_header_file block can be any number of lines but always ends with }
So i am trying to grab exactly 7 digits between the g and the S for all the lines from the "Keyword" till the "}" for that specific header.
this is what i used:
FirstSevenDigitPart = str(re.findall(r"Keyword\s%s.*\n.*XYZ\s[gd]([0-9]{7})[A-Z][0-9]{7}.*\}"%variable , str(File) , flags=re.MULTILINE))
but unfortunately it does not return anything.. just a blank []
what am i doing wrong? how can i accomplish this?
Thanks in advance.

You may read your file into a contents variable and use
import re
contents = "...\n...\n...\nKeyword some_header_file {\n XYZ \ng1234567S7894561_some_other_trash_underscored_text;\n XYZ \n1122334S9315919_different_other_trash_underscored_text;\n}\n...\n...\n..."
results = []
variable = 'some_header_file'
block_rx = r'Keyword\s+{}\s*{{([^{{}}]*)}}'.format(re.escape(variable))
value_rx = r'XYZ\s[gd]([0-9]{7})[A-Z][0-9]{7}'
for block in re.findall(block_rx, contents):
results.extend(re.findall(value_rx, block))
print(results)
# => ['1234567', '1122334']
See the Python demo.
The first regex (block_rx) will look like Keyword\s+some_header_file\s*{([^{}]*)} and will match all the blocks you need to search for values in. The second regex, XYZ\s[gd]([0-9]{7})[A-Z][0-9]{7}, matches what you need and returns the list of captures.

I think that the simplest way here will be to use two expressions and run it in two steps. There is a little example. Of course you should optimize it for your needs.
import re
text = """Keyword some_header_file {
XYZ g1234567S7894561_some_other_trash_underscored_text;
XYZ g1122334S9315919_different_other_trash_underscored_text;
}"""
all_lines_pattern = 'Keyword\s*%s\s*\{\n(?P<all_lines>(.|\s)*)\}'
first_match = re.match(all_lines_pattern % 'some_header_file', text)
if first_match is None:
# some break logic here
pass
found_lines = first_match.group(1)
print(found_lines) # ' XYZ g1234567S7894561_some_other_trash_underscored_text;\n XYZ g1122334S9315919_different_other_trash_underscored_text;\n '
sub_pattern = '(XYZ\s*[gd](?P<your_pattern>[0-9]{7})[A-Z]).*;'
found_groups = re.findall(sub_pattern, found_lines)
print(found_groups) # [('XYZ g1234567S', '1234567'), ('XYZ g1122334S', '1122334')]

How can I read a text file and replace numbers?

If I have many of these in a text file;
<Vertex> 0 {
-0.597976 -6.85293 8.10038
<UV> { 0.898721 0.149503 }
<RGBA> { 0.92549 0.92549 0.92549 1 }
}
...
<Vertex> 1507 {
12 -5.3146 -0.000708352
<UV> { 5.7487 0.180395 }
<RGBA> { 0.815686 0.815686 0.815686 1 }
}
How can I read through the text file and add 25 to the first number in the second row? (-0.597976 in Vertex 0)
I have tried splitting the second line's text at each space with .split(' '), then using float() on the third element, and adding 25, but I don't know how to implicitly select the line in the text file.

Try to ignore the lines that start with "<", for example:
L=["<Vertex> 0 {",
"-0.597976 -6.85293 8.10038",
"<UV> { 0.898721 0.149503 }",
"<RGBA> { 0.92549 0.92549 0.92549 1 }"
]
for l in L:
if not l.startswith("<"):
print l.split(' ')[0]
Or if you read your data from a file:
f = open("test.txt", "r")
for line in f:
line = line.strip().split(' ')
try:
print float(line[0]) + 25
except:
pass
f.close()

The hard way is to use Python Lex/Yacc tools.
The hardest (did you expect "easy"?) way is to make a custom function recognizing tokens (tokens would be <Vertex>, numbers, bracers, <UV> and <RGBA>; token separators would be spaces).
I'm sorry but what you're asking is a mini language if you cannot guarantee the entries respect the CR and LFs.
Another ugly (and even harder!) way is, since you don't use recursion in that mini language, using regex. But the regex solution would be long and ugly in the same way and amount (trust me: really long one).
Try using this library: Python Lex/Yacc since what you need is to parse a language, and even when regex is possible to use here, you'll end with an ugly and unmaintainable one. YOU HAVE TO LEARN THE TIPS of language parsing to use this. Have a look Here

If the verticies will always be on the line after , you can look for that as a marker, then read the next line. If you read the second line, .strip() leading and trailing whitespace, then .split() by the space character, you will have a list of your three verticies, like so (assuming you have read the line into a string varaible line:
>>> line = line.strip()
>>> verticies = line.split(' ')
>>> verticies
['-0.597976', '-6.85293', '8.10038']
What now? Call float() on the first item in your list, then add 25 to the result.
The real challenge here is finding the <Vertex> marker and reading the subsequent line. This looks like a homework assignment, so I'll let you puzzle that out a bit first!

If your file is well-formatted, then you should be able to parse through the file pretty easily. Assuming <Vertex> is always on a line proceeding a line with just the three numbers, you could do this:
newFile = []
while file:
line = file.readline()
newFile.append(line)
if '<Vertex>' in line:
line = file.readline()
entries = line.strip().split()
entries[0] = str(25+float(entries[0]))
line = ' ' + ' '.join(entries)
newFile.append(line)
with open(newFileName, 'w') as fileToWrite:
fileToWrite.writelines(newFile)

This syntax looks like a Panda3d .egg file.
I suggest you use Panda's file load, modify, and save functions to work on the file safely; see https://www.panda3d.org/manual/index.php/Modifying_existing_geometry_data
Something like:
INPUT = "path/to/myfile.egg"
def processGeomNode(node):
# something using modifyVertexData()
def main():
model = loader.loadModel(INPUT)
for nodePath in model.findAllMatches('**/+GeomNode').asList():
processGeomNode(nodePath.node())
if __name__=="__main__":
main()

It is a Panda3D .egg file. The easiest and most reliable way to modify data in it is by using Panda3D's EggData API to parse the .egg file, modify the desired value through these structures, and write it out again, without loss of data.

Python Regex: find all lines that start with '{' and end with '}'

I am receiving data over a socket, a bunch of JSON strings. However, I receive a set amount of bytes, so sometimes the last of my JSON strings is cut-off. I will typically get the following:
{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}
{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}
{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}
{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}
{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}
{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}
{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}
{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}
{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}
{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}
{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}
{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}
{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}
{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}
{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}
{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}
{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}
{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}
{"pitch":-30.816765,"yaw":-125
With Python, I would like to create a string array of the first 18 complete { data... } strings.
Here is what I have tried: cleanData = re.search('{.*}', data) but it seems like this is only giving me the very first { data... } entry. How can I get the full string array of complete { } sets?

To get all, you can use re.finditer or re.findall.
>>> re.findall(r'{.*}', s)
['{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}', '{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}', '{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}', '{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}', '{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}', '{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}', '{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}', '{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}', '{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}', '{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}', '{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}', '{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}', '{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}', '{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}', '{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}', '{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}', '{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}', '{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}']
>>>
OR
>>> [x.group() for x in re.finditer(r'{.*}', s)]
['{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}', '{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}', '{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}', '{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}', '{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}', '{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}', '{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}', '{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}', '{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}', '{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}', '{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}', '{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}', '{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}', '{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}', '{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}', '{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}', '{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}', '{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}']
>>>

You need re.findall() (or re.finditer)
>>> import re
>>> for r in re.findall(r'{.*}', data)[:18]:
print r
{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}
{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}
{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}
{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}
{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}
{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}
{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}
{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}
{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}
{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}
{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}
{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}
{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}
{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}
{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}
{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}
{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}
{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}

Extracting lines that start and end with a specific character can be done without any regex, use str.startswith and str.endswith methods when iterating through the lines in a file:
results = []
with open(filepath, 'r') as f:
for line in f:
if line.startswith('{') and line.rstrip('\n').endswith('}'):
results.append(line.rstrip('\n'))
Note the .rstrip('\n') is used before .endswith to make sure the final newline does not interfere with the } check at the end of the string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract text part from file using Python & Regular Expressions - python

Related

Erases the following tag while trying to replace a html tag using Python

I am trying to write python script to find a string in file with more than 1000 lines and delete few lines (10) after that string match

Finding a pattern multiple times between start and end patterns python regex

How can I read a text file and replace numbers?

Python Regex: find all lines that start with '{' and end with '}'

Categories

Resources