How can I read a text file and replace numbers? - python

If I have many of these in a text file;
<Vertex> 0 {
-0.597976 -6.85293 8.10038
<UV> { 0.898721 0.149503 }
<RGBA> { 0.92549 0.92549 0.92549 1 }
}
...
<Vertex> 1507 {
12 -5.3146 -0.000708352
<UV> { 5.7487 0.180395 }
<RGBA> { 0.815686 0.815686 0.815686 1 }
}
How can I read through the text file and add 25 to the first number in the second row? (-0.597976 in Vertex 0)
I have tried splitting the second line's text at each space with .split(' '), then using float() on the third element, and adding 25, but I don't know how to implicitly select the line in the text file.

Try to ignore the lines that start with "<", for example:
L=["<Vertex> 0 {",
"-0.597976 -6.85293 8.10038",
"<UV> { 0.898721 0.149503 }",
"<RGBA> { 0.92549 0.92549 0.92549 1 }"
]
for l in L:
if not l.startswith("<"):
print l.split(' ')[0]
Or if you read your data from a file:
f = open("test.txt", "r")
for line in f:
line = line.strip().split(' ')
try:
print float(line[0]) + 25
except:
pass
f.close()

The hard way is to use Python Lex/Yacc tools.
The hardest (did you expect "easy"?) way is to make a custom function recognizing tokens (tokens would be <Vertex>, numbers, bracers, <UV> and <RGBA>; token separators would be spaces).
I'm sorry but what you're asking is a mini language if you cannot guarantee the entries respect the CR and LFs.
Another ugly (and even harder!) way is, since you don't use recursion in that mini language, using regex. But the regex solution would be long and ugly in the same way and amount (trust me: really long one).
Try using this library: Python Lex/Yacc since what you need is to parse a language, and even when regex is possible to use here, you'll end with an ugly and unmaintainable one. YOU HAVE TO LEARN THE TIPS of language parsing to use this. Have a look Here

If the verticies will always be on the line after , you can look for that as a marker, then read the next line. If you read the second line, .strip() leading and trailing whitespace, then .split() by the space character, you will have a list of your three verticies, like so (assuming you have read the line into a string varaible line:
>>> line = line.strip()
>>> verticies = line.split(' ')
>>> verticies
['-0.597976', '-6.85293', '8.10038']
What now? Call float() on the first item in your list, then add 25 to the result.
The real challenge here is finding the <Vertex> marker and reading the subsequent line. This looks like a homework assignment, so I'll let you puzzle that out a bit first!

If your file is well-formatted, then you should be able to parse through the file pretty easily. Assuming <Vertex> is always on a line proceeding a line with just the three numbers, you could do this:
newFile = []
while file:
line = file.readline()
newFile.append(line)
if '<Vertex>' in line:
line = file.readline()
entries = line.strip().split()
entries[0] = str(25+float(entries[0]))
line = ' ' + ' '.join(entries)
newFile.append(line)
with open(newFileName, 'w') as fileToWrite:
fileToWrite.writelines(newFile)

This syntax looks like a Panda3d .egg file.
I suggest you use Panda's file load, modify, and save functions to work on the file safely; see https://www.panda3d.org/manual/index.php/Modifying_existing_geometry_data
Something like:
INPUT = "path/to/myfile.egg"
def processGeomNode(node):
# something using modifyVertexData()
def main():
model = loader.loadModel(INPUT)
for nodePath in model.findAllMatches('**/+GeomNode').asList():
processGeomNode(nodePath.node())
if __name__=="__main__":
main()

It is a Panda3D .egg file. The easiest and most reliable way to modify data in it is by using Panda3D's EggData API to parse the .egg file, modify the desired value through these structures, and write it out again, without loss of data.

Related

How to extract text part from file using Python & Regular Expressions

Using Python I want to read a text file, search for a string and print all lines between this matching string and another one.
The textfile looks like the following:
Text=variables.Job_SalesDispatch.CaptionNew
Tab=0
TabAlign=0
}
}
}
[UserVariables]
User1=#StJid;IF(fields.Fieldtype="Artikel.Gerät" , STR$(fields.id,0,0) , #StJid)
[Parameters]
[#Parameters]
{
[Parameters]
{
LL.ProjectDescription=? (default)
LL.SortOrderID=
}
}
[PageLayouts]
[#PageLayouts]
{
[PageLayouts]
{
[PageLayout]
{
DisplayName=
Condition=Page() = 1
SourceTray=0
Now I want to print all "UserVariables", so only the lines between [UserVariables] and the next line starting with a square bracket. In this example this would be [Parameters].
What I have done so far is:
with open("path/testfile.lst", encoding="utf8", errors="ignore") as file:
for line in file:
uservars = re.findall('\b(\w*UserVariables\w*)\b', line)
print (uservars)
what gives me only [].
If using regular expressions is not a mandatory requirement for you, you can go with something like this:
with open("path/testfile.lst", encoding="utf8", errors="ignore") as file:
inside_uservars = False
for line in file:
if inside_uservars:
if line.strip().startswith('['):
inside_uservars = False
else:
print(line)
if line.strip() == '[UserVariables]':
inside_uservars = True
We can try using re.findall with the following regex pattern:
\[UserVariables\]\n((?:(?!\[.*?\]).)*)
This says to match a [UserVariables] tag, followed by a slightly complicated looking expression:
((?:(?!\[.*?\]).)*)
This expression is a tempered dot trick which matches any character, one at a time, so long as what lies immediately ahead is not another tag contained in square brackets.
matches = re.findall(r'\[UserVariables\]\n((?:(?!\[.*?\]).)*)', input, re.DOTALL)
print(matches)
[' User1=#StJid;IF(fields.Fieldtype="Artikel.Ger\xc3\xa4t" , STR$(fields.id,0,0) , #StJid)\n']
Edit:
My answer assumes that the entire file content sits in memory, in a single Python string. You may read the entire file using:
with open('Path/to/your/file.txt', 'r') as content_file:
input = content_file.read()
matches = re.findall(r'\[UserVariables\]\n((?:(?!\[.*?\]).)*)', input, re.DOTALL)
print(matches)

optimal method to parse a json object in a datafile

I am trying to setup a simple data file format, and I am working with these files in Python for analysis. The format basically consists of header information, followed by the data. For syntax and future extensibility reasons, I want to use a JSON object for the header information. An example file looks like this:
{
"name": "my material",
"sample-id": null,
"description": "some material",
"funit": "MHz",
"filetype": "material_data"
}
18 6.269311533 0.128658208 0.962033017 0.566268827
18.10945274 6.268810641 0.128691962 0.961950095 0.565591807
18.21890547 6.268312637 0.128725463 0.961814928 0.564998228...
If the data length/structure is always the same, this is not hard to parse. However, it brought up in my mind a question about the most flexible way to parse out the JSON object, given an unknown number of lines, and an unknown number of nested curly braces, and potentially more than one JSON object in the file.
If there is only one JSON object in the file, one can use this regular expression:
with open(fname, 'r') as fp:
fstring = fp.read()
json_string = re.search('{.*}', fstring, flags=re.S)
However, if there is more than one JSON string, and I want to grab the first one, I need to use something like this:
def grab_json(mystring):
lbracket = 0
rbracket = 0
lbracket_pos = 0
rbracket_pos = 0
for i in range(len(mystring)):
if mystring[i] == '{':
lbracket = 1
lbracket_pos = i
break
for i in range(lbracket_pos+1, len(mystring)):
if mystring[i] == '}':
rbracket += 1
if rbracket == lbracket:
rbracket_pos = i
break
elif mystring[i] == '{':
lbracket += 1
json_string = mystring[lbracket_pos : rbracket_pos + 1]
return json_string, lbracket_pos, rbracket_pos
json_string, beg_pos, end_pos = grab_json(fstring)
I guess the question as always: is there a better way to do this? Better meaning simpler code, more flexible code, more robust code, or really anything?
The easiest solution, as Klaus suggested, is just to use JSON for the entire file. That makes your life much simpler because than writing is just json.dump and reading is just json.load.
A second solution is to put the metadata in a separate file, which keeps reading and writing simple at the expense of multiple files for each data set.
A third solution would be, when writing the file to disk, to prepend the length of the JSON data. So writing might look something like:
metadata_json = json.dumps(metadata)
myfile.write('%d\n' % len(metadata_json))
myfile.write(metadata_json)
myfile.write(data)
Then reading looks like:
with open('myfile') as fd:
len = fd.readline()
metadata_json = fd.read(int(len))
metadata = json.loads(metadata)
data = fd.read()
A fourth option is to adopt an existing storage format (maybe hdf?) that already has the features you are looking for in terms of storing both data and metadata in the same file.
I would store headers separately. It'll give you a possibility to use the same header file for multiple data files
Alternatively you may want to take a look at Apache Parquet Format especially if you want to process your data on distributed cluster(s) using Spark power

Importing wrongly concatenated JSONs in python

I've a text document that has several thousand jsons strings in the form of: "{...}{...}{...}". This is not a valid json it self but each {...} is.
I currently use the following a regular expression to split them:
fp = open('my_file.txt', 'r')
raw_dataset = (re.sub('}{', '}\n{', fp.read())).split('\n')
Which basically breaks every line where a curly bracket closes and other opens (}{ -> }\n{) so I can split them into different lines.
The problem is that few of them have a tags attribute written as "{tagName1}{tagName2}" which breaks my regular expression.
An example would be:
'{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
Is parsed into
'{"name":"Bob Dylan", "tags":"{Artist}'
'{Singer}"}'
'{"name": "Michael Jackson"}'
instead of
'{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}'
'{"name": "Michael Jackson"}'
What is the proper way of achieve this for further json parsing?
Use the raw_decode method of json.JSONDecoder
>>> import json
>>> d = json.JSONDecoder()
>>> x='{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
>>> d.raw_decode(x)
({'tags': '{Artist}{Singer}', 'name': 'Bob Dylan'}, 47)
>>> x=x[47:]
>>> d.raw_decode(x)
({'name': 'Michael Jackson'}, 27)
raw_decode returns a 2-tuple, the first element being the decoded JSON and the second being the offset in the string of the next byte after the JSON ended.
To loop until the end or until an invalid JSON element is encountered:
>>> while True:
... try:
... j,n = d.raw_decode(x)
... except ValueError:
... break
... print(j)
... x=x[n:]
...
{'name': 'Bob Dylan', 'tags': '{Artist}{Singer}'}
{'name': 'Michael Jackson'}
When the loop breaks, inspection of x will reveal if it has processed the whole string or had encountered a JSON syntax error.
With a very long file of short elements you might read a chunk into a buffer and apply the above loop, concatenating anything that's left over with the next chunk after the loop breaks.
You can use the jq command line utility to transfer your input to json. Let's say you have the following input:
input.txt:
{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}
You can use jq -s, which consumes multiple json documents from input and transfers them into a single output array:
jq -s . input.txt
Gives you:
[
{
"name": "Bob Dylan",
"tags": "{Artist}{Singer}"
},
{
"name": "Michael Jackson"
}
]
I've just realized that there are python bindings for libjq. Meaning you
don't need to use the command line, you can use jq directly in python.
https://github.com/mwilliamson/jq.py
However, I've not tried it so far. Let me give it a try :) ...
Update: The above library is nice, but it does not support the slurp mode so far.
you need to make a parser ... I dont think regex can help you for
data = ""
curlies = []
def get_dicts(file_text):
for letter in file_text:
data += letter
if letter == "{":
curlies.append(letter)
elif letter == "}":
curlies.pop() # remove last
if not curlies:
yield json.loads(data)
data = ""
note that this does not actually solve the problem that {name:"bob"} is not valid json ... {"name":"bob"} is
this will also break in the event you have weird unbalanced parenthesis inside of strings ie {"name":"{{}}}"} would break this
really your json is so broken based on your example your best bet is probably to edit it by hand and fix the code that is generating it ... if that is not feasible you may need to write a more complex parser using pylex or some other grammar library (effectively writing your own language parser)

How can I elegantly combine/concat files by section with python?

Like many an unfortunate programmer soul before me, I am currently dealing with an archaic file format that refuses to die. I'm talking ~1970 format specification archaic. If it were solely up to me, we would throw out both the file format and any tool that ever knew how to handle it, and start from scratch. I can dream, but that unfortunately that won't resolve my issue.
The format: Pretty Loosely defined, as years of nonsensical revisions have destroyed almost all back compatibility it once had. Basically, the only constant is that there are section headings, with few rules about what comes before or after these lines. The headings are sequential (e.g. HEADING1, HEADING2, HEADING3,...), but not numbered and are not required (e.g HEADING1, HEADING3, HEADING7). Thankfully, all possible heading permutations are known. Here's a fake example:
# Bunch of comments
SHOES # First heading
# bunch text and numbers here
HATS # Second heading
# bunch of text here
SUNGLASSES # Third heading
...
My problem: I need to concatenate multiple of these files by these section headings. I have a perl script that does this quite nicely:
while(my $l=<>) {
if($l=~/^SHOES/i) { $r=\$shoes; name($r);}
elsif($l=~/^HATS/i) { $r=\$hats; name($r);}
elsif($l=~/^SUNGLASSES/i) { $r=\$sung; name($r);}
elsif($l=~/^DRESS/i || $l=~/^SKIRT/i ) { $r=\$dress; name($r);}
...
...
elsif($l=~/^END/i) { $r=\$end; name($r);}
else {
$$r .= $l;
}
print STDERR "Finished processing $ARGV\n" if eof;
}
As you can see, with the perl script I basically just change where a reference points to when I get to a certain pattern match, and concatenate each line of the file to its respective string until I get to the next pattern match. These are then printed out later as one big concated file.
I would and could stick with perl, but my needs are becoming more complex every day and I would really like to see how this problem can be solved elegantly with python (can it?). As of right now my method in python is basically to load the entire file as a string, search for the heading locations, then split up the string based on the heading indices and concat the strings. This requires a lot of regex, if-statements and variables for something that seems so simple in another language.
It seems that this really boils down to a fundamental language issue. I found a very nice SO discussion about python's "call-by-object" style as compared with that of other languages that are call-by-reference.
How do I pass a variable by reference?
Yet, I still can't think of an elegant way to do this in python. If anyone can help kick my brain in the right direction, it would be greatly appreciated.
That's not even elegant Perl.
my #headers = qw( shoes hats sunglasses dress );
my $header_pat = join "|", map quotemeta, #headers;
my $header_re = qr/$header_pat/i;
my ( $section, %sections );
while (<>) {
if (/($header_re)/) { name( $section = \$sections{$1 } ); }
elsif (/skirt/i) { name( $section = \$sections{'dress'} ); }
else { $$section .= $_; }
print STDERR "Finished processing $ARGV\n" if eof;
}
Or if you have many exceptions:
my #headers = qw( shoes hats sunglasses dress );
my %aliases = ( 'skirt' => 'dress' );
my $header_pat = join "|", map quotemeta, #headers, keys(%aliases);
my $header_re = qr/$header_pat/i;
my ( $section, %sections );
while (<>) {
if (/($header_re)/) {
name( $section = \$sections{ $aliases{$1} // $1 } );
} else {
$$section .= $_;
}
print STDERR "Finished processing $ARGV\n" if eof;
}
Using a hash saves the countless my declarations you didn't show.
You could also do $header_name = $1; name(\$sections{$header_name}); and $sections{$header_name} .= $_ for a bit more readability.
I'm not sure if I understand your whole problem, but this seems to do everything you need:
import sys
headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = [[] for header in headers]
for arg in sys.argv[1:]:
section_index = 0
with open(arg) as f:
for line in f:
if line.startswith(headers[section_index + 1]):
section_index = section_index + 1
else:
sections[section_index].append(line)
Obviously you could change this to read or mmap the whole file, then re.search or just buf.find for the next header. Something like this (untested pseudocode):
import sys
headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = defaultdict(list)
for arg in sys.argv[1:]:
with open(arg) as f:
buf = f.read()
section = None
start = 0
for header in headers[1:]:
idx = buf.find('\n'+header, start)
if idx != -1:
sections[section].append(buf[start:idx])
section = header
start = buf.find('\n', idx+1)
if start == -1:
break
else:
sections[section].append(buf[start:])
And there are plenty of other alternatives, too.
But the point is, I can't see anywhere where you'd need to pass a variable by reference in any of those solutions, so I'm not sure where you're stumbling on whichever one you've chosen.
So, what if you want to treat two different headings as the same section?
Easy: create a dict mapping headers to sections. For example, for the second version:
headers_to_sections = {None: None, 'SHOES': 'SHOES', 'HATS': 'HATS',
'DRESSES': 'DRESSES', 'SKIRTS': 'DRESSES'}
Now, in the code that doessections[section], just do sections[headers_to_sections[section]].
For the first, just make this a mapping from strings to indices instead of strings to strings, or replace sections with a dict. Or just flatten the two collections by using a collections.OrderedDict.
My deepest sympathies!
Here's some code (please excuse minor syntax errors)
def foundSectionHeader(l, secHdrs):
for s in secHdrs:
if s in l:
return True
return False
def main():
fileList = ['file1.txt', 'file2.txt', ...]
sectionHeaders = ['SHOES', 'HATS', ...]
sectionContents = dict()
for section in sectionHeaders:
sectionContents[section] = []
for file in fileList:
fp = open(file)
lines = fp.readlines()
idx = 0
while idx < len(lines):
sec = foundSectionHeader(lines[idx]):
if sec:
idx += 1
while not foundSectionHeader(lines[idx], sectionHeaders):
sectionContents[sec].append(lines[idx])
idx += 1
This assumes that you don't have content lines which look like "SHOES"/"HATS" etc.
Assuming you're reading from stdin, as in the perl script, this should do it:
import sys
import collections
headings = {'SHOES':'SHOES','HATS':'HATS','DRESS':'DRESS','SKIRT':'DRESS'} # etc...
sections = collections.defaultdict(str)
key = None
for line in sys.stdin:
sline = line.strip()
if sline not in headings:
sections[headings.get(key)].append(sline)
else:
key = sline
You'll end up with a dictionary where like this:
{
None: <all lines as a single string before any heading>
'HATS' : <all lines as a single string below HATS heading and before next heading> ],
etc...
}
The headings list does not have to be defined in the some order as the headings appear in the input.

How to make python program extensible

So I have a bunch of line of codes like these in a row in my program:
str = str.replace('ten', '10s')
str = str.replace('twy', '20s')
str = str.replave('fy', '40s')
...
I want to make it such that I don't have to manually open my source file to add new cases. For example ('sy', '70'). I know I have to put all these in a function somehow, but I'd like to map cases that are not in my "mapper lib" from the command line. Configuration file maybe? how?
Thanks!
You could use a config file in json format like this:
[
["ten", "10s"],
["twy", "20s"],
["fy", "40s"]
]
Save it as 'replacements.json' and then use it this way:
import json
with open('replacements.json') as i:
replacements = json.load(i)
text = 'ten, twy, fy'
for r in replacements:
text = text.replace(r[0], r[1])
Then when you need to change the values just edit the replacements.json file without touching any Python code.
The format for you replacements file could be anything but json is easy to use and edit.
a simple solution could be to put those in a file, read them in your program and do your replaces in a loop..
Many ways to do this, if it's a rarely changing thing you could consider doing it with a Python dict:
mappings = {
'ten': '10s',
'twy': '20s',
'fy': '40s',
}
def replace(str_):
for s, r in mappings.iteritems():
str_.replace(s, r)
return str_
Alternatively in a Text file (make sure you use a safe delimiter which isn't used in any of the keys!)
mappings.txt
ten|10s
twy|20s
fy|40s
And the Python part:
mappings = {}
for line in open('mappings.txt'):
k, v = line.split('|', 1)
mappings[k] = v
And use the replace from above :)
You could use csv to store the replacements in a human-editable form in a file:
import csv
with open('replacements.csv', 'rb') as f:
replacements = list(csv.reader(f))
for old, new in replacements:
your_string = your_string.replace(old, new)
where replacements.csv:
ten,10s
twy,20s
fy,40s
It avoids unnecessary markup such as ", [] in the json format and allows a delimiter (,) to be present in a string itself unlike the plain text format from #WoLpH's answer.
(live example)

Categories