How can I elegantly combine/concat files by section with python? - python

Like many an unfortunate programmer soul before me, I am currently dealing with an archaic file format that refuses to die. I'm talking ~1970 format specification archaic. If it were solely up to me, we would throw out both the file format and any tool that ever knew how to handle it, and start from scratch. I can dream, but that unfortunately that won't resolve my issue.
The format: Pretty Loosely defined, as years of nonsensical revisions have destroyed almost all back compatibility it once had. Basically, the only constant is that there are section headings, with few rules about what comes before or after these lines. The headings are sequential (e.g. HEADING1, HEADING2, HEADING3,...), but not numbered and are not required (e.g HEADING1, HEADING3, HEADING7). Thankfully, all possible heading permutations are known. Here's a fake example:
# Bunch of comments
SHOES # First heading
# bunch text and numbers here
HATS # Second heading
# bunch of text here
SUNGLASSES # Third heading
...
My problem: I need to concatenate multiple of these files by these section headings. I have a perl script that does this quite nicely:
while(my $l=<>) {
if($l=~/^SHOES/i) { $r=\$shoes; name($r);}
elsif($l=~/^HATS/i) { $r=\$hats; name($r);}
elsif($l=~/^SUNGLASSES/i) { $r=\$sung; name($r);}
elsif($l=~/^DRESS/i || $l=~/^SKIRT/i ) { $r=\$dress; name($r);}
...
...
elsif($l=~/^END/i) { $r=\$end; name($r);}
else {
$$r .= $l;
}
print STDERR "Finished processing $ARGV\n" if eof;
}
As you can see, with the perl script I basically just change where a reference points to when I get to a certain pattern match, and concatenate each line of the file to its respective string until I get to the next pattern match. These are then printed out later as one big concated file.
I would and could stick with perl, but my needs are becoming more complex every day and I would really like to see how this problem can be solved elegantly with python (can it?). As of right now my method in python is basically to load the entire file as a string, search for the heading locations, then split up the string based on the heading indices and concat the strings. This requires a lot of regex, if-statements and variables for something that seems so simple in another language.
It seems that this really boils down to a fundamental language issue. I found a very nice SO discussion about python's "call-by-object" style as compared with that of other languages that are call-by-reference.
How do I pass a variable by reference?
Yet, I still can't think of an elegant way to do this in python. If anyone can help kick my brain in the right direction, it would be greatly appreciated.

That's not even elegant Perl.
my #headers = qw( shoes hats sunglasses dress );
my $header_pat = join "|", map quotemeta, #headers;
my $header_re = qr/$header_pat/i;
my ( $section, %sections );
while (<>) {
if (/($header_re)/) { name( $section = \$sections{$1 } ); }
elsif (/skirt/i) { name( $section = \$sections{'dress'} ); }
else { $$section .= $_; }
print STDERR "Finished processing $ARGV\n" if eof;
}
Or if you have many exceptions:
my #headers = qw( shoes hats sunglasses dress );
my %aliases = ( 'skirt' => 'dress' );
my $header_pat = join "|", map quotemeta, #headers, keys(%aliases);
my $header_re = qr/$header_pat/i;
my ( $section, %sections );
while (<>) {
if (/($header_re)/) {
name( $section = \$sections{ $aliases{$1} // $1 } );
} else {
$$section .= $_;
}
print STDERR "Finished processing $ARGV\n" if eof;
}
Using a hash saves the countless my declarations you didn't show.
You could also do $header_name = $1; name(\$sections{$header_name}); and $sections{$header_name} .= $_ for a bit more readability.

I'm not sure if I understand your whole problem, but this seems to do everything you need:
import sys
headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = [[] for header in headers]
for arg in sys.argv[1:]:
section_index = 0
with open(arg) as f:
for line in f:
if line.startswith(headers[section_index + 1]):
section_index = section_index + 1
else:
sections[section_index].append(line)
Obviously you could change this to read or mmap the whole file, then re.search or just buf.find for the next header. Something like this (untested pseudocode):
import sys
headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = defaultdict(list)
for arg in sys.argv[1:]:
with open(arg) as f:
buf = f.read()
section = None
start = 0
for header in headers[1:]:
idx = buf.find('\n'+header, start)
if idx != -1:
sections[section].append(buf[start:idx])
section = header
start = buf.find('\n', idx+1)
if start == -1:
break
else:
sections[section].append(buf[start:])
And there are plenty of other alternatives, too.
But the point is, I can't see anywhere where you'd need to pass a variable by reference in any of those solutions, so I'm not sure where you're stumbling on whichever one you've chosen.
So, what if you want to treat two different headings as the same section?
Easy: create a dict mapping headers to sections. For example, for the second version:
headers_to_sections = {None: None, 'SHOES': 'SHOES', 'HATS': 'HATS',
'DRESSES': 'DRESSES', 'SKIRTS': 'DRESSES'}
Now, in the code that doessections[section], just do sections[headers_to_sections[section]].
For the first, just make this a mapping from strings to indices instead of strings to strings, or replace sections with a dict. Or just flatten the two collections by using a collections.OrderedDict.

My deepest sympathies!
Here's some code (please excuse minor syntax errors)
def foundSectionHeader(l, secHdrs):
for s in secHdrs:
if s in l:
return True
return False
def main():
fileList = ['file1.txt', 'file2.txt', ...]
sectionHeaders = ['SHOES', 'HATS', ...]
sectionContents = dict()
for section in sectionHeaders:
sectionContents[section] = []
for file in fileList:
fp = open(file)
lines = fp.readlines()
idx = 0
while idx < len(lines):
sec = foundSectionHeader(lines[idx]):
if sec:
idx += 1
while not foundSectionHeader(lines[idx], sectionHeaders):
sectionContents[sec].append(lines[idx])
idx += 1
This assumes that you don't have content lines which look like "SHOES"/"HATS" etc.

Assuming you're reading from stdin, as in the perl script, this should do it:
import sys
import collections
headings = {'SHOES':'SHOES','HATS':'HATS','DRESS':'DRESS','SKIRT':'DRESS'} # etc...
sections = collections.defaultdict(str)
key = None
for line in sys.stdin:
sline = line.strip()
if sline not in headings:
sections[headings.get(key)].append(sline)
else:
key = sline
You'll end up with a dictionary where like this:
{
None: <all lines as a single string before any heading>
'HATS' : <all lines as a single string below HATS heading and before next heading> ],
etc...
}
The headings list does not have to be defined in the some order as the headings appear in the input.

Related

Adding text in the middle of the file

I have these files:
actions.js - append before }
import {constants} from "./constants";
export const setUser = (value) => ({
type: constants.SET_USER,
payload: value,
});
//here
constants.js - append to the end
export const constants = {
SET_USER: "SET_USER",
//here
};
reducers.js - add a const above export and inside the combineReducers object
import {constants} from "./constants";
import {combineReducers} from "redux";
const user = (state = null, action) => action.type === constants.SET_USER ? action.payload : state;
//here
export const reducers = combineReducers({
user,
// here
})
And I want to add code into these files in the places where I put //here. How can I do that with Python? I know I can write over a file with open('file', 'w').write('string') but how can I actually add text without loosing and overwriting the file? I want to add the text to the existing file, not to create the file, or overwrite it. I want it to have the old text, and add the new text to it. How can I achieve this with Python?
I made it append to the actions.js like this:
import sys
import os
reducer = sys.argv[1]
open("actions.js","a").write("""export const set{reducer} = (value) => ({{
type: constants.{constant},
payload: value,
}});
""".format(reducer=reducer.capitalize(), constant=constant))
But I have no idea how to get the others done
Read the file, slice the string at index you want, concatenate in order, and then write to the file with cursor at 0. Let x.txt be your file. "export" in the index() method here refers to a unique non repeating word. You can use unique comments to slice the string at respective positions!
with open("x.txt","r+") as f:
old=f.read()
print(old)
constant_text= "What you want to add??"
result=old[0:old.index("export")] + constant_text + old[old.index("export"):]
# print(result)
f.seek(0)
f.write(result)
print("######################################")
print(result)
Make sure the index keywords are unique if you want to slice in multiple locations using keywords!
To my knowledge, this is not possible in the way you suggest in a single operation. My solution of choice would be to iterate over the file’s lines, and once you hit your // here - marker, insert the code.
new_content = ""
with open(file_name) as f:
for line in f.readlines():
new_content += line
if line.strip() == "// here":
new_content += text_to_insert
After this loop, new_content should hold the old text and the new* inserted at the right place, which you can then write to any file you like.
*assuming that your input is properly formatted, including line breaks and so on.

How to extract a block of lines from given file using python

I have a file like this
grouping data-rate-parameters {
description
"Data rate configuration parameters.";
reference
"ITU-T G.997.2 clause 7.2.1.";
leaf maximum-net-data-rate {
type bbf-yang:data-rate32;
default "4294967295";
description
"Defines the value of the maximum net data rate (see clause
11.4.2.2/G.9701).";
reference
"ITU-T G.997.2 clause 7.2.1.1 (MAXNDR).";
}
leaf psd-level {
type psd-level;
description
"The PSD level of the referenced sub-carrier.";
}
}
}
grouping line-spectrum-profile {
description
"Defines the parameters contained in a line spectrum
profile.";
leaf profiles {
type union {
type enumeration {
enum "all" {
description
"Used to indicate that all profiles are allowed.";
}
}
type profiles;
}
Here I want to extract every leaf block. ex., leaf maximum-net-data-rate block is
leaf maximum-net-data-rate {
type bbf-yang:data-rate32;
default "4294967295";
description
"Defines the value of the maximum net data rate (see clause
11.4.2.2/G.9701).";
reference
"ITU-T G.997.2 clause 7.2.1.1 (MAXNDR).";
}
like this I want to extract
I tried with this code, here based on the counting of braces('{') i am trying to read the block
with open(r'file.txt','r') as f:
leaf_part = []
count = 0
c = 'psd-level'
for line in f:
if 'leaf %s {'%c in line:
cur_line=line
for line in f:
pre_line=cur_line
cur_line=line
if '{' in pre_line:
leaf_part.append(pre_line)
count+=1
elif '}' in pre_line:
leaf_part.append(pre_line)
count-=1
elif count==0:
break
else:
leaf_part.append(pre_line)
Its worked for leaf maximum-net-data-rate but its not working for leaf psd-level
while doing for leaf psd-level, its displaying out of block lines also.
Help me to achieve this task.
it just need simple edit in your break loop because of multiple closing bracket '}' your count is already been negative hence you need to change that line with
elif count<=0:
break
but it is still appending multiple braces in your list so you can handle it by keeping record of opening bracket and I changed the code as below:
with open(r'file.txt','r') as f:
leaf_part = []
braces_record = []
count = 0
c = 'psd-level'
for line in f:
if 'leaf %s {'%c in line:
braces_record.append('{')
cur_line=line
for line in f:
pre_line=cur_line
cur_line=line
if '{' in pre_line:
braces_record.append('{')
leaf_part.append(pre_line)
count+=1
elif '}' in pre_line:
try:
braces_record.pop()
if len(braces_record)>0:
leaf_part.append(pre_line)
except:
pass
count-=1
elif count<=0:
break
elif '}' not in pre_line:
leaf_part.append(pre_line)
Result of above code:
leaf psd-level {
type psd-level;
description
"The PSD level of the referenced sub-carrier.";
}
You can use regex:
import re
reg = re.compile(r"leaf.+?\{.+?\}", re.DOTALL)
reg.findall(file)
It returns an array of all matched blocks
If you want to search for specific leaf names, you can use format(remember to double curly brackets):
leafname = "maximum-net-data-rate"
reg = re.compile(r"leaf\s{0}.+?\{{.+?\}}".format(temp), re.DOTALL)
EDIT: for python 2.7
reg = re.compile(r"leaf\s%s.+?\{.+?\}" %temp, re.DOTALL)
EDIT2: totally missed that you have nested brackets in your last example.
This solution will be much more involved than a simple regex, so you might consider another approach. Still, it is possible to do.
First, you will need to install regex module, since built-in re does not support recursive patterns.
pip install regex
second, here is you pattern
import regex
reg = regex.compile(r"(leaf.*?)({(?>[^\{\}]|(?2))*})", regex.DOTALL)
reg.findall(file)
Now, this pattern will return a list of tuples, so you may want to do something like this
res = [el[0]+el[1] for el in reg.findall(file)]
This should give you the list of full results.

Collapse rows based on column 1

I want to parse InterProScan results for TopGO R package.
I would like to have a file in a format a bit distant of what I have.
# input file (gene_ID GO_ID1, GO_ID2, GO_ID3, ....)
Q97R95 GO:0004349, GO:0005737, GO:0006561
Q97R95 GO:0004349, GO:0006561
Q97R95 GO:0005737, GO:0006561
Q97R95 GO:0006561
# desired output (removed duplicates and rows collapsed)
Q97R95 GO:0004349,GO:0005737,GO:0006561
You can test your tool with the whole data file here:
https://drive.google.com/file/d/0B8-ZAuZe8jldMHRsbGgtZmVlZVU/view?usp=sharing
You can make use of 2-d array of gnu awk:
awk -F'[, ]+' '{for(i=2;i<=NF;i++)r[$1][$i]}
END{for(x in r){
printf "%s ",x;b=0;
for(y in r[x]){printf "%s%s",(b?",":""),y;b=1}
print ""}
}' file
It gives:
Q97R95 GO:0005737,GO:0006561,GO:0004349
The duplicated fields are removed, however the order was not kept.
Here is a, hopefully tidy, Perl solution. It preserves order of keys and values as far as possible, and doesn't keep the whole file contents in memory, only as much as necessary to do the job.
#!perl
use strict;
use warnings;
my ($prev_key, #seen_values, %seen_values);
while (<>) {
# Parse the input
chomp;
my ($key, $values) = split /\s+/, $_, 2;
my #values = split /,\s*/, $values;
# If we have a new key...
if ($key ne $prev_key) {
# output the old data, as long as there is some,
if (#seen_values) {
print "$prev_key\t", join(", ", #seen_values), "\n";
}
# clear it out,
#seen_values = %seen_values = ();
# and remember the new key for next time.
$prev_key = $key;
}
# Merge this line's values with previous ones, de-duplicating
# but preserving order.
for my $value (#values) {
push #seen_values, $value unless $seen_values{$value}++;
}
}
# Output what's left after the last line
if (#seen_values) {
print "$prev_key\t", join(", ", #seen_values), "\n";
}

How can I read a text file and replace numbers?

If I have many of these in a text file;
<Vertex> 0 {
-0.597976 -6.85293 8.10038
<UV> { 0.898721 0.149503 }
<RGBA> { 0.92549 0.92549 0.92549 1 }
}
...
<Vertex> 1507 {
12 -5.3146 -0.000708352
<UV> { 5.7487 0.180395 }
<RGBA> { 0.815686 0.815686 0.815686 1 }
}
How can I read through the text file and add 25 to the first number in the second row? (-0.597976 in Vertex 0)
I have tried splitting the second line's text at each space with .split(' '), then using float() on the third element, and adding 25, but I don't know how to implicitly select the line in the text file.
Try to ignore the lines that start with "<", for example:
L=["<Vertex> 0 {",
"-0.597976 -6.85293 8.10038",
"<UV> { 0.898721 0.149503 }",
"<RGBA> { 0.92549 0.92549 0.92549 1 }"
]
for l in L:
if not l.startswith("<"):
print l.split(' ')[0]
Or if you read your data from a file:
f = open("test.txt", "r")
for line in f:
line = line.strip().split(' ')
try:
print float(line[0]) + 25
except:
pass
f.close()
The hard way is to use Python Lex/Yacc tools.
The hardest (did you expect "easy"?) way is to make a custom function recognizing tokens (tokens would be <Vertex>, numbers, bracers, <UV> and <RGBA>; token separators would be spaces).
I'm sorry but what you're asking is a mini language if you cannot guarantee the entries respect the CR and LFs.
Another ugly (and even harder!) way is, since you don't use recursion in that mini language, using regex. But the regex solution would be long and ugly in the same way and amount (trust me: really long one).
Try using this library: Python Lex/Yacc since what you need is to parse a language, and even when regex is possible to use here, you'll end with an ugly and unmaintainable one. YOU HAVE TO LEARN THE TIPS of language parsing to use this. Have a look Here
If the verticies will always be on the line after , you can look for that as a marker, then read the next line. If you read the second line, .strip() leading and trailing whitespace, then .split() by the space character, you will have a list of your three verticies, like so (assuming you have read the line into a string varaible line:
>>> line = line.strip()
>>> verticies = line.split(' ')
>>> verticies
['-0.597976', '-6.85293', '8.10038']
What now? Call float() on the first item in your list, then add 25 to the result.
The real challenge here is finding the <Vertex> marker and reading the subsequent line. This looks like a homework assignment, so I'll let you puzzle that out a bit first!
If your file is well-formatted, then you should be able to parse through the file pretty easily. Assuming <Vertex> is always on a line proceeding a line with just the three numbers, you could do this:
newFile = []
while file:
line = file.readline()
newFile.append(line)
if '<Vertex>' in line:
line = file.readline()
entries = line.strip().split()
entries[0] = str(25+float(entries[0]))
line = ' ' + ' '.join(entries)
newFile.append(line)
with open(newFileName, 'w') as fileToWrite:
fileToWrite.writelines(newFile)
This syntax looks like a Panda3d .egg file.
I suggest you use Panda's file load, modify, and save functions to work on the file safely; see https://www.panda3d.org/manual/index.php/Modifying_existing_geometry_data
Something like:
INPUT = "path/to/myfile.egg"
def processGeomNode(node):
# something using modifyVertexData()
def main():
model = loader.loadModel(INPUT)
for nodePath in model.findAllMatches('**/+GeomNode').asList():
processGeomNode(nodePath.node())
if __name__=="__main__":
main()
It is a Panda3D .egg file. The easiest and most reliable way to modify data in it is by using Panda3D's EggData API to parse the .egg file, modify the desired value through these structures, and write it out again, without loss of data.

Search/Replace/Delete Jekyll YAML Front Matter Category Tags

I've inherited a Jekyll website and I'm coming from a .NET world so it's a learning curve for me.
This Jekyll site takes forever to build and I think it is because there are literally thousands of category tags that require those pages to be removed. I'm able to get a list of all the categories and created a CSV that I'd like to loop through and figure out if a category tag is still needed. The structure of the CSV is:
old_tag,new_tag
Clearly I'd like to update the tags based on those (e.g. make all C#, C-Sharp, C # and C Sharp categories just C-Sharp). But, I'd also like to delete some where the old tag field exists and the new one is blank:
old_tag,new_tag
C#, C-Sharp
C Sharp, C-Sharp
Crazy,
C #, C-Sharp
Using Ruby or Python I'd like to figure out how to loop through over 4000 markdown files and use the CSV to conditionally update each one. The database person in me just can't think how this would work with flat files.
I'd recommend starting with a Hash, using it like a translation table. Hash lookups are very fast, and can organize your tags and their replacements nicely.
hash = {
# old_tag => new_tag
'C#' => 'C-Sharp',
'C Sharp' => 'C-Sharp',
'Crazy' => '',
'C #' => 'C-Sharp',
}
You can see there's a lot of redundancy in the values, which could be fixed by reversing the hash, which reduces it nicely:
hash = {
# new_tag => old_tag
'C-Sharp' => ['C#', 'C Sharp', 'C #'],
}
'Crazy' is an outlier, but we will deal with that.
Ruby's String.gsub has a nice, but little used feature, where we can pass it a regular expression, and a hash, and it'll replace all regex matches with the equivalent value in the hash. We can build that regex easily:
regex = /(?:#{ Regexp.union(hash.keys).source })/
=> /(?:C\-Sharp)/
Now, you're probably saying, "but wait, I have a lot more tags to find!", and, because of the way the hash is built, they're hidden in the values. To remedy that, we'll reverse the hash's keys and values, exploding the value arrays into their individual elements:
reversed_hash = Hash[hash.flat_map{ |k,v| v.map{ |i| [i,k] } }]
=> {
"C#" => "C-Sharp",
"C Sharp" => "C-Sharp",
"C #" => "C-Sharp",
}
Adding in 'Crazy' is easy, by merging a second hash of the "special cases":
special_cases = {
'Crazy' => ''
}
reversed_hash = Hash[hash.flat_map{ |k,v| v.map{ |i| [i,k] } }].merge(special_cases)
=> {
"C#" => "C-Sharp",
"C Sharp" => "C-Sharp",
"C #" => "C-Sharp",
"Crazy" => ""
}
Using that with the regex buildin' code:
regex = /(?:#{ Regexp.union(reversed_hash.keys).source })/
=> /(?:C\#|C\ Sharp|C\ \#|Crazy)/
That will find the tags using a auto-generated regex. If it needs to be case-insensitive, use:
regex = /(?:#{ Regexp.union(reversed_hash.keys).source })/i
Creating some text to test against:
text =<<EOT
This is "#C#"
This is "C Sharp"
This is "C #"
This is "Crazy"
EOT
=> "This is \"#C#\"\nThis is \"C Sharp\"\nThis is \"C #\"\nThis is \"Crazy\"\n"
And testing the gsub:
puts text.gsub(regex, reversed_hash)
Which outputs:
This is "#C-Sharp"
This is "#C-Sharp"
This is "#C-Sharp"
This is "#"
Now, I'm not a big fan of slurping big files into memory, because that doesn't scale well. Today's machines usually have many GB of memory, but I see files that still exceed the RAM in a machine. So, instead of using a File.read to load the file, then a single gsub to process it, I recommend using File.foreach. Using that changes the code.
Here's how I'd do it:
file_to_read = '/path/to/file/to/read'
File.open(file_to_read + '.new', 'w') do |fo|
File.foreach(file_to_read) do |li|
fo.puts li.gsub(regex, reversed_hash)
end
end
File.rename(file_to_read, file_to_read + '.bak')
File.rename(file_to_read + '.new', file_to_read)
This will create a .bak version of each file processed, so if something goes wrong you have a fall-back, which is always a good practice.
Edit: I forgot about the CSV file:
You can read/create one easily with Ruby using the CSV module, however I'd go with a YAML file because it allows you to easily create your hash layout in a file that is easy to edit by hand, or generate from a file.
Edit: More about CSV, YAML and generating one from the other
Here's how to read the CSV and convert it into the recommended hash format:
require 'csv'
text = <<EOT
C#, C-Sharp
C Sharp, C-Sharp
Crazy,
C #, C-Sharp
EOT
hash = Hash.new{ |h,k| h[k] = [] }
special_cases = []
CSV.parse(text) do |k,v|
(
(v.nil? || v.strip.empty?) ? special_cases : hash[v.strip]
) << k.strip
end
Picking up from before:
reversed_hash = Hash[hash.flat_map{ |k,v| v.map{ |i| [i,k] } }].merge(Hash[special_cases.map { |k| [k, ''] }])
puts reversed_hash
# => {"C#"=>"C-Sharp", "C Sharp"=>"C-Sharp", "C #"=>"C-Sharp", "Crazy"=>""}
To convert the CSV file to something more editable and useful, use the above code to create hash and special_cases, then:
require 'yaml'
puts ({
'hash' => hash,
'special_cases' => special_cases
}).to_yaml
Which looks like:
---
hash:
C-Sharp:
- C#
- C Sharp
- ! 'C #'
special_cases:
- Crazy
The rest you can figure out from the YAML docs.
Here's one possible approach; not sure how well it will work for large amounts of data:
require "stringio"
require "csv"
class MarkdownTidy
def initialize(rules)
#csv = CSV.new(rules.is_a?(IO) ? rules : StringIO.new(rules))
#from_to = {}.tap do |hsh|
#csv.each do |from, to|
re = Regexp.new(Regexp.escape(from.strip))
hsh[re] = to.strip
end
end
end
def tidy(str)
cpy = str.dup
#from_to.each do |re, canonical|
cpy.gsub! re, canonical
end
cpy
end
end
csv = <<-TEXT
C#, C-Sharp
C Sharp, C-Sharp
Crazy,
C #, C-Sharp
TEXT
markdown = <<-TEXT
C# some text C # some text Crazy
C#, C Sharp
TEXT
mt = MarkdownTidy.new(csv)
[markdown].each do |str|
puts mt.tidy(markdown)
end
The idea is that you would replace the loop at the very end with one that opens up the files, reads them and then saves them back to disk.

Categories