Simplify SVGs by applying transform - reduce size - python

I have quite often some SVGs with structures like this:
<svg:g
transform="translate(-251.5,36.5)"
id="g12578"
style="fill:#ffff00;fill-opacity:1">
<svg:rect
width="12"
height="12"
x="288"
y="35.999958"
id="rect12580"
style="fill:#ffff00;fill-opacity:1;stroke:#000000;stroke-width:1" />
</svg:g>
I would like to apply translate directly to the coordinates and delete the tranform-attribute:
<svg:g
id="g12578"
style="fill:#ffff00;fill-opacity:1">
<svg:rect
width="12"
height="12"
x="36.5"
y="69.499958"
id="rect12580"
style="fill:#ffff00;fill-opacity:1;stroke:#000000;stroke-width:1" />
</svg:g>
Do you know a script / program for simplifying SVGs? Or a python-snipplet for parsing SVG's?
This script works for my special case, but I would like one which works allways:
#http://epydoc.sourceforge.net/stdlib/xml.dom.minidom.Element-class.html
from xml.dom.minidom import parse, parseString
import re
f = open('/home/moose/mathe/svg/Solitaire-Board.svg', 'r')
xmldoc = parse(f)
p = re.compile('translate\(([-\d.]+),([-\d.]+)\)', re.IGNORECASE)
for node in xmldoc.getElementsByTagName('svg:g'):
transform_dict = node.attributes["transform"]
m = p.match(transform_dict.value)
if m:
x = float(m.group(1))
y = float(m.group(2))
child_rectangles = node.getElementsByTagName('svg:rect')
for rectangle in child_rectangles:
x_dict = rectangle.attributes["x"]
y_dict = rectangle.attributes["y"]
new_x = float(x_dict.value) + x
new_y = float(y_dict.value) + y
rectangle.setAttribute('x', str(new_x))
rectangle.setAttribute('y', str(new_y))
node.removeAttribute('transform')
print xmldoc.toxml()
I think the size of the svg could be reduced quite heavily without loss of quality, if the transform-attribute could be removed.
If the tool would be able to reduce coordinate precision, delete unnecessary regions, group and style wisely it would be great.

I'd recommend using lxml. It's extremely fast and has a lot of nice features. You can parse your example if you properly declare the svg namespace prefix. You can do that pretty easily:
>>> svg = '<svg xmlns:svg="http://www.w3.org/2000/svg">' + example_svg + '</svg>'
Now you can parse it with lxml.etree (or xml.etree.ElementTree):
>>> doc = etree.fromstring(svg)
If you use lxml you can take advantage of XPath:
>>> ns = {'svg': 'http://www.w3.org/2000/svg'}
>>> doc.xpath('//svg:g/#transform', namespaces=ns)
<<< ['translate(-251.5,36.5)']

You might want to have a look at scour:
Scour aims to reduce the size of SVG files as much as possible, while retaining the original rendering of the files. It does not do so flawlessly for all files, therefore users are encouraged not to overwrite their original files.
Optimizations performed by Scour on SVG files include: removing empty elements, removing metadata elements, removing unused id= attribute values, removing unrenderable elements, trimming coordinates to a certain number of significant places, and removing vector editor metadata.

1) it can be parsed and edited with regular expression. you can easily get the translate values, and the x,y's.
2) if you checked the minidom, and sure that your only problem is with the ':', so just replace the ':', edit what you need, and then re-replace it to ':'.
3) you can use this question: Is there any scripting SVG editor? to learn how to parse better this XML format.

Related

Python lxml find text efficiently

Using python lxml I want to test if a XML document contains EXPERIMENT_TYPE, and if it exists, extract the <VALUE>.
Example:
<EXPERIMENT_SET>
<EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0">
<TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE>
<STUDY_REF accession="SomeStudy" refcenter="BCCA"/>
<EXPERIMENT_ATTRIBUTES>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE>
</EXPERIMENT_ATTRIBUTES>
</EXPERIMENT>
</EXPERIMENT_SET>
Is there a faster way than iterating through all elements?
all = etree.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG')
for e in all:
if e.text == 'EXPERIMENT_TYPE':
print("Found")
That attempt is also getting messy when I want to extract the <VALUE>.
Preferably you do this with XPath which is bound to be incredibly fast. My sugestion (tested and working). It will return a (possible empty) list of VALUE elements from which you can extra the text.
PS: do not use "special" words such as all as variable names. Bad practice and may lead to unexpected bugs.
import lxml.etree as ET
from lxml.etree import Element
from typing import List
xml_str = """
<EXPERIMENT_SET>
<EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0">
<TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE>
<STUDY_REF accession="SomeStudy" refcenter="BCCA"/>
<EXPERIMENT_ATTRIBUTES>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE>
<EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE>
</EXPERIMENT_ATTRIBUTES>
</EXPERIMENT>
</EXPERIMENT_SET>
"""
tree = ET.ElementTree(ET.fromstring(xml_str))
vals: List[Element] = tree.xpath(".//EXPERIMENT_ATTRIBUTE/TAG[text()='EXPERIMENT_TYPE']/following-sibling::VALUE")
print(vals[0].text)
# DNA Methylation
An alternative XPath declaration was provided below by Michael Kay, which is identical to the answer by Martin Honnen.
.//EXPERIMENT_ATTRIBUTE[TAG='EXPERIMENT_TYPE']/VALUE
In terms of XPath it seems you simply want to select the VALUE element based on the TAG element with e.g. /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE[TAG = 'EXPERIMENT_TYPE']/VALUE.
I think with Python and lxml people often use a text node selection with e.g. /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE[TAG = 'EXPERIMENT_TYPE']/VALUE/text() as then the xpath function returns that as a Python string.
Using findall is the natural way to do it. I suggest the following code to find the VALUEs:
from lxml import etree
root = etree.parse('toto.xml').getroot()
all = root.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG')
for e in all:
if e.text == 'EXPERIMENT_TYPE':
v = e.getparent().find('VALUE')
if v is not None:
print(f'Found val="{v.text}"')
This outputs:
Found val="DNA Methylation"

How can I increase the amount of array iterated during the run-time of script?

My script cleans arrays from the unwanted string like "##$!" and other stuff.
The script works as intended but the speed of it is extremely slow when the excel row size is big.
I tried to use numpy if it could speed it up but I'm not too familiar with is so I might be using it incorrectly.
xls = pd.ExcelFile(path)
df = xls.parse("Sheet2")
TeleNum = np.array(df['telephone'].values)
def replace(orignstr): # removes the unwanted string from numbers
for elem in badstr:
if elem in orignstr:
orignstr = orignstr.replace(elem, '')
return orignstr
for UncleanNum in tqdm(TeleNum):
newnum = replace(str(UncleanNum)) # calling replace function
df['telephone'] = df['telephone'].replace(UncleanNum, newnum) # store string back in data frame
I also tried removing the method to if that would help and just place it as one block of code but the speed remained the same.
for UncleanNum in tqdm(TeleNum):
orignstr = str(UncleanNum)
for elem in badstr:
if elem in orignstr:
orignstr = orignstr.replace(elem, '')
print(orignstr)
df['telephone'] = df['telephone'].replace(UncleanNum, orignstr)
TeleNum = np.array(df['telephone'].values)
The current speed of the script running an excel file of 200,000 is around 70it/s and take around an hour to finish. Which is not that good since this is just one function of many.
I'm not too advanced in python. I'm just learning as I script so if you have any pointer it would be appreciated.
Edit:
Most of the array elements Im dealing with are numbers but some have string in them. I trying to remove all string in the array element.
Ex.
FD3459002912
*345*9002912$
If you are trying to clear everything that isn't a digit from the strings you can directly use re.sub like this:
import re
string = "FD3459002912"
regex_result = re.sub("\D", "", string)
print(regex_result) # 3459002912

How to save a list of lxml.etree._ElementTree to file

I'm running into an annoying problem with the lxml library and cant figure out how to get around it.
I have a list of lxml.etree._ElementTree tree's and a list of lxml.html.HtmlElement's which belong to those trees and have the corosponding paths stored in a list called paths
element_found = [True if len(tree.xpath(path)) > 0 else False for tree,path in zip(trees,paths)]
print(element_found.count(False)) # == 0
problem becomes when I try and save the paths and trees to later retrieve this state:
trees_to_save = [{'tree': lxml.etree.tostring(tree, pretty_print=True)} for tree in trees]
t2sdf = pd.DataFrame(trees_to_save)
t2sdf.to_csv('trees.csv')
EncodeForamt = lxml.html.HTMLParser(encoding='utf-8')
trees_from_file = pd.read_csv('trees.csv')
trees_from_file['tree'] = trees_from_file['tree'].apply(lambda x: etree.HTML(literal_eval(x),EncodeForamt).getroottree())
then the same test is ran:
element_found = [True if len(tree.xpath(path)) > 0 else False for tree,path in zip(trees_from_file,paths)]
print(element_found.count(False)) # == 6 (out of 12k)
Generally im trying to accomplish all paths being found, there is clearly a problem with either to/from string methods and how I'm saving the trees. I've tried various methods in the lxml library such as tree.write and instead of to string, instead of literal_eval just .encode('utf-8') to no avail, with and without pretty_print, tried etree.from_string() also same result for everything...
worryingly this throws XML syntax errors also:
trees = [etree.fromstring(etree.tostring(t)) for t in trees]
I'm at a bit of a loss of how to get these trees properly saved...
Ok, I figured out how to get this done after a while of trying everything I could find, needed to use parse instead of tostring:
trees_to_save = [{'tree': lxml.etree.tostring(tree,encoding='utf-8',method='html')} for tree in trees]
t2sdf = pd.DataFrame(trees_to_save)
t2sdf.to_csv('location_trees.csv')
trees_from_file = pd.read_csv('location_trees.csv')
EncodeForamt = lxml.etree.HTMLParser(encoding='utf-8')
trees_from_file['tree'] = trees_from_file['tree'].apply(lambda x: lxml.etree.parse(x,parser=EncodeForamt))

performing fuzzy differences between 2 files

When doing some retrocomputing stuff, I sometimes have to compare 2 MC68000 disassembled executables of the same game.
Games are published using different languages (english, french...) or have slight modifications / revisions.
The code is roughly the same but the global labels are shifted because of previous code changes (or data which is wrongly interpreted as branches which generate more or less fake labels, depending on the data) so I can have for the first file:
LAB_0012:
MOVE #0,D0
MOVE #2,D2
LAB_0013:
RTS
and for the second file:
LAB_0015:
MOVE #0,D0
SUB #3,D1
MOVE #2,D2
LAB_0016:
RTS
If I perform a diff on both files, the labels scramble/pollute the required result, which I'd like to be SUB #3,D1 added in file 2.
So I performed a pre-processing using a regex to change all labels by LAB_XXXX, like this:
def readlines(filepath):
with open(filepath) as f:
lines = list(f)
return [x.rstrip() for x in lines],[r.sub("LAB_XXXX",l).partition(";")[0] for l in lines]
and use difflib to print the diffs and it kind of works but it doesn't revert back to original label values of course. So I keep the original data, and parse difflib output to try to print the original data instead, but that's lame and doesn't work very well.
lines1,filtered_lines1 = readlines(file1)
lines2,filtered_lines2 = readlines(file2)
for line in difflib.unified_diff(filtered_lines1, filtered_lines2, fromfile=file1, tofile=file2, lineterm=''):
m = re.match(r"##..(\d+),(\d+).*(\d+),(\d+)",line)
if m:
start,end,start2,end2 = [int(x) for x in m.groups()]
print(line)
for i in range(start,start+end):
print("{} <=> {}".format(lines1[i],lines2[i-start2+start]))
I've checked this answer Fuzzy file diff but that doesn't cut it for me: pre-processing both files is already what I'm doing.
I'd like to instruct difflib (or any other diff mean) to ignore this LAB_.... regex when comparing (a bit like you can compare data ignoring blanks, or case insensitive), so the original file content is printed (either side would do) when showing the diffs. For my above example I'd like:
LAB_0015:
MOVE #0,D0
##added:235,1## <== this is just an example: 1 line added at line 235
> SUB #3,D1
MOVE #2,D2
LAB_0016:
RTS
I'd prefer to keep it within python, but if I have to perform system calls for external commands that's okay too.

Python: Joining and writing (XML.etrees) trees stored in a list

I'm looping over some XML files and producing trees that I would like to store in a defaultdict(list) type. With each loop and the next child found will be stored in a separate part of the dictionary.
d = defaultdict(list)
counter = 0
for child in root.findall(something):
tree = ET.ElementTree(something)
d[int(x)].append(tree)
counter += 1
So then repeating this for several files would result in nicely indexed results; a set of trees that were in position 1 across different parsed files and so on. The question is, how do I then join all of d, and write the trees (as a cumulative tree) to a file?
I can loop through the dict to get each tree:
for x in d:
for y in d[x]:
print (y)
This gives a complete list of trees that were in my dict. Now, how do I produce one massive tree from this?
Sample input file 1
Sample input file 2
Required results from 1&2
Given the apparent difficulty in doing this, I'm happy to accept more general answers that show how I can otherwise get the result I am looking for from two or more files.
Use Spyne:
from spyne.model.primitive import *
from spyne.model.complex import *
class GpsInfo(ComplexModel):
UTC = DateTime
Latitude = Double
Longitude = Double
DopplerTime = Double
Quality = Unicode
HDOP = Unicode
Altitude = Double
Speed = Double
Heading = Double
Estimated = Boolean
class Header(ComplexModel):
Name = Unicode
Time = DateTime
SeqNo = Integer
class CTrailData(ComplexModel):
index = UnsignedInteger
gpsInfo = GpsInfo
Header = Header
class CTrail(ComplexModel):
LastError = AnyXml
MaxTrial = Integer
Trail = Array(CTrailData)
from lxml import etree
from spyne.util.xml import *
file_1 = get_xml_as_object(etree.fromstring(open('file1').read()), CTrail)
file_2 = get_xml_as_object(etree.fromstring(open('file2').read()), CTrail)
file_1.Trail.extend(file_2.Trail)
file_1.Trail.sort(key=lambda x: x.index)
elt = get_object_as_xml(file_1, no_namespace=True)
print etree.tostring(elt, pretty_print=True)
While doing this, Spyne also converts the data fields from string to their native Python formats as well, so it'll be much easier for you to work with the data from this xml document.
Also, if you don't mind using the latest version from git, you can do e.g.:
class GpsInfo(ComplexModel):
# (...)
doppler_time = Double(sub_name="DopplerTime")
# (...)
so that you can get data from the CamelCased tags without having to violate PEP8.
Use lxml.objectify:
from lxml import etree, objectify
obj_1 = objectify.fromstring(open('file1').read())
obj_2 = objectify.fromstring(open('file2').read())
obj_1.Trail.CTrailData.extend(obj_2.Trail.CTrailData)
# .sort() won't work as objectify's lists are not regular python lists.
obj_1.Trail.CTrailData = sorted(obj_1.Trail.CTrailData, key=lambda x: x.index)
print etree.tostring(obj_1, pretty_print=True)
It doesn't do the additional conversion work that the Spyne variant does, but for your use case, that might be enough.

Categories