Tika python does not preserve the order of texts in pdf - python

I am using tika-python to extract text from pdf. But when there are multiple table in a pdf page, the order of the text is not preserved. In my case the table at the top of the page comes at the end when extracted through tika.
I tried using following custom config file. But it is not working. I have tried keeping the statement <property name="sortByPosition" value="True"/> at various positions. But nothing has worked. I referred this for the config.xml.
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<!-- Default Parser for most things, except for 2 mime types, and never
use the Executable Parser -->
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/pdf</mime-exclude>
<parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
<!-- property name="sortByPosition" value="True" -->
</parser>
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/pdf</mime>
<!-- here? -->
<property name="sortByPosition" value="True"/> # this statement is for preserving the order
</parser>
</parsers>
</properties>
and the following command to read the text:
from tika import parser
data = parser.from_file(file_path, xmlContent=True,
config_path=/path/to/'tika_config.xml')
What I am doing wrong or what is the way to change the config or preserving order is not possible?

Related

How to replace xml lines using 'if statements' in python?

Hi I'm new to xml files in general, but I am trying to replace specific lines in a xml file using 'if statements' in python 3.6. I've been looking at suggestions to use ElementTree, but none of the posts online quite fit the problem I have, so here I am.
My file is as followed:
<?xml version="1.0" encoding="UTF-8"?>
-<StructureDefinition xmlns="http://hl7.org/fhir">
<url value="http://example.org/fhir/StructureDefinition/MyObservation"/>
<name value="MyObservation"/>
<status value="draft"/>
<fhirVersion value="3.0.1"/>
<kind value="resource"/>
<abstract value="false"/>
<type value="Observation"/>
<baseDefinition value="http://hl7.org/fhir/StructureDefinition/Observation"/>
<derivation value="constraint"/>
</StructureDefinition>
I want to replace
url value="http://example.org/fhir/StructureDefinition/MyObservation"/
to something like
url value="http://example.org/fhir/StructureDefinition/NewObservation"/
by using conditional statements - because these are repeated multiple times in other files.
I have tried for-looping through the xml find to find the exact string match (which I've succeeded), but I wasn't able to delete, or replace the line (probably having to do with the fact that this isn't a .txt file).
Any help is greatly appreciated!
Your sample file contains a "-"-token in ln 3 that may be overlooked when copy/pasting in order to find a solution.
Input File
<?xml version="1.0" encoding="UTF-8"?>
<StructureDefinition xmlns="http://hl7.org/fhir">
<url value="http://example.org/fhir/StructureDefinition/MyObservation"/>
<name value="MyObservation"/>
<status value="draft"/>
<fhirVersion value="3.0.1"/>
<kind value="resource"/>
<abstract value="false"/>
<type value="Observation"/>
<baseDefinition value="http://hl7.org/fhir/StructureDefinition/Observation"/>
<derivation value="constraint"/>
</StructureDefinition>
Script
from xml.dom.minidom import parse # use minidom for this task
dom = parse('june.xml') #read in your file
search = "http://example.org/fhir/StructureDefinition/MyObservation" #set search value
replace = "http://example.org/fhir/StructureDefinition/NewObservation" #set replace value
res = dom.getElementsByTagName('url') #iterate over url tags
for element in res:
if element.getAttribute('value') == search: #in case of match
element.setAttribute('value', replace) #replace
with open('june_updated.xml', 'w') as f:
f.write(dom.toxml()) #update the dom, save as new xml file
Output file
<?xml version="1.0" ?><StructureDefinition xmlns="http://hl7.org/fhir">
<url value="http://example.org/fhir/StructureDefinition/NewObservation"/>
<name value="MyObservation"/>
<status value="draft"/>
<fhirVersion value="3.0.1"/>
<kind value="resource"/>
<abstract value="false"/>
<type value="Observation"/>
<baseDefinition value="http://hl7.org/fhir/StructureDefinition/Observation"/>
<derivation value="constraint"/>
</StructureDefinition>

How to extend XSD scheme for supporting SVG?

I try to extend ISOSTS XSD scheme for supporting SVG images tags.
I found XSD scheme for SVG and has put it near ISOSTS.xsd.
Now I try to extend ISOSTS.xsd:
<?xml version="1.0" encoding="utf-8"?>
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:tbx="urn:iso:std:iso:30042:ed-1"
xmlns:xlink="http://www.w3.org/1999/xlink"
<!-- my line -->
xmlns:svg="http://www.w3.org/2000/svg"
elementFormDefault="qualified">
<xs:import namespace="http://www.w3.org/1998/Math/MathML"
schemaLocation="ncbi-mathml2/mathml2.xsd"/>
<xs:import namespace="http://www.w3.org/1999/xlink"
schemaLocation="xlink.xsd"/>
<!-- XSD import of namespace http://www.w3.org/2001/XMLSchema-instance suppressed (not necessary) -->
<xs:import namespace="http://www.w3.org/XML/1998/namespace"
schemaLocation="xml.xsd"/>
<xs:import namespace="urn:iso:std:iso:30042:ed-1"
schemaLocation="tbx.xsd"/>
<!-- my line -->
<xs:import namespace="http://www.w3.org/2000/svg"
schemaLocation="SVG.xsd"/>
....
<xs:element name="p">
<xs:complexType mixed="true">
<xs:choice minOccurs="0" maxOccurs="unbounded">
<!-- my line --> <xs:element ref="svg:svg"/>
<xs:element ref="email"/>
....
But I have error when try to load scheme:
from lxml.etree import parse, XMLSchema
schema_file = open(self._schema_filename)
schema_doc = parse(schema_file)
schema_file.close()
self._xmlschema = XMLSchema(schema_doc) # Error
Error message:
File "src/lxml/xmlschema.pxi", line 87, in lxml.etree.XMLSchema.init (src/lxml/lxml.etree.c:197819)
lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}element', attribute 'ref': References from this schema to components in the namespace 'http://www.w3.org/2000/svg' are not allowed, since not indicated by an import statement., line 4664
What is wrong?
The message seems clear enough to me, I'm not sure which part of it you don't understand. Your schema document imports schema components for various namespaces (mathml, xlink, xml, etc) but it makes no attempt to import the schema for SVG, and the error message is telling you so.
I replicated your three modifications (declaring a namespace binding for the SVG namespace, importing the SVG namespace, and referring to the svg:svg element), but got no error from Xerces or Saxon EE.
So it seems to me that you've done everything right.
The error message suggests that your XSD validator is not picking up the import.
If I had to guess (and I suppose I have to, since while you've given a very concise statement of the problem, we don't have a reproducible error), your validator is looking at an interim version of the schema document in which the reference to svg:svg has been added to the content model of p, but the xs:import statement has not yet been added to the beginning of the schema document.
Possibly your Python bytecode is out of date and your Python needs to be recompiled? (Pure conjecture; I don't know how much schema information lxml generates at compile time and how much it generates at run time.)
Problem solved using next XSD schema for SVG: https://github.com/dumistoklus/svg-xsd-schema

python : appending new data in xml is overring existing data

i want to add entire tag to xml, below is my XML format.
<?xml version="1.0" encoding="UTF-8"?>
<ca st="true" name="XMLConfig">
<app>
<!--- I want to add entire commneted tag to XML . !
<ar ty="co" name="st">
<ly ty="pt">
<pt>value</pt>
</Layout>
</ar> -->
<roll name="roll" fN="file.log" fP="logs.gz">
<ly type="ptl">
<pt>value</pt>
</ly>
<po>
<!-- Comment /> -->
<si size="100 MB" />
<!-- Comment /> -->
</po>
<de fI="max" max="10"/>
</roll>
</app>
as shown in above file i want to add this tag in file
<ar ty="co" name="st">
<ly ty="pt">
<pt>value</pt>
</Layout>
</ar>
this is where i reached so far..
for appenders in tree.xpath('//Appenders'):
if appenders.getchildren():
appenders.remove(appenders.getchildren()[0])
appenders.insert(0, appenders.getparent().append(etree.fromstring('<ar ty="co" name="st"> <ly ty="pt"><pt>value</pt></Layout></ar>')))
this is removing all other content after new content.
any help will be appreciated.!
In my opinion the first way you did it is way better. You just made some mistakes in your insert line, it should be this:
appenders.insert(0, etree.fromstring('<ar ty="co" name="st"> <ly ty="pt"><pt>value</pt></ly></ar>')))
I'm surprised it didn't throw an error for you because your insert line is basically this:
appenders.insert(0,None)
Also I noticed you do something in all of your questions:
You leave out some line(s) of your xml file. (I mean why?)
You shorten the tag names in your xml but you keep their long version in the code, which is kind of annoying because the person who wants to answer you have to change the code again to see if it is working.
I got it working, !
for apps in tree.xpath('//app'):
if appenders.tag == 'app':
appenders.insert(0, etree.SubElement(appenders, 'ar', ty="Co", name="st"))
for appender in tree.xpath('//ar'):
appender.insert(0, etree.SubElement(appender, 'ly', ty="pt"))
for layout in tree.xpath('//ly'):
layout.insert(0, etree.SubElement(layout, 'pt'))
for pattern in tree.xpath('//pt'):
pattern.text = 'value'
tree.write(r'C:\value.xml', xml_declaration=True, encoding='UTF-8')
if anyone has better way to do this .. please let me know to so i can improve on this .!

Blank XML Namespace processing With Python

I am trying to parse a XML using python ,xml example snippet:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<raml xmlns="raml21.xsd" version="2.1">
<series xmlns="" scope="USA" name="Arizona">
<header>
<log action="created"/>
</header>
<x_ns color="Blue">
<p name="timeZone">(GMT-10)</p>
</x_ns>
<x_ns color="Red">
<p name="AvgHeight">175</p>
</x_ns>
<x_ns color="black">
<p name="AvgWeight">235</p>
</x_ns>
the problem is namespaces keeps changing so as an alternative I tried to read the xmlns string first then create a dicionary using namespaces using the below code
root = raw_xml.getroot()
namespace_temp1=root.tag.split("}")
namespace_temp2=namespace_temp1[0].strip('{')
namespaces_auto={}
tag_name =["x","y","z","w","v"]
ns_name=[namespace_temp2,namespace_temp2,namespace_temp2,namespace_temp2,namespace_temp2]
namespace_temp3=zip(tag_name,ns_name)
for tag,ns in namespace_temp3:
namespaces_auto[tag]=ns
namespaces=namespaces_auto
to access a particular tag with namespace I am using the code as follows
for data in raw_xml.findall('x:x_ns',namespaces)
this pretty much solves the problem but gets stuck when the child node has blank xmlns as seen in the series tag (xmlns=""). Not Sure how to incorporate it in the code to check this condition.

Remove xmlns information from generated file?

I am using Elementtree to parse an xml file, edit the contents and write to a new xml file. I have this all working apart form one issue. When I generate the file there are a lot of extra lines containing namespace information. Here are some snippets of code:
import xml.etree.ElementTree as ET
ET.register_namespace("", "http://clish.sourceforge.net/XMLSchema")
tree = ET.parse('ethernet.xml')
root = tree.getroot()
commands = root.findall('{http://clish.sourceforge.net/XMLSchema}'
'VIEW/{http://clish.sourceforge.net/XMLSchema}COMMAND')
for command in commands:
all1.append(list(command.iter()))
And a sample of the output file, with the erroneous line xmlns="http://clish.sourceforge.net/XMLSchema:
<COMMAND xmlns="http://clish.sourceforge.net/XMLSchema" help="Interface specific description" name="description">
<PARAM help="Description (must be in double-quotes)" name="description" ptype="LINE" />
<CONFIG />
</COMMAND>
How can I remove this with elementtree, can I? Or will i have to use some regex (I am writing a string to the file)?

Categories