Python or PETL Parsing XML - python

I have been playing with PETL and seeing if I could extract multiple xml files and combine them into one.
I have no control over the structure of the XML files, Here are the variations I am seeing and which is giving my trouble.
XML File 1 Example:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name>John Doe</Name>
<Date>01/01/2021</Date>
</Info>
<App>
<Description></Description>
<Type>Two</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>
XML File 2 Example:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name></Name>
<Date>01/02/2021</Date>
</Info>
<App>
<Description>Sample description here.</Description>
<Type>One</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
<DetailOne>3</DetailOne>
<DetailTwo>4</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>
My python code is just scanning the subfolder xmlfiles and then trying to use PETL to parse from there. With the structure of the documents, I am loading three tables so far:
1 to hold the Info name and date
2 to hold the description and type
3 to collect the details
import petl as etl
import os
from lxml import etree
for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
if filename.endswith('.xml'):
# Get the info children
table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
'Name': 'Name',
'Date': 'Date'
})
# Get the App children
table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
'Description': 'Description',
'Type': 'Type'
})
# Get the App Details children
table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App/Details', {
'DetailOne': 'DetailOne',
'DetailTwo': 'DetailTwo'
})
# concat
c = etl.crossjoin(table1, table2, table3)
# I want the filename added on
result = etl.addfield(c, 'FileName', filename)
print('Results:\n', result)
I concat the three tables because I want the Info and App data on each line with each detail. This works until I get a XML file that has multiples of the DetailOne and DetailTwo elements.
What I am getting as results is:
Results:
+------------+----------+-------------+------+-----------+-----------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+==========+=============+======+===========+===========+==========+
| 01/01/2021 | John Doe | None | Two | 1 | 2 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
| 01/01/2021 | John Doe | None | Two | 10 | 11 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
Results:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | ('1', '3') | ('2', '4') | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
The second file showing DetailOne being ('1','3') and DetailTwo being ('2', '4') is not what I want.
What I want is:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | 1 | 2 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 3 | 4 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
I believe XPath may be the way to go but after researching:
https://petl.readthedocs.io/en/stable/io.html#xml-files - doesn't go in depth on lxml and petl
some light reading here:
https://www.w3schools.com/xml/xpath_syntax.asp
some more reading here:
https://lxml.de/tutorial.html
Any assistance on this is appreciated!

First, thanks for taking the time to write a good question. I'm happy to spend the time answering it.
I've never used PETL, but I did scan the docs for XML processing. I think your main problem is that the <Details> tag sometimes contains 1 pair of tags, and sometimes multiple pairs. If only there was a way to extract a flat list of the and tag values, without the enclosing tags getting in the way...
Fortunately there is. I used https://www.webtoolkitonline.com/xml-xpath-tester.html and the XPath expression //Details/DetailOne returns the list 1,3,10 when applied to your example XML.
So I suspect that something like this should work:
import petl as etl
import os
from lxml import etree
for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
if filename.endswith('.xml'):
# Get the info children
table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
'Name': 'Name',
'Date': 'Date'
})
# Get the App children
table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
'Description': 'Description',
'Type': 'Type'
})
# Get the App Details children
table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), '/App', {
'DetailOne': '//DetailOne',
'DetailTwo': '//DetailTwo'
})
# concat
c = etl.crossjoin(table1, table2, table3)
# I want the filename added on
result = etl.addfield(c, 'FileName', filename)
print('Results:\n', result)
The leading // may be redundant. It is XPath syntax for 'at any level in the document'. I don't know how PETL processes the XPath so I'm trying to play safe. I agree btw - the documentation is rather light on details.

Related

Create Rows and Tables using BeautifulSoup python with xml to json convertion

Currently I'm creating a parser script that can convert xml to json, my plan is to modify my script and add creation of rows and columns when I convert it to csv file.
The current output of my script it can only creates newlines.
My Script
xml_parser = BeautifulSoup(open('SAMPLE.xml'), 'xml')
DESCRIPTION = xml_parser.DESCRIPTION
NAME = xml_parser.NAME
LOCATION = xml_parser.LOCATION
STATUS = xml_parser.STATUS
data = {
'DESCRIPTION' : DESCRIPTION.text,
'NAME' : NAME.text,
'LOCATION' : LOCATION.text,
'STATUS' : STATUS.text,
}
print(json.dumps(data).replace(",","\n"))
Output
DESCRIPTION: MAIN FLOOR
NAME: FORT-0232
LOCATION: MIDDLE
STATUS: ACTIVE
Plan Output to be
DESCRIPTION | NAME | LOCATION | STATUS |
MAIN FLOOR | FORT-0232 | MIDDLE | ACTIVE |

Reorganize Pyspark dataframe: Create new column using row element

I am trying to map document with this structure to dataframe.
root
|-- Id: "a1"
|-- Type: "Work"
|-- Tag: Array
| |--0: Object
| | |-- Tag.name : "passHolder"
| | |-- Tag.value : "Jack Ryan"
| | |-- Tag.stat : "verified"
| |-- 1: Object
| | |-- Tag.name : "passNum"
| | |-- Tag.value : "1234"
| | |-- Tag.stat : "unverified"
|-- version: 1.5
By exploding the array using explode_outer,flattening struct and renaming using .col + alias, the dataframe will look like:
df = df.withColumn("Tag",F.explode_outer("Tag"))
df = df.select(col("*"),
.col("Tag.name").alias("Tag_name"),
.col("Tag.value").alias("Tag_value"),
.col("Tag.stat").alias("Tag_stat")).drop("Tag")
+--+----+-----------+-----------+---------+---------+
|Id|Type| Tag_name | Tag_value |Tag_stat | version |
+--+----+-----------+-----------+---------+---------+
a1 Work passHolder Jack Ryan verified 1.5
a1 Work passNum 1234 unverified 1.5
I am trying to reorganise the df structure so that it's more query-able,by making certain row element as column name and populate it with relevant values.
Can anyone help to give pointers/steps required to arrive at desired output format like below? Thank you very much for the advise.
Target format:
+--+----+-----------------+-----------------+-------------+------------+--------+
|Id|Type| Tag_passHolder | passHolder_stat | Tag_passNum |passNum_stat||version|
+--+----+-----------------+-----------------+-------------+------------+--------+
a1 Work Jack Ryan verified 1234 unverified 1.5
Based on the output df you displayed, I would do something like this :
from pyspark.sql import functions as F
passholder_df = df.select(
"ID",
"Type",
F.col("Tag_value").alias("Tag_passHolder"),
F.col("Tag_stat").alias("passHolder_stat"),
"version",
).where("Tag_name = 'passHolder'")
passnum_df = df.select(
"ID",
"Type",
F.col("Tag_value").alias("Tag_passNum"),
F.col("Tag_stat").alias("passNum_stat"),
"version",
).where("Tag_name = 'passNum'")
passholder_df.join(passnum_df, on=["ID", "Type", "version"], how="full")
You probably need to work a little bit on the join condition, depending on your business rules.

How to use behave context.table with key value table?

I saw that it is possible to access data from context.table from Behave when the table described in the BDD has a header. for example:
Scenario: Add new Expense
Given the user fill out the required fields
| item | name | amount |
| Wine | Julie | 30.00 |
To access this code it's simply:
for row in context.table:
context.page.fill_item(row['item'])
context.page.fill_name(row['name'])
context.page.fill_amount(row['amount'])
That works well and it's very clean, however, I have to refactor code when I have a huge amount of lines of input data. for example:
Given I am on registration page
When I fill "test#test.com" for email address
And I fill "test" for password
And I fill "Didier" for first name
And I fill "Dubois" for last name
And I fill "946132795" for phone number
And I fill "456456456" for mobile phon
And I fill "Company name" for company name
And I fill "Avenue Victor Hugo 1" for address
And I fill "97123" for postal code
And I fill "Lyon" for city
And I select "France" country
...
15 more lines for filling the form
How could I use the following table in behave:
|first name | didier |
|last name | Dubois |
|phone| 4564564564 |
So on ...
How would my step definition look like?
To use a vertical table rather than a horizontal table, you need to process each row as its own field. The table still needs a header row:
When I fill in the registration form with:
| Field | Value |
| first name | Didier |
| last name | Dubois |
| country | France |
| ... | ... |
In your step definition, loop over the table rows and call a method on your selenium page model:
for row in context.table
context.page.fill_field(row['Field'], row['Value'])
The Selenium page model method needs to do something based on the field name:
def fill_field(self, field, value)
if field == 'first name':
self.first_name.send_keys(value)
elif field == 'last name':
self.last_name.send_keys(value)
elif field == 'country':
# should be instance of SelectElement
self.country.select_by_text(value)
elif
...
else:
raise NameError(f'Field {field} is not valid')

Parsing recursive templates with pyparsing

I'm currently trying to parse recursive templates with pyparsing. A template can look like this:
{{Attribute
| name=attr1
| description=First attribute.}}
The template has a name (Attribute) and defines some variables (name = attr1, description = First attribute.). However, there are also templates which can contain zero or more templates:
{{Enum
| name=MyEnum
| description=Just a test enum.
| example=Not given...
| attributes={{Attribute
| name=attr1
| description=First attribute.}}
{{Attribute
| name=attr2
| description=Second attribute.}}}}
To parse these templates I came up with the following:
template = Forward()
lb = '{{'
rb = '}}'
template_name = Word(alphas)
variable = Word(alphas)
value = CharsNotIn('|{}=') | Group(ZeroOrMore(template))
member = Group(Suppress('|') + variable + Suppress('=') + value)
members = Group(OneOrMore(member))
template << Suppress(lb) + Group(template_name + members) + Suppress(rb)
This works quite well, but it does not allow me to use "|{}=" within a value which is problematic if I want to use them. E.g.:
{{Enum
| name=MyEnum
| description=Just a test enum.
| example=<python>x = 1</python>
| attributes=}}
So, how can I change my code so that it allows these characters, too? Unforunately, I have no idea how I can archieve this.
I hope someone can give me some tips!
I found what I was looking for: https://github.com/earwig/mwparserfromhell

django make query

DB TABLE
select * from AAA;
id | Name | Class | Grade |
--------------------------------------
1 | john | 1 | A |
2 | Jane | 2 | B |
3 | Joon | 2 | A |
4 | Josh | 3 | C |
|
Code
Django
search_result = AAA.objects.filter(Grade = 'B').count()
print search_result
search_result -> 2
I want to change Grade to Class by VALUE.
Django
target_filter = 'Class'
search_result = AAA.objects.filter(__"target_filter..."__ = '3').count()
search_result -> 1
Q) How can I complete this code? Is it possible?
Maybe you can do it like this:
target_filter = 'Class'
filter_args = {target_filter: 3}
search_result = AAA.objects.filter(**filter_args).count()
you can shorten aeby's example by putting the kwargs in directly
target_filter = 'Class'
search_result = AAA.objects.filter(**{target_filter:3}).count()
It is somehow the same answer I gave here. You can also use the getattr but now with the module object itself. This is either __module__ if it is the same module, or the module you imported.
target_filter = 'Class'
search_result = AAA.objects.filter(getattr(__module__, target_filter) = '3').count()
EDIT: I got it wrong, it is not possible to access the current module via __module__. If the class is declared in the same module as your search, you can use globals()[target_filter] to access it. If your search is from another module, you can do it like this:
import somemodule
...
target_filter = 'Class'
search_result = AAA.objects.filter(getattr(somemodule, target_filter) = '3').count()

Categories