Create Rows and Tables using BeautifulSoup python with xml to json convertion - python

Currently I'm creating a parser script that can convert xml to json, my plan is to modify my script and add creation of rows and columns when I convert it to csv file.
The current output of my script it can only creates newlines.
My Script
xml_parser = BeautifulSoup(open('SAMPLE.xml'), 'xml')
DESCRIPTION = xml_parser.DESCRIPTION
NAME = xml_parser.NAME
LOCATION = xml_parser.LOCATION
STATUS = xml_parser.STATUS
data = {
'DESCRIPTION' : DESCRIPTION.text,
'NAME' : NAME.text,
'LOCATION' : LOCATION.text,
'STATUS' : STATUS.text,
}
print(json.dumps(data).replace(",","\n"))
Output
DESCRIPTION: MAIN FLOOR
NAME: FORT-0232
LOCATION: MIDDLE
STATUS: ACTIVE
Plan Output to be
DESCRIPTION | NAME | LOCATION | STATUS |
MAIN FLOOR | FORT-0232 | MIDDLE | ACTIVE |

Related

Python or PETL Parsing XML

I have been playing with PETL and seeing if I could extract multiple xml files and combine them into one.
I have no control over the structure of the XML files, Here are the variations I am seeing and which is giving my trouble.
XML File 1 Example:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name>John Doe</Name>
<Date>01/01/2021</Date>
</Info>
<App>
<Description></Description>
<Type>Two</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>
XML File 2 Example:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name></Name>
<Date>01/02/2021</Date>
</Info>
<App>
<Description>Sample description here.</Description>
<Type>One</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
<DetailOne>3</DetailOne>
<DetailTwo>4</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>
My python code is just scanning the subfolder xmlfiles and then trying to use PETL to parse from there. With the structure of the documents, I am loading three tables so far:
1 to hold the Info name and date
2 to hold the description and type
3 to collect the details
import petl as etl
import os
from lxml import etree
for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
if filename.endswith('.xml'):
# Get the info children
table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
'Name': 'Name',
'Date': 'Date'
})
# Get the App children
table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
'Description': 'Description',
'Type': 'Type'
})
# Get the App Details children
table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App/Details', {
'DetailOne': 'DetailOne',
'DetailTwo': 'DetailTwo'
})
# concat
c = etl.crossjoin(table1, table2, table3)
# I want the filename added on
result = etl.addfield(c, 'FileName', filename)
print('Results:\n', result)
I concat the three tables because I want the Info and App data on each line with each detail. This works until I get a XML file that has multiples of the DetailOne and DetailTwo elements.
What I am getting as results is:
Results:
+------------+----------+-------------+------+-----------+-----------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+==========+=============+======+===========+===========+==========+
| 01/01/2021 | John Doe | None | Two | 1 | 2 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
| 01/01/2021 | John Doe | None | Two | 10 | 11 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
Results:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | ('1', '3') | ('2', '4') | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
The second file showing DetailOne being ('1','3') and DetailTwo being ('2', '4') is not what I want.
What I want is:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | 1 | 2 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 3 | 4 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
I believe XPath may be the way to go but after researching:
https://petl.readthedocs.io/en/stable/io.html#xml-files - doesn't go in depth on lxml and petl
some light reading here:
https://www.w3schools.com/xml/xpath_syntax.asp
some more reading here:
https://lxml.de/tutorial.html
Any assistance on this is appreciated!
First, thanks for taking the time to write a good question. I'm happy to spend the time answering it.
I've never used PETL, but I did scan the docs for XML processing. I think your main problem is that the <Details> tag sometimes contains 1 pair of tags, and sometimes multiple pairs. If only there was a way to extract a flat list of the and tag values, without the enclosing tags getting in the way...
Fortunately there is. I used https://www.webtoolkitonline.com/xml-xpath-tester.html and the XPath expression //Details/DetailOne returns the list 1,3,10 when applied to your example XML.
So I suspect that something like this should work:
import petl as etl
import os
from lxml import etree
for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
if filename.endswith('.xml'):
# Get the info children
table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
'Name': 'Name',
'Date': 'Date'
})
# Get the App children
table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
'Description': 'Description',
'Type': 'Type'
})
# Get the App Details children
table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), '/App', {
'DetailOne': '//DetailOne',
'DetailTwo': '//DetailTwo'
})
# concat
c = etl.crossjoin(table1, table2, table3)
# I want the filename added on
result = etl.addfield(c, 'FileName', filename)
print('Results:\n', result)
The leading // may be redundant. It is XPath syntax for 'at any level in the document'. I don't know how PETL processes the XPath so I'm trying to play safe. I agree btw - the documentation is rather light on details.

Convert Excel to Yaml syntax in Python

I want to convert my data that is in this form to YAML Syntax (preferably without using pandas or need to install new libraries)
Sample data in excel :
users | name | uid | shell
user1 | nino | 8759 | /bin/ksh
user2 | vivo | 9650 | /bin/sh
Desired output format :
YAML Syntax output
You can do it using file operations. Since you are keen on *"preferably without using pandas or need to install new libraries
Assumption : The "|" symbol is to indicate columns and is not a delimiter or separater
Step 1
Save the excel file as CSV
Then run the code
Code
# STEP 1 : Save your excel file as CSV
ctr = 0
excel_filename = "Book1.csv"
yaml_filename = excel_filename.replace('csv', 'yaml')
users = {}
with open(excel_filename, "r") as excel_csv:
for line in excel_csv:
if ctr == 0:
ctr+=1 # Skip the coumn header
else:
# save the csv as a dictionary
user,name,uid,shell = line.replace(' ','').strip().split(',')
users[user] = {'name': name, 'uid': uid, 'shell': shell}
with open(yaml_filename, "w+") as yf :
yf.write("users: \n")
for u in users:
yf.write(f" {u} : \n")
for k,v in users[u].items():
yf.write(f" {k} : {v}\n")
Output
users:
user1 :
name : nino
uid : 8759
shell : /bin/ksh
user2 :
name : vivo
uid : 9650
shell : /bin/sh
You can do this, in your case you would just do pd.read_excel instead of pd.read_csv:
df = pd.read_csv('test.csv', sep='|')
df['user_col'] = 'users'
data = df.groupby('user_col')[['users', 'name','uid','shell']].apply(lambda x: x.set_index('users').to_dict(orient='index')).to_dict()
with open('newtree.yaml', "w") as f:
yaml.dump(data, f)
Yaml file looks like this:
users:
user1:
name: nino
shell: /bin/ksh
uid: 8759
user2:
name: vivo
shell: /bin/sh
uid: 9650

How to use behave context.table with key value table?

I saw that it is possible to access data from context.table from Behave when the table described in the BDD has a header. for example:
Scenario: Add new Expense
Given the user fill out the required fields
| item | name | amount |
| Wine | Julie | 30.00 |
To access this code it's simply:
for row in context.table:
context.page.fill_item(row['item'])
context.page.fill_name(row['name'])
context.page.fill_amount(row['amount'])
That works well and it's very clean, however, I have to refactor code when I have a huge amount of lines of input data. for example:
Given I am on registration page
When I fill "test#test.com" for email address
And I fill "test" for password
And I fill "Didier" for first name
And I fill "Dubois" for last name
And I fill "946132795" for phone number
And I fill "456456456" for mobile phon
And I fill "Company name" for company name
And I fill "Avenue Victor Hugo 1" for address
And I fill "97123" for postal code
And I fill "Lyon" for city
And I select "France" country
...
15 more lines for filling the form
How could I use the following table in behave:
|first name | didier |
|last name | Dubois |
|phone| 4564564564 |
So on ...
How would my step definition look like?
To use a vertical table rather than a horizontal table, you need to process each row as its own field. The table still needs a header row:
When I fill in the registration form with:
| Field | Value |
| first name | Didier |
| last name | Dubois |
| country | France |
| ... | ... |
In your step definition, loop over the table rows and call a method on your selenium page model:
for row in context.table
context.page.fill_field(row['Field'], row['Value'])
The Selenium page model method needs to do something based on the field name:
def fill_field(self, field, value)
if field == 'first name':
self.first_name.send_keys(value)
elif field == 'last name':
self.last_name.send_keys(value)
elif field == 'country':
# should be instance of SelectElement
self.country.select_by_text(value)
elif
...
else:
raise NameError(f'Field {field} is not valid')

How to fix twint error "CRITICAL:root:twint.get:User:replace() argument 2 must be str, not None"

I'm trying to use the twint module to get some information from twitter, in particular the bio. The code example works just fine:
import twint
c = twint.Config()
c.Username = "twitter"
twint.run.Lookup(c)
yields
783214 | Twitter | #Twitter | Private: 0 | Verified: 1 | Bio: What’s happening?! | Location: Everywhere | Url: https://about.twitter.com/ | Joined: 20 Feb 2007 6:35 AM | Tweets: 10816 | Following: 140 | Followers: 56328970 | Likes: 5960 | Media: 1932 | Avatar: https://pbs.twimg.com/profile_images/1111729635610382336/_65QFl7B_400x400.png
Thing is, I only need the bio data. According to the site, you can use
c.Format = 'bio: {bio}'
Unfortunately, this yields
CRITICAL:root:twint.get:User:replace() argument 2 must be str, not None
I think this may be due to the following code line (from here):
output += output.replace("{bio}", u.bio)
Where the u.bio value is assigned here:
u.bio = card(ur, "bio")
The card function does the following when our type is "bio":
if _type == "bio":
try:
ret = ur.find("p", "ProfileHeaderCard-bio u-dir").text.replace("\n", " ")
except:
ret = None
I think the problem may lie in the second part, where a value is assigned to u.bio, either not even being called or returning None for some reason. Unfortunately, I do not know how to fix that or call the function.
I've had a similar problem before with a different function, twint.run.Following(c), but was able to solve it by not setting c.User_full = true
Could anyone help me out?
The format should be of the form
c.Format = "{bio}"
If you wanted multiple fields
c.Format = "{bio} | {name}"
I find you get a rate limit of 250 items before a blocker drops down and you need to wait for a few minutes for it to lift.

Pygal - Click bar and post data?

I am trying to create a simple charting web app using pygal and flask to chart some financial data I have stored in a mysql database. The data in the DB is hierarchical, and I want the app to start at the top of the hierarchy and allow the user to drill down by simply clicking on the relevant parts of the chart.
I am using flask to display the dynamically generated pygal charts.
Sample DB data:
guid | name | value | parentGuid | Description
------------------------------------------------
1234 | cat1 | 1 | null | Blah, blah
4567 | cat2 | 55 | null | asfdsa
8901 | cat3 | 22 | 1234 | asfdsa
5435 | cat4 | 3 | 8901 | afdsa
etc...
I have no problem drilling down the hierarchy using python + sql, but where I'm stumped is how I drill down using links in my pygal chart.
#app.route('/', methods=['GET', 'POST'])
def index():
if request.method == 'POST':
chart = graph_sub(request.form['guid'])
return render_template('Graph3.html', chart=chart)
else:
chart = graph_main()
return render_template('Graph3.html', chart=chart)
def graph_main():
""" render svg graph """
line_chart = pygal.HorizontalBar()
line_chart.title = 'Root Accounts'
RootAccts = GetRootAccounts() # Returns a list of lists containing accounts and account data.
for Acct in RootAccts:
line_chart.add({
'title': Acct[1], # Acct Name
'tooltip': Acct[4], # Acct description
'xlink': {'href': '/'} # Hyperlink that I want to pass POST data back to the form.
}, [{
'value': Acct[2]), # Acct Value
'label': Acct[4], # Acct Description
'xlink': {'href': '/'} # Hyperlink that I want to pass POST data back to the form.
}])
return line_chart.render()
def graph_sub(parentGuid):
### This works fine if I pass a parent GUID to it
### Now the question is how do I pass the guid to it from my pygal chart?
return line_chart.render()
So when I click on the links embedded in the pygal chart
'xlink': {'href': '/'}
How can I make it redirect back to the same page and pass the GUID of the selected account as POST data? Or is there another way to do this that doesn't involve POST?
The page reloading every time they click something so I'm hoping to keep this as simple as possible without having to involve ajax/java/etc... though if there are no other options I am open to it.
I don't have it coded yet, but there will also be some additional form controls added to the page that will allow the user to set date ranges to control how much data is displayed. I was planning to use POST to pass user input around, but before I get too far down that road, I need to figure out how I can manage this base functionality.
Thoughts?

Categories