wxPython print string with formatting - python

so I have the following class which prints the text and header you call with it. I also provided the code I am using to call the Print function. I have a string 'outputstring' which contains the text I want to print. My expected output is below and my actual output is below. It seems to be removing spaces which are needed for proper legibility. How can I print while keeping the spaces?
Class:
#Printer Class
class Printer(HtmlEasyPrinting):
def __init__(self):
HtmlEasyPrinting.__init__(self)
def GetHtmlText(self,text):
"Simple conversion of text. Use a more powerful version"
html_text = text.replace('\n\n','<P>')
html_text = text.replace('\n', '<BR>')
return html_text
def Print(self, text, doc_name):
self.SetHeader(doc_name)
self.PrintText(self.GetHtmlText(text),doc_name)
def PreviewText(self, text, doc_name):
self.SetHeader(doc_name)
HtmlEasyPrinting.PreviewText(self, self.GetHtmlText(text))
Expected Print:
+-------------------+---------------------------------+------+-----------------+-----------+
| Domain: | Mail Server: | TLS: | # of Employees: | Verified: |
+-------------------+---------------------------------+------+-----------------+-----------+
| bankofamerica.com | ltwemail.bankofamerica.com | Y | 239000 | Y |
| | rdnemail.bankofamerica.com | Y | | Y |
| | kcmemail.bankofamerica.com | Y | | Y |
| | rchemail.bankofamerica.com | Y | | Y |
| citigroup.com | mx-b.mail.citi.com | Y | 248000 | N |
| | mx-a.mail.citi.com | Y | | N |
| bnymellon.com | cluster9bny.us.messagelabs.com | ? | 51400 | N |
| | cluster9bnya.us.messagelabs.com | Y | | N |
| usbank.com | mail1.usbank.com | Y | 65565 | Y |
| | mail2.usbank.com | Y | | Y |
| | mail3.usbank.com | Y | | Y |
| | mail4.usbank.com | Y | | Y |
| us.hsbc.com | vhiron1.us.hsbc.com | Y | 255200 | Y |
| | vhiron2.us.hsbc.com | Y | | Y |
| | njiron1.us.hsbc.com | Y | | Y |
| | njiron2.us.hsbc.com | Y | | Y |
| | nyiron1.us.hsbc.com | Y | | Y |
| | nyiron2.us.hsbc.com | Y | | Y |
| pnc.com | cluster5a.us.messagelabs.com | Y | 49921 | N |
| | cluster5.us.messagelabs.com | ? | | N |
| tdbank.com | cluster5.us.messagelabs.com | ? | 0 | N |
| | cluster5a.us.messagelabs.com | Y | | N |
+-------------------+---------------------------------+------+-----------------+-----------+
Actual Print:
The same thing as expected but the spaces are removed making it very hard to read.
Function call:
def printFile():
outputstring = txt_tableout.get(1.0, 'end')
print(outputstring)
app = wx.PySimpleApp()
p = Printer()
p.Print(outputstring, "Data Results")

For anyone else struggling, this is the modified class function I used to generate a nice table with all rows and columns.
def GetHtmlText(self,text):
html_text = '<h3>Data Results:</h3><p><table border="2">'
html_text += "<tr><td>Domain:</td><td>Mail Server:</td><td>TLS:</td><td># of Employees:</td><td>Verified</td></tr>"
for row in root.ptglobal.to_csv():
html_text += "<tr>"
for x in range(len(row)):
html_text += "<td>"+str(row[x])+"</td>"
html_text += "</tr>"
return html_text + "</table></p>"

maybe try
`html_text = text.replace(' ',' ').replace('\n','<br/>')`
that would replace your spaces with html space characters ... but it would still not look right since it is not a monospace font ... this will be hard to automate ... you really want to probably put it in a table structure ... but that would require some work
you probably want to invest a little more time in your html conversion ... perhaps something like (making assumptions based on what you have shown)
def GetHtmlText(self,text):
"Simple conversion of text. Use a more powerful version"
text_lines = text.splitlines()
html_text = "<table>"
html_text += "<tr><th> + "</th><th>".join(text_lines[0].split(":")) + "</th></tr>
for line in text_lines[1:]:
html_text += "<tr><td>"+"</td><td>".join(line.split()) +"</td></tr>
return html_text + "</table>"

Related

Web Scrape table data from this webpage

I'm trying to scrape the data from the table in the specifications section of this webpage:
Lochinvar Water Heaters
I'm using beautiful soup 4. I've tried searching for it by class - for example - (class="Table__Cell-sc-1e0v68l-0 kdksLO") but bs4 can't find the class on the webpage. I listed all the available classes that it could find and it doesn't find anything useful. Any help is appreciated.
Here's the code I tried to get the classes
import requests
from bs4 import BeautifulSoup
URL = "https://www.lochinvar.com/products/commercial-water-heaters/armor-condensing-water-heater"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find_all("div", class_='Table__Wrapper-sc-1e0v68l-3 iFOFNW')
classes = [value
for element in soup.find_all(class_=True)
for value in element["class"]]
classes = sorted(classes)
for cass in classes:
print(cass)
The page is populated with javascript, but fortunately in this case, much of the data [including the specs table you want] seems to be inside a script tag within the fetched html. The script just has one statement, so it's fairly easy to extract it as json
import json
### copied from your q ####
import requests
from bs4 import BeautifulSoup
URL = "https://www.lochinvar.com/products/commercial-water-heaters/armor-condensing-water-heater"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
###########################
wrInf = soup.find(lambda l: l.name == 'script' and '__routeInfo' in l.text)
wrInf = wrInf.text.replace('window.__routeInfo = ', '', 1) # remove variable name
wrInf = wrInf.strip()[:-1] # get rid of ; at end
wrInf = json.loads(wrInf) # convert to python dictionary
specsTables = wrInf['data']['product']['specifications'][0]['table'] # get table (tsv string)
specsTables = [tuple(row.split('\t')) for row in specsTables.split('\n')] # convert rows to tuples
To view it, you could use pandas,
import pandas
headers = specsTables[0]
st_df = pandas.DataFrame([dict(zip(headers, r)) for r in specsTables[1:]])
# or just
# st_df = pandas.DataFrame(specsTables[1:], columns=headers)
print(st_df.head())
or you could simply print it
for i, r in enumerate(specsTables):
print(" | ".join([f'{c:^18}' for c in r]))
if i == 0: print()
output:
Model Number | Btu/Hr Input | Thermal Efficiency | GPH # 100ºF Rise | A | B | C | D | E | F | G | H | I | J | K | L | M | Gas Conn. | Water Conn. | Air Inlet | Vent Size | Ship. Wt.
AWH0400NPM | 399,000 | 99% | 479 | 45" | 24" | 30-1/2" | 42-1/2" | 29-3/4" | 20-1/4" | 12" | 20" | 38" | 3-1/2" | 10-1/2" | 19-1/4" | 20" | 1" | 2" | 4" | 4" | 326
AWH0500NPM | 500,000 | 99% | 600 | 45" | 24" | 30-1/2" | 42-1/2" | 29-3/4" | 20-1/4" | 12" | 20" | 38" | 3-1/2" | 10-1/2" | 19-1/4" | 20" | 1" | 2" | 4" | 4" | 333
AWH0650NPM | 650,000 | 98% | 772 | 45" | 24" | 41" | 53" | 30-1/2" | 15-1/4" | 12" | 20" | 38" | 3-1/2" | 10-1/2" | 19-1/4" | 20" | 1-1/4" | 2" | 4" | 6" | 424
AWH0800NPM | 800,000 | 98% | 950 | 45" | 24" | 41" | 53" | 30-1/2" | 15-1/4" | 12" | 20" | 38" | 3-1/2" | 10-1/2" | 19-1/4" | 20" | 1-1/4" | 2" | 4" | 6" | 434
AWH1000NPM | 999,000 | 98% | 1,187 | 45" | 24" | 48" | 62" | 30-1/2" | 15-3/4" | 12" | 20" | 38" | 3-1/2" | 10-1/2" | 19-1/4" | 20" | 1-1/4" | 2-1/2" | 6" | 6" | 494
AWH1250NPM | 1,250,000 | 98% | 1,485 | 51-1/2" | 34" | 49" | 59" | 5-1/2" | 5-1/2" | 13-1/2" | 6-3/4" | 46-3/4" | 5-3/4" | 19-3/4" | 23" | 22-1/2" | 1-1/2" | 2-1/2" | 8" | 8" | 1,568
AWH1500NPM | 1,500,000 | 98% | 1,782 | 51-1/2" | 34" | 52-3/4" | 62-3/4" | 4-1/2" | 4-1/2" | 13-1/2" | 6-3/4" | 46-3/4" | 5-3/4" | 19-3/4" | 23" | 22-1/2" | 1-1/2" | 2-1/2" | 8" | 8" | 1,649
AWH2000NPM | 1,999,000 | 98% | 2,375 | 51-1/2" | 34" | 65-1/2" | 75-1/2" | 7" | 5-3/4" | 14-3/4" | 7-1/4" | 46-3/4" | 6-3/4" | 18-3/4" | 23" | 23-1/2" | 1-1/2" | 2-1/2" | 8" | 8" | 1,911
AWH3000NPM | 3,000,000 | 98% | 3,564 | 67-1/4" | 48-1/4" | 79-3/4" | 93-3/4" | 4-3/4" | 6-3/4" | 17-3/4" | 8-3/4" | 60-1/4" | 8-1/2" | 25-1/2" | 29-1/2" | 40" | 2" | 4" | 10" | 10" | 3,147
AWH4000NPM | 4,000,000 | 98% | 4,752 | 67-1/4" | 48-1/4" | 96" | 110" | 5" | 7-1/2" | 17-3/4" | 8-3/4" | 60-1/4" | 8-1/2" | 25-1/2" | 29-1/2" | 40" | 2-1/2" | 4" | 12" | 12" | 3,694
If you wanted a specific models specs:
modelNo = 'AWH1000NPM'
mSpecs = [r for r in specsTables if r[0] == modelNo]
mSpecs = [[]] if mSpecs == [] else mSpecs # in case there is no match
mSpecs = dict(zip(specsTables[0], mSpecs[0])) # convert to dictionary
print(mSpecs)
output:
{'Model Number': 'AWH1000NPM', 'Btu/Hr Input': '999,000', 'Thermal Efficiency': '98%', 'GPH # 100ºF Rise': '1,187', 'A': '45"', 'B': '24"', 'C': '48"', 'D': '62"', 'E': '30-1/2"', 'F': '15-3/4"', 'G': '12"', 'H': '20"', 'I': '38"', 'J': '3-1/2"', 'K': '10-1/2"', 'L': '19-1/4"', 'M': '20"', 'Gas Conn.': '1-1/4"', 'Water Conn.': '2-1/2"', 'Air Inlet': '6"', 'Vent Size': '6"', 'Ship. Wt.': '494'}
The contents for constructing the table are within a script tag. You can extract the relevant string and re-create the table through string manipulation.
import requests, re
import pandas as pd
r = requests.get('https://www.lochinvar.com/products/commercial-water-heaters/armor-condensing-water-heater/').text
s = re.sub(r'\\"', '"', re.search(r'table":"([\s\S]+?)(?:","tableFootNote)', r).groups(1)[0])
lines = [i.split('\\t') for i in s.split('\\n')]
df = pd.DataFrame(lines[1:], columns = lines[:1])
df.head(5)

Python Pandas Extracting specific data from column

I have one column as an object contains multiple data separated by ( | )
I would like to extract only the customer order number which is start with
( 44 ) sometimes the order number in the beginning, sometimes in the middle, sometimes in the end
And sometimes is duplicated
44019541285_P_002 | 0317209757 | 87186978110350851 | 387186978103840544 |
87186978110202440 | 44019119315 | 87186978110202440 | 44019119315
87186978110326832 | 44019453624 | 87186978110326832 | 44019453624
44019406029 | 0317196878 | 87186978110313085 | 387186978120481881|
44019480564 | 0317202711 | 87186978110335810 | 387186978103844160 |
Wishing result
44019541285
44019119315
44019453624
44019406029
44019480564
My code:
import pandas as pd
from io import StringIO
data = '''
Order_Numbers
44019541285_P_002 | 0317209757 | 87186978110350851 | 387186978103840544 | 0652569032
87186978110202440 | 44019119315 | 87186978110202440 | 44019119315
87186978110326832 | 44019453624 | 87186978110326832 | 44019453624
44019406029 | 0317196878 | 87186978110313085 | 387186978120481881|
44019480564 | 0317202711 | 87186978110335810 | 387186978103844160 | 630552498
'''
df = pd.read_csv(StringIO(data.replace(' ','')))
df
'''
Order_Numbers
0 44019541285_P_002|0317209757|87186978110350851...
1 87186978110202440|44019119315|8718697811020244...
2 87186978110326832|44019453624|8718697811032683...
3 44019406029|0317196878|87186978110313085|38718...
4 44019480564|0317202711|87186978110335810|38718...
'''
Final code:
(
df.Order_Numbers.str.split('|', expand=True)
.astype(str)
.where(lambda x: x.applymap(lambda y: y[:2] == '44'))
.bfill(axis=1)
[0]
.str.split('_').str.get(0)
)
0 44019541285
1 44019119315
2 44019453624
3 44019406029
4 44019480564
Name: 0, dtype: object
import pandas as pd
df = pd.DataFrame({
'order_number':[
'44019541285_P_002 | 0317209757 | 87186978110350851 | 387186978103840544 | 0652569032',
'87186978110202440 | 44019119315 | 87186978110202440 | 44019119315',
'87186978110326832 | 44019453624 | 87186978110326832 | 44019453624',
'44019406029 | 0317196878 | 87186978110313085 | 387186978120481881|',
'44019480564 | 0317202711 | 87186978110335810 | 387186978103844160 | 630552498'
]
})
def extract_customer_order(order_number):
order_number = order_number.replace(' ','') # remove all space to make it easy to process e.g. '44019541285_P_002 | 0317209757 ' -> '44019541285_P_002|0317209757'
order_number_list = order_number.split('|') # split the string at every | to multiple string in list '44019541285_P_002|0317209757' -> ['44019541285_P_002', '0317209757']
result = []
for order in order_number_list:
if order.startswith('44'): # select only order number starting with '44'
if order not in result: # to prevent duplicate order number
result += [order]
# if you want the result as string separated by '|', uncomment line below
# result = '|'.join(result)
return result
df['customer_order'] = df['order_number'].apply(extract_customer_order)

what is NavigableString in this error refers to, and why that happened?

from bs4 import BeautifulSoup
import numpy as np
import requests
from selenium import webdriver
from nltk.tokenize import sent_tokenize,word_tokenize
html = webdriver.Firefox(executable_path=r'D:\geckodriver.exe')
html.get("https://www.tsa.gov/coronavirus/passenger-throughput")
def TSA_travel_numbers(html):
soup = BeautifulSoup(html,'lxml')
for i,rows in enumerate(soup.find('div',class_='view-content'),1):
# print(rows.content)
for header in rows.find('tr'):
number = rows.find_all('td',class_='views-field views-field field-2021-throughput views-align-center')
print(number.text)
TSA_travel_numbers(html.page_source)
My error as follows :
Traceback (most recent call last):
File "TSA_travel.py", line 23, in <module>
TSA_travel_numbers(html.page_source)
File "TSA_travel.py", line 15, in TSA_travel_numbers
for header in rows.find('tr'):
TypeError: 'int' object is not iterable
What is happening here?
I can't iter thru 'tr' tags, please help me to solve this problem.
Sorry for your time and advance thanks!
As the error says, you can't iterate over an int, which is your rows.
Also, there's no need for a webdriver as data on the page is static.
Here's my take on it:
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
def get_page(url):
return requests.get(url).text
def get_data(page):
soup = BeautifulSoup(page, 'lxml')
return [
item.getText(strip=True) for item in soup.select(".views-align-center")
]
def build_table(table_rows):
t = [table_rows[i:i + 4] for i in range(0, len(table_rows[1:]), 4)]
h = t[0]
return t[1:], h
if __name__ == '__main__':
source = "https://www.tsa.gov/coronavirus/passenger-throughput"
table, header = build_table(get_data(get_page(source)))
print(tabulate(table, headers=header, tablefmt="pretty"))
Output:
+------------+--------------------------+--------------------------+--------------------------+
| Date | 2021 Traveler Throughput | 2020 Traveler Throughput | 2019 Traveler Throughput |
+------------+--------------------------+--------------------------+--------------------------+
| 5/9/2021 | 1,707,805 | 200,815 | 2,419,114 |
| 5/8/2021 | 1,429,657 | 169,580 | 1,985,942 |
| 5/7/2021 | 1,703,267 | 215,444 | 2,602,631 |
| 5/6/2021 | 1,644,050 | 190,863 | 2,555,342 |
| 5/5/2021 | 1,268,938 | 140,409 | 2,270,662 |
| 5/4/2021 | 1,134,103 | 130,601 | 2,106,597 |
| 5/3/2021 | 1,463,672 | 163,692 | 2,470,969 |
| 5/2/2021 | 1,626,962 | 170,254 | 2,512,598 |
| 5/1/2021 | 1,335,535 | 134,261 | 1,968,278 |
| 4/30/2021 | 1,558,553 | 171,563 | 2,546,029 |
| 4/29/2021 | 1,526,681 | 154,695 | 2,499,461 |
| 4/28/2021 | 1,184,326 | 119,629 | 2,256,442 |
| 4/27/2021 | 1,077,199 | 110,913 | 2,102,068 |
| 4/26/2021 | 1,369,410 | 119,854 | 2,412,770 |
| 4/25/2021 | 1,571,220 | 128,875 | 2,506,809 |
| 4/24/2021 | 1,259,724 | 114,459 | 1,990,464 |
| 4/23/2021 | 1,521,393 | 123,464 | 2,521,897 |
| 4/22/2021 | 1,509,649 | 111,627 | 2,526,961 |
| 4/21/2021 | 1,164,099 | 98,968 | 2,254,209 |
| 4/20/2021 | 1,082,443 | 92,859 | 2,227,475 |
| 4/19/2021 | 1,412,500 | 99,344 | 2,594,171 |
| 4/18/2021 | 1,572,383 | 105,382 | 2,356,802 |
| 4/17/2021 | 1,277,815 | 97,236 | 1,988,205 |
| 4/16/2021 | 1,468,218 | 106,385 | 2,457,133 |
| 4/15/2021 | 1,491,435 | 95,085 | 2,616,158 |
| 4/14/2021 | 1,152,703 | 90,784 | 2,317,381 |
| 4/13/2021 | 1,085,034 | 87,534 | 2,208,688 |
| 4/12/2021 | 1,468,972 | 102,184 | 2,484,580 |
| 4/11/2021 | 1,561,495 | 90,510 | 2,446,801 |
and so on ...
Or, an ever shorter approach, just use pandas:
import pandas as pd
import requests
from tabulate import tabulate
if __name__ == '__main__':
source = "https://www.tsa.gov/coronavirus/passenger-throughput"
df = pd.read_html(requests.get(source).text, flavor="bs4")[0]
print(tabulate(df.head(10), tablefmt="pretty", showindex=False))
Output:
+-----------+-----------+--------+---------+
| 5/9/2021 | 1707805.0 | 200815 | 2419114 |
| 5/8/2021 | 1429657.0 | 169580 | 1985942 |
| 5/7/2021 | 1703267.0 | 215444 | 2602631 |
| 5/6/2021 | 1644050.0 | 190863 | 2555342 |
| 5/5/2021 | 1268938.0 | 140409 | 2270662 |
| 5/4/2021 | 1134103.0 | 130601 | 2106597 |
| 5/3/2021 | 1463672.0 | 163692 | 2470969 |
| 5/2/2021 | 1626962.0 | 170254 | 2512598 |
| 5/1/2021 | 1335535.0 | 134261 | 1968278 |
| 4/30/2021 | 1558553.0 | 171563 | 2546029 |
+-----------+-----------+--------+---------+

Beautiful Soup returning None

I am trying to scrape the th element, but the result keeps returning None. What am I doing wrong?
This is the code I have tried:
import requests
import bs4
import urllib3
dateList = []
openList = []
closeList = []
highList = []
lowList = []
r = requests.get(
'https://coinmarketcap.com/currencies/bitcoin/historical-data/')
soup = bs4.BeautifulSoup(r.text, 'lxml')
td = soup.find('th')
print(td)
There's an API endpoint so you can fetch the data from there.
Here's how:
import pandas as pd
import requests
from tabulate import tabulate
api_endpoint = "https://web-api.coinmarketcap.com/v1/cryptocurrency/ohlcv/historical?id=1&convert=USD&time_start=1609804800&time_end=1614902400"
bitcoin = requests.get(api_endpoint).json()
df = pd.DataFrame([q["quote"]["USD"] for q in bitcoin["data"]["quotes"]])
print(tabulate(df, headers="keys", showindex=False, disable_numparse=True, tablefmt="pretty"))
Output:
+----------------+----------------+----------------+----------------+--------------------+-------------------+--------------------------+
| open | high | low | close | volume | market_cap | timestamp |
+----------------+----------------+----------------+----------------+--------------------+-------------------+--------------------------+
| 34013.614533 | 36879.69856854 | 33514.03374162 | 36824.36441009 | 75289433810.59091 | 684671246323.6501 | 2021-01-06T23:59:59.999Z |
| 36833.87435728 | 40180.3679073 | 36491.18981083 | 39371.04235311 | 84762141031.49448 | 732062681138.1346 | 2021-01-07T23:59:59.999Z |
| 39381.76584266 | 41946.73935079 | 36838.63599637 | 40797.61071993 | 88107519479.50471 | 758625941266.7522 | 2021-01-08T23:59:59.999Z |
| 40788.64052286 | 41436.35000639 | 38980.87690625 | 40254.54649816 | 61984162837.0747 | 748563483043.1383 | 2021-01-09T23:59:59.999Z |
| 40254.21779758 | 41420.19103255 | 35984.62712175 | 38356.43950662 | 79980747690.35463 | 713304617760.9486 | 2021-01-10T23:59:59.999Z |
| 38346.52950301 | 38346.52950301 | 30549.59876946 | 35566.65594049 | 123320567398.62296 | 661457321418.0524 | 2021-01-11T23:59:59.999Z |
| 35516.36114084 | 36568.52697414 | 32697.97662163 | 33922.9605815 | 74773277909.4566 | 630920422745.0479 | 2021-01-12T23:59:59.999Z |
| 33915.11958124 | 37599.96059774 | 32584.66767186 | 37316.35939997 | 69364315979.27992 | 694069582193.7559 | 2021-01-13T23:59:59.999Z |
| 37325.10763475 | 39966.40524241 | 36868.5632453 | 39187.32812109 | 63615990033.01017 | 728904366964.3611 | 2021-01-14T23:59:59.999Z |
| 39156.7080858 | 39577.71118833 | 34659.58974449 | 36825.36585131 | 67760757880.723885 | 685005864471.3622 | 2021-01-15T23:59:59.999Z |
| 36821.64873201 | 37864.36887891 | 35633.55401669 | 36178.13890106 | 57706187875.104546 | 673000645230.8221 | 2021-01-16T23:59:59.999Z |
| 36163.64923243 | 36722.34987621 | 34069.32218533 | 35791.27792129 | 52359854336.21185 | 665831621390.9865 | 2021-01-17T23:59:59.999Z |
| 35792.23666766 | 37299.28580604 | 34883.84404829 | 36630.07568284 | 49511702429.3542 | 681470030572.0747 | 2021-01-18T23:59:59.999Z |
| 36642.23272357 | 37755.89185872 | 36069.80639361 | 36069.80639361 | 57244195485.50075 | 671081200699.8711 | 2021-01-19T23:59:59.999Z |
and so on ...
I think the request object doesn't have a text attribute.
Try soup = bs4.BeautifulSoup(r.content, 'lxml')

Convert Decision tree from text 2 visual

I have a decision tree output in a 'text' format which is very hard to read and interpret. There are ton of pipes and indentation to follow the tree/nodes/leaf. I was wondering if there are tools out there where I can feed in a decision tree like below and get a tree diagram like Weka, Python, ...etc does?
Since my decision tree is very large, below is the sample/partial decision to give an idea of my text decision tree. Thanks a bunch!
"bio" <= 0.5:
| "ml" <= 0.5:
| | "algorithm" <= 0.5:
| | | "bioscience" <= 0.5:
| | | | "microbial" <= 0.5:
| | | | | "assembly" <= 0.5:
| | | | | | "nano-tech" <= 0.5:
| | | | | | | "smith" <= 0.5:
| | | | | | | | "neurons" <= 0.5:
| | | | | | | | | "process" <= 1.5:
| | | | | | | | | | "program" <= 1.5:
| | | | | | | | | | | "mammal" <= 1.0:
| | | | | | | | | | | | "lab" <= 0.5:
| | | | | | | | | | | | | "human-machine" <= 1.5:
| | | | | | | | | | | | | | "tech" <= 0.5:
| | | | | | | | | | | | | | | "smith" <= 0.5:
I'm not aware of any tool to interpret that format so I think you're going to have to write something, either to interpret the text format or to retrieve the tree structure using the DecisionTree class in MALLET's Java API.
Interpreting the text in Python shouldn't be too hard: for example, if
line = '| | | | | "assembly" <= 0.5:'
then you can get the indent level, the predictor name and the split point with
parts = line.split('"')
indent = parts[0].count('| ')
predictor = parts[1]
splitpoint = float(parts[2][-1-parts[2].rfind(' '):-1])
To create graphical output, I would use GraphViz. There are Python APIs for it, but it's simple enough to build a file in its text-based dot format and create a graphic from it with the dot command. For example, the file for a simple tree might look like
digraph MyTree {
Node_1 [label="Predictor1"]
Node_1 -> Node_2 [label="< 0.335"]
Node_1 -> Node_3 [label=">= 0.335"]
Node_2 [label="Predictor2"]
Node_2 -> Node_4 [label="< 1.42"]
Node_2 -> Node_5 [label=">= 1.42"]
Node_3 [label="Class1
(p=0.897, n=26)", shape=box,style=filled,color=lightgray]
Node_4 [label="Class2
(p=0.993, n=17)", shape=box,style=filled,color=lightgray]
Node_5 [label="Class3
(p=0.762, n=33)", shape=box,style=filled,color=lightgray]
}
and the resulting output from dot

Categories