File seeking issue in Python - python

I try to write a J48 parseTree algorithm using Python
However, I encounter a weird problem:
def parseTree(f1):
line = f1.readline()
while not line.startswith("attribute"):
f2.write(line);
save = f1.tell();
line = f1.readline()
print f1.tell()
print f1.readline()
f1.seek(1518)
print f1.readline()
the result is:
1518
attribute22 > 0
te14 = Y
I am confused about why the two f1.readline() is not the same
this is part of the J48 tree:
=== Run information ===
Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: cls-weka.filters.unsupervised.attribute.Remove-R1,25-27,48-56
Instances: 60818
Attributes: 43
cert_category
attribute1
attribute2
attribute3
attribute4
attribute5
attribute6
attribute7
attribute8
attribute9
attribute10
attribute11
attribute12
attribute13
attribute14
attribute15
attribute16
attribute17
attribute18
attribute19
attribute20
attribute21
attribute22
attribute26
attribute27
attribute28
attribute29
attribute23_days
attribute24_days
attribute25_days
attribute30
attribute31
attribute32
attribute33
attribute34
attribute35
attribute36
attribute37
attribute38
attribute39
attribute40
attribute41
attribute42_num
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
J48 pruned tree
------------------
attribute22 <= 0: 4 (406.0)
attribute22 > 0
| attribute23_days <= 1
| | attribute14 = Y
| | | attribute37 = Y: 0 (60.0/2.0)
| | | attribute37 = N: 5 (17.0/1.0)
| | | attribute37 = A: 0 (0.0)
| | attribute14 = N
| | | attribute23_days <= 0: 5 (45.0)
| | | attribute23_days > 0
| | | | attribute2 <= 26: 5 (20.0)
| | | | attribute2 > 26
| | | | | attribute3 = Y: 5 (13.0)
| | | | | at

Related

Create a forecast matrix from timeserie samples

I would like to create a matrix of delay from a timeserie.
For example if
y = [y_0, y_1, y_2, ..., y_N] and W = 5
I would like to create the matrix
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | y_0 |
| 0 | 0 | 0 | y_0 | y_1 |
| ... | | | | |
| y_{N-4} | y_{N-3} | y_{N-2} | y_{N-1} | y_N |
I know that function timeseries_dataset_from_array from tensorflow do approximatively the same thing when well configured but I would like to avoid using tensorflow.
This is my current function to perform this task:
def get_warm_up_matrix(_data: ndarray, W: int) -> ndarray:
"""
Return a warm-up matrix
If _data = [y_1, y_2, ..., y_N]
The output matrix W will be
W = +---------+-----+---------+---------+-----+
| 0 | ... | 0 | 0 | 0 |
| 0 | ... | 0 | 0 | y_1 |
| 0 | ... | 0 | y_1 | y_2 |
| ... | ... | ... | ... | ... |
| y_1 | ... | y_{W-2} | y_{W-1} | y_W |
| ... | ... | ... | ... | ... |
| y_{N-W} | ... | y_{N-2} | y_{N-1} | y_N |
+---------+-----+---------+---------+-----+
:param _data:
:param W:
:return:
"""
N = len(_data)
warm_up = np.zeros((N, W), dtype=_data.dtype)
raw_data_with_zeros = np.concatenate((np.zeros(W, dtype=_data.dtype), _data), dtype=_data.dtype)
for k in range(W, N + W):
warm_up[k - W, :] = raw_data_with_zeros[k - W:k]
return warm_up
It works well, but it's quite slow since the concatenate operation and the for loop take time to be performed. It also take a lot of memory since the data have to be duplicated in memory before filling the matrix.
I would like a faster and memory-friendly method to perform the same task. Thanks for your help :)

Issues appending to a list in python

I have the following data file.
>| --- | | Adelaide | | --- | | 2021 | | --- | | Rnd | T | Opponent | Scoring | F | Scoring | A | R | M | W-D-L | Venue | Crowd |
> Date | | R1 | H | Geelong | 4.4 11.7 13.9 15.13 | 103 | 2.3 5.5 10.8
> 13.13 | 91 | W | 12 | 1-0-0 | Adelaide Oval | 26985 | Sat 20-Mar-2021 4:05 PM | | R2 | A | Sydney | 3.2 4.6 6.14 11.22 | 88 |
> 4.1 9.6 15.11 18.13 | 121 | L | -33 | 1-0-1 | S.C.G. | 23946 | Sat 27-Mar-2021 1:45 PM |
I created a code to manipulate that data to my desired results which is a list. When I print my variable row at the current spot it prints correctly.
However, when I append my list row to another list my_array I have issues. I get an empty list returned.
I think the issue is the placement of where I am appending?
My code is this:
with open('adelaide.md', 'r') as f:
my_array = []
team = ''
year = ''
for line in f:
row=[]
line = line.strip()
fields = line.split('|')
num_fields = len(fields)
if len(fields) == 3:
val = fields[1].strip()
if val.isnumeric():
year = val
elif val != '---':
team = val
elif num_fields == 15:
row.append(team)
row.append(year)
for i in range(1, 14):
row.append(fields[i].strip())
print(row)
my_array.append(row)
You need to append the row array inside the for loop.
I think last line should be inside the for loop. Your code is probably appending the last 'row' list. Just give it a tab.

How to aggregate and restructure dataframe data in pyspark (column wise)

I am trying to aggregate data in pyspark dataframe on a particular criteria. I am trying to align the acct based on switchOUT amount to switchIN amount. So that accounts with money switching out of becomes from account and other accounts become to_accounts.
Data I am getting in the dataframe to begin with
+--------+------+-----------+----------+----------+-----------+
| person | acct | close_amt | open_amt | switchIN | switchOUT |
+--------+------+-----------+----------+----------+-----------+
| A | 1 | 125 | 50 | 75 | 0 |
+--------+------+-----------+----------+----------+-----------+
| A | 2 | 100 | 75 | 25 | 0 |
+--------+------+-----------+----------+----------+-----------+
| A | 3 | 200 | 300 | 0 | 100 |
+--------+------+-----------+----------+----------+-----------+
To this table
+--------+--------+-----------+----------+----------+
| person | from_acct| to_acct | switchIN | switchOUT|
+--------+----------+--------+----------+-----------+
| A | 3 | 1 | 75 | 100 |
+--------+----------+--------+----------+-----------+
| A | 3 | 2 | 25 | 100 |
+--------+----------+--------+----------+-----------+
And also how can I do it so that it works for N number of rows (not just 3 accounts)
So far I have used this code
# define udf
def sorter(l):
res = sorted(l, key=operator.itemgetter(1))
return [item[0] for item in res]
def list_to_string(l):
res = 'from_fund_' +str(l[0]) + '_to_fund_'+str(l[1])
return res
def listfirstAcc(l):
res = str(l[0])
return res
def listSecAcc(l):
res = str(l[1])
return res
sort_udf = F.udf(sorter)
list_str = F.udf(list_to_string)
extractFirstFund = F.udf(listfirstAcc)
extractSecondFund = F.udf(listSecAcc)
# Add additional columns
df= df.withColumn("move", sort_udf("list_col").alias("sorted_list"))
df= df.withColumn("move_string", list_str("move"))
df= df.withColumn("From_Acct",extractFirstFund("move"))
df= df.withColumn("To_Acct",extractSecondFund("move"))
Current outcome I am getting:
+--------+--------+-----------+----------+----------+
| person | from_acct| to_acct | switchIN | switchOUT|
+--------+----------+--------+----------+-----------+
| A | 3 | 1,2 | 75 | 100 |
+--------+----------+--------+----------+-----------+

Convert Decision tree from text 2 visual

I have a decision tree output in a 'text' format which is very hard to read and interpret. There are ton of pipes and indentation to follow the tree/nodes/leaf. I was wondering if there are tools out there where I can feed in a decision tree like below and get a tree diagram like Weka, Python, ...etc does?
Since my decision tree is very large, below is the sample/partial decision to give an idea of my text decision tree. Thanks a bunch!
"bio" <= 0.5:
| "ml" <= 0.5:
| | "algorithm" <= 0.5:
| | | "bioscience" <= 0.5:
| | | | "microbial" <= 0.5:
| | | | | "assembly" <= 0.5:
| | | | | | "nano-tech" <= 0.5:
| | | | | | | "smith" <= 0.5:
| | | | | | | | "neurons" <= 0.5:
| | | | | | | | | "process" <= 1.5:
| | | | | | | | | | "program" <= 1.5:
| | | | | | | | | | | "mammal" <= 1.0:
| | | | | | | | | | | | "lab" <= 0.5:
| | | | | | | | | | | | | "human-machine" <= 1.5:
| | | | | | | | | | | | | | "tech" <= 0.5:
| | | | | | | | | | | | | | | "smith" <= 0.5:
I'm not aware of any tool to interpret that format so I think you're going to have to write something, either to interpret the text format or to retrieve the tree structure using the DecisionTree class in MALLET's Java API.
Interpreting the text in Python shouldn't be too hard: for example, if
line = '| | | | | "assembly" <= 0.5:'
then you can get the indent level, the predictor name and the split point with
parts = line.split('"')
indent = parts[0].count('| ')
predictor = parts[1]
splitpoint = float(parts[2][-1-parts[2].rfind(' '):-1])
To create graphical output, I would use GraphViz. There are Python APIs for it, but it's simple enough to build a file in its text-based dot format and create a graphic from it with the dot command. For example, the file for a simple tree might look like
digraph MyTree {
Node_1 [label="Predictor1"]
Node_1 -> Node_2 [label="< 0.335"]
Node_1 -> Node_3 [label=">= 0.335"]
Node_2 [label="Predictor2"]
Node_2 -> Node_4 [label="< 1.42"]
Node_2 -> Node_5 [label=">= 1.42"]
Node_3 [label="Class1
(p=0.897, n=26)", shape=box,style=filled,color=lightgray]
Node_4 [label="Class2
(p=0.993, n=17)", shape=box,style=filled,color=lightgray]
Node_5 [label="Class3
(p=0.762, n=33)", shape=box,style=filled,color=lightgray]
}
and the resulting output from dot

wxPython print string with formatting

so I have the following class which prints the text and header you call with it. I also provided the code I am using to call the Print function. I have a string 'outputstring' which contains the text I want to print. My expected output is below and my actual output is below. It seems to be removing spaces which are needed for proper legibility. How can I print while keeping the spaces?
Class:
#Printer Class
class Printer(HtmlEasyPrinting):
def __init__(self):
HtmlEasyPrinting.__init__(self)
def GetHtmlText(self,text):
"Simple conversion of text. Use a more powerful version"
html_text = text.replace('\n\n','<P>')
html_text = text.replace('\n', '<BR>')
return html_text
def Print(self, text, doc_name):
self.SetHeader(doc_name)
self.PrintText(self.GetHtmlText(text),doc_name)
def PreviewText(self, text, doc_name):
self.SetHeader(doc_name)
HtmlEasyPrinting.PreviewText(self, self.GetHtmlText(text))
Expected Print:
+-------------------+---------------------------------+------+-----------------+-----------+
| Domain: | Mail Server: | TLS: | # of Employees: | Verified: |
+-------------------+---------------------------------+------+-----------------+-----------+
| bankofamerica.com | ltwemail.bankofamerica.com | Y | 239000 | Y |
| | rdnemail.bankofamerica.com | Y | | Y |
| | kcmemail.bankofamerica.com | Y | | Y |
| | rchemail.bankofamerica.com | Y | | Y |
| citigroup.com | mx-b.mail.citi.com | Y | 248000 | N |
| | mx-a.mail.citi.com | Y | | N |
| bnymellon.com | cluster9bny.us.messagelabs.com | ? | 51400 | N |
| | cluster9bnya.us.messagelabs.com | Y | | N |
| usbank.com | mail1.usbank.com | Y | 65565 | Y |
| | mail2.usbank.com | Y | | Y |
| | mail3.usbank.com | Y | | Y |
| | mail4.usbank.com | Y | | Y |
| us.hsbc.com | vhiron1.us.hsbc.com | Y | 255200 | Y |
| | vhiron2.us.hsbc.com | Y | | Y |
| | njiron1.us.hsbc.com | Y | | Y |
| | njiron2.us.hsbc.com | Y | | Y |
| | nyiron1.us.hsbc.com | Y | | Y |
| | nyiron2.us.hsbc.com | Y | | Y |
| pnc.com | cluster5a.us.messagelabs.com | Y | 49921 | N |
| | cluster5.us.messagelabs.com | ? | | N |
| tdbank.com | cluster5.us.messagelabs.com | ? | 0 | N |
| | cluster5a.us.messagelabs.com | Y | | N |
+-------------------+---------------------------------+------+-----------------+-----------+
Actual Print:
The same thing as expected but the spaces are removed making it very hard to read.
Function call:
def printFile():
outputstring = txt_tableout.get(1.0, 'end')
print(outputstring)
app = wx.PySimpleApp()
p = Printer()
p.Print(outputstring, "Data Results")
For anyone else struggling, this is the modified class function I used to generate a nice table with all rows and columns.
def GetHtmlText(self,text):
html_text = '<h3>Data Results:</h3><p><table border="2">'
html_text += "<tr><td>Domain:</td><td>Mail Server:</td><td>TLS:</td><td># of Employees:</td><td>Verified</td></tr>"
for row in root.ptglobal.to_csv():
html_text += "<tr>"
for x in range(len(row)):
html_text += "<td>"+str(row[x])+"</td>"
html_text += "</tr>"
return html_text + "</table></p>"
maybe try
`html_text = text.replace(' ',' ').replace('\n','<br/>')`
that would replace your spaces with html space characters ... but it would still not look right since it is not a monospace font ... this will be hard to automate ... you really want to probably put it in a table structure ... but that would require some work
you probably want to invest a little more time in your html conversion ... perhaps something like (making assumptions based on what you have shown)
def GetHtmlText(self,text):
"Simple conversion of text. Use a more powerful version"
text_lines = text.splitlines()
html_text = "<table>"
html_text += "<tr><th> + "</th><th>".join(text_lines[0].split(":")) + "</th></tr>
for line in text_lines[1:]:
html_text += "<tr><td>"+"</td><td>".join(line.split()) +"</td></tr>
return html_text + "</table>"

Categories