XML format in Python - python

I needed to take an XML file and replace certain values with other values.
This was easy enough parsing through the xml (as text) and replacing the old values with the new.
The issue is the new txt file is in the wrong format.
It's all encased in square brackets and has "/n" characters instead of linebreaks.
I did try the xml.dom.minidom lib but it's not working ...
I could parse the resulting file aswell and remove the "/n" and square brackets but don't want to do that as I am not sure this is the only thing that has been added in this format.
source code :
import json
import shutil
import itertools
import datetime
import time
import calendar
import sys
import string
import random
import uuid
import xml.dom.minidom
inputfile = open('data.txt')
outputfile = open('output.xml','w')
sess_id = "d87c2b8e063e5e5c789d277c34ea"
new_sess_id = ""
number_of_sessions = 4
my_text = str(inputfile.readlines())
my_text2 = ''
#print (my_text)
#Just replicate the session logs x times ...
#print ("UUID random : " + str(uuid.uuid4()).replace("-","")[0:28])
for i in range (0,number_of_sessions):
new_sess_id = str(uuid.uuid4()).replace("-","")[0:28]
my_text2 = my_text + my_text2
my_text2 = my_text2.replace(sess_id,new_sess_id)
#xml = xml.dom.minidom.parseString(my_text2)
outputfile.write(my_text2)
print (my_text)
inputfile.close()
outputfile.close()
The original text is XML format but the output is like
time is it</span></div><div
class=\\"di_transcriptAvatarAnswerEntry\\"><span
class=\\"di_transcriptAvatarTitle\\">[AVATAR] </span> <span
class=\\"di_transcriptAvatarAnswerText\\">My watch says it\'s 6:07
PM.<br/>Was my answer helpful? No Yes</span></div>\\r\\n"
</variable>\n', '</element>\n', '</path>\n', '</transaction>\n',
'</session>\n', '<session>']

You are currently using readlines(). This will read each line of your file and return you a Python list, one line per entry (complete with \n on the end of each entry). You were then using str() to convert the list representation into a string, for example:
text = str(['line1\n', 'line2\n', line3\n'])
text would now be a string looking like a your list, complete with all the [ and quote characters. Rather than using readlines(), you probably need to just use read() which would return the whole file contents as a single text string for you to work with.
Try using the following type approach which also uses the preferred with context manager for dealing with files (it closes them automatically for you).
import uuid
sess_id = "d87c2b8e063e5e5c789d277c34ea"
new_sess_id = ""
number_of_sessions = 4
with open('data.txt') as inputfile, open('output.xml','w') as outputfile:
my_text = inputfile.read()
my_text2 = ''
#Just replicate the session logs x times ...
for i in range (0,number_of_sessions):
new_sess_id = str(uuid.uuid4()).replace("-","")[0:28]
my_text2 = my_text + my_text2
my_text2 = my_text2.replace(sess_id,new_sess_id)
outputfile.write(my_text2)

Related

multiple modification to a list at once

I have a text file of some ip's and Mac's. The format of the Mac's are xxxx.xxxx.xxxx, I need to change all the MAC's to xx:xx:xx:xx:xx:xx
I am already reading the file and putting it into a list. Now I am looping through each line of the list and I need to make multiple modification. I need to remove the IP's and then change the MAC format.
The problem I am running into is that I cant seem to figure out how to do this in one shot unless I copy the list to a newlist for every modification.
How can I loop through the list once, and update each element on the list with all my modification?
count = 0
output3 = []
for line in output:
#print(line)
#removes any extra spaces between words in a string.
output[count] = (str(" ".join(line.split())))
#create a new list with just the MAC addresses
output3.append(str(output[count].split(" ")[3]))
#create a new list with MAC's using a ":"
count += 1
print(output3)
It appears you are trying to overthink the problem, so that may be where your frustration is spinning you around a bit.
First, you should always consider if you need a count variable in python. Usually you do not, and the enumerate() function is your friend here.
Second, there is no need to process data multiple times in python. You can use variables to your advantage and leverage python's expressiveness, rather than trying to hide your problem from the language.
PSA an implementation example that may help you think through your approach. Good luck on solving your harder problems, and I hope python will help you out with them!
#! /usr/bin/env python3
import re
from typing import Iterable
# non-regex reformat mac to be xx:xx:xx:xx:xx:xx
# NOTE: this assumes a source with '.' separators only
# reformat_mac = lambda _: ':'.join(_ for _ in _.split('.') for _ in (_[:2], _[2:]))
# regex reformat mac to be xx:xx:xx:xx:xx:xx
# NOTE: Only requires at least two hex digits adjacent at a time
reformat_mac = lambda _: ":".join(re.findall(r"(?i)[\da-f]{2}", _))
def generate_output3(output: Iterable[str]) -> Iterable[str]:
for line in output:
col1, col2, col3, mac, *cols = line.split()
mac = reformat_mac(mac)
yield " ".join((col1, col2, col3, mac, *cols))
if __name__ == "__main__":
output = [
"abc def ghi 1122.3344.5566",
"jklmn op qrst 11a2.33c4.55f6 uv wx yz",
"zyxwu 123 next 11a2.33c4.55f6 uv wx yz",
]
for line in generate_output3(output):
print(line)
Solution
You can use the regex (regular expression) module to extract any pattern that matches that of the
mac-ids: "xxxx:xxxx:xxxx" and then process it to produce the expected output ("xx-xx-xx-xx-xx-xx")
as shown below.
Note: I have used a dummy data file (see section: Dummy Data below) to make this answer
reproducible. It should work with your data as well.
# import re
filepath = "input.txt"
content = read_file(filepath)
mac_ids = extract_mac_ids(content, format=True) # format=False --> "xxxx:xxxx:xxxx"
print(mac_ids)
## OUTPUT:
#
# ['a0-b1-ff-33-ac-d5',
# '11-b9-33-df-55-f6',
# 'a4-d1-e7-33-ff-55',
# '66-a1-b2-f3-b9-c5']
Code: Convenience Functions
How does the regex work? see this example
def read_file(filepath: str):
"""Reads and returns the content of a file."""
with open(filepath, "r") as f:
content = f.read() # read in one attemp
return content
def format_mac_id(mac_id: str):
"""Returns a formatted mac_id.
INPUT FORMAT: "xxxxxxxxxxxx"
OUTPUT FORMAT: "xx-xx-xx-xx-xx-xx"
"""
mac_id = list(mac_id)
mac_id = ''.join([ f"-{v}" if (i % 2 == 0) else v for i, v in enumerate(mac_id)])[1:]
return mac_id
def extract_mac_ids(content: str, format: bool=True):
"""Extracts and returns a list of formatted mac_ids after.
INPUT FORMAT: "xxxx:xxxx:xxxx"
OUTPUT FORMAT: "xx-xx-xx-xx-xx-xx"
"""
import re
# pattern = "(" + ':'.join([r"\w{4}"]*3) + "|" + ':'.join([r"\w{2}"]*6) + ")"
# pattern = r"(\w{4}:\w{4}:\w{4}|\w{2}:\w{2}:\w{2}:\w{2}:\w{2}:\w{2})"
pattern = r"(\w{4}:\w{4}:\w{4})"
pat = re.compile(pattern)
mac_ids = pat.findall(content) # returns a list of all mac-ids
# Replaces the ":" with "" and then formats
# each mac-id as: "xx-xx-xx-xx-xx-xx"
if format:
mac_ids = [format_mac_id(mac_id.replace(":", "")) for mac_id in mac_ids]
return mac_ids
Dummy Data
The following code block creates a dummy file with some sample mac-ids.
filepath = "input.txt"
s = """
a0b1:ff33:acd5 ghwvauguvwi ybvakvi
klasilvavh; 11b9:33df:55f6
haliviv
a4d1:e733:ff55
66a1:b2f3:b9c5
"""
# Create dummy data file
with open(filepath, "w") as f:
f.write(s)

Turning text into a dictionary

I've successfully extracted my sitemap, and I would like to turn the urls into a list. I can't quite figure out how to do that, separating the https from the dates modified. Ideally I would also like to turn it into a dictionary, with the associated date stamp. In the end, I plant to iterate over the list and create text files of the web pages, and save the date time stamp at the top of the text file.
I will settle for the next step of turning this into a list. This is my code:
import urllib.request
import inscriptis
from inscriptis import get_text
sitemap = "https://grapaes.com/sitemap.xml"
i=0
url = sitemap
html=urllib.request.urlopen(url).read().decode('utf-8')
text=get_text(html)
dicto = {text}
print(dicto)
for i in dicto:
if i.startswith ("https"):
print (i + '/n')
The output is basically a row with the date stamp, space, and the url.
You can split the text around whitespaces first, then proceed like this:
text = text.split(' ')
dicto = {}
for i in range(0, len(text), 2):
dicto[text[i+1]] = text[i]
gives a dictionary with timestamp as key and URL as value, as follows:
{
'2020-01-12T09:19+00:00': 'https://grapaes.com/',
'2020-01-12T12:13+00:00': 'https://grapaes.com/about-us-our-story/',
...,
'2019-12-05T12:59+00:00': 'https://grapaes.com/211-retilplast/',
'2019-12-01T08:29+00:00': 'https://grapaes.com/fruit-logistica-berlin/'
}
I believe you can do further processing from here onward.
In addition to the answer above: You could also use an XML Parser (standard module) to achieve what you are trying to do:
# Save your xml on disk
with open('sitemap.xml', 'w') as f:
f.write(text)
f.close()
# Import XML-Parser
import xml.etree.ElementTree as ET
# Load xml and obtain the root node
tree = ET.parse('sitemap.xml')
root_node = tree.getroot()
From here you can access your xml's nodes just like every other list-like object:
print(root_node[1][0].text) # output: 'https://grapaes.com/about-us-our-story/'
print(root_node[1][1].text) # output: '2020-01-12T12:13+00:00'
Creating a dict from this is as easy as that:
dicto = dict()
for child in root_node:
dicto.setdefault(child[0], child[1])

Adding double quotes to string is giving me incorrect data in Python

I am trying to add double quotes to each line in the file , in the file output (I want .tsv output file) I am getting four double quotes around the string, it gives me proper result when I am writing it to .csv file format. Code is as follows
import re
import pandas as pd
df = pd.read_csv('C:/Users/name/Documents/TA/sample.tsv',delimiter='\t',encoding='utf-8')
re_vin = re.compile(r'^.*\s')
vin_quotes = []
with open('C:/Users/name/Documents/TA/sample.tsv') as f:
for line in f:
line = line.rstrip('\n')
line_quotes = '"{}"'.format(line)
vin_quotes.append(line_quotes)
vin_df = pd.DataFrame(data = vin_quotes[1:])
vin_df.to_csv('C:/Users/name/Documents/TA/processed.tsv', sep='\t', encoding='utf-8',index= False)
Sample data is as follows
**cvdt35 Output from code**
1GADP5B """1GADP5B"""
1GADP5G """1GADP5G"""
1GAHP2G """1GAHP2G"""
1GM5K8D """1GM5K8D"""
1GM5K8H """1GM5K8H"""
1GMCU0G """1GMCU0G"""
1GMCU9G """1GMCU9G"""
1GMJK1J """1GMJK1J"""
1GTEW1E """1GTEW1E"""
2GMPK4A """2GMPK4A"""
3GA6P0H """3GA6P0H"""
3GA6P0L """3GA6P0L"""
3GA6P0L """3GA6P0L"""
3GAHP0H """3GAHP0H"""
expected output
"1GADP5B","1GADP5G","1GAHP2G","1GM5K8D","1GM5K8H","1GMCU0G","1GMCU9G","1GMJK1J","1GTEW1E","2GMPK4A","3GA6P0H","3GA6P0L","3GA6P0L","3GAHP0H"
Thank you in advance
You can use the following which simply matches any character that isn't " or a whitespace character \s one or more times, then joins the result together.
See code in action here
import re
s = '1GADP5B """1GADP5B"""\n1GADP5G """1GADP5G"""\n1GAHP2G """1GAHP2G"""\n1GM5K8D """1GM5K8D"""\n1GM5K8H """1GM5K8H"""\n1GMCU0G """1GMCU0G"""\n1GMCU9G """1GMCU9G"""\n1GMJK1J """1GMJK1J"""\n1GTEW1E """1GTEW1E"""\n2GMPK4A """2GMPK4A"""\n3GA6P0H """3GA6P0H"""\n3GA6P0L """3GA6P0L"""\n3GA6P0L """3GA6P0L"""\n3GAHP0H """3GAHP0H"""'
r = re.findall(r'[^\s"]+', s)
r = ",".join(['"{0}"'.format(x) for x in r])
print(r)
Outputs the following:
"1GADP5B","1GADP5B","1GADP5G","1GADP5G","1GAHP2G","1GAHP2G","1GM5K8D","1GM5K8D","1GM5K8H","1GM5K8H","1GMCU0G","1GMCU0G","1GMCU9G","1GMCU9G","1GMJK1J","1GMJK1J","1GTEW1E","1GTEW1E","2GMPK4A","2GMPK4A","3GA6P0H","3GA6P0H","3GA6P0L","3GA6P0L","3GA6P0L","3GA6P0L","3GAHP0H","3GAHP0H"
To extract "word" from """ word """:
import re
data = []
# extract all words between quotes
with open('C:/Users/name/Documents/TA/sample.tsv') as f:
text = f.read()
data = re.findall(r'"\w+"', text)
print(data) # ['"1GADP5B"', '"1GADP5G"', '"1GAHP2G"',...'"3GA6P0L"', '"3GAHP0H"']
with open('C:/Users/name/Documents/TA/processed.tsv', 'w', encoding='utf-8') as w_f:
w_f.write('\t'.join(data)) # or ','.join(data)
you want to write the result in procossed.tsv you have a list of words it's up to you to choose what you want as a seprator for join.

Python - to output contents in a HTML file to spreadsheet

Part of below is sourced from another example. It’s modified a bit and use to read a HTML file, and output the contents into a spreadsheet.
As it’s a just a local file, using Selenium is maybe an over-kill, but I just want to learn through this example.
from selenium import webdriver
import lxml.html as LH
import lxml.html.clean as clean
import xlwt
book = xlwt.Workbook(encoding='utf-8', style_compression = 0)
sheet = book.add_sheet('SeaWeb', cell_overwrite_ok = True)
driver = webdriver.PhantomJS()
ignore_tags=('script','noscript','style')
results = []
driver.get("source_file.html")
content = driver.page_source
cleaner = clean.Cleaner()
content = cleaner.clean_html(content)
doc = LH.fromstring(content)
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text = elt.text or '' #question 1
tail = elt.tail or '' #question 1
words = ''.join((text,tail)).strip()
if words: # extra question
words = words.encode('utf-8') #question 2
results.append(words) #question 3
results.append('; ') #question 3
sheet.write (0, 0, results)
book.save("C:\\ source_output.xls")
The lines text=elt.text or '' and tail=elt.tail or '' – why both .text and .tail have texts? And why the or '' part is important here?
The texts in the HTML file contains special characters like ° (temperature degrees) – the .encode('utf-8') doesn’t make it a perfect output, neither in IDLE or Excel spreadsheet. What’s the alternative?
Is it possible to join the output into a string, instead of a list? Now to append it into a list, I have to .append it twice to have the texts and ; added.
elt is a html node. It contains certain attributes and a text section. lxml provides way to extract all the attributes and text, by using .text or .tail depending where the text is.
<a attribute1='abc'>
some text ----> .text gets this
<p attributeP='def'> </p>
some tail ---> .tail gets this
</a>
The idea behind the or ''is that if there is no text/tail found in the current html node, it returns None. And later when we want to concatenate/append None type it will complain. So to avoid any future error, if the text/tail is None then use an empty string ''
Degree character is a one-character unicode string, but when you do a .encode('utf-8') it becomes 2-byte utf-8 byte string. This 2-byte is nothing but ° or \xc3\x82\xc2\xb0. So basically you do not have to do any encoding for ° character and Python interpreter correctly interprets the encoding. If not, provide the correct shebang on top of your python script. Check the PEP-0263
# -*- coding: UTF-8 -*-
Yes you can also join the output in string, just use + as there is no append for string types for e.g.
results = ''
results = results + 'whatever you want to join'
You can keep the list and combine your 2 lines:
results.append(words + '; ')
Note: Just now i checked the xlwt documentation and sheet.write() accept only strings. So basically you cannot pass results, a list type.
A simple example for Q1
from lxml import etree
test = etree.XML("<main>placeholder</main>")
print test.text #prints placeholder
print test.tail #prints None
print test.tail or '' #prints empty string
test.text = "texter"
print etree.tostring(test) #prints <main>texter</main>
test.tail = "tailer"
print etree.tostring(test) #prints <main>texter</main>tailer

How to use Python to find all isbn in a text file?

I have a text file text_isbn with loads of ISBN in it. I want to write a script to parse it and write it to a new text file with each ISBN number in a new line.
Thus far I could write the regular expression for finding the ISBN, but could not process any further:
import re
list = open("text_isbn", "r")
regex = re.compile('(?:[0-9]{3}-)?[0-9]{1,5}-[0-9]{1,7}-[0-9]{1,6}-[0-9]')
I tried to use the following but got an error (I guess the list is not in proper format...)
parsed = regex.findall(list)
How to do the parsing and write it to a new file (output.txt)?
Here is a sample of the text in text_isbn
Praxisguide Wissensmanagement - 978-3-540-46225-5
Programmiersprachen - 978-3-8274-2851-6
Effizient im Studium - 978-3-8348-8108-3
How about
import re
isbn = re.compile("(?:[0-9]{3}-)?[0-9]{1,5}-[0-9]{1,7}-[0-9]{1,6}-[0-9]")
matches = []
with open("text_isbn") as isbn_lines:
for line in isbn_lines:
matches.extend(isbn.findall(line))
try this regex (from regular expression cookbook ):
import re
data = open("text_isbn", "r")
regex = "(?:ISBN(?:-1[03])?:? )?(?=[-0-9 ]{17}$|[-0-9X ]{13}$|[0-9X]{10}$)(?:97[89][- ]?)?[0-9]{1,5}[- ]?(?:[0-9]+[- ]?){2}[0-9X]$"
for l in data.readlines():
match = re.search(regex, l)
isbn = match.group()
outfile.write('%s\n' % isbn)
tested with your sample data. assume that each line contain only one isbn number

Categories