So the current problem I'm facing would be in converting a text file into a xml file.
The text file would be in this format.
Serial Number: Operator ID: test Time: 00:03:47 Test Step 2 TP1: 17.25 TP2: 2.46
Serial Number: Operator ID: test Time: 00:03:47 Test Step 2 TP1: 17.25 TP2: 2.46
I wanted to convert to convert it into a xml with this format:
<?xml version="1.0" encoding="utf-8"?>
<root>
<filedata>
</serialnumber>
<operatorid>test</operatorid>
<time>00:00:42 Test Step 2</time>
<tp1>17.25</tp1>
<tp2>2.46</tp2>
</filedata>
...
</root>
I was using a code like this to convert my previous text file to xml...but right now I'm facing problems in splitting the lines.
import xml.etree.ElementTree as ET
import fileinput
import os
import itertools as it
root = ET.Element('root')
with open('text.txt') as f:
lines = f.read().splitlines()
celldata = ET.SubElement(root, 'filedata')
for line in it.groupby(lines):
line=line[0]
if not line:
celldata = ET.SubElement(root, 'filedata')
else:
tag = line.split(":")
el=ET.SubElement(celldata,tag[0].replace(" ",""))
tag=' '.join(tag[1:]).strip()
if 'File Name' in line:
tag = line.split("\\")[-1].strip()
elif 'File Size' in line:
splist = filter(None,line.split(" "))
tag = splist[splist.index('Low:')+1]
#splist[splist.index('High:')+1]
el.text = tag
import xml.dom.minidom as minidom
formatedXML = minidom.parseString(
ET.tostring(
root)).toprettyxml(indent=" ",encoding='utf-8').strip()
with open("test.xml","wb") as f:
f.write(formatedXML)
I saw a similar question in stackoverflow
" Python text file to xml "
but the problem is I couldn't change it into a .csv format as this file is generated by a certain machine.
If anyone know how to solve it, please do help.
Thank you.
Here is a better method of splitting the lines.
Notice that the text variable would technically be your .txt file, and that I purposely modified it so that we have a greater context of the output.
from collections import OrderedDict
from pprint import pprint
# Text would be our loaded .txt file.
text = """Serial Number: test Operator ID: test1 Time: 00:03:47 Test Step 1 TP1: 17.25 TP2: 2.46
Serial Number: Operator ID: test2 Time: 00:03:48 Test Step 2 TP1: 17.24 TP2: 2.47"""
# Headers of the intended break-points in the text files.
headers = ["Serial Number:", "Operator ID:", "Time:", "TP1:", "TP2:"]
information = []
# Split our text by lines.
for line in text.split("\n"):
# Split our text up so we only have the information per header.
default_header = headers[0]
for header in headers[1:]:
line = line.replace(header, default_header)
info = [i.strip() for i in line.split(default_header)][1:]
# Compile our header+information together into OrderedDict's.
compiled_information = OrderedDict()
for header, info in zip(headers, info):
compiled_information[header] = info
# Append to our overall information list.
information.append(compiled_information)
# Pretty print the information (not needed, only for better display of data.)
pprint(information)
Outputs:
[OrderedDict([('Serial Number:', 'test'),
('Operator ID:', 'test1'),
('Time:', '00:03:47 Test Step 1'),
('TP1:', '17.25'),
('TP2:', '2.46')]),
OrderedDict([('Serial Number:', ''),
('Operator ID:', 'test2'),
('Time:', '00:03:48 Test Step 2'),
('TP1:', '17.24'),
('TP2:', '2.47')])]
This method should generalize better than what you are currently writing, and the idea of the code is something I've had saved from another project. I recommend you going through the code and understanding its logic.
From here you should be able to loop through the information list and create your custom .xml file. I would recommend you checking out dicttoxml as well, as it might make your life much easier on the final step.
In regards to your code, remember: breaking down fundamental tasks is easier than trying to incorporate them all into one. By trying to create the xml file while you split your txt file you've created a monster that is hard to tackle when it revolts back with bugs. Instead, take it one step at a time -- create "checkpoints" that you are 100% certain work, and then move on to the next task.
Related
I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python
Am new to python and am trying to read a PDF file to pull the ID No.. I have been successful so far to extract the text out of the PDF file using pdfplumber. Below is the code block:
import pdfplumber
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
print (raw_text)
Here is the text output:
Welcome to ABC
01 January, 1991
ID No. : 10101010
Welcome to your ABC portal. Learn
More text here..
Even more text here..
Mr Jane Doe
Jack & Jill Street Learn more about your
www.abc.com
....
....
....
However, am unable to find the optimum way to parse this unstructured text further. The final output am expecting to be is just the ID No. i.e. 10101010. On a side note, the script would be using against fairly huge set of PDFs so performance would be of concern.
Try using a regular expression:
import pdfplumber
import re
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
m = re.search(r'ID No\. : (\d+)', raw_text)
if m:
print(m.group(1))
Of course you'll have to iterate over all the PDF's contents - not just the first page! Also ask yourself if it's possible that there's more than one match per page. Anyway: you know the structure of the input better than I do (and we don't have access to the sample file), so I'll leave it as an exercise for you.
If the length of the id number is always the same, I would try to find the location of it with the find-function. position = raw_text.find('ID No. : ')should return the position of the I in ID No. position + 9 should be the first digit of the id. When the number has always a length of 8 you could get it with int(raw_text[position+9:position+17])
If you are new to Python and actually need to process serious amounts of data, I suggest that you look at Scala as an alternative.
For data processing in general, and regular expression matching in particular, the time it takes to get results is much reduced.
Here is an answer to your question in Scala instead of Python:
import com.itextpdf.text.pdf.PdfReader
import com.itextpdf.text.pdf.parser.PdfTextExtractor
val fil = "ABC.pdf"
val textFromPage = (1 until (new PdfReader(fil)).getNumberOfPages).par.map(page => PdfTextExtractor.getTextFromPage(new PdfReader(fil), page)).mkString
val r = "ID No\\. : (\\d+)".r
val res = for (m <- r.findAllMatchIn(textFromPage )) yield m.group(0)
res.foreach(println)
I've successfully extracted my sitemap, and I would like to turn the urls into a list. I can't quite figure out how to do that, separating the https from the dates modified. Ideally I would also like to turn it into a dictionary, with the associated date stamp. In the end, I plant to iterate over the list and create text files of the web pages, and save the date time stamp at the top of the text file.
I will settle for the next step of turning this into a list. This is my code:
import urllib.request
import inscriptis
from inscriptis import get_text
sitemap = "https://grapaes.com/sitemap.xml"
i=0
url = sitemap
html=urllib.request.urlopen(url).read().decode('utf-8')
text=get_text(html)
dicto = {text}
print(dicto)
for i in dicto:
if i.startswith ("https"):
print (i + '/n')
The output is basically a row with the date stamp, space, and the url.
You can split the text around whitespaces first, then proceed like this:
text = text.split(' ')
dicto = {}
for i in range(0, len(text), 2):
dicto[text[i+1]] = text[i]
gives a dictionary with timestamp as key and URL as value, as follows:
{
'2020-01-12T09:19+00:00': 'https://grapaes.com/',
'2020-01-12T12:13+00:00': 'https://grapaes.com/about-us-our-story/',
...,
'2019-12-05T12:59+00:00': 'https://grapaes.com/211-retilplast/',
'2019-12-01T08:29+00:00': 'https://grapaes.com/fruit-logistica-berlin/'
}
I believe you can do further processing from here onward.
In addition to the answer above: You could also use an XML Parser (standard module) to achieve what you are trying to do:
# Save your xml on disk
with open('sitemap.xml', 'w') as f:
f.write(text)
f.close()
# Import XML-Parser
import xml.etree.ElementTree as ET
# Load xml and obtain the root node
tree = ET.parse('sitemap.xml')
root_node = tree.getroot()
From here you can access your xml's nodes just like every other list-like object:
print(root_node[1][0].text) # output: 'https://grapaes.com/about-us-our-story/'
print(root_node[1][1].text) # output: '2020-01-12T12:13+00:00'
Creating a dict from this is as easy as that:
dicto = dict()
for child in root_node:
dicto.setdefault(child[0], child[1])
I needed to take an XML file and replace certain values with other values.
This was easy enough parsing through the xml (as text) and replacing the old values with the new.
The issue is the new txt file is in the wrong format.
It's all encased in square brackets and has "/n" characters instead of linebreaks.
I did try the xml.dom.minidom lib but it's not working ...
I could parse the resulting file aswell and remove the "/n" and square brackets but don't want to do that as I am not sure this is the only thing that has been added in this format.
source code :
import json
import shutil
import itertools
import datetime
import time
import calendar
import sys
import string
import random
import uuid
import xml.dom.minidom
inputfile = open('data.txt')
outputfile = open('output.xml','w')
sess_id = "d87c2b8e063e5e5c789d277c34ea"
new_sess_id = ""
number_of_sessions = 4
my_text = str(inputfile.readlines())
my_text2 = ''
#print (my_text)
#Just replicate the session logs x times ...
#print ("UUID random : " + str(uuid.uuid4()).replace("-","")[0:28])
for i in range (0,number_of_sessions):
new_sess_id = str(uuid.uuid4()).replace("-","")[0:28]
my_text2 = my_text + my_text2
my_text2 = my_text2.replace(sess_id,new_sess_id)
#xml = xml.dom.minidom.parseString(my_text2)
outputfile.write(my_text2)
print (my_text)
inputfile.close()
outputfile.close()
The original text is XML format but the output is like
time is it</span></div><div
class=\\"di_transcriptAvatarAnswerEntry\\"><span
class=\\"di_transcriptAvatarTitle\\">[AVATAR] </span> <span
class=\\"di_transcriptAvatarAnswerText\\">My watch says it\'s 6:07
PM.<br/>Was my answer helpful? No Yes</span></div>\\r\\n"
</variable>\n', '</element>\n', '</path>\n', '</transaction>\n',
'</session>\n', '<session>']
You are currently using readlines(). This will read each line of your file and return you a Python list, one line per entry (complete with \n on the end of each entry). You were then using str() to convert the list representation into a string, for example:
text = str(['line1\n', 'line2\n', line3\n'])
text would now be a string looking like a your list, complete with all the [ and quote characters. Rather than using readlines(), you probably need to just use read() which would return the whole file contents as a single text string for you to work with.
Try using the following type approach which also uses the preferred with context manager for dealing with files (it closes them automatically for you).
import uuid
sess_id = "d87c2b8e063e5e5c789d277c34ea"
new_sess_id = ""
number_of_sessions = 4
with open('data.txt') as inputfile, open('output.xml','w') as outputfile:
my_text = inputfile.read()
my_text2 = ''
#Just replicate the session logs x times ...
for i in range (0,number_of_sessions):
new_sess_id = str(uuid.uuid4()).replace("-","")[0:28]
my_text2 = my_text + my_text2
my_text2 = my_text2.replace(sess_id,new_sess_id)
outputfile.write(my_text2)
for starters, i am actually a medical student, so i wouldn't know first thing about programming, but i found myself in desperate need for this, so pardon me for my complete ignorance about the subject.
i have 2 XML files containing text, each one contains nearly 2 million lines, the first one looks like this:
<TEXT>
<Unknown1>-65535</Unknown1>
<autoId>1</autoId>
<autoId2>0</autoId2>
<alias>Name2.Boast_Duel_Season01_sudden_death_1vs1</alias>
<original>Уникальная массовая дуэль: Битва один на один до полного уничтожения в один раунд</original>
</TEXT>
<TEXT>
<Unknown1>-65535</Unknown1>
<autoId>2</autoId>
<autoId2>0</autoId2>
<alias>Name2.Boast_Duel_Season01_sudden_death_3vs3</alias>
<original>Уникальная массовая дуэль: Битва трое на трое до полного уничтожения в один раунд</original>
and the second one looks like this:
<TEXT>
<Unknown1>-65535</Unknown1>
<autoId>1</autoId>
<autoId2>0</autoId2>
<alias>Name2.Boast_Duel_Season01_sudden_death_1vs1</alias>
<replacement>Unique mass duel one on one battle to the complete destruction of one round</replacement>
</TEXT>
<TEXT>
<Unknown1>-65535</Unknown1>
<autoId>2</autoId>
<autoId2>0</autoId2>
<alias>Name2.Boast_Duel_Season01_sudden_death_3vs3</alias>
<replacement>Unique mass duel Battle three against three to the complete destruction of one round</replacement>
</TEXT>
and those blocks of code are repeated along the files for like half a million time, netting me the 2 million liner i told you about..
now what i need to do is merge both files to make the final product look like this:
<TEXT>
<Unknown1>-65535</Unknown1>
<autoId>1</autoId>
<autoId2>0</autoId2>
<alias>Name2.Boast_Duel_Season01_sudden_death_1vs1</alias>
<original>Уникальная массовая дуэль: Битва один на один до полного уничтожения в один раунд</original>
<replacement>Unique mass duel one on one battle to the complete destruction of one round</replacement>
</TEXT>
<TEXT>
<Unknown1>-65535</Unknown1>
<autoId>2</autoId>
<autoId2>0</autoId2>
<alias>Name2.Boast_Duel_Season01_sudden_death_3vs3</alias>
<original>Уникальная массовая дуэль: Битва трое на трое до полного уничтожения в один раунд</original>
<replacement>Unique mass duel Battle three against three to the complete destruction of one round</replacement>
</TEXT>
so, Basically i want to add the "Replacement" line under each respective "original" line while the rest of the file is kept intact (it's the same in both) , doing this manually would take me like 2 weeks..and i only have 1 day to do so!
any help is appreciated, and again..sorry if i sound like a total idiot at this, cause i kind of am!
P.S: i can't even chose a proper tag! i will totally understand if i just get lashed in the answers now..this job is way to big for me!
The truth about "where to start" is to learn basic python string manipulation. I was feeling nice and I like these sorts of problem, however, so here's a (quick and dirty) solution. The only things you'll need to change are the "original.xml" and "replacement.xml" file names. You'll also need a working python version, of course. That's up to you to figure out.
A couple caveats about my code:
Parsing XML is a solved problem. Using regular expressions to do it is frowned upon, but it works, and when you're doing something as simple and fixed as this, it really doesn't matter.
I made a few assumptions when building the outputted XML file (for example an indentation style of 4 spaces), but it outputs valid XML. The application you're using should play nice with it.
-
import re
def loadfile(filename):
'''
Returns a string containing all data from file
'''
infile = open(filename, 'r')
infile_string = infile.read()
infile.close()
return infile_string
def main():
#load the files into strings
original = loadfile("original.xml")
replacement = loadfile("replacement.xml")
#grab all of the "replacement" lines from the replacement file
replacement_regex = re.compile("(<replacement>.*?</replacement>)")
replacement_list = replacement_regex.findall(replacement)
#grab all of the "TEXT" blocks from the original file
original_regex = re.compile("(<TEXT>.*?</TEXT>)", re.DOTALL)
original_list = original_regex.findall(original)
#a string to write out to the new file
outfile_string = ""
to_find = "</original>" #this is the point where the replacement text is going to be appended after
additional_len = len(to_find)
for i in range(len(original_list)): #loop through all of the original text blocks
#build a new string with the replacement text after the original
build_string = ""
build_string += original_list[i][:original_list[i].find(to_find)+additional_len]
build_string += "\n" + " "*4
build_string += replacement_list[i]
build_string += "\n</TEXT>\n"
outfile_string+=build_string
#write the outfile string out to a file
outfile = open("outfile.txt", 'w')
outfile.write(outfile_string)
outfile.close()
if __name__ == "__main__":
main()
Edit (reply to comment): The IndexError, list out of range error means that the regex isn't properly working (it's not finding the exactly correct amount of replacement text and grabbing each item to put it into a list). I tested what I wrote on the blurbs you provided, so there's a discrepancy between the blurbs you provided and the full-blown XML files. If there aren't the same amount of original/replacement tags or anything like that, that will break the code. Impossible for me to figure out without access to the files themselves.
here I present a straightforward way to do that (without xml parseing).
def parse_org(file_handle):
for line in file_handle:
if "<TEXT>" in line:
record = line## start a new record if find tag <TEXT>
elif "</TEXT>" in line:
yield record## end a record if find tag <\TEXT>
record = None
elif record is not None:
record +=line
def parse_rep(file_handle):
for line in file_handle:
if "<TEXT>" in line:
record = None
elif "</TEXT>" in line:
yield record
record = None
elif "<replacement>" in line:
record = line
if __name__ == "__main__":
orginal_file = open("filepath/yourfile.xml")
replacement_file = ("filepath/yourfile.xml")
a_new_file = open("result_file","w")
END = "NOT"
while END =="NOT":
try:
org = parse_org(orginal_file).next()
rep = parse_rep(replacement_file).next()
new_record = org+rep+"</TEXT>\n"
a_new_file.write(new_record)
except StopIteration:
END = "YES"
a_new_file.close()
orginal_file.close()
replacement_file.close()
the code is written using python, and it uses keyword yield, use http://www.codecademy.com/ if you want to learn python, google yield python to learn how to use yield in python. if you would like to process such txt file in future, you should learn a script language, python may be the easiest one. If you encounter any questions you could post them on this website, but dont do nothing and just ask like "write this program for me".