Transforming data from file into a structured format (python)

Transforming data from file into a structured format (python) - python

I've just started to write little programs in Python, so my experience level is very low. At the moment I'm trying to read a file into a data structure in Python3, but I have no idea how to do it fast & easy to understand.
First, I have to explain the content of the file. There are headings and the lines following is additional information belonging to the heading.
Booklist.txt
Programming----------------
Python Cookbook
Python in a nutshell
Recipes--------------------
Slow Cooking
Clean Eating
Low Carb
Sports---------------------
Mastering Mountain Bike Skills
My idea is to have a structure like this:
{'Programming': ['Python Cookbook', 'Python in a nutshell'],
'Recipes': ['Slow Cooking', 'Clean Eating', 'Low Carb'], ... }
So far, I did something that seems to work:
f = open('Booklist.txt')
myDict = dict()
for ind, line in enumerate(f):
match = re.search(r"(^[\w ]+)([-]+)$", line)
if match is not None:
category = match.group(1)
myDict[category] = []
else:
myDict[category].append(line)
f.close()
But what could I do with the index? Can I use it to sort the keys in any way? Dictionaries are unsorted, right?

It may be overkill, but you can use a python PEG parser like parsimonious to parse the booklist.txt. It will take you some time to learn the PEG syntax, but it is much easier to write robust code with an established library than doing everything yourself.
Basic usage:
from parsimonious.grammar import Grammar
grammar = Grammar(
"""
body = ( category '\n' name+ '\n' ) +
category = name '-'+
name = ~"[a-zA-Z]*"i
""")
with open('booklist.txt','r') as f:
ast = grammar.parse(f.read())
print( ast )
After SO update the question
Yes, dict is unsorted. If you want to keep the origin order, use OrderedDict. Also if match is not None: can be simplified to if match:

Related

Python Readline Loop and Subloop

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.

If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.

As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python

Parse unstructured text in python

Am new to python and am trying to read a PDF file to pull the ID No.. I have been successful so far to extract the text out of the PDF file using pdfplumber. Below is the code block:
import pdfplumber
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
print (raw_text)
Here is the text output:
Welcome to ABC
01 January, 1991
ID No. : 10101010
Welcome to your ABC portal. Learn
More text here..
Even more text here..
Mr Jane Doe
Jack & Jill Street Learn more about your
www.abc.com
....
....
....
However, am unable to find the optimum way to parse this unstructured text further. The final output am expecting to be is just the ID No. i.e. 10101010. On a side note, the script would be using against fairly huge set of PDFs so performance would be of concern.

Try using a regular expression:
import pdfplumber
import re
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
m = re.search(r'ID No\. : (\d+)', raw_text)
if m:
print(m.group(1))
Of course you'll have to iterate over all the PDF's contents - not just the first page! Also ask yourself if it's possible that there's more than one match per page. Anyway: you know the structure of the input better than I do (and we don't have access to the sample file), so I'll leave it as an exercise for you.

If the length of the id number is always the same, I would try to find the location of it with the find-function. position = raw_text.find('ID No. : ')should return the position of the I in ID No. position + 9 should be the first digit of the id. When the number has always a length of 8 you could get it with int(raw_text[position+9:position+17])

If you are new to Python and actually need to process serious amounts of data, I suggest that you look at Scala as an alternative.
For data processing in general, and regular expression matching in particular, the time it takes to get results is much reduced.
Here is an answer to your question in Scala instead of Python:
import com.itextpdf.text.pdf.PdfReader
import com.itextpdf.text.pdf.parser.PdfTextExtractor
val fil = "ABC.pdf"
val textFromPage = (1 until (new PdfReader(fil)).getNumberOfPages).par.map(page => PdfTextExtractor.getTextFromPage(new PdfReader(fil), page)).mkString
val r = "ID No\\. : (\\d+)".r
val res = for (m <- r.findAllMatchIn(textFromPage )) yield m.group(0)
res.foreach(println)

Extract time values from a list and add to a new list or array

I have a script that reads through a log file that contains hundreds of these logs, and looks for the ones that have a "On, Off, or Switch" type. Then I output each log into its own list. I'm trying to find a way to extract the Out and In times into a separate list/array and then subtract the two times to find the duration of each separate log. This is what the outputted logs look like:
['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
This is my current code:
logfile = '/path/to/my/logfile'
with open(logfile, 'r') as f:
text = f.read()
words = ["On", "Off", "Switch"]
text2 = text.split('\n')
for l in text.split('\n'):
if (words[0] in l or words[1] in l or words[2] in l):
log = l.split(',')[0:3]
I'm stuck on how to target only the Out and In time values from the logs and put them in an array and convert to a time value to find duration.
Initial log before script: everything after the "In" time is useless for what I'm looking for so I only have the first three indices outputted
2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a","Type":"Switch,"In":"2020-01-31T00:30:20.140Z","Path":"interface","message":"interface changed status from unknown to normal","severity":"INFORMATIONAL","display":true,"json_map":"{\"severity\":null,\"eventId\":\"65e-64d9-45-ab62-8ef98ac5e60d\",\"componentPath\":\"interface_css\",\"displayToGui\":false,\"originalState\":\"unknown\",\"closed\":false,\"eventType\":\"InterfaceStateChange\",\"time\":\"2019-04-18T07:04:32.747Z\",\"json_map\":null,\"message\":\"interface_css changed status from unknown to normal\",\"newState\":\"normal\",\"info\":\"Event created with current status\"}","closed":false,"info":"Event created with current status","originalState":"unknown","newState":"normal"}

Below is a possible solution. The wordmatch line is a bit of a hack, until I find something clearer: it's just a one-liner that create an empty or 1-element set of True if one of the words matches.
(Untested)
import re
logfile = '/path/to/my/logfile'
words = ["On", "Off", "Switch"]
dateformat = r'\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}\.\d+[Zz]?'
pattern = fr'Out:\s*\[(?P<out>{dateformat})\].*In":\s*\"(?P<in>{dateformat})\"'
regex = re.compile(pattern)
with open(logfile, 'r') as f:
for line in f:
wordmatch = set(filter(None, (word in s for word in words)))
if wordmatch:
match = regex.search(line)
if match:
intime = match.group('in')
outtime = match.group('out')
# whatever to store these strings, e.g., append to list or insert in a dict.
As noted, your log example is very awkward, so this works for the example line, but may not work for every line. Adjust as necessary.
I have also not included (if so wanted), a conversion to a datetime.datetime object. For that, read through the datetime module documentation, in particular datetime.strptime. (Alternatively, you may want to store your results in a Pandas table. In that case, read through the Pandas documentation on how to convert strings to actual datetime objects.)
You also don't need to read nad split on newlines yourself: for line in f will do that for you (provided f is indeed a filehandle).

Regex is probably the way to go (fastness, efficiency etc.) ... but ...
You could take a very simplistic (if very inefficient) approach of cleaning your data:
join all of it into a string
replace things that hinder easy parsing
split wisely and filter the split
like so:
data = ['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
all_text = " ".join(data)
# this is inefficient and will create throwaway intermediate strings - if you are
# in a hurry or operate on 100s of MB of data, this is NOT the way to go, unless
# you have time
# iterate pairs of ("bad thing", "what to replace it with") (or list of bad things)
for thing in [ (": ",":"), (list('[]{}"'),"") ]:
whatt = thing[0]
withh = thing[1]
# if list, do so for each bad thing
if isinstance(whatt, list):
for p in whatt:
# replace it
all_text = all_text.replace(p,withh)
else:
all_text = all_text.replace(whatt,withh)
# format is now far better suited to splitting/filtering
cleaned = [a for a in all_text.split(" ")
if any(a.startswith(prefix) or "Switch" in a
for prefix in {"In:","Switch:","Out:"})]
print(cleaned)
Outputs:
['Out:2020-01-31T00:30:20.150Z', 'Type:Switch', 'In:2020-01-31T00:30:20.140Z']
After cleaning your data would look like:
2020-01-31T12:04:57.976Z 1234 Out:2020-01-31T00:30:20.150Z Id:Id:4-f-4-9-6a Type:Switch In:2020-01-31T00:30:20.140Z
You can transform the clean list into a dictionary for ease of lookup:
d = dict( part.split(":",1) for part in cleaned)
print(d)
will produce:
{'In': '2020-01-31T00:30:20.140Z',
'Type': 'Switch',
'Out': '2020-01-31T00:30:20.150Z'}
You can use datetime module to parse the times from your values as shown in 0 0 post.

Trying to write to multiple files with a single command with Python

EDIT I will try to clarify this question. I want to make two csv files. One with the text "Greetings", the other with the text "Greetings earth". The problem is I can't find a way to ask python to write to multiple files with one write command. I am trying to find a way to make things more efficient.
This question was identified as a possible duplicate of this. write multiple files at a time but there are a lot more parts to that question that I don't understand. I am trying to isolate this problem in as simple a question as I can.
hello = open("hello.csv","w")
world = open("world.csv","w")
everything = ['hello','world']
half = ['world']
everything.write("Greetings")
half.write("Earth")
hello.close()
world.close()

It is not entirely clear what any why you try to achieve.
If you need a function which manages 'all your file creation needs' you should probably approach this by creating the setup (file names -> contents) of your files then just write them. Alternatively you can label the files and generate their contents based on the preset 'flags'.
approach 1 is something like this:
file_dict = {'hello.csv': 'Greetings', 'world.csv': 'Greetings earth'}
for f in file_dict:
with open(f) as working:
working.write(file_dict[f])
approach 2 is something like this:
files = {'common': 'hello', 'custom': 'world'}
common_text = 'Greetings'
custom_text = ' earth'
for f in files.keys():
with open(files[f]+'.csv', 'w') as working_file:
text = common_text
if f is 'custom':
text += custom_text
working_file.write(text)
If you are happy with your implementation, you can migrate the 'writing' part to a separate function (something like this):
def write_my_stuffs():
for f in file_dict:
with open(f) as working:
working.write(file_dict[f])
file_dict = {'animal.csv': 'I like dogs',
'candy.csv': 'I like chocolate cake'}
write_my_stuffs()

csv","w")
world = open("world.csv","w")
everything = [hello,world]
half = [world]
for x in everything:
x.write("Greetings")
for x in half:
x.write(" Earth")
hello.close()
world.close()

optimal method to parse a json object in a datafile

I am trying to setup a simple data file format, and I am working with these files in Python for analysis. The format basically consists of header information, followed by the data. For syntax and future extensibility reasons, I want to use a JSON object for the header information. An example file looks like this:
{
"name": "my material",
"sample-id": null,
"description": "some material",
"funit": "MHz",
"filetype": "material_data"
}
18 6.269311533 0.128658208 0.962033017 0.566268827
18.10945274 6.268810641 0.128691962 0.961950095 0.565591807
18.21890547 6.268312637 0.128725463 0.961814928 0.564998228...
If the data length/structure is always the same, this is not hard to parse. However, it brought up in my mind a question about the most flexible way to parse out the JSON object, given an unknown number of lines, and an unknown number of nested curly braces, and potentially more than one JSON object in the file.
If there is only one JSON object in the file, one can use this regular expression:
with open(fname, 'r') as fp:
fstring = fp.read()
json_string = re.search('{.*}', fstring, flags=re.S)
However, if there is more than one JSON string, and I want to grab the first one, I need to use something like this:
def grab_json(mystring):
lbracket = 0
rbracket = 0
lbracket_pos = 0
rbracket_pos = 0
for i in range(len(mystring)):
if mystring[i] == '{':
lbracket = 1
lbracket_pos = i
break
for i in range(lbracket_pos+1, len(mystring)):
if mystring[i] == '}':
rbracket += 1
if rbracket == lbracket:
rbracket_pos = i
break
elif mystring[i] == '{':
lbracket += 1
json_string = mystring[lbracket_pos : rbracket_pos + 1]
return json_string, lbracket_pos, rbracket_pos
json_string, beg_pos, end_pos = grab_json(fstring)
I guess the question as always: is there a better way to do this? Better meaning simpler code, more flexible code, more robust code, or really anything?

The easiest solution, as Klaus suggested, is just to use JSON for the entire file. That makes your life much simpler because than writing is just json.dump and reading is just json.load.
A second solution is to put the metadata in a separate file, which keeps reading and writing simple at the expense of multiple files for each data set.
A third solution would be, when writing the file to disk, to prepend the length of the JSON data. So writing might look something like:
metadata_json = json.dumps(metadata)
myfile.write('%d\n' % len(metadata_json))
myfile.write(metadata_json)
myfile.write(data)
Then reading looks like:
with open('myfile') as fd:
len = fd.readline()
metadata_json = fd.read(int(len))
metadata = json.loads(metadata)
data = fd.read()
A fourth option is to adopt an existing storage format (maybe hdf?) that already has the features you are looking for in terms of storing both data and metadata in the same file.

I would store headers separately. It'll give you a possibility to use the same header file for multiple data files
Alternatively you may want to take a look at Apache Parquet Format especially if you want to process your data on distributed cluster(s) using Spark power

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Transforming data from file into a structured format (python) - python

Related

Python Readline Loop and Subloop

Parse unstructured text in python

Extract time values from a list and add to a new list or array

Trying to write to multiple files with a single command with Python

optimal method to parse a json object in a datafile

Categories

Resources