Reading substrings from string in Python - python

I am doing some research where I have +25,000 reports in one large text-file. Each report is divided by "TEXTSTART[UNIQUE-ID]" and "TEXTEND".
So far I have succeded in reading a single report (that is text between the indentifiers) from the txt-file with this code:
f = open("samples_combined_incomplete.txt","r" )
report = f.read()
f.close()
rstart = "TEXTSTART"
rend = "TEXTEND"
a = ((report.split(rstart))[1].split(rend)[0])
print (a)
My question is this; how can I divide the text-document into uniquely identifiable substrings, based on TEXTSTART[UNIQUE-ID]? And how should the ID be returned?
I am just starting, so any advise on documentation, useful functions, etc. would be much appriciated.
Thank you, works like a charm! The IDs are a combination of numbers and characters FYI.
f = open("samples_combined_incomplete.txt","r" )
report = f.read()
f.close()
rstart = "TEXTSTART"
rend = "TEXTEND"
a = 0
dict = re.findall('TEXTSTART\[(.*?)\](.*?)TEXTEND', report, re.DOTALL)
while a < 10:
print (dict[a])
a += 1
If I want to search within the containers for a specific keyword and have the keys returned, how could I do that?

import re
print dict(re.findall('TEXTSTART\[([^\]]+)\](.*?)TEXTEND', report, re.DOTALL))

Related

How would I extract & organize data from a txt file using python?

Situation: I have a flat file of data with various elements in it and I need to extract specific portions. I am a beginner in Python and wrote it out using Regular Expressions and other functions. Here is a sample of the data from the txt file I receive:
**ACCESSORID = FS01234** TYPE = USER SIZE = 1024 BYTES
**NAME = JOHN SMITH** FACILITY = TSO
DEPT ACID = D12RGRD DEPARTMENT = TRAINING
DIV ACID = NR DIVISION = NRE
CREATED = 01/17/05 00:00 LAST MOD = 11/16/21 10:42
**PROFILES = VPSNRE P11NR00A**
LAST USED = 12/02/21 09:03 CPU(SYSB) FAC(SUPRSESS) COUNT(06051)
**XA SSN = 123456789** OWNER(JB112)
XA TSOACCT = 123456789 OWNER(JB112 )
XA TSOAUTH = JCL OWNER(JB112 )
XA TSOAUTH = RECOVER OWNER(JB112 )
XA TSOPROC = NR005PROC OWNER(JB112 )
----------- SEGMENT TSO
TRBA = NON-DISPLAY FIELD
TSOCOMMAND =
TSODEFPRFG =
TSOLACCT = 111111111
TSOLPROC = NR9923PROC
TSOLSIZE = 0004096
TSOOPT = MAIL,NONOTICES,NOOIDCARD
TSOUDATA = 0000
TSOUNIT = SYSDD
TUPT = NON-DISPLAY FIELD
----------- SEGMENT USER
**EMAIL ADDR = john.smith#nre.ago.com**
The portions I need to extract are bolded. I know I need to provide what I have done so far and without posting my entire script, here is what I am doing to extract the ACCESSORID = FS01234 and NAME = JOHN SMITH portion.
def RemoveSpace():
f = open("PROJECTFILE.txt","r")
f1 = open("RemoveSpace.txt", "w")
data1 = f.read()
word = data1.split()
s = ' '.join(word)
f1.write(s)
print("Data Written Successfully")
RemoveSpace()
f = open(r"C:\Users\user\Desktop\HR\PROJECTFILE\RemoveSpace.txt".format(g), "r").read()
TSS = []
contents = re.split(r"ACCESSORID =",f)
contents.pop(0)
for item in contents:
TSS_DICT = {}
emplid = re.search(r"FS.*", item)
if emplid is not None:
s_emplid = re.search("FS\w*", emplid.group())
else:
s_emplid = None
if s_emplid is not None:
s_emplid = s_emplid.group()
else:
s_emplid = None
TSS_DICT["EMPLOYEE ID"] = s_emplid
name = re.search(r"NAME =.*", item)
if name is not None:
emp_name = re.search("[^NAME = ][^,]*", name.group())
else:
emp_name = None
if emp_name is not None:
emp_name = emp_name.group()
else:
emp_name = None
TSS_DICT["EMPLOYEE NAME"] = emp_name
Question: I am having some difficulty getting John Smith. It keeps bringing in everything after John Smith down to very last line of email address. My end goal is to get a CSV file with each bolded item as its own column. And more directly speaking, how would experts approach this data clean up approach to simplify the process? If needed I can post full code but didn't want to muddle this up anymore than needed.
For practising your Regex, I recommend using a website like RegExr. Here, you can paste the text that you want to match and you can play around with different matching expressions to get the result that you intend.
Assuming that you want to use this code for multiple files of the same organisation and that the data is formatted the same way in each, you can simplify your code a lot.
Let's say we wanted to extract NAME = JOHN SMITH from the text file. We could write the following Python code to do this:
import re
pattern = "NAME = \\w+ \\w+"
name = re.findall(pattern, text_to_search)[0][7:]
print(name)
pattern is our Regex search expression. text_to_search is your text file that you have read into your Python script. re.findall() returns a list of matched items that we then access the first index of with [0]. We can then use string slicing ([7:]) to remove the NAME = bit.
The above code would output the following:
JOHN SMITH
You should be able to apply the same principles to the other bold sections of your text file.
In terms of writing your extracted data out to a CSV file, it is probably worth reading a good tutorial on this. For example Reading and Writing CSV Files in Python. There are a few different ways of storing your information before writing, such as lists vs dictionaries. But you can write CSV files either with built-in Python tools or manually.

Python Readline Loop and Subloop

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python

Parse unstructured text in python

Am new to python and am trying to read a PDF file to pull the ID No.. I have been successful so far to extract the text out of the PDF file using pdfplumber. Below is the code block:
import pdfplumber
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
print (raw_text)
Here is the text output:
Welcome to ABC
01 January, 1991
ID No. : 10101010
Welcome to your ABC portal. Learn
More text here..
Even more text here..
Mr Jane Doe
Jack & Jill Street Learn more about your
www.abc.com
....
....
....
However, am unable to find the optimum way to parse this unstructured text further. The final output am expecting to be is just the ID No. i.e. 10101010. On a side note, the script would be using against fairly huge set of PDFs so performance would be of concern.
Try using a regular expression:
import pdfplumber
import re
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
m = re.search(r'ID No\. : (\d+)', raw_text)
if m:
print(m.group(1))
Of course you'll have to iterate over all the PDF's contents - not just the first page! Also ask yourself if it's possible that there's more than one match per page. Anyway: you know the structure of the input better than I do (and we don't have access to the sample file), so I'll leave it as an exercise for you.
If the length of the id number is always the same, I would try to find the location of it with the find-function. position = raw_text.find('ID No. : ')should return the position of the I in ID No. position + 9 should be the first digit of the id. When the number has always a length of 8 you could get it with int(raw_text[position+9:position+17])
If you are new to Python and actually need to process serious amounts of data, I suggest that you look at Scala as an alternative.
For data processing in general, and regular expression matching in particular, the time it takes to get results is much reduced.
Here is an answer to your question in Scala instead of Python:
import com.itextpdf.text.pdf.PdfReader
import com.itextpdf.text.pdf.parser.PdfTextExtractor
val fil = "ABC.pdf"
val textFromPage = (1 until (new PdfReader(fil)).getNumberOfPages).par.map(page => PdfTextExtractor.getTextFromPage(new PdfReader(fil), page)).mkString
val r = "ID No\\. : (\\d+)".r
val res = for (m <- r.findAllMatchIn(textFromPage )) yield m.group(0)
res.foreach(println)

python newbie - where is my if/else wrong?

Complete beginner so I'm sorry if this is obvious!
I have a file which is name | +/- or IG_name | 0 in a long list like so -
S1 +
IG_1 0
S2 -
IG_S3 0
S3 +
S4 -
dnaA +
IG_dnaA 0
Everything which starts with IG_ has a corresponding name. I want to add the + or - to the IG_name. e.g. IG_S3 is + like S3 is.
The information is gene names and strand information, IG = intergenic region. Basically I want to know which strand the intergenic region is on.
What I think I want:
open file
for every line, if the line starts with IG_*
find the line with *
print("IG_" and the line it found)
else
print line
What I have:
with open(sys.argv[2]) as geneInfo:
with open(sys.argv[1]) as origin:
for line in origin:
if line.startswith("IG_"):
name = line.split("_")[1]
nname = name[:-3]
for newline in geneInfo:
if re.match(nname, newline):
print("IG_"+newline)
else:
print(line)
where origin is the mixed list and geneInfo has only the names not IG_names.
With this code I end up with a list containing only the else statements.
S1 +
S2 -
S3 +
S4 -
dnaA +
My problem is that I don't know what is wrong to search so I can (attempt) to fix it!
Below is some step-by-step annotated code that hopefully does what you want (though instead of using print I have aggregated the results into a list so you can actually make use of it). I'm not quite sure what happened with your existing code (especially how you're processing two files?)
s_dict = {}
ig_list = []
with open('genes.txt', 'r') as infile: # Simulating reading the file you pass in sys.argv
for line in infile:
if line.startswith('IG_'):
ig_list.append(line.split()[0]) # Collect all our IG values for later
else:
s_name, value = line.split() # Separate out the S value and its operator
s_dict[s_name] = value.strip() # Add to dictionary to map S to operator
# Now you can go back through your list of IG values and append the appropriate operator
pulled_together = []
for item in ig_list:
s_value = item.split('_')[1]
# The following will look for the operator mapped to the S value. If it is
# not found, it will instead give you 'not found'
corresponding_operator = s_dict.get(s_value, 'Not found')
pulled_together.append([item, corresponding_operator])
print ('List structure')
print (pulled_together)
print ('\n')
print('Printout of each item in list')
for item in pulled_together:
print(item[0] + '\t' + item[1])
nname = name[:-3]
Python's slicing through list is very powerful, but can be tricky to understand correctly.
When you write [:-3], you take everything except the last three items. The thing is, if you have less than three element in your list, it does not return you an error, but an empty list.
I think this is where things does not work, as there are not much elements per line, it returns you an empty list. If you could tell what do you exactly want it to return there, with an example or something, it would help a lot, as i don't really know what you're trying to get with your slicing.
Does this do what you want?
from __future__ import print_function
import sys
# Read and store all the gene info lines, keyed by name
gene_info = dict()
with open(sys.argv[2]) as gene_info_file:
for line in gene_info_file:
tokens = line.split()
name = tokens[0].strip()
gene_info[name] = line
# Read the other file and lookup the names
with open(sys.argv[1]) as origin_file:
for line in origin_file:
if line.startswith("IG_"):
name = line.split("_")[1]
nname = name[:-3].strip()
if nname in gene_info:
lookup_line = gene_info[nname]
print("IG_" + lookup_line)
else:
pass # what do you want to do in this case?
else:
print(line)

Analysing a text file in Python

I have a text file that needs to be analysed. Each line in the file is of this form:
7:06:32 (slbfd) IN: "lq_viz_server" aqeela#nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj#nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj#nabwmps3
I need to skip the timestamp and the (slbfd) and only keep a count of the lines with the IN and OUT. Further, depending on the name in quotes, I need to increase a variable count for different variables if a line starts with OUT and decrease the variable count otherwise. How would I go about doing this in Python?
The other answers with regex and splitting the line will get the job done, but if you want a fully maintainable solution that will grow with you, you should build a grammar. I love pyparsing for this:
S ='''
7:06:32 (slbfd) IN: "lq_viz_server" aqeela#nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj#nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj#nabwmps3'''
from pyparsing import *
from collections import defaultdict
# Define the grammar
num = Word(nums)
marker = Literal(":").suppress()
timestamp = Group(num + marker + num + marker + num)
label = Literal("(slbfd)")
flag = Word(alphas)("flag") + marker
name = QuotedString(quoteChar='"')("name")
line = timestamp + label + flag + name + restOfLine
grammar = OneOrMore(Group(line))
# Now parsing is a piece of cake!
P = grammar.parseString(S)
counts = defaultdict(int)
for x in P:
if x.flag=="IN": counts[x.name] += 1
if x.flag=="OUT": counts[x.name] -= 1
for key in counts:
print key, counts[key]
This gives as output:
lq_viz_server 1
OFM32 -1
Which would look more impressive if your sample log file was longer. The beauty of a pyparsing solution is the ability to adapt to a more complex query in the future (ex. grab and parse the timestamp, pull email address, parse error codes...). The idea is that you write the grammar independent of the query - you simply convert the raw text to a computer friendly format, abstracting away the parsing implementation away from it's usage.
If I consider that the file is divided into lines (I don't know if it's true) you have to apply split() function to each line. You will have this:
["7:06:32", "(slbfd)", "IN:", "lq_viz_server", "aqeela#nabltas1"]
And then I think you have to be capable of apply any logic comparing the values that you need.
i made some wild assumptions about your specification and here is a sample code to help you start:
objects = {}
with open("data.txt") as data:
for line in data:
if "IN:" in line or "OUT:" in line:
try:
name = line.split("\"")[1]
except IndexError:
print("No double quoted name on line: {}".format(line))
name = "PARSING_ERRORS"
if "OUT:" in line:
diff = 1
else:
diff = -1
try:
objects[name] += diff
except KeyError:
objects[name] = diff
print(objects) # for debug only, not advisable to print huge number of names
You have two options:
Use the .split() function of the string (as pointed out in the comments)
Use the re module for regular expressions.
I would suggest using the re module and create a pattern with named groups.
Recipe:
first create a pattern with re.compile() containing named groups
do a for loop over the file to get the lines use .match() od the
created pattern object on each line use .groupdict() of the
returned match object to access your values of interest
In the mode of just get 'er done with the standard distribution, this works:
import re
from collections import Counter
# open your file as inF...
count=Counter()
for line in inF:
match=re.match(r'\d+:\d+:\d+ \(slbfd\) (\w+): "(\w+)"', line)
if match:
if match.group(1) == 'IN': count[match.group(2)]+=1
elif match.group(1) == 'OUT': count[match.group(2)]-=1
print(count)
Prints:
Counter({'lq_viz_server': 1, 'OFM32': -1})

Categories