Extracting data elements from large unstructured text files with Python - python

I am trying to extract data elements from large unstructured text files (1,000,000 to 15,000,000 lines per file) with no consistent delimiter. The order of the data elements are consistent.
Sample data:
NAME FIRSTNAME LASTNAME DATE-OF-BIRTH 01/01/2019 ID-NUMBER 123
ADDRESS-1 1234 FAKE STREET COUNTY-CODE 123
ADDRESS-2
CITY NOWHERE STATE OH ZIP 12345
RANDOM DATA .... 700+ LINES
NAME FIRSTNAME2 LASTNAME2 DATE-OF-BIRTH 01/01/2019 ID-NUMBER 4567
ADDRESS-1 123456 OTHER STREET COUNTY-CODE 45678
ADDRESS-2
CITY SOMEWHERE STATE MI ZIP 65432
RANDOM DATA .... 700+ LINES
I'm looking for a way to create a CSV output with the values of a few fields listed below:
NAME, COUNTY-CODE, ZIP
FIRSTNAME LASTNAME, 123, 12345
FIRSTNAME2 LASTNAME2, 45678, 65432
The data is NOT tab delimited and spacing will vary. Any help would be greatly appreciated!

Hmm...
I assume that you have a bunch of lines, each containing pairs ID VALUE, and that each chunk starts with the id NAME.
So I would use the re module to search for the expected patterns, the occurence of NAME starting a new element. As real first and last names can use more than one single word (John Fitzgerald Kennedy), I would take the NAME to be everything between NAME and DATE-OF-BIRTH.
IMHO a simple way is to build a dict when parsing the lines, and write the dict using a DictWriter whenever NAME is reached, and at end of file. I would only keep the first occurence of each keyword if more that one was found, but you could also raise an exception.
Code could be
import re
import csv
# prepare the patterns to search for
name = re.compile(r"NAME\s+(.*)\s+DATE")
zip_code = re.compile(r"ZIP\s*([0-9]+)")
county_code = re.compile(r"COUNTY-CODE\s*([0-9]+)")
with open("input.txt") as fdin, open("output.csv", newline='') as fdout:
wr = csv.DictWriter(fdout, fieldnames=['NAME', 'COUNTY-CODE', 'ZIP'])
elt = {}
wr.writeheader()
for line in fdin:
# process NAME
mx = name.search(line)
if mx:
if elt: # write previous dict if any
wr.writerow(elt)
elt = {'NAME': mx.group(1).strip()} # initialize a new dict
# process other keywords
if not 'COUNTY-CODE' in elt: # only keep first one
mx = county_code.search(line)
if mx:
elt['COUNTY-CODE'] = mx.group(1).strip() # update the dict with it
if not 'ZIP' in elt:
mx = zip_code.search(line)
if mx:
elt['ZIP'] = mx.group(1)
wr.writerow(elt) # don't forget last dict

The problem is very similar to the one found in other SO question.
The solution is to construct partial grammar which parses the structure of the recognized construct while skipping what can't be recognized.
In your case, using textX, that would be something along these lines (I haven't tested it but you get the picture):
from textx import metamodel_from_str
mm = metamodel_from_str(r'''
File: ( /(?s:.*?(?=NAME))/ persons*=Person | 'NAME' )*
/(?s).*/;
Person:
'NAME' first_name=Name last_name=Name birth_date=Date
'ADDRESS-1' address_1=UntilEOL
'ADDRESS-2' address_2=UntilEOL
'CITY' city=UntilEOL
;
Name: /\w+/;
Date: /\d{4}-\d{2}-\d{2}/;
UntilEOL[noskipws]: /.*?\n/;
''')
data_model = mm.model_from_file('some_input_file.txt')
# Here data_model is an object with attribute `persons`
# where each person have attributes `first_name`, `last_name`, ...
# from the `Person` rule above.
Note: This solution assume that the start of structural part must have a keyword NAME, but it is ok for the keyword to be found in the random data as on invalid parse of the rule Person the parser will consume the word NAME and continue.
Depending on your actual data you will have to tune the grammar a bit (e.g. particular regexes).

Related

How to create multiline lists from raw file and process them in python?

I have the following code that takes raw_file.txt and turns it into processed_file.txt.
Problem 1:
Besides item_location I also need the item_id (as str, not int) to be in the processed file, perhaps as list so it would look like WANTED_processed_file.txt
def process_file(raw_file, processed_file, target1):
with open(raw_file, 'r') as raw_file:
with open(processed_file, 'a') as processed_file:
for line in raw_file:
if target1 in line:
processed_file.write(line.split(target1)[1])
process_file('raw_file.txt', 'processed_file.txt', 'item_location: ')
By adding another if statement with target2, the content is appended below target1 (as expected), but I don't actually know how to make it a list.
Problem 2:
With my current code I'm only able to process the string corresponding to the line, but since WANTED_processed_file.txt contains multiple list I need to adapt it.
def my_function():
print(i)
with open('processed_file.txt', "r") as processed_file:
items = processed_file.read().splitlines()
for i in items:
my_function()
This is what I've tried but I'm not getting the desired results:
def my_function():
print(f'Item {i[0]} is located at {i[1]}')
with open('WANTED_processed_file.txt', "r") as processed_file:
items = processed_file.read()
for i in items:
my_function()
raw_file.txt:
ITEM:
item_id: 0001
item_location: first location
item_description: something
ITEM:
item_id: 0002
item_location: second location
item_description: something else
processed_file.txt:
first location
second location
WANTED_processed_file.txt:
['0001', 'first location']
['0002', 'second location']
Thank you and apologies for the lengthy post
You want to parse a multiline blocs from a text file. The robust way would be to process it line by line searching for start of bloc markers and data markers, fill a data structure and then store the data to a file.
If you are sure that your items will always be in the same order, you could use a multiline regex:
import re
...
with open('raw_file.txt') as fd:
text = fd.read()
rx = re.compile(r'ITEM:.*?item_id: ([^\n]*).*?item_location: ([^\n]*)',
re.MULTILINE | re.DOTALL)
with open('processed_file', 'w') as fd:
for record in rx.finditer(t):
print(list(record), file=fd)
But beware, it will be less robust than a true parser...

Extract time values from a list and add to a new list or array

I have a script that reads through a log file that contains hundreds of these logs, and looks for the ones that have a "On, Off, or Switch" type. Then I output each log into its own list. I'm trying to find a way to extract the Out and In times into a separate list/array and then subtract the two times to find the duration of each separate log. This is what the outputted logs look like:
['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
This is my current code:
logfile = '/path/to/my/logfile'
with open(logfile, 'r') as f:
text = f.read()
words = ["On", "Off", "Switch"]
text2 = text.split('\n')
for l in text.split('\n'):
if (words[0] in l or words[1] in l or words[2] in l):
log = l.split(',')[0:3]
I'm stuck on how to target only the Out and In time values from the logs and put them in an array and convert to a time value to find duration.
Initial log before script: everything after the "In" time is useless for what I'm looking for so I only have the first three indices outputted
2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a","Type":"Switch,"In":"2020-01-31T00:30:20.140Z","Path":"interface","message":"interface changed status from unknown to normal","severity":"INFORMATIONAL","display":true,"json_map":"{\"severity\":null,\"eventId\":\"65e-64d9-45-ab62-8ef98ac5e60d\",\"componentPath\":\"interface_css\",\"displayToGui\":false,\"originalState\":\"unknown\",\"closed\":false,\"eventType\":\"InterfaceStateChange\",\"time\":\"2019-04-18T07:04:32.747Z\",\"json_map\":null,\"message\":\"interface_css changed status from unknown to normal\",\"newState\":\"normal\",\"info\":\"Event created with current status\"}","closed":false,"info":"Event created with current status","originalState":"unknown","newState":"normal"}
Below is a possible solution. The wordmatch line is a bit of a hack, until I find something clearer: it's just a one-liner that create an empty or 1-element set of True if one of the words matches.
(Untested)
import re
logfile = '/path/to/my/logfile'
words = ["On", "Off", "Switch"]
dateformat = r'\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}\.\d+[Zz]?'
pattern = fr'Out:\s*\[(?P<out>{dateformat})\].*In":\s*\"(?P<in>{dateformat})\"'
regex = re.compile(pattern)
with open(logfile, 'r') as f:
for line in f:
wordmatch = set(filter(None, (word in s for word in words)))
if wordmatch:
match = regex.search(line)
if match:
intime = match.group('in')
outtime = match.group('out')
# whatever to store these strings, e.g., append to list or insert in a dict.
As noted, your log example is very awkward, so this works for the example line, but may not work for every line. Adjust as necessary.
I have also not included (if so wanted), a conversion to a datetime.datetime object. For that, read through the datetime module documentation, in particular datetime.strptime. (Alternatively, you may want to store your results in a Pandas table. In that case, read through the Pandas documentation on how to convert strings to actual datetime objects.)
You also don't need to read nad split on newlines yourself: for line in f will do that for you (provided f is indeed a filehandle).
Regex is probably the way to go (fastness, efficiency etc.) ... but ...
You could take a very simplistic (if very inefficient) approach of cleaning your data:
join all of it into a string
replace things that hinder easy parsing
split wisely and filter the split
like so:
data = ['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
all_text = " ".join(data)
# this is inefficient and will create throwaway intermediate strings - if you are
# in a hurry or operate on 100s of MB of data, this is NOT the way to go, unless
# you have time
# iterate pairs of ("bad thing", "what to replace it with") (or list of bad things)
for thing in [ (": ",":"), (list('[]{}"'),"") ]:
whatt = thing[0]
withh = thing[1]
# if list, do so for each bad thing
if isinstance(whatt, list):
for p in whatt:
# replace it
all_text = all_text.replace(p,withh)
else:
all_text = all_text.replace(whatt,withh)
# format is now far better suited to splitting/filtering
cleaned = [a for a in all_text.split(" ")
if any(a.startswith(prefix) or "Switch" in a
for prefix in {"In:","Switch:","Out:"})]
print(cleaned)
Outputs:
['Out:2020-01-31T00:30:20.150Z', 'Type:Switch', 'In:2020-01-31T00:30:20.140Z']
After cleaning your data would look like:
2020-01-31T12:04:57.976Z 1234 Out:2020-01-31T00:30:20.150Z Id:Id:4-f-4-9-6a Type:Switch In:2020-01-31T00:30:20.140Z
You can transform the clean list into a dictionary for ease of lookup:
d = dict( part.split(":",1) for part in cleaned)
print(d)
will produce:
{'In': '2020-01-31T00:30:20.140Z',
'Type': 'Switch',
'Out': '2020-01-31T00:30:20.150Z'}
You can use datetime module to parse the times from your values as shown in 0 0 post.

RegEx for capturing groups using dictionary key

I'm having trouble displaying the right named capture in my dictionary function. My program reads a .txt file and then transforms the text in that file into a dictionary. I already have the right regex formula to capture them.
Here is my File.txt:
file Science/Chemistry/Quantum 444 1
file Marvel/CaptainAmerica 342 0
file DC/JusticeLeague/Superman 300 0
file Math 333 0
file Biology 224 1
Here is the regex link that is able to capture the ones I want:
By looking at the link, the ones I want to display is highlighted in green and orange.
This part of my code works:
rx= re.compile(r'file (?P<path>.*?)( |\/.*?)? (?P<views>\d+).+')
i = sub_pattern.match(data) # 'data' is from the .txt file
x = (i.group(1), i.group(3))
print(x)
But since I'm making the .txt into a dictionary I couldn't figure out how to make .group(1) or .group(3) as keys to display specifically for my display function. I don't know how to make those groups display when I use print("Title: %s | Number: %s" % (key[1], key[3])) and it will display those contents. I hope someone can help me implement that in my dictionary function.
Here is my dictionary function:
def create_dict(data):
dictionary = {}
for line in data:
line_pattern = re.findall(r'file (?P<path>.*?)( |\/.*?)? (?P<views>\d+).+', line)
dictionary[line] = line_pattern
content = dictionary[line]
print(content)
return dictionary
I'm trying to make my output look like this from my text file:
Science 444
Marvel 342
DC 300
Math 333
Biology 224
You may create and populate a dictionary with your file data using
def create_dict(data):
dictionary = {}
for line in data:
m = re.search(r'file\s+([^/\s]*)\D*(\d+)', line)
if m:
dictionary[m.group(1)] = m.group(2)
return dictionary
Basically, it does the following:
Defines a dictionary dictionary
Reads data line by line
Searches for a file\s+([^/\s]*)\D*(\d+) match, and if there is a match, the two capturing group values are used to form a dictionary key-value pair.
The regex I suggest is
file\s+([^/\s]*)\D*(\d+)
See the Regulex graph explaining it:
Then, you may use it like
res = {}
with open(filepath, 'r') as f:
res = create_dict(f)
print(res)
See the Python demo.
You already used named group in your 'line_pattern', simply put them to your dictionary. re.findall would not work here. Also the character escape '\' before '/' is redundant. Thus your dictionary function would be:
def create_dict(data):
dictionary = {}
for line in data:
line_pattern = re.search(r'file (?P<path>.*?)( |/.*?)? (?P<views>\d+).+', line)
dictionary[line_pattern.group('path')] = line_pattern.group('views')
content = dictionary[line]
print(content)
return dictionary
This RegEx might help you to divide your inputs into four groups where group 2 and group 4 are your target groups that can be simply extracted and spaced with a space:
(file\s)([A-Za-z]+(?=\/|\s))(.*)(\d{3})

Parsing through sequence output- Python

I have this data from sequencing a bacterial community.
I know some basic Python and am in the midst of completing the codecademy tutorial.
For practical purposes, please think of OTU as another word for "species"
Here is an example of the raw data:
OTU ID OTU Sum Lineage
591820 1083 k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
532752 517 k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
218456 346 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__Bordetella; s__
590248 330 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__; s__
343284 321 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Comamonadaceae; g__Limnohabitans; s__
The data includes three things: a reference number for the species, how many times that species is in the sample, and the taxonomy of said species.
What I'm trying to do is add up all the times a sequence is found for a taxonomic family (designated as f_x in the data).
Here is an example of the desired output:
f__Fusobacteriaceae 1600
f__Alcaligenaceae 676
f__Comamonadaceae 321
This isn't for a class. I started learning python a few months ago, so I'm at least capable of looking up any suggestions. I know how it works out from doing it the slow way (copy & paste in excel), so this is for future reference.
If the lines in your file really look like this, you can do
from collections import defaultdict
import re
nums = defaultdict(int)
with open("file.txt") as f:
for line in f:
items = line.split(None, 2) # Split twice on any whitespace
if items[0].isdigit():
key = re.search(r"f__\w+", items[2]).group(0)
nums[key] += int(items[1])
Result:
>>> nums
defaultdict(<type 'int'>, {'f__Comamonadaceae': 321, 'f__Fusobacteriaceae': 1600,
'f__Alcaligenaceae': 676})
Yet another solution, using collections.Counter:
from collections import Counter
counter = Counter()
with open('data.txt') as f:
# skip header line
next(f)
for line in f:
# Strip line of extraneous whitespace
line = line.strip()
# Only process non-empty lines
if line:
# Split by consecutive whitespace, into 3 chunks (2 splits)
otu_id, otu_sum, lineage = line.split(None, 2)
# Split the lineage tree into a list of nodes
lineage = [node.strip() for node in lineage.split(';')]
# Extract family node (assuming there's only one)
family = [node for node in lineage if node.startswith('f__')][0]
# Increase count for this family by `otu_sum`
counter[family] += int(otu_sum)
for family, count in counter.items():
print "%s %s" % (family, count)
See the docs for str.split() for the details of the None argument (matching consecutive whitespace).
Get all your raw data and process it first, I mean structure it and then use the structured data to do any sort of operations you desire.
In case if you have GB's of data you can use elasticsearch. Feed your raw data and query with your required string in this case f_* and get all entries and add them
That's very doable with basic python. Keep the Library Reference under your pillow, as you'll want to refer to it often.
You'll likely end up doing something similar to this (I'll write it the longer-more-readable way -- there's ways to compress the code and do this quicker).
# Open up a file handle
file_handle = open('myfile.txt')
# Discard the header line
file_handle.readline()
# Make a dictionary to store sums
sums = {}
# Loop through the rest of the lines
for line in file_handle.readlines():
# Strip off the pesky newline at the end of each line.
line = line.strip()
# Put each white-space delimited ... whatever ... into items of a list.
line_parts = line.split()
# Get the first column
reference_number = line_parts[0]
# Get the second column, convert it to an integer
sum = int(line_parts[1])
# Loop through the taxonomies (the rest of the 'columns' separated by whitespace)
for taxonomy in line_parts[2:]:
# skip it if it doesn't start with 'f_'
if not taxonomy.startswith('f_'):
continue
# remove the pesky semi-colon
taxonomy = taxonomy.strip(';')
if sums.has_key(taxonomy):
sums[taxonomy] += int(sum)
else:
sums[taxonomy] = int(sum)
# All done, do some fancy reporting. We'll leave sorting as an exercise to the reader.
for taxonomy in sums.keys():
print("%s %d" % (taxonomy, sums[taxonomy]))

Modifiying a txt file in Python 3

I am working on a school project to make a video club management program and I need some help. Here is what I am trying to do:
I have a txt file with the client data, in which there is this:
clientId:clientFirstName:clientLastName:clientPhoneNumber
The : is the separator for any file in data.
And in the movie title data file I got this:
movieid:movieKindFlag:MovieName:MovieAvalaible:MovieRented:CopieInTotal
where it is going is that in the rentedData file there should be that:
idClient:IdMovie:DateOfReturn
I am able to do this part. Where I fail due to lack of experience:
I need to actually make a container with 3 levels for the movie data file because I want to track the available and rented numbers (changing them when I rent a movie and when I return one).
The first level represents the whole file, calling it will print the whole file, the second level should have each line in a container, the third one is every word of the line in a container.
Here is an example of what I mean:
dataMovie = [[[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal]],[[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal]]
I actually know that I can do this for a two layer in this way:
DataMovie=[]
MovieInfo = open('Data_Movie', 'r')
#Reading the file and putting it into a container
for ligne in MovieInfo:
print(ligne, end='')
words = ligne.split(":")
DataMovie.append(words)
print(DataMovie)
MovieInfo.close()
It separates all the words in to this:
[[MovieID],[MovieTitle],[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal], [MovieID],[MovieTitle],[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal]]
Each line is in the same container (second layer) but the lines are not separated, not very helpful since I need to change a specific information about the quantity available and the rented one to be able to not rent the movie if all of the copies are rented.
I think you should be using dictionaries to store your data. Rather then just embedding lists on top of one another.
Here is a quick page about dictionaries.
http://www.network-theory.co.uk/docs/pytut/Dictionaries.html
So your data might look like
movieDictionary = {"movie_id":234234,"movie title":"Iron
Man","MovieAvailable":Yes,"MovieRented":No,"CopieInTotal":20}
Then when you want to retrieve a value.
movieDictionary["movie_id"]
would yield the value.
234234
you can also embed lists inside of a dictionary value.
Does this help answer you question?
If you have to use a txt file, storing it in xml format might make the task easier. Since there's already are several good xml parsers for python.
For example ElementTree:
You could structure you'r data like this:
<?xml version="1.0"?>
<movies>
<movie id = "1">
<type>movieKind</type>
<name>firstmovie</name>
<MovieAvalaible>True</MovieAvalaible>
<MovieRented>False</MovieRented>
<CopieInTotal>2</CopieInTotal>
</movie>
<movie id = "2">
<type>movieKind</type>
<name>firstmovie2</name>
<MovieAvalaible>True</MovieAvalaible>
<MovieRented>False</MovieRented>
<CopieInTotal>3</CopieInTotal>
</movie>
</movies>
and then access and modify it like this:
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
search = root.findall('.//movie[#id="2"]')
for element in search:
rented = element.find('MovieRented')
rented.text = "False"
tree.write('data.xml')
What you are actually doing is creating three databases:
one for clients
one for movies
one for rentals
A relatively easy way to read text files with one record per line and a : separator is to create a csv.reader object. For storing the databases into your program I would recommend using lists of collections.namedtuple objects for the clients and the rentals.
from collections import namedtuple
from csv import reader
Rental = namedtuple('Rental', ['client', 'movie', 'returndate'])
with open('rentals.txt', newline='') as rentalsfile:
rentalsreader = csv.reader(rentalsfile, delimiter=':')
rentals = [Rental(int(row[0]), int(row[1]), row[2]) for row in rentalsreader]
And a list of dictionaries for the movies:
with open('movies.txt', 'rb', newline='') as moviesfile:
moviesreader = csv.reader(moviesfile, delimiter=':')
movies = [{'id': int(row[0]), 'kind', row[1], 'name': row[2],
'rented': int(row[3]), 'total': int(row[4])} for row in moviesreader]
The main reason for using a list of dictionaries for the movies is that a named tuple is a tuple and therefore immutable, and presumably you want to be able to change rented.
Referring to your comment on Daniel Rasmuson's answer, since you only put the values of the fields in the text files, you will have to hardocde the names of the fields into your program one way or another.
An alternative solution is to store the date in json files. Those are easily mapped to Python data structures.
This might be what you we're looking for
#Using OrderedDict so we always get the items in the right order when iteration.
#So the values match up with the categories\headers
from collections import OrderedDict as Odict
class DataContainer(object):
def __init__(self, fileName):
'''
Loading the text file in a list. First line assumed a header line and is used to set dictionary keys
Using OrderedDict to fix the order or iteration for dict, so values match up with the headers again when called
'''
self.file = fileName
self.data = []
with open(self.file, 'r') as content:
self.header = content.next().split('\n')[0].split(':')
for line in content:
words = line.split('\n')[0].split(':')
self.data.append(Odict(zip(self.header, words)))
def __call__(self):
'''Outputs the contents as a string that can be written back to the file'''
lines = []
lines.append(':'.join(self.header))
for i in self.data:
this_line = ':'.join(i.values())
lines.append(this_line)
newContent = '\n'.join(lines)
return newContent
def __getitem__(self, index):
'''Allows index access self[index]'''
return self.data[index]
def __setitem__(self, index, value):
'''Allows editing of values self[index]'''
self.data[index] = value
d = DataContainer('data.txt')
d[0]['MovieAvalaible'] = 'newValue' # Example of how to set the values
#Will print out a string with the contents
print d()

Categories