Python List to Dictionary from a file - python

I have a file of notes that im trying to convert to a dictionary. I got the script working but failed to output the data im looking for when there are repeated values.
In short took the file commands or comments which are separated by # as per below. I take that list and seperate the 1st column "key" by # and the rest is the comment or definition. Then i check the magic word im looking for, parse it match it and then to out.
Flashcards file as per below
> car # automobile 4 wheels and run
> washington dc # the capital of United States
> fedora # an operating distro
> cat file # reads the file
> car nissan # altima
> car nissan # altima ## first car
> car nissan # maxima
> car nissan # rougue
flashcard_dict = dict()
flashcard_file = open('FlashCards','r')
enter = input("Searching nemo: ")
firstcolumn_str_list = list()
for x in flashcard_file:
flashcard_sprint = x.strip()
flascard_clean = flashcard_sprint.split("#",1)
firstcolumn_str = flascard_clean[0]
firstcolumn = firstcolumn_str.strip()
firstcolumn_str_list.append(firstcolumn)
secondcolumn = flascard_clean[1]
flashcard_dict[firstcolumn] = secondcolumn
print
print ("###" * 3)
lista = list()
# this is version 4 - where lambda works but fails as it matches the string in all words.
# so if the word is "es" all patterns are matched that has "es" AND NOT the specific word
filter_object = filter(lambda a: enter in a, firstcolumn_str_list)
for x in filter_object:
lista.append(x)
print (lista)
cc = 0
if cc < len(lista):
for lambdatodictmatch in lista:
if lambdatodictmatch in flashcard_dict:
print (flashcard_dict[lambdatodictmatch])
else:
print ("NONEsense... nothing here")
else:
print ("NONEsense... nothing here")
Again it works but when i search for car nissan. I get four responses but i only get the last "rougue" output or i get 4 repeated response "rougue".
what's the best way to accomplish this?

If you may have repeated elements then you should always use lists to keep even single value
if firstcolumn not in flashcard_dict:
flashcard_dict[firstcolumn] = []
firstcolumn[firstcolumn].append(secondcolumn)
instead of
flashcard_dict[firstcolumn] = secondcolumn
EDIT:
Full working code with other changes
first I used shorter and more readable names for variables,
I read file at start and later use loop to ask for different cards.
I added command !keys to display all keys, and !exit to exit loop and finish program,
list(sorted(flashcards.keys())) gives all keys from dictionary without repeating values (and sorted)
I used io only to simulate file in memory - so everyone can simply copy and run this code (without creating file FlashCards) but you should use open(...)
text = '''car # automobile 4 wheels and run
washington dc # the capital of United States
fedora # an operating distro
cat file # reads the file
car nissan # altima
car nissan # altima ## first car
car nissan # maxima
car nissan # rougue
'''
import io
# --- constansts ---
DEBUG = True
# --- functions ---
def read_data(filename='FlashCards'):
if DEBUG:
print('[DEBUG] reading file')
flashcards = dict() # with `s` at the end because it keeps many flashcards
#file_handler = open(filename)
file_handler = io.StringIO(text)
for line in file_handler:
line = line.strip()
parts = line.split("#", 1)
key = parts[0].strip()
value = parts[1].strip()
if key not in flashcards:
flashcards[key] = []
flashcards[key].append(value)
all_keys = list(sorted(flashcards.keys()))
return flashcards, all_keys
# --- main ---
# - before loop -
# because words `key` and `keys` are very similar and it is easy to make mistake in code - so I added prefix `all_`
flashcards, all_keys = read_data()
print("#########")
# - loop -
while True:
print() # empty line to make output more readable
enter = input("Searching nemo (or command: !keys, !exit): ").strip().lower()
print() # empty line to make output more readable
if enter == '!exit':
break
elif enter == '!keys':
#print( "\n".join(all_keys) )
for key in all_keys:
print('key>', key)
elif enter.startswith('!'):
print('unknown command:', enter)
else:
# keys which have `enter` only at
#selected_keys = list(filter(lambda text: text.startswith(enter), all_keys))
# keys which have `enter` in any place (at the beginning, in the middle, at the end)
selected_keys = list(filter(lambda text: enter in text, all_keys))
print('selected_keys:', selected_keys)
if selected_keys: # instead of `0 < len(selected_keys)`
for key in selected_keys:
# `selected_keys` has to exist in `flashcards` so there is no need to check if `key` exists in `flashcards`
print(key, '=>', flashcards[key])
else:
print("NONEsense... nothing here")
# - after loop -
print('bye')

Related

Extract Text from a word document

I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]

Python tkinter - Searching for key word that is not a sub string

import keyword
from tkinter import END
def highlight(text):
keywords = keyword.kwlist
for kw in keywords:
text.tag_remove(kw, 1.0, END)
first = 1.0
while True:
first = text.search(
kw, first,
nocase=False,
stopindex=END
)
if first is None or first == "":
break
first_splitted = first.split(".")
if len(first_splitted) == 1:
break
last = f"{first_splitted[0]}.{int(first_splitted[1]) + len(kw)}"
character_position_before_first = f"{first_splitted[0]}.{int(first_splitted[1]) - 1}"
character_before_first = text.get(character_position_before_first)
last_splitted = last.split(".")
character_position_after_last = f"{last_splitted[0]}.{int(last_splitted[1])}"
character_after_last = text.get(character_position_after_last)
if not character_before_first.isspace() and not character_after_last.isspace():
break
text.tag_add(kw, first, last)
first = last
text.tag_config(
kw,
foreground="#aa71eb"
)
Given the following code, I'm trying to highlight key words in a text. The issue is that sub strings are being marked.
Example:
hello this is a test open works too lmao lol lol lol
Would mark is from this and is
I only want it to mark the second is as the first is is a sub string of this
I have no clue why the code above is not working. Help would be appreciated.

Building Abreviations Dictionary from Text file

I would like to build a dictionary of abreviations.
I have a text file with a lot of abreviations. The text file looks like this(after import)
with open('abreviations.txt') as ab:
ab_words = ab.read().splitlines()
An extract:
'ACE',
'Access Control Entry',
'ACK',
'Acknowledgement',
'ACORN',
'A Completely Obsessive Really Nutty person',
Now I want to build the dictionnary, where I have every uneven line as a dictionary key and every even line as the dictionary value.
Hence I should be able to write at the end:
ab_dict['ACE']
and get the result:
'Access Control Entry'
Also, How can I make it case-insensitive ?
ab_dict['ace']
should yield the same result
'Access Control Entry'
In fact, it would be perfect, if the output would also be lower case:
'access control entry'
Here is a link to the text file: https://www.dropbox.com/s/91afgnupk686p9y/abreviations.txt?dl=0
Complete solution with custom ABDict class and Python's generator functionality:
class ABDict(dict):
''' Class representing a dictionary of abbreviations'''
def __getitem__(self, key):
v = dict.__getitem__(self, key.upper())
return v.lower() if key.islower() else v
with open('abbreviations.txt') as ab:
ab_dict = ABDict()
while True:
try:
k = next(ab).strip() # `key` line
v = next(ab).strip() # `value` line
ab_dict[k] = v
except StopIteration:
break
Now, testing (with case-relative access):
print(ab_dict['ACE'])
print(ab_dict['ace'])
print('*' * 10)
print(ab_dict['WYTB'])
print(ab_dict['wytb'])
The output(consecutively):
Access Control Entry
access control entry
**********
Wish You The Best
wish you the best
Here's another solution based on the pairwise function from this solution:
from requests.structures import CaseInsensitiveDict
def pairwise(iterable):
"s -> (s0, s1), (s2, s3), (s4, s5), ..."
a = iter(iterable)
return zip(a, a)
with open('abreviations.txt') as reader:
abr_dict = CaseInsensitiveDict()
for abr, full in pairwise(reader):
abr_dict[abr.strip()] = full.strip()
Here is an answer that also allows sentences to be replaced with words from the dictionary:
import re
from requests.structures import CaseInsensitiveDict
def read_file_dict(filename):
"""
Reads file data into CaseInsensitiveDict
"""
# lists for keys and values
keys = []
values = []
# case sensitive dict
data = CaseInsensitiveDict()
# count used for deciding which line we're on
count = 1
with open(filename) as file:
temp = file.read().splitlines()
for line in temp:
# if the line count is even, a value is being read
if count % 2 == 0:
values.append(line)
# otherwise, a key is being read
else:
keys.append(line)
count += 1
# Add to dictionary
# perhaps some error checking here would be good
for key, value in zip(keys, values):
data[key] = value
return data
def replace_word(ab_dict, sentence):
"""
Replaces sentence with words found in dictionary
"""
# not necessarily words, but you get the idea
words = re.findall(r"[\w']+|[.,!?; ]", sentence)
new_words = []
for word in words:
# if word is in dictionary, replace it and add it to resulting list
if word in ab_dict:
new_words.append(ab_dict[word])
# otherwise add it as normally
else:
new_words.append(word)
# return sentence with replaced words
return "".join(x for x in new_words)
def main():
ab_dict = read_file_dict("abreviations.txt")
print(ab_dict)
print(ab_dict['ACE'])
print(ab_dict['Ace'])
print(ab_dict['ace'])
print(replace_word(ab_dict, "The ACE is not easy to understand"))
if __name__ == '__main__':
main()
Which outputs:
{'ACE': 'Access Control Entry', 'ACK': 'Acknowledgement', 'ACORN': 'A Completely Obsessive Really Nutty person'}
Access Control Entry
Access Control Entry
Access Control Entry
The Access Control Entry is not easy to understand

Reading data from a file in python, but taking specific data from the file throughout

Hi I am trying to read data from a .dat file, the file has information like this:
1
Carmella Henderson
24.52
13.5
21.76
2
Christal Piper
14.98
11.01
21.75
3
Erma Park
12.11
13.51
18.18
4
Dorita Griffin
20.05
10.39
21.35
5
Marlon Holmes
18.86
13.02
13.36
From this data I need the person number, name and the first number, like so:
1 #person number
Marlon Holmes #Name
18.86 # First number
13.02
13.36
However currently my code is reading the data from the file but not these specific parts, it is simply printing the file
This is my code currently for this specific part:
def Cucumber_Scoreboard():
with open('veggies_2015.dat', 'r') as f:
count = 0
for line in f:
count **= 1
if count % 2 == 0:
print (line)
Im not sure where it's going wrong, I tried to put the data from the file into a list and try it from there but had no success, any help would be greatly appreciated
Whole file code if needed:
def menu():
exit = False
while not exit:
print("To enter new competitior data, type new")
print("To view the competition score boards, type Scoreboard")
print("To view the Best Overall Growers Scoreboard, type Podium")
print("To review this years and previous data, type Data review")
print("Type quit to exit the program")
choice = raw_input("Which option would you like?")
if choice == 'new':
new_competitor()
elif choice == 'Scoreboard':
scoreboard_menu()
elif choice == 'Podium':
podium_place()
elif choice == 'Data review':
data_review()
elif choice == 'quit':
print("Goodbye")
raise SystemExit
"""Entering new competitor data: record competitor's name and vegtables lengths"""
def competitor_data():
global competitor_num
l = []
print("How many competitors would you like to enter?")
competitors = raw_input("Number of competitors:")
num_competitors = int(competitors)
for i in range(num_competitors):
name = raw_input("Enter competitor name:")
Cucumber = raw_input("Enter length of Cucumber:")
Carrot = raw_input("Enter length of Carrot:")
Runner_Beans = raw_input("Enter length of Runner Beans:")
l.append(competitor_num)
l.append(name)
l.append(Cucumber)
l.append(Carrot)
l.append(Runner_Beans)
competitor_num += 1
return (l)
def new_competitor():
with open('veggies_2016.txt', 'a') as f:
for item in competitor_data():
f.write("%s\n" %(item))
def scoreboard_menu():
exit = False
print("Which vegetable would you like the scoreboard for?")
vegetable = raw_input("Please type either Cucumber, Carrot or Runner Beans:")
if vegetable == "Cucumber":
Cucumber_Scoreboard()
elif vegetable == "Carrot":
Carrot_Scoreboard()
elif vegetable == "Runner Beans":
Runner_Beans_Scoreboard()
def Cucumber_Scoreboard():
with open('veggies_2015.dat', 'r') as f:
count = 0
for line in f:
count **= 1
if count % 2 == 0:
print (line)
This doesn't feel the most elegant way of doing it, but if you're going line by line, you need an extra counter in there which results in nothing happening for a set amount of "surplus" lines, before resetting your counters. Note that excess_count only needs to be incremented once because you want the final else to reset both counters, again which will not result in something being printed but still results in a skipped line.
def Cucumber_Scoreboard():
with open('name_example.txt', 'r') as f:
count = 0
excess_count = 0
for line in f:
if count < 3:
print (line)
count += 1
elif count == 3 and excess_count < 1:
excess_count += 1
else:
count = 0
excess_count = 0
EDIT: Based on your comments, I have extended this answer. Really, what you have asked should be raised as another question because it is detached from your main question. As pointed out by jDo, this is not ideal code because it will fail instantly if there is a blank line or missing data causing a line to skip artificially. Also, the new code is stuffed in around my initial answer. Use this only as an illustration of resetting counters and lists in loops, it's not stable for serious things.
from operator import itemgetter
def Cucumber_Scoreboard():
with open('name_example.txt', 'r') as f:
count = 0
excess_count = 0
complete_person_list = []
sublist = []
for line in f:
if count < 3:
print (line)
sublist.append(line.replace('\n', ''))
count += 1
elif count == 3 and excess_count < 1:
excess_count += 1
else:
count = 0
excess_count = 0
complete_person_list.append(sublist)
sublist = []
sorted_list = sorted(complete_person_list, key=itemgetter(2), reverse = True)
return sorted_list
a = Cucumber_Scoreboard()
You could make the program read the whole file line by line getting all the information. Then because the data is in a known format (eg position, name ...) skip the unneeded lines with file.readline() which will move you to the next line.
Someone recently asked how to save the player scores for his/her game and I ended up writing a quick demo. I didn't post it though since it would be a little too helpful. In its current form, it doesn't fit your game exactly but maybe you can use it for inspiration. Whatever you do, not relying on line numbers, modulo and counting will save you a headache down the line (what if someone added an empty/extra line?).
There are advantages and drawbacks associated with all datatypes. If we compare your current data format (newline separated values with no keys or category/column labels) to json, yours is actually more efficient in terms of space usage. You don't have any repetitions. In key/value pair formats like json and python dictionaries, you often repeat the keys over and over again. This makes it a human-readable format, it makes order insignificant and it means that the entire thing could be written on a single line. However, it goes without saying that repeating all the keys for every player is not efficient. If there were 100.000 players and they all had a firstname, lastname, highscore and last_score, you'd be repeating these 4 words 100.000 times. This is where actual databases become the sane choice for data storage. In your case though, I think json will suffice.
import json
import pprint
def scores_load(scores_json_file):
""" you hand this function a filename and it returns a dictionary of scores """
with open(scores_json_file, "r") as scores_json:
return json.loads(scores_json.read())
def scores_save(scores_dict, scores_json_file):
""" you hand this function a dictionary and a filename to save the scores """
with open(scores_json_file, "w") as scores_json:
scores_json.write(json.dumps(scores_dict))
# main dictionary of dictionaries - empty at first
scores_dict = {}
# a single player stat dictionary.
# add/remove keys and values at will
scores_dict["player1"] = {
"firstname" : "whatever",
"lastname" : "whateverson",
"last_score" : 3,
"highscore" : 12,
}
# another player stat dictionary
scores_dict["player2"] = {
"firstname" : "something",
"lastname" : "somethington",
"last_score" : 5,
"highscore" : 15,
}
# here, we save the main dictionary containing stats
# for both players in a json file called scores.json
scores_save(scores_dict, "scores.json")
# here, we load them again and turn them into a
# dictionary that we can easily read and write to
scores_dict = scores_load("scores.json")
# add a third player
scores_dict["player3"] = {
"firstname" : "person",
"lastname" : "personton",
"last_score" : 2,
"highscore" : 3,
}
# save the whole thing again
scores_save(scores_dict, "scores.json")
# print player2's highscore
print scores_dict["player2"]["highscore"]
# we can also invent a new category (key/value pair) on the fly if we want to
# it doesn't have to exist for every player
scores_dict["player2"]["disqualified"] = True
# print the scores dictionary in a pretty/easily read format.
# this isn't necessary but just looks nicer
pprint.pprint(scores_dict)
"""
The contents of the scores.json pretty-printed in my terminal:
$ cat scores.json | json_pp
{
"player3" : {
"firstname" : "person",
"last_score" : 2,
"lastname" : "personton",
"highscore" : 3
},
"player2" : {
"highscore" : 15,
"lastname" : "somethington",
"last_score" : 5,
"firstname" : "something"
},
"player1" : {
"firstname" : "whatever",
"last_score" : 3,
"lastname" : "whateverson",
"highscore" : 12
}
}
"""
Create a function that reads one "record" (5 lines) at a time, then call it repeatedly:
def read_data(in_file):
rec = {}
rec["num"] = in_file.next().strip()
rec["name"] = in_file.next().strip()
rec["cucumber"] = float(in_file.next().strip())
# skip 2 lines
in_file.next()
in_file.next()
return rec
EDIT: improved the code + added a usage example
The read_data() function reads one 5-line record from a file and returns its data as a dictionary. An example of using this function:
def Cucumber_Scoreboard():
with open('veggies_2015.dat', 'r') as in_file:
data = []
try:
while True:
rec = read_data(in_file)
data.append(rec)
except StopIteration:
pass
data_sorted = sorted(data, key = lambda x: x["cucumber"])
return data_sorted
cucumber = Cucumber_Scoreboard()
from pprint import pprint
pprint(cucumber)

Python: How to loop through blocks of lines and copy specific text within lines

Input file:
DATE: 07/01/15 # 0800 HYRULE HOSPITAL PAGE 1
USER: LINK Antibiotic Resistance Report
--------------------------------------------------------------------------------------------
Activity Date Range: 01/01/15 - 02/01/15
--------------------------------------------------------------------------------------------
HH0000000001 LINK,DARK 30/M <DIS IN 01/05> (UJ00000001) A001-01 0A ZELDA,PRINCESS MD
15:M0000001R COMP, Coll: 01/02/15-0800 Recd: 01/02/15-0850 (R#00000001) ZELDA,PRINCESS MD
Source: SPUTUM
PSEUDOMONAS FLUORESCENS LEVOFLOXACIN >=8 R
--------------------------------------------------------------------------------------------
HH0000000002 FAIRY,GREAT 25/F <DIS IN 01/06> (UJ00000002) A002-01 0A ZELDA,PRINCESS MD
15:M0000002R COMP, Coll: 01/03/15-2025 Recd: 01/03/15-2035 (R#00000002) ZELDA,PRINCESS MD
Source: URINE- STRAIGHT CATH
PROTEUS MIRABILIS CEFTRIAXONE-other R
--------------------------------------------------------------------------------------------
HH0000000003 MAN,OLD 85/M <DIS IN 01/07> (UJ00000003) A003-01 0A ZELDA,PRINCESS MD
15:M0000003R COMP, Coll: 01/04/15-1800 Recd: 01/04/15-1800 (R#00000003) ZELDA,PRINCESS MD
Source: URINE-CLEAN VOIDED SPEC
ESCHERICHIA COLI LEVOFLOXACIN >=8 R
--------------------------------------------------------------------------------------------
Completely new to programming/scripting and Python. How do you recommend looping through this sample input to grab specific text in the fields?
Each patient has a unique identifier (e.g. HH0000000001). I want to grab specific text from each line.
Output should look like:
Date|Time|Name|Account|Specimen|Source|Antibiotic
01/02/15|0800|LINK, DARK|HH0000000001|PSEUDOMONAS FLUORESCENS|SPUTUM|LEVOFLOXACIN
01/03/15|2025|FAIRY, GREAT|HH0000000002|PROTEUS MIRABILIS|URINE- STRAIGHT CATH|CEFTRIAXONE-other
Edit: My current code looks like this:
(Disclaimer: I am fumbling around in the dark, so the code is not going to be pretty at all.
input = open('report.txt')
output = open('abx.txt', 'w')
date = '' # Defining global variables outside of the loop
time = ''
name = ''
name_last = ''
name_first = ''
account = ''
specimen = ''
source = ''
output.write('Date|Time|Name|Account|Specimen|Source\n')
lines = input.readlines()
for index, line in enumerate(lines):
print index, line
if last_line_location:
new_patient = True
if not first_time_through:
output.write("{}|{}|{}, {}|{}|{}|{}\n".format(
'Date', # temporary placeholder
'Time', # temporary placeholder
name_last.capitalize(),
name_first.capitalize(),
account,
'Specimen', # temporary placeholder
'Source' # temporary placeholder
) )
last_line_location = False
first_time_through = False
for each in lines:
if line.startswith('HH'): # Extract account and name
account = line.split()[0]
name = line.split()[1]
name_last = name.split(',')[0]
name_first = name.split(',')[1]
last_line_location = True
input.close()
output.close()
Currently, the output will skip the first patient and will only display information for the 2nd and 3rd patient. Output looks like this:
Date|Time|Name|Account|Specimen|Source
Date|Time|Fairy, Great|HH0000000002|Specimen|Source
Date|Time|Man, Old|HH0000000003|Specimen|Source
Please feel free to make suggestions on how to improve any aspect of this, including output style or overall strategy.
You code actually works if you add...
last_line_location = True
first_time_through = True
...before your for loop
You asked for pointers as well though...
As has been suggested in the comments, you could look at the re module.
I've knocked something together that shows this. It may not be suitable for all data because three records is a very small sample, and I've made some assumptions.
The last item is also quite contrived because there's nothing definite to search for (such as Coll, Source). It will fail if there are no spaces at the start of the final line, for example.
This code is merely a suggestion of another way of doing things:
import re
startflag = False
with open('report.txt','r') as infile:
with open('abx.txt','w') as outfile:
outfile.write('Date|Time|Name|Account|Specimen|Source|Antibiotic\n')
for line in infile:
if '---------------' in line:
if startflag:
outfile.write('|'.join((date, time, name, account, spec, source, anti))+'\n')
else:
startflag = True
continue
if 'Activity' in line:
startflag = False
acc_name = re.findall('HH\d+ \w+,\w+', line)
if acc_name:
account, name = acc_name[0].split(' ')
date_time = re.findall('(?<=Coll: ).+(?= Recd:)', line)
if date_time:
date, time = date_time[0].split('-')
source_re = re.findall('(?<=Source: ).+',line)
if source_re:
source = source_re[0].strip()
anti_spec = re.findall('^ +(?!Source)\w+ *\w+ + \S+', line)
if anti_spec:
stripped_list = anti_spec[0].strip().split()
anti = stripped_list[-1]
spec = ' '.join(stripped_list[:-1])
Output
Date|Time|Name|Account|Specimen|Source|Antibiotic
01/02/15|0800|LINK,DARK|HH0000000001|PSEUDOMONAS FLUORESCENS|SPUTUM|LEVOFLOXACIN
01/03/15|2025|FAIRY,GREAT|HH0000000002|PROTEUS MIRABILIS|URINE- STRAIGHT CATH|CEFTRIAXONE-other
01/04/15|1800|MAN,OLD|HH0000000003|ESCHERICHIA COLI|URINE-CLEAN VOIDED SPEC|LEVOFLOXACIN
Edit:
Obviously, the variables should be reset to some dummy value between writes on case of a corrupt record. Also, if there is no line of dashes after the last record it won't get written as it stands.

Categories