Python: Multiple Search in same Text-File - python

I have a huge text-file, with data something like this :
Name : ABC
Bank : Bank1
Account-No : 01234567
Amount: 123456
Spouse : CDF
Name : ABD
Bank : Bank1
Account-No : 01234568
Amount: 12345
Spouse : BDF
Name : ABE
Bank : Bank2
Account-No : 01234569
Amount: 12344
Spouse : CDG
.
.
.
.
.
I need to fetch Account-No and Amount and then write them to the new file
Account-No: 01234567
Amount : 123456
Account-No: 01234568
Amount : 12345
Account-No: 01234569
Amount : 12344
.
.
.
I tried to search the text-file through mmap to get the position of Account-No, but I am not able
to get the next account-no through this.
import mmap
fname = input("Enter the file name")
f1 = open(fname)
s = mmap.mmap(f1.fileno(),0,access=mmap.ACCESS_READ)
if s.find(b'Account-No') != -1:
r = s.find(b'Account-No')
f1.close()
In 'r' I have the first location of the Account-No, but I am not able to search from (r+1) to get
the next Account-No.
I can put this in loop, but the exact syntax for mmap is not working for me.
Can anyone please help me in this regard through mmap or any other method.

With pandas, we can do the following:
import pandas as pd
rowsOfLines = pd.read_table('my_file.txt', header=None)
with open('output_file.txt', 'w+') as file:
for index, row in rowsOfLines.iterrows():
splitLine = row.str.split()[0]
if 'Account-No' in splitLine:
file.write('{} \n'.format(row.to_string(index=False)))
elif 'Amount:' in splitLine:
file.write('{} \n'.format(row.to_string(index=False)))

Solution for huge files:
Here is a working example that you can easily customize by adding or removing field names to the "required_fields" list.
This solution allows you to handle a massive file because the whole file in not read into memory at the same time.
import tempfile
# reproduce your input file
# for the purpose of having a
# working example
input_filename = None
with tempfile.NamedTemporaryFile(delete=False) as f_orig:
input_filename = f_orig.name
f_orig.write("""Name : ABC
Bank : Bank1
Account-No : 01234567
Amout: 123456
Spouse : CDF
Name : ABD
Bank : Bank1
Account-No : 01234568
Amout: 12345
Spouse : BDF
Name : ABE
Bank : Bank2
Account-No : 01234569
Amout: 12344
Spouse : CDG""")
# start looking from the beginning of the file again
f_orig.seek(0)
# list the fields you want to keep
required_fields = [
'Account-No',
'Amout',
]
# filter and write, line by line
result_filename = None
with tempfile.NamedTemporaryFile(delete=False) as f_result:
result_filename = f_result.name
# process one line at a time (memory efficient)
while True:
line = f_orig.readline()
#check if we have reached the end of the file
if not line:
break
for field_name in required_fields:
# write fields of interest to new file
if field_name in line:
f_result.write(line)
f_result.write('\n') # just for formatting
# show result
with open(result_filename, 'r') as f:
print(f.read())
The result of this is:
Account-No : 01234567
Amout: 123456
Account-No : 01234568
Amout: 12345
Account-No : 01234569
Amout: 12344

Code:
listOfAllAccountsAndAmounts = [] # list to save all the account and lists
searchTexts = ['Account-No','Amout'] # what all you want to search
with open('a.txt', 'r') as inFile:
allLines = inFile.readlines() # read all the lines
# save all the indexes of those that have any of the words from the searchTexts list in them
indexOfAccounts = [ i for i, line in enumerate(allLines) if any( x in line for x in searchTexts) ]
for index in indexOfAccounts:
listOfAllAccountsAndAmounts.append(allLines[index][:-1].split(': '))
print(listOfAllAccountsAndAmounts)
Output:
[['Account-No ', '01234567'], ['Amout', '123456'], ['Account-No ', '01234568'], ['Amout', '12345'], ['Account-No ', '01234569'], ['Amout', '12344']]
If you don't want to split and save as it is:
listOfAllAccountsAndAmounts.append(allLines[index])
Output:
['Account-No : 01234567\n', 'Amout: 123456\n', 'Account-No : 01234568\n', 'Amout: 12345\n', 'Account-No : 01234569\n', 'Amout: 12344\n']
I have written to a list in case you want to process the information. You can also directly simply write the string to the new file without even using a list as shown by #Arda.

Can You read the whole text file and return will be the list,so iterate over the list and search for string "Account no." and "amount" in it and write over to the another file.

Related

Convert Excel to Yaml syntax in Python

I want to convert my data that is in this form to YAML Syntax (preferably without using pandas or need to install new libraries)
Sample data in excel :
users | name | uid | shell
user1 | nino | 8759 | /bin/ksh
user2 | vivo | 9650 | /bin/sh
Desired output format :
YAML Syntax output
You can do it using file operations. Since you are keen on *"preferably without using pandas or need to install new libraries
Assumption : The "|" symbol is to indicate columns and is not a delimiter or separater
Step 1
Save the excel file as CSV
Then run the code
Code
# STEP 1 : Save your excel file as CSV
ctr = 0
excel_filename = "Book1.csv"
yaml_filename = excel_filename.replace('csv', 'yaml')
users = {}
with open(excel_filename, "r") as excel_csv:
for line in excel_csv:
if ctr == 0:
ctr+=1 # Skip the coumn header
else:
# save the csv as a dictionary
user,name,uid,shell = line.replace(' ','').strip().split(',')
users[user] = {'name': name, 'uid': uid, 'shell': shell}
with open(yaml_filename, "w+") as yf :
yf.write("users: \n")
for u in users:
yf.write(f" {u} : \n")
for k,v in users[u].items():
yf.write(f" {k} : {v}\n")
Output
users:
user1 :
name : nino
uid : 8759
shell : /bin/ksh
user2 :
name : vivo
uid : 9650
shell : /bin/sh
You can do this, in your case you would just do pd.read_excel instead of pd.read_csv:
df = pd.read_csv('test.csv', sep='|')
df['user_col'] = 'users'
data = df.groupby('user_col')[['users', 'name','uid','shell']].apply(lambda x: x.set_index('users').to_dict(orient='index')).to_dict()
with open('newtree.yaml', "w") as f:
yaml.dump(data, f)
Yaml file looks like this:
users:
user1:
name: nino
shell: /bin/ksh
uid: 8759
user2:
name: vivo
shell: /bin/sh
uid: 9650

How to count the number of occurrences of a string in a file and append it within another file

I need to count the number of occurrences of 'Product ID' in the .txt file and have it print the number within that file. I'm new to python and trying to wrap my head around this. I have it working separately in the code, but it prints the number to the command line after running the program (hence the print). I tried using print(count) >> "hardDriveSummary.txt file" and print >> count, "hardDriveSummary.txt file" but can't get it to work.
# Read .xml file and putlines row_name and Product ID into new .txt file
search = 'row_name', 'Product ID'
#source file
with open('20190211-131516_chris_Hard_Drive_Order.xml') as f1:
#output file
with open('hardDriveSummary.txt', 'wt') as f2:
lines = f1.readlines()
for i, line in enumerate(lines):
if line.startswith(search):
f2.write("\n" + line)
#count how many occurances of 'Product ID' in .txt file
def main():
file = open('hardDriveSummary.txt', 'r').read()
team = "Product ID"
count = file.count(team)
print(count)
main()
Sample of hardDriveSummary.txt:
Name Country 1
Product ID : 600GB
Name Country 2
Product ID : 600GB
Name Country 1
Product ID : 450GB
Contents of .xml file:
************* Server Summary *************
Server serv01
label R720
asset_no CNT3NW1
Name Country 1
name.1 City1
Unnamed: 6 NaN
************* Drive Summary **************
ID : 0:1:0
State : Failed
Product ID : 600GB
Serial No. : 6SL5KF5G
************* Server Summary *************
Server serv02
label R720
asset_no BZYGT03
Name Country 2
name.1 City2
Unnamed: 6 NaN
************* Drive Summary **************
ID : 0:1:0
State : Failed
Product ID : 600GB
Serial No. : 6SL5K75G
************* Server Summary *************
Server serv03
label R720
asset_no 5GT4N51
Name Country 1
name.1 City1
Unnamed: 6 NaN
************* Drive Summary **************
ID : 0:1:0
State : Failed
Product ID : 450GB
Serial No. : 6S55K5MG
If you simply just want to tag the counter value onto the end of the file the following code should work:
import os
def main():
with open('hardDriveSummary.txt', 'ab+') as f:
term = "Product ID"
count = f.read().count(term)
f.seek(os.SEEK_END) # Because we've already read the entire file. Go to the end before writing otherwise we get an IOError
f.write('\n'+str(count))
Since Product ID is two different words, you split the entire text to two word groups, the following code will give you the expected results:
from collections import Counter
f = open(r"sample.py", "r")
words = f.read().split()
bigrams = zip(words, words[1:])
counts = Counter(bigrams)
data = {' '.join(k): v for k, v in dict(counts).items()}
if 'Product ID' in data:
print('Count of "Product ID": ', data['Product ID'])

Reading data from a file in python, but taking specific data from the file throughout

Hi I am trying to read data from a .dat file, the file has information like this:
1
Carmella Henderson
24.52
13.5
21.76
2
Christal Piper
14.98
11.01
21.75
3
Erma Park
12.11
13.51
18.18
4
Dorita Griffin
20.05
10.39
21.35
5
Marlon Holmes
18.86
13.02
13.36
From this data I need the person number, name and the first number, like so:
1 #person number
Marlon Holmes #Name
18.86 # First number
13.02
13.36
However currently my code is reading the data from the file but not these specific parts, it is simply printing the file
This is my code currently for this specific part:
def Cucumber_Scoreboard():
with open('veggies_2015.dat', 'r') as f:
count = 0
for line in f:
count **= 1
if count % 2 == 0:
print (line)
Im not sure where it's going wrong, I tried to put the data from the file into a list and try it from there but had no success, any help would be greatly appreciated
Whole file code if needed:
def menu():
exit = False
while not exit:
print("To enter new competitior data, type new")
print("To view the competition score boards, type Scoreboard")
print("To view the Best Overall Growers Scoreboard, type Podium")
print("To review this years and previous data, type Data review")
print("Type quit to exit the program")
choice = raw_input("Which option would you like?")
if choice == 'new':
new_competitor()
elif choice == 'Scoreboard':
scoreboard_menu()
elif choice == 'Podium':
podium_place()
elif choice == 'Data review':
data_review()
elif choice == 'quit':
print("Goodbye")
raise SystemExit
"""Entering new competitor data: record competitor's name and vegtables lengths"""
def competitor_data():
global competitor_num
l = []
print("How many competitors would you like to enter?")
competitors = raw_input("Number of competitors:")
num_competitors = int(competitors)
for i in range(num_competitors):
name = raw_input("Enter competitor name:")
Cucumber = raw_input("Enter length of Cucumber:")
Carrot = raw_input("Enter length of Carrot:")
Runner_Beans = raw_input("Enter length of Runner Beans:")
l.append(competitor_num)
l.append(name)
l.append(Cucumber)
l.append(Carrot)
l.append(Runner_Beans)
competitor_num += 1
return (l)
def new_competitor():
with open('veggies_2016.txt', 'a') as f:
for item in competitor_data():
f.write("%s\n" %(item))
def scoreboard_menu():
exit = False
print("Which vegetable would you like the scoreboard for?")
vegetable = raw_input("Please type either Cucumber, Carrot or Runner Beans:")
if vegetable == "Cucumber":
Cucumber_Scoreboard()
elif vegetable == "Carrot":
Carrot_Scoreboard()
elif vegetable == "Runner Beans":
Runner_Beans_Scoreboard()
def Cucumber_Scoreboard():
with open('veggies_2015.dat', 'r') as f:
count = 0
for line in f:
count **= 1
if count % 2 == 0:
print (line)
This doesn't feel the most elegant way of doing it, but if you're going line by line, you need an extra counter in there which results in nothing happening for a set amount of "surplus" lines, before resetting your counters. Note that excess_count only needs to be incremented once because you want the final else to reset both counters, again which will not result in something being printed but still results in a skipped line.
def Cucumber_Scoreboard():
with open('name_example.txt', 'r') as f:
count = 0
excess_count = 0
for line in f:
if count < 3:
print (line)
count += 1
elif count == 3 and excess_count < 1:
excess_count += 1
else:
count = 0
excess_count = 0
EDIT: Based on your comments, I have extended this answer. Really, what you have asked should be raised as another question because it is detached from your main question. As pointed out by jDo, this is not ideal code because it will fail instantly if there is a blank line or missing data causing a line to skip artificially. Also, the new code is stuffed in around my initial answer. Use this only as an illustration of resetting counters and lists in loops, it's not stable for serious things.
from operator import itemgetter
def Cucumber_Scoreboard():
with open('name_example.txt', 'r') as f:
count = 0
excess_count = 0
complete_person_list = []
sublist = []
for line in f:
if count < 3:
print (line)
sublist.append(line.replace('\n', ''))
count += 1
elif count == 3 and excess_count < 1:
excess_count += 1
else:
count = 0
excess_count = 0
complete_person_list.append(sublist)
sublist = []
sorted_list = sorted(complete_person_list, key=itemgetter(2), reverse = True)
return sorted_list
a = Cucumber_Scoreboard()
You could make the program read the whole file line by line getting all the information. Then because the data is in a known format (eg position, name ...) skip the unneeded lines with file.readline() which will move you to the next line.
Someone recently asked how to save the player scores for his/her game and I ended up writing a quick demo. I didn't post it though since it would be a little too helpful. In its current form, it doesn't fit your game exactly but maybe you can use it for inspiration. Whatever you do, not relying on line numbers, modulo and counting will save you a headache down the line (what if someone added an empty/extra line?).
There are advantages and drawbacks associated with all datatypes. If we compare your current data format (newline separated values with no keys or category/column labels) to json, yours is actually more efficient in terms of space usage. You don't have any repetitions. In key/value pair formats like json and python dictionaries, you often repeat the keys over and over again. This makes it a human-readable format, it makes order insignificant and it means that the entire thing could be written on a single line. However, it goes without saying that repeating all the keys for every player is not efficient. If there were 100.000 players and they all had a firstname, lastname, highscore and last_score, you'd be repeating these 4 words 100.000 times. This is where actual databases become the sane choice for data storage. In your case though, I think json will suffice.
import json
import pprint
def scores_load(scores_json_file):
""" you hand this function a filename and it returns a dictionary of scores """
with open(scores_json_file, "r") as scores_json:
return json.loads(scores_json.read())
def scores_save(scores_dict, scores_json_file):
""" you hand this function a dictionary and a filename to save the scores """
with open(scores_json_file, "w") as scores_json:
scores_json.write(json.dumps(scores_dict))
# main dictionary of dictionaries - empty at first
scores_dict = {}
# a single player stat dictionary.
# add/remove keys and values at will
scores_dict["player1"] = {
"firstname" : "whatever",
"lastname" : "whateverson",
"last_score" : 3,
"highscore" : 12,
}
# another player stat dictionary
scores_dict["player2"] = {
"firstname" : "something",
"lastname" : "somethington",
"last_score" : 5,
"highscore" : 15,
}
# here, we save the main dictionary containing stats
# for both players in a json file called scores.json
scores_save(scores_dict, "scores.json")
# here, we load them again and turn them into a
# dictionary that we can easily read and write to
scores_dict = scores_load("scores.json")
# add a third player
scores_dict["player3"] = {
"firstname" : "person",
"lastname" : "personton",
"last_score" : 2,
"highscore" : 3,
}
# save the whole thing again
scores_save(scores_dict, "scores.json")
# print player2's highscore
print scores_dict["player2"]["highscore"]
# we can also invent a new category (key/value pair) on the fly if we want to
# it doesn't have to exist for every player
scores_dict["player2"]["disqualified"] = True
# print the scores dictionary in a pretty/easily read format.
# this isn't necessary but just looks nicer
pprint.pprint(scores_dict)
"""
The contents of the scores.json pretty-printed in my terminal:
$ cat scores.json | json_pp
{
"player3" : {
"firstname" : "person",
"last_score" : 2,
"lastname" : "personton",
"highscore" : 3
},
"player2" : {
"highscore" : 15,
"lastname" : "somethington",
"last_score" : 5,
"firstname" : "something"
},
"player1" : {
"firstname" : "whatever",
"last_score" : 3,
"lastname" : "whateverson",
"highscore" : 12
}
}
"""
Create a function that reads one "record" (5 lines) at a time, then call it repeatedly:
def read_data(in_file):
rec = {}
rec["num"] = in_file.next().strip()
rec["name"] = in_file.next().strip()
rec["cucumber"] = float(in_file.next().strip())
# skip 2 lines
in_file.next()
in_file.next()
return rec
EDIT: improved the code + added a usage example
The read_data() function reads one 5-line record from a file and returns its data as a dictionary. An example of using this function:
def Cucumber_Scoreboard():
with open('veggies_2015.dat', 'r') as in_file:
data = []
try:
while True:
rec = read_data(in_file)
data.append(rec)
except StopIteration:
pass
data_sorted = sorted(data, key = lambda x: x["cucumber"])
return data_sorted
cucumber = Cucumber_Scoreboard()
from pprint import pprint
pprint(cucumber)

Python: How to loop through blocks of lines and copy specific text within lines

Input file:
DATE: 07/01/15 # 0800 HYRULE HOSPITAL PAGE 1
USER: LINK Antibiotic Resistance Report
--------------------------------------------------------------------------------------------
Activity Date Range: 01/01/15 - 02/01/15
--------------------------------------------------------------------------------------------
HH0000000001 LINK,DARK 30/M <DIS IN 01/05> (UJ00000001) A001-01 0A ZELDA,PRINCESS MD
15:M0000001R COMP, Coll: 01/02/15-0800 Recd: 01/02/15-0850 (R#00000001) ZELDA,PRINCESS MD
Source: SPUTUM
PSEUDOMONAS FLUORESCENS LEVOFLOXACIN >=8 R
--------------------------------------------------------------------------------------------
HH0000000002 FAIRY,GREAT 25/F <DIS IN 01/06> (UJ00000002) A002-01 0A ZELDA,PRINCESS MD
15:M0000002R COMP, Coll: 01/03/15-2025 Recd: 01/03/15-2035 (R#00000002) ZELDA,PRINCESS MD
Source: URINE- STRAIGHT CATH
PROTEUS MIRABILIS CEFTRIAXONE-other R
--------------------------------------------------------------------------------------------
HH0000000003 MAN,OLD 85/M <DIS IN 01/07> (UJ00000003) A003-01 0A ZELDA,PRINCESS MD
15:M0000003R COMP, Coll: 01/04/15-1800 Recd: 01/04/15-1800 (R#00000003) ZELDA,PRINCESS MD
Source: URINE-CLEAN VOIDED SPEC
ESCHERICHIA COLI LEVOFLOXACIN >=8 R
--------------------------------------------------------------------------------------------
Completely new to programming/scripting and Python. How do you recommend looping through this sample input to grab specific text in the fields?
Each patient has a unique identifier (e.g. HH0000000001). I want to grab specific text from each line.
Output should look like:
Date|Time|Name|Account|Specimen|Source|Antibiotic
01/02/15|0800|LINK, DARK|HH0000000001|PSEUDOMONAS FLUORESCENS|SPUTUM|LEVOFLOXACIN
01/03/15|2025|FAIRY, GREAT|HH0000000002|PROTEUS MIRABILIS|URINE- STRAIGHT CATH|CEFTRIAXONE-other
Edit: My current code looks like this:
(Disclaimer: I am fumbling around in the dark, so the code is not going to be pretty at all.
input = open('report.txt')
output = open('abx.txt', 'w')
date = '' # Defining global variables outside of the loop
time = ''
name = ''
name_last = ''
name_first = ''
account = ''
specimen = ''
source = ''
output.write('Date|Time|Name|Account|Specimen|Source\n')
lines = input.readlines()
for index, line in enumerate(lines):
print index, line
if last_line_location:
new_patient = True
if not first_time_through:
output.write("{}|{}|{}, {}|{}|{}|{}\n".format(
'Date', # temporary placeholder
'Time', # temporary placeholder
name_last.capitalize(),
name_first.capitalize(),
account,
'Specimen', # temporary placeholder
'Source' # temporary placeholder
) )
last_line_location = False
first_time_through = False
for each in lines:
if line.startswith('HH'): # Extract account and name
account = line.split()[0]
name = line.split()[1]
name_last = name.split(',')[0]
name_first = name.split(',')[1]
last_line_location = True
input.close()
output.close()
Currently, the output will skip the first patient and will only display information for the 2nd and 3rd patient. Output looks like this:
Date|Time|Name|Account|Specimen|Source
Date|Time|Fairy, Great|HH0000000002|Specimen|Source
Date|Time|Man, Old|HH0000000003|Specimen|Source
Please feel free to make suggestions on how to improve any aspect of this, including output style or overall strategy.
You code actually works if you add...
last_line_location = True
first_time_through = True
...before your for loop
You asked for pointers as well though...
As has been suggested in the comments, you could look at the re module.
I've knocked something together that shows this. It may not be suitable for all data because three records is a very small sample, and I've made some assumptions.
The last item is also quite contrived because there's nothing definite to search for (such as Coll, Source). It will fail if there are no spaces at the start of the final line, for example.
This code is merely a suggestion of another way of doing things:
import re
startflag = False
with open('report.txt','r') as infile:
with open('abx.txt','w') as outfile:
outfile.write('Date|Time|Name|Account|Specimen|Source|Antibiotic\n')
for line in infile:
if '---------------' in line:
if startflag:
outfile.write('|'.join((date, time, name, account, spec, source, anti))+'\n')
else:
startflag = True
continue
if 'Activity' in line:
startflag = False
acc_name = re.findall('HH\d+ \w+,\w+', line)
if acc_name:
account, name = acc_name[0].split(' ')
date_time = re.findall('(?<=Coll: ).+(?= Recd:)', line)
if date_time:
date, time = date_time[0].split('-')
source_re = re.findall('(?<=Source: ).+',line)
if source_re:
source = source_re[0].strip()
anti_spec = re.findall('^ +(?!Source)\w+ *\w+ + \S+', line)
if anti_spec:
stripped_list = anti_spec[0].strip().split()
anti = stripped_list[-1]
spec = ' '.join(stripped_list[:-1])
Output
Date|Time|Name|Account|Specimen|Source|Antibiotic
01/02/15|0800|LINK,DARK|HH0000000001|PSEUDOMONAS FLUORESCENS|SPUTUM|LEVOFLOXACIN
01/03/15|2025|FAIRY,GREAT|HH0000000002|PROTEUS MIRABILIS|URINE- STRAIGHT CATH|CEFTRIAXONE-other
01/04/15|1800|MAN,OLD|HH0000000003|ESCHERICHIA COLI|URINE-CLEAN VOIDED SPEC|LEVOFLOXACIN
Edit:
Obviously, the variables should be reset to some dummy value between writes on case of a corrupt record. Also, if there is no line of dashes after the last record it won't get written as it stands.

Are there any tools can build object from text directly, like google protocol buffer?

At most log process systems, log file is tab separated text files, the schema of the file is provided separately.
for example.
12 tom tom#baidu.com
3 jim jim#baidu.com
the schema is
id : uint64
name : string
email : string
In order to find record like this person.name == 'tom' , The code is
for each_line in sys.stdin:
fields = each_line.strip().split('\t')
if feilds[1] == 'tom': # magic number
print each_line
There are a lot of magic numbers 1 2 3.
Are there some tools like google protocol buffer(It's for binary), So we can build the object from text directly?
Message Person {
uint64 id = 1;
string name = 2;
string email = 3;
}
so we than build person like this: person = lib.BuildFromText(line)
for each_line in sys.stdin:
person = lib.BuildFromText(each_line) # no magic number
if person.name == 'tom':
print each_line
import csv
Person = {
'id': int,
'name': str,
'email': str
}
persons = []
for row in csv.reader(open('CSV_FILE_NAME', 'r'), delimiter='\t'):
persons.append({item[0]: item[1](row[index]) for index, item in enumerate(Person.items())})
How does lib.BuildFromText() function suppose to know how to name fields? They are just values in the line you pass to it, right? Here is how to do it in Python:
import sys
from collections import namedtuple
Person = namedtuple('Person', 'id, name, email')
for each_line in sys.stdin:
person = Person._make(each_line.strip().split('\t'))
if person.name == 'tom':
print each_line

Categories