Need help writing a regex - python

I need a regex that reads a file with blast information.
The file looks like:
****ALIGNMENT****
Sequence: gi|516137619|ref|WP_017568199.1| hypothetical protein [Nocardiopsis synnemataformans]
length: 136
E_value: 8.9548e-11
score: 153.0
bit_score: 63.5438
identities: 35
positives: 42
gaps: 6
align_length: 70
query: MIRIHPASRDPQTLLDPENWRSAAWNGAPIRDCRGCIDCCDDDWNRSEPEWRRCYGEHLAEDVRHGVAVC...
match: MIRI A+RD LLDP NW S W+ A R CRGC DC + +CYGE + +DVRHGV+VC...
sbjct: MIRIDRANRDHAELLDPANWLSFHWSNAT-RACRGCDDC-----GGTTETLVQCYGEGVVDDVRHGVSVC...
I already have a code, but in this file there is some extra data. The variable names with the corresponding name in this example, are:
hitsid = 516137619
protein = hypothetical protein
organism = Nocardiopsis synnemataformans
length = 136
evalue = 8.9548e-11
score = 153.0
bitscore = 63.5438
identities = 35
positives = 42
gaps = 6
query = MIRIHPASRDPQTLLDPENWRSAAWNGAPIRDCRGCIDCCDDDWNRSEPEWRRCYGEHLAEDVRHGVAVC...
match = MIRI A+RD LLDP NW S W+ A R CRGC DC + +CYGE + +DVRHGV+VC...
subject = MIRIDRANRDHAELLDPANWLSFHWSNAT-RACRGCDDC-----GGTTETLVQCYGEGVVDDVRHGVSVC...
I'm looking for something like this, this is a regex I already got, but now there are some extra things added:
p = re.compile(r'^Sequence:[^|]*\|(?P<hitsid>[^|]*)\|\S*\s*(?P<protein>[^][]*?)\s*\[(?P<organism>[^][]*)][\s\S]*?\nE-value:\s*(?P<evalue>.*)', re.MULTILINE)
File looks like:
****ALIGNMENT****
Sequence: gi|516137619|ref|WP_017568199.1| hypothetical protein [Nocardiopsis synnemataformans]
length: 136
E_value: 8.9548e-11
score: 153.0
bit_score: 63.5438
identities: 35
positives: 42
gaps: 6
align_length: 70
query: MIRIHPASRDPQTLLDPENWRSAAWNGAPIRDCRGCIDCCDDDWNRSEPEWRRCYGEHLAEDVRHGVAVC...
match: MIRI A+RD LLDP NW S W+ A R CRGC DC + +CYGE + +DVRHGV+VC...
sbjct: MIRIDRANRDHAELLDPANWLSFHWSNAT-RACRGCDDC-----GGTTETLVQCYGEGVVDDVRHGVSVC...
****ALIGNMENT****
Sequence: gi|962700925|ref|BC_420072443.1| Protein crossbronx-like [Nocardiopsis synnemataformans]
length: 136
E_value: 8.9548e-11
score: 153.0
bit_score: 63.5438
identities: 35
positives: 42
gaps: 6
align_length: 70
query: MIRIHPASRDPQTLLDPENWRSAAWNGAPIRDCRGCIDCCDDDWNRSEPEWRRCYGEHLAEDVRHGVAVC...
match: MIRI A+RD LLDP NW S W+ A R CRGC DC + +CYGE + +DVRHGV+VC...
sbjct: MIRIDRANRDHAELLDPANWLSFHWSNAT-RACRGCDDC-----GGTTETLVQCYGEGVVDDVRHGVSVC...
****ALIGNMENT****
Sequence: gi|516137619|ref|WP_017568199.1| hypothetical protein [Nocardiopsis synnemataformans]
length: 136
E_value: 8.9548e-11
score: 153.0
bit_score: 63.5438
identities: 35
positives: 42
gaps: 6
align_length: 70
query: MIRIHPASRDPQTLLDPENWRSAAWNGAPIRDCRGCIDCCDDDWNRSEPEWRRCYGEHLAEDVRHGVAVC...
match: MIRI A+RD LLDP NW S W+ A R CRGC DC + +CYGE + +DVRHGV+VC...
sbjct: MIRIDRANRDHAELLDPANWLSFHWSNAT-RACRGCDDC-----GGTTETLVQCYGEGVVDDVRHGVSVC...

You no need regexp:
parsed = []
raw_parts = open('tmp9.txt','r').read().split('****ALIGNMENT****')
for raw_part in raw_parts:
parsed_dict = {}
for line in raw_part.split('\n'):
try:
key,value = line.split(':')
parsed_dict[key] = value.strip()
except:
pass
parsed.append(parsed_dict)
print(parsed)

Related

dataframe put must be a unicode string, not 0, how give the string not the dataframe

i try to manipulate some dataframe and i did a function to calculate the distance between 2 cities.
def find_distance(A,B):
key = '0377f0e6b42a47fe9d30a4e9a2b3bb63' # get api key from: https://opencagedata.com
geocoder = OpenCageGeocode(key)
result_A = geocoder.geocode(A)
lat_A = result_A[0]['geometry']['lat']
lng_A = result_A[0]['geometry']['lng']
result_B = geocoder.geocode(B)
lat_B = result_B[0]['geometry']['lat']
lng_B = result_B[0]['geometry']['lng']
return int(geodesic((lat_A,lng_A), (lat_B,lng_B)).kilometers)
this is my dataframe
2 32 Mulhouse 1874.0 2 797 16.8 16,3 € 10.012786
13 13 Saint-Étienne 1994.0 3 005 14.3 13,5 € 8.009882
39 39 Roubaix 2845.0 2 591 17.4 15,0 € 6.830968
27 27 Perpignan 2507.0 3 119 15.1 13,3 € 6.727255
40 40 Tourcoing 3089.0 2 901 17.5 15,3 € 6.327547
25 25 Limoges 2630.0 2 807 14.2 12,5 € 6.030424
20 20 Le Mans 2778.0 3 202 14.4 12,3 € 5.789559
there is my code:
def clean_text(row):
# return the list of decoded cell in the Series instead
return [r.decode('unicode_escape').encode('ascii', 'ignore') for r in row]
def main():
inFile = "prix_m2_france.xlsx" #On ouvre l'excel
inSheetName = "Sheet1" #le nom de l excel
cols = ['Ville', 'Prix_moyen', 'Loyer_moyen'] #Les colomnes
df =(pd.read_excel(inFile, sheet_name = inSheetName))
df[cols] = df[cols].replace({'€': '', ",": ".", " ": "", "\u202f":""}, regex=True)
# df['Prix_moyen'] = df.apply(clean_text)
# df['Loyer_moyen'] = df.apply(clean_text)
df['Prix_moyen'] = df['Prix_moyen'].astype(float)
df['Loyer_moyen'] = df['Loyer_moyen'].astype(float)
# df["Prix_moyen"] += 1
df["revenu"] = (df['Loyer_moyen'] * 12) / (df["Prix_moyen"] * 1.0744) * 100
# df['Ville'].replace({'Le-Havre': 'Le Havre', 'Le-Mans': 'Le Mans'})
df["Ville"] = df['Ville'].replace(['Le-Havre', 'Le-Mans'], ['Le Havre', 'Le Mans'])
df["distance"] = find_distance("Paris", df["Ville"])
df2 = df.sort_values(by = 'revenu', ascending = False)
print(df2.head(90))
main()
df["distance"] = find_distance("Paris", df["Ville"]) fails and give me this error:
opencage.geocoder.InvalidInputError: Input must be a unicode string, not 0 Paris
1 Marseille
2 Lyon
3 T
I imagine it as a loop where i will put the distance between paris and the city but i guess it take all the dataframe on my first value.
Thanks for your help
(Edit, i just pasted a part of my dataframe)
You can try something like :
df["distance"] = [find_distance("Paris", city) for city in df["Ville"]]

how to sum/aggregate by group without using pandas or import

so I am basically not allowed to use any import or other libraries like pandas or groupby.
and I have to categorize the data and sum up the corresponding values. The data is in the csv file.
For example,
**S** C **T**
A T 100
A. B 102
A. T. 200
A B. 100
C T 203
C. T. 200
C B 200
C T 200
C. B 200
my expected result should be
S C T
A T 300
A B. 202
C T 403
C B. 200
C T. 200
C B. 200
Considering that you have a csv file (i.e., columns split by comma):
with open('myfile.csv', 'r') as file:
header = file.readline().rstrip()
data = {}
for row in file:
state, candidate, value = row.split(',')
k, value = (state, candidate), int(value)
data[k] = data.get(k, 0) + value
result_csv = '\n'.join([header] + [f"{','.join(k)},{v}" for k,v in data.items()])
print(result_csv)
Output:
state,candidate,total votes
Alaska,Trump,300
Alaska,Biden,202
colorado,Trump,403
colorado,Biden,200
California,Trump,200
California,Biden,200
Original content of myfile.csv is (use str.replace if necessary):
state,candidate,total votes
Alaska,Trump,100
Alaska,Biden,102
Alaska,Trump,200
Alaska,Biden,100
colorado,Trump,203
colorado,Trump,200
colorado,Biden,200
California,Trump,200
California,Biden,200
mylist = []
with open("data", "r") as msg:
for line in msg:
mylist.append(line.strip().replace(".",""))
msg.close()
headers = mylist[0].replace("*","").split()
del mylist[0]
headers[2] = headers[2]+" "+headers[3]
mydict = {}
for line in mylist:
state = line.split()[0]
mydict[state] = {}
for line in mylist:
state = line.split()[0]
candidate = line.split()[1]
mydict[state][candidate] = 0
for line in mylist:
state = line.split()[0]
candidate = line.split()[1]
votes = line.split()[2]
mydict[state][candidate] = mydict[state][candidate] + int(votes)
print ("%-15s %-15s %-15s \n\n" % (headers[0],headers[1],headers[2]))
for state in mydict.keys():
for candidate in mydict[state].keys():
print ("%-15s %-15s %-15s" % (state,candidate,str(mydict[state][candidate])))
Output:
state candidate total votes
Alaska Trump 300
Alaska Biden 202
colorado Trump 403
colorado Biden 200
California Trump 200
California Biden 200

How to combine three columns into one in python

Excel table = this is the excel file screenshot which is how final result should be. Please take closer look at "Lifestyle" section.
I can't figure out how to make my python just like the excel picture screenshot. "Lifestyle" section needs to have 2 more sub-columns combined just like in a picture below. Any help would be appreciated.
I'm gonna post picture below PyCharm screenshot:
Here is my code:
#convert inches to feet-inches
def inch_to_feet(x):
feet = x // 12
inch = x % 12
return str(feet)+"'"+str(inch)+'"'
#file opened
print("Hello")
roster = input("Please enter a roster file: ")
if roster != "roster_extended.csv":
print("Invalid name")
elif roster == "roster_extended.csv":
additional_name = input("There are 13 lines in this file. Would you like to enter an additional names? (Y/N): ")
if additional_name == "Y":
input("How many more names?: ")
infile = open("roster_extended.csv", 'r')
b = infile.readline()
b = infile.readlines()
header = '{0:>12} {1:>35} {2:>3} {3:>16} {4:>5} {5:>3} {6:>9}'.format("FirstName","LastName","Age","Occupation","Ht","Wt","lifestyle")
print(header)
with open("roster_extended.csv", "a+") as infile:
b = infile.write(input("Enter first name: "))
for person in b:
newperson = person.replace("\n", "").split(",")
newperson[4] = eval(newperson[4])
newperson[4] = inch_to_feet(newperson[4])
newperson
formatted='{0:>12} {1:>35} {2:>3} {3:>16} {4:>5} {5:>3} {6:>9}'.format(newperson[0],newperson[1],newperson[2],newperson[3],newperson[4],newperson[5],newperson[6])
print(formatted)
Here is the output I get:
FirstName LastName Age Occupation Ht Wt lifestyle
Anna Barbara 35 nurse 5'3" 129
Catherine Do 45 physicist 5'5" 135
Eric Frederick 28 teacher 5'5" 140
Gabriel Hernandez 55 surgeon 5'7" 150 x
Ivy Joo 31 engineer 5'2" 126 x
Kelly Marks 21 student 5'4" 132
Nancy Owens 60 immunologist 5'8" 170 x
Patricia Qin 36 dental assistant 4'11" 110 x
Roderick Stevenson 51 bus driver 5'6" 160 x
Tracy Umfreville 42 audiologist 5'7" 156 x
Victoria Wolfeschlegelsteinhausenbergerdorff 38 data analyst 5'8" 158
Lucy Xi 49 professor 5'9" 161
Yolanda Zachary 58 secretary 5'10" 164 x
Brief explanation of the solution:
You gave tabulated data as input (there are several ways to tabulate: check here). Since you're starting with python the solution keeps within standard library (thus not resorting to external libraries). Only format() and class variables are used to keep track of column width (if you delete elements you'll want to update the variables.) This programmatically automates tabulation.
Since you are starting out, I recommend putting a breakpoint in __init__() and __new__() to observe their behavior.
I used Enum because conceptually it's the right tool for the job. You only need to understand Enum.name and Enum.value, as for everything else consider it a normal class.
There are 2 output files, one in tabulated form and the other in barebone csv.
(For the most part the solution is "canonical" (or close). The procedural part was rushed, but gives a sufficient idea.)
import csv
import codecs
from enum import Enum
from pathlib import Path
IN_FILE = Path("C:\\your_path\\input.csv")
OUT_FILE = Path("C:\\your_path\\output1.csv")
OUT_FILE_TABULATE = Path("C:\\your_path\\output2.csv")
def read_csv(file) -> list:
with open(file) as csv_file:
reader_csv = csv.reader(csv_file, delimiter=',')
for row in reader_csv:
yield row
def write_file(file, result_ordered):
with codecs.open(file, "w+", encoding="utf-8") as file_out:
for s in result_ordered:
file_out.write(s + '\n')
class LifeStyle(Enum):
Sedentary = 1
Active = 2
Moderate = 3
def to_list(self):
list_life_style = list()
for one_style in LifeStyle:
if one_style is self:
list_life_style.append('x')
else:
list_life_style.append('')
return list_life_style
def tabulate(self):
str_list_life_style = list()
for one_style in LifeStyle:
if one_style is not self:
str_list_life_style.append('{: ^{width}}'.format(' ', width=len(one_style.name)))
else:
str_list_life_style.append('{: ^{width}}'.format('x', width=len(self.name)))
return str_list_life_style
def tabulate_single_column(self):
return '{: >{width}}'.format(str(self.name), width=len(LifeStyle.Sedentary.name))
#staticmethod
def header_single_column():
return ' {}'.format(LifeStyle.__name__)
#staticmethod
def header():
return ' {} {} {}'.format(
LifeStyle.Sedentary.name,
LifeStyle.Active.name,
LifeStyle.Moderate.name,
)
class Person:
_FIRST_NAME = "First Name"
_LAST_NAME = "Last Name"
_AGE = "Age"
_OCCUPATION = "Occupation"
_HEIGHT = "Height"
_WEIGHT = "Weight"
max_len_first_name = len(_FIRST_NAME)
max_len_last_name = len(_LAST_NAME)
max_len_occupation = len(_OCCUPATION)
def __new__(cls, first_name, last_name, age, occupation, height, weight, lifestyle):
cls.max_len_first_name = max(cls.max_len_first_name, len(first_name))
cls.max_len_last_name = max(cls.max_len_last_name, len(last_name))
cls.max_len_occupation = max(cls.max_len_occupation, len(occupation))
return super().__new__(cls)
def __init__(self, first_name, last_name, age, occupation, height, weight, lifestyle):
self.first_name = first_name
self.last_name = last_name
self.age = age
self.occupation = occupation
self.height = height
self.weight = weight
self.lifestyle = lifestyle
#classmethod
def _tabulate_(cls, first_name, last_name, age, occupation, height, weight):
first_part = '{: >{m_first}} {: >{m_last}} {: >{m_age}} {: <{m_occup}} {: <{m_height}} {: >{m_weight}}'.format(
first_name,
last_name,
age,
occupation,
height,
weight,
m_first=Person.max_len_first_name,
m_last=Person.max_len_last_name,
m_occup=Person.max_len_occupation,
m_age=len(Person._AGE),
m_height=len(Person._HEIGHT),
m_weight=len(Person._WEIGHT))
return first_part
#classmethod
def header(cls, header_life_style):
first_part = Person._tabulate_(Person._FIRST_NAME, Person._LAST_NAME, Person._AGE, Person._OCCUPATION,
Person._HEIGHT, Person._WEIGHT)
return '{}{}'.format(first_part, header_life_style)
def __str__(self):
first_part = Person._tabulate_(self.first_name, self.last_name, self.age, self.occupation, self.height,
self.weight)
return '{}{}'.format(first_part, ' '.join(self.lifestyle.tabulate()))
def single_column(self):
first_part = Person._tabulate_(self.first_name, self.last_name, self.age, self.occupation, self.height,
self.weight)
return '{} {}'.format(first_part, self.lifestyle.tabulate_single_column())
def populate(persons_populate):
for line in read_csv(IN_FILE):
life_style = ''
if line[6] == 'x':
life_style = LifeStyle.Sedentary
elif line[7] == 'x':
life_style = LifeStyle.Moderate
elif line[8] == 'x':
life_style = LifeStyle.Active
persons_populate.append(Person(line[0], line[1], line[2], line[3], line[4], line[5], life_style))
return persons_populate
persons = populate(list())
print(Person.header(LifeStyle.header()))
for person in persons:
print(person)
write_file(OUT_FILE_TABULATE, [str(item) for item in persons])
# add new persons here
persons.append(Person("teste", "teste", "22", "worker", "5'8\"", "110", LifeStyle.Active))
final_list = list()
for person in persons:
one_list = [person.first_name, person.last_name, person.age, person.occupation, person.height,
person.weight]
one_list.extend([item.strip() for item in person.lifestyle.tabulate()])
final_list.append(','.join(one_list))
write_file(OUT_FILE, final_list)
print("\n", Person.header(LifeStyle.header_single_column()))
for person in persons:
print(person.single_column())
output1.csv:
Anna,Barbara,35,nurse,5'3",129,,,x
Catherine,Do,45,physicist,5'5",135,,x,
Eric,Frederick,28,teacher,5'5",140,,,x
Gabriel,Hernandez,55,surgeon,5'7",150,x,,
Ivy,Joo,31,engineer,5'2",126,x,,
Kelly,Marks,21,student,5'4",132,,x,
Nancy,Owens,60,immunologist,5'8",170,x,,
Patricia,Qin,36,dental assistant,4'11",110,x,,
Roderick,Stevenson,51,bus driver,5'6",160,x,,
Tracy,Umfreville,42,audiologist,5'7",156,x,,
Victoria,Wolfeschlegelsteinhausenbergerdorff,38,data analyst ,5'8",158,,,x
Lucy,Xi,49,professor,5'9",161,,,x
Yolanda,Zachary,58,secretary,5'10",164,x,,
teste,teste,22,worker,5'8",110,,x,
output2.csv:
Anna Barbara 35 nurse 5'3" 129 x
Catherine Do 45 physicist 5'5" 135 x
Eric Frederick 28 teacher 5'5" 140 x
Gabriel Hernandez 55 surgeon 5'7" 150 x
Ivy Joo 31 engineer 5'2" 126 x
Kelly Marks 21 student 5'4" 132 x
Nancy Owens 60 immunologist 5'8" 170 x
Patricia Qin 36 dental assistant 4'11" 110 x
Roderick Stevenson 51 bus driver 5'6" 160 x
Tracy Umfreville 42 audiologist 5'7" 156 x
Victoria Wolfeschlegelsteinhausenbergerdorff 38 data analyst 5'8" 158 x
Lucy Xi 49 professor 5'9" 161 x
Yolanda Zachary 58 secretary 5'10" 164 x
single_column:
Anna Barbara 35 nurse 5'3" 129 Moderate
Catherine Do 45 physicist 5'5" 135 Active
Eric Frederick 28 teacher 5'5" 140 Moderate
Gabriel Hernandez 55 surgeon 5'7" 150 Sedentary
Ivy Joo 31 engineer 5'2" 126 Sedentary
Kelly Marks 21 student 5'4" 132 Active
Nancy Owens 60 immunologist 5'8" 170 Sedentary
Patricia Qin 36 dental assistant 4'11" 110 Sedentary
Roderick Stevenson 51 bus driver 5'6" 160 Sedentary
Tracy Umfreville 42 audiologist 5'7" 156 Sedentary
Victoria Wolfeschlegelsteinhausenbergerdorff 38 data analyst 5'8" 158 Moderate
Lucy Xi 49 professor 5'9" 161 Moderate
Yolanda Zachary 58 secretary 5'10" 164 Sedentary
teste teste 22 worker 5'8" 110 Active

Unable to format output in Python correctly

I am unable to format in Python correctly. Below is what my list looks like. I am not sure why the spacing is off on some of the fields. Below is my code as well as a snip of how it reads.
def main():
golf_file = open('golf.txt', 'r') #open file
first_name = golf_file.readline() #read first line
print('First Name\tLast Name\tHandicap\tGolf Score\tOver, Under or Par') #print headings
while first_name != '': #while statement for loop
last_name = golf_file.readline()
handicap = golf_file.readline()
golf_score = golf_file.readline()
#stripping newline from each string
first_name = first_name.rstrip('\n')
last_name = last_name.rstrip('\n')
handicap = handicap.rstrip('\n')
golf_score = golf_score.rstrip('\n')
handicap_num = float(handicap)
golfscore_num = int(golf_score)
#if statement to determine if golf score is over, under or par
if golfscore_num == 80:
OverUnderPar = ('Par')
elif golfscore_num < 80:
OverUnderPar = ('Under Par')
else:
OverUnderPar = ('Over Par')
#print info with two tabs for positioning.
print( first_name, '\t''\t', last_name, '\t''\t', handicap_num, '\t', '\t', golfscore_num, '\t', '\t', OverUnderPar)
first_name = golf_file.readline()
golf_file.close() #close file
main()
First Name Last Name Handicap Golf Score Over, Under or Par
Andrew Marks 11.2 72 Under Par
Betty Franks 12.8 89 Over Par
Connie William 14.6 92 Over Par
Donny Ventura 9.9 78 Under Par
Ernie Turner 10.1 81 Over Par
Fred Smythe 8.1 75 Under Par
Greg Tucker 7.2 72 Under Par
Henry Zebulon 8.3 83 Over Par
Ian Fleming 4.2 72 Under Par
Jan Holden 7.7 84 Over Par
Kit Possum 7.9 79 Under Par
Landy Bern 10.3 93 Over Par
Mona Docker 11.3 98 Over Par
Kevin Niles 7.1 80 Par
Pam Stiles 10.9 87 Over Par
Russ Hunt 5.6 73 Under Par
If the name is too big(over 6 char) the tab will be moved one more down. You can either check if the name is too big and move it down own tab.By using something like
numTabs = '\t' * (2-len(last_name)//6
Or a better approach would use something like str.format as mentioned by #Michael Butscher in the comments.

How to read a particular line of interest from a text file?

Here I have a text file. I want to read Adress, Beneficiary, Beneficiary Bank, Acc Nbr, Total US$, Date which is at the top, RUT, BOX. I tried writing some code by myself but I am not able to correctly get the required information and moreover if the length of character changes I will not get correct output. How should I do this such that I will get every required information in a particular string.
The main problem will arise when my slicings will go wrong. For eg: I am using line[31:] for Acc Nbr. But if the address change then my slicing will also go wrong
My Text.txt
2014-11-09 BOX 1531 20140908123456 RUT 21 654321 0123
Girry S.A. CONTADO
G 5 Y Serie A
NO 098765
11 al Rayo 321 - Oqwerty 108 Monteaudio - Gruguay
Pharm Cosco, Inc - Britania PO Box 43215
Dirección Hot Springs AR 71903 - Estados Unidos
Oescripción Importe
US$
DO 7640183 - 50% of the Production Degree 246,123
Beneficiary Bank: Bankue Heritage (Gruguay) S.A Account Nbr: 1234563 Swift: MANIUYMM
Adress: Tencon 108 Monteaudio, Gruguay.
Beneficiary: Girry SA Acc Nbr: 1234567
Servicios prestados en el exterior, exentos de IVA o IRAE
Subtotal US$ 102,500
Iva US$ ---------------
Total US$ 102,500
I.V.A AL DIA Fecha de Vencimiento
IMPRENTA IRIS LTDA. - RUT 210161234015 - 0/40987 17/11/2015
CONSTANCIA N9 1234559842 -04/2013
CONTADO A 000.001/ A 000.050 x 2 VIAS
QWERTYAS ZXCVBIZADA
R. U.T. Bamprador Asdfumldor Final
Fecha 12/12/2014
1º ORIGINAL CLLLTE (Blanco) 2º CASIA AQWERVO (Rosasd)
My Code:
txt = 'Text.txt'
lines = [line.rstrip('\n') for line in open(txt)]
for line in lines:
if 'BOX' in line:
Date = line.split("BOX")[0]
BOX = line.split('BOX ', 1)[-1].split("RUT")[0]
RUT = line.split('RUT ',1)[-1]
print 'Date : ' + Date
print 'BOX : ' + BOX
print 'RUT : ' + RUT
if 'Adress' in line:
Adress = line[8:]
print 'Adress : ' + Adress
if 'NO ' in line:
Invoice_No = line.split('NO ',1)[-1]
print 'Invoice_No : ' + Invoice_No
if 'Swift:' in line:
Swift = line.split('Swift: ',1)[-1]
print 'Swift : ' + Swift
if 'Fecha' in line and '/' in line:
Invoice_Date = line.split('Fecha ',1)[-1]
print 'Invoice_Date : ' + Invoice_Date
if 'Beneficiary Bank' in line:
Beneficiary_Bank = line[18:]
Ben_Acc_Nbr = line.split('Nbr: ', 1)[-1]
print 'Beneficiary_Bank : ' + Beneficiary_Bank.split("Acc")[0]
print 'Ben_Acc_Nbr : ' + Ben_Acc_Nbr.split("Swift")[0]
if 'Beneficiary' in line and 'Beneficiary Bank' not in line:
Beneficiary = line[13:]
print 'Beneficiary : ' + Beneficiary.split("Acc")[0]
if 'Acc Nbr' in line:
Acc_Nbr = line.split('Nbr: ', 1)[-1]
print 'Acc_Nbr : ' + Acc_Nbr
if 'Total US$' in line:
Total_US = line.split('US$ ', 1)[-1]
print 'Total_US : ' + Total_US
Output:
Date : 2014-11-09
BOX : 1531 20140908123456
RUT : 21 654321 0123
Invoice_No : 098765
Swift : MANIUYMM
Beneficiary_Bank : Bankue Heritage (Gruguay) S.A
Ben_Acc_Nbr : 1234563
Adress : Tencon 108 Monteaudio, Gruguay.
Beneficiary : Girry SA
Acc_Nbr : 1234567
Total_US : 102,500
Invoice_Date : 12/12/2014
Some Code Changes
I have made some changes but still I am not convinced as I need to provide spaces also in split.
I would recommend you to use regular expressions to extract information you need. It helps to avoid the calculation of the numbers of offset characters.
import re
with open('C:\Quad.txt') as f:
for line in f:
match = re.search(r"Acc Nbr: (.*?)", line)
if match is not None:
Acc_Nbr = match.group(1)
print Acc_Nbr
# etc...
you can search to obtain index of it. for example:
if 'Acc Nbr' in line:
Acc_Nbr = line[line.find("Acc Nbr") + 10:]
print Acc_Nbr
note that find gives you index of first char of item you searched.

Categories