Slice two ranges from one line in text file - python

I'm trying to extract two ranges per line of an opened text file in python 3 by looping through.
The application has a Entry widget and the value is stored in self.search. I then loop through a text file which contains the values I want, and write out the results to self.searchresults and then display in a Textbox.
I've tried variations of the third line below but am not getting anywhere.
I want to write out characters in each line from 3:24 and 81:83 ...
for line in self.searchfile:
if self.search in line:
line = line[3:24]+[81:83]
self.searchresults.write(line+"\n")
Here's an abridged version of the text file I'm working with (original here):
! Author: Greg Thompson NCAR/RAP
! please mail corrections to gthompsn (at) ucar (dot) edu
! Date: 24 Feb 2015
! This file is continuously maintained at:
! http://www.rap.ucar.edu/weather/surface/stations.txt
!
! [... more comments ...]
ALASKA 19-SEP-14
CD STATION ICAO IATA SYNOP LAT LONG ELEV M N V U A C
AK ADAK NAS PADK ADK 70454 51 53N 176 39W 4 X T 7 US
AK AKHIOK PAKH AKK 56 56N 154 11W 14 X 8 US
AK AKUTAN PAUT 54 09N 165 36W 25 X 7 US
AK AMBLER PAFM AFM 67 06N 157 51W 88 X 7 US
AK ANAKTUVUK PASS PAKP AKP 68 08N 151 44W 642 X 7 US
AK ANCHORAGE INTL PANC ANC 70273 61 10N 150 01W 38 X T X A 5 US

You're not far off - your problem is that you need to specify what you're slicing each time you slice it:
for line in self.searchfile:
if self.search in line:
line = line[3:24] + line[81:83]
self.searchresults.write(line+"\n")
... but you'll probably want to separate the two fields with a space:
line = line[3:24] + " " + line[81:83]
However, if you find yourself using + more than once to construct a string, you should think about using Python's built-in string-formatting abilities instead (and while you're at it, you can add that newline in the same operation):
for line in self.searchfile:
if self.search in line:
formatted = "%s %s\n" % (line[3:24], line[81:83])
self.searchresults.write(formatted)

Related

Python PyPDF - getting additional spaces when reading text using ExtractText

I'm trying to extract text from a PDF file that has Address information, shown as below
CALIFORNIA EYE SPECIALISTS MED GRP INC
1900 W GARVEY AVE S # 335
WEST COVINA CA 91790
and I'm using below logic to extract the data
f = open(addressPath.pdf,'rb')
pdf_reader = PyPDF2.PdfFileReader(f)
first_page = pdf_reader.getPage(0)
mytext = first_page.extractText().split('\n')
but I'm getting below output, logic is introducing additional spaces.
Any Idea, why this is happening?
C A L I F O RN IA E YE SP E C I A L I STS M ED GRP INC
19 00 W G A R V EY A VE S # 3 35
WE S T CO VI NA C A 91 7 90
PyPDF2 was handling spaces not at all for a long time. In April 2022, I improved the situation with a very simple logic. It was too simple and got many cases wrong.
The contributor pubpub-zz changed that. Today, version 2.1.0 was released which improves spacing a lot.

How to reformat a text file data formatting in python (reading a csv/txt with multiple delimiters and rows)

Coding noob here with a question. I have a text file that has the following format:
img1.jpg 468,3,489,16,5 510,37,533,51,2 411,3,433,17,5 ....
img2.jpg 255,397,267,417,2 ....
.
.
.
The data is a series of images with information on co-ordinates where there are 5 variables separated by commas, and then a new set of co-ordinates is separated by a space. There are about 500 files and for each file there are variable numbers of co-ordinate groups. I'm wanting to convert this text file into the following kind of format:
File name
Co-ord 1
Co-ord 2
Co-ord 3
Co-ord 4
Co-ord 5
img1.jpg
468
3
489
16
5
img1.jpg
510
37
533
51
2
img1.jpg
411
3
433
17
5
img2.jpg
255
397
267
417
2
How can I do this in python?
Since each image name and group of co-ordinates are separated by spaces, we can use split() to split it info array and it's basically what your expected output.
Here i wrote an example that split your input into a list of lists, each inner list represent one line in your example output with the first element describe which image the co-ordinates belong to:
test = "img1.jpg 468,3,489,16,5 510,37,533,51,2 411,3,433,17,5"
str_list = test.split()
res = list()
recent_img = ''
for item in str_list:
if item.endswith("jpg"):
# find a new image name
recent_img = item
continue
co_ordinates_list = item.split(",")
if len(co_ordinates_list) == 5:
co_ordinates_list.insert(0, recent_img)
res.append(co_ordinates_list)

Using the right python package to achieve result

I have a fixed width text file that I must convert to a .csv where all numbers have to be converted to integers (no commas, dollar signs, quotes, etc). I have currently parsed the text file using plain python, but when utilizing the right package I seem to be at an impasse.
With csv, I can use writer.writerows in place of my print statement to write the output into my csv file, but the problem is that I have more columns (such as the date and time) that I must add after these rows that I cannot seem to do with csv. I also cannot seem to find a way to translate the blank columns in my text document to blank columns in output. csv seems to write in order.
I was reading the documentation on xlsxwriter and I see how you can write to individual columns with a set formatting, but I am unsure if it would work with my .csv requirement
My input text has a series of random groupings throughout the 50k line document but follows the below format
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
1--------------------
1ANTECR09 CHEK DPCK_R_009
TRANSIT EXTRACT SUB-SYSTEM
CURRENT DATE = 08/03/2017 JOURNAL REPORT PAGE 1
PROCESS DATE =
ID = 022000046-MNT
FILE HEADER = H080320171115
+____________________________________________________________________________________________________________________________________
R T SEQUENCE CR BT A RSN ITEM ITEM CHN USER REASO
NBR NBR OR PIC NBR DB NBR NBR COD AMOUNT SERIAL IND .......FIELD.. DESCR
5,556 01 7450282689 C 538196640 9835177743 15 $9,064.81 00 CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431 DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896 DR CR
5,559 01 7450282692 D 071108834 176885 38 $6,688.00 1454 DR CR
5,560 01 7450282693 D 031309123 1390001566241 38 $293.42 6878 DR CR
My code currently parses this document, pulls the date, time, and prints only the lines where the sequence number starts with 42 and the CR is "C"
lines = []
a = 'PRINT DATE:'
b = 'ARCHIVE'
c = 'PRINT TIME:'
with open(r'textfile.txt') as in_file:
for line in in_file:
values = line.split()
if 'PRINT DATE:' in line:
dtevalue = line.split(a,1)[-1].split(b)[0]
lines.append(dtevalue)
elif 'PRINT TIME:' in line:
timevalue = line.split(c,1)[-1].split(b)[0]
lines.append(timevalue)
elif (len(values) >= 4 and values[3] == 'C'
and len(values[2]) >= 2 and values[2][:2] == '41'):
print(line)
print (lines[0])
print (lines[1])
What would be the cleanest way to achieve this result, and am I headed in the right direction by writing out the parsing first or should I have just done everything within a package first?
Any help is appreciated
Edit:
the header block (between 1----------, and +___________) is repeated throughout the document, as well as different sized groupings separated by -------
--------------------
34,615 207 4100223726 C 538196620 9866597322 10 $645.49 00 CREDIT
34,616 207 4100223727 D 022000046 8891636675 31 $645.49 111583 DR ON-
--------------------
34,617 208 4100223728 C 538196620 11701364 10 $756.19 00 CREDIT
34,618 208 4100223729 D 071923828 00 54 $305.31 11384597 BAD AC
34,619 208 4100223730 D 071923828 35110011 30 $450.88 10913052 6 DR SEL
--------------------
I would recommend slicing fixed width blocks to parse through the fixed width fields. Something like the following (incomplete) code:
data = """ 5,556 01 4250282689 C 538196640 9835177743 15 $9,064.81 00
CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431
DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896
DR CR
"""
# list of data layout tuples (start_index, stop_index, field name)
# TODO add missing data layout tuples
data_layout = [(0, 12, 'r_nbr'), (12, 22, 't_nbr'), (22, 39, 'seq'), (39, 42, 'cr_db')]
for line in data.splitlines():
# skip "separator" lines
# NOTE this may be an iterative process to identify these
if line.startswith('-----'):
continue
record = {}
for start_index, stop_index, name in data_layout:
record[name] = line[start_index:stop_index].strip()
# your conditional (seems inconsistent with text)
if record['seq'].startswith('42') and record['cr_db'] == 'C':
# perform any special handling for each column
record['r_nbr'] = record['r_nbr'].replace(',', '')
# TODO other special handling (like $)
print('{r_nbr},{t_nbr},{seq},{cr_db},...'.format(**record))
Output is:
5556,01,4250282689,C,...
Update based on seemingly spurious values in undefined columns
Based on the new information provided about the "spurious" columns/fields (appear only rarely), this will likely be an iterative process.
My recommendation would be to narrow (appropriately!) the width of the desired fields. For example, if spurious data is in line[12:14] above, you could change the tuple for (12, 22, 't_nbr') to (14, 22, 't_nbr') to "skip" the spurious field.
An alternative is to add a "garbage" field in the list of tuples to handle those types of lines. Wherever the "spurious" fields appear, the "garbage" field would simply consume it.
If you need these fields, the same general approach to the "garbage" field approach still applies, but you save the data.
Update based on random separators
If they are relatively consistent, I'd simply add some logic (as I did above) to "detect" the separators and skip over them.

printing column numbers in python

How can I print first 52 numbers in the same column and so on (in 6 columns in total that repeats). I have lots of float numbers and I want to keep the first 52 and so on numbers in the same column before starting new column that will as well have to contain the next 52 numbers. The numbers are listed in lines separated by one space in a file.txt document. So in the end I want to have:
1 53 105 157 209 261
2
...
52 104 156 208 260 312
313 ... ... ... ... ...
...(another 52 numbers and so on)
I have try this:
with open('file.txt') as f:
line = f.read().split()
line1 = "\n".join("%-20s %s"%(line[i+len(line)/52],line[i+len(line)/6]) for i in range(len(line)/6))
print(line1)
However this only prints of course 2 column numbers . I have try to add line[i+len()line)/52] six time but the code is still not working.
for row in range(52):
for col in range(6):
print line[row + 52*col], # Dangling comma to stay on this line
print # Now go to the next line
Granted, you can do this in more Pythonic ways, but this will show you the algorithm structure and let you tighten the code as you wish.

Files and Exceptions

So in python class we are going over files and exceptions but the professor didn't explain it thoroughly hence why I'm lost to what exactly it is that he wants me to do, I would appreciate any help please. I understand that he wants us to copy the table 2 example but not quite sure. Here is the question.
The file ALW.txt contains the information shown in Table 1. Write a program to use the file to produce a text file containing the information in Table 2, in which the baseball teams list W-L percentage, as well as the total percentage.
Table 1:
ALW,W,L,W-L%
----------------
Oakland Athletics,96,66
----------------------
Texas Rangers,91,72
-------------------
Los Angeles,78,84
-------------------
Seattle Mariners,71,91
-----------------------
Houston Astros,51,111
---------------------------
Table 2:
-----
**Team.........................................W L W-L%**
-----
Oakland Athletics....................96 66 0.593
--------
Texas Rangers.........................91 72 0.558
------
Los Angeles.............................78 84 0.481
--------
Seattle Mariners.......................71 91 0.438
--------
Houston Astros.......................51 111 0.315
--------
Total:........................................387 423 0.484
so I came up with this code but I don't think I'm doing it right.
fob= open("C:/Users/Manny/Documents/Chapter 5 Assignments/ALW.txt","r")
fob.readline()
print ("Total number of teams: 5 ")
print ("Teams")
Oakland_Athletics_win = 96
Texas_Rangers_win = 91
Los_Angeles_win = 78
Seattle_Mariners_win = 71
Houston_Astros_win = 51
Oakland_Athletics_lose = 66
Texas_Rangers_lose = 72
Los_Angeles_lose = 84
Seattle_Mariners_lose = 91
Houston_Astros_lose = 111
total_win = Oakland_Athletics_win + Texas_Rangers_win + Los_Angeles_win + Seattle_Mariners_win + Houston_Astros_win
total_lose = Oakland_Athletics_lose + Texas_Rangers_lose + Los_Angeles_lose + Seattle_Mariners_lose + Houston_Astros_lose
win_lose_ratio = (Oakland_Athletics_lose + Oakland_Athletics_win)
win_lose_ratio2 = Oakland_Athletics_win / win_lose_ratio
total_ratio = total_win + total_lose
total_ratio2 = total_win / total_ratio
for line in fob:
x = line.split(",")
x2 = win_lose_ratio2
print ('\t','\t','\t','\t',"Wins",'\t',"Losses",'\t','\t',"Win-Lose%")
print (x[0],'\t',x[1],'\t',x[2],'\t','\t',(x2))
print ("Total: ",'\t','\t',total_win,'\t',total_lose,'\t',total_ratio2)
He basically wants you to convert how the statistics are displayed from one file and copy it to a different file in a different format.
I'd suggest that you first read the first file line by line and then split the line with commas. Then the elements at each index of the returned array can be formatted and written to an out file in the format he wants you to.
You teacher wants you to parse the file and produce a similar result than table 2 has.
You will need to:
open the file
Go through each line until you find the first Team
Add the strings separated by commas in that line to a list
Calculate the percentages
Carry on till the end
Write everything in anew file
It's a good exercise and Python is really good doing it

Categories