Python PyPDF - getting additional spaces when reading text using ExtractText - python

I'm trying to extract text from a PDF file that has Address information, shown as below
CALIFORNIA EYE SPECIALISTS MED GRP INC
1900 W GARVEY AVE S # 335
WEST COVINA CA 91790
and I'm using below logic to extract the data
f = open(addressPath.pdf,'rb')
pdf_reader = PyPDF2.PdfFileReader(f)
first_page = pdf_reader.getPage(0)
mytext = first_page.extractText().split('\n')
but I'm getting below output, logic is introducing additional spaces.
Any Idea, why this is happening?
C A L I F O RN IA E YE SP E C I A L I STS M ED GRP INC
19 00 W G A R V EY A VE S # 3 35
WE S T CO VI NA C A 91 7 90

PyPDF2 was handling spaces not at all for a long time. In April 2022, I improved the situation with a very simple logic. It was too simple and got many cases wrong.
The contributor pubpub-zz changed that. Today, version 2.1.0 was released which improves spacing a lot.

Related

Filtering and parsing text over Solar Region Summary files

I was trying to filter some .txt files that are named after a date in YYYYMMDD format and contain some data about active regions in the Sun. I made a code that, given a date in YYYYMMDD format, can list the files that are within a time range which I expect the active region I am looking for to be and parse the information based on that entry. An example of these txts can be seen below and more information about it (if you feel curious) can be seen on SWPC website.
:Product: 0509SRS.txt
:Issued: 2012 May 09 0030 UTC
# Prepared jointly by the U.S. Dept. of Commerce, NOAA,
# Space Weather Prediction Center and the U.S. Air Force.
#
Joint USAF/NOAA Solar Region Summary
SRS Number 130 Issued at 0030Z on 09 May 2012
Report compiled from data received at SWO on 08 May
I. Regions with Sunspots. Locations Valid at 08/2400Z
Nmbr Location Lo Area Z LL NN Mag Type
1470 S19W68 284 0030 Cro 02 02 Beta
1471 S22W60 277 0120 Cso 05 03 Beta
1474 N14W13 229 0010 Axx 00 01 Alpha
1476 N11E35 181 0940 Fkc 17 33 Beta-Gamma-Delta
1477 S22E73 144 0060 Hsx 03 01 Alpha
IA. H-alpha Plages without Spots. Locations Valid at 08/2400Z May
Nmbr Location Lo
1472 S28W80 297
1475 N05W05 222
II. Regions Due to Return 09 May to 11 May
Nmbr Lat Lo
1460 N16 126
1459 S16 110
The code I am using to parse over these txt files is:
import glob
def seeker(noaa_number, t_start, path = None):
'''
This function will open an SRS file
and look for each line if the given AR
(specified by its NOAA number) is there.
If so, this function should grab the
entries and return them.
'''
#defaulting path if none is given
if path is None:
#assigning
path = 'defaultpath'
#listing the items within the directory
files = sorted(glob.glob(path+'*.txt'))
#finding the index in the list of
#the starting time
index = files.index(path+str(t_start)+'SRS.txt')
#looping over each file
for file in files[index: index+20]:
#opening file
f = open(file, 'r')
#reading the lines
text = f.readlines()
#looping over each line in the text
for line in text:
#checking if the noaa number is mentioned
#in the given line
if noaa_number in line:
#test print
print('Original line: ', line)
#slicing the text to get the column values
nbr = line[:4]
Location = line[5:11]
Lo = line[14:18]
Area = line[19:23]
Z = line[24:28]
LL = line[29:31]
NN = line[34:36]
MagType = line[37:]
#test prints
print('nbr: ', nbr)
print('location: ', Location)
print('Lo: ', Lo)
print('Area: ', Area)
print('Z: ', Z)
print('LL: ', LL)
print('NN: ', NN)
print('MagType: ', MagType)
return
I tested this and it is working but I fell a bit dumb for two reasons:
Despite these files being made following a standard, one extra space is all it takes to crash the code considering the way I am slicing the arrays by index. Is there a better option to that?
The information on tables IA and II are not relevant for me so, ideally, I would like to prevent my code to scan them. Since the number of lines on the first column varies, is it possible to tell the code when to stop reading a giving document?
Thanks for your time!
Robustness:
Instead of slicing by absolute position you could split the lines up into a list using the .split() method. This will be robust against extra spaces.
So instead of
Location = line[5:11]
Lo = line[14:18]
Area = line[19:23]
Z = line[24:28]
LL = line[29:31]
NN = line[34:36]
You could use
Location = line.split()[1]
Lo = line.split()[2]
Area = line.split()[3]
Z = line.split()[4]
LL = line.split()[5]
NN = line.split()[6]
If you wanted it to be faster you could split the list once and then just pull the relevant data from the same list rather than splitting it every time:
data = line.split()
Location = data[1]
Lo = data[2]
Area = data[3]
Z = data[4]
LL = data[5]
NN = data[6]
Stopping:
To stop it from continuing reading the file after it's passed the relevant data you could just have something that exits the loop once it no longer finds the noaa_number in the line
# In the file function but before looping through the lines.
started_reading = False ## Set this to false so
## that it doesn't exit
## before it gets to the
## relevant data
for line in text:
if noaa_number in line:
started_reading = True
## Parsing stuff
elif started_reading is True:
break # exits the loop

Using the right python package to achieve result

I have a fixed width text file that I must convert to a .csv where all numbers have to be converted to integers (no commas, dollar signs, quotes, etc). I have currently parsed the text file using plain python, but when utilizing the right package I seem to be at an impasse.
With csv, I can use writer.writerows in place of my print statement to write the output into my csv file, but the problem is that I have more columns (such as the date and time) that I must add after these rows that I cannot seem to do with csv. I also cannot seem to find a way to translate the blank columns in my text document to blank columns in output. csv seems to write in order.
I was reading the documentation on xlsxwriter and I see how you can write to individual columns with a set formatting, but I am unsure if it would work with my .csv requirement
My input text has a series of random groupings throughout the 50k line document but follows the below format
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
1--------------------
1ANTECR09 CHEK DPCK_R_009
TRANSIT EXTRACT SUB-SYSTEM
CURRENT DATE = 08/03/2017 JOURNAL REPORT PAGE 1
PROCESS DATE =
ID = 022000046-MNT
FILE HEADER = H080320171115
+____________________________________________________________________________________________________________________________________
R T SEQUENCE CR BT A RSN ITEM ITEM CHN USER REASO
NBR NBR OR PIC NBR DB NBR NBR COD AMOUNT SERIAL IND .......FIELD.. DESCR
5,556 01 7450282689 C 538196640 9835177743 15 $9,064.81 00 CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431 DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896 DR CR
5,559 01 7450282692 D 071108834 176885 38 $6,688.00 1454 DR CR
5,560 01 7450282693 D 031309123 1390001566241 38 $293.42 6878 DR CR
My code currently parses this document, pulls the date, time, and prints only the lines where the sequence number starts with 42 and the CR is "C"
lines = []
a = 'PRINT DATE:'
b = 'ARCHIVE'
c = 'PRINT TIME:'
with open(r'textfile.txt') as in_file:
for line in in_file:
values = line.split()
if 'PRINT DATE:' in line:
dtevalue = line.split(a,1)[-1].split(b)[0]
lines.append(dtevalue)
elif 'PRINT TIME:' in line:
timevalue = line.split(c,1)[-1].split(b)[0]
lines.append(timevalue)
elif (len(values) >= 4 and values[3] == 'C'
and len(values[2]) >= 2 and values[2][:2] == '41'):
print(line)
print (lines[0])
print (lines[1])
What would be the cleanest way to achieve this result, and am I headed in the right direction by writing out the parsing first or should I have just done everything within a package first?
Any help is appreciated
Edit:
the header block (between 1----------, and +___________) is repeated throughout the document, as well as different sized groupings separated by -------
--------------------
34,615 207 4100223726 C 538196620 9866597322 10 $645.49 00 CREDIT
34,616 207 4100223727 D 022000046 8891636675 31 $645.49 111583 DR ON-
--------------------
34,617 208 4100223728 C 538196620 11701364 10 $756.19 00 CREDIT
34,618 208 4100223729 D 071923828 00 54 $305.31 11384597 BAD AC
34,619 208 4100223730 D 071923828 35110011 30 $450.88 10913052 6 DR SEL
--------------------
I would recommend slicing fixed width blocks to parse through the fixed width fields. Something like the following (incomplete) code:
data = """ 5,556 01 4250282689 C 538196640 9835177743 15 $9,064.81 00
CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431
DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896
DR CR
"""
# list of data layout tuples (start_index, stop_index, field name)
# TODO add missing data layout tuples
data_layout = [(0, 12, 'r_nbr'), (12, 22, 't_nbr'), (22, 39, 'seq'), (39, 42, 'cr_db')]
for line in data.splitlines():
# skip "separator" lines
# NOTE this may be an iterative process to identify these
if line.startswith('-----'):
continue
record = {}
for start_index, stop_index, name in data_layout:
record[name] = line[start_index:stop_index].strip()
# your conditional (seems inconsistent with text)
if record['seq'].startswith('42') and record['cr_db'] == 'C':
# perform any special handling for each column
record['r_nbr'] = record['r_nbr'].replace(',', '')
# TODO other special handling (like $)
print('{r_nbr},{t_nbr},{seq},{cr_db},...'.format(**record))
Output is:
5556,01,4250282689,C,...
Update based on seemingly spurious values in undefined columns
Based on the new information provided about the "spurious" columns/fields (appear only rarely), this will likely be an iterative process.
My recommendation would be to narrow (appropriately!) the width of the desired fields. For example, if spurious data is in line[12:14] above, you could change the tuple for (12, 22, 't_nbr') to (14, 22, 't_nbr') to "skip" the spurious field.
An alternative is to add a "garbage" field in the list of tuples to handle those types of lines. Wherever the "spurious" fields appear, the "garbage" field would simply consume it.
If you need these fields, the same general approach to the "garbage" field approach still applies, but you save the data.
Update based on random separators
If they are relatively consistent, I'd simply add some logic (as I did above) to "detect" the separators and skip over them.

Values missing when loaded from Pandas HDF5 file

I've some twitter feed loaded in Pandas Series that I want to store in HDF5 format. Here's a sample of it:
>>> feeds[80:90]
80 BØR MAN STARTE en tweet med store bokstaver? F...
81 #NRKSigrid #audunlysbakken Har du husket Per S...
82 Lurer på om IS har fått med seg kaoset ved Eur...
83 synes han hørte på P3 at Opoku uttales Opoko. ...
84 De statsbærende partiene Ap og Høyre må ta sky...
85 April 2014. Blir MDG det nye arbeider #partiet...
86 MDG: Hasj for kjøtt. #valg2015
87 Grønt skifte.. https://t.co/OuM8quaMz0
88 Kinderegg https://t.co/AsECmw2sV9
89 MDG for honning, frukt og grønt. https://t.co/...
Name: feeds, dtype: object
Whenever I try to load the above data from a saved HDF5 file, some values are missing and are replaced by ''... And the same values reappear when I change the indexing. For example, while storing rows with index 84-85:
>>> store = pd.HDFStore('feed.hd5')
>>> store.append('feed', feeds[84:86], min_itemsize=200, encoding='utf-8')
>>> store.close()
when I read the file, the value of 84th row is now missing:
>>> pd.read_hdf('feed.hd5', 'feed')
84
85 April 2014. Blir MDG det nye arbeider #partiet...
Name: feeds, dtype: object
I get the same output as above if I do this way too:
>>> feeds[84:86].to_hdf('feed.hd5', 'feed', format='table', data_columns=True)
>>> pd.read_hdf('feed.hd5', 'feed')
But If I change the index to, say, [84:87] from [84:86], the 84th row is now loaded.
>>> feeds[84:87].to_hdf('feed.hd5', 'feed', format='table', data_columns=True)
>>> res = pd.read_hdf('feed.hd5', 'feed')
>>> res
84 De statsbærende partiene Ap og Høyre må ta sky...
85 April 2014. Blir MDG det nye arbeider #partiet...
86 MDG: Hasj for kjøtt. #valg2015
Name: feeds, dtype: object
But now, the loaded string is missing some characters when compared with the original tweet. here's that 84th row valued tweet:
>>> # Original tweet (Length: 140)
>>> print (feeds[84])
De statsbærende partiene Ap og Høyre må ta skylda for Miljøpartiets fremgang. Velgerne har sett at SV og V ikke vinner frem i miljøspørsmål.
>>> # Loaded tweet (Length: 134)
>>> print (res[84])
De statsbærende partiene Ap og Høyre må ta skylda for Miljøpartiets fremgang. Velgerne har sett at SV og V ikke vinner frem i miljøspø
I plan to use Python 3.3.x mainly for this unicode column support in PyTables (Am I wrong?) but could not store all the data successfully, yet. Can anyone explain this and let me know how can I avoid it ?
I am using OS: Mac OS X Yosemite, Pandas: 0.16.2, Python: 3.3.5, PyTables: 3.2.0
UPDATE: I confirmed with HDFView (http://www.hdfgroup.org/products/java/hdfview/) that the data is indeed getting stored always (although with some last characters missing) but I am unable to load it successfully every time though.
Thanks.
See the doc-string here.
You need to provide encoding='utf-8' otherwise this will be stored with your default python encoding (which might or might not work). Reading will use the written encoding.
The data
In [13]: df[84:86]
Out[13]:
tweet_id username tweet_time tweet
84 641437756275720192 #nicecap 2015-09-09T02:27:33+00:00 De statsbærende partiene Ap og Høyre må ta sky...
85 641434661391101952 #nicecap 2015-09-09T02:15:15+00:00 April 2014. Blir MDG det nye arbeider #partiet...
Appending, supply the encoding.
In [11]: store.append('feed',df[84:86],encoding='utf-8')
Supply the encoding when read as well
In [12]: store.select('feed',encoding='utf-8')
Out[12]:
tweet_id username tweet_time tweet
84 641437756275720192 #nicecap 2015-09-09T02:27:33+00:00 De statsbærende partiene Ap og Høyre må ta sky...
85 641434661391101952 #nicecap 2015-09-09T02:15:15+00:00 April 2014. Blir MDG det nye arbeider #partiet...
Here's how its stored
In [14]: store.get_storer('feed')
Out[14]: frame_table (typ->appendable,nrows->2,ncols->4,indexers->[index])
In [15]: store.get_storer('feed').attrs
Out[15]:
/feed._v_attrs (AttributeSet), 15 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := [],
encoding := 'utf-8',
index_cols := [(0, 'index')],
info := {1: {'names': [None], 'type': 'Index'}, 'index': {}},
levels := 1,
metadata := [],
nan_rep := 'nan',
non_index_axes := [(1, ['tweet_id', 'username', 'tweet_time', 'tweet'])],
pandas_type := 'frame_table',
pandas_version := '0.15.2',
table_type := 'appendable_frame',
values_cols := ['values_block_0', 'values_block_1']]
So, I suppose this is a bug in that I should by default use the stored encoding when reading. I created an issue here
I have found the issue and could partly able to correct it . If someone first tweets a tweet that is near to 140 characters and another person retweets it, the latter one doesn't contain the full tweet as there will be some text pre-appended for retweets like RT #username:. As a result, the tweet is now more than 140 characters and hence is stripped down to 140 and is obtained as such via twitter APIs for python like tweepy or Python twitter tools (these are the two I have tested...). Sometimes the last character of these kind of tweets is a character '…' which has a length of 1 and an ordinal value of 8230 (try chr(8230) in python 3.x or unichr(8230) for python 2.x...). When these are stored in a HDF5 file and read via pd.read_hdf, it could not be done and instead pandas replaces the whole tweet with just ''.
This could be rectified as below:
>>> # replace that '…' character with '' (empty char)
>>> ch = chr(8230)
>>> feeds.str.replace(ch, '')
>>> # Store it in HDF5 now... # Not sure if it preserves the encoding...
>>> feeds.to_hdf('feed.h5', 'feed', format='table', append=True, encoding='utf-8', data_columns=True)
>>> # If you prefer this way
>>> with pd.HDFStore('feed.h5') as store:
store.append('feed', feeds, min_itemsize=200, encoding='utf-8')
>>> # Now read it safely
>>> pd.read_hdf('feed.h5', 'feed')
However, the problem still appears sometimes if there are some unicode characters... Giving the encoding='utf-8' option didn't really help, at least for my case. Any help in this regard is appreciated... :)

Slice two ranges from one line in text file

I'm trying to extract two ranges per line of an opened text file in python 3 by looping through.
The application has a Entry widget and the value is stored in self.search. I then loop through a text file which contains the values I want, and write out the results to self.searchresults and then display in a Textbox.
I've tried variations of the third line below but am not getting anywhere.
I want to write out characters in each line from 3:24 and 81:83 ...
for line in self.searchfile:
if self.search in line:
line = line[3:24]+[81:83]
self.searchresults.write(line+"\n")
Here's an abridged version of the text file I'm working with (original here):
! Author: Greg Thompson NCAR/RAP
! please mail corrections to gthompsn (at) ucar (dot) edu
! Date: 24 Feb 2015
! This file is continuously maintained at:
! http://www.rap.ucar.edu/weather/surface/stations.txt
!
! [... more comments ...]
ALASKA 19-SEP-14
CD STATION ICAO IATA SYNOP LAT LONG ELEV M N V U A C
AK ADAK NAS PADK ADK 70454 51 53N 176 39W 4 X T 7 US
AK AKHIOK PAKH AKK 56 56N 154 11W 14 X 8 US
AK AKUTAN PAUT 54 09N 165 36W 25 X 7 US
AK AMBLER PAFM AFM 67 06N 157 51W 88 X 7 US
AK ANAKTUVUK PASS PAKP AKP 68 08N 151 44W 642 X 7 US
AK ANCHORAGE INTL PANC ANC 70273 61 10N 150 01W 38 X T X A 5 US
You're not far off - your problem is that you need to specify what you're slicing each time you slice it:
for line in self.searchfile:
if self.search in line:
line = line[3:24] + line[81:83]
self.searchresults.write(line+"\n")
... but you'll probably want to separate the two fields with a space:
line = line[3:24] + " " + line[81:83]
However, if you find yourself using + more than once to construct a string, you should think about using Python's built-in string-formatting abilities instead (and while you're at it, you can add that newline in the same operation):
for line in self.searchfile:
if self.search in line:
formatted = "%s %s\n" % (line[3:24], line[81:83])
self.searchresults.write(formatted)

Python Regular Expression of long complex string

So I am scraping data from a webpage and the received data usually is as followed:
233989 001 0 / 49 T R 4:15 PM - 5:30 PM 205 IST Building 01/13/14 - 05/02/14 Controls View (814) 865-8947 266200 002 0 / 43 M W F 10:10 AM - 11:00 AM 110 IST Building 01/13/14 - 05/02/14 Controls View (814) 865-8947
I am trying to split the data from the pattern ###### (6 numbers, i.e. 233989) to the phone number which represents the end of the current data line (i.e. (814) 865-8947) Because I know it'll always end with 4 numbers I came up with the expression:
(^[0-9]{1,6}$[^[0-9]{1,4}$]*[0-9]{1,4}$+)+
This does not seem to work though. Can anyone lend a helping hand?
You could use this:
r'(\d{6}.*?\(\d{3}\) \d{3}-\d{4}) ?'
Then rebuild it on $1\n
Like so: http://regex101.com/r/lG4gG5
Python:
import re
s = '233989 001 0 / 49 T R 4:15 PM - 5:30 PM 205 IST Building 01/13/14 - 05/02/14 Controls View (814) 865-8947 266200 002 0 / 43 M W F 10:10 AM - 11:00 AM 110 IST Building 01/13/14 - 05/02/14 Controls View (814) 865-8947'
spl = re.split(r'(\d{6}.*?\(\d{3}\) \d{3}-\d{4}) ?', s)
for line in spl:
print line

Categories