How to complete for loop with pdfplumber?

How to complete for loop with pdfplumber? - python

Problem
I was following this tutorial https://www.youtube.com/watch?v=eTz3VZmNPSE&list=PLxEus0qxF0wciRWRHIRck51EJRiQyiwZT&index=16
when the code has returned my this error.
Goal
I need to scrape a pdf that looks like this (I wanted to attach the pdf but I do not know how):
170001WO01
English (US) into Arabic (DZ)
Trans./Edit/Proof. 22.117,00 Words 1,350 29.857,95
TM - Fuzzy Match 2.941,00 Words 0,500 1.470,50
TM - Exact Match 353,00 Words 0,100 35,30
Approach
I am following the tutorial aforementioned with pdfplumber.
import re
import pdfplumber
import PyPDF2
import pandas as pd
from collections import namedtuple
ap = open('test.pdf', 'rb')
I name the column of the dataframe that I want as a final product.
Serv = namedtuple('Serv', 'case_number language num_trans num_fuzzy num_exact')
Issues
I have 5 different lines compared to the tutorial example which has 2.
case_li = re.compile(r'(\d{6}\w{2}\d{2})')
language_li = re.compile(r'(nglish \(US\) into )(.*)')
trans_li = re.compile(r'(Trans./Edit/Proof. )(\d{2}\.\d{3})')
fuzzy_li = re.compile(r'(TM - Fuzzy Match )(\d{1}\.\d{3})')
exact_li = re.compile(r'(M - Exact Match )(\d{3})')
Issue
When I introduce the third line in the code, I got an error which I do not know. I have modified the code as 2e0byo suggested but I still get an error.
This is the new code:
line_items = []
with pdfplumber.open(ap) as pdf:
page = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
line = case_li.search(line)
if line:
case_number = line
line = language_li.search(line)
if line:
language = line.group(2)
line = trans_li.search(line)
if line:
num_trans = line.group(2)
line = fuzzy_li.search(line)
if line:
num_fuzzy = line.group(2)
line = exact_li.search(line)
if line:
num_exact = line.group(2)
line_items.append(Serv(case_number, language, num_trans, num_fuzzy, num_exact))```
---------------------------------------------------------------------------
and this is the new error:
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13992/1572426536.py in <module>
10 case_number = line
11
---> 12 line = language_li.search(line)
13 if line:
14 language = line.group(2)
TypeError: expected string or bytes-like object
TypeError: expected string or bytes-like object
# GOAL
It would be to append the lines to line_items and eventually
df = pd.DataFrame(line_items)

You have reassigned line, here:
for line in text.split("\n"):
# line is a str (the line)
line = language_li.search(line)
# line is no longer a str, but the result of a re.search
so line is no longer the text line, but the result of that match. Thus trans_li.search(line) is not searching the line you thought it was.
To fix your code, adopt a consistent pattern:
for line in text.split("\n"):
match = language_li.search(line)
# line is still a str (the line)
# match is the result of re.search
if match:
do_something(match.groups())
...
# line is *still* a str
match = trans_li.search(line):
if match:
...
For completeness' sake, with the dreaded walrus operator you can now write this:
if match := language_li.search(line) is not None:
do_something(match.groups())
Which I briefly thought was neater, but now think ugly. I fully expect to get downvoted just for mentioning the walrus operator. (If you look at the edit history of this post you will see that I have even forgotten how to use it and wrote it backwards first.)
PS: you may wish to read up on variable scope in python, although no language I know would allow this particular scope collision (overwriting a loop variable within the loop). Incidentally doing this kind of thing by mistake is why conventionally we avoid similarly-named variables (like line and Line) and go with things like line and match instead.

Related

How do I solve this IndexError inside a for Loop?

I have been playing around with Python for over a year now and written several automation codes which I use on a daily basis. I have been writing this auto typer for Python, here is the code:
import pyautogui as pt
from time import sleep
empty_file = "C:\\Users\\Lucas\\Desktop\\PycharmProjects\\Automate\\main\\screenshots\\empty_file.png"
text_write = "C:\\Users\\Lucas\\Desktop\\PycharmProjects\\Automate\\main\\text_write.txt"
with open(text_write, 'r') as f:
text = f.read()
sentence = text.split("\n")
position0 = pt.locateOnScreen(empty_file, confidence=.8)
x = position0[0]
y = position0[1]
pt.moveTo(x, y, duration=.05)
pt.leftClick()
def post_text():
pt.moveTo(x-370, y+95, duration=.1)
for lines in range(len(text)):
pt.typewrite(str(sentence[lines],) + "\n", interval=.01)
with pt.hold('shift'):
pt.press('tab', presses=5)
sleep(2)
post_text()
The code completely works but at the end instead of the code breaking it gives me this error:
C:\Users\Lucas\PycharmProjects\wechat_bot\venv\Scripts\python.exe C:/Users/Lucas/Desktop/PycharmProjects/Automate/main/auto_typer.py
Traceback (most recent call last):
File "C:\Users\Lucas\Desktop\PycharmProjects\Automate\main\auto_typer.py", line 26, in <module>
post_text()
File "C:\Users\Lucas\Desktop\PycharmProjects\Automate\main\auto_typer.py", line 20, in post_text
pt.typewrite(str(sentence[lines],) + "\n", interval=.01)
IndexError: list index out of range
Process finished with exit code 1
I suspect the issue has to do specifically with the line:
str(sentence[lines],
I haven't found a solution yet. Am I supposed to be using len() or would and if else statement be better?

The problem is that you are doing
for lines in range(len(text)):
but then you later do
sentence[lines]
This will only work text is shorter than or the same length as sentence. But there is no such guarantee in your code.
Instead, you should do
for lines in range(len(sentence)):
Or better yet loop over the lines without indexes:
for line in sentence:
And then you can just do line instead of sentence[lines].

Correct mistakes in a python program dealing with CSV

I'm trying to edit a CSV file using informations from a first one. That doesn't seem simple to me as I should filter multiple things. Let's explain my problem.
I have two CSV files, let's say patch.csv and origin.csv. Output csv file should have the same pattern as origin.csv, but with corrected values.
I want to replace trip_headsign column fields in origin.csv using forward_line_name column in patch.csv if direction_id field in origin.csv row is 0, or using backward_line_name if direction_id is 1.
I want to do this only if the part of the line_id value in patch.csv between ":" and ":" symbols is the same as the part of route_id value in origin.csv before the ":" symbol.
I know how to replace a whole line, but not only some parts, especially that I sometimes have to look only part of a value.
Here is a sample of origin.csv:
route_id,service_id,trip_id,trip_headsign,direction_id,block_id
210210109:001,2913,70405957139549,70405957,0,
210210109:001,2916,70405961139553,70405961,1,
and a sample of patch.csv:
line_id,line_code,line_name,forward_line_name,forward_direction,backward_line_name,backward_direction,line_color,line_sort,network_id,commercial_mode_id,contributor_id,geometry_id,line_opening_time,line_closing_time
OIF:100110010:10OIF439,10,Boulogne Pont de Saint-Cloud - Gare d'Austerlitz,BOULOGNE / PONT DE ST CLOUD - GARE D'AUSTERLITZ,OIF:SA:8754700,GARE D'AUSTERLITZ - BOULOGNE / PONT DE ST CLOUD,OIF:SA:59400,DFB039,91,OIF:439,metro,OIF,geometry:line:100110010:10,05:30:00,25:47:00
OIF:210210109:001OIF30,001,FFOURCHES LONGUEVILLE PROVINS,Place Mérot - GARE DE LONGUEVILLE,,GARE DE LONGUEVILLE - Place Mérot,OIF:SA:63:49,000000 1,OIF:30,bus,OIF,,05:39:00,19:50:00
Each file has hundred of lines I need to parse and edit this way.
Separator is comma in my csv files.
Based on mhopeng answer to a previous question, I obtained that code:
#!/usr/bin/env python2
from __future__ import print_function
import fileinput
import sys
# first get the route info from patch.csv
f = open(sys.argv[1])
d = open(sys.argv[2])
# ignore header line
#line1 = f.readline()
#line2 = d.readline()
# get line of data
for line1 in f.readline():
line1 = f.readline().split(',')
route_id = line1[0].split(':')[1] # '210210109'
route_forward = line1[3]
route_backward = line1[5]
line_code = line1[1]
# process origin.csv and replace lines in-place
for line in fileinput.input(sys.argv[2], inplace=1):
line2 = d.readline().split(',')
num_route = line2[0].split(':')[0]
# prevent lines with same route_id but different line_code to be considered as the same line
if line.startswith(route_id) and (num_route == line_code):
if line.startswith(route_id):
newline = line.split(',')
if newline[4] == 0:
newline[3] = route_backward
else:
newline[3] = route_forward
print('\t'.join(newline),end="")
else:
print(line,end="")
But unfortunately, that doesn't push the right forward or backward_line_name in trip_headsign (always forward is used), the condition to compare patch.csv line_code to the end of route_id of origin.csv (after the ":") doesn't work, and the script finally triggers that error, before finishing parsing the file:
Traceback (most recent call last):
File "./GTFS_enhancer_headsigns.py", line 28, in
if newline[4] == 0:
IndexError: list index out of range
Could you please help me fixing these three problems?
Thanks for your help :)

You really should consider using the python csv module instead of split().
Out of experience , everything is much easier when working with csv files and the csv module.
This way you can iterate through the dataset in a structured way without the risk of getting index out of range errors.

Two regex functions together do not work

I am trying to get the index for the start of a tag and the end of another tag. However, when I use one regex it works absolutely fine but for two regex functions, it gives an error for the second one.
Kindly help in explaining the reason
The below code works fine:
import re
f = open('C:/Users/Jyoti/Desktop/PythonPrograms/try.xml','r')
opentag = re.search('<TEXT>',f.read())
begin = opentag.start()+6
print begin
But when I add another similar regex it give me the error
AttributeError: 'NoneType' object has no attribute 'start'
which I understand is due to the start() function returning None
Below is the code:
import re
f = open('C:/Users/Jyoti/Desktop/PythonPrograms/try.xml','r')
opentag = re.search('<TEXT>',f.read())
begin = opentag.start()+6
print begin
closetag = re.search('</TEXT>',f.read())
end = closetag.start() - 1
print end
Please provide a solution to how can I get this working. Also I am a newbie here so please don't mind if I ask more questions on the solution.

You are reading the file in f.read() which reads the whole file, and so the file descriptor moves forward, which means the text can't be read again when you do f.read() the next time.
If you need to search on the same text again, save the output of f.read(), and then do a regular expression search on it as below:
import re
f = open('C:/Users/Jyoti/Desktop/PythonPrograms/try.xml','r')
text = f.read()
opentag = re.search('<TEXT>',text)
begin = opentag.start()+6
print begin
closetag = re.search('</TEXT>',text)
end = closetag.start() - 1
print end

f.read() reads the whole file. So there's nothing left to read on the second f.read() call.
See https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects

First of all you have to know that f.read() after read file sets the pointer to the EOF so if you again use f.read() it gives you empty string ''. Secondly you should use r before string passed as a pattern of re.search function, which means raw, and automatically escapes special characters. So you have to do something like this:
import re
f = open('C:/Users/Jyoti/Desktop/PythonPrograms/try.xml','r')
data = f.read()
opentag = re.search(r'<TEXT>',data)
begin = opentag.start()+6
print begin
closetag = re.search(r'</TEXT>',data)
end = closetag.start() - 1
print end
gl & hf with Python :)

Python/IPython strange non reproducible list index out of range error

I have recently been learning some Python and how to apply it to my work. I have written a couple of scripts successfully, but I am having an issue I just cannot figure out.
I am opening a file with ~4000 lines, two tab separated columns per line. When reading the input file, I get an index error saying that the list index is out of range. However, while I get the error every time, it doesn't happen on the same line every time (as in, it will throw the error on different lines everytime!). So, for some reason, it works generally but then (seemingly) randomly fails.
As I literally only started learning Python last week, I am stumped. I have looked around for the same problem, but not found anything similar. Furthermore I don't know if this is a problem that is language specific or IPython specific. Any help would be greatly appreciated!
input = open("count.txt", "r")
changelist = []
listtosort = []
second = str()
output = open("output.txt", "w")
for each in input:
splits = each.split("\t")
changelist = list(splits[0])
second = int(splits[1])
print second
if changelist[7] == ";":
changelist.insert(6, "000")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
elif changelist[8] == ";":
changelist.insert(6, "00")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
elif changelist[9] == ";":
changelist.insert(6, "0")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
else:
#output.write(str("".join(changelist)))
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
output.close()
The error
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/home/a/Desktop/sharedfolder/ipytest/individ.ins.count.test/<ipython-input-87-32f9b0a1951b> in <module>()
57 splits = each.split("\t")
58 changelist = list(splits[0])
---> 59 second = int(splits[1])
60
61 print second
IndexError: list index out of range
Input:
ID=cds0;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr 12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C 50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C 36
Desired output:
ID=cds0000;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr 12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C 50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C 36

The reason you're getting the IndexError is that your input-file is apparently not entirely tab delimited. That's why there is nothing at splits[1] when you attempt to access it.
Your code could use some refactoring. First of all you're repeating yourself with the if-checks, it's unnecessary. This just pads the cds0 to 7 characters which is probably not what you want. I threw the following together to demonstrate how you could refactor your code to be a little more pythonic and dry. I can't guarantee it'll work with your dataset, but I'm hoping it might help you understand how to do things differently.
to_sort = []
# We can open two files using the with statement. This will also handle
# closing the files for us, when we exit the block.
with open("count.txt", "r") as inp, open("output.txt", "w") as out:
for each in inp:
# Split at ';'... So you won't have to worry about whether or not
# the file is tab delimited
changed = each.split(";")
# Get the value you want. This is called unpacking.
# The value before '=' will always be 'ID', so we don't really care about it.
# _ is generally used as a variable name when the value is discarded.
_, value = changed[0].split("=")
# 0-pad the desired value to 7 characters. Python string formatting
# makes this very easy. This will replace the current value in the list.
changed[0] = "ID={:0<7}".format(value)
# Join the changed-list with the original separator and
# and append it to the sort list.
to_sort.append(";".join(changed))
# Write the results to the file all at once. Your test data already
# provided the newlines, you can just write it out as it is.
output.writelines(to_sort)
# Do what else you need to do. Maybe to_list.sort()?
You'll notice that this code is reduces your code down to 8 lines but achieves the exact same thing, does not repeat itself and is pretty easy to understand.
Please read the PEP8, the Zen of python, and go through the official tutorial.

This happens when there is a line in count.txt which doesn't contain the tab character. So when you split by tab character there will not be any splits[1]. Hence the error "Index out of range".
To know which line is causing the error, just add a print(each) after splits in line 57. The line printed before the error message is your culprit. If your input file keeps changing, then you will get different locations. Change your script to handle such malformed lines.

str.startswith() not working as I intended

I'm trying to test for a /t or a space character and I can't understand why this bit of code won't work. What I am doing is reading in a file, counting the loc for the file, and then recording the names of each function present within the file along with their individual lines of code. The bit of code below is where I attempt to count the loc for the functions.
import re
...
else:
loc += 1
for line in infile:
line_t = line.lstrip()
if len(line_t) > 0 \
and not line_t.startswith('#') \
and not line_t.startswith('"""'):
if not line.startswith('\s'):
print ('line = ' + repr(line))
loc += 1
return (loc, name)
else:
loc += 1
elif line_t.startswith('"""'):
while True:
if line_t.rstrip().endswith('"""'):
break
line_t = infile.readline().rstrip()
return(loc,name)
Output:
Enter the file name: test.txt
line = '\tloc = 0\n'
There were 19 lines of code in "test.txt"
Function names:
count_loc -- 2 lines of code
As you can see, my test print for the line shows a /t, but the if statement explicitly says (or so I thought) that it should only execute with no whitespace characters present.
Here is my full test file I have been using:
def count_loc(infile):
""" Receives a file and then returns the amount
of actual lines of code by not counting commented
or blank lines """
loc = 0
for line in infile:
line = line.strip()
if len(line) > 0 \
and not line.startswith('//') \
and not line.startswith('/*'):
loc += 1
func_loc, func_name = checkForFunction(line);
elif line.startswith('/*'):
while True:
if line.endswith('*/'):
break
line = infile.readline().rstrip()
return loc
if __name__ == "__main__":
print ("Hi")
Function LOC = 15
File LOC = 19

\s is only whitespace to the re package when doing pattern matching.
For startswith, an ordinary method of ordinary strings, \s is nothing special. Not a pattern, just characters.

Your question has already been answered and this is slightly off-topic, but...
If you want to parse code, it is often easier and less error-prone to use a parser. If your code is Python code, Python comes with a couple of parsers (tokenize, ast, parser). For other languages, you can find a lot of parsers on the internet. ANTRL is a well-known one with Python bindings.
As an example, the following couple of lines of code print all lines of a Python module that are not comments and not doc-strings:
import tokenize
ignored_tokens = [tokenize.NEWLINE,tokenize.COMMENT,tokenize.N_TOKENS
,tokenize.STRING,tokenize.ENDMARKER,tokenize.INDENT
,tokenize.DEDENT,tokenize.NL]
with open('test.py', 'r') as f:
g = tokenize.generate_tokens(f.readline)
line_num = 0
for a_token in g:
if a_token[2][0] != line_num and a_token[0] not in ignored_tokens:
line_num = a_token[2][0]
print(a_token)
As a_token above is already parsed, you can easily check for function definition, too. You can also keep track where the function ends by looking at the current column start a_token[2][1]. If you want to do more complex things, you should use ast.

You string literals aren't what you think they are.
You can specify a space or TAB like so:
space = ' '
tab = '\t'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to complete for loop with pdfplumber? - python

Related

How do I solve this IndexError inside a for Loop?

Correct mistakes in a python program dealing with CSV

Two regex functions together do not work

Python/IPython strange non reproducible list index out of range error

str.startswith() not working as I intended

Categories

Resources