I found this code on the internet, meant to search through text files in a zipped folder to find matches. I ran it in IDLE to see how it worked.. but I have a problem, and it seems to be this line:
fname = seed + ".txt"
The error message returns this:
Traceback (most recent call last):
File "C:/Users/[name]/AppData/Local/Programs/Python/Python36-32/zip2.py", line 10, in <module>
fname = seed + ".txt"
TypeError: can only concatenate tuple (not "str") to tuple
Here is the code:
import re
from zipfile import *
findnothing = re.compile(r"Next nothing is (\d+)").match
comments = []
z = ZipFile("channel.zip", "r")
seed = "90052"
while True:
fname = seed + ".txt"
comments.append(z.getinfo(fname).comment)
guts = z.read(fname)
m = findnothing(guts.decode('utf-8'))
if m:
seed = m.groups(1)
else:
break
print("".join(comments))
I've searched stackoverflow, and have found nothing similar to my issue. Most of them state that a comma in a variable usually causes the compiler to treat it as a tuple. I don't understand why it is saying seed is a tuple. There is no comma, no parenthesis, or anything else that would define it as a tuple to the Python compiler. How can I fix this?
Thanks in advance
Change m.groups(1) to m.group(1) (singular, not plural). Per the docs at https://docs.python.org/3.6/library/re.html#re.match.groups , group returns a single match but groups returns a tuple of all matches. You are getting the error the second time through the loop, when seed has been replaced by the output of groups, which is a tuple.
First, re.match only matches at the start of the string. Make sure you didn't mean to use re.search instead!
Second, m.groups(1) returns a tuple like ('12345',). Try seed = m.groups(1)[0] instead.
Related
Problem
I was following this tutorial https://www.youtube.com/watch?v=eTz3VZmNPSE&list=PLxEus0qxF0wciRWRHIRck51EJRiQyiwZT&index=16
when the code has returned my this error.
Goal
I need to scrape a pdf that looks like this (I wanted to attach the pdf but I do not know how):
170001WO01
English (US) into Arabic (DZ)
Trans./Edit/Proof. 22.117,00 Words 1,350 29.857,95
TM - Fuzzy Match 2.941,00 Words 0,500 1.470,50
TM - Exact Match 353,00 Words 0,100 35,30
Approach
I am following the tutorial aforementioned with pdfplumber.
import re
import pdfplumber
import PyPDF2
import pandas as pd
from collections import namedtuple
ap = open('test.pdf', 'rb')
I name the column of the dataframe that I want as a final product.
Serv = namedtuple('Serv', 'case_number language num_trans num_fuzzy num_exact')
Issues
I have 5 different lines compared to the tutorial example which has 2.
case_li = re.compile(r'(\d{6}\w{2}\d{2})')
language_li = re.compile(r'(nglish \(US\) into )(.*)')
trans_li = re.compile(r'(Trans./Edit/Proof. )(\d{2}\.\d{3})')
fuzzy_li = re.compile(r'(TM - Fuzzy Match )(\d{1}\.\d{3})')
exact_li = re.compile(r'(M - Exact Match )(\d{3})')
Issue
When I introduce the third line in the code, I got an error which I do not know. I have modified the code as 2e0byo suggested but I still get an error.
This is the new code:
line_items = []
with pdfplumber.open(ap) as pdf:
page = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
line = case_li.search(line)
if line:
case_number = line
line = language_li.search(line)
if line:
language = line.group(2)
line = trans_li.search(line)
if line:
num_trans = line.group(2)
line = fuzzy_li.search(line)
if line:
num_fuzzy = line.group(2)
line = exact_li.search(line)
if line:
num_exact = line.group(2)
line_items.append(Serv(case_number, language, num_trans, num_fuzzy, num_exact))```
---------------------------------------------------------------------------
and this is the new error:
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13992/1572426536.py in <module>
10 case_number = line
11
---> 12 line = language_li.search(line)
13 if line:
14 language = line.group(2)
TypeError: expected string or bytes-like object
TypeError: expected string or bytes-like object
# GOAL
It would be to append the lines to line_items and eventually
df = pd.DataFrame(line_items)
You have reassigned line, here:
for line in text.split("\n"):
# line is a str (the line)
line = language_li.search(line)
# line is no longer a str, but the result of a re.search
so line is no longer the text line, but the result of that match. Thus trans_li.search(line) is not searching the line you thought it was.
To fix your code, adopt a consistent pattern:
for line in text.split("\n"):
match = language_li.search(line)
# line is still a str (the line)
# match is the result of re.search
if match:
do_something(match.groups())
...
# line is *still* a str
match = trans_li.search(line):
if match:
...
For completeness' sake, with the dreaded walrus operator you can now write this:
if match := language_li.search(line) is not None:
do_something(match.groups())
Which I briefly thought was neater, but now think ugly. I fully expect to get downvoted just for mentioning the walrus operator. (If you look at the edit history of this post you will see that I have even forgotten how to use it and wrote it backwards first.)
PS: you may wish to read up on variable scope in python, although no language I know would allow this particular scope collision (overwriting a loop variable within the loop). Incidentally doing this kind of thing by mistake is why conventionally we avoid similarly-named variables (like line and Line) and go with things like line and match instead.
I am trying to execute a python script which is giving me an IndexError. I understood that the rsplit() method failed to split the string. I don't exactly know why it is showing index out of range. Could anyone tell me how to solve this problem ?
code
raw_directory = 'results/'
for name in glob.glob(raw_directory + '*.x*'):
try:
#with open(name) as g:
# pass
print(name)
reaction_mechanism = 'gri30.xml' #'mech.cti'
gas = ct.Solution(reaction_mechanism)
f = ct.CounterflowDiffusionFlame(gas, width=1.)
name_only = name.rsplit('\\',1)[1] #delete directory in filename
file_name = name_only
f.restore(filename=raw_directory + file_name, name='diff1D', loglevel=0)
Output
If I delete the file strain_loop_07.xml, I got the same error with another file.
results/strain_loop_07.xml
Traceback (most recent call last):
File "code.py", line 38, in <module>
name_only = name.rsplit('\\'1)[1] #delete directory in filename
IndexError: list index out of range
If rsplit failed to split the string, it returns an array with only one solution, so the [0] and not [1]
I understood in reply of this post that "name" variable is filled with text like "result/strain_loop_07.xml", so you want to rsplit that, with a line more like
name_only = name.rsplit('/', 1)[1]
So you'll get the "strain_loop_07.xml" element, which is what you probably wanted, because name.resplit('/', 1) return something like
['result', 'strain_loop_07.xml']
By the way, don't hesitate to print your variable midway for debuging, that is often the thing to do, to understand the state of your variable at a specific timing. Here right before your split !
I hope someone can point out where I have gone wrong. I am looking to iterate through the 'mylist' list to grab the first entry and use that first entry as a search string, then perform a search and gather particular information once the string is found and post it to an Excel worksheet. Then I am hoping to iterate to the next 'mylist' entry and perform another search. The first iteration performs ok, but with the second iteration of the loop I get the following CMD window error...
2014 Apr 25 09:43:42.080 INFORMATION FOR A
14.01
Traceback (most recent call last):
File "C:\TEST.py", line 362, in <module>
duta()
File "C:\TEST.py", line 128, in duta
if split[10] == 'A':
IndexError: list index out of range
Exception RuntimeError: RuntimeError('sys.meta_path must be a list of
import hooks',) in <bound method Workbook.__del__ of
<xlsxwriter.workbook.Workbook object at 0x0238C310>> ignored
Here's my code...
for root, subFolders, files in chain.from_iterable(os.walk(path) for path in paths):
for filename in files:
if filename.endswith('.txt'):
with open(os.path.join(root, filename), 'r') as fBMA:
searchlinesBMA = fBMA.readlines()
fBMA.close()
row_numBMAA+=1
num = 1
b = 1
print len(mylist)
print (mylist[num])
while b<len(mylist):
for i, line in enumerate(searchlinesBMA):
for word in [mylist[num]]:
if word in line:
keylineBMA = searchlinesBMA[i-2]
Rline = searchlinesBMA[i+10]
Rline = re.sub('[()]', '', Rline)
valueR = Rline.split()
split = keylineBMA.split()
if split[6] == 'A':
print keylineBMA
print valueR[3]
worksheetFILTERA.write(row_numBMAA,3,valueR[3], decimal_format)
row_numBMAA+=1
break
num+=1
b=+1
Any ideas as to what I am doing wrong? Is my loop out of position, or am I not inputting the correct list pointer?
Thanks,
MikG
In my experience, this error is related to garbage collecting out of order. I saw it once when I was debugging code where someone was writing to files in a __del__ method. (Bad idea). I'm pretty sure you're getting the error because you're closing the file inside a with: block, which does the open and close for you.
On the second run, you got split = keylineBMA.split() with a result shorter than you expected. You try to access index 10 which is outside the list.
i have below code in which filenames are FR1.1.csv, FR2.0.csv etc. I am using these names to print in header row but i want to modify these name to FR1.1 , Fr2.0 and so on. Hence i am using strip function to remove .csv. when i have tried it at command prompt its working fine. But when i have added it to main script its not giving output.
for fname in filenames:
print "fname : ", fname
fname.strip('.csv');
print "after strip fname: ", fname
headerline.append(fname+' Compile');
headerline.append(fname+' Run');
output i am getting
fname :FR1.1.csv
after strip fname: FR1.1.csv
required output-->
fname :FR1.1.csv
after strip fname: FR1.1
i guess some indentation problem is there in my code after for loop.
plesae tell me what is the correct way to achive this.
Strings are immutable, so string methods can't change the original string, they return a new one which you need to assign again:
fname = fname.strip('.csv') # no semicolons in Python!
But this call doesn't do what you probably expect it to. It will remove all the leading and trailing characters c, s, v and . from your string:
>>> "cross.csv".strip(".csv")
'ro'
So you probably want to do
import re
fname = re.sub(r"\.csv$", "", fname)
Strings are immutable. strip() returns a new string.
>>> "FR1.1.csv".strip('.csv')
'FR1.1'
>>> m = "FR1.1.csv".strip('.csv')
>>> print(m)
FR1.1
You need to do fname = fname.strip('.csv').
And get rid of the semicolons in the end!
P.S - Please see Jon Clement's comment and Tim Pietzcker's answer to know why this code should not be used.
You probably should use os.path for path manipulations:
import os
#...
for fname in filenames:
print "fname : ", fname
fname = os.path.splitext(fname)[0]
#...
The particular reason why your code fails is provided in other answers.
change
fname.strip('.csv')
with
fname = fname.strip('.csv')
I'm trying to write some code that searches through a directory and pulls out all the items that start with a certain numbers (defined by a list) and that end with '.labels.txt'. This is what I have so far.
lbldir = '/musc.repo/Data/shared/my_labeled_images/labeled_image_maps/'
picnum = []
for ii in os.listdir(picdir):
num = ii.rstrip('.png')
picnum.append(num)
lblpath = []
for file in os.listdir(lbldir):
if fnmatch.fnmatch(file, '*.labels.txt') and fnmatch.fnmatch(file, ii in picnum + '.*'):
lblpath.append(os.path.abspath(file))
Here is the error I get
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-a03c65e65a71> in <module>()
3 lblpath = []
4 for file in os.listdir(lbldir):
----> 5 if fnmatch.fnmatch(file, '*.labels.txt') and fnmatch.fnmatch(file, ii in picnum + '.*'):
6 lblpath.append(os.path.abspath(file))
TypeError: can only concatenate list (not "str") to list
I realize the ii in picnum part won't work but I don't know how to get around it. Can this be accomplished with the fnmatch module or do I need regular expressions?
The error comes because you are trying to add ".*" (a string) to the end of picnum, which is a list, and not a string.
Also, ii in picnum isn't giving you back each item of picnum, because you are not iterating over ii. It just has the last value that it was assigned in your first loop.
Instead of testing both at once with the and, you might have a nested test that operates when you find a file matching .labels.txt, as below. This uses re instead of fnmatch to extract the digits from the beginning of the file name, instead of trying to match each picnum. This replaces your second loop:
import re
for file in os.listdir(lbldir):
if file.endswith('.labels.txt')
startnum=re.match("\d+",file)
if startnum and startnum.group(0) in picnum:
lblpath.append(os.path.abspath(file))
I think that should work, but it is obviously untested without your actual file names.