I'm trying to write some code that searches through a directory and pulls out all the items that start with a certain numbers (defined by a list) and that end with '.labels.txt'. This is what I have so far.
lbldir = '/musc.repo/Data/shared/my_labeled_images/labeled_image_maps/'
picnum = []
for ii in os.listdir(picdir):
num = ii.rstrip('.png')
picnum.append(num)
lblpath = []
for file in os.listdir(lbldir):
if fnmatch.fnmatch(file, '*.labels.txt') and fnmatch.fnmatch(file, ii in picnum + '.*'):
lblpath.append(os.path.abspath(file))
Here is the error I get
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-a03c65e65a71> in <module>()
3 lblpath = []
4 for file in os.listdir(lbldir):
----> 5 if fnmatch.fnmatch(file, '*.labels.txt') and fnmatch.fnmatch(file, ii in picnum + '.*'):
6 lblpath.append(os.path.abspath(file))
TypeError: can only concatenate list (not "str") to list
I realize the ii in picnum part won't work but I don't know how to get around it. Can this be accomplished with the fnmatch module or do I need regular expressions?
The error comes because you are trying to add ".*" (a string) to the end of picnum, which is a list, and not a string.
Also, ii in picnum isn't giving you back each item of picnum, because you are not iterating over ii. It just has the last value that it was assigned in your first loop.
Instead of testing both at once with the and, you might have a nested test that operates when you find a file matching .labels.txt, as below. This uses re instead of fnmatch to extract the digits from the beginning of the file name, instead of trying to match each picnum. This replaces your second loop:
import re
for file in os.listdir(lbldir):
if file.endswith('.labels.txt')
startnum=re.match("\d+",file)
if startnum and startnum.group(0) in picnum:
lblpath.append(os.path.abspath(file))
I think that should work, but it is obviously untested without your actual file names.
Related
Problem
I was following this tutorial https://www.youtube.com/watch?v=eTz3VZmNPSE&list=PLxEus0qxF0wciRWRHIRck51EJRiQyiwZT&index=16
when the code has returned my this error.
Goal
I need to scrape a pdf that looks like this (I wanted to attach the pdf but I do not know how):
170001WO01
English (US) into Arabic (DZ)
Trans./Edit/Proof. 22.117,00 Words 1,350 29.857,95
TM - Fuzzy Match 2.941,00 Words 0,500 1.470,50
TM - Exact Match 353,00 Words 0,100 35,30
Approach
I am following the tutorial aforementioned with pdfplumber.
import re
import pdfplumber
import PyPDF2
import pandas as pd
from collections import namedtuple
ap = open('test.pdf', 'rb')
I name the column of the dataframe that I want as a final product.
Serv = namedtuple('Serv', 'case_number language num_trans num_fuzzy num_exact')
Issues
I have 5 different lines compared to the tutorial example which has 2.
case_li = re.compile(r'(\d{6}\w{2}\d{2})')
language_li = re.compile(r'(nglish \(US\) into )(.*)')
trans_li = re.compile(r'(Trans./Edit/Proof. )(\d{2}\.\d{3})')
fuzzy_li = re.compile(r'(TM - Fuzzy Match )(\d{1}\.\d{3})')
exact_li = re.compile(r'(M - Exact Match )(\d{3})')
Issue
When I introduce the third line in the code, I got an error which I do not know. I have modified the code as 2e0byo suggested but I still get an error.
This is the new code:
line_items = []
with pdfplumber.open(ap) as pdf:
page = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
line = case_li.search(line)
if line:
case_number = line
line = language_li.search(line)
if line:
language = line.group(2)
line = trans_li.search(line)
if line:
num_trans = line.group(2)
line = fuzzy_li.search(line)
if line:
num_fuzzy = line.group(2)
line = exact_li.search(line)
if line:
num_exact = line.group(2)
line_items.append(Serv(case_number, language, num_trans, num_fuzzy, num_exact))```
---------------------------------------------------------------------------
and this is the new error:
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13992/1572426536.py in <module>
10 case_number = line
11
---> 12 line = language_li.search(line)
13 if line:
14 language = line.group(2)
TypeError: expected string or bytes-like object
TypeError: expected string or bytes-like object
# GOAL
It would be to append the lines to line_items and eventually
df = pd.DataFrame(line_items)
You have reassigned line, here:
for line in text.split("\n"):
# line is a str (the line)
line = language_li.search(line)
# line is no longer a str, but the result of a re.search
so line is no longer the text line, but the result of that match. Thus trans_li.search(line) is not searching the line you thought it was.
To fix your code, adopt a consistent pattern:
for line in text.split("\n"):
match = language_li.search(line)
# line is still a str (the line)
# match is the result of re.search
if match:
do_something(match.groups())
...
# line is *still* a str
match = trans_li.search(line):
if match:
...
For completeness' sake, with the dreaded walrus operator you can now write this:
if match := language_li.search(line) is not None:
do_something(match.groups())
Which I briefly thought was neater, but now think ugly. I fully expect to get downvoted just for mentioning the walrus operator. (If you look at the edit history of this post you will see that I have even forgotten how to use it and wrote it backwards first.)
PS: you may wish to read up on variable scope in python, although no language I know would allow this particular scope collision (overwriting a loop variable within the loop). Incidentally doing this kind of thing by mistake is why conventionally we avoid similarly-named variables (like line and Line) and go with things like line and match instead.
I am trying to execute a python script which is giving me an IndexError. I understood that the rsplit() method failed to split the string. I don't exactly know why it is showing index out of range. Could anyone tell me how to solve this problem ?
code
raw_directory = 'results/'
for name in glob.glob(raw_directory + '*.x*'):
try:
#with open(name) as g:
# pass
print(name)
reaction_mechanism = 'gri30.xml' #'mech.cti'
gas = ct.Solution(reaction_mechanism)
f = ct.CounterflowDiffusionFlame(gas, width=1.)
name_only = name.rsplit('\\',1)[1] #delete directory in filename
file_name = name_only
f.restore(filename=raw_directory + file_name, name='diff1D', loglevel=0)
Output
If I delete the file strain_loop_07.xml, I got the same error with another file.
results/strain_loop_07.xml
Traceback (most recent call last):
File "code.py", line 38, in <module>
name_only = name.rsplit('\\'1)[1] #delete directory in filename
IndexError: list index out of range
If rsplit failed to split the string, it returns an array with only one solution, so the [0] and not [1]
I understood in reply of this post that "name" variable is filled with text like "result/strain_loop_07.xml", so you want to rsplit that, with a line more like
name_only = name.rsplit('/', 1)[1]
So you'll get the "strain_loop_07.xml" element, which is what you probably wanted, because name.resplit('/', 1) return something like
['result', 'strain_loop_07.xml']
By the way, don't hesitate to print your variable midway for debuging, that is often the thing to do, to understand the state of your variable at a specific timing. Here right before your split !
I found this code on the internet, meant to search through text files in a zipped folder to find matches. I ran it in IDLE to see how it worked.. but I have a problem, and it seems to be this line:
fname = seed + ".txt"
The error message returns this:
Traceback (most recent call last):
File "C:/Users/[name]/AppData/Local/Programs/Python/Python36-32/zip2.py", line 10, in <module>
fname = seed + ".txt"
TypeError: can only concatenate tuple (not "str") to tuple
Here is the code:
import re
from zipfile import *
findnothing = re.compile(r"Next nothing is (\d+)").match
comments = []
z = ZipFile("channel.zip", "r")
seed = "90052"
while True:
fname = seed + ".txt"
comments.append(z.getinfo(fname).comment)
guts = z.read(fname)
m = findnothing(guts.decode('utf-8'))
if m:
seed = m.groups(1)
else:
break
print("".join(comments))
I've searched stackoverflow, and have found nothing similar to my issue. Most of them state that a comma in a variable usually causes the compiler to treat it as a tuple. I don't understand why it is saying seed is a tuple. There is no comma, no parenthesis, or anything else that would define it as a tuple to the Python compiler. How can I fix this?
Thanks in advance
Change m.groups(1) to m.group(1) (singular, not plural). Per the docs at https://docs.python.org/3.6/library/re.html#re.match.groups , group returns a single match but groups returns a tuple of all matches. You are getting the error the second time through the loop, when seed has been replaced by the output of groups, which is a tuple.
First, re.match only matches at the start of the string. Make sure you didn't mean to use re.search instead!
Second, m.groups(1) returns a tuple like ('12345',). Try seed = m.groups(1)[0] instead.
To introduce you to the context of my problem: I have two files containing information about genes:
pos.bed contains positions of specific genes and hg19-genes.txt contains all the existing genes of the species, with some indicated characters such as the position of the genes (start and end), its name, its symbol, etc.
The problem is that in pos, only the position of the gene is indicated, but not its name/symbol. I would like to read through both files and compare the start and end in each line. If there is a match, I would like to get the symbol of the corresponding gene.
I wrote this little python code:
pos=open('C:/Users/Claire/Desktop/Arithmetics/pos.bed','r')
gen=open('C:/Users/Claire/Desktop/Arithmetics/hg19-genes.txt','r')
for row in pos:
row=row.split()
start=row[11]
end=row[12]
for row2 in gen:
row2=row2.split()
start2=row2[3]
end2=row2[4]
sym=row2[10]
if start==start2 and end==end2:
print sym
pos.close()
gen.close()
But it seems like this is only comparing the two files line by line (like line 2 in file pos with line 2 in file gen only).So I tried adding else to the if loop but I get an error message:
else:
gen.next()
StopIteration Traceback (most recent call last)
<ipython-input-9-a309fdca7035> in <module>()
14 print sym
15 else:
---> 16 gen.next()
17
18 pos.close()
StopIteration:
I know it is possible to compare all the lines of 2 files, no matter the position of the line, by doing something like:
same = set(file1).intersection(file2)
but in my case I only want to compare some columns of each line as the lines have different information in each file (except for the start and the end). Is there a similar way to compare lines in files, but only for some specified items?
gen is an iterator that iterates over the lines of the file exactly once, that is, when processing the first row in pos. The simplest workaround for that is to open the gen file inside the outer loop:
pos=open('C:/Users/Claire/Desktop/Arithmetics/pos.bed','r')
for row in pos:
row=row.split()
start=row[11]
end=row[12]
gen=open('C:/Users/Claire/Desktop/Arithmetics/hg19-genes.txt','r')
for row2 in gen:
row2=row2.split()
start2=row2[3]
end2=row2[4]
sym=row2[10]
if start==start2 and end==end2:
print sym
gen.close()
pos.close()
Another option would be to read all lines of gen into a list and use that list in the inner loop.
I hope someone can point out where I have gone wrong. I am looking to iterate through the 'mylist' list to grab the first entry and use that first entry as a search string, then perform a search and gather particular information once the string is found and post it to an Excel worksheet. Then I am hoping to iterate to the next 'mylist' entry and perform another search. The first iteration performs ok, but with the second iteration of the loop I get the following CMD window error...
2014 Apr 25 09:43:42.080 INFORMATION FOR A
14.01
Traceback (most recent call last):
File "C:\TEST.py", line 362, in <module>
duta()
File "C:\TEST.py", line 128, in duta
if split[10] == 'A':
IndexError: list index out of range
Exception RuntimeError: RuntimeError('sys.meta_path must be a list of
import hooks',) in <bound method Workbook.__del__ of
<xlsxwriter.workbook.Workbook object at 0x0238C310>> ignored
Here's my code...
for root, subFolders, files in chain.from_iterable(os.walk(path) for path in paths):
for filename in files:
if filename.endswith('.txt'):
with open(os.path.join(root, filename), 'r') as fBMA:
searchlinesBMA = fBMA.readlines()
fBMA.close()
row_numBMAA+=1
num = 1
b = 1
print len(mylist)
print (mylist[num])
while b<len(mylist):
for i, line in enumerate(searchlinesBMA):
for word in [mylist[num]]:
if word in line:
keylineBMA = searchlinesBMA[i-2]
Rline = searchlinesBMA[i+10]
Rline = re.sub('[()]', '', Rline)
valueR = Rline.split()
split = keylineBMA.split()
if split[6] == 'A':
print keylineBMA
print valueR[3]
worksheetFILTERA.write(row_numBMAA,3,valueR[3], decimal_format)
row_numBMAA+=1
break
num+=1
b=+1
Any ideas as to what I am doing wrong? Is my loop out of position, or am I not inputting the correct list pointer?
Thanks,
MikG
In my experience, this error is related to garbage collecting out of order. I saw it once when I was debugging code where someone was writing to files in a __del__ method. (Bad idea). I'm pretty sure you're getting the error because you're closing the file inside a with: block, which does the open and close for you.
On the second run, you got split = keylineBMA.split() with a result shorter than you expected. You try to access index 10 which is outside the list.