Find a phrase in defined list with a file - python

I have a list name my= ['cbs is down','abnormal']
and I have opened a file in read mode
Now I want to search any of the string available in list that exist in that file and perform the if action
fopen = open("test.txt","r")
my =['cbs is down', 'abnormal']
for line in fopen:
if my in line:
print ("down")
and when I execute it, I get the following
Traceback (most recent call last):
File "E:/python/fileread.py", line 4, in <module>
if my in line:
TypeError: 'in <string>' requires string as left operand, not list

This should work things out:
if any(i in line for i in my):
...
Basically you are going through my and checking whether any of its elements is present in line.

fopen = open("test.txt","r")
my =['cbs is down', 'abnormal']
for line in fopen:
for x in my:
if x in line:
print ("down")
Sample input
Some text cbs is down
Yes, abnormal
not in my list
cbs is down
Output
down
down
down

The reason for your error:
The in operator as used in:
if my in line: ...
^ ^
|_ left | hand side
|
|_ right hand side
for a string operand on the right side (i.e. line) requires a corresponding string operand on the left hand side. This operand consistency check is implemented by the str.__contains__ method, where the call to __contains__ is made from the string on the right hand side (see cpython implemenetation). Same as:
if line.__contains__(my): ...
You're however passing a list, my, instead of a string.
An easy way to resolve this is by check that any of the items in the list are contained in the current line using the builtin any function:
for line in fopen:
if any(item in line for item in my):
...
Or since you have just two items use the or operator (pun unintended) which short-circuits in the same way as any:
for line in fopen:
if 'cbs is down' in line or 'abnormal' in line:
...

You could also join the terms in my to a regular expression like \b(cbs is down|abnormal)\b and use re.findall or re.search to find the terms. This way, you can also enclose the pattern in word-boundaries \b...\b so it does not match parts of longer words, and you also see which term was matched, and where.
>>> import re
>>> my = ['cbs is down', 'abnormal']
>>> line = "notacbs is downright abnormal"
>>> p = re.compile(r"\b(" + "|".join(map(re.escape, my)) + r")\b")
>>> p.findall(line)
['abnormal']
>>> p.search(line).span()
(21, 29)

Related

How to complete for loop with pdfplumber?

Problem
I was following this tutorial https://www.youtube.com/watch?v=eTz3VZmNPSE&list=PLxEus0qxF0wciRWRHIRck51EJRiQyiwZT&index=16
when the code has returned my this error.
Goal
I need to scrape a pdf that looks like this (I wanted to attach the pdf but I do not know how):
170001WO01
English (US) into Arabic (DZ)
Trans./Edit/Proof. 22.117,00 Words 1,350 29.857,95
TM - Fuzzy Match 2.941,00 Words 0,500 1.470,50
TM - Exact Match 353,00 Words 0,100 35,30
Approach
I am following the tutorial aforementioned with pdfplumber.
import re
import pdfplumber
import PyPDF2
import pandas as pd
from collections import namedtuple
ap = open('test.pdf', 'rb')
I name the column of the dataframe that I want as a final product.
Serv = namedtuple('Serv', 'case_number language num_trans num_fuzzy num_exact')
Issues
I have 5 different lines compared to the tutorial example which has 2.
case_li = re.compile(r'(\d{6}\w{2}\d{2})')
language_li = re.compile(r'(nglish \(US\) into )(.*)')
trans_li = re.compile(r'(Trans./Edit/Proof. )(\d{2}\.\d{3})')
fuzzy_li = re.compile(r'(TM - Fuzzy Match )(\d{1}\.\d{3})')
exact_li = re.compile(r'(M - Exact Match )(\d{3})')
Issue
When I introduce the third line in the code, I got an error which I do not know. I have modified the code as 2e0byo suggested but I still get an error.
This is the new code:
line_items = []
with pdfplumber.open(ap) as pdf:
page = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
line = case_li.search(line)
if line:
case_number = line
line = language_li.search(line)
if line:
language = line.group(2)
line = trans_li.search(line)
if line:
num_trans = line.group(2)
line = fuzzy_li.search(line)
if line:
num_fuzzy = line.group(2)
line = exact_li.search(line)
if line:
num_exact = line.group(2)
line_items.append(Serv(case_number, language, num_trans, num_fuzzy, num_exact))```
---------------------------------------------------------------------------
and this is the new error:
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13992/1572426536.py in <module>
10 case_number = line
11
---> 12 line = language_li.search(line)
13 if line:
14 language = line.group(2)
TypeError: expected string or bytes-like object
TypeError: expected string or bytes-like object
# GOAL
It would be to append the lines to line_items and eventually
df = pd.DataFrame(line_items)
You have reassigned line, here:
for line in text.split("\n"):
# line is a str (the line)
line = language_li.search(line)
# line is no longer a str, but the result of a re.search
so line is no longer the text line, but the result of that match. Thus trans_li.search(line) is not searching the line you thought it was.
To fix your code, adopt a consistent pattern:
for line in text.split("\n"):
match = language_li.search(line)
# line is still a str (the line)
# match is the result of re.search
if match:
do_something(match.groups())
...
# line is *still* a str
match = trans_li.search(line):
if match:
...
For completeness' sake, with the dreaded walrus operator you can now write this:
if match := language_li.search(line) is not None:
do_something(match.groups())
Which I briefly thought was neater, but now think ugly. I fully expect to get downvoted just for mentioning the walrus operator. (If you look at the edit history of this post you will see that I have even forgotten how to use it and wrote it backwards first.)
PS: you may wish to read up on variable scope in python, although no language I know would allow this particular scope collision (overwriting a loop variable within the loop). Incidentally doing this kind of thing by mistake is why conventionally we avoid similarly-named variables (like line and Line) and go with things like line and match instead.

Calling variable inside insert()

I have two text files which I'm trying to work with in python 2.7.7, structured as in these examples:
sequence_file.txt:
MKRPGGAGGGGGSPSLVTMANSSDDGYGGVGMEAEGDVEEEMMACGGGGE
positions.txt
10
7
4
What I want to do is insert a # symbol into the sequence at every position indicated in positions.txt:
MKR#PGG#AGGG#GGSPSLVTMANSSDDGYGGVGMEAEGDVEEEMMACGGGGE
At the moment, my code is as follows:
# Open sequence file, remove newlines:
with open ("sequence_file.txt", "r") as seqfile:
seqstring=seqfile.read().replace('\n', '').replace('\r', '')
# Turn sequence into list
seqlist = list(sequence)
# Open positions.txt, and use each line as a parameter for the insert() function.
with open("positions.txt") as positions:
for line in positions:
insertpoint = line.rstrip('\n')
seqlist.insert(insertpoint, '#')
seqlist = list(sequence)
The last block of that code is where it falls down. I'm trying to have it read the first line, trim the newline character (\n) and then use that line as a variable (insertpoint) in the insert() command. However, whenever I try this it tells me:
Traceback (most recent call last):
File "<pyshell#8>", line 4, in <module>
seqlist.insert(insertpoint, '#')
TypeError: an integer is required
If I test it out and try 'print insertpoint' it produces the number correctly, and so my interpretation of the error is that when I use the insert() command it is reading 'insertpoint' as text rather than the variable that was just set.
Can anyone suggest what might be going wrong with this?
What happens is that str.rstrip() returns a string, but insert() expects an integer.
Solution: Convert that string into an integer:
insertpoint = int(line.rstrip('\n'))
Note: When you print insertpoint it is shown without the '' but it is a string. You can check this by printing its type:
print(type(insertpoint)) # <type 'str'>
It appears you might need to put int() around insertpoint:
seqlist.insert(int(insertpoint), '#')

finding and replacing a string within a line with an if statement

I am trying to parse a particular text file. I am trying to open the text file and line by line ask if a particular string is there (In the following example case its the presence of the number 01 in the curly brackets), then manipulate a particular string either forwards backwards, or keep it the same. Here's that example, with one line named arbitrarily "go"... (other lines in the full file have similar format but have {01}, {00} etc...
go = 'USC_45774-1111-0 <hkxhk> {10} ; 78'
go = go.replace(go[22:24],go[23:21:-1])
>>> go
'USC_45774-1111-0 <khxkh> {10} ; 78'
I am trying to manipulate the first "hk" (go[22:24]) by replacing it with the same letters but backwards (go[23:21:-1).What I want is to see khxhk but as you can see, the result I am getting is that both are turned backwards to khxkh.
I am also having a problem of executing the specific if statement for each line. Many lines that dont have {01} are being manipulated as if they were....
with open('c:/LG 1A.txt', 'r') as rfp:
with open('C:/output5.txt', 'w') as wfp:
for line in rfp.readlines():
if "{01}" or "{-1}" in line:
line = line.replace(line[25:27],line[26:24:-1])
line = line.replace("<"," ")
line = line.replace(">"," ")
line = line.replace("x"," ")
wfp.write(line)
elif "{10}" or "{1-}" in line:
line = line.replace(line[22:24],line[23:21:-1])
line = line.replace("<"," ")
line = line.replace(">"," ")
line = line.replace("x"," ")
wfp.write(line)
elif "{11}" in line:
line = line.replace(line[22:27],line[26:21:-1])
line = line.replace("<"," ")
line = line.replace(">"," ")
line = line.replace("x"," ")
wfp.write(line)
wfp.close()
Am I missing something simple?
The string replace method does not replace characters by position, it replaces them by what characters they are.
>>> 'apple aardvark'.replace('a', '!')
'!pple !!rdv!rk'
So in your first case, you are telling to replace "hk" with "kh". It doesn't "know" that you want to only replace one of the occurrences; it just knows you want to replace "hk" with "kh", so it replaces all occurrences.
You can use the count argument to replace to specify that you only want to replace the first occurrence:
>>> go = 'USC_45774-1111-0 <hkxhk> {10} ; 78'
... go.replace(go[22:24],go[23:21:-1],1)
'USC_45774-1111-0 <khxhk> {10} ; 78'
Note, though, that this will always replace the first occurrence, not necessarily the occurrence at the position in the string you specified. In this case I guess that's what you want, but it may not work directly for other similar tasks. (That is, there is no way to use this method as-is to replace the second occurrence or the third occurrence; you can only replace the first, or the first two, or the first three, etc. To replace the second or third occurrence you'd need to do a bit more.)
As for the second part of your question, you are misunderstanding what if "{01}" or "{-1}" in line means. It means, in layman's terms, if "{01}" or if "{-1}" in line. Since if "{01}" is always true (i.e., the string "{01}" is not a false value), the whole condition is always true. What you want is if "{01}" in line or "{-1}" in line".
I don't know what it is about Python, but your problem is one that gets posted here at least a couple times every day.
if "{01}" or "{-1}" in line:
This doesn't do what you think it does. It asks, "is "{01}" true"? Because it's a non-zero-length string, it is. Because or short-circuits, the rest of the condition is not tested because the first argument is true. Therefore the body of your if statement is always executed.
In other words, Python evaluates as if you'd written this:
if ("{01}") or ("{-1}" in line):
You want something like:
if "{01}" in line or "{-1}" in line:
Or if you have a lot of similar conditions:
if any(x in line for x in ("{01}", "{-1}")):
you can use count argument of replace():
'USC_45774-1111-0 <hkxhk> {10} ; 78'.replace("hk","kh",1)
For your second question, you need change the condition to:
if "{01}" in line or "{-1}" in line:
...

In what order does Python resolve functions? (why does string.join(lst.append('a')) fail?)

How does string.join resolve? I tried using it as below:
import string
list_of_str = ['a','b','c']
string.join(list_of_str.append('d'))
But got this error instead (exactly the same error in 2.7.2):
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/string.py", line 318, in join
return sep.join(words)
TypeError
The append does happen, as you can see if you try to join list_of_string again:
print string.join(list_of_string)
-->'a b c d'
here's the code from string.py (couldn't find the code for the builtin str.join() for sep):
def join(words, sep = ' '):
"""join(list [,sep]) -> string
Return a string composed of the words in list, with
intervening occurrences of sep. The default separator is a
single space.
(joinfields and join are synonymous)
"""
return sep.join(words)
What's going on here? Is this a bug? If it's expected behavior, how does it resolve/why does it happen? I feel like I'm either about to learn something interesting about the order in which python executes its functions/methods OR I've just hit a historical quirk of Python.
Sidenote: of course it works to just do the append beforehand:
list_of_string.append('d')
print string.join(list_of_string)
-->'a b c d'
list_of_str.append('d')
does not return the new list_of_str.
The method append has no return value and so returns None.
To make it work you can do this:
>>> import string
>>> list_of_str = ['a','b','c']
>>> string.join(list_of_str + ['d'])
Although that is not very Pythonic and there is no need to import string... this way is better:
>>> list_of_str = ['a','b','c']
>>> ''.join(list_of_str + ['d'])

pl/python TypeError: sequence item 21: expected string, int found

Friends: in PostgreSQL plpython, am trying to do an iterative search/replace in a text block 'data'.
Using re-sub to define a match pattern, then call a function 'replace' to do the work.
Objective is to have the 'replace' function called repeatedly, as some replacements generate further 'rule' matches, which require further replacements.
All works well through many, many replacements - and I'm managing to trigger the 2nd Pass of the repeat loop. Then, until something causes the Regex pattern to return an integer(?) -- apparently at the point it finds no matches... ?? I've tried testing for 'None' and '0', with no luck. Ideas?
data = (a_huge_block of_text)
# ====================== THE FUNCTION ==============
def replace(matchobj):
tag = matchobj.group(1)
plpy.info("-------- matchobj.group(1), tag: ", tag)
if matchobj.group(1) != '':
(do all the replacement work in here)
# ====================== END FUNCTION ==============
passnumber = 0
# If _any_ pattern match is found, process all of data for _all_ matches:
while re.search('(rule:[A-Za-z#]+)', data) != '':
# BEGIN repeat loop:
passnumber = passnumber + 1
plpy.info(' ================================ BEGIN PASS: ', passnumber)
data = re.sub('(rule:[A-Za-z#]+)', replace, data)
plpy.info(' =================================== END PASS: ', passnumber)
Above code seems to be running OK, into a second iteration... then:
ERROR: TypeError: sequence item 21: expected string, int found
CONTEXT: Traceback (most recent call last):
PL/Python function "myfunction", line 201, in <module>
data = re.sub('(rule:[A-Za-z#]+)', replace, data)
PL/Python function "myfunction", line 150, in sub
PL/Python function "myfunction"
Have also tried re.search (...) != '' -- and re.search (...) != 'None' --- with same result.
I do realize I must find the syntax to represent the match object in some readable form...
The answer to this turned out to be quite simple, of course, once you know Python! (I don't!)
To initiate the repeat loop, I had been doing this test:
while re.search('(rule:[A-Za-z#]+)', data) != '':
Had also tried this one, which will also not work:
while re.search('(rule:[A-Za-z#]+)', data) != 'None':
The None result can be trapped, of course, but the quotes are not needed. It's as simple as that:
while re.search('(rule:[A-Za-z#]+)', data) != None:
It's all so simple, once you know!

Categories