Python Regex, Get Actual Line Back - python

I am trying to regex over an entire file, however I keep ending up with a list like this:
NONE
NONE
NONE
NONE
<_sre.SRE_Match object at 0x7f89b0152db0>
NONE
<_sre.SRE_Match object at 0x7f89b0152db0>
How do I get the actual line back?
Here is my code:
dictionaryFile = "file.txt"
patternMatch = re.compile('^(\w{6,8})(\s+)(\d+)(\s+)(.+)(\s+)(\d{1,3}\s*-\s*\d{1,3})')
with open(dictionaryFile) as file:
for line in file:
result = patternMatch.search(line)
print result
Here is an example of the file I am regex'ing over:
HETELAVL 2 IS THERE A TELEPHONE ELSEWHERE ON 35 - 36
WHICH PEOPLE IN THIS HOUSEHOLD CAN
BE CONTACTED?
EDITED UNIVERSE: HETELHHD = 2
VALID ENTRIES
1 YES
2 NO
HEPHONEO 2 IS A TELEPHONE INTERVIEW ACCEPTABLE? 37 - 38
EDITED UNIVERSE: HETELHHD = 1 OR HETELAVL = 1
VALID ENTRIES
1 YES
2 NO
I would like to get this back:
HETELAVL 2 IS THERE A TELEPHONE ELSEWHERE ON 35 - 36
HEPHONEO 2 IS A TELEPHONE INTERVIEW ACCEPTABLE? 37 - 38

search() returns None if no position in the string matches the pattern.
Check if the result is not None and print line:
result = patternMatch.search(line)
if result is not None:
print line

search returns a match object so don't just print it as it will print the object use result.group(0) to get the actual line.

Related

How to extract text between the matching pattern in python [duplicate]

This question already has answers here:
Extract Values between two strings in a text file using python
(9 answers)
Closed 3 years ago.
I am new to python and wanted to try it to extract text between the matching pattern in each line of my tab delimited text file (mydata)
mydata.txt:
Sequence tRNA Bounds tRNA Anti Intron Bounds Cove
Name tRNA # Begin End Type Codon Begin End Score
-------- ------ ---- ------ ---- ----- ----- ---- ------
lcl|NC_035155.1_gene_75[locus_tag=SS1G_20133][db_xref=GeneID:33 1 1 71 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_73[locus_tag=SS1G_20131][db_xref=GeneID:33 1 1 73 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_72[locus_tag=SS1G_20130][db_xref=GeneID:33 1 1 71 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_71[locus_tag=SS1G_20129][db_xref=GeneID:33 1 1 72 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_62[locus_tag=SS1G_20127][db_xref=GeneID:33 1 1 71 Pseudo ??? 0 0 -1
Code I tried:
lines = [] #Declare an empty list named "lines"
with open('/media/owner/c3c5fbb4-73f6-45dc-a475-988ad914056e/phasing/trna/test.txt') as input_data:
# Skips text before the beginning of the interesting block:
for line in input_data:
# print(line)
if line.strip() == "locus_tag=": # Or whatever test is needed
break
# Reads text until the end of the block:
for line in input_data: # This keeps reading the file
if line.strip() == "][db":
break
print(line) # Line is extracted (or block_of_lines.append(line), etc.)
I want to grab texts between [locus_tag= and ][db_xre and get these as my results:
SS1G_20133
SS1G_20131
SS1G_20130
SS1G_20129
SS1G_20127
If I'm understanding correctly, this should work for a given line of your data:
data = line.split("locus_tag=")[1].split("][db_xref")[0]
The idea is to split the string on locus_tag=, take the 2nd element, then split that string on ][db_xref and take the first element.
If you want help with the outer loop it could look like:
for line in open(file_path, 'r'):
if "locus_tag" in line:
data = line.split("locus_tag=")[1].split("][db_xref")[0]
print(data)
You can use re.search with positive lookbehind and positive lookahead patterns:
import re
...
for line in input_data:
match = re.search(r'(?<=\[locus_tag=).*(?=\]\[db_xre)', line)
if match:
print(match.group())

SyntaxError: unexpected EOF while parsing (using .format())

I have used format in Python many times, but this one I am having trouble.
The solution should be simple, but I'm not getting it...
Here is the code:
test_list = df.groupby(['gender', 'admitted'])['student_id'].count()
print('The quantity of female students are {}.'.format(test_list[0] + test_list[1])
The output of test_list is:
gender admitted
female False 183
True 74
male False 125
True 118
Name: student_id, dtype: int64
So, test_list[0] is 183 and test_list[1] is 74.
The result expected from print is:
The quantity of female students are 257.
You forgot the closing ")" in your print statement. Because of that, the parser reached the end of the file before it expected to, thus raising an EOFError.
All you need to do is change it to this:
test_list = df.groupby(['gender', 'admitted'])['student_id'].count()
print('The quantity of female students are {}.'.format(test_list[0] + test_list[1]))

Python exception "too many vaues to unpack" thrown when assigning a string of numbers to a dictionary

I have a function that reads a file which contains a name followed by a space, then multiple numbers, each seperated by a space. I want to parse the name into one string, and all the numbers into another, then put them in a dictionary (with the name as the key). I have written the following code:
def read_users (user_file):
try:
file_in = open(user_file)
except:
return None
user_scores = {}
for line in file_in:
temp_lst = line.strip().split(' ', 1)
user_scores = [temp_lst[0]] = temp_lst[1]
return user_scores
This seems to do everything I need, but when it puts it into a dictionary it throws the exception "Too many values to unpack". I'm confused as to why this is thrown because I think I should be passing the dictionary a string with the name as the key, and a string with a bunch of numbers as the value.
If it's important the lines in the input file are formatted as follows:
Ben 1 0 2 3 4 -2 5 5 6 6 1
I have tried printing the list before I pass it to the dictionary and it appears as follows:
['Ben', '1 0 2 3 4 -1 5 5 6 6 1']
Anyone have any ideas? Thanks!
#I think the way you construct the dictionary is not quite right. Try below code to see if it works.
user_scores[temp_lst[0]] = temp_lst[1]

Python CSV search [duplicate]

This question already has answers here:
Reading from CSVs in Python repeatedly?
(2 answers)
Closed 6 years ago.
I am fairly new to python and have run into an issue which I believe to be strange. I am searching through the first column of a csvfile.
I am using the csv module and have some code with nested for loops. My intentions were for the loop in the middle to restart from the first row in the csvfile every time it finds a match. But instead it always starts from the last row being searched in the csv.
My code below and the results will make my problem more apparent.
number = [1,3,5,6,7,8,1234,324,5,2,35]
import csv
with open('...../Documents/pycharm/testcsv.csv', 'rb') as csvimport:
csvfile = csv.reader(csvimport)
for num in number:
print 'looking for ' + str(num)
is_in_file = False
print 'set to false'
for row in csvfile:
print 'looking at value ' + row[0]
if row[0] == str(num):
is_in_file = True
print 'match, set to true'
break
print 'test1'
if is_in_file == False:
print str(num) + ' not found in file!'
Here is what gets printed in the IDE:
looking for 1
set to false
looking at value a
looking at value 1
match, set to true
test1
looking for 3
set to false
NOTE: here I wish to look at the first line of the csvfile (value a). Instead it looks at the third line of the csvfile (value '').
looking at value
looking at value 1234
looking at value 7
looking at value 1
looking at value 3
match, set to true
test1
From here on out it skips my inner for loop all together as it has gone through the last row:
looking for 5
set to false
looking at value 5
match, set to true
test1
looking for 6
set to false
looking at value 6
match, set to true
test1
looking for 7
set to false
looking at value 77
looking at value 23
looking at value 87
test1
7 not found in file!
looking for 8
set to false
test1
8 not found in file!
looking for 1234
set to false
test1
1234 not found in file!
looking for 324
set to false
test1
324 not found in file!
looking for 5
set to false
test1
5 not found in file!
looking for 2
set to false
test1
2 not found in file!
looking for 35
set to false
test1
35 not found in file!
Here is the csvfile
a,c,b,d,e
1,3,4,5,6
,7,7,,87
1234,1,98,7,76
7,8,90,0,8
1,3,98,0,0
3,cat,food,20,39
5,%,3,6,90
6,2,2,2,3
77,3,4,3,5
23,3,4,3,6
87,5,5,5,
csvfile is a generator: first time it will work, but second time it will return immediately.
Consider doing this:
csvfile = list(csv.reader(csvimport))
Then you can scan csvfile as much as you want.
BUT, this code is not really performant as you perform a linear scan on your file. Consider doing a dictionary instead. Here's how
d = dict()
for r in csvfile:
d[r[0]] = r[1:]
Then, replace your inner loop by:
if num in d:
is_in_file = True
print("%d is in file" % num)

Search user input from a string and print

I have a string called new_file that I read from a file with these contents:
;ASP718I
;AspA2I
;AspBHI 0 6 9 15 ...
;AspCNI
;AsuI 37 116 272 348
...
I am using name = raw_input ("enter the enzyme ")
to get data from the user and I am trying to print the corresponding fields from the above file (new_file).
For the input ;AspBHI I'd like the program to print the corresponding line from the file:
;AspBHI 0 6 9 15 ...
How can I achieve this?
This is a start:
db = dict((x.split(" ")[0], x) for x in new_file.split("\n"))
name = raw_input("enter the enzyme ")
print db[name]
Also try to be nice next time, people might help you with more enthusiasm and even explain their approach.

Categories