Python functions and for loops - python

I'm new to Python programming and I do not seem to get the right behavior from a FOR loop.
I've got a list of ids, and I want to iterate a ".gtf" file (tab separated multi-line) and extract from it some values corresponding to those ids.
It seems that the construction of the regex is not working correctly inside the findgtf function. From the second iteration onward, the "id" variable passed to the function is not used for the regex pattern of "sc" variable and subsequently, the pattern matching doesn't work. Do I need to reinitialize the variables "id" or/and "sc" before each iteration?
I so, could you tell me how to achieve that
Here's is the code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys, os, re
#Usage:gtf_parser_4.py [path_to_dir] [IDlist]
#######FUNCTIONS######################################
def findgtf(id, gtf):
id=id.strip()#remove \n
#print "Received Id: *"+id+"* post-stripped"
for line in gtf:
seq, source, feat, start, end, score, strand, frame, attribute = line.strip().split("\t")
sc = re.search(str(id), str(attribute))
if sc:
print "Coord of "+id+" -> Start: "+str(start)+" End: "+str(end)
###########################MAIN#########################
#Arguments retrieval
mydir = sys.argv[1]
#print"Directory : "+mydir
IDlist = sys.argv[2]
#print"IDlist : "+IDlist
path2ID = os.path.join(mydir, IDlist)
#print"Full IdList: "+path2ID
#lines to list
IDlines = [line.rstrip('\n') for line in open(path2ID)]
#Open and read dir
for file in os.listdir(mydir):
if file.endswith(".gtf"):
path2file = os.path.join(mydir, file)
#print"Full gtf : "+path2file
gtf = open(path2file,"r")
for id in IDlines:
print"ID submitted to findgtf: "+id
fg = findgtf(id, gtf)
gtf.close()
And here are the results retrieved from the console (submitted an Idlist with 3 ids: LX00_00030, gyrB, LX00_00065 ):
ID submitted to findgtf: LX00_00030
Coord of LX00_00030 -> Start: 4299 End: 5303
ID submitted to findgtf: gyrB
ID submitted to findgtf: LX00_00065
As you can see the first ID worked correctly but the second an third do not yield any result (although they do if their order is switched in the IDlist).
Thanks in advance for your help

Your code is not working because you are trying to repeatedly iterate over the same file object. A file keeps track of the position you've read to internally, so when you've read to the end, you can't read any more!
To make your code work, you need to seek back to the start of the file before iterating over it again.
for id in IDlines:
print"ID submitted to findgtf: "+id
gtf.seek(0) # seek to the start of the file
fg = findgtf(id, gtf)

Related

Using python and regex to find stings and put them together to replace the filename of a .pdf- the rename fails when using more than one group

I have several thousand pdfs which I need to re-name based on the content. The layouts of the pdfs are inconsistent. To re-name them I need to locate a specific string "MEMBER". I need the value after the string "MEMBER" and the values from the two lines above MEMBER, which are Time and Date values respectively.
So:
STUFF
STUFF
STUFF
DD/MM/YY
HH:MM:SS
MEMBER ######
STUFF
STUFF
STUFF
I have been using regex101.com and have ((.*(\n|\r|\r\n)){2})(MEMBER.\S+) which matches all of the values I need. But it puts them across four groups with group 3 just showing a carriage return.
What I have so far looks like this:
import fitz
from os import DirEntry, curdir, chdir, getcwd, rename
from glob import glob as glob
import re
failed_pdfs = []
count = 0
pdf_regex = r'((.*(\n|\r|\r\n)){2})(MEMBER.\S+)'
text = ""
get_curr = getcwd()
directory = 'PDF_FILES'
chdir(directory)
pdf_list = glob('*.pdf')
for pdf in pdf_list:
with fitz.open(pdf) as pdf_obj:
for page in pdf_obj:
text += page.get_text()
new_file_name = re.search(pdf_regex, text).group().strip().replace(":","").replace("-","") + '.pdf'
text = ""
#clean_new_file_name = new_file_name.translate({(":"): None})
print(new_file_name)
# Tries to rename a pdf. If the filename doesn't already exist
# then rename. If it does exist then throw an error and add to failed list
try:
rename(pdf, new_file_name )
except WindowsError:
count += 1
failed_pdfs.append(str(count) + ' - FAILED TO RENAME: [' + pdf + " ----> " + str(new_file_name) + "]")
If I specify a group in the re.search portion- Like for instance Group 4 which contains the MEMBER ##### value, then the file renames successfully with just that value. Similarly, Group 2 renames with the TIME value. I think the multiple lines are preventing it from using all of the values I need. When I run it with group(), the print value shows as
DATE
TIME
MEMBER ######.pdf
And the log count reflects the failures.
I am very new at this, and stumbling around trying to piece together how things work. Is the issue with how I built the regex or with the re.search portion? Or everything?
I have tried re-doing the Regular Expression, but I end up with multiple lines in the results, which seems connected to the rename failure.
The strategy is to read the page's text by words and sort them adequately.
If we then find "MEMBER", the word following it represents the hashes ####, and the two preceeding words must be date and time respectively.
found = False
for page in pdf_obj:
words = page.get_text("words", sort=True)
# all space-delimited strings on the page, sorted vertically,
# then horizontally
for i, word in enumerate(words):
if word[4] == "MEMBER":
hashes = words[i+1][4] # this must be the word after MEMBER!
time-string = words[i-1][4] # the time
date_string = words[i-2][4] # the date
found = True
break
if found == True: # no need to look in other pages
break

Iterate Previous Lines after find a pattern

I am searching for a pattern and then if I find that pattern(which can be multiples in a single file) then i want to iterate backwords and capture another pattern and pick the 1st instance.
For Example, if content of the file is as below:
SetSearchExpr("This is the Search Spec 1");
...
ExecuteQuery (ForwardOnly);
var Rec2=FirstRecord();
if(Rec2!=null);
{
Then the Expected Output:
ExecuteQuery Search Spec = "This is the Search Spec 1"
I have figured out by below to check if ExecuteQuery is present, but unable to get the logic to iterate back, my code as below:
import sys
import os
file = open("Sample_code.txt", 'r')
for line in file:
if "ExecuteQuery (" in line:
#if found then check previous lines for another pattern
If anyone help me with a steer then it would be of great help.
No need to go backwards. Just save the SetSearchExpr() line in a variable and use that when you find ExecuteQuery()
for line in file:
if 'SetSearchExpr(' in line:
search_line = line
elif 'ExecuteQuery (' in line:
m = re.match(r'SetSearchExpr\((".*")\)', search_line)
search_spec = m.group(1)
print(f'ExecuteQuery Search Spec = {search_spec}')

Categorising data according to one column in python

Hi I have a dataset as follows eg:
sample pos mutation
2fec2 40 TC
1f3c 40 TC
19b0 40 TC
tld3 60 CG
I want to be able to find a python way to for example find every instance where 2fec2 and 1f3c have the same mutation and print the code. So far I have tried the following but it simply returns everything. I am completely new to python and trying to ween myself off awk - please help!
from sys import argv
script, vcf_file = argv
import vcf
vcf_reader = vcf.Reader(open(vcf_file, 'r'))
for record.affected_start in vcf_reader: #.affect_start is this modules way of calling data from the parsed pos column from a particular type of bioinformatics file
if record.sample == 2fec2 and 1f3c != 19b0 !=t1d3: #ditto .sample
print record.affected_start
I'm assuming your data is in the format you describe and not VCF.
You can try to simply parse the file with standard python techniques and for each (pos, mutation) pair, build the set of samples having it as follows:
from sys import argv
from collections import defaultdict
# More convenient than a normal dict: an empty set will be
# automatically created whenever a new key is accessed
# keys will be (pos, mutation) pairs
# values will be sets of sample names
mutation_dict = defaultdict(set)
# This "with" syntax ("context manager") is recommended
# because file closing will be handled automatically
with open(argv[1], "r") as data_file:
# Read first line and check headers
# (assert <something False>, "message"
# will make the program exit and display "message")
assert data_file.readline().strip().split() == ["sample", "pos", "mutation"], "Unexpected column names"
# .strip() removes end-of-line character
# .split() splits into a list of words
# (by default using "blanks" as separator)
# .readline() has "consumed" a first line.
# Now we can loop over the rest of the lines
# that should contain the data
for line in data_file:
# Extract the fields
[sample, pos, mutation] = line.strip().split()
# add the sample to the set of samples having
# this (pos, mutation) combination
mutation_dict[(pos, mutation)].add(sample)
# Now loop over the key, value pairs in our dict:
for (pos, mutation), samples in mutation_dict.items():
# True if set intersection (&) is not empty
if samples & {"2fec2", "1f3c"}:
print("2fec2 and 1f3c share mutation %s at position %s" % (mutation, pos))
With your example data as first argument of the script, this outputs:
2fec2 and 1f3c share mutation TC at position 40
How about this
from sys import argv
script, vcf_file = argv
import vcf
vcf_reader = vcf.Reader(open(vcf_file, 'r'))
# Store our results outside of the loop
fecResult = ""
f3cResult = ""
# For each record
for record.affected_start in vcf_reader:
if record.sample == "2fec2":
fecResult = record.mutation
if record.sample == "1f3c":
f3cResult = record.mutation
# Outside of the loop compare the results and if they match print the record.
if fecResult == f3cResult:
print record.affected_start

Can't get unique word/phrase counter to work - Python

I'm having trouble getting anything to write in my outut file (word_count.txt).
I expect the script to review all 500 phrases in my phrases.txt document, and output a list of all the words and how many times they appear.
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
Any help would be great (especially if I'm able to actually output 'single' words within the output list.
thanks very much.
Would be helpful if we got more information such as what you've tried and what sorts of error messages you received. As kaveh commented above, this code has some major indentation issues. Once I got around those, there were a number of other logic errors to work through. I've made some assumptions:
list_file_data is assigned to '../data/phrases.txt' but there is then a
loop through all file in a directory. Since you don't have any handling for
multiple files elsewhere, I've removed that logic and referenced the
file listed in list_file_data (and added a small bit of error
handling). If you do want to walk through a directory, I'd suggest
using os.walk() (http://www.tutorialspoint.com/python/os_walk.htm)
You named your file 'pharses.txt' but then check for if the files
that endswith 'data'. I've removed this logic.
You've placed the data set into a list when findall works just fine with strings and ignores special characters that you've manually removed. Test here:
https://regex101.com/ to make sure.
Changed 'w+' to '\w+' - check out the above link
Converting to a list outside of the output loop isn't necessary - your dict_word_count is a Counter object which has an 'iteritems' method to roll through each key and value. Also changed the variable name to 'counter_word_count' to be slightly more accurate.
Instead of manually generating csv's, I've imported csv and utilized the writerow method (and quoting options)
Code below, hope this helps:
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)
Something like this?
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]

Unable to create a dictionary

I am trying to write a script for log parsing.
I got a file in which logs are jumbled up. Every first line of a particular log will have time stamp so I want to sort them using that.
For e.g.
10:48 Start
.
.
10:50 start
.
.
10:42 start
First line will contain key word ‘Start’ and the time stamp. The lines between ‘Start’ and before next ‘start’ are one set. I want to sort all of these sets in log files based on their time stamp.
Code Logic:
I thought of creating dictionary, where I will pick this time and assign it as ‘key’ and the text in value for that log set. And then I will sort the ‘keys’ in dictionary and print their ‘values’ in that sorted order in a file.
However I am getting error “TypeError: unhashable type: 'list'”
write1 = False
x = 0
search3 = "start"
matched = dict()
matched = {}
# fo is a list which is defined elsewhre in the code.
for line in fo:
if search3 in line:
#got the Hello2 printed which indicates script enters this loop
print('hello2')
write1 = True
x +=1
time = line.split()[3]
name1 = [time.split(':')[0] +":"+time.split(':')[1] + ":"+ time.split(':')[2]]
matched[name1] = line
elif write1:
matched[name1] += line
print(matched.keys())
Please let me know if my logic and the way I am doing is correct?
You set name1 as a list. Lists aren't hashable, only tuples are. However, I assume that you want name1 to be a string so you just want to remove the brackets:
name1 = time.split(':')[0] +":"+time.split(':')[1] + ":"+ time.split(':')[2]

Categories