Get certain files from list by pattern - python

I have this list. I want to make a for loop that will use in a function combinations of these files in the list.
I am not sure how to make these combinations that for each 'check' it will take the correct combination.
The function if it wasn't for the loop it would look like this:
erase('check3_dwg_Polyline','check3_dwg_Polyline_feat_to_polyg_feat_to_line','output_name')
What I've tried:
Here's the list.
li=['check3_dwg_Polyline', 'check2_dwg_Polyline',
'check3_dwg_Polyline_feat_to_polyg',# this will not be needed to extracted
'check2_dwg_Polyline_feat_to_polyg',# >> >>
'check3_dwg_Polyline_feat_to_polyg_feat_to_line',
'check2_dwg_Polyline_feat_to_polyg_feat_to_line']
start with this:
a=[li[i:i+3] for i in range(0, len(li), 3)]
where returns:
[['check3_dwg_Polyline',
'check2_dwg_Polyline',
'check3_dwg_Polyline_feat_to_polyg'],
['check2_dwg_Polyline_feat_to_polyg',
'check3_dwg_Polyline_feat_to_polyg_feat_to_line',
'check2_dwg_Polyline_feat_to_polyg_feat_to_line']]
Finally:
for base, base_f, base_line in a:
print(base, base_line, base + "_output")
gives:
check3_dwg_Polyline check3_dwg_Polyline_feat_to_polyg check3_dwg_Polyline_output
check2_dwg_Polyline_feat_to_polyg check2_dwg_Polyline_feat_to_polyg_feat_to_line check2_dwg_Polyline_feat_to_polyg_output
Other method:
base = [f for f in li if not f.endswith(("_polyg", "_to_line"))]
base_f = {f.strip("_feat_to_polyg"): f for f in li if f.endswith("_polyg")}
base_line = {f.strip("_feat_to_polyg_feat_to_line"): f for f in li if f.endswith("_to_line")}
[(b, base_f[b], base_line[b]) for b in base]
gives:
KeyError: 'check3_dwg_Polyline'
I have tried sorting the list but it just ruins it in a different way when put through the processes mentioned above.
The ideal result is this
when trying this:
for base, base_f, base_line in a:
print(base, base_line, base + "_output")
to give this:
check3_dwg_Polyline check3_dwg_Polyline_feat_to_polyg_feat_to_line check3_dwg_Polyline_output
check2_dwg_Polyline check2_dwg_Polyline_feat_to_polyg_feat_to_line check2_dwg_Polyline_output
where will be put in like this:
erase('check3_dwg_Polyline','check3_dwg_Polyline_feat_to_polyg_feat_to_line','output_name')

zip the list into chunks of check3, check2… Then you can do your for loop.
n = len(li) // 3
a = zip(*[li[i:i+n] for i in range(0, len(li), n)])
(pprint(list(a)) would output
[('check3_dwg_Polyline',
'check3_dwg_Polyline_feat_to_polyg',
'check3_dwg_Polyline_feat_to_polyg_feat_to_line'),
('check2_dwg_Polyline',
'check2_dwg_Polyline_feat_to_polyg',
'check2_dwg_Polyline_feat_to_polyg_feat_to_line')]

Related

Counting the command line arguments and removing the not needed one in python

I want to write python code which will be run using the following command :
python3 myProgram.py 4 A B C D stemfile
Where 4 is the number of files and A,B,C,D are 4 files.Then I wanted to generate all the combinations of A,B,C,D except the empty one.(A, B, C, D, AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD) But before that it will read the stemfile.names and if stemfile.names has a line | Final Pseudo Deletion Count is 0. Then only it will generate the above 15 combination, else it will say noisy data and will not print the combinations of 3 files and not consider D. So the output will be : (A, B, C, AB, AC, BC, ABC)
So in my code what I did is, I always took D as the last file arguments and ran that loop 1 time less. But it is not always true that D will be the last argument only. It can be like : python3 myProgram.py 4 B D C A stemfile
In this case, in my code the A will not be considered while making the combinations, But whenever that line will not be found in the stemfile.names, I just want to remove D file from the equation. How should I do that?
And later in that code, when the combination is A only it will store the A in a seperate outputfile, whenever it is AB then it stores the union of A,B files in a separate files and so on for all the combinations. Here also if there is noisy data then that D file will not come in any of the outputfile.
One more example, If I give : python3 myProgram.py 3 A D B stemfile
And the stemfile.names doesn't have the line | Final Pseudo Deletion Count is 0. then the output combinations are : A,B,AB and it will create 2 output files only.
Below I am attaching my code:
import sys
import itertools
from itertools import combinations
def union(files):
lines = set()
for file in files:
with open(file) as fin:
lines.update(fin.readlines())
return lines
def main():
number = int(sys.argv[1])
dataset = sys.argv[number+2]
with open(dataset+'.names') as myfile:
if '| Final Pseudo Deletion Count is 0.' in myfile.read():
a_list = sys.argv[2:number+2]
print("All possible combinations:\n")
for L in range(1, len(a_list)+1):
for subset in itertools.combinations(a_list, L):
print(*list(subset), sep=',')
print("...............................")
matrix = [itertools.combinations(a_list, r)
for r in range(1, len(a_list) + 1)]
combinations = [c for combinations in matrix for c in combinations]
for combination in combinations:
filenames = [f'{name}' for name in combination]
output = f'{"".join(combination)}_output'
print(f'Writing union of {filenames} to {output}')
with open(output, 'w') as fout:
fout.writelines(union(filenames))
else:
a_list = sys.argv[2:number+1]
# Here I am reducing a number only
print("Noisy data.\n")
print("So all possible combinations:\n")
for L in range(1, len(a_list)+1):
for subset in itertools.combinations(a_list, L):
print(*list(subset), sep=',')
print("................................")
matrix = [itertools.combinations(a_list, r)
for r in range(1, len(a_list) + 1)]
combinations = [c for combinations in matrix for c in combinations]
for combination in combinations:
filenames = [f'{name}' for name in combination]
output = f'{"".join(combination)}_output'
print(f'Writing union of {filenames} to {output}')
with open(output, 'w') as fout:
fout.writelines(union(filenames))
if __name__ == '__main__':
main()
Please help me out.
I think you should probably break this down into smaller, more specific questions. It seems like there is a lot of detail here that's not focused on the specific problem you're facing. I took a shot at what I think you're asking, however.
I think you're trying to figure out how to remove an item from the command line arguments. If that's the case, there's nothing you can do about what's passed to the program, but you can modify the list of inputs after you parse. I really think you should try reading about the argparse library, as I stated in my comment. I'm not sure if it's exactly what you're looking for, but here's some code using argparse that expects full filenames for each input file. The last argument must be the stemfile.
Once the arguments are parsed, you have list of pathlib.Path objects. You can simply remove the D file from the list.
import argparse
import itertools
import pathlib
NOISY_DATA_LINE = '| Final Pseudo Deletion Count is 0.'
def get_parser():
parser = argparse.ArgumentParser()
parser.add_argument('filenames', type=pathlib.Path, nargs='+')
parser.add_argument('stemfile', type=pathlib.Path)
return parser
def union(files):
lines = set()
for file in files:
with open(file) as fin:
lines.update(fin.readlines())
return lines
def main():
parser = get_parser()
args = parser.parse_args()
stemfile_lines = args.stemfile.read_text().splitlines()
if stemfile_lines[-1] == NOISY_DATA_LINE:
filenames = [p for p in args.filenames if p.stem != 'D']
else:
filenames = args.filenames
matrix = [itertools.combinations(filenames, r) for r in range(1, len(filenames) + 1)]
combinations = [c for combinations in matrix for c in combinations]
print(' '.join([str([p.stem for p in c]) for c in combinations]))
for combination in combinations:
output = f'{"".join([p.stem for p in combination])}_output.txt'
print(f'Writing union of {[p.stem for p in combination]} to {output}')
with open(output, 'w') as fout:
fout.writelines(union(filenames))
if __name__ == '__main__':
main()

Python: list index out of range - while/for loop

I have a list
abc = ['date1','sentence1','date2','sentence2'...]
I want to do sentiment analysis on the sentences. After that I want to store the results in a list that looks like:
xyz =[['date1','sentence1','sentiment1'],['date2','sentence2','sentiment2']...]
For this I have tried following code:
def result(doc):
x = 2
i = 3
for lijn in doc:
sentiment = classifier.classify(word_feats_test(doc[i]))
xyz.extend(([doc[x],doc[i],sentiment])
x = x + 2
i = i + 2
The len(abc) is about 7500. I start out with x as 2 and i as 3, as I don't want to use the first two elements of the list.
I keep on getting the error 'list index out of range', no matter what I try (while, for loops...)
Can anybody help me out? Thank you!
As comments mentioned - we won't be able to help You with finding error in Your code without stacktrace. But it is easy to solve Your problem like this:
xyz = []
def result(abc):
for item in xrange(0, len(abc), 2): # replace xrange with range in python3
#sentiment = classifier.classify(word_feats_test(abc[item]))
sentiment = "sentiment" + str(1 + (item + 1) / 2)
xyz.append([abc[item], abc[item + 1], sentiment])
You might want to read about built-in functions that makes programmers life easy. (Why worry about incrementing if range has that already?)
#output
[['date1', 'sentence1', 'sentiment1'],
['date2', 'sentence2', 'sentiment2'],
['date3', 'sentence3', 'sentiment3'],
['date4', 'sentence4', 'sentiment4'],
['date5', 'sentence5', 'sentiment5']]
Try this
i =0
for i in xrange(0,len(doc) -1)
date = doc[i]
sentence = doc[i + 1]
sentiment = classifier.classify(word_feats_test(sentence))
xyz.append([date,sentence,classifier])
Only need one index. The important thing is knowing when to stop.
Also, check out the difference between extend and append
Finally I would suggest you store your data as a list of dictionaries rather than a list of lists. That lets you access the items by field name rather than index , which makes for cleaner code.
If you want two elements from your list at a time, you can use a generator then pass the element/s to your classifier:
abc = ["ignore","ignore",'date1','sentence1','date2','sentence2']
from itertools import islice
def iter_doc(doc, skip=False):
it = iter(doc)
if skip: # if skip is set, start from index doc[skip:]
it = iter(islice(it, skip, None))
date, sent = next(it), next(it)
while date and sent:
yield date, sent
date, sent = next(it, ""), next(it, "")
for d, sen in result(abc, 2): # skip set to to so we ignore first two elements
print(d, sen)
date1 sentence1
date2 sentence2
So to create you list of lists xyz you can use a list comprehension:
xyz = [ [d,sen,classifier.classify(word_feats_test(sen))] for d, sen in iter_doc(abc, 2)]
It's simple. you can try it:
>>> abc = ['date1','sentence1','date2','sentence2'...]
>>> xyz = [[ abc[i], abc[i+1], "sentiment"+ str(i/2 + 1)] for i in range(0, len(abc), 2) ]
>>> xyz
output : [['date1', 'sentence1', 'sentiment1'], ['date2', 'sentence2', 'sentiment2'], .....]

How to perform summation for regular expression match results in python

Im trying to extract certain numbers from multiple files and perform a summation for the extracted numbers here is what i have written till now
import re, os
path = "F:/s"
in_files = os.listdir(path)
for g in in_files:
file = os.path.join(path, g)
text = open(file, "r")
a = text.readlines()
b = a[6]
m = re.search('\t(.+?)\n', b)
if m:
found = m.group()
print (found)
Extraction is working i get the results like this.
122
74
97
Now i want to sum all these numbers.
Lets do it using re.findall()
count = 0
for number in re.findall('\t(.+?)\n', b):
## add int(number.strip()) to count
You can create an empty list above your loop and instead of printing, just append found to that list. You can then sum the contents of that list (if everything goes well you should end up with a list of 'strings of integers').
import re, os
path = "F:/s"
in_files = os.listdir(path)
l = []
for g in in_files:
...
...
if m:
found = m.group()
l.append(found)
Your list should look like this now: ['122', '74', '97'],
so you can use map() and sum() to find the total (outside the loop):
print sum(map(int, l)) # 293

Create multiple dictionaries from a single iterator in nested for loops

I have a nested list comprehension which has created a list of six lists of ~29,000 items. I'm trying to parse this list of final data, and create six separate dictionaries from it. Right now the code is very unpythonic, I need the right statement to properly accomplish the following:
1.) Create six dictionaries from a single statement.
2.) Scale to any length list, i.e., not hardcoding a counter shown as is.
I've run into multiple issues, and have tried the following:
1.) Using while loops
2.) Using break statements, will break out of the inner most loop, but then does not properly create other dictionaries. Also break statements set by a binary switch.
3.) if, else conditions for n number of indices, indices iterate from 1-29,000, then repeat.
Note the ellipses designate code omitted for brevity.
# Parse csv files for samples, creating a dictionary of key, value pairs and multiple lists.
with open('genes_1') as f:
cread_1 = list(csv.reader(f, delimiter = '\t'))
sample_1_values = [j for i, j in (sorted([x for x in {i: float(j)
for i, j in cread_1}.items()], key = lambda v: v[1]))]
sample_1_genes = [i for i, j in (sorted([x for x in {i: float(j)
for i, j in cread_1}.items()], key = lambda v: v[1]))]
...
# Compute row means.
mean_values = []
for i, (a, b, c, d, e, f) in enumerate(zip(sample_1_values, sample_2_values, sample_3_values, sample_4_values, sample_5_values, sample_6_values)):
mean_values.append((a + b + c + d + e + f)/6)
# Provide proper gene names for mean values and replace original data values by corresponding means.
sample_genes_list = [i for i in sample_1_genes, sample_2_genes, sample_3_genes, sample_4_genes, sample_5_genes, sample_6_genes]
sample_final_list = [sorted(zip(sg, mean_values)) for sg in sample_genes_list]
# Create multiple dictionaries from normalized values for each dataset.
class BreakIt(Exception): pass
try:
count = 1
for index, items in enumerate(sample_final_list):
sample_1_dict_normalized = {}
for index, (genes, values) in enumerate(items):
sample_1_dict_normalized[genes] = values
count = count + 1
if count == 29595:
raise BreakIt
except BreakIt:
pass
...
try:
count = 1
for index, items in enumerate(sample_final_list):
sample_6_dict_normalized = {}
for index, (genes, values) in enumerate(items):
if count > 147975:
sample_6_dict_normalized[genes] = values
count = count + 1
if count == 177570:
raise BreakIt
except BreakIt:
pass
# Pull expression values to qualify overexpressed proteins.
print 'ERG values:'
print 'Sample 1:', round(sample_1_dict_normalized.get('ERG'), 3)
print 'Sample 6:', round(sample_6_dict_normalized.get('ERG'), 3)
Your code is too long for me to give exact answer. I will answer very generally.
First, you are using enumerate for no reason. if you don't need both index and value, you probably don't need enumerate.
This part:
with open('genes.csv') as f:
cread_1 = list(csv.reader(f, delimiter = '\t'))
sample_1_dict = {i: float(j) for i, j in cread_1}
sample_1_list = [x for x in sample_1_dict.items()]
sample_1_values_sorted = sorted(sample_1_list, key=lambda expvalues: expvalues[1])
sample_1_genes = [i for i, j in sample_1_values_sorted]
sample_1_values = [j for i, j in sample_1_values_sorted]
sample_1_graph_raw = [float(j) for i, j in cread_1]
should be (a) using a list named samples and (b) much shorter, since you don't really need to extract all this information from sample_1_dict and move it around right now. It can be something like:
samples = [None] * 6
for k in range(6):
with open('genes.csv') as f: #but something specific to k
cread = list(csv.reader(f, delimiter = '\t'))
samples[k] = {i: float(j) for i, j in cread}
after that, calculating the sum and mean will be way more natural.
In this part:
class BreakIt(Exception): pass
try:
count = 1
for index, items in enumerate(sample_final_list):
sample_1_dict_normalized = {}
for index, (genes, values) in enumerate(items):
sample_1_dict_normalized[genes] = values
count = count + 1
if count == 29595:
raise BreakIt
except BreakIt:
pass
you should be (a) iterating of the samples list mentioned earlier, and (b) not using count at all, since you can iterate naturally over samples or sample[i].list or something like that.
Your code has several problems. You should put your code in functions that preferably do one thing each. Than you can call a function for each sample without repeating the same code six times (I assume that is what the ellipsis is hiding.). Give each function a self-describing name and a doc string that explains what it does. There is quite a bit unnecessary code. Some of this might become obvious once you have it in functions. Since functions take arguments you can hand in your 29595, for example.

writing a range output lists into a text file or four array

I have a function and its output is a selection of lists [a,b,c,d] [a,b,c,d] [a,b,c,d] [a,b,c,d]
and I want [a,a,a,a] [b,b,b,b] [c,c,c,c] [d,d,d,d]
def meanarr(image, res=None):
"construct code which runs over a single ccd to get the means"
a = pyfits.getdata(image).MAG_AUTO
q = numpy.mean(a)
s = pyfits.getdata(image).X2WIN_IMAGE
j = numpy.mean(s)
f = pyfits.getdata(image).Y2WIN_IMAGE
z = numpy.mean(f)
g = pyfits.getdata(image).XYWIN_IMAGE
h = abs(numpy.mean(g))
a = [q, j, z, h]
print a
s0 = ''
return res
for arg in sys.argv[1:]:
#print arg
s = meanarr(arg)
This is my function and program how would I get the code to read all of the q's in one list all of the j's z's and h's in their own lists. I know I could separate the function into four different functions but this still doesn't return my results in a list it just outputs them individually.
You might be looking for zip. Try that :
data = [['a','b','c','d'], ['a','b','c','d'], ['a','b','c','d'], ['a','b','c','d']]
print data
print zip(*data)
You can write it this way:
def meanarr(image, res=None):
"costruct code which runs over a single ccd to get the means"
a=pyfits.getdata(image).MAG_AUTO
q=numpy.mean(a)
s=pyfits.getdata(image).X2WIN_IMAGE
j=numpy.mean(s)
f=pyfits.getdata(image).Y2WIN_IMAGE
z=numpy.mean(f)
g=pyfits.getdata(image).XYWIN_IMAGE
h= abs(numpy.mean(g))
a=[q,j,z,h]
return a
data =[meanarr(arg) for arg in sys.argv[1:]]
print zip(*data)

Categories