Python: write files from lists based on match

Python: write files from lists based on match - python

please help. i am trying to write separate files from list of lists based on a string match.
Below list X has 3 sub-lists, and based on match from filter, i want to filter the lines and write them to separate files.
X = ['apple,banana,fruits,orange', 'dog,cat,animals,horse', 'mouse,elephant,animals,peacock']
filter = (fruits, animals)
from lists in X, i want to write csv files separately based on match found in filter.
Tried below incomplete code:
def write(Y):
temp = []
for elem in Y:
for n in filter:
if n in elem:
temp.append(elem)
expected output:
cat fruits.csv:
apple,banana,fruits,orange
cat animals.csv
dog,cat,animals,horse
mouse,elephant,animals,peacock
Please help or advice best method to do this.
THANKS in advance.

You can create a dictionary from your list with keys as filenames and then iterate the dictionary to write to file:
import re
from collections import defaultdict
X = ['apple,banana,fruits,orange', 'dog,cat,animals,horse', 'mouse,elephant,animals,peacock']
filter = ('fruits', 'animals')
d = defaultdict(list)
for x in X:
for f in filter:
if re.search(fr'\b{f}\b', x):
d[f].append(x)
for k, v in d.items():
with open(f'{k}.csv', 'w') as fi:
for y in v:
fi.write(y)

Related

List of strings I want to group together if they contain a specific substring from a master list

I have list of strings that I want to group together if they contain a specific substring from a master list.
Example Input:
["Audi_G71Q3E5T7_Coolant", "Volt_Battery_G9A2B4C6D8E", "Speaker_BMW_G71Q3E5T7", "Engine_Benz_G9A2B4C6D8E", "Ford_G9A2B4C6D8E_Wheel", "Toyota_Exhaust_G71Q3E5T7"]
Master List:
["G71Q3E5T7", "G9A2B4C6D8E"]
Expected Output:
[["Audi_G71Q3E5T7_Coolant", "Speaker_BMW_G71Q3E5T7", "Toyota_Exhaust_G71Q3E5T7"], ["Volt_Battery_G9A2B4C6D8E", "Engine_Benz_G9A2B4C6D8E", "Ford_G9A2B4C6D8E_Wheel"]]
I haven't found any example or solution online but am aware the itertools.groupby() function is useful in these scenarios, but am struggling to make it work.

Example inputs:
in_list = ["Audi_G71Q3E5T7_Coolant", "Volt_Battery_G9A2B4C6D8E", "Speaker_BMW_G71Q3E5T7",
"Engine_Benz_G9A2B4C6D8E", "Ford_G9A2B4C6D8E_Wheel", "Toyota_Exhaust_G71Q3E5T7"]
master_list = ["G71Q3E5T7", "G9A2B4C6D8E"]
out_list = []
For every master item, check if is in any input item. Add results to out_list[index]
for index, m in enumerate(master_list):
out_list.append([])
for i in in_list:
if m in i:
out_list[index].append(i)
print(out_list)

itertools.groupby is not appropriate as you would need to sort the data.
You can use a regex to extract the IDs, and a dictionary/defaultdict to collect the data:
L = ["Audi_G71Q3E5T7_Coolant", "Volt_Battery_G9A2B4C6D8E",
"Speaker_BMW_G71Q3E5T7", "Engine_Benz_G9A2B4C6D8E",
"Ford_G9A2B4C6D8E_Wheel", "Toyota_Exhaust_G71Q3E5T7"]
ids = ["G71Q3E5T7", "G9A2B4C6D8E"]
import re
from collections import defaultdict
regex = re.compile('|'.join(map(re.escape, ids)))
# re.compile(r'G71Q3E5T7|G9A2B4C6D8E', re.UNICODE)
out = defaultdict(list)
for item in L: # for each item
m = regex.search(item) # try to find ID
if m: # if ID
out[m.group()].append(item) # add to appropriate list
out = list(out.values())
output:
[['Audi_G71Q3E5T7_Coolant',
'Speaker_BMW_G71Q3E5T7',
'Toyota_Exhaust_G71Q3E5T7'],
['Volt_Battery_G9A2B4C6D8E',
'Engine_Benz_G9A2B4C6D8E',
'Ford_G9A2B4C6D8E_Wheel']]

Using Zip Function to Create Columns in CSV with non-identical lengths of data

I have large number of files that are named according to a gradually more specific criteria.
Each part of the filename separate by the '_' relate to a drilled down categorization of that file.
The naming convetion looks like this:
TEAM_STRATEGY_ATTRIBUTION_TIMEFRAME_DATE_FILEVIEW
What I am trying to do is iterate through all these files and then pull out a list of how many different occurrences of each naming convention exists.
So essentially this is what I've done so far, I iterated through all the files and made a list of each name. I then separated each name by the '_' and then appended each of those to their respective category lists.
Now I'm trying to export them to a CSV file separated by columns and this is where I'm running into problems
L = [teams, strategies, attributions, time_frames, dates, file_types]
columns = zip(*L)
list(columns)
with open (_outputfolder_, 'w') as f:
writer = csv.writer(f)
for column in columns:
print(column)
This is a rough estimation of the list I'm getting out:
[{'TEAM1'},
{'STRATEGY1', 'STRATEGY2', 'STRATEGY3', 'STRATEGY4', 'STRATEGY5', 'STRATEGY6', 'STRATEGY7', 'STRATEGY8', 'STRATEGY9', 'STRATEGY10','STRATEGY11', 'STRATEGY12', 'STRATEGY13', 'STRATEGY14', 'STRATEGY15'},
{'ATTRIBUTION1','ATTRIBUTION1','Attribution3','Attribution4','Attribution5', 'Attribution6', 'Attribution7', 'Attribution8', 'Attribution9', 'Attribution10'},
{'TIME_FRAME1', 'TIME_FRAME2', 'TIME_FRAME3', 'TIME_FRAME4', 'TIME_FRAME5', 'TIME_FRAME6', 'TIME_FRAME7'},
{'DATE1'},
{'FILE_TYPE1', 'FILE_TYPE2'}]
What I want the final result to look like is something like:
Team1 STRATEGY1 ATTRIBUTION1 TIME_FRAME1 DATE1 FILE_TYPE1
STRATEGY2 ATTRIBUTION2 TIME_FRAME2 FILE_TYPE2
... ... ...
etc. etc. etc.
But only the first line actually gets stored in the CSV file.
can anyone help me understand how to iterate just past the first line? I'm sure this is happening because the Team type has only one option, but I don't want this to hinder it.

I referred to the answer, you have to transpose the result and use it.
refer the post below ,
Python - Transposing a list (rows with different length) using numpy fails.
I have used natural sorting to sort the integers and appended the lists with blanks to have the expected outcome.
The natural sorting is slower for larger lists
you can also use third party libraries,
Does Python have a built in function for string natural sort?
def natural_sort(l):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(l, key = alphanum_key)
res = [[] for _ in range(max(len(sl) for sl in columns))]
count = 0
for sl in columns:
sorted_sl = natural_sort(sl)
for x, res_sl in zip(sorted_sl, res):
res_sl.append(x)
for result in res:
if (count > 0 ):
result.insert(0,'')
count = count +1
with open ("test.csv", 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(res)
f.close()
the columns should be converted in to list before printing to csv file
writerows method can be leveraged to print multiplerows
https://docs.python.org/2/library/csv.html -- you can find more information here
TEAM1,STRATEGY1,ATTRIBUTION1,TIME_FRAME1,DATE1,FILE_TYPE1
,STRATEGY2,Attribution3,TIME_FRAME2,FILE_TYPE2
,STRATEGY3,Attribution4,TIME_FRAME3
,STRATEGY4,Attribution5,TIME_FRAME4
,STRATEGY5,Attribution6,TIME_FRAME5
,STRATEGY6,Attribution7,TIME_FRAME6
,STRATEGY7,Attribution8,TIME_FRAME7
,STRATEGY8,Attribution9
,STRATEGY9,Attribution10
,STRATEGY10
,STRATEGY11
,STRATEGY12
,STRATEGY13
,STRATEGY14
,STRATEGY15

Python regular expression for a string and match them into a dictonnary

I have three files in a directory and I wanted them to be matched with a list of strings to dictionary.
The files in dir looks like following,
DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz
DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz
DB_DEF_S1_001_MM_R1.faq.gz
DB_DEF_S1_001_MM_R2.faq.gz
The list has part of the filename as,
ABC
DEF
So here is what I tried,
import os
import re
dir='/user/home/files'
list='/user/home/list'
samp1 = {}
samp2 = {}
FH_sample = open(list, 'r')
for line in FH_sample:
samp1[line.strip().split('\n')[0]] =[]
samp2[line.strip().split('\n')[0]] =[]
FH_sample.close()
for file in os.listdir(dir):
m1 =re.search('(.*)_R1', file)
m2 = re.search('(.*)_R2', file)
if m1 and m1.group(1) in samp1:
samp1[m1.group(1)].append(file)
if m2 and m2.group(1) in samp2:
samp2[m2.group(1)].append(file)
I wanted the above script to find the matches from m1 and m2 and collect them in dictionaries samp1 and samp2. But the above script is not finding the matches, within the if loop. Now the samp1 and samp2 are empty.
This is what the output should look like for samp1 and samp2:
{'ABC': [DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz, DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz], 'DEF': [DB_DEF_S1_001_MM_R1.faq.gz, DB_DEF_S1_001_MM_R2.faq.gz]}
Any help would be greatly appreciated

A lot of this code you probably don't need. You could just see if the substring that you have from list is in dir.
The code below reads in the data as lists. You seem to have already done this, so it will simply be a matter of replacing files with the file names you read in from dir and replacing st with the substrings from list (which you shouldn't use as a variable name since it is actually used for something else in Python).
files = ["BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz",
"BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz",
"BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz",
"BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz"]
my_strings = ["MOHUA", "MSJLF"]
res = {s: [] for s in my_strings}
for k in my_strings:
for file in files:
if k in file:
res[k].append(file)
print(res)

You can pass the python script a dict and provide id_list and then add id_list as dict keys and append the fastqs if the dict key is in the fastq_filename:
import os
import sys
dir_path = sys.argv[1]
fastqs=[]
for x in os.listdir(dir_path):
if x.endswith(".faq.gz"):
fastqs.append(x)
id_list = ['MOHUA', 'MSJLF']
sample_dict = dict((sample,[]) for sample in id_list)
print(sample_dict)
for k in sample_dict:
for z in fastqs:
if k in z:
sample_dict[k].append(z)
print(sample_dict)
to run:
python3.6 fq_finder.py /path/to/fastqs
output from above to show what is going on:
{'MOHUA': [], 'MSJLF': []} # first print creates dict with empty list as vals for keys
{'MOHUA': ['BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz', 'BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz'], 'MSJLF': ['BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz', 'BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz']}

How to sort a large number of lists to get a top 10 of the longest lists

So I have a text file with around 400,000 lists that mostly look like this.
100005 127545 202036 257630 362970 376927 429080
10001 27638 51569 88226 116422 126227 159947 162938 184977 188045
191044 246142 265214 290507 296858 300258 341525 348922 359832 365744
382502 390538 410857 433453 479170 489980 540746
10001 27638 51569 88226 116422 126227 159947 162938 184977 188045
191044 246142 265214 290507 300258 341525 348922 359832 365744 382502
So far I have a for loop that goes line by line and turns the current line into a temp array list.
How would I create a top ten list that has the list with the most elements of the whole file.
This is the code I have now.
file = open('node.txt', 'r')
adj = {}
top_ten = []
at_least_3 = 0
for line in file:
data = line.split()
adj[data[0]] = data[1:]
And this is what one of the list look like
['99995', '110038', '330533', '333808', '344852', '376948', '470766', '499315']

# collect the lines
lines = []
with open("so.txt") as f:
for line in f:
# split each line into a list
lines.append(line.split())
# sort the lines by length, descending
lines = sorted(lines, key=lambda x: -len(x))
# print the first 10 lines
print(lines[:10])

Why not use collections to display the top 10? i.e.:
import re
import collections
file = open('numbers.txt', 'r')
content = file.read()
numbers = re.findall(r"\d+", content)
counter = collections.Counter(numbers)
print(counter.most_common(10))
Ideone Demo

When wanting to count and then find the one(s) with the highest counts, collections.Counter comes to mind:
from collections import Counter
lists = Counter()
with open('node.txt', 'r') as file:
for line in file:
values = line.split()
lists[tuple(values)] = len(values)
print('Length Data')
print('====== ====')
for values, length in lists.most_common(10):
print('{:2d} {}'.format(length, list(values)))
Output (using sample file data):
Length Data
====== ====
10 ['191044', '246142', '265214', '290507', '300258', '341525', '348922', '359832', '365744', '382502']
10 ['191044', '246142', '265214', '290507', '296858', '300258', '341525', '348922', '359832', '365744']
10 ['10001', '27638', '51569', '88226', '116422', '126227', '159947', '162938', '184977', '188045']
7 ['382502', '390538', '410857', '433453', '479170', '489980', '540746']
7 ['100005', '127545', '202036', '257630', '362970', '376927', '429080']

Use a for loop and max() maybe? You say you've got a for loop that's placing the values into a temp array. From that you could use "max()" to pick out the largest value and put that into a list.
As an open for loop, something like appending max() to a new list:
newlist = []
for x in data:
largest = max(x)
newlist.append(largest)
Or as a list comprehension:
newlist = [max(x) for x in data]
Then from there you have to do the same process on the new list(s) until you get to the desired top 10 scenario.
EDIT: I've just realised that i've misread your question. You want to get the lists with the most elements, not the highest values. Ok.
len() is a good one for this.
for x in data:
if len(templist) > x:
newlist.append(templist)
That would give you the current highest and from there you could create a top 10 list of lengths or of the temp lists themselves, or both.

If your data is really as shown with each number the same length, then I would make a dictionary with key = line, value = length, get the top value / key pairs in the dictionary and voila. Sounds easy enough.

Searching and writing

I need to write a program which looks for words with the same three middle characters(each word is 5 characters long) in a list, then writes them into a file like this :
wasdy
casde
tasdf
gsadk
csade
hsadi
Between the similar words i need to leave an empty line. I am kinda stuck.
Is there a way to do this? I use Python 3.2 .
Thanks for your help.

I would use the itertools.groupby function for this. Assuming wordlist is a list of the words you want to group, this code does the trick.
import itertools
for k, v in itertools.groupby(wordlist, lambda word: word[1:4]):
# here, k is the key the words are grouped by, i.e. word[1:4]
# and v is a list/iterable of the words in the group
for word in v:
print word
print
itertools.groupby(wordlist, lambda word: word[1:4]) basically takes all the words, and groups them by word[1:4], i.e. the three middle characters. Here's the output of the above code with your sample data:
wasdy
casde
tasdf
gsadk
csade
hsadi

To get you started: try using the builtin sorted function on the list of words, and for the key you should experiment with using a slice(1, 4).
For example:
some_list = ['wasdy', 'casde', 'tasdf', 'gsadk', 'other', 'csade', 'hsadi']
sorted(some_list, key = lambda x: sorted(x[1:4]))
# outputs ['wasdy', 'casde', 'tasdf', 'gsadk', 'csade', 'hsadi', 'other']
edit: It was unclear to me whether you wanted "same three middle characters, in order" or just "same three middle characters". If the latter, then you could look at sorted(some_list, key = lambda x: x[1:4]) instead.

try:
from collections import defaultdict
dict_of_words = defaultdict(list)
for word in list_of_words:
dict_of_words[word[1:-1]].append(word)
then, to write to an output file:
with open('outfile.txt', 'w') as f:
for key in dict_of_words:
f.write('\n'.join(dict_of_words[key])
f.write('\n' )

word_list = ['wasdy', 'casde','tasdf','gsadk','csade','hsadi']
def test_word(word):
return all([x in word[1:4] for x in ['a','s','d']])
f = open('yourfile.txt', 'w')
f.write('\n'.join([word for word in word_list if test_word(word)]))
f.close()
returns:
wasdy
casde
tasdf
gsadk
csade
hsadi

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.