Concatenating fasta files from different folders - python

I have a large numbers of fasta files (these are just text files) in different subfolders. What I need is a way to search through the directories for files that have the same name and concatenate these into a file with the name of the input files. I can't do this manually as I have 10000+ genes that I need to do this for.
So far I have the following Python code that looks through one of the directories and then uses those file names to search through the other directories. This returns a list that has the full path for each file.
import os
from os.path import join, abspath
path = '/directoryforfilelist/' #Directory for source list
listing = os.listdir(path)
for x in listing:
for root, dirs, files in os.walk('/rootdirectorytosearch/'):
if x in files:
pathlist = abspath(join(root,x))
Where I am stuck is how to concatenate the files it returns that have the same name. The results from this script look like this.
/directory1/file1.fasta
/directory2/file1.fasta
/directory3/file1.fasta
/directory1/file2.fasta
/directory2/file2.fasta
/directory3/file2.fasta
In this case I would need the end result to be two files named file1.fasta and file2.fasta that contain the text from each of the same named files.
Any leads on where to go from here would be appreciated. While I did this part in Python anyway that gets the job done is fine with me. This is being run on a Mac if that matters.

Not tested, but here's roughly what I'd do:
from itertools import groupby
import os
def conc_by_name(names):
for tail, group in groupby(names, key=os.path.split):
with open(tail, 'w') as out:
for name in group:
with open(name) as f:
out.writelines(f)
This will create the files (file1.fasta and file2.fasta in your example) in the current folder.

For each file of your list, allocate the target file in append mode, read each line of your source file and write it to the target file.
Assuming that the target folder is empty to start with, and is not in /rootdirectorytosearch.

Related

Extract a list of files with a certain criteria within subdirectory of zip archive in python

I want to access some .jp2 image files inside a zip file and create a list of their paths. The zip file contains a directory folder named S2A_MSIL2A_20170420T103021_N0204_R108_T32UNB_20170420T103454.SAFE and I am currently reading the files using glob, after having extracted the folder.
I don't want to have to extract the contents of the zip file first. I read that I cannot use glob within a zip directory, nor I can use wildcards to access files within it, so I am wondering what my options are, apart from extracting to a temporary directory.
The way I am currently getting the list is this:
dirr = r'C:\path-to-folder\S2A_MSIL2A_20170420T103021_N0204_R108_T32UNB_20170420T103454.SAFE'
jp2_files = glob.glob(dirr + '/**/IMG_DATA/**/R60m/*B??_??m.jp2', recursive=True)
There are additional different .jp2 files in the directory, for which reason I am using the glob wildcards to filter the ones I need.
I am hoping to make this work so that I can automate it for many different zip directories. Any help is highly appreciated.
I made it work with zipfile and fnmatch
from zipfile import ZipFile
import fnmatch
zip = path_to_zip.zip
with ZipFile(zipaki, 'r') as zipObj:
file_list = zipObj.namelist()
pattern = '*/R60m/*B???60m.jp2'
filtered_list = []
for file in file_list:
if fnmatch.fnmatch(file, pattern):
filtered_list.append(file)

Python list certain files in different folders

I've got 2 folders, each with a different CSV file inside (both have the same format):
I've written some python code to search within the "C:/Users/Documents" directory for CSV files which begin with the word "File"
import glob, os
inputfile = []
for root, dirs, files in os.walk("C:/Users/Documents/"):
for datafile in files:
if datafile.startswith("File") and datafile.endswith(".csv"):
inputfile.append([os.path.join(root, datafile)])
print(inputfile)
That almost worked as it returns:
[['C:/Users/Documents/Test A\\File 1.csv'], ['C:/Users/Documents/Test B\\File 2.csv']]
Is there any way I can get it to return this instead (no sub list and shows / instead of \):
['C:/Users/Documents/Test A/File 1.csv', 'C:/Users/Documents/Test B/File 2.csv']
The idea is so I can then read both CSV files at once later, but I believe I need to get the list in the format above first.
okay, I will paste an option here.
I made use of os.path.abspath to get the the path before join.
Have a look and see if it works.
import os
filelist = []
for folder, subfolders, files in os.walk("C:/Users/Documents/"):
for datafile in files:
if datafile.startswith("File") and datafile.endswith(".csv"):
filePath = os.path.abspath(os.path.join(folder, datafile))
filelist.append(filePath)
filelist
Result:
['C:/Users/Documents/Test A/File 1.csv','C:/Users/Documents/Test B/File 2.csv']

Reading csv file with partially variable name

I want to read a csv file into a data frame from a certain folder with pandas. This folder contains several csv files. They contain different information.
df = pd.read_csv(r'C:\User\Username\Desktop\Statistic\12345678_Reference.csv')
The first part in the filename (1 - 8 is variable). I want to read it in the file which ends with '_Reference.csv', but I have no clue how to manage it. I googled, but could not find a solution if there are more than one csv file in the same folder.
If you import os, then you can use functions for for navigating the file system.
os.listdir(path) will return a list of all of the file names in a directory.
[f for f in os.listdir(path) if f.endswith("Reference.csv")]
Will return a list of all files names ending with "Reference.csv". In your scenario, it sounds like there will be only one item in the list.
So, [f for f in os.listdir(path) if f.endswith("Reference.csv")][0] would return the filename that you're looking for.
Then you can construct a path using the filename, and feed it to pd.read_csv().

Errors with Glob while outputting file names

I am combining two questions here because they are related to each other.
Question 1: I am trying to use glob to open all the files in a folder but it is giving me "Syntax Error". I am using Python 3.xx. Has the syntax changed for Python 3.xx?
Error Message:
File "multiple_files.py", line 29
files = glob.glob(/src/xyz/rte/folder/)
SyntaxError: invalid syntax
Code:
import csv
import os
import glob
from pandas import DataFrame, read_csv
#extracting
files = glob.glob(/src/xyz/rte/folder/)
for fle in files:
with open (fle) as f:
print("output" + fle)
f_read.close()
Question 2: I want to read input files, append "output" to the names and print out the names of the files. How can I do that?
Example: Input file name would be - xyz.csv and the code should print output_xyz.csv .
Your help is appreciated.
Your first problem is that strings, including pathnames, need to be in quotes. This:
files = glob.glob(/src/xyz/rte/folder/)
… is trying to divide a bunch of variables together, but the leftmost and rightmost divisions are missing operands, so you've confused the parser. What you want is this:
files = glob.glob('/src/xyz/rte/folder/')
Your next problem is that this glob pattern doesn't have any globs in it, so the only thing it's going to match is the directory itself.
That's perfectly legal, but kind of useless.
And then you try to open each match as a text file. Which you can't do with a directory, hence the IsADirectoryError.
The answer here is less obvious, because it's not clear what you want.
Maybe you just wanted all of the files in that directory? In that case, you don't want glob.glob, you want listdir (or maybe scandir): os.listdir('/src/xyz/rte/folder/').
Maybe you wanted all of the files in that directory or any of its subdirectories? In that case, you could do it with rglob, but os.walk is probably clearer.
Maybe you did want all the files in that directory that match some pattern, so glob.glob is right—but in that case, you need to specify what that pattern is. For example, if you wanted all .csv files, that would be glob.glob('/src/xyz/rte/folder/*.csv').
Finally, you say "I want to read input files, append "output" to the names and print out the names of the files". Why do you want to read the files if you're not doing anything with the contents? You can do that, of course, but it seems pretty wasteful. If you just want to print out the filenames with output appended, that's easy:
for filename in os.listdir('/src/xyz/rte/folder/'):
print('output'+filename)
This works in http://pyfiddle.io:
Doku: https://docs.python.org/3/library/glob.html
import csv
import os
import glob
# create some files
for n in ["a","b","c","d"]:
with open('{}.txt'.format(n),"w") as f:
f.write(n)
print("\nFiles before")
# get all files
files = glob.glob("./*.*")
for fle in files:
print(fle) # print file
path,fileName = os.path.split(fle) # split name from path
# open file for read and second one for write with modified name
with open (fle) as f,open('{}{}output_{}'.format(path,os.sep, fileName),"w") as w:
content = f.read() # read all
w.write(content.upper()) # write all modified
# check files afterwards
print("\nFiles after")
files = glob.glob("./*.*") # pattern for all files
for fle in files:
print(fle)
Output:
Files before
./d.txt
./main.py
./c.txt
./b.txt
./a.txt
Files after
./d.txt
./output_c.txt
./output_d.txt
./main.py
./output_main.py
./c.txt
./b.txt
./output_b.txt
./a.txt
./output_a.txt
I am on windows and would use os.walk (Doku) instead.
for d,subdirs,files in os.walk("./"): # deconstruct returned aktDir, all subdirs, files
print("AktDir:", d)
print("Subdirs:", subdirs)
print("Files:", files)
Output:
AktDir: ./
Subdirs: []
Files: ['d.txt', 'output_c.txt', 'output_d.txt', 'main.py', 'output_main.py',
'c.txt', 'b.txt', 'output_b.txt', 'a.txt', 'output_a.txt']
It also recurses into subdirs.

Running a python script on all the files in a directory

I have a Python script that reads through a text csv file and creates a playlist file. However I can only do one at a time, like:
python playlist.py foo.csv foolist.txt
However, I have a directory of files that need to be made into a playlist, with different names, and sometimes a different number of files.
So far I have looked at creating a txt file with a list of all the names of the file in the directory, then loop through each line of that, however I know there must be an easier way to do it.
for f in *.csv; do
python playlist.py "$f" "${f%.csv}list.txt"
done
Will that do the trick? This will put foo.csv in foolist.txt and abc.csv in abclist.txt.
Or do you want them all in the same file?
Just use a for loop with the asterisk glob, making sure you quote things appropriately for spaces in filenames
for file in *.csv; do
python playlist.py "$file" >> outputfile.txt;
done
Is it a single directory, or nested?
Ex.
topfile.csv
topdir
--dir1
--file1.csv
--file2.txt
--dir2
--file3.csv
--file4.csv
For nested, you can use os.walk(topdir) to get all the files and dirs recursively within a directory.
You could set up your script to accept dirs or files:
python playlist.py topfile.csv topdir
import sys
import os
def main():
files_toprocess = set()
paths = sys.argv[1:]
for p in paths:
if os.path.isfile(p) and p.endswith('.csv'):
files_toprocess.add(p)
elif os.path.isdir(p):
for root, dirs, files in os.walk(p):
files_toprocess.update([os.path.join(root, f)
for f in files if f.endswith('.csv')])
if you have directory name you can use os.listdir
os.listdir(dirname)
if you want to select only a certain type of file, e.g., only csv file you could use glob module.

Categories