Loop through files in a subdirectory, append filenames as dictionary keys - python

I have a directory of text files in a subdirectory, /directory_name/. I would like to loop through all files within this subdirectory, and append the filename as a string as the dictionary key. Then I would like to take the first word in each text file, and use this as the value for each key. I'm a bit stuck on this part:
import os
os.path.expanduser(~/directory_name/) # go to subdirectory
file_dict = {} # create dictionary
for i in file_directory:
file_dict[str(filename)] = {} # creates keys based on each filename
# do something here to get the dictionary values
Is it possible to do this in two separate steps? That is, create dictionary keys first, then do an operation on all text files to extract the values?

To change directories, use os.chdir(). Assuming the first word in each file is followed by a space,
import os
file_dict = {} # create a dictionary
os.chdir(os.path.join(os.path.expanduser('~'), directory_name))
for key in [file for file in os.listdir(os.getcwd()) if os.path.isfile(file)]:
value = open(key).readlines()[0].split()[0]
file_dict[key] = value
works for me. And if you really want to do it in two steps,
import os
os.chdir(os.path.join(os.path.expanduser('~'), directory_name))
keys = [file for file in os.listdir(os.getcwd()) if os.path.isfile(file)] # step 1
# Do something else in here...
values = [open(key).readlines()[0].split()[0] for key in keys] # step 2
file_dict = dict(zip(keys, values)) # Map values onto keys to create the dictionary
gives the same output.

Related

Making a dictionary from files in which keys are filenames and values are strings with specific character

so my problem is - I have proteomes in FASTA format, which look like this:
Name of the example file:
GCA_003547095.1_protein.faa
Contents:
>CAG77607.1
ABCDEF
>CAG72141.1
CSSDAS
And I also have files that contain just names of the proteins, i.e.:
Filename:
PF00001
Contents:
CAG77607.1
CAG72141.1
My task is to iterate through proteomes using list of proteins to find out how many proteins are in each proteome. PE told me that it should be a dictionary made from filenames of proteomes as keys and sequence names after ">" as values.
My approach was as follows:
import pandas as pd
file_names = open("proteomes_list").readlines()
d = {x: pd.read_csv("/proteomes/" + "GCA_003547095.1_protein.faa").columns.tolist() for x in file_names}
print (d)
As You can see I've made proteome filenames into list (using simple bash "ls", these are ONLY names of proteomes) and then creating dictionary with sequence names as values - unfortunetly each proteome (including the tested proteome) has only one value.
I will be grateful if You could shed some light on my case.
My goal was to make dictionary where key would be i.e. GCA_003547095.1_protein.faa and value i.e. CAG77607.1, CAG72141.1.
Is this the output you expect. This function should iterate over your file and grab the fasta file header or the name of the proteins that are expected in the file. Here is a quick function that can create a list of the fasta header.
You can create the dictionary you mentioned buy iterating over the file names and update the parent dictionary
import os
def extract_proteomes(folder: str, filename: str) -> list[str]:
with open(os.path.join(folder, filename), mode='r') as file:
content: str = file.read().split('\n')
protein_names = [i[1:] for i in content if i.startswith('>')]
if not protein_names:
protein_names = [i for i in content if i]
return protein_names
folder = "/Users/user/Downloads/"
files = ["GCA_003547095.1_protein.faa", "PF00001"]
d = {}
for i in files:
d.update({i: extract_proteomes(folder=folder, filename=i)})

How can I iterate over a list of .txt files using numpy?

I'm trying to iterate over a list of .txt files in Python. I would like to load each file individually, create an array, find the maximum value in a certain column of each array, and append it to an empty list. Each file has three columns and no headers or anything apart from numbers.
My problem is starting the iteration. I've received error messages such as "No such file or directory", then displays the name of the first .txt file in my list.
I used os.listdir() to display each file in the directory that I'm working with. I assigned this to the variable filenamelist, which I'm trying to iterate over.
Here is one of my attempts to iterate:
for f in filenamelist:
x, y, z = np.array(f)
currentlist.append(max(z))
I expect it to make an array of each file, find the maximum value of the third column (which I have assigned to z) and then append that to an empty list, then move onto the next file.
Edit: Here is the code that I have wrote so far:
import os
import numpy as np
from glob import glob
path = 'C://Users//chand//06072019'
filenamelist = os.listdir(path)
currentlist = []
for f in filenamelist:
file_array = np.fromfile(f, sep=",")
z_column = file_array[:,2]
max_z = z_column.max()
currentlist.append(max_z)
Edit 2: Here is a snippet of one file that I'm trying to extract a value from:
0, 0.996, 0.031719
5.00E-08, 0.996, 0.018125
0.0000001, 0.996, 0.028125
1.50E-07, 0.996, 0.024063
0.0000002, 0.996, 0.023906
2.50E-07, 0.996, 0.02375
0.0000003, 0.996, 0.026406
Each column is of length 1000. I'm trying to extract the maximum value of the third column and append it to an empty list.
The main issue is thatnp.array(filename) does not load the file for you. Depending on the format of your file, something like np.loadtxt() will do the trick (see the docs).
Edit: As others have mentioned, there is another issue with your implementation. os.listdir() returns a list of file names, but you need file paths. You could use os.path.join() to get the path that you need.
Below is an example of how you might do what you want, but it really depends on the file format. In this example I'm assuming a CSV (comma separated) file.
Example input file:
1,2,3
4,5,6
Example code:
path = 'C://Users//chand//06072019'
filenames = os.listdir(path)
currentlist = []
for f in filenames:
# get the full path of the filename
filepath = os.path.join(path, f)
# load the file
file_array = np.loadtxt(filepath, delimiter=',')
# get the whole third column
z_column = file_array[:,2]
# get the max of that column
max_z = z_column.max()
# add the max to our list
currentlist.append(max_z)

Python create dict from CSV and use the file name as key

I have a simple CSV text file called "allMaps.txt". It contains the following:
cat, dog, fish
How can I take the file name "allMaps" and use it as a key, along with the contents being values?
I wish to achieve this format:
{"allMaps": "cat", "dog", "fish"}
I have a range of txt files all containing values separated by commas so a more dynamic method that does it for all would be beneficial!
The other txt files are:
allMaps.txt
fiveMaps.txt
tenMaps.txt
sevenMaps.txt
They all contain comma separated values. Is there a way to look into the folder and convert each one on the text files into a key-value dict?
Assuming you have the file names in a list.
files = ["allMaps.txt", "fiveMaps.txt", "tenMaps.txt", "sevenMaps.txt"]
You can do the following:
my_dict = {}
for file in files:
with open(file) as f:
items = [i.strip() for i in f.read().split(",")]
my_dict[file.replace(".txt", "")] = items
If the files are all in the same folder, you could do the following instead of maintaining a list of files:
import os
files = os.listdir("<folder>")
Given the file names, you can create a dictionary where the key stores the filenames with a corresponding value of a list of the file data:
files = ['allMaps.txt', 'fiveMaps.txt', 'tenMaps.txt', 'sevenMaps.txt']
final_results = {i:[b.strip('\n').split(', ') for b in open(i)] for i in files}

How would I read and write from multiple files in a single directory? Python

I am writing a Python code and would like some more insight on how to approach this issue.
I am trying to read in multiple files in order that end with .log. With this, I hope to write specific values to a .csv file.
Within the text file, there are X/Y values that are extracted below:
Textfile.log:
X/Y = 5
X/Y = 6
Textfile.log.2:
X/Y = 7
X/Y = 8
DesiredOutput in the CSV file:
5
6
7
8
Here is the code I've come up with so far:
def readfile():
import os
i = 0
for file in os.listdir("\mydir"):
if file.endswith(".log"):
return file
def main ():
import re
list = []
list = readfile()
for line in readfile():
x = re.search(r'(?<=X/Y = )\d+', line)
if x:
list.append(x.group())
else:
break
f = csv.write(open(output, "wb"))
while 1:
if (i>len(list-1)):
break
else:
f.writerow(list(i))
i += 1
if __name__ == '__main__':
main()
I'm confused on how to make it read the .log file, then the .log.2 file.
Is it possible to just have it automatically read all the files in 1 directory without typing them in individually?
Update: I'm using Windows 7 and Python V2.7
The simplest way to read files sequentially is to build a list and then loop over it. Something like:
for fname in list_of_files:
with open(fname, 'r') as f:
#Do all the stuff you do to each file
This way whatever you do to read each file will be repeated and applied to every file in list_of_files. Since lists are ordered, it will occur in the same order as the list is sorted to.
Borrowing from #The2ndSon's answer, you can pick up the files with os.listdir(dir). This will simply list all files and directories within dir in an arbitrary order. From this you can pull out and order all of your files like this:
allFiles = os.listdir(some_dir)
logFiles = [fname for fname in allFiles if "log" in fname.split('.')]
logFiles.sort(key = lambda x: x.split('.')[-1])
logFiles[0], logFiles[-1] = logFiles[-1], logFiles[0]
The above code will work with files name like "somename.log", "somename.log.2" and so on. You can then take logFiles and plug it in as list_of_files. Note that the last line is only necessary if the first file is "somename.log" instead of "somename.log.1". If the first file has a number on the end, just exclude the last step
Line By Line Explanation:
allFiles = os.listdir(some_dir)
This line takes all files and directories within some_dir and returns them as a list
logFiles = [fname for fname in allFiles if "log" in fname.split('.')]
Perform a list comprehension to gather all of the files with log in the name as part of the extension. "something.log.somethingelse" will be included, "log_something.somethingelse" will not.
logFiles.sort(key = lambda x: x.split('.')[-1])
Sort the list of log files in place by the last extension. x.split('.')[-1] splits the file name into a list of period delimited values and takes the last entry. If the name is "name.log.5", it will be sorted as "5". If the name is "name.log", it will be sorted as "log".
logFiles[0], logFiles[-1] = logFiles[-1], logFiles[0]
Swap the first and last entries of the list of log files. This is necessary because the sorting operation will put "name.log" as the last entry and "nane.log.1" as the first.
If you change the naming scheme for your log files you can easily return of list of files that have the ".log" extension. For example if you change the file names to Textfile1.log and Textfile2.log you can update readfile() to be:
import os
def readfile():
my_list = []
for file in os.listdir("."):
if file.endswith(".log"):
my_list.append(file)
print my_list will return ['Textfile1.log', 'Textfile2.log']. Using the word 'list' as a variable is generally avoided, as it is also used to for an object in python.

Extract number from file name in python

I have a directory where I have many data files, but the data file names have arbitrary numbers. For example
data_T_1e-05.d
data_T_7.2434.d
data_T_0.001.d
and so on. Because of the decimals in the file names they are not sorted according to the value of the numbers. What I want to do is the following:
I want to open every file, extract the number from the file name, put it in a array and do some manipulations using the data. Example:
a = np.loadtxt("data_T_1e-05.d",unpack=True)
res[i][0] = 1e-05
res[i][1] = np.sum[a]
I want to do this for every file by running a loop. I think it could be done by creating an array containing all the file names (using import os) and then doing something with it.
How can it be done?
If your files all start with the same prefix and end with the same suffix, simply slice and pass to float():
number = float(filename[7:-2])
This removes the first 7 characters (i.e. data_T_) and the last 2 (.d).
This works fine for your example filenames:
>>> for example in ('data_T_1e-05.d', 'data_T_7.2434.d', 'data_T_0.001.d'):
... print float(example[7:-2])
...
1e-05
7.2434
0.001
import os
# create the list containing all files from the current dir
filelistall = os.listdir(os.getcwd())
# create the list containing only data files.
# I assume that data file names end with ".d"
filelist = filter(lambda x: x.endswith('.d'), filelistall)
for filename in filelist:
f = open(filename, "r")
number = float(filename[7:-2])
# and any other code dealing with file
f.close()

Categories