Extract number from file name in python - python

I have a directory where I have many data files, but the data file names have arbitrary numbers. For example
data_T_1e-05.d
data_T_7.2434.d
data_T_0.001.d
and so on. Because of the decimals in the file names they are not sorted according to the value of the numbers. What I want to do is the following:
I want to open every file, extract the number from the file name, put it in a array and do some manipulations using the data. Example:
a = np.loadtxt("data_T_1e-05.d",unpack=True)
res[i][0] = 1e-05
res[i][1] = np.sum[a]
I want to do this for every file by running a loop. I think it could be done by creating an array containing all the file names (using import os) and then doing something with it.
How can it be done?

If your files all start with the same prefix and end with the same suffix, simply slice and pass to float():
number = float(filename[7:-2])
This removes the first 7 characters (i.e. data_T_) and the last 2 (.d).
This works fine for your example filenames:
>>> for example in ('data_T_1e-05.d', 'data_T_7.2434.d', 'data_T_0.001.d'):
... print float(example[7:-2])
...
1e-05
7.2434
0.001

import os
# create the list containing all files from the current dir
filelistall = os.listdir(os.getcwd())
# create the list containing only data files.
# I assume that data file names end with ".d"
filelist = filter(lambda x: x.endswith('.d'), filelistall)
for filename in filelist:
f = open(filename, "r")
number = float(filename[7:-2])
# and any other code dealing with file
f.close()

Related

Separating a file by lines in python

I have a .fastq file (cannot use Biopython) that consists of multiple samples in different lines. The file contents look like this:
#sample1
ACGTC.....
+
IIIIDDDDDFF
#sample2
AGCGC....
+
IIIIIDFDFD
.
.
.
#sampleX
ACATAG
+
IIIIIDDDFFF
I want to take the file and separate out each individual set of samples (i.e. lines 1-4, 5-8 and so on until the end of the file) and write each of them to a separate file (i.e. sample1.fastq contains that contents of sample 1 lines 1-4 and so on). Is this doable using loops in python?
You can use defaultdict and regex for this
import re
from collections import defaultdict
# Get file contents
with open("test.fastq", "r") as f:
content = f.read()
samples = defaultdict(list) # Make defaultdict of empty lists
identifier = ""
# Iterate through every line in file
for line in content.split("\n"):
# Find strings which start with #
if re.match("^#.*", line):
# Set identifier to match following lines to this section
identifier = line.replace("#", "")
else:
# Add the line to its identifier
samples[identifier].append(line)
Now all you have to do is save the contents of this default dictionary into multiple files:
# Loop through all samples (and their contents)
for sample_name, sample_items in samples.items():
# Create new file with the name of its sample_name.fastq
# (You might want to change the naming)
with open(f"{sample_name}.fastq", "w") as f:
# Write each element of the sample_items to new line
f.write("\n".join(sample_items))
It might be helpful for you to also include #sample_name in the beginning of the file (first line), but I'm not sure you want that so I haven't added that.
Note that you can adjust the regex settings to only match #sample[number] instead of all #..., if you want that, you can use re.match("^#sample\d+") instead

How can I iterate over a list of .txt files using numpy?

I'm trying to iterate over a list of .txt files in Python. I would like to load each file individually, create an array, find the maximum value in a certain column of each array, and append it to an empty list. Each file has three columns and no headers or anything apart from numbers.
My problem is starting the iteration. I've received error messages such as "No such file or directory", then displays the name of the first .txt file in my list.
I used os.listdir() to display each file in the directory that I'm working with. I assigned this to the variable filenamelist, which I'm trying to iterate over.
Here is one of my attempts to iterate:
for f in filenamelist:
x, y, z = np.array(f)
currentlist.append(max(z))
I expect it to make an array of each file, find the maximum value of the third column (which I have assigned to z) and then append that to an empty list, then move onto the next file.
Edit: Here is the code that I have wrote so far:
import os
import numpy as np
from glob import glob
path = 'C://Users//chand//06072019'
filenamelist = os.listdir(path)
currentlist = []
for f in filenamelist:
file_array = np.fromfile(f, sep=",")
z_column = file_array[:,2]
max_z = z_column.max()
currentlist.append(max_z)
Edit 2: Here is a snippet of one file that I'm trying to extract a value from:
0, 0.996, 0.031719
5.00E-08, 0.996, 0.018125
0.0000001, 0.996, 0.028125
1.50E-07, 0.996, 0.024063
0.0000002, 0.996, 0.023906
2.50E-07, 0.996, 0.02375
0.0000003, 0.996, 0.026406
Each column is of length 1000. I'm trying to extract the maximum value of the third column and append it to an empty list.
The main issue is thatnp.array(filename) does not load the file for you. Depending on the format of your file, something like np.loadtxt() will do the trick (see the docs).
Edit: As others have mentioned, there is another issue with your implementation. os.listdir() returns a list of file names, but you need file paths. You could use os.path.join() to get the path that you need.
Below is an example of how you might do what you want, but it really depends on the file format. In this example I'm assuming a CSV (comma separated) file.
Example input file:
1,2,3
4,5,6
Example code:
path = 'C://Users//chand//06072019'
filenames = os.listdir(path)
currentlist = []
for f in filenames:
# get the full path of the filename
filepath = os.path.join(path, f)
# load the file
file_array = np.loadtxt(filepath, delimiter=',')
# get the whole third column
z_column = file_array[:,2]
# get the max of that column
max_z = z_column.max()
# add the max to our list
currentlist.append(max_z)

Python: Appending file outputs from different directories into one overall list

I have n directories (labeled 0 to n), each that has a file (all the files have the same name), from which I want to grab certain lines from each file. I then want to append these grabbed lines together in order (from 0 to n) in a list.
This is my set-up:
for i in range(0, nfolders):
folder = "%02d" % i
os.system("cd " + folder)
myFile = open("myOutputFile", "r")
lines = myFile.readlines()
firstLine = float(lines[0])
#I then write a loop to store the next 5 lines in a list using append and call this list nextLines
My question is, is there an easy way to append firstLine from all the directories into one list (that my function returns), as well as append nextLines from all the directories into one list (again, that my function returns)?
I know there is the extend function, would I loop over that here (because let's say I have nfolders = 300, making it hard to manually add things together)?
Thanks!
You've got a couple of problems to deal with. os.system changes the working directory of the subshell invoke (and then immediately exit), but not the directory of this running script. Use os.chdir for that. Or, far better, just add the path to the file name and use that.
You don't need to read the entire file to get its first line, .readline or the next() function does that for you. Finally, just append to a list.
my_list = []
for i in range(0, nfolders):
filename = "%02d/MyOutputFile" % i
with open(filename) as myFile:
firstLine = float(next(myFile))
my_list.append(firstLine)
UPDATE
Suppose you want 4 + i lines from each file. You could tighten this up with
my_list = []
for i in range(0, nfolders):
filename = "%02d/MyOutputFile" % i
with open(filename) as myFile:
my_list += (next(myFile) for _ in range(4+i))
Note that we only use range to count iterations and don't care about its value so we use the variable _ as a quick visual queue that the value is not needed.

In Python, how can I find the location of information in a CSV file?

I have three very long CSV files, and I need some advice/help with manipulating the code. Basically, I want the program to be broad/basic enough where I can add any limitations and it'll work.
For example, if I want to set the code to find where column 1==x and column 2 ==y, I want the code to also work if I want column 1!=r and column 2
import csv
file = input('csv files: ').split(',')
filters = input('Enter the filters: ').split(',')
f = open(csv_file,'r')
p=csv.reader(f)
header_eliminator = next(p,[])
I run into issues with the "file" part because if I choose to only use one file rather than the three I want to use now, it won't work. Same goes for the filters. The filters could be like
4==10,5>=4
this means that column 4 of the file(s) would equal 10 and column 5 of the files would be greater than or equal to 4. However, I might also want the filters to look like this:
1==4.333, 5=="6/1/2014 0:00:00", 6<=60.0, 7!=6
So I want to be able to use it for other things! I'm having so much trouble with this, do you have any advice on how to get started? Thanks!
Pandas is excellent for dealing with csv files. I'd recommend installing it. pip install pandas
Then if you want to read open 3 csv files and do checks on the columns. You'll just need to familiarize yourself with indexing in pandas. The only method you need to know for now, is .iloc since it seems you are indexing using the integer position of the columns.
import pandas as pd
files = input('Enter the csv files: ').split(',')
data = []
#keeping a list of the files allows us to input a different number of files
#we use pandas to read in each file into a pandas dataframe which is then stored in an element of the list. The length of the list is the number of files.
for names in files:
data.append(pd.read_csv(names)
#You can then perform checks like this to see if the column 2 of all files are equal to 3
print all(i.iloc[:,2] == 3 for i in data)
You can write an generator that will take a bunch of filenames and output the lines one by one and feed that in to csv.reader. The tricky part is the filter. If you let the filter be a single line of python code, then you can use eval for that part. As an example
import csv
#filenames = input('csv files: ').split(',')
#filters = input('Enter the filters: ').split(',')
# todo: for debug
# in this implementation, filters is a single python expression that can
# reference the 'col' variable which is a list of the current columns
filenames = 'a.csv,b.csv,c.csv'
filters = '"a" in col[0] and "2" in col[2]'
# todo: debug generate test files
for name in 'abc':
with open('{}.csv'.format(name), 'w') as fp:
fp.write('the header row\n')
for row in range(3):
fp.write(','.join('{}{}{}'.format(name, row, col) for col in range(3)) + '\n')
def header_squash(filenames):
"""Iterate multiple files line by line after squashing header line
and any empty lines.
"""
for filename in filenames:
with open(filename) as fp:
next(fp)
for line in fp:
if line.strip():
yield line
for col in csv.reader(header_squash(filenames.split(','))):
# eval's namespace limits the damage untrusted code can do...
if eval(filters, { 'col':col }):
# passed the filter, do the work
print(col)

How would I read and write from multiple files in a single directory? Python

I am writing a Python code and would like some more insight on how to approach this issue.
I am trying to read in multiple files in order that end with .log. With this, I hope to write specific values to a .csv file.
Within the text file, there are X/Y values that are extracted below:
Textfile.log:
X/Y = 5
X/Y = 6
Textfile.log.2:
X/Y = 7
X/Y = 8
DesiredOutput in the CSV file:
5
6
7
8
Here is the code I've come up with so far:
def readfile():
import os
i = 0
for file in os.listdir("\mydir"):
if file.endswith(".log"):
return file
def main ():
import re
list = []
list = readfile()
for line in readfile():
x = re.search(r'(?<=X/Y = )\d+', line)
if x:
list.append(x.group())
else:
break
f = csv.write(open(output, "wb"))
while 1:
if (i>len(list-1)):
break
else:
f.writerow(list(i))
i += 1
if __name__ == '__main__':
main()
I'm confused on how to make it read the .log file, then the .log.2 file.
Is it possible to just have it automatically read all the files in 1 directory without typing them in individually?
Update: I'm using Windows 7 and Python V2.7
The simplest way to read files sequentially is to build a list and then loop over it. Something like:
for fname in list_of_files:
with open(fname, 'r') as f:
#Do all the stuff you do to each file
This way whatever you do to read each file will be repeated and applied to every file in list_of_files. Since lists are ordered, it will occur in the same order as the list is sorted to.
Borrowing from #The2ndSon's answer, you can pick up the files with os.listdir(dir). This will simply list all files and directories within dir in an arbitrary order. From this you can pull out and order all of your files like this:
allFiles = os.listdir(some_dir)
logFiles = [fname for fname in allFiles if "log" in fname.split('.')]
logFiles.sort(key = lambda x: x.split('.')[-1])
logFiles[0], logFiles[-1] = logFiles[-1], logFiles[0]
The above code will work with files name like "somename.log", "somename.log.2" and so on. You can then take logFiles and plug it in as list_of_files. Note that the last line is only necessary if the first file is "somename.log" instead of "somename.log.1". If the first file has a number on the end, just exclude the last step
Line By Line Explanation:
allFiles = os.listdir(some_dir)
This line takes all files and directories within some_dir and returns them as a list
logFiles = [fname for fname in allFiles if "log" in fname.split('.')]
Perform a list comprehension to gather all of the files with log in the name as part of the extension. "something.log.somethingelse" will be included, "log_something.somethingelse" will not.
logFiles.sort(key = lambda x: x.split('.')[-1])
Sort the list of log files in place by the last extension. x.split('.')[-1] splits the file name into a list of period delimited values and takes the last entry. If the name is "name.log.5", it will be sorted as "5". If the name is "name.log", it will be sorted as "log".
logFiles[0], logFiles[-1] = logFiles[-1], logFiles[0]
Swap the first and last entries of the list of log files. This is necessary because the sorting operation will put "name.log" as the last entry and "nane.log.1" as the first.
If you change the naming scheme for your log files you can easily return of list of files that have the ".log" extension. For example if you change the file names to Textfile1.log and Textfile2.log you can update readfile() to be:
import os
def readfile():
my_list = []
for file in os.listdir("."):
if file.endswith(".log"):
my_list.append(file)
print my_list will return ['Textfile1.log', 'Textfile2.log']. Using the word 'list' as a variable is generally avoided, as it is also used to for an object in python.

Categories