Creating a match between string and file name in lists - python

Let's say I have a list containing strings of the form "{N} word" (without quotation marks), where N is some integer and word is some string. In some folder D:\path\folder, I have plenty of files with names of the form "{N}name.filetype". With an input of the aforementioned list (elements being "{N}"), how would I get an output of a list, where every element is of the following form: "{N} words D:\path\folder\{N}name.filetype"?
For example...
InputList = [{75} Hello, {823} World, ...]
OutputList = [{75} Hello D:\path\folder\{75}Stuff.docx, {823} World D:\path\folder\{823}Things.docx, ...]
if folder at D:\path\folder contains, among other files, {75}Stuff.docx and {823}Things.docx.
To generalize, my question is fundamentally:
How do I get python to read a folder and take the absolute path of any file that contains only some part of every element in the list (in this case, we look for {N} in the file names and disregard the word) and add that path to every corresponding element in the list to make the output list?
I understand this is a bit of a long question that combines a couple concepts so I highly thank anyone willing to help in advance!

The important step is to convert your InputList to a dict of {number: word} - this makes it significantly easier to work with. After that it's just a matter of looping through the files in the folder, extracting the number from their name and looking them up in the dict:
InputList = ['{75} Hello', '{823} World']
folder_path= r'D:\path\folder'
# define a function to extract the number between curly braces
def extract_number(text):
return text[1:text.find('}')]
from pathlib import Path
# convert the InputList to a dict for easy and efficient lookup
words= {extract_number(name):name for name in InputList}
OutputList= []
# iterate through the folder to find matching files
for path in Path(folder_path).iterdir():
# extract the file name from the path, e.g. "{75}Stuff.docx"
name= path.name
# extract the number from the file name and find the matching word
number= extract_number(name)
try:
word= words[number]
except KeyError: # if no matching word exists, skip this file
continue
# put the path and the word together and add them to the output list
path= '{} {}'.format(word, path)
OutputList.append(path)
print(OutputList)
# output: ['{75} Hello D:\\path\\folder\\{75}Stuff.docx', '{823} World D:\\path\\folder\\{823}Things.docx']

Related

I have a list of a part of a filename, for each one I want to go through the files in a directory that matches that part and return the filename

So, let's say I have a directory with a bunch of filenames.
for example:
Scalar Product or Dot Product (Hindi)-fodZTqRhC24.m4a
AP Physics C - Dot Product-Wvhn_lVPiw0.m4a
An Introduction to the Dot Product-X5DifJW0zek.m4a
Now let's say I have a list, of only the keys, which are at the end of the file names:
['fodZTqRhC24', 'Wvhn_lVPiw0, 'X5DifJW0zek']
How can I iterate through my list to go into that directory and search for a file name containing that key, and then return me the filename?
Any help is greatly appreciated!
I thought about it, I think I was making it harder than I had to with regex. Sorry about not trying it first. I have done it this way:
audio = ['Scalar Product or Dot Product (Hindi)-fodZTqRhC24.m4a',
'An Introduction to the Dot Product-X5DifJW0zek.m4a',
'AP Physics C - Dot Product-Wvhn_lVPiw0.m4a']
keys = ['fodZTqRhC24', 'Wvhn_lVPiw0', 'X5DifJW0zek']
file_names = []
for Id in keys:
for name in audio:
if Id in name:
file_names.append(name)
combined = zip(keys,file_names)
combined
Here is an example:
ls: list of files in a given directory
names: list of strings to search for
import os
ls=os.listdir("/any/folder")
n=['Py', 'sql']
for file in ls:
for name in names:
if name in file:
print(file)
Results :
.PyCharm50
.mysql_history
zabbix2.sql
.mysql
PycharmProjects
zabbix.sql
Assuming you know which directory that you will be looking in, you could try something like this:
import os
to_find = ['word 1', 'word 2'] # list containing words that you are searching for
all_files = os.listdir('/path/to/file') # creates list with files from given directory
for file in all_files: # loops through all files in directory
for word in to_find: # loops through target words
if word in file:
print file # prints file name if the target word is found
I tested this in my directory which contained these files:
Helper_File.py
forms.py
runserver.py
static
app.py
templates
... and i set to_find to ['runserver', 'static']...
and when I ran this code it returned:
runserver.py
static
For future reference, you should make at least some sort of attempt at solving a problem prior to posting a question on Stackoverflow. It's not common for people to assist you like this if you can't provide proof of an attempt.
Here's a way to do it that allows for a selection of weather to match based on placement of text.
import os
def scan(dir, match_key, bias=2):
'''
:0 startswith
:1 contains
:2 endswith
'''
matches = []
if not isinstance(match_key, (tuple, list)):
match_key = [match_key]
if os.path.exists(dir):
for file in os.listdir(dir):
for match in match_key:
if file.startswith(match) and bias == 0 or file.endswith(match) and bias == 2 or match in file and bias == 1:
matches.append(file)
continue
return matches
print scan(os.curdir, '.py'

Python Re-ordering the lines in a dat file by string

Sorry if this is a repeat but I can't find it for now.
Basically I am opening and reading a dat file which contains a load of paths that I need to loop through to get certain information.
Each of the lines in the base.dat file contains m.somenumber. For example some lines in the file might be:
Volumes/hard_disc/u14_cut//u14m12.40_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m12.50_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m11.40_all.beta/beta8
I need to be able to re-write the dat file so that all the lines are re-ordered from the largest m.number to the smallest m.number. Then when I loop through PATH in database (shown in code) I am looping through in decreasing m.
Here is the relevant part of the code
base = open('base8.dat', 'r')
database= base.read().splitlines()
base.close()
counter=0
mu_list=np.array([])
delta_list=np.array([])
ofsset = 0.00136
beta=0
for PATH in database:
if os.path.exists(str(PATH)+'/CHI/optimal_spectral_function_CHI.dat'):
n1_array = numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.n.dat')
n7_array= numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.npx.dat')
n1_mean = n1_array[0]
delta=round(float(5.0+ofsset-(n1_array[0]*2.+4.*n7_array[0])),6)
par = open(str(PATH)+"/params10", "r")
for line in par:
counter= counter+1
if re.match("mu", line):
mioMU= re.findall('\d+', line.translate(None, ';'))
mioMU2=line.split()[2][:-1]
mu=mioMU2
print mu, delta, PATH
mu_list=np.append(mu_list, mu)
delta_list=np.append(delta_list,delta)
optimal_counter=0
print delta_list, mu_list
I have checked the possible flagged repeat but I can't seem to get it to work for mine because my file doesn't technically contain strings and numbers. The 'number' I need to sort by is contained in the string as a whole:
Volumes/data_disc/u14_cut/from_met/u14m11.40_all.beta/beta16
and I need to sort the entire line by just the m(somenumber) part
Assuming that the number part of your line has the form of a float you can use a regular expression to match that part and convert it from string to float.
After that you can use this information in order to sort all the lines read from your file. I added a invalid line in order to show how invalid data is handled.
As a quick example I would suggest something like this:
import re
# TODO: Read file and get list of lines
l = ['Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8']
regex = r'^.+\*{2}m{1}(?P<criterion>[0-9\.]*)\*{2}.+$'
p = re.compile(regex)
criterion_list = []
for s in l:
m = p.match(s)
if m:
crit = m.group('criterion')
try:
crit = float(crit)
except Exception as e:
crit = 0
else:
crit = 0
criterion_list.append(crit)
tuples_list = list(zip(criterion_list, l))
output = [element[1] for element in sorted(tuples_list, key=lambda t: t[0])]
print(output)
# TODO: Write output to new file or overwrite existing one.
Giving:
['Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8']
This snippets starts after all lines are read from the file and stored into a list (list called l here). The regex group criterion catches the float part contained in **m12.50** as you can see on regex101. So iterating through all the lines gives you a new list containing all matching groups as floats. If the regex does not match on a given string or casting the group to a float fails, crit is set to zero in order to have those invalid lines at the very beginning of the sorted list later.
After that zip() is used to get a list of tules containing the extracted floats and the according string. Now you can sort this list of tuples based on the tuple's first element and write the according string to a new list output.

How to read a text file by line in python

I want to randomly retrieve and print a whole line from a text file.
The text file is basically a list and so each item on that list needs to be searched.
import random
a= random.random
prefix = ["CYBER-", "up-", "down-", "joy-"]
suprafix = ["with", "in", "by", "who", "thus", "what"]
suffix = ["boy", "girl", "bread", "hippy", "box", "christ"]
print (random.choice(prefix), random.choice(suprafix), random.choice(prefix), random.choice(suffix))
this is my code for if i were to just manually input it into a list but i can't seem to find how to use an array or index to capture line by line the text and use that
I'm not sure I completely understand what you're asking, but I'll try to help.
If you're trying to choose a random line from a file, you can use open(), then readlines(), then random.choice():
import random
line = random.choice(open("file").readlines())
If you're trying to choose a random element from each of three lists, you can use random.choice():
import random
choices=[random.choice(i) for i in lists]
lists is a list of lists to choose from here.
Use Python's file.readLines() method:
with open("file_name.txt") as f:
prefix = f.readlines()
Now you should be able to iterate through the list prefix.
Those answers have helped me grab things from a list in a text file. as you can see in my code below. But I have three lists as text files and I am trying to generate a 4 word message randomly, choosing from the 'prefix' and 'suprafix' list for the first 3 words and 'suffix' file for the fourth word but I want to prevent it, when printing them, from picking a word that was already chosen by the random.choice function
import random
a= random.random
prefix = open('prefix.txt','r').readlines()
suprafix = open('suprafix.txt','r').readlines()
suffix = open('suffix.txt','r').readlines()
print (random.choice(prefix + suprafix), random.choice(prefix + suprafix), random.choice(prefix + suprafix), random.choice(suffix))
as you can see it chooses randomly from those 2 lists for 3 words

Using a dictionary as regex in Python

I had a Python question I was hoping for some help on.
Let's start with the important part, here is my current code:
import re #for regex
import numpy as np #for matrix
f1 = open('file-to-analyze.txt','r') #file to analyze
#convert files of words into arrays.
#These words are used to be matched against in the "file-to-analyze"
math = open('sample_math.txt','r')
matharray = list(math.read().split())
math.close()
logic = open('sample_logic.txt','r')
logicarray = list(logic.read().split())
logic.close()
priv = open ('sample_priv.txt','r')
privarray = list(priv.read().split())
priv.close()
... Read in 5 more files and make associated arrays
#convert arrays into dictionaries
math_dict = dict()
math_dict.update(dict.fromkeys(matharray,0))
logic_dict = dict()
logic_dict.update(dict.fromkeys(logicarray,1))
...Make more dictionaries from the arrays (8 total dictionaries - the same number as there are arrays)
#create big dictionary of all keys
word_set = dict(math_dict.items() + logic_dict.items() + priv_dict.items() ... )
statelist = list()
for line in f1:
for word in word_set:
for m in re.finditer(word, line):
print word.value()
The goal of the program is to take a large text file and perform analysis on it. Essentially, I want the program to loop through the text file and match words found in Python dictionaries and associate them with a category and keep track of it in a list.
So for example, let's say I was parsing through the file and I ran across the word "ADD". ADD is listed under the "math" or '0' category of words. The program should then add it to a list that it ran across a 0 category and then continue to parse the file. Essentially generating a large list that looks like [0,4,6,7,4,3,4,1,2,7,1,2,2,2,4...] with each of the numbers corresponding to a particular state or category of words as illustrated above. For the sake of understanding, we'll call this large list 'statelist'
As you can tell from my code, so far I can take as input the file to analyze, take and store the text files that contain the list of words into arrays and from there into dictionaries with their correct corresponding list value (a numerical value from 1 - 7). However, I'm having trouble with the analysis portion.
As you can tell from my code, I'm trying to go line by line through the text file and regex any of the found words with the dictionaries. This is done through a loop and regexing with an additional, 9th dictionary that is more or less a "super" dictionary to help simplify the parsing.
However, I'm having trouble matching all the words in the file and when I find the word, matching it to the dictionary value, not the key. That is when it runs across and "ADD" to add 0 to the list because it is a part of the 0 or "math" category.
Would someone be able to help me figure out how to write this script? I really appreciate it! Sorry for the long post, but the code requires a lot of explanation so you know what's going on. Thank you so much in advance for your help!
The simplest change to your existing code would just be to just keep track of both the word and the category in the loop:
for line in f1:
for word, category in word_set.iteritems():
for m in re.finditer(word, line):
print word, category
statelist.append(category)

Scan through txt, append certain data to an empty list in Python

I have a text file that I am reading in python . I'm trying to extract certain elements from the text file that follow keywords to append them into empty lists . The file looks like this:
so I want to make two empty lists
1st list will append the sequence names
2nd list will be a list of lists which will include be in the format [Bacteria,Phylum,Class,Order, Family, Genus, Species]
most of the organisms will be Uncultured bacterium . I am trying to add the Uncultured bacterium with the following IDs that are separated by ;
Is there anyway to scan for a certain word and when the word is found, take the word that is after it [separated by a '\t'] ?
I need it to create a dictionary of the Sequence Name to be translated to the taxonomic data .
I know i will need an empty list to append the names to:
seq_names=[ ]
a second list to put the taxonomy lists into
taxonomy=[ ]
and a 3rd list that will be reset after every iteration
temp = [ ]
I'm sure it can be done in Biopython but i'm working on my python skills
Yes there is a way.
You can split the string which you get from reading the file into an array using the inbuilt function split. From this you can find the index of the word you are looking for and then using this index plus one to get the word after it. For example using a text file called test.text that looks like so (the formatting is a bit weird because SO doesn't seem to like hard tabs).
one two three four five six seven eight nine
The following code
f = open('test.txt','r')
string = f.read()
words = string.split('\t')
ind = words.index('seven')
desired = words[ind+1]
will return desired as 'eight'
Edit: To return every following word in the list
f = open('test.txt','r')
string = f.read()
words = string.split('\t')
desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]
This is using list comprehensions. It enumerates the list of words and if the word is what you are looking for includes the word at the next index in the list.
Edit2: To split it on both new lines and tabs you can use regular expressions
import re
f = open('testtest.txt','r')
string = f.read()
words = re.split('\t|\n',string)
desired = [words[ind+1] for ind, word in enumerate(words) if word == "seven"]
It sounds like you might want a dictionary indexed by sequence name. For instance,
my_data = {
'some_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species],
'some_other_sequence': [Bacteria,Phylum,Class,Order, Family, Genus, Species]
}
Then, you'd just access my_data['some_sequence'] to pull up the data about that sequence.
To populate your data structure, I would just loop over the lines of the files, .split('\t') to break them into "columns" and then do something like my_data[the_row[0]] = [the_row[10], the_row[11], the_row[13]...] to load the row into the dictionary.
So,
for row in inp_file.readlines():
row = row.split('\t')
my_data[row[0]] = [row[10], row[11], row[13], ...]

Categories