Automatic read files with similar names - python

Sorry that I am a newbie in Python
How can I automatically read different files with similar names (only changing the numbers with increments), under the same folder ? For example I need to make an array out of this file cons.1.00, and I have other files which I want to make other arrays out of, such as cons.1.01, cons.1.02..... etc. Is there some functions or loops for me to accomplish this task ?
Here is my code for my attempt, but since I will have to do this for many files I was wondering if there is a better way to do it.
c0 = np.loadtxt(r'C:\Users\holde\OneDrive\Desktop\Python\cons.1.00.dat')
c1 = np.loadtxt(r'C:\Users\holde\OneDrive\Desktop\Python\cons.1.01.dat')
c2 = np.loadtxt(r'C:\Users\holde\OneDrive\Desktop\Python\cons.1.02.dat')

Use glob package to get list of files with specific pattern:
import glob
import numpy as np
list_c = []
for x in glob.glob("C:\Users\holde\OneDrive\Desktop\Python\cons.1.*.dat"):
c = np.loadtxt(x)
list_c.append(c)
Or using list comprehension:
[ np.loadtxt(x) for x in glob.glob("C:\Users\holde\OneDrive\Desktop\Python\cons.1.*.dat") ]
Reference: https://docs.python.org/3/library/glob.html
Note that this also work on Windows even if glob is often associated with unix:
The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order. No tilde expansion is done, but *, ?, and character ranges expressed with [] will be correctly matched. This is done by using the os.scandir() and fnmatch.fnmatch() functions in concert, and not by actually invoking a subshell. Note that unlike fnmatch.fnmatch(), glob treats filenames beginning with a dot (.) as special cases. (For tilde and shell variable expansion, use os.path.expanduser() and os.path.expandvars().)

Any time you're doing the same thing over and over and only one thing is changing, you want to use a loop:
c = [
np.loadtxt(r'C:\Users\holde\OneDrive\Desktop\Python\' + f'cons.1.{i:02d}.dat')
for i in range(3)
]

Related

How can I easily create a for loop that only continues if the characters are parenthesis followed by integers in Python?

As you can already see, I'm fairly new to coding, specifically Python, and I was wondering how I could loop through a string only if the iteration is a parenthesis followed by integers?
I'm attempting to write a duplicate file finder, and when it comes to file names, I could have the same file copied and named slightly different iryan(1).mp4 and iryan(2).mp4, and using splitext[0] isn't enough for it to be detected as a duplicate. For now, I'm using os.stat().st_size to at least ensure they're the same size before diving into the filename comparison, but I'm struggling with ideas on how to only continue a for loop if the iteration is on an opening parenthesis and is followed by an integer?
Is "regex" something I should deep-dive into to solve this problem?
Assuming you have a list of file names (it could also be any other iterable):
file_names = ['iryan.mp4', 'iryan(1).mp4', 'iryan(2).mp4']
You can do the following to find all the duplicate names:
import re
# This regex only matches names that contain
# a number in brackets followed by `.mp4`
dup_regex = re.compile(r'\(\d+\)\.mp4$')
for duplicate in filter(dup_regex.search, file_names):
print (duplicate)
Output
iryan(1).mp4
iryan(2).mp4
You can use something like this
import re
pattern = re.compile('\(\d+\)')
name = 'iryan(2).mp4'
if len(pattern.findall(name)):
for..You loop here
...
Using the regex command, your code might look something like this:
pattern = re.compile('\(\d+\)')
for f in files:
if pattern.match(f):
# duplicate file found...
# do something
Another alternative is to use the filter function. That code might look something like this:
pattern = re.match('\(\d+\)')
filtered_files = filter(lambda f: p.match(f) is None, files)
for f in filtered_files:
# load file
# do something
The interesting thing here is that the filter function will return an iterator, which you can access easily in the for loop later. This avoids using the if statement in the for-loop, which could make your code look a little nicer.
This method will return to you all of the files without the parenthesis+number combination in them, giving you, ideally, all of the original non-duplicate files, and excluding all of the duplicates. If you want just the duplicates (files with names that contain the parenthesis+number combination), you can change the filter command to this:
duplicate_files = filter(p.match, files)

glob syntax working not as expected( [ ] *)

I have a folder containing 4 files.
Keras_entity_20210223-2138.h5
intent_tokens.pickle
word_entity_set_20210223-2138.pickle
LSTM_history.h5
I used code:
NER_MODEL_FILEPATH = glob.glob("model/[Keras_entity]*.h5")[0]
It's working correctly since NER_MODEL_FILEPATH is a list only containing the path of that Keras_entity file. Not picking that other .h5 file.
But when I use this code:
WORD_ENTITY_SET_FILEPATH = glob.glob("model/[word_entity_set]*.pickle")[0]
It's not working as expected, rather than picking up only that word_entity_set file,
this list contains both of those two pickle files.
Why would this happen?
Simply remove the square brackets: word_entity_set*.pickle
Per the docs:
[seq] matches any character in seq
So word_entity_set_20210223-2138.pickle is matched because it starts with a w, and intent_tokens.pickle is matched because it starts with an i.
To be clear, it is working as expected. Your expectations were incorrect.
Your code selects intent_tokens.pickle and word_entity_set_20210223-2138.pickle because your glob is incorrect. Change the glob to "word_entity_set*.pickle"
When you use [<phrase>]*.pickle, you're telling the globber to match one of any of the characters in <phrase> plus any characters, plus ".pickle". So "wordwordword.pickle" will match, so will:
wwww.pickle
.pickle
w.pickle
But
xw.pickle
foobar.pickle
will not.
There are truly infinite permutations.

Regular expression, glob, Python

I have a folder, contains many files.
There is a group contains pc_0.txt,pc_1.txt,...,pc_699.txt.
I want to select all files beetween pc_200 - > to pc_699.txt
How?
for filename in glob.glob("pc*.txt"):
global_list.append(filename)
For this specific case, glob already supports what you need (see fnmatch docs for glob wildcards). You can just do:
for filename in glob.glob("pc[23456]??.txt"):
If you need to be extra specific that the two trailing characters are numbers (some files might have non-numeric characters there), you can replace the ?s with [0123456789], but otherwise, I find the ? a little less distracting.
In a more complicated scenario, you might be forced to resort to regular expressions, and you could do so here with:
import re
for filename in filter(re.compile(r'^pc_[2-6]\d\d\.txt$').match, os.listdir('.')):
but given that glob-style wildcards work well enough, you don't need to break out the big guns just yet.

How to make an absolute path relative to another?

Suppose I have two file paths as strings in Python, as an example, let's say they are these two:
C:/Users/testUser/Program/main.py
C:/Users/testUser/Program/data/somefile.txt
Is there a way, using the os module, to generate a relative URL based off of the first one? For example, feeding the two above to produce:
data/somefile.txt
I realize this is possible with string manipulation, by splitting off the files at the ends and cutting the first string out of the second, but is there a more robust way, probably using the python os module?
Thanks to MPlanchard in the comment below, here is the full answer:
import os
string1 = "C:/Users/testUser/Program/main.py"
string2 = "C:/Users/testUser/Program/data/somefile.txt"
os.path.relpath(string2, os.path.dirname(string1))

split twice in the same expression?

Imagine I have the following:
inFile = "/adda/adas/sdas/hello.txt"
# that instruction give me hello.txt
Name = inFile.name.split("/") [-1]
# that one give me the name I want - just hello
Name1 = Name.split(".") [0]
Is there any chance to simplify that doing the same job in just one expression?
You can get what you want platform independently by using os.path.basename to get the last part of a path and then use os.path.splitext to get the filename without extension.
from os.path import basename, splitext
pathname = "/adda/adas/sdas/hello.txt"
name, extension = splitext(basename(pathname))
print name # --> "hello"
Using os.path.basename and os.path.splitext instead of str.split, or re.split is more proper (and therefore received more points then any other answer) because it does not break down on other platforms that use different path separators (you would be surprised how varried this can be).
It also carries most points because it answers your question for "one line" precisely and is aesthetically more pleasing then your example (even though that is debatable as are all questions of taste)
Answering the question in the topic rather than trying to analyze the example...
You really want to use Florians solution if you want to split paths, but if you promise not to use this for path parsing...
You can use re.split() to split using several separators by or:ing them with a '|', have a look at this:
import re
inFile = "/adda/adas/sdas/hello.txt"
print re.split('\.|/', inFile)[-2]
>>> inFile = "/adda/adas/sdas/hello.txt"
>>> inFile.split('/')[-1]
'hello.txt'
>>> inFile.split('/')[-1].split('.')[0]
'hello'
if it is always going to be a path like the above you can use os.path.split and os.path.splitext
The following example will print just the hello
from os.path import split, splitext
path = "/adda/adas/sdas/hello.txt"
print splitext(split(path)[1])[0]
For more info see https://docs.python.org/library/os.path.html
I'm pretty sure some Regex-Ninja*, would give you a more or less sane way to do that (or as I now see others have posted: ways to write two expressions on one line...)
But I'm wondering why you want to do split it with just one expression?
For such a simple split, it's probably faster to do two than to create some advanced either-or logic. If you split twice it's safer too:
I guess you want to separate the path, the file name and the file extension, if you split on '/' first you know the filename should be in the last array index, then you can try to split just the last index to see if you can find the file extension or not. Then you don't need to care if ther is dots in the path names.
*(Any sane users of regular expressions, should not be offended. ;)

Categories