Python use count in list - python

I want to make a list of all of my saved wifi files, with a number in front of each file but my output is not what I what from my code.
import os,
path = '/etc/NetworkManager/system-connections/'
dirs = os.listdir(path)
count = sum([len(files) for r, d, files in os.walk(path)])
for file in dirs:
for item in range(count):
print(item, file)
Expected output:
1 wifi-test
2 androidAP
3 androidAPtest
output now:
0 wifi-test
1 androidAP
2 androidAPtest
0 wifi-test
1 androidAP
2 androidAPtest
and then it starts over

How a Loop inside a Loop works
I think there you have a misunderstanding in what happens when you put a loop inside a loop, so let me explain that first.
If you have, for example
for item_a in ['a', 'b']:
for item_b in ['1', '2']:
print(item_a + item_b)
then your output would be:
a1
a2
b1
b2
The code would start in the a loop first, and then it would go over both items in the inner loop. Once finished, the next item in the outer loop is b, and then it will go over both items in the inner loop again.
If you want to keep track of how many items you've gone over in your loop, you could do so with this type of pattern:
count = 0
for item_a in ['a', 'b']:
count = count + 1
print( str(count) + item_a)
This results in
1a
2a
But there is a shortcut. You can use a nifty function called enumerate to get the count of each item in the for loop.
for count, item_a in enumerate(['a', 'b']):
print( str(count) + item_a)
Which will also give you
1a
2a
Solution to your problem
With all this said, you can create your list of files like so
# First we loop over os.walk to get the files in the current directory and all sub-directories
for root, dirs, files in os.walk(path):
# And then using enumerate, we can consolidate those two for loops into one loop that properly counts everything
for item, file in enumerate(files):
print(item, os.path.join(root, file))
And if you don't care about sub-directories, you can just do
for item, file in enumerate(os.listdir(path)):
print(item, file)

It's not quite clear what you want with your code. What's that count for?
Maybe this is what you want:
import os
path = '/etc/NetworkManager/system-connections/'
dirs = os.listdir(path)
for num, file in enumerate(dirs):
print(num+1, file)

I'm not sure what count is supposed to do here, but if you want the files in the directory (not subdirectories) you just need os.listdir.
import os
path = '/etc/NetworkManager/system-connections/'
dirs = os.listdir(path)
for i in range(len(dirs)):
print(i + 1, dirs[i])

This is exactly what a nested loop will do.
Check the outputs of the loops independently:
It says you have 2 files the directory.
So this is the first thing you want to look at. Because if the output is supposed to be one, it's counting another thing inside dirs. What is it counting? Debug it by printing the loops separately.
Also, for the next problem you can solve it by hardcoding an innocent +1
print(item + 1, file)

Related

Python: Search for Substring Inside List Using Unique Values

I have two lists, both containing file paths to PDFs. The first list contains PDFs that have unique file names. The second list contains the file names with the same unique file names that need to be matched up to the first list, although it is possible that there could be multiple PDFs in the second list that could be matched to the first. It is a one to many relationship from ListA to ListB. Below is an example.
List A: C:\FolderA\A.pdf, C:\FolderA\B.pdf, C:\FolderA\C.pdf
List B: C:\FolderB\A_1.pdf, C:\FolderB\B_1.pdf, C:\FolderB\C_1.pdf, C:\FolderB\C_2.pdf
I need to find a way to iterate through both lists and combine the PDFs by matching the unique filename. If I can find a way to iterate and match the files, then I think I can combine the PDFs on my own. Below is the code I have so far.
folderA = C:\FolderA
ListA = []
for root, dirs, filenames in os.walk(folderA):
for filename in filenames:
ListA.append(str(filename))
filepath = os.path.join(root, filename)
ListA.append(str(filepath))
folderB: C:\FolderB
ListB = []
for root, dirs, filenames in os.walk(folderB):
for filename in filenames:
filepath = os.path.join(root, filename)
folderB.append(str(filepath))
#Split ListB to file name only without the "_#" so it can be matched to the PDFs in ListA.
for pdfValue in ListB:
pdfsplit = pdfValue.split(".")[0]
pdfsplit1 = pdfsplit.split("\\")[-1]
pdfsplit2 = pdfsplit1.rsplit("_", 1)[0]
for pdfValue2 in ListA:
if pdfsplit2 in ListA:
#combine PDF code
I have verified everything works up to the last if statement. From here is when I am not sure how to go about it. I know how to search for a substring within a string, but I cannot get it to work correctly with a list. No matter how I code it, I either end up in an endless loop or it does not successfully match.
Any ideas on how to make this work, if it is possible?
It would be better to use gather all the information together in one data structure, rather than separate lists. That should allow you to reduce your code to a single function.
Completely untested, but something like this should work.
from collections import defaultdict
pdfs = defaultdict(lambda: defaultdict(list))
def find_pdfs(pdfs, folder, split=False):
for root, dirs, filenames in os.walk(folder):
for filename in filenames:
basename, ext = os.path.splitext(filename)
if ext == '.pdf':
if split:
basename = basename.partition('_')[0]
pdfs[basename][root].append(filename)
find_pdfs(pdfs, folderA)
find_pdfs(pdfs, folderB, True)
This should produce a data structure like this:
pdfs = {
'A':
{'C:\FolderA': ['A.pdf'],
'C:\FolderB': ['A_1.pdf']},
'B':
{'C:\FolderA': ['B.pdf'],
'C:\FolderB': ['B_1.pdf']},
'C':
{'C:\FolderA': ['C.pdf'],
'C:\FolderB': ['C_1.pdf', 'C_2.pdf']},
}
I think what you want to do is create a collections.defaultdict and set it up to hold lists of matching names.
import collections
matching_files = collections.defaultdict(list)
You can then strip the filenames in folder B down to base names, and put the paths into the dict:
matching_files[pdfsplit2].append(pdfValue)
Now you have a list of pdf files from folder B, grouped by base name. Go back to folder A and do the same thing (split off the path and extension, use that for the key, add the full path to the list). You'll have lists, which have files sharing a common base name.
for key,file_list in matching_files.items(): #use .iteritems() for py-2.x
print("Files with base name '%s':"%key)
print(' ', '\n '.join(file_list))
To compare the two files names, rather than split along the '_', you should try the str.startwith() method :
A.startwith(B) returns True if the string A beginning is the string B.
In your case, your code would be :
match={} #the dictionary where you will stock the matching names
for pdfValue in ListA:
match[pdfValue]=[] # To create an entry in the dictionary with the wanted keyword
A=pdfValue.split("\\")[-1] #You want just the filename part
for pdfValue2 in ListB:
B=pdfValue2.split("\\")[-1]
if B.startswith(A): # Then B has the same unique namefile than A
match[pdfValue].append(pdfValue2) #so you associate it with A in the dictionnary
I hope it works for you
One more solution
lista = ['C:\FolderA\A.pdf', 'C:\FolderA\B.pdf', 'C:\FolderA\C.pdf']
listb = ['C:\FolderB\A_1.pdf', 'C:\FolderB\B_1.pdf', 'C:\FolderB\C_1.pdf', 'C:\FolderB\C_2.pdf']
# get the filenames for folder a and folder b
lista_filenames = [l.split('\\')[-1].split('.')[0] for l in lista]
listb_filenames = [l.split('\\')[-1].split('.')[0] for l in listb]
# create a dictionary to store lists of mappings
from collections import defaultdict
data_structure = defaultdict(list)
for i in lista_filenames:
for j in listb_filenames:
if i in j:
data_structure['C:\\FolderA\\' + i +'.pdf'].append('C:\\FolderB\\' + j +'.pdf')
# this is how the mapping dictionary looks like
print data_structure
results in :
defaultdict(<type 'list'>, {'C:\\FolderA\\C.pdf': ['C:\\FolderB\\C_1.pdf', 'C:\\FolderB\\C_2.pdf'], 'C:\\FolderA\\A.pdf': ['C:\\FolderB\\A_1.pdf'], 'C:\\FolderA\\B.pdf': ['C:\\FolderB\\B_1.pdf']})

How to loop reading files group by group?

Generally, I loop through files one by one in Python. Now I want to loop through them group by group. How do I read them efficiently?
Here's an example to explain my question.
Given files like these:
group1: m2000_01, m2000_02,..., m2000_12
group2: m2001_01, m2001_02,...., m2001_12
.....
group17: m2016_01, m2016_02,...., m2016_12
I want to read files in same year for calculation and loop alone time serials for batching. Pseudo-code as follow:
for year in list[2000,2001,...,2016]:
A=open(m2000_01), B=open(m2000_02), C=open(m2000_03).... # reading files section
mean2000 = (A + B + C ...) / 12
#calculation body,how to set varibles for each file.such as A=m2000_01, B=m2000_02, ...,
#use a dict to set these files?
print mean2000, mean2001,..., mean2016 #result I want
Maybe I could make a list, and then loop element in list for matching(seive) and extracting group files. But if there are many groups of files and the group key words (such as 2000 in above example) are irregular. Are there any common method to solve similar problems? I think there is a proven method, but I don't know how to describe and search. Please forgive me if this problem is simple.
This will do
import os
path = "your\\path"
all_files = [x for x in os.listdir(path) if os.path.isfile(path + "\\" + x)]
for year in range(2000, 2017):
for file_name in [y for y in all_files if str(year) in y]:
sub_file_path = path + "\\" + file_name
# read file, insert appropriate code yourself
You can find and group the files for processing using os.listdir(), along with the re regex module, and the itertools.groupby() function to do something along these lines:
from itertools import groupby
import os
import re
folder_path = 'data_folder'
pattern = r'm\d\d\d\d_\d\d'
filenames = [filename for filename in sorted(os.listdir(folder_path))
if re.match(pattern, filename)]
for k, g in groupby(filenames, lambda filename: filename.split('_')[0]):
year = int(k[1:])
year_files = list(g)
print('{}: {}'.format(year, year_files))
Sample output:
2000: ['m2000_01', 'm2000_02', 'm2000_03', 'm2000_04', 'm2000_05', 'm2000_06', 'm2000_07', 'm2000_08', 'm2000_09', 'm2000_10', 'm2000_11', 'm2000_12']
2001: ['m2001_01', 'm2001_02', 'm2001_03', 'm2001_04', 'm2001_05', 'm2001_06', 'm2001_07', 'm2001_08', 'm2001_09', 'm2001_10', 'm2001_11', 'm2001_12']
2002: ['m2002_01', 'm2002_02', 'm2002_03', 'm2002_04', 'm2002_05', 'm2002_06', 'm2002_07', 'm2002_08', 'm2002_09', 'm2002_10', 'm2002_11', 'm2002_12']

Count all files in all folders/subfolders with Python

Which is the most efficient way to count all files in all folders and subfolders in Python? I want to use this on Linux systems.
Example output:
(Path files)
/ 2
/bin 100
/boot 20
/boot/efi/EFI/redhat 1
....
/root 34
....
Paths without a file should be ignored.
Thanks.
You can do it with os.walk();
import os
for root, dirs, files in os.walk('/some/path'):
if files:
print('{0} {1}'.format(root, len(files)))
Note that this will also include hidden files, i.e. those that begin with a dot (.).
import os
print [(item[0], len(item[2])) for item in os.walk('/path') if item[2]]
It returns a list of tuples of folders/subfolders and files count in /path.
OR
import os
for item in os.walk('/path'):
if item[2]:
print item[0], len(item[2])
It prints folders/subfolders and files count in /path.
If you want try faster solution, then you had to try to combine:
os.scandir() # from python 3.5.2
iterate recursively and use:
from itertools import count
counter = count()
counter.next() # returns at first 0, next 1, 2, 3 ...
if counter.next() > 1000:
print 'dir with file count over 1000' # and use continue in for loop
Maybe that will be faster, because I think in os.walk function are unnecessary things for you.

How can I avoid getting duplicate paths from os.walk

I have the following directory structure
/mnt/type/split/v2/doc/RESOURCE_ID/YYYY/FY/DOCUMENT_ID
for example, one path might be
/mnt/type/split/v2/doc/100045/2008/FY/28
where
RESOURCE_ID = 100045
YYYY = 2008
DOCUMENT_ID = 28
Note, DOCUMENT_ID is the last directory in the path - there will be files in the DOCUMENT_ID directory
I was trying to take inventory of this structure using the following code
def survey():
magic_paths = []
for (resource_id, dirname,filename) in os.walk('/mnt/type/split/v2/doc'):
if resource_id:
for (magic_path, dirname2,filename2) in os.walk(resource_id):
if len(magic_path.split(os.sep)) == 10:
magic_paths.append(magic_path + os.linesep)
write_survey(magic_paths)
x = len(magic_paths)
return x
I am getting five copies of each path in my magic_paths list. I have 1,500,000 paths, so I am getting 7,500,00 items in my list.
The first 1,500,000 are the unique values. The next 6,000,000 consist of groups that are rooted on the RESOURCE_ID, repeated 4 times
/mnt/type/split/v2/doc/100045/2008/FY/28 #obs_1
/mnt/type/split/v2/doc/100045/2008/FY/29 #obs_2
/mnt/type/split/v2/doc/100045/2008/FY/30 #obs_3
/mnt/type/split/v2/doc/100045/2008/FY/31 #obs_4
/mnt/type/split/v2/doc/1028/2008/FY/28 #obs_5 # see the new RESOURCE_ID
.
. 1,499,995 more unique values
.
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of first repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of second repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of third repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of fourth repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/1028/2008/FY/28 #series of 4 repetitions based on RESOURCE ID 1028
There are various files in the directories and subs at each level, I just need to inventory the paths to the DOCUMENT_IDs.
I do not understand why the results are patterned as they are. I believed that I was starting at RESOURCE_ID and finding only the directories that were 9 deep since splitting on os.sep gives me a list with ten items.
'/mnt/type/split/v2/doc/100045/2008/FY/31'.split(os.sep) = ['','mnt',type','split','v2','doc','100045','2008','FY','31']
In response to the questions in the comments
I believed that I was getting each RESOURCE_ID directory and then walking it. That the other items returned from the first os.walk (dirnames and filenames) would be ignored
I did not think os.listdir would work, I can make this work with glob but am worried about it eating my memory
os.walk() will recursively walk a directory structure. For each directory you encounter, you start another recursive call. So for every directory, you recursively walk that directory plus all nested directories. That includes nested directories. By kicking off a search for /mnt/type/split/v2/doc, /mnt/type/split/v2/doc/100045, /mnt/type/split/v2/doc/100045/2008, /mnt/type/split/v2/doc/100045/2008 and /mnt/type/split/v2/doc/100045/2008/FY paths, you produce 5 matches per document ID.
Call os.walk() just once:
def survey():
magic_paths = []
for (resource_id, dirnames, filenames) in os.walk('/mnt/type/split/v2/doc'):
if len(resource_id.split(os.sep)) == 10:
magic_paths.append(resource_id + os.linesep)
write_survey(magic_paths)
x = len(magic_paths)
return x
You may want to prune the search after finding a match; there is no point in searching through further subdirectories once you find a DOCUMENT_ID directory:
def survey():
magic_paths = []
for (resource_id, dirnames, filenames) in os.walk('/mnt/type/split/v2/doc'):
if len(resource_id.split(os.sep)) == 10:
magic_paths.append(resource_id + os.linesep)
dirnames[:] = [] # clear the subdirs list to stop further recursion here
write_survey(magic_paths)
x = len(magic_paths)
return x

batch renaming 100K files with python

I have a folder with over 100,000 files, all numbered with the same stub, but without leading zeros, and the numbers aren't always contiguous (usually they are, but there are gaps) e.g:
file-21.png,
file-22.png,
file-640.png,
file-641.png,
file-642.png,
file-645.png,
file-2130.png,
file-2131.png,
file-3012.png,
etc.
I would like to batch process this to create padded, contiguous files. e.g:
file-000000.png,
file-000001.png,
file-000002.png,
file-000003.png,
When I parse the folder with for filename in os.listdir('.'): the files don't come up in the order I'd like to them to. Understandably they come up
file-1,
file-1x,
file-1xx,
file-1xxx,
etc. then
file-2,
file-2x,
file-2xx,
etc. How can I get it to go through in the order of the numeric value? I am a complete python noob, but looking at the docs i'm guessing I could use map to create a new list filtering out only the numerical part, and then sort that list, then iterate that? With over 100K files this could be heavy. Any tips welcome!
import re
thenum = re.compile('^file-(\d+)\.png$')
def bynumber(fn):
mo = thenum.match(fn)
if mo: return int(mo.group(1))
allnames = os.listdir('.')
allnames.sort(key=bynumber)
Now you have the files in the order you want them and can loop
for i, fn in enumerate(allnames):
...
using the progressive number i (which will be 0, 1, 2, ...) padded as you wish in the destination-name.
There are three steps. The first is getting all the filenames. The second is converting the filenames. The third is renaming them.
If all the files are in the same folder, then glob should work.
import glob
filenames = glob.glob("/path/to/folder/*.txt")
Next, you want to change the name of the file. You can print with padding to do this.
>>> filename = "file-338.txt"
>>> import os
>>> fnpart = os.path.splitext(filename)[0]
>>> fnpart
'file-338'
>>> _, num = fnpart.split("-")
>>> num.rjust(5, "0")
'00338'
>>> newname = "file-%s.txt" % num.rjust(5, "0")
>>> newname
'file-00338.txt'
Now, you need to rename them all. os.rename does just that.
os.rename(filename, newname)
To put it together:
for filename in glob.glob("/path/to/folder/*.txt"): # loop through each file
newname = make_new_filename(filename) # create a function that does step 2, above
os.rename(filename, newname)
Thank you all for your suggestions, I will try them all to learn the different approaches. The solution I went for is based on using a natural sort on my filelist, and then iterating that to rename. This was one of the suggested answers but for some reason it has disappeared now so I cannot mark it as accepted!
import os
files = os.listdir('.')
natsort(files)
index = 0
for filename in files:
os.rename(filename, str(index).zfill(7)+'.png')
index += 1
where natsort is defined in http://code.activestate.com/recipes/285264-natural-string-sorting/
Why don't you do it in a two step process. Parse all the files and rename with padded numbers and then run another script that takes those files, which are sorted correctly now, and renames them so they're contiguous?
1) Take the number in the filename.
2) Left-pad it with zeros
3) Save name.
def renamer():
for iname in os.listdir('.'):
first, second = iname.replace(" ", "").split("-")
number, ext = second.split('.')
first, number, ext = first.strip(), number.strip(), ext.strip()
number = '0'*(6-len(number)) + number # pad the number to be 7 digits long
oname = first + "-" + number + '.' + ext
os.rename(iname, oname)
print "Done"
Hope this helps
The simplest method is given below. You can also modify for recursive search this script.
use os module.
get filenames
os.rename
import os
class Renamer:
def __init__(self, pattern, extension):
self.ext = extension
self.pat = pattern
return
def rename(self):
p, e = (self.pat, self.ext)
number = 0
for x in os.listdir(os.getcwd()):
if str(x).endswith(f".{e}") == True:
os.rename(x, f'{p}_{number}.{e}')
number+=1
return
if __name__ == "__main__":
pattern = "myfile"
extension = "txt"
r = Renamer(pattern=pattern, extension=extension)
r.rename()

Categories