Creating arrays based on folder name - python

I have data that has been collected and organized in multiple folders.
In each folder, there can be multiple similar runs -- e.g. collected data under the same conditions, at different times. These filenames contain a number in them that increments. Each folder contains similar data collected under different conditions. For example, I can have an idle folder, and in it can be files named idle_1.csv, idle_2.csv, idle_3.csv, etc. Then I can have another folder pos1 folder, and similarly, pos1_1.csv, pos1_2.csv, etc.
In order to keep track of what folder and what file the data in the arrays came from, I want to use the folder name, "idle", "pos1", etc, as the array name. Then, each file within that folder (or the data resulting from processing each file in that folder, rather) becomes another row in that array.
For example, if the name of the folder is stored in variable arrname, and the file index is stored in variable arrndx, I want to write the value into that array:
arrname[arrndx]=value
This doesn't work, giving the following error:
TypeError: 'str' object does not support item assignment
Then, I thought about using a dictionary to do this, but I think I still would run into the same issue. If I use a dictionary, I think I need each dictionary's name to be the name derived from the folder name -- creating the same issue. If I instead try to use it as a key in a dictionary, the entries get overwritten with data from every file from the same folder since the name is the same:
arrays['name']=arrname
arrays['index']=int(arrndx)
arrays['val']=value
arrays['name': arrname, 'index':arrndx, 'val':value]
I can't use 'index' either since it is not unique across each different folder.
So, I'm stumped. I guess I could predefine all the arrays, and then write to the correct one based on the variable name, but that could result in a large case statement (is there such a thing in python?) or a big if statement. Maybe there is no avoiding this in my case, but I'm thinking there has to be a more elegant way...
EDIT
I was able to work around my issue using globals():
globals()[arrname].insert(int(arrndx),value)
However, I believe this is not the "correct" solution, although I don't understand why it is frowned upon to do this.

Use a nested dictionary with the folder names at the first level and the file indices (or names) at the second.
from pathlib import Path
data = {}
base_dir = 'base'
for folder in Path(base_dir).resolve().glob('*'):
if not folder.is_dir():
continue
data[folder.name] = {}
for csv in folder.glob('*.csv'):
file_id = csv.stem.split('_')[1]
data[folder.name][file_id] = csv
The above example just saves the file name in the structure but you could alternatively load the file's data (e.g. using Pandas) and save that to the dictionary. It all depends what you want to do with it afterwards.

What about :
foldername = 'idle' # Say your folder name is idle for example
files = {}
files[filename] = [filenmae + "_" + str(i) + ".csv" for i in range(1, number_of_files_inside_folder + 2)]
does that solve your problem ?

Related

AttributeError: 'DataFrame' object has no attribute 'path'

I'm trying incrementally to build a financial statement database. The first steps center around collecting 10-Ks from the SEC's EDGAR database. I have code for pulling the relevant 8-Ks, 10-Ks, and 10-Qs by CIK number and accession number, and retrieving the relevant excel spreadsheet. The code below is now centering on trying to create a folder within a directory, then name the folder with the CIK code, then pull the spreadsheet from the EDGAR database, and save the spreadsheet to the folder with the CIK code. My example is a csv file I'm calling "accessionnumtest.csv", which has headings:
company_name,report_type,cik,date,cik_accession
and data:
4Less Group, Inc.,10K/A,1438901,11/27/2019,edgar/data/1438901/000121390019024801.txt
AB INTERNATIONAL GROUP CORP.,10K,1605331,10/22/2019,edgar/data/1605331/000166357719000384.txt
ABM INDUSTRIES INC /DE/,10K,771497,12/20/2019,edgar/data/771497/000162828019015259.txt
ACTUANT CORP,10K,6955,10/29/2019,edgar/data/6955/000000695519000033.txt
my code is below
import os
import pandas as pd
path = os.getcwd()
folder_path = "C:/metricdatadb/"
df = pd.read_csv("accessionnumtest.csv")
folder_name = df['cik']
print(folder_name)
for row in df.iterrows():
dir = df.path.join(folder_path, folder_name)
os.makedirs(dir)
This code is giving me, AttributeError: 'DataFrame' object has no attribute 'path' error. I have renamed the path, checked for whitespace in the headers. Any suggestions are appreciated.
Regarding the error: os.path.join. Not pd.path.join. You are calling the wrong module.
That being said, your code is not doing what you are trying to do regardless of the error. folder_name will not update for each row. You could do row.cik to get the value for each iterrows()
dir = os.path.join(folder_path, row.cik)
It is relatively unclear what you're working towards accomplishing, particularly with .csv files and Pandas. The code you have seems to have a lot of curious errors in it, which I think might be ameliorated by going back to learn some of the more simple Python concepts before trying something as difficult as web-scraping. Note I don't mean to give up, rather than building up the fundamentals is a necessary step in this type of project.
That said, if I'm understanding your intent correctly, you want to create a file hierarchy for 10-K, 10-Q, etc. filings for several CIKs.
There shouldn't be any need to use .csv files, or pandas for this.
Probably the simplest way to do this would be to do it in the same step you download them.
Pseudocode for this would be as follows:
for cik in list_of_ciks:
first_file = find_first_file_online();
if first_file is 10-K:
save_to_10-K folder for CIK
if first_file is 10-Q:
save_to_10-Q folder for CIK
As I said above, you can skip the .csv file (Also, note that CSV stands for "comma-separated-value." Some of the entries in your data contain commas, e.g. "4Less Group, Inc." This is incompatible with a CSV file, as it will split the single entry into two columns on the comma, shifting all of your data one column).
When you process the data, you'll want to build the folders as you go.
When you iterate through a new CIK, create the master folder for that CIK. When you encounter a 10-K, create a folder for 10-K's and save it with a unique name. Since you need to use the accession numbers to get the excel sheets, that wouldn't be a bad naming convention to follow.
It would be something like this:
import requests
import pathlib
cik_list = [cik_1, cik_2... cik_n]
for cik in cik_list:
file = requests.get("cik/accession/Report.xlsx").data
with open(pathlib.Path(cik, report_type, accession_number + ".xlsx", "wb")) as excel_file:
excel_file.write(file.data)
The above code will definitely not run, and does not include everything you would need to make it work, since that information has been written by you. Integrating the above concepts into your code is up to you.
To reiterate, you have the CIK, the accession number, and the report type. To save the files in folders, you need only create the folders as you go, with the form "CIK/report_type/accession.xlsx"

How can I search in a subdir for a specific file?

Let's say the start.py is located in C:\.
import os
path = "C:\\Users\\Downloads\\00005.tex"
file = open(path,"a+")
file. truncate(0)
file.write("""Hello
""")
file.close()
os.startfile("C:\\Users\\Downloads\\00005.tex")
In the subdirectory could be some files. For example: 00001.tex, 00002.tex, 00003.tex, 00004.tex.
I want first to search in the subdir for the file with the highest number (00004.tex) and create a new one with the next number (00005.tex), write "Hello" and save it in the subdir 00005.tex.
Are the zeros necessary or can i also just name them 1.tex, 2.tex, 3.tex......?
Textually, "2" is greater than "100" but of course numerically, its the opposite. The reason for writing files as say, "00002.txt" and "00100.text" is that for files numbered up to 99999, the lexical sorting is the same as the numerical sorting. If you write them as "2.txt" and "100.txt" then you need to change the non-extension part of the name to an integer before sorting.
In your case, since you want the next highest number, you need to convert the filenames to integers so that you can get a maximum and add 1. Since you are converting to an integer anyway, your progam doesn't care whether you prepend zeroes or not.
So the choice is based on external reasons. Is there some reason to make it so that a textual sort works? If not, then the choice is purely random and do whatever you think looks better.
You can use glob:
import glob, os
os.chdir(r"Path")
files = glob.glob("*.tex")
entries = sorted([int(entry.split(".", 1)[0]) for entry in files])
next_entry = entries[-1]+1
next_entry can be used as a new filename. You can then create a new file with this name and write your new content to that file

Find and remove duplicate files using Python

I have several folders which contain duplicate files that have slightly different names (e.g. file_abc.jpg, file_abc(1).jpg), or a suffix with "(1) on the end. I am trying to develop a relative simple method to search through a folder, identify duplicates, and then delete them. The criteria for a duplicate is "(1)" at the end of file, so long as the original also exists.
I can identify duplicate okay, however I am having trouble creating the text string in the right format to delete them. It needs to be "C:\Data\temp\file_abc(1).jpg", however using the code below I end up with r"C:\Data\temp''file_abc(1).jpg".
I have looked at answers [Finding duplicate files and removing them, however this seems to be far more sophisticated than what I need.
If there are better (+simple) ways to do this then I let me know, however I only have around 10,000 files in total in 50 odd folders, so not a great deal of data to crunch through.
My code so far is:
import os
file_path = r"C:\Data\temp"
file_list = os.listdir(file_path)
print (file_list)
for file in file_list:
if ("(1)" in file):
index_no = file_list.index(file)
print("!! Duplicate file, number in list: "+str(file_list.index(file)))
file_remove = ('r"%s' %file_path+"'\'"+file+'"')
print ("The text string is: " + file_remove)
os.remove(file_remove)
Your code is just a little more complex than necessary, and you didn't apply a proper way to create a file path out of a path and a file name. And I think you should not remove files which have no original (i. e. which aren't duplicates though their name looks like it).
Try this:
for file_name in file_list:
if "(1)" not in file_name:
continue
original_file_name = file_name.replace('(1)', '')
if not os.path.exists(os.path.join(file_path, original_file_name):
continue # do not remove files which have no original
os.remove(os.path.join(file_path, file_name))
Mind though, that this doesn't work properly for files which have multiple occurrences of (1) in them, and files with (2) or higher numbers also aren't handled at all. So my real proposition would be this:
Make a list of all files in the whole directory tree below a given start (use os.walk() to get this), then
sort all files by size, then
walk linearly through this list, identify the doubles (which are neighbours in this list) and
yield each such double-group (i. e. a small list of files (typically just two) which are identical).
Of course you should check the contents of these few files then to be sure that not just two of them are accidentally the same size without being identical. If you are sure you have a group of identical ones, remove all but the one with the simplest names (e. g. without suffixes (1) etc.).
By the way, I would call the file_path something like dir_path or root_dir_path (because it is a directory and a complete path to it).

How can I improve performance of finding all files in a folder created at a certain date?

There are 10,000 files in a folder. Few files are created on 2018-06-01, few on 2018-06-09, like that.
I need to find all files which are created on 2018-06-09. But it is taking to much time (almost 2 hours) to read each file and get the file creation date and then get the files which are created on 2018-06-09.
for file in os.scandir(Path):
if file.is_file():
file_ctime = datetime.fromtimestamp(os.path.getctime(file)).strftime('%Y- %m- %d %H:%M:%S')
if file_ctime[0:4] == '2018-06-09':
# ...
You could try using os.listdir(path) to get all the files and dirs from the given path.
Once you have all the files and directories you could use filter and a lambda function to create a new list of only the files with the desired timestamp.
You could then iterate through that list to do what work you need to on the correct files.
Let's start with the most basic thing - why are you building a datetime only to re-format it as string and then do a string comparison?
Then there is the whole point of using os.scandir() over os.listdir() - os.scandir() returns a os.DirEntry which caches file stats through the os.DirEntry.stat() call.
In dependence of checks you need to perform, os.listdir() might even perform better if you expect to do a lot of filtering on the filename as then you won't need to build up a whole os.DirEntry just to discard it.
So, to optimize your loop, if you don't expect a lot of filtering on the name:
for entry in os.scandir(Path):
if entry.is_file() and 1528495200 <= entry.stat().st_ctime < 1528581600:
pass # do whatever you need with it
If you do, then better stick with os.listdir() as:
import stat
for entry in os.listdir(Path):
# do your filtering on the entry name first...
path = os.path.join(Path, entry) # build path to the listed entry...
stats = os.stat(path) # cache the file entry statistics
if stat.S_ISREG(stats.st_mode) and 1528495200 <= stats.st_ctime < 1528581600:
pass # do whatever you need with it
If you want to be flexible with the timestamps, use datetime.datetime.timestamp() beforehand to get the POSIX timestamps and then you can compare them against what stat_result.st_ctime returns directly without conversion.
However, even your original, non-optimized approach should be significantly faster than 2 hours for a mere 10k entries. I'd check the underlying filesystem, too, something seems wrong there.

Python File Creation Date & Rename - Request for Critique

Scenario: When I photograph an object, I take multiple images, from several angles. Multiplied by the number of objects I "shoot", I can generate a large number of images. Problem: Camera generates images identified as, 'DSCN100001', 'DSCN100002", etc. Cryptic.
I put together a script that will prompt for directory specification (Windows), as well as a "Prefix". The script reads the file's creation date and time, and rename the file accordingly. The prefix will be added to the front of the file name. So, 'DSCN100002.jpg' can become "FatMonkey 20110721 17:51:02". The time detail is important to me for chronology.
The script follows. Please tell me whether it is Pythonic, whether or not it is poorly written and, of course, whether there is a cleaner - more efficient way of doing this. Thank you.
import os
import datetime
target = raw_input('Enter full directory path: ')
prefix = raw_input('Enter prefix: ')
os.chdir(target)
allfiles = os.listdir(target)
for filename in allfiles:
t = os.path.getmtime(filename)
v = datetime.datetime.fromtimestamp(t)
x = v.strftime('%Y%m%d-%H%M%S')
os.rename(filename, prefix + x +".jpg")
The way you're doing it looks Pythonic. A few alternatives (not necessarily suggestions):
You could skip os.chdir(target) and do os.path.join(target, filename) in the loop.
You could do strftime('{0}-%Y-%m-%d-%H:%M:%S.jpg'.format(prefix)) to avoid string concatenation. This is the only one I'd reccomend.
You could reuse a variable name like temp_date instead of t, v, and x. This would be OK.
You could skip storing temporary variables and just do:
for filename in os.listdir(target):
os.rename(filename, datetime.fromtimestamp(
os.path.getmtime(filename)).strftime(
'{0}-%Y-%m-%d-%H:%M:%S.jpeg'.format(prefix)))
You could generalize your function to work for recursive directories by using os.walk().
You could detect the file extension of files so it would be correct not just for .jpegs.
You could make sure you only renamed files of the form DSCN1#####.jpeg
Your code is nice and simple. Few possible improvements I can suggest:
Command line arguments is more preferable for dir names because of autocomplition by TAB
EXIF is more accurate source of date and time of photo creating. If you modify photo in image editor, modify time will be changed while EXIF information will be preserved. Here is discussion about EXIF library for Python: Exif manipulation library for python
My only thought is that if you are going to have the computer do the work for you, let it do more of the work. My assumption is that you are going to shoot one object several times, then either move to another object or move another object into place. If so, you could consider grouping the photos by how close the timestamps are together (maybe any delta over 2 minutes is considered a new object). Then based on these pseudo clusters, you could name the photos by object.
May not be what you are looking for, but thought I'd add in the suggestion.

Categories