writing a python script that calls different directories - python

It's kind of hard to explain but I'm using a directory that has a number of different files but essentially I want to loop over files with irregular intervals
so in pseudocode I guess it would be written like:
A = 1E4, 1E5, 5E5, 7E5, 1E6, 1.05E6, 1.1E6, 1.2E6, 1.5E6, 2E6
For A in range(start(A),end(A)):
inputdir ="../../../COMBI_Output/Noise Studies/[A] Macro Particles/10KT_[A]MP_IP1hoN0.0025/"
Run rest of code
Because at the moment I'm doing it manually by changing the value in [A] and its a nightmare and time consuming. I'm using Python on a macbook so I wonder if writing a bash script that is called within Python would be the right idea?
Or replacing A with a text file, such that its:
import numpy as np
mpnum=np.loadtxt("mp.txt")
for A in range(0,len(A)):
for B in range(0,len(A)):
inputdir ="../../../COMBI_Output/Noise Studies/",[A] "Macro Particles/10KT_",[A]"MP_IP1hoN0.0025/"
But I tried this first and still had no luck.

You are almost there. You don't need a range, just iterate over the list. Then do a replacement in the string using format.
A = ['1E4', '1E5', '5E5', '7E5', '1E6', '1.05E6', '1.1E6', '1.2E6', '1.5E6', '2E6']
for a in A:
inputdir = "../../../COMBI_Output/Noise Studies/{} Macro Particles/10KT_{}MP_IP1hoN0.0025/".format(a)

The idea of putting the file names in a list and simply iterating over them using
for a in A:
seems to be the best idea. However, one small suggestion, if I may, instead of having a list, if you're going to have a large number of files inside this list, why not make it a dictionary? In this way, you can iterate through your files easily as well as keep a count on them.

Related

Python list comprehension - Directly returning large list is faster than storing, then returning

I was working on a problem recently that required me to go through a very large folder (~600,000 files) and return the list of filenames that matched a certain criterion. The original version was a normal list comprehension stored in a variable. This isn't the actual code but it gives the gist.
def filter_files(file_path):
filtered = [f.path for f in os.scandir(file_path) if f.path.endswith('.png')]
return filtered
When monitoring this one it would start out fast then progressively get slower and slower. I presume because it's just trying to store so much data in the variable.
I then rewrote it to be:
def filter_files(file_path):
return [f.path for f in os.scandir(file_path) if f.path.endswith('.png')]
And called it like:
def test(file_path):
filtered = filter_files(file_path)
This one never slows down. It maintains the same speed the entire time.
My question is what under the hood causes this difference? The data is still being stored in a variable and it's still being processed as a list comprehension. What about writing the comprehension in a return avoids the issues of the first version? Thanks!
There is no difference between those two pieces of code. None at all. Both of them are creating a list, and then managing a reference to that list.
The likely cause of your issue is caching. In the first case, the file system has to keep going out to the disk over and over and over to fetch more entries. After you finished that, the directory was in the file cache and could be read immediately. Reboot and try again, and you'll see the second one takes the same amount of time.

when converting XML to SEVERAL dataframes, how to name these dfs in a dynamic way?

my code is on the bottom
"parse_xml" function can transfer a xml file to a df, for example, "df=parse_XML("example.xml", lst_level2_tags)" works
but as I want to save to several dfs so I want to have names like df_ first_level_tag, etc
when I run the bottom code, I get an error "f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
^
SyntaxError: can't assign to literal"
I also tried .format method instead of f-string but it also hasn't worked
there are at least 30 dfs to save and I don't want to do it one by one. always succeeded with f-string in Python outside pandas though
Is the problem here about f-string/format method or my code has other logic problem?
if necessary for you, the parse_xml function is directly from this link
the function definition
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
This seems like a situation where you'd be best served by putting them into a dictionary:
dfs = {}
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
dfs[first_level_tag] = parse_XML("example.xml", lst_level2_tags)
There's nothing structurally wrong with your f-string, but you generally can't get dynamic variable names in Python without doing ugly things. In general, storing the values in a dictionary ends up being a much cleaner solution when you want something like that.
One advantage of working with them this way is that you can then just iterate over the dictionary later on if you want to do something to each of them. For example, if you wanted to write each of them to disk as a CSV with a name matching the tag, you could do something like:
for key, df in dfs.items():
df.to_csv(f'{key}.csv')
You can also just refer to them individually (so if there was a tag named a, you could refer to dfs['a'] to access it in your code later).

I can't delete cases from .sav files using spss with python

I have some .sav files that I want to check for bad data. What I mean by bad data is irrelevant to the problem. I have written a script in python using the spss module to check the cases and then delete them if they are bad. I do that within a datastep by defining a dataset object and then getting its case list. I then use
del datasetObj.cases[k]
to delete the problematic cases within the datastep.
Here is my problem:
Say I have a data set foo.sav and it is the active data set in spss, then I can run something like:
BEGIN PROGRAM PYTHON.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
caselist = datasetObj.cases
del caselist[k]
spss.EndDataStep()
END PROGRAM.
from within the spss client and it will delete the case k from the data set foo.sav. But, if I run something like the following using the directory of foo.sav as the working directory:
import os, spss
pathname = os.curdir()
foopathname = os.path.join(pathname, 'foo.sav')
spss.Submit("""
GET FILE='%(foopathname)s'.
DATASET NAME file1.
DATASET ACTIVATE file1.
""" %locals())
spss.StartDataStep()
datasetObj = spss.Dataset()
caselist = datasetObj.cases
del caselist[3]
spss.EndDataStep()
from command line, then it doesn't delete the case k. Similar code which gets values will work fine. E.g.,
print caselist[3]
will print case k (when it is in the data step). I can even change the values for the various entries of a case. But it will not delete cases. Any ideas?
I am new to python and spss, so there may be something that I am not seeing which is obvious to others; hence why I am asking the question.
Your first piece of code did not work for me. I adjusted it as follows to get it working:
BEGIN PROGRAM PYTHON.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
del datasetObj.cases[k]
spss.EndDataStep()
END PROGRAM.
Notice that, in your code, caselist is just a list, containing values taken from the datasetObj in SPSS. The attribute .cases belongs to datasetObj.
With spss.Submit, you can also delete cases (or actually, not select them) using the SPSS command SELECT IF. For example, if your file has a variable (column) named age, with values ranging from 0 to 100, you can delete all cases with an age lower than (in SPSS: lt or <) 25 using:
BEGIN PROGRAM PYTHON.
import spss
spss.Submit("""
SELECT IF age lt 25.
""")
END PROGRAM.
Don't forget to add some code to save the edited file.
caselist is not actually a regular list containing the dataset values. Although its interface is the list interface, it actually works directly with the dataset, so it does not contain a list of values. It just accesses operations on the SPSS side to retrieve, change, or delete values. The most important difference is that since Statistics is not keeping the data in memory, the size of the caselist is not limited by memory.
However, if you are trying to iterate over the cases with a loop using
range(spss.GetCaseCount())
and deleting some, the loop will eventually fail, because the actual case count reflects the deletions, but the loop limit doesn't reflect that. And datasetObj.cases[k] might not be the case you expect if an earlier case has been deleted. So you need to keep track of the deletions and adjust the limit or the k value appropriately.
HTH

Most efficient way to compare all values in a dictionary?

I have a dictionary I created by reading in a whole lot of image files. It looks like this:
files = { 'file1.png': [data...], 'file2.png': [data...], ... 'file1000': [data...]}
I am trying to process these images to see how similar each of them are to each other. The thing is, with 1000s of files worth of data this is taking forever. I'm sure I have 20 different places I could optimize but I am trying to work through it one piece at a time to see how I can better optimize it.
My original method tested file1 against all of the rest of the files. Then I tested file2 against all of the files. But I still tested it against file1. So, by the time I get to file1000 in the above example I shouldn't even need to test anything at that point since it has already been tested 999 times.
This is what I tried:
answers = {}
for x in files:
for y in files:
if y not in answers or x not in answers[y]:
if(compare(files[x],files[y]) < 0.01):
answers.setdefault(x, []).append(y)
This doesn't work, as I am getting the wrong output now. The compare function is just this:
rms = math.sqrt(functools.reduce(operator.add,map(lambda a,b: (a-b)**2, h1[0], h2[0]))/len(h1[0]))
return rms
I just didn't want to put that huge equation into the if statement.
Does anyone have a good method for comparing each of the data segments of the files dictionary without overlapping the comparisons?
Edit:
After trying ShadowRanger's answer I have realized that I may not have fully understood what I needed. My original answers dictionary looked like this:
{ 'file1.png': ['file1.png', 'file23.png', 'file333.png'],
'file2.png': ['file2.png'],
'file3.png': ['file3.png', 'file4.png', 'file5.png'],
'file4.png': ['file3.png', 'file4.png', 'file5.png'],
...}
And for now I am storing my results in a file like this:
file1.png file23.png file33.png
file2.png
file3.png file4.png file5.png
file6.png
...
I thought that by using combinations and only testing individual files once I would save a lot of time retesting files and not have to waste time getting rid of duplicate answers. But as far as I can tell, the combinations have actually reduced my ability to find matches and I'm not sure why.
You can avoid redundant comparisons with itertools.combinations to get order-insensitive unique pairs. Just import itertools and replace your doubly nested loop:
for x in files:
for y in files:
with a single loop that gets the combinations:
for x, y in itertools.combinations(files, 2):

Separating a string by the first occurrence of a separator

I just got into python very recently and now I'm practicing by (what I imagine to be rather simple, but challenging enough for me) creating small tools to sort files into folders.
So far it has been going pretty well, but now I've encountered a problem:
My files are in the following format:
myAsset_prefix1_prefix2_prettyName.ext ;
(i.e. Tiger_texture_spec_brightOrange.png)
myAsset always has a different length since it's dependent on name.
I want to sort every file of the same asset ( "myAsset_" tag) in a separate folder.
The copying to a separate folder etc is no challenge but..
I don't want to update an array by hand every time I create/receive a new asset.
So instead of using the startswith operation and make it run through a list, I'd like to build that array when my script runs, by making the script look at the name of the file and store everything up to and including the first "_" in a variable/array.
Is that possible?
I think you want the glob module. This allows you to list the files that match a certain format.
For example:
for filename in glob.glob(*.ext):
asset_tag = filename.split(" ")[0]

Categories