I have many different strings (which are files) that look like this:
20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc
20201225_00_ec_op_2m_temp_chinawheat_timeseries.nc
20201225_00_ec_op_snowfall_romaniawheat_timeseries.nc
And many more. I want to be able to loop through all of these files and store their file path in a dictionary. To do that, I want the key to be the text that is between the last two instances of an underscore. For example, if this was the file 20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc, then the dict would be
{'argentinacorn': path/to/file/20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc
How can I loop through and do this pythonically?
You can use regexes to extract the key from the strings like this:
import re
input_string = "20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc"
dict_key = re.findall(".*_(.+)_[^_]+", input_string)[0]
gives
'argentinacorn'
Or with just a simple split:
dict_key = input_string.split("_")[-2]
Regarding file names, you can get the list from current working directory like this:
import os
file_names = os.listdir()
You can just loop through this list and apply the split/regex as shown above.
A simple split and pick:
parts = "20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc".split("_")
key = parts[-2:-1]
Related
I need to search by part of filenames.
I have a column in my database that contains the names of products, and each product has a file for description, so I need to add this file to the same raw in my database.
I want to add a new column called description and add the contents of the file that has the same name on the column name, but the names in column and file are different, for example, the product called cs12 in the database and datasheet-cs12 or guide-cs12 or anything like this in the file
You will need to figure out how to
get a list of files in a folder
Look for a substring in each element of that list
Re 1.: https://stackoverflow.com/search?q=%5Bpython%5D+get+list+of+files+in+directory
Re 2.:
You have a logical problems. There might be multiple files that match any string. So, I don't think you can solve this fully automatically at all. What if you have two files datasheet_BAT54alternative.txt and info_BAT54A, and two rows containing the string BAT54 and BAT54A. A "BAT54" is not the same as a "BAT54A". So, you'll always have to deal with a list of candidates. If you're lucky, that list has only one entry:
def give_candidates(list_of_file_names, substring):
return [ fname for fname in file_names if substring.lower() in fname.lower() ]
I am trying to find through a list of files all the excel, txt or csv files and append them to a list
goodAttachments = [i for i in attachments if str(i).split('.')[1].find(['xlsx','csv','txt'])
This is obviously not working because find() needs a string and not a list. Should I try a list comprehension inside of a list comprehension?
There's no need to split or use double list comprehension. You can use str.endswith which takes a tuple of strings to check as an argument:
goodAttachments = [i for i in attachments if str(i).endswith(('.xlsx', '.csv', '.txt'))]
If you really want to split:
goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ('xlsx', 'csv', 'txt')]
The first way is better as it accounts for files with no extension.
You could try something like this:
goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ['xlsx', 'csv', 'txt']]
This will check if the extension after the last '.' matches one of 'xlsx', 'csv', or 'txt' exactly.
[i for i in attachments if any([e in str(i).split('.')[1] for e in ['xlsx','csv','txt']]))
Like you said, nested list comprehension.
Edit: This will work without splitting, I was trying to replicate the logic in find.
You can check that everything after the last dot is present in a second list. using [-1] instead of [1] ensures that files named like.this.txt will return the last split txt and not this.
goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ['xlsx','csv','txt']]
I would suggest maybe adding a few more lines then trying to create a one-liner with nested list comprehensions. Though that would work, I think it makes more readable code to split these comprehensions out onto separate lines.
import os
attachments = ['sadf.asdf', 'asd/asd/asd.xslx']
whitelist = {'.xslx', '.csv'}
extentions = (os.path.split(fp)[1] for fp in attachments)
good_attachments = [fp for fp, ext in zip(attachments, extentions) if ext in whitelist]
I've also used os.path.split over str.split as the file may have multiple dots present and this split is designed for this exact job.
I am a beginner in Python and I have been working on a code to access two types of files (dcd and inp files), combine them and create a new list with the matching strings.
I got stuck somewhere at the beginning. I want to get all dcd files here. So they have .dcd extension but the first part is not the same. So I was thinking if there is a way to access them after I have split the string.
#collect all dcd files into a list
list1 = []
for filename1 in glob.glob('*/FEP_SYAF014*/FEP1/298/*/*.dcd'):
filename1 = filename1.split('/')
filename1.sort()
list1.append(filename1)
I want to get only names with dcd extension that are indexed [5] and create a new list or mutate this one, but I am not sure how to do that.
p.s I have just posted first part of the code
Thank you !
the oddly sorted part
this one looks better
and this is how I would like it to look like, but sorted and without eq* files.
want this sorted
just use sort with a sort key: os.path.basename (extracts only the basename of the file to perform sort):
import os, glob
list1 = sorted(glob.glob('*/FEP_SYAF014*/FEP1/298/*/*.dcd'), key = os.path.basename)
So this worked. I just added del filename1[:5] to get rid of other unnecessary string parts
import os, glob
list1 = sorted(glob.glob('/FEP_SYAF014/FEP1/298//.dcd'), key = os.path.basename)
for filename1 in sorted(glob.glob('*/FEP_SYAF014 */FEP1/298/*/*.dcd'),key = os.path.basename):
filename1 = filename1.split('/')
filename1.sort()
list1.append(filename1)
del filename1[:5]
print filename1
Your sort function is applied to file name parts. This is not what you want. If I understand well you want to sort the filename list, not the parts of the filename.
The code given by Jean François is great but I guess you'd like to get your own code working.
You need to extract the file name by using only the last part of the split
A split returns a list of strings. Each item is a part of the original.
filename = filename.split ('/')[len (filename.split ('/'))-1]
This line will get you the last part of the split
Then you can add that part to your list
And after all that you can sort your list
Hope this helps!
I have a folder with *.txt files which contain a specific format (c is character and d is digit and yyyy-mm-dd-hh-mm-ss is the date format)
cccccd_ddd_cc_ccc_c_dd-ddd_yyyy-mm-dd-hh-mm-ss.txt
or
cccccd_ddd_cc_ccc_c_dd-dddd_yyyy-mm-dd-hh-mm-ss.txt
or
cccccd_ddd_cc_ccc_c_d_yyyy-mm-dd-hh-mm-ss.txt
when the single digidt d is equal to 0
I would like to create a python script to obtain the dates and sort the files from that specific date.
SO far I ahve done
import os
list_files=[]
for file in os.listdir():
if file.endswith(".txt"):
#print(file)
list_files.append(file)
But I am bit new with regular expressions. Thanks
You can use .split() to split a string.
It seems that we can split from the last occurence of "_", remove the part after "." to get the timestamp.
So, method to return timestamp from the file_name is:
def get_timestamp(file_name):
return file_name.split("_")[-1].split('.')[0]
As all the dates are of same format, python can sort those using the timestamp string itself.
To get the sorted list of filenames using that timestamp, you can do:
sorted_list = sorted(list_files, key=get_timestamp)
More about the Key function can be learned from official python documentation.
I'm trying to set up an environment variable via Python:
os.environ["myRoot"]="/home/myName"
os.environ["subDir"]="$myRoot/subDir"
I expect the subDir environment variable to hold /home/myname/subDir, however it holds the string '$myRoot/subDir'. How do I get this functionality?
(Bigger picture : I'm reading a json file of environment variables and the ones lower down reference the ones higher up)
Use os.environ to fetch the value, and os.path to correctly put slashes in the right places:
os.environ["myRoot"]="/home/myName"
os.environ["subDir"] = os.path.join(os.environ['myRoot'], "subDir")
You can use os.path.expandvars to expand environment variables like so:
>>> import os
>>> print os.path.expandvars("My home directory is $HOME")
My home director is /home/Majaha
>>>
For your example, you might do:
os.environ["myRoot"] = "/home/myName"
os.environ["subDir"] = os.path.expandvars("$myRoot/subDir")
I think #johntellsall's answer is the better for the specific example you gave, however I don't doubt you'll find this useful for your json work.
Edit: I would now recommend using #johntellsall's answer, as os.path.expandvars() is designed explicitly for use with paths, so using it for arbitrary strings may work but is kinda hacky.
def fix_text(txt,data):
'''txt is the string to fix, data is the dictionary with the variable names/values'''
def fixer(m): #takes a regex match
match = m.groups()[0] #since theres only one thats all we worry about
#return a replacement or the variable name if its not in the dictionary
return data.get(match,"$%s"%match)
return re.sub("$([a-zA-Z]+)",fixer,txt) #regular expression to match a "$" followed by 1 or more letters
with open("some.json") as f: #open the json file to read
file_text= f.read()
data = json.loads(file_text) #load it into a json object
#try to ensure you evaluate them in the order you found them
keys = sorted(data.keys() ,key=file_text.index)
#create a new dictionary by mapping our ordered keys above to "fixed" strings that support simple variables
data2= dict(map(lambda k:(k,fixer(data[k],data)),keys)
#sanity check
print data2
[edited to fix a typo that would cause it not to work]