Access last string after split function to create new list - python

I am a beginner in Python and I have been working on a code to access two types of files (dcd and inp files), combine them and create a new list with the matching strings.
I got stuck somewhere at the beginning. I want to get all dcd files here. So they have .dcd extension but the first part is not the same. So I was thinking if there is a way to access them after I have split the string.
#collect all dcd files into a list
list1 = []
for filename1 in glob.glob('*/FEP_SYAF014*/FEP1/298/*/*.dcd'):
filename1 = filename1.split('/')
filename1.sort()
list1.append(filename1)
I want to get only names with dcd extension that are indexed [5] and create a new list or mutate this one, but I am not sure how to do that.
p.s I have just posted first part of the code
Thank you !
the oddly sorted part
this one looks better
and this is how I would like it to look like, but sorted and without eq* files.
want this sorted

just use sort with a sort key: os.path.basename (extracts only the basename of the file to perform sort):
import os, glob
list1 = sorted(glob.glob('*/FEP_SYAF014*/FEP1/298/*/*.dcd'), key = os.path.basename)

So this worked. I just added del filename1[:5] to get rid of other unnecessary string parts
import os, glob
list1 = sorted(glob.glob('/FEP_SYAF014/FEP1/298//.dcd'), key = os.path.basename)
for filename1 in sorted(glob.glob('*/FEP_SYAF014 */FEP1/298/*/*.dcd'),key = os.path.basename):
filename1 = filename1.split('/')
filename1.sort()
list1.append(filename1)
del filename1[:5]
print filename1

Your sort function is applied to file name parts. This is not what you want. If I understand well you want to sort the filename list, not the parts of the filename.
The code given by Jean François is great but I guess you'd like to get your own code working.
You need to extract the file name by using only the last part of the split
A split returns a list of strings. Each item is a part of the original.
filename = filename.split ('/')[len (filename.split ('/'))-1]
This line will get you the last part of the split
Then you can add that part to your list
And after all that you can sort your list
Hope this helps!

Related

Strip str between last 2 instances of common character

I have many different strings (which are files) that look like this:
20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc
20201225_00_ec_op_2m_temp_chinawheat_timeseries.nc
20201225_00_ec_op_snowfall_romaniawheat_timeseries.nc
And many more. I want to be able to loop through all of these files and store their file path in a dictionary. To do that, I want the key to be the text that is between the last two instances of an underscore. For example, if this was the file 20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc, then the dict would be
{'argentinacorn': path/to/file/20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc
How can I loop through and do this pythonically?
You can use regexes to extract the key from the strings like this:
import re
input_string = "20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc"
dict_key = re.findall(".*_(.+)_[^_]+", input_string)[0]
gives
'argentinacorn'
Or with just a simple split:
dict_key = input_string.split("_")[-2]
Regarding file names, you can get the list from current working directory like this:
import os
file_names = os.listdir()
You can just loop through this list and apply the split/regex as shown above.
A simple split and pick:
parts = "20201225_00_ec_op_2m_temp_24hdelta_argentinacorn_timeseries.nc".split("_")
key = parts[-2:-1]

Cleaning list to remove semi-duplicate values

I have a list of of video links. Some of these links are almost duplicates, meaning they contain almost the same link except that it has x_480.mp4 instead of x.mp4. Not all links have those "_480" at the end.
How can I clean the list to get only the ones that end with _480.mp4, removing their alternate versions, and keep the ones without a _480.mp4 version?
Example:
videos=["VfeHB0sga.mp4","G9uKZiNm.mp4","VfeHB0sga_480.mp4","kvlX4Fa4.mp4"]
Expected result:
["G9uKZiNm.mp4","VfeHB0sga_480.mp4","kvlX4Fa4.mp4"]`
Note: all links ends with .mp4. Also, there are no _480.mp4 without original one.
By the way len(videos) is 243.
You can do it in two lines of code:
to_remove = {fn[:-8] + '.mp4' for fn in videos if fn.endswith('_480.mp4')}
cleaned = [fn for fn in videos if fn not in to_remove]
The first line uses a set comprehension to extract all of the _480.mp4
filenames, converting them to their unwanted short versions. They are
stored in a set for quick searching.
The second line uses a list comprehension to filter out the unwanted
filenames.
I'd probably go the dict route to not have to check for existence of items in a list (would become a (performance) problem for large lists). For instance:
list({v[:-8] if v.endswith("_480.mp4") else v[:-4]: v
for v in sorted(videos)}.values())
That is compact way to say.
Construct me a dictionary whose key is incoming v without last 8 characters for values ending with "_480.mp4" or otherwise just stripped of last four character and being assigned value of the full incoming string.
Give me just values of that dictionary and since input was a list, I've passed it to a list constructor to get the same type as output.
Or broken down for easier reading, it could look something like this:
videos=["x.mp4","y.mp4","z.mp4","x_480.mp4"]
video_d = {}
for video_name in sorted(videos):
if video_name.endswith("_480.mp4"):
video_d[video_name[:-8]] = video_name
else:
video_d[video_name[:-4]] = video_name
new_videos = list(video_d.values())
It uses a virtual base name (stripping _480.mp4 or .mp4) as dictionary key. Since you do not care about resulting list order, we've made sure _480 suffixed entries are sorted after the "plain" entries. That way if they appear, they overwrite keys created for values without _480 suffix.
This should work. It loops through the videos until it finds one which ends with "_480.mp4". It then splits the title and get the starting bit and add ".mp4" to to create the video title which you want to remove. It then loops through the videos again and removes the video with that title.
videos=["x.mp4","y.mp4","z.mp4","x_480.mp4"]
#Loops through all the videos
for video in videos:
if "_480.mp4" in video:
#Removes the "_480" part of the video title
start = video.replace("_480", "")
for video2 in videos:
if video2 == start:
videos.remove(start)
print(videos)
You can even do this with one liner list comprehension.
[x for x in videos if x.split('.')[0] + '_480.mp4' not in videos]

Pythonic way to find if a string contains multiple values?

I am trying to find through a list of files all the excel, txt or csv files and append them to a list
goodAttachments = [i for i in attachments if str(i).split('.')[1].find(['xlsx','csv','txt'])
This is obviously not working because find() needs a string and not a list. Should I try a list comprehension inside of a list comprehension?
There's no need to split or use double list comprehension. You can use str.endswith which takes a tuple of strings to check as an argument:
goodAttachments = [i for i in attachments if str(i).endswith(('.xlsx', '.csv', '.txt'))]
If you really want to split:
goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ('xlsx', 'csv', 'txt')]
The first way is better as it accounts for files with no extension.
You could try something like this:
goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ['xlsx', 'csv', 'txt']]
This will check if the extension after the last '.' matches one of 'xlsx', 'csv', or 'txt' exactly.
[i for i in attachments if any([e in str(i).split('.')[1] for e in ['xlsx','csv','txt']]))
Like you said, nested list comprehension.
Edit: This will work without splitting, I was trying to replicate the logic in find.
You can check that everything after the last dot is present in a second list. using [-1] instead of [1] ensures that files named like.this.txt will return the last split txt and not this.
goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ['xlsx','csv','txt']]
I would suggest maybe adding a few more lines then trying to create a one-liner with nested list comprehensions. Though that would work, I think it makes more readable code to split these comprehensions out onto separate lines.
import os
attachments = ['sadf.asdf', 'asd/asd/asd.xslx']
whitelist = {'.xslx', '.csv'}
extentions = (os.path.split(fp)[1] for fp in attachments)
good_attachments = [fp for fp, ext in zip(attachments, extentions) if ext in whitelist]
I've also used os.path.split over str.split as the file may have multiple dots present and this split is designed for this exact job.

Python: Using str.split and getting list index out of range

I just started using python and am trying to convert some of my R code into python. The task is relatively simple; I have many csv file with a variable name (in this case cell lines) and values ( IC50's). I need to pull out all variables and their values shared in common among all files. Some of these files share the save variables but are formatted differently. For example in some files a variable is just "Cell_line" and in others it is MEL:Cell_line. So first things first to make a direct string comparison I need to format them the same and hence am trying ti use str.split() to do so. There is probably a much better way to do this but for now I am using the following code:
import csv
import os
# Change working directory
os.chdir("/Users/joshuamannheimer/downloads")
file_name="NCI60_Bleomycin.csv"
with open(file_name) as csvfile:
NCI_data=csv.reader(csvfile, delimiter=',')
alldata={}
for row in NCI_data:
name_str=row[0]
splt=name_str.split(':')
n_name=splt[1]
alldata[n_name]=row
[1]
name_str.split return a list of length 2. Since the portion I want is after the ":" I want the second element which should be indexed as splt[1] as splt[0] is the first in python. However when I run the code I get this error message "IndexError: list index out of range"
I'm trying the second element out of a list of length 2 thus I have no idea why it is out of range. Any help or suggestions would be appreciated.
I am pretty sure that there are some rows where name_str does not have a : in them. From your own example if the name_str is Cell_line it would fail.
If you are sure that there would only be 1 : in name_str (at max) , or if there are multiple : you want to select the last one, instead of splt[1] , you should use - splt[-1] . -1 index would take the last element in the list (unless its empty) .
The simple answer is that sometimes the data isn't following the specification being assumed when you write this code (i.e. that there is a colon and two fields).
The easiest way to deal with this is to add an if block if len(splot)==2: and do the subsequent lines within that block.
Optionally, add an else: and print the lines that are not so spec or save them somewhere so you can diagnose.
Like this:
import csv
import os
# Change working directory
os.chdir("/Users/joshuamannheimer/downloads")
file_name="NCI60_Bleomycin.csv"
with open(file_name) as csvfile:
NCI_data=csv.reader(csvfile, delimiter=',')
alldata={}
for row in NCI_data:
name_str=row[0]
splt=name_str.split(':')
if len(splt)==2:
n_name=splt[1]
alldata[n_name]=row
else:
print "invalid name: "+name_str
Alternatively, you can use try/except, which in this case is a bit more robust because we can handle IndexError anywhere, in either row[0] or in split[1], with the one exception handler, and we don't have to specify that the length of the : split field should be 2.
In addition we could explicitly check that there actually is a : before splitting, and assign the name appropriately.
import csv
import os
# Change working directory
os.chdir("/Users/joshuamannheimer/downloads")
file_name="NCI60_Bleomycin.csv"
with open(file_name) as csvfile:
NCI_data=csv.reader(csvfile, delimiter=',')
alldata={}
for row in NCI_data:
try:
name_str=row[0]
if ':' in name_str:
splt=name_str.split(':')
n_name=splt[1]
else:
n_name = name_str
alldata[n_name]=row
except IndexError:
print "bad row:"+str(row)

Split words and creating new files with different names(python)

I need to write a program like this:
Write a program that reads a file .picasa.ini and copies pictures in new files, whose names are the same as identification numbers of person on these pictures (eg. 8ff985a43603dbf8.jpg). If there are more person on the picture it makes more copies. If a person is on more pictures, later override earlier copies of pictures; if a person 8ff985a43603dbf8 may appear in more pictures, only one file with this name will exist. You must presume that we have a simple file .picasa.ini.
I have an .ini, that consists:
[img_8538.jpg]
faces=rect64(4ac022d1820c8624),**d5a2d2f6f0d7ccbc**
backuphash=46512
[img_8551.jpg]
faces=rect64(acb64583d1eb84cb),**2623af3d8cb8e040**;rect64(58bf441388df9592),**d85d127e5c45cdc2**
backuphash=8108
...
Is this a good way to start this program?
for line in open('C:\Users\Admin\Desktop\podatki-picasa\.picasa.ini'):
if line.startswith('faces'):
line.split() # what must I do here to split the bolded words?
Is there a better way to do this? Remember the .jpg file must be created with a new name, so I think I should link the current .jpg file with the bolded one.
Consider using ConfigParser. Then you will have to split each value by hand, as you describe.
import ConfigParser
import string
config = ConfigParser.ConfigParser()
config.read('C:\Users\Admin\Desktop\podatki-picasa\.picasa.ini')
imgs = []
for item in config.sections():
imgs.append(config.get(item, 'faces'))
This is still work in progress. Just want to ask if it's correct.
edit:
Still don't know hot to split the bolded words out of there. This split function really is a pain for me.
Suggestions:
Your lines don't start with 'faces', so your second line won't work the way you want it to. Depending on how the rest of the file looks, you might only need to check whether the line is empty or not at that point.
To get the information you need, first split at ',' and work from there
Try at a solution: The elements you need seem to always have a ',' before them, so you can start by splitting at the ',' sign and taking everything from the 1-index elemnt onwards [1::] . Then if what I am thinking is correct, you split those elements twice again: at the ";" and take the 0-index element of that and at that " ", again taking the 0-index element.
for line in open('thingy.ini'):
if line != "\n":
personelements = line.split(",")[1::]
for person in personelements:
personstring = person.split(";")[0].split(" ")[0]
print personstring
works for me to get:
d5a2d2f6f0d7ccbc
2623af3d8cb8e040
d85d127e5c45cdc2

Categories