Extract string in list based on character in Python [duplicate] - python

This question already has answers here:
Splitting on first occurrence
(5 answers)
Closed 29 days ago.
I have a list of filenames in Python that looks like this, except that the list is much longer:
filenames = ['BETON\\map (120).png',
'BETON\\map (125).png',
'BETON\\map (134).png',
'BETON\\map (137).png',
'TUILES\\map (885).png',
'TUILES\\map (892).png',
'TUILES\\map (924).png',
'TUILES\\map (936).png',
'TUILES\\map (954).png',
'TUILES\\map (957).png',
'TUILES\\map (97).png',
'TUILES\\map (974).png',
'TUILES\\map (987).png']
I would like to only keep the first part of the filename strings in my list in order to only keep its type, like so:
filenames = ['BETON',
'BETON',
'BETON',
'BETON',
'TUILES',
'TUILES',
'TUILES',
'TUILES',
'TUILES',
'TUILES',
'TUILES',
'TUILES',
'TUILES']
I have been using a workaround grabbing the first 5 elements
def Extract(files):
return [item[:5] for item in files]
# Driver code
files2 = Extract(files)
However, it's becoming an issue as I have many more types (indicated in the list of filenames) coming with varying lengths in and I cannot just take the first elements.
How can I extract as soon as it spots the backslash \\ ?
Many thanks!

Split the filenames on a backslash, and take only the first item from the split.
filenames = [n.split('\\')[0] for n in filenames]

string.split()
Yeah, you indeed can split every string and take only the part you need.
Try this:
for index in range(len(filenames)):
# Only take the name
filenames[index] = filenames[index].split('\\')[0]
This code above doesn't distingues by file name lenght but it just take the string before the character you pass to the split function. '\' in your case.

import os
list(map(os.path.dirname, filenames))
['BETON', 'BETON', 'BETON', 'BETON', 'TUILES', 'TUILES', 'TUILES', 'TUILES', 'TUILES', 'TUILES', 'TUILES', 'TUILES', 'TUILES']

Alternate solution producing output as desired by OP.
Use python's operator.methodcaller() function
The methodcaller() assists in maintaining your code when you start to define functions, as your use-case expands. With methodcaller(), you can use list comprehension or map, or even lambda (if required).
Please note that you'll need to import the methodcaller from the operator library. There is nothing to install!
methodcaller
Return a callable object that calls the method name on its operand. If additional arguments and/or keyword arguments are given, they will be given to the method as well.
## import methodcaller
from operator import methodcaller
method_to_use = 'split'
arg1 = '\\'
## use methodcaller with list comprehension to split filenames and return the first split
## similar to #John Gordon above
#[methodcaller('split', '\\')(n)[0] for n in filenames]
[methodcaller(method_to_use, arg1)(n)[0] for n in filenames]
##alternatively
#'''
## call methodcaller
#f = methodcaller('split', '\\')
f = methodcaller(method_to_use, arg1)
filenames_split_caller = [f(n)[0] for n in filenames]
filenames_split_caller
#'''
methodcaller() with map()
[e[0] for e in list(map(methodcaller(method_to_use, arg1), filenames))]
## get the first element of the first list (list of list after split)
[Edit]
The 'three' codes above each provide desired output.
I would like to only keep the first part of the filename strings in my list in order to only keep its type.
PS: The author's context is not properly addressed by the 'duplicate'.

Related

Python how to get rid of a nested list

Good evening,
I have a python variable like so
myList = ["['Ben'", " 'Dillon'", " 'Rawr'", " 'Mega'", " 'Tote'", " 'Case']"]
I would like it to look like this instead
myList = ['Ben', 'Dillon', 'Rawr', 'Mega', 'Tote', 'Case']
If I do something like this
','.join(myList)
It gives me what I want but the type is a String
I also would like it to keep the type of List. I have tried using the Join method and split method. And I have been debugging use the type() method. It tells me that the type in the original scenario is a list.
I appreciate any and all help on this.
Join the inner list elements, then call ast.literal_eval() to parse it as a list of strings.
import ast
myList = ast.literal_eval(",".join(myList))
Also can be done by truncating Strings, therefore avoiding the import of ast.
myList[5] = (myList[5])[:-1]
for n in range(0, len(myList)):
myList[n] = (myList[n])[2:-1]

How to extract paths in a Url string in Python?

I need to extract path component from url string at different depth levels.
If the input is:
http//10.6.7.9:5647/folder1/folder2/folder3/folder4/df.csv
Output should be:
folder1_path = 'http//10.6.7.9:5647/folder1'
folder2_path = 'http//10.6.7.9:5647/folder1/folder2'
folder3_path = 'http//10.6.7.9:5647/folder1/folder2/folder3'
folder4_path = 'http//10.6.7.9:5647/folder1/folder2/folder3/folder4'
Output is to create 3 new string variable by doing string operation on my_url_path.
You can use a clever combination of string split and join. Something like this should work:
def path_to_folder_n(url, n):
"""
url: str, full url as string
n: int, level of directories to include from root
"""
base = 3
s = url.split('/')
return '/'.join(s[:base+n])
my_url_path = 'http//10.6.7.9:5647/folder1/folder2/folder3/folder4/df.csv'
# folder 1
print(path_to_folder_n(my_url_path, 1))
# folder 4
print(path_to_folder_n(my_url_path, 4))
# folder 3
print(path_to_folder_n(my_url_path, 3))
Output:
>> http//10.6.7.9:5647/folder1
>> http//10.6.7.9:5647/folder1/folder2/folder3/folder4
>> http//10.6.7.9:5647/folder1/folder2/folder3
Keep in mind you may want to add error checks to avoid n going too long.
See it in action here: https://repl.it/repls/BelovedUnhealthyBase#main.py
For getting the parent directory from a string in this format you could simply do
my_url_path.split('/')[-2]
For any parent you subtract the number from the index of the list.
I've made this function that address your problem.
It just uses split() and join() methods of the str class, and also the takewhile() function of the itertools module, which basically takes elements from an iterable while the predicate (its first argument) is true.
from itertools import takewhile
def manipulate_path(target, url):
path_parts = url.split('/')
partial_output = takewhile(lambda x: x != target, path_parts)
return "/".join(partial_output) + f'/{target}'
You can use it as follows:
manipulate_path('folder1', my_url_path) # returns 'http//10.6.7.9:5647/folder1'
manipulate_path('folder2', my_url_path) # returns 'http//10.6.7.9:5647/folder1/folder2'

Conditional copying of files in python

so I'm trying to copy files to another directory if their filename starts with the same 4 digit ID as the values my list.
I'm either getting the wrong data written to the file or nothing at all.
What I have so far:
import shutil
import os
ok_ids = [5252,
8396,
8397,
8397,
8556,
8004,
6545,
6541,
4392,
4392,
6548,
1363,
1363,
1363,
8489,
8652,
1368,
1368]
source = os.listdir("/Users/amm/Desktop/mypath1/")
destination = "/Users/amm/Desktop/mypath2/"
for files in source:
for x in ok_ids:
if files[:4] == x:
shutil.copy(files,destination)
else:
print("not working")
Sample of the files I'm trying to copy i.e. source
0000_051123_192805.txt
0000_051123_192805.txt
8642_060201_113220.txt
8652_060204_152839.txt
8652_060204_152839.txt
309-_060202_112353.txt
x104_051203_064013.txt
destination directory is blank
A few important things: ok_ids does not contain distinct values, but i'd like the the program to treat the list as if it does contain distinct values. for example 8397 appears in the ok_ids list twice and it doesnt need to be iterated over twice in the ok_ids loop (its a verrry long list and i dont fancy editing). source can often contain duplicate id's also, using the example above these are 0000, 8652, but the rest of the filename is different.
So in summary... if 0000 is in my ok_ids list and there are filenames beginning with 0000 in my source directory then i want to copy them into my destination folder.
I've looked at using .startswith but its not happy using a list as the argument even if i cast it to a tuple and then a str. Any help would be amazing.
UPDATE
Could the reason for this not working be that some of the ids contain a hyphen? and others start with a char x not a int value?
The first 4 values are the ID, for example these are still valid:
309-_060202_112353.txt
x104_051203_064013.txt
This should work:
for file in source:
for x in set(ok_ids):
if file.startswith(str(x)):
shutil.copy(file, destination)
Use set() to make numbers unique and str() to convert to string. So you can preprocess the list into a set for better performance.
Or better yet, given your naming constraints:
if int(file.split("_")[0]) in ok_ids:
Why your code doesn't work?
if files[:4] == x:
You're comparing a str with a int, which, intuitively, will always be False.
import os
import shutil
for root, dirs, files in os.walk("/Users/amm/Desktop/mypath1/"):
for file in files:
try:
if int(file[:4]) in ok_ids:
shutil.copy(file,destination)
except:
pass
This worked for me. The only catch is that it crawls all folders in the same directory.
Your code works for me with the slight modification of str(x) instead of x.
Try using this to see what it is doing with each file:
for files in source:
for x in ok_ids:
if files[:4] == str(x):
print("File '{}' matched".format(files))
break
else:
print("File '{}' not matched".format(files))
Or, alternatively, convert all the items in ok_ids to strings and then see what this produces:
ok_ids = [str(id) for id in ok_ids]
files_matched = [file for file in source if file[:4] in ok_ids]
files[:4] == x can never be true because x is an integer and files[:4] is a string. It does not matter if the string representation of x matches:
>>> 123 == '123'
False
I've looked at using .startswith but its not happy using a list as the argument even if i cast it to a tuple and then a str. Any help would be amazing.
This is arguably the best way to solve the problem, but you don't just need a tuple - you need the individual ID values to be strings. There is no possible "cast" (they are not really casts) you can perform on ok_ids that affects the elements.
The simplest way to do that is to make a tuple in the first place, and have the elements of the tuple be strings in the first place:
ok_ids = (
'5252',
'8396',
# ...
'1368'
)
If you do not control this data, you can use a generator expression passed to tuple to create the tuple:
ok_ids = tuple(str(x) for x in ok_ids)

Python List Comprehensions - Join with For loop

I am trying to generate URLs as follows:
http://ergast.com/api/f1/2000/qualifying?limit=10000
I am using Python to generate URLs for the years 2000 to 2015, and to that end, wrote this code snippet:
url = "http://ergast.com/api/f1/"
year = url.join([str(i) + "/qualifying?limit=10000" + "\n" for i in range(1999, 2016)])
print(year)
The output is:
1999/qualifying?limit=10000
http://ergast.com/api/f1/2000/qualifying?limit=10000
http://ergast.com/api/f1/2001/qualifying?limit=10000
http://ergast.com/api/f1/2002/qualifying?limit=10000
http://ergast.com/api/f1/2003/qualifying?limit=10000
http://ergast.com/api/f1/2004/qualifying?limit=10000
......
http://ergast.com/api/f1/2012/qualifying?limit=10000
http://ergast.com/api/f1/2013/qualifying?limit=10000
http://ergast.com/api/f1/2014/qualifying?limit=10000
http://ergast.com/api/f1/2015/qualifying?limit=10000
How do I get rid of the first line? I tried making the range (2000, 2016), but the same thing happened with the first line being 2000 instead of 1999. What am I doing wrong? How can I fix this?
You can use string formatting for this:
url = 'http://ergast.com/api/f1/{0}/qualifying?limit=10000'
print('\n'.join(url.format(year) for year in range(2000, 2016)))
# http://ergast.com/api/f1/2000/qualifying?limit=10000
# http://ergast.com/api/f1/2001/qualifying?limit=10000
# ...
# http://ergast.com/api/f1/2015/qualifying?limit=10000
UPDATE:
Based on OP's comments to pass these urls in requests.get:
url_tpl = 'http://ergast.com/api/f1/{0}/qualifying?limit=10000'
# use list coprehension to get all the urls
all_urls = [url_tpl.format(year) for year in range(2000, 2016)]
for url in all_urls:
response = requests.get(url)
Instead of using the URL to join the string, use a list comprehension to create the different URLs.
>>> ["http://ergast.com/api/f1/%d/qualifying?limit=10000" % i for i in range(1999, 2016)]
['http://ergast.com/api/f1/1999/qualifying?limit=10000',
'http://ergast.com/api/f1/2000/qualifying?limit=10000',
...
'http://ergast.com/api/f1/2014/qualifying?limit=10000',
'http://ergast.com/api/f1/2015/qualifying?limit=10000']
You could then still use '\n'.join(...) to join all those to one big string, it you like.
You could use the cleaner and more powerful string formatting as follows,
fmt = "http://ergast.com/api/f1/{y}/qualifying?limit=10000"
urls = [fmt.format(y=y) for y in range(2000, 1016)]
In your code the use of str.join is questionable as it has a semantics different from what you are trying to accomplish. s.join(ls), joins the items of list ls by str s. If ls = [l1, l2 ,...] , it returns str(l1) + s + str(l2) + s..
It's good to understand why it's happening.
For that you need to understand the join function, look the docs
Concatenate a list or tuple of words with intervening occurrences of
sep.
That means that your url parameter will be repeated in between the words you want to concatenate, what will result in the output above, with the first element without the url.
What you want is not use join, is to concatenate the strings as you're already doing with the year.
For that you can use different methods, as was already answered.
You can use string formatting as was pointed out by #AKS and it should work.

How do you sort files numerically?

I'm processing some files in a directory and need the files to be sorted numerically. I found some examples on sorting—specifically with using the lambda pattern—at wiki.python.org, and I put this together:
import re
file_names = """ayurveda_1.tif
ayurveda_11.tif
ayurveda_13.tif
ayurveda_2.tif
ayurveda_20.tif
ayurveda_22.tif""".split('\n')
num_re = re.compile('_(\d{1,2})\.')
file_names.sort(
key=lambda fname: int(num_re.search(fname).group(1))
)
Is there a better way to do this?
This is called "natural sorting" or "human sorting" (as opposed to lexicographical sorting, which is the default). Ned B wrote up a quick version of one.
import re
def tryint(s):
try:
return int(s)
except:
return s
def alphanum_key(s):
""" Turn a string into a list of string and number chunks.
"z23a" -> ["z", 23, "a"]
"""
return [ tryint(c) for c in re.split('([0-9]+)', s) ]
def sort_nicely(l):
""" Sort the given list in the way that humans expect.
"""
l.sort(key=alphanum_key)
It's similar to what you're doing, but perhaps a bit more generalized.
Just use :
tiffFiles.sort(key=lambda var:[int(x) if x.isdigit() else x for x in re.findall(r'[^0-9]|[0-9]+', var)])
is faster than use try/except.
If you are using key= in your sort method you shouldn't use cmp which has been removed from the latest versions of Python. key should be equated to a function which takes a record as input and returns any object which will compare in the order you want your list sorted. It doesn't need to be a lambda function and might be clearer as a stand alone function. Also regular expressions can be slow to evaluate.
You could try something like the following to isolate and return the integer part of the file name:
def getint(name):
basename = name.partition('.')
alpha, num = basename.split('_')
return int(num)
tiffiles.sort(key=getint)
#April provided a good solution in How is Pythons glob.glob ordered? that you could try
#First, get the files:
import glob
import re
files = glob.glob1(img_folder,'*'+output_image_format)
# Sort files according to the digits included in the filename
files = sorted(files, key=lambda x:float(re.findall("(\d+)",x)[0]))
Partition results in Tuple
def getint(name):
(basename, part, ext) = name.partition('.')
(alpha, num) = basename.split('_')
return int(num)
This is a modified version of #Don O'Donnell's answer, because I couldn't get it working as-is, but I think it's the best answer here as it's well-explained.
def getint(name):
_, num = name.split('_')
num, _ = num.split('.')
return int(num)
print(sorted(tiffFiles, key=getint))
Changes:
1) The alpha string doesn't get stored, as it's not needed (hence _, num)
2) Use num.split('.') to separate the number from .tiff
3) Use sorted instead of list.sort, per https://docs.python.org/2/howto/sorting.html

Categories