How to extract paths in a Url string in Python? - python

I need to extract path component from url string at different depth levels.
If the input is:
http//10.6.7.9:5647/folder1/folder2/folder3/folder4/df.csv
Output should be:
folder1_path = 'http//10.6.7.9:5647/folder1'
folder2_path = 'http//10.6.7.9:5647/folder1/folder2'
folder3_path = 'http//10.6.7.9:5647/folder1/folder2/folder3'
folder4_path = 'http//10.6.7.9:5647/folder1/folder2/folder3/folder4'
Output is to create 3 new string variable by doing string operation on my_url_path.

You can use a clever combination of string split and join. Something like this should work:
def path_to_folder_n(url, n):
"""
url: str, full url as string
n: int, level of directories to include from root
"""
base = 3
s = url.split('/')
return '/'.join(s[:base+n])
my_url_path = 'http//10.6.7.9:5647/folder1/folder2/folder3/folder4/df.csv'
# folder 1
print(path_to_folder_n(my_url_path, 1))
# folder 4
print(path_to_folder_n(my_url_path, 4))
# folder 3
print(path_to_folder_n(my_url_path, 3))
Output:
>> http//10.6.7.9:5647/folder1
>> http//10.6.7.9:5647/folder1/folder2/folder3/folder4
>> http//10.6.7.9:5647/folder1/folder2/folder3
Keep in mind you may want to add error checks to avoid n going too long.
See it in action here: https://repl.it/repls/BelovedUnhealthyBase#main.py

For getting the parent directory from a string in this format you could simply do
my_url_path.split('/')[-2]
For any parent you subtract the number from the index of the list.

I've made this function that address your problem.
It just uses split() and join() methods of the str class, and also the takewhile() function of the itertools module, which basically takes elements from an iterable while the predicate (its first argument) is true.
from itertools import takewhile
def manipulate_path(target, url):
path_parts = url.split('/')
partial_output = takewhile(lambda x: x != target, path_parts)
return "/".join(partial_output) + f'/{target}'
You can use it as follows:
manipulate_path('folder1', my_url_path) # returns 'http//10.6.7.9:5647/folder1'
manipulate_path('folder2', my_url_path) # returns 'http//10.6.7.9:5647/folder1/folder2'

Related

Join function is not returning a string value from a list

I am trying to make convert a list into a string so i can print it and show my desired outcome however, for some reason, the join function is not working like it was before.
This is my code:
xno = 5
yno = 10
exis =xno*'x'
list1 = ('.',exis,'-')*yno
str1 = ''.join(list1)
strop = str(str1)
flist = strop.split('-')
'\n'.join(flist)
print(flist)
In Python, join will return a new value but won't modify the list reference passed in.
You could set a variable as such:
list_str = '\n'.join(flist)
print(list_str)
or, just pass the join function into print:
print('\n'.join(flist))

Find a specific type of URL when 'n' URLs are provided

This will be my sample data:
lis = ['http://wiki.dbpedia.org/about','http://dbpedia.org/data/Category:Cybercrime.rdf',
'http://dbpedia.org/resource/Stop_Cyberbullying_Day',
'http://dbpedia.org/resource/Category:Cybercrime_in_Canada',
'http://dbpedia.org/resource/Political_repression_of_cyber-dissidents',
'http://creativecommons.org/licenses/by-sa/3.0/']
I have used the following code to filter only those URLs that contain http://dbpedia.org/resource/
c = 'http://dbpedia.org/resource/'
for i in lis:
if i[:27] is c:
print (i)
The expected output should be:
http://dbpedia.org/resource/Stop_Cyberbullying_Day
http://dbpedia.org/resource/Category:Cybercrime_in_Canada
http://dbpedia.org/resource/Political_repression_of_cyber-dissidents'
But prints NULL
There are two issues in your code:
You're using is for comparison, which compares the identity of two objects, not the equality. You want to use == instead.
Your string ('http://dbpedia.org/resource/') is 28 characters long, but you're comparing it to the first 26 characters of i. Replace your i[:27] with i[:29], or better yet use use i[:len(c)] to have it dynamically change with the c string.
All this being said, you should use str.startswith() which essentially does all of this for you:
for i in lis:
if i.starswith(c):
print(i)
is operator checks for identity of its operands.
Just use str.startwith for your simple case:
lst = ['http://wiki.dbpedia.org/about','http://dbpedia.org/data/Category:Cybercrime.rdf',
'http://dbpedia.org/resource/Stop_Cyberbullying_Day',
'http://dbpedia.org/resource/Category:Cybercrime_in_Canada',
'http://dbpedia.org/resource/Political_repression_of_cyber-dissidents',
'http://creativecommons.org/licenses/by-sa/3.0/']
c = 'http://dbpedia.org/resource/'
for url in lst:
if url.startswith(c):
print(url)
The output:
http://dbpedia.org/resource/Stop_Cyberbullying_Day
http://dbpedia.org/resource/Category:Cybercrime_in_Canada
http://dbpedia.org/resource/Political_repression_of_cyber-dissidents

Manipulate the output of if any for loop

I need to compare the following sequences list items:
sequences = ['sphere_v002_', 'sphere_v002_0240_', 'test_single_abc_f401']
to:
folder = 'sphere_v002'
and then work on the list items containing folder.
I have a working function for this but I want to improve it.
Current code is:
foundSeq = False
for seq in sequences:
headName = os.path.splitext(seq.head())[0]
#Check name added exception for when name has a last underscore
if headName == folder or headName[:-1] == folder:
foundSeq = True
sequence = seq
if not foundSeq:
...
My improvement looks like this:
if any(folder in os.path.splitext(seq.head())[0] for seq in sequences):
print seq
But then I get the following error:
local variable seq referenced before the assignment
How can I get the correct output working with the improved solution?
any returns a Boolean value only, it won't store in a variable seq the element within sequences when your condition is satisfied.
What you can do is use a generator and utilize the fact None is "Falsy":
def get_seq(sequences, folder):
for seq in sequences:
if folder in os.path.splitext(seq.head())[0]:
yield seq
for seq in get_seq(sequences, folder):
print seq
You can rewrite this, if you wish, as a generator expression:
for seq in (i for i in sequences if folder in os.path.splitext(i.head())[0]):
print seq
If the condition is never specified, the generator or generator expression will not yield any values and the logic within your loop will not be processed.
As pointed out by jpp, any just return a boolean. So the if any is not the good solution in this particular case.
Like suggested by thebjorn, the most efficient code for us so far consists in the use of filter function.
sequences = ['sphere_v002_', 'sphere_v002_0240_', 'test_single_abc_f401']
match = filter(lambda x: 'sphere_v002' == x[:-1] or 'sphere_v002' == x, sequences)
print match
['sphere_v002_']

Python substitute elements inside a list

I have the following code that is filtering and printing a list. The final output is json that is in the form of name.example.com. I want to substitute that with name.sub.example.com but I'm having a hard time actually doing that. filterIP is a working bit of code that removes elements entirely and I have been trying to re-use that bit to also modify elements, it doesn't have to be handled this way.
def filterIP(fullList):
regexIP = re.compile(r'\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}$')
return filter(lambda i: not regexIP.search(i), fullList)
def filterSub(fullList2):
regexSub = re.compile(r'example\.com, sub.example.com')
return filter(lambda i: regexSub.search(i), fullList2)
groups = {key : filterSub(filterIP(list(set(items)))) for (key, items) in groups.iteritems() }
print(self.json_format_dict(groups, pretty=True))
This is what I get without filterSub
"type_1": [
"server1.example.com",
"server2.example.com"
],
This is what I get with filterSub
"type_1": [],
This is what I'm trying to get
"type_1": [
"server1.sub.example.com",
"server2.sub.example.com"
],
The statement:
regexSub = re.compile(r'example\.com, sub.example.com')
doesn't do what you think it does. It creates a compiled regular expression that matches the string "example.com" followed by a comma, a space, the string "sub", an arbitrary character, the string "example", an arbitrary character, and the string "com". It does not create any sort of substitution.
Instead, you want to write something like this, using the re.sub function to perform the substitution and using map to apply it:
def filterSub(fullList2):
regexSub = re.compile(r'example\.com')
return map(lambda i: re.sub(regexSub, "sub.example.com", i),
filter(lambda i: re.search(regexSub, i), fullList2))
If the examples are all truly as simple as those you listed, a regex is probably overkill. A simple solution would be to use string .split and .join. This would likely give better performance.
First split the url at the first period:
url = 'server1.example.com'
split_url = url.split('.', 1)
# ['server1', 'example.com']
Then you can use the sub to rejoin the url:
subbed_url = '.sub.'.join(split_url)
# 'server1.sub.example.com'
Of course you can do the split and the join at the same time
'.sub.'.join(url.split('.', 1))
Or create a simple function:
def sub_url(url):
return '.sub.'.join(url.split('.', 1))
To apply this to the list you can take several approaches.
A list comprehension:
subbed_list = [sub_url(url)
for url in url_list]
Map it:
subbed_list = map(sub_url, url_list)
Or my favorite, a generator:
gen_subbed = (sub_url(url)
for url in url_list)
The last looks like a list comprehension but gives the added benefit that you don't rebuild the entire list. It processes the elements one item at a time as the generator is iterated through. If you decide you do need the list later you can simply convert it to a list as follows:
subbed_list = list(gen_subbed)

How do you sort files numerically?

I'm processing some files in a directory and need the files to be sorted numerically. I found some examples on sorting—specifically with using the lambda pattern—at wiki.python.org, and I put this together:
import re
file_names = """ayurveda_1.tif
ayurveda_11.tif
ayurveda_13.tif
ayurveda_2.tif
ayurveda_20.tif
ayurveda_22.tif""".split('\n')
num_re = re.compile('_(\d{1,2})\.')
file_names.sort(
key=lambda fname: int(num_re.search(fname).group(1))
)
Is there a better way to do this?
This is called "natural sorting" or "human sorting" (as opposed to lexicographical sorting, which is the default). Ned B wrote up a quick version of one.
import re
def tryint(s):
try:
return int(s)
except:
return s
def alphanum_key(s):
""" Turn a string into a list of string and number chunks.
"z23a" -> ["z", 23, "a"]
"""
return [ tryint(c) for c in re.split('([0-9]+)', s) ]
def sort_nicely(l):
""" Sort the given list in the way that humans expect.
"""
l.sort(key=alphanum_key)
It's similar to what you're doing, but perhaps a bit more generalized.
Just use :
tiffFiles.sort(key=lambda var:[int(x) if x.isdigit() else x for x in re.findall(r'[^0-9]|[0-9]+', var)])
is faster than use try/except.
If you are using key= in your sort method you shouldn't use cmp which has been removed from the latest versions of Python. key should be equated to a function which takes a record as input and returns any object which will compare in the order you want your list sorted. It doesn't need to be a lambda function and might be clearer as a stand alone function. Also regular expressions can be slow to evaluate.
You could try something like the following to isolate and return the integer part of the file name:
def getint(name):
basename = name.partition('.')
alpha, num = basename.split('_')
return int(num)
tiffiles.sort(key=getint)
#April provided a good solution in How is Pythons glob.glob ordered? that you could try
#First, get the files:
import glob
import re
files = glob.glob1(img_folder,'*'+output_image_format)
# Sort files according to the digits included in the filename
files = sorted(files, key=lambda x:float(re.findall("(\d+)",x)[0]))
Partition results in Tuple
def getint(name):
(basename, part, ext) = name.partition('.')
(alpha, num) = basename.split('_')
return int(num)
This is a modified version of #Don O'Donnell's answer, because I couldn't get it working as-is, but I think it's the best answer here as it's well-explained.
def getint(name):
_, num = name.split('_')
num, _ = num.split('.')
return int(num)
print(sorted(tiffFiles, key=getint))
Changes:
1) The alpha string doesn't get stored, as it's not needed (hence _, num)
2) Use num.split('.') to separate the number from .tiff
3) Use sorted instead of list.sort, per https://docs.python.org/2/howto/sorting.html

Categories