Split string every nth character from the right? - python

I have different very large sets of files which I'd like to put in different subfolders. I already have an consecutive ID for every folder I want to use.
I want to split the ID from the right to always have 1000 folders in the deeper levels.
Example:
id: 100243 => resulting_path: './100/243'
id: 1234567890 => resulting path: '1/234/567/890'
I found Split string every nth character?, but all solutions are from left to right and I also did not want to import another module for one line of code.
My current (working) solution looks like this:
import os
base_path = '/home/made'
n=3 # take every 'n'th from the right
max_id = 12345678900
test_id = 24102442
# current algorithm
str_id = str(test_id).zfill(len(str(max_id)))
ext_path = list(reversed([str_id[max(i-n,0):i] for i in range(len(str_id),0,-n)]))
print(os.path.join(base_path, *ext_path))
Output is: /home/made/00/024/102/442
The current algorithm looks awkward and complicated for the simple thing I want to do.
I wonder if there is a better solution. If not it might help others, anyway.
Update:
I really like Joe Iddons solution. Using .join and mod makes it faster and more readable.
In the end I decided that I never want to have a /in front. To get rid of the preceeding /in case len(s)%3is zero, I changed the line to
'/'.join([s[max(0,i):i+3] for i in range(len(s)%3-3*(len(s)%3 != 0), len(s), 3)])
Thank you for your great help!
Update 2:
If you are going to use os.path.join (like in my previous code) its even simpler since os.path.jointakes care of the format of the args itself:
ext_path = [s[0:len(s)%3]] + [s[i:i+3] for i in range(len(s)%3, len(s), 3)]
print(os.path.join('/home', *ext_path))

You can adapt the answer you linked, and use the beauty of mod to create a nice little one-liner:
>>> s = '1234567890'
>>> '/'.join([s[0:len(s)%3]] + [s[i:i+3] for i in range(len(s)%3, len(s), 3)])
'1/234/567/890'
and if you want this to auto-add the dot for the cases like your first example of:
s = '100243'
then you can just add a mini ternary use or as suggested by #MosesKoledoye:
>>> '/'.join(([s[0:len(s)%3] or '.']) + [s[i:i+3] for i in range(len(s)%3, len(s), 3)])
'./100/243'
This method will also be faster than reversing the string before hand or reversing a list.

Then if you got a solution for the direction left to right, why not simply reverse the input and output ?
str = '1234567890'
str[::-1]
Output:
'0987654321'
You can use the solution you found for left to right and then, you simply need to reverse it again.

You could use regex and modulo to split the strings into groups of three. This solution should get you started:
import re
s = [100243, 1234567890]
final_s = ['./'+'/'.join(re.findall('.{2}.', str(i))) if len(str(i))%3 == 0 else str(i)[:len(str(i))%3]+'/'+'/'.join(re.findall('.{2}.', str(i)[len(str(i))%3:])) for i in s]
Output:
['./100/243', '1/234/567/890']

Try this:
>>> line = '1234567890'
>>> n = 3
>>> rev_line = line[::-1]
>>> out = [rev_line[i:i+n][::-1] for i in range(0, len(line), n)]
>>> ['890', '567', '234', '1']
>>> "/".join(reversed(out))
>>> '1/234/567/890'

Related

PySpark / Python Slicing and Indexing Issue

Can someone let me know how to pull out certain values from a Python output.
I would like the retrieve the value 'ocweeklyreports' from the the following output using either indexing or slicing:
'config': '{"hiveView":"ocweeklycur.ocweeklyreports"}
This should be relatively easy, however, I'm having problem defining the Slicing / Indexing configuation
The following will successfully give me 'ocweeklyreports'
myslice = config['hiveView'][12:30]
However, I need the indexing or slicing modified so that I will get any value after'ocweeklycur'
I'm not sure what output you're dealing with and how robust you're wanting it but if it's just a string you can do something similar to this (for a quick and dirty solution).
input = "Your input"
indexStart = input.index('.') + 1 # Get the index of the input at the . which is where you would like to start collecting it
finalResponse = input[indexStart:-2])
print(finalResponse) # Prints ocweeklyreports
Again, not the most elegant solution but hopefully it helps or at least offers a starting point. Another more robust solution would be to use regex but I'm not that skilled in regex at the moment.
You could almost all of it using regex.
See if this helps:
import re
def search_word(di):
st = di["config"]["hiveView"]
p = re.compile(r'^ocweeklycur.(?P<word>\w+)')
m = p.search(st)
return m.group('word')
if __name__=="__main__":
d = {'config': {"hiveView":"ocweeklycur.ocweeklyreports"}}
print(search_word(d))
The following worked best for me:
# Extract the value of the "hiveView" key
hive_view = config['hiveView']
# Split the string on the '.' character
parts = hive_view.split('.')
# The value you want is the second part of the split string
desired_value = parts[1]
print(desired_value) # Output: "ocweeklyreports"

In Python, how to remove items in a list based on the specific string format?

I have a Python list as below:
merged_cells_lst = [
'P19:Q19
'P20:Q20
'P21:Q21
'P22:Q22
'P23:Q23
'P14:Q14
'P15:Q15
'P16:Q16
'P17:Q17
'P18:Q18
'AU9:AV9
'P10:Q10
'P11:Q11
'P12:Q12
'P13:Q13
'A6:P6
'A7:P7
'D9:AJ9
'AK9:AQ9
'AR9:AT9'
'A1:P1'
]
I only want to unmerge the cells in the P and Q columns. Therefore, I seek to remove any strings/items in the merged_cells_lst that does not have the format "P##:Q##".
I think that regex is the best and most simple way to go about this. So far I have the following:
for item in merge_cell_lst:
if re.match(r'P*:Q*'):
pass
else:
merged_cell_lst.pop(item)
print(merge_cell_lst)
The code however is not working. I could use any additional tips/help. Thank you!
Modifying a list while looping over it causes troubles. You can use list comprehension instead to create a new list.
Also, you need a different regex expression. The current pattern P*:Q* matches PP:QQQ, :Q, or even :, but not P19:Q19.
import re
merged_cells_lst = ['P19:Q19', 'P20:Q20', 'P21:Q21', 'P22:Q22', 'P23:Q23', 'P14:Q14', 'P15:Q15', 'P16:Q16', 'P17:Q17', 'P18:Q18', 'AU9:AV9', 'P10:Q10', 'P11:Q11', 'P12:Q12', 'P13:Q13', 'A6:P6', 'A7:P7', 'D9:AJ9', 'AK9:AQ9', 'AR9:AT9', 'A1:P1']
p = re.compile(r"P\d+:Q\d+")
output = [x for x in merged_cells_lst if p.match(x)]
print(output)
# ['P19:Q19', 'P20:Q20', 'P21:Q21', 'P22:Q22', 'P23:Q23', 'P14:Q14', 'P15:Q15',
# 'P16:Q16', 'P17:Q17', 'P18:Q18', 'P10:Q10', 'P11:Q11', 'P12:Q12', 'P13:Q13']
Your list has some typos, should look something like this:
merged_cells_lst = [
'P19:Q19',
'P20:Q20',
'P21:Q21', ...]
Then something as simple as:
x = [k for k in merged_cells_lst if k[0] == 'P']
would work. This is assuming that you know a priori that the pattern you want to remove follows the Pxx:Qxx format. If you want a dynamic solution then you can replace the condition in the list comprehension with a regex match.

How to traverse dictionary keys in sorted order

I am reading a cfg file, and receive a dictionary for each section. So, for example:
Config-File:
[General]
parameter1="Param1"
parameter2="Param2"
[FileList]
file001="file1.txt"
file002="file2.txt" ......
I have the FileList section stored in a dictionary called section. In this example, I can access "file1.txt" as test = section["file001"], so test == "file1.txt". To access every file of FileList one after the other, I could try the following:
for i in range(1, (number_of_files + 1)):
access_key = str("file_00" + str(i))
print(section[access_key])
This is my current solution, but I don't like it at all. First of all, it looks kind of messy in python, but I will also face problems when more than 9 files are listed in the config.
I could also do it like:
for i in range(1, (number_of_files + 1)):
if (i <= 9):
access_key = str("file_00" + str(i))
elif (i > 9 and i < 100):
access_key = str("file_0" + str(i))
print(section[access_key])
But I don't want to start with that because it becomes even worse. So my question is: What would be a proper and relatively clean way to go through all the file names in order? I definitely need the loop because I need to perform some actions with every file.
Use zero padding to generate the file number (for e.g. see this SO question answer: https://stackoverflow.com/a/339013/3775361). That way you don’t have to write the logic of moving through digit rollover yourself—you can use built-in Python functionality to do it for you. If you’re using Python 3 I’d also recommend you try out f-strings (one of the suggested solutions at the link above). They’re awesome!
If we can assume the file number has three digits, then you can do the followings to achieve zero padding. All of the below returns "015".
i = 15
str(i).zfill(3)
# or
"%03d" % i
# or
"{:0>3}".format(i)
# or
f"{i:0>3}"
Start by looking at the keys you actually have instead of guessing what they might be. You need to filter out the ones that match your pattern, and sort according to the numerical portion.
keys = [key for key in section.keys() if key.startswith('file') and key[4:].isdigit()]
You can add additional conditions, like len(key) > 4, or drop the conditions entirely. You might also consider learning regular expressions to make the checking more elegant.
To sort the names without having to account for padding, you can do something like
keys = sorted(keys, key=lambda s: int(s[4:]))
You can also try a library like natsort, which will handle the custom sort key much more generally.
Now you can iterate over the keys and do whatever you want:
for key in sorted((k for k in section if k.startswith('file') and k[4:].isdigit()), key=lambda s: int(s[4:])):
print(section[key])
Here is what a solution equipt with re and natsort might look like:
import re
from natsort import natsorted
pattern = re.compile(r'file\d+')
for key in natsorted(k for k in section if pattern.fullmatch(k)):
print(section[key])

Get the last word after / in url python

I like simplify my code for get the last word after /
any suggestion?
def downloadRepo(repo):
pos1=repo[::-1].index("/")
salida=repo[::-1][:pos1]
print(salida[::-1])
downloadRepo("https://github.com/byt3bl33d3r/arpspoof")
Thanks in advance!
You can use str.rsplit and negative indexing:
"https://github.com/byt3bl33d3r/arpspoof".rsplit('/', 1)[-1]
# 'arpspoof'
You can also stick with indexes and use str.rfind:
s = "https://github.com/byt3bl33d3r/arpspoof"
index = s.rfind('/')
s[index+1:]
# 'arpspoof'
The latter is more memory efficient, since the split methods build in-memory lists which contain all the split tokens, including the spurious ones from the front that we don't use.
You may use
string = "https://github.com/byt3bl33d3r/arpspoof"
last_part = string.split("/")[-1]
print(last_part)
Which yields
arpspoof
Timing rsplit() vs split() yields (on my Macbook Air) the following results:
import timeit
def schwobaseggl():
return "https://github.com/byt3bl33d3r/arpspoof".rsplit('/', 1)[-1]
def jan():
return "https://github.com/byt3bl33d3r/arpspoof".split("/")[-1]
print(timeit.timeit(schwobaseggl, number=10**6))
print(timeit.timeit(jan, number=10**6))
# 0.347005844116
# 0.379151821136
So the rsplit alternative is indeed slightly faster (running it a 1.000.000 times that is).

Sorting with two digits in string - Python

I am new to Python and I have a hard time solving this.
I am trying to sort a list to be able to human sort it 1) by the first number and 2) the second number. I would like to have something like this:
'1-1bird'
'1-1mouse'
'1-1nmouses'
'1-2mouse'
'1-2nmouses'
'1-3bird'
'10-1birds'
(...)
Those numbers can be from 1 to 99 ex: 99-99bird is possible.
This is the code I have after a couple of headaches. Being able to then sort by the following first letter would be a bonus.
Here is what I've tried:
#!/usr/bin/python
myList = list()
myList = ['1-10bird', '1-10mouse', '1-10nmouses', '1-10person', '1-10cat', '1-11bird', '1-11mouse', '1-11nmouses', '1-11person', '1-11cat', '1-12bird', '1-12mouse', '1-12nmouses', '1-12person', '1-13mouse', '1-13nmouses', '1-13person', '1-14bird', '1-14mouse', '1-14nmouses', '1-14person', '1-14cat', '1-15cat', '1-1bird', '1-1mouse', '1-1nmouses', '1-1person', '1-1cat', '1-2bird', '1-2mouse', '1-2nmouses', '1-2person', '1-2cat', '1-3bird', '1-3mouse', '1-3nmouses', '1-3person', '1-3cat', '2-14cat', '2-15cat', '2-16cat', '2-1bird', '2-1mouse', '2-1nmouses', '2-1person', '2-1cat', '2-2bird', '2-2mouse', '2-2nmouses', '2-2person']
def mysort(x,y):
x1=""
y1=""
for myletter in x :
if myletter.isdigit() or "-" in myletter:
x1=x1+myletter
x1 = x1.split("-")
for myletter in y :
if myletter.isdigit() or "-" in myletter:
y1=y1+myletter
y1 = y1.split("-")
if x1[0]>y1[0]:
return 1
elif x1[0]==y1[0]:
if x1[1]>y1[1]:
return 1
elif x1==y1:
return 0
else :
return -1
else :
return -1
myList.sort(mysort)
print myList
Thanks !
Martin
You have some good ideas with splitting on '-' and using isalpha() and isdigit(), but then we'll use those to create a function that takes in an item and returns a "clean" version of the item, which can be easily sorted. It will create a three-digit, zero-padded representation of the first number, then a similar thing with the second number, then the "word" portion (instead of just the first character). The result looks something like "001001bird" (that won't display - it'll just be used internally). The built-in function sorted() will use this callback function as a key, taking each element, passing it to the callback, and basing the sort order on the returned value. In the test, I use the * operator and the sep argument to print it without needing to construct a loop, but looping is perfectly fine as well.
def callback(item):
phrase = item.split('-')
first = phrase[0].rjust(3, '0')
second = ''.join(filter(str.isdigit, phrase[1])).rjust(3, '0')
word = ''.join(filter(str.isalpha, phrase[1]))
return first + second + word
Test:
>>> myList = ['1-10bird', '1-10mouse', '1-10nmouses', '1-10person', '1-10cat', '1-11bird', '1-11mouse', '1-11nmouses', '1-11person', '1-11cat', '1-12bird', '1-12mouse', '1-12nmouses', '1-12person', '1-13mouse', '1-13nmouses', '1-13person', '1-14bird', '1-14mouse', '1-14nmouses', '1-14person', '1-14cat', '1-15cat', '1-1bird', '1-1mouse', '1-1nmouses', '1-1person', '1-1cat', '1-2bird', '1-2mouse', '1-2nmouses', '1-2person', '1-2cat', '1-3bird', '1-3mouse', '1-3nmouses', '1-3person', '1-3cat', '2-14cat', '2-15cat', '2-16cat', '2-1bird', '2-1mouse', '2-1nmouses', '2-1person', '2-1cat', '2-2bird', '2-2mouse', '2-2nmouses', '2-2person']
>>> print(*sorted(myList, key=callback), sep='\n')
1-1bird
1-1cat
1-1mouse
1-1nmouses
1-1person
1-2bird
1-2cat
1-2mouse
1-2nmouses
1-2person
1-3bird
1-3cat
1-3mouse
1-3nmouses
1-3person
1-10bird
1-10cat
1-10mouse
1-10nmouses
1-10person
1-11bird
1-11cat
1-11mouse
1-11nmouses
1-11person
1-12bird
1-12mouse
1-12nmouses
1-12person
1-13mouse
1-13nmouses
1-13person
1-14bird
1-14cat
1-14mouse
1-14nmouses
1-14person
1-15cat
2-1bird
2-1cat
2-1mouse
2-1nmouses
2-1person
2-2bird
2-2mouse
2-2nmouses
2-2person
2-14cat
2-15cat
2-16cat
You need leading zeros. Strings are sorted alphabetically with the order different from the one for digits. It should be
'01-1bird'
'01-1mouse'
'01-1nmouses'
'01-2mouse'
'01-2nmouses'
'01-3bird'
'10-1birds'
As you you see 1 goes after 0.
The other answers here are very respectable, I'm sure, but for full credit you should ensure that your answer fits on a single line and uses as many list comprehensions as possible:
import itertools
[''.join(r) for r in sorted([[''.join(x) for _, x in
itertools.groupby(v, key=str.isdigit)]
for v in myList], key=lambda v: (int(v[0]), int(v[2]), v[3]))]
That should do nicely:
['1-1bird',
'1-1cat',
'1-1mouse',
'1-1nmouses',
'1-1person',
'1-2bird',
'1-2cat',
'1-2mouse',
...
'2-2person',
'2-14cat',
'2-15cat',
'2-16cat']

Categories