Extracting file name without number from the end - python

How do extract file name from with numbers on both end?
I have extracted file name 56flybox007 using :
filter(lambda x: x.isalpha(), 56flybox007)
Results in flybox but I want to remove numbers from prefix part so result be like: 56flybox

Try Using this code:
import string
Sample = "56flybox00"
cleaned = Sample.rstrip(string.digits)
print(cleaned)
Output:
56flybox

I would use a regex here, even if there are a lot of other ways to achieve this, regex are powerful and could be easily modified if your needs change:
import re
rx = re.search(r'(\d*\D+)\d*', '123abc456')
print(rx.group(1)) # >>> '123abc'

Try this.
file = "56flybox007"
file[:file.find(filter(lambda x: x.isalpha(), file))]+filter(lambda x: x.isalpha(), file)

You can use the rstrip method of string to remove characters from the right hand side of the string. In this case you pass all the digits to rstrip and it will remove them from just the right hand side.
files = ["56flybox007", "45NotherFile456", "78LasstFile45"]
out_files = [file.rstrip("0123456789") for file in files]
print(files)
print(out_files)
OUTPUT
['56flybox007', '45NotherFile456', '78LasstFile45']
['56flybox', '45NotherFile', '78LasstFile']

Related

Delete all characters that come after a given string

how exactly can I delete characters after .jpg? is there a way to differentiate between the extension I take with python and what follows?
for example I have a link like that
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
How can I delete everything after .jpg?
I tried replacing but it didn't work
another way?
Use a forum to count strings or something like ?
I tried to get jpg files with this
for link in links:
res = requests.get(link).text
soup = BeautifulSoup(res, 'html.parser')
img_links = []
for img in soup.select('a.thumbnail img[src]'):
print(img["src"])
with open('links'+'.csv', 'a', encoding = 'utf-8', newline='') as csv_file:
file_is_empty = os.stat(self.filename+'.csv').st_size == 0
fieldname = ['links']
writer = csv.DictWriter(csv_file, fieldnames = fieldname)
if file_is_empty:
writer.writeheader()
writer.writerow({'links':img["src"]})
img_links.append(img["src"])
You could use split (assuming the string has 'jpg', otherwise the code below will just return the original url).
string = 'https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
jpg_removed = string.split('.jpg')[0]+'.jpg'
Example
string = 'www.google.com'
com_removed = string.split('.com')[0]
# com_removed = 'www.google'
You can make use of regular expression. You just want to ignore the characters after .jpg so you can some use of something like this:
import re
new_url=re.findall("(.*\.jpg).*",old_url)[0]
(.*\.jpg) is like a capturing group where you're matching any number of characters before .jpg. Since . has a special meaning you need to escape the . in jpg with a \. .* is used to match any number of character but since this is not inside the capturing group () this will get matched but won't get extracted.
You can use the .find function to find the characters .jpg then you can index the string to get everything but that. Ex:
string = https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
index = string.find(".jpg")
new_string = string[:index+ 4]
You have to add four because that is the length of jpg so it does not delete that too.
The find() method returns the lowest index of the substring if it is found in given string. If its is not found then it returns -1.
str ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
result = str.find('jpg')
print(result)
new_str = str[:result]
print(new_str+'jpg')
See: Extracting extension from filename in Python
Instead of extracting the extension, we extract the filename and add the extension (if we know it's always .jpg, it's fine!)
import os
filename, file_extension = os.path.splitext('/path/to/somefile.jpg_corruptedpath')
result = filename + '.jpg'
Now, outside of the original question, I think there might be something wrong with how you got that piece of information int he first place. There must be a better way to extract that jpeg without messing around with the path. Sadly I can't help you with that since I a novice with BeautifulSoup.
You could use a regular expression to replace everything after .jpg with an empty string:
import re
url ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
name = re.sub(r'(?<=\.jpg).*',"",url)
print(name)
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg

Reverse a list in python with a condition

I want to sort this list in this way:
.log suffix should be the first item
and .gz file should be in a descending order
my_list = [
'/abc/a.log.1.gz',
'/abc/a.log',
'/abc/a.log.30.gz',
'/abc/a.log.2.gz',
'/abc/a.log.5.gz',
'/abc/a.log.3.gz',
'/abc/a.log.6.gz',
'/abc/a.log.4.gz',
]
expected_result:
my_list = ['/abc/a.log',
'/abc/a.log.30.gz',
'/abc/a.log.6.gz',
'/abc/a.log.5.gz',
'/abc/a.log.4.gz',
'/abc/a.log.3.gz',
'/abc/a.log.2.gz'
'/abc/a.log.1.gz']
reversed(mylist) is also not getting me the desired solution.
I would use this sort helper:
import re
def sort_helper(i:str):
m = re.search('^.*?(\d+)(\.gz)',i)
try:
return int(m.group(1))
except AttributeError:
return 0
print(sorted(my_list, key= sort_helper, reverse=True))
Perhaps using a re is overkill, but it is a flexible tool.
To get around the lack of a leading zero in your filenames, return an int.
also, note the use of the lazy operator at the start of the regular expression:
^.*?
not just
^.*
this matches a little as possible, letting the match for the numbers in the filename be as greedy as possible.
it seems like you are trying to sort file names. I would recommend using os.path to manipulate these strings.
First you can use os.path.splitext split out the extension to compare between .log or .gz. Then strip off the extension again to get the file number, and convert it to an integer.
For example:
import os
def get_sort_keys(filepath):
split_file_path = os.path.splitext(filepath)
sort_key = (split_file_path[1], *os.path.splitext(split_file_path[0]))
return (sort_key[0], sort_key[1], int(sort_key[2].strip(".")) if sort_key[2] else 0)
print(sorted(my_list, key=get_sort_keys, reverse=
I am relying on the fact that the log extension will sort after gz lexicographically.

Regex to match strings in a list without .csv extension

How can i write a regular expression to only match the string names without .csv extension. This should be the required output
Required Output:
['ap_2010', 'class_size', 'demographics', 'graduation','hs_directory', 'sat_results']
Input:
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
I tried but it return a empty list.
for i in data_files:
regex = re.findall(r'/w+/_[/d{4}][/w*]?', i)
If you really want to use a regular expression, you can use re.sub to remove the extension if it exists, and if not, leave the string alone:
[re.sub(r'\.csv$', '', i) for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
A better approach in general is using the os module to handle anything to do with filenames:
[os.path.splitext(i)[0] for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
If you want regex, the solution is r'(.*)\.csv:
for i in data_files:
regex = re.findall(r'(.*)\.csv', i)
print(regex)
Split the string at '.' and then take the last element of the split (using index [-1]). If this is 'csv' then it is a csv file.
for i in data_files:
if i.split('.')[-1].lower() == 'csv':
# It is a CSV file
else:
# Not a CSV
# Input
data_files = [ 'ap_2010.csv', 'class_size.csv', 'demographics.csv', 'graduation.csv', 'hs_directory.csv', 'sat_results.csv' ]
import re
pattern = '(?P<filename>[a-z0-9A-Z_]+)\.csv'
prog = re.compile(pattern)
# `map` function yields:
# - a `List` in Python 2.x
# - a `Generator` in Python 3.x
result = map(lambda data_file: re.search(prog, data_file).group('filename'), data_files)
l = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
print([i.rstrip('.'+i.split('.')[-1]) for i in l])

Extract int between two different strings in python

I have a list files of strings of the following format:
files = ['/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_418000.caffemodel.h5',
'/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_502000.caffemodel.h5', ...]
I want to extract the int between iter_ and .caffemodel and return a list of those ints.
After some research I came up with this solution that does the trick, but I was wondering if there is a more elegant/pythonic way to do it, possibly using a list comprehension?
li = []
for f in files:
tmp = re.search('iter_[\d]+.caffemodel', f).group()
li.append(int(re.search(r'\d+', tmp).group()))
Just to add another possible solution: join the file names together into one big string (looks like the all end with h5, so there is no danger of creating unwanted matches) and use re.findall on that:
import re
li = [int(d) for d in re.findall(r'iter_(\d+)\.caffemodel', ''.join(files))]
Use just:
li = []
for f in files:
tmp = int(re.search('iter_(\d+)\.caffemodel', f).group(1))
li.append(tmp)
If you put an expression into parenthesis it creates another group of matched expressions.
You can also use a lookbehind assertion:
regex = re.compile("(?<=iter_)\d+")
for f in files:
number = regex.search(f).group(0)
Solution with list comprehension, as you wished:
import re
re_model_id = re.compile(r'iter_(?P<model_id>\d+).caffemodel')
li = [int(re_model_id.search(f).group('model_id')) for f in files]
Without a regex:
files = [
'/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_418000.caffemodel.h5',
'/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_502000.caffemodel.h5']
print([f.rsplit("_", 1)[1].split(".", 1)[0] for f in files])
['418000', '502000']
Or if you want to be more specific:
print([f.rsplit("iter_", 1)[1].split(".caffemodel", 1)[0] for f in files])
But your pattern seems to repeat so the first solution is probably sufficient.
You can also slice using find and rfind:
print( [f[f.find("iter_")+5: f.rfind("caffe")-1] for f in files])
['418000', '502000']

Split stock quote to tokens in Python

I need to read a text file of stock quotes and do some processing with each stock data (i.e. a line in the file).
The stock data looks like this :
[class,'STOCK'],[symbol,'AAII'],[open,2.60],[high,2.70],[low,2.53],[close,2.60],[volume,458500],[date,'21-Dec-04'],[openClosePDiff,0.0],[highLowPDiff,0.067],[closeEqualsLow,'false'],[closeEqualsHigh,'false']
How do I split the line into tokens where each token is what is enclosed in the square braces, .i.e. for the above line, the tokens should be "class, 'STOCK'" , "symbol, 'AAII'" etc.
print(re.findall("\[(.*?)\]", inputline))
Or perhaps without regex:
print(inputline[1:-1].split("],["))
Try this code:
#!/usr/bin/env python3
import re
str="[class,'STOCK'],[symbol,'AAII'],[open,2.60],[high,2.70],[low,2.53],[close,2.60],[volume,458500],[date,'21-Dec-04'],[openClosePDiff,0.0],[highLowPDiff,0.067],[closeEqualsLow,'false'],[closeEqualsHigh,'false']"
str = re.sub('^\[','',str)
str = re.sub('\]$','',str)
array = str.split("],[")
for line in array:
print(line)
Start with:
re.findall("[^,]+,[^,]+", a)
This would give you a list of:
[class,'STOCK'], [symbol,'AAII'] and such, then you could cut the brackets.
If you want a functional one liner, use:
map(lambda x: x[1:-1], re.findall("[^,]+,[^,]+", a))
The first part splits every second ,, the map (for each item in the list, use the lambda function..) cuts the brackets.
import re
s = "[class,'STOCK'],[symbol,'AAII'],[open,2.60],[high,2.70],[low,2.53],[close,2.60],[volume,458500],[date,'21-Dec-04'],[openClosePDiff,0.0],[highLowPDiff,0.067],[closeEqualsLow,'false'],[closeEqualsHigh,'false']"
m = re.findall(r"([a-zA-Z0-9]+),([a-zA-Z0-9']+)", s)
d= { x[0]:x[1] for x in m }
print d
you can run the snippet here : http://liveworkspace.org/code/EZGav$35

Categories