Delete all characters that come after a given string - python

how exactly can I delete characters after .jpg? is there a way to differentiate between the extension I take with python and what follows?
for example I have a link like that
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
How can I delete everything after .jpg?
I tried replacing but it didn't work
another way?
Use a forum to count strings or something like ?
I tried to get jpg files with this
for link in links:
res = requests.get(link).text
soup = BeautifulSoup(res, 'html.parser')
img_links = []
for img in soup.select('a.thumbnail img[src]'):
print(img["src"])
with open('links'+'.csv', 'a', encoding = 'utf-8', newline='') as csv_file:
file_is_empty = os.stat(self.filename+'.csv').st_size == 0
fieldname = ['links']
writer = csv.DictWriter(csv_file, fieldnames = fieldname)
if file_is_empty:
writer.writeheader()
writer.writerow({'links':img["src"]})
img_links.append(img["src"])

You could use split (assuming the string has 'jpg', otherwise the code below will just return the original url).
string = 'https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
jpg_removed = string.split('.jpg')[0]+'.jpg'
Example
string = 'www.google.com'
com_removed = string.split('.com')[0]
# com_removed = 'www.google'

You can make use of regular expression. You just want to ignore the characters after .jpg so you can some use of something like this:
import re
new_url=re.findall("(.*\.jpg).*",old_url)[0]
(.*\.jpg) is like a capturing group where you're matching any number of characters before .jpg. Since . has a special meaning you need to escape the . in jpg with a \. .* is used to match any number of character but since this is not inside the capturing group () this will get matched but won't get extracted.

You can use the .find function to find the characters .jpg then you can index the string to get everything but that. Ex:
string = https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
index = string.find(".jpg")
new_string = string[:index+ 4]
You have to add four because that is the length of jpg so it does not delete that too.

The find() method returns the lowest index of the substring if it is found in given string. If its is not found then it returns -1.
str ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
result = str.find('jpg')
print(result)
new_str = str[:result]
print(new_str+'jpg')

See: Extracting extension from filename in Python
Instead of extracting the extension, we extract the filename and add the extension (if we know it's always .jpg, it's fine!)
import os
filename, file_extension = os.path.splitext('/path/to/somefile.jpg_corruptedpath')
result = filename + '.jpg'
Now, outside of the original question, I think there might be something wrong with how you got that piece of information int he first place. There must be a better way to extract that jpeg without messing around with the path. Sadly I can't help you with that since I a novice with BeautifulSoup.

You could use a regular expression to replace everything after .jpg with an empty string:
import re
url ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
name = re.sub(r'(?<=\.jpg).*',"",url)
print(name)
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg

Related

Replacing specific substrings in a specific part of a string

I have a following text file that is to be edited in a certain manner. The part of the file that comes to inside the (init: part is to be overwritten and nothing except that should be edited.
File:
(define (problem bin-picking-doosra)
(:domain bin-picking-second)
;(:requirements :typing :negative-preconditions)
(:objects
)
(:init
(batsmen first_batsman)
(bowler none_bowler)
(umpire third_umpire)
(spectator no_spectator)
)
(:goal (and
(batsmen first_batsman)
(bowler last_bowler)
(umpire third_umpire)
(spectator full_spectator)
)
)
)
In this file I want replace every line that is inside the (init: section with the required string. In this case, I want to replace:
(batsmen first_batsman) with (batsmen none_batsmen)
(bowler none_bowler) with (bowler first_bowler)
(umpire third_umpire) with (umpire leg_umpire)
(spectator no_spectator) with (spectator empty_spectator)
The code I currently have the following:
file_path = "/home/mus/problem_turtlebot.pddl"
s = open(file_path).read()
s = s.replace('(batsmen first_batsman)', '(batsmen '+ predicate_batsmen + '_batsman)')
f = open(file_path, 'w')
f.write(s)
f.close()
The term predicate_batsmen here contains the word none. It works fine this way. This code only satisfies point number 1. mentioned above
There are three problems that I have.
This code also changes the '(batsmen first_batsmen)' part in (goal: part which I dont want. I only want it to change the (init: part
Currently for the other strings in the (init: part, I have to redo this code with different statement. For eg: for '(bowler none_bowler)' i.e. point number 2 above, I have to have a copy of the coded lines again which I think is a not a good coding technique. Any better way for it.
If we consider the first string in (init: that is to be overwritten i.e (batsmen first_batsman). Is there a way in python that no matter what matter what is written in the question mark part of the string like (batsmen ??????_batsman) could be replaced with none. For now it is 'first' but even if it is written 'second'((batsmen second_batsman)) or 'last' ((batsmen last_batsman)) , I want to replace it with 'none'(batsmen none_batsman).
Any ideas on these issues?
Thanks
First of all you need to find the init-group. The init-group seems to have the structure:
(:init
...
)
where ... is some recurrence of text contained inside parenthesis, e.g. "(batsmen first_batsman)". Regular expressions is a powerful way to locate these kind of patterns in text. If you are not familiar with regular expressions (or regex for short) have a look here.
The following regex locates this group:
import re
#Matches the items in the init-group:
item_regex = r"\([\w ]+\)\s+"
#Matches the init-group including items:
init_group_regex = re.compile(r"(\(:init\s+({})+\))".format(item_regex))
init_group = init_group_regex.search(s).group()
Now you have the init-group in match. The next step is to locate the term you would want to replace, and actually replace it. re.sub can do just that! First store the mappings in a dictionary:
mappings = {'batsmen first_batsman': 'batsmen '+ predicate_batsmen + '_batsman',
'bowler none_bowler': 'bowler first_bowler',
'umpire third_umpire': 'umpire leg_umpire',
'spectator no_spectator': 'spectator empty_spectator'}
Finding the occurrences and replacing them by their corresponding value one-by-one:
for key, val in mappings.items():
init_group = re.sub(key, val, init_group)
Finally you can replace the init-group in the original string:
s = init_group_regex.sub(init_group, s)
This is really flexible! You can use regex in mappings to have it match anything you like, including:
mappings = {'batsmen \w+_batsman': '(batsmen '+ predicate_batsmen + '_batsman)'}
to match 'batsmen none_batsman', 'batsmen first_batsman' etc.

Python regular expression, ignoring characters until some charater is matched a number of times

i'm renaming a batch of files i downloaded from a torrent and wanted to get the episode's name,so i figured regex would do the trick. I'm kinda new to regex so I'd appreciate the help. This is what i could come up to:
i have a class related to other renaming functions so the function defined here is within this class, that initializes with the path to the files directory, the expression to rename to and the file extension.
im using glob to access all files with the extension ".mkv"
for debugging i printed out all the file names:
Mr.Robot.S02E01.eps2.0_unm4sk-pt1.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E02.eps2.0_unm4sk-pt2.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E03.eps2.1_k3rnel-pan1c.ksd.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E04.eps2.2_init_1.asec.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E05.eps2.3.logic-b0mb.hc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E06.eps2.4.m4ster-s1ave.aes.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E07.eps2.5_h4ndshake.sme.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E08.eps2.6.succ3ss0r.p12.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E09.eps2.7_init_5.fve.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E10.eps2.8_h1dden-pr0cess.axx.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E11.eps2.9_pyth0n-pt1.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E12.eps2.9_pyth0n-pt2.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
def strip_ep_name(self):
for i, f in enumerate(self.files):
f_list = f.split("\\")
name, ext = os.path.splitext(f_list[-1])
ep_name = name.strip(r'(.*?)".720p.WEB-DL.x264-[MULVAcoded]"')
print(ep_name)
for me, the goal is to get the episode's name, either with or without the episode's number, because i can, later on, give the episode a new name.
and the output is:
r.Robot.S02E01.eps2.0_unm4sk-pt1.t
r.Robot.S02E02.eps2.0_unm4sk-pt2.t
r.Robot.S02E03.eps2.1_k3rnel-pan1c.ks
r.Robot.S02E04.eps2.2_init_1.as
r.Robot.S02E05.eps2.3.logic-b0mb.h
r.Robot.S02E06.eps2.4.m4ster-s1ave.aes
r.Robot.S02E07.eps2.5_h4ndshake.sm
r.Robot.S02E08.eps2.6.succ3ss0r.p1
r.Robot.S02E09.eps2.7_init_5.fv
r.Robot.S02E10.eps2.8_h1dden-pr0cess.a
r.Robot.S02E11.eps2.9_pyth0n-pt1.p7z
r.Robot.S02E12.eps2.9_pyth0n-pt2.p7z
I wanted to strip all the ".eps2.2" before the episode's name, but they dont follow an order.
Now I don't know how to move on from here. can anyone help?
Do it all in one step:
\.eps\d+\.\d+[-_.](.+?)(?:\.720p.+)\.(\w+)$
Broken down, this reads:
\.eps\d+\.\d+ # ".eps", followed by digits, a dot and other digits
[-_.] # one of -, _ or .
(.+?) # anything else lazily afterwards
(?:\.720p.+) # until .720p is found (might need some tweaking)
\. # a dot
(\w+)$ # some word characters (aka the file extension) at the end
This needs to be replaced by .\1.\2 to get your desired format in the end.
Everything in Python:
import re
filenames = """
Mr.Robot.S02E01.eps2.0_unm4sk-pt1.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E02.eps2.0_unm4sk-pt2.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E03.eps2.1_k3rnel-pan1c.ksd.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E04.eps2.2_init_1.asec.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E05.eps2.3.logic-b0mb.hc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E06.eps2.4.m4ster-s1ave.aes.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E07.eps2.5_h4ndshake.sme.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E08.eps2.6.succ3ss0r.p12.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E09.eps2.7_init_5.fve.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E10.eps2.8_h1dden-pr0cess.axx.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E11.eps2.9_pyth0n-pt1.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E12.eps2.9_pyth0n-pt2.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
"""
rx = re.compile(r'\.eps\d+\.\d+[-_.](.+?)(?:\.720p.+)\.(\w+)$', re.M)
filenames = rx.sub(r".\1.\2", filenames)
print(filenames)
Which yields
Mr.Robot.S02E01.unm4sk-pt1.tc.mkv
Mr.Robot.S02E02.unm4sk-pt2.tc.mkv
Mr.Robot.S02E03.k3rnel-pan1c.ksd.mkv
Mr.Robot.S02E04.init_1.asec.mkv
Mr.Robot.S02E05.logic-b0mb.hc.mkv
Mr.Robot.S02E06.m4ster-s1ave.aes.mkv
Mr.Robot.S02E07.h4ndshake.sme.mkv
Mr.Robot.S02E08.succ3ss0r.p12.mkv
Mr.Robot.S02E09.init_5.fve.mkv
Mr.Robot.S02E10.h1dden-pr0cess.axx.mkv
Mr.Robot.S02E11.pyth0n-pt1.p7z.mkv
Mr.Robot.S02E12.pyth0n-pt2.p7z.mkv
See a demo on regex101.com.
Firstly import the regex module of Python:
import re
Then use this to replace from "r.Robot.S02E01.eps2.0_unm4sk-pt1.t" :
ep_name = re.sub(r"eps2\.\d{1,2}(\.|\_)","",episode_name)
use ep_name in loop and pass episode name to episode_name one by one and then print ep_name.
Output will be like:
r.Robot.S02E01.unm4sk-pt1.t
I'm not sure if I understand correctly, I don't know the series hence nor do I the titles. But do you really need re?
for f in files:
print(f[23:-35].split('.')[0])
results in
unm4sk-pt1
unm4sk-pt2
k3rnel-pan1c
init_1
logic-b0mb
m4ster-s1ave
h4ndshake
succ3ss0r
init_5
h1dden-pr0cess
pyth0n-pt1
pyth0n-pt2
Edit:
I still don't see an actual target format definition in your post, but just in case that #Jan is right, here's the re-less solution for that, too:
for f in files:
print(f[:16] + '.'.join(f[23:].split('.')[:2]) + '.mkv')
Mr.Robot.S02E01.unm4sk-pt1.tc.mkv
Mr.Robot.S02E02.unm4sk-pt2.tc.mkv
Mr.Robot.S02E03.k3rnel-pan1c.ksd.mkv
Mr.Robot.S02E04.init_1.asec.mkv
Mr.Robot.S02E05.logic-b0mb.hc.mkv
Mr.Robot.S02E06.m4ster-s1ave.aes.mkv
Mr.Robot.S02E07.h4ndshake.sme.mkv
Mr.Robot.S02E08.succ3ss0r.p12.mkv
Mr.Robot.S02E09.init_5.fve.mkv
Mr.Robot.S02E10.h1dden-pr0cess.axx.mkv
Mr.Robot.S02E11.pyth0n-pt1.p7z.mkv
Mr.Robot.S02E12.pyth0n-pt2.p7z.mkv

Python Parse through String to create variable

I have a variable that reads in a datafile
dfPort = pd.read_csv("E:...\Portfolios\ConsDisc_20160701_Q.csv")
I was hoping to create three variables: portName, inceptionDate, and frequency that would read the string of the "E:..." above and take out the wanted parts of the string using the underscore as a indicator to go to next variable. Example after parsing string:
portName = "ConsDisc"
inceptionDate: "2016-07-01"
frequency: "Q"
Any tips would be appreciated!
You can use os.path.basename, os.path.splitext and str.split:
import os
filename = r'E:...\Portfolios\ConsDisc_20160701_Q.csv'
parts = os.path.splitext(os.path.basename(filename.replace('\\', os.sep)))[0].split('_')
print(parts)
outputs ['ConsDisc', '20160701', 'Q']. You can then manipulate this list as you like, for example extract it into variables with port_name, inception_date, frequency = parts, etc.
The .replace('\\', os.sep) there is used to "normalize" Windows-style backslash-separated paths into whatever is the convention of the system the code is being run on (i.e. forward slashes on anything but Windows :) )
import os
def parse_filename(path):
filename = os.path.basename(path)
filename_no_ext = os.path.splitext(filename)[0]
return filename_no_ext.split("_")
path = r"Portfolios\ConsDisc_20160701_Q.csv"
portName, inceptionDate, frequency = parse_filename(path)
How about an alternative solution just in case if you want to store them into a dictionary and use them like so,
import re
str1 = "E:...\Portfolios\ConsDisc_20160701_Q.csv"
re.search(r'Portfolios\\(?P<portName>.*)_(?P<inceptionDate>.*)_(?P<frequency>.)', str1).groupdict()
# result
# {'portName': 'ConsDisc', 'inceptionDate': '20160701', 'frequency': 'Q'}

regex.sub unexpectedly modifying the substituting string with some kind of encoding?

I have a path string "...\\JustStuff\\2017GrainHarvest_GQimagesTestStand\\..." that I am inserting into an existing text file in place of another string. I compile a regex pattern and find bounding text to get the location to insert, and then use regex.sub to replace it. I'm doing something like this...
with open(imextXML, 'r') as file:
filedata = file.read()
redirpath = re.compile("(?<=<directoryPath>).*(?=</directoryPath>)", re.ASCII)
filedatatemp = redirpath.sub(newdir,filedata)
The inserted text is messed up though, with "\\20" being replaced with "\x8" and "\\" replaced with "\" (single slash)
i.e.
"...\\JustStuff\\2017GrainHarvest_GQimagesTestStand\\..." becomes
"...\\JustStuff\x817GrainHarvest_GQimagesTestStand\..."
What simple thing am I missing here to fix it?
Update:
to break this down even further to copy and paste to reproduce the issue...
t2 = r'\JustStuff\2017GrainHarvest_GQimagesTestStand\te'
redirpath = re.compile("(?<=<directoryPath>).*(?=</directoryPath>)", re.ASCII)
temp = r"<directoryPath>aasdfgsdagewweags</directoryPath>"
redirpath.sub(t2,temp)
produces...
>>'<directoryPath>\\JustStuff\x817GrainHarvest_GQimagesTestStand\te</directoryPath>'
When you define the string that you want to insert, prefix it with an r to indicate that it is a raw string literal:
>>> rex = re.compile('a')
>>> s = 'path\\with\\2017'
>>> sr = r'path\\with\\2017'
>>> rex.sub(s, 'ab')
'path\\with\x817b'
>>> rex.sub(sr, 'ab')
'path\\with\\2017b'

How to remove .txt or .docx at end of string in python

I am trying to create a list of all file names from a specific directory. My code is below:
import os
#dir = input('Enter the directory: ')
dir = 'C:/Users/brian/Documents/Moeller'
r = os.listdir(dir)
for fnam in os.listdir(dir):
print(fnam.split())
sep = fnam.split()
My output is:
['50', 'OP', '856101P02.txt']
['856101P02', 'OP', '040.txt']
['856101P02', 'OP', '50.txt']
['OP', '040', '856101P02.txt']
How would I be able to remove anything to the right of a "." in a string, while keeping the text to the left of the period?
Basically, what you do is start splitting from the right with rsplit and then instruct it to split only once.
print "a.b.c.d".rsplit('.',1)[0]
prints a.b.c
You can use os.path.splitext to split a filename to two parts,
keeping only the extension in the right, and everything else on the left.
For example,
a path like some/path/file.tar.gz will be split to some/path/file.tar and .gz:
base, ext = os.path.splitext('path/to/hello.tar.gz')
If you want to get rid of the . in the ext part,
simply use ext[1:].
If the file has no extension, for example path/to/file,
then the ext part will be the empty string.
This is a nice feature,
so that os.path.splitext always returns a tuple of two elements,
and this way the base, ext = ... example above always works.
I am trying to create a list of all file names from a specific directory.
[...]
How would I be able to remove anything to the right of a "." in a string, while keeping the text to the left of the period?
To get the base names (filenames without the extension) of a specific directory somedir, you could use this list comprehension:
basenames = [os.path.splitext(f)[0] for f in os.listdir(somedir)]
From there, find the period and take everything up to that position. In simple steps ...
for fnam in os.listdir(dir):
nam_split = fnam.split() # "sep" is usually the separator character
print(nam_split)
ext_split = nam_split.rsplit('.', 1) # Split at only one dot, from the right
file_no_ext = ext_split[0] # The first part of the split is the file name
print(file_no_ext)

Categories