Python Parse through String to create variable - python

I have a variable that reads in a datafile
dfPort = pd.read_csv("E:...\Portfolios\ConsDisc_20160701_Q.csv")
I was hoping to create three variables: portName, inceptionDate, and frequency that would read the string of the "E:..." above and take out the wanted parts of the string using the underscore as a indicator to go to next variable. Example after parsing string:
portName = "ConsDisc"
inceptionDate: "2016-07-01"
frequency: "Q"
Any tips would be appreciated!

You can use os.path.basename, os.path.splitext and str.split:
import os
filename = r'E:...\Portfolios\ConsDisc_20160701_Q.csv'
parts = os.path.splitext(os.path.basename(filename.replace('\\', os.sep)))[0].split('_')
print(parts)
outputs ['ConsDisc', '20160701', 'Q']. You can then manipulate this list as you like, for example extract it into variables with port_name, inception_date, frequency = parts, etc.
The .replace('\\', os.sep) there is used to "normalize" Windows-style backslash-separated paths into whatever is the convention of the system the code is being run on (i.e. forward slashes on anything but Windows :) )

import os
def parse_filename(path):
filename = os.path.basename(path)
filename_no_ext = os.path.splitext(filename)[0]
return filename_no_ext.split("_")
path = r"Portfolios\ConsDisc_20160701_Q.csv"
portName, inceptionDate, frequency = parse_filename(path)

How about an alternative solution just in case if you want to store them into a dictionary and use them like so,
import re
str1 = "E:...\Portfolios\ConsDisc_20160701_Q.csv"
re.search(r'Portfolios\\(?P<portName>.*)_(?P<inceptionDate>.*)_(?P<frequency>.)', str1).groupdict()
# result
# {'portName': 'ConsDisc', 'inceptionDate': '20160701', 'frequency': 'Q'}

Related

Delete all characters that come after a given string

how exactly can I delete characters after .jpg? is there a way to differentiate between the extension I take with python and what follows?
for example I have a link like that
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
How can I delete everything after .jpg?
I tried replacing but it didn't work
another way?
Use a forum to count strings or something like ?
I tried to get jpg files with this
for link in links:
res = requests.get(link).text
soup = BeautifulSoup(res, 'html.parser')
img_links = []
for img in soup.select('a.thumbnail img[src]'):
print(img["src"])
with open('links'+'.csv', 'a', encoding = 'utf-8', newline='') as csv_file:
file_is_empty = os.stat(self.filename+'.csv').st_size == 0
fieldname = ['links']
writer = csv.DictWriter(csv_file, fieldnames = fieldname)
if file_is_empty:
writer.writeheader()
writer.writerow({'links':img["src"]})
img_links.append(img["src"])
You could use split (assuming the string has 'jpg', otherwise the code below will just return the original url).
string = 'https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
jpg_removed = string.split('.jpg')[0]+'.jpg'
Example
string = 'www.google.com'
com_removed = string.split('.com')[0]
# com_removed = 'www.google'
You can make use of regular expression. You just want to ignore the characters after .jpg so you can some use of something like this:
import re
new_url=re.findall("(.*\.jpg).*",old_url)[0]
(.*\.jpg) is like a capturing group where you're matching any number of characters before .jpg. Since . has a special meaning you need to escape the . in jpg with a \. .* is used to match any number of character but since this is not inside the capturing group () this will get matched but won't get extracted.
You can use the .find function to find the characters .jpg then you can index the string to get everything but that. Ex:
string = https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
index = string.find(".jpg")
new_string = string[:index+ 4]
You have to add four because that is the length of jpg so it does not delete that too.
The find() method returns the lowest index of the substring if it is found in given string. If its is not found then it returns -1.
str ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
result = str.find('jpg')
print(result)
new_str = str[:result]
print(new_str+'jpg')
See: Extracting extension from filename in Python
Instead of extracting the extension, we extract the filename and add the extension (if we know it's always .jpg, it's fine!)
import os
filename, file_extension = os.path.splitext('/path/to/somefile.jpg_corruptedpath')
result = filename + '.jpg'
Now, outside of the original question, I think there might be something wrong with how you got that piece of information int he first place. There must be a better way to extract that jpeg without messing around with the path. Sadly I can't help you with that since I a novice with BeautifulSoup.
You could use a regular expression to replace everything after .jpg with an empty string:
import re
url ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
name = re.sub(r'(?<=\.jpg).*',"",url)
print(name)
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg

How to remove .txt or .docx at end of string in python

I am trying to create a list of all file names from a specific directory. My code is below:
import os
#dir = input('Enter the directory: ')
dir = 'C:/Users/brian/Documents/Moeller'
r = os.listdir(dir)
for fnam in os.listdir(dir):
print(fnam.split())
sep = fnam.split()
My output is:
['50', 'OP', '856101P02.txt']
['856101P02', 'OP', '040.txt']
['856101P02', 'OP', '50.txt']
['OP', '040', '856101P02.txt']
How would I be able to remove anything to the right of a "." in a string, while keeping the text to the left of the period?
Basically, what you do is start splitting from the right with rsplit and then instruct it to split only once.
print "a.b.c.d".rsplit('.',1)[0]
prints a.b.c
You can use os.path.splitext to split a filename to two parts,
keeping only the extension in the right, and everything else on the left.
For example,
a path like some/path/file.tar.gz will be split to some/path/file.tar and .gz:
base, ext = os.path.splitext('path/to/hello.tar.gz')
If you want to get rid of the . in the ext part,
simply use ext[1:].
If the file has no extension, for example path/to/file,
then the ext part will be the empty string.
This is a nice feature,
so that os.path.splitext always returns a tuple of two elements,
and this way the base, ext = ... example above always works.
I am trying to create a list of all file names from a specific directory.
[...]
How would I be able to remove anything to the right of a "." in a string, while keeping the text to the left of the period?
To get the base names (filenames without the extension) of a specific directory somedir, you could use this list comprehension:
basenames = [os.path.splitext(f)[0] for f in os.listdir(somedir)]
From there, find the period and take everything up to that position. In simple steps ...
for fnam in os.listdir(dir):
nam_split = fnam.split() # "sep" is usually the separator character
print(nam_split)
ext_split = nam_split.rsplit('.', 1) # Split at only one dot, from the right
file_no_ext = ext_split[0] # The first part of the split is the file name
print(file_no_ext)

Inverse re.split

I have this C:/Users/nash08/Desktop/NUKE_OITO_MEDIA/PRODUTORAS/PYTHON/WORK_INTERNO/_CENAS_FX/N10/N01_V01_NK08.%04d.dpx
I want to remove everything thats in the path till the words produtoras like this:
/PRODUTORAS/PYTHON/WORK_INTERNO/_CENAS_FX/N10/N01_V01_NK08.%04d.dpx
My knowledge in regular expressions its pretty mediocre, the only way that I am used to is to separate the path by / like this
rpath = path.rsplit('/',1)[0]
rpath2 = re.split('/',path)
and index to where I want.
You're using regular expression.
That works:
t = "C:/Users/nash08/Desktop/NUKE_OITO_MEDIA/PRODUTORAS/PYTHON/WORK_INTERNO/_CENAS_FX/N10/N01_V01_NK08.%04d.dpx"
output = re.search(".*(/PRODUTORAS.*)", t)
print output.group(1)
>'/PRODUTORAS/PYTHON/WORK_INTERNO/_CENAS_FX/N10/N01_V01_NK08.%04d.dpx'
Here is a simple approach:
if '/PRODUTORAS/' in mypath:
newpath = '/PRODUTORAS/' + mypath.split('/PRODUTORAS/', 1)[1]
This only works if you are using forward slashes for your path separator and PRODUTORAS is capitalized.
If your prefix doesn't change, this code will work:
path = "C:/Users/nash08/Desktop/NUKE_OITO_MEDIA/PRODUTORAS/PYTHON/WORK_INTERNO/_CENAS_FX/N10/N01_V01_NK08.%04d.dpx"
prefix = "C:/Users/nash08/Desktop/NUKE_OITO_MEDIA/"
print path.strip(prefix)
#Output:
>>> 'PRODUTORAS/PYTHON/WORK_INTERNO/_CENAS_FX/N10/N01_V01_NK08.%04d.dpx'

Replace recursively from a replacement map

I have a dictionary in the form
{'from.x': 'from.changed.x',...}
possibly quite big, and I have to substitute in text files accordingly to that dictionary in a quite big directory structure.
I didn't find anything which might any nice solution and I end up:
using os.walk
iterating through the dictionary
writing everything out
WIth something like:
def fix_imports(top_dir, not_ui_keys):
"""Walk through the directory and substitute the wrong imports
"""
repl = {}
for n in not_ui_keys:
# interleave a model in between
dotted = extract_dotted(n)
if dotted:
repl[dotted] = add_model(dotted)
for root, dirs, files in walk(top_dir):
py_files = [path.join(root, x) for x in files if x.endswith('.py')]
for py in py_files:
res = replace_text(open(py).read(), repl)
def replace_text(orig_text, replace_map):
res = orig_text
# now try to grep all the keys, using a translate maybe
# with a dictionary of the replacements
for to_replace in replace_map:
res.replace(to_replace, replace_map[to_replace])
# now print the differences
for un in unified_diff(res.splitlines(), orig_text.splitlines()):
print(un)
return res
Is there any better/nicer/faster way to do it?
EDIT:
Clarifying a bit the problem, the substitution are generated from a function, and they are all in the form:
{'x.y.z': 'x.y.added.z', 'x.b.a': 'x.b.added.a'}
And yes, sure I should better use regexps, I just thought I didn't need them this time.
I don't think it can help much, however, because I can't really formalize the whole range of substitutions with only one (or multiple) regexps..
I would write the first function using generators:
def fix_imports(top_dir, not_ui_keys):
"""Walk through the directory and substitute the wrong imports """
from itertools import imap,ifilter
gen = ifilter(None,imap(extract_dotted, not_ui_keys))
repl = dict((dotted,add_model(dotted)) for dotted in gen)
py_files = (path.join(root, x)
for root, dirs, files in walk(top_dir)
for x in files if x[-3:]=='.py')
for py in py_files:
with open(py) as opf:
res = replace_text(opf.read(), repl)
x[-3:]=='.py' is faster than x.endswith('.py')
Thank you everyone, and about the problem of substituting from a mapping in many files, I think I have a working solution:
def replace_map_to_text(repl_map, text_lines):
"""Take a dictionary with the replacements needed and a list of
files and return a list with the substituted lines
"""
res = []
concat_st = "(%s)" % "|".join(repl_map.keys())
# '.' in non raw regexp means one of any characters, so must be
# quoted ore we need a way to make the string a raw string
concat_st = concat_st.replace('.', '\.')
combined_regexp = re.compile(concat_st)
for line in text_lines:
found = combined_regexp.search(line)
if found:
expr = found.group(1)
new_line = re.sub(expr, repl_map[expr], line)
logger.info("from line %s to line %s" % (line, new_line))
res.append(new_line)
else:
res.append(line)
return res
def test_replace_string():
lines = ["from psi.io.api import x",
"from psi.z import f"]
expected = ["from psi.io.model.api import x",
"from psi.model.z import f"]
mapping = {'psi.io.api': 'psi.io.model.api',
'psi.z': 'psi.model.z'}
assert replace_map_to_text(mapping, lines) == expected
In short I compose a big regexp in the form
(first|second|third)
Then I search for it in every line and substitute with re.sub if something was found.
Still a bit rough but the simple test after works fine.
EDIT: fixed a nasty bug in the concatenation, because if it's not a raw string '.' means only one character, not a '.'

batch renaming 100K files with python

I have a folder with over 100,000 files, all numbered with the same stub, but without leading zeros, and the numbers aren't always contiguous (usually they are, but there are gaps) e.g:
file-21.png,
file-22.png,
file-640.png,
file-641.png,
file-642.png,
file-645.png,
file-2130.png,
file-2131.png,
file-3012.png,
etc.
I would like to batch process this to create padded, contiguous files. e.g:
file-000000.png,
file-000001.png,
file-000002.png,
file-000003.png,
When I parse the folder with for filename in os.listdir('.'): the files don't come up in the order I'd like to them to. Understandably they come up
file-1,
file-1x,
file-1xx,
file-1xxx,
etc. then
file-2,
file-2x,
file-2xx,
etc. How can I get it to go through in the order of the numeric value? I am a complete python noob, but looking at the docs i'm guessing I could use map to create a new list filtering out only the numerical part, and then sort that list, then iterate that? With over 100K files this could be heavy. Any tips welcome!
import re
thenum = re.compile('^file-(\d+)\.png$')
def bynumber(fn):
mo = thenum.match(fn)
if mo: return int(mo.group(1))
allnames = os.listdir('.')
allnames.sort(key=bynumber)
Now you have the files in the order you want them and can loop
for i, fn in enumerate(allnames):
...
using the progressive number i (which will be 0, 1, 2, ...) padded as you wish in the destination-name.
There are three steps. The first is getting all the filenames. The second is converting the filenames. The third is renaming them.
If all the files are in the same folder, then glob should work.
import glob
filenames = glob.glob("/path/to/folder/*.txt")
Next, you want to change the name of the file. You can print with padding to do this.
>>> filename = "file-338.txt"
>>> import os
>>> fnpart = os.path.splitext(filename)[0]
>>> fnpart
'file-338'
>>> _, num = fnpart.split("-")
>>> num.rjust(5, "0")
'00338'
>>> newname = "file-%s.txt" % num.rjust(5, "0")
>>> newname
'file-00338.txt'
Now, you need to rename them all. os.rename does just that.
os.rename(filename, newname)
To put it together:
for filename in glob.glob("/path/to/folder/*.txt"): # loop through each file
newname = make_new_filename(filename) # create a function that does step 2, above
os.rename(filename, newname)
Thank you all for your suggestions, I will try them all to learn the different approaches. The solution I went for is based on using a natural sort on my filelist, and then iterating that to rename. This was one of the suggested answers but for some reason it has disappeared now so I cannot mark it as accepted!
import os
files = os.listdir('.')
natsort(files)
index = 0
for filename in files:
os.rename(filename, str(index).zfill(7)+'.png')
index += 1
where natsort is defined in http://code.activestate.com/recipes/285264-natural-string-sorting/
Why don't you do it in a two step process. Parse all the files and rename with padded numbers and then run another script that takes those files, which are sorted correctly now, and renames them so they're contiguous?
1) Take the number in the filename.
2) Left-pad it with zeros
3) Save name.
def renamer():
for iname in os.listdir('.'):
first, second = iname.replace(" ", "").split("-")
number, ext = second.split('.')
first, number, ext = first.strip(), number.strip(), ext.strip()
number = '0'*(6-len(number)) + number # pad the number to be 7 digits long
oname = first + "-" + number + '.' + ext
os.rename(iname, oname)
print "Done"
Hope this helps
The simplest method is given below. You can also modify for recursive search this script.
use os module.
get filenames
os.rename
import os
class Renamer:
def __init__(self, pattern, extension):
self.ext = extension
self.pat = pattern
return
def rename(self):
p, e = (self.pat, self.ext)
number = 0
for x in os.listdir(os.getcwd()):
if str(x).endswith(f".{e}") == True:
os.rename(x, f'{p}_{number}.{e}')
number+=1
return
if __name__ == "__main__":
pattern = "myfile"
extension = "txt"
r = Renamer(pattern=pattern, extension=extension)
r.rename()

Categories