I'm saving the recording of a set of sentences to a corresponding set of audio files.
Sentences include:
Ich weiß es nicht!
¡No lo sé!
Ég veit ekki!
How would you recommend I convert the sentence to a human readable filename which will later be served on an online server. I'm not sure right now as to what languages I might be dealing with in the future.
UPDATE:
Please note that two sentences can't clash with each other. For example:
É bär icke dej.
E bår icke dej.
can't resolve to the same filename as these will overwrite each other. This is the problem with the slugify function mentioned here: Turn a string into a valid filename?
The best I have come up with is to use urllib.parse.quote. However I think the resulting output is harder to read than I would have hoped. Any suggestions?:
Ich%20wei%C3%9F%20es%20nicht%21
%C2%A1No%20lo%20s%C3%A9%21
%C3%89g%20veit%20ekki%21
What about unidecode?
import unidecode
a = [u'Ich weiß es nicht!', u'¡No lo sé!', u'Ég veit ekki!']
for s in a:
print(unidecode.unidecode(s).replace(' ', '_'))
This gives pure ASCII strings that can readily be processed if they still contain unwanted characters. Keeping spaces distinct in the form of underscores helps with readability.
Ich_weiss_es_nicht!
!No_lo_se!
Eg_veit_ekki!
If uniqueness is a problem, a hash or something like that might be added to the strings.
Edit:
Some clarification seems to be required with respect to the hashing. Many hash functions are explicitely designed for giving very different outputs for close inputs. For example, the built-in hash function of python gives:
In [1]: hash('¡No lo sé!')
Out[1]: 6428242682022633791
In [2]: hash('¡No lo se!')
Out[2]: 4215591310983444451
With that you can do something like
unidecode.unidecode(s).replace(' ', '_') + '_' + str(hash(s))[:10]
in order to get not too long strings. Even with such shortened hashes, clashes are pretty unlikely.
you should probably try to convert spaces into another symbol making your string look like É-bär-icke-dej.
if your using python I would do it like this.
Replace spaces with another symbol like (-) or (/)
mystring.replace(' ','-')
Detect your character encoding using chardet a python package that detects encoding.
Decode your string using pythons
mystring.decode(*the detected encoding*)
Check if file name is in your directory already using python's OS package. something like
files = os.listdir(*path to directory*)
//get how many times the file name has been repeated
redundance = 0
for name in files:
if mystring in name:
redundance+=1
append redundance to your string
if redundance !=0:
mystring = mystring+redundance
Use ur string as a file name!
Hope this helps!
The only disallowed characters in traditional Unix / Linux file names are slash (/ U+002F) and the null character (U+0000). There is no need to convert your example human-readable strings to anything else.
If you need to make the files available to systems which do not use the same file name encoding, such as for downloading over FTP or from a web server, perhaps you want to expose them as explicitly UTF-8. On most modern U*xes, this should be the default out of the box anyway. This would correspond to the results you get from urllib quoting, where the percent-encoding is a safe and reasonably standard way of producing a machine readable and unambigious representation of the encoding. If you embed these in a snippet of HTML or something, you can keep the display text human-readable, and just keep the link machine-readable.
Ég veit ekki!
Related
I am trying to save a figure from Matplotlib to a folder location on a drive and i am getting some unwanted behavior from the filepath.
This is what i have set up to run with a real string type to handle the "\" escape character.
save_path = r"\\nemesis\Network Planning\Team Members\Taylor\2020_04_23 - COVID Impact
Adjustment\Test Stores\State and Region Growth - " +str(Store_ID)+ ".jpg"
print(save_path)
plt.savefig(save_path)
The print statement displays the correct file path string
However when i run the savefig python appears to add an extra slash next to every existing slash in the string and gives the FileNotFound error. Full error transcript below.
FileNotFoundError: [Errno 2] No such file or directory: '\\\\nemesis\\Network Planning\\Team Members\\Taylor\\2020_04_23 - COVID Impact Adjustment\\Test Stores\\State and Region Growth - 17062.jpg'
I am at a loss for the reasons as to why this is occurring and have tried a bunch of different string methods and none have seemed to work.
Any help is much appreciated
To answer your question, I'll need to explain some background on raw strings. Raw strings are just an easier way to include backslashes in a normal string without you needing to escape them. For example, defining a string that would be printed as "a\b\c" using normal string syntax, you would need to write my_string = "a\\b\\c", but with raw strings, you only need to write r"a\b\c", but the resulting string is equal in both cases:
s = r"a\b\c"
s2 = "a\\b\\c"
s == s2 # Evaluates to True
When you print the string, print() excludes the extra backslashes required to recreate the string using normal syntax:
print(s) # -> a\b\c
To view a representation of the string suitable for recreating it, use repr(s):
print(repr(s)) # -> "a\\b\\c"
Now for your question. The raw string you make may look like what you want when you use print(), as it excludes the extra slashes, but isn't what you want. For one thing, I don't think you meant to have two backslashes at the beginning of the path.
save_path = r"\\nemesis\Network Planning\..."
print(save_path) # Prints the correct path, save the extra leading backslash
print(repr(save_path)) # Reveals the normal string representation, which requires 4 backslashes to create (where there should be only two).
Fixing this problem is simple: represent your file path differently. Either use normal strings and escape all the backslashes manually: "\\nemesis\\Network Planning\\Team Members\\Taylor\\2020_04_23 - COVID Impact Adjustment\\Test Stores\\State and Region Growth - " +str(Store_ID)+ ".jpg" or just use os.path.join("\\nemesis", "Network Planning", "Team Members", "Taylor", "2020_04_23 - COVID Impact Adjustment", "Test Stores", "State and Region Growth - "+ str(Store_ID)+ ".jpg") to automatically join the directories with all the proper backslashes (I can't test that second one because I'm on Linux)
Hope this helped!
I scraped data about fundraising from the web and put it into a table.
As I start to clean the data , I see that some elements, for instance "2 000000", are read "2\xa0000000" by the machine.
1/ What does that mean ?
2/ How can I remove it ? (as I want to transform the whole column to integers)
Best,
To fix a DataFrame column, use:
df['col'] = df['col'].str.replace('\D', '').astype(int)
The issue is that you have escape sequences read in as Unicode characters in the string. The easiest way to remove those characters without using replace on each specific showing is using the unicodedata package.
Specifically:
from unicodedata import normalize
string1 = "2\xa0000000"
new_string = normalize('NFKD', string1)
print(new_string)
Output:
2 000000
This package was already built into my machine, but you may need to install it if you used a different method to build your python package than I. I find this better because this normalization works across a lot of various formatting, so you do not need to use replace each time you see something else that is not formatted correctly. It's an escape sequence
Character of hex code A0 is non-breaking space. So to speak, you can just treat it as a space in most cases. According to my experience, it mostly come up when I process some data generated from Microsoft Office products, or from the web when people put the HTML code on it.
Unfortunately, python split() (for example, I don't know how you process your data) will not treat that as space. But as it is just a distinct character, you can solve the issue with:
longstring.replace('\xA0', ' ').split()
PS: Read again your question, seems it should be ignored to produce the number two million as an data entity. So you might want to replace '\xA0' with empty string.
I created about 200 csv files in Python and now need to download them all.
I created the files from a single file using:
g = df.groupby("col")
for n,g in df.groupby('col'):
g.to_csv(n+'stars'+'.csv')
When I try to use this same statement to export to my machine I get a syntax error and I'm not sure what I'm doing wrong:
g = df.groupby("col")
for n,g in df.groupby('col'):
g.to_csv('C:\Users\egagne\Downloads\'n+'stars'+'.csv'')
Error:
File "<ipython-input-27-43a5bfe55259>", line 3
g.to_csv('C:\Users\egagne\Downloads\'n+'stars'+'.csv'')
^
SyntaxError: invalid syntax
I'm in Jupyter lab, so I can download each file individually but I really don't want to have to do that.
You're possibly mixing up integers and strings, and the use of backslash in literals is dangerous anyway. Consider using the following
import os
inside the loop
f_name = os.path.join('C:', 'users', ' egagne', 'Downloads', str(n), 'stars.csv')
g.to_csv(f_name)
with os.path.join taking care of the backslashes for you.
g.to_csv('C:\Users\egagne\Downloads\'n+'stars'+'.csv'')
needs to be
g.to_csv('C:\\Users\\egagne\\Downloads\\'+n+'stars.csv').
There were two things wrong -- the backslash is an escape character so if you put a ' after it, it will be treated as part of your string instead of a closing quote as you intended it. Using \\ instead of a single \ escapes the escape character so that you can include a backslash in your string.
Also, you did not pair your quotes correctly. n is a variable name but from the syntax highlighting in your question it is clear that it is part of the string. Similarly you can see that stars and .csv are not highlighted as part of a string, and the closing '' should be a red flag that something has gone wrong.
Edit: I addressed what is causing the problem but Ami Tavory's answer is the right one -- though you know this is going to run on windows it is a better practice to use os.path.join() with directory names instead of writing out a path in a string. str(n) is also the right way to go if you are at all unsure about the type of n.
I want to open a file in python 3.5 in its default application, specifically 'screen.txt' in Notepad.
I have searched the internet, and found os.startfile(path) on most of the answers. I tried that with the file's path os.startfile(C:\[directories n stuff]\screen.txt) but it returned an error saying 'unexpected character after line continuation character'. I tried it without the file's path, just the file's name but it still didn't work.
What does this error mean? I have never seen it before.
Please provide a solution for opening a .txt file that works.
EDIT: I am on Windows 7 on a restricted (school) computer.
It's hard to be certain from your question as it stands, but I bet your problem is backslashes.
[EDITED to add:] Or actually maybe it's something simpler. Did you put quotes around your pathname at all? If not, that will certainly not work -- but once you do, you will find that then you need the rest of what I've written below.
In a Windows filesystem, the backslash \ is the standard way to separate directories.
In a Python string literal, the backslash \ is used for putting things into the string that would otherwise be difficult to enter. For instance, if you are writing a single-quoted string and you want a single quote in it, you can do this: 'don\'t'. Or if you want a newline character, you can do this: 'First line.\nSecond line.'
So if you take a Windows pathname and plug it into Python like this:
os.startfile('C:\foo\bar\baz')
then the string actually passed to os.startfile will not contain those backslashes; it will contain a form-feed character (from the \f) and two backspace characters (from the \bs), which is not what you want at all.
You can deal with this in three ways.
You can use forward slashes instead of backslashes. Although Windows prefers backslashes in its user interface, forward slashes work too, and they don't have special meaning in Python string literals.
You can "escape" the backslashes: two backslashes in a row mean an actual backslash. os.startfile('C:\\foo\\bar\\baz')
You can use a "raw string literal". Put an r before the opening single or double quotes. This will make backslashes not get interpreted specially. os.startfile(r'C:\foo\bar\baz')
The last is maybe the nicest, except for one annoying quirk: backslash-quote is still special in a raw string literal so that you can still say 'don\'t', which means you can't end a raw string literal with a backslash.
The recommended way to open a file with the default program is os.startfile. You can do something a bit more manual using os.system or subprocess though:
os.system(r'start ' + path_to_file')
or
subprocess.Popen('{start} {path}'.format(
start='start', path=path_to_file), shell=True)
Of course, this won't work cross-platform, but it might be enough for your use case.
For example I created file "test file.txt" on my drive D: so file path is 'D:/test file.txt'
Now I can open it with associated program with that script:
import os
os.startfile('d:/test file.txt')
Say I have the string "blöt träbåt" which has a few a and o with umlaut and ring above. I want it to become "blot trabat" as simply as possibly. I've done some digging and found the following method:
import unicodedata
unicode_string = unicodedata.normalize('NFKD', unicode(string))
This will give me the string in unicode format with the international characters split into base letter and combining character (\u0308 for umlauts.) Now to get this back to an ASCII string I could do ascii_string = unicode_string.encode('ASCII', 'ignore') and it'll just ignore the combining characters, resulting in the string "blot trabat".
The question here is: is there a better way to do this? It feels like a roundabout way, and I was thinking there might be something I don't know about. I could of course wrap it up in a helper function, but I'd rather check if this doesn't exist in Python already.
It would be better if you created an explicit table, and then used the unicode.translate method. The advantage would be that transliteration is more precise, e.g. transliterating "ö" to "oe" and "ß" to "ss", as should be done in German.
There are several transliteration packages on PyPI: translitcodec, Unidecode, and trans.