I'm trying to train a chatbot, and most of the data is in text files.
I pull:
Matt said you have a "shit load" of dining dollars. I have almost none so if you're willing to sell, I'm willing to buy.
from the text file, but when the chatterbot corpus tries to train the bot, it reads the above as:
'Matt said you have a "shit load" of dining dollars\\ I have almost none so if you\'re willing to sell, I\'m willing to buy\\\n'
How can I fix this?
This is my code:
def train_from_text():
#chatbot.set_trainer(ListTrainer)
directory = basedir + "Text Trainers"
files = find_files_in_directory(directory)
for file in files:
conversation = []
file_name = directory+"/"+file
with open(file_name, 'r') as to_read:
for line in to_read:
conversation.append(line)
chatbot.train(conversation)
Please excuse the swearing, its the data I was given.
Edit: Full error
Traceback (most recent call last):
File "E:/Jason Chatterbot/Jason Chat.py", line 102, in <module>
control()
File "E:/Jason Chatterbot/Jason Chat.py", line 96, in control
train_from_text()
File "E:/Jason Chatterbot/Jason Chat.py", line 58, in train_from_text
chatbot.train(conversation)
File "C:\Python27\lib\site-packages\chatterbot\trainers.py", line 119, in train
corpora = self.corpus.load_corpus(corpus_path)
File "C:\Python27\lib\site-packages\chatterbot_corpus\corpus.py", line 98, in load_corpus
corpus_data = self.read_corpus(file_path)
File "C:\Python27\lib\site-packages\chatterbot_corpus\corpus.py", line 63, in read_corpus
with io.open(file_name, encoding='utf-8') as data_file:
IOError: [Errno 22] Invalid argument: 'Matt said you have a "shit load" of dining dollars\\ I have almost none so if you\'re willing to sell, I\'m willing to buy\\\r\n'
Without looking at a larger subset of the data, seems like it's replacing single quotes (') with escaped single quotes (\'), actual newline characters, with escaped newlines (\n) and periods with double backslashes (\)
A simple string replace might fix it for you, depending on how bad the data is getting munged. Try changing
conversation.append(line)
to
conversation.append(line.replace("\\'","'").replace('\\\\','.').replace("\\n","\n"))
We're basically trying to reverse those substitutions that are being made automatically.
Related
I'm trying to translate a yml file (which exists over 4000 lines) to dutch using the Googletrans API.
This is my python code:
from googletrans import Translator
import re
translator = Translator()
with open("Results2/ValuesfileNotTranslated.yml") as a_file: # Not translated File
for object in a_file:
stripped_object = object.rstrip()
found = False
file = open("Results2/ValuesfileTranslated.yml", "a") #Translated file
if not stripped_object.strip():
file.writelines("\n")
elif "# Do not translate" in stripped_object: #Skips lines with "# Do Not Translates"
counter_DoNotTranslate += 1
file.writelines(stripped_object + "\n")
else: #Translates english to dutch
counter_Translate += 1
results = translator.translate(stripped_object, src='en', dest='nl')
translatedText = results.text
file.writelines(re.split('|=', translatedText, maxsplit=1)[-1].strip() + "\n" )
But when I try to run the code it works until line 276.
This is the YAML file I want to translate.
...
E-mail
E-Mail oder Name
Password
Toggle dropdown
Are you sure?
Yes, sure
Cancel
Back
Download
Submit
Add
Edit
Delete
Beta
"This new functionality is not yet optimized for all browsers. Please create a helpdesk ticket to inform us about errors, providing details including the browser and browser version you are using. As a workaround, we ask you to try another browser to continue (Chrome or Firefox)"
Please enable Javascript in your browser!! # This is line 276
http://www.enable-javascript.com/
Read more
Read less
Close
...
After line 276 I get this error:
Traceback (most recent call last):
File "/Users/AndreB/Library/Mobile Documents/com~apple~CloudDocs/Work/Freelance/ProjectYML/Programma/python/vertaling.py", line 29, in <module>
results = translator.translate(stripped_object, src='en', dest='nl') #Choose a source and a destination
File "/Users/AndreB/Library/Python/3.9/lib/python/site-packages/googletrans/client.py", line 222, in translate
translated_parts = list(map(lambda part: TranslatedPart(part[0], part[1] if len(part) >= 2 else []), parsed[1][0][0][5]))
IndexError: list index out of range
I can't figure out what the problem is with my code.
Does anyone have an idea how I can fix this?
I know this is an old post but I just had this problem and found a solution. The issue was that Googletrans didn't know what to do with empty lines. I was translating by line so I just added:
if line != "\n":
before
translator.translate(line)
That's it - it worked!
You may need to modify if you're not translating by line, but that's the idea.
Good luck.
I am using Python 2.7.18
The idea is to use python to gather songs from specified directories, then create and run the commands to run them through a bunch of converters and sound processors.
Some of my songs have characters with accents and any song with a ? in the title gets changed to a ¿ (Inverted Question Mark) in the file name.
My convert_song function works correctly when ran, but when I try to run it in a Pool and the file name or directory has a non ascii character in it, it fails with:
Traceback (most recent call last):
File "C:\StreamLine.py", line 270, in <module>
result = pool.map(convert_song, qTheStack)
File "C:\Python27\lib\multiprocessing\pool.py", line 253, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Python27\lib\multiprocessing\pool.py", line 572, in get
raise self._value
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbf' in position 27: ordinal not in range(128)
Here's my main where I set up the pool:
if __name__ == '__main__':
print('Reading artists.')
predir = 'G:\\Vault\\The Music\\'
artistfile = open('C:\\Controls\\ArtistList.txt', 'r')
artistlist = artistfile.readlines()
dirs = []
for artist in artistlist:
dirs.append(predir + artist.strip())
qTheStack = []
for currentPath in dirs:
for wFile in generate_next_file(currentPath):
print(repr(wFile))
#print(convert_song(wFile))
qTheStack.append(wFile)
print('List loaded.')
pool = Pool(12)
result = pool.map(convert_song, qTheStack)
for item in result:
print(item)
The print(repr(wFile)) looks like this when ran:
'G:\\Vault\\The Music\\Chicago\\1989 - Greatest Hits 1982-1989\\04 - Will You Still Love Me\xbf.flac'
'G:\\Vault\\The Music\\Chicago\\1989 - Greatest Hits 1982-1989\\06 - What Kind of Man Would I Be\xbf [Remix].flac'
How can I get the built-in Pool from multiprocessing to accept my input?
Change to Python 3, dude.
As much as I wanted there to be an answer that stayed on Python 2.7, I tried Python 3 and it didn't disappoint.
I did have to go back through the obscure steps I found to generate a file that will run a COM/DLL in Python, and I had to remove all the str.decode and encode calls throughout my script. After only one import change, I hit run and it ran as expected.
I'm trying to open a file in Python, but I got an error, and in the beginning of the string I got a /u202a character... Does anyone know how to remove it?
def carregar_uml(arquivo, variaveis):
cadastro_uml = {}
id_uml = 0
for i in open(arquivo):
linha = i.split(",")
carregar_uml("H:\\7 - Script\\teste.csv", variaveis)
OSError: [Errno 22] Invalid argument: '\u202aH:\7 - Script\teste.csv'
When you initially created your .py file, your text editor introduced a non-printing character.
Consider this line:
carregar_uml("H:\\7 - Script\\teste.csv", variaveis)
Let's carefully select the string, including the quotes, and copy-paste it into an interactive Python session:
$ python
Python 3.6.1 (default, Jul 25 2017, 12:45:09)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "H:\\7 - Script\\teste.csv"
'\u202aH:\\7 - Script\\teste.csv'
>>>
As you can see, there is a character with codepoint U-202A immediately before the H.
As someone else pointed out, the character at codepoint U-202A is LEFT-TO-RIGHT EMBEDDING. Returning to our Python session:
>>> s = "H:\\7 - Script\\teste.csv"
>>> import unicodedata
>>> unicodedata.name(s[0])
'LEFT-TO-RIGHT EMBEDDING'
>>> unicodedata.name(s[1])
'LATIN CAPITAL LETTER H'
>>>
This further confirms that the first character in your string is not H, but the non-printing LEFT-TO-RIGHT EMBEDDING character.
I don't know what text editor you used to create your program. Even if I knew, I'm probably not an expert in that editor. Regardless, some text editor that you used inserted, unbeknownst to you, U+202A.
One solution is to use a text editor that won't insert that character, and/or will highlight non-printing characters. For example, in vim that line appears like so:
carregar_uml("<202a>H:\\7 - Script\\teste.csv", variaveis)
Using such an editor, simply delete the character between " and H.
carregar_uml("H:\\7 - Script\\teste.csv", variaveis)
Even though this line is visually identical to your original line, I have deleted the offending character. Using this line will avoid the OSError that you report.
you can use this sample code to remove u202a from file path
st="F:\\somepath\\filename.xlsx"
data = pd.read_excel(st)
if i try to do this it gives me a OSError and
In detail
Traceback (most recent call last):
File "F:\CodeRepo\PythonWorkSpace\demo\removepartofstring.py", line 14, in <module>
data = pd.read_excel(st)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\util\_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\util\_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel.py", line 350, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel.py", line 653, in __init__
self._reader = self._engines[engine](self._io)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel.py", line 424, in __init__
self.book = xlrd.open_workbook(filepath_or_buffer)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\xlrd\__init__.py", line 111, in open_workbook
with open(filename, "rb") as f:
OSError: [Errno 22] Invalid argument: '\u202aF:\\somepath\\filename.xlsx'
but if i do that like this
st="F:\\somepath\\filename.xlsx"
data = pd.read_excel(st.strip("u202a")) #replace your string here
Its working for me
The problem is the directory path of the file is not read properly. Use raw strings to pass it as argument and it should work.
carregar_uml(r'H:\7 - Script\teste.csv', variaveis)
try strip(),
def carregar_uml(arquivo, variaveis):
cadastro_uml = {}
id_uml = 0
for i in open(arquivo):
linha = i.split(",")
carregar_uml("H:\\7 - Script\\teste.csv", variaveis)
carregar_uml = carregar_uml.strip("\u202a")
Or you can slice out that character
file_path = r"C:\Test3\Accessing_mdb.txt"
file_path = file_path[1:]
with open(file_path, 'a') as f_obj:
f_obj.write('some words')
use small letter when you write your hard-disk-drive name! not big letter!
ex) H: -> error
ex) h: -> not error
I tried all of the above solutions. Problem is when we copy path or any string from left to write, extra character is added . It does not show in our IDE. this extra added character denotes Right to Left mark (RLM)
https://en.wikipedia.org/wiki/Right-to-left_mark
, i.e. you selected the text at time of copying from Right to left.
check the image Linked to my answer.
I also did try copying left to right ,then this extra character is not added. So either type your path manually or copy it left to right to avoid this type of issue.
The following is a simple function to remove the "\u202a"and "\u202c" characters.
you can add any characters you want to be removed to the list.
def cleanup(inp):
new_char = ""
for char in inp:
if char not in ["\u202a", "\u202c"]:
new_char += char
return new_char
example = '\u202a7551\u202c'
print(cleanup(example)) # prints 7551
I'm new to writing questions here, so please feel free to point out how i can improve the quality of future questions!
Edit: More code included as was asked in the comments
I'm trying to read GuitarPro files into python. These files essentially contain the sheet music for songs, but contain more information than e.g. MIDI files.
I want to parse the notes and the duration of the notes into e.g. a list structure. Further, i hope other effects can be parsed from the GuitarPro files also, such as bends, slides, hammer-ons etc.
I have been trying to use the library PyGuitarPro, but get stuck:
import guitarpro
import os
# 'wet_sand.gp5' is the guitar pro file
parsed_song = guitarpro.parse('wet_sand.gp5')
song = guitarpro.gp5.GP5File(parsed_song,encoding='UTF-8')
song.readSong()
I get the following error from ReadSong() (documentation here):
Traceback (most recent call last):
File "<ipython-input-15-e1663229852d>", line 8, in <module>
song.readSong()
File "C:\Python27\lib\site-packages\guitarpro\gp5.py", line 62, in readSong
song.version = self.readVersion()
File "C:\Python27\lib\site-packages\guitarpro\iobase.py", line 114, in readVersion
self.version = self.readByteSizeString(30)
File "C:\Python27\lib\site-packages\guitarpro\iobase.py", line 97, in readByteSizeString
return self.readString(size, self.readByte())
File "C:\Python27\lib\site-packages\guitarpro\iobase.py", line 47, in readByte
return (self.read(*args, default=default) if count == 1 else
File "C:\Python27\lib\site-packages\guitarpro\iobase.py", line 35, in read
data = self.data.read(count)
AttributeError: 'Song' object has no attribute 'read'
Looking at the the examples provided, e.g. this one. I don't think you need this portion.
song = guitarpro.gp5.GP5File(parsed_song,encoding='UTF-8')
The following should be enough, as parse already calls readSong here.
song = guitarpro.parse('wet_sand.gp5')
Finally it looks like the file-format is automatically determined by parse here.
As an example you could do something like this.
import guitarpro
song = guitarpro.parse('test.gp5')
for track in song.tracks:
for measure in track.measures:
for voice in measure.voices:
for beat in voice.beats:
for note in beat.notes:
print(note.durationPercent)
print(note.effect)
I have 112 music files in a folder. All of them start with the type of music like 【House】Supermans Feinde Shine.
All of them start with that 【 and i want to rename like House - Supermans Feinde Shine
I have tried:
import os
for filename in os.listdir("C:/MYMUSICSFOLDER"):
if filename.startswith("【"):
os.rename(filename, filename[7:])
but I get:
Error : sys:1: DeprecationWarning: Non-ASCII character '\xe3' in file C:\MYPROGRAMSFOLDER\ne11.py on line 6,but no enconding declared
How do I do that? Rename all of the music files this way?
I tried various code ... but I can't do that.
I have a program thats execute a music when I say "songs" but when I try to do it I get an error; all other functions work perfectly.
Here's the code ...
import os,sys,random
import webbrowser
import speech
import sys
def callback(phrase, listener):
print ": %s" % phrase
if phrase == "songs":
folder = os.listdir("C:/users/william/desktop/music/xkito music")
file = random.choice(folder)
ext3= ['.mp3','.mp4','.wmv']
while file[-4:] not in ext3 :
file = random.choice(folder)
else:
os.startfile(file)
speech.say('Playing Music')
if phrase == "open opera":
webbrowser.open('http://www.google.com')
speech.say("Opening opera")
if phrase == "turn off":
speech.say("Goodbye.")
listener.stoplistening()
sys.exit()
print "Anything you type, speech will say back."
print "Anything you say, speech will print out."
print "Say or type 'turn off' to quit."
print
listener= speech.listenforanything(callback)
while listener.islistening():
text = raw_input("> ")
if text == "turn off":
listener.stoplistening()
sys.exit()
else:
speech.say(text)
And I'm getting this error when trying to execute the music:
pythoncom error: Python error invoking COM method.
Traceback (most recent call last):
File "C:\Python24\Lib\site-packages\win32com\server\policy.py", line 277, in _Invoke_
return self._invoke_(dispid, lcid, wFlags, args)
File "C:\Python24\Lib\site-packages\win32com\server\policy.py", line 282, in _invoke_
return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None, None)
File "C:\Python24\Lib\site-packages\win32com\server\policy.py", line 585, in _invokeex_
return func(*args)
File "C:\Users\william\Desktop\speech-0.5.2\speech.py", line 138, in OnRecognition
self._callback(phrase, self._listener)
File "C:\Users\william\Desktop\speech-0.5.2\example.py", line 21, in callback
os.startfile(file)
WindowsError: [Errno 2] The system can not find the specified file: '?Glitch Hop?Chinese Man - I Got That Tune (Tha Trickaz Remix) [Free Download].mp4
That ? in the beginning of the name is 【 and 】
The error is about handling unicode UTF-8 characters.
I would think that the filename[7:] even splits a UTF-8 character between two of its bytes, so that rename() sees a partial character.
The right way to fix it is to handle UTF-8 correctly, of course.
Bt one way to work around it altogether is to not work with individual bytes of the string, but with strings only, in a way that the encoding is not relevant:
To convert 【House】Superfoo to House - Superfoo, you can
replace 【 by the empty string, and 】 by -.
Use the result of that as the new file name. If the original name is not of the expected format, the name is not changed, and nothing happens. It's not an error, the program does not even notice.