Writing Escape Characters to a Csv File in Python - python

I'm using the csv module in python and escape characters keep messing up my csv's. For example, if I had the following:
import csv
rowWriter = csv.writer(open('bike.csv', 'w'), delimiter = ",")
text1 = "I like to \n ride my bike"
text2 = "pumpkin sauce"
rowWriter.writerow([text1, text2])
rowWriter.writerow(['chicken','wings'])
I would like my csv to look like:
I like to \n ride my bike,pumpkin sauce
chicken,wings
But instead it turns out as
I like to
ride my bike,pumpkin sauce
chicken,wings
I've tried combinations of quoting, doublequote, escapechar and other parameters of the csv module, but I can't seem to make it work. Does anyone know whats up with this?
*Note - I'm also using codecs encode("utf-8"), so text1 really looks like "I like to \n ride my bike".encode("utf-8")

The problem is not with writing them to the file. The problem is that \n is a line break when inside '' or "". What you really want is either 'I like to \\n ride my bike' or r'I like to \n ride my bike' (notice the r prefix).

Firstly, it is not obvious why you want r"\n" (two bytes) to appear in your file instead of "\n" (one byte). What is the consumer of the output file meant to do? Use ast.evaluate_literal() on each input field? If your actual data contains any of (non-ASCII characters, apostrophes, quotes), then I'd be very wary of serialising it using repr().
Secondly, you have misreported either your code or your output (or both). The code that you show actually produces:
"I like to
ride my bike",pumpkin sauce
chicken,wings
Thirdly, about your "I like to \n ride my bike".encode("utf-8"): str_object.encode("utf-8") is absolutely pointless if str_object contains only ASCII bytes -- it does nothing. Otherwise it raises an exception.
Fourthly, this comment:
I don't need to call encode anymore, now that I'm using the raw
string. There are a lot of unicode characters in the text that I am
using, so before I started using the raw string I was using encode so
that csv could read the unicode text
doesn't make any sense -- as I've said, "ascii string".encode('utf8') is pointless.
Consider taking a step ot two backwards, and explain what you are really trying to do: where does your data come from, what's in it, and most importantly, what does the process that is going to read the file going to do?

Related

How do I get python to interpret the ANSI escape codes for colors in a string read from a text file

All the codes I've tried work in VS Code terminal and the Widows Terminal (Power Script and Command Window), so I'm pretty happy about that, however, when I read a string from a text file and I print the string, the escape codes are printed in plain view and no colour is applied to the strings.
I've tried the octal, hexadecimal and unicode versions, I had the same problem with "\n" until I realised that the string read would contain "\n", where it would effectively escape the "" char, so calling .replace("\\n","\n") on the string solved that issue, but I got no joy with the colour codes.
This is the code that I use to read the file:
with open('ascii_art_with_color.txt','r') as file:
for line in file.readlines() :
text_line = line
print( text_line , end='' )
Sample from the ascii file:
encounter = You \033[31mencounter\033[0m a wolf howling at the moonlight
Printing using the print function works just fine, either the string constant or from a variable
print('The wolf \033[31mgrowls\033[0m at you as you try to get closer')
winning = 'The wolf lets out a \033[34mpiercing\033[0m cry, then falls to the ground'
print(winning)
Ideas? The main problem that got me stumped is that the codes are not interpreted/applied for the strings I read from the text file, anything else seems to work.
Update:
As it was suggested in the comments, the file contained the '\033' (4 chars) instead of the '\033' one char. I was hoping python would take the line, then apply/translate/encode it into the ANSI escape sequence code while printing it, as it does with the string in the example above.
In the meantime, I managed to get the colours in the text file using a script that replaces a specific string with the escape sequence (I guess python does the encoding behind the scenes before writing it to file)
file_dest = 'ascii_monster_wolf_dest.txt'
with open(file_name,'r') as file, open(file_dest,'w+') as file_dest:
for line in file.readlines():
line = line.replace('{#}','\033[31m')
line = line.replace('{*}','\033[0m')
file_dest.writelines(line)
This is some progress, but not what I really wanted tho.
Coming back to my question, is there a way to read the file and have the sequence '\033' (4 characters) being interpreted as the 1 char escape sequence, the way it seems to do with strings?
There are a couple of ways to do what you ask.
If you wrap the individual lines with quote marks, so they look like Python string constants, you can use the ast literal evaluator to decode it:
s = '"\\x61\\x62"'
# That string has 10 characters.
print( ast.literal_eval(s) )
# Prints ab
Alternatively, you can convert the strings to byte strings, and use the "unicode-escape" codec:
s = '\\x61\\x62'
s = s.encode('utf-8').decode('unicode-escape')
print( s )
# Prints ab
In my humble opinion, however, you would be better served by using some other kind of markup to denote your colors. By that, I mean something like:
<red>This is red</red> <blue>This is blue</blue
Maybe not exactly an HTML-type syntax, but something with code markers that YOU understand, that can be read by humans, and can be interpreted by all languages.
Open the file in binary format. Then use decode() as Tim Roberts suggested.
with open('ascii_art_with_color.txt','rb') as file:
for line in file.readlines() :
print( line.decode('unicode-escape') , end='' )

How to remove special characters in json data python

I am reading a set of data from a json file. Content of the json file looks like:
"Address":"4820 ALCOA AVE� ",
"City":"VERNON� "
As you can see that it contains a special character � and white space at the end. While reading this json data, it is coming like:
'address': '4820 ALCOA AVE� '
'city': 'VERNON� '
I can remove the whitespace easily but I am not sure how can I remove the ¿½. I do not have direct access to json file so cannot edit it and even if I had access to json file, I would talk couple of hours to edit the file. Is there any way in python we can remove this special characters. Please help. Thanks
you can use regexp
import re
address = re.sub(r"[^\x20-\x7E]", "", "4820 ALCOA AVE� ")
print(address)
Looks like somewhere upstream wasn't handling character encoding properly and ended up with replacement characters... You may need to keep an eye out in case it mangled more important parts of the text (eg. accented characters, non-English letters, emoji).
For the immediate problem, you can load the JSON data with the utf-8 encoding, then strip the character '\ufffd'.
value = value.strip().strip('\ufffd')
If the replacement characters also appear in the middle (and you want to delete them), you might want to use replace() instead.
value = value.replace('\ufffd', '').strip()

Python CSV writer, how to handle quotes in order to avoid triple quotes in output

I am working with Python's CSV module, specifically the writer. My question is how can I add double quotes to a single item in a list and have the writer write the string the same way as a print statement would?
for example:
import csv
#test "data"
test = ['item1','01','001',1]
csvOut = csv.writer(open('file.txt','a')) #'a' used for keeping past results
test[1] = '"'+test[1]+'"'
print test
#prints: ['item1', '"01"', '001', 1]
csvOut.writerow(test)
#written in the output file: item1,"""01""",001,1
#I was expecting: item1,"01",001,1
del csvOut
I tired adding a quoting=csv.QUOTE_NONE option, but that raised an error. I am guessing this is related to the many csv dialects, I was hoping to avoid digging too far into that.
In retrospect I could probably have built my initial data set smarter and perhaps avoided the need for this situation but at this point curiosity is really getting the better of me (this is a simplified example): how do you keep the written output from adding those extra quotes?
It's not actually triple-quoting, although it looks that way. Try it with another example to see:
test = ['item1', 'abc"def']
Now you'll see that it writes this:
"abc""def"
In other words, it's just wrapping quotes around your string, and escaping the literal quote characters by doubling them, because that's how default Excel-style CSV handles quote characters.
The question is, what format do you want here? Almost anything you want (within reason) is doable, but you have to pick something. Backslash-escaping quotes? Backslash-escaping everything instead of using quotes in the first place? Single quotes instead of double quotes?
For example, this looks like an answer:
csvOut = csv.writer(open('file.txt','a'), quotechar="'")
… until you have an item like Filet O'Fish and the whole thing gets single-quoted and the ' gets doubled and you have the exact same problem you were trying to avoid. If you're aiming for human readability, and ' is a lot less common in your data than ", that may actually be the right answer, but it's not a perfect answer.
And really, no answer can be perfect: you need some way to either quote or escape commas—and other things, like newlines—and the way you do that is going to add at least one more character that needs to be quote-doubled or escaped. If you know there are never any commas, newlines, etc. in your data, and there's at least one other character you know will never show up, you can get away with setting either quotechar to that other character, or escapechar to that other character and quoting=QUOTE_NONE. But the first time someone unexpectedly uses the character you were sure would never appear, your code will break, so you'd better actually be sure.
Quotes get escaped because your data could contain a comma. You probably don't want a CSV file if you don't want quotes escaped. Just join on a comma (this will break downstream if your data has a comma in it)

Python: Removing particular character (u"\u2610") from string

I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character.
(You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.)
To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove.
for work in glob.glob(pathtofiles):
openfile = open(work)
readfile = openfile.read()
stringfile = str(readfile)
decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line?
soup = BeautifulSoup(decodefile)
textwithtags = soup.findAll('text')
textwithtagsasstring = str(textwithtags)
#this method strips everything between anglebrackets as it should
textwithouttags = stripTags(textwithtagsasstring)
#clean text
nonewlines = textwithouttags.replace("\n", " ")
noextrawhitespace = re.sub(' +',' ', nonewlines)
print noextrawhitespace #the boxes appear
I tried to remove the boxes by using
noboxes = noextrawhitespace.replace(u"\u2610", "")
But Python threw an error flag:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128)
Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.
The problem is that you're mixing unicode and str. Whenever you do that, Python has to convert one to the other, which is does by using sys.getdefaultencoding(), which is usually ASCII, which is almost never what you want.*
If the exception comes from this line:
noboxes = noextrawhitespace.replace(u"\u2610", "")
… the fix is simple… except that you have to know whether noextrawhitespace is supposed to be a unicode object or a UTF-8-encoding str object). If the former, it's this:
noboxes = noextrawhitespace.replace(u"\u2610", u"")
If the latter, it's this:
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
But really, you have to get all of the strings consistent in your code; mixing the two up is going to cause problems in more places than this one.
Since I don't have your XML files to test, I wrote my own:
<xml>
<text>abc☐def</text>
</xml>
Then, I added these two lines to the bottom of your code (and a bit to the top to just open my file instead of globbing for whatever):
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
print noboxes
The output is now:
[<text>abc☐def</text>]
[<text>abc☐def</text>]
[<text>abcdef</text>]
So, I think that's what you want here.
* Sure sometimes you want ASCII… but those aren't usually the times when you have unicode objects…
Give this a try:
noextrawhitespace.replace("\\u2610", "")
I think you are just missing that extra '\'
This might also work.
print(noextrawhitespace.decode('unicode_escape').encode('ascii','ignore'))
Reading your sample, the following are the non-ASCII characters in the document:
0x2223 DIVIDES
0x2022 BULLET
0x3009 RIGHT ANGLE BRACKET
0x25aa BLACK SMALL SQUARE
0x25ca LOZENGE
0x3008 LEFT ANGLE BRACKET
0x2014 EM DASH
0x2026 HORIZONTAL ELLIPSIS
\u2223 is the actual character in question in line 3682, and it is being used as a soft hyphen. The others are used in markup for tagging illegible characters, such as:
<GAP DESC="illegible" RESP="oxf" EXTENT="4+ letters" DISP="\u2022\u2022\u2022\u2022\u2026"/>
Here's some code to do what your code is attempting. Make sure to process in Unicode:
from bs4 import BeautifulSoup
import re
with open('k000039.000.xml') as f:
soup = BeautifulSoup(f) # BS figures out the encoding
text = u''.join(soup.strings) # strings is a generator for just the text bits.
text = re.sub(ur'\s+',ur' ',text) # Simplify all white space.
text = text.replace(u'\u2223',u'') # Get rid of the DIVIDES character.
print text
Output:
[[truncated]] reckon my self a Bridegroom too. Buckle. I doubt Kickey won't find him such. [Aside.] Mrs. Sago. Well,—poor Keckky's bound to good Behaviour, or she had lost quite her Puddy's Favour. Shall I for this repine at Fortune?—No. I'm glad at Heart that I'm forgiven so. Some Neighbours Wives have but too lately shown, When Spouse had left 'em all their Friends were flown. Then all you Wives that wou'd avoid my Fate. Remain contented with your present State FINIS.

How to recognize special eol character when I see it, using Python?

I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.
It's not a '\n', or, I don't think, '\r\n'. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, ''), I can remove all of these characters from one of my files.
So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol) is the character in its functional usage as a newline.
Please help! Thanks, and sorry if I'm missing something obvious.
To determine what specific character that is, you can use str.encode('unicode_escape') or repr() to get (in Python 2) a ASCII-printable representation of the character:
>>> print u'☃'.encode('unicode_escape')
\u2603
>>> print repr(u'☃')
u'\u2603'

Categories