I am trying to read a file and replace every "a ... a" by a '\footnotemark'
with open('myfile', 'r') as myfile:
data = myfile.read()
data = re.sub('<a.+?</a>', '\footnotemark', data)
Somehow Python always makes '\footnotemark' to '\x0cootnotemark' ('\f' to '\x0c'). I tried so far
Escaping: '{2 Backslashes}footnotemark'
raw String: r'\footnotemark' or r'"\footnotemark"'
None of these worked
Example input:
fooasdasd bar
Example output:
foo\footnotemark bar
Assuming Python2 since You haven't mentioned anything about version
#/usr/bin/python
import re
# myfile is saved with utf-8 encoding
with open('myfile', 'r') as myfile:
text = myfile.read()
print text
data = re.sub('<a.+?</a>', r'\\footnotemark', text)
print data
outputs
fooasdasd bar
foo\footnotemark bar
Related
I have a .csv file which is encoded in UTF-8.
I am working with Python 2.7.
Something intereseting happens on Ubuntu.
When I print out the results of the file like this:
with open("file.csv", "r") as file:
myFile = csv.reader(file, delimiter = ",")
for row in myFile:
print row
I get signs like \xc3\x, \xa1\, .... Note that row is a list and all the elements in my list are marked as strings by '' in the output.
When I print out the results like this:
with open("file.csv", "r") as file:
myFile = csv.reader(file, delimiter = ",")
for row in myFile:
print ",".join(row)
Everything is decoded fine. Note that every row from my original file is one big string here.
Why is that?
This is because in the case of printing a list, Python is using repr(), but when printing a string it is using str(). Example:
unicode_str = 'åäö'
unicode_str_list = [unicode_str, unicode_str]
print 'unwrapped:', unicode_str
print 'in list:', unicode_str_list
print 'repr:', repr(unicode_str)
print 'str:', str(unicode_str)
Produces:
unwrapped: åäö
in list: ['\xc3\xa5\xc3\xa4\xc3\xb6', '\xc3\xa5\xc3\xa4\xc3\xb6']
repr: '\xc3\xa5\xc3\xa4\xc3\xb6'
str: åäö
Hopefully someone can help me out with the following. It is probably not too complicated but I haven't been able to figure it out. My "output.txt" file is created with:
f = open('output.txt', 'w')
print(tweet['text'].encode('utf-8'))
print(tweet['created_at'][0:19].encode('utf-8'))
print(tweet['user']['name'].encode('utf-8'))
f.close()
If I don't encode it for writing to file, it will give me errors. So "output" contains 3 rows of utf-8 encoded output:
b'testtesttest'
b'line2test'
b'\xca\x83\xc9\x94n ke\xc9\xaan'
In "main.py", I am trying to convert this back to a string:
f = open("output.txt", "r", encoding="utf-8")
text = f.read()
print(text)
f.close()
Unfortunately, the b'' - format is still not removed. Do I still need to decode it? If possible, I would like to keep the 3 row structure.
My apologies for the newbie question, this is my first one on SO :)
Thank you so much in advance!
With the help of the people answering my question, I have been able to get it to work. The solution is to change the way how to write to file:
tweet = json.loads(data)
tweet_text = tweet['text'] # content of the tweet
tweet_created_at = tweet['created_at'][0:19] # tweet created at
tweet_user = tweet['user']['name'] # tweet created by
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(tweet_text + '\n')
f.write(tweet_created_at+ '\n')
f.write(tweet_user+ '\n')
Then read it like:
f = open("output.txt", "r", encoding='utf-8')
tweettext = f.read()
print(text)
f.close()
Instead of specifying the encoding when opening the file, use it to decode as you read.
f = open("output.txt", "rb")
text = f.read().decode(encoding="utf-8")
print(text)
f.close()
If b and the quote ' are in your file, that means this in a problem with your file. Someone probably did write(print(line)) instead of write(line). Now to decode it, you can use literal_eval. Otherwise #m_callens answer's should be ok.
import ast
with open("b.txt", "r") as f:
text = [ast.literal_eval(line) for line in f]
for l in text:
print(l.decode('utf-8'))
# testtesttest
# line2test
# ʃɔn keɪn
I have a .csv file which is encoded in UTF-8.
I am working with Python 2.7.
Something intereseting happens on Ubuntu.
When I print out the results of the file like this:
with open("file.csv", "r") as file:
myFile = csv.reader(file, delimiter = ",")
for row in myFile:
print row
I get signs like \xc3\x, \xa1\, .... Note that row is a list and all the elements in my list are marked as strings by '' in the output.
When I print out the results like this:
with open("file.csv", "r") as file:
myFile = csv.reader(file, delimiter = ",")
for row in myFile:
print ",".join(row)
Everything is decoded fine. Note that every row from my original file is one big string here.
Why is that?
This is because in the case of printing a list, Python is using repr(), but when printing a string it is using str(). Example:
unicode_str = 'åäö'
unicode_str_list = [unicode_str, unicode_str]
print 'unwrapped:', unicode_str
print 'in list:', unicode_str_list
print 'repr:', repr(unicode_str)
print 'str:', str(unicode_str)
Produces:
unwrapped: åäö
in list: ['\xc3\xa5\xc3\xa4\xc3\xb6', '\xc3\xa5\xc3\xa4\xc3\xb6']
repr: '\xc3\xa5\xc3\xa4\xc3\xb6'
str: åäö
the content of the text file is
u'\u26be\u26be\u26be'
When I run the script...
import codecs
f1 = codecs.open("test1.txt", "r", "utf-8")
text = f1.read()
print text
str1 = u'\u26be\u26be\u26be'
print(str1)
I get the output...
u'\u26be\u26be\u26be'
⚾⚾⚾
Question: Why is that a string, which the same content as the file, is able to produce the emojis properly?
File content u'\u26be\u26be\u26be' is like r"u'\u26be\u26be\u26be'". In other words, characters of u, \, u, 2, ...
You can convert such string to the string ⚾⚾⚾ using ast.literal_eval:
import ast
import codecs
with codecs.open("test1.txt", "r", "utf-8") as f1:
text = ast.literal_eval(f1.read())
print text
...
But, why does the file contain such string (u'\u26be\u26be\u26be') instead of ⚾⚾⚾? Maybe you need to consider redesigning file saving part.
If the input file is required to have unicode escapes you will need to filter it like so:
with open("test1.txt", "r") as f1:
text = f1.read()
print unicode(text, 'unicode_escape')
str1 = u'\u26be\u26be\u26be'
print(str1)
No need to import other libraries.
i got an inputfile which contains a javascript code which contains many five-figure ids. I want to have these ids in a list like:
53231,53891,72829 etc
This is my actual python file:
import re
fobj = open("input.txt", "r")
text = fobj.read()
output = re.findall(r'[0-9][0-9][0-9][0-9][0-9]' ,text)
outp = open("output.txt", "w")
How can i get these ids in the output file like i want it?
Thanks
import re
# Use "with" so the file will automatically be closed
with open("input.txt", "r") as fobj:
text = fobj.read()
# Use word boundary anchors (\b) so only five-digit numbers are matched.
# Otherwise, 123456 would also be matched (and the match result would be 12345)!
output = re.findall(r'\b\d{5}\b', text)
# Join the matches together
out_str = ",".join(output)
# Write them to a file, again using "with" so the file will be closed.
with open("output.txt", "w") as outp:
outp.write(out_str)