Read txt file into python [duplicate] - python

This question already has answers here:
Wrong encoding when reading file in Python 3?
(1 answer)
Read special characters from .txt file in python
(3 answers)
Closed last year.
I have german wordlist which contain special charachters like ä,ö,ü. and a word e.g. like "Nährstoffe". But when i read the text file and create a dict from it i get a wrong word out of it.
Here is my code in python3:
import random
import csv
import os
permanettxtfile='wortliste.txt'
newlines = open(permanettxtfile, "r")
lines=newlines.read().split('\n')
random.shuffle(lines)
linkdict=dict.fromkeys(lines)
print(linkdict)
I get as output:
'Nährstoffe': None
But i want:
'Nährstoffe': None
How can i solve this issue? Is this an UTF-8 issue?

Try opening file in utf-8 encoding:
import random
import csv
import os
permanettxtfile='wortliste.txt'
with open(permanettxtfile, 'r', encoding='utf-8') as file:
lines = file.read().split('\n')
random.shuffle(lines)
linkdict = dict.fromkeys(lines)
print(linkdict)
Also don't forget to close it with context manager as in my example or with newlines.close() for your example

Specify the encoding using
open(permanettxtfile, "r", encoding="UTF-8")

It is most likly a encoding issue you can try this:
with open("filename.txt", "rb") as f:
contents = f.read().decode("UTF-8")
or
with open("filename.txt", encoding='utf-8') as f:
contents = f.read()

Related

Opening a CSV explicitly saved as UTF-8 still shows its encoding as cp1252 [duplicate]

This question already has answers here:
python 3 open() default encoding
(2 answers)
Closed 1 year ago.
I have a Python script that generates a .csv file from given pandas DataFrame.
Even though in python3 the default pandas.to_csv() sets the encoding to 'utf-8', I also specify it in the code (after the file is generated):
df.to_csv(filename, index=False, encoding='utf-8')
I check the encoding type using:
with open(filename) as f:
print(f)
after which I get: encoding='cp1252'
Could anyone help, Why is this the case?
try this:
with open(filename, encoding="utf8") as f:
print(f)
Try
with open(filename, encoding='utf-8') as f:
print(f)

Using regex in Python script [duplicate]

This question already has answers here:
How to input a regex in string.replace?
(7 answers)
Closed 4 years ago.
Am trying to write a python script to search and replace this line:
time residue 3 Total
with an empty line.
This is my script:
import glob
read_files = glob.glob("*.agr")
with open("out.txt", "w") as outfile:
for f in read_files:
with open(f, "r") as infile:
outfile.write(infile.read())
with open("out.txt", "r") as file:
filedata = file.read()
filedata = filedata.replace(r'#time\s+residue\s+[0-9]\s+Total', 'x')
with open("out.txt", "w") as file:
file.write(filedata)
Using this, am not able to get any replacement. Why could that be? The rest of the code is working fine. The output file has not change to it so i suspect that the pattern cant be found.
Thank you.
The str.replace method replaces a fixed substring. Use re.sub instead if you're looking to replace a match of a regex pattern:
import re
filedata = re.sub(r'#time\s+residue\s+[0-9]\s+Total', 'x', filedata)

Python3: Convert Latin-1 to UTF-8 [duplicate]

This question already has answers here:
Python: Converting from ISO-8859-1/latin1 to UTF-8
(5 answers)
Closed 1 year ago.
My code looks like the following:
for file in glob.iglob(os.path.join(dir, '*.txt')):
print(file)
with codecs.open(file,encoding='latin-1') as f:
infile = f.read()
with codecs.open('test.txt',mode='w',encoding='utf-8') as f:
f.write(infile)
The files I work with are encoded in Latin-1 (I could not open them in UTF-8 obviously). But I want to write the resulting files in utf-8.
But this:
<Trans audio_filename="VALE_M11_070.MP3" xml:lang="español">
<Datos clave_texto=" VALE_M11_070" tipo_texto="entrevista_semidirigida">
<Corpus corpus="PRESEEA" subcorpus="ESESUMA" ciudad="Valencia" pais="España"/>
Instead becomes this (in gedit):
<Trans audio_filename="VALE_M11_070.MP3" xml:lang="espa뇃漀氀∀㸀ഀ਀㰀䐀愀琀`漀猀 挀氀愀瘀攀开琀攀砀琀漀㴀∀ 嘀䄀䰀䔀开䴀㄀㄀开 㜀
If I print it on the Terminal, it shows up normal.
Even more confusing is what I get when I open the resulting file with LibreOffice Writer:
<#T#r#a#n#s# (and so on)
So how do I properly convert a latin-1 string to a utf-8 string? In python2, it's easy, but in python3, it seems confusing to me.
I tried already these in different combinations:
#infile = bytes(infile,'utf-8').decode('utf-8')
#infile = infile.encode('utf-8').decode('utf-8')
#infile = bytes(infile,'utf-8').decode('utf-8')
But somehow I always end up with the same weird output.
Thanks in advance!
Edit: This question is different to the questions linked in the comment, as it concerns Python 3, not Python 2.7.
I have found a half-part way in this. This is not what you want / need, but might help others in the right direction...
# First read the file
txt = open("file_name", "r", encoding="latin-1") # r = read, w = write & a = append
items = txt.readlines()
txt.close()
# and write the changes to file
output = open("file_name", "w", encoding="utf-8")
for string_fin in items:
if "é" in string_fin:
string_fin = string_fin.replace("é", "é")
if "ë" in string_fin:
string_fin = string_fin.replace("ë", "ë")
# this works if not to much needs changing...
output.write(string_fin)
output.close();
*note for detection
For python 3.6:
your_str = your_str.encode('utf-8').decode('latin-1')

Python: Problems with latin characters in output

I have a document in Spanish I'd like to format using Python. Problem is that in the output file, the accented characters are messed up, in this manner: \xc3\xad.
I succeeded in keeping the proper characters when I did some similar editing a while back, and although I've tried everything I did then and more, somehow it won't work this time.
This is current version of the code:
# -*- coding: utf-8 -*-
import re
import pickle
inputfile = open("input.txt").read()
pat = re.compile(r"(#.*\*)")
mylist = pat.findall(inputfile)
outputfile = open("output.txt", "w")
pickle.dump(mylist, outputfile)
outputfile.close()
I'm using Python 2.7 on Windows 7.
Can anyone see any obvious problems? The inputfile is encoded in utf-8, but I've tried encoding it latin-1 too. Thanks.
To clarify: My problem is that the latin characters doesn't show up properly in the output.
It's solved now, I just had to add this line as suggested by mata:
inputfile = inputfile.decode('utf-8')
it the input file is encoded in utf-8, then you should decode it first to work with it:
import re
import pickle
inputfile = open("input.txt").read()
inputfile = inputfile.decode('utf-8')
pat = re.compile(r"(#.*\*)")
mylist = pat.findall(inputfile)
outputfile = open("output.txt", "w")
pickle.dump(mylist, outputfile)
outputfile.close()
the so created file will contain a pickled version of your list. it you would rather hava a human readable file, then you might want to just use a plain file.
also a good way to deal with different encodings is using the codecs module:
import re
import codecs
with codecs.open("input.txt", "r", "utf-8") as infile:
inp = infile.read()
pat = re.compile(r"(#.*\*)")
mylist = pat.findall(inp)
with codecs.open("output.txt", "w", "utf-8") as outfile:
outfile.write("\n".join(mylist))

Replace string within file contents [duplicate]

This question already has answers here:
Replacing instances of a character in a string
(17 answers)
How to search and replace text in a file?
(22 answers)
How to read a large file - line by line?
(11 answers)
Writing a list to a file with Python, with newlines
(26 answers)
Closed 7 months ago.
How can I open a file, Stud.txt, and then replace any occurences of "A" with "Orange"?
with open("Stud.txt", "rt") as fin:
with open("out.txt", "wt") as fout:
for line in fin:
fout.write(line.replace('A', 'Orange'))
If you'd like to replace the strings in the same file, you probably have to read its contents into a local variable, close it, and re-open it for writing:
I am using the with statement in this example, which closes the file after the with block is terminated - either normally when the last command finishes executing, or by an exception.
def inplace_change(filename, old_string, new_string):
# Safely read the input filename using 'with'
with open(filename) as f:
s = f.read()
if old_string not in s:
print('"{old_string}" not found in {filename}.'.format(**locals()))
return
# Safely write the changed content, if found in the file
with open(filename, 'w') as f:
print('Changing "{old_string}" to "{new_string}" in {filename}'.format(**locals()))
s = s.replace(old_string, new_string)
f.write(s)
It is worth mentioning that if the filenames were different, we could have done this more elegantly with a single with statement.
#!/usr/bin/python
with open(FileName) as f:
newText=f.read().replace('A', 'Orange')
with open(FileName, "w") as f:
f.write(newText)
Using pathlib (https://docs.python.org/3/library/pathlib.html)
from pathlib import Path
file = Path('Stud.txt')
file.write_text(file.read_text().replace('A', 'Orange'))
If input and output files were different you would use two different variables for read_text and write_text.
If you wanted a change more complex than a single replacement, you would assign the result of read_text to a variable, process it and save the new content to another variable, and then save the new content with write_text.
If your file was large you would prefer an approach that does not read the whole file in memory, but rather process it line by line as show by Gareth Davidson in another answer (https://stackoverflow.com/a/4128192/3981273), which of course requires to use two distinct files for input and output.
Something like
file = open('Stud.txt')
contents = file.read()
replaced_contents = contents.replace('A', 'Orange')
<do stuff with the result>
with open('Stud.txt','r') as f:
newlines = []
for line in f.readlines():
newlines.append(line.replace('A', 'Orange'))
with open('Stud.txt', 'w') as f:
for line in newlines:
f.write(line)
If you are on linux and just want to replace the word dog with catyou can do:
text.txt:
Hi, i am a dog and dog's are awesome, i love dogs! dog dog dogs!
Linux Command:
sed -i 's/dog/cat/g' test.txt
Output:
Hi, i am a cat and cat's are awesome, i love cats! cat cat cats!
Original Post: https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands
easiest way is to do it with regular expressions, assuming that you want to iterate over each line in the file (where 'A' would be stored) you do...
import re
input = file('C:\full_path\Stud.txt', 'r')
#when you try and write to a file with write permissions, it clears the file and writes only #what you tell it to the file. So we have to save the file first.
saved_input
for eachLine in input:
saved_input.append(eachLine)
#now we change entries with 'A' to 'Orange'
for i in range(0, len(old):
search = re.sub('A', 'Orange', saved_input[i])
if search is not None:
saved_input[i] = search
#now we open the file in write mode (clearing it) and writing saved_input back to it
input = file('C:\full_path\Stud.txt', 'w')
for each in saved_input:
input.write(each)

Categories