The codes in the screenshot below does a simple regex based file string replacement, replacing two tab characters in the source text file with "TT" into the target file. The replacement works fine, but for some reason, this operation adds the weird special characters (question mark in a diamond) to the file.
How can I avoid this?
From the documentation of file.truncate:
The current file position is not changed.
That means the system probably replaced the old contents of the file with null bytes. You have to f.seek(0) after truncating.
Related
I am trying to write a python script to practice the re.sub method. But when I use python3 to run the script, I figure out that the string in the file doesn't change.
Here is my location.txt file,
34.3416,108.9398
this is what regex.py contains,
import re
with open ('location.txt','r+') as second:
content = second.read()
content = re.sub('([-+]?\d{2}\.\d{4},[-+]?\d{2}\.\d{4})','44.9740,-93.2277',content)
print (content)
I set up a print statement to test the output, and it gives me
34.3416,108.9398
which is not what I want.
Then I change the "r+" to "w+", it completely removes the location.txt content. Can anyone tell me the reason?
Your regexp has a problem as pointed by Andrej Kesely in the other answer. \d{2} should be \d{2,3}:
content = re.sub(r'([-+]?\d{2,3}\.\d{4},[-+]?\d{2,3}\.\d{4})', ,'44.9740,-93.2277',content)
After fixing that, you changed the string, but you didn't write it back to the file, you're only changing the variable in memory.
second.seek(0) # return to beginning of file
second.write(content) # write the data back to the file
second.truncate() # remove extraneous bytes (in case the content shrinked)
The second number in your location.txt is 108.9398, which has 3 digits before dot and it doesn't match to your regexp. Change your regexp to:
([-+]?\d{2,3}\.\d{4},[-+]?\d{2,3}\.\d{4})
Online regexp here.
When exporting excel/libreoffice sheets where the cells can contain new lines as CSV, the resulting file will have those new lines preserved as literal newline characters not something like the char string "\n".
The standard csv module in Python 3 apparently does not handle this as would be necessary. The documentation says "Note The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future." . Well, duh.
Is there some other way to read in such csv files properly? What csv really should do is ignore any new lines withing quoted text fields and only recognise new line characters outside a field, but since it does not, is there a different way to solve this short of implementing my own CSV parser?
Try using pandas with something like df = pandas.read_csv('my_data.csv'). You'll have more granular control over how the data is read in. If you're worried about formatting, you can also set the delimiter for the csv from libreoffice to something that doesn't occur in nature like ;;
I am processing, with python, a long list of data that looks like this
The digraphs are probably due to encoding problems. (I am not sure whether these characters will be preserved in this site)
29/07/2016 04:00:12 0.125143
Now, when I read such file into a script using something like open and readlines, there is an error, reading
SyntaxError: EOL while scanning string literal
I know (or may look up usage of) replace and regex functions, but I cannot do them in my script. The biggest problem is that anywhere I include or read such strange character, error occurs, pointing on the very line it is read. So I cannot do anything to them.
Are you reading a file? If so, try to extract values using regexps, not to remove extra characters:
re.search(r'^([\d/: ]{19})', line).group(1)
re.search(r'([\d.]{7})', line).group(1)
I find that the re.findall works. (I am sorry I do not have time to test all other methods, since the significance of this job has vanished, and I even forget this question itself.)
def extract_numbers(str_i):
pat="(\d+)/(\d+)/(\d+)\D*(\d+):(\d+):(\d+)\D*(\d+)\.(\d+)"
match_h = re.findall(pat, str_i)
return match_h[0]
# ....
# `f` is the handle of the file in question
lines =f.readlines()
for l in lines:
ls_f =extract_numbers(l)
# process them....
Following up on my earlier question here: Row limit in read.table.ffdf?
I have a text file with >285 million records, but about two-thirds of the way through there are several non-ASCII characters that are being interpreted by AWK as well as several R packages (ff, data.table) as EOF bytes. It appears that the characters were originally entered as degree signs, but appear in text editors as boxes (see example here). When I try to read in the text file using these methods it just stops when it encounters the first character, with no error messages as if it's complete.
For now I was able to open the file in a text editor to remove these characters. But this is not a long-term solution for this dataset given its size; I need to be able to remove or bypass them without having to open the whole file. I've tried using the quote option in R, and tried replacing all non-ASCII and 'CTRL-M' characters specifically during an awk import, but the read process always stops at the first character. Any solutions? I'm using R and awk now, but am open to other options (python?). Thanks!
gawk -v BINMODE=3 '{gsub(/[[:cntrl:]]/,"")}1
will remove them.
I want to keep track of the file pointer on a simple text file (just a few lines), after having used readline() on it. I observed that the tell() function also counts the line endings.
My questions:
How to instruct the code to skip counting the line endings ?
How to do the first question regardless the line ending type (to work the same in case the text file uses just \n, or just \r, or both) ?
You are navigating into trouble.
DOn't do that: either use the number "tell" tells you about, or count what you have in memory, regardless of the file contents.
You won't be able to correlate a position in text, read in memory, to a physicall place in a text file: text files are not meant for that. They are meant to be read one line at a time, or in whole: your pogram consumes the text, and let the OS worry about the file position.
You can open your file in binary mode, read its contents as they are into memory, and have some method of retrieving readable text from those contents as needed - doing this with a proper class can make it not that messy.
Consider the problem you already have with the line-endings which could be either "\n" or "\r\n" and still count as a single character, and now, imagine that situation one hundred fold more complex if the file has a single utf-8 encoded character that takes more than one byte to encode.
And even in binary files, knowing the absolute file pointer position can only be useful in a handful situations where, usually, one would be better using a database engine to start with.
tell is tell. It counts the number of bytes from the start of the file to the cursor. \n and \r are bytes, so they get counted. If you want to count the number of bytes, but not count certain characters, you will have to do it manually:
data_read = … # data you have already read
len([b for b in data_read if b not in '\r\n'])
The bad news is that it's far more annoying to do this than just looking at tell. The good news is that it answers both your questions.
or, I suppose you could do
yourfile.tell() - data_read.count('\r') - data_read.count('\n')
result = re.sub("[\r\n]", "", subject)
http://regex101.com/r/kM6dA1
Match a single character present in the list below «[\r\n]»
A carriage return character «\r»
A line feed character «\n»