I have a text file and would like to replace certain elements which is "NaN".
I usually have used file.replace function for change NaNs with a certain number through entire text file.
Now, I would like to replace NaNs with a certain number in only first line of text file, not whole text.
Would you give me a hint for this problem?
You can only read the whole file, call .replace() for the first line and write it to the new file.
with open('in.txt') as fin:
lines = fin.readlines()
lines[0] = lines[0].replace('old_value', 'new_value')
with open('out.txt', 'w') as fout:
for line in lines:
fout.write(line)
If your file isn't really big, you can use just .join():
with open('out.txt', 'w') as fout:
fout.write(''.join(lines))
And if it is really big, you would probably better read and write lines simultaneously.
You can hack this provided you accept a few constraints. The replacement string needs to be of equal length to the original string. If the replacement string is shorter than the original, pad the shorter string with spaces to make it of equal length (this only works if extra spaces in your data is acceptable). If the replacement string is longer than the original you can not do the replacement in place and need to follow Harold's answer.
with open('your_file.txt', 'r+') as f:
line = next(f) # grab first line
old = 'NaN'
new = '0 ' # padded with spaces to make same length as old
f.seek(0) # move file pointer to beginning of file
f.write(line.replace(old, new))
This will be fast on any length file.
Related
I have a text file consisting of multiline (hundreds of lines actually) strings. Each of the strings starts with '&' sign. I want to change my text file in a way that only the first 300 characters of each string remain in the new file. How I can do this by using python?
You can read a file and loop over the lines to do what you want. Strings are easily slicable in python to get the first 300 to write to another file.
file = open(path,"r")
lines = file.readlines()
newFile = open(newPath,"w")
for index, line in enumerate(lines):
newLine = line[0:301]
newFile.writelines([newLine])
Hope this is what you meant
You could do something like this:
# Open output file in append mode
with open('output.txt', 'a') as out_file:
# Open input file in read mode
with open("input.txt", "r") as in_file:
for line in in_file:
# Take first 300 characters from line
# I believe this works even when line is < 300 characters
new_line = line[0:300]
# Write new line to output
# (You might need to add '\n' for new lines)
out_file.write(new_line)
print(new_line)
You can use the string method split to split your lines, then you can use slices to keep only the 300 first characters of each split.
with open("oldFile.txt", "rt") as old_file, open("newFile.txt", "wt") as new_file:
for line in old_file.read().split("&"):
new_file.write("&{}\n".format(line[:300]))
This version preserves ends of line \n within your strings.
If you want to remove ends of line in each individual string, you can use replace:
with open("oldFile.txt", "rt") as old_file, open("newFile.txt", "wt") as new_file:
for line in old_file.read().split("&"):
new_file.write("&{}\n".format(line.replace("\n", "")[:300]))
Note that your new file will end with an empty line.
Another note is, depending on the size of your file, you may rather use a generator function version, instead of split which results in the whole file content being loaded in memory as a list of strings.
I have a long text file with 20k+ lines. There are four distinct patterns which identify the beginning of the lines I want to write to a file. These lines are repeated in the entry file. There are lines which don't start with one of those patterns, these lines shall be skipped. I want to grab the lines starting with the four patterns in order and write them to a file-output in the same order as in the base file.
For example:
random text
specific start of the first line, random text A
random text B
specific start of the second line, random text C
random text D
etc.
I want the output seems like:
specific start of the first line, random text A
specific start of the second line, random text C
I was thinking about reg-exp, but I'm quite unfamiliar with them. I thought maybe a line-by-line executed function could be better, and maybe even faster. The important thing is, I must retain the original line order.
file = open("input_file",r)
outfile = open("out_file",w)
specific_start = ["specific start pattern1","specific start
pattern2","specific start pattern3","specific start pattern4"]
for line in file:
if not line.startswith(specific_start[0],specific_start[1],specific_start[2],specific_start[3]):
continue
else:
outfile.write(line)
Use string's method startswith() to check if the beginning of the line is what you want.
This will write all lines from input.txt, beginning with "aaaa", to output.txt:
wanted = "aaaa"
with open("input.txt", "r") as f_in, open("output.txt", "w") as f_out:
for line in f_in:
if line.startswith(wanted):
f_out.write(line)
I've got a huge csv file (around 10GB of data) and I want to delete its header.
Searching on this web I found this solution:
with open("test.csv",'r') as f, open("updated_test.csv",'w') as f1:
next(f) # skip header line
for line in f:
f1.write(line)
But this would imply creating a new csv file. ¿Is there a way just to delete the header without looping over all the csv lines?
The point you've got is this: You want to delete a line in the beginning of a file. Straight forward this means you need to shift the complete contents after the header to the front which in turn means copying the whole file.
But this is way too costly of course when we are talking about 10GB files.
In your case I propose to read the first two lines, store their sizes, open the file for reading/writing without creating (so no truncation takes place), write the second(!) line at the beginning of the file and pad it with as many spaces as are necessary to overwrite the original first and second line.
This way you overwrite the first two lines with a very long line which semantically only contains the data from the second line (the first data line) and syntactically contains just some additional trailing spaces (which in CSV files do not hurt normally).
with open('a', 'rw+') as f:
headers = f.readline()
firstData = f.readline()
f.seek(0)
firstData = firstData[:-1] + ' ' * len(headers) + '\n'
f.write(firstData)
My input, spaces displayed as dots here:
one.two.three.four.five
1.2.3.4.5
6.7.8.9.10
My output, spaces displayed as dots here:
1.2.3.4.5........................
6.7.8.9.10
Using pandas with the header=0
df = pd.read_csv('yourfile.csv', sep='joursep', header=0)
Using Python 2.7.1, I read in a file:
input = open(file, "rU")
tmp = input.readlines()
which looks like this:
>name -----meoidoad
>longname -lksowkdkfg
>nm --kdmknskoeoe---
>nmee dowdbnufignwwwwcds--
That is, each line has a short substring of whitespaces, but the length of this substring varies by line.
I would like to write script that edits my tmp object such that when I write tmp to file, the result is
>name
-----meoidoad
>longname
-lksowkdkfg
>nm
--kdmknskoeoe---
>nmee
dowdbnufignwwwwcds--
I.e. I would like to break each line into two lines, at that substring of whitespaces (and get rid of the spaces in the process).
The starting position of the string after the whitespaces is always the same within a file, but may vary among a large batch of files I am working with. So, I need a solution that does not rely on positions.
I've seen many similar questions on here, with many well-liked answers that use short regex scripts to do so, so it is possible I am duplicating a previous question. However, none of what I've seen so far has worked for me.
import re
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
outfile.write(re.sub('\s\s+', '\n', line))
If the file isn't huge (i.e. hundreds of MB), you can do this concisely with split() and join():
with open(file, 'rU') as f, open(outfilename, 'w') as o:
o.write('\n'.join(f.read().split()))
I would also recommend against naming anything input, as that will mask the built-in.
I have two text files (that are not equal in number of lines/size). I would like to compare each line of the shorter text file with every line of the longer text file. As it compares, if there are any duplicate strings, I would like to have those removed. Lastly, I would like write the result to a new text file and print the contents.
Is there a simply script that can do this for me?
Any help would be much appreciated.
The text files are not very large. One has about 10 lines and the other has about 5. The code I have tried (that failed miserably) is below:
for line in file2:
line1 = line
for line in file1:
requested3 = file('request2.txt','a')
if fnmatch.fnmatch(line1,line):
line2 = line.replace(line,"")
requested3.write(line2)
if not fnmatch.fnmatch(line1,line):
requested3.write(line+'\n')
requested3.close()
with open(longfilename) as longfile, open(shortfilename) as shortfile, open(newfilename, 'w') as newfile:
newfile.writelines(line for line in shortfile if line not in set(longfile))
It's as simple as that. This will copy lines from shortfile to newfile, without having to keep them all in memory, if they also exist in longfile.
If you're on Python 2.6 or older, you would need to nest the with statements:
with open(longfilename) as longfile:
with open(shortfilename) as shortfile:
with open(newfilename, 'w') as newfile:
If you're on Python 2.5, you need to either:
from __future__ import with_statement
at the very top of your file, or just use
longfile = open(longfilename)
etc. and close each file yourself.
If you need to manipulate the lines, an explicit for loop is fine, the important part is set(). Looking up an item in a set is fast, looking up a line in a long list is slow.
longlines = set(line.strip_or_whatever() for line in longfile)
for line in shortfile:
if line not in longlines:
newfile.write(line)
Assuming the files are both plain text, each string is on a new line delimited with \n newline characters:
small_file = open('file1.txt','r')
long_file = open('file2.txt','r')
output_file = open('output_file.txt','w')
try:
small_lines = small_file.readlines()
long_lines = long_file.readlines()
small_lines_cleaned = [line.rstrip().lower() for line in small_lines]
long_file_lines = long_file.readlines()
long_lines_cleaned = [line.rstrip().lower() for line in long_lines]
for line in small_lines_cleaned:
if line not in long_lines_cleaned:
output_file.writelines(line + '\n')
finally:
small_file.close()
long_file.close()
output_file.close()
Explanation:
Since you can't get 'with' statements working, we open the files first using regular open functions, then use a try...finally clause to close them at the end of the program.
We take the small file and the long file and first remove any trailing '\n' (newline) characters with .rstrip(), then make all the characters lower-case with .lower(). If you have two sentences identical in every aspect except one has upper case letters and the other doesn't, they wont' match. Forcing them lower case avoids that; if you prefer a case-sensitive compare, remove the .lower() method.
We go line by line in small_lines_cleaned (for line in...) and see if it is in the larger file.
Output each line if it is not in the longer file; we add the '\n' newline character so that each line will appear on a new line, insteadOfOneGiantLongSetOfStrings
I'd use difflib, it makes it easy to do comparisons/diffs. There is a nice tutorial for it here. If you just wanted the lines that were unique to the shorter file:
from difflib import ndiff
short = open('short.txt').readlines()
long = open('long.txt').readlines()
with open('unique.txt', 'w') as f:
f.write(''.join(x[2:] for x in ndiff(short, long) if x.startswith('-')))
Your code as it stands checks each line against the line in the other file. But that's not what you want. For each line in the first file, you need to check whether any line in the other file matches and then print it out if there are no matches.
The following code reads file two and checks it against file one.Anything that's in file one but not in file two will get printed and also written to a new text file.
If you wanted to do the opposite, you'd just get rid of the "not" from if statement below. So it'd print anything that's in file one and in file two.
It works by putting the contents of the shorter file (file two) in a variable and then reading the longer file (file one) line by line. Each line is checked against the variable and then the line is either written or not written to the text file according to it's presence in the variable.
(Remember to remove the stars surrounding the not statement if you wish to use it, or removing the not statement all together if you want it to print the matching words.)
fileOne = open("LONG FILE.ext","r")
fileTwo = open("SHORT FILE.ext","r")
fileThree = open("Results.txt","a+")
contents = fileTwo.read()
numLines = sum(1 for line in fileOne)
for i in range (numLines):
if **not** fileOne.readline(i) in contents:
print (fileOne.readline(i))
fileThree.write (fileOne.readline(i))
fileOne.close()
fileTwo.close()
fileThree.close()