I am running Python 2.7.
I have three text files: data.txt, find.txt, and replace.txt. Now, find.txt contains several lines that I want to search for in data.txt and replace that section with the content in replace.txt. Here is a simple example:
data.txt
pumpkin
apple
banana
cherry
himalaya
skeleton
apple
banana
cherry
watermelon
fruit
find.txt
apple
banana
cherry
replace.txt
1
2
3
So, in the above example, I want to search for all occurences of apple, banana, and cherry in the data and replace those lines with 1,2,3.
I am having some trouble with the right approach to this as my data.txt is about 1MB so I want to be as efficient as possible. One dumb way is to concatenate everything into one long string and use replace, and then output to a new text file so all the line breaks will be restored.
import re
data = open("data.txt", 'r')
find = open("find.txt", 'r')
replace = open("replace.txt", 'r')
data_str = ""
find_str = ""
replace_str = ""
for line in data: # concatenate it into one long string
data_str += line
for line in find: # concatenate it into one long string
find_str += line
for line in replace:
replace_str += line
new_data = data_str.replace(find, replace)
new_file = open("new_data.txt", "w")
new_file.write(new_data)
But this seems so convoluted and inefficient for a large data file like mine. Also, the replace function appears to be deprecated so that's not good.
Another way is to step through the lines and keep a track of which line you found a match.
Something like this:
location = 0
LOOP1:
for find_line in find:
for i, data_line in enumerate(data).startingAtLine(location):
if find_line == data_line:
location = i # found possibility
for idx in range(NUMBER_LINES_IN_FIND):
if find_line[idx] != data_line[idx+location] # compare line by line
#if the subsequent lines don't match, then go back and search again
goto LOOP1
Not fully formed code, I know. I don't even know if it's possible to search through a file from a certain line on or between certain lines but again, I'm just a bit confused in the logic of it all. What is the best way to do this?
Thanks!
If the file is large, you want to read and write one line at a time, so the whole thing isn't loaded into memory at once.
# create a dict of find keys and replace values
findlines = open('find.txt').read().split('\n')
replacelines = open('replace.txt').read().split('\n')
find_replace = dict(zip(findlines, replacelines))
with open('data.txt') as data:
with open('new_data.txt', 'w') as new_data:
for line in data:
for key in find_replace:
if key in line:
line = line.replace(key, find_replace[key])
new_data.write(line)
Edit: I changed the code to read().split('\n') instead of readliens() so \n isn't included in the find and replace strings
couple things here:
replace is not deprecated, see this discussion for details:
Python 2.7: replace method of string object deprecated
If you are worried about reading data.txt in to memory all at once, you should be able to just iterate over data.txt one line at a time
data = open("data.txt", 'r')
for line in data:
# fix the line
so all that's left is coming up with a whole bunch of find/replace pairs and fixing each line. Check out the zip function for a handy way to do that
find = open("find.txt", 'r').readlines()
replace = open("replace.txt", 'r').readlines()
new_data = open("new_data.txt", 'w')
for find_token, replace_token in zip(find, replace):
new_line = line.replace(find_token, replace_token)
new_data.write(new_line + os.linesep)
Related
I have two text files with numbers that I want to do some very easy calculations on (for now). I though I would go with Python. I have two file readers for the two text files:
with open('one.txt', 'r') as one:
one_txt = one.readline()
print(one_txt)
with open('two.txt', 'r') as two:
two_txt = two.readline()
print(two_txt)
Now to the fun (and for me hard) part. I would like to loop trough all the numbers in the second text file and then subtract it with the second number in the first text file.
I have done this (extended the coded above):
with open('two.txt') as two_txt:
for line in two_txt:
print line;
I don't know how to proceed now, because I think that the second text file would need to be converted to string in order do make some parsing so I get the numbers I want. The text file (two.txt) looks like this:
Start,End
2432009028,2432009184,
2432065385,2432066027,
2432115011,2432115211,
2432165329,2432165433,
2432216134,2432216289,
2432266528,2432266667,
I want to loop trough this, ignore the Start,End (first line) and then once it loops only pick the first values before each comma, the result would be:
2432009028
2432065385
2432115011
2432165329
2432216134
2432266528
Which I would then subtract with the second value in one.txt (contains numbers only and no Strings what so ever) and print the result.
There are many ways to do string operations and I feel lost, for instance I don't know if the methods to read everything to memory are good or not.
Any examples on how to solve this problem would be very appreciated (I am open to different solutions)!
Edit: Forgot to point out, one.txt has values without any comma, like this:
102582
205335
350365
133565
Something like this
with open('one.txt', 'r') as one, open('two.txt', 'r') as two:
next(two) # skip first line in two.txt
for line_one, line_two in zip(one, two):
one_a = int(split(line_one, ",")[0])
two_b = int(split(line_two, " ")[1])
print(one_a - two_b)
Try this:
onearray = []
file = open("one.txt", "r")
for line in file:
onearray.append(int(line.replace("\n", "")))
file.close()
twoarray = []
file = open("two.txt", "r")
for line in file:
if line != "Start,End\n":
twoarray.append(int(line.split(",")[0]))
file.close()
for i in range(0, len(onearray)):
print(twoarray[i] - onearray[i])
It should do the job!
In a file I have a names of planets:
sun moon jupiter saturn uranus neptune venus
I would like to say "replace saturn with sun". I have tried to write it as a list. I've tried different modes (write, append etc.)
I think I am struggling to understand the concept of iteration, especially when it comes to iterating over a list, dict, or str in file. I know it can be done using csv or json or even pickle module. But my objective is to get the grasp of iteration using for...loop to modify a txt file. And I want to do that using .txt file only.
with open('planets.txt', 'r+')as myfile:
for line in myfile.readlines():
if 'saturn' in line:
a = line.replace('saturn', 'sun')
myfile.write(str(a))
else:
print(line.strip())
Try this but keep in mind if you use string.replace method it will replace for example testsaturntest to testsuntest, you should use regex instead:
In [1]: cat planets.txt
saturn
In [2]: s = open("planets.txt").read()
In [3]: s = s.replace('saturn', 'sun')
In [4]: f = open("planets.txt", 'w')
In [5]: f.write(s)
In [6]: f.close()
In [7]: cat planets.txt
sun
This replaces the data in the file with the replacement you want and prints the values out:
with open('planets.txt', 'r+') as myfile:
lines = myfile.readlines()
modified_lines = map(lambda line: line.replace('saturn', 'sun'), lines)
with open('planets.txt', 'w') as f:
for line in modified_lines:
f.write(line)
print(line.strip())
Replacing the lines in-file is quite tricky, so instead I read the file, replaced the files and wrote them back to the file.
If you just want to replace the word in the file, you can do it like this:
import re
lines = open('planets.txt', 'r').readlines()
newlines = [re.sub(r'\bsaturn\b', 'sun', l) for l in lines]
open('planets.txt', 'w').writelines(newlines)
f = open("planets.txt","r+")
lines = f.readlines() #Read all lines
f.seek(0, 0); # Go to first char position
for line in lines: # get a single line
f.write(line.replace("saturn", "sun")) #replace and write
f.close()
I think its a clear guide :) You can find everything for this.
I have not tested your code but the issue with r+ is that you need to keep track of where you are in the file so that you can reset the file position so that you replace the current line instead of writing the replacement afterwords. I suggest creating a variable to keep track of where you are in the file so that you can call myfile.seek()
i'm trying to parse through a file with structure:
0 rs41362547 MT 10044
1 rs28358280 MT 10550
...
and so forth, where i want the second thing in each line to be put into an array. I know it should be pretty easy, but after a lot of searching, I'm still lost. I'm really new to python, what would be the script to do this?
THanks!
You can split the lines using str.split:
with open('file.txt') as infile:
result = []
for line in infile: #loop through the lines
data = line.split(None, 2)[1] #split, get the second column
result.append(data) #append it to our results
print data #Just confirming
This will work:
with open('/path/to/file') as myfile: # Open the file
data = [] # Make a list to hold the data
for line in myfile: # Loop through the lines in the file
data.append(line.split(None, 2)[1]) # Get the data and add it to the list
print (data) # Print the finished list
The important parts here are:
str.split, which breaks up the lines based on whitespace.
The with-statement, which auto-closes the file for you when done.
Note that you could also use a list comprehension:
with open('/path/to/file') as myfile:
data = [line.split(None, 2)[1] for line in myfile]
print (data)
I am attempting to parse a space delimited text file in python 2.7.5 which looks kind of like:
variable description useless data
a1 asdfsdf 2342354
Sometimes it goes into further detail about the
variable/description here
a2 asdsfda 32123
EDIT:Sorry about the spaces added in the beginning, i did not see them
I want to be able to split the text file into an array with variable and description in 2 separate columns, and cut all the useless data and skip any lines that do not start with a string. The way I have set up my code to start is:
import os
import pandas
import numpy
os.chdir('C:\folderwithfiles')
f = open('Myfile.txt', 'r')
lines = f.readlines()
for line in lines:
if not line.strip():
continue
else:
print(line)
print(lines)
As of right now, this code skips most of the descriptive lines between variable lines, however some still pop up in the parsing. If I could get any help with either troubleshooting my line skips or help me to get started on the column forming part that would be great! I also do not have a lot of expirience in python. Thanks!
EDIT: A part of the file before code
CASEID (id) Case Identification 1 15 AN
MIDX (id) Index to Birth History 16 1 No
1:6
After:
CASEID (id) Case Identification 1 15 AN
MIDX (id) Index to Birth History 16 1 No
1:6
You want to filter out lines that start with spaces, and split all other lines to get the first two columns.
Translating those two rules into code:
with open('Myfile.txt') as f:
for line in f:
if not line.startswith(' '):
variable, description, _ = line.split(None, 2)
print(variable, description)
That's all there is to it.
Or, translating even more directly:
with open('Myfile.txt') as f:
non_descriptions = filter(lambda line: not line.startswith(' '), f)
values = (line.split(None, 2) for line in non_descriptions)
Now values is an iterator over (variable, description) tuples. And it's nice and declarative. The first line means "filter out lines that start with space". The second means "split each line to get the first two columns". (You could write the first as a genexpr instead of filter, or the second as map instead of a genexpr, but I think this is the closest to the English description.)
Assuming no spaces in your variables or descriptions, this will work
with open('path/to/file') as infile:
answer = []
for line in file:
if not line.strip():
continue
if line.startswith(' '): # skipping descriptions
continue
splits = line.split()
var, desc = splits[:2]
answer.append([var, desc])
If you are using pandas try this:
from pandas import read_csv
data = read_csv('file.txt', error_bad_lines=False).drop(['useless data'])
If your file is fixed-width (as opposed to comma-separated-values) then use pandas.read_fwf
I'm somewhat new to python. I'm trying to sort through a list of strings and integers. The lists contains some symbols that need to be filtered out (i.e. ro!ad should end up road). Also, they are all on one line separated by a space. So I need to use 2 arguments; one for the input file and then the output file. It should be sorted with numbers first and then the words without the special characters each on a different line. I've been looking at loads of list functions but am having some trouble putting this together as I've never had to do anything like this. Any takers?
So far I have the basic stuff
#!/usr/bin/python
import sys
try:
infilename = sys.argv[1] #outfilename = sys.argv[2]
except:
print "Usage: ",sys.argv[0], "infile outfile"; sys.exit(1)
ifile = open(infilename, 'r')
#ofile = open(outfilename, 'w')
data = ifile.readlines()
r = sorted(data, key=lambda item: (int(item.partition(' ')[0])
if item[0].isdigit() else float('inf'), item))
ifile.close()
print '\n'.join(r)
#ofile.writelines(r)
#ofile.close()
The output shows exactly what was in the file but exactly as the file is written and not sorted at all. The goal is to take a file (arg1.txt) and sort it and make a new file (arg2.txt) which will be cmd line variables. I used print in this case to speed up the editing but need to have it write to a file. That's why the output file areas are commented but feel free to tell me I'm stupid if I screwed that up, too! Thanks for any help!
When you have an issue like this, it's usually a good idea to check your data at various points throughout the program to make sure it looks the way you want it to. The issue here seems to be in the way you're reading in the file.
data = ifile.readlines()
is going to read in the entire file as a list of lines. But since all the entries you want to sort are on one line, this list will only have one entry. When you try to sort the list, you're passing a list of length 1, which is going to just return the same list regardless of what your key function is. Try changing the line to
data = ifile.readlines()[0].split()
You may not even need the key function any more since numbers are placed before letters by default. I don't see anything in your code to remove special characters though.
since they are on the same line you dont really need readlines
with open('some.txt') as f:
data = f.read() #now data = "item 1 item2 etc..."
you can use re to filter out unwanted characters
import re
data = "ro!ad"
fixed_data = re.sub("[!?#$]","",data)
partition maybe overkill
data = "hello 23frank sam wilbur"
my_list = data.split() # ["hello","23frank","sam","wilbur"]
print sorted(my_list)
however you will need to do more to force numbers to sort maybe something like
numbers = [x for x in my_list if x[0].isdigit()]
strings = [x for x in my_list if not x[0].isdigit()]
sorted_list = sorted(numbers,key=lambda x:int(re.sub("[^0-9]","",x))) + sorted(strings(
Also, they are all on one line separated by a space.
So your file contains a single line?
data = ifile.readlines()
This makes data into a list of the lines in your file. All 1 of them.
r = sorted(...)
This makes r the sorted version of that list.
To get the words from the line, you can .read() the entire file as a single string, and .split() it (by default, it splits on whitespace).