Multiple string replacements on a 100mb file in python 2.6

Multiple string replacements on a 100mb file in python 2.6 - python

I have a large 100mb file which I would like to perform about 5000 string replacements on it, what is the most efficient way of achieving this?
Is there no better way then reading the file line by line and performing the 5000 replacements on each line?
I also tried reading the file as a string using the .read method when opening the file and performing the 5000 replacements on the string, but this is even slower since it makes 5000 copies of the whole file.
This script has to run on windows using python 2.6
Thanks in advance

Try the following, in this order, until you get one that is fast enough.
Read the file into a large string and do each replacement in turn, overwriting the same variable.
with open(..., 'w') as f:
s = f.read()
for src, dest in replacements:
s = s.replace(src, dest)
f.seek(0)
f.write(s)
Memory map the file, and write a custom replacement function that does the replacements.

I suggest, instead of doing 5000 searches, do one search for 5000 items:
import re
replacements = {
"Abc-2454": "Gb-43",
"This": "that",
"you": "me"
}
pat = re.compile('(' + '|'.join(re.escape(key) for key in replacements.iterkeys()) + ')')
repl = lambda match: replacements[match.group(0)]
You can now apply re.sub either to the entire file,
with open("input.txt") as inf:
s = inf.read()
s = pat.sub(repl, s)
with open("result.txt") as outf:
outf.write(s)
or line-by-line,
with open("input.txt") as inf, open("result.txt") as outf:
outf.writelines(pat.sub(repl, line) for line in inf)

You should read in the text using open() and read(), and then use (compiled) regular expressions to do the string replacement. A short example:
import re
# read data
f = open("file.txt", "r")
txt = f.read()
f.close()
# list of patterns and what to replace them with
xs = [("foo","bar"), ("baz","foo")]
# do replacements
for (x,y) in xs:
regexp = re.compile(x)
txt = regexp.sub(y, txt)
# write back data
f = open("file.txt", "w")
f.write(txt)
f.close()

Related

Removing duplicate characters from a file - Python

I have a file as below,
this is Rdaaaa
thissss Is Sethaaa
hiii
I want to remove all the duplicate characters from this file..
I tried two code..
This completely removes duplicate chars but does not seem to be efficient code.
with open("test.txt", "r") as f1:
with open("test1.txt", "w") as f2:
#content = f1.readlines()
char_set = set()
while True:
char = f1.read(1)
if char not in char_set:
char_set.add(char)
f2.write(char)
if not char:
break
print(char_set)
I also tried using regex following a stackoverflow post
import re
with open("test.txt", "r") as f1:
with open("test1.txt", "w") as f2:
content = f1.read()
f2.write(re.sub(r'([a-z])\1+',r'\1',content))
But this removes thiish to thish and not thiish to this
Any suggestions on the code with improved efficiency?

For "medium" sized files that can fit into memory, this approach is a bit faster and fewer lines. You can load the whole file into memory, and then create a dictionary from it, where the dictionary's keys are the individual characters in the file. This keeps the output chars in the same order as when they were first seen (property of dict).
This ran in about 100ms for a 2 MB file with 11501 distinct characters. Your use case may make another approach better.
# replace in_file and out_file with actual paths or file names
with open(in_file, "r") as f1, open(out_file, "w") as f2:
txt = f1.read()
ordered_set = ''.join(dict.fromkeys(txt).keys())
f2.write(ordered_set)

If you have a big file, and you don't want to load it in the memory, you can read it line by line instead of character by character, which is much better and faster:
file_input = open("old_file.txt", "r")
file_output = open("new_file.txt", "w")
memory = set()
while True:
line = file_input.readline()
if not line:
break
new_line = ""
for char in line:
if char == " ":
new_line += char
continue
if char not in memory:
memory.add(char)
new_line += char
file_output.writelines(new_line)
But if the file is small, you can read it once, and apply the same logic

It looks like you want to delete letters that appear repeatedly in succession. Try using itertools.groupby:
>>> from itertools import groupby
>>> from operator import itemgetter
>>> s = '''this is Rdaaaa
... thissss Is Sethaaa
... hiii'''
>>> print(''.join(map(itemgetter(0), groupby(s))))
this is Rda
this Is Setha
hi

Try this
import re
# read the file
with open('file.txt', 'r') as f:
data = f.read()
# remove duplicate characters
data = re.sub(r'(.)\1+', r'\1', data)
# write the file
with open('file.txt', 'w') as f:
f.write(data)
The output is:
this is Rda
this Is Setha
hi

Python program for writing length of list to file

I have a file list.txt that contains a single list only e.g.
[asd,ask,asp,asq]
The list might be a very long one. I want to create a python program len.py that reads list.txt and writes the length of the within list to the file num.txt. Something like the following:
fin = open("list.txt", "rt")
fout = open("num.txt", "wt")
for list in fin:
fout.write(len(list))
fin.close()
fout.close()
However this does not work. Can someone point out what needs to be changed? Many thanks.

Use:
with open("list.txt") as f1, open("num.txt", "w") as f2:
for line in f1:
line = line.strip('\n[]')
f2.write(str(len(line.split(','))) + '\n')

with open("list.txt") as fin, open("num.txt", "w") as fout:
input_data = fin.readline()
# check if there was any info read from input file
if input_data:
# split string into list on comma character
strs = input_data.replace('[','').split('],')
lists = [map(int, s.replace(']','').split(',')) for s in strs]
print(len(lists))
fout.write(str(len(lists)))
I updated the code to use the with statement from another answer. I also used some code from this answer (How can I convert this string to list of lists?) to (more?) correctly count nested lists.

When python try to read a file using default method it generally treats content of that file as a string. So first responsibility is to type cast string content into appropriate content type for that you can not use default type casting method.
You can use special package by the name ast to type cast the data.
import ast
fin = open("list.txt", "r")
fout = open("num.txt", "w")
for list in fin.readlines():
fout.write(len(ast.literal_eval(list)))
fin.close()
fout.close()

Python regex from txt file

I have a text file, that has data.
PAS_BEGIN_3600000
CMD_VERS=2
CMD_TRNS=O
CMD_REINIT=
CMD_OLIVIER=
I want to extract data from that file, where nothing is after the equal sign.
So in my new text file, I want to get
CMD_REINIT
CMD_OLIVIER
How do I do this?
My code is like that righr now.
import os, os.path
DIR_DAT = "dat"
DIR_OUTPUT = "output"
print("Psst go check in the ouptut folder ;)")
for roots, dir, files in os.walk(DIR_DAT):
for filename in files:
filename_output = "/" + os.path.splitext(filename)[0]
with open(DIR_DAT + "/" + filename) as infile, open(DIR_OUTPUT + "/bonjour.txt", "w") as outfile:
for line in infile:
if not line.strip().split("=")[-1]:
outfile.write(line)
I want to collect all data in a single file. It doesn't work. Can anyone help me ?
The third step, it do crawl that new file, and only keep single values. As four files are appended into a single one. Some data might be there four, three, two times.
And I need to keep in a new file, that I will call output.txt. Only the lines that are in common in all the files.

You can use regex:
import re
data = """PAS_BEGIN_3600000
CMD_VERS=2
CMD_TRNS=O
CMD_REINIT=
CMD_OLIVIER="""
found = re.findall(r"^\s+(.*)=\s*$",data,re.M)
print( found )
Output:
['CMD_REINIT', 'CMD_OLIVIER']
The expression looks for
^\s+ line start + whitespaces
(.*)= anything before a = which is caputred as group
\s*$ followed by optional whitespaces and line end
using the re.M (multiline) flag.
Read your files text like so:
with open("yourfile.txt","r") as f:
data = f.read()
Write your new file like so:
with open("newfile.txt","w") as f:
f.write(''.join("\n",found))
You can use http://www.regex101.com to evaluate test-text vs regex-patterns, make sure to swith to its python mode.

I suggest you the following short solution using comprehension:
with open('file.txt', 'r') as f, open('newfile.txt', 'w') as newf:
for x in (line.strip()[:-1] for line in f if line.strip().endswith("=")):
newf.write(f'{x}\n')

Try this pattern: \w+(?==$).
Demo

Using a simple iteration.
Ex:
with open(filename) as infile, open(filename2, "w") as outfile:
for line in infile: #Iterate Each line
if not line.strip().split("=")[-1]: #Check for second Val
print(line.strip().strip("="))
outfile.write(line) #Write to new file
Output:
CMD_REINIT
CMD_OLIVIER

Send keylogger log files to e-mail [duplicate]

I have a text file that looks like:
ABC
DEF
How can I read the file into a single-line string without newlines, in this case creating a string 'ABCDEF'?
For reading the file into a list of lines, but removing the trailing newline character from each line, see How to read a file without newlines?.

You could use:
with open('data.txt', 'r') as file:
data = file.read().replace('\n', '')
Or if the file content is guaranteed to be one-line
with open('data.txt', 'r') as file:
data = file.read().rstrip()

In Python 3.5 or later, using pathlib you can copy text file contents into a variable and close the file in one line:
from pathlib import Path
txt = Path('data.txt').read_text()
and then you can use str.replace to remove the newlines:
txt = txt.replace('\n', '')

You can read from a file in one line:
str = open('very_Important.txt', 'r').read()
Please note that this does not close the file explicitly.
CPython will close the file when it exits as part of the garbage collection.
But other python implementations won't. To write portable code, it is better to use with or close the file explicitly. Short is not always better. See https://stackoverflow.com/a/7396043/362951

To join all lines into a string and remove new lines, I normally use :
with open('t.txt') as f:
s = " ".join([l.rstrip("\n") for l in f])

with open("data.txt") as myfile:
data="".join(line.rstrip() for line in myfile)
join() will join a list of strings, and rstrip() with no arguments will trim whitespace, including newlines, from the end of strings.

This can be done using the read() method :
text_as_string = open('Your_Text_File.txt', 'r').read()
Or as the default mode itself is 'r' (read) so simply use,
text_as_string = open('Your_Text_File.txt').read()

I'm surprised nobody mentioned splitlines() yet.
with open ("data.txt", "r") as myfile:
data = myfile.read().splitlines()
Variable data is now a list that looks like this when printed:
['LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN', 'GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE']
Note there are no newlines (\n).
At that point, it sounds like you want to print back the lines to console, which you can achieve with a for loop:
for line in data:
print(line)

It's hard to tell exactly what you're after, but something like this should get you started:
with open ("data.txt", "r") as myfile:
data = ' '.join([line.replace('\n', '') for line in myfile.readlines()])

I have fiddled around with this for a while and have prefer to use use read in combination with rstrip. Without rstrip("\n"), Python adds a newline to the end of the string, which in most cases is not very useful.
with open("myfile.txt") as f:
file_content = f.read().rstrip("\n")
print(file_content)

Here are four codes for you to choose one:
with open("my_text_file.txt", "r") as file:
data = file.read().replace("\n", "")
or
with open("my_text_file.txt", "r") as file:
data = "".join(file.read().split("\n"))
or
with open("my_text_file.txt", "r") as file:
data = "".join(file.read().splitlines())
or
with open("my_text_file.txt", "r") as file:
data = "".join([line for line in file])

you can compress this into one into two lines of code!!!
content = open('filepath','r').read().replace('\n',' ')
print(content)
if your file reads:
hello how are you?
who are you?
blank blank
python output
hello how are you? who are you? blank blank

You can also strip each line and concatenate into a final string.
myfile = open("data.txt","r")
data = ""
lines = myfile.readlines()
for line in lines:
data = data + line.strip();
This would also work out just fine.

This is a one line, copy-pasteable solution that also closes the file object:
_ = open('data.txt', 'r'); data = _.read(); _.close()

f = open('data.txt','r')
string = ""
while 1:
line = f.readline()
if not line:break
string += line
f.close()
print(string)

python3: Google "list comprehension" if the square bracket syntax is new to you.
with open('data.txt') as f:
lines = [ line.strip('\n') for line in list(f) ]

Oneliner:
List: "".join([line.rstrip('\n') for line in open('file.txt')])
Generator: "".join((line.rstrip('\n') for line in open('file.txt')))
List is faster than generator but heavier on memory. Generators are slower than lists and is lighter for memory like iterating over lines. In case of "".join(), I think both should work well. .join() function should be removed to get list or generator respectively.
Note: close() / closing of file descriptor probably not needed

Have you tried this?
x = "yourfilename.txt"
y = open(x, 'r').read()
print(y)

To remove line breaks using Python you can use replace function of a string.
This example removes all 3 types of line breaks:
my_string = open('lala.json').read()
print(my_string)
my_string = my_string.replace("\r","").replace("\n","")
print(my_string)
Example file is:
{
"lala": "lulu",
"foo": "bar"
}
You can try it using this replay scenario:
https://repl.it/repls/AnnualJointHardware

I don't feel that anyone addressed the [ ] part of your question. When you read each line into your variable, because there were multiple lines before you replaced the \n with '' you ended up creating a list. If you have a variable of x and print it out just by
x
or print(x)
or str(x)
You will see the entire list with the brackets. If you call each element of the (array of sorts)
x[0]
then it omits the brackets. If you use the str() function you will see just the data and not the '' either.
str(x[0])

Maybe you could try this? I use this in my programs.
Data= open ('data.txt', 'r')
data = Data.readlines()
for i in range(len(data)):
data[i] = data[i].strip()+ ' '
data = ''.join(data).strip()

Regular expression works too:
import re
with open("depression.txt") as f:
l = re.split(' ', re.sub('\n',' ', f.read()))[:-1]
print (l)
['I', 'feel', 'empty', 'and', 'dead', 'inside']

with open('data.txt', 'r') as file:
data = [line.strip('\n') for line in file.readlines()]
data = ''.join(data)

from pathlib import Path
line_lst = Path("to/the/file.txt").read_text().splitlines()
Is the best way to get all the lines of a file, the '\n' are already stripped by the splitlines() (which smartly recognize win/mac/unix lines types).
But if nonetheless you want to strip each lines:
line_lst = [line.strip() for line in txt = Path("to/the/file.txt").read_text().splitlines()]
strip() was just a useful exemple, but you can process your line as you please.
At the end, you just want concatenated text ?
txt = ''.join(Path("to/the/file.txt").read_text().splitlines())

This works:
Change your file to:
LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE
Then:
file = open("file.txt")
line = file.read()
words = line.split()
This creates a list named words that equals:
['LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN', 'GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE']
That got rid of the "\n". To answer the part about the brackets getting in your way, just do this:
for word in words: # Assuming words is the list above
print word # Prints each word in file on a different line
Or:
print words[0] + ",", words[1] # Note that the "+" symbol indicates no spaces
#The comma not in parentheses indicates a space
This returns:
LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN, GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE

with open(player_name, 'r') as myfile:
data=myfile.readline()
list=data.split(" ")
word=list[0]
This code will help you to read the first line and then using the list and split option you can convert the first line word separated by space to be stored in a list.
Than you can easily access any word, or even store it in a string.
You can also do the same thing with using a for loop.

file = open("myfile.txt", "r")
lines = file.readlines()
str = '' #string declaration
for i in range(len(lines)):
str += lines[i].rstrip('\n') + ' '
print str

Try the following:
with open('data.txt', 'r') as myfile:
data = myfile.read()
sentences = data.split('\\n')
for sentence in sentences:
print(sentence)
Caution: It does not remove the \n. It is just for viewing the text as if there were no \n

reading in file python says its a string [duplicate]

I have a text file that looks like:
ABC
DEF
How can I read the file into a single-line string without newlines, in this case creating a string 'ABCDEF'?
For reading the file into a list of lines, but removing the trailing newline character from each line, see How to read a file without newlines?.

You could use:
with open('data.txt', 'r') as file:
data = file.read().replace('\n', '')
Or if the file content is guaranteed to be one-line
with open('data.txt', 'r') as file:
data = file.read().rstrip()

In Python 3.5 or later, using pathlib you can copy text file contents into a variable and close the file in one line:
from pathlib import Path
txt = Path('data.txt').read_text()
and then you can use str.replace to remove the newlines:
txt = txt.replace('\n', '')

You can read from a file in one line:
str = open('very_Important.txt', 'r').read()
Please note that this does not close the file explicitly.
CPython will close the file when it exits as part of the garbage collection.
But other python implementations won't. To write portable code, it is better to use with or close the file explicitly. Short is not always better. See https://stackoverflow.com/a/7396043/362951

To join all lines into a string and remove new lines, I normally use :
with open('t.txt') as f:
s = " ".join([l.rstrip("\n") for l in f])

with open("data.txt") as myfile:
data="".join(line.rstrip() for line in myfile)
join() will join a list of strings, and rstrip() with no arguments will trim whitespace, including newlines, from the end of strings.

This can be done using the read() method :
text_as_string = open('Your_Text_File.txt', 'r').read()
Or as the default mode itself is 'r' (read) so simply use,
text_as_string = open('Your_Text_File.txt').read()

I'm surprised nobody mentioned splitlines() yet.
with open ("data.txt", "r") as myfile:
data = myfile.read().splitlines()
Variable data is now a list that looks like this when printed:
['LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN', 'GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE']
Note there are no newlines (\n).
At that point, it sounds like you want to print back the lines to console, which you can achieve with a for loop:
for line in data:
print(line)

It's hard to tell exactly what you're after, but something like this should get you started:
with open ("data.txt", "r") as myfile:
data = ' '.join([line.replace('\n', '') for line in myfile.readlines()])

I have fiddled around with this for a while and have prefer to use use read in combination with rstrip. Without rstrip("\n"), Python adds a newline to the end of the string, which in most cases is not very useful.
with open("myfile.txt") as f:
file_content = f.read().rstrip("\n")
print(file_content)

Here are four codes for you to choose one:
with open("my_text_file.txt", "r") as file:
data = file.read().replace("\n", "")
or
with open("my_text_file.txt", "r") as file:
data = "".join(file.read().split("\n"))
or
with open("my_text_file.txt", "r") as file:
data = "".join(file.read().splitlines())
or
with open("my_text_file.txt", "r") as file:
data = "".join([line for line in file])

you can compress this into one into two lines of code!!!
content = open('filepath','r').read().replace('\n',' ')
print(content)
if your file reads:
hello how are you?
who are you?
blank blank
python output
hello how are you? who are you? blank blank

You can also strip each line and concatenate into a final string.
myfile = open("data.txt","r")
data = ""
lines = myfile.readlines()
for line in lines:
data = data + line.strip();
This would also work out just fine.

This is a one line, copy-pasteable solution that also closes the file object:
_ = open('data.txt', 'r'); data = _.read(); _.close()

f = open('data.txt','r')
string = ""
while 1:
line = f.readline()
if not line:break
string += line
f.close()
print(string)

python3: Google "list comprehension" if the square bracket syntax is new to you.
with open('data.txt') as f:
lines = [ line.strip('\n') for line in list(f) ]

Oneliner:
List: "".join([line.rstrip('\n') for line in open('file.txt')])
Generator: "".join((line.rstrip('\n') for line in open('file.txt')))
List is faster than generator but heavier on memory. Generators are slower than lists and is lighter for memory like iterating over lines. In case of "".join(), I think both should work well. .join() function should be removed to get list or generator respectively.
Note: close() / closing of file descriptor probably not needed

Have you tried this?
x = "yourfilename.txt"
y = open(x, 'r').read()
print(y)

To remove line breaks using Python you can use replace function of a string.
This example removes all 3 types of line breaks:
my_string = open('lala.json').read()
print(my_string)
my_string = my_string.replace("\r","").replace("\n","")
print(my_string)
Example file is:
{
"lala": "lulu",
"foo": "bar"
}
You can try it using this replay scenario:
https://repl.it/repls/AnnualJointHardware

I don't feel that anyone addressed the [ ] part of your question. When you read each line into your variable, because there were multiple lines before you replaced the \n with '' you ended up creating a list. If you have a variable of x and print it out just by
x
or print(x)
or str(x)
You will see the entire list with the brackets. If you call each element of the (array of sorts)
x[0]
then it omits the brackets. If you use the str() function you will see just the data and not the '' either.
str(x[0])

Maybe you could try this? I use this in my programs.
Data= open ('data.txt', 'r')
data = Data.readlines()
for i in range(len(data)):
data[i] = data[i].strip()+ ' '
data = ''.join(data).strip()

Regular expression works too:
import re
with open("depression.txt") as f:
l = re.split(' ', re.sub('\n',' ', f.read()))[:-1]
print (l)
['I', 'feel', 'empty', 'and', 'dead', 'inside']

with open('data.txt', 'r') as file:
data = [line.strip('\n') for line in file.readlines()]
data = ''.join(data)

from pathlib import Path
line_lst = Path("to/the/file.txt").read_text().splitlines()
Is the best way to get all the lines of a file, the '\n' are already stripped by the splitlines() (which smartly recognize win/mac/unix lines types).
But if nonetheless you want to strip each lines:
line_lst = [line.strip() for line in txt = Path("to/the/file.txt").read_text().splitlines()]
strip() was just a useful exemple, but you can process your line as you please.
At the end, you just want concatenated text ?
txt = ''.join(Path("to/the/file.txt").read_text().splitlines())

This works:
Change your file to:
LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE
Then:
file = open("file.txt")
line = file.read()
words = line.split()
This creates a list named words that equals:
['LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN', 'GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE']
That got rid of the "\n". To answer the part about the brackets getting in your way, just do this:
for word in words: # Assuming words is the list above
print word # Prints each word in file on a different line
Or:
print words[0] + ",", words[1] # Note that the "+" symbol indicates no spaces
#The comma not in parentheses indicates a space
This returns:
LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN, GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE

with open(player_name, 'r') as myfile:
data=myfile.readline()
list=data.split(" ")
word=list[0]
This code will help you to read the first line and then using the list and split option you can convert the first line word separated by space to be stored in a list.
Than you can easily access any word, or even store it in a string.
You can also do the same thing with using a for loop.

file = open("myfile.txt", "r")
lines = file.readlines()
str = '' #string declaration
for i in range(len(lines)):
str += lines[i].rstrip('\n') + ' '
print str

Try the following:
with open('data.txt', 'r') as myfile:
data = myfile.read()
sentences = data.split('\\n')
for sentence in sentences:
print(sentence)
Caution: It does not remove the \n. It is just for viewing the text as if there were no \n

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiple string replacements on a 100mb file in python 2.6 - python

Related

Removing duplicate characters from a file - Python

Python program for writing length of list to file

Python regex from txt file

Send keylogger log files to e-mail [duplicate]

reading in file python says its a string [duplicate]

Categories

Resources