I have to parse a large log file (2GB) using reg ex in python. In the log file regular expression matches line which I am interested in. Log file can also have unwanted data.
Here is a sample from the file:
"#DEBUG:: BFM [L4] 5.4401e+08ps MSG DIR:TX SCB_CB TYPE:DATA_REQ CPortID:'h8 SIZE:'d20 NumSeg:'h0001 Msg_Id:'h00000000"
My regular expression is ".DEBUG.*MSG."
First I will split it using the white spaces then the "field:value" patterns are inserted into the sqlite3 database; but for large files it takes around 10 to 15 minutes to parse the file.
Please suggest the best way to do the above task in minimal time.
As others have said, profile your code to see why it is slow. The cProfile module in conjunction with the gprof2dot tool can produce nice readable information
Without seeing your slow code, I can guess a few things that might help:
First is you can probably get away with using the builtin string methods instead of a regex - this might be marginally quicker. If you need to use regex's, it's worthwhile precompiling outside the main loop using re.compile
Second is to not do one insert query per line, instead do the insertions in batches, e.g add the parsed info to a list, then when it reaches a certain size, perform one INSERT query with executemany method.
Some incomplete code, as an example of the above:
import fileinput
parsed_info = []
for linenum, line in enumerate(fileinput.input()):
if not line.startswith("#DEBUG"):
continue # Skip line
msg = line.partition("MSG")[1] # Get everything after MSG
words = msg.split() # Split on words
info = {}
for w in words:
k, _, v = w.partition(":") # Split each word on first :
info[k] = v
parsed_info.append(info)
if linenum % 10000 == 0: # Or maybe if len(parsed_info) > 500:
# Insert everything in parsed_info to database
...
parsed_info = [] # Clear
Paul's answer makes sense, you need to understand where you "lose" time first.
Easiest way if you don't have a profiler is to post a timestamp in milliseconds before and after each "step" of your algorithm (opening the file, reading it line by line (and inside, time taken for the split / regexp to recognise the debug lines), inserting it in the DB, etc...).
Without further knowledge of your code, there are possible "traps" that would be very time consuming :
- opening the log file several times
- opening the DB every time you need to insert data inside instead of opening one connection and then write as you go
"The best way to do the above task in minimal time" is to first figure out where the time is going. Look into how to profile your Python script to find what parts are slow. You may have an inefficient regex. Writing to sqlite may be the problem. But there are no magic bullets - in general, processing 2GB of text line by line, with a regex, in Python, is probably going to run in minutes, not seconds.
Here is a test script that will show how long it takes to read a file, line by line, and do nothing else:
from datetime import datetime
start = datetime.now()
for line in open("big_honkin_file.dat"):
pass
end = datetime.now()
print (end-start)
Related
How to get/extract number of lines added and deleted?
(Just like we do using git diff --numstat).
repo_ = Repo('git-repo-path')
git_ = repo_.git
log_ = g.diff('--numstat','HEAD~1')
print(log_)
prints the entire output (lines added/deleted and file-names) as a single string. Can this output format be modified or changed so as to extract useful information?
Output format: num(added) num(deleted) file-name
For all files modified.
If I understand you correctly, you want to extract data from your log_ variable and then re-format it and print it? If that's the case, then I think the simplest way to fix it, is with a regular expression:
import re
for line in log_.split('\n'):
m = re.match(r"(\d+)\s+(\d+)\s+(.+)", line)
if m:
print("{}: rows added {}, rows deleted {}".format(m[3], m[1], m[2]))
The exact output, you can of course modify any way you want, once you have the data in a match m. Getting the hang of regular expressions may take a while but it can be very helpful for small scripts.
However, be adviced, reg exps tend to be write-only code and can be very hard to debug. However, for extracting small parts like this, it is very helpful.
I am trying to extract some information from a set of files sent to me by a collaborator. Each file contains some python code which names a sequence of lists. They look something like this:
#PHASE = 0
x = np.array(1,2,...)
y = np.array(3,4,...)
z = np.array(5,6,...)
#PHASE = 30
x = np.array(1,4,...)
y = np.array(2,5,...)
z = np.array(3,6,...)
#PHASE = 40
...
And so on. There are 12 files in total, each with 7 phase sets. My goal is to convert each phase into it's own file which can then be read by ascii.read() as a Table object for manipulation in a different section of code.
My current method is extremely inefficient, both in terms of resources and time/energy required to assemble. It goes something like this: Start with a function
def makeTable(a,b,c):
output = Table()
output['x'] = a
output['y'] = b
output['z'] = c
return output
Then for each phase, I have manually copy-pasted the relevant part of the text file into a cell and appended a line of code
fileName_phase = makeTable(a,b,c)
Repeat ad nauseam. It would take 84 iterations of this to process all the data, and naturally each would need some minor adjustments to match the specific fileName and phase.
Finally, at the end of my code, I have a few lines of code set up to ascii.write each of the tables into .dat files for later manipulation.
This entire method is extremely exhausting to set up. If it's the only way to handle the data, I'll do it. I'm hoping I can find a quicker way to set it up, however. Is there one you can suggest?
If efficiency and code reuse instead of copy is the goal, I think that Classes might provide a good way. I'm going to sleep now, but I'll edit later. Here's my thoughts: create a class called FileWithArrays and use a parser to read the lines and put them inside the object FileWithArrays you will create using the class. Once that's done, you can then create a method to transform the object in a table.
P.S. A good idea for the parser is to store all the lines in a list and parse them one by one, using list.pop() to auto shrink the list. Hope it helps, tomorrow I'll look more on it if this doesn't help a lot. Try to rewrite/reformat the question if I misunderstood anything, it's not very easy to read.
I will suggest a way which will be scorned by many but will get your work done.
So apologies to every one.
The prerequisites for this method is that you absolutely trust the correctness of the input files. Which I guess you do. (After all he is your collaborator).
So the key point here is that the text in the file is code which means it can be executed.
So you can do something like this
import re
import numpy as np # this is for the actual code in the files. You might have to install numpy library for this to work.
file = open("xyz.txt")
content = file.read()
Now that you have all the content, you have to separate it by phase.
For this we will use the re.split function.
phase_data = re.split("#PHASE = .*\n", content)
Now we have the content of each phase in an array.
Now comes for the part of executing it.
for phase in phase_data:
if len(phase.strip()) == 0:
continue
exec(phase)
table = makeTable(x, y, z) # the x, y and z are defined by the exec.
# do whatever you want with the table.
I will reiterate that you have to absolutely trust the contents of the file. Since you are executing it as code.
But your work seems like a scripting one and I believe this will get your work done.
PS : The other "safer" alternative to exec is to have a sandboxing library which takes the string and executes it without affecting the parent scope.
To avoid the safety issue of using exec as suggested by #Ajay Brahmakshatriya, but keeping his first processing step, you can create your own minimal 'phase parser', something like:
VARS = 'xyz'
def makeTable(phase):
assert len(phase) >= 3
output = Table()
for i in range(3):
line = [s.strip() for s in phase[i].split('=')]
assert len(line) == 2
var, arr = line
assert var == VARS[i]
assert arr[:10]=='np.array([' and arr[-2:]=='])'
output[var] = np.fromstring(arr[10:-2], sep=',')
return output
and then call
table = makeTable(phase)
instead of
exec(phase)
table = makeTable(x, y, z)
You could also skip all these assert statements without compromising safety, if the file is corrupted or not formatted as expected the error that will be thrown might just be harder to understand...
I have a folder with 300+ .txt files with total size of 15GB+. These files contain tweets. Each line is a different tweet. I have a list of keywords I'd like to search the tweets for. I have created a script that searches each line of every file for every item on my list. If the tweet contains the keyword, then it writes the line into another file. This is my code:
# Search each file for every item in keywords
print("Searching the files of " + filename + " for the appropriate keywords...")
for file in os.listdir(file_path):
f = open(file_path + file, 'r')
for line in f:
for key in keywords:
if re.search(key, line, re.IGNORECASE):
db.write(line)
This is the format each line has:
{"created_at":"Wed Feb 03 06:53:42 +0000 2016","id":694775753754316801,"id_str":"694775753754316801","text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF","source":"\u003ca href=\"http:\/\/www.facebook.com\/twitter\" rel=\"nofollow\"\u003eFacebook\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":5981342,"id_str":"5981342","name":"Lava Kafle","screen_name":"lkafle","location":"Kathmandu, Nepal","url":"http:\/\/about.me\/lavakafle","description":"#deerwalkinc 24000+ tweeps bigdata #Team #Genomics http:\/\/deerwalk.com #Genetic #Testing #population #health #management #BigData #Analytics #java #hadoop","protected":false,"verified":false,"followers_count":24742,"friends_count":23169,"listed_count":1481,"favourites_count":147252,"statuses_count":171880,"created_at":"Sat May 12 04:49:14 +0000 2007","utc_offset":20700,"time_zone":"Kathmandu","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_link_color":"088253","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/5981342\/1416802075","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/MqmDwbCDAF","expanded_url":"http:\/\/fb.me\/Yj1JW9bJ","display_url":"fb.me\/Yj1JW9bJ","indices":[45,68]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1454482422661"}
The script works but it takes a lot of time. For ~40 keywords it needs more than 2 hours. Obviously my code is not optimized. What can I do to improve the speed?
p.s. I have read some relevant questions regarding searching and speed but I suspect that the problem in my script lies in the fact that I'm using a list for the keywords. I've tried some of the suggested solutions but to no avail.
1) External library
If you're willing to lean on external libraries (and time to execute is more important than the one-off time cost to install), you might be able to gain some speed by loading each file into a simple Pandas DataFrame and performing the keyword search as a vector operation. To get the matching tweets, you would do something like:
import pandas as pd
dataframe_from_text = pd.read_csv("/path/to/file.txt")
matched_tweets_index = dataframe_from_text.str.match("keyword_a|keyword_b")
dataframe_from_text[matched_tweets_index] # Uses the boolean search above to filter the full dataframe
# You'd then have a mini dataframe of matching tweets in `dataframe_from_text`.
# You could loop through these to save them out to a file using the `.to_dict(orient="records")` format.
Dataframe operations within Pandas can be really quick so might be worth investigating.
2) Group your regex
Looks like you're not logging which keyword you matched against. If this is true, you could group your keywords into a single regex query like so:
for line in f:
keywords_combined = "|".join(keywords)
if re.search(keywords_combined, line, re.IGNORECASE):
db.write(line)
I've not tested this but by reducing the number of loops per line, that could trim some time off.
Why it's slow
You are regex searching through a json dump, which is not always a good idea. For example, if you keywords include words like user, time, profile and image each line will result in a match because the json format for tweets has all these terms as dictionary keys.
Besides the raw JSON is huge, each tweet will be more than 1kb in size (this one is 2.1kb) but the only part that's relevent in your sample is:
"text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF",
And this is less than 100 bytes, a typical tweet is still less than 140 characters despite recent changes to the API.
Things to try:
pre compile the regex as suggested by Padraic Cunningham
Option 1. Load this data into a postgresql JSONB field. JSONB fields are indexable and can be searched very quickly
Option 2. Load this into any old database, with the context of the text field having it's own column so that this column can be searched easily.
Option 3. last but not least, extract just the text field into it's own file. You can have a CSV file where the first column is the screen name and the second is the text of the tweet. Your 15GB will be shrunk to about 1GB
In short what you are doing now is searching the whole farm for the needle when you only need to search the haystack.
Been learning Python the last couple of days for the function of completing a data extraction. I'm not getting anywhere & hope one of you lovely people can advise.
I need to extract data that follows: RESP, CRESP, RTTime and RT.
Here's a snippit for an example of the mess I have to deal with.
Thoughts?
Level: 4
*** LogFrame Start ***
Procedure: ActProcScenarios
No: 1
Line1: It is almost time for your town's spring festival. A friend of yours is
Line2: on the committee and asks if you would be prepared to help out with the
Line3: barbecue in the park. There is a large barn for use if it rains.
Line4: You hope that on that day it will be
pfrag: s-n-y
pword: sunny
pletter: u
Quest: Does the town have an autumn festival?
Correct: {LEFTARROW}
ScenarioListPract: 1
Topic: practice
Subtheme: practice
ActPracScenarios: 1
Running: ActPracScenarios
ActPracScenarios.Cycle: 1
ActPracScenarios.Sample: 1
DisplayFragInstr.OnsetDelay: 17
DisplayFragInstr.OnsetTime: 98031
DisplayFragInstr.DurationError: -999999
DisplayFragInstr.RTTime: 103886
DisplayFragInstr.ACC: 0
DisplayFragInstr.RT: 5855
DisplayFragInstr.RESP: {DOWNARROW}
DisplayFragInstr.CRESP:
FragInput.OnsetDelay: 13
FragInput.OnsetTime: 103899
FragInput.DurationError: -999999
FragInput.RTTime: 104998
I think regular expressions would be the right tool here because the \b word boundary anchors allow you to make sure that RESP only matches a whole word RESP and not just part of a longer word (like CRESP).
Something like this should get you started:
>>> import re
>>> for line in myfile:
... match = re.search(r"\b(RT|RTTime|RESP|CRESP): (.*)", line)
... if match:
... print("Matched {0} with value {1}".format(match.group(1),
... match.group(2)))
Output:
Matched RTTime with value 103886
Matched RT with value 5855
Matched RESP with value {DOWNARROW}
Matched CRESP with value
Matched RTTime with value 104998
transform it to a dict first, then just get items from the dict as you wish
d = {k.strip(): v.strip() for (k, v) in
[line.split(':') for line in s.split('\n') if line.find(':') != -1]}
print (d['DisplayFragInstr.RESP'], d['DisplayFragInstr.CRESP'],
d['DisplayFragInstr.RTTime'], d['DisplayFragInstr.RT'])
>>> ('{DOWNARROW}', '', '103886', '5855')
I think you may be making things harder for yourself than needed. E-prime has a file format called .edat that is designed for the purpose you are describing. An edat file is another format that contains the same information as the .txt file but it a way that makes extracting variables easier. I personally only use the type of text file you have posted here as a form of data storage redundancy.
If you are doing things this way because you do not have a software key, it might help to know that the E-Merge and E-DataAid programs for eprime don't require a key. You only need the key for editing build files. Whoever provided you with the .txt files should probably have an install disk for these programs. If not, it is available on the PST website (I believe you need a serial code to create an account, but not certain)
Eprime generally creates a .edat file that matches the content of the text file you have posted an example of. Sometimes though if eprime crashes you don't get the edat file and only have the .txt. Luckily you can generate the edat file from the .txt file.
Here's how I would approach this issue: If you do not have the edat files available first use E-DataAid to recover the files.
Then presuming you have multiple participants you can use e-merge to merge all of the edat files together for all participants in who completed this task.
Open the merged file. It might look a little chaotic depending on how much you have in the file. You can got to Go to tools->Arrange columns This will show a list of all your variables. Adjust so that only the desired variables are in the right hand box. Hit ok.
Looking at the file you posted it says level 4 at the top so I'm guessing there are a lot of procedures in this experiment. If you have many procedures in the program you might at this point have lines that just have startup info and NULL in the locations where your variables or interest are. You and fix this by going to tools->filter and creating a filter to eliminate those lines. Sometimes also depending on file structure you might also end up with duplicate lines of the same data. You can also fix this with filtering.
You can then export this file as a csv
import re
import pprint
def parse_logs(file_name):
with open(file_name, "r") as f:
lines = [line.strip() for line in f.readlines()]
base_regex = r'^.*{0}: (.*)$'
match_terms = ["RESP", "CRESP", "RTTime", "RT"]
regexes = {term: base_regex.format(term) for term in match_terms}
output_list = []
for line in lines:
for key, regex in regexes.items():
match = re.match(regex, line)
if match:
match_tuple = (key, match.groups()[0])
output_list.append(match_tuple)
return output_list
pprint.pprint(parse_logs("respregex"))
Edit: Tim and Guy's answers are both better. I was in a hurry to write something and missed two much more elegant solutions.
I have a number of files where I want to replace all instances of a specific string with another one.
I currently have this code:
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
replaceFile = open('file', 'r+')
# read in all the lines
lines = replaceFile.readlines()
# seek to the start of the file and truncate
# (this is cause i want to do an "inline" replace
replaceFile.seek(0)
replaceFile.truncate()
# Loop through each line from file
for line in lines:
# Loop through each Key in the mappings dict
for i in mappings.keys():
# if the key appears in the line
if i in line:
# do replacement
line = line.replace(i, mappings[i])
# Write the line to the file and move to next line
replaceFile.write(line)
This works ok, but it is very slow for the size of the mappings and the size of the files I am dealing with.
For instance, in the "mappings" dict there are 60728 key value pairs.
I need to process up to 50 files and replace all instances of "key" with the corresponding value, and each of the 50 files is approximately 250000 lines.
There are also multiple instances where there are multiple keys that need to be replaced on the one line, hence I cant just find the first match and then move on.
So my question is:
Is there a faster way to do the above?
I have thought about using a regex, but I am not sure how to craft one that will do multiple in-line replaces using key/value pairs from a dict.
If you need more info, let me know.
If this performance is slow, you'll have to find something fancy. It's just about all running at C-level:
for filename in filenames:
with open(filename, 'r+') as f:
data = f.read()
f.seek(0)
f.truncate()
for k, v in mappings.items():
data = data.replace(k, v)
f.write(data)
Note that you can run multiple processes where each process tackles a portion of the total list of files. That should make the whole job a lot faster. Nothing fancy, just run multiple instances off the shell, each with a different file list.
Apparently str.replace is faster than regex.sub.
So I got to thinking about this a bit more: suppose you have a really huge mappings. So much so that the likelihood of any one key in mappings being detected in your files is very low. In this scenario, all the time will be spent doing the searching (as pointed out by #abarnert).
Before resorting to exotic algorithms, it seems plausible that multiprocessing could at least be used to do the searching in parallel, and thereafter do the replacements in one process (you can't do replacements in multiple processes for obvious reasons: how would you combine the result?).
So I decided to finally get a basic understanding of multiprocessing, and the code below looks like it could plausibly work:
import multiprocessing as mp
def split_seq(seq, num_pieces):
# Splits a list into pieces
start = 0
for i in xrange(num_pieces):
stop = start + len(seq[i::num_pieces])
yield seq[start:stop]
start = stop
def detect_active_keys(keys, data, queue):
# This function MUST be at the top-level, or
# it can't be pickled (multiprocessing using pickling)
queue.put([k for k in keys if k in data])
def mass_replace(data, mappings):
manager = mp.Manager()
queue = mp.Queue()
# Data will be SHARED (not duplicated for each process)
d = manager.list(data)
# Split the MAPPINGS KEYS up into multiple LISTS,
# same number as CPUs
key_batches = split_seq(mappings.keys(), mp.cpu_count())
# Start the key detections
processes = []
for i, keys in enumerate(key_batches):
p = mp.Process(target=detect_active_keys, args=(keys, d, queue))
# This is non-blocking
p.start()
processes.append(p)
# Consume the output from the queues
active_keys = []
for p in processes:
# We expect one result per process exactly
# (this is blocking)
active_keys.append(queue.get())
# Wait for the processes to finish
for p in processes:
# Note that you MUST only call join() after
# calling queue.get()
p.join()
# Same as original submission, now with MUCH fewer keys
for key in active_keys:
data = data.replace(k, mappings[key])
return data
if __name__ == '__main__':
# You MUST call the mass_replace function from
# here, due to how multiprocessing works
filenames = <...obtain filenames...>
mappings = <...obtain mappings...>
for filename in filenames:
with open(filename, 'r+') as f:
data = mass_replace(f.read(), mappings)
f.seek(0)
f.truncate()
f.write(data)
Some notes:
I have not executed this code yet! I hope to test it out sometime but it takes time to create the test files and so on. Please consider it as somewhere between pseudocode and valid python. It should not be difficult to get it to run.
Conceivably, it should be pretty easy to use multiple physical machines, i.e. a cluster with the same code. The docs for multiprocessing show how to work with machines on a network.
This code is still pretty simple. I would love to know whether it improves your speed at all.
There seem to be a lot of hackish caveats with using multiprocessing, which I tried to point out in the comments. Since I haven't been able to test the code yet, it may be the case that I haven't used multiprocessing correctly anyway.
According to http://pravin.paratey.com/posts/super-quick-find-replace, regex is the fastest way to go for Python. (Building a Trie data structure would be fastest for C++) :
import sys, re, time, hashlib
class Regex:
# Regex implementation of find/replace for a massive word list.
def __init__(self, mappings):
self._mappings = mappings
def replace_func(self, matchObj):
key = matchObj.group(0)
if self._mappings.has_key(key):
return self._mappings[key]
else:
return key
def replace_all(self, filename):
text = ''
with open(filename, 'r+') as fp
text = fp.read()
text = re.sub("[a-zA-Z]+", self.replace_func, text)
fp = with open(filename, "w") as fp:
fp.write(text)
# mapping dictionary of (find, replace) tuples defined
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# initialize regex class with mapping tuple dictionary
r = Regex(mappings)
# replace file
r.replace_all( 'file' )
The slow part of this is the searching, not the replacing. (Even if I'm wrong, you can easily speed up the replacing part by first searching for all the indices, then splitting and replacing from the end; it's only the searching part that needs to be clever.)
Any naive mass string search algorithm is obviously going to be O(NM) for an N-length string and M substrings (and maybe even worse, if the substrings are long enough to matter). An algorithm that searched M times at each position, instead of M times over the whole string, might be offer some cache/paging benefits, but it'll be a lot more complicated for probably only a small benefit.
So, you're not going to do much better than cjrh's implementation if you stick with a naive algorithm. (You could try compiling it as Cython or running it in PyPy to see if it helps, but I doubt it'll help much—as he explains, all the inner loops are already in C.)
The way to speed it up is to somehow look for many substrings at a time. The standard way to do that is to build a prefix tree (or suffix tree), so, e.g, "original-1" and "original-2" are both branches off the same subtree "original-", so they don't need to be handled separately until the very last character.
The standard implementation of a prefix tree is a trie. However, as Efficient String Matching: An Aid to Bibliographic Search and the Wikipedia article Aho-Corasick string matching algorithm explain, you can optimize further for this use case by using a custom data structure with extra links for fallbacks. (IIRC, this improves the average case by logM.)
Aho and Corasick further optimize things by compiling a finite state machine out of the fallback trie, which isn't appropriate to every problem, but sounds like it would be for yours. (You're reusing the same mappings dict 50 times.)
There are a number of variant algorithms with additional benefits, so it might be worth a bit of further research. (Common use cases are things like virus scanners and package filters, which might help your search.) But I think Aho-Corasick, or even just a plain trie, is probably good enough.
Building any of these structures in pure Python might add so much overhead that, at M~60000, the extra cost will defeat the M/logM algorithmic improvement. But fortunately, you don't have to. There are many C-optimized trie implementations, and at least one Aho-Corasick implementation, on PyPI. It also might be worth looking at something like SuffixTree instead of using a generic trie library upside-down if you think suffix matching will work better with your data.
Unfortunately, without your data set, it's hard for anyone else to do a useful performance test. If you want, I can write test code that uses a few different modules, that you can then run against you data. But here's a simple example using ahocorasick for the search and a dumb replace-from-the-end implementation for the replace:
tree = ahocorasick.KeywordTree()
for key in mappings:
tree.add(key)
tree.make()
for start, end in reversed(list(tree.findall(target))):
target = target[:start] + mappings[target[start:end]] + target[end:]
This use a with block to prevent leaking file descriptors. The string replace function will ensure all instances of key get replaced within the text.
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
with open('file', 'r+') as fd:
# read in all the data
text = fd.read()
# seek to the start of the file and truncate so file will be edited inline
fd.seek(0)
fd.truncate()
for key in mappings.keys():
text = text.replace(key, mappings[key])
fd.write(text)