I have a large 11 GB .txt file with email addresses. I would like to save only the strings till the # symbol among each other. My output only generate the first line.I have used this code of a earlier project. I would like to save the output in a different .txt file. I hope someone could help me out.
my code:
import re
def get_html_string(file,start_string,end_string):
answer="nothing"
with open(file, 'rb') as open_file:
for line in open_file:
line = line.rstrip()
if re.search(start_string, line) :
answer=line
break
start=answer.find(start_string)+len(start_string)
end=answer.find(end_string)
#print(start,end,answer)
return answer[start:end]
beginstr=''
end='#'
file='test.txt'
readstring=str(get_html_string(file,beginstr,end))
print readstring
Your file is quite big (11G) so you shouldn't keep all those strings in memory. Instead, process the file line by line and write the result before reading next line.
This should be efficient :
with open('test.txt', 'r') as input_file:
with open('result.txt', 'w') as output_file:
for line in input_file:
prefix = line.split('#')[0]
output_file.write(prefix + '\n')
If your file looks like this example:
user#google.com
user2#jshds.com
Useruser#jsnl.com
You can use this:
def get_email_name(file_name):
with open(file_name) as file:
lines = file.readlines()
result = list()
for line in lines:
result.append(line.split('#')[0])
return result
get_email_name('emails.txt')
Out:
['user', 'user2', 'Useruser']
Related
This is my original .txt data:
HKEY_CURRENT_USER\SOFTWARE\7-Zip
HKEY_CURRENT_USER\SOFTWARE\AppDataLow
HKEY_CURRENT_USER\SOFTWARE\Chromium
HKEY_CURRENT_USER\SOFTWARE\Clients
HKEY_CURRENT_USER\SOFTWARE\CodeBlocks
HKEY_CURRENT_USER\SOFTWARE\Discord
HKEY_CURRENT_USER\SOFTWARE\Dropbox
HKEY_CURRENT_USER\SOFTWARE\DropboxUpdate
HKEY_CURRENT_USER\SOFTWARE\ej-technologies
HKEY_CURRENT_USER\SOFTWARE\Evernote
HKEY_CURRENT_USER\SOFTWARE\GNU
And I need to have a new file where the new lines contain only part of those strings, like:
7-Zip
AppDataLow
Chromium
Clients
...
how to do it in python?
Try this:
## read file content as string
with open("file.txt", "r") as file:
string = file.read()
## convert each line to list
lines = string.split("\n")
## write only last part after "\" in each line
with open("new.txt", "w") as file:
for line in lines:
file.write(line.split("\\")[-1] + "\n")
One approach would be to read the entire text file into a Python string. Then use split on each line to find the final path component.
with open('file.txt', 'r') as file:
data = file.read()
lines = re.split(r'\r?\n', data)
output = [x.split("\\")[-1] for x in lines]
# write to file if desired
text = '\n'.join(output)
f_out = open('output.txt', 'w')
f_out.write(text)
f_out.close()
How do I remove a line from a txt file which start with ">"?
For example, in the txt file, there is about 250k+ lines and if I were to use the code below, it will take quite some time.
data = ""
with open(fileName) as f:
for line in f:
if ">" not in line:
line = line.replace("\n", "")
data += line
An example of the txt file is:
> version 1.0125 revision 0... # This is the line to be removed
some random line 1
some random line 2
> version 1.0126 revision 0... # This is the line to be removed
...
I have tried using data = f.read(), it is instant but the data will contain line that start with ">".
Any help is appreciated. Thank you :)
Not knowing what you want to do with the data afterwards, this should be fast and correct:
with open(fileName) as f:
data = "".join(line for line in f if not line.startswith(">"))
If you just want to remove these lines from the file, I would honestly not do it in Python, but in your shell directly, e.g. on Linux:
$ grep -v '^>' original_file.txt >fixed_file.txt
If you insist on Python, do it on a line-by-line basis:
with open(original_file) as f:
with open(new_file, "w") as g:
for line in f:
if not line.startswith(">"):
g.write(line)
Use two files, one for reading, second for appending:
with open(fileName, 'r') as f, open(fileName.raplace('.txt', '_1.txt'), 'a+') as df:
for line in f.readlines():
if not line.startswith('>'):
df.write(line)
I've scraped some comments from a webpage using selenium and saved them to a text file. Now I would like to perform multiple edits to the text file and save it again. I've tried to group the following into one smooth flow but I'm fairly new to python so I just couldn't get it right. Examples of what happened to me at the bottom. The only way I could get it to work is to open and close the file over and over.
These are the action I want to perform in the order the need to:
with open('results.txt', 'r') as f:
lines = f.readlines()
with open("results.txt", "w") as f:
for line in lines:
f.write(line.replace("a sample text line", ' '))
with open('results.txt', 'r') as f:
lines = f.readlines()
with open("results.txt", "w") as f:
pattern = r'\d in \d example text'
for line in lines:
f.write(re.sub(pattern, "", line))
with open('results.txt', 'r') as f:
lines = f.readlines()
with open('results.txt','w') as file:
for line in lines:
if not line.isspace():
file.write(line)
with open('results.txt', 'r') as f:
lines = f.readlines()
with open("results.txt", "w") as f:
for line in lines:
f.write(line.replace(" ", '-'))
I've tried to loop them into one but I get doubled lines, words, or extra spaces.
Any help is appreciated, thank you.
If you want to do these in one smooth pass, you better open another file to write the desired results i.e.
import re
pattern = r"\d in \d example text"
# Open your results file for reading and another one for writing
with open("results.txt", "r") as fh_in, open("output.txt", "w") as fh_out:
for line in fh_in:
# Process the line
line = line.replace("a sample text line", " ")
line = re.sub(pattern, "", line)
if line.isspace():
continue
line = line.replace(" ", "-")
# Write out
fh_out.write(line)
We process each line in order you described and the resultant line goes to output file.
I am trying to do what for many will be a very straight forward thing but for me is just infuriatingly difficult.
I am trying search for a line in a file that contains certain words or phrases and modify that line...that's it.
I have been through the forum and suggested similar questions and have found many hints but none do just quite what I want or are beyond my current ability to grasp.
This is the test file:
# 1st_word 2nd_word
# 3rd_word 4th_word
And this is my script so far:
############################################################
file = 'C:\lpthw\\text'
f1 = open(file, "r+")
f2 = open(file, "r+")
############################################################
def wrline():
lines = f1.readlines()
for line in lines:
if "1st_word" in line and "2nd_word" in line:
#f2.write(line.replace('#\t', '\t'))
f2.write((line.replace('#\t', '\t')).rstrip())
f1.seek(0)
wrline()
My problem is that the below inserts a \n after the line every time and adds a blank line to the file.
f2.write(line.replace('#\t', '\t'))
The file becomes:
1st_word 2nd_word
# 3rd_word 4th_word
An extra blank line between the lines of text.
If I use the following:
f2.write((line.replace('#\t', '\t')).rstrip())
I get this:
1st_word 2nd_wordd
# 3rd_word 4th_word
No new blank line inserted but and extra "d" at the end instead.
What am I doing wrong?
Thanks
Your blank line is coming from the original blank line in the file. Writing a line with nothing in it writes a newline to the file. Instead of not putting anything into the written line, you have to completely skip the iteration, so it does not write that newline. Here's what I suggest:
def wrline():
lines = open('file.txt', 'r').readlines()
f2 = open('file.txt', 'w')
for line in lines:
if '1st_word' in line and '2nd_word' in line:
f2.write((line.replace('# ', ' ')).rstrip('\n'))
else:
if line != '\n':
f2.write(line)
f2.close()
I would keep read and write operations separate.
#read
with open(file, 'r') as f:
lines = f.readlines()
#parse, change and write back
with open(file, 'w') as f:
for line in lines:
if line.startswith('#\t'):
line = line[1:]
f.write(line)
You have not closed the files and there is no need for the \t
Also get rid of the rstrip()
Read in the file, replace the data and write it back.. open and close each time.
fn = 'example.txt'
new_data = []
# Read in the file
with open(fn, 'r+') as file:
filedata = file.readlines()
# Replace the target string
for line in filedata:
if "1st_word" in line and "2nd_word" in line:
line = line.replace('#', '')
new_data.append(line)
# Write the file out again
with open(fn, 'w+') as file:
for line in new_data:
file.write(line)
I wish to create a function that will search each line of a input file, search through each line of this file looking for a particular string sequence and if it finds it, delete the whole line from the input file and output this line into a newly created text file with a similar format.
The input files format is always like so:
Firstname:DOB
Firstname:DOB
Firstname:DOB
Firstname:DOB
etc...
I want it so this file is input, then search for the DOB (19111991) and if it finds this string in the line then delete it from the input file and finally dump it into a new .txt document .
I'm pretty clueless if I'm being honest but I guess this would be my logically attempt even though some of the code may be wrong:
def snipper(iFile)
with open(iFile, "a") as iFile:
lines = iFile.readlines()
for line in lines:
string = line.split(':')
if string[1] == "19111991":
iFile.strip_line()
with open("newfile.txt", "w") as oFile:
iFile.write(string[0] + ':' + '19 November' + '\n')
Any help would be great.
Try this code instead:
def snipper(filename)
with open(filename, "r") as f:
lines = f.readlines()
new_data = filter(lambda x: "19111991" in x, lines)
remaining_old_data = filter(lambda x: "19111991" not in x, lines)
with open("newfile.txt", "w") as oFile:
for line in new_data:
oFile.write(line.replace("19111991", "19th November 1991'"))
with open(filename, "w") as iFile:
for line in remaining_old_data:
iFile.write(line)