Reading and Striping a text file - python

import re
f= ('HelloHowAreYou')
f = re.sub(r"([a-z\d])([A-Z])", r'\1 \2', f)
# Makes the string space separated. You can use split to convert it to list
f = f.split()
print (f)
this works fine to separate all the string of text by capital letters, however when i then change the code to read a text file i have issues. can anyone shed some light why?
to read a file I'm using:
f = open('words.txt','r')

to read a file I'm using:
f = open('words.txt','r')
But that code doesn't read the file, it only opens it. Try:
my_file = open('words.txt','r')
f = file.read()
my_file.close()
Or
with open('words.txt','r') as my_file:
f = my_file.read()

Related

Slice a given txtfile and write only part of it in a newfile in python

This is my original .txt data:
HKEY_CURRENT_USER\SOFTWARE\7-Zip
HKEY_CURRENT_USER\SOFTWARE\AppDataLow
HKEY_CURRENT_USER\SOFTWARE\Chromium
HKEY_CURRENT_USER\SOFTWARE\Clients
HKEY_CURRENT_USER\SOFTWARE\CodeBlocks
HKEY_CURRENT_USER\SOFTWARE\Discord
HKEY_CURRENT_USER\SOFTWARE\Dropbox
HKEY_CURRENT_USER\SOFTWARE\DropboxUpdate
HKEY_CURRENT_USER\SOFTWARE\ej-technologies
HKEY_CURRENT_USER\SOFTWARE\Evernote
HKEY_CURRENT_USER\SOFTWARE\GNU
And I need to have a new file where the new lines contain only part of those strings, like:
7-Zip
AppDataLow
Chromium
Clients
...
how to do it in python?
Try this:
## read file content as string
with open("file.txt", "r") as file:
string = file.read()
## convert each line to list
lines = string.split("\n")
## write only last part after "\" in each line
with open("new.txt", "w") as file:
for line in lines:
file.write(line.split("\\")[-1] + "\n")
One approach would be to read the entire text file into a Python string. Then use split on each line to find the final path component.
with open('file.txt', 'r') as file:
data = file.read()
lines = re.split(r'\r?\n', data)
output = [x.split("\\")[-1] for x in lines]
# write to file if desired
text = '\n'.join(output)
f_out = open('output.txt', 'w')
f_out.write(text)
f_out.close()

Regex to delete multi line content between two specific words

I have multiple instances of Fortran subroutines within a text file like the following:
SUBROUTINE ABCDEF(STRING1)
STRING2
STRING3
.
.
.
STRINGN
END
How can I delete the subroutines with their content in python using regex?
I have already tried this piece of code without success:
with open(input, 'r') as file:
output = open(stripped, 'w')
try:
for line in file:
result = re.sub(r"(?s)SUBROUTINE [A-Z]{6}(.*?)\bEND\b", input)
output.write("\n")
finally:
output.close()
Does this work? I replaced input with input_file as input is a builtin function, so it's bad practice to use it.
pattern = r"(?s)SUBROUTINE [A-Z]{6}(.*?)\bEND\b"
regex = re.compile(pattern, re.MULTILINE|re.DOTALL)
with open(input_file, 'r') as file:
with open(stripped, 'w') as output_file:
result = regex.sub('', file.read())
output_file.write(result)

Python 3.6 - Read encoded text from file and convert to string

Hopefully someone can help me out with the following. It is probably not too complicated but I haven't been able to figure it out. My "output.txt" file is created with:
f = open('output.txt', 'w')
print(tweet['text'].encode('utf-8'))
print(tweet['created_at'][0:19].encode('utf-8'))
print(tweet['user']['name'].encode('utf-8'))
f.close()
If I don't encode it for writing to file, it will give me errors. So "output" contains 3 rows of utf-8 encoded output:
b'testtesttest'
b'line2test'
b'\xca\x83\xc9\x94n ke\xc9\xaan'
In "main.py", I am trying to convert this back to a string:
f = open("output.txt", "r", encoding="utf-8")
text = f.read()
print(text)
f.close()
Unfortunately, the b'' - format is still not removed. Do I still need to decode it? If possible, I would like to keep the 3 row structure.
My apologies for the newbie question, this is my first one on SO :)
Thank you so much in advance!
With the help of the people answering my question, I have been able to get it to work. The solution is to change the way how to write to file:
tweet = json.loads(data)
tweet_text = tweet['text'] # content of the tweet
tweet_created_at = tweet['created_at'][0:19] # tweet created at
tweet_user = tweet['user']['name'] # tweet created by
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(tweet_text + '\n')
f.write(tweet_created_at+ '\n')
f.write(tweet_user+ '\n')
Then read it like:
f = open("output.txt", "r", encoding='utf-8')
tweettext = f.read()
print(text)
f.close()
Instead of specifying the encoding when opening the file, use it to decode as you read.
f = open("output.txt", "rb")
text = f.read().decode(encoding="utf-8")
print(text)
f.close()
If b and the quote ' are in your file, that means this in a problem with your file. Someone probably did write(print(line)) instead of write(line). Now to decode it, you can use literal_eval. Otherwise #m_callens answer's should be ok.
import ast
with open("b.txt", "r") as f:
text = [ast.literal_eval(line) for line in f]
for l in text:
print(l.decode('utf-8'))
# testtesttest
# line2test
# ʃɔn keɪn

Read lines from a text file, reverse and save in a new text file

So far I have this code:
f = open("text.txt", "rb")
s = f.read()
f.close()
f = open("newtext.txt", "wb")
f.write(s[::-1])
f.close()
The text in the original file is:
This is Line 1
This is Line 2
This is Line 3
This is Line 4
And when it reverses it and saves it the new file looks like this:
4 eniL si sihT 3 eniL si sihT 2 eniL si sihT 1 eniL si sihT
When I want it to look like this:
This is line 4
This is line 3
This is line 2
This is line 1
How can I do this?
You can do something like:
with open('test.txt') as f, open('output.txt', 'w') as fout:
fout.writelines(reversed(f.readlines()))
read() returns the whole file in a single string. That's why when you reverse it, it reverses the lines themselves too, not just their order. You want to reverse only the order of lines, you need to use readlines() to get a list of them (as a first approximation, it is equivalent to s = f.read().split('\n')):
s = f.readlines()
...
f.writelines(s[::-1])
# or f.writelines(reversed(s))
f = open("text.txt", "rb")
s = f.readlines()
f.close()
f = open("newtext.txt", "wb")
s.reverse()
for item in s:
print>>f, item
f.close()
The method file.read() returns a string of the whole file, not the lines.
And since s is a string of the whole file, you're reversing the letters, not the lines!
First, you'll have to split it to lines:
s = f.read()
lines = s.split('\n')
Or:
lines = f.readlines()
And your method, it is already correct:
f.write(lines[::-1])
Hope this helps!
There are a couple of steps here. First we want to get all the lines from the first file, and then we want to write them in reversed order to the new file. The code for doing this is as follows
lines = []
with open('text.txt') as f:
lines = f.readlines()
with open('newtext.txt', 'w') as f:
for line in reversed(lines):
f.write(line)
Firstly, we initialize a variable to hold our lines. Then we read all the lines from the 'test.txt' file.
Secondly, we open our output file. Here we loop through the lines in reversed order, writing them to the output file as we go.
A sample using list so it will be much easier:
I'm sure there answer that are more elegant but this way is clear to understand.
f = open(r"c:\test.txt", "rb")
s = f.read()
f.close()
rowList = []
for value in s:
rowList.append(value + "\n")
rowList.reverse()
f = open(r"c:\test.txt", "wb")
for value in rowList:
f.write(value)
f.close()
You have to work line by line.
f = open("text.txt", "rb")
s = f.read()
f.close()
f = open("newtext.txt", "wb")
lines = s.split('\n')
f.write('\n'.join(lines[::-1]))
f.close()
Use it like this if your OS uses \n to break lines
f = open("text.txt", "rb")
s = f.read()
f.close()
f = open("newtext.txt", "wb")
f.write(reversed(s.split("\n")).join("\n"))
f.close()
Main thing here is reversed(s.split("\n")).join("\n").
It does the following:
Split your string by line breaks - \n,
resulting an array
reverses the array
merges the array back with linebreaks \n to a string
Here the states:
string: line1 \n line2 \n line3
array: ["line1", "line2", "line3"]
array: ["line3", "line2", "line1"]
string: line3 \n line2 \n line1 \n
If your input file is too big to fit in memory, here is an efficient way to reverse it:
Split input file into partial files (still in original order).
Read each partial file from last to first, reverse it and append to output file.
Implementation:
import os
from itertools import islice
input_path = "mylog.txt"
output_path = input_path + ".rev"
with open(input_path) as fi:
for i, sli in enumerate(iter(lambda: list(islice(fi, 100000)), []), 1):
with open(f"{output_path}.{i:05}", "w") as fo:
fo.writelines(sli)
with open(output_path, "w") as fo:
for file_index in range(i, 0, -1):
path = f"{output_path}.{file_index:05}"
with open(path) as fi:
lines = fi.readlines()
os.remove(path)
for line in reversed(lines):
fo.write(line)

Export entire regex group to text file

When I print the group "print(a)" the entire group is shown. When I save it to a text file "open("sirs1.txt", "w").write(a)" only the last row is saved to the file.
import re
def main():
f = open('sirs.txt')
for lines in f:
match = re.search('(AA|BB|CC|DD)......', lines)
if match:
a = match.group()
print(a)
open("sirs1.txt", "w").write(a)
How do I save the entire group to the text file.
nosklo is correct the main problem is that you are overwriting the whole file each time you write to it. mehmattski is also correct in that you will also need to explicitly add \n to each write in order to make the output file readable.
Try this:
enter code here
import re
def main():
f = open('sirs.txt')
outputfile = open('sirs1.txt','w')
for lines in f:
match = re.search('(AA|BB|CC|DD)......', lines)
if match:
a = match.group()
print(a)
outputfile.write(a+"\n")
f.close()
outputfile.close()
the open command creates a new file, so you're creating a new file every time.
Try to create the file outside the for-loop
import re
def main():
with open('sirs.txt') as f:
with open("sirs1.txt", "w") as fw:
for lines in f:
match = re.search('(AA|BB|CC|DD)......', lines)
if match:
a = match.group()
print(a)
fw.write(a)
You need to add a newline character after each string to get them to print on separate lines:
import re
def main():
f = open('sirs.txt')
outputfile = open('sirs1.txt','w')
for lines in f:
match = re.search('(AA|BB|CC|DD)......', lines)
if match:
a = match.group()
print(a)
outputfile.write(a+'/n')
f.close()
outputfile.close()

Categories