RegEx Python Find and Print to a new document - python

Sorry if this is a dumb lump of questions, but I had a couple things I was hoping to inquire about. Basically, what I am trying to do is take a file that is being sent where a bunch of data is getting clumped all together that is supposed to be on separate lines, sort through it, and print each statement on its own line. The thing I don't know is how to create a new document for everything to be dumped into, nor do I know how to print into that document where each thing is on its new line.
I've decided to try and tackle this task while using Regular Expressions and Python. I want my code to look for any of four specific strings (MTH|, SCN|, ENG|, or HST|) and copy everything after it UNTIL it runs into one of those four strings again. At that point I need it to stop, record everything it copied, and then start copying the new string. I need to make it read past new lines and ignore them, which I hope to accomplish with
re.DOTALL
Basically, I want my code to take something like this:
MTH|stuffstuffstuffSCN|stuffstuffstuffENG|stuffstuffstuffHST|stuffstu
ffstuffSCN|stuffstuffstuffENG|stuffstuffstuffHST|stuffstuffstuffMTH|s
tuffstuffstuffSCN|stuffstuffstuffENG|stuffstuffstuff
And turn into something nice and readable like this:
MTH|stuffstuffstuff
SCN|stuffstuffstuff
ENG|stuffstuffstuff
HST|stuffstuffstuff
SCN|stuffstuffstuff
ENG|stuffstuffstuff
HST|stuffstuffstuff
MTH|stuffstuffstuff
SCN|stuffstuffstuff
ENG|stuffstuffstuff
While also creating a new document and pasting it all in that .txt file. My code looks like this so far:
import re
re.DOTALL
from __future__ import print_function
NDoc = raw_input("Enter name of to-be-made document")
log = open("C:\Users\XYZ\Desktop\Python\NDoc.txt", "w")
#Need help with this^ How do I make new file instead of opening a file?
nl = list()
file = raw_input("Enter a file to be sorted")
xfile = open(file)
for line in xfile:
l=line.strip()
n=re.findall('^([MTH|SCN|ENG|HST][|].)$[MTH|SCN|ENG|HST][|]',l)
#Edited out some x's here that I left in, sorry
if len(n) > 0:
nl.append(n)
for item in nl:
print(item, file = log)
In the starting file, stuffstuffstuff can be number, letters, and various symbols (including | ), but no where except where they are supposed to be will MTH| SCN| ENG| HST| occur, so I want to look specifically for those 4 strings as my starts and ends.
Aside from being able to create a new document and paste into it on separate lines for each item in list, will the above code accomplish what I am trying to do? Can I scan .txt files and excel files? I don't have a file to test it on till Friday but I am supposed to have it mostly done by then.
Oh, also, to do things like:
import.re
re.DOTALL
from __future__ import print_function
do I have to set anything external? Are these addons or things I need to import or are these all just built into python?

This regex will take your string and put newlines in between each string you wanted to separate:
re.sub("(\B)(?=((MTH|SCN|ENG|HST)[|]))","\n\n",line)
Here is the code I was testing with:
from __future__ import print_function
import re
#NDoc = raw_input("Enter name of to-be-made document")
#log = open("C:\Users\XYZ\Desktop\Python\NDoc.txt", "w")
#Need help with this^ How do I make new file instead of opening a file?
#nl = list()
#file = raw_input("Enter a file to be sorted")
xfile = open("file2")
for line in xfile:
l=line.strip()
n=re.sub("(\B)(?=((MTH|SCN|ENG|HST)[|]))","\n\n",line)
#Edited out some x's here that I left in, sorry
if len(n) > 0:
nl=n.split("\n")
for item in nl:
print(item)
I've tested this version with input data that has no newlines. I also have a version that works with newlines. If this doesn't work, let me know and I'll post that version.
The main environmental changes I made are that I'm reading from a file named "file2" in the same directory as the python script and I'm just writing the output to the screen.
This version assumes that there are newlines in your data and just reads the whole file in:
from __future__ import print_function
import re
#NDoc = raw_input("Enter name of to-be-made document")
#log = open("C:\Users\XYZ\Desktop\Python\NDoc.txt", "w")
#Need help with this^ How do I make new file instead of opening a file?
#nl = list()
#file = raw_input("Enter a file to be sorted")
xfile = open("file")
line = xfile.read()
l=line.strip()
l=re.sub("\n","",l)
n=re.sub("(\B)(?=((MTH|SCN|ENG|HST)[|]))","\n\n",l)
print(n)

Related

Writing code to search many text files for certain words [duplicate]

This question already has answers here:
opening and reading all the files in a directory in python - python beginner
(4 answers)
Python - Ignore letter case
(5 answers)
Closed 2 years ago.
I'm new to coding. I have about 50 notepad files which are transcripts with timestamps that I want to search through for certain words or phrases. Ideally I would be able to input a word or phrase and the program would find every instance of its occurrence and then print it along with its timestamp. The lines in my notepad files looks like this:
"then he went to the office
00:03
but where did he go afterwards?
00:06
to the mall
00:08"
I just want a starting point. I have fun in figuring things out on my own, but I need somewhere to start.
A few questions I have:
How can I ensure that the user input will not be case sensitive?
How can I open many text files systematically with a loop? I have titled the text files {1,2,3,4...} to try to make this easy to implement but I still don't know how to do it.
from os import listdir
import re
def find_word(text, search):
result = re.findall('\\b'+search+'\\b', text, flags=re.IGNORECASE)
if len(result)>0:
return True
else:
return False
#please put all the txt file in a folder (for my case it was: D:/onr/1738349/txt/')
search = 'word' #or phrase you want to search
search = search.lower() #converted to lower case
result_folder= "D:/onr/1738349/result.txt" #make sure you change them accordingly, I put it this why so you can understand
transcript_folder = "D:/onr/1738349/txt" #all the transcript files would be here
with open(result_folder, "w") as f: #please change in to where you want to output your result, but not in the txt folder where you kept all other files
for filename in listdir(transcript_folder): #the folder that contains all the txt file(50 files as you said)
with open(transcript_folder+'/' + filename) as currentFile:
# Strips the newline character
i = 0;
for line in currentFile:
line = line.lower() #since you want to omit case-sensitivity
i=i+1
if find_word(line, search): #for exact match, for example 'word' and 'testword' would be different
f.write('Found in (' + filename[:-4] + '.txt) at line number: ('+str(i) +') ')
if(next(currentFile)!='\0'):
f.write('time: ('+next(currentFile).rstrip()+') \n')
else:
continue
Make sure you follow the comments.
(1) create result.txt
(2) keep all the transcript files in a folder
(3) make sure (1) and (2) are not in the same folder
(4) Directory would be different if you are using Unix based system(mine is windows)
(5) Just run this script after making suitable changes, the script will take care all of it(it will find all the transcript files and show the result to a single file for your convenience)
The Output would be(in the result.txt):
Found in (fileName.txt) at line number: (#lineNumber) time: (Time)
.......
How can I ensure that the user input will not be case sensitive?
As soon as you open a file, covert the whole thing to capital letters. Similarly, when you accept user input, immediately convert it to capital letters.
How can I open many text files systematically with a loop?
import os
Files = os.listdir(FolderFilepath)
for file in Files:
I have titled the text files {1,2,3,4...} to try to make this easy to implement but I still don't know how to do it.
Have fun.
Maybe these thoughts are helpful for you:
First I would try to find I way to get hold of all files I need (even if its only their paths)
And filtering can be done with regex (there are online websites where you can test your regex constructs like regex101 which I found super helpful when I first looked at regex)
Then I would try to mess around getting all timestemps/places etc. to see that you filter correctly
Have fun!
searched_text = input("Type searched text: ")
with open("file.txt", "r") as file:
lines = file.readlines()
for i in range(len(lines)):
lines[i] = lines[i].replace("\n","")
for i in range(len(lines)-2):
if searched_text in lines[i]:
print(searched_text + " appeared in " + str(lines[i]) + " at datetime " +
str(lines[i+2]))
Here is code which works exactly like you expect

Python Script to copy define line in C files

I am having some trouble with a script. My task is to read in a bunch of .h (files written in c) files and I am capable of doing that using:
myfiles glob.glob('*.h')
but the struggle I am having is once these files are being read in, I need to take the #define line and paste it below that and change it. Confusing I know but an example would be:
#ifndefine _THIS_CODE_NEEDS_COPIED_H
#define _THIS_CODE_NEEDS_COPIED_H
#define THIS_CODE_NEEDS_COPIED_VERSION "10" <----thats what I need to add!
Noting: it loses one underscore
after define and the H is changed to
VERSION with a String "10" at the end.
Yes, it seems as though it would be simple but I am not sure how to read python in character by character. Any suggestions?! Minding, this is a new line copied and then edited below those others. Also, this is many files. There are hundreds of this! And all of their #define's read something different after them (ex: another might be #define _THIS_IS_DIFFERENT_H). So they would not all say the same thing. Please, help! My brain can't take anymore!
This sounds like a job for the fileinput and re modules:
import fileinput
import glob
import re
import sys
files = glob.glob('*.h')
pattern = re.compile(r'#define\s+_([_A-Z]+)_H\s+$')
realstdout = sys.stdout
for line in fileinput.input(files, inplace=True, backup='.bak'):
sys.stdout.write(line)
m = pattern.match(line)
if m:
sys.stdout.write('\n#define %s_VERSION "10"\n' % m.group(1))
realstdout.write('%s: %s\n'%(fileinput.filename(),m.group(1)))
Notes:
The call to fileinput.input() iterates over files in the list that is passed in as the first argument.
The inplace parameter to fileinput.input() indicates that you are editing the files in place. That is, they files will be replaced by whatever your program writes to standard output.
The regular expression matches the sort of #define that you say you are looking for. Additionally, the parentheses () in it capture a substring of that match.
Inside the loop, we maintain the existing content by writing every single line of every file. Additionally, if we see the magic #define, then we write one extra line.
The business with realstdout provides a log of which files were modified, and what patterns were detected.
You don't need to read in character-by-character to do this in Python. This could be done in fewer lines of code but it would be even uglier and harder to follow than it is now:
with open("input_file.txt", "rb") as f: # you can use glob or os.walk as needed to get input files
with open("output_file.txt", "w") as out: # output file, adjust as needed
for line in f: # iterate through each line
new_line = line # make new_line to be written equal to current line
if line.startswith("#define"):
tokens = line.strip().split(' ')
sp = 1 if tokens[-1].startswith('_') else 0 # skip initial underscore if present by adjusting start position (sp)
def_list = tokens[-1][sp:].split('_')
if def_list[-1] == 'H':
def_list[-1] = 'VERSION'
new_line = '_'.join(def_list) + ' "10"'
out.write("%s\n" % new_line) # write new_line to file
This will change lines as needed and write unaffected lines to the new file as is. If it is required to have an underscore prefix to the ones needing to be changed that can be modified also, currently this script handles it either way by adjusting the start position (sp).

Python - Calling lines from a text file to compile a pattern search of a second file

Forgive me if this is asked and answered. If so, chalk it up to my being new to programming and not knowing enough to search properly.
I have a need to read in a file containing a series of several hundred phrases, such as names or email addresses, one per line, to be used as part of a compiled search term - pattern = re.search(name). The 'pattern' variable will be used to search another file of over 5 million lines to identify and extract select fields from relevant lines.
The text of the name file being read in for variable would be in the format of:
John\n
Bill\n
Harry#helpme.com\n
Sally\n
So far I have the below code which does not error out, but also does not process and close out. If I pass the names manually using slightly different code with a sys.argv[1], everything works fine. The code (which should be) in bold is the area I am having problems with - starting at "lines = open...."
import sys
import re
import csv
import os
searchdata = open("reallybigfile", "r")
Certfile = csv.writer(open('Certfile.csv', 'ab'), delimiter=',')
**lines = open("Filewithnames.txt", 'r')
while True:
for line in lines:
line.rstrip('\n')
lines.seek(0)
for nam in lines:
pat = re.compile(nam)**
for f in searchdata.readlines():
if pat.search(f):
fields = f.strip().split(',')
Certfile.writerow([nam, fields[3], fields[4]])
lines.close()
The code at the bottom (starting "for f in searchdata.readlines():") locates, extracts and writes the fields fine. I have been unable to find a way to read in the Filewithnames.txt file and have it use each line. It either hangs, as with this code, or it reads all lines of the file to the last line and returns data only for the last line, e.g. 'Sally'.
Thanks in advance.
while True is an infinite loop, and there is no way to break out of it that I can see. That will definitely cause the program to continue to run forever and not throw an error.
Remove the while True line and de-indent that loop's code, and see what happens.
EDIT:
I have resolved a few issues, as commented, but I will leave you to figure out the precise regex you need to accomplish your goal.
import sys
import re
import csv
import os
searchdata = open("c:\\dev\\in\\1.txt", "r")
# Certfile = csv.writer(open('c:\\dev\\Certfile.csv', 'ab'), delimiter=',') #moved to later to ensure the file will be closed
lines = open("c:\\dev\\in\\2.txt", 'r')
pats = [] # An array of patterns
for line in lines:
line.rstrip()
lines.seek(0)
# Add additional conditioning/escaping of input here.
for nam in lines:
pats.append(re.compile(nam))
with open('c:\\dev\\Certfile.csv', 'ab') as outfile: #This line opens the file
Certfile = csv.writer(outfile, delimiter=',') #This line interprets the output into CSV
for f in searchdata.readlines():
for pat in pats: #A loop for processing all of the patterns
if pat.search(f) is not None:
fields = f.strip().split(',')
Certfile.writerow([pat.pattern, fields[3], fields[4]])
lines.close()
searchdata.close()
First of all, make sure to close all the files, including your output file.
As stated before, the while True loop was causing you to run infinitely.
You need a regex or set of regexes to cover all of your possible "names." The code is simpler to do a set of regexes, so that is what I have done here. This may not be the most efficient. This includes a loop for processing all of the patterns.
I believe you need additional parsing of the input file to give you clean regular expressions. I have left some space for you to do that.
Hope that helps!

How do I delete multiple lines in a text file with python?

I am practicing my python skills by writing a Phone book program. I am able to search and add entries but I am having a lot of trouble deleting entries.
I am trying to find a line that matches my search and delete it as well as the next 4 lines. I have a function to delete here:
def delete():
del_name = raw_input("What is the first name of the person you would like to delete? ")
with open("phonebook.txt", "r+") as f:
deletelines = f.readlines()
for i, line in enumerate(deletelines):
if del_name in line:
for l in deletelines[i+1:i+4]:
f.write(l)
This does not work.
How would I delete multiple entries from a text file like this?
Answering your direct question: you can use fileinput to easily alter a text file in-place:
import fileinput
file = fileinput.input('phonebook.txt', inplace=True)
for line in file:
if word_to_find in line:
for _ in range(4): # skip this line and next 4 lines
next(file, None)
else:
print line,
In order to avoid reading the entire file into memory, this handles some things in the background for you - it moves your original file to a tempfile, writes the new file, and then deletes the tempfile.
Probably better answer: it looks like you have rolled a homemade serialization solution. Consider using a built-in library like csv, json, shelve, or even sqlite3 to persist your data in an easier-to-work-with format.

Save Outfile with Python Loop in SPSS

Ok so I've been playing with python and spss to achieve almost what I want. I am able to open the file and make the changes, however I am having trouble saving the files (and those changes). What I have (using only one school in the schoollist):
begin program.
import spss, spssaux
import os
schoollist = ['brow']
for x in schoollist:
school = 'brow'
school2 = school + '06.sav'
filename = os.path.join("Y:\...\Data", school2) #In this instance, Y:\...\Data\brow06.sav
spssaux.OpenDataFile(filename)
#--This block are the changes and not particularly relevant to the question--#
cur=spss.Cursor(accessType='w')
cur.SetVarNameAndType(['name'],[8])
cur.CommitDictionary()
for i in range(cur.GetCaseCount()):
cur.fetchone()
cur.SetValueChar('name', school)
cur.CommitCase()
cur.close()
#-- What am I doing wrong here? --#
spss.Submit("save outfile = filename".)
end program.
Any suggestions on how to get the save outfile to work with the loop? Thanks. Cheers
In your save call, you are not resolving filename to its actual value. It should be something like this:
spss.Submit("""save outfile="%s".""" % filename)
I'm unfamiliar with spssaux.OpenDataFile and can't find any documentation on it (besides references to working with SPSS data files in unicode mode). But what I am going to guess is the problem is that it grabs the SPSS data file for use in the Python program block, but it isn't actually opened to further submit commands.
Here I make a test case that instead of using spssaux.OpenDataFile to grab the file, does it all with SPSS commands and just inserts the necessary parts via python. So first lets create some fake data to work with.
*Prepping the example data files.
FILE HANDLE save /NAME = 'C:\Users\andrew.wheeler\Desktop\TestPython'.
DATA LIST FREE / A .
BEGIN DATA
1
2
3
END DATA.
SAVE OUTFILE = "save\Test1.sav".
SAVE OUTFILE = "save\Test2.sav".
SAVE OUTFILE = "save\Test3.sav".
DATASET CLOSE ALL.
Now here is a paired down version of what your code is doing. I have the LIST ALL. command inserted in so you can check the output that it is adding the variable of interest to the file.
*Sequential opening the data files and appending data name.
BEGIN PROGRAM.
import spss
import os
schoollist = ['1','2','3']
for x in schoollist:
school2 = 'Test' + x + '.sav'
filename = os.path.join("C:\\Users\\andrew.wheeler\\Desktop\\TestPython", school2)
#opens the SPSS file and makes a new variable for the school name
spss.Submit("""
GET FILE = "%s".
STRING Name (A20).
COMPUTE Name = "%s".
LIST ALL.
SAVE OUTFILE = "%s".
""" %(filename, x,filename))
END PROGRAM.

Categories