I am a beginner in python and I am using it for my master thesis, so I don't know that much. I have a bunch of annual reports (in txt format) files and I want to select all the text between "ITEM1." and "ITEM2.". I am using the re package. My problem is that sometimes, in those 10ks, there is a section called "ITEM1A.". I want the code to recognize this and stop at "ITEM1A." and put in the output the text between "ITEM1." and "ITEM1A.". In the code I attached to this post, I tried to make it stop at "ITEM1A.", but it does not, it continues further because "ITEM1A." appears multiple times through the file. I would be ideal to make it stop at the first one it sees. The code is the following:
import os
import re
#path to where 10k are
saved_path = "C:/Users/Adrian PC/Desktop/Thesis stuff/10k abbot/python/Multiple 10k/saved files/"
#path to where to save the txt with the selected text between ITEM 1 and ITEM 2
selected_path = "C:/Users/Adrian PC/Desktop/Thesis stuff/10k abbot/python/Multiple 10k/10k_select/"
#get a list of all the items in that specific folder and put it in a variable
list_txt = os.listdir(saved_path)
for text in list_txt:
file_path = saved_path+text
file = open(file_path,"r+", encoding="utf-8")
file_read = file.read()
# looking between ITEM 1 and ITEM 2
res = re.search(r'(ITEM[\s\S]*1\.[\w\W]*)(ITEM+[\s\S]*1A\.)', file_read)
item_text_section = res.group(1)
saved_file = open(selected_path + text, "w+", encoding="utf-8") # save the file with the complete names
saved_file.write(item_text_section) # write to the new text files with the selected text
saved_file.close() # close the file
print(text) #show the progress
file.close()
If anyone has any suggestions on how to tackle this, it would be great. Thank you!
Try the following regex:
ITEM1\.([\s\S]*?)ITEM1A\.
Adding the question mark makes it non-greedy thus it will stop at the first occurrence
Related
I have a problem. I want to read this file
https://cdn.discordapp.com/attachments/852226751832653864/870341903923695626/map.txt
And find all "targetname" lines and then finding out what comes next of targetname. For example,
"targetname" "rope01"
I want to know "rope01", and in .txt file there are multiple amount of substrings. And then I want to manipulate with rope01. How I can do it?
import re
value = '"targetname" "test"'
text = value[13:]
print(text)
#map_itself = open('map.txt', 'r')
with open('map.txt', 'rb') as map_itself:
for line in map_itself:
if '"targetname"'.encode() in map_itself:
print("I found it!")
This code is pretty stupid and it can't even find any "targetname". However, they are existing in .txt. (CTRL + F to find "targetname" since it's not on top)
This question already has answers here:
opening and reading all the files in a directory in python - python beginner
(4 answers)
Python - Ignore letter case
(5 answers)
Closed 2 years ago.
I'm new to coding. I have about 50 notepad files which are transcripts with timestamps that I want to search through for certain words or phrases. Ideally I would be able to input a word or phrase and the program would find every instance of its occurrence and then print it along with its timestamp. The lines in my notepad files looks like this:
"then he went to the office
00:03
but where did he go afterwards?
00:06
to the mall
00:08"
I just want a starting point. I have fun in figuring things out on my own, but I need somewhere to start.
A few questions I have:
How can I ensure that the user input will not be case sensitive?
How can I open many text files systematically with a loop? I have titled the text files {1,2,3,4...} to try to make this easy to implement but I still don't know how to do it.
from os import listdir
import re
def find_word(text, search):
result = re.findall('\\b'+search+'\\b', text, flags=re.IGNORECASE)
if len(result)>0:
return True
else:
return False
#please put all the txt file in a folder (for my case it was: D:/onr/1738349/txt/')
search = 'word' #or phrase you want to search
search = search.lower() #converted to lower case
result_folder= "D:/onr/1738349/result.txt" #make sure you change them accordingly, I put it this why so you can understand
transcript_folder = "D:/onr/1738349/txt" #all the transcript files would be here
with open(result_folder, "w") as f: #please change in to where you want to output your result, but not in the txt folder where you kept all other files
for filename in listdir(transcript_folder): #the folder that contains all the txt file(50 files as you said)
with open(transcript_folder+'/' + filename) as currentFile:
# Strips the newline character
i = 0;
for line in currentFile:
line = line.lower() #since you want to omit case-sensitivity
i=i+1
if find_word(line, search): #for exact match, for example 'word' and 'testword' would be different
f.write('Found in (' + filename[:-4] + '.txt) at line number: ('+str(i) +') ')
if(next(currentFile)!='\0'):
f.write('time: ('+next(currentFile).rstrip()+') \n')
else:
continue
Make sure you follow the comments.
(1) create result.txt
(2) keep all the transcript files in a folder
(3) make sure (1) and (2) are not in the same folder
(4) Directory would be different if you are using Unix based system(mine is windows)
(5) Just run this script after making suitable changes, the script will take care all of it(it will find all the transcript files and show the result to a single file for your convenience)
The Output would be(in the result.txt):
Found in (fileName.txt) at line number: (#lineNumber) time: (Time)
.......
How can I ensure that the user input will not be case sensitive?
As soon as you open a file, covert the whole thing to capital letters. Similarly, when you accept user input, immediately convert it to capital letters.
How can I open many text files systematically with a loop?
import os
Files = os.listdir(FolderFilepath)
for file in Files:
I have titled the text files {1,2,3,4...} to try to make this easy to implement but I still don't know how to do it.
Have fun.
Maybe these thoughts are helpful for you:
First I would try to find I way to get hold of all files I need (even if its only their paths)
And filtering can be done with regex (there are online websites where you can test your regex constructs like regex101 which I found super helpful when I first looked at regex)
Then I would try to mess around getting all timestemps/places etc. to see that you filter correctly
Have fun!
searched_text = input("Type searched text: ")
with open("file.txt", "r") as file:
lines = file.readlines()
for i in range(len(lines)):
lines[i] = lines[i].replace("\n","")
for i in range(len(lines)-2):
if searched_text in lines[i]:
print(searched_text + " appeared in " + str(lines[i]) + " at datetime " +
str(lines[i+2]))
Here is code which works exactly like you expect
I have the following situation:
I have several hundred word files that contain company information. I would like to search these files for specific words to find specific paragraphs and copy just these paragraphs to new word files. Basically I just need to reduce the original couple hundred documents to a more readable size each.
The documents that I have are located in one directory and carry different names. In each of them I want to extract particular information that I need to define individually.
To go about this I started with the following code to first write all file names into a .csv file:
# list all transcript files and print names to .csv
import os
import csv
with open("C:\\Users\\Stef\\Desktop\\Files.csv", 'w') as f:
writer = csv.writer(f)
for path, dirs, files in os.walk("C:\\Users\\Stef\\Desktop\\Files"):
for filename in files:
writer.writerow([filename])
This works perfectly. Next I open Files.csv and edit the second column for the keywords that I need to search for in each document.
See picture below for how the .csv file looks:
CSV file
The couple hundred word files I have, are structured with different layers of headings. What I wanted to do now was to search for specific headings with the keywords I manually defined in the .csv and then copy the content of the following passage to a new file. I uploaded an extract from a word file, "Presentation" is a 'Heading 1' and "North America" and "China" are 'Heading 2'.
Word example
In this case I would like for example to search for the 'Headline 2' "North America" and then copy the text that is below ("In total [...] diluted basis.) to a new word file that has the same name as the old one just an added "_clean.docx".
I started with my code as follows:
import os
import glob
import csv
import docx
os.chdir('C:\\Users\\Stef\\Desktop')
f = open('Files.csv')
csv_f = csv.reader(f)
file_name = []
matched_keyword = []
for row in csv_f:
file_name.append(row[0])
matched_keyword.append(row[1])
filelist = file_name
filelist2 = matched_keyword
for i, j in zip(filelist, filelist2):
rootdir = 'C:\\Users\\Stef\\Desktop\\Files'
doc = docx.Document(os.path.join(rootdir, i))
After this I was not able to find any working solution. I tried a few things but could not succeed at all. I would greatly appreciate further help.
I think the end should then again look something like this, however not quite sure.
output =
output.save(i +"._clean.docx")
Have considered the following questions and ideas:
Extracting MS Word document formatting elements along with raw text information
extracting text from MS word files in python
How can I search a word in a Word 2007 .docx file?
Just figured something similar for myself, so here is a complete working example for you. Might be a more pythonic way of doing it…
from docx import Document
inputFile = 'soTest.docx'
try:
doc = Document(inputFile)
except:
print(
"There was some problem with the input file.\nThings to check…\n"
"- Make sure the file is a .docx (with no macros)"
)
exit()
outFile = inputFile.split("/")[-1].split(".")[0] + "_clean.docx"
strFind = 'North America'
# paraOffset used in the event the paragraphs are not adjacent
paraOffset = 1
# document.paragraph returns a list of objects
parasFound = []
paras = doc.paragraphs
# use the list index find the paragraph immediately after the known string
# keep a list of found paras, in the event there is more than 1 match
parasFound = [paras[index+paraOffset]
for index in range(len(paras))
if (paras[index].text == strFind)]
# Add paras to new document
docOut = Document()
for para in parasFound:
docOut.add_paragraph(para.text)
docOut.save(outFile)
exit()
I've also added a image of the input file, showing that North America appears in more than 1 place.
Sorry if this is a dumb lump of questions, but I had a couple things I was hoping to inquire about. Basically, what I am trying to do is take a file that is being sent where a bunch of data is getting clumped all together that is supposed to be on separate lines, sort through it, and print each statement on its own line. The thing I don't know is how to create a new document for everything to be dumped into, nor do I know how to print into that document where each thing is on its new line.
I've decided to try and tackle this task while using Regular Expressions and Python. I want my code to look for any of four specific strings (MTH|, SCN|, ENG|, or HST|) and copy everything after it UNTIL it runs into one of those four strings again. At that point I need it to stop, record everything it copied, and then start copying the new string. I need to make it read past new lines and ignore them, which I hope to accomplish with
re.DOTALL
Basically, I want my code to take something like this:
MTH|stuffstuffstuffSCN|stuffstuffstuffENG|stuffstuffstuffHST|stuffstu
ffstuffSCN|stuffstuffstuffENG|stuffstuffstuffHST|stuffstuffstuffMTH|s
tuffstuffstuffSCN|stuffstuffstuffENG|stuffstuffstuff
And turn into something nice and readable like this:
MTH|stuffstuffstuff
SCN|stuffstuffstuff
ENG|stuffstuffstuff
HST|stuffstuffstuff
SCN|stuffstuffstuff
ENG|stuffstuffstuff
HST|stuffstuffstuff
MTH|stuffstuffstuff
SCN|stuffstuffstuff
ENG|stuffstuffstuff
While also creating a new document and pasting it all in that .txt file. My code looks like this so far:
import re
re.DOTALL
from __future__ import print_function
NDoc = raw_input("Enter name of to-be-made document")
log = open("C:\Users\XYZ\Desktop\Python\NDoc.txt", "w")
#Need help with this^ How do I make new file instead of opening a file?
nl = list()
file = raw_input("Enter a file to be sorted")
xfile = open(file)
for line in xfile:
l=line.strip()
n=re.findall('^([MTH|SCN|ENG|HST][|].)$[MTH|SCN|ENG|HST][|]',l)
#Edited out some x's here that I left in, sorry
if len(n) > 0:
nl.append(n)
for item in nl:
print(item, file = log)
In the starting file, stuffstuffstuff can be number, letters, and various symbols (including | ), but no where except where they are supposed to be will MTH| SCN| ENG| HST| occur, so I want to look specifically for those 4 strings as my starts and ends.
Aside from being able to create a new document and paste into it on separate lines for each item in list, will the above code accomplish what I am trying to do? Can I scan .txt files and excel files? I don't have a file to test it on till Friday but I am supposed to have it mostly done by then.
Oh, also, to do things like:
import.re
re.DOTALL
from __future__ import print_function
do I have to set anything external? Are these addons or things I need to import or are these all just built into python?
This regex will take your string and put newlines in between each string you wanted to separate:
re.sub("(\B)(?=((MTH|SCN|ENG|HST)[|]))","\n\n",line)
Here is the code I was testing with:
from __future__ import print_function
import re
#NDoc = raw_input("Enter name of to-be-made document")
#log = open("C:\Users\XYZ\Desktop\Python\NDoc.txt", "w")
#Need help with this^ How do I make new file instead of opening a file?
#nl = list()
#file = raw_input("Enter a file to be sorted")
xfile = open("file2")
for line in xfile:
l=line.strip()
n=re.sub("(\B)(?=((MTH|SCN|ENG|HST)[|]))","\n\n",line)
#Edited out some x's here that I left in, sorry
if len(n) > 0:
nl=n.split("\n")
for item in nl:
print(item)
I've tested this version with input data that has no newlines. I also have a version that works with newlines. If this doesn't work, let me know and I'll post that version.
The main environmental changes I made are that I'm reading from a file named "file2" in the same directory as the python script and I'm just writing the output to the screen.
This version assumes that there are newlines in your data and just reads the whole file in:
from __future__ import print_function
import re
#NDoc = raw_input("Enter name of to-be-made document")
#log = open("C:\Users\XYZ\Desktop\Python\NDoc.txt", "w")
#Need help with this^ How do I make new file instead of opening a file?
#nl = list()
#file = raw_input("Enter a file to be sorted")
xfile = open("file")
line = xfile.read()
l=line.strip()
l=re.sub("\n","",l)
n=re.sub("(\B)(?=((MTH|SCN|ENG|HST)[|]))","\n\n",l)
print(n)
Ok so I've been playing with python and spss to achieve almost what I want. I am able to open the file and make the changes, however I am having trouble saving the files (and those changes). What I have (using only one school in the schoollist):
begin program.
import spss, spssaux
import os
schoollist = ['brow']
for x in schoollist:
school = 'brow'
school2 = school + '06.sav'
filename = os.path.join("Y:\...\Data", school2) #In this instance, Y:\...\Data\brow06.sav
spssaux.OpenDataFile(filename)
#--This block are the changes and not particularly relevant to the question--#
cur=spss.Cursor(accessType='w')
cur.SetVarNameAndType(['name'],[8])
cur.CommitDictionary()
for i in range(cur.GetCaseCount()):
cur.fetchone()
cur.SetValueChar('name', school)
cur.CommitCase()
cur.close()
#-- What am I doing wrong here? --#
spss.Submit("save outfile = filename".)
end program.
Any suggestions on how to get the save outfile to work with the loop? Thanks. Cheers
In your save call, you are not resolving filename to its actual value. It should be something like this:
spss.Submit("""save outfile="%s".""" % filename)
I'm unfamiliar with spssaux.OpenDataFile and can't find any documentation on it (besides references to working with SPSS data files in unicode mode). But what I am going to guess is the problem is that it grabs the SPSS data file for use in the Python program block, but it isn't actually opened to further submit commands.
Here I make a test case that instead of using spssaux.OpenDataFile to grab the file, does it all with SPSS commands and just inserts the necessary parts via python. So first lets create some fake data to work with.
*Prepping the example data files.
FILE HANDLE save /NAME = 'C:\Users\andrew.wheeler\Desktop\TestPython'.
DATA LIST FREE / A .
BEGIN DATA
1
2
3
END DATA.
SAVE OUTFILE = "save\Test1.sav".
SAVE OUTFILE = "save\Test2.sav".
SAVE OUTFILE = "save\Test3.sav".
DATASET CLOSE ALL.
Now here is a paired down version of what your code is doing. I have the LIST ALL. command inserted in so you can check the output that it is adding the variable of interest to the file.
*Sequential opening the data files and appending data name.
BEGIN PROGRAM.
import spss
import os
schoollist = ['1','2','3']
for x in schoollist:
school2 = 'Test' + x + '.sav'
filename = os.path.join("C:\\Users\\andrew.wheeler\\Desktop\\TestPython", school2)
#opens the SPSS file and makes a new variable for the school name
spss.Submit("""
GET FILE = "%s".
STRING Name (A20).
COMPUTE Name = "%s".
LIST ALL.
SAVE OUTFILE = "%s".
""" %(filename, x,filename))
END PROGRAM.