Weird symbols / encoding showing up in output Python txt files - python

I'm having a frustrating issue outputting to text files from Python. Actually, the files appear perfectly normal when opened up in a text editor, but I am uploading these files into QDA miner, a data analysis suite and once they are uploaded into QDA miner, this is what the text looks like:
. 

"This problem really needs to be focused in a way that is particular to its cultural dynamics and tending in the industry,"
As you can see, many of these weird ( 

) symbols show up throughout the texts. The text that my python script parses initially is a RTF file that I convert to plain text using OSX's built in text editor.
Is there an easy way to remove these symbols? I am parsing over singular 100+mb text files and separating them into thousands of separate articles, I have to have a way to batch convert them otherwise it will be near impossible. I should also mention that the origin of these text files is copied from webpages.
Here is some relevant code from the script I wrote:
test1 = filedialog.askopenfile()
newFolder = ((str(test1)[25:])[:-32])
folderCreate(newFolder)
masterFileName = newFolder+"/"+"MASTER_FILE"
masterOutput = open(masterFileName,"w")
edit = test1.readlines()
for i,line in enumerate(edit):
for j in line.split():
if j in ["Author","Author:"]:
try:
outputFileName = "-".join(edit[i-2].lower().title().split())+".txt"
output = open(newFolder+"/"+outputFileName,"w") # create file with article name # backslashed changed to front slash windows
print("File created - ","-".join(edit[i-2].lower().title().split()))
counter2 = counter2+1
except:
print("Filename error.")
counter = counter+1
pass
#Count number of words in each article
wordCount = 0
for word in edit[i+1].split():
wordCount+=1
fileList.append((outputFileName,str(wordCount)))
#Now write to file
output.write(edit[i-2])
output.write("\n")
author = line
output.write(author) # write article author
output.write("\n")
output.write("\n")
content = edit[i+1]
output.write(content) # write article content
Thanks

Related

Seeking and deleting elements in lists of a parsed file and saving result to another file

I have a large .txt file that is a result of a C-file being parsed containing various blocks of data, but about 90% of them are useless to me. I'm trying to get rid of them and then save the result to another file, but have hard time doing so. At first I tried to delete all useless information in unparsed file, but then it won't parse. My .txt file is built like this:
//Update: Files I'm trying to work on comes from pycparser module, that I found on a GitHub.
File before being parsed looks like this:
And after using pycparser
file_to_parse = pycparser.parse_file(current_directory + r"\D_Out_Clean\file.d_prec")
I want to delete all blocks that starts with word Typedef. This module stores this in an one big list that I can access via it's attribute.
Currently my code looks like this:
len_of_ext_list = len(file_to_parse.ext)
i = 0
while i < len_of_ext_list:
if 'TypeDecl' not in file_to_parse.ext[i]:
print("NOT A TYPEDECL")
print(file_to_parse.ext[i], type(file_to_parse.ext[i]))
parsed_file_2 = open(current_directory + r"\Zadanie\D_Out_Clean_Parsed\clean_file.d_prec", "w+")
parsed_file_2.write("%s%s\n" %("", file_to_parse.ext[i]))
parsed_file_2.close
#file_to_parse_2 = file_to_parse.ext[i]
i+=1
But above code only saves one last FuncDef from a unparsed file, and I don't know how to change it.
So, now I'm trying to get rid of all typedefs in parsed file as they don't have any valuable information for me. I want to now what functions definitions and declarations are in file, and what type of global variables are stored in parsed file. Hope this is more clear now.
I suggest reading the entire input file into a string, and then doing a regex replacement:
with open(current_directory + r"\D_Out\file.txt", "r+") as file:
with open(current_directory + r"\D_Out_Clean\clean_file.txt", "w+") as output:
data = file.read()
data = re.sub(r'type(?:\n\{.*?\}|[^;]*?;)\n?', '', data, flags=re.S)
output.write(line)
Here is a regex demo showing that the replacement logic is working.

Python Writing huge string in text file with a new line character every 240 characters

I need to convert a word document into html code and then save it into a .txt file with lines of no longer than 100 characters (there's a process later on that won't pick up more than 255 characters if they're not in separate lines).
So far, I've successfully (though a better solution is welcome) managed to convert the .docx file into html and deploy that variable into a .txt file. However, I'm not able to figure out how to separate the lines. Is there any integrated function which could achieve this?
import mammoth
with open(r'C:\Users\uXXXXXX\Downloads\Test_Script.docx', "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages, such as warnings during conversion
with open(r'C:\Users\uXXXXXX\Downloads\Output.txt', 'w') as text_file:
text_file.write(html)
In that case, you can just do
html = "..."
i = 100
while i < len(html):
html = html[:i] + "\n" + html[i:]
i += 101

Selective text using Python

I am a beginner in python and I am using it for my master thesis, so I don't know that much. I have a bunch of annual reports (in txt format) files and I want to select all the text between "ITEM1." and "ITEM2.". I am using the re package. My problem is that sometimes, in those 10ks, there is a section called "ITEM1A.". I want the code to recognize this and stop at "ITEM1A." and put in the output the text between "ITEM1." and "ITEM1A.". In the code I attached to this post, I tried to make it stop at "ITEM1A.", but it does not, it continues further because "ITEM1A." appears multiple times through the file. I would be ideal to make it stop at the first one it sees. The code is the following:
import os
import re
#path to where 10k are
saved_path = "C:/Users/Adrian PC/Desktop/Thesis stuff/10k abbot/python/Multiple 10k/saved files/"
#path to where to save the txt with the selected text between ITEM 1 and ITEM 2
selected_path = "C:/Users/Adrian PC/Desktop/Thesis stuff/10k abbot/python/Multiple 10k/10k_select/"
#get a list of all the items in that specific folder and put it in a variable
list_txt = os.listdir(saved_path)
for text in list_txt:
file_path = saved_path+text
file = open(file_path,"r+", encoding="utf-8")
file_read = file.read()
# looking between ITEM 1 and ITEM 2
res = re.search(r'(ITEM[\s\S]*1\.[\w\W]*)(ITEM+[\s\S]*1A\.)', file_read)
item_text_section = res.group(1)
saved_file = open(selected_path + text, "w+", encoding="utf-8") # save the file with the complete names
saved_file.write(item_text_section) # write to the new text files with the selected text
saved_file.close() # close the file
print(text) #show the progress
file.close()
If anyone has any suggestions on how to tackle this, it would be great. Thank you!
Try the following regex:
ITEM1\.([\s\S]*?)ITEM1A\.
Adding the question mark makes it non-greedy thus it will stop at the first occurrence

Finding a heading in word file and copying entire paragraph thereafter to new word file with python

I have the following situation:
I have several hundred word files that contain company information. I would like to search these files for specific words to find specific paragraphs and copy just these paragraphs to new word files. Basically I just need to reduce the original couple hundred documents to a more readable size each.
The documents that I have are located in one directory and carry different names. In each of them I want to extract particular information that I need to define individually.
To go about this I started with the following code to first write all file names into a .csv file:
# list all transcript files and print names to .csv
import os
import csv
with open("C:\\Users\\Stef\\Desktop\\Files.csv", 'w') as f:
writer = csv.writer(f)
for path, dirs, files in os.walk("C:\\Users\\Stef\\Desktop\\Files"):
for filename in files:
writer.writerow([filename])
This works perfectly. Next I open Files.csv and edit the second column for the keywords that I need to search for in each document.
See picture below for how the .csv file looks:
CSV file
The couple hundred word files I have, are structured with different layers of headings. What I wanted to do now was to search for specific headings with the keywords I manually defined in the .csv and then copy the content of the following passage to a new file. I uploaded an extract from a word file, "Presentation" is a 'Heading 1' and "North America" and "China" are 'Heading 2'.
Word example
In this case I would like for example to search for the 'Headline 2' "North America" and then copy the text that is below ("In total [...] diluted basis.) to a new word file that has the same name as the old one just an added "_clean.docx".
I started with my code as follows:
import os
import glob
import csv
import docx
os.chdir('C:\\Users\\Stef\\Desktop')
f = open('Files.csv')
csv_f = csv.reader(f)
file_name = []
matched_keyword = []
for row in csv_f:
file_name.append(row[0])
matched_keyword.append(row[1])
filelist = file_name
filelist2 = matched_keyword
for i, j in zip(filelist, filelist2):
rootdir = 'C:\\Users\\Stef\\Desktop\\Files'
doc = docx.Document(os.path.join(rootdir, i))
After this I was not able to find any working solution. I tried a few things but could not succeed at all. I would greatly appreciate further help.
I think the end should then again look something like this, however not quite sure.
output =
output.save(i +"._clean.docx")
Have considered the following questions and ideas:
Extracting MS Word document formatting elements along with raw text information
extracting text from MS word files in python
How can I search a word in a Word 2007 .docx file?
Just figured something similar for myself, so here is a complete working example for you. Might be a more pythonic way of doing it…
from docx import Document
inputFile = 'soTest.docx'
try:
doc = Document(inputFile)
except:
print(
"There was some problem with the input file.\nThings to check…\n"
"- Make sure the file is a .docx (with no macros)"
)
exit()
outFile = inputFile.split("/")[-1].split(".")[0] + "_clean.docx"
strFind = 'North America'
# paraOffset used in the event the paragraphs are not adjacent
paraOffset = 1
# document.paragraph returns a list of objects
parasFound = []
paras = doc.paragraphs
# use the list index find the paragraph immediately after the known string
# keep a list of found paras, in the event there is more than 1 match
parasFound = [paras[index+paraOffset]
for index in range(len(paras))
if (paras[index].text == strFind)]
# Add paras to new document
docOut = Document()
for para in parasFound:
docOut.add_paragraph(para.text)
docOut.save(outFile)
exit()
I've also added a image of the input file, showing that North America appears in more than 1 place.

Break a large text file into separate files using a double carriage return

I’m using Python 2.7 with Windows 7. I have a single large text file that I want to break into several smaller files. The format of the file currently looks like this . . .
Double carriage return
Header line
Body (consisting of several lines)
Double carriage return
Header line
Body (consisting of several lines)
I want to create separate text files using the Header line as the file name and the Body as the file content. The Double carriage return identifies the start of a new file.
I’ve searched Stack Overflow but haven’t found what I’m looking for. I’m very new to Python so any help would be much appreciated.
The code I have so far is . . .
fh = open(path/file.txt)
data = fh.read()
doc = re.split(r'[\r\n\r\n]',data)
for para in doc:
header = re.search('^[1-9].+Chapter', para)
filename = str(header) + ".txt"
fwrite = open(filename,"w")
fwrite.write(para)
fwrite.close()
I'd like to use the first line as the text file title.
The first line does not open the file properly, this should work assuming everything else exists. The best practice to keep the file opening in a try Exception block
fh = open('path/file.txt','r')
data = fh.read()
doc = re.split(r'[\r\n\r\n]',data)
for para in doc:
header = re.search('^[1-9].+Chapter', para)
filename = str(header) + ".txt"
fwrite = open(filename,"w")
fwrite.write(para)
fwrite.close()
The argument to open is a quoted string; you omitted the quotes.
Your code will needlessly pull the entire file into memory -- this is obviously not a problem with small files, but needlessly restricts your program. If there is no need to analyze the lines together, it is better to read one at a time into memory, and then forget it after writing it out again.
Your code hard-codes DOS carriage returns, which is not only tasteless...
Your code does not enforce the requirement that the first line after the separator has to contain the chapter title. If this is not a hard requirement, the replacement code will need some changes. I figured it's better to alert and abort than pull stuff from further down in the file which just happens to match; but with the refactored code, the latter approach isn't really even feasible.
with open('path/file', 'Ur') as input:
output = None
for line in input:
if output is None:
if 'Chapter' in line and line[0:1].isdigit():
output = open('.'.join(line.rstrip(), 'txt'), 'w')
else:
raise ValueError(
'First line in paragraph is not chapter header: '
'{}'.format(line.rstrip())
elif line == '\n':
output.close()
output = None
continue
output.write(line)

Categories