Cleaning text for multiple files in Python

Cleaning text for multiple files in Python - python

I am working a script to clean up a .txt file, create a list, count frequencies of unique words, and output a .csv file with the frequencies. I would like to open multiple files and combine them to still output a single .csv file.
Would it be more efficient to write code that would combine the text across the .txt files first or to read/clean all the unique files and combine the lists/dictionaries afterwards? What would the syntax look like for the optimal scenario?
I have been trying to research it on my own but have very limited coding skills and can't seem to find an answer that fits my specific question. I appreciate any and all input. Thanks!
import re
filename = 'testtext.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
import re
words = re.split(r'\W+', text)
words = [word.lower() for word in words]
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
from collections import Counter
countlist = Counter(stripped)
import csv
w = csv.writer(open("testtext.csv", "w"))
for key, val in countlist.items():
w.writerow([key, val])

If you want to count the frequencies of words for multiple files and outputting it into one CSV file, you won't need to do much to your code, just add a loop in your code e.g.:
import re
import string
from collections import Counter
import csv
files = ['testtext.txt', 'testtext2.txt', 'testtext3']
stripped = []
for filename in files:
file = open(filename, 'rt')
text = file.read()
file.close()
words = re.split(r'\W+', text)
words = [word.lower() for word in words]
table = str.maketrans('', '', string.punctuation)
stripped += [w.translate(table) for w in words] # concatenating parsed data
countlist = Counter(stripped)
w = csv.writer(open("testtext.csv", "w"))
for key, val in countlist.items():
w.writerow([key, val])
I don't know if this is the most optimal way of doing it.
It will depend on factors like: how big are the files? and how many files do you want to parse? and how frequently do you want to parse x files of y size? etc etc.
When you have figured that out, you can start thinking of ways to optimize the process.

if you need calculate the frequency, it's better to combine the strings from multiple .txt files at first, to know the performance, you can write the datetime function at the begining and end of the processing.

Related

Replace only exact string

I am trying to replace exact string only in a csv file using another csv as dictionary.
This is my code
import re
text = open("input.csv", "r", encoding="ISO-8859-1")
replacelist = open("replace.csv","r", encoding="ISO-8859-1").readlines()
for r in replacelist:
r = r.split(",")
text = ''.join([i for i in text]) \
.replace(r[0],r[1])
print ({r[0]})
print ({r[1]})
x = open("new.csv","w")
x.writelines(text)
x.close()
Is it possible to use replace method to only replace exact match strings? Should I import and use re.sub() instead of replace?
input.csv example
ciao123;xxxxx;0
ciao12345;xxzzx;2
replace.csv example
ciao123,ok
aaaa,no
bbb,cc
Only first line in input.csv should be replaced.

Well, as per your comments, your task would be much simpler and you don't need to play with regex as well!
Basically, you are trying to replace something in a csv column if it is a exact word match, if that is the case, you should not be treating it as raw text, treat it as a column data.
If you do so, you could use one example like below:
text = open("input.csv", "r", encoding="ISO-8859-1").readlines()
replacelist = open("replace.csv","r", encoding="ISO-8859-1").readlines()
# make a replace word dictionary with O(n) time complexity
replace_data = {i.split(',')[0]: i.split(',')[1] for i in replacelist}
# Now treat data in input.csv as tabular data to replace the words
# Start another loop of O(n) time complexity
for idx, line in enumerate(text):
line_lis = line.split(';')
if line_lis[0] in replace_data:
only replace word if it is meant to be replaced
line_lis[0] = replace_data.get(line_lis[0])
text[idx] = ';'.join(line_lis)
# write results
with open("new.csv","w") as f:
f.writelines(text)
Result would be as:
ok;xxxxx;0
ciao12345;xxzzx;2

How to check how many times a word appears in a text file

I've seen a few people ask how this would be done, but their questions were 'too broad' so I decided to find out how to do it. I've posted below how.

So to do this, first you must open the file (Assuming you have a file of text called 'text.txt') We do this by calling the open function.
file = open('text.txt', 'r')
The open function uses the syntax: open(file, mode)
The file being the text document, and the mode being how it's opened. ('r' means read only) The read function just reads the file, then split separates each of the words into a list object. Lastly, we use the count function to find how many times the word appears.
word = input('word: ')
print(file.read().split().count(word))
And there you have it, counting words in a text file!

Word counts can be tricky. At a minimum, one would like to avoid differences in capitalization and punctuation. A simple way to take the next step in word counts is to use regular expressions and to convert its resulting words to lower case before we do the count. We could even use collections.Counter and count all of the words.
import re
# `word_finder(somestring)` emits all words in string as list
word_finder = re.compile(r'\w+').findall
filename = input('filename: ')
word = input('word: ')
# remove case for compare
lword = word.lower()
# `word_finder` emits all of the words excluding punctuation
# `filter` removes the lower cased words we don't want
# `len` counts the result
count = len(list(filter(lambda w: w.lower() == lword,
word_finder(open(filename).read()))))
print(count)
# we could go crazy and count all of the words in the file
# and do it line by line to reduce memory footprint.
import collections
import itertools
from pprint import pprint
word_counts = collections.Counter(itertools.chain.from_iterable(
word_finder(line.lower()) for line in open(filename)))
print(pprint(word_counts))

Splitting on whitespace isn't sufficient -- split on everything you're not counting and get your case under control:
import re
import sys
file = open(sys.argv[1])
word = sys.argv[2]
print(re.split(r"[^a-z]+", file.read().casefold()).count(word.casefold()))
You can add apostrophes to the inverted pattern [^a-z'] or whatever else you want to include in your count.
Hogan: Colonel, you're asking and answering your own questions. That's tops in German efficiency.

def words_frequency_counter(filename):
"""Print how many times the word appears in the text."""
try:
with open(filename) as file_object:
contents = file_object.read()
except FileNotFoundError:
pass
else:
word = input("Give me a word: ")
print("'" + word + "'" + ' appears ' +
str(contents.lower().count(word.lower())) + ' times.\n')

First, you want to open the file. Do this with:
your_file = open('file.txt', 'r')
Next, you want to count the word. Let's set your word as brian under the variable life. No reason.
your_file.read().split().count(life)
What that does is reads the file, splits it into individual words, and counts the instances of the word 'brian'. Hope this helps!

How to track number of occurrences with set.union()

So, I'm looping through a bunch of docs creating a list of all unique words and sequential groups of words in the documents (obviously, the strings I'm looking at are pretty short).
globallist=[]
for filename in glob.glob(os.path.join(path, '*.html')):
mystr = "some text I want"
stuff = re.sub("[^\w]", " ", mystr).split()
wordlist = [''.join(stuff[i:j]) for i in range(len(stuff)) for j in range(i+1, len(stuff)+1)]
globallist = set.union(set(globallist), set(wordlist))
I want to track occurrences with globallist as I go so that at the end I'll already have a count of how many documents contain each string in the list. I plan to delete any element that only occurs in one document. What's the best way to do this?

The script below should help to give you some ideas.
You are attempting to parse HTML files, so ideally you need to extract only the text from each file without any of the HTML markup. This can be done using a library such as BeautifulSoup. Next it is best to lowercase all of the words to ensure you catch words using different case. Python's collections.Counter can be used to count all of the words, and from that a list containing only words with a count of one can be constructed. Finally a count of your phrases can be made.
All of this information can then be stored on a per file basis into file_stats. The results are then displayed at the end.
From that, you will be able to see how many of the documents contain the text you were looking for.
from bs4 import BeautifulSoup
import collections
import glob
import re
import os
path = r'mypath'
file_stats = []
search_list = ['some text I want', 'some other text']
search_list = [phrase.lower() for phrase in search_list] # Ensure list is all lowercase
for filename in glob.glob(os.path.join(path, '*.html')):
with open(filename, 'r') as f_input:
html = f_input.read()
soup = BeautifulSoup(html, 'html.parser')
# Remove style and script sections from the HTML
for script in soup(["style", "script"]):
script.extract()
# Extract all text
text = soup.get_text().encode('utf-8')
# Create a word list in lowercase
word_list = [word.lower() for word in re.sub("[^\w]", " ", text).split()]
# Search for matching phrases
phrase_counts = dict()
text = ' '.join(word_list)
for search in search_list:
phrase_counts[search] = text.count(search)
# Calculate the word counts
word_counts = collections.Counter(word_list)
# Filter unique words
unique_words = sorted(word for word, count in word_counts.items() if count == 1)
# Create a list of unique words and phrase matches for each file
file_stats.append([filename, unique_words, phrase_counts])
# Display the results for all files
for filename, unique_words, phrase_counts in file_stats:
print '{:30} {}'.format(filename, unique_words)
for phrase, count in phrase_counts.items():
print ' {} : {}'.format(phrase, count)

Create a set of words for each document, and update a collections.Counter with the per file words. The set per file avoiding counting words more than once per file, the Counter seamlessly sums across files. For a super simple example counting individual words (without tracking which file they came from):
from collections import Counter
totals = Counter()
for file in allfiles:
with open(file) as f:
totals.update(set(f.read().split()))

Python character and word counts

I'm a beginner with python and would like to know how to use two txt files to count the characters as well as counter the 10 most common characters. also how to convert all characters in the file to lower case and eliminate all characters other than a-z
here's what i've tried and had no luck with:
from string import ascii_lowercase
from collections import Counter
with open ('document1.txt' , 'document2.txt') as f:
print Counter(letter for line in f
for letter in line.lower()
if letter in ascii_lowercase)

try like this :
>>> from collections import Counter
>>> import re
>>> words = re.findall(r'\w+', "{} {}".format(open('your_file1').read().lower(), open('your_file2').read().lower()))
>>> Counter(words).most_common(10)

Here is a simple example. You can adapt this code to fit your needs
from string import ascii_lowercase
from collections import Counter
with open('file1.txt', 'r') as file1data: #opening an reading file one
file1 = file1data.read().lower() #convert the entire file contents to lower
with open('file2.txt', 'r') as file2data: #opening an reading file two
file2 = file2data.read().lower()
#The contents of both file 1 and 2 are stored in fil1 and file2 variables
#Examples of how to work with one file repeat for two files
file1_list = []
for ch in file1:
if ch in ascii_lowercase: #makes sure only lowercase alphabet is appended. All Non alphabet characters are removed
file1_list.append(ch)
elif ch in [" ", ".", ",", "'"]: #remove this elif block is you just want the letters
file1_list.append(ch) #make sure basic punctionation is kept
print "".join(file1_list) #this line is not needed. Just to show what the text looks like now
print Counter(file1_list).most_common(10) #prints the top ten
print Counter(file1_list) #prints the number of characters and how many times they repeat
Now that you've reviewed that mess above and have an idea of what each line is doing, here is a cleaner version, that does what you were looking for.
from string import ascii_lowercase
from collections import Counter
with open('file1.txt', 'r') as file1data:
file1 = file1data.read().lower()
with open('file2.txt', 'r') as file2data:
file2 = file2data.read().lower()
file1_list = []
for ch in file1:
if ch in ascii_lowercase:
file1_list.append(ch)
file2_list = []
for ch in file2:
if ch in ascii_lowercase:
file2_list.append(ch)
all_counter = Counter(file1_list + file2_list)
top_ten_counter = Counter(file1_list + file2_list).most_common(10)
print sorted(all_counter.items())
print sorted(top_ten_counter)

Unfortunately there is no way to insert into the middle of a file without re-writing it. As previous posters have indicated, you can append to a file or overwrite part of it using seek but if you want to add stuff at the beginning or the middle, you'll have to rewrite it.
This is an operating system thing, not a Python thing. It is the same in all languages.
What I usually do is read from the file, make the modifications and write it out to a new file called myfile.txt.tmp or something like that. This is better than reading the whole file into memory because the file may be too large for that. Once the temporary file is completed, I rename it the same as the original file.
This is a good, safe way to do it because if the file write crashes or aborts for any reason, you still have your untouched original file.
To find most common words from multiple files,
from collections import Counter
import re
with open(''document1.txt'') as f1, open(''document1.txt'') as f2:
words = re.findall(r'\w+', f1.read().lower()) + re.findall(r'\w+', f2.read().lower())
>>>Counter(words).most_common(10)
"wil give you most 10 common words"
If you want most 10 common characters
>>>Counter(f1.read() + f2.read()).most_common(10)

Why did my method of writing list items to a .txt file not work?

I've written a short program will take an input file, remove punctuation, sort the contents by number of occurrences per word and then write the 100 most common results to an output file.
I had some trouble on the last part (writing the results to an output file), and though I've fixed it, I don't know what the problem was.
The full code looks like so:
from collections import Counter
from itertools import chain
import sys
import string
wordList = []
#this file contains text from a number of reviews
file1 = open('reviewfile', 'r+')
reviewWords = file1.read().lower()
#this file contains a list of the 1000 most common English words
file2 = open('commonwordsfile', 'r')
commonWords = file2.read().lower()
#remove punctuation
for char in string.punctuation:
reviewWords = reviewWords.replace(char, " ")
#create a list of individual words from file1
splitWords = reviewWords.split()
for w in splitWords:
if w not in commonWords and len(w)>2:
wordList.append(w)
#sort the resulting list by length
wordList = sorted(wordList, key=len)
#return a list containing the 100
#most common words and number of occurrences
words_to_count = (word for word in wordList)
c = Counter(words_to_count)
commonHundred = c.most_common(100)
#create new file for results and write
#the 100 most common words to it
fileHandle = open("outcome", 'w' )
for listItem in commonHundred:
fileHandle.write (str(listItem) + "\n")
fileHandle.close()
I previously had this following code snippet attempting to write the 100 most common terms to a .txt file, but it didn't work. Can anyone explain why not?
makeFile = open("outputfile", "w")
for item in CommonHundred:
makeFile.write("[0]\n".format(item))
makeFile.close()

Those should be curly braces, like:
makefile.write("{0}\n".format(item))
Run this and see what happens:
a = "[0]".format("test")
print(a)
b = "{0}".format("test")
print(b)
Then go search for "Format String Syntax" here if you'd like to know more: http://docs.python.org/3/library/string.html.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning text for multiple files in Python - python

if you need calculate the frequency, it's better to combine the strings from multiple .txt files at first, to know the performance, you can write the datetime function at the begining and end of the processing.

Related

Replace only exact string

How to check how many times a word appears in a text file

How to track number of occurrences with set.union()

Python character and word counts

Why did my method of writing list items to a .txt file not work?

Categories

Resources