Split txt file into multiple new files with regex - python

I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.
I have a txt file of Letters to the Editor that I need to split into their own individual files.
The files are all formatted in relatively the same way with:
For once, before offering such generous but the unasked for advice, put yourselves in...
Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...
Why is it that The Times does not urge totalitarian Arab slates and terrorist...
PAUL STONEHILL Los Angeles
There you go again. Your editorial again makes groundless criticisms of the Israeli...
On Dec. 7 you called proportional representation “bizarre," despite its use in the...
Proportional representation distorts Israeli politics? Huh? If Israel changes the...
MATTHEW SHUGART Laguna Beach
Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...
Although the mayor did not support Proposition U (the slow-growth initiative) his...
If West Los Angeles is any indication of the no-growth policy, where do we go from here?
MARJORIE L. SCHWARTZ Los Angeles
I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.
I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.
The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.
import re
thefile = raw_input('Filename to split: ')
name_occur = []
full_file = []
pattern = re.compile("^[A-Z]{4,}")
with open (thefile, 'rt') as in_file:
for line in in_file:
full_file.append(line)
if pattern.search(line):
name_occur.append(line)
totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)
while letters <= totalFiles:
f1 = open(thefile + '-' + str(letters) + ".txt", "a")
doIHaveToCopyTheLine = False
ignoreLines = False
for line in full_file:
if not ignoreLines:
f1.write(line)
full_file.remove(line)
if pattern.search(line):
doIHaveToCopyTheLine = True
ignoreLines = True
letters += 1
f1.close()
I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.

I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.
import string
def split_letters(fullpath):
current_letter = []
letter_index = 1
fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)
with open(fullpath, 'r') as letters_file:
letters = letters_file.readlines()
for line in letters:
words = line.split()
upper_words = []
for word in words:
upper_word = ''.join(
c for c in word if c in string.ascii_uppercase)
upper_words.append(upper_word)
len_upper_words = len(upper_words)
first_word_upper = len_upper_words and len(upper_words[0]) > 1
second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
if first_word_upper and (second_word_upper or third_word_upper):
current_letter.append(line)
new_filename = '{0}{1}.{2}'.format(
fullpath_base, letter_index, fullpath_ext)
with open(new_filename, 'w') as new_letter:
new_letter.writelines(current_letter)
current_letter = []
letter_index += 1
else:
current_letter.append(line)
I tested it on your sample input and it worked fine.

While the other answer is suitable, you may still be curious about using a regex to split up a file.
smallfile = None
buf = ""
with open ('input_file.txt', 'rt') as f:
for line in f:
buf += str(line)
if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
if smallfile:
smallfile.close()
match = re.findall(r'^([A-Z\s\.]+\b)' , line)
smallfile_name = '{}.txt'.format(match[0])
smallfile = open(smallfile_name, 'w')
smallfile.write(buf)
buf = ""
if smallfile:
smallfile.close()

If you run on Linux, use csplit.
Otherwise, check out these two threads:
How can I split a text file into multiple text files using python?
How to match "anything up until this sequence of characters" in a regular expression?

Related

Counting word pairs from a text file [Python]

So from a text file which has a content:
Lemonade juice whiskey beer soda vodka
In Python, by implementing that same .txt file, I would like to output word-pairs in the next order:
juice-lemonade
whiskey-juice
beer-whiskey
soda-beer
vodka-soda
I managed outputing something like that by using list instead of opening file in Python, but in the case with some major .txt file, that is not really a handy solution.
Also, the bonus task for this would be to output the probability for each of those pairs. Any kind of hint would be highly appreciated.
To read large files efficiently, you should read them line-by-line, or (if you have really long lines, which is what the snippet below assumes) token-by-token.
A clean way to do this while keeping an open handle on a file is by using generators that yield a word at a time.
You can have another generator that combines 2 words at a time and yields pairs.
from typing import Iterator
def memory_efficient_word_generator(text_file: str) -> Iterator[str]:
word = ''
with open(text_file) as text:
while True:
character = text.read(1)
if not character:
return
if character.isspace():
yield word.lower()
word = ''
else:
word += character
def pair_generator(text_file: str) -> Iterator[str]:
previous_word = ''
for word in memory_efficient_word_generator(text_file):
if previous_word and word:
yield f'{previous_word}-{word}'
previous_word = word or previous_word
for pair in pair_generator('filename.txt'):
print(pair)
Assuming filename.txt contains:
Lemonade juice whiskey beer soda vodka
cola tequila lemonade juice
You should see something like:
lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice
Of course, there's a lot more you should handle depending on your desired behaviour (for example, handling non-alphabetic characters in your input).
Thank you very much for the feedback.
That's pretty much it, I just added encoding = 'utf-8' here:
with open(text_file, encoding='utf-8') as text:
since it outputs error for 'charmap' for me.
And just one more thing, I also wanted to output the number of the elements(words) from the text file by using:
file = open("filename.txt", "rt", encoding="utf8")
data = file.read()
words = data.split()
print('Number of words :', len(words))
which I did, now I'm trying to do the same with those word-pairs that you sent, basically each of those pairs would be one element, like for example:
lemonade-juice ---> one element
So if we would to count all of these from a text file:
lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice
we would get the output of 9 elements or
Number of word-pairs: 9
Was thinking now to try to do that with using len function and calling text_file.
Fix me if I'm looking in a wrong direction.
Once again, thank you for your time.

Fetch the line in a file that mention specific keywords

I have two files, keywords_file (5MB) contain one word per line.Here's a sample :
prénom
pseudonyme
surnom
accès aux origines
puissance paternelle
droit de l’enfant
devoir de protection
droit à la survie
travail des enfants
obligation scolaire
assistance éducative
mauvais traitements
filiation adultérine
r_isa (205MB) that contain words that share an "isa" relationship. Here's a sample, where \t represents a literal tab character:
égalité de Parseval\tformule_0.9333\tégalité_1.0
filiation illégitime\tfiliation_1.0
Loi reconnaissant l'égalité\tloi_1.0
égalité entre les sexes\tégalité_1.0
liberté égalité fraternité\tliberté_1.0
This means, "égalité de Parseval" isa "formule" with a score of 0.9333 and isa "égalité" with a score of 1. And so go on..
I want to fetch from the r_isa file, the words that have the hypernym relationship with the keywords mentioned on the keywords_file. Here's what I did:
keywords = pd.read_csv("keywords_file.txt", sep="\t",encoding = 'utf8', header=None)
with open("r_isa.txt",encoding="utf-8") as openfile:
for line in openfile:
for k in keywords[0]:
if k in line:
file = open('isa.txt','a', encoding='utf-8')
file.write(("".join(line) + "\n"))
file.close()
This keeps running no stop through the entire night. I'm guessing something must be wrong. Any help?
PS: i wanted to add a regular expression like this :
...
for k in keywords[0]:
if re.search(r'\b' + k + r'\b', line):
...
to look up for the exact word on each line but that threw me the following error so i left it as it is now:
error: missing ), unterminated subpattern at position 69
Probably the main bottleneck is the repeated opening for appending inside the tight loop. The operating system will need to open the file and seek to the end every time you write to it. If you need several writers to have access to the end of the file, maybe run each of them with output to a separate file, then combine the result files when all the writers are done.
I'm also a bit suspicious about the order in which you read the files. Apparently the raw r_isa.txt file is bigger, but if it contains fewer lines than the keywords.txt file, perhaps switch them. Generally, try to read the smaller set of data into memory, then loop over the bigger file, one line at a time.
Here's an attempt completely without Pandas. There's probably nothing wrong with using it, but it also doesn't really provide much value here.
I also switched to using a regex; it's not clear to me whether that's going to be a performance improvement, but at least it should show you how to get this going so you can measure and compare.
import re
keywords = []
with open("keywords_file.txt") as kwfile:
keywords = [line.rstrip('\n') for line in kwfile]
regex = re.compile(r'\b(?:' + '|'.join(keywords) + r')\b')
with open("r_isa.txt") as readfile, open('isa.txt', 'w') as writefile:
for line in readfile:
firstfield = line.split('\t')[0]
m = regex.match(firstfield)
if m:
writefile.write(line)
Regular expressions are good for looking for substring matches and variations; if you simply want every line where exactly the first field exists verbatim as a line in the keywords file, this is almost certainly going to be quicker:
for line in readfile:
firstfield = line.split('\t')[0]
if firstfield in keywords:
writefile.write(line)
and then of course take out import re and the regex assignment. Maybe also then convert keywords to a set().

Counting how many times a string appears in a CSV file

I have a piece of code what is supposed to tell me how many times a word occurs in a CSV file. Note: the file is pretty large (2 years of text messages)
This is my code:
key_word1 = 'Exmple_word1'
key_word2 = 'Example_word2'
counter = 0
with open('PATH_TO_FILE.csv',encoding='UTF-8') as a:
for line in a:
if (key_word1 or key_word2) in line:
counter = counter + 1
print(counter)
There are two words because I did not know how to make it non-case sensitive.
To test it I used the find function in word on the whole file (using only one of the words as I was able to do a non-case sensitive search there) and I received more than double of what my code has calculated.
At first I did use the value_counts() function BUT I received different values for the same word (searching Exmple_word1 appeared 32 and 56 times and 2 times and so on. I kind of got stuck there for a while but it got me thinking. I use two keyboards on my phone which I change regularly - could it be that the same words could actually be different and that would explain why I am getting these results?
Also, I pretty much checked all sources regarding this matter and I found different approaches that did not actually do what I want them to do. ( the value_counts() method for example)
If that is the case, how can I fix this?
Notice some mistakes in your code:
key_word1 or key_word2 - it's "lazy", meaning if the left part - "key_word1" evaluated to True, it won't even look at key_word2. The will cause checking only if key_word1 appeared in the line.
An example to emphesize:
w1 = 'word1'
w2 = 'word2'
s = 'bla word2'
(w1 or w2) in s
>> False
(w2 or w1) in s
>> True
2. Reading csv file: I recommend using csv package (just import it), something like:
import csv
with open('PATH_TO_FILE.csv') as f:
for line in csv.reader(f):
# do you logic here
Case sensitivity - don't work hard, you probably can lower case the line you read, just to not hold 2 words..
guess the solution you are looking for should look something like:
import csv
word_to_search = 'donald'
with open('PATH_TO_FILE.csv', encoding='UTF-8') as f:
for line in csv.reader(f):
if any(word_to_search in l for l in map(str.lower, line)):
counter += 1
Running on input:
bla,some other bla,donald rocks
make,who,great
again, donald is here, hura
will result:
counter=2

Output comes twice - Update of a Q asked 30 minutes before posting this one

Here is my code
import re
with open('newfiles.txt') as f:
k = f.read()
p = re.compile(r'[\w\:\-\.\,\']+|[^[\w\:\-\.\'\,]\s]')
originaltext = p.findall(k)
uniquelist = []
for word in originaltext:
if word not in uniquelist:
uniquelist.append(word)
indexes = ' '.join(str(uniquelist.index(word)+1) for word in originaltext)
n = p.findall(indexes)
file = open("newfiletwo.txt","w")
file.write (' '.join(str(e) for e in n))
file.close()
file = open("newfilethree.txt","w")
file.write(' '.join(uniquelist))
file.close()
with open('newfiletwo.txt') as f:
indexess = f.read()
with open('newfilethree.txt') as f:
differentwords = f.read()
differentwords = p.findall(differentwords)
indexess = [uniquelist.index(word) for word in originaltext]
for word in originaltext:
if not word in differentwords:
differentwords.append(word)
i = differentwords.index(word)
indexess.append(i)
s = "" # the reconstructed sentence
for i in indexess:
s = s + differentwords[i] + " "
print(s)
The program basically takes an external text file, returns the index of its positions (if any word repeats, then the first position is taken) and then saves the positions as an external file. Whilst doing this, I have split up the text file including splitting punctuation and saved different words and punctuation that occur in the file as an external file too. Now for the hard part, using both of these external files - the indexes and the different separated words, I am trying to recreate the original text file, including the punctuation. But the error shown in the title occurs:
Traceback (most recent call last):
File "E:\Python\Index.py", line 31, in <module>
s = s + differentwords[i] + " "
IndexError: list index out of range
Not trying to sound rude but I am a sort of beginner, please try to change as less as possible in a simple way, as I have created this myself. You guys maybe know a far shorter way to do this, but this is the level of simplicity I can handle, proved by the length of the code. I have tried to shorten the original text file but that proves no use. Anyone know why the error occurs and how to fix it? I am not looking for efficiency right now, maybe after another couple of months of learning, but the simplest (i don't mind long) answer will be the best. Sorry if I have repeated myself a lot :-)
'newfiles' - A bunch of sentences with punctuation
UPDATE
The code does not show the error but prints the original sentence twice. The error has gone due to the removal of +1 on line 23. Does anyone know why the output repeats twice though?
Problem is, how you qualify what word is, what is not. For instance is comma part of word? In your case that is not mentioned as such, while it is also not a separator. So you end up with separate word comma, or dot, and so on. I have no access to your input, so I can just provide sample:
p = re.compile(r'[\w\:\-\.\,]+|[^[\w\:\-\.\,]\s]')
There is one point - in this case: 'Word', 'word', 'Word', 'Word.', 'word,' are all separate words. Since dot, and coma are parts of word. You can't eat cake and have it. To fix that... you need to store information if there is white space before separation.
UPDATE:
Oh, yes. Double output. Files that are stored in the middle - are OK. So something was filed after that. Look at this two lines:
i = differentwords.index(word)
indexess.append(i)
They need to be inside preceding if statement.

Location String Munging in Python

(Python 2.7, using datasets from http://www.policemisconduct.net/databases/, 2009, 2010)
[[YOU CAN SKIP DOWN TO ____ IF YOU DON'T CARE ABOUT NATURE OF MY DATA]]
I'm fairly new to Python and programming in general - I'd like someone to explain the results I'm getting from my loop.
I'm trying to loop through the 'location' column of a police misconduct dataset. Its format is as follows:
city, state, USA
(I'm aware the URL above has the data broken into separate 2009 and 2010 files, where the location is already in 2 separate columns, as well as a Google Fusion to which I am referring. This question is specifically about how to make A look like B, as well as the errors I'm throwing and why.)
Allow me a simplified version of my question. Consider the following five locations in test.csv:
Tallahassee, Florida, USA
Denver, Colorado, USA
Watertown, New York, USA
Kalamazoo, Michigan, USA
Toronto, Ontario, Canada
I run the follow script:
def censor(text, word):
texts = str(text)
words = texts.split() #creates the list
x = "" * len(word) #creates the stars with correct length
for i in range(len(words)):
if words[i] == word:
words[i] = x #replace
return "".join(words)
places = pd.read_csv(test.csv) #the 5-place list above
censor(places,"USA")
And get the following
'Tallahassee,Florida,0Denver,Colorado,1Watertown,NewYork,2Kalamazoo,Michigan,3Toronto,Ontario,Canada'
Obviously, the numbers shouldn't be there; it's one big long string (but an array [] instead of "" string throws errors when trying to use the .split method...); Even the spaces I want were dropped.
Adding an alpha character in the return line ""string.join(words) as I tinkered made me even more confused about the loop I had written... (so now line 8 reads: 'return "a".join(words)')
'Tallahassee,aFlorida,aa0aDenver,aColorado,aa1aWatertown,aNewaYork,aa2aKalamazoo,aMichigan,aa3aToronto,aOntario,Canada'
...and the only thing that does well is make me sound like Luigi when I read it.
How can I make a) a two, separate nx1 arrays, where n is the number of observations in each array for State and City, and b) one nX2 array with analogous columns...
Thanks! (And sorry for n3wb? :(
I suggest this.
def censor_word(word, word_to_censor):
word = word.strip()
if word.lower() == word_to_censor.lower():
return '*' * len(word)
else:
return word
def censor(line, word_to_censor):
words = str(line).split(',') #creates the list
words = [censor_word(w, word_to_censor) for w in words]
return ", ".join(words)
with open("test.csv", "rt") as f:
for line in f:
print(censor(line, "USA"))
Sorry, I have to run out the door. Usually I explain the code but cannot right now. If you have questions, I will answer them later.
If you want to process this as a string, there's a much easier way to do your "censorship". You can use the replace method of a string to remove all instances of "USA" (or any other substring for that matter).
f = open('places.csv')
text = str(f.read())
f.close()
places = text.replace(', USA','')
It's then very simple to recreate your dataframe using string operations:
t1 = places.split('\n')
t2 = [p.replace(' ','').strip() for p in t1]
final_places = [p.split(',') for p in t2]
This gives you a result in array form (fixed):
[['Tallahassee', 'Florida'], ['Denver', 'Colorado'], ['Watertown', 'NewYork'], ['Kalamazoo', 'Michigan'], ['Toronto', 'Ontario', 'Canada']]
To get cities/states:
cities = [p[0] for p in final_places]
states = [p[1] for p in final_places]

Categories