I'm trying to insert an increment after the occurance of ~||~ in my .txt. I have this working, however I want to split it up, so after each semicolon, it starts back over at 1.
So Far I have the following, which does everything except split up at semicolons.
inputfile = "output2.txt"
outputfile = "/output3.txt"
f = open(inputfile, "r")
words = f.read().split('~||~')
f.close()
count = 1
for i in range(len(words)):
if ';' in words [i]:
count = 1
words[i] += "~||~" + str(count)
count = count + 1
f2 = open(outputfile, "w")
f2.write("".join(words))
Why not first split the file based on the semicolon, then in each segment count the occurences of '~||~'.
import re
count = 0
with open(inputfile) as f:
semicolon_separated_chunks = f.read().split(';')
count = len(re.findall('~||~', semicolon_separated_chunks))
# if file text is 'hello there ~||~ what is that; what ~||~ do you ~|| mean; nevermind ~||~'
# then count = 4
Instead of resetting the counter the way you are now, you could do the initial split on ;, and then split the substrings on ~||~. You'd have to store your words another way, since you're no longer doing words = f.read().split('~||~'), but it's safer to make an entirely new list anyway.
inputfile = "output2.txt"
outputfile = "/output3.txt"
all_words = []
f = open(inputfile, "r")
lines = f.read().split(';')
f.close()
for line in lines:
count = 1
words = line.split('~||~')
for word in words:
all_words.append(word + "~||~" + str(count))
count += 1
f2 = open(outputfile, "w")
f2.write("".join(all_words))
See if this works for you. You also may want to put some strategically-placed newlines in there, to make the output more readable.
Related
New here!
I am searching for the following or the next word for the word "I". Ex "I am new here" -> the next word is "am".
import re
word = 'i'
with open('tedtalk.txt', 'r') as words:
pat = re.compile(r'\b{}\b \b(\w+)\b'.format(word))
print(pat.findall(words))
with open('tedtalk.txt','r') as f:
for line in f:
phrase = 'I'
if phrase in line:
next(f)
These are the codes i have developed so far, but i am kind of stuck already. Thanks in advance!
you have 2 options.
first, with split
with open('tedtalk.txt','r') as f:
data = f.read()
search_word = "I"
list_of_words = data.split()
next_word = list_of_words[list_of_words.index(search_word) + 1]
second, with regex:
import re
regex = re.compile(r"\bI\b\s*?\b(\w+)\b")
with open('tedtalk.txt','r') as f:
data = f.readlines()
result = regex.findall(data)
In your first piece of code, words is a file object, and there will be problems with line-by-line verification. For example, in the following case, am2 may not be found.
tedtalk.txt
I am1 new here, I
am2 new here, I am3 new here
So I modified the program and read 4096 bytes multiple times to prevent the file from being too large and causing the memory to explode.
In order to prevent the data being truncated causing it to be missed, the I will be looked for from the end of the data for a single read, and if found, the data following it will be truncated and put in front of the next read.
import re
regex = re.compile(r"\bI\b\s*?\b(\w+)\b")
def find_index(data, target_value="I"):
"""Look for spaces from the back, the intention is to find the value between two space blocks and check if it is equal to `target_value`"""
index = once_read_data.rfind(" ")
if index != -1:
index2 = index
while True:
index2 = once_read_data.rfind(" ", 0, index2)
if index2 == -1:
break
t = index - index2
# two adjacent spaces
if t == 1:
continue
elif t == 2 and once_read_data[index2 + 1: index] == target_value:
return index2
result = []
with open('tedtalk.txt', 'r') as f:
# Save data that might have been truncated last time.
prev_data = ""
while True:
once_read_data = prev_data + f.read(4096)
if not once_read_data:
break
index = find_index(once_read_data)
if index is not None:
# Slicing based on the found index.
prev_data = once_read_data[index:]
once_read_data = once_read_data[:index]
else:
prev_data = ""
result += regex.findall(once_read_data)
print(result)
Output:
['am1', 'am2', 'am3']
search_word = 'I'
prev_data = ""
result = []
with open('tedtalk.txt', 'r') as f:
while True:
data = prev_data + f.readline()
if data == prev_data: # reached eof
break
list_of_words = data.split()
for word_pos, word in enumerate(list_of_words[:-1]):
if word == search_word:
result.append(list_of_words[word_pos+1])
prev_data = list_of_words[-1] + ' '
print(result)
I modified the code to read the text file by line, this should handle unlimited large file. The code also addresses the case where the search word is the last word on a line by taking as next word the first word of the next line.
If you rather treat each line independently and ignore the search word if it is the last word in the line, the code can be simplified as follows:
search_word = 'I'
result = []
with open('tedtalk.txt', 'r') as f:
while True:
data = f.readline()
if not data: # reached eof
break
list_of_words = data.split()
for word_pos, word in enumerate(list_of_words[:-1]):
if word == search_word:
result.append(list_of_words[word_pos+1])
print(result)
i would like to know how i could get all lines after the first in a python file
I've tried with this:
fr = open("numeri.txt", "r")
count = 0
while True:
line = fr.readline(count)
if line == "":
break
count += 1
print(line)
fr.close()
Could anyone help me? Thanks
You could add an extra if statement to check if count != 0 Since on the first loop it will be 0.
I don't know if i understood well, but to obtain all the lines skipping the first one you can simple do
lines = []
with open("numeri.txt") as fobj:
lines = fobj.readlines()[1:]
count = len(lines)+1 if lines else 0 # If you want to maintain the same counting as in your example
count = 0
with open(file, 'r') as file:
next(file.readline()) # skip the first line
for count, line in enumerate(file.readlines()): # read remaining lines with count
if not line: # If line equals "" this will be True
break
print(count, line)
count -= 1 # To ignore last lines count.
Just read the first line without using it:
with open('numeri.txt') as f:
f.readline()
lines = f.readlines()
print(*lines, sep='')
To ignore the first line you can also use next(f) (instead of f.readline()).
This is also fine:
with open('numeri.txt') as f:
lines = f.readlines()[1:]
print(*lines, sep='')
Try using l[1:]. It returns a subset of l that consist in the elements of l except the first position.
with open("numeri.txt", "r") as f:
content = f.readlines()[1:]
for line in content:
print(line.strip('\n')) # In order to avoid introduce double \n since print ends with a '\n'
EDIT: Based on #riccardo-bucco ' solution:
with open("numeri.txt", "r") as f:
content = f.readlines()[1:]
print(*content, sep='')
To print all but the first line:
with open('numeri.txt', 'r') as f:
output = ''.join(f.readlines()[1:])
print(output)
start count at 1 so it skips the first line
...
count = 1
...
Getting the desired output so far.
The program prompts user to search for a word.
user enters it and the program reads the file and gives the output.
'ashwin: 2'
Now i want it to ignore case sensitive. For example, "Ashwin" and "ashwin" both shall return 2, as it contains two ashwin`s in the text file.
def word_count():
file = "test.txt"
word = input("Enter word to be searched:")
k = 0
with open(file, 'r') as f:
for line in f:
words = line.split()
for i in words:
if i == word:
k = k + 1
print(word + ": " + str(k))
word_count()
You could use lower() to compare the strings in this part if i.lower() == word.lower():
For example:
def word_count():
file = "test.txt"
word = input("Enter word to be searched:")
k = 0
with open(file, 'r') as f:
for line in f:
words = line.split()
for i in words:
if i.lower() == word.lower():
k = k + 1
print(word + ": " + str(k))
word_count()
You can either use .lower on the line and word to eliminate case.
Or you can use the built-in re module.
len(re.findall(word, text, flags=re.IGNORECASE))
Use the Counter class from collections that returns an dictionary with key value pairs that could be accessed using O(1) time.
from collections import Counter
def word_count():
file = "test.txt"
with open(file, 'r') as f:
words = f.read().replace('\n', '').lower().split()
count = Counter(words)
word = input("Enter word to be searched:")
print(word, ":", count.get(word.lower()))
How to compare word frequencies from two text files in python? For example, if a word contains in file1 and file2 both then it should be written only once but not adding their frequencies while comparing, it should be {'The': 3,5}. Here 3 is the frequency in file1 and 5 is frequency in file2. And if some words only exist in one file but not both then for that file there should be 0. Please Help
Here is what I have done so far:
import operator
f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2
wordlist=[]
wordlist2=[]
for line in f1:
for word in line.split():
wordlist.append(word)
for line in f2:
for word in line.split():
wordlist2.append(word)
worddictionary = {}
for word in wordlist:
if word in worddictionary:
worddictionary[word] += 1
else:
worddictionary[word] = 1
worddictionary2 = {}
for word in wordlist2:
if word in worddictionary2:
worddictionary2[word] += 1
else:
worddictionary2[word] = 1
print(worddictionary)
print(worddictionary2)
Edit: Here's the more general way you would do this for any list of files (explanation in comments):
f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2
file_list = [f1, f2] # This would hold all your open files
num_files = len(file_list)
frequencies = {} # We'll just make one dictionary to hold the frequencies
for i, f in enumerate(file_list): # Loop over the files, keeping an index i
for line in f: # Get the lines of that file
for word in line.split(): # Get the words of that file
if not word in frequencies:
frequencies[word] = [0 for _ in range(num_files)] # make a list of 0's for any word you haven't seen yet -- one 0 for each file
frequencies[word][i] += 1 # Increment the frequency count for that word and file
print frequencies
Keeping with the code you wrote, here's how you could create a combined dictionary:
import operator
f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2
wordlist=[]
wordlist2=[]
for line in f1:
for word in line.split():
wordlist.append(word)
for line in f2:
for word in line.split():
wordlist2.append(word)
worddictionary = {}
for word in wordlist:
if word in worddictionary:
worddictionary[word] += 1
else:
worddictionary[word] = 1
worddictionary2 = {}
for word in wordlist2:
if word in worddictionary2:
worddictionary2[word] += 1
else:
worddictionary2[word] = 1
# Create a combined dictionary
combined_dictionary = {}
all_word_set = set(worddictionary.keys()) | set(worddictionary2.keys())
for word in all_word_set:
combined_dictionary[word] = [0,0]
if word in worddictionary:
combined_dictionary[word][0] = worddictionary[word]
if word in worddictionary2:
combined_dictionary[word][1] = worddictionary2[word]
print(worddictionary)
print(worddictionary2)
print(combined_dictionary)
Edit: I misunderstood the problem, the code now works for your question.
f1 = open('file1.txt','r') #file 1
f2 = open('file2.txt','r') #file 2
wordList = {}
for line in f1.readlines(): #for each line in lines (file.readlines() returns a list)
for word in line.split(): #for each word in each line
if(not word in wordList): #if the word is not already in our dictionary
wordList[word] = 0 #Add the word to the dictionary
for line in f2.readlines(): #for each line in lines (file.readlines() returns a list)
for word in line.split(): #for each word in each line
if(word in wordList): #if the word is already in our dictionary
wordList[word] = wordList[word]+1 #add one to it's value
f1.close() #close files
f2.close()
f1 = open('file1.txt','r') #Have to re-open because we are at the end of the file.
#might be a n easier way of doing this
for line in f1.readlines(): #Removing keys whose values are 0
for word in line.split(): #for each word in each line
try:
if(wordList[word] == 0): #if it's value is 0
del wordList[word] #remove it from the dictionary
else:
wordList[word] = wordList[word]+1 #if it's value is not 0, add one to it for each occurrence in file1
except:
pass #we know the error was that there was no wordList[word]
f1.close()
print(wordList)
Adding first file words, if that word is in second file, add one to the value.
After that, check each word, if it's value is 0, remove it.
This can't be done by iterating over the dictionary, because it is changing size while iterating over it.
This is how you would implement it for multiple files (more complex):
f1 = open('file1.txt','r') #file 1
f2 = open('file2.txt','r') #file 2
fileList = ["file1.txt", "file2.txt"]
openList = []
for i in range(len(fileList)):
openList.append(open(fileList[i], 'r'))
fileWords = []
for i, file in enumerate(openList): #for each file
fileWords.append({}) #add a dictionary to our list
for line in file: #for each line in each file
for word in line.split(): #for each word in each line
if(word in fileWords[i]): #if the word is already in our dictionary
fileWords[i][word] += 1 #add one to it
else:
fileWords[i][word] = 1 #add it to our dictionary with value 0
for i in openList:
i.close()
for i, wL in enumerate(fileWords):
print(f"File: {fileList[i]}")
for l in wL.items():
print(l)
#print(f"File {i}\n{wL}")
You might find the following demonstration program to be a good starting point for getting the word frequencies of your files:
#! /usr/bin/env python3
import collections
import pathlib
import pprint
import re
import sys
def main():
freq = get_freq(sys.argv[0])
pprint.pprint(freq)
def get_freq(path):
if isinstance(path, str):
path = pathlib.Path(path)
return collections.Counter(
match.group() for match in re.finditer(r'\b\w+\b', path.open().read())
)
if __name__ == '__main__':
main()
In particular, you will want to use the get_freq function to get a Counter object that tells you what the word frequencies are. Your program can call the get_freq function multiple times with different file names, and you should find the Counter objects to be very similar to the dictionaries you were previously using.
I'm relatively new to python and coding and I'm trying to write a code that counts the number of times each different character comes out in a text file while disregarding the case of the characters.
What I have so far is
letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o',
'p','q','r','s','t','u','v','w','x','y','z']
prompt = "Enter filename: "
titles = "char count\n---- -----"
itemfmt = "{0:5s}{1:10d}"
totalfmt = "total{0:10d}"
whiteSpace = {' ':'space', '\t':'tab', '\n':'nline', '\r':'crtn'}
filename = input(prompt)
fname = filename
numberCharacters = 0
fname = open(filename, 'r')
for line in fname:
linecount +=1
word = line.split()
word += words
for word in words:
for char in word:
numberCharacters += 1
return numberCharacters
Somethings seems wrong about this. Is there a more efficient way to do my desired task?
Thanks!
from collections import Counter
frequency_per_character = Counter(open(filename).read().lower())
Then you can display them as you wish.
A better way would be to use the str methods such as isAlpha
chars = {}
for l in open('filename', 'rU'):
for c in l:
if not c.isalpha(): continue
chars[c] = chars.get(c, 0) + 1
And then use the chars dict to write the final histogram.
You over-complicating it, you can just convert your file content to a set in order to eliminate duplicated characters:
number_diff_value = len(set(open("file_path").read()))