In python2, how do I limit the length of a string from importing all txt files from a directory? like wordlength = 6000
import glob
raw_text = ""
path = "/workspace/simple/*.txt"
for filename in glob.glob(path):
with open(filename, 'r') as f:
for line in f:
raw_text += line
words = raw_text.split()
print(words)
this code only feeds in all txt files and prints in on screen. How do I limit it to 6000 words and only prints 6000 words?
import glob
raw_text = ""
path = "/workspace/simple/*.txt"
for filename in glob.glob(path):
with open(filename, 'r') as f:
for line in f:
if len(raw_text.split())< N: ###here you put your number
raw_text += line
else:
break
words = raw_text.split()
print(words)
Assuming you are wanting 6000 or less words from each file ?
import glob, sys
path = sys.argv[1]
count = int(sys.argv[2]) if len(sys.argv) > 2 else 60
words = []
for file in glob.glob(path):
with open(file) as f:
words += f.read().split()[:count]
print(words)
>>>python test.py "/workspace/simple/*.txt" 6000
You could also set up a dictionary for words to file :
import glob, sys
path = sys.argv[1]
count = int(sys.argv[2]) if len(sys.argv) > 2 else 60
fwords = {}
for file in glob.glob(path):
with open(file) as f:
fwords[file] = f.read().split()[:count]
print(fwords)
If you want only files with the count of words in them
for file in glob.glob(path):
with open(file) as f:
tmp = f.read().split()
if len(tmp) == count : # only the count
fwords[file] = tmp
That depends on your definition of a word. If it's simply text separated by white space, it's fairly easy: count the words as they go past, and stop when you have enough. For instance:
word_limit = 6000
word_count = 0
for line in f:
word_count += len(line.split())
if word_count > word_limit:
break
raw_text += line
If you want exactly 6000 words, you can modify the loop to grab enough words from the last line to make 6000 exactly.
If you want to make it a little more effective, then drop raw_text and build words within the loop, one line at a time, with
line_words = line.split()
words.extend(line_words)
In this case, you'll want to use len(line_words) for your check.
Try replacing your code with this:
for filename in glob.glob(path):
with open(filename, 'r') as f:
word_limit = 12000
word_count = 0
for line in f:
word_count += len(line)
if word_count > word_limit:
break
raw_text += line
Related
I want to search a list of group of strings inside a text file (.txt or .log).
it must include group A or B (or CDE..).
group A OR B each words need in the same line but not near by. (eg. ["123456", "Login"] or ["123457", "Login"] if in the same line then save it to a new txt file.
Some of example output line:
20221110,1668057560.965,AE111,123457,0,"Action=Account Login,XXX,XXX",XXX,XXX
20221110,1668057560.965,AE112,123458,0,"Action=Account Login,XXX,XXX",XXX,XXX
20221111,1668057560.965,AE113,123458,0,"Action=Order,XXX,XXX",XXX,XXX
below is my code:
import os, re
path = "Log\\"
file_list = [path + f for f in os.listdir(path) if f.endswith('.log')]
keep_phrases1 = ["123456", "Login"]
keep_phrases2 = ["123457", "Login"]
pat = r"\b.*?\b".join([re.escape(word) for word in keep_phrases1])
pat = re.compile(r"\b" + pat + r"\b")
pat2 = r"\b.*?\b".join([re.escape(word) for word in keep_phrases2])
pat2 = re.compile(r"\b" + pat2 + r"\b")
print(pat2,pat)
if len(file_list) != 0:
for infile in sorted(file_list):
with open(infile, encoding="latin-1") as f:
f = f.readlines()
for line in f:
found1 = pat.search(line)
found2 = pat2.search(line)
if found1 or found2:
with open(outfile, "a") as wf:
wf.write(line)
It's works for me but not easy to add more group of words. And I think the code is not good for understand?
My problems is How can I simplify the code?
How can I easier to add other group to search? e.g. ["123458", "Login"] ["123456", "order"] ["123457", "order"]
import os, re
path = "Log\\"
file_list = [path + f for f in os.listdir(path) if f.endswith('.log')]
All keep_phrases in a container, I choose a dictionary but since they are identified by order, it could have been a list:
keep_phrases = {'keep_phrases1': ["123456", "Login"], 'keep_phrases2':["123457", "Login"]}
# Alternative, a list would work:
# keep_phrases = [["123456", "Login"], ["123457", "Login"]]
Now let's generate a list with the compiled patterns:
def compile_pattern(keep_phrase):
pat = r"\b.*?\b".join([re.escape(word) for word in keep_phrase])
pat = re.compile(r"\b" + pat + r"\b")
return pat
patterns = [compile_pattern(keep_phrases[keep_phrase]) for keep_phrase in keep_phrases.keys()]
# if keep_phrases had been a list, we would do
# patterns = [compile_pattern(keep_phrase) for keep_phrase in keep_phrases]
Finally, we look for matches for every pattern and if we get any finding, we write to file.
if len(file_list) != 0:
for infile in sorted(file_list):
with open(infile, encoding="latin-1") as f:
f = f.readlines()
for line in f:
findings = [pat.search(line) for pat in patterns] # can do this because there's a list with patterns
if any(findings):
with open(outfile, "a") as wf:
wf.write(line)
Try, this. I read the whole file in a string to make code fast and readable, findall will return a list with all matching lines for the file.
If memory is a problem the pattern also works on individual lines:
import re
file_list=["sit.txt"]
keep_phrases=[["123456", "Login"],["123457", "Login"]]
pat = [r"(?:.*?(?:" + p1 + r"\b.*?"+p2+r".*?(?:\n|$)))" for p1,p2 in keep_phrases]
pat= r"|".join(pat)
for infile in sorted(file_list):
with open(infile, encoding="latin-1") as f:
text=f.read()
print(re.findall(pat,text))
Without regex
def match_words(line, words):
return all(word in words for word in line)
with open(infile, encoding="latin-1") as f:
f = f.readlines()
for line in f:
split_line = line.split(",")
if any( match_words(split_line , word) for word in [keep_phrases1, keep_phrases2]):
with open(outfile, "a") as wf:
wf.write(line)
I have a series of files and I have to separate and display part of the text
my code is :
path = 'C:\\Bot\\*.log'
files = glob.glob(path)
nlines = 0
for name in files:
try:
with open(name) as f:
for line in f :
nlines += 1
if (line.find("Total") >= 0):
print(line)
I need a text that is saved in the file after the line number is obtained.
With the above code, I have access to the line number but I do not have access to some subsequent lines
How to access the next line value??
path = 'C:\\Bot\\*.log'
files = glob.glob(path)
nlines = 0
for name in files:
try:
with open(name) as f:
for line in f:
nlines += 1
if (line.find("Total") >= 0):
print(next(f))
I think it is a better solution to this problem
Use next() to read:
path = 'C:\\Bot\\*.log'
files = glob.glob(path)
nlines = 0
for name in files:
try:
with open(name) as f:
for line in f:
nlines += 1
if (line.find("Total") >= 0):
for i in range(6):
print(next(f))
I'm making program that open txt file and replace first 0 with 1 of given line. Now it only print the edited line, but I want that it prints all the lines. I'm using python 3.1.
line_number = 3
with open(filename, "r") as f:
number = 0
for line in f:
number += 1
if line_number == number:
content = line.replace("0","1",1)
savefile = filename[:4] + ".tmp"
with open(savefile, "w") as f:
f.write(content)
os.remove(filename)
os.rename(savefile, filename)
Text file:
0 Dog
0 Cat
0 Giraffe
0 Leopard
0 Bear
You need to write each unchanged line to the savefile:
import os
filename = 'input.txt'
line_number = 3
savefile = filename[:4] + ".tmp"
with open(filename, "r") as f:
with open(savefile, "w") as fout:
number = 0
for line in f:
number += 1
if line_number == number:
content = line.replace("0","1",1)
fout.write(content)
else:
# Write unchanged lines here
fout.write(line)
os.remove(filename)
os.rename(savefile, filename)
Did you try something like this:
filename = "./test.txt"
with open(filename) as f:
lines = f.readlines()
# the element with index 2 is the 3-th element
lines[2] = lines[2].replace("0","1",1)
with open(filename, 'w') as f:
[f.write(line) for line in lines]
Output(./test.txt):
0 Dog
0 Cat
1 Giraffe
0 Leopard
0 Bear
You can read the file and save it to a list. Then you can then perform a certain action for each item(or for a specific element) in the list and save the result in the same file. You don't need of .tmp file or to remove and rename a file.
Edit:
There is an another approach with fileinput (thanks to #PeterWood)
import fileinput
with fileinput.input(files=('test.txt',), inplace=True) as f:
for line in f:
if fileinput.lineno() is 3:
print(line.replace("0", "1", 1).strip())
else:
print(line.strip())
I need to read a file in bed format that contains coordinates of all chr in a genome, into different files according with the chr. I tried this approach but it doesn't work, it doesn't create any files. Any idees why this happens or alternative approaches to solve this problem?
import sys
def make_out_file(dir_path, chr_name, extension):
file_name = dir_path + "/" + chr_name + extension
out_file = open(file_name, "w")
out_file.close()
return file_name
def append_output_file(line, out_file):
with open(out_file, "a") as f:
f.write(line)
f.close()
in_name = sys.argv[1]
dir_path = sys.argv[2]
with open(in_name, "r") as in_file:
file_content = in_file.readlines()
chr_dict = {}
out_file_dict = {}
line_count = 0
for line in file_content[:0]:
line_count += 1
elems = line.split("\t")
chr_name = elems[0]
chr_dict[chr_name] += 1
if chr_dict.get(chr_name) = 1:
out_file = make_out_file(dir_path, chr_name, ".bed")
out_file_dict[chr_name] = out_file
append_output_file(line, out_file)
elif chr_dict.get(chr_name) > 1:
out_file = out_file_dict.get(chr_name)
append_output_file(line, out_file)
else:
print "There's been an Error"
in_file.close()
This line:
for line in file_content[:0]:
says to iterate over an empty list. The empty list comes from the slice [:0] which says to slice from the beginning of the list to just before the first element. Here's a demonstration:
>>> l = ['line 1\n', 'line 2\n', 'line 3\n']
>>> l[:0]
[]
>>> l[:1]
['line 1\n']
Because the list is empty no iteration takes place, so the code in the body of your for loop in not executed.
To iterate over each line of the file you do not need the slice:
for line in file_content:
However, it is better again to iterate over the file object as this does not require that the whole file be first read into memory:
with open(in_name, "r") as in_file:
chr_dict = {}
out_file_dict = {}
line_count = 0
for line in in_file:
...
Following that there are numerous problems, including syntax errors, with the code in the for loop which you can begin debugging.
I am having problems deleting a specific line/entry within a text file. With the code I have the top line in the file is deleted no matter what line number I select to delete.
def erase():
contents = {}
f = open('members.txt', 'a')
f.close()
f = open('members.txt', 'r')
index = 0
for line in f:
index = index + 1
contents[index] = line
print ("{0:3d}) {1}".format(index,line))
f.close()
total = index
entry = input("Enter number to be deleted")
f = open('members.txt', 'w')
index = 0
for index in range(1,total):
index = index + 1
if index != entry:
f.write(contents[index])
Try this:
import sys
import os
def erase(file):
assert os.path.isfile(file)
with open(file, 'r') as f:
content = f.read().split("\n")
#print content
entry = input("Enter number to be deleted:")
assert entry >= 0 and entry < len(content)
new_file = content[:entry] + content[entry+1:]
#print new_file
with open(file,'w') as f:
f.write("\n".join(new_file))
if __name__ == '__main__':
erase(sys.argv[1])
As already noted you were starting the range from 1 which is incorrect. List slicing which I used in new_file = content[:entry] + content[entry+1:] makes the code more readable and it is an approach less prone to similar errors.
Also you seem to open and close the input file at the beginning for no reason. Also you should use with if possible when doing operations with files.
Finally I used the join and split to simplify the code so you don't need a for loop to process the lines of the file.