How to read datas from several files with python - python

I am new to python and this may be a foolish question but it troubles me for several days.
I have about 30 log files, and each of them contains strings and data. They are almost the same except the difference of several data, and their names are arranged regularly, like 'log10.lammps', 'log20.lammps', etc.(The '10''20'represent the temperature of the simulation). I want to write a python script, which loops all these files and read their data in a specific line, (say line3900). Then I want to write these data to another data file, which is arranged like these:
10 XXX
20 XXX
30 XXX
.
.
.
I can read and write from a single file, but I cannot achieve the loop. Could anyone please tell me how to to that. Thanks a lot!
PS. Still another difficulty is that the data in line 3900 is presented like this: "The C11 is 180.1265465616", the one I want to extract is 180.1265465616. How can I extract the number without the strings?

This answer covers how to get all the files in a folder in Python. To summarize the top answer, to get all the files in a single folder, you would do this:
import os
import os.path
def get_files(folder_path):
return [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]
The next step would be to extract the number from the line The C11 is 180.1265465616.
I'm assuming that you have a function called get_line that given a filename, will return that exact line.
You can do one of three things. If the length of the number at the end is constant, then you can just grab the last n characters in the string and convert it to a number. Alternatively, you could split the string by the spaces and grab the last item -- the number. Finally, you could use Regex.
I'm just going to go with the second option since it looks the most straightforward for now.
def get_numbers():
numbers = []
for file in get_files('folder'):
line = get_line(file)
components = line.split(' ')
number = float(components[-1])
numbers.append(number)
return numbers
I wasn't sure how you wanted to write the numbers to the file, but hopefully these should help you get started.

#assuming files is a list of filenames
for filename in files:
with open(filename) as f:
<do stuff with file f>
ps. float(line.split(' ')[-1])

Well, I can give you a hint which path I would have taken (but there might be a better one):
Get all files in a directory to a list with os.listdir
Loop over every one and perform the following:
Use the re module to extract the temperature from the filename (if not match pattern break) else add it to a list (to_write_out)
Read the right line with linecache
Get the value (line.split()[-1])
Append the value to the list to_write_out.
Join the list to_write_out to a string with join
Write the string to a file.
Help with the regex
The regular expressions can be a bit tricky if you haven't used them before. To extract the temperature from your filenames (first bullet beneath point number 2) you would use something like:
for fname in filenames:
pattern = 'log(\d+)\.lammps'
match = re.search(pattern, fname)
if match:
temp = match.group(1)
# Append the temperature to the list.
else:
break
# Continue reading the right line etc.

Related

Writing code to search many text files for certain words [duplicate]

This question already has answers here:
opening and reading all the files in a directory in python - python beginner
(4 answers)
Python - Ignore letter case
(5 answers)
Closed 2 years ago.
I'm new to coding. I have about 50 notepad files which are transcripts with timestamps that I want to search through for certain words or phrases. Ideally I would be able to input a word or phrase and the program would find every instance of its occurrence and then print it along with its timestamp. The lines in my notepad files looks like this:
"then he went to the office
00:03
but where did he go afterwards?
00:06
to the mall
00:08"
I just want a starting point. I have fun in figuring things out on my own, but I need somewhere to start.
A few questions I have:
How can I ensure that the user input will not be case sensitive?
How can I open many text files systematically with a loop? I have titled the text files {1,2,3,4...} to try to make this easy to implement but I still don't know how to do it.
from os import listdir
import re
def find_word(text, search):
result = re.findall('\\b'+search+'\\b', text, flags=re.IGNORECASE)
if len(result)>0:
return True
else:
return False
#please put all the txt file in a folder (for my case it was: D:/onr/1738349/txt/')
search = 'word' #or phrase you want to search
search = search.lower() #converted to lower case
result_folder= "D:/onr/1738349/result.txt" #make sure you change them accordingly, I put it this why so you can understand
transcript_folder = "D:/onr/1738349/txt" #all the transcript files would be here
with open(result_folder, "w") as f: #please change in to where you want to output your result, but not in the txt folder where you kept all other files
for filename in listdir(transcript_folder): #the folder that contains all the txt file(50 files as you said)
with open(transcript_folder+'/' + filename) as currentFile:
# Strips the newline character
i = 0;
for line in currentFile:
line = line.lower() #since you want to omit case-sensitivity
i=i+1
if find_word(line, search): #for exact match, for example 'word' and 'testword' would be different
f.write('Found in (' + filename[:-4] + '.txt) at line number: ('+str(i) +') ')
if(next(currentFile)!='\0'):
f.write('time: ('+next(currentFile).rstrip()+') \n')
else:
continue
Make sure you follow the comments.
(1) create result.txt
(2) keep all the transcript files in a folder
(3) make sure (1) and (2) are not in the same folder
(4) Directory would be different if you are using Unix based system(mine is windows)
(5) Just run this script after making suitable changes, the script will take care all of it(it will find all the transcript files and show the result to a single file for your convenience)
The Output would be(in the result.txt):
Found in (fileName.txt) at line number: (#lineNumber) time: (Time)
.......
How can I ensure that the user input will not be case sensitive?
As soon as you open a file, covert the whole thing to capital letters. Similarly, when you accept user input, immediately convert it to capital letters.
How can I open many text files systematically with a loop?
import os
Files = os.listdir(FolderFilepath)
for file in Files:
I have titled the text files {1,2,3,4...} to try to make this easy to implement but I still don't know how to do it.
Have fun.
Maybe these thoughts are helpful for you:
First I would try to find I way to get hold of all files I need (even if its only their paths)
And filtering can be done with regex (there are online websites where you can test your regex constructs like regex101 which I found super helpful when I first looked at regex)
Then I would try to mess around getting all timestemps/places etc. to see that you filter correctly
Have fun!
searched_text = input("Type searched text: ")
with open("file.txt", "r") as file:
lines = file.readlines()
for i in range(len(lines)):
lines[i] = lines[i].replace("\n","")
for i in range(len(lines)-2):
if searched_text in lines[i]:
print(searched_text + " appeared in " + str(lines[i]) + " at datetime " +
str(lines[i+2]))
Here is code which works exactly like you expect

How can I read and search multiple textfiles so that I can store a list of files that match my search?

I hope you can help out a new learner of Python. I could not find my problem in other questions, but if so: apologies. What I basically want to do is this:
Read a large number of text files and search each for a number of string terms.
If the search terms are matched, store the corresponding file name to a new file called "filelist", so that I can tell the good files from the bad files.
Export "filelist" to Excel or CSV.
Here is the code that I have so far:
#textfiles all contain only simple text e.g. "6 Apples"
filelist=[]
for file in os.listdir('C:/mydirectory/'):
with open('C:/mydirectory/' + file, encoding="Latin1") as f:
fine=f.read()
if re.search('APPLES',fine) or re.search('ORANGE',fine) or re.search('BANANA',fine):
filelist.append(file)
listoffiles = pd.DataFrame(filelist)
writer = pd.ExcelWriter('ListofFiles.xlsx', engine='xlsxwriter')
listoffiles.to_excel(writer,sheet_name='welcome',index=False)
writer.save()
print(filelist)
Questions:
Surely, there is a more elegant or time-efficient way? I need to do this for a large amount of files :D
Related to the former, is there a way to solve the reading-in of files using pandas? Or is it less time efficient? For me as a STATA user, having a dataframe feels a bit more like home....
I added the "Latin1" option, as some characters in the raw data create conflict in encoding. Is there a way to understand which characters are causing the problem? Can I get rid of this easily, e.g. by cutting of the first line beforehand (skiprow maybe)?
Just couple of things to speed up the script:
1.) compile your regex beforehand, not every time in the loop (also use | to combine multiple strings to one regex!
2.) read files line by line, not all at once!
3.) Use any() to terminate search when you get first positive
For example:
import re
import os
filelist=[]
r = re.compile(r'APPLES|ORANGE|BANANA') # you can add flags=re.I for case insensitive search
for file in os.listdir('C:/mydirectory/'):
with open('C:/mydirectory/' + file, 'r', encoding='latin1') as f:
if any(r.search(line) for line in f): # read files line by line, not all content at once
filelist.append(file) # add to list
# convert list to pandas, etc...

combining txt files and counting words

def word_counter(s)
word_list=s.split()
return len(word_list)
f=open("a.txt")
total=0
for i in f.readlines():
total+=word_counter(i)
print(total)
if I want to count number of alphabet(without blank), number of word used and average length of each 'a.txt', 'b.txt', 'c.txt', 'd.txt', 'e.txt'. At last, I want to get a 'total.txt' of all txt combined.
I dont know how to do more..
please help
You actually have the concept right. Just need to add a little more to reach your desired output.
Remember when you use f = open("a.txt"), make sure you call f.close(). Or, use the with keyword, like I did in the example. It automatically closes the file for you, even if you forget to.
I won't give the exact code as it is, but will provide the steps so that you learn the concepts.
Put all the .txt files names in a list.
Example, list_FileNames = ["a.txt", "b.txt"]
Then open each file, and get the entire file into a string.
for file in list_FileNames:
with open(file, 'r') as inFile:
myFileInOneString = inFile.read().replace('\n', '')
You have the right function to count words. For characters: len(myFileInOneString) - myFileInOneString.count(' ')
Save all these values into a varible and write to another file. Check how to write to a file: How to Write to a File in Python

How do I read the last modified file that contains a certain string?

I want to find the last modified file that contains a string using python. I have many files in my directory so comparing dates is not appealing. I am looking for a way to search the file by modified data until I hit a file containing the string I want. Or if there is a better way to do this that would be great.
To search for the last modified file in the current directory that contains the string search_string, you can use this code:
files = os.listdir(".")
files.sort(key=os.path.getmtime, reverse=True)
for name in files:
with open(name) as f:
if search_string in f.read():
print name
break
This will first list all the files, sort them by modification time, newest first, and then iterates over the list of files to see if they contain search_string.

Confusing loop problem (python)

this is similar to the question in merge sort in python
I'm restating because I don't think I explained the problem very well over there.
basically I have a series of about 1000 files all containing domain names. altogether the data is > 1gig so I'm trying to avoid loading all the data into ram. each individual file has been sorted using .sort(get_tld) which has sorted the data according to its TLD (not according to its domain name. sorted all the .com's together, .orgs together, etc)
a typical file might look like
something.ca
somethingelse.ca
somethingnew.com
another.net
whatever.org
etc.org
but obviosuly longer.
I now want to merge all the files into one, maintaining the sort so that in the end the one large file will still have all the .coms together, .orgs together, etc.
What I want to do basically is
open all the files
loop:
read 1 line from each open file
put them all in a list and sort with .sort(get_tld)
write each item from the list to a new file
the problem I'm having is that I can't figure out how to loop over the files
I can't use with open() as because I don't have 1 file open to loop over, I have many. Also they're all of variable length so I have to make sure to get all the way through the longest one.
any advice is much appreciated.
Whether you're able to keep 1000 files at once is a separate issue and depends on your OS and its configuration; if not, you'll have to proceed in two steps -- merge groups of N files into temporary ones, then merge the temporary ones into the final-result file (two steps should suffice, as they let you merge a total of N squared files; as long as N is at least 32, merging 1000 files should therefore be possible). In any case, this is a separate issue from the "merge N input files into one output file" task (it's only an issue of whether you call that function once, or repeatedly).
The general idea for the function is to keep a priority queue (module heapq is good at that;-) with small lists containing the "sorting key" (the current TLD, in your case) followed by the last line read from the file, and finally the open file ready for reading the next line (and something distinct in between to ensure that the normal lexicographical order won't accidentally end up trying to compare two open files, which would fail). I think some code is probably the best way to explain the general idea, so next I'll edit this answer to supply the code (however I have no time to test it, so take it as pseudocode intended to communicate the idea;-).
import heapq
def merge(inputfiles, outputfile, key):
"""inputfiles: list of input, sorted files open for reading.
outputfile: output file open for writing.
key: callable supplying the "key" to use for each line.
"""
# prepare the heap: items are lists with [thekey, k, theline, thefile]
# where k is an arbitrary int guaranteed to be different for all items,
# theline is the last line read from thefile and not yet written out,
# (guaranteed to be a non-empty string), thekey is key(theline), and
# thefile is the open file
h = [(k, i.readline(), i) for k, i in enumerate(inputfiles)]
h = [[key(s), k, s, i] for k, s, i in h if s]
heapq.heapify(h)
while h:
# get and output the lowest available item (==available item w/lowest key)
item = heapq.heappop(h)
outputfile.write(item[2])
# replenish the item with the _next_ line from its file (if any)
item[2] = item[3].readline()
if not item[2]: continue # don't reinsert finished files
# compute the key, and re-insert the item appropriately
item[0] = key(item[2])
heapq.heappush(h, item)
Of course, in your case, as the key function you'll want one that extracts the top-level domain given a line that's a domain name (with trailing newline) -- in a previous question you were already pointed to the urlparse module as preferable to string manipulation for this purpose. If you do insist on string manipulation,
def tld(domain):
return domain.rsplit('.', 1)[-1].strip()
or something along these lines is probably a reasonable approach under this constraint.
If you use Python 2.6 or better, heapq.merge is the obvious alternative, but in that case you need to prepare the iterators yourself (including ensuring that "open file objects" never end up being compared by accident...) with a similar "decorate / undecorate" approach from that I use in the more portable code above.
You want to use merge sort, e.g. heapq.merge. I'm not sure if your OS allows you to open 1000 files simultaneously. If not you may have to do it in 2 or more passes.
Why don't you divide the domains by first letter, so you would just split the source files into 26 or more files which could be named something like: domains-a.dat, domains-b.dat. Then you can load these entirely into RAM and sort them and write them out to a common file.
So:
3 input files split into 26+ source files
26+ source files could be loaded individually, sorted in RAM and then written to the combined file.
If 26 files are not enough, I'm sure you could split into even more files... domains-ab.dat. The point is that files are cheap and easy to work with (in Python and many other languages), and you should use them to your advantage.
Your algorithm for merging sorted files is incorrect. What you do is read one line from each file, find the lowest-ranked item among all the lines read, and write it to the output file. Repeat this process (ignoring any files that are at EOF) until the end of all files has been reached.
#! /usr/bin/env python
"""Usage: unconfuse.py file1 file2 ... fileN
Reads a list of domain names from each file, and writes them to standard output grouped by TLD.
"""
import sys, os
spools = {}
for name in sys.argv[1:]:
for line in file(name):
if (line == "\n"): continue
tld = line[line.rindex(".")+1:-1]
spool = spools.get(tld, None)
if (spool == None):
spool = file(tld + ".spool", "w+")
spools[tld] = spool
spool.write(line)
for tld in sorted(spools.iterkeys()):
spool = spools[tld]
spool.seek(0)
for line in spool:
sys.stdout.write(line)
spool.close()
os.remove(spool.name)

Categories