hello beautiful people so i have a text file like this :
user = user447
pass = 455555az
type = registred
date = 1 year
and i want to read the file and rewrite it like this
user|pass|type|date,
line by line,
i tried so many ways , i seem stuck since i have to deal with 1 million account
with open(file, "r") as f:
data = []
for line in f:
key = line.rstrip('\n').split('=')
key1 = key[1:2]
You don't need to read the entire file all at once, instead, you can just read it in parts and write as you read (note the with block is used for two open() context managers, though you can nest them inside each other just as easily)
with open(source) as fh_src, open(destination, "w") as fh_dest:
block = []
for lineno, line in enumerate(fh_src, 1):
# .split("=", 1)[-1] captures everything after the first =
# this is also an opportunity to verify the key
block.append(line.split("=", 1)[-1].strip())
if len(block) == 4:
fh_dest.write("{}|{}|{}|{}\n".format(*block))
block = [] # reset block after each write
it's definitely worth creating some safeguards, however!
checking if lines really start with some key if you have a set of known keys or have some you intend to omit, or if you have some dynamic set of keys (say some users have a collection of previous password hashes, or different comments)
checking if block at the end (it should be cleared and write!)
checking = is really in each line or that any comments are kept or discarded
opening "w" will remove destination if it exists already (perhaps from a botched previous run), which may be undesirable
(lineno is only included to simplify discovering bad lines)
Related
I have a log file that has what's known as a header section, and then the rest of it is a lot of data. The header section contains certain key value pairs that tells a db table information about said file.
One of my tasks is to parse out some of this header info. The other task is to go through the entire file and parse out counts of when certain strings occur. The later part I have a function for attched below:
with open(filename, 'rb') as f:
time_data_count = 0
while True:
memcap = f.read(102400)
# f.seek(-tdatlength, 1)
poffset_set = set(config_offset.keys())
# need logic to check if key value exists
time_data_count += memcap.count(b'TIME_DATA')
if len(memcap) <= 8:
break
if time_data_count > 20:
print("time_data complete")
else:
print("incomplete time_data data")
print(time_data_count)
The issue now with this is that it is not a line by line processing which would take a lot of time. I want to only get the first 50 lines of this log and then parse them. Then have the rest of the function go through the entire file without goign line by line and doing the counting parts.
Is it possible to extract the first 50 lines without going through the entire file?
The first 50 lines have header info of the form
ProdID: A785X
What I really need is to get the value of ProdID in that log file
You can read line-by-line for the first 50, by using a for loop or a list comprehension to just read the next line 50 times. This moves the read pointer down through the file, so when you call .read() or any other method, you'll not get anything you've already consumed. You can then process the rest as batch, or however else you need to:
with open(filename, 'rb') as f:
first_50_lines = [next(f) for _ in range(50)] # first 50 lines
remainder_of_file = f.read() # however much of the file remains
You can alternate various methods of reading the file, as long as the same file object (f in this case) is in play the entire time. Line-by-line, sized-chunk by chunk, or all at once (though .read() is always going to preclude further processing, on account of consuming the whole thing at once).
I have a text file, which has the following:
20
15
10
And I have the following code:
test_file = open("test.txt","r")
n = 21
line1 = test_file.readline(1)
line2 = test_file.readline(2)
line3 = test_file.readline(3)
test_file.close()
line1 = int(line1)
line2 = int(line2)
line3 = int(line3)
test_file = open("test.txt","a")
if n > line1:
test_file.write("\n")
n = str(n)
test_file.write(n)
test_file.close()
This code checks if the variable 'n' is bigger than line 1. What I wanted it to do is if it is bigger than line 1, it should be written in a line before the previous line 1. However this code will write it at the bottom of the file. Is there anything I can do to write something where I want to and not at the bottom of the file?
Any help is appreciated.
You can put your whole data in a variable, edit that variable then overwrite the information in the file.
with open('test.txt', 'r') as file:
# read a list of lines into data
data = file.readlines()
# now change the 2nd line, note that you have to add a newline
data[1] = "42\t\n"
# and write everything back
with open('test.txt', 'w') as file:
file.writelines( data )
This is a short answer, implement your own algorithm to solve your own problem.
As correctly pointed out by Amadan in a comment, the only way to obtain this result is a complete rewrite of the file.
This, clearly depending on how strict your requirements are, is fairly inefficient.
If you want to understand more about inefficiency just imagine the actions you would have to manually take to write a new 1st line in a physical notebook page.
Since the 1st line is already written you would have to turn the page, write the new first line, then copy again all the lines from the old page and, finally, tear the 1st page out and have your perfect notebook with a perfect page again.
You are writing with pen so there is no possibility to delete, only a new page will do the trick.
That is quite some work!
This is - well, more or less - what Python is doing behind the scenes when it is opening for reading (the 'r' part in my examples below) and then opening for writing (the 'w' part) the same file again.
As a general idea imagine that when you see for loops there is a lot of work to do.
I will clumsily over-simplify saying that the more the for loops the slower the code (countless pages of paper have been written by brilliant minds on performances, I suggest you diving dive deeper and searching for "Big O notation" using your preferred search engine. Here's an example: https://www.freecodecamp.org/news/all-you-need-to-know-about-big-o-notation-to-crack-your-next-coding-interview-9d575e7eec4/).
A better solution would be to change your data file and make sure that the last value is also the most recent one.
Rewriting the file is as easy as writing an empty file, code and result are identical.
The trick here is that we have in memory (in the variables data and new_data) everything we need.
In data we store the whole content of the file before the change.
In new_data we can easily apply the needed modification because it is just a list containing a number and a newline (\n) for each list item.
Once new_data contains the data in the desired order all we need to do is write that list into a file.
Here's a possible solution, as close as possible to your code:
n = 21
with open('test.txt', 'r') as file:
data = file.readlines()
first_entry = int(data[0])
if (n > first_entry):
new_data = []
new_value = str(n) + "\n"
new_data.append(new_value)
for item in data:
new_data.append(item)
with open('test.txt', 'w') as file:
file.writelines(new_data)
Here's a more portable one:
def prepend_to_file_if_bigger_than_first_line(filename, value):
"""Checks if value is bigger than the one found in the 1st line of the specified file,
if true prepends it to the file
Args:
filename (str): The file name to check.
value (str): The value to check.
"""
with open(filename, 'r') as file:
data = file.readlines()
first_entry = int(data[0])
if (value > first_entry):
new_value = "{}\n".format(value)
new_data = []
new_data.append(new_value)
for old_value in data:
new_data.append(old_value)
with open(filename, 'w') as file:
file.writelines(new_data)
prepend_to_file_if_bigger_than_first_line("test.txt", 301)
As bonus some food for thought and exercises to learn:
What if instead of rewriting everything you just add a new line to the end of the page? Wouldn't it be more efficient and effective?
How would you re-implement my function above just to check the last line in file and append a new value?
Try bench-marking the prepend and the append solution, which one is best?
Good afternoon, I have a multiple list of IP and MAC, list of arbitrary length
A = [['10.0.0.1','00:4C:3S:**:**:**', 0], ['10.0.0.2', '00:5C:4S:**:**:**', 0], [....], [....]]
I want to check if this MAC is in the oui file:
E043DB (base 16) Shenzhen
2405f5 (base 16) Integrated
3CD92B (base 16) Hewlett Packard
...
If the MAC from the list is in the file, write the name of the manufacturer as 3 list items. I'm trying to do so and it turns out to check only the first element, the remaining ones are not checked, how can I do this please tell me?
f = open('oui.txt', 'r')
for values in A:
for line in f.readlines():
if values[1][0:8].replace(':','') in line:
values[2]=(line.split('(base 16)')[1].strip())
f.close()
print (A)
And get an answer:
A = [['10.0.0.1','00:4C:3S:**:**:**', 'Firm Name'], ['10.0.0.2', '00:5C:4S:**:**:**', 0], [....], [....]]
The Problem
Consider the "shape" of your code:
f = open('a file')
for values in [ 'some list' ]:
for line in f.readlines():
Your two loops are doing this:
Start with first value in list
Read all lines remaining in file object f
Move to next value in list
Read all lines remaining in file object f
Except that the first time you told it to "read all lines remaining" it would do so.
So, unless you have some way to put more lines into f (which can happen with async files like stdin!) you are going to get one "good" pass through the file, and then every subsequent pass the file object will point to the end of the file, so you'll get nothing.
A Solution
When you are dealing with a file, you want to only process it one time. File I/O is expensive compared to other operations. So you can choose to either (a) read the entire file into memory, and do whatever you want since it's not a file any more; or (b) scan through it only one time.
If you choose to scan through it only once, the easy solution is just to invert the two for loops. Instead of doing this:
for item in list:
for line in file:
Do this instead:
for line in file:
for item in list:
And presto! You are now only reading the file one time.
Other Considerations
If I look at your code, and your examples, it seems like you are trying for an exact match on a particular key. You trim down the MAC addresses in your list to check them against the manufacturer ids.
This suggests to me that you might well have many, many more list values (source MAC addresses) than you have manufacturers. So perhaps you should consider reading the contents of the tile into memory, rather than processing it one line at a time.
Once you have the file in memory, consider building a proper dictionary. You have a key (MAC prefix) and a value (manufacturer). So build something like:
for line in f:
mac = line.split('(base 16)')[0].strip()
mfg = line.split('(base 16)')[1].strip()
mac_to_mfg[mac] = mfg
Then you can make one pass through the source addresses and use the dict's O(1) lookup to your advantage:
for src in A:
prefix = src[1][:8].replace(':', '')
if prefix in mac_to_mfg:
# etc...
The problem is you got the order of the loops reversed. Usually this isn't that big of a problem, but when working objects that are consumed (like the IO file object) the contents will no longer produce once it's been iterated over.
You'll need to iterate the lines first, and then within each lines iterate through A to check the values:
with open('oui.txt', 'r') as f:
for line in f.readlines():
for values in A:
if values[1][0:8].replace(':','') in line:
values[2]=(line.split('(base 16)')[1].strip())
print (A)
Notice I changed your file opening to use the with context manager instead, where once your code exists the with block it will automatically close() the file for you. It is recommended over manually opening the file as you might forget to close() it after.
In Think Python by Allen Downey the excersise 13-2 asks to process any .txt file from gutenberg.org and skip the header information which end with something like "Produced by". This is the solution that author gives:
def process_file(filename, skip_header):
"""Makes a dict that contains the words from a file.
box = temp storage unit to combine two following word in one string
res = dict
filename: string
skip_header: boolean, whether to skip the Gutenberg header
returns: map from string of two word from file to list of words that comes
after them
Last two word in text maps to None"""
res = {}
fp = open(filename)
if skip_header:
skip_gutenberg_header(fp)
for line in fp:
process_line(line, res)
return res
def process_line(line, res):
for word in line.split():
word = word.lower().strip(string.punctuation)
if word.isalpha():
res[word] = res.get(word, 0) + 1
def skip_gutenberg_header(fp):
"""Reads from fp until it finds the line that ends the header.
fp: open file object
"""
for line in fp:
if line.startswith('Produced by'):
break
I really don't understand the flaw of execution in this code. Once the code starts reading the file using skip_gutenberg_header(fp) which contains "for line in fp:"; it finds needed line and breaks. However next loop picks up right where break statement left. But why? My vision of it is that there are two independent iterations here both containing "for line in fp:",
so shouldn't second one start form the beginning?
No, it shouldn't re-start from the beginning. An open file object maintains a file position indicator, which gets moved as you read (or write) the file. You can also move the position indicator via the file's .seek method, and query it via the .tell method.
So if you break out of a for line in fp: loop you can continue reading where you left off with another for line in fp: loop.
BTW, this behaviour of files isn't specific to Python: all modern languages that inherit C's notion of streams and files work like this.
The .seek and .tell methods are mentioned briefly in the tutorial.
For a more in-depth treatment of file / stream handling in Python, please see the docs for the io module. There's a lot of info in that document, and some of that information is mainly intended for advanced coders. You will probably need to read it several times and write a few test programs to absorb what it says, so feel free to skim through it the first time you try to read... or the first few times. ;)
My vision of it is that there are two independent iterations here both containing "for line in fp:", so shouldn't second one start form the beginning?
If fp were a list, then of course they would. However it's not -- it's just an iterable. In this case it's a file-like object that has methods like seek, tell, and read. In the case of file-like objects, they keep state. When you read a line from them, it changes the position of the read pointer in the file, so the next read starts a line below.
This is commonly used to skip the header of tabular data (when you're not using a csv.reader, at least)
with open("/path/to/file") as f:
headers = next(f).strip() # first line
for line in f:
# iterate by-line for the rest of the file
...
Most of what I do involves writing simple parsing scripts that reads search terms from one file and searches, line by line, another file. Once the search term is found, the line and sometimes the following line are written to another output file. The code I use is rudimentary and likely crude.
#!/usr/bin/env python
data = open("data.txt", "r")
search_terms = ids.read().splitlines()
data.close()
db = open("db.txt", "r")
output = open("output.txt", "w")
for term in search_terms:
for line in db:
if line.find(term) > -1:
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found %s" % term)
There are a few problems here. First, I don't think it's the most efficient and fastest to search line by line, but I'm not exactly sure about that. Second, I often run into issues with cursor placement and the cursor doesn't reset to the beginning of the file when the search term is found. Third, while I am usually confident that all of the terms can be found in the db, there are rare times when I can't be sure, so I would like to write to another file whenever it iterates through the entire db and can't find the term. I've tried adding a snippet that counts the number of lines of the db so if the find() function gets to the last line and the term isn't found, then it outputs to another "not found" file, but I haven't been able to get my elif and else loops right.
Overall, I'd just like any hints or corrections that could make this sort of script more efficient and robust.
Thanks.
Unless it's a really big file, why not iterate line by line? If the input file's size is some significant portion of your machine's available resources (memory), then you might want to look into buffered input and other, more low-level abstractions of what the computer is doing. But if you're talking about a few hundred MB or less on a relatively modern machine, let the computer do the computing ;)
Off the bat you might want to get into the habit of using the built-in context manager with. For instance, in your snippet, you don't have a call to output.close().
with open('data.txt', 'r') as f_in:
search_terms = f_in.read().splitlines()
Now search_terms is a handle to a list that has each line from data.txt as a string (but with the newline characters removed). And data.txt is closed thanks to with.
In fact, I would do that with the db.txt file, also.
with open('db.txt', 'r') as f_in:
lines = f_in.read().splitlines()
Context managers are cool.
As a side note, you could open your destination file now, and do your parsing and results-tracking with it open the whole time, but I like leaving as many files closed as possible for as long as possible.
I would suggest setting the biggest object on the outside of your loop, which I'm guessing is db.txt contents. The outermost loop only usually only gets iterated once, so might as well put the biggest thing there.
results = []
for i, line in enumerate(lines):
for term in search_terms:
if term in line:
# Use something not likely to appear in your line as a separator
# for these "second lines". I used three pipe characters, but
# you could just as easily use something even more random
results.append('{}|||{}'.format(line, lines[i+1]))
if results:
with open('output.txt', 'w') as f_out:
for result in results:
# Don't forget to replace your custom field separator
f_out.write('> {}\n'.format(result.replace('|||', '\n')))
else:
with open('no_results.txt', 'w') as f_out:
# This will write an empty file to disk
pass
The nice thing about this approach is each line in db.txt is checked once for each search_term in search_terms. However, the downside is that any line will be recorded for each search term it contains, ie., if it has three search terms in it, that line will appear in your output.txt three times.
And all the files are magically closed.
Context managers are cool.
Good luck!
search_terms keeps whole data.txt in memory. That it's not good in general but in this case it's not quite bad.
Looking line-by-line is not sufficient but if the case is simple and files are not too big it's not a big deal. If you want more efficiency you should sort data.txt file and put this to some tree-like structure. It depends on data which is inside.
You have to use seek to move pointer back after using next.
Propably the easiest way here is to generate two lists of lines and search using in like:
`db = open('db.txt').readlines()
db_words = [x.split() for x in db]
data = open('data.txt').readlines()
print('Lines in db {}'.format(len(db)))
for item in db:
for words in db_words:
if item in words:
print("Found {}".format(item))`
Your key issue is that you may be looping in the wrong order -- in your code as posted, you'll always exhaust the db looking for the first term, so after the first pass of the outer for loop db will be at end, no more lines to read, no other term will ever be found.
Other improvements include using the with statement to guarantee file closure, and a set to track which search terms were not found. (There are also typos in your posted code, such as opening a file as data but then reading it as ids).
So, for example, something like:
with open("data.txt", "r") as data:
search_terms = data.read().splitlines()
missing_terms = set(search_terms)
with open("db.txt", "r") as db, open("output.txt", "w") as output:
for line in db:
for term in search_terms:
if term in line:
missing_terms.discard(term)
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found {}".format(term))
break
if missing_terms:
diagnose_not_found(missing_terms)
where the diagnose_not_found function does whatever you need to do to warn the user about missing terms.
There are assumptions embedded here, such as the fact that you don't care if some other search term is present in a line where you've found a previous one, or the very next one; they might take substantial work to fix if not applicable and it will require that you edit your Q with a very complete and unambiguous list of specifications.
If your db is actually small enough to comfortably fit in memory, slurping it all in as a list of lines once and for all would allow easier accommodation for more demanding specs (as in that case you can easily go back and forth, while iterating on a file means you can only go forward one line at a time), so if your specs are indeed more demanding please also clarify if this crucial condition hold, or rather you need this script to process potentially humungous db files (say gigabyte-plus sizes, so as to not "comfortably fit in memory", depending on your platform of course).