Python - Handling an nth line hop with readlines() - python

I'm having a go at a fixing a broken lib that I want to use on Github.
I have locally "fixed" the problem. but I don't think its a very clean method...
I'm poking the WARC library by the internet archive, and spcifically the arc.py part (https://github.com/internetarchive/warc/blob/master/warc/arc.py).
Since the lib was written, the tools that make the ARC files have changed a bit, and as a result, the builtin parser fails, as its not expecting to see some metadata in the file.
My local fix looks like this:
if header.startswith("<arcmetadata"):
while not header.endswith("</arcmetadata>\n"):
header = self.fileobj.readline()
header = self.fileobj.readline()
header = self.fileobj.readline()
And I'm not sure that my calling of readlines() twice to strip of the next two empty lines (containing "/n" is the cleanest way of advancing through the fileobject.
Is this good python? or is there a better way?

The code looks like a copy/paste error. There is nothing wrong with using .readline(), just document what you are doing:
# skip metadata
if header.startswith("<arcmetadata"):
while not header.endswith("</arcmetadata>\n"):
header = self.fileobj.readline()
#NOTE: header ends with `"</arc..."` here i.e., it is not blank
# skip blank lines
while not header.strip():
header = self.fileobj.readline()
btw, if the file contains xml then use an xml parser to parse it. Don't do it by hand.

Although there's nothing inherently wrong with what you're doing, it might be more semantic to write:
next(self.fileobj, None)
without a variable assignment to signify that you are tossing the next line.

itertools may be of use here
from itertools import islice, dropwhile
if header.startswith("<arcmetadata"):
fileobj = dropwhile(lambda x: not x.endswith("</arcmetadata>\n"), fileobj)
fileobj = islice(fileobj, 2, None)

Related

Webbrowser() reading through a text file for URLS

I am trying to write a script to automate browsing to my most commonly visited websites. I have put the websites into a list and am trying to open it using the webbrowser() module in Python. My code looks like the following at the moment:
import webbrowser
f = open("URLs", "r")
list = f.readline()
for line in list:
webbrowser.open_new_tab(list)
This only reads the first line from my file "URLs" and opens it in the browser. Could any one please help me understand how I can achieve reading through the entire file and also opening the URLs in different tabs?
Also other options that can help me achieve the same.
You have two main problems.
The first problem you have is that you are using readline and not readlines. readline will give you the first line in the file, while readlines gives you a list of your file contents.
Take this file as an example:
# urls.txt
http://www.google.com
http://www.imdb.com
Also, get in to the habit of using a context manager, as this will close the file for you once you have finished reading from it. Right now, even though for what you are doing, there is no real danger, you are leaving your file open.
Here is the information from the documentation on files. There is a mention about best practices with handling files and using with.
The second problem in your code is that, when you are iterating over list (which you should not use as a variable name, since it shadows the builtin list), you are passing list in to your webrowser call. This is definitely not what you are trying to do. You want to pass your iterator.
So, taking all this in to mind, your final solution will be:
import webbrowser
with open("urls.txt") as f:
for url in f:
webbrowser.open_new_tab(url.strip())
Note the strip that is called in order to ensure that newline characters are removed.
You're not reading the file properly. You're only reading the first line. Also, assuming you were reading the file properly, you're still trying to open list, which is incorrect. You should be trying to open line.
This should work for you:
import webbrowser
with open('file name goes here') as f:
all_urls = f.read().split('\n')
for each_url in all_urls:
webbrowser.open_new_tab(each_url)
My answer is assuming that you have the URLs 1 per line in the text file. If they are separated by spaces, simply change the line to all_urls = f.read().split(' '). If they're separated in another way just change the line to split accordingly.

How to copy a JSON file in another JSON file, with Python

I want to copy the contents of a JSON file in another JSON file, with Python
Any ideas ?
Thank you :)
Given the lack of research effort, I normally wouldn't answer, but given the poor suggestions in comments, I'll bite and give a better option.
Now, this largely depends on what you mean, do you wish to overwrite the contents of one file with another, or insert? The latter can be done like so:
with open("from.json", "r") as from, open("to.json", "r") as to:
to_insert = json.load(from)
destination = json.load(to)
destination.append(to_insert) #The exact nature of this line varies. See below.
with open("to.json", "w") as to:
json.dump(to, destination)
This uses python's json module, which allows us to do this very easily.
We open the two files for reading, then open the destination file again in writing mode to truncate it and write to it.
The marked line depends on the JSON data structure, here I am appending it to the root list element (which could not exist), but you may want to place it at a particular dict key, or somesuch.
In the case of replacing the contents, it becomes easier:
with open("from.json", "r") as from, open("to.json", "w") as to:
to.write(from.read())
Here we literally just read the data out of one file and write it into the other file.
Of course, you may wish to check the data is JSON, in which case, you can use the JSON methods as in the first solution, which will throw exceptions on invalid data.
Another, arguably better, solution to this could also be shutil's copy methods, which would avoid actually reading or writing the file contents manually.
Using the with statement gives us the benefit of automatically closing our files - even if exceptions occur. It's best to always use them where we can.
Note that in versions of Python before 2.7, multiple context managers are not handled by the with statement, instead you will need to nest them:
with open("from.json", "r") as from:
with open("to.json", "r+") as to:
...

python join "large" file

In python I have read in a file into a list using file.readlines() , later on after some logic, I would like to put it back together in a string using fileString = ''.join(file), for some reason, even without a print function, it prints the fileString out to the console up to a certain point, then it just stops. It does not run the rest of the program which is not useful for me.
Why does join do this, how do I perhaps pre-allocate how much memory I would like my list/string to use so that it does not stop. Or some other solution too.
Thank you
File is your file pointer in memory. When you attempt to join on it, you don't actually have a string to work with.
How about this?
with open(file, 'rb') as myfile:
strings = myfile.readlines()
# do your stuff to strings
filestring = ''.join(strings)
Note that strings is a list of lines like this:
['my line\n', 'my other line!\n']
And as such, a large file will require quite a bit of memory. You may be better served by building a mini filter.
You should also consider what you are going to do with the resulting string. If you just want to write the contents back to a file, there is no need to join the parts first, you can use file.writelines(strings) directly.

Delineating a Read File

Not really too sure how to word this question, therefore if you don't particularly understand it then I can try again.
I have a file called example.txt and I'd like to import this into my Python program. Here I will do some calculations with what it contains and other things that are irrelevant.
Instead of me importing this file, going through it line-by-line and extracting the information I want.. can Python do it instead? As in, if I structure the .txt correctly (whether it be key / value pairs seperated by an equals on each line), is there a current Python 'way' where it can handle it all and I work with that?
with open("example.txt") as f:
for line in f:
key, value = line.strip().split("=")
do_something(key,value)
looks like a starting point if I understand you correctly. You need Python 2.6 or 3.x for this.
Another place to look is the csv module that can parse comma-separated value files - and you can tell it to use = as a separator instead. This will abstract away some of the "manual work" in that previous example - but it seems your example doesn't especially need that kind of abstraction.
Another idea:
with open("example.txt") as f:
d = dict([line.strip().split("=") for line in f])
Now that's concise and pythonic :)
for line in open("file")
key, value = line.strip().split("=")
key=key.strip()
value=value.strip()
do_something(key,value)
There's also another method - you can create a valid python file (let it be a list, dict definition or whatever else), read its content using
f = open('file.txt', r)
content = f.read() #assuming file isn't too long
And then just parse it:
parsedContent = eval(content)
You can pass any environment to eval (see docs), so it might not have access to your globals and locals. This is evil and wrong, but in small program that won't be distributed and won't get 'file.txt' from network or from so called malicious user - you can use it.

How can I change a huge file into csv in python

I'm a beginner in python. I have a huge text file (hundreds of GB) and I want to convert the file into csv file. In my text file, I know the row delimiter is a string "<><><><><><><>". If a line contains that string, I want to replace it with ". Is there a way to do it without having to read the old file and rewriting a new file.
Normally I thought I need to do something like this:
fin = open("input", "r")
fout = open("outpout", "w")
line = f.readline
while line != "":
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line)
line = f.readline
but copying hundreds of GB is wasteful. Also I don't know if open will eat lots of memory (does it treat file handler as a stream?)
Any help is greatly appreciated.
Note: an example of the file would be
file.txt
<><><><><><><>
abcdefeghsduai
asdjliwa
1231214 ""
<><><><><><><>
would be one row and one column in csv.
#richard-levasseur
I agree, sed seems like the right way to go. Here's a rough cut at what the OP describes:
sed -i -e's/<><><><><><><>/"/g' foo.txt
This will do the replacement in-place in the existing foo.txt. For that reason, I recommend having the original file under some sort of version control; any of the DVCS should fit the bill.
Yes, open() treats the file as a stream, as does readline(). It'll only read the next line. If you call read(), however, it'll read everything into memory.
Your example code looks ok at first glance. Almost every solution will require you to copy the file elsewhere. Its not exactly easy to modify the contents of a file inplace without a 1:1 replacement.
It may be faster to use some standard unix utilities (awk and sed most likely), but I lack the unix and bash-fu necessary to provide a full solution.
It's only wasteful if you don't have disk to spare. That is, fix it when it's a problem. Your solution looks ok as a first attempt.
It's not wasteful of memory because a file handler is a stream.
Reading lines is simply done using a file iterator:
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
Also consider the CSV writer object to write CSV files, e.g:
import csv
writer = csv.writer(open("some.csv", "wb"))
writer.writerows(someiterable)
With python you will have to create a new file for safety sake, it will cause alot less headaches than trying to write in place.
The below listed reads your input 1 line at a time and buffers the columns (from what I understood of your test input file was 1 row) and then once the end of row delimiter is hit it will write that buffer to disk, flushing manually every 1000 lines of the original file. This will save some IO as well instead of writing every segment, 1000 writes of 32 bytes each will be faster than 4000 writes of 8 bytes.
fin = open(input_fn, "rb")
fout = open(output_fn, "wb")
row_delim = "<><><><><><><>"
write_buffer = []
for i, line in enumerate(fin):
if not i % 1000:
fout.flush()
if row_delim in line and i:
fout.write('"%s"\r\n'%'","'.join(write_buffer))
write_buffer = []
else:
write_buffer.append(line.strip())
Hope that helps.
EDIT: Forgot to mention, while using .readline() is not a bad thing don't use .readlines() which will go and read the entire content of the file into a list containing each line which is incredibly inefficient. Using the built in iterator that comes with a file object is the best memory usage and speed.
#Constatin suggests that if you would be satisfied with replacing '<><><><><><><>\n' by '" \n'
then the replacement string is the same length, and in that case you can craft a solution to in-place editing with mmap. You will need python 2.6. It's vital that the file is opened in the right mode!
import mmap, os
CHUNK = 2**20
oldStr = ''
newStr = '" '
strLen = len(oldStr)
assert strLen==len(newStr)
f = open("myfilename", "r+")
size = os.fstat(f.fileno()).st_size
for offset in range(0,size,CHUNK):
map = mmap.mmap(f.fileno(),
length=min(CHUNK+strLen,size-offset), # not beyond EOF
offset=offset)
index = 0 # start at beginning
while 1:
index = map.find(oldStr,index) # find next match
if index == -1: # no more matches in this map
break
map[index:index+strLen] = newStr
f.close()
This code is not debugged! It works for me on a 3 MB test case, but it may not work on a large ( > 2GB) file - the mmap module still seems a bit immature, so I wouldn't rely on it too much.
Looking at the bigger picture, from what you've posted it isn't clear that your file will end up as valid CSV. Also be aware that the tool you're planning to use to actually process the CSV may be flexible enough to deal with the file as it stands.
If you're delimiting fields with double quotes, it looks like you need to escape the double quotes you have occurring in your elements (for example 1231214 "" will need to be \n1231214 \"\").
Something like
fin = open("input", "r")
fout = open("output", "w")
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line.replace('"',r'\"')
fin.close()
fout.close()
[For the problem exactly as stated] There's no way that this can be done without copying the data, in python or any other language. If your processing always replaced substrings with new substrings of equal length, maybe you could do it in-place. But whenever you replace <><><><><><><> with " you are changing the position of all subsequent characters in the file. Copying from one place to another is the only way to handle this.
EDIT:
Note that the use of sed won't actually save any copying...sed doesn't really edit in-place either. From the GNU sed manual:
-i[SUFFIX]
--in-place[=SUFFIX]
This option specifies that files are to be edited in-place. GNU sed does this by creating a temporary file and sending output to this file rather than to the standard output.
(emphasis mine.)

Categories