I am using python to crawl web pages and I am doing it iteratively- so I am using 3 html files to store the web pages but somehow I am finding that these files are not getting overwritten and I am still getting old files. Here is the code that I am using:
def Vals(a,b):
file1="C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file1.html"
file2="C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file22.html"
file3="C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file33.html"
Query1='"http://scholar.google.com/scholar?q=%22'+a+'%22&btnG=&hl=en&as_sdt=0%2C24"'
URL1='wget --user-agent Mozilla '+Query1+' -O '+file1
Query2='"http://scholar.google.com/scholar?q=%22'+b+'%22&btnG=&hl=en&as_sdt=0%2C24"'
URL2='wget --user-agent Mozilla '+Query2+' -O '+file2
Query3='"http://scholar.google.com/scholar?q=%22'+a+'%22+%22'+b+'%22&btnG=&hl=en&as_sdt=0%2C24"'
URL3='wget --user-agent Mozilla '+Query3+' -O '+file3
## print Query1
## print Query2
## print Query3
##
## print URL1
## print URL2
## print URL3
os.system("wget "+ URL1)
os.system("wget "+ URL2)
os.system("wget "+ URL3)
f1 = open(file1,'r+')
f2 = open(file2,'r+')
f3 = open(file3,'r+')
S1=str(f1.readlines())
start=S1.find("About")+6
stop=S1.find("results",start)-1
try:
val1=float((S1[start:stop]).replace(",",""))
except ValueError:
val1=Reads('C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file1.html')
S1=str(f2.readlines())
#f2.close()
start=S1.find("About")+6
stop=S1.find("results",start)-1
try:
val2=float((S1[start:stop]).replace(",",""))
except ValueError:
val2=Reads('C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file22.html')
S1=str(f3.readlines())
#f3.close()
start=S1.find("About")+6
stop=S1.find("results",start)-1
try:
val3=float((S1[start:stop]).replace(",",""))
except ValueError:
val3=Reads('C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file33.html')
f1.truncate()
f2.truncate()
f3.truncate()
f1.close()
f2.close()
f3.close()
return (val1,val2,val3)
Can anyone tell if there is some error in closing the files or how shall I close them for my purpose.
Thanks
you're using the -O (capital O) option, which concatenates everything to 1 file.
‘-O file’
‘--output-document=file’
The documents will not be written to the appropriate files, but all will be concatenated together and written to file. If ‘-’ is used as file, documents will be printed to standard output, disabling link conversion. (Use ‘./-’ to print to a file literally named ‘-’.)
Use of ‘-O’ is not intended to mean simply “use the name file instead of the one in the URL;” rather, it is analogous to shell redirection: wget -O file http://foo is intended to work like wget -O - http://foo > file; file will be truncated immediately, and all downloaded content will be written there.
For this reason, ‘-N’ (for timestamp-checking) is not supported in combination with ‘-O’: since file is always newly created, it will always have a very new timestamp. A warning will be issued if this combination is used.
Similarly, using ‘-r’ or ‘-p’ with ‘-O’ may not work as you expect: Wget won't just download the first file to file and then download the rest to their normal names: all downloaded content will be placed in file. This was disabled in version 1.11, but has been reinstated (with a warning) in 1.11.2, as there are some cases where this behavior can actually have some use.
Note that a combination with ‘-k’ is only permitted when downloading a single document, as in that case it will just convert all relative URIs to external ones; ‘-k’ makes no sense for multiple URIs when they're all being downloaded to a single file; ‘-k’ can be used only when the output is a regular file.
This snippet was taken from wget's manual.
Hope this helps.
Related
import requests
res = requests.get("https://google.com")
res.raise_for_status()
print(len(res.text))
print(res.text)
with open('mygoogle.html',"w",encoding=('utf-8')) as f :
f.write(res.text)
The with open as f part doesn't work.
If it does, it has to open new file, but it doesn't.
I have tested this code and it works properly - mygoogle.html gets written and contains the expected content. The issue is likely outside of the script then. Since it is using a relative path, the issue could be that the file is getting written somewhere where you don't expect it due to the working directory not being set how you expect - Windows often defaults to C:\Windows\system32 or your user profile directory (%userprofile% aka C:\users\<username>) for new cmd instances.
My specific question is around what exactly zipfile.ZipFile(myfile.zip).extractall(thepassword) actually does. I'm working through some pentest tutorials and have written the following snippet.
There are two files that the script acts upon. The first is a zip file that was created using zip -P very_long_random_string_password -r myfile.zip myfiles.txt. This is on Kali Linux from the CLI with Python3 (3.9). The second file is a list of common passwords from github ([passlist.txt][1]).
#Python3
from zipfile import ZipFile
with open('passlist.txt', 'r') as f:
for line in f:
password = line.strip('\n')
password = password.encode('utf-8')
try:
foundpass = ZipFile('myfile.zip').extractall(pwd=password)
if foundpass == None:
print("\nPassword: ", password.decode())
break
except RuntimeError:
pass
The expected result would be for the script to try every password in the passlist.txt and if the extractall method didn't throw a RuntimeError to print out the successful password. If the password is wrong, the program continues.
This in fact does work and I've tried multiple passwords that are caught as expected based on passwords in the passlist.txt. BUT... I wanted to have the script run unsuccessfully and got an unexpected result.
Using a -P pashdshGivgisudhfagn9879te6rtq6rr in the zip process, which doesn't exist in the passlist.txt resulted in successfully unlocking the zip file using the password rabbit11. As it turns out I can unzip the file with rabbit11 or pashdshGivgisudhfagn9879te6rtq6rr.
I've run this several times now with different very_long_random_string_passwords and been able to find corresponding simple_passwords that will unlock the zip file.
Why? I don't see any pattern to encoded text and this seems to be odd behavior.
Other examples (I actually haven't found one that doesn't work):
Complex: 444hdshGivgisudhfagn9879te6rtq6rr
Simple: rugby12
Complex: rszv8FoGGM6JRWGX
Simple: lickme
First of all, thank you to everyone on Stack Overflow for past, present, and future help. You've all saved me from disaster (both of my own design and otherwise) too many times to count.
The present issue is part of a decision at my firm to transition from a Microsoft SQL Server 2005 database to PostgreSQL 9.4. We have been following the notes on the Postgres wiki (https://wiki.postgresql.org/wiki/Microsoft_SQL_Server_to_PostgreSQL_Migration_by_Ian_Harding), and these are the steps we're following for the table in question:
Download table data [on Windows client]:
bcp "Carbon.consensus.observations" out "Carbon.consensus.observations" -k -S [servername] -T -w
Copy to Postgres server [running CentOS 7]
Run Python pre-processing script on Postgres server to change encoding and clean:
import sys
import os
import re
import codecs
import fileinput
base_path = '/tmp/tables/'
cleaned_path = '/tmp/tables_processed/'
files = os.listdir(base_path)
for filename in files:
source_path = base_path + filename
temp_path = '/tmp/' + filename
target_path = cleaned_path + filename
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with open(source_path, 'r') as source_file:
with open(target_path, 'w') as target_file:
start = True
while True:
contents = source_file.read(BLOCKSIZE).decode('utf-16le')
if not contents:
break
if start:
if contents.startswith(codecs.BOM_UTF8.decode('utf-8')):
contents = contents.replace(codecs.BOM_UTF8.decode('utf-8'), ur'')
contents = contents.replace(ur'\x80', u'')
contents = re.sub(ur'\000', ur'', contents)
contents = re.sub(ur'\r\n', ur'\n', contents)
contents = re.sub(ur'\r', ur'\\r', contents)
target_file.write(contents.encode('utf-8'))
start = False
for line in fileinput.input(target_path, inplace=1):
if '\x80' in line:
line = line.replace(r'\x80', '')
sys.stdout.write(line)
Execute SQL to load table:
COPY consensus.observations FROM '/tmp/tables_processed/Carbon.consensus.observations';
The issue is that the COPY command is failing with a unicode error:
[2015-02-24 19:52:24] [22021] ERROR: invalid byte sequence for encoding "UTF8": 0x80
Where: COPY observations, line 2622420: "..."
Given that this could very likely be because of bad data in the table (which also contains legitimate non-ASCII characters), I'm trying to find the actual byte sequence in context, and I can't find it anywhere (sed to look at the line in question, regexes to replace the character as part of the preprocessing, etc). For reference, this grep returns nothing:
cat /tmp/tables_processed/Carbon.consensus.observations | grep --color='auto' -P "[\x80]"
What am I doing wrong in tracking down where this byte sequence sits in context?
I would recommend loading the SQL file (which appears to be /tmp/tables_processed/Carbon.consensus.observations) into an editor that has a hex mode. This should allow you to see it (depending on the exact editor) in context.
gVim (or terminal-based Vim) is one option I would recommend.
For example, if I open in gVim an SQL copy file that has this content:
1 1.2
2 1.1
3 3.2
I can the convert it into hex mode via the command %!xxd (in gVim or terminal Vim) or the Menu option Tools > Convert to HEX.
That yields this display:
0000000: 3109 312e 320a 3209 312e 310a 3309 332e 1.1.2.2.1.1.3.3.
0000010: 320a 2.
You can then run %!xxd -r to convert it back, or the Menu option Tools > Convert back.
Note: This actually modifies the file, so it would be advisable to do this to a copy of the original, just in case the changes somehow get written (you would have to explicitly save the buffer in Vim).
This way, you can see both the hex sequences on the left, and their ASCII equivalent on the right. If you search for 80, you should be able to see it in context. With gVim, the line numbering will be different for both modes, though, as is evidenced by this example.
It's likely the first 80 you find will be that line, though, since if there were earlier ones, it likely would've failed on those instead.
Another tool which might help that I've used in the past is the graphical hex editor GHex. Since that's a GNOME project, not quite sure it'll work with CentOS. wxHexEditor supposedly works with CentOS and looks promising from the website, although I've not yet used it. It's pitched as a "hex editor for massive files", so if your SQL file is large, that might be the way to go.
I'm adding functionality onto our website so that users can download files stored in a database. The problem is that I cannot properly specify the filename for the user - the user is instead prompted to save the file with the name of the main python script running the website. I am setting the Content-Disposition information but its not working as expected. I've edited the code down to the following which still fails to work:
import sys, os
import mydatabasemodule
PDFReport = [...read file from database ...]
print('Content-Type: application/octet-stream\n')
print('Content-Disposition: attachment; filename=\"mytest.pdf\"\n')
print(report)
sys.stdout.close()
Running this code prompts the user to download the file as mysite.py. The PDF downloads correctly just with the wrong filename.
Can anyone tell what I'm doing wrong here? In the full version of the code, I also set Content-Description and Content-Length but that also fails. The files are small and I am trying to avoid saving them to disk but even when I do so, the same problem happens.
[edit]
The webserver is running CentOS 5.5, Python 2.4.3, Apache 2.2.3, and mod_python. I've tested this on an Ubuntu 11.04 client using Google Chrome 17.0.963.46 beta and Firefox 13. If I instead try to show the PDF inline:
print('Content-type: application/pdf\n')
print('Content-Disposition: inline; filename=\"mytest.pdf\"\n')
print("Content-Length: %d" % len(report))
then Chrome shows the PDF (with a plugin) and Firefox asks to save the file, recognizing it as a PDF but still with the wrong filename i.e. the filename is still the script name.
[edit]
The solution was given below by Mike. I think the problem was the newline I added in the first line above. Since print adds a newline, this second newline signaled the end of the header so the Content-Disposition line was never read. Thanks to all for the quick help!
In python versions < 3.0 print is not a function and automatically adds a newline char. Try this.
import sys, os
import mydatabasemodule
PDFReport = [...read file from database ...]
print 'Content-Type: application/octet-stream'
print 'Content-Disposition: attachment; filename="mytest.pdf"'
print
sys.stdout.write(PDFReport)
sys.stdout.flush()
I have a script which connects to database and gets all records which statisfy the query. These record results are files present on a server, so now I have a text file which has all file names in it.
I want a script which would know:
What is the size of each file in the output.txt file?
What is the total size of all the files present in that text file?
Update:
I would like to know how can I achieve my task using Perl programming language, any inputs would be highly appreciated.
Note: I do not have any specific language constraint, it could be either Perl or Python scripting language which I can run from the Unix prompt. Currently I am using the bash shell and have sh and py script. How can this be done?
My scripts:
#!/usr/bin/ksh
export ORACLE_HOME=database specific details
export PATH=$ORACLE_HOME/bin:path information
sqlplus database server information<<EOF
SET HEADING OFF
SET ECHO OFF
SET PAGESIZE 0
SET LINESIZE 1000
SPOOL output.txt
select * from my table_name;
SPOOL OFF
EOF
I know du -h would be the command which I should be using but I am not sure how should my script be, I have tried something in python. I am totally new to Python and it's my first time effort.
Here it is:
import os
folderpath='folder_path'
file=open('output file which has all listing of query result','r')
for line in file:
filename=line.strip()
filename=filename.replace(' ', '\ ')
fullpath=folderpath+filename
# print (fullpath)
os.system('du -h '+fullpath)
File names in the output text file for example are like: 007_009_Bond Is Here_009_Yippie.doc
Any guidance would be highly appreciated.
Update:
How can I move all the files which are present in output.txt file to some other folder location using Perl ?
After doing step1, how can I delete all the files which are present in output.txt file ?
Any suggestions would be highly appreciated.
In perl, the -s filetest operator is probaby what you want.
use strict;
use warnings;
use File::Copy;
my $folderpath = 'the_path';
my $destination = 'path/to/destination/directory';
open my $IN, '<', 'path/to/infile';
my $total;
while (<$IN>) {
chomp;
my $size = -s "$folderpath/$_";
print "$_ => $size\n";
$total += $size;
move("$folderpath/$_", "$destination/$_") or die "Error when moving: $!";
}
print "Total => $total\n";
Note that -s gives size in bytes not blocks like du.
On further investigation, perl's -s is equivalent to du -b. You should probably read the man pages on your specific du to make sure that you are actually measuring what you intend to measure.
If you really want the du values, change the assignment to $size above to:
my ($size) = split(' ', `du "$folderpath/$_"`);
Eyeballing, you can make YOUR script work this way:
1) Delete the line filename=filename.replace(' ', '\ ') Escaping is more complicated than that, and you should just quote the full path or use a Python library to escape it based on the specific OS;
2) You are probably missing a delimiter between the path and the file name;
3) You need single quotes around the full path in the call to os.system.
This works for me:
#!/usr/bin/python
import os
folderpath='/Users/andrew/bin'
file=open('ft.txt','r')
for line in file:
filename=line.strip()
fullpath=folderpath+"/"+filename
os.system('du -h '+"'"+fullpath+"'")
The file "ft.txt" has file names with no path and the path part is '/Users/andrew/bin'. Some of the files have names that would need to be escaped, but that is taken care of with the single quotes around the file name.
That will run du -h on each file in the .txt file, but does not give you the total. This is fairly easy in Perl or Python.
Here is a Python script (based on yours) to do that:
#!/usr/bin/python
import os
folderpath='/Users/andrew/bin/testdir'
file=open('/Users/andrew/bin/testdir/ft.txt','r')
blocks=0
i=0
template='%d total files in %d blocks using %d KB\n'
for line in file:
i+=1
filename=line.strip()
fullpath=folderpath+"/"+filename
if(os.path.exists(fullpath)):
info=os.stat(fullpath)
blocks+=info.st_blocks
print `info.st_blocks`+"\t"+fullpath
else:
print '"'+fullpath+"'"+" not found"
print `blocks`+"\tTotal"
print " "+template % (i,blocks,blocks*512/1024)
Notice that you do not have to quote or escape the file name this time; Python does it for you. This calculates file sizes using allocation blocks; the same way that du does it. If I run du -ahc against the same files that I have listed in ft.txt I get the same number (well kinda; du reports it as 25M and I get the report as 24324 KB) but it reports the same number of blocks. (Side note: "blocks" are always assumed to be 512 bytes under Unix even though the actual block size on larger disc is always larger.)
Finally, you may want to consider making your script so that it can read a command line group of files rather than hard coding the file and the path in the script. Consider:
#!/usr/bin/python
import os, sys
total_blocks=0
total_files=0
template='%d total files in %d blocks using %d KB\n'
print
for arg in sys.argv[1:]:
print "processing: "+arg
blocks=0
i=0
file=open(arg,'r')
for line in file:
abspath=os.path.abspath(arg)
folderpath=os.path.dirname(abspath)
i+=1
filename=line.strip()
fullpath=folderpath+"/"+filename
if(os.path.exists(fullpath)):
info=os.stat(fullpath)
blocks+=info.st_blocks
print `info.st_blocks`+"\t"+fullpath
else:
print '"'+fullpath+"'"+" not found"
print "\t"+template % (i,blocks,blocks*512/1024)
total_blocks+=blocks
total_files+=i
print template % (total_files,total_blocks,total_blocks*512/1024)
You can then execute the script (after chmod +x [script_name].py) by ./script.py ft.txt and it will then use the path to the command line file as the assumed path to the files "ft.txt". You can process multiple files as well.
You can do it in your shell script itself.
You have all the files names in your spooled file output.txt, all you have to add at the end of existing script is:
< output.txt du -h
It will give size of each file and also a total at the end.
You can use the Python skeleton that you've sketched out and add os.path.getsize(fullpath) to get the size of individual file.
For example, if you wanted a dictionary with the file name and size you could:
dict((f, os.path.getsize(f)) for f in file)
Keep in mind that the result from os.path.getsize(...) is in bytes so you'll have to convert it to get other units if you want.
In general os.path is a key module for manipulating files and paths.