I am running the following python file that is supposed to concatenate the content of files from multiple folders into one csv with a common header file.The contents of the files are being concatenated as needed, but the results are being added, not appended.
This is an example of the result I am getting:
This is an example of what I need:
Here is the code of the python file, with the variables removed for clarity:
import sys
import re
import os
frequencyCount = {}
with open(sys.argv[1] + '/Concatenated.csv', 'w+') as outfile:
try:
with open(sys.argv[1] + '/MatrixHeader.csv') as headerfile:
for line in headerfile:
outfile.write(line + '\n')
except:
print 'No Header File'
//vars were in here
//built columnTuple
if url in frequencyCount:
frequencyCount[url] = tuple(sum(t) for t in zip(frequencyCount[url], columnTuple))
else:
frequencyCount[url] = columnTuple
# write the tuples to stdout
# Note: they are unsorted
for url in frequencyCount.keys():
writeString = url
for col in frequencyCount[url]:
writeString = writeString + "," + str(col)
writeString = writeString + "\n"
outfile.write(writeString)
Isn't it because of the sum you are using heretuple(sum(t) for t in zip(frequencyCount[url], columnTuple))? I think you should get rid of the call to sum.
Related
I am trying to download files using python and then add lines at the end of the downloaded files, but it returns an error:
f.write(data + """<auth-user-pass>
TypeError: can't concat str to bytes
Edit: Thanks, it works now when I do this b"""< auth-user-pass >""", but I only want to add the string at the end of the file. When I run the code, it adds the string for every line.
I also tried something like this but it also did not work: f.write(str(data) + "< auth-user-pass >")
here is my full code:
import requests
from multiprocessing.pool import ThreadPool
def download_url(url):
print("downloading: ", url)
# assumes that the last segment after the / represents the file name
# if url is abc/xyz/file.txt, the file name will be file.txt
file_name_start_pos = url.rfind("/") + 1
file_name = url[file_name_start_pos:]
save_path = 'ovpns/'
complete_path = os.path.join(save_path, file_name)
print(complete_path)
r = requests.get(url, stream=True)
if r.status_code == requests.codes.ok:
with open(complete_path, 'wb') as f:
for data in r:
f.write(data + """<auth-user-pass>
username
password
</auth-user-pass>""")
return url
servers = [
"us-ca72.nordvpn.com",
"us-ca73.nordvpn.com"
]
urls = []
for server in servers:
urls.append("https://downloads.nordcdn.com/configs/files/ovpn_legacy/servers/" + server + ".udp1194.ovpn")
# Run 5 multiple threads. Each call will take the next element in urls list
results = ThreadPool(5).imap_unordered(download_url, urls)
for r in results:
print(r)
EDIT: Thanks, it works now when I do this b"""< auth-user-pass >""", but I only want to add the string at the end of the file. When I run the code, it adds the string for every line.
Try this:
import requests
from multiprocessing.pool import ThreadPool
def download_url(url):
print("downloading: ", url)
# assumes that the last segment after the / represents the file name
# if url is abc/xyz/file.txt, the file name will be file.txt
file_name_start_pos = url.rfind("/") + 1
file_name = url[file_name_start_pos:]
save_path = 'ovpns/'
complete_path = os.path.join(save_path, file_name)
print(complete_path)
r = requests.get(url, stream=True)
if r.status_code == requests.codes.ok:
with open(complete_path, 'wb') as f:
for data in r:
f.write(data)
return url
servers = [
"us-ca72.nordvpn.com",
"us-ca73.nordvpn.com"
]
urls = []
for server in servers:
urls.append("https://downloads.nordcdn.com/configs/files/ovpn_legacy/servers/" + server + ".udp1194.ovpn")
# Run 5 multiple threads. Each call will take the next element in urls list
results = ThreadPool(5).imap_unordered(download_url, urls)
with open(complete_path, 'ab') as f:
f.write(b"""<auth-user-pass>
username
password
</auth-user-pass>""")
for r in results:
print(r)
You are using binary mode, encode your string before concat, that is replace
for data in r:
f.write(data + """<auth-user-pass>
username
password
</auth-user-pass>""")
using
for data in r:
f.write(data + """<auth-user-pass>
username
password
</auth-user-pass>""".encode())
You open the file as a write in binary.
Because of that you cant use normal strings like the comment from #user56700 said.
You either need to convert the string or open it another way(ex. 'a' = appending).
Im not completly sure but it is also possible that the write binary variant of open the data of the file deletes. Normally open with write deletes existing data, so its quite possible that you need to change it to 'rwb'.
I'm converting text directly to epub and I'm having a problem automatically splitting the HTML book file into separate header/chapter files. At the moment, the code below partially works but only creates every other chapter file. So half the header/chapter files are missing from the output. Here is the code:
def splitHeaderstoFiles(fpath):
infp = open(fpath, 'rt', encoding=('utf-8'))
for line in infp:
# format and split headers to files
if '<h1' in line:
#-----------format header file names and other stuff ------------#
# create a new file for the header/chapter section
path = os.getcwd() + os.sep + header
with open(path, 'wt', encoding=('utf-8')) as outfp:
# write html top meta headers
outfp = addMetaHeaders(outfp)
# add the header
outfp = outfp.write(line)
# add the chapter/header bodytext
for line in infp:
if '<h1' not in line:
outfp.write(line)
else:
outfp.write('</body>\n</html>')
break
else:
continue
infp.close()
The problem occurs in the second 'for loop' at the bottom of the code, when I look for the next h1 tag to stop the split. I cannot use seek() or tell() to rewind or move back one line so the program can find the next header/chapter on the next iteration. Apparently you cannot use these in python in a for loop containing an implicit iter or next object in operation. Just gives a 'can't do non-zero cur-relative seeks' error.
I've also tried the while line != ' ' + readline() combination in the code which also gives the same error as above.
Does anyone know an easy way to split HTML headers/chapters of varying lengths into separate files in python? Are there any special python modules(such as pickles) that could help make this task easier?
I'm using Python 3.4
My grateful thanks in advance for any solutions to this problem...
I ran into similar problem a while ago, here is a simplified solution:
from itertools import count
chapter_number = count(1)
output_file = open('000-intro.html', 'wb')
with open('index.html', 'rt') as input_file:
for line in input_file:
if '<h1' in line:
output_file.close()
output_file = open('{:03}-chapter'.format(next(chapter_number)), 'wb')
output_file.write(line)
output_file.close()
In this approach, the first block of text leading to the first h1 block is written into 000-intro.html, the first chapter will be written into 001-chapter.html and so on. Please modify it to taste.
The solution is a simple one: Upon encountering the h1 tag, close the last output file and open a new one.
You are looping over your input file twice, which is likely causing your problems:
for line in infp:
...
with open(path, 'wt', encoding=('utf-8')) as outfp:
...
for line in infp:
...
Each for is going to have it's own iterator, so you are going to loop over the file many times.
You might try transforming your for loop into a while so you're not using two different iterators:
while infp:
line = infp.readline()
if '<h1' in line:
with open(...) as outfp:
while infp:
line = infp.readline()
if '<h1' in line:
break
outfp.writeline(...)
Alternatively, you may wish to use an HTML parser (i.e., BeautifulSoup). Then you can do something like what is described here: https://stackoverflow.com/a/8735688/65295.
Update from comment - essentially, read the entire file all at once so you can freely move back or forward as necessary. This probably won't be a performance issue unless you have a really really big file (or very little memory).
lines = infp.readlines() # read the entire file
i = 0
while i < len(lines):
if '<h1' in lines[i]:
with open(...) as outfp:
j = i + 1
while j < len(lines):
if '<h1' in lines[j]:
break
outfp.writeline(lines[j])
# line j has an <h1>, set i to j so we detect the it at the
# top of the next loop iteration.
i = j
else:
i += 1
I eventually found the answer to the above problem. The code below does alot more that just get the file header. It also simultaneously loads two parallel list arrays with formatted file name data(with extension) and pure header name data respectively so I can use these lists to fill in the and formatted filename extension in these html files within a while loop in one hit. The code now works well and is shown below.
def splitHeaderstoFiles(dir, inpath):
count = 1
t_count = 0
out_path = ''
header = ''
write_bodytext = False
file_path_names = []
pure_header_names = []
inpath = dir + os.sep + inpath
with open(inpath, 'rt', encoding=('utf-8')) as infp:
for line in infp:
if '<h1' in line:
#strip html tags, convert to start caps
p = re.compile(r'<.*?>')
header = p.sub('', line)
header = capwords(header)
line_save = header
# Add 0 for count below 10
if count < 10:
header = '0' + str(count) + '_' + header
else:
header = str(count) + '_' + header
# remove all spaces + add extension in header
header = header.replace(' ', '_')
header = header + '.xhtml'
count = count + 1
#create two parallel lists used later
out_path = dir + os.sep + header
outfp = open(out_path, 'wt', encoding=('utf-8'))
file_path_names.insert(t_count, out_path)
pure_header_names.insert(t_count, line_save)
t_count = t_count + 1
# Add html meta headers and write it
outfp = addMainHeaders(outfp)
outfp.write(line)
write_bodytext = True
# add header bodytext
elif write_bodytext == True:
outfp.write(line)
# now add html titles and close the html tails on all files
max_num_files = len(file_path_names)
tmp = dir + os.sep + 'temp1.tmp'
i = 0
while i < max_num_files:
outfp = open(tmp, 'wt', encoding=('utf-8'))
infp = open(file_path_names[i], 'rt', encoding=('utf-8'))
for line in infp:
if '<title>' in line:
line = line.strip(' ')
line = line.replace('<title></title>', '<title>' + pure_header_names[i] + '</title>')
outfp.write(line)
else:
outfp.write(line)
# add the html tail
if '</body>' in line or '</html>' in line:
pass
else:
outfp.write(' </body>' + '\n</html>')
# clean up
infp.close()
outfp.close()
shutil.copy2(tmp, file_path_names[i])
os.remove(tmp)
i = i + 1
# now rename just the title page
if os.path.isfile(file_path_names[0]):
title_page_name = file_path_names[0]
new_title_page_name = dir + os.sep + '01_Title.xhtml'
os.rename(title_page_name, new_title_page_name)
file_path_names[0] = '01_Title.xhtml'
else:
logmsg27(DEBUG_FLAG)
os._exit(0)
# xhtml file is no longer needed
if os.path.isfile(inpath):
os.remove(inpath)
# returned list values are also used
# later to create epub opf and ncx files
return(file_path_names, pure_header_names)
#Hai Vu and #Seth -- Thanks for all your help.
Im trying to write different things onto a text file in a while loop but it only writes it once. I want to write something to unmigrated.txt
import urllib.request
import json
Txtfile = input("Name of the TXT file: ")
fw = open(Txtfile + ".txt", "r")
red = fw.read()
blue = red.split("\n")
i=0
while i<len(blue):
try:
url = "https://api.mojang.com/users/profiles/minecraft/" + blue[i]
rawdata = urllib.request.urlopen(url)
newrawdata = rawdata.read()
jsondata = json.loads(newrawdata.decode('utf-8'))
results = jsondata['id']
url_uuid = "https://sessionserver.mojang.com/session/minecraft/profile/" + results
rawdata_uuid = urllib.request.urlopen(url_uuid)
newrawdata_uuid = rawdata_uuid.read()
jsondata_uuid = json.loads(newrawdata_uuid.decode('utf-8'))
try:
results = jsondata_uuid['legacy']
print (blue[i] + " is " + "Unmigrated")
wf = open("unmigrated.txt", "w")
wring = wf.write(blue[i] + " is " + "Unmigrated\n")
except:
print(blue[i] + " is " + "Migrated")
except:
print(blue[i] + " is " + "Not-Premium")
i+=1
You keep overwriting opening the file with w inside the loop so you only see the last data that was written to the file, either open the file once outside the loop or open with a to append. Opening once would be the simplest approach, you can also use range instead of your while or better again just iterate over the list:
with open("unmigrated.txt", "w") as f: # with close your file automatically
for ele in blue:
.....
Also wring = wf.write(blue[i] + " is " + "Unmigrated\n") sets wring to None which is what write returns so probably not of any real use.
Lastly using a blank expect is usually never a good idea, catch the specific exceptions you expect and log or at least print when you get an error.
Using the requests library, I would break up your code doing something like:
import requests
def get_json(url):
try:
rawdata = requests.get(url)
return rawdata.json()
except requests.exceptions.RequestException as e:
print(e)
except ValueError as e:
print(e)
return {}
txt_file = input("Name of the TXT file: ")
with open(txt_file + ".txt") as fw, open("unmigrated.txt", "w") as f: # with close your file automatically
for line in map(str.rstrip, fw): # remove newlines
url = "https://api.mojang.com/users/profiles/minecraft/{}".format(line)
results = get_json(url).get("id")
if not results:
continue
url_uuid = "https://sessionserver.mojang.com/session/minecraft/profile/{}".format(results)
results = get_json(url_uuid).get('legacy')
print("{} is Unmigrated".format(line))
f.write("{} is Unmigrated\n".format(line))
I am not sure where 'legacy' fits into the code, that logic I will leave to you. You can also iterate directly over the file object so you can forget about splitting the lines into blue.
try:
with open("filename", "w") as f:
f.write("your content")
But that will overwrite all contents of the file.
Instead, if you want to append to the file use:
with open("filename", "a") as f:
If you choose to not use the with syntax, remember to close the file.
Read more here:
https://docs.python.org/2/library/functions.html#open
I have written a piece of code which lets me extract the table from a file named 195775.html. I save the output in a text file. Now I need to iterate this code for all the 20,000 files which I have in the same directory. In addition, I also want the files to be tagged with their respective file names. i.e. each file should have a column (in the table) which takes the filename as its value. Also, I want the output text files to be named as per the input files (i.e. the names should match).
Here is my code:
import urllib2
import os
import time
import traceback
from bs4 import BeautifulSoup
outfile= open('C:/Users/Manvendra/Dropbox/Python/195775.txt','wb')
rfile = open('C:/Users/Manvendra/Dropbox/PRI/Data/AP/195775.html')
rsoup = BeautifulSoup(rfile)
nodes = rsoup.find('div',{'class':'frmhdtitle'})
if nodes!= None:
#print "div present"
x = nodes.findNext('table')
if x!= None:
#print "table present"
y = x.find('tbody')
if y!= None:
#print "tbody present"
z= y.findAll('tr')
if z!= None:
#print "tr present"
for wx in z[1:]:
num= wx.find('td').get_text()
print num
name= wx.find('td').findNext('td').get_text()
print name
age = wx.find('td').findNext('td').findNext('td').get_text()
print age
caste= wx.find('td').findNext('td').findNext('td').findNext('td').get_text()
print caste
gender= wx.find('td').findNext('td').findNext('td').findNext('td').findNext('td').get_text()
print gender
quali = wx.find('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').get_text()
print quali
occu = wx.find('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').get_text()
print occu
#email = wx.find('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').get_text()
#print email
#ward = wx.find('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').get_text()
#print ward
resr = wx.find('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').findNext('td').get_text()
print resr
outfile.write(str(num) +"\t" + str(name) +"\t" + str(age) +"\t" + str (caste) +"\t" + str(quali) +"\t" + str(occu) + "\t" + str(resr) + str(infile) +"\n")
outfile.close()
Put your code into a separate function and call it for each html file in the directory:
#!/usr/bin/env python2
import os
from glob import glob
dest_dir = 'C:/Users/Manvendra/Dropbox/Python'
for html_filename in glob('C:/Users/Manvendra/Dropbox/PRI/Data/AP/*.html'):
basename = os.path.splitext(os.path.basename(html_filename))[0]
with open(html_filename, 'rb') as html_file, \
open(os.path.join(dest_dir, basename + '.txt'), 'wb') as csv_file:
html2csv(html_file, csv_file)
where html2csv() is:
import logging
from bs4 import BeautifulSoup
log = logging.getLogger(__name__)
def html2csv(html_file, csv_file):
writerow = csv.writer(csv_file, dialect=csv.excel_tab).writerow
div = BeautifulSoup(html_file).find('div', 'frmhdtitle')
try:
rows = div.find_next('table').tbody.find_all('tr')[1:]
except AttributeError:
log.warning("No info in %s file", html_file.name)
else:
for tr in rows:
writerow([td.get_text().encode('utf-8')
for td in tr.find_all('td')[:8]] + [html_file.name])
Note: findNext('td') method in your code searches the html document without any regard for elements boundaries i.e., it may find td that belongs to a different row or even a different table as long as it is further in the document. I rewrote the loop assuming that you want to find eight adjacent <td> elements in each row.
Do something like this:
files = os.listdir(directoryPath)
for file in files:
*your code*
Note that if you want to open the files you need to open the path: directoryPath + "/" + file.
Regarding all the tags and filenames you want to name the files, "file" is now a variable which contains the name of the file you are now processing so do with it what you want.
I am trying to iterate through a number .rtf files and for each file: read the file, perform some operations, and then write new files into a sub-directory as plain text files with the same name as the original file, but with .txt extensions. The problem I am having is with the file naming.
If a file is named foo.rtf, I want the new file in the subdirectory to be foo.txt. here is my code:
import glob
import os
import numpy as np
dir_path = '/Users/me/Desktop/test/'
file_suffix = '*.rtf'
output_dir = os.mkdir('sub_dir')
for item in glob.iglob(dir_path + file_suffix):
with open(item, "r") as infile:
reader = infile.readlines()
matrix = []
for row in reader:
row = str(row)
row = row.split()
row = [int(value) for value in row]
matrix.append(row)
np_matrix = np.array(matrix)
inv_matrix = np.transpose(np_matrix)
new_file_name = item.replace('*.rtf', '*.txt') # i think this line is the problem?
os.chdir(output_dir)
with open(new_file_name, mode="w") as outfile:
outfile.write(inv_matrix)
When I run this code, I get a Type Error:
TypeError: coercing to Unicode: need string or buffer, NoneType found
How can I fix my code to write new files into a subdirectory and change the file extensions from .rtf to .txt? Thanks for the help.
Instead of item.replace, check out some of the functions in the os.path module (http://docs.python.org/library/os.path.html). They're made for splitting up and recombining parts of filenames. For instance, os.path.splitext will split a filename into a file path and a file extension.
Let's say you have a file /tmp/foo.rtf and you want to move it to /tmp/foo.txt:
old_file = '/tmp/foo.rtf'
(file,ext) = os.path.splitext(old_file)
print 'File=%s Extension=%s' % (file,ext)
new_file = '%s%s' % (file,'.txt')
print 'New file = %s' % (new_file)
Or if you want the one line version:
old_file = '/tmp/foo.rtf'
new_file = '%s%s' % (os.path.splitext(old_file)[0],'.txt')
I've never used glob, but here's an alternative way without using a module:
You can easily strip the suffix using
name = name[:name.rfind('.')]
and then add the new suffix:
name = name + '.txt'
Why not using a function ?
def change_suffix(string, new_suffix):
i = string.rfind('.')
if i < 0:
raise ValueError, 'string does not have a suffix'
if not new_suffix[0] == '.':
new_suffix += '.'
return string[:i] + new_suffix
glob.iglob() yields pathnames, without the character '*'.
therefore your line should be:
new_file_name = item.replace('.rtf', '.txt')
consider working with clearer names (reserve 'filename' for a file name and use 'path' for a complete path to a file; use 'path_original' instead of 'item'), os.extsep ('.' in Windows) and os.path.splitext():
path_txt = os.extsep.join([os.path.splitext(path_original)[0], 'txt'])
now the best hint of all:
numpy can probably read your file directly:
data = np.genfromtxt(filename, unpack=True)
(see also here)
To better understand where your TypeError comes from, wrap your code in the following try/except block:
try:
(your code)
except:
import traceback
traceback.print_exc()