Naming multiple files in python and scrapy - python

I'm trying to save files to a directory after scraping them from the web using scrapy. I'm extracting a date from the file and using that as the file name. The problem I'm running into, however, is that some files have the same date, i.e. there are two files that would take the name "June 2, 2009". So, what I'm looking to do is somehow check whether there is already a file with the same name, and if so, name it something like "June 2, 2009.1" or some such.
The code I'm using is as follows:
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
response = response.replace(body=response.body.replace('<br />', '\n'))
hxs = HtmlXPathSelector(response)
date = hxs.select("//div[#id='content']").extract()[0]
dateStrip = re.search(r"([A-Z]*|[A-z][a-z]+)\s\d*\d,\s[0-9]+", date)
newDate = dateStrip.group()
content = hxs.select("//div[#id='content']")
content = content.select('string()').extract()[0]
filename = ("/path/to/a/folder/ %s.txt") % (newDate)
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(content)

You can use os.listdir to get a list of existing files and allocate a filename that will not cause conflict.
import os
def get_file_store_name(path, fname):
count = 0
for f in os.listdir(path):
if fname in f:
count += 1
return os.path.join(path, fname+str(count))
# This is example to use
print get_file_store_name(".", "README")+".txt"

The usual way to check for existence of a file in the C library is with a function called stat(). Python offers a thin wrapper around this function in the form of os.stat(). I suggest you use that.
http://docs.python.org/library/stat.html
def file_exists(fname):
try:
stat_info = os.stat(fname)
if os.S_ISREG(stat_info): # true for regular file
return True
except Exception:
pass
return False

one other solution is you can append time with date, for naming file like
from datetime import datetime
filename = ("/path/to/a/folder/ %s_%s.txt") % (newDate,datetime.now().strftime("%H%M%S"))

The other answer pointed me in the correct direction by checking into the os tools in python, but I think the way I found is perhaps more straightforward. Reference here How do I check whether a file exists using Python? for more.
The following is the code I came up with:
existence = os.path.isfile(filename)
if existence == False:
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(content)
else:
newFilename = ("/path/.../.../- " + '%s' ".1.txt") % (newDate)
with codecs.open(newFilename, 'w', encoding='utf-8') as output:
output.write(content)
Edited to Add:
I didn't like this solution too much, and thought the other answer's solution was probably better but didn't quite work. The main part I didn't like about my solution was that it only worked with 2 files of the same name; if three or four files had the same name the initial problem would occur. The following is what I came up with:
filename = ("/Users/path/" + " " + "title " + '%s' + " " + "-1.txt") % (date)
filename = str(filename)
while True:
os.path.isfile(filename)
newName = filename.replace(".txt", "", filename)
newName = str.split(newName)
newName[-1] = str(int(newName[-1]) + 1)
filename = " ".join(newName) + ".txt"
if os.path.isfile(filename) == False:
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(texts)
break
It probably isn't the most elegant and might be kind of a hackish approach, but it has worked so far and seems to have addressed my problem.

Related

Use Python Regex to search files and return filename

Please help.
I'm searching several .txt files, in several directories for a pattern. If there is a match, I would like to print the filename and location of the match.
Here is my code:
a = ('Z:/rodney/020year/2020-04/')
b = []
for y in os.listdir(a):
b.append(a+y+'/')
for filename in b:
path = filename
for filenames in listdir(path):
with open(path+filenames) as currentfile:
text = currentfile.read()
loan = re.compile(r'2 NNN \d LOANS')
bb = loan.search(text)
with open('z:/rodney/results.txt','a') as f:
f.write(os.path.dirname(path)+' ')
f.write(filenames[:-4]+'\n')
f.write(bb)
Error message = "TypeError: write() argument must be str, not None"
If there is a match, I would like to see only the filename and location of a match. I do not need to see "None" in every file where there is no match.
You have:
bb = loan.search(text)
But if the string you are looking for is not found in text, bb will ne None and consequently f.write(bb) will raise an exception (you did not indicate which line of code was raising the exception, so this is an educated guess).
You need to modify your code to be:
bb = loan.search(text)
if bb:
with open('z:/rodney/results.txt','a') as f:
f.write(os.path.dirname(path)+' ')
f.write(filenames[:-4]+'\n')
As an aside:
You have the statement loan = re.compile(r'2 NNN \d LOANS') in a loop. There is no need for that to be in a loop since it is invariant.
You can avoid using string slicing and bunch of functions to parse file path by using pathlib, where most of needed cases are already implemented. Also you can optimize your code by moving re.compile() out of loop (create once and use). Same with writing result - you don't need to reopen file every time, just open it once before loop start.
Optimized code:
from pathlib import Path
import re
src_dir = Path(r"Z:\rodney\020year\2020-04")
res_fn = r"z:\rodney\results.txt"
with open(res_fn, "w+") as res_f:
search_re = re.compile(r"2\sN{3}\s{28}\d\sLOANS")
for directory in src_dir.iterdir():
if directory.is_dir():
for file in directory.iterdir():
if file.is_file():
with open(file) as of:
bb = search_re.search(of.read())
if bb:
print(file.parent, file.stem, file=res_f)
print(bb.group(), file=res_f)
# res_f.write(file.parent + " " + file.stem + "\n" + bb.group())
Based on your source code, I optimized it.
I use os.walk to access each .txt file and then read it line by line in those txt files and save it in an enum. Then I will check each line in that enum with regex (I referenced Olvin Roght-san). If there is a match, it will print out the exact file location and line for you.
import os
import re
extension = [".txt"]
regex = r"2\sN{3}\s{28}\d\sLOANS"
re_Search = re.compile(regex)
path = "Z:\rodney\020year\2020-04"
for subdir, dirs, files in os.walk(path):
for file in files:
file_path = os.path.join(subdir, file)
ext = os.path.splitext(file)[-1].lower()
if ext in extension:
with open(file_path, "r") as f:
try:
f_content = f.readlines()
except Exception as e:
print(e)
for l_idx, line in enumerate(f_content):
if re_Search.search(line):
print(file_path)
print("At line: {l_idx}".format(l_idx = l_idx+1))
else:
print("Nothing!!")

Change CSV name to CSV date time python

I want to change csv name (in this case Example.csv) to a specific name: date time name. I have a library called from datetime import datetime
This is my sentence to create a cvsFile:
with open('Example.csv', 'w') as csvFile:
I want that my output to be:
20180820.csv
20180821.csv
20180822.csv ... etc
And if I run more that one time in the same day, I want that my output to be:
20180820.csv (First time that I run the script)
20180821(2).csv (Second time run)
... etc
Something like this:
import pandas as pd
import datetime
current_date = datetime.datetime.now()
filename = str(current_date.day)+str(current_date.month)+str(current_date.year)
df.to_csv(str(filename + '.csv'))
Since you know how to create the file name you just have to check whether it already exists or not :
def exists(filename):
try:
with open(filename) as f:
file_exists = True
except FileNotFoundError:
file_exists = False
return file_exists
name = 'some_date.csv'
c = 0
while exists(filename):
c += 1
name = 'some_date({}).csv'.format(c)
# do stuff with name
Please find a solution if you can manage a 'progressive' variable taking track of the files. Otherwise you need to check the disk content and it might be rather more complex.
import datetime
progressive = 0
today = datetime.date.today()
todaystr = str(today)
rootname = todaystr
progressive += 1
if progressive > 1:
rootname = todaystr + '_' + str(progressive)
filename = rootname + '.csv'
Count the number of files in the directory with the same date in its name and use that information to create the file name. Here is a solution for both your problems.
import datetime
import os
now = datetime.datetime.now().strftime("%y%m%d")
# count the number of files already in the output dir with date (now)
count = len([name for name in os.listdir('./output/') if (os.path.isfile(name) and now in name)])
csv_name = './output/' + now
if count > 0:
csv_name = csv_name + "(" + str(count+1) +")"
csv_name = csv_name + ".csv"
with open(csv_name, 'w') as csvFile:
pass
Good Luck.
I found the solution:
Only take the real time in a variable, and then concatenate with .csv (and also I put this csv output in a specific folder called output). Finally I open the csvFile with the variable name.
> now = datetime.now().strftime('%Y%m%d-%Hh%M')
> csvname = './output/' + now + '.csv'
> with open(csvname, 'w') as csvFile:
I can not do the second part of my problem. I want that if I run more than one time the code the date time name change or add (2), (3)... etc.

Script that reads PDF metadata and writes to CSV

I wrote a script to read PDF metadata to ease a task at work. The current working version is not very usable in the long run:
from pyPdf import PdfFileReader
BASEDIR = ''
PDFFiles = []
def extractor():
output = open('windoutput.txt', 'r+')
for file in PDFFiles:
try:
pdf_toread = PdfFileReader(open(BASEDIR + file, 'r'))
pdf_info = pdf_toread.getDocumentInfo()
#print str(pdf_info) #print full metadata if you want
x = file + "~" + pdf_info['/Title'] + " ~ " + pdf_info['/Subject']
print x
output.write(x + '\n')
except:
x = file + '~' + ' ERROR: Data missing or corrupt'
print x
output.write(x + '\n')
pass
output.close()
if __name__ == "__main__":
extractor()
Currently, as you can see, I have to manually input the working directory and manually populate the list of PDF files. It also just prints out the data in the terminal in a format that I can copy/paste/separate into a spreadsheet.
I'd like the script to work automatically in whichever directory I throw it in and populate a CSV file for easier use. So far:
from pyPdf import PdfFileReader
import csv
import os
def extractor():
basedir = os.getcwd()
extension = '.pdf'
pdffiles = [filter(lambda x: x.endswith('.pdf'), os.listdir(basedir))]
with open('pdfmetadata.csv', 'wb') as csvfile:
for f in pdffiles:
try:
pdf_to_read = PdfFileReader(open(f, 'r'))
pdf_info = pdf_to_read.getDocumentInfo()
title = pdf_info['/Title']
subject = pdf_info['/Subject']
csvfile.writerow([file, title, subject])
print 'Metadata for %s written successfully.' % (f)
except:
print 'ERROR reading file %s.' % (f)
#output.writerow(x + '\n')
pass
if __name__ == "__main__":
extractor()
In its current state it seems to just prints a single error (as in, the error message in the exception, not an error returned by Python) message and then stop. I've been staring at it for a while and I'm not really sure where to go from here. Can anyone point me in the right direction?
writerow([file, title, subject]) should be writerow([f, title, subject])
You can use sys.exc_info() to print the details of your error
http://docs.python.org/2/library/sys.html#sys.exc_info
Did you check the pdffiles variable contains what you think it does? I was getting a list inside a list... so maybe try:
for files in pdffiles:
for f in files:
#do stuff with f
I personally like glob. Notice I add * before the .pdf in the extension variable:
import os
import glob
basedir = os.getcwd()
extension = '*.pdf'
pdffiles = glob.glob(os.path.join(basedir,extension)))
Figured it out. The script I used to download the files was saving the files with '\r\n' trailing after the file name, which I didn't notice until I actually ls'd the directory to see what was up. Thanks for everyone's help.

Write file to directory based on variable in Python

The script will generate multiple files using the year and id variable. These files need to be placed into a folder matching year and id. How do I write them to the correct folders?
file_root_name = row["file_root_name"]
year = row["year"]
id = row["id"]
path = year+'-'+id
try:
os.makedirs(path)
except:
pass
output = open(row['file_root_name']+'.smil', 'w')
output.write(prettify(doctype, root))
If I understand your question correctly, you want to do this:
import os.path
file_name = row['file_root_name']+'.smil'
full_path = os.path.join(path, file_name)
output = open(full_path, 'w')
Please note that it's not very common in Python to use the + operator for string concatenation. Although not in your case, with large strings the method is not very fast.
I'd prefer:
file_name = '%s.smil' % row['file_root_name']
and:
path = '%i-%i' % (year, id)

File naming problem with Python

I am trying to iterate through a number .rtf files and for each file: read the file, perform some operations, and then write new files into a sub-directory as plain text files with the same name as the original file, but with .txt extensions. The problem I am having is with the file naming.
If a file is named foo.rtf, I want the new file in the subdirectory to be foo.txt. here is my code:
import glob
import os
import numpy as np
dir_path = '/Users/me/Desktop/test/'
file_suffix = '*.rtf'
output_dir = os.mkdir('sub_dir')
for item in glob.iglob(dir_path + file_suffix):
with open(item, "r") as infile:
reader = infile.readlines()
matrix = []
for row in reader:
row = str(row)
row = row.split()
row = [int(value) for value in row]
matrix.append(row)
np_matrix = np.array(matrix)
inv_matrix = np.transpose(np_matrix)
new_file_name = item.replace('*.rtf', '*.txt') # i think this line is the problem?
os.chdir(output_dir)
with open(new_file_name, mode="w") as outfile:
outfile.write(inv_matrix)
When I run this code, I get a Type Error:
TypeError: coercing to Unicode: need string or buffer, NoneType found
How can I fix my code to write new files into a subdirectory and change the file extensions from .rtf to .txt? Thanks for the help.
Instead of item.replace, check out some of the functions in the os.path module (http://docs.python.org/library/os.path.html). They're made for splitting up and recombining parts of filenames. For instance, os.path.splitext will split a filename into a file path and a file extension.
Let's say you have a file /tmp/foo.rtf and you want to move it to /tmp/foo.txt:
old_file = '/tmp/foo.rtf'
(file,ext) = os.path.splitext(old_file)
print 'File=%s Extension=%s' % (file,ext)
new_file = '%s%s' % (file,'.txt')
print 'New file = %s' % (new_file)
Or if you want the one line version:
old_file = '/tmp/foo.rtf'
new_file = '%s%s' % (os.path.splitext(old_file)[0],'.txt')
I've never used glob, but here's an alternative way without using a module:
You can easily strip the suffix using
name = name[:name.rfind('.')]
and then add the new suffix:
name = name + '.txt'
Why not using a function ?
def change_suffix(string, new_suffix):
i = string.rfind('.')
if i < 0:
raise ValueError, 'string does not have a suffix'
if not new_suffix[0] == '.':
new_suffix += '.'
return string[:i] + new_suffix
glob.iglob() yields pathnames, without the character '*'.
therefore your line should be:
new_file_name = item.replace('.rtf', '.txt')
consider working with clearer names (reserve 'filename' for a file name and use 'path' for a complete path to a file; use 'path_original' instead of 'item'), os.extsep ('.' in Windows) and os.path.splitext():
path_txt = os.extsep.join([os.path.splitext(path_original)[0], 'txt'])
now the best hint of all:
numpy can probably read your file directly:
data = np.genfromtxt(filename, unpack=True)
(see also here)
To better understand where your TypeError comes from, wrap your code in the following try/except block:
try:
(your code)
except:
import traceback
traceback.print_exc()

Categories