I tried to run this code:
from tqdm.auto import tqdm
import os
from datasets import load_dataset
dataset = load_dataset('oscar', 'unshuffled_deduplicated_ar', split='train[:25%]')
text_data = []
file_count = 0
for sample in tqdm(dataset['train']):
sample = sample['text'].replace('\n', ' ')
text_data.append(sample)
if len(text_data) == 10_000:
# once we git the 10K mark, save to file
filename = f'/data/text/oscar_ar/text_{file_count}.txt'
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, 'w', encoding='utf-8') as fp:
fp.write('\n'.join(text_data))
text_data = []
file_count += 1
# after saving in 10K chunks, we will have ~2082 leftover samples, we save those now too
with open(f'data/text/oscar_ar/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
fp.write('\n'.join(text_data))
and i get following PermissionError:
Permission Error
I've tried changing rights to this directory and running jupyter with sudo privilages but it still doesn't work.
You are opening :
with open(f'data/text/oscar_ar/text_{file_count}.txt')
But you are writing :
filename = f'/Dane/text/oscar_ar/text_{file_count}.txt'
And you're screenshot says :
filename = f'/date/text/oscar_ar/text_{file_count}.txt'
You have to make a choice between data, /date or /Dane :)
Also It seems you should remove the first / in /data/text/oscar_ar/text_{file_count}.txt.
Explanation: When you put a slash (/) at the begin of a path, that means to look from the root of the filesystem, the top level. If you don't put the slash, it will start looking from your current directory.
Related
I am working on a project where I need to scrape images off of the web. To do this, I write the image links to a file, and then I download each of them to a folder with requests. At first, I used Google as the scrape site, but do to several reasons, I have decided that wikipedia is a much better alternative. However, after I tried the first time, many of the images couldn't be opened, so I tried again with the change that when I downloaded the images, I downloaded them to names with endings that matched the endings of the links. More images were able to be accessed like this, but many were still not able to be opened. When I tested downloading the images myself (individually outside of the function), they downloaded perfectly, and when I used my function to download them afterwards, they kept downloading correctly (i.e. I could access them). I am not sure i it is important, but the image endings that I generally come across are svg.png and png. I want to know why this is occurring and what I may be able to do to prevent it. I have left some of my code below. Thank you.
Function:
def download_images(file):
object = file[0:file.index("IMAGELINKS") - 1]
folder_name = object + "_images"
dir = os.path.join("math_obj_images/original_images/", folder_name)
if not os.path.exists(dir):
os.mkdir(dir)
with open("math_obj_image_links/" + file, "r") as f:
count = 1
for line in f:
try:
if line[len(line) - 1] == "\n":
line = line[:len(line) - 1]
if line[0] != "/":
last_chunk = line.split("/")[len(line.split("/")) - 1]
endings = last_chunk.split(".")[1:]
image_ending = ""
for ending in endings:
image_ending += "." + ending
if image_ending == "":
continue
with open("math_obj_images/original_images/" + folder_name + "/" + object + str(count) + image_ending, "wb") as f:
f.write(requests.get(line).content)
file = object + "_IMAGEENDINGS.txt"
path = "math_obj_image_endings/" + file
with open(path, "a") as f:
f.write(image_ending + "\n")
count += 1
except:
continue
f.close()
Doing this outside of it worked:
with open("test" + image_ending, "wb") as f:
f.write(requests.get(line).content)
Example of image link file:
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
https:/static/images/footer/wikimedia-button.png
https:/static/images/footer/poweredby_mediawiki_88x31.png
If all the files are indeed in PNG format and the suffix is always .png, you could try something like this:
import requests
from pathlib import Path
u1 = "https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png"
r = requests.get(u1)
Path('u1.png').write_bytes(r.content)
My previous answer works for PNG's only
For SVG files you need to check if the file contents start eith the string "<svg" and create a file with the .svg suffix.
The code below saves the downloaded files in the "downloads" subdirectory.
import requests
from pathlib import Path
# urls are stored in a file 'urls.txt'.
with open('urls.txt') as f:
for i, url in enumerate(f.readlines()):
url = url.strip() # MUST strip the line-ending char(s)!
try:
content = requests.get(url).content
except:
print('Cannot download url:', url)
continue
# Check if this is an SVG file
# Note that content is bytes hence the b in b'<svg'
if content.startswith(b'<svg'):
ext = 'svg'
elif url.endswith('.png'):
ext = 'png'
else:
print('Cannot process contents of url:', url)
Path('downloads', f'url{i}.{ext}').write_bytes(requests.get(url).content)
Contents of the urls.txt file:
(the last url is an svg)
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
I am try to create some temporal files and make some operations on them inside a loop. Then I will access the information on all of the temporal files. And do some operations with that information. For simplicity I brought the following code that reproduces my issue:
import tempfile
tmp_files = []
for i in range(40):
tmp = tempfile.NamedTemporaryFile(suffix=".txt")
with open(tmp.name, "w") as f:
f.write(str(i))
tmp_files.append(tmp.name)
string = ""
for tmp_file in tmp_files:
with open(tmp_file, "r") as f:
data = f.read()
string += data
print(string)
ERROR:
with open(tmp_file, "r") as f: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpynh0kbnw.txt'
When I look on /tmp directory (with some time.sleep(2) on the loop) I see that the file is deleted and only one is preserved. And for that the error.
Of course I could handle to keep all the files with the flag tempfile.NamedTemporaryFile(suffix=".txt", delete=False). But that is not the idea. I would like to hold the temporal files just for the running time of the script. I also could delete the files with os.remove. But my question is more why this happen. Because I expected that the files hold to the end of the running. Because I don't close the file on the execution (or do I?).
A lot of thanks in advance.
tdelaney does already answer your actual question.
I just would like to offer you an alternative to NamedTemporaryFile. Why not creating a temporary folder which is removed (with all files in it) at the end of the script?
Instead of using a NamedTemporaryFile, you could use tempfile.TemporaryDirectory. The directory will be deleted when closed.
The example below uses the with statement which closes the file handle automatically when the block ends (see John Gordon's comment).
import os
import tempfile
with tempfile.TemporaryDirectory() as temp_folder:
tmp_files = []
for i in range(40):
tmp_file = os.path.join(temp_folder, f"{i}.txt")
with open(tmp_file, "w") as f:
f.write(str(i))
tmp_files.append(tmp_file)
string = ""
for tmp_file in tmp_files:
with open(tmp_file, "r") as f:
data = f.read()
string += data
print(string)
By default, a NamedTemporaryFile deletes its file when closed. its a bit subtle, but tmp = tempfile.NamedTemporaryFile(suffix=".txt") in the loop causes the previous file to be deleted when tmp is reassigned. One option is to use the delete=False parameter. Or, just keep the file open and seek to the beginning after the write.
NamedTemporaryFile is already a file object - you can write to it directly without reopening. Just make sure the mode is "write plus" and in text, not binary mode. Put the code an a try/finally block to make sure the files are really deleted at the end.
import tempfile
tmp_files = []
try:
for i in range(40):
tmp = tempfile.NamedTemporaryFile(suffix=".txt", mode="w+")
tmp.write(str(i))
tmp.seek(0)
tmp_files.append(tmp)
string = ""
for tmp_file in tmp_files:
data = tmp_file.read()
string += data
finally:
for tmp_file in tmp_files:
tmp_file.close()
print(string)
My code is working correctly to scour a directory of PDFs, download weblinks embedded within those PDFs, and sequentially name them with appropriate file extension.
That being said - I am getting a few random files that download but DON'T have an extension associated with them. In doing quality checks, I have all the attachments that matter - these extra files are truly garbage.
Is there a way to not download them or build in a check in the code so that I don't end up with these phantom files?
#!/usr/bin/env python3
import os
import glob
import pdfx
import wget
import urllib.parse
import requests
## Accessing and Creating Six Digit File Code
pdf_dir = "./"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
## Identify File Name and Limit to Digits
filename = os.path.basename(file)
newname = filename[0:6]
## Run PDFX to identify and download links
pdf = pdfx.PDFx(filename)
url_list = pdf.get_references_as_dict()
attachment_counter = (1)
for x in url_list["url"]:
if x[0:4] == "http":
parsed_url = urllib.parse.quote(x)
extension = os.path.splitext(x)[1]
r = requests.get(x)
with open('temporary', 'wb') as f:
f.write(r.content)
##Concatenate File Name Once Downloaded
os.rename('./temporary', str(newname) + '_attach' + str(attachment_counter) + str(extension))
##Increase Attachment Count
attachment_counter += 1
for x in url_list["pdf"]:
parsed_url = urllib.parse.quote(x)
extension = os.path.splitext(x)[1]
r = requests.get(x)
with open('temporary', 'wb') as f:
f.write(r.content)
##Concatenate File Name Once Downloaded
os.rename('./temporary', str(newname) + '_attach' + str(attachment_counter) + str(extension))
##Increase Attachment Count
attachment_counter += 1
It's not clear which part of your code produces these "phantom" files, but anyplace you want to avoid downloading a file which doesn't have an extension, you can make the download conditional. If the component after the last slash doesn't contain a dot, do nothing.
if '.' in x.split('/')[-1]:
... dowload(x) etc
I'm trying to use below code to read 5 files from source, write them in destination and then deleting the files in source. I get the following error: [Errno 13] Permission denied: 'c:\\data\\AM\\Desktop\\tester1. The file by the way look like this:
import os
import time
source = r'c:\data\AM\Desktop\tester'
destination = r'c:\data\AM\Desktop\tester1'
for file in os.listdir(source):
file_path = os.path.join(source, file)
if not os.path.isfile:
continue
print(file_path)
with open (file_path, 'r') as IN, open (destination, 'w') as OUT:
data ={
'Power': None,
}
for line in IN:
splitter = (ID, Item, Content, Status) = line.strip().split()
if Item in data == "Power":
Content = str(int(Content) * 10)
os.remove(IN)
I have re-written your entire code. I assume you want to update the value of Power by a multiple of 10 and write the updated content into a new file. The below code will do just that.
Your code had multiple issues, first and foremost, most of what you wanted in your head did not get written in the code (like writing into a new file, providing what and where to write, etc.). The original issue of the permission was because you were trying to open a directory to write instead of a file.
source = r'c:\data\AM\Desktop\tester'
destination = r'c:\data\AM\Desktop\tester1'
for file in os.listdir(source):
source_file = os.path.join(source, file)
destination_file=os.path.join(destination, file)
if not os.path.isfile:
continue
print(source_file)
with open (source_file, 'r') as IN , open (destination_file, 'w') as OUT:
data={
'Power': None,
}
for line in IN:
splitter = (ID, Item, Content, Status) = line.strip().split()
if Item in data:# == "Power": #Changed
Content = str(int(Content) * 10)
OUT.write(ID+'\t'+Item+'\t'+Content+'\t'+Status+'\n') #Added to write the content into destination file.
else:
OUT.write(line) #Added to write the content into destination file.
os.remove(source_file)
Hope this works for you.
I'm not sure what you're going for here, but here's what I could come up with the question put into the title.
import os
# Takes the text from the old file
with open('old file path.txt', 'r') as f:
text = f.read()
# Takes text from old file and writes it to the new file
with open('new file path.txt', 'w') as f:
f.write(text)
# Removes the old text file
os.remove('old file path.txt')
Sounds from your description like this line fails:
with open (file_path, 'r') as IN, open (destination, 'w') as OUT:
Because of this operation:
open (destination, 'w')
So, you might not have write-access to
c:\data\AM\Desktop\tester1
Set file permission on Windows systems:
https://www.online-tech-tips.com/computer-tips/set-file-folder-permissions-windows/
#Sherin Jayanand
One more question bro, I wanted to try something out with some pieces of your code. I made this of it:
import os
import time
from datetime import datetime
#Make source, destination and archive paths.
source = r'c:\data\AM\Desktop\Source'
destination = r'c:\data\AM\Desktop\Destination'
archive = r'c:\data\AM\Desktop\Archive'
for root, dirs, files in os.walk(source):
for f in files:
pads = (root + '\\' + f)
# print(pads)
for file in os.listdir(source):
dst_path=os.path.join(destination, file)
print(dst_path)
with open(pads, 'r') as IN, open(dst_path, 'w') as OUT:
data={'Power': None,
}
for line in IN:
(ID, Item, Content, Status) = line.strip().split()
if Item in data:
Content = str(int(Content) * 10)
OUT.write(ID+'\t'+Item+'\t'+Content+'\t'+Status+'\n')
else:
OUT.write(line)
But again I received the same error: Permission denied: 'c:\\data\\AM\\Desktop\\Destination\\C'
How comes? Thank you very much!
I am working in a simple task of appending and adding an extra column to multiple CSV files.
The following code works perfectly in Python prompt shell:
import csv
import glob
import os
data_path = "C:/Users/mmorenozam/Documents/Python Scripts/peptidetestivory/"
outfile_path = "C:/Users/mmorenozam/Documents/Python Scripts/peptidetestivory/alldata.csv"
filewriter = csv.writer(open(outfile_path,'wb'))
file_counter = 0
for input_file in glob.glob(os.path.join(data_path,'*.csv')):
with open(input_file,'rU') as csv_file:
filereader = csv.reader(csv_file)
name,ext = os.path.splitext(input_file)
ident = name[-29:-17]
for i, row in enumerate(filereader):
row.append(ident)
filewriter.writerow(row)
file_counter += 1
However, when I run this code using Spyder, in order to have the desired .csv file, I have to add
exit()
or type in the IPython console "%reset".
Is there a better way to finish this part of the script? because the following parts of my code work with the .csv file generated in this part, and using the options above is annoying