I have a task of pulling images of items based on SKU and write them to an excel sheet. I can download the image fine and write it out. But issue is that when workbook.close() is called. xlsxwriter is only writing the last image. This is due to me saving space and overwriting the image after writing. Here is my write function:
def writeExcel(url, asin, imgLink, number):
if (url == -1): #incase image isn't able to be retrived
worksheet.write("A{}".format(number), asin)
worksheet.write("C{}".format(number), "N/A")
return
worksheet.write_string("A{}".format(number), asin)
imgPath = os.getcwd() + "/cache/img.jpg"
deleteCache() #remove the previous downloaded image to download the new one
getImage(imgLink) #download the image into ./cache/img.jpg
fixImage(imgPath) #fix the aspect ratio of image to fit into the cell
worksheet.insert_image("C{}".format(number), imgPath, {
"y_scale": 0.2,
"x_scale": 0.5,
"object_position": 1,
"url": url
})
It takes in the SKU of the item, and the image link. The calls getImage() which downloads it into ./cache/img.jpg. Then fixes the ratio with fixImage(). Finally it writes the image to the file.
This function is called in another function's for loop for each of the SKU.
Here is the function for reference.
def amazonSearch(asinList):
number = 0
for asin in asinList:
number += 1
if number % 25 == 0: #feedback to make sure it isn't stuck
print("Finished {}. Currently at {}".format(number, asin))
for region in regions:
req = requests.get(HOST.format(region, asin))
counter = 0
while (req.status_code == 503):
req = requests.get(HOST.format(region, asin))
time.sleep(1) #don't spam
counter += 1
if (counter >= 25):
break
if req.status_code == 200:
break
if (req.status_code != 200):
writeExcel(-1, asin, "", "")
continue
soup = bs(req.content, "html.parser")
imgTag = soup.find_all(id="landingImage")
imgLink = imgTag[0]["src"]
writeExcel(req.url, asin, imgLink, number)
After the script finishes. The file is written but the last SKU image will show up in all other SKUs. This is probably due to xlsxwriter only writing changes when workbook.close() is called.
My question is how can i fix that without having to save every single image and writing at the end? As the input file is pretty big (over 8k items). I have thought of closing and reopening sheet every time writeExcel() is called but that seems unfeasible. xlsxwriter overwrites every time so it can't be done.
insert_image only adds the image-path or url to a buffer. Later when closing/saving the workbook, the images are loaded from paths (all same in your case) and written to output.
You can fix by reading the image-binary and inserting using image_data:
image_file = open(filename, 'rb')
image_data = BytesIO(image_file.read())
image_file.close()
# Write the byte stream image to a cell. The filename must be specified
worksheet.insert_image('B8', filename, {'image_data': image_data})
Note: In this case, when image_data is present, the file at path/URL of argument filename does not need to exist. So you can treat filename argument rather as identifier or URI.
Since you are reading from the same cached file, your filename passed to insert_image as argument can be made unique by using some distinctive attribute like:
asin
url
For example:
filename_to_insert = asin + filename
or filename_to_insert = url
See:
Example: Inserting images from a URL or byte stream into a worksheet — XlsxWriter Documentation
Related
So I already tried to check other questions here about (almost) the same topic, however I did not find something that solves my problem.
Basically, I have a piece of code in Python that tries to open the file as a data frame and execute some eye tracking functions (PyGaze). I have 1000 files that I need to analyse and wanted to create a for-loop to execute my code on all the files automatically.
The code is the following:
os.chdir("/Users/Documents//Analyse/Eye movements/Python - Eye Analyse")
directory = '/Users/Documents/Analyse/Eye movements/R - Filtering Data/Filtered_data/Filtered_data_test'
for files in glob.glob(os.path.join(directory,"*.csv")):
#Downloas csv, plot
df = pd.read_csv(files, parse_dates = True)
#Plot raw data
plt.plot(df['eye_x'],df['eye_y'], 'ro', c="red")
plt.ylim([0,1080])
plt.xlim([0,1920])
#Fixation analysis
from detectors import fixation_detection
fixations_data = fixation_detection(df['eye_x'],df['eye_y'], df['time'],maxdist=25, mindur=100)
Efix_data = fixations_data[1]
numb_fixations = len(Efix_data) #number of fixations
fixation_start = [i[0] for i in Efix_data]
fixation_stop = [i[1] for i in Efix_data]
fixation = {'start' : fixation_start, 'stop': fixation_stop}
fixation_frame = pd.DataFrame(data=fixation)
fixation_frame['difference'] = fixation_frame['stop'] - fixation_frame['start']
mean_fixation_time = fixation_frame['difference'].mean() #mean fixation time
final = {'number_fixations' : [numb_fixations], 'mean_fixation_time': [mean_fixation_time]}
final_frame = pd.DataFrame(data=final)
#write everything in one document
final_frame.to_csv("/Users/Documents/Analyse/Eye movements/final_data.csv")
The code is running (no errors), however : it only runs for the first file. The code is not ran for the other files present in the folder/directory.
I do not see where my mistake is?
Your output file name is constant, so it gets overwritten with each iteration of the for loop. Try the following instead of your final line, which opens the file in "append" mode instead:
#write everything in one document
with open("/Users/Documents/Analyse/Eye movements/final_data.csv", "a") as f:
final_frame.to_csv(f, header=False)
I am making a program that should be able to extract the notes, rests, and chords from a certain midi file and write the respective pitch (in midi tone numbers - they go from 0-127) of the notes and chords to a csv file for later use.
For this project, I am using the Python Library "Music21".
from music21 import *
import pandas as pd
#SETUP
path = r"Pirates_TheCarib_midi\1225766-Pirates_of_The_Caribbean_Medley.mid"
#create a function for taking parsing and extracting the notes
def extract_notes(path):
stm = converter.parse(path)
treble = stm[0] #access the first part (if there is only one part)
bass = stm[1]
#note extraction
notes_treble = []
notes_bass = []
for thisNote in treble.getElementsByClass("Note"):
indiv_note = [thisNote.name, thisNote.pitch.midi, thisNote.offset]
notes_treble.append(indiv_note) # print's the note and the note's
offset
for thisNote in bass.getElementsByClass("Note"):
indiv_note = [thisNote.name, thisNote.pitch.midi, thisNote.offset]
notes_bass.append(indiv_note) #add the notes to the bass
return notes_treble, notes_bass
#write to csv
def to_csv(notes_array):
df = pd.DataFrame(notes_array, index=None, columns=None)
df.to_csv("attempt1_v1.csv")
#using the functions
notes_array = extract_notes(path)
#to_csv(notes_array)
#DEBUGGING
stm = converter.parse(path)
print(stm.parts)
Here is the link to the score I am using as a test.
https://musescore.com/user/1699036/scores/1225766
When I run the extract_notes function, it returns two empty arrays and the line:
print(stm.parts)
it returns
<music21.stream.iterator.StreamIterator for Score:0x1b25dead550 #:0>
I am confused as to why it does this. The piece should have two parts, treble and bass. How can I get each note, chord and rest into an array so I can put it in a csv file?
Here is small snippet how I did it. I needed to get all notes, chords and rests for specific instrument. So at first I iterated through part and found specific instrument and afterwards check what kind of type note it is and append it.
you can call this method like
notes = get_notes_chords_rests(keyboard_instruments, "Pirates_of_The_Caribbean.mid")
where keyboard_instruments is list of instruments.
keyboard_nstrument = ["KeyboardInstrument", "Piano", "Harpsichord", "Clavichord", "Celesta", ]
def get_notes_chords_rests(instrument_type, path):
try:
midi = converter.parse(path)
parts = instrument.partitionByInstrument(midi)
note_list = []
for music_instrument in range(len(parts)):
if parts.parts[music_instrument].id in instrument_type:
for element_by_offset in stream.iterator.OffsetIterator(parts[music_instrument]):
for entry in element_by_offset:
if isinstance(entry, note.Note):
note_list.append(str(entry.pitch))
elif isinstance(entry, chord.Chord):
note_list.append('.'.join(str(n) for n in entry.normalOrder))
elif isinstance(entry, note.Rest):
note_list.append('Rest')
return note_list
except Exception as e:
print("failed on ", path)
pass
P.S. It is important to use try block because a lot of midi files on the web are corrupted.
I have a script which is designed to place inset maps onto specific pages while exporting Data Driven Pages, the script is amalgamation of a friend's work and some of my own code from other projects.
The issue is the code exports pages 15 and 16 twice one with my inset maps and the other without and I can't figure out why.
I think it is something to do with the indentation within the Loop but I cant get it so it behaves in any other way. Any help would be appreciated!
import arcpy, os, time, datetime
from datetime import datetime
start_time = datetime.now()
PageNumber = "Page "
# Create an output directory variable i.e the location of your maps folder
outDir = r"C:\Users\support\Desktop\Python\Book of Reference"
# Create a new, empty pdf document in the specified output directory
# This will be your final product
finalpdf_filename = outDir + r"\FinalMapBook.pdf"
if os.path.exists(finalpdf_filename): # Check to see if file already exists, delete if it does
os.remove(finalpdf_filename)
finalPdf = arcpy.mapping.PDFDocumentCreate(finalpdf_filename)
# Create a Data Driven Pages object from the mxd you wish to export
mxdPath = r"C:\Users\support\Desktop\Python\Book Of Reference\Book_Of_Reference_20160526_Python_Test.mxd"
tempMap = arcpy.mapping.MapDocument(mxdPath)
tempDDP = tempMap.dataDrivenPages
# Create objects for the layout elements that will be moving, e.g., inset data frame, scale text
Page15 = arcpy.mapping.ListDataFrames(tempMap)[1]
Page16 = arcpy.mapping.ListDataFrames(tempMap)[2]
# Instead of exporting all pages at once, you will need to use a loop to export one at a time
# This allows you to check each index and execute code to add inset maps to the correct pages
for pgIndex in range(1, tempDDP.pageCount + 1, 1):
# Create a name for the pdf file you will create for each page
temp_filename = r"C:\Users\support\Desktop\Python\Book of Reference\Book of Reference" + \
str(pgIndex) + ".pdf"
if os.path.exists(temp_filename):
os.remove(temp_filename) #Removes pdf if it is already in the folder
# Code for setting up the inset map on the first page #
if (pgIndex == 15):
# Set position of inset map to place it on the page layout
Page15.elementPositionX = 20.1717
Page15.elementPositionY = 2.0382
# Set the desired size of the inset map for this page
Page15.elementHeight = 9.7337
Page15.elementWidth = 12.7115
# Set the desired extent for the inset map
Page15insetExtent = arcpy.Extent(518878,108329,519831,107599)
Page15insetExtent = Page15insetExtent
arcpy.RefreshActiveView()
tempDDP.exportToPDF(temp_filename, "RANGE", pgIndex)
finalPdf.appendPages(temp_filename)
Page15.elementPositionX = 50 #Move the Inset back off the page
arcpy.RefreshActiveView() #Refresh to ensure the Inset has been removed
print PageNumber + str(pgIndex)
if (pgIndex == 16):
# Set up inset map
Page16.elementPositionX = 2.1013
Page16.elementPositionY = 18.1914
Page16.elementHeight = 9.7337
Page16.elementWidth = 12.7115
Page16insetExtent = arcpy.Extent(520012, 107962, 521156,107086)
Page16insetExtent = Page16insetExtent
arcpy.RefreshActiveView()
print PageNumber + str(pgIndex)
tempDDP.exportToPDF(temp_filename, "RANGE", pgIndex)
finalPdf.appendPages(temp_filename)
print PageNumber + str(pgIndex)
Page16.elementPositionX = 50
arcpy.RefreshActiveView()
# Else Fuction takes care of the pages that dont have insets and just itterates through using the loop on line 28
else :
tempDDP.exportToPDF(temp_filename, "RANGE", pgIndex)
finalPdf.appendPages(temp_filename)
print PageNumber + str(pgIndex)
# Clean up
del tempMap
# Update the properties of the final pdf
finalPdf.updateDocProperties(pdf_open_view="USE_THUMBS",
pdf_layout="SINGLE_PAGE")
# Save your result
finalPdf.saveAndClose()
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
I believe your problem is that when the pgIndex is 15 it performs the export as intended. Then it checks if the pgIndex is 16. The pgIndex is not 16 so it drops into the else and re-exports without the inset maps. I would recommend changing the second if to an elif
I am trying to make a video from a large number of images using MoviePy. The approach works fine for small numbers of images, but the process is killed for large numbers of images. At about 500 images added, the Python process is using all about half of the volatile memory available. There are many more images than that.
How should I address this? I want the processing to complete and I don't mind if the processing takes a bit longer, but it would be good if I could limit the memory and CPU usage in some way. With the current approach, the machine becomes almost unusable while processing.
The code is as follows:
import os
import time
from moviepy.editor import *
def ls_files(
path = "."
):
return([fileName for fileName in os.listdir(path) if os.path.isfile(
os.path.join(path, fileName)
)])
def main():
listOfFiles = ls_files()
listOfTileImageFiles = [fileName for fileName in listOfFiles \
if "_tile.png" in fileName
]
numberOfTiledImages = len(listOfTileImageFiles)
# Create a video clip for each image.
print("create video")
videoClips = []
imageDurations = []
for imageNumber in range(0, numberOfTiledImages):
imageFileName = str(imageNumber) + "_tile.png"
print("add image {fileName}".format(
fileName = imageFileName
))
imageClip = ImageClip(imageFileName)
duration = 0.1
videoClip = imageClip.set_duration(duration)
# Determine the image start time by calculating the sum of the durations
# of all previous images.
if imageNumber != 0:
videoStartTime = sum(imageDurations[0:imageNumber])
else:
videoStartTime = 0
videoClip = videoClip.set_start(videoStartTime)
videoClips.append(videoClip)
imageDurations.append(duration)
fullDuration = sum(imageDurations)
video = concatenate(videoClips)
video.write_videofile(
"video.mp4",
fps = 30,
codec = "mpeg4",
audio_codec = "libvorbis"
)
if __name__ == "__main__":
main()
If I understood correctly, you want to use the different images as the frames of your video.
In this case you should use ImageSequenceClip() (it's in the library, but not yet in the web docs, see here for the doc).
Basically, you just write
clip = ImageSequenceClip("some/directory/", fps=10)
clip.write_videofile("result.mp4")
And it will read the images in the directory in alpha-numerical order, while keeping only one frame at a time in the memory.
While I am at it, you can also provide a list of filenames or a list of numpy arrays to ImageSequenceClip.
Note that if you just want to transform images into a video, and not anything else fancy like adding titles or compositing with another video, then you can do it directly with ffmpeg. From memory the command should be:
ffmpeg -i *_tile.png -r 10 -o result.mp4
Using a list I am able to get all url's from a webage already into list imgs_urls. I need to now how to save all images from a webage, with the number of images changing.
Within the imgs_urls list depending on what report I run, there can be any number of urls in the list. This currently already works by calling just one list item.
html = lxml.html.fromstring(data)
imgs = html.cssselect('img.graph')
imgs_urls = []
for x in imgs:
imgs_urls.append('http://statseeker%s' % (x.attrib['src']))
lnum = len(imgs_urls)
link = urllib2.Request(imgs_urls[0])
output = open('sla1.jpg','wb')
response = urllib2.urlopen(link)
output.write(response.read())
output.close()
The urls in the lsit are full urls. This list would readback something like this if printed:
img_urls = ['http://site/2C2302.png','http://site/2C22101.png','http://site/2C2234.png']
Basic premise of what I think something like this would look like, but the Syntax I know is not correct:
lnum = len(imgs_urls)
link = urllib2.Request(imgs_urls[0-(lnum)])
output = open('sla' + (0-(lnum)).jpg','wb')
response = urllib2.urlopen(link)
output.write(response.read())
output.close()
It would then save all images, and the file would look something like this:
sla1.png, sla2.png, sla3.png, sla4.png
Any ideas? I think a loop would probably fix this but I don't know how to increment saving the sla.jpg the amount of times of the integer in lnum, and then increment the list number in output the same way.
I like to use Python's enumerate to get the index of the iterable in addition to the value. You can use this to auto-increment the value you give to the outputted filenames. Something like this should work:
import urllib2
img_urls = ['http://site/2C2302.png','http://site/2C22101.png','http://site/2C2234.png']
for index, url in enumerate(img_urls):
link = urllib2.urlopen(url)
try:
name = "sla%s.jpg" % (index+1)
with open(name, "wb") as output:
output.write(link.read())
except IOError:
print "Unable to create %s" % name
You may need to catch other exceptions too, such as permission errors, but that should get you started. Note that I incremented the index by 1 as it is zero-based.
See also:
http://www.blog.pythonlibrary.org/2012/06/07/python-101-how-to-download-a-file/
How do I download a file over HTTP using Python?