I have a zipped folder containing 15 000 yaml files. I'd like to iterate through the folder using yaml.safe_load so that each file is in a dictionary format and I can extract information from each file that I need. I've written some code so far using zipfile.ZipFile and yaml.safe_load but it only works for the first file in the zipped folder. Would anyone please mind taking a look and explaining what I'm misunderstanding please?
zip_file = zipfile.ZipFile("D:/export.zip")
files = zip_file.namelist()
print(files)
for i in range(10):
with zip_file.open(files[i]) as yamlfile:
yamlreader = yaml.safe_load(yamlfile)
print(yamlreader["identifier"])
for now I'm just iterating through 10 files to make life easier. Eventually I'd like to do the whole 15 000. "identifier" is a key in the yaml file.
This is the error:
10.5281/zenodo.1014773
Traceback (most recent call last):
File "C:/Users/estho/PycharmProjects/GSOC3/testing_dataextraction.py", line 20, in <module>
yamlreader = yaml.safe_load(yamlfile)
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\__init__.py", line 162, in safe_load
return load(stream, SafeLoader)
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\__init__.py", line 114, in load
return loader.get_single_data()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\constructor.py", line 41, in get_single_data
node = self.get_single_node()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\composer.py", line 36, in get_single_node
document = self.compose_document()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\composer.py", line 127, in compose_mapping_node
while not self.check_event(MappingEndEvent):
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\parser.py", line 98, in check_event
self.current_event = self.state()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\parser.py", line 428, in parse_block_mapping_key
if self.check_token(KeyToken):
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\scanner.py", line 260, in fetch_more_tokens
self.get_mark())
yaml.scanner.ScannerError: while scanning for the next token
found character '\t' that cannot start any token
in "yamlfile_10_5281_zenodo_1745362.yaml", line 4, column 1
Thank you.
It seems to me like in the file "yamlfile_10_5281_zenodo_1745362.yaml" there is a bad token name. Try running it without this file. In python \t is representative of a tab and so cannot be included in a string ect normally without escaping it.
Related
I'm trying to merge multiple .xls files into a single workbook, where each file is inserted into a sheet, named with the .xls filename.
While surfing on web, I've seen the documentation of Pyexcel and a specific module which, as written here, could do the job easly.
Here's the code.
from pyexcel.cookbook import merge_all_to_a_book
import glob
merge_all_to_a_book(glob.glob("Dir\*.xls"),"output.xls")
As expected, it doesn't work. Here's the console output.
File "..\Desktop\scripts\provaimport.py", line 48, in <module>
merge_all_to_a_book(glob.glob("C:\Users\Tesisti\Desktop\forpythonscript\*.xls"),"output.xls")
File "C:\Users\Tesisti\Anaconda2\lib\site-packages\pyexcel\cookbook.py", line 148, in merge_all_to_a_book
merged.save_as(outfilename)
File "C:\Users\Tesisti\Anaconda2\lib\site-packages\pyexcel\internal\meta.py", line 339, in save_as
return save_book(self, file_name=filename, **keywords)
File "C:\Users\Tesisti\Anaconda2\lib\site-packages\pyexcel\internal\core.py", line 51, in save_book
return _save_any(a_source, book)
File "C:\Users\Tesisti\Anaconda2\lib\site-packages\pyexcel\internal\core.py", line 55, in _save_any
a_source.write_data(instance)
File "C:\Users\Tesisti\Anaconda2\lib\site-packages\pyexcel\plugins\sources\file_output.py", line 38, in write_data
**self._keywords)
File "C:\Users\Tesisti\Anaconda2\lib\site-packages\pyexcel\plugins\renderers\excel.py", line 30, in render_book_to_file
save_data(file_name, book.to_dict(), **keywords)
File "C:\Users\Tesisti\Anaconda2\lib\site-packages\pyexcel_io\io.py", line 119, in save_data
**keywords)
File "C:\Users\Tesisti\Anaconda2\lib\site-packages\pyexcel_io\io.py", line 141, in store_data
writer.write(data)
File "C:\Users\Tesisti\Anaconda2\lib\site-packages\pyexcel_io\book.py", line 58, in __exit__
self.close()
File "C:\Users\Tesisti\Anaconda2\lib\site-packages\pyexcel_xls\xlsw.py", line 86, in close
self.work_book.save(self._file_alike_object)
File "C:\Users\Tesisti\Anaconda2\lib\site-packages\xlwt\Workbook.py", line 710, in save
doc.save(filename_or_stream, self.get_biff_data())
File "C:\Users\Tesisti\Anaconda2\lib\site-packages\xlwt\Workbook.py", line 680, in get_biff_data
self.__worksheets[self.__active_sheet].selected = True
Any idea on how to fix?
It seems to me that glob.glob("Dir*.xls") returned an empty list of files. Hence pyexcel's plugin pyexcel-xls fails to create an empty file.
The current solution, I would recommend is to take the latest pyexcel-xls and use try-except statement around merge_all_to_a_book, catching empty file case.
I'm getting a strange error when parsing a YAML:
yaml.scanner.ScannerError: mapping values are not allowed here
The YAML file I'm trying to read is valid according to YAML Lint
Another strange thing is that it works fine on my laptop (Arch Linux) but not on the Server (Ubuntu). The PyYAML version is the same though on both machines.
I have seen the other posts on stackoverflow where people were missing the space after the colon, but I'm not missing any spaces.
This is the complete YAML file:
pipeline:
- read:
input: /home/omnibrain/projects/company/data/data.csv
output: some_data
- filter:
input: some_data
filtername: latlng_filter
minlat: 32.5
maxlat: 32.9
minlng: -117.4
maxlng: -117.0
- enhance:
input: some_data
enhancername: geo_enhancer
fields: zip
- write:
input: some_data
writername: csv_writer
output_dir: /home/omnibrain/outputs
columns: [id, latitude, longitude, zip, networktype]
filename: example1 # the output filename
And this is the complete stack trace:
Traceback (most recent call last):
File "/usr/local/bin/someproject", line 9, in <module>
load_entry_point('someproject==0.0.1', 'console_scripts', 'someproject')()
File "/usr/local/lib/python3.4/dist-packages/someproject-0.0.1-py3.4.egg/someproject/__init__.py", line 19, in main
pipeline.Pipeline(parser.parse_args().scriptfile).start()
File "/usr/local/lib/python3.4/dist-packages/someproject-0.0.1-py3.4.egg/someproject/pipeline/pipeline.py", line 20, in __init__
self._raw_pipeline = self._parse_yaml(yamlscript)
File "/usr/local/lib/python3.4/dist-packages/someproject-0.0.1-py3.4.egg/someproject/pipeline/pipeline.py", line 55, in _parse_yaml
data = yaml.load(yamlscript)
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/__init__.py", line 72, in load
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/constructor.py", line 35, in get_single_data
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/composer.py", line 36, in get_single_node
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/composer.py", line 55, in compose_document
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/composer.py", line 84, in compose_node
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/composer.py", line 133, in compose_mapping_node
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/composer.py", line 82, in compose_node
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/composer.py", line 111, in compose_sequence_node
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/composer.py", line 84, in compose_node
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/composer.py", line 133, in compose_mapping_node
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/composer.py", line 84, in compose_node
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/composer.py", line 127, in compose_mapping_node
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/parser.py", line 98, in check_event
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/parser.py", line 428, in parse_block_mapping_key
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/scanner.py", line 116, in check_token
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/scanner.py", line 220, in fetch_more_tokens
File "/usr/local/lib/python3.4/dist-packages/PyYAML-3.11-py3.4-linux-x86_64.egg/yaml/scanner.py", line 580, in fetch_value
yaml.scanner.ScannerError: mapping values are not allowed here
in "./test1.yaml", line 3, column 93
You are not missing any spaces after the colon, you have too many spaces in the line starting with input: /home/omnibrain/projects/company/data/data.csv. That is why you see line 3 column 93
That whole line reads something like:
input: /home/omnibrain/projects/company/data/data.csv output: some_data
It also should have some funny characters messing with your display as normally you would see a string
... output: some_data
below the mappings not allowed here.
That kind of differences normally occur if the files look the same, but in reality are not, e.g. after copy and paste from one terminal to another. Or after pasting into a website like YAMLlint.
Generate an md5sum on both systems for the file to check if they are really the same. Use od -c on the YAML file to inspect it for strange characters.
I'm quit new to coding in general.
What i want to achieve is to make an script that runs to a list of employers in excel and weekly generate a new hour-sheet. And by generating i mean copy for every employer an empty hour-sheet and rename it, and also change the week-number and employer-name in the newly made copy.
I didn't start with a loop, because i first wanted to made the part that change the employers-name and week-number. I've already search the internet for some answers, but i can't get the code to work, keep getting error messages.
So here is my code so far:
import os
import shutil
import time
from openpyxl import load_workbook
#calculate the year and week number
from time import strftime
year = (time.strftime("%Y"))
week = str(int(time.strftime("%W"))+1)
year_week = year + "_" + week
#create weekly houresheets per employer
employer = "Adam"
hsheets_dir = "C:\\test\\"
old_file_name = "blanco.xlsx"
new_file_name = employer + "_" + year_week + ".xlsx"
dest_filename = (hsheets_dir + new_file_name)
shutil.copy2((hsheets_dir + old_file_name), dest_filename)
#change employer name and weeknumber
def insert_xlsx(dest, empl, wk):
#Open an xlsx for reading
print (dest)
wb = load_workbook(filename = dest)
#Get the current Active Sheet
ws = wb.get_sheet_by_name("Auto")
ws.cell(row=1,column=2).value = empl
ws.cell(row=2,column=2).value = wk
wb.save(dest)
insert_xlsx(dest_filename, employer, week_str)
And here is the error message i keep getting:
Traceback (most recent call last):
File "G:\ALL\Urenverantwoording\Wekelijks\Genereer_weekstaten.py", line 46, in <module>
insert_xlsx(dest_filename, employer, week)
File "G:\ALL\Urenverantwoording\Wekelijks\Genereer_weekstaten.py", line 44, in insert_xlsx
wb.save(dest)
File "C:\Python34\lib\site-packages\openpyxl\workbook\workbook.py", line 298, in save
save_workbook(self, filename)
File "C:\Python34\lib\site-packages\openpyxl\writer\excel.py", line 198, in save_workbook
writer.save(filename, as_template=as_template)
File "C:\Python34\lib\site-packages\openpyxl\writer\excel.py", line 181, in save
self.write_data(archive, as_template=as_template)
File "C:\Python34\lib\site-packages\openpyxl\writer\excel.py", line 87, in write_data
self._write_worksheets(archive)
File "C:\Python34\lib\site-packages\openpyxl\writer\excel.py", line 114, in _write_worksheets
write_worksheet(sheet, self.workbook.shared_strings,
File "C:\Python34\lib\site-packages\openpyxl\writer\worksheet.py", line 302, in write_worksheet
xf.write(comments)
File "C:\Python34\lib\contextlib.py", line 66, in __exit__
next(self.gen)
File "C:\Python34\lib\site-packages\openpyxl\xml\xmlfile.py", line 51, in element
self._write_element(el)
File "C:\Python34\lib\site-packages\openpyxl\xml\xmlfile.py", line 78, in _write_element
xml = tostring(element)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 1126, in tostring
short_empty_elements=short_empty_elements)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 778, in write
short_empty_elements=short_empty_elements)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 943, in _serialize_xml
short_empty_elements=short_empty_elements)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 943, in _serialize_xml
short_empty_elements=short_empty_elements)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 935, in _serialize_xml
v = _escape_attrib(v)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 1093, in _escape_attrib
_raise_serialization_error(text)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 1059, in _raise_serialization_error
"cannot serialize %r (type %s)" % (text, type(text).__name__)
TypeError: cannot serialize 3 (type int)
Can somewone put me in the right directions?
Many thanks
I think based on your responses then that the problem lies with your existing hour-sheet Excel spreadsheet:
Try starting with a copy of your existing spreadsheet and removing all of the entries. Hopefully this too will work.
If this fails, start with a new blank spreadsheet.
Bit by bit copy the existing data and repeat your script.
By doing this you will might be able to isolate the feature which is not compatible with openpyxl.
Alternatively, you might be able to write the whole thing from your Python script, and skip trying to modify a semi-filled in one. This would then be 100% compatible.
My app at google-app-engine uses a yaml settings file. I've used the same approach in several similar app engine apps for a couple of years. App settings that do not have to change unless at deployment time go into that yaml file.
Now opening and loading from that yaml file turned out to be a big source of problem if I had a bunch of users going at the API at the same time. So I quickly fixed the issue some year or two ago by memcaching the contents of the yaml. That worked well for a long time.
I realized recently that I still get occasional DeadlineExceededErrors errors when trying to open() that file. The number of open attempts should be very few (when I manually change the key used for the memcache really, which should be even less than amount of deployments). What really happens, sometimes, is that it times out after 60 seconds, failing to open that file - this happens when the memcache for some reason has lost the content pertaining to that key. That's okay because that's how Memcache works, it still keep the memcached stuff around most of the time. I just use it for that yaml file and never felt a need to use it for anything else. Still, the open() often fails out after 60 seconds, with a DeadlineExceededErrors.
Any leads?
The error log goes like this:
File "/base/data/home/apps/s~mjpuroland/1.385586677613659867/mjconfig.py", line 81, in loadversionsettings
versionsettings = yaml.load(open(versionsettingsfile).read())
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/__init__.py", line 71, in load
return loader.get_single_data()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/constructor.py", line 37, in get_single_data
node = self.get_single_node()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/composer.py", line 36, in get_single_node
document = self.compose_document()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/composer.py", line 133, in compose_mapping_node
item_value = self.compose_node(node, item_key)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/composer.py", line 133, in compose_mapping_node
item_value = self.compose_node(node, item_key)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/composer.py", line 82, in compose_node
node = self.compose_sequence_node(anchor)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/composer.py", line 111, in compose_sequence_node
node.value.append(self.compose_node(node, index))
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/composer.py", line 127, in compose_mapping_node
while not self.check_event(MappingEndEvent):
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/parser.py", line 98, in check_event
self.current_event = self.state()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/parser.py", line 428, in parse_block_mapping_key
if self.check_token(KeyToken):
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/scanner.py", line 159, in fetch_more_tokens
self.stale_possible_simple_keys()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/yaml-3.10/yaml/scanner.py", line 283, in stale_possible_simple_keys
for level in self.possible_simple_keys.keys():
DeadlineExceededError
Update: Looking at the log again, I realized that what is timing out is the yaml.load() part, not open(). Is yaml that problematic? The yaml file I have is 249 KB at the moment.
i tried to open rdf file (dmoz rdf dump), but a get this error message
Traceback (most recent call last):
File "/media/_dev_/ODP_RDF_get_links.py", line 4, in <module>
result = g.parse("data/content.rdf")
File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 1033, in parse
parser.parse(source, self, **args)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 577, in parse
self._parser.parse(source)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 210, in feed
self._parser.Parse(data, isFinal)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 352, in end_element_ns
self._cont_handler.endElementNS(pair, None)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 160, in endElementNS
self.current.end(name, qname)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 331, in node_element_end
self.error("Repeat node-elements inside property elements: %s"%"".join(name))
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 185, in error
raise ParserError(info + message)
file:///media/_dev_/data/content.rdf:5:12: Repeat node-elements inside property elements: http://dmoz.org/rdf/catid
my simple code is as follow:
import rdflib
g = rdflib.Graph()
result = g.parse("data/content.rdf")
print("graph has %s statements." % len(g))
i need to be able to read the file.
extract all links in the world category.
thanks for any possible help
EDIT:
PS: found this wikipedia rdf_dumps, so developing custom scripts is necessary to use this dump