Cannot read Statistics Canada sdmx file - python

I am trying to read some Canadian census data from Statistics Canada
(the XML option for the "Canada, provinces and territories" geograpic level). I see that the xml file is in the SDMX format and that there is a structure file provided, but I cannot figure out how to read the data from the xml file.
It seems there are 2 options in Python, pandasdmx and sdmx1, both of which say they can read local files. When I try
import sdmx
datafile = '~/Documents/Python/Generic_98-401-X2016059.xml'
canada = sdmx.read_sdmx(datafile)
It appears to read the first 903 lines and then produces the following:
Traceback (most recent call last):
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/xml.py", line 238, in read_message
raise NotImplementedError(element.tag, event) from None
NotImplementedError: ('{http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message}GenericData', 'start')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/__init__.py", line 126, in read_sdmx
return reader().read_message(obj, **kwargs)
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/xml.py", line 259, in read_message
raise XMLParseError from exc
sdmx.exceptions.XMLParseError: NotImplementedError: ('{http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message}GenericData', 'start')
Is this happening because I've not loaded the structure of the sdmx file (Structure_98-401-X2016059.xml in the zip file from the StatsCan link above)? If so, how do I go about loading that and telling sdmx to use that when reading datafile?
The documentation for sdmx and pandasdmx only show examples of loading files from online providers and not from local files, so I'm stuck. I have limited familiarity with python so any help is much appreciated.
For reference, I can read the file in R using the instructions from the rsdmx github. I would like to be able to do the same/similar in Python.
Thanks in advance.

From a cursory inspection of the documentation, it seems that Statistics Canada is not one of the sources that is included by default. There is however an sdmx.add_source function. I suggest you try that (before loading the data).

As per the sdmx1 developer, StatsCan is using the older, unsupported version of the SDMX (v. 2.0). The current version is 2.1 and rsdmx1 only supports this (support is also going towards the upcoming v.3).

Related

Using Python, NLTK, to analyse German text

I am a beginner in Python and currently trying to use NLTK to analyze German text (extract the German noun and it's frequency of German text) by following this tutorial: https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/
There are several issues that I faced during the process and I am not able to solve them.
When I follow the website to execute the code below:
import random
tagged_sents = list(corp.tagged_sents())
random.shuffle(tagged_sents)
split_perc = 0.1
split_size = int(len(tagged_sents) * split_perc)
train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_size]
and it comes out with this
Traceback (most recent call last):
File "test2.py", line 7, in <module>
tagged_sents = list(corp.tagged_sents())
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\conll.py", line 130, in tagged_sents
return LazyMap(get_tagged_words, self._grids(fileids))
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\conll.py", line 215, in _grids
return concat(
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\util.py", line 433, in concat
raise ValueError("concat() expects at least one object!")
ValueError: concat() expects at least one object!
Then I try to fix by following this solution https://teamtreehouse.com/community/randomshuffle-crashes-when-passed-a-range-somenums-randomshufflerange5250
and alter the
tagged_sents = list(corp.tagged_sents())
to
tagged_sents = list(range(5,250))
And the ValueError didn't come out, I don't know what (5,250) means, although I have read the explanation.
Then I continue to execute the follow step
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
tagger = ClassifierBasedGermanTagger(train=train_sents)
And it shows
Traceback (most recent call last):
File "test1.py", line 90, in <module>
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
ModuleNotFoundError: No module named 'ClassifierBasedGermanTagger'
I have already downloaded the ClassifierBasedGermanTagger.py and init.py and put them in the folder which link to the VS CODE, don't know if it is correct as the passage said:
'Using his Python class ClassifierBasedGermanTagger (which you can download from the github page) we can create a tagger and train it with the data from the TIGER corpus:'
Please help me to fix these problems, thanks!
First of all, welcome to StackOverflow! Before posting a question, please make sure that you have done your own research and most of the time it solves the problem.
Secondly, range(start, end) is a very basic function in Python to get list of numbers based on the input and I don't think using it like the way you did is going to solve the problem. I would suggest you to use print to see what kind of data is being populated in corp and start debugging from there. Maybe corp is just empty and that's why you don't get any tagged_sents.
For the the import part, it is not clear to me where did you put the ClassifierBasedGermanTagger.py but wherever it is, it is not visible to your code. You can try to put your code (test2.py) and ClassifierBasedGermanTagger.py in the same directory. Read the link below for more details on how to properly import module in Python.
https://docs.python.org/3/reference/import.html

How to read the .mat file from Visual SFM in Python Code?

Can someone help me with the Python code to read the .mat file generated from Visual SFM? You can download the .mat file from the link:
https://github.com/cvlab-epfl/tf-lift/tree/master/example
You can get a .mat file in the zip in the link and the file is what I am asking for help.
It seems to be an ASCII file. I do not know how to read the data in the file.
I tried to load the data in the .mat file with scipy.io.loadmat() but an error occurred as:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
raise ValueError('Unknown mat file type, version %s, %s' % ret)
ValueError: Unknown mat file type, version 20, 0
Can someone help me to load the data in the file with Python code?
Thanks for your help and replies sincerely.
If you mean this VisualSFM (http://ccwu.me/vsfm/doc.html), then the .mat file isn't a MATLAB .mat file, but a 'match' file.
From the website:
[name].sift stores all the detected SIFT features, and [name].mat stores the feature matches.
It seems there is C++ code for reading this file (http://ccwu.me/vsfm/MatchFile.zip) which you could use to write a python parser.
Additionally, it seems like there is a python socket interface to VSFM, which may allow you to do what you want https://github.com/nrhine1/vsfm_util

Invalid Signature Exception with .sav file

New to scipy but not to python. Trying to import a .sav file to scipy so I can do some basic work on it. But, each time I try to import the file using scipy.io.readsav(), python throws an error:
Traceback (most recent call last):
File "<ipython-input-7-743be643d8a1>", line 1, in <module>
dataset = io.readsav("c:/users/me/desktop/survey.sav")
File "C:\Users\me\Anaconda3\lib\site-packages\scipy\io\idl.py", line 726, in readsav
raise Exception("Invalid SIGNATURE: %s" % signature)
Exception: Invalid SIGNATURE: b'$F'
Any idea what's happening? I can open the file in R and manipulate the data, but I'd like to do it in Python. Running Anaconda on Windows.
scipy.io.readsav() reads IDL SAVE files. You have tagged this question spss, so I assume you are trying to read an SPSS file. The format of an SPSS .sav file is not the same as the format of an IDL SAVE file.
Look on pypi for savReaderWriter for Python code to read and write sav files.

Read .net pajek file using python igraph library

I am trying to load a .net file using python igraph library. Here is the sample code:
import igraph
g = igraph.read("s.net",format="pajek")
But when I tried to run this script I got the following errors:
Traceback (most recent call last):
File "demo.py", line 2, in <module>
g = igraph.read('s.net',format="pajek")
File "C:\Python27\lib\site-packages\igraph\__init__.py", line 3703, in read
return Graph.Read(filename, *args, **kwds)
File "C:\Python27\lib\site-packages\igraph\__init__.py", line 2062, in Read
return reader(f, *args, **kwds)
igraph._igraph.InternalError: Error at .\src\foreign.c:574: Parse error in Pajek
file, line 1 (syntax error, unexpected ARCSLINE, expecting VERTICESLINE), Parse error
Kindly provide some hint over it.
Either your file is not a regular Pajek file or igraph's Pajek parser is not able to read this particular Pajek file. (Writing a Pajek parser is a bit of hit-and-miss since the Pajek file format has no formal specification). If you send me your Pajek file via email, I'll take a look at it.
Update: you were missing the *Vertices section of the Pajek file. Adding a line like *Vertices N (where N is the number of vertices in the graph) resolves your problem. I cannot state that this line is mandatory in Pajek files because of lack of formal specification for the file format, but all the Pajek files I have seen so far included this line so I guess it's pretty standard.

Errors during downloading data from Google App Engine using bulkloader

I am trying to download some data from the datastore using the following
command:
appcfg.py download_data --config_file=bulkloader.yaml --application=myappname
--kind=mykindname --filename=myappname_mykindname.csv
--url=http://myappname.appspot.com/_ah/remote_api
When I didn't have much data in this particular kind/table I could
download the data in one shot - occasionally running into the
following error:
.................................[ERROR ] [Thread-11]
ExportProgressThread:
Traceback (most recent call last):
File "C:\Program Files\Google\google_appengine\google\appengine\tools
\bulkload
er.py", line 1448, in run
self.PerformWork()
File "C:\Program Files\Google\google_appengine\google\appengine\tools
\bulkload
er.py", line 2216, in PerformWork
item.key_end)
File "C:\Program Files\Google\google_appengine\google\appengine\tools
\bulkload
er.py", line 2011, in StoreKeys
(STATE_READ, unicode(kind), unicode(key_start), unicode(key_end)))
OperationalError: unable to open database file
This is what I see in the server log:
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/
ext/remote_api/handler.py", line 277, in post
response_data = self.ExecuteRequest(request)
File "/base/python_runtime/python_lib/versions/1/google/appengine/
ext/remote_api/handler.py", line 308, in ExecuteRequest
response_data)
File "/base/python_runtime/python_lib/versions/1/google/appengine/
api/apiproxy_stub_map.py", line 86, in MakeSyncCall
return stubmap.MakeSyncCall(service, call, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/
api/apiproxy_stub_map.py", line 286, in MakeSyncCall
rpc.CheckSuccess()
File "/base/python_runtime/python_lib/versions/1/google/appengine/
api/apiproxy_rpc.py", line 126, in CheckSuccess
raise self.exception
ApplicationError: ApplicationError: 4 no matching index found.
When that error appeared I would simply re-run the download and things
would work out well.
Of late, I am noticing that as the size of my kind increases, the
download tool fails much more often. For instance, with a kind with
~3500 entities I had to run to the command 5 times - only the last of
which succeeded. Is there a way around this error? Previously, my only
worry was I wouldn't be able to automate downloads in a script because
of the occasional failures - now I am scared I won't be able to get my
data out at all.
This issue was discussed previously here
but the post is old and I am not sure what the suggested flag does -
hence posting my similar query again.
Some additional details.
As mentioned here I tried the suggestion to proceed with interrupted downloads (in the section Downloading Data from App Engine ). When I resume after the interruption, I get no errors, but the number of rows that are downloaded are lesser than the entity count the datastore admin shows me.This is the message I get:
[INFO ] Have 3220 entities, 3220 previously transferred
[INFO ] 3220 entities (1003 bytes) transferred in 2.9 seconds
The datastore admin tells me this particular kind has ~4300 entities. Why aren't the remaining entities getting downloaded?
Thanks!
I am going to make a completely uneducated guess at this just based on the fact that I saw the word "unicode" in the first error; I had an issue that was related to my data being user generated from the web. A user put in a few unicode characters and a whole load of stuff started breaking - probably my fault - as I had implemented pretty looking repr functions and a load of other stuff. If you can, take a quick scan of your data via the console utility in your live app, maybe (if it's only 4k records), try converting all of the data to ascii strings to find any that don't conform.
And after that, I started "sanitising" user inputs (sorry, but my "public handle" field needs to be ascii only players!)

Categories