Using Python, NLTK, to analyse German text

Using Python, NLTK, to analyse German text - python

I am a beginner in Python and currently trying to use NLTK to analyze German text (extract the German noun and it's frequency of German text) by following this tutorial: https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/
There are several issues that I faced during the process and I am not able to solve them.
When I follow the website to execute the code below:
import random
tagged_sents = list(corp.tagged_sents())
random.shuffle(tagged_sents)
split_perc = 0.1
split_size = int(len(tagged_sents) * split_perc)
train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_size]
and it comes out with this
Traceback (most recent call last):
File "test2.py", line 7, in <module>
tagged_sents = list(corp.tagged_sents())
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\conll.py", line 130, in tagged_sents
return LazyMap(get_tagged_words, self._grids(fileids))
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\conll.py", line 215, in _grids
return concat(
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\util.py", line 433, in concat
raise ValueError("concat() expects at least one object!")
ValueError: concat() expects at least one object!
Then I try to fix by following this solution https://teamtreehouse.com/community/randomshuffle-crashes-when-passed-a-range-somenums-randomshufflerange5250
and alter the
tagged_sents = list(corp.tagged_sents())
to
tagged_sents = list(range(5,250))
And the ValueError didn't come out, I don't know what (5,250) means, although I have read the explanation.
Then I continue to execute the follow step
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
tagger = ClassifierBasedGermanTagger(train=train_sents)
And it shows
Traceback (most recent call last):
File "test1.py", line 90, in <module>
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
ModuleNotFoundError: No module named 'ClassifierBasedGermanTagger'
I have already downloaded the ClassifierBasedGermanTagger.py and init.py and put them in the folder which link to the VS CODE, don't know if it is correct as the passage said:
'Using his Python class ClassifierBasedGermanTagger (which you can download from the github page) we can create a tagger and train it with the data from the TIGER corpus:'
Please help me to fix these problems, thanks!

First of all, welcome to StackOverflow! Before posting a question, please make sure that you have done your own research and most of the time it solves the problem.
Secondly, range(start, end) is a very basic function in Python to get list of numbers based on the input and I don't think using it like the way you did is going to solve the problem. I would suggest you to use print to see what kind of data is being populated in corp and start debugging from there. Maybe corp is just empty and that's why you don't get any tagged_sents.
For the the import part, it is not clear to me where did you put the ClassifierBasedGermanTagger.py but wherever it is, it is not visible to your code. You can try to put your code (test2.py) and ClassifierBasedGermanTagger.py in the same directory. Read the link below for more details on how to properly import module in Python.
https://docs.python.org/3/reference/import.html

Related

Cannot read Statistics Canada sdmx file

I am trying to read some Canadian census data from Statistics Canada
(the XML option for the "Canada, provinces and territories" geograpic level). I see that the xml file is in the SDMX format and that there is a structure file provided, but I cannot figure out how to read the data from the xml file.
It seems there are 2 options in Python, pandasdmx and sdmx1, both of which say they can read local files. When I try
import sdmx
datafile = '~/Documents/Python/Generic_98-401-X2016059.xml'
canada = sdmx.read_sdmx(datafile)
It appears to read the first 903 lines and then produces the following:
Traceback (most recent call last):
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/xml.py", line 238, in read_message
raise NotImplementedError(element.tag, event) from None
NotImplementedError: ('{http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message}GenericData', 'start')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/__init__.py", line 126, in read_sdmx
return reader().read_message(obj, **kwargs)
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/xml.py", line 259, in read_message
raise XMLParseError from exc
sdmx.exceptions.XMLParseError: NotImplementedError: ('{http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message}GenericData', 'start')
Is this happening because I've not loaded the structure of the sdmx file (Structure_98-401-X2016059.xml in the zip file from the StatsCan link above)? If so, how do I go about loading that and telling sdmx to use that when reading datafile?
The documentation for sdmx and pandasdmx only show examples of loading files from online providers and not from local files, so I'm stuck. I have limited familiarity with python so any help is much appreciated.
For reference, I can read the file in R using the instructions from the rsdmx github. I would like to be able to do the same/similar in Python.
Thanks in advance.

From a cursory inspection of the documentation, it seems that Statistics Canada is not one of the sources that is included by default. There is however an sdmx.add_source function. I suggest you try that (before loading the data).

As per the sdmx1 developer, StatsCan is using the older, unsupported version of the SDMX (v. 2.0). The current version is 2.1 and rsdmx1 only supports this (support is also going towards the upcoming v.3).

How to get past the AttributeError in Gnu Radio version 3.8

Here is the error I am recieving:
Traceback (most recent call last):
File "/home/awilhelmy5/Downloads/qpsk-adaptive-master/gnuradio/qpsk_usrp.py", line 354, in <module>
main()
File "/home/awilhelmy5/Downloads/qpsk-adaptive-master/gnuradio/qpsk_usrp.py", line 332, in main
tb = top_block_cls()
File "/home/awilhelmy5/Downloads/qpsk-adaptive-master/gnuradio/qpsk_usrp.py", line 148, in __init__
self.mapper_preamble_sync_demapper_hard_0 = mapper.preamble_sync_dehard(0, per_bits, mapper.QPSK, [0,1,3,2], 0, 3, False)
AttributeError: module 'mapper' has no attribute 'preamble_sync_dehard'
The very last line of the error is the part that is bugging me. I have tried a multitude of things, such as installing swig version 4.0, tried doing the sudo ldconfig command, tried reversing the order of the commands in the .h file, I changed the yml file name to match the name it had in the xml file and I even started messing around with the target_link
_libraries command. Any help would be greatly appreciated!

Is it possible that your code has an error (e.g. accidental deletion of some characters)? I ask because I was not able to find the text "preamble_sync_dehard" anywhere on the web, but there is a class called gr::mapper::preamble_sync_demapper_hard. Perhaps some person or program did an unwise find-and-replace that intended to remove a prefix mapper_ but ended up removing a substring instead. You can check whether this is the problem: run Python and enter
import mapper
dir(mapper)
This will list all the names in the Python module, one of which (if this is the only problem) will be preamble_sync_demapper_hard, or some related spelling which you can then adjust your qpsk_usrp.py to use.
But if it is not there at all, then you may have a bigger problem. From what I have heard, OOT modules generally have to have their structure updated in order to work with GNU Radio 3.8, and the repository I found doesn't seem to have been touched since 2018 (before 3.8 was released in 2019).

Google search issue in Python

I have implemented a program in python which performs the Google search and captures top ten links from the search results. I am using 'pygoogle' library for search, when I am implementing my program for the first two or three times, it is getting proper hits and the entire project is working very fine. But afterward, after certain links got downloaded, it's giving an error as follows. (gui_two.py is my program name)
Exception in Tkinter callback
Traceback (most recent call last):
File "/usr/lib/python2.7/lib-tk/Tkinter.py", line 1413, in __call__
return self.func(*args)
File "gui_two.py", line 113, in action
result = uc.utilcorpus(self.fn1,"")
File "/home/ci/Desktop/work/corpus/corpus.py", line 125, in utilcorpus
for url in g1.get_urls(): #this is key sentence based search loop
File "/home/ci/Desktop/work/corpus/pygoogle.py", line 132, in get_urls
for result in data['responseData']['results']:
TypeError: 'NoneType' object has no attribute '__getitem__'
I know this is most familiar error in python, but I am not able to do anything since it is a library. I wonder my program is spamming the Google or I need custom Google search API's or may be the other reason. Please give me precise information for performing search without any issue. I will be so grateful for your help.
Thanks.
Edited: Actually my code is very huge, here is a small piece of code, where problem arises.
g1 = pygoogle(query)
g1.pages = 1
for url in g1.get_urls(): #error is in this line
print "URL : ",url
It may work if we simply copy it in a simple .py file, but if we execute it many times, program gives an error.

Here's the culprit code from pygoogle.py (from http://pygoogle.googlecode.com/svn/trunk/pygoogle.py)
def get_urls(self):
"""Returns list of result URLs"""
results = []
search_results = self.__search__()
if not search_results:
self.logger.info('No results returned')
return results
for data in search_results:
if data and data.has_key('responseData') and data['responseData']['results']:
for result in data['responseData']['results']:
if result:
results.append(urllib.unquote(result['unescapedUrl']))
return results
Unlike every other place where data['responseData']['results'] is used, they're not both being checked for existence using has_key().
I suspect that your responseData is missing results, hence the for loop fails.
Since you have the source, you can edit this yourself.
Also make an issue for the project - very similar to this one in fact.

I fixed the issue by modifying the source code of pygoogle.py library program. The bug in this code is, whether an element has the data or none is not checked in the code. The modified code is:
def get_urls(self):
"""Returns list of result URLs"""
results = []
for data in self.__search__():
#following two lines are added to fix the issue
if data['responseData'] == None or data['responseData']['results'] == None:
break
for result in data['responseData']['results']:
if result:
results.append(urllib.unquote(result['unescapedUrl']))
return results

Python package: Bioservices, error using UniChem() command

I was following the tutorial on the webpage:
http://pythonhosted.org/bioservices/compound_tutorial.html
Everything worked well until I reached the following command:
uni = UniChem()
and then I received the error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "P:\Anaconda\lib\site-packages\bioservices\unichem.py", line 84, in __init__
maxid_service = int(self.get_all_src_ids()[-1]['src_id'])
TypeError: list indices must be integers, not str
As a minimum working example:
from bioservices import *
uni = UniChem()
and then I receive the error. I understand the error (for the most part) but I don't know how to fix it. So my question is how do I fix the function or work around it?
The overall aim it to map a list of 1000 drug names (and hopefully more in the near future) to Chembl IDs.

The error you saw is probably related to the fact that when you tried to connect to UniChem service, it was off for maintenance or it took too much time to initialize. The consequence is that the service was not started hence the error message you got.
I've just tried (bioservices 1.2.6)
from bioservices import *
uni = UniChem()
and it worked. The following request also worked:
>>> mapping = uni.get_mapping("kegg_ligand", "chembl")
'CHEMBL278315'

How to change the color of single characters in a cell in excel with python win32com?

I have a question regarding the win32com bindings for excel. I set up early bindings and followed some examples from the "Python Programming on Win32" book from O'Reilly.
The following code works fine:
book2.xlApp.Worksheets('Sheet1').Cells(1,1).Font.ColorIndex = 1
book2.xlApp.Worksheets('Sheet1').Cells(1,1).Font.ColorIndex = 2
It changes the font color of the whole cell according to the number.
However this does not work:
book2.xlApp.Worksheets('Sheet1').Cells(1,1).Characters(start,length).Font.ColorIndex = 1
I get the following callback:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
AttributeError: Characters instance has no __call__ method
However in Excels VBA the code works. Can anybody point me to the solution?
I really need to change parts of a string in an excel cell.
Thank you very much.

use GetCharacters:
Cells(1,1).GetCharacters(start,length).Font.ColorIndex = 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.