Passing regex to coloumn attribute in happybase scan

Passing regex to coloumn attribute in happybase scan - python

I am trying to pass in a list of regexes to the columns attribute in my happybase scan calls. This is because, my coloumn names are made by dynamically appending ids which i dont have acces to at scan time.
Is this possible?

HappyBase author here.
According to the Thrift API you can pass regular expressions in the columns argument for the ScannerOpen() API family (see http://svn.apache.org/viewvc/hbase/trunk/hbase-thrift/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift?view=markup#l717). However, the Thrift API used by HappyBase is ScannerOpenWithScan(), which uses the TScan struct (see http://svn.apache.org/viewvc/hbase/trunk/hbase-thrift/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift?view=markup#l141), which does not contain any remark about regular expressions. Actually I don't know (without testing) whether this works.
A more flexible and powerful way is to specify a filter string using the filter argument to happybase.Table.scan(). See http://hbase.apache.org/book/thrift.html for the filter string syntax. In your case, something like "ColumnPrefixFilter('theprefix')" should do the trick. See http://happybase.readthedocs.org/en/latest/api.html#happybase.Table.scan for the HappyBase API.

I am not familiar with HBase's syntax. Here is the happybase-python code I used, and it works for me. Thanks to Wouter Bolsterlee!! Not like the 'columns' statement, you don't have to put 'columnFamily' in 'ColumnPrefixFilter'.
import happybase
pool = happybase.ConnectionPool(size=3, host='172.xx.xx.xx')
with pool.connection() as conn1:
hbaseTable = conn1.table('HBase_table_name_here')
for rowKey, rowData in hbaseTable.scan(row_prefix= 'year-2015-', filter="ColumnPrefixFilter('month-06')", limit = 6):
print rowData

Related

Python parsing update statements using regex

I'm trying to find a regex expression in python that will be able to handle most of the UPDATE queries that I throw at if from my DB. I can't use sqlparse or any other libraries that may be useful with for this, I can only use python's built-in modules or cx_Oracle, in case it has a method I'm not aware of that could do something like this.
Most update queries look like this:
UPDATE TABLE_NAME SET COLUMN_NAME=2, OTHER_COLUMN=to_date('31-DEC-202023:59:59','DD-MON-YYYYHH24:MI:SS'), COLUMN_STRING='Hello, thanks for your help', UPDATED_BY=-100 WHERE CODE=9999;
Most update queries I use have a version of these types of updates. The output has to be a list including each separate SQL keyword (UPDATE, SET, WHERE), each separate update statement(i.e COLUMN_NAME=2) and the final identifier (CODE=9999).
Ideally, the result would look something like this:
list = ['UPDATE', 'TABLE_NAME', 'SET', 'COLUMN_NAME=2', 'OTHER_COLUMN=("31-DEC-2020 23:59:59","DD-MON-YYYY HH24:MI:SS")', COLUMN_STRING='Hello, thanks for your help', 'UPDATED_BY=-100', 'WHERE', 'CODE=9999']
Initially I tried doing this using a string.split() splitting on the spaces, but when dealing with one of my slightly more complex queries like the one above, the split method doesn't deal well with string updates such as the one I'm trying to make in COLUMN_STRING or those in OTHER_COLUMN due to the blank spaces in those updates.

Let's use the shlex module :
import shlex
test="UPDATE TABLE_NAME SET COLUMN_NAME=2, OTHER_COLUMN=to_date('31-DEC-202023:59:59','DD-MON-YYYYHH24:MI:SS'), COLUMN_STRING='Hello, thanks for your help', UPDATED_BY=-100 WHERE CODE=9999;"
t=shlex.split(test)
Up to here, we won't get rid of comma delimiters and the last semi one, so maybe we can do this :
for i in t:
if i[-1] in [',',';']:
i=i[:-1]
If we print every element of that list we'll get :
UPDATE
TABLE_NAME
SET
COLUMN_NAME=2
OTHER_COLUMN=to_date(31-DEC-202023:59:59,DD-MON-YYYYHH24:MI:SS)
COLUMN_STRING=Hello, thanks for your help
UPDATED_BY=-100
WHERE
CODE=9999
Not a proper generic answer, but serves the purpose i hope.

Layer between data extraction and storage

What I am doing:
Get data from data source (could be from API or scraping) in form of a dictionary
Clean/manipulate some of the fields
Combine fields from data source dictionary into new dictionaries that represent objects
Save the created dictionaries into database
Is there a pythonic way to do this? I am wondering about the whole process but I'll give some guiding questions:
What classes should I have?
What methods/classes should the cleaning of fields from the data source to objects be in?
What methods/classes should the combining/mapping of fields from the data source to objects be in?
If the method is different in scraping vs. api, please explain how and why
Here is an example:
API returns:
{data: {
name: "<b>asd</b>",
story: "tame",
story2: "adjet"
}
}
What you want to do:
Clean name
Create a name_story object
Set name_story.name = dict['data']['name']
Set name_story.story = dict['data']['story'] + dict['data']['story2']
Save name_story to database
(and consider that there could be multiple objects to create and multiple incoming data sources)
How would you structure this process? An interface of all classes/methods would be enough for me without any explanation.

What classes should I have?
In Python, there is no strong need to use classes. Classes are the way to manage complexity. If your solution is not complex, use functions (or, maybe, module-level code, if it is one-time solution)
If the method is different in scraping vs. api, please explain how and why
I prefer to organize my code in respect with modularity and principle of least knowledge and define clear interfaces between parts of modules system.
Example of modular solution
You can have module (either function or class) for fetching information, and it should return dictionary with specified fields, no matter what exactly it does.
Another module should process dictionary and return dictionary too (for example).
Third module can save information from that dictionary to database.
There is great possibility, that this plan far from what you need or want and you should develop your modules system yourself.
And some words about your wants:
Clean name
Consider this stackoverflow answer
Create a name_story object
Set name_story.name = dict['data']['name']
Set name_story.story = dict['data']['story'] + dict['data']['story2']
If you want to have access to attributes of object through dot (as you specified in 3 and 4 items, you could use either python namedtuple or plain python class. If indexed access is OK for you, use python dictionary.
In case of namedtuple, it will be:
from collections import namedtuple
NameStory = namedtuple('NameStory', ['name', 'story'])
name_story1 = NameStory(name=dict['data']['name'], story=dict['data']['story'] + dict['data']['story2'])
name_story2 = NameStory(name=dict2['data']['name'], story=dict2['data']['name'])
If your choice if dictionary, it's easier:
name_story = {
'name': dict['data']['name'],
'story': dict['data']['story'] + dict['data']['story2'],
}
Save name_story to database
This is much more complex question.
You can use raw SQL. Specific instructions depends on your database. Google for 'python sqlite' or 'python postgresql' or what you want, there are plenty of good tutorials.
Or you can utilize one of python ORMs:
peewee
SQLAlchemy
google for more options
By the way
It's strongly recommended to not override python built-in types (list, dict, str etc), as you did in this line:
name_story.name = dict['data']['name']

Django, SQLite - Accurate ordering of strings with accented letters

Main problem:
I have a Python (3.4) Django (1.6) web app using an SQLite (3) database containing a table of authors. When I get the ordered list of authors some names with accented characters like ’Čapek’ and ’Örkény’ are the end of list instead of at (or directly after) section ’c’ and ’o’ of the list.
My 1st try:
SQLite can accept collation definitions. I searched for one that was made to order UTF-8 strings correctly for example Localized and Unicode collation in Android (Accented Search in sqlite (android)) but found none.
My 2nd try:I found an old closed Django ticket about my problem: https://code.djangoproject.com/ticket/8384 It suggests sorting with Python as workaround. I found it quite unsatisfying. Firstly if I sort with a Python method (like below) instead of ordering at model level I cannot use generic views. Secondly ordering with a Python method returns the very same result as the SQLite order_by does: ’Čapek’ and ’Örkény’ are placed after section 'z'.
author_list = sorted(Author.objects.all(), key=lambda x: (x.lastname, x.firstname))
How could I get the queryset ordered correctly?

Thanks to the link CL wrote in his comment, I managed to overcome the difficulties that I replied about. I answer my question to share the piece of code that worked because using Pyuca to sort querysets seems to be a rare and undocumented case.
# import section
from pyuca import Collator
# Calling Collator() takes some seconds so you should create it as reusable variable.
c = Collator()
# ...
# main part:
author_list = sorted(Author.objects.all(), key=lambda x: (c.sort_key(x.lastname), c.sort_key(x.firstname)))
The point is to use sort_key method with the attribute you want to sort by as argument. You can sort by multiple attributes as you see in the example.
Last words: In my language (Hungarian) we use four different accented version of the Latin letter ‘o’: ‘o’, ’ó’, ’ö’, ’ő’. ‘o’ and ‘ó’ are equal in sorting, and ‘ö’ and ‘ő’ are equal too, and ‘ö’/’ő’ are after ‘o’/’ó’. In the default collation table the four letters are equal. Now I try to find a way to define or find a localized collation table.

You could create a new field in the table, fill it with the result of unidecode, then sort according to it.
Using a property to provide get/set methods could help in keeping the fields in sync.

Output list of links grouped by extension or base URL - built on regex using python.

Working on this assignment for a while now. The regex is not particularly difficult, but I don't quite follow how to get the output they want
Your program should:
Read the html of a webpage (which has been stored as textfile);
Extract all the domains referred to and list all the full http addresses related to these domains;
Extract all the resource types referred to and list all the full http * addresses related to these resource types.
Please solve the task using regular expressions and re functions/methods. I suggest using ‘finditer’ and ‘groups’ (there might be other possibilities). Please do not use string functions where re is better suited."
The output is supposed to look like this
www.fairfaxmedia.co.nz
http://www.fairfaxmedia.co.nz
www.essentialmums.co.nz
http://www.essentialmums.co.nz/
http://www.essentialmums.co.nz/
http://www.essentialmums.co.nz/
www.nzfishingnews.co.nz
http://www.nzfishingnews.co.nz/
www.nzlifeandleisure.co.nz
http://www.nzlifeandleisure.co.nz/
www.weatherzone.co.nz
http://www.weatherzone.co.nz/
www.azdirect.co.nz
http://www.azdirect.co.nz/
i.stuff.co.nz
http://i.stuff.co.nz/
ico
http://static.stuff.co.nz/781/3251781.ico
zip
http://static2.stuff.co.nz/1392867595/static/jwplayer/skin/Modieus.zip
mp4
http://file2.stuff.co.nz/1394587586/272/9819272.mp4
I really need help with how to filter stuff out so the output shows up like that?

create list of tuples (keyword, url)
sort it according to keyword
using itertools.groupby group per keyword
for each keyword, print keyword and then all urls (these to be printed indentend).

Python: Need to replace a series of different substrings in HTML template with additional HTML or database results

Situation:
I am writing a basic templating system in Python/mod_python that reads in a main HTML template and replaces instances of ":value:" throughout the document with additional HTML or db results and then returns it as a view to the user.
I am not trying to replace all instances of 1 substring. Values can vary. There is a finite list of what's acceptable. It is not unlimited. The syntax for the values is [colon]value[colon]. Examples might be ":gallery: , :related: , :comments:". The replacement may be additional static HTML or a call to a function. The functions may vary as well.
Question:
What's the most efficient way to read in the main HTML file and replace the unknown combination of values with their appropriate replacement?
Thanks in advance for any thoughts/solutions,
c

There are dozens of templating options that already exist. Consider genshi, mako, jinja2, django templates, or more.
You'll find that you're reinventing the wheel with little/no benefit.

If you can't use an existing templating system for whatever reason, your problem seems best tackled with regular expressions:
import re
valre = re.compile(r':\w+:')
def dosub(correspvals, correspfuns, lastditch):
def f(value):
v = value.group()[1:-1]
if v in correspvals:
return correspvals[v]
if v in correspfuns:
return correspfuns[v]() # or whatever args you need
# what if a value has neither a corresponding value to
# substitute, NOR a function to call? Whatever...:
return lastditch(v)
return f
replacer = dosub(adict, another, somefun)
thehtml = valre.sub(replacer, thehtml)
Basically you'll need two dictionaries (one mapping values to corresponding values, another mapping values to corresponding functions to be called) and a function to be called as a last-ditch attempt for values that can't be found in either dictionary; the code above shows you how to put these things together (I'm using a closure, a class would of course do just as well) and how to apply them for the required replacement task.

This is probably a job for a templating engine and for Python there are a number of choices. In this stackoveflow question people have listed their favourites and some helpfully explain why: What is your single favorite Python templating engine?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Passing regex to coloumn attribute in happybase scan - python

I am trying to pass in a list of regexes to the columns attribute in my happybase scan calls. This is because, my coloumn names are made by dynamically appending ids which i dont have acces to at scan time. Is this possible?

Related

Python parsing update statements using regex

Layer between data extraction and storage

Django, SQLite - Accurate ordering of strings with accented letters

Output list of links grouped by extension or base URL - built on regex using python.

Python: Need to replace a series of different substrings in HTML template with additional HTML or database results

Categories

Resources