Using CombinePerKey in Google Cloud Dataflow Python - python

I'm trying to run a simple Dataflow Python pipeline that gets certain user events from BigQuery and produces a per-user event count.
p = df.Pipeline(argv=pipeline_args)
result_query = "..."
data = p | df.io.Read(df.io.BigQuerySource(query=result_query))
user_events = data|df.Map(lambda x: (x['users_user_id'], 1))
user_event_counts = user_events|df.CombinePerKey(sum)
Running this gives me an error:
TypeError: Expected tuple, got int [while running 'Map(<lambda at user_stats.py:...>)']
Data before the CombinePerKey transform is in this form:
(u'55107178236374', 1)
(u'55107178236374', 1)
(u'55107178236374', 1)
(u'2296845644499670', 1)
(u'2296845644499670', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
If instead calculate user_event_counts with this:
user_event_counts = (user_events|df.GroupByKey()|
df.Map('count', lambda (user, ones): (user, sum(ones))))
then there are no errors and I get the result I expect.
Based the docs I would have expected similar behaviour from both approaches. I obviously missing something with respect to CombinePerKey but I can't see what it is. Any tips appreciated!

I am guessing you run a version of the SDK lower than 0.2.4.
This is a bug in how we handle combining operations in some scenarios. The issue is fixed with the latest release of the SDK (v0.2.4): https://github.com/GoogleCloudPlatform/DataflowPythonSDK/releases/tag/v0.2.4
Sorry about that. Let us know if you still experience the issue with the latest release.

Related

Django: too many values to unpack (expected 2) while making dynamic Q OR model query

I am currently trying to dynamically generate a Django or query using Q objects. I have created some code to do so based off of what I've read people having success with, however I'm not having any with my implementation. My code is as follows
query = reduce(or_, (Q(target[search_class] + '__icontains=' + keyword) for search_class in range(2, len(target))))
model.objects.filter(query) # Error happens while making the query itself (too many values to unpack (expected 2))
This is the simplest implementation of this method that I could find. Any help solving this issue would be greatly appreciated. Thanks in advance!
If anyone is having a similar problem, after tons of tinkering, I was able to fix this issue using the following code:
matching = []
matching.extend(model.filter(reduce(or_, (Q(**{target[search_class] + '__icontains': keyword}) for search_class in range(3, len(target))))))

Wait.On() on Apache Beam Python SDK version

I am using Apache Beam on Python and would like to ask what is the equivalent of Apache Beam Java Wait.on() on python SDK?
currently I am having problem with this code snippet below
if len(output_pcoll) > 1:
merged = (tuple(output_pcoll) |
'MergePCollections1' >> beam.Flatten())
else:
merged = output_pcoll[0]
outlier_side_input = self.construct_outlier_side_input(merged)
(merged |
"RemoveOutlier" >>
beam.ParDo(utils.Remove_Outliers(),
beam.pvalue.AsDict(outlier_side_input)) |
"WriteToCSV" >>
beam.io.WriteToText('../../ML-DATA/{0}.{1}'.format(self.BUCKET,
self.OUTPUT), num_shards=1))
it seems Apache Beam does not wait until the code on self.construct_outlier_side_input finished executing and result in empty side input when executing "RemoveOutlier" in the next pipeline. In Java version you can use Wait.On() to wait for construct_outlier_side_input to finish executing, however I could not find the equivalent method in the Python SDK.
--Edit--
what i am trying to achieve is almost the same as in this link,
https://rmannibucau.metawerx.net/post/apache-beam-initialization-destruction-task
You can use additional outputs feature of Beam to do this.
A sample code snippet is as follows
results = (words | beam.ParDo(ProcessWords(), cutoff_length=2, marker='x')
.with_outputs('above_cutoff_lengths', 'marked strings',
main='below_cutoff_strings'))
below = results.below_cutoff_strings
above = results.above_cutoff_lengths
marked = results['marked strings'] # indexing works as well
Once you run the above code snippet you get multiple PCollections such as below, above and marked. You can then use side inputs to further filter or join the results
Hope that helps.
Update
Based on the comments I would like to mention that Apache Beam has capabilities to do stateful processing with the help of ValueState and BagState. If the requirement is to read through a PCollection and then make decisions based on if a prior value is present or not then such requirements can be handled through BagState as shown below:-
def process(self,
element,
timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam,
buffer_1=beam.DoFn.StateParam(BUFFER_STATE_1),
buffer_2=beam.DoFn.StateParam(BUFFER_STATE_2),
watermark_timer=beam.DoFn.TimerParam(WATERMARK_TIMER)):
# Do you processing here
key, value = element
# Read all the data from buffer1
all_values_in_buffer_1 = [x for x in buffer_1.read()]
if StatefulDoFn._is_clear_buffer_1_required(all_values_in_buffer_1):
# clear the buffer data if required conditions are met.
buffer_1.clear()
# add the value to buffer 2
buffer_2.add(value)
if StatefulDoFn._all_condition_met():
# Clear the timer if certain condition met and you don't want to trigger
# the callback method.
watermark_timer.clear()
yield element
#on_timer(WATERMARK_TIMER)
def on_expiry_1(self,
timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam,
key=beam.DoFn.KeyParam,
buffer_1=beam.DoFn.StateParam(BUFFER_STATE_1),
buffer_2=beam.DoFn.StateParam(BUFFER_STATE_2)):
# Window and key parameters are really useful especially for debugging issues.
yield 'expired1'

How to use iterview function from python couchdb

I have been working with couchdb module in python to meet some projects needs. I was happily using view method from couchdb to retrieve result sets from my database until recently.
for row in db.view(mapping_function):
print row.key
However lately I have been needing to work with databases a lot bigger in size than before (~ 15-20 Gb). This is when I ran into an unfortunate issue.
db.view() method loads all rows in memory before you can do anything with it. This is not an issue with small databases but a big problem with large databases.
That is when I came across iterview function. This looks promising but I couldn't find a example usage of it. Can someone share or point me to example usage of iteview function in python-couchdb
Thanks - A
Doing this is almost working for me:
import couchdb.client
server = couchdb.client.Server()
db = server['db_name']
for row in db.iterview('my_view', 10, group=True):
print row.key + ': ' + row.value
I say it almost works because it does return all of the data and all the rows are printed. However, at the end of the batch, it throws a KeyError exception inside couchdb/client.py (line 884) in iterview
This worked for me. You need to add include_docs=True to the iterview call, and then you will get a doc attribute on each row which can be passed to the database delete method:
import couchdb
server = couchdb.Server("http://127.0.0.1:5984")
db = server['your_view']
for row in db.iterview('your_view/your_view', 10, include_docs=True):
# print(type(row))
# print(type(row.doc))
# print(dir(row))
# print(row.id)
# print(row.keys())
db.delete(row.doc)

Azure Streaming Analytics input/output

I implemented a very simple Streaming Analytics query:
SELECT
Collect()
FROM
Input TIMESTAMP BY ts
GROUP BY
TumblingWindow(second, 3)
I produce on an event hub input with a python script:
...
iso_ts = datetime.fromtimestamp(ts).isoformat()
data = dict(ts=iso_ts, value=value)
msg = json.dumps(data, encoding='utf-8')
# bus_service is a ServiceBusService instance
bus_service.send_event(HUB_NAME, msg)
...
I consume from a queue:
...
while True:
msg = bus_service.receive_queue_message(Q_NAME, peek_lock=False)
print msg.body
...
The problem is that I cannot see any error from any point in the Azure portal (the input and the output are tested and are ok), but I cannot get any output from my running process!
I share a picture of the diagnostic while the query is running:
Can somebody give me an idea for where to start troubleshooting?
Thank you so much!
UPDATE
Ok, I guess I isolated the problem.
First of all, the query format should be like this:
SELECT
Collect()
INTO
[output-alias]
FROM
[input-alias] TIMESTAMP BY ts
GROUP BY
TumblingWindow(second, 3)
I tried to remove the TIMESTAMP BY clause and everything goes well; so, I guess that the problem is with that clause.
I paste an example of JSON-serialized input data:
{
"ts": "1970-01-01 01:01:17",
"value": "foo"
}
One could argue that the timestamp is too old (seventies), but I also tried with current timestamps and I didn't get any output and any error on the input.
Can somebody imagine what is going wrong? Thank you!
I discovered that my question was a duplicate of Basic query with TIMESTAMP by not producing output.
So, the solution is that you cannot use data from the seventies, because streaming analytics will consider that all the tuples are late and will drop them.
I re-tried to produce in-time tuples and, after a long latency, I could see the output.
Thanks to everybody!
Can you check the Service Bus queue from Azure portal for number of messages received?

PyODBC Cursor.fetchall() causes python to crash (segfault)

I am using Python 2.7 on Windows XP.
I have a simple python script on a schedule that uses pyodbc to grab data from an AR database which has worked perfectly until today. I get a segfault once the cursor reaches a particular row. I have similar code in C++ which has no problem retrieving the results, so I figure this is an issue with pyodbc. Either way, I'd like to "catch" this error. I've tried to use the subprocess module, but it doesn't seem to work since once the script hits a segfault it just hangs on the "python.exe has encountered a problem and needs to close." message. I guess I could set some arbitrary time frame for it to complete in and, if it doesn't, force close the process, but that seems kind of lame.
I have reported the issue here as well - http://code.google.com/p/pyodbc/issues/detail?id=278
#paulsm4 - I have answered your questions below, thanks!
Q: You're on Windows/XP (32-bit, I imagine), Python 2.7, and BMC
Remedy AR. Correct?
A: Yes, it fails on Win XP 32 bit and Win Server 2008 R2 64 bit.
Q: Is there any chance you (or perhaps your client, if they purchased
Remedy AR) can open a support call with BMC?
A: Probably not...
Q: Can you isolate which column causes the segfault? "What's
different" when the segfault occurs?
A: Just this particular row...but I have now isolated the issue with your suggestions below. I used a loop to fetch each field until a segfault occurred.
cursor.columns(table="mytable")
result = cursor.fetchall()
columns = [x[3] for x in result]
for x in columns:
print x
cursor.execute("""select "{0}"
from "mytable"
where id = 'abc123'""".format(x))
cursor.fetchall()
Once I identified the column that causes the segfault I tried a query for all columns EXCEPT that one and sure enough it worked no problem.
The column's data type was CHAR(1024). I used C++ to grab the data and noticed that the column for that row had the most characters in it out of any other row...1023! Thinking that maybe there is a buffer in the C code for PyODBC that is getting written to beyond its boundaries.
2) Enable ODBC tracing: http://support.microsoft.com/kb/274551
3) Post back the results (including the log trace of the failure)
Ok, I have created a pastebin with the results of the ODBC trace - http://pastebin.com/6gt95rB8. To protect the innocent, I have masked some string values.
Looks like it may have been due to data truncation.
Does this give us enough info as to how to fix the issue? I'm thinking it's a bug within PyODBC since using the C ODBC API directly works fine.
Update
So I compiled PyODBC for debugging and I got an interesting message -
Run-Time Check Failure #2 - Stack around the variable 'tempBuffer' was corrupted.
While I don't currently understand it, the call stack is as follows -
pyodbc.pyd!GetDataString(Cursor * cur=0x00e47100, int iCol=0) Line 410 + 0xf bytes C++
pyodbc.pyd!GetData(Cursor * cur=0x00e47100, int iCol=0) Line 697 + 0xd bytes C++
pyodbc.pyd!Cursor_fetch(Cursor * cur=0x00e47100) Line 1032 + 0xd bytes C++
pyodbc.pyd!Cursor_fetchlist(Cursor * cur=0x00e47100, int max=-1) Line 1063 + 0x9 bytes C++
pyodbc.pyd!Cursor_fetchall(_object * self=0x00e47100, _object * args=0x00000000) Line 1142 + 0xb bytes C++
Resolved!
Problem was solved by ensuring that the buffer had enough space.
In getdata.cpp on line 330
char tempBuffer[1024];
Was changed to
char tempBuffer[1025];
Compiled and replaced the old pyodbc.pyd file in site-packages and we're all good!
Thanks for your help!
Q: You're on Windows/XP (32-bit, I imagine), Python 2.7, and BMC Remedy AR. Correct?
Q: Is there any chance you (or perhaps your client, if they purchased Remedy AR) can open a support call with BMC?
Q: Can you isolate which column causes the segfault? "What's different" when the segfault occurs?
Please do the following:
1) Try different "select a,b,c" statements using Python/ODBC to see if you can reproduce the problem (independent of your program) and isolate a specific column (or, ideally, a specific column and row!)
2) Enable ODBC tracing:
http://support.microsoft.com/kb/274551
3) Post back the results (including the log trace of the failure)
4) If that doesn't work - and if you can't get BMC Technical Support involved - then Plan B might be to debug at the ODBC library level:
How to debug C extensions for Python on Windows
Q: What C/C++ compiler would work best for you?
for any one else who might get this error, check the data type format returned. in my case, it was a datetime column. used select convert(varchar, getdate(), 20) from xxx, or any convert formats to get your desired result.

Categories