sample map reduce script in python for hive produces exception - python

I am learning hive. I have setup a table named records. With schema as follows:
year : string
temperature : int
quality : int
Here are sample rows
1999 28 3
2000 28 3
2001 30 2
Now I wrote a sample map reduce script in python exactly as specified in the book Hadoop The Definitive Guide:
import re
import sys
for line in sys.stdin:
(year,tmp,q) = line.strip().split()
if (tmp != '9999' and re.match("[01459]",q)):
print "%s\t%s" % (year,tmp)
I run this using following command:
ADD FILE /usr/local/hadoop/programs/sample_mapreduce.py;
SELECT TRANSFORM(year, temperature, quality)
USING 'sample_mapreduce.py'
AS year,temperature;
Execution fails. On the terminal I get this:
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2012-08-23 18:30:28,506 Stage-1 map = 0%, reduce = 0%
2012-08-23 18:30:59,647 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201208231754_0005 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201208231754_0005_m_000002 (and more) from job job_201208231754_0005
Exception in thread "Thread-103" java.lang.RuntimeException: Error while reading from task log url
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211)
at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://master:50060/tasklog?taskid=attempt_201208231754_0005_m_000000_2&start=-8193
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at java.net.URL.openStream(URL.java:1010)
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
... 3 more
I go to failed job list and this is the stack trace
java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:226)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hit error while closing ..
at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:452)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:193)
... 8 more
The same trace repeated 3 times more.
Please, can someone help me with this? What is wrong here? I am going exactly by the book. What seems to be the problem. There are two errors it seems. On terminal it says that it can't read from task log url. In the failed job list, the exception says something different. Please help

I went to stedrr log from the hadoop admin interface and saw that there was syntax error from python. Then I found that when I created hive table the field delimiter was tab. And in the split() i hadn't mentioned. So I changed it to split('\t') and it worked alright !

just use 'describe formatted ' and near the bottom of the output you'll find 'Storage Desc Params:' which describe any delimiters used.

Related

Where can I find a detailed description of Snowflake error 255005?

The answer I'm looking for is a reference to documentation. I'm still debugging, but can't find a reference for the error code.
The full error comment is:
snowflake.connector.errors.OperationalError: 255005: Failed to read next arrow batch: b'Array length did not match record batch length'
Some background, if it helps:
The error is in response to the call to fetchall as shown here (python):
cs.execute (f'SELECT {specific_column} FROM {table};')
all_starts = cs.fetchall()
The code context is: when running from cron (i.e., timed job), successful connect, then for a list of tables, the third time through, there's an error. (I.e., two tables are "successful".) When the same script is run at other times (via command line, not cron), there's no error (all tables are "successful").

how to efficiently join and groupBy pyspark dataframe?

I working with python pyspark dataframe. I have approx 70 gb of json files which are basically the emails. The aim is to perform TF-IDF on body of emails. Primarily, I have converted record-oriented json to HDFS. for the tf-idf implementation, first I have done some data cleaning using spark NLP, followed by basic computing tf, idf and tf-idf which includes several groupBy() and join() operations. the implementation works just fine with a small sample dataset but when I run if with entire dataset I have been getting the following error:
Py4JJavaError: An error occurred while calling o506.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 107 in stage 7.0 failed 4 times, most recent failure: Lost task 107.3 in stage 7.0 : ExecutorLostFailure (executor 72 exited caused by one of the running tasks) Reason: Container from a bad node: container_1616585444828_0075_01_000082 on host: Exit status: 143. Diagnostics: [2021-04-08 01:02:20.700]Container killed on request. Exit code is 143
[2021-04-08 01:02:20.701]Container exited with a non-zero exit code 143.
[2021-04-08 01:02:20.702]Killed by external signal
sample code:
df_1 = df.select(df.id,explode(df.final).alias("words"))
df_1= df_1.filter(length(df_1.words) > '3')
df_2 = df_1.groupBy("id","words").agg(count("id").alias("count_id"))
I get error on third step, i.e, df_2 = df_1.groupBy("id","words").agg(count("id").alias("count_id"))
Things I have tried:
set("spark.sql.broadcastTimeout", "36000")
df_1.coalesce(20) before groupBy()
checked for null values -> no null values in any df
nothing worked so far. Since I am very new to pyspark I would really appreciate some help on how to make the implementation more efficient and fast.

Error message: Protocol wrong type for socket with py2neo

I am using py2neo to query a set of data from a neo4j graph database and to create relationships between nodes once appropriate information has been achieved within a python script.
Here is a basic structure of the script.
import py2neo
### get a set of data from a graph database
graph = Graph()
query = graph.cypher.execute("MATCH (p:LABEL1))-[:relationship]->(c:Label2 {Name:'property'} \
RETURN p.Id, p.value1, c.value2 LIMIT 100")
### do a data analysis within a python script here
~ ~ ~
### update newly found information through the analysis to the graph database
tx = graph.cypher.begin()
qs1 = "UNWIND range(0,{size}-1) AS r \
MATCH (p1:Label1 {Id:{xxxx}[r]}),(p2:Label2 {Id:{yyyy}[r]}) \
MERGE (p1)-[:relationship {property:{value}[r]]->(p2)"
tx.append(qs1, parameters = {"size":size, \
"xxxx":xxxx, \
"yyyy":yyyy, \
"value":value})
tx.commit()
The script does perform as it is supposed to do when the query results are limited to 100, but when I increase 200 or above, the program crashes, leaving the following error message:
--> 263 tx.commit()
SocketError: Protocol wrong type for socket
Unfortunately, besides the above statement, there is no other useful information that may hint what the problem might be. Did anyone have this sort of problem and could you suggest what the underlying issues may be?
Thank you.
I haven't looked into this in much depth.
I ran into the same error when using:
tx = graph.cypher.begin()
tx.append("my cypher statement, i.e. MATCH STUFF")
tx.process()
....
tx.commit()
I was appending 1000 statements between each process call, and every 2000 statements I'd commit.
When I reduce the number of statements between process & commit to 250/500 it works fine (similar to you limiting to 100).
The actual error in the stack trace is from python's http/client:
File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 888, in send
self.sock.sendall(data)
OSError: [Errno 41] Protocol wrong type for socket
Which is this an EPROTOTYPE error, so it's happening on the client side and not in neo4j.
Are you on OS X? (I am.) This looks like it could be the issue:
Chasing an EPROTOTYPE Through Rust, Sendto, and the OSX Kernel With C-Reduce. I don't know enough about underlying socket behaviour to tell you why reducing the amount we're sending over the connection removes the race condition.

Python for Keithley

I hooked up the Keithley 2701 DMM, installed the software and set the IPs right. I can access and control the instrument via the internet explorer webpage and the Keithley communicator. When I try to use python, it detects the instrument
i.e. a=visa.instrument("COM1") doesn't give an error.
I can write to the instrument as well:
a.write("*RST")
a.write("DISP:ENAB ON/OFF")
a.write("DISP:TEXT:STAT ON/OFF")
etc all don't give any error but no change is seen on the instrument screen.
However when I try to read back, a.ask("*IDN?") etc give me an error
saying timeout expired before operation completed.
I tried redefining as:
a=visa.instrument("COM1",timeout=None)
a=visa.instrument("TCPIP::<the IP adress>::1354::SOCKET")
and a few other possible combinations but I'm getting the same error.
Please do help.
The issue with communicating to the 2701 might be an invalid termination character. By default the termination character has the value CR+LF which is “\r\n”.
The python code to set the termination character is:
theInstrument = visa.instrument(“TCPIP::<IPaddress>::1394::SOCKET”, term_chars = “\n”)
or
theInstrument = visa.instrument(“TCPIP::<IPaddress>::1394::SOCKET”)
theInstrument.term_chars = “\n”
I hope this helps,

PyODBC Cursor.fetchall() causes python to crash (segfault)

I am using Python 2.7 on Windows XP.
I have a simple python script on a schedule that uses pyodbc to grab data from an AR database which has worked perfectly until today. I get a segfault once the cursor reaches a particular row. I have similar code in C++ which has no problem retrieving the results, so I figure this is an issue with pyodbc. Either way, I'd like to "catch" this error. I've tried to use the subprocess module, but it doesn't seem to work since once the script hits a segfault it just hangs on the "python.exe has encountered a problem and needs to close." message. I guess I could set some arbitrary time frame for it to complete in and, if it doesn't, force close the process, but that seems kind of lame.
I have reported the issue here as well - http://code.google.com/p/pyodbc/issues/detail?id=278
#paulsm4 - I have answered your questions below, thanks!
Q: You're on Windows/XP (32-bit, I imagine), Python 2.7, and BMC
Remedy AR. Correct?
A: Yes, it fails on Win XP 32 bit and Win Server 2008 R2 64 bit.
Q: Is there any chance you (or perhaps your client, if they purchased
Remedy AR) can open a support call with BMC?
A: Probably not...
Q: Can you isolate which column causes the segfault? "What's
different" when the segfault occurs?
A: Just this particular row...but I have now isolated the issue with your suggestions below. I used a loop to fetch each field until a segfault occurred.
cursor.columns(table="mytable")
result = cursor.fetchall()
columns = [x[3] for x in result]
for x in columns:
print x
cursor.execute("""select "{0}"
from "mytable"
where id = 'abc123'""".format(x))
cursor.fetchall()
Once I identified the column that causes the segfault I tried a query for all columns EXCEPT that one and sure enough it worked no problem.
The column's data type was CHAR(1024). I used C++ to grab the data and noticed that the column for that row had the most characters in it out of any other row...1023! Thinking that maybe there is a buffer in the C code for PyODBC that is getting written to beyond its boundaries.
2) Enable ODBC tracing: http://support.microsoft.com/kb/274551
3) Post back the results (including the log trace of the failure)
Ok, I have created a pastebin with the results of the ODBC trace - http://pastebin.com/6gt95rB8. To protect the innocent, I have masked some string values.
Looks like it may have been due to data truncation.
Does this give us enough info as to how to fix the issue? I'm thinking it's a bug within PyODBC since using the C ODBC API directly works fine.
Update
So I compiled PyODBC for debugging and I got an interesting message -
Run-Time Check Failure #2 - Stack around the variable 'tempBuffer' was corrupted.
While I don't currently understand it, the call stack is as follows -
pyodbc.pyd!GetDataString(Cursor * cur=0x00e47100, int iCol=0) Line 410 + 0xf bytes C++
pyodbc.pyd!GetData(Cursor * cur=0x00e47100, int iCol=0) Line 697 + 0xd bytes C++
pyodbc.pyd!Cursor_fetch(Cursor * cur=0x00e47100) Line 1032 + 0xd bytes C++
pyodbc.pyd!Cursor_fetchlist(Cursor * cur=0x00e47100, int max=-1) Line 1063 + 0x9 bytes C++
pyodbc.pyd!Cursor_fetchall(_object * self=0x00e47100, _object * args=0x00000000) Line 1142 + 0xb bytes C++
Resolved!
Problem was solved by ensuring that the buffer had enough space.
In getdata.cpp on line 330
char tempBuffer[1024];
Was changed to
char tempBuffer[1025];
Compiled and replaced the old pyodbc.pyd file in site-packages and we're all good!
Thanks for your help!
Q: You're on Windows/XP (32-bit, I imagine), Python 2.7, and BMC Remedy AR. Correct?
Q: Is there any chance you (or perhaps your client, if they purchased Remedy AR) can open a support call with BMC?
Q: Can you isolate which column causes the segfault? "What's different" when the segfault occurs?
Please do the following:
1) Try different "select a,b,c" statements using Python/ODBC to see if you can reproduce the problem (independent of your program) and isolate a specific column (or, ideally, a specific column and row!)
2) Enable ODBC tracing:
http://support.microsoft.com/kb/274551
3) Post back the results (including the log trace of the failure)
4) If that doesn't work - and if you can't get BMC Technical Support involved - then Plan B might be to debug at the ODBC library level:
How to debug C extensions for Python on Windows
Q: What C/C++ compiler would work best for you?
for any one else who might get this error, check the data type format returned. in my case, it was a datetime column. used select convert(varchar, getdate(), 20) from xxx, or any convert formats to get your desired result.

Categories