Writing a UDF in Python using Pandas throwing error - python

We are trying to write UDFs of Hive in Python to clean the data. The UDF we tried was using Pandas and it is throwing the error.
When we try using another python code without the Pandas it is working fine. Kindly help to understand the problem. Providing Pandas code below:
We have already tried various ways of Pandas but unfortunately no luck. As the other Python code without Pandas is working fine,we are confused why is it failing?
import sys
import pandas as pd
import numpy as np
for line in sys.stdin:
df = line.split('\t')
df1 = pd.DataFrame(df)
df2=df1.T
df2[0] = np.where(df2[0].str.isalpha(), df2[0], np.nan)
df2[1] = np.where(df2[1].astype(str).str.isdigit(), df2[1], np.nan)
df2[2] = np.where(df2[2].astype(str).str.len() != 10, np.nan,
df2[2].astype(str))
#df2[3] = np.where(df2[3].astype(str).str.isdigit(), df2[3], np.nan)
df2 = df2.dropna()
print(df2)
I get this error:
FAILED: Execution Error, return code 20003 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. An error occurred when trying to close the Operator running your custom script.
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

I think you'll need to look at the detailed job logs for more information.
My first guess is that Pandas is not installed on a data node.
This answer looks appropriate for you if you intend to bundle dependencies with your job: https://stackoverflow.com/a/2869974/7379644

Related

python not showing values in max()

The following code ran in PYCHARM editor
import pandas as pd
df=pd.read_csv('Train_UWu5bXk.csv')
df.min()
the output is not showing any value
C:\Users\krishna\PycharmProjects\pdproject\Scripts\python.exe
C:/Users/krishna/Python37-32/Scripts/phythonncode/load1pandas.py
Process finished with exit code 0
someone can help me.
the problem is solved with the following code in python
import pandas as pd
df=pd.read_csv('Train_UWu5bXk.csv')
print(df.min())

error No module named 'xlrd'. how to import excel with python and pandas properly? please close this

I realized that there may be something wrong in my local dev env just now.
I tried my code on colab.
it worked well.
import pandas as pd
df = pd.read_excel('hurun-2018-top50.xlsx')
thank u all.
please close this session.
------- following is original description ---------
I am trying to import excel with python and pandas.
I already pip installed "xlrd" module.
I googled a lot and tried several different methods, none of them worked.
Here is my code.
import pandas as pd
from pandas import ExcelFile
from pandas import ExcelWriter
df = pd.read_excel('hurun-2018-top50.xlsx', index_col=0)
df = pd.read_excel('hurun-2018-top50.xlsx', sheetname='Sheet1')
df = pd.read_excel('hurun-2018-top50.xlsx')
Any response will be appreciated.

Read SAS file with pandas

I'm trying to use the pandas read_sas() function.
First, I create a SAS dataset by running this code in SAS:
libname tmp 'c:\temp';
data tmp.test;
do i=1 to 100;
x=rannor(0);
output;
end;
run;
Now, in IPython, I do this:
import numpy as np
import pandas as pd
%cd C:\temp
pd.read_sas('test.sas7bdat')
Pretty straightforward and seems like it should work. But I just get this error:
TypeError: read() takes at most 1 argument (2 given)
What am I missing here? I'm using pandas version 0.18.0.
According issue report linked below, this bug will be fixed in 18.1.
https://github.com/pydata/pandas/issues/12647

Unable to write my dataframe using feather (strided data not supported)

When using the feather package (http://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/) to try and write a simple 20x20 dataframe, I keep getting an error stating that strided data isn't yet supported. I don't believe my data is strided (or out of the ordinary), and I can replicate the sample code given on the website, but can't seem to get it to work with my own. Here is some sample code:
import feather
import numpy as np
import pandas as pd
tempArr = reshape(np.arange(400), (20,20))
df = pd.DataFrame(tempArr)
feather.write_dataframe(df, 'test.feather')
The last line returns the following error:
FeatherError: Invalid: no support for strided data yet
I am running this on Ubuntu 14.04. Am I perhaps misunderstanding something about how pandas dataframes are stored?
Please come to GitHub: https://github.com/wesm/feather/issues/97
Bug reports do not belong on StackOverflow

Pickling a DataFrame

I am trying to pickle a DataFrame with
import pandas as pd
from pandas import DataFrame
data = pd.read_table('Purchases.tsv',index_col='coreuserid')
data.to_pickle('Purchases.pkl')
I have been running on "data" for a while and have had no issues so I know it is not a data corruption issue. I am thinking likely syntax but I have tried a number of variants. I hesitate to give the whole error message but it ends with:
\pickle.pyc in to_pickle(obj, path)
13 """
14 with open(path, 'wb') as f:
15 pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
SystemError: error return without exception set
The Purchases.pkl file is created but if I call
data = pd.read_pickle('Purchases.pkl')
I get EOFError. I am using Canopy 1.4 so pandas 0.13.1 which should be recent enough to have this functionality.
Fast forward a few years, and now it works fine. Thanks pandas ;)
You can try create a class from your DataFrame and pickle it after.
This can help you:
Pass pandas dataframe into class

Categories