pyspark taking a log of column - python

length = df.count()
df = df.withColumn("log", log(col("power"),lit(length)))
The following lines throw such an error. Can you please help me take a log of a column using another value or another column as a base.
TypeError Traceback (most recent call last)
<ipython-input-102-c0894b6127d1> in <module>()
1 #df.show()
2
----> 3 df = df.withColumn("log", log(col("power"),lit(2)))
5 frames
/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/column.py in __iter__(self)
342
343 def __iter__(self):
--> 344 raise TypeError("Column is not iterable")
345
346 # string methods
TypeError: Column is not iterable

If you want to use funtions that are not build-in on spark dataframes you can use user-defined functions, in your case it would look like this:
from pyspark.sql.functions import udf
from math import log
#udf("float")
def log_udf(s):
return log(s,2)
df.withColumn("log", log_udf("power")).show()

Related

Pandas interpolation function fails to interpolate after replacing values with .nan

I am working with the pandas function, and I am trying to interpolate a missing value after removing a value that isn't numeric. However, I am still reading one na value when calling the isna().sum() function. A better explanation is below.
The input .csv file can be found here.
Here is what I have done:
#Import modules
import pandas as pd
import numpy as np
#Import data
df = pd.read_csv('example.csv')
df.isna().sum() #Shows no NA values, but I know that one of them is not numeric.
pd.to_numeric(df['example'])
The following error is produced, indicating the presence of an entry that needs to be removed at line number 949:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~libs\lib.pyx:2315, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "asdf"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Input In [111], in <cell line: 3>()
1 df1 = pd.read_csv('example.csv')
2 df1.isna().sum()
----> 3 pd.to_numeric(df1['example'])
File ~numeric.py:184, in to_numeric(arg, errors, downcast)
182 coerce_numeric = errors not in ("ignore", "raise")
183 try:
--> 184 values, _ = lib.maybe_convert_numeric(
185 values, set(), coerce_numeric=coerce_numeric
186 )
187 except (ValueError, TypeError):
188 if errors == "raise":
File ~libs\lib.pyx:2357, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "asdf" at position 949
Here is my attempt to correct remove this value and interpolate a new one in its place:
idx_missing = df== 'asdf'
df[idx_missing] = np.nan
df['example'].isnull().sum() #This line confirms that there is one value missing
#Perform interpolation with a linear method
df1.iloc[:, -1] = df.iloc[:, -1].interpolate(method='linear') #Specifying the last column in the dataframe with the 'iloc' command
df1.isna().sum()
Apparently, there is still a missing value and the value was not interpolated:
example 1
dtype: int64
How can I correctly interpolate this value?
If you first find and replace any value that is not a digit, that should fix your issue.
#Import modules
import pandas as pd
import numpy as np
#Import data
df = pd.read_csv('example.csv')
df['example'] = df.example.replace(r'[^\d]',np.nan,regex=True)
pd.to_numeric(df.example)

Get the latest data for each customer from pandas dataframe

I am trying to get the latest data for every customer regardless of other attributes in the dataframe.
My dataframe looks like this
My output should look like this
I have tried 'df.iloc[df.groupby('customer')['date'].idxmax()]' but I am getting ValueError.
"ValueError Traceback (most recent call last)
in
----> 1 df = df.iloc[df.groupby('cutomer')['date'].idxmax()]
~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\groupby\groupby.py in wrapper(*args, **kwargs)
653 if self.obj.ndim == 1:
654 # this can be called recursively, so need to raise ValueError
--> 655 raise ValueError
656
657 # GH#3688 try to operate item-by-item
ValueError: "
I think it's really the same as this one: similar problem
In this case the code would look like this:
df['date'] = pd.to_datetime(df.date)
idx = df.groupby('Customer')['date'].transform(max) == df['date']
df[idx]

filtering dataframe in LAMBDA function in python [duplicate]

I am trying to filter an RDD based like below:
spark_df = sc.createDataFrame(pandas_df)
spark_df.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)
But got the following errors:
TypeErrorTraceback (most recent call last)
<ipython-input-8-86cfb363dd8b> in <module>()
1 spark_df = sc.createDataFrame(pandas_df)
----> 2 spark_df.filter(lambda r: str(r['target']).startswith('good'))
3 spark_df.take(5)
/usr/local/spark-latest/python/pyspark/sql/dataframe.py in filter(self, condition)
904 jdf = self._jdf.filter(condition._jc)
905 else:
--> 906 raise TypeError("condition should be string or Column")
907 return DataFrame(jdf, self.sql_ctx)
908
TypeError: condition should be string or Column
Any idea what I missed? Thank you!
DataFrame.filter, which is an alias for DataFrame.where, expects a SQL expression expressed either as a Column:
spark_df.filter(col("target").like("good%"))
or equivalent SQL string:
spark_df.filter("target LIKE 'good%'")
I believe you're trying here to use RDD.filter which is completely different method:
spark_df.rdd.filter(lambda r: r['target'].startswith('good'))
and does not benefit from SQL optimizations.
I have been through this and have settled to using a UDF:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
filtered_df = spark_df.filter(udf(lambda target: target.startswith('good'),
BooleanType())(spark_df.target))
More readable would be to use a normal function definition instead of the lambda
convert the dataframe into rdd.
spark_df = sc.createDataFrame(pandas_df)
spark_df.rdd.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)
I think it may work!

simple dask map_partitions example

I read the following SO thead and now am trying to understand it. Here is my example:
import dask.dataframe as dd
import pandas as pd
from dask.multiprocessing import get
import random
df = pd.DataFrame({'col_1':random.sample(range(10000), 10000), 'col_2': random.sample(range(10000), 10000) })
def test_f(col_1, col_2):
return col_1*col_2
ddf = dd.from_pandas(df, npartitions=8)
ddf['result'] = ddf.map_partitions(test_f, columns=['col_1', 'col_2']).compute(get=get)
It generates the following error below. What am I doing wrong? Also I am not clear how to pass additional parameters to function in map_partitions?
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\utils.py in raise_on_meta_error(funcname)
136 try:
--> 137 yield
138 except Exception as e:
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in _emulate(func, *args, **kwargs)
3130 with raise_on_meta_error(funcname(func)):
-> 3131 return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
3132
TypeError: test_f() got an unexpected keyword argument 'columns'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-9-913789c7326c> in <module>()
----> 1 ddf['result'] = ddf.map_partitions(test_f, columns=['col_1', 'col_2']).compute(get=get)
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in map_partitions(self, func, *args, **kwargs)
469 >>> ddf.map_partitions(func).clear_divisions() # doctest: +SKIP
470 """
--> 471 return map_partitions(func, self, *args, **kwargs)
472
473 #insert_meta_param_description(pad=12)
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in map_partitions(func, *args, **kwargs)
3163
3164 if meta is no_default:
-> 3165 meta = _emulate(func, *args, **kwargs)
3166
3167 if all(isinstance(arg, Scalar) for arg in args):
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in _emulate(func, *args, **kwargs)
3129 """
3130 with raise_on_meta_error(funcname(func)):
-> 3131 return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
3132
3133
~\AppData\Local\conda\conda\envs\tensorflow\lib\contextlib.py in __exit__(self, type, value, traceback)
75 value = type()
76 try:
---> 77 self.gen.throw(type, value, traceback)
78 except StopIteration as exc:
79 # Suppress StopIteration *unless* it's the same exception that
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\utils.py in raise_on_meta_error(funcname)
148 ).format(" in `{0}`".format(funcname) if funcname else "",
149 repr(e), tb)
--> 150 raise ValueError(msg)
151
152
ValueError: Metadata inference failed in `test_f`.
Original error is below:
------------------------
TypeError("test_f() got an unexpected keyword argument 'columns'",)
Traceback:
---------
File "C:\Users\some_user\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\utils.py", line 137, in raise_on_meta_error
yield
File "C:\Users\some_user\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py", line 3131, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
There is an example in map_partitions docs to achieve exactly what are trying to do:
ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
When you call map_partitions (just like when you call .apply() on pandas.DataFrame), the function that you try to map (or apply) will be given dataframe as a first argument.
In case of dask.dataframe.map_partitions this first argument will be a partition and in case of pandas.DataFrame.apply - a whole dataframe.
Which means that your function has to accept dataframe(partition) as a first argument and and in your case could look like this:
def test_f(df, col_1, col_2):
return df.assign(result=df[col_1] * df[col_2])
Note that assignment of a new column in this case happens (i.e. gets scheduled to happen) BEFORE you call .compute().
In your example you assign column AFTER you call .compute(), which kind of defeats the purpose of using dask. I.e. after you call .compute() the results of that operation are loaded into memory if there is enough space for those results (if not you just get MemoryError).
So for you example to work you could:
1) Use function (with column names as arguments):
def test_f(df, col_1, col_2):
return df.assign(result=df[col_1] * df[col_2])
ddf_out = ddf.map_partitions(test_f, 'col_1', 'col_2')
# Here is good place to do something with BIG ddf_out dataframe before calling .compute()
result = ddf_out.compute(get=get) # Will load the whole dataframe into memory
2) Use lambda (with column names hardcoded in the function):
ddf_out = ddf.map_partitions(lambda df: df.assign(result=df.col_1 * df.col_2))
# Here is good place to do something with BIG ddf_out dataframe before calling .compute()
result = ddf_out.compute(get=get) # Will load the whole dataframe into memory
Update:
To apply function on a row-by-row basis, here is a quote from the post you linked:
map / apply
You can map a function row-wise across a series with map
df.mycolumn.map(func)
You can map a function row-wise across a dataframe with apply
df.apply(func, axis=1)
I.e. for the example function in your question, it might look like this:
def test_f(dds, col_1, col_2):
return dds[col_1] * dds[col_2]
Since you will be applying it on a row-by-row basis the function's first argument will be a series (i.e. each row of a dataframe is a series).
To apply this function then you might call it like this:
dds_out = ddf.apply(
test_f,
args=('col_1', 'col_2'),
axis=1,
meta=('result', int)
).compute(get=get)
This will return a series named 'result'.
I guess you could also call .apply on each partition with a function but it does not look to be any more efficient then calling .apply on dataframe directly. But may be your tests will prove otherwise.
Your test_f takes two arguments: col_1 and col_2. You pass a single argument, ddf.
Try something like
In [5]: dd.map_partitions(test_f, ddf['col_1'], ddf['col_2'])
Out[5]:
Dask Series Structure:
npartitions=8
0 int64
1250 ...
...
8750 ...
9999 ...
dtype: int64
Dask Name: test_f, 32 tasks

PySpark: TypeError: condition should be string or Column

I am trying to filter an RDD based like below:
spark_df = sc.createDataFrame(pandas_df)
spark_df.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)
But got the following errors:
TypeErrorTraceback (most recent call last)
<ipython-input-8-86cfb363dd8b> in <module>()
1 spark_df = sc.createDataFrame(pandas_df)
----> 2 spark_df.filter(lambda r: str(r['target']).startswith('good'))
3 spark_df.take(5)
/usr/local/spark-latest/python/pyspark/sql/dataframe.py in filter(self, condition)
904 jdf = self._jdf.filter(condition._jc)
905 else:
--> 906 raise TypeError("condition should be string or Column")
907 return DataFrame(jdf, self.sql_ctx)
908
TypeError: condition should be string or Column
Any idea what I missed? Thank you!
DataFrame.filter, which is an alias for DataFrame.where, expects a SQL expression expressed either as a Column:
spark_df.filter(col("target").like("good%"))
or equivalent SQL string:
spark_df.filter("target LIKE 'good%'")
I believe you're trying here to use RDD.filter which is completely different method:
spark_df.rdd.filter(lambda r: r['target'].startswith('good'))
and does not benefit from SQL optimizations.
I have been through this and have settled to using a UDF:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
filtered_df = spark_df.filter(udf(lambda target: target.startswith('good'),
BooleanType())(spark_df.target))
More readable would be to use a normal function definition instead of the lambda
convert the dataframe into rdd.
spark_df = sc.createDataFrame(pandas_df)
spark_df.rdd.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)
I think it may work!

Categories