I need to iterate over a pandas dataframe in order to pass each row as argument of a function (actually, class constructor) with **kwargs. This means that each row should behave as a dictionary with keys the column names and values the corresponding ones for each row.
This works, but it performs very badly:
import pandas as pd
def myfunc(**kwargs):
try:
area = kwargs.get('length', 0)* kwargs.get('width', 0)
return area
except TypeError:
return 'Error : length and width should be int or float'
df = pd.DataFrame({'length':[1,2,3], 'width':[10, 20, 30]})
for i in range(len(df)):
print myfunc(**df.iloc[i])
Any suggestions on how to make that more performing ? I have tried iterating with tried df.iterrows(),
but I get the following error :
TypeError: myfunc() argument after ** must be a mapping, not tuple
I have also tried df.itertuples() and df.values , but either I am missing something, or it means that I have to convert each tuple / np.array to a pd.Series or dict , which will also be slow.
My constraint is that the script has to work with python 2.7 and pandas 0.14.1.
one clean option is this one:
for row_dict in df.to_dict(orient="records"):
print(row_dict['column_name'])
You can try:
for k, row in df.iterrows():
myfunc(**row)
Here k is the dataframe index and row is a dict, so you can access any column with: row["my_column_name"]
Defining a separate function for this will be inefficient, as you are applying row-wise calculations. More efficient would be to calculate a new series, then iterate the series:
df = pd.DataFrame({'length':[1,2,3,'test'], 'width':[10, 20, 30,'hello']})
df2 = df.iloc[:].apply(pd.to_numeric, errors='coerce')
error_str = 'Error : length and width should be int or float'
print(*(df2['length'] * df2['width']).fillna(error_str), sep='\n')
10.0
40.0
90.0
Error : length and width should be int or float
Related
I have encountered a bizarre error when trying to use an RDD to create a PySpark DataFrame. Normally spark.createDataFrame(df.rdd, new_schema) works fine as long as the schema is compatible with the RDD. In the following case, though, the RDD has values that should be integers but are instead floats, which causes an error.
I believe that the RDD is trying to do some tricky memory optimization by only storing unique values once (per partition? per block?) and having each cell in the DataFrame point to the same address in memory. It seems to consider 1 and 1.0 to be the same value despite having different data types.
I would like to fundamentally understand WHY this is happening, and whether it is a bug. But more pertinently, how can I work around it? Can I prevent the RDD from doing this in the first place? Can I somehow access df.rdd without running into this? Can I "typecast" the columns in the RDD before passing it to the dataframe constructor?
Reproducible case:
from pyspark.sql import types as T
from pyspark.sql import functions as F
# Create a simple dataframe with an integer and a float column
# This has the important traits:
# a) both columns contain the same value: 1 or 1.0
# b) that value is repeated twice in the first column, which seems to trigger spark
# to optimize memory by pointing to the same address in memory
df = spark.createDataFrame(
data=[
(1, 3.1415),
(2, 1.0000),
(1, 3.1415),
],
schema=T.StructType(
[
T.StructField("x", T.IntegerType(), True),
T.StructField("y", T.DoubleType(), True),
]
),
)
df = df.repartition(1) # this is needed to get all the data into one block(?)
df.show()
# Build a second dataframe with same data, but nested within structs
df2 = df.withColumn("x2", F.struct(F.col("x")))
df2 = df2.withColumn("y2", F.struct(F.col("y")))
df2 = df2.select("x2", "y2")
df2.show()
# Print the rows in each dataframe. The second will have row 3, column 1 pointing to
# the same value in memory as row 2, column 2.
print("df1:")
for row in df.collect():
print(row)
print("\ndf2:")
for row in df2.collect():
print(row)
# Try to build a new dataframe from the RDD. We will get an error
spark.createDataFrame(df2.rdd, df2.schema).show()
This prints:
df1
Row(x=1, y=3.1415)
Row(x=2, y=1.0)
Row(x=1, y=3.1415)
df2
Row(x2=Row(x=1), y2=Row(y=3.1415))
Row(x2=Row(x=2), y2=Row(y=1.0))
Row(x2=Row(y=1.0), y2=Row(y=3.1415))
Note that in df2, which has the values nested under structs, the third value for x2 is 1.0, even though that column has IntegerType! (Printing the memory addresses confirms that it's the same pointer as y2 row 2). This results in the following error when the RDD is used:
TypeError: field x in field x2: IntegerType can not accept object 1.0 in type <class 'float'>
I want to convert a python dictionary to a pandas DataFrame, but as the dictionary values are not of the same length, when I do:
recomm = pd.DataFrame(recommendation.items(),columns=['id','recId1','recId2','recId3','recId4','recId5'])
I get:
6 columns passed, passed data had 2 columns
which mean that one of the provided values is of length 2.
To correct it, I did:
for key in recommendation.keys():
while True:
l1 = recommendation[key]
l1.append(0)
recommendation[key] = l1
if len(l1) < 5:
break
But I still get the error when converting to DF.
I checked the dictionary as follow:
for key in recommendation.keys():
if len(recommendation[key]) != 5:
print key
and discovered that 0 was added to those of length 5 too. means I'm now having some of the values with 6 as length.
e.g dictionary value:
[12899423, 12907790, 12443129, 12558006, 12880407, 0]
How to correct the while code so that it ONLY adds 0 to the list of values if the length of the list is < 5.
and is there a better way to convert the dictionary to pandas DataFrame?
Dictionary keys are: int and str.
You could use the following :
In python 2.7 use iteritems() as it return an iterator over the dictionary, in python 3.x, items() have the same behavior
import numpy as np
import pandas as pd
#Your dictionary
d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.iteritems() ]))
It will fill the dataframe with NaN values for the missing entries, then you just have to call the fillna function :
df.fillna(0,inplace=True)
Your missing data will now be filled with zeros
I want to apply the .nunique() function to a full dataFrame.
On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).
def nbDifferentValues(data):
total = data.nunique()
total = total.sort_values(ascending=False)
percent = (total/data.shape[0]*100)
return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])
diffValues = nbDifferentValues(dataFrame)
And the code fails at the first line and I get the following error which I don't know how to solve ("unhashable type : 'list'", 'occured at index columns'):
Trace of the error
You probably have a column whose content are lists.
Since lists in Python are mutable they are unhashable.
import pandas as pd
df = pd.DataFrame([
(0, [1,2]),
(1, [2,3])
])
# raises "unhashable type : 'list'" error
df.nunique()
SOLUTION: Don't use mutable structures (like lists) in your dataframe:
df = pd.DataFrame([
(0, (1,2)),
(1, (2,3))
])
df.nunique()
# 0 2
# 1 2
# dtype: int64
To get nunique or unique in a pandas.Series , my preferred approaches are
Quick Approach
NOTE: It wouldn't hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.
_unique_items = df.COL_LIST.explode().unique()
or
_unique_count = df.COL_LIST.explode().nunique()
Alternate Approach
Alternatively, if I wish not to explode the items,
# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()
# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()
Bonus
df.COL.apply(json.dumps) might handle all the cases.
OP's solution
df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...
I have come across this problem with .nunique() when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.
# this is the dataframe that is causing your issues
df = data.copy()
print(f"Rows and columns: {df.shape} \n")
print(f"Null values per column: \n{df.isna().sum()} \n")
# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
try:
df[each_col].nunique()
ls_cols_nunique.append(each_col)
except:
ls_cols_error_nunique.append(each_col)
print(f"Unique values per column: \n{df[ls_cols_nunique].nunique()} \n")
print(f"Columns error nunique: \n{ls_cols_error_nunique} \n")
This code should split your dataframe columns into 2 lists:
Column that can calculate .nunique()
Column that errors when running .nunique()
Then just calculate the .nunique() on the columns without errors.
As far as converting the columns with errors, there are other resources that address that with .apply(pd.series).
EDIT : here are the first lines :
df = pd.read_csv(os.path.join(path, file), dtype = str,delimiter = ';',error_bad_lines=False, nrows=50)
df["CALDAY"] = df["CALDAY"].apply(lambda x:dt.datetime.strptime(x,'%d/%m/%Y'))
df = df.fillna(0)
I have a csv file that has 1500 columns and 35000 rows. It contains values, but under the form 1.700,35 for example, whereas in python I need 1700.35. When I read the csv, all values are under a str type.
To solve this I wrote this function :
def format_nombre(df):
for i in range(length):
for j in range(width):
element = df.iloc[i,j]
if (type(element) != type(df.iloc[1,0])):
a = df.iloc[i,j].replace(".","")
b = float(a.replace(",","."))
df.iloc[i,j] = b
Basically, I select each intersection of all rows and columns, I replace the problematic characters, I turn the element into a float and I replace it in the dataframe. The if ensures that the function doesn't consider dates, which are in the first column of my dataframe.
The problem is that although the function does exactly what I want, it takes approximately 1 minute to cover 10 rows, so transforming my csv would take a little less than 60h.
I realize this is far from being optimized, but I struggled and failed to find a way that suited my needs and (scarce) skills.
How about:
def to_numeric(column):
if np.issubdtype(column.dtype, np.datetime64):
return column
else:
return column.str.replace('.', '').str.replace(',', '.').astype(float)
df = df.apply(to_numeric)
That's assuming all strings are valid. Otherwise use pd.to_numeric instead of astype(float).
I always get this error:
AnalysisException: u"cannot resolve 'substring(l,1,-1)' due to data type mismatch: argument 1 requires (string or binary) type, however, 'l' is of array type.;"
Quite confused because l[0] is a string, and matches arg 1.
dataframe has only one column named 'value', which is a comma separated string.
And I want to convert this original dataframe to another dataframe of object LabeledPoint, with the first element to be 'label' and the others to be 'features'.
from pyspark.mllib.regression import LabeledPoint
def parse_points(dataframe):
df1=df.select(split(dataframe.value,',').alias('l'))
u_label_point=udf(LabeledPoint)
df2=df1.select(u_label_point(col('l')[0],col('l')[1:-1]))
return df2
parsed_points_df = parse_points(raw_data_df)
I think you what to create LabeledPoint in dataframe. So you can:
def parse_points(df):
df1=df.select(split(df.value,',').alias('l'))
df2=df1.map(lambda seq: LabeledPoint(float(seq[0][0]),seq[0][1:])) # since map applies lambda in each tuple
return df2.toDF() #this will convert pipelinedRDD to dataframe
parsed_points_df = parse_points(raw_data_df)