I'm trying to do the following: "# drop all rows where tag == train_loop and start is NaN".
Here's my current attempt (thanks Copilot):
# drop all rows where tag == train_loop and start is NaN
# apply filter function to each row
# return True if row should be dropped
def filter_fn(row):
return row["tag"] == "train_loop" and pd.isna(row["start"]):
old_len = len(df)
df = df[~df.apply(filter_fn, axis=1)]
It works well, but I'm wondering if there is a less verbose way.
using apply is a really bad way to do this actually, since it loops over every row, calling the function you defined in python. Instead, use vectorized functions which you can call on the entire dataframe, which call optimized/vectorized versions written in C under the hood.
df = df[~((df["tag"] == "train_loop") & df["start"].isnull())]
If your data is large (>~100k rows), then even faster would be to use pandas query methods, where you can write both conditions in one:
df = df.query(
'~((tag == "train_loop") and (start != start))'
)
This makes use of the fact that NaNs never equal anything, including themselves, so we can use simple logical operators to find NaNS (.isnull() isn't available in the compiled query mini-language). For the query method to be faster, you need to have numexpr installed, which will compile your queries on the fly before they're called on the data.
See the docs on enhancing performance for more info and examples.
You can do
df = df.loc[~(df['tag'].eq('train_loop') & df['start'].isna())]
I have a code and my dataframe contains almost 800k rows and therefore it is impossible to iterate over it by using standard methods. I searched a little bit and see a method of iterrows() but i couldn't understand how to use. Basicly this is my code and can you help me how to update it for iterrows()?
**
for i in range(len(x["Value"])):
if x.loc[i ,"PP_Name"] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'] :
x.loc[i,"Santral_Type"] = "HES"
elif x.loc[i ,"PP_Name"] in ['BND','BND2','TFB','TFB3','TFB4','KNT']:
x.loc[i,"Santral_Type"] = "TERMIK"
elif x.loc[i ,"PP_Name"] in ['BRS','ÇKL','DPZ']:
x.loc[i,"Santral_Type"] = "RES"
else : x.loc[i,"Santral_Type"] = "SOLAR"
**
How to iterate over very big dataframes -- In general, you don't. You should use some sort of vectorize operation to the column as a whole. For example, your case can be map and fillna:
map_dict = {
'HES' : ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'],
'TERMIK' : ['BND','BND2','TFB','TFB3','TFB4','KNT'],
'RES' : ['BRS','ÇKL','DPZ']
}
inv_map_dict = {x:k for k,v in map_dict.items() for x in v}
df['Santral_Type'] = df['PP_Name'].map(inv_map_dict).fillna('SOLAR')
It is not advised to iterate through DataFrames for these things. Here is one possible way of doing it, applied to all rows of the DataFrame x at once:
# Default value
x["Santral_Type"] = "SOLAR"
x.loc[x.PP_Name.isin(['BRS','ÇKL','DPZ']), 'Santral_Type'] = "RES"
x.loc[x.PP_Name.isin(['BND','BND2','TFB','TFB3','TFB4','KNT']), 'Santral_Type'] = "TERMIK"
hes_list = ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
x.loc[x.PP_Name.isin(hes_list), 'Santral_Type'] = "HES"
Note that 800k can not be considered a large table when using standard pandas methods.
I would advise strongly against using iterrows and for loops when you have vectorised solutions available which take advantage of the pandas api.
this is your code adapted with numpy which should run much faster than your current method.
import numpy as np
col = 'PP_Name'
conditions = [
x[col].isin(
['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
),
x[col].isin(["BND", "BND2", "TFB", "TFB3", "TFB4", "KNT"]),
x[col].isin(["BRS", "ÇKL", "DPZ"])]
outcomes = ["HES", "TERMIK", "RES"]
x["Santral_Type"] = np.select(conditions, outcomes, default='SOLAR')
df.iterrows() according to documentation returns a tuple (index, Series).
You can use it like this:
for row in df.iterrows():
if row[1]['PP_Name'] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']:
df['Santral_Type] = "HES"
# and so on
By the way, I must say, using iterrows is going to be very slow, and looking at your sample code it's clear you can use simple pandas selection techniques to do this without explicit loops.
Better to do it as #mcsoini suggested
the simplest method could be .values, example:
def f(x0,...xn):
return('hello or some complicated operation')
df['newColumn']=[f(r[0],r[1],...,r[n]) for r in df.values]
the drawbacks of this method as far as i know is that you cannot refer to the column values by name but just by position and there is no info about the index of the df.
Advantage is faster than iterrows, itertuples and apply methods.
hope it helps
I want to sum different columns in a spark dataframe.
Code
from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Why isn't approach #2. & #3. not working?
I am on Spark 2.2
Because,
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using pyspark sum function which takes column as input but you are trying to get it at row level.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.
TL;DR builtins.sum is just fine.
Following your comments:
Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it
and
its not a pypark function so it wont be really be completely benefiting from spark right.
I can see you are making incorrect assumptions.
Let's decompose the problem:
[df[col] for col in ["`A.p1`","`B.p1`"]]
creates a list of Columns:
[Column<b'A.p1'>, Column<b'B.p1'>]
Let's call it iterable.
sum reduces output by taking elements of this list and calling __add__ method (+). Imperative equivalent is:
accum = iterable[0]
for element in iterable[1:]:
accum = accum + element
This gives Column:
Column<b'(A.p1 + B.p1)'>
which is the same as calling
df["`A.p1`"] + df["`B.p1`"]
No data has been touched and when evaluated it is benefits from all Spark optimizations.
Addition of multiple columns from a list into one column
I tried a lot of methods and the following are my observations:
PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others (might be because of conflict in names)
In your 3rd approach, the expression (inside python's sum function) is returning a PySpark DataFrame.
So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.
I have data in a Spark data frame, with a column col that contains structured data of the form:
------ col ------- # Column whose elements are structures
field0 field1 … # StructType with StructFields (variable names and count)
[1,2,3] [4,5] [6] # Each field is of type ArrayType
[1,2] [3] []
…
where the number and the names of the fields are not fixed.
What is the most efficient way of calculating the total number of elements in each row? In the above example, the expected resulting data frame would thus be:
num_elements
6
3
…
There is always the solution of a user defined function:
from pyspark.sql.types import IntegerType
def num_elements(all_arrays_in_row):
return sum(map(len, all_arrays_in_row))
num_elements = pyspark.sql.functions.udf(num_elements, IntegerType())
data_frame.select(num_elements(data_frame.col)).show() # Number of elements in each row
Now, I am not sure whether this is generally efficient, because:
Function num_elements() is in Python.
If the fields happen to not be stored together for some reason, the map() forces a fetch of each array before calculating their length.
More generally, a "pure" Spark approach would be more efficient, but it is eluding me. What I tried so far is the following, but this is way more cumbersome than the approach above, and is also not complete:
Get the field names field0, etc. with [field.name for field in data_frame.select("col").schema.fields[0].dataType.fields] (cumbersome).
For each field name, efficiently calculate the size of its array:
sizes_one_field = data_frame.select(pyspark.sql.functions.size(
data_frame.col.getField(field_name))
Now, I am stuck at this point because I am not sure how to sum together the 1-column data frames sizes_one_field (there is one for each field name). Plus, maybe there is a way of directly applying the size() function to each field of column col in Spark (through some kind of map?)? Or some completely different approach to getting the total number of elements in each row?
You can try something like the following:
from pyspark.sql import functions as f
result = df.select(sum((f.size(df[col_name]) for col_name in df.columns), f.lit(0)))
This solution uses the pyspark.sql built-in functions and will be executed in an optimized way. For more information about these functions, you can check its pyspark documentation.
In Python's Pandas, I am using the Data Frame as such:
drinks = pandas.read_csv(data_url)
Where data_url is a string URL to a CSV file
When indexing the frame for all "light drinkers" where light drinkers is constituted by 1 drink, the following is written:
drinks.light_drinker[drinks.light_drinker == 1]
Is there a more DRY-like way to self-reference the "parent"? I.e. something like:
drinks.light_drinker[self == 1]
You can now use query or assign depending on what you need:
drinks.query('light_drinker == 1')
or to mutate the the df:
df.assign(strong_drinker = lambda x: x.light_drinker + 100)
Old answer
Not at the moment, but an enhancement with your ideas is being discussed here. For simple cases where might be enough. The new API might look like this:
df.set(new_column=lambda self: self.light_drinker*2)
In the most current version of pandas, .where() also accepts a callable!
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html?highlight=where#pandas.DataFrame.where
So, the following is now possible:
drinks.light_drinker.where(lambda x: x == 1)
which is particularly useful in method-chains. However, this will return only the Series (not the DataFrame filtered based on the values in the light_drinker column). This is consistent with your question, but I will elaborate for the other case.
To get a filtered DataFrame, use:
drinks.where(lambda x: x.light_drinker == 1)
Note that this will keep the shape of the self (meaning you will have rows where all entries will be NaN, because the condition failed for the light_drinker value at that index).
If you don't want to preserve the shape of the DataFrame (i.e you wish to drop the NaN rows), use:
drinks.query('light_drinker == 1')
Note that the items in DataFrame.index and DataFrame.columns are placed in the query namespace by default, meaning that you don't have to reference the self.
I don't know of any way to reference parent objects like self or this in Pandas, but perhaps another way of doing what you want which could be considered more DRY is where().
drinks.where(drinks.light_drinker == 1, inplace=True)