Function within a function involving each column of a DataFrame in Python - python

As the question states, I'm trying to learn how to run a function on each element belonging to a column within a DataFrame without having to define that column directly. The point is that I would like to be able to enter any given set of DataFrame and find each element within each column that fulfills a particular condition.
The sample that I've included illustrates what I'm trying to do. I know the below doesn't work and I thought that writing def fun(dataframe[column]) would do the trick but the syntax is incorrect, unfortunately.
Basically, the reason for this is that I have multiple sets of data where I'd like to locate each element that is above a set threshold.
Thanks a lot in advance!
df=pd.DataFrame(np.random.randint(0,100,size=(3, 3)), columns=list('ABC'))
def fun(dataframe):
for column in dataframe:
def fun(column):
mean= sum(column)/len(column)
print (mean)
for element in column:
if element < mean*1.1:
element = 0
print (element)
fun(df)

As #MadPhysicist mentioned in a comment, pandas was created to reduce the need for explicit for-looping.
If I understand your specific case correctly, you intend to replace with zero any element that is less than 1.1 times the mean value of its column. Here's one way to do that in idiomatic pandas:
# Set a random seed for repeatability
np.random.seed(314159)
# Create example data
df = pd.DataFrame(np.random.randint(0,100,size=(3, 3)), columns=list('ABC'))
df
A B C
0 11 34 93
1 79 0 81
2 66 43 71
# By default, df.mean() computes the mean of each numeric column (not row)
df.mean()
A 52.000000
B 25.666667
C 81.666667
dtype: float64
# We can use boolean indexing to replace values less than
# 1.1 * column mean with zero
# docs: https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
df[df < 1.1 * df.mean()] = 0
df
A B C
0 0 34 93
1 79 0 0
2 66 43 0

Related

Faster way to find all columns are with no missing values?

Currently I am using this statement to find all columns in a dataframe that has no missing values, it works fine. but I'm wondering if there is more concise way (albeit, efficient way) to do the same thing?
df.columns[ np.sum(df.isnull()) == 0 ]
To better answer the question one would need to have access to the dataframe in question.
Without it, there are various method one can use.
Let's consider the following dataframe as example
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
df.iloc[0:10, 0] = np.nan
[Out]:
A B C D
0 NaN 89 63 41
1 NaN 12 47 8
2 NaN 79 76 67
3 NaN 87 61 38
4 NaN 28 31 30
Method 1 - As OP indicated (we will be use as reference)
df.columns[ np.sum(df.isnull()) == 0 ]
Method 2 - Similar to Method 1, with numpy.sum and pandas.isnull, but with a Lambda function
df.columns[ df.apply(lambda x: np.sum(x.isnull()) == 0) ]
Method 3 - Using numpy.all and pandas.DataFrame.notnull
columns = df.columns[ np.all(df.notnull(), axis=0) ]
Method 4 - Using only pandas built-in modules
columns = df.columns[ df.isnull().sum() == 0 ]
Method 5 - Using pandas.DataFrame.isna
(same method used here).
columns = df.columns[ df.isna().any() == False ]
The output in all is the one that OP wants, more specifically
Index(['B', 'C', 'D'], dtype='object')
If one times each of the methods with time.perf_counter() (there are additional ways to measure the time of execution), one will get the following
method time
0 method 1 2.999996e-07
1 method 2 3.000005e-07
2 method 3 2.000006e-07
3 method 4 6.000000e-07
4 method 5 3.999994e-07
Again, this might change depending on the dataframe that one uses. Also, depending on the requirements (hardware, and business requirements), there might be other ways to achieve the same goal.
You can use this:
df.isna().any() # returns all columns either True (column names that has MISSING values) False (column names has NO MISSING values)
df.columns[df.isna().any()] # returns only the column names with MISSING values
df.columns[~df.isna().any()] # tilda negates the condition # returns the columns with NO MISSING values
df.columns[~df.isna().any()].tolist() # .tolist() converts the result to a list, if you wish.

Pandas apply function and update copy of dataframe

I have data frames
df = pd.DataFrame({'A':[1,2,2,1],'B':[20,21,22,32],'C':[4,5,6,7],'D':[99,98,97,96]})
dfcopy = df.copy()
I want to apply a function to values in df columns 'B' and 'C' based on value in col 'A' and then update the result in corresponding rows in dfcopy.
For example, for each row where 'A' is 1, get the 'B' and 'C' values for that row, apply function, and store results in dfcopy. For the first row where 'A'==2, the value for 'B' is 21 and 'C' is 5. Assume the function is to multiply by 2x2 ones matrix: np.dot(np.ones((2,2)),np.array([[21],[5]])). Then we want df[1,'B']=26 and df[1,'C']=26. Then I want to repeat for a different value in A until the function has been applied uniquely based on each value in A.
Lastly, I don't want to iterate row by row, check value in A, and apply function. This is because there will be an operation to do based on each value of A (i.e. the np.ones((2,2)) will be replaced by values in file corresponding to value in A, and I don't want to repeat it
I'm sure I can force a solution (e.g. by looping and setting values), but I'm guessing there is an elegant way to do this with Pandas API. I just can't find it.
In the example below I picked different matrices so it's obvious that I have applied them.
df = pd.DataFrame({'A':[1,2,2,1],'B':[20,21,22,32],'C':[4,5,6,7],'D':[99,98,97,96]})
matrices = [None,pd.DataFrame([[1,0],[0,0]],index=["B","C"]),pd.DataFrame([[0,0],[0,1]],index=["B","C"])]
df[["B","C"]] = pd.concat((df[df["A"] == i][["B","C"]].dot(matrices[i]) for i in set(df["A"])))
A B C D
0 1 20 0 99
1 2 0 5 98
2 2 0 6 97
3 1 32 0 96

Identify increasing features in a data frame

I have a data frame that present some features with cumulative values. I need to identify those features in order to revert the cumulative values.
This is how my dataset looks (plus about 50 variables):
a b
346 17
76 52
459 70
680 96
679 167
246 180
What I wish to achieve is:
a b
346 17
76 35
459 18
680 26
679 71
246 13
I've seem this answer, but it first revert the values and then try to identify the columns. Can't I do the other way around? First identify the features and then revert the values?
Finding cumulative features in dataframe?
What I do at the moment is run the following code in order to give me the feature's names with cumulative values:
def accmulate_col(value):
count = 0
count_1 = False
name = []
for i in range(len(value)-1):
if value[i+1]-value[i] >= 0:
count += 1
if value[i+1]-value[i] > 0:
count_1 = True
name.append(1) if count == len(value)-1 and count_1 else name.append(0)
return name
df.apply(accmulate_col)
Afterwards, I save these features names manually in a list called cum_features and revert the values, creating the desired dataset:
df_clean = df.copy()
df_clean[cum_cols] = df_clean[cum_features].apply(lambda col: np.diff(col, prepend=0))
Is there a better way to solve my problem?
To identify which columns have increasing* values throughout the whole column, you will need to apply conditions on all the values. So in that sense, you have to use the values first to figure out what columns fit the conditions.
With that out of the way, given a dataframe such as:
import pandas as pd
d = {'a': [1,2,3,4],
'b': [4,3,2,1]
}
df = pd.DataFrame(d)
#Output:
a b
0 1 4
1 2 3
2 3 2
3 4 1
Figuring out which columns contain increasing values is just a question of using diff on all values in the dataframe, and checking which ones are increasing throughout the whole column.
That can be written as:
out = (df.diff().dropna()>0).all()
#Output:
a True
b False
dtype: bool
Then, you can just use the column names to select only those with True in them
new_df = df[df.columns[out]]
#Output:
a
0 1
1 2
2 3
3 4
*(the term cumulative doesn't really represent the conditions you used.Did you want it to be cumulative or just increasing? Cumulative implies that the value in a particular row/index was the sum of all previous values upto that index, while increasing is just that, the value in current row/index is greater than previous.)

pandas DataFrame sum method works counterintuitively

my_df = DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
my_df.sum(axis="rows")
O/P is
a 22
b 26
c 30
// I expect it to sum by rows thereby giving
0 6
1 15
2 24
3 33
my_df.sum(axis="columns") //helps achieve this
Why does it work counterintutively?
In a similar context, drop method works as it should i.e when i write
my_df.drop(['a'],axis="columns")
// This drops column "a".
Am I missing something? Please enlighten.
Short version
It is a naming convention. The sum of the columns gives a row-wise sum. You are looking for axis='columns').
Long version
Ok that was interesting. In pandas normally 0 is for columns and 1 is for rows.
However looking in the docs we find that the allowed params are:
axis : {index (0), columns (1)}
You are passing a param that does not exist which results in the default. This can thus be read as: The sum of the columns returns the row sum. The sum of the index returns the column sum. What you want to use it axis=1 or axis='columns' which results in your desired output:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
print(df.sum(axis=1))
Returns:
0 6
1 15
2 24
3 33
dtype: int64

Accessing first column of pandas value_counts

I'm trying to use value_counts() function from Python's pandas package to find the frequency of items in a column. This works and outputs the following:
57 1811
62 630
71 613
53 217
59 185
68 88
52 70
Name: hospitalized, dtype: int64
In which the first column is the item and the right column is its frequency in the column.
From there, I wanted to access the first column of items and iterate through that in a for loop. I want to be able to access the item of each row and check if it is equal to another value. If this is true, I want to be able to access the second column and divide it by another number.
My big issue is accessing the first column from the .value_counts() output. Is it possible to access this column and if so, how? The columns aren't named anything specific (since it's just the value_counts() output) so I'm unsure how to access them.
Use Panda's iteritems():
df = pd.DataFrame({'mycolumn': [1,2,2,2,3,3,4]})
for val, cnt in df.mycolumn.value_counts().iteritems():
print 'value', val, 'was found', cnt, 'times'
value 2 was found 3 times
value 3 was found 2 times
value 4 was found 1 times
value 1 was found 1 times
value_counts returns a Pandas Series:
df = pd.DataFrame(np.random.choice(list("abc"), size=10), columns = ["X"])
df["X"].value_counts()
Out[243]:
c 4
b 3
a 3
Name: X, dtype: int64
For the array of individual values, you can use the index of the Series:
vl_list = df["X"].value_counts().index
Index(['c', 'b', 'a'], dtype='object')
It is of type "Index" but you can iterate over it:
for idx in vl_list:
print(idx)
c
b
a
Or for the numpy array, you can use df["X"].value_counts().index.values
You can access the first column by using .keys() or index as below:
df.column_name.value_counts().keys()
df.column_name.value_counts().index

Categories