creating a column based on missing value in pandas - python

I have a data-frame for which want to create a column that represents missing value patterns in data-frame.For example :
for example for the CSV file,
A,B,C,D
1,NaN,NaN,NaN
Nan,2,3,NaN
3,2,2,3
3,2,NaN,3
3,2,1,NaN
I want to create a column E,which has value in following way:
If A,B,C,D all are missing E = 4,
If A,B,C,D all are present E = 0,
if A and B are only missing E = 1 of that sort, encoding of E need not be like I mentioned just an indication of pattern.How can I come across this problem in pandas?

use isnull in combination with sum(axis=1)
Example:
import pandas as pd
df = pd.DataFrame({'A': [1, None, 3, 3, 3],
'B':[ None, None, 1, 1, 1]})
df['C'] = df.isnull().sum(axis=1)

Related

Appending list with highest value from two columns using for loop in pandas

I have two columns: column A, and column B.
I would like to find whether the value in each row of column A is larger than the value for the same row in column B, and if it is append a list with these values.
I'm able to append the list if the value in column A is higher than a set value, but I'm unsure how to compare it to the value from column B.
The below code appends the list if the value in column A is higher than 4. Hopefully I'm on the right track and can just substitute 4 with some other code?
list = []
for x in A:
if x > 4:
list.append(x)
print(list)
Any help would be greatly appreciated.
Thank you!
An approach could be:
import pandas as pd
df = pd.DataFrame({"A":[2, 3, 4, 5], "B":[1, 4, 6, 3]}) # Test DataFrame
print(list(df[df["A"] > df["B"]]["A"]))
OUTPUT
[2, 5]
FOLLOW UP
If, as described in the comments, you want to check conditions on multiple columns:
import pandas as pd
df = pd.DataFrame({"A":[2, 3, 4, 5], "B":[1, 4, 6, 3], "C":[1, 1, 1, 10]}) # Test DataFrame
print(list(df[(df["A"] > df["B"]) & (df["A"] > df["C"])]["A"]))
OUTPUT
[2]

Speeding up complex functions on pandas

I am filling up NaN values in one column of my dataframe using the followikng code:
for i in tqdm(range(nadf.shape[0])):
a = nadf["primary"][i]
nadf["count"][i] = np.ceil(d[a]*a)
This code replaces the NaN values in the "count" by multiplying the corresponding value of the "primary" in a dictionary d with the value of "primary". The nadf has 16 million rows. I understand that the execution will be slow, but is there a method to speed this up?
If I understood your question and dataframe value in a right way, the problem can be solved the following way by using pandas internal functionality:
Please follow comments in code, feel free to ask questions.
import pandas as pd
import numpy as np
import math
def fill_nan(row, _d):
"""fill nan values in "count" column based on "primary" column value and dictionary _d"""
if math.isnan(row["count"]):
return np.ceil(_d[row["primary"]]) * row["primary"]
return row["count"] # else not nan
if __name__ == "__main__":
d = {1: 10, 2: 20, 3: 30}
df = pd.DataFrame({
"primary": [1, 2, 3, 1, 2, 1, 2],
"count": [10.1, 4, 5, np.nan, np.nan, 4, np.nan]
})
df["count"] = df.apply(lambda row: fill_nan(row, d), axis=1) # changes nan here
print(df)

obtaining indices of n max absolute values in dataframe row

suppose i create a Pandas DataFrame as below
import pandas as pd
import numpy as np
np.random.seed(0)
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
as an example, this can generate the below:
for each row, i am looking for a way to readily obtain the indices corresponding to the largest n (say 3) values in absolute value terms. for example, for the first row, i would expect [0,3,4]. we can assume that the results don't need to be ordered.
i tried searching for solutions similar to idxmax and argmax, but it seems these do not readily handle multiple values
You can use np.argsort(axis=1)
Given dataset:
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
0 1 2 3 4
0 17.640523 4.001572 9.787380 22.408932 18.675580
1 -9.772779 9.500884 -1.513572 -1.032189 4.105985
2 1.440436 14.542735 7.610377 1.216750 4.438632
3 3.336743 14.940791 -2.051583 3.130677 -8.540957
4 -25.529898 6.536186 8.644362 -7.421650 22.697546
df.abs().values.argsort(1)[:, -3:][:, ::-1]
array([[3, 4, 0],
[0, 1, 4],
[1, 2, 4],
[1, 4, 0],
[0, 4, 2]])
Try this ( this is not the optimal code ) :
idx_nmax = {}
n = 3
for index, row in df.iterrows():
idx_nmax[index] = list(row.nlargest(n).index)
at the end of that you will have a dictionary with:
as Key the index of the row
and as Values ​​the index of the 'n' highest value of this row

python dataframe how to convert set column to list

I tried to convert a set column to list in python dataframe, but failed. Not sure what's best way to do so. Thanks.
Here is the example:
I tried to create a 'c' column which convert 'b' set column to list. but 'c' is still set.
data = [{'a': [1,2,3], 'b':{11,22,33}},{'a':[2,3,4],'b':{111,222}}]
tdf = pd.DataFrame(data)
tdf['c'] = list(tdf['b'])
tdf
a b c
0 [1, 2, 3] {33, 11, 22} {33, 11, 22}
1 [2, 3, 4] {222, 111} {222, 111}
You could do:
import pandas as pd
data = [{'a': [1,2,3], 'b':{11,22,33}},{'a':[2,3,4],'b':{111,222}}]
tdf = pd.DataFrame(data)
tdf['c'] = [list(e) for e in tdf.b]
print(tdf)
Use apply:
tdf['c'] = tdf['b'].apply(list)
Because using list is doing to whole column not one by one.
Or do:
tdf['c'] = tdf['b'].map(list)

Detect Missing Column Labels in Pandas

I'm working with the dataset outlined here:
https://archive.ics.uci.edu/ml/datasets/Balance+Scale
I'm trying create a general function to be able to parse any categorical data following these two rules:
Must have a column labeled class containing the class of the object
Each row must have the same numbers of columns
Minimal example of the data that I'm working with:
Class,LW,LD,RW,RD
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
This provides 3 unique classes: B, L, R. It also provides 4 features which pertain to each entry: LW, LD, RW and RD.
The following is a part of my function to handle generic cases, but my issue with it is that I don't know how to check if any column labels are simply missing:
import pandas as pd
import sys
dataframe = pd.read_csv('Balance_Data.csv')
columns = list(dataframe.columns.values)
if "Class" not in columns:
sys.exit("'Class' is not a column in the data")
if "Class.1" in columns:
sys.exit("Cannot specify more than one 'Class' column")
columns.remove("Class")
inputX = dataframe.loc[:, columns].as_matrix()
inputY = dataframe.loc[:, ['Class']].as_matrix()
At this point, the correct values are:
inputX = array([[1, 1, 1, 1],
[1, 2, 1, 1],
[1, 2, 1, 3],
[2, 2, 4, 5]])
inputY = array([['B'],
['L'],
['R'],
['R'],
['R'],
['R']], dtype=object)
But if I remove the last column label (RD) and reprocess,
Class,LW,LD,RW
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
I get:
inputX = array([[1, 1, 1],
[2, 1, 1],
[2, 1, 3],
[2, 4, 5]])
inputY = array([[1],
[1],
[1],
[2]])
This indicates that it reads label values from right to left instead of left to right, which means that if any data is input into this function that doesn't have the right amount of labels, it's not going to work correctly.
How can I check that the dimension of the rows is the same as the number of columns? (It can be assumed that there are no gaps in the data itself, that each row of data beyond the columns always has the same number of elements in it)
I would pull it out as follows:
In [11]: df = pd.read_csv('Balance_Data.csv', index_col=0)
In [12]: df
Out[12]:
LW LD RW RD
Class
B 1 1 1 1
L 1 2 1 1
R 1 2 1 3
R 2 2 4 5
That way the assertion check can be:
if "Class" in df.columns:
sys.exit("class must be the first and only the column and number of columns must match all rows")
and then check that the there are no NaNs in the last column:
In [21]: df.iloc[:, -1].notnull().all()
Out[21]: True
Note: this happens e.g. with the following (bad) csv:
In [31]: !cat bad.csv
A,B,C
1,2
3,4
In [32]: df = pd.read_csv('bad.csv', index_col=0)
In [33]: df
Out[33]:
B C
A
1 2 NaN
3 4 NaN
In [34]: df.iloc[:, -1].notnull().all()
Out[34]: False
I think these are the only two failing cases (but I think the error messages can be made clearer)...

Categories