Pandas where condition - python

I have a dataset as shown below
test=pd.DataFrame({'number': [1,2,3,4,5,6,7],
'A': [0,0, 0, 0,0,0,1],
'B': [0,0, 1, 0,0,0,0],
'C': [0,1, 0, 0,0,0,1],
'D': [0,0, 1, 0,0,0,1],
'E': [0,0, 0, 0,0,0,1],
})
Trying to creating a column flag at the end with a condition where if 'number'<=5 and (A!=0|B!=0|c!=0|D!=0|E!=0) then 1 else 0.
np.where(((test['number']<=5) &
((test['A']!=0) |
(test['B']!=0) |
(test['C']!=0) |
(test['D']!=0) |
(test['E']!=0))),1,0)
This worked out but I am trying to simplify the query by not hard encoding the columns names A/B/C/D/E as they change(names may change and also number of columns may also change). Only one column remains static which is 'number' column.

Let's try with any on axis=1 instead of joining with |:
test['flag'] = np.where(
test['number'].le(5) &
test.iloc[:, 1:].ne(0).any(axis=1), 1, 0
)
test:
number A B C D E flag
0 1 0 0 0 0 0 0
1 2 0 0 1 0 0 1
2 3 0 1 0 1 0 1
3 4 0 0 0 0 0 0
4 5 0 0 0 0 0 0
5 6 0 0 0 0 0 0
6 7 1 0 1 1 1 0
Lot's of options to select columns:
Select by location iloc all columns the first and after -> test.iloc[:, 1:]
Select by loc all columns 'A' and after -> test.loc[:, 'A':]
Select all columns except 'number' with Index.difference -> test[test.columns.difference(['number'])]

Related

How to pivot dataframe into ML format

My head is spinning trying to figure out if I have to use pivot_table, melt, or some other function.
I have a DF that looks like this:
month day week_day classname_en origin destination
0 1 7 2 1 2 5
1 1 2 6 2 1 167
2 2 1 5 1 2 54
3 2 2 6 4 1 6
4 1 2 6 5 6 1
But I want to turn it into something like:
month_1 month_2 ...classname_en_1 classname_en_2 ... origin_1 origin_2 ...destination_1
0 1 0 1 0 0 1 0
1 1 0 0 1 1 0 0
2 0 1 1 0 0 1 0
3 0 1 0 0 1 0 0
4 1 0 0 0 0 0 1
Basically, turn all values into columns and then have binary rows 1 - if the column is present, 0 if none.
IDK if it is at all possible to do with like a single function or not, but would appreciate all and any help!
To expand #Corraliens answer
It is indeed a way to do it, but since you write for ML purposes, you might introduce a bug.
With the code above you get a matrix with 20 features. Now, say you want to predict on some data which suddenly have a month more than your training data, then your matrix on your prediction data would have 21 features, thus you cannot parse that into your fitted model.
To overcome this you can use one-hot-encoding from Sklearn. It'll make sure that you always have the same amount of features on "new data" as your training data.
import pandas as pd
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)
# output
age color_blue color_red
0 10 0 1
1 15 1 0
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)
#output
age color_blue color_green color_red
0 10 0 0 1
1 15 1 0 0
2 20 0 1 0
and as you can see, the order of the color-binary representation has also changed.
If we on the other hand use OneHotEncoder you can ommit all those issues
from sklearn.preprocessing import OneHotEncoder
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore")
color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #creates sparse matrix
ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]
pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)
# output
color_blue color_red
0 0 1
1 1 0
# now transform new data
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)
#output
color_blue color_red
0 0 1
1 1 0
2 0 0
note in the last row that both blue and red are both zeros since it has color= "green" which was not present in the training data.
Note the todense() function is only used here to illustrate how it works. Ususally you would like to keep it a sparse matrix and use e.g scipy.sparse.hstack to append your other features such as age to it.
Use pd.get_dummies:
out = pd.get_dummies(df, columns=df.columns)
print(out)
# Output
month_1 month_2 day_1 day_2 day_7 week_day_2 week_day_5 ... origin_2 origin_6 destination_1 destination_5 destination_6 destination_54 destination_167
0 1 0 0 0 1 1 0 ... 1 0 0 1 0 0 0
1 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 1
2 0 1 1 0 0 0 1 ... 1 0 0 0 0 1 0
3 0 1 0 1 0 0 0 ... 0 0 0 0 1 0 0
4 1 0 0 1 0 0 0 ... 0 1 1 0 0 0 0
[5 rows x 20 columns]
You can use get_dummies function of pandas for convert row to column based on data.
For that your code will be:
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 2, 2, 1],
'day': [7, 2, 1, 2, 2],
'week_day': [2, 6, 5, 6, 6],
'classname_en': [1, 2, 1, 4, 5],
'origin': [2, 1, 2, 1, 6],
'destination': [5, 167, 54, 6, 1]
})
response = pd.get_dummies(df, columns=df.columns)
print(response)
Result :

Comparing two panda dataframes with different size

I want to compare two dataframes with content of 1s and 0s. I run for loops to check every element of the dataframes and at the end, I want to replace the "1" values in dataframe out that are equal with the dataframe df with the letter d and the values that are not equal between the dataframes with the letter i in the dataframe out. This code is too slow and I need some input to make it efficient and faster; does anyone have any idea? Also the df dataframe is 420x420 and the out 410x410
a1=out.columns.values
a2=df.columns.values
b1=out.index.values
b2=df.index.values
for a in a1:
for b in b1:
for c in a2:
for d in b2:
if a == c and b == d:
if out.loc[b,a] == 1 and df.loc[d,c]==1:
out.loc[b,a] = "d"
elif out.loc[b,a] != df.loc[d,c]:
out.loc[d,c] = "i"
else:
pass
A small example for better understanding:
Dataframe df
1
2
3
4
1
0
1
1
2
1
0
0
3
1
0
0
4
0
0
0
Dataframe out
1
2
3
4
1
0
1
1
2
1
0
1
3
1
1
0
4
0
0
0
And the resulted dataframe out should be like that:
1
2
3
4
1
0
d
d
2
d
0
i
3
d
i
0
4
0
0
0
I created your dataframes like theese:
# df creation
data1 = [
[1, 0, 1, 1],
[2, 1, 0, 0],
[3, 1, 0, 0],
[4, 0, 0, 0]
]
df = pd.DataFrame(data1, columns=[1, 2, 3, 4])
1
2
3
4
1
0
1
1
2
1
0
0
3
1
0
0
4
0
0
0
# df_out creation
data2 = [
[1, 0, 1, 1],
[2, 1, 0, 1],
[3, 1, 1, 0],
[4, 0, 0, 0]
]
df_out = pd.DataFrame(data2, columns=[1, 2, 3, 4])
1
2
3
4
1
0
1
1
2
1
0
1
3
1
1
0
4
0
0
0
# Then I used 'np.where' method on all intersected columns.
intersected_columns = set(df.columns).intersection(df_out.columns)
for col in intersected_columns:
if col != 1: # I think first column is the index
df_out[col] = np.where(# First condition
(df[col] == 1) & (df_out[col] == 1),
"d", # If first condition is true
np.where( # If first condition is false apply second condition
df[col] != df_out[col],
"i",
df_out[col])
)
Output like this:
| 1 | 2 | 3 | 4 |
|----:|:----|:----|:----|
| 1 | 0 | d | d |
| 2 | d | 0 | i |
| 3 | d | i | 0 |
| 4 | 0 | 0 | 0 |

Parsing values to specific columns in Pandas

I would like to use Pandas to parse Q26 Challenges into the subsequent columns, with a "1" representing its presence in the original unparsed column. So the data frame initially looks like this:
ID
Q26 Challenges
Q26_1
Q26_2
Q26_3
Q26_4
Q26_5
Q26_6
Q26_7
1
5
0
0
0
0
0
0
0
2
1,2
0
0
0
0
0
0
0
3
1,3,7
0
0
0
0
0
0
0
And I want it to look like this:
ID
Q26 Challenges
Q26_1
Q26_2
Q26_3
Q26_4
Q26_5
Q26_6
Q26_7
1
5
0
0
0
0
1
0
0
2
1,2
1
1
0
0
0
0
0
3
1,3,7
1
0
1
0
0
0
1
You can iterate over the range of values in Q26 Challenges, using str.contains to check if the current value is contained in the string and then converting that boolean value to an integer. For example:
df = pd.DataFrame({'id' : [1, 2, 3, 4, 5], 'Q26 Challenges': ['0', '1,2', '2', '1,2,6,7', '3,4,5,11' ] })
for i in range(1, 12):
df[f'Q26_{i}'] = df['Q26 Challenges'].str.contains(rf'\b{i}\b').astype(int)
df
Output:
id Q26 Challenges Q26_1 Q26_2 Q26_3 Q26_4 Q26_5 Q26_6 Q26_7 Q26_8 Q26_9 Q26_10 Q26_11
0 1 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1,2 1 1 0 0 0 0 0 0 0 0 0
2 3 2 0 1 0 0 0 0 0 0 0 0 0
3 4 1,2,6,7 1 1 0 0 0 1 1 0 0 0 0
4 5 3,4,5,11 0 0 1 1 1 0 0 0 0 0 1
str.get_dummies can be used on the 'Q26 Challenges' column to create the indicator values. This indicator DataFrame can be reindexed to include the complete result range (note column headers will be of type string). add_prefix can be used to add the 'Q26_' to the column headers. Lastly, join back to the original DataFrame:
df = df.join(
df['Q26 Challenges'].str.get_dummies(sep=',')
.reindex(columns=map(str, range(1, 8)), fill_value=0)
.add_prefix('Q26_')
)
The reindexing can also be done dynamically based on the resulting columns. It is necessary to convert the resulting column headers to numbers first to ensure numeric order, rather than lexicographic ordering:
s = df['Q26 Challenges'].str.get_dummies(sep=',')
# Convert to numbers to correctly access min and max
s.columns = s.columns.astype(int)
# Add back to DataFrame
df = df.join(s.reindex(
# Build range from the min column to max column values
columns=range(min(s.columns), max(s.columns) + 1),
fill_value=0
).add_prefix('Q26_'))
Both options produce:
ID Q26 Challenges Q26_1 Q26_2 Q26_3 Q26_4 Q26_5 Q26_6 Q26_7
0 1 5 0 0 0 0 1 0 0
1 2 1,2 1 1 0 0 0 0 0
2 3 1,3,7 1 0 1 0 0 0 1
Given initial input:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3],
'Q26 Challenges': ['5', '1,2', '1,3,7']
})
ID Q26 Challenges
0 1 5
1 2 1,2
2 3 1,3,7

Discard rows and corresponding colums of a matrix that are all 0 [duplicate]

This question already has answers here:
Efficiently test matrix rows and columns with numpy
(2 answers)
Closed 4 years ago.
I have a square matrix that looks something like
0 0 0 0 0 0 0
1 0 0 0 1 0 0
1 1 1 0 0 0 0
0 0 0 0 0 0 0
1 1 1 0 1 1 0
1 1 1 0 1 1 0
0 0 0 0 0 0 0
eg the output of this would be:
0 0 0 | 0 0 |
1 0 0 | 1 0 |
1 1 1 | 0 0 |
- - - + - - +
1 1 1 | 1 1 |
1 1 1 | 1 1 |
- - - + - - +
0 0 0 0 0
1 0 0 1 0
1 1 1 0 0
1 1 1 1 1
1 1 1 1 1
Notice how the 4th row and column are all 0, as well as the last. I would like to delete rows and columns if and only if the ith row and the ith column are all 0s. (Also note that the first row of 0s remains since the first column contains non-zero elements.)
Is there a clean and easy way to do this without looping through each one?
Assume a is a numpy array with same sizes on both dimensions:
# find out the index to keep
keep_idx = a.any(0) | a.any(1)
# subset the array
a[keep_idx][:, keep_idx]
#array([[0, 0, 0, 0, 0],
# [1, 0, 0, 1, 0],
# [1, 1, 1, 0, 0],
# [1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1]])
Suppose we have a 7*7 data frame, similar to your matrix, then the following code does the work:
row_sum = df.sum(axis=1)
col_sum = df.sum(axis=0)
lst=[]
for i in range(len(df)):
if ((row_sum[i] == 0) & (col_sum[i]==0)):
lst.append(i)
df1 = df.drop(lst, axis = 1).drop(lst, axis = 0)

Finding columns which are unique to a row in Pandas dataframe

I have a dataframe of the below structure. I want to get the column numbers which are unique to a particular row.
1 1 0 1 1 1 0 0 0
0 1 0 1 0 0 0 0 0
0 1 0 0 1 0 0 0 0
1 0 0 0 1 0 0 0 1
0 0 0 0 0 0 1 1 0
1 0 0 0 1 0 0 0 0
In the above example I should get coln6, coln7, coln8, coln9 (as there is only one row which has a value specific to these columns). Also I should be able to distinguish among the columns like coln7 and coln8 should group together as they are unique to the same row. Is there an efficient solution in Python for this?
You can call sum on the df and compare against 1 and use this to mask the columns:
In [19]:
df.columns[df.sum(axis=0) == 1]
Out[19]:
Int64Index([5, 6, 7, 8], dtype='int64')
Here is my first approach:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([
1,1,0,1,1,1,0,0,0,
0,1,0,1,0,0,0,0,0,
0,1,0,0,1,0,0,0,0,
1,0,0,0,1,0,0,0,1,
0,0,0,0,0,0,1,1,0,
1,0,0,0,1,0,0,0,0]).reshape(6,9))
print df.sum(axis=0).apply(lambda x: True if x == 1 else False)
Output:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool

Categories