Comparing two panda dataframes with different size - python

I want to compare two dataframes with content of 1s and 0s. I run for loops to check every element of the dataframes and at the end, I want to replace the "1" values in dataframe out that are equal with the dataframe df with the letter d and the values that are not equal between the dataframes with the letter i in the dataframe out. This code is too slow and I need some input to make it efficient and faster; does anyone have any idea? Also the df dataframe is 420x420 and the out 410x410
a1=out.columns.values
a2=df.columns.values
b1=out.index.values
b2=df.index.values
for a in a1:
for b in b1:
for c in a2:
for d in b2:
if a == c and b == d:
if out.loc[b,a] == 1 and df.loc[d,c]==1:
out.loc[b,a] = "d"
elif out.loc[b,a] != df.loc[d,c]:
out.loc[d,c] = "i"
else:
pass
A small example for better understanding:
Dataframe df
1
2
3
4
1
0
1
1
2
1
0
0
3
1
0
0
4
0
0
0
Dataframe out
1
2
3
4
1
0
1
1
2
1
0
1
3
1
1
0
4
0
0
0
And the resulted dataframe out should be like that:
1
2
3
4
1
0
d
d
2
d
0
i
3
d
i
0
4
0
0
0

I created your dataframes like theese:
# df creation
data1 = [
[1, 0, 1, 1],
[2, 1, 0, 0],
[3, 1, 0, 0],
[4, 0, 0, 0]
]
df = pd.DataFrame(data1, columns=[1, 2, 3, 4])
1
2
3
4
1
0
1
1
2
1
0
0
3
1
0
0
4
0
0
0
# df_out creation
data2 = [
[1, 0, 1, 1],
[2, 1, 0, 1],
[3, 1, 1, 0],
[4, 0, 0, 0]
]
df_out = pd.DataFrame(data2, columns=[1, 2, 3, 4])
1
2
3
4
1
0
1
1
2
1
0
1
3
1
1
0
4
0
0
0
# Then I used 'np.where' method on all intersected columns.
intersected_columns = set(df.columns).intersection(df_out.columns)
for col in intersected_columns:
if col != 1: # I think first column is the index
df_out[col] = np.where(# First condition
(df[col] == 1) & (df_out[col] == 1),
"d", # If first condition is true
np.where( # If first condition is false apply second condition
df[col] != df_out[col],
"i",
df_out[col])
)
Output like this:
| 1 | 2 | 3 | 4 |
|----:|:----|:----|:----|
| 1 | 0 | d | d |
| 2 | d | 0 | i |
| 3 | d | i | 0 |
| 4 | 0 | 0 | 0 |

Related

How to get the name of a column at a specific index and print it into another column using python and pandas

I'm a python newbie and need help with a specfic task. My main goal is to identify all indicies with their specific values and column-names which are greater than 0 within a row and to sum up these values below each other into another column within the same row.
Here is what I tried:
import pandas as pd
import numpy as np
table = {
'A':[0, 2, 0, 5],
'B' :[4, 1, 3, 0],
'C':[2, 9, 0, 6],
'D':[1, 0, 1, 6]
}
df = pd.DataFrame(table)
print(df)
# create a new column that sums up the row
df['summary'] = 'NoData'
# print the header
print(df.columns.values)
A B C D summary
0 0 4 2 1 NoData
1 2 1 9 0 NoData
2 0 3 0 1 NoData
3 5 0 6 6 NoData
# get length of rows and columns
row = len(df.index)
column = len(df.columns)
# If a value at a spefic index is greater
# than 0, take the column name and the value at that index and print it into the column
#'summary'. Also write all values greater than 0 within a row below each other
for i in range(row):
for j in range(column):
if df.iloc[i][j] > 0:
df.at[i,'summary'] = df.columns(df.iloc[i][j]) + '\n'
I hope it is a bit clear what I want to achieve. Here is a picture of how the result should look in the column 'summary'
You don't really need a for loop.
Starting with df:
A B C D
0 0 4 2 1
1 2 1 9 0
2 0 3 0 1
3 5 0 6 6
You can do:
# Define an helper function
def f(val, col_name):
# You can modify this function in order to customize the summary string
return "" if val == 0 else str(val) + col_name + "\n"
# Assign summary column
df["summary"] = df.apply(lambda x: x.apply(f, args=(x.name,))).sum(axis=1).str[:-1]
Output:
A B C D summary
0 0 4 2 1 4B\n2C\n1D
1 2 1 9 0 2A\n1B\n9C
2 0 3 0 1 3B\n1D
3 5 0 6 6 5A\n6C\n6D
It works for longer column names as well:
one two three four summary
0 0 4 2 1 4two\n2three\n1four
1 2 1 9 0 2one\n1two\n9three
2 0 3 0 1 3two\n1four
3 5 0 6 6 5one\n6three\n6four
Try this:
import pandas as pd
import numpy as np
table = {
'A':[0, 2, 0, 5],
'B' :[4, 1, 3, 0],
'C':[2, 9, 0, 6],
'D':[1, 0, 1, 6]
}
df = pd.DataFrame(table)
print(df)
print(f'\n\n-------------BREAK-----------\n\n')
def func(line):
templist = ''
list_col = line.index.values.tolist()
temp = line.values.tolist()
for x in range(0, len(temp)):
if (temp[x] <= 0):
pass
else:
if (x == 0 ):
templist = f"{temp[x]}{list_col[x]}"
else:
templist = f"{templist}\n{temp[x]}{list_col[x]}"
return templist
df['summary'] = df.apply(func, axis = 1)
print(df)
EXIT
A B C D
0 0 4 2 1
1 2 1 9 0
2 0 3 0 1
3 5 0 6 6
-------------BREAK-----------
A B C D summary
0 0 4 2 1 \n4B\n2C\n1D
1 2 1 9 0 2A\n1B\n9C
2 0 3 0 1 \n3B\n1D
3 5 0 6 6 5A\n6C\n6D

Find the rows that share the value

I need to find where the rows in ABC all have the value 1 and then create a new column that has the result.
my idea is to use np.where() with some condition, but I don't know the correct way of dealing with this problem, from what I have read I'm not supposed to iterate through a dataframe, but use some of the pandas creative methods?
df1 = pd.DataFrame({'A': [0, 1, 1, 0],
'B': [1, 1, 0, 1],
'C': [0, 1, 1, 1],},
index=[0, 1, 2, 4])
print(df1)
what I am after is this:
A B C TRUE
0 0 1 0 0
1 1 1 1 1 <----
2 1 0 1 0
4 0 1 1 0
If the data is always 0/1, you can simply take the product per row:
df1['TRUE'] = df1.prod(1)
output:
A B C TRUE
0 0 1 0 0
1 1 1 1 1
2 1 0 1 0
4 0 1 1 0
This is what you are looking for:
df1["TRUE"] = (df1==1).all(axis=1).astype(int)

Pandas where condition

I have a dataset as shown below
test=pd.DataFrame({'number': [1,2,3,4,5,6,7],
'A': [0,0, 0, 0,0,0,1],
'B': [0,0, 1, 0,0,0,0],
'C': [0,1, 0, 0,0,0,1],
'D': [0,0, 1, 0,0,0,1],
'E': [0,0, 0, 0,0,0,1],
})
Trying to creating a column flag at the end with a condition where if 'number'<=5 and (A!=0|B!=0|c!=0|D!=0|E!=0) then 1 else 0.
np.where(((test['number']<=5) &
((test['A']!=0) |
(test['B']!=0) |
(test['C']!=0) |
(test['D']!=0) |
(test['E']!=0))),1,0)
This worked out but I am trying to simplify the query by not hard encoding the columns names A/B/C/D/E as they change(names may change and also number of columns may also change). Only one column remains static which is 'number' column.
Let's try with any on axis=1 instead of joining with |:
test['flag'] = np.where(
test['number'].le(5) &
test.iloc[:, 1:].ne(0).any(axis=1), 1, 0
)
test:
number A B C D E flag
0 1 0 0 0 0 0 0
1 2 0 0 1 0 0 1
2 3 0 1 0 1 0 1
3 4 0 0 0 0 0 0
4 5 0 0 0 0 0 0
5 6 0 0 0 0 0 0
6 7 1 0 1 1 1 0
Lot's of options to select columns:
Select by location iloc all columns the first and after -> test.iloc[:, 1:]
Select by loc all columns 'A' and after -> test.loc[:, 'A':]
Select all columns except 'number' with Index.difference -> test[test.columns.difference(['number'])]

Discard rows and corresponding colums of a matrix that are all 0 [duplicate]

This question already has answers here:
Efficiently test matrix rows and columns with numpy
(2 answers)
Closed 4 years ago.
I have a square matrix that looks something like
0 0 0 0 0 0 0
1 0 0 0 1 0 0
1 1 1 0 0 0 0
0 0 0 0 0 0 0
1 1 1 0 1 1 0
1 1 1 0 1 1 0
0 0 0 0 0 0 0
eg the output of this would be:
0 0 0 | 0 0 |
1 0 0 | 1 0 |
1 1 1 | 0 0 |
- - - + - - +
1 1 1 | 1 1 |
1 1 1 | 1 1 |
- - - + - - +
0 0 0 0 0
1 0 0 1 0
1 1 1 0 0
1 1 1 1 1
1 1 1 1 1
Notice how the 4th row and column are all 0, as well as the last. I would like to delete rows and columns if and only if the ith row and the ith column are all 0s. (Also note that the first row of 0s remains since the first column contains non-zero elements.)
Is there a clean and easy way to do this without looping through each one?
Assume a is a numpy array with same sizes on both dimensions:
# find out the index to keep
keep_idx = a.any(0) | a.any(1)
# subset the array
a[keep_idx][:, keep_idx]
#array([[0, 0, 0, 0, 0],
# [1, 0, 0, 1, 0],
# [1, 1, 1, 0, 0],
# [1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1]])
Suppose we have a 7*7 data frame, similar to your matrix, then the following code does the work:
row_sum = df.sum(axis=1)
col_sum = df.sum(axis=0)
lst=[]
for i in range(len(df)):
if ((row_sum[i] == 0) & (col_sum[i]==0)):
lst.append(i)
df1 = df.drop(lst, axis = 1).drop(lst, axis = 0)

Add values for matching column and row names

Quick question that I'm brain-farting on how to best implement. I am generating a matrix to add up how many times two items are found next to each other in a list across a large number of permutations of this list. My code looks something like this:
agreement_matrix = pandas.DataFrame(0, index=names, columns=names)
for list in bunch_of_lists:
for i in range(len(list)-1):
agreement_matrix[list[i]][list[i+1]] += 1
It generates an array like:
A B C D
A 0 2 1 1
B 2 0 1 1
C 1 1 0 2
D 1 1 2 0
And because I don't care about order as much I want to add up values so it's like this:
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
Is there any fast/simple way to achieve this? I've been toying around with both doing it after generation and trying to do it as I add values.
Use np.tri*:
np.triu(df) + np.tril(df).T
array([[0, 4, 2, 2],
[0, 0, 2, 2],
[0, 0, 0, 4],
[0, 0, 0, 0]])
Call the DataFrame constructor:
pd.DataFrame(np.triu(df) + np.tril(df).T, df.index, df.columns)
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
To solve the problem ..
np.triu(df.values*2)#df.values.T+df.values
Out[595]:
array([[0, 4, 2, 2],
[0, 0, 2, 2],
[0, 0, 0, 4],
[0, 0, 0, 0]], dtype=int64)
Then you do
pd.DataFrame(np.triu(df.values*2), df.index, df.columns)
Out[600]:
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
A pandas solution to avoid the first loop:
values=['ABCD'[i] for i in np.random.randint(0,4,100)] # data
df=pd.DataFrame(values)
df[1]=df[0].shift()
df=df.iloc[1:]
df.values.sort(axis=1)
df[2]=1
res=df.pivot_table(2,0,1,np.sum,0)
#
#1 A B C D
#0
#A 2 14 11 16
#B 0 5 9 13
#C 0 0 10 17
#D 0 0 0 2

Categories