Add values for matching column and row names - python

Quick question that I'm brain-farting on how to best implement. I am generating a matrix to add up how many times two items are found next to each other in a list across a large number of permutations of this list. My code looks something like this:
agreement_matrix = pandas.DataFrame(0, index=names, columns=names)
for list in bunch_of_lists:
for i in range(len(list)-1):
agreement_matrix[list[i]][list[i+1]] += 1
It generates an array like:
A B C D
A 0 2 1 1
B 2 0 1 1
C 1 1 0 2
D 1 1 2 0
And because I don't care about order as much I want to add up values so it's like this:
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
Is there any fast/simple way to achieve this? I've been toying around with both doing it after generation and trying to do it as I add values.

Use np.tri*:
np.triu(df) + np.tril(df).T
array([[0, 4, 2, 2],
[0, 0, 2, 2],
[0, 0, 0, 4],
[0, 0, 0, 0]])
Call the DataFrame constructor:
pd.DataFrame(np.triu(df) + np.tril(df).T, df.index, df.columns)
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0

To solve the problem ..
np.triu(df.values*2)#df.values.T+df.values
Out[595]:
array([[0, 4, 2, 2],
[0, 0, 2, 2],
[0, 0, 0, 4],
[0, 0, 0, 0]], dtype=int64)
Then you do
pd.DataFrame(np.triu(df.values*2), df.index, df.columns)
Out[600]:
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0

A pandas solution to avoid the first loop:
values=['ABCD'[i] for i in np.random.randint(0,4,100)] # data
df=pd.DataFrame(values)
df[1]=df[0].shift()
df=df.iloc[1:]
df.values.sort(axis=1)
df[2]=1
res=df.pivot_table(2,0,1,np.sum,0)
#
#1 A B C D
#0
#A 2 14 11 16
#B 0 5 9 13
#C 0 0 10 17
#D 0 0 0 2

Related

Comparing two panda dataframes with different size

I want to compare two dataframes with content of 1s and 0s. I run for loops to check every element of the dataframes and at the end, I want to replace the "1" values in dataframe out that are equal with the dataframe df with the letter d and the values that are not equal between the dataframes with the letter i in the dataframe out. This code is too slow and I need some input to make it efficient and faster; does anyone have any idea? Also the df dataframe is 420x420 and the out 410x410
a1=out.columns.values
a2=df.columns.values
b1=out.index.values
b2=df.index.values
for a in a1:
for b in b1:
for c in a2:
for d in b2:
if a == c and b == d:
if out.loc[b,a] == 1 and df.loc[d,c]==1:
out.loc[b,a] = "d"
elif out.loc[b,a] != df.loc[d,c]:
out.loc[d,c] = "i"
else:
pass
A small example for better understanding:
Dataframe df
1
2
3
4
1
0
1
1
2
1
0
0
3
1
0
0
4
0
0
0
Dataframe out
1
2
3
4
1
0
1
1
2
1
0
1
3
1
1
0
4
0
0
0
And the resulted dataframe out should be like that:
1
2
3
4
1
0
d
d
2
d
0
i
3
d
i
0
4
0
0
0
I created your dataframes like theese:
# df creation
data1 = [
[1, 0, 1, 1],
[2, 1, 0, 0],
[3, 1, 0, 0],
[4, 0, 0, 0]
]
df = pd.DataFrame(data1, columns=[1, 2, 3, 4])
1
2
3
4
1
0
1
1
2
1
0
0
3
1
0
0
4
0
0
0
# df_out creation
data2 = [
[1, 0, 1, 1],
[2, 1, 0, 1],
[3, 1, 1, 0],
[4, 0, 0, 0]
]
df_out = pd.DataFrame(data2, columns=[1, 2, 3, 4])
1
2
3
4
1
0
1
1
2
1
0
1
3
1
1
0
4
0
0
0
# Then I used 'np.where' method on all intersected columns.
intersected_columns = set(df.columns).intersection(df_out.columns)
for col in intersected_columns:
if col != 1: # I think first column is the index
df_out[col] = np.where(# First condition
(df[col] == 1) & (df_out[col] == 1),
"d", # If first condition is true
np.where( # If first condition is false apply second condition
df[col] != df_out[col],
"i",
df_out[col])
)
Output like this:
| 1 | 2 | 3 | 4 |
|----:|:----|:----|:----|
| 1 | 0 | d | d |
| 2 | d | 0 | i |
| 3 | d | i | 0 |
| 4 | 0 | 0 | 0 |

Dataframe column: to find local maxima

In the below dataframe the column "CumRetperTrade" is a column which consists of a few vertical vectors (=sequences of numbers) separated by zeros. (= these vectors correspond to non-zero elements of column "Portfolio").
I would like to find the local maxima of every non-zero vector contained in column "CumRetperTrade"
To be precise, I would like to transform (using vectorization - or other - methods) column "CumRetperTrade" to the column "PeakCumRet" (desired result) which provides maxima for every vector contained in column "CumRetperTrade" its local max value. The numeric example is below. Thanks in advance.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Portfolio": [1, 1, 1, 0 , 0, 0, 1, 1, 1],
"CumRetperTrade": [3, 2, 1, 0 , 0, 0, 4, 2, 1],
"PeakCumRet": [3, 3, 3, 0 , 0, 0, 4, 4, 4]})
df1
Portfolio CumRetperTrade PeakCumRet
1 3 3
1 2 3
1 1 3
0 0 0
0 0 0
0 0 0
1 4 4
1 2 4
1 1 4
You can use :
df1['PeakCumRet'] = (df1.groupby(df1['Portfolio'].ne(df1['Portfolio'].shift()).cumsum())
['CumRetperTrade'].transform('max')
)
Output:
Portfolio CumRetperTrade PeakCumRet
0 1 3 3
1 1 2 3
2 1 1 3
3 0 0 0
4 0 0 0
5 0 0 0
6 1 4 4
7 1 2 4
8 1 1 4

Find the rows that share the value

I need to find where the rows in ABC all have the value 1 and then create a new column that has the result.
my idea is to use np.where() with some condition, but I don't know the correct way of dealing with this problem, from what I have read I'm not supposed to iterate through a dataframe, but use some of the pandas creative methods?
df1 = pd.DataFrame({'A': [0, 1, 1, 0],
'B': [1, 1, 0, 1],
'C': [0, 1, 1, 1],},
index=[0, 1, 2, 4])
print(df1)
what I am after is this:
A B C TRUE
0 0 1 0 0
1 1 1 1 1 <----
2 1 0 1 0
4 0 1 1 0
If the data is always 0/1, you can simply take the product per row:
df1['TRUE'] = df1.prod(1)
output:
A B C TRUE
0 0 1 0 0
1 1 1 1 1
2 1 0 1 0
4 0 1 1 0
This is what you are looking for:
df1["TRUE"] = (df1==1).all(axis=1).astype(int)

python pandas - .replace - nothing happens

I have a df called all_data and within it a column called 'Neighborhood'. all_data["Neighborhood"].head() looks like
0 CollgCr
1 Veenker
2 CollgCr
3 Crawfor
4 NoRidge
I want to replace certain neighborhood names with 0, and others with 1 to get
0 1
1 1
2 1
3 0
4 1
So I did this:
all_data["Neighb_Good"] = all_data["Neighborhood"].copy().replace({'Neighborhood': {'StoneBr': 1, 'NrdigHt': 1,
'Veenker': 1, 'Somerst': 1,
'Timber': 1, 'CollgCr': 1,
'Blmngtn': 1, 'NoRidge': 1,
'Mitchel': 1, 'ClearCr': 1,
'ClearCr': 0, 'Crawfor': 0,
'SawyerW': 0, 'Gilbert': 0,
'Sawyer': 0, 'NPkVill': 0,
'NAmes': 0, 'NWAmes': 0,
'BrkSide': 0, 'MeadowV': 0,
'Edwards': 0, 'Blueste': 0,
'BrDale': 0, 'OldTown': 0,
'IDOTRR': 0, 'SWISU': 0,
}})
It doesn't give me an error, but nothing happens. Instead, all_data["Neighb_Good"] looks exactly like all_data["Neighborhood"].
I've been trying to figure it out for a while now and I swear I can't see what's the matter because I've used this same method yesterday on some other columns and it worked perfectly.
UPDATE: you seem to need Series.map():
In [196]: df['Neighb_Good'] = df['Neighborhood'].map(d['Neighborhood'])
In [197]: df
Out[197]:
Neighborhood Neighb_Good
0 CollgCr 1
1 Veenker 1
2 CollgCr 1
3 Crawfor 0
4 NoRidge 1
using data set from comments:
In [201]: df["ExterQual_Good"] = df["ExterQual"].map(d)
In [202]: df
Out[202]:
ExterQual ExterQual_Good
0 TA 1
1 Fa 0
2 Gd 1
3 Po 0
4 Ex 1
Old answer:
Use DataFrame.replace() instead of Series.replace() if you have a nested dict, containing column names:
In [81]: df['Neighb_Good'] = df.replace(d)
In [82]: df
Out[82]:
Neighborhood Neighb_Good
0 CollgCr 1
1 Veenker 1
2 CollgCr 1
3 Crawfor 0
4 NoRidge 1
or use Series.replace() with a flat (not nested dict):
In [85]: df['Neighb_Good'] = df['Neighborhood'].replace(d['Neighborhood'])
In [86]: df
Out[86]:
Neighborhood Neighb_Good
0 CollgCr 1
1 Veenker 1
2 CollgCr 1
3 Crawfor 0
4 NoRidge 1
How about
A B C
P 0 1 2
Q 3 4 5
R 6 7 8
S 9 10 11
T 12 13 14
U 15 16 17
data1.A.replace({0:"A"..and so on})

pandas DataFrame set non-contiguous sections

I have a DataFrame like below and would like for B to be 1 for n rows after the 1 in column A (where below n = 2)
index A B
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
7 0 0
8 1 0
9 0 1
I think I can do it using .ix similar to this example but not sure how. I'd like to do it in a single in pandas-style selection command if possible. (Ideally not using rolling_apply.)
Modifying a subset of rows in a pandas dataframe
EDIT: the application is that the 1 in column A is "ignored" if it falls within n rows of the previous 1. As per the comments, for n = 2 then, and these example:
A = [1, 0, 1, 0, 1], B should be = [0, 1, 1, 0, 0]
A = [1, 1, 0, 0], B should be [0, 1, 1, 0]

Categories