Delete row and column from pandas dataframe - python

I have a CSV file which is contains a symmetric adjacency matrix which means row and column have equivalent labels.
I would like to import this into a pandas dataframe, ideally have some GUI pop up and ask for a list of items to delete....and then take that list in and set the values in the relative row and column as zero's and return a separate altered dataframe.
In short, something that takes the following matrix
a b c d e
a 0 3 5 3 5
b 3 0 2 4 5
c 5 2 0 1 7
d 3 4 1 0 9
e 5 5 7 9 0
Pops up a simple interface asking "which regions should be deleted" and a line to enter those regions
and say c and e are entered
returns
a b c d e
a 0 3 0 3 0
b 3 0 0 4 0
c 0 0 0 0 0
d 3 4 0 0 0
e 0 0 0 0 0
with the altered entries as shown in bold
it should be able to do this for as many areas as entered which can be up to 379....ideally seperated by commas

Set columns and rows by index values with DataFrame.loc:
vals = ['c','e']
df.loc[vals, :] = 0
df[vals] = 0
#alternative
#df.loc[:, vals] = 0
print (df)
a b c d e
a 0 3 0 3 0
b 3 0 0 4 0
c 0 0 0 0 0
d 3 4 0 0 0
e 0 0 0 0 0
Another solution is create boolean mask with numpy broadcasting and set values by DataFrame.mask:
mask = df.index.isin(vals) | df.columns.isin(vals)[:, None]
df = df.mask(mask, 0)
print (df)
a b c d e
a 0 3 0 3 0
b 3 0 0 4 0
c 0 0 0 0 0
d 3 4 0 0 0
e 0 0 0 0 0

Start by importing the csv:
import pandas as pd
adj_matrix = pd.read_csv("file/name/to/your.csv", index_col=0)
Then request the input:
regions = input("Please enter the regions that you want deleted (as an array of strings)")
adj_matrix.loc[regions, :] = 0
adj_matrix.loc[:, regions] = 0
Now adj_matrix should be in the form you want.

Related

Count combination of values in pandas dataframe

Let's say we have the following df:
id
A
B
C
D
123
1
1
0
0
456
0
1
1
0
786
1
0
0
0
The id column represents a unique client.
Columns A, B, C, and D represent a product. These columns' values are binary.
1 means the client has that product.
0 means the client doesn't have that product.
I want to create a matrix table of sorts that counts the number of combinations of products that exist for all users.
This would be the desired output, given the df provided above:
A
B
C
D
A
2
1
0
0
B
0
2
1
0
C
0
1
1
0
D
0
0
1
0
import pandas as pd
df = pd.read_fwf('table.dat', infer_nrows=1001)
cols = ['A', 'B', 'C', 'D']
df2 = df[cols]
df2.T.dot(df2)
Result:
A B C D
A 2 1 0 0
B 1 2 1 0
C 0 1 1 0
D 0 0 0 0
I think you want a dot product:
df2 = df.set_index('id')
out = df2.T.dot(df2)
Output:
A B C D
A 2 1 0 0
B 1 2 1 0
C 0 1 1 0
D 0 0 0 0

Select rows where two or more columns are bigger than 0 in pandas

I am working with a dataframe in pandas. My dataframe had 55 columns and 70.000 rows.
How can I select the rows where two or more values are bigger than 0?
It now looks like this:
A B C D E
a 0 2 0 8 0
b 3 0 0 0 0
c 6 2 5 0 0
And would like to make this:
A B C D E F
a 0 2 0 8 0 true
b 3 0 0 0 0 false
c 6 2 5 0 0 true
Have tried converting it to just 0's and 1's and summing that, like so:
df[df > 0] = 1
df[(df > 0).sum(axis=1) >= 2]
But then I lose all the other info in the dataframe and I still want to be able to see the original values.
Try assigning to a column like this:
>>> df['F'] = df.gt(0).sum(axis=1).ge(2)
>>> df
A B C D E F
a 0 2 0 8 0 True
b 3 0 0 0 0 False
c 6 2 5 0 0 True
Or try with astype(bool):
>>> df['F'] = df.astype(bool).sum(axis=1).ge(2)
>>> df
A B C D E F
a 0 2 0 8 0 True
b 3 0 0 0 0 False
c 6 2 5 0 0 True
>>>
You are close, only assign mask to new column:
df['F'] = (df > 0).sum(axis=1) >= 2
Or:
df['F'] = np.count_nonzero(df, axis=1) >= 2
print (df)
A B C D E F
a 0 2 0 8 0 True
b 3 0 0 0 0 False
c 6 2 5 0 0 True

Create a categorical column based on different binary columns in python

I have a dataset that looks like this:
df = pd.DataFrame(data= [[0,0,1],[1,0,0],[0,1,0]], columns = ['A','B','C'])
A B C
0 0 0 1
1 1 0 0
2 0 1 0
I want to create a new column where on each row appears the value of the previous column where there is a 1:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Use dot:
df['value'] = df.values.dot(df.columns)
Output:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Using pd.DataFrame.idxmax:
df['value'] = df.idxmax(1)
print(df)
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B

shopping basket analysis in python with pandas

I have a dataframe containing transaction data. Each row represents one transaction and the columns indicate whether a product has been bought from a category (categories are A-F) or not (one = yes, zero = no). Now I would like to compute the pairs of transactions within each category. My dataframe looks as follows:
A B C D E F
1 1 0 0 0 0
1 0 1 1 0 0
The output should be a matrix counting each pairs of the categories in the dataframe like so:
A B C D E F
A 4 2 1 0 4 2
B 5 6 7 3 5 1
C 1 6 5 8 7 9
D ...
E ...
F ...
Anyone knows a solution on how to solve this?
Thank you very much!
Use the dot product with its transpose:
df.T.dot(df)
Out:
A B C D E F
A 2 1 1 1 0 0
B 1 1 0 0 0 0
C 1 0 1 1 0 0
D 1 0 1 1 0 0
E 0 0 0 0 0 0
F 0 0 0 0 0 0
Note that looking for pairwise occurrences is not scalable though. You might want to look at apriori algorithm.

Condition based on First Element of a group in python

I have a dataframe df with columns [ShowOnAir, AfterPremier, ID, EverOnAir].
My condition is that
if it is the first element of groupby(df.ID)
then if (df.ShowOnAir ==0 or df.AfterPremier == 0), then EverOnAir = 0
else EverOnAir = 1
I am not sure how to compare the first element of the groupby, with elements of the orignal dataframe df.
would really appreciate if I could get help in it ,
Thank you
You can get a row number for your groups by using cumsum, then you can do your logic on the resulting dataframe:
df = pd.DataFrame([[1],[1],[2],[2],[2]])
df['n']=1
df.groupby(0).cumsum()
n
0 1
1 2
2 1
3 2
4 3
You can first create new column EverOnAir filled 1. Then groupby by ID and apply custom function f, where find first element of columns by iat and fill 0:
print df
ShowOnAir AfterPremier ID
0 0 0 a
1 0 1 a
2 1 1 a
3 1 1 b
4 1 0 b
5 0 0 b
6 0 1 c
7 1 0 c
8 0 0 c
def f(x):
#print x
x['EverOnAir'].iat[0] = np.where((x['ShowOnAir'].iat[0] == 0 ) |
(x['AfterPremier'].iat[0] == 0), 0, 1)
return x
df['EverOnAir'] = 1
print df.groupby('ID').apply(f)
ShowOnAir AfterPremier ID EverOnAir
0 0 0 a 0
1 0 1 a 1
2 1 1 a 1
3 1 1 b 1
4 1 0 b 1
5 0 0 b 1
6 0 1 c 0
7 1 0 c 1
8 0 0 c 1

Categories