Get row numbers from duplicate rows - python

I need to read excel file and highlight duplicate rows, without editing excel or adding new column/rows. I read excel file with:
df = pd.read_excel(path2, sheet_name='Sheet1')
and with
df.drop_duplicates(subset=df.columns.difference(['Mark 4']))
i get all duplicate rows, excluding 'Mark 4'. Problem is that I can't extract those row numbers to use them with
df.style.applymap(color_negative_red)
to highlight those rows in excel since the are not included in the df.
I've tried
dfToList = redovi['unique_row_to_index'].tolist()
but since there's no unique row I can't extract the data.
Output of df.drop_duplicates(subset=df.columns.difference(['Mark 4'])) is:
Type1 Type2
0 w A
11 w A
12 w A
18 w A
19 w A
20 w A
[6 rows x 170 columns]
I need to extract those row numbers which are not part of excel columns and use them as list for future formatting.

You can use custom function with DataFrame.duplicated and keep=False for mask of duplicated rows by specified columns names:
df = pd.DataFrame({'Type1':['w'] * 3 + ['a'],
'Type2':['A'] * 3 + ['b'],
'Mark 4': range(4)})
print (df)
Type1 Type2 Mark 4
0 w A 0
1 w A 1
2 w A 2
3 a b 3
Test:
print (df.duplicated(subset=df.columns.difference(['Mark 4']), keep=False))
0 True
1 True
2 True
3 False
dtype: bool
def highlight(x):
c = 'background-color: red'
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
m = x.duplicated(subset=x.columns.difference(['Mark 4']), keep=False)
df1 = df1.mask(m, c)
return df1
df.style.apply(highlight, axis=None)

Related

Python : Remove all data in a column of a dataframe and keep the last value in the first row

Let's say that I have a simple Dataframe.
import pandas as pd
data1 = [12,34,'fsdf',678,'','','dfs','','']
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4
5
6 dfs
7
8
I want to delete all the data except the last value found in the column that I want to keep in the first row. It can be an column with thousands of rows. So I would like the result :
Data
0 dfs
1
2
3
4
5
6
7
8
And I have to keep the shape of this dataframe, so not removing rows.
What are the simplest functions to do that efficiently ?
Thank you
Get index of last not empty string value and pass to first value of column:
s = df1.loc[df1['Data'].iloc[::-1].ne('').idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
If empty strings are missing values:
data1 = [12,34,'fsdf',678,np.nan,np.nan,'dfs',np.nan,np.nan]
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4 NaN
5 NaN
6 dfs
7 NaN
8 NaN
s = df1.loc[df1['Data'].iloc[::-1].notna().idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
A simple pandas condition check like this can help,
df1['Data'] = [df1.loc[df1['Data'].ne(""), "Data"].iloc[-1]] + [''] * (len(df1) - 1)
You can replace '' with NaN using df.replace, now use df.last_valid_index
val = df1.loc[df1.replace('', np.nan).last_valid_index(), 'Data']
# Below two lines taken from #jezrael's answer
df1.loc[0, 'Data'] = val
df1.loc[1:, 'Data'] = ''
Or
You can use np.full with fill_value set to np.nan here.
val = df1.loc[df1.replace("", np.nan).last_valid_index(), "Data"]
df1 = pd.DataFrame(np.full(df1.shape, np.nan),
index=df.index,
columns=df1.columns)
df1.loc[0, "Data"] = val

Name group of columns and rows in Pandas DataFrame

I would like to give a name to groups of columns and rows in my Pandas DataFrame to achieve the same result as a merged Excel table:
However, I can't find any way to give an overarching name to groups of columns/rows like what is shown.
I tried wrapping the tables in an array, but the dataframes don't display:
labels = ['a', 'b', 'c']
df = pd.DataFrame(np.ones((3,3)), index=labels, columns=labels)
labeledRowsCols = pd.DataFrame([df, df])
labeledRowsCols = pd.DataFrame(labeledRowsCols.T, index=['actual'], columns=['predicted 1', 'predicted 2'])
print(labeledRowsCols)
predicted 1 predicted 2
actual NaN NaN
You can set hierarchical indices for both the rows and columns.
import pandas as pd
df = pd.DataFrame([[3,1,0,3,1,0],[0,3,0,0,3,0],[2,1,3,2,1,3]])
col_ix = pd.MultiIndex.from_product([['Predicted: Set 1', 'Predicted: Set 2'], list('abc')])
row_ix = pd.MultiIndex.from_product([['True label'], list('abc')])
df = df.set_index(row_ix)
df.columns = col_ix
df
# returns:
Predicted: Set 1 Predicted: Set 2
a b c a b c
True label a 3 1 0 3 1 0
b 0 3 0 0 3 0
c 2 1 3 2 1 3
Exporting this to Excel should have the merged cells as in your example.

Python: Pivot Table/group by specific conditions

I'm trying to change structure of my data from text file(.txt) which data look like this:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
And I would like to transform them into this format (like pivot-table in excel which column name is character between ":" and each group always start with :1:)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Does anyone have any idea? Thanks in advance.
First create DataFrame by read_csv with header=None, because no header in file:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
Extract original column by DataFrame.pop, then remove traling : by Series.str.strip and Series.str.split values to 2 new columns. Then create groups by compare with Series.eq for == by string 0 with Series.cumsum, create MultiIndex by DataFrame.set_index and last reshape by Series.unstack:
df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J
Use:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Another way to do this:
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
The output will be looking like this:
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None

Remove column index from dataframe

I extracted multiple dataframes from excel sheet by passing cordinates (start & end)
Now i used below funtion to extacr according to cordinates, but when i am trying to
convert it into dataframe, no sure from where index are coming in df as columns
I wanted to remove these index and make 2nd row as columns, this is my dataframe
0 1 2 3 4 5 6
Cols/Rows A A2 B B2 C C2
0 A 50 50 150 150 200 200
1 B 200 200 250 300 300 300
2 C 350 500 400 400 450 450
def extract_dataframes(sheet):
ws = sheet['pivots']
cordinates = [('A1', 'M8'), ('A10', 'Q17'), ('A19', 'M34'), ('A36', 'Q51')]
multi_dfs_list = []
for i in cordinates:
data_rows = []
for row in ws[i[0]:i[1]]:
data_cols = []
for cell in row:
data_cols.append(cell.value)
data_rows.append(data_cols)
multi_dfs_list.append(data_rows)
multi_dfs = {i: pd.DataFrame(df) for i, df in enumerate(multi_dfs_list)}
return multi_dfs
I tried to delete index but not working.
Note: when i say
>>> multi_dfs[0].columns # first dataframe
RangeIndex(start=0, stop=13, step=1)
Change
multi_dfs = {i: pd.DataFrame(df) for i, df in enumerate(multi_dfs_list)}
for
multi_dfs = {i: pd.DataFrame(df[1:], columns=df[0]) for i, df in enumerate(multi_dfs_list)}
From the Docs,
columns : Index or array-like
Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided
I think need:
df = pd.read_excel(file, skiprows=1)

pandas dataframe drop columns by number of nan

I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan
print dff
There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.
You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)
I recommend the drop-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)
Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)
You can do this through another approach as well like below for dropping columns having certain number of na values:
df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])
For dropping columns having certain percentage of na values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])

Categories