I'm processing a dataset with around 2000 columns and I noticed many of them are empty. I want to know specifically how many of them are empty and how many are not. I use the following code:
df.isnull().sum()
I will get the number of empty rows in each columns. However, given I'm investigating about 2000 columns with 7593 rows, the output in IPython looks like the following:
FEMALE_DEATH_YR3_RT 7593
FEMALE_COMP_ORIG_YR3_RT 7593
PELL_COMP_4YR_TRANS_YR3_RT 7593
PELL_COMP_2YR_TRANS_YR3_RT 7593
...
FIRSTGEN_YR4_N 7593
NOT1STGEN_YR4_N 7593
It doesn't show all the columns because it has too many columns. Thus it makes it very difficult to tell how many columns are all empty and how any are not.
I wonder is there anyway to allow me to identify the non-empty columns quickly? Thanks!
to find the number of non empty columns:
len(df.columns) - len(df.dropna(axis=1,how='all').columns)
3
df
Country movie name rating year Something
0 thg John 3 NaN NaN NaN
1 thg Jan 4 NaN NaN NaN
2 mol Graham lob NaN NaN NaN
df=df.dropna(axis=1,how='all')
Country movie name
0 thg John 3
1 thg Jan 4
2 mol Graham lob
Related
I created a data-capturing template. When imported into Python (as DataFrame), I noticed that some of the records spanned multiple rows.
I need to clean up the spanned record (see expected representation).
The 'Entity' column is the anchor column. Currently, it is not the definitive column, as one can see the row(s) underneath with NaN.
NB: Along the line, I'll be dropping the 'Unnamed:' column.
Essentially, for every row where df.Entity.isnull(), the value(s) must be joined to the row above where df.Entity.notnull().
NB: I can adjust the source, however, I'll like to keep the source template because of ease of capturing and #reproducibility.
[dummy representation]
Unnamed: 0
Entity
Country
Naming
Type:
Mode:
Duration
Page
Reg
Elig
Unit
Structure
Notes
6.
Sur...
UK
...Publ
Pros...
FT
Standard
Yes
Guid... 2021
General
All
Intro & discussion...
Formal
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Assessment
7.
War...
UK
by Publ...
Retro...
FT
1 yr
Yes
Reg 38...
Staff
All
Covering Doc...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
General
NaN
5000 <10000
3-8 publ...
8.
EAng...
UK
Publ...
Retro...
PT
6 mths
Yes
Reg...
General (Cat B)
All
Critical Anal...
Formal as
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Staff (Cat A)
NaN
15000
*g...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Edu & LLL l...
NaN
NaN
LLL not...
[expected representation] I expect to have
Unnamed
Entity
Country
Naming
Type:
Mode:
Duration
Page
Reg
Elig-------------
Unit
Structure -----
Notes
6.
Sur...
UK
...Publ
Pros...
FT
Standard
Yes
Guid... 2021
General
All
Intro & discussion...
Formal Assessment
7.
War...
UK
by Publ...
Retro...
FT
1 yr
Yes
Reg 38...
Staff General
All
Covering Doc... 5000 <10000
Formal 3-8 publ...
8.
EAng...
UK
Publ...
Retro...
PT
6 mths
Yes
Reg...
General (Cat B) Staff (Cat A) Edu & LLL l...
All
Critical Anal... 15000
Formal as *g... LLL not...
My instinct is to test for isnull() on [Entity] column. I would prefer not to do a if...then/'loop' check.
My mind wandered along stack, groupby or stack, merge/join, pop. Not sure these approaches are 'right'.
My preference will be some 'vectorisation' as much as it's possible; taking advantage of pandas' DataFrame
I took note of
Merging Two Rows (one with a value, the other NaN) in Pandas
Pandas dataframe merging rows to remove NaN
Concatenate column values in Pandas DataFrame with "NaN" values
Merge DF of different Sizes with NaN values in between
In my case, my anchor column [Entity] has the key values on one row; however, its values are on one row or span multiple rows.
NB: I'm dealing with one DataFrame and not two df.
I should also mention that I took note of the SO solution that 'explode' newline across multiple rows. This is the opposite for my own scenario. However, I take note as it might provide hints.
Pandas: How to read a DataFrame from excel-file where multiple rows are sometimes separated by line break (\n)
[UPDATE: Walkaround 1]
NB: This walkaround is not a solution. It simply provide an alternate!
With leads from a Medium and a SO post,
I attempted with success reading my dataset directly from the table in the Word document. For this, I installed the python-docx library.
## code snippet; #Note: incomplete
from docx import Document as docs
... ...
document = docs("datasets/IDA..._AppendixA.docx")
def read_docx_table(document, tab_id: int = None, nheader: int = 1, start_row: int = 0):
... ...
data = [[cell.text for cell in row.cells] for i, row in enumerate(table.rows)
if i >= start_row]
... ...
if nheader == 1: ## first row as column header
df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
... ...
return df
... ...
## parse and show dataframe
df_table = read_docx_table(document, tab_id=3, nheader=1, start_row=0)
df_table
The rows are no longer spilling over multiple rows. The columns with newline are now showing the '\n' character.
I can, if I use df['col'].str.replace(), remove newlines '\n' or other delimiters, if I so desire.
Replacing part of string in python pandas dataframe
[dataframe representation: importing and parsing using python-docx library] Almost a true representation of the original table in Word
Unnamed
Entity
Country
Naming
Type:
Mode:
Duration
Page
Reg
Elig-------------
Unit
Structure -----
Notes
6.
Sur...
UK
...Publ
Pros...
FT
Standard
Yes
Guid... 2021
General
All
Intro & discussion...
Formal \n| Assessment
7.
War...
UK
by Publ...
Retro...
FT
1 yr
Yes
Reg 38...
Staff \nGeneral
All
Covering Doc... \n| 5000 <10000
Formal \n| 3-8 publ...
8.
EAng...
UK
Publ...
Retro...
PT
6 mths
Yes
Reg...
General (Cat B) Staff (Cat A) \nEdu & LLL l...
All
Critical Anal... \n| 15000
Formal as \n|*g... \n| LLL not...
[UPDATE 2]
After my update: walkaround 1, I saw #J_H comments. Whiltst it is not 'data corruption' in the true sense, it is nonetheless an ETL strategy issue. Thanks #J_H. Absolutely, well-thought-through #design is of the essence.
Going forward, I'll either leave the source template practically as-is with minor modifications and use python-docx as I've used; or
I modify the source template for easy capture in Excel or 'csv' type repository.
Despite the two approaches outlined here or any other, I'm still keen on 'data cleaning code' that can clean-up the df, to give the expected df.
I have a 2dataframes, which I am calling as df1 and df2.
df1 has columns like KPI and context and it looks like this.
KPI Context
0 Does the company have a policy in place to man... Anti-Bribery Policy\nBroadridge does not toler...
1 Does the company have a supplier code of conduct? Vendor Code of Conduct Our vendors play an imp...
2 Does the company have a grievance/complaint ha... If you ever have a question or wish to report ...
3 Does the company have a human rights policy ? Human Rights Statement of Commitment Broadridg...
4 Does the company have a policies consistent wi... Anti-Bribery Policy\nBroadridge does not toler...
df2 has a single column 'keyword'
df2:
Keyword
0 1.5 degree
1 1.5°
2 2 degree
3 2°
4 accident
I wanted to create another dataframe out of these two dataframe wherein if a particular value from 'Keyword' column of df2 is present in the 'Context' of df1 then simply write the count of it.
for which I have used pd.crosstab() however I suspect that its not giving me the expected output.
here's what I have tried so far.
new_df = df1.explode('Context')
new_df1 = df2.explode('Keyword')
new_df = pd.crosstab(new_df['KPI'], new_df1['Keyword'], values=new_df['Context'], aggfunc='count').reset_index().rename_axis(columns=None)
print(new_df.head())
the new_df looks like this.
KPI 1.5 degree 1.5° \
0 Does the Supplier code of conduct cover one or... NaN NaN
1 Does the companies have sites/operations locat... NaN NaN
2 Does the company have a due diligence process ... NaN NaN
3 Does the company have a grievance/complaint ha... NaN NaN
4 Does the company have a grievance/complaint ha... NaN NaN
2 degree 2° accident
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 1.0 NaN NaN
4 NaN NaN NaN
The expected output which I want is something like this.
0 KPI 1.5 degree 1.5° 2 degree 2° accident
1 Does the company have a policy in place to man 44 2 3 5 9
what exactly am I missing? please let me know, thanks!
There is multiple problems - first explode working with splitted values, not with strings. Then for extract Keyword from Context need Series.str.findall and for crosstab use columns in same DataFrame, not 2 different:
import re
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2['Keyword'])
df1['new'] = df1['Context'].str.findall(pat, flags=re.I)
new_df = df1.explode('new')
out = pd.crosstab(new_df['KPI'], new_df['new'])
Here is a sample of a df I am working with. I am particularly interested in these two columns rusher and receiver.
rusher receiver
0 A.Ekeler NaN
1 NaN S.Barkley
2 C.Carson NaN
3 J.Jacobs NaN
4 NaN K.Drake
I want to run a groupby that considers all of these names in both columns (because the same name can show up in both columns).
My idea is to create a new column player, and then I can just groupby player, if that makes sense. Here is what I want my output to look like
rusher receiver player
0 A.Ekeler NaN A.Ekeler
1 NaN S.Barkley S.Barkley
2 C.Carson NaN C.Carson
3 J.Jacobs NaN J.Jacobs
4 NaN K.Drake K.Drake
I would like to take the name from whichever column it is listed under in that particular row and place it into the player column, so I can then run a groupby.
I have tried various string methods but I don't know how to work around the NaNs
Check with fillna
df['player'] = df['rusher'].fillna(df['receiver'])
I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.
Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]
I have an excel files and the first two rows are:
Weekly Report
December 1-7, 2014
And after that comes the relevant table.
When I use
filename = r'excel.xlsx'
df = pd.read_excel(filename)
print(df)
I get
Weekly Report Unnamed: 1 Unnamed: 2 Unnamed:
3 Unnamed: 4 Unnamed: 5
0 December 1-7, 2014 NaN NaN
NaN NaN NaN
1 NaN NaN NaN
NaN NaN NaN
2 Date App Campaign
Country Cost Installs
What I mean is that the columns name is unnamed because it is in the first irrelevant row.
If pandas would read only the table my columns will be installs, cost etc... which I want.
How can I tell him to read starting from line 3?
Use skiprows to your advantage -
df = pd.read_excel(filename, skiprows=[0,1])
This should do it. pandas ignores the first two rows in this case -
skiprows : list-like
Rows to skip at the beginning (0-indexed)
More details here