drop_duplicates() with double brackets [[ ]] - python

I know what a horrible error message "not working" is, but it simply is that simple. I have a data set with a year and group identifier, year and group.
The code that I used to do was
df = df.reset_index().drop_duplicates([['year', 'gvkey']]).set_index(['year', 'gvkey'], drop=True)
However, df.index.is_unique would return false. Puzzled, I looked at some slice of the data, and indeed:
>>> asd = df.head().reset_index()
>>> asd
Out[575]:
year gvkey sic state naics
0 1966 1000 3089 NaN NaN
1 1966 1000 3089 NaN NaN
2 1972 1000 3089 NaN NaN
3 1976 1000 3089 NaN NaN
4 1984 1001 5812 OK 722
>>> asd.drop_duplicates([['year', 'gvkey']])
Out[576]:
year gvkey sic state naics
0 1966 1000 3089 NaN NaN
1 1966 1000 3089 NaN NaN
4 1984 1001 5812 OK 722
However, following a random twitch, I also tried:
>>> asd.drop_duplicates(['year', 'gvkey'])
Out[577]:
year gvkey sic state naics
0 1966 1000 3089 NaN NaN
2 1972 1000 3089 NaN NaN
3 1976 1000 3089 NaN NaN
4 1984 1001 5812 OK 722
which gave me what I expected. Now I am ultimately confused. What exactly is the difference between the two notations - I always used the double brackets [[]]for slicing etc in python. Do I need to revise all my code or is this specific to drop_duplicates()?

From the documentation when you pass a sequence to the first argument, which is cols in Pandas 0.13.1, you are giving the names of the columns to be considered when identifying the duplicates.
Therefore, the right sintax uses single brackets [], (), because they will produce the sequence that you want. Using double brackets will produce a sequence of lists, in your case, and this will not represent the column labels that you are looking for.

drop_duplicates expects a label or list of labels for its first argument. What you've created by putting two sets of brackets is a list of lists of labels. Pandas doesn't know what it's looking at when you do that.
I always used the double brackets [[]] for slicing etc in python
Most likely, either you haven't been doing that as much as you thought, or your code is full of awkwardly formed data structures and weird code to work with them. Under normal circumstances (such as here), double brackets would be an error, and you would already have noticed. I would recommend rechecking the places you've used double brackets; I can't tell whether they should be changed just from this information.

Related

DataFrame cleanup: join/concatenate non-NaN values when there is NaN in-between rows (spin-off from row with 'unique' record))

I created a data-capturing template. When imported into Python (as DataFrame), I noticed that some of the records spanned multiple rows.
I need to clean up the spanned record (see expected representation).
The 'Entity' column is the anchor column. Currently, it is not the definitive column, as one can see the row(s) underneath with NaN.
NB: Along the line, I'll be dropping the 'Unnamed:' column.
Essentially, for every row where df.Entity.isnull(), the value(s) must be joined to the row above where df.Entity.notnull().
NB: I can adjust the source, however, I'll like to keep the source template because of ease of capturing and #reproducibility.
[dummy representation]
Unnamed: 0
Entity
Country
Naming
Type:
Mode:
Duration
Page
Reg
Elig
Unit
Structure
Notes
6.
Sur...
UK
...Publ
Pros...
FT
Standard
Yes
Guid... 2021
General
All
Intro & discussion...
Formal
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Assessment
7.
War...
UK
by Publ...
Retro...
FT
1 yr
Yes
Reg 38...
Staff
All
Covering Doc...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
General
NaN
5000 <10000
3-8 publ...
8.
EAng...
UK
Publ...
Retro...
PT
6 mths
Yes
Reg...
General (Cat B)
All
Critical Anal...
Formal as
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Staff (Cat A)
NaN
15000
*g...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Edu & LLL l...
NaN
NaN
LLL not...
[expected representation] I expect to have
Unnamed
Entity
Country
Naming
Type:
Mode:
Duration
Page
Reg
Elig-------------
Unit
Structure -----
Notes
6.
Sur...
UK
...Publ
Pros...
FT
Standard
Yes
Guid... 2021
General
All
Intro & discussion...
Formal Assessment
7.
War...
UK
by Publ...
Retro...
FT
1 yr
Yes
Reg 38...
Staff General
All
Covering Doc... 5000 <10000
Formal 3-8 publ...
8.
EAng...
UK
Publ...
Retro...
PT
6 mths
Yes
Reg...
General (Cat B) Staff (Cat A) Edu & LLL l...
All
Critical Anal... 15000
Formal as *g... LLL not...
My instinct is to test for isnull() on [Entity] column. I would prefer not to do a if...then/'loop' check.
My mind wandered along stack, groupby or stack, merge/join, pop. Not sure these approaches are 'right'.
My preference will be some 'vectorisation' as much as it's possible; taking advantage of pandas' DataFrame
I took note of
Merging Two Rows (one with a value, the other NaN) in Pandas
Pandas dataframe merging rows to remove NaN
Concatenate column values in Pandas DataFrame with "NaN" values
Merge DF of different Sizes with NaN values in between
In my case, my anchor column [Entity] has the key values on one row; however, its values are on one row or span multiple rows.
NB: I'm dealing with one DataFrame and not two df.
I should also mention that I took note of the SO solution that 'explode' newline across multiple rows. This is the opposite for my own scenario. However, I take note as it might provide hints.
Pandas: How to read a DataFrame from excel-file where multiple rows are sometimes separated by line break (\n)
[UPDATE: Walkaround 1]
NB: This walkaround is not a solution. It simply provide an alternate!
With leads from a Medium and a SO post,
I attempted with success reading my dataset directly from the table in the Word document. For this, I installed the python-docx library.
## code snippet; #Note: incomplete
from docx import Document as docs
... ...
document = docs("datasets/IDA..._AppendixA.docx")
def read_docx_table(document, tab_id: int = None, nheader: int = 1, start_row: int = 0):
... ...
data = [[cell.text for cell in row.cells] for i, row in enumerate(table.rows)
if i >= start_row]
... ...
if nheader == 1: ## first row as column header
df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
... ...
return df
... ...
## parse and show dataframe
df_table = read_docx_table(document, tab_id=3, nheader=1, start_row=0)
df_table
The rows are no longer spilling over multiple rows. The columns with newline are now showing the '\n' character.
I can, if I use df['col'].str.replace(), remove newlines '\n' or other delimiters, if I so desire.
Replacing part of string in python pandas dataframe
[dataframe representation: importing and parsing using python-docx library] Almost a true representation of the original table in Word
Unnamed
Entity
Country
Naming
Type:
Mode:
Duration
Page
Reg
Elig-------------
Unit
Structure -----
Notes
6.
Sur...
UK
...Publ
Pros...
FT
Standard
Yes
Guid... 2021
General
All
Intro & discussion...
Formal \n| Assessment
7.
War...
UK
by Publ...
Retro...
FT
1 yr
Yes
Reg 38...
Staff \nGeneral
All
Covering Doc... \n| 5000 <10000
Formal \n| 3-8 publ...
8.
EAng...
UK
Publ...
Retro...
PT
6 mths
Yes
Reg...
General (Cat B) Staff (Cat A) \nEdu & LLL l...
All
Critical Anal... \n| 15000
Formal as \n|*g... \n| LLL not...
[UPDATE 2]
After my update: walkaround 1, I saw #J_H comments. Whiltst it is not 'data corruption' in the true sense, it is nonetheless an ETL strategy issue. Thanks #J_H. Absolutely, well-thought-through #design is of the essence.
Going forward, I'll either leave the source template practically as-is with minor modifications and use python-docx as I've used; or
I modify the source template for easy capture in Excel or 'csv' type repository.
Despite the two approaches outlined here or any other, I'm still keen on 'data cleaning code' that can clean-up the df, to give the expected df.

Can't fill nan values in pandas even with inplace flag

I have a pandas dataframe containing NaN values for some column.
I'm trying to fill them with a default value (30), but it doesn't work.
Original dataframe:
type avg_speed
0 CAR 32.0
1 CAR NaN
2 CAR NaN
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR NaN
8 BIKE NaN
9 BIKE 35.1
...
Desired result:
type avg_speed
0 CAR 32.0
1 CAR 30
2 CAR 30
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR 30
8 BIKE 30
9 BIKE 35.1
My code:
def fill_with_default(pandas_df, column_name, default_value):
print(f"Total count: {pandas_df.count()}")
print(f"Count of Nan BEFORE: {pandas_df[column_name].isna().sum()}")
pandas_df[column_name].fillna(default_value, inplace=True)
print(f"Count of Nan AFTER: {pandas_df[column_name].isna().sum()}")
return pandas_df
df = fill_with_default(df, "avg_speed", 30)
Output:
Total count: 105018
Count of Nan BEFORE: 49514
Count of Nan AFTER: 49514
The chain of dataframe transformations and list of columns are too long, so it's difficult to show all steps (join with another dataframe, drop useless columns, add usefull columns, join with other dataframes, filter etc.)
I've tried other options but they also don't work:
#pandas_df.fillna({column_name: default_value}, inplace=True)
#pandas_df.loc[pandas_df[column_name].isnull(),column_name] = default_value
...
Type of column before applying "fillna" is fload64, the same as default_value
Therefore, my question is: what could be the potential reasons of this problem?
What kind of transformation can lead to this problem? Because this is the method that works for another similar data frame. The only difference between them lies in the chain of transformations.
BTW, there is a system log at this place:
/home/hadoop/.local/lib/python3.6/site-
packages/pandas/core/generic.py:6287: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-
copy
self._update_inplace(new_data)

Transform a DataFrame back to integers after np.nan?

I have a DataFrame that looks like this (original is a lot longer):
Country Energy Supply Energy Supply per Capita % Renewable
0 Afghanistan 321 10 78.6693
1 Albania 102 35 100
2 Algeria 1959 51 0.55101
3 American Samoa ... ... 0.641026
4 Andorra 9 121 88.6957
5 Angola 642 27 70.9091
I am trying to replace those pesky '...' with a NaN value using np.nan. But I want to change only those specific '...' values because if I apply np.nan to the df then all the integers are changed to float. I am not sure if I am getting this right, please correct me if I am. The reason why I don't want all the numbers in the df to be float is that I will have to multiply integers by large numbers and it comes up as scientific notation. I tried using this:
energy = energy.replace('...', np.nan)
But as I said, all numbers from df are turned into float.
If you wanna write back into file as integer, df.astype({'col1': 'int32'}) may help.
In numpy, you may need to split integer columns and float columns and operate separately. my_npArray.astype(int) may help you

how to remove rows in python data frame with condition?

I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.
Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]

How to identity non-empty columns in pandas dataframe?

I'm processing a dataset with around 2000 columns and I noticed many of them are empty. I want to know specifically how many of them are empty and how many are not. I use the following code:
df.isnull().sum()
I will get the number of empty rows in each columns. However, given I'm investigating about 2000 columns with 7593 rows, the output in IPython looks like the following:
FEMALE_DEATH_YR3_RT 7593
FEMALE_COMP_ORIG_YR3_RT 7593
PELL_COMP_4YR_TRANS_YR3_RT 7593
PELL_COMP_2YR_TRANS_YR3_RT 7593
...
FIRSTGEN_YR4_N 7593
NOT1STGEN_YR4_N 7593
It doesn't show all the columns because it has too many columns. Thus it makes it very difficult to tell how many columns are all empty and how any are not.
I wonder is there anyway to allow me to identify the non-empty columns quickly? Thanks!
to find the number of non empty columns:
len(df.columns) - len(df.dropna(axis=1,how='all').columns)
3
df
Country movie name rating year Something
0 thg John 3 NaN NaN NaN
1 thg Jan 4 NaN NaN NaN
2 mol Graham lob NaN NaN NaN
df=df.dropna(axis=1,how='all')
Country movie name
0 thg John 3
1 thg Jan 4
2 mol Graham lob

Categories