How do I count the number of columns without NaN ?
import pandas as pd
df = pd.DataFrame(columns = ['name', 'age', 'favorite_color', 'grade','NaN'])
print(len(df.columns))
Output: 5
But looking Output: 4
How do I count the number of columns without NaN
Few additional ways are:
Method1 :
len(df.columns ^ ['NaN'])
Method2 pd.Index.difference:
len(df.columns.difference(['NaN']))
Method3 set: (almost same as 2)
len(set(df.columns)-{'NaN'})
Try this
res = [x for x in df.columns if str(x) not in ['nan','NaN','NAN'] ]
print(res)
#['name', 'age', 'favorite_color', 'grade']
print(len(res)
#4
if you mean to drop columns whose name is np.nan or NaN, respectively; try this:
df = df.loc[:, df.columns.notnull()]
df = df.loc[:, (filter(lambda x: x != 'NaN',df.columns))]
print(len(df.columns))
On the other side, you maight want to drop columns that include Nan:
# drop the columns where at least one element is NaN:
df.dropna(axis=1)
# drop the columns where all elements are NaN:
df.dropna(axis=1, how='all')
Use basic python to get the result:-
import pandas as pd
df = pd.DataFrame(columns = ['name', 'age', 'favorite_color', 'grade','NaN'])
col = df.columns
count = 0
for var in col:
if var not in ['NaN', 'nan']:
count+=1
print(f"The length of columns without NaN values is\t {count} ")
Output:
The length of columns without NaN values is 4
Related
I am trying to replace my column names that have quotations and simply remove the quotations but when I try this:
for x in df.columns:
x = x.replace('"', '')
print(x)
Nothing happens and the quotations are still there.
I would do something like this
cols = [column_name.replace('"','') for column_name in df.columns]
df.columns = cols
CODE
import pandas as pd
df=pd.DataFrame({"a":[1,2],'"b"':[3,4]})
print('BEFORE')
print(df)
cols = [column_name.replace('"','') for column_name in df.columns]
df.columns = cols
print('AFTER')
print(df)
OUTPUT
BEFORE
a "b"
0 1 3
1 2 4
AFTER
a b
0 1 3
1 2 4
you can remove it by writing the following code:
col=[]
for x in df.columns:
x = x.replace('"', '')
col.append(x)
df.columns=col
To know more about column renaming: Check this Renaming columns in pandas
One canonical solution to this problem is using pandas str.replace on the header directly (this is "vectorized"):
df = pd.DataFrame({"a": [1, 2], '"b"': [3, 4]})
df.columns = df.columns.str.replace('"', '')
df
a b
0 1 3
1 2 4
I have below dataframe
clm1, clm2, clm3
10, a, clm4=1|clm5=5
11, b, clm4=2
My desired result is
clm1, clm2, clm4, clm5
10, a, 1, 5
11, b, 2, Nan
I have tried below method
rows = list(df.index)
dictlist = []
for index in rows: #loop through each row to convert clm3 to dict
i = df.at[index, "clm3"]
mydict = dict(map(lambda x: x.split('='), [x for x in i.split('|') if '=' in x]))
dictlist.append(mydict)
l=json_normalize(dictlist) #convert dict column to flat dataframe
resultdf = example.join(l).drop('clm3',axis=1)
This is giving me desired result but I am looking for a more efficient way to convert clm3 to dict which does not involve looping through each row.
two steps :
idea is to create a double split and then group by the index and unstack the values as columns
s = (
df["clm3"]
.str.split("|", expand=True)
.stack()
.str.split("=", expand=True)
.reset_index(level=1, drop=True)
)
final = pd.concat([df, s.groupby([s.index, s[0]])[1].sum().unstack()], axis=1).drop(
"clm3", axis=1
)
print(final)
clm1 clm2 clm4 clm5
0 10 a 1 5
1 11 b 2 NaN
Using str.extractall to get your values and unstack to pivot them to a column for each unique value.
And str.get_dummies to get a column for each unique clm.
values = (
df['clm3'].str.extractall('(=\d)')[0]
.str.replace('=', '')
.unstack()
.rename_axis(None, axis=1)
)
columns = df['clm3'].str.replace('=\d', '').str.get_dummies(sep='|').columns
values.columns = columns
dfnew = pd.concat([df[['clm1', 'clm2']], values], axis=1)
clm1 clm2 0 1
0 10 a 1 5
1 11 b 2 NaN
I have the following data frame:
df = pd.DataFrame()
df['Name'] = ['A','B','C']
df['Value'] = ['2+0.5','1+0.2','2-0.06']
What I wanted to do is to split the value and assign to two new columns.
It means my desired output will be as follow:
The element in value column will be splitted into two and I will re-use the sign in the columns.
I am very grateful for your advice.
Thanks.
import pandas as pd
df = pd.DataFrame()
df['Name'] = ['A','B','C']
df['Value'] = ['2+0.5','1+0.2','2-0.06']
df[['value1','value2']]=df.Value.str.split('[-+]+',expand=True)
contain_symbol = df.Value.str.contains('-',regex=False)
df.loc[contain_symbol,"value2"] = -df.loc[contain_symbol,"value2"].astype(float)
Say you have a column source_name with values like : 'Anand_VUNagar_DC (Gujarat)' and you want to create 3 new columns A,B,B with values : 'Mother', 'Teresa', 'India'
You can do by-
df1[['A','B','C']]=df1['source_name'].str.rsplit('_',2, expand=True)
result:
source_name A B C
0 Anand_VUNagar_DC (Gujarat) Anand VUNagar DC (Gujarat)
IIUC
newdf=df.Value.str.split('([-+])+',expand=True)
newdf[2]=newdf[1].map({'+':1,'-':-1})*newdf[2].astype(float)
df[['value1','value2']]=newdf[[0,2]]
df
Out[30]:
Name Value value1 value2
0 A 2+0.5 2 0.50
1 B 1+0.2 1 0.20
2 C 2-0.06 2 -0.06
I have the following dataframe:
a b x y
0 1 2 3 -1
1 2 4 6 -2
2 3 6 9 -3
3 4 8 12 -4
How can I move columns b and x such that they are the last 2 columns in the dataframe? I would like to specify b and x by name, but not the other columns.
You can rearrange columns directly by specifying their order:
df = df[['a', 'y', 'b', 'x']]
In the case of larger dataframes where the column titles are dynamic, you can use a list comprehension to select every column not in your target set and then append the target set to the end.
>>> df[[c for c in df if c not in ['b', 'x']]
+ ['b', 'x']]
a y b x
0 1 -1 2 3
1 2 -2 4 6
2 3 -3 6 9
3 4 -4 8 12
To make it more bullet proof, you can ensure that your target columns are indeed in the dataframe:
cols_at_end = ['b', 'x']
df = df[[c for c in df if c not in cols_at_end]
+ [c for c in cols_at_end if c in df]]
cols = list(df.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('b')) #Remove b from list
cols.pop(cols.index('x')) #Remove x from list
df = df[cols+['b','x']] #Create new dataframe with columns in the order you want
For example, to move column "name" to be the first column in df you can use insert:
column_to_move = df.pop("name")
# insert column with insert(location, column_name, column_value)
df.insert(0, "name", column_to_move)
similarly, if you want this column to be e.g. third column from the beginning:
df.insert(2, "name", column_to_move )
You can use to way below. It's very simple, but similar to the good answer given by Charlie Haley.
df1 = df.pop('b') # remove column b and store it in df1
df2 = df.pop('x') # remove column x and store it in df2
df['b']=df1 # add b series as a 'new' column.
df['x']=df2 # add b series as a 'new' column.
Now you have your dataframe with the columns 'b' and 'x' in the end. You can see this video from OSPY : https://youtu.be/RlbO27N3Xg4
similar to ROBBAT1's answer above, but hopefully a bit more robust:
df.insert(len(df.columns)-1, 'b', df.pop('b'))
df.insert(len(df.columns)-1, 'x', df.pop('x'))
This function will reorder your columns without losing data. Any omitted columns remain in the center of the data set:
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
columns = list(set(columns) - set(first_cols))
columns = list(set(columns) - set(drop_cols))
columns = list(set(columns) - set(last_cols))
new_order = first_cols + columns + last_cols
return new_order
Example usage:
my_list = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
# Output:
['fourth', 'third', 'first', 'sixth', 'second']
To assign to your dataframe, use:
my_list = df.columns.tolist()
reordered_cols = reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
df = df[reordered_cols]
Simple solution:
old_cols = df.columns.values
new_cols= ['a', 'y', 'b', 'x']
df = df.reindex(columns=new_cols)
An alternative, more generic method;
from pandas import DataFrame
def move_columns(df: DataFrame, cols_to_move: list, new_index: int) -> DataFrame:
"""
This method re-arranges the columns in a dataframe to place the desired columns at the desired index.
ex Usage: df = move_columns(df, ['Rev'], 2)
:param df:
:param cols_to_move: The names of the columns to move. They must be a list
:param new_index: The 0-based location to place the columns.
:return: Return a dataframe with the columns re-arranged
"""
other = [c for c in df if c not in cols_to_move]
start = other[0:new_index]
end = other[new_index:]
return df[start + cols_to_move + end]
You can use pd.Index.difference with np.hstack, then reindex or use label-based indexing. In general, it's a good idea to avoid list comprehensions or other explicit loops with NumPy / Pandas objects.
cols_to_move = ['b', 'x']
new_cols = np.hstack((df.columns.difference(cols_to_move), cols_to_move))
# OPTION 1: reindex
df = df.reindex(columns=new_cols)
# OPTION 2: direct label-based indexing
df = df[new_cols]
# OPTION 3: loc label-based indexing
df = df.loc[:, new_cols]
print(df)
# a y b x
# 0 1 -1 2 3
# 1 2 -2 4 6
# 2 3 -3 6 9
# 3 4 -4 8 12
You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'b')
mc.MoveToLast(df,'x')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/
You can also do this as a one-liner:
df.drop(columns=['b', 'x']).assign(b=df['b'], x=df['x'])
This will move any column to the last column :
Move any column to the last column of dataframe :
df= df[ [ col for col in df.columns if col != 'col_name_to_moved' ] + ['col_name_to_moved']]
Move any column to the first column of dataframe:
df= df[ ['col_name_to_moved'] + [ col for col in df.columns if col != 'col_name_to_moved' ]]
where col_name_to_moved is the column that you want to move.
I use Pokémon database as an example, the columns for my data base are
['Name', '#', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']
Here is the code:
import pandas as pd
df = pd.read_html('https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6')[0]
cols = df.columns.to_list()
cos_end= ["Name", "Total", "HP", "Defense"]
for i, j in enumerate(cos_end, start=(len(cols)-len(cos_end))):
cols.insert(i, cols.pop(cols.index(j)))
print(cols)
df = df.reindex(columns=cols)
print(df)
I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
What I would like to get is:
When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:
Is there a way to get the desired result?
Note: alternatively, I would accept this as a result:
Versions used:
Python: 2.7.8
Pandas 0.16.0
Here is an automated way to fix the column index. First,
pull the column level values into a DataFrame:
columns = pd.DataFrame(df.columns.tolist())
then rename the Unnamed: columns to NaN:
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
and then forward-fill the NaNs:
columns[0] = columns[0].fillna(method='ffill')
so that columns now looks like
In [314]: columns
Out[314]:
0 1
0 NaN A
1 NaN B
2 C X
3 C Y
4 C Z
5 D X
6 D Y
7 D Z
Now we can find the remaining NaNs and fill them with empty strings:
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
Now you can build a new MultiIndex and assign it to df.columns:
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
Putting it all together, if data is
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0
then
import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
print(df)
yields
A B C D
X Y Z X Y Z
0 1 2 3 4 5 6 7 8
1 3 4 5 6 7 8 9 0
There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:
names = ['A', 'B',
('C','X'), ('C', 'Y'), ('C', 'Z'),
('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
header=1, names=names, index_col=[0, 1])
Gives:
C D
X Y Z X Y Z
A B
1 2 3 4 5 6 7 8
To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.
pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])
Then access the columns and loop to create your header.
Again, not a very clean solution, but should work.
you can read using :
df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True)
and then
df.columns = pd.MultiIndex.from_tuples(df.columns)
Load the dataframe, with multiindex:
df = pd.read_csv(filelist,header=[0,1], sep=',')
Write a function to replace the index:
def replace_index(df):
arr = df.columns.values
l = [list(x) for x in arr]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
if l[i-1][0][:7] != 'Unnamed':
l[i][0] = l[i-1][0]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
l[i][0] = l[i][1]
l[i][1] = ''
index = pd.MultiIndex.from_tuples(l)
df.columns = index
return df
Return the new dataframe properly indexed:
replace_index(df)
I used a technique to flatten from the multi-index columns and make one column. It works well for me.
your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]
Import your csv file providing the header row indexes:
df = pd.read_csv('file.csv', header=[0, 1, 2])
Then, you can iterate over each column header, clean it up, assign it to a tuple, the re-assign the dataframe columns using pd.MultiIndex.from_tuples(list_of_tuples)
df.columns = pd.MultiIndex.from_tuples(
[tuple(['' if y.find('Unnamed')==0 else y for y in x]) for x in df.columns]
)
this is the quick one liner I was looking for when trying to figure this out.