I want to count the number of occurrences per column of two different values the first one is the number of null and the second one is the number of \N in my dataframe. Example I've got:
A B C D E D
1 \N 1 \N 12 1
2 4 \N 3 0 \N
3 4 M \N 1
I'm expected the following result:
A 2
B 1
C 2
D 1
E 1
F 2
I already succeed to count the number of missing value with the following code:
df = pd.read_csv("mypath/myFile", sep=',')
null_value = df.isnull().sum()
But the following code doesn't work:
break_line = df[df == '\N'].count()
return break_line + null_value
I get the following error
TypeError: Could not compare ['\N'] with block values
one liner:
ns = df.applymap(lambda x: x == '\N').sum(axis = 0)
null_value + ns
A 2
B 1
C 2
D 1
E 1
F 2
You can simply do the following using applymap:
df.applymap(lambda x: x == '\N').sum() + df.isnull().sum()
which gives you the desired output:
A 2
B 1
C 2
D 1
E 1
F 2
dtype: int64
Note: You use D twice; I now replaced that by F.
I assume you only want to count values where the string ends with '\N'. If not, you can use str.contains instead.
I use a dictionary comprehension to loop through the columns of the dataframe and a vectorized str function to count the number of rows with \N at the end.
df = pd.DataFrame({'A': ['\N', 4, None],
'B': [1, None, 4],
'C': ['\N', '\N', 'M'],
'D': [12, 3, None],
'E': [1, 0, '\N'],
'F': [None, '\N', 1]})
>>> df
A B C D E F
0 \N 1 \N 12 1 None
1 4 NaN \N 3 0 \N
2 None 4 M NaN \N 1
>>> pd.Series({col: df[col].str.endswith('\N').sum()
if df[col].dtype == 'object' else 0
for col in df}) + df.isnull().sum()
A 2
B 1
C 2
D 1
E 1
F 2
dtype: int64
A solution which uses only vectorized calculations:
df.isna().sum() + (df == '\\N').sum()
Output:
A 2
B 1
C 2
D 1
E 1
F 2
Related
How to split a column into rows if values are separated with a comma? I am stuck in here. I have used the following code
xd = df.assign(var1=df['var1'].str.split(',')).explode('var1')
xd = xd.assign(var2=xd['var2'].str.split(',')).explode('var2')
xd
But the above code generate multiple irrelevant rows. I am stuck here. Please suggest answers
DataFrame.explode
For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length.
From docs:
df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],
'B': 1,
'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})
df
A B C
0 [0, 1, 2] 1 [a, b, c]
1 foo 1 NaN
2 [] 1 []
3 [3, 4] 1 [d, e]
Multi-column explode.
df.explode(list('AC'))
A B C
0 0 1 a
0 1 1 b
0 2 1 c
1 foo 1 NaN
2 NaN 1 NaN
3 3 1 d
3 4 1 e
For your specific question:
xd = df.assign(
var1=df['var1'].str.split(','),
var2=df['var2'].str.split(',')
).explode(['var1', 'var2'])
xd
var1 var2 var3
0 a e 1
0 b f 1
0 c g 1
0 d h 1
1 p s 2
1 q t 2
1 r u 2
I was doing a project in nlp.
My input is:
index name lst
0 a c
0 d
0 e
1 f
1 b g
I need output like this:
index name lst combine
0 a c a c
0 d a d
0 e a e
1 f b f
1 b g b g
How can I achieve this?
You can use groupby+transform('max') to replace the empty cells with the letter per group as the letters have precedence over space. The rest is a simple string concatenation per column:
df['combine'] = df.groupby('index')['name'].transform('max') + ' ' + df['lst']
Used input:
df = pd.DataFrame({'index': [0,0,0,1,1],
'name': ['a','','','','b'],
'lst': list('cdefg'),
})
NB. I considered "index" to be a column here, if this is the index you should use df.index in the groupby
Output:
index name lst combine
0 0 a c a c
1 0 d a d
2 0 e a e
3 1 f b f
4 1 b g b g
Suppose I have the following pandas dataframe:
df = pd.DataFrame([['A','B'],[8,'s'],[5,'w'],['e',1],['n',3]])
print(df)
0 1
0 A B
1 8 s
2 5 w
3 e 1
4 n 3
If there is an integer in column 1, then I want to swap the value with the value from column 0, so in other words I want to produce this dataframe:
0 1
0 A B
1 8 s
2 5 w
3 1 e
4 3 n
Replace numbers from second column with mask by to_numeric with errors='coerce' and Series.notna:
m = pd.to_numeric(df[1], errors='coerce').notna()
Another solution with convert to strings by Series.astype and Series.str.isnumeric - but working only for integers:
m = df[1].astype(str).str.isnumeric()
And then replace by DataFrame.loc with DataFrame.values for numpy array for avoid columns alignment:
df.loc[m, [0, 1]] = df.loc[m, [1, 0]].values
print(df)
0 1
0 A B
1 8 s
2 5 w
3 1 e
4 3 n
Last if possible better is convert first row to columns names:
df.columns = df.iloc[0]
df = df.iloc[1:].rename_axis(None, axis=1)
print(df)
A B
1 8 s
2 5 w
3 1 e
4 3 n
or possible removing header=None in read_csv.
sorted
with a key that test for int
df.loc[:] = [
sorted(t, key=lambda x: not isinstance(x, int))
for t in zip(*map(df.get, df))
]
df
0 1
0 A B
1 8 s
2 5 w
3 1 e
4 3 n
You can be explicit with the columns if you'd like
df[[0, 1]] = [
sorted(t, key=lambda x: not isinstance(x, int))
for t in zip(df[0], df[1])
]
I have a dataframe with mixed string and float/int values in column 'k':
>>> df
a b k
0 1 a q
1 2 b 1
2 3 c e
3 4 d r
When I do this to remove any whitespaces from all columns:
df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
It converts the integer 1 to a NaN:
a b k
0 1 a q
1 2 b NaN
2 3 c e
3 4 d r
How can I overcome this?
You can do with mask and to_numeric, this will mask all nonnumeric value to NaN
df=df.mask(df.apply(pd.to_numeric,errors = 'coerce').isnull(),df.astype(str).apply(lambda x : x.str.strip()))
df
Out[572]:
a b k
0 1 a q
1 2 b 1
2 3 c e
3 4 d r
Assuming that I have the following pandas dataframe:
>>> data = pd.DataFrame({ 'X':['a','b'], 'Y':['c','d'], 'Z':['e','f']})
X Y Z
0 a c e
1 b d f
The desired output is:
0 a c e
1 b d f
When I run the following code, I get:
>>> data.sum(axis=1)
0 ace
1 bdf
So how do I add columns of strings with space between them?
Use apply per rows by axis=1 and join:
a = data.apply(' '.join, axis=1)
print (a)
0 a c e
1 b d f
dtype: object
Another solution with add spaces, sum and last str.rstrip:
a = data.add(' ').sum(axis=1).str.rstrip()
#same as
#a = (data + ' ').sum(axis=1).str.rstrip()
print (a)
0 a c e
1 b d f
dtype: object
You can do as follow :
data.apply(lambda x:x + ' ').sum(axis=1)
The output is :
0 a c e
1 b d f
dtype: object