How to replace specific character in pandas column with null? - python

I Have a column within a dataset, regarding categorical company sizes, which currently looks like this, where the '-' hyphens are currently representing missing data:
I want to change the '-' in missing values with nulls so i can analyse missing data. However when I use the pd replace tool (see following code) with a None value it seems to also make any of the genuine entries as they also contain hyphens (e.g 51-200).
df['Company Size'].replace({'-': None},inplace =True, regex= True)
How can I replace only lone standing hyphens and leave the other entries untouched?

You need not to use regex=True.
df['Company Size'].replace({'-': None},inplace =True)

You could also just do:
df['column_name'] = df['column_name'].replace('-','None')

import numpy as np
df.replace('-', np.NaN, inplace=True)
This code worked for me.

you can do it like this
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', '-', 'c--', 'd', 'e']})
df['C'] = df['C'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
# can also use this -> df['C'] = df['C'].where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5 a
1 1 6 None
2 2 7 c--
3 3 8 d
4 4 9 e
another example:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': ['5-5', '-', 7, 8, 9],
'C': ['a', 'b', 'c--', 'd', 'e']})
df['B'] = df['B'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5-5 a
1 1 None b
2 2 7 c--
3 3 8 d
4 4 9 e

Related

How to calculate number of rows between 2 indexes of pandas dataframe

I have the following Pandas dataframe in Python:
import pandas as pd
d = {'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data=d)
df.index=['A', 'B', 'C', 'D', 'E']
df
which gives the following output:
col1 col2
A 1 6
B 2 7
C 3 8
D 4 9
E 5 10
I need to write a function (say the name will be getNrRows(fromIndex) ) that will take an index value as input and will return the number of rows between that given index and the last index of the dataframe.
For instance:
nrRows = getNrRows("C")
print(nrRows)
> 2
Because it takes 2 steps (rows) from the index C to the index E.
How can I write such a function in the most elegant way?
The simplest way might be
len(df[row_index:]) - 1
For your information we have built-in function get_indexer_for
len(df)-df.index.get_indexer_for(['C'])-1
Out[179]: array([2], dtype=int64)

Replace values of a dataframe with the value of another dataframe

I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12
Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12
One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])

Pandas sample by filter criteria

I have a data frame like the one below
d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8], 'class': ['a', 'a', 'c', 'b']}
df = pd.DataFrame(data=d)
df
var1 var2 class
0 1 5 a
1 2 6 a
2 3 7 c
3 4 8 b
I would like to be able to change the proportion of the class column. For example I would like to down-sample at random the a class by 50% but keep the number of rows for the other classes the same. the results would be:
df
var1 var2 class
0 1 5 a
1 3 7 c
2 4 8 b
How would this be done.
I used the approach to split the DataFrame into df_selection and df_remaining first.
I then reduced df_selection by REMOVE_PERCENTAGE and merged the resulting DataFrame with df_remaining again.
import numpy as np
import pandas as pd
d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8], 'class': ['a', 'a', 'c', 'b']}
df = pd.DataFrame(data=d)
REMOVE_PERCENTAGE = 0.5 # between 0 and 1
df = df.set_index(['class'])
df_selection = df.loc['a'] \
.reset_index()
df_remaining = df.drop('a') \
.reset_index()
rows_to_remove = int(REMOVE_PERCENTAGE * len(df_selection.index))
drop_indices = np.random.choice(df_selection.index, rows_to_remove, replace=False)
df_selection_reduced = df_selection.drop(drop_indices)
df_result = pd.concat([df_selection_reduced, df_remaining]) \
.reset_index(drop=True)
print(df_result)

how to add pound(#) symbol in pandas 'to_csv' header?

I have a pandas DataFrame and I would like to save the DataFrame in a tab separated file format with pound(#) symbol at the beginning of the header.
Here is my demo code:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
file_name = 'test.tsv'
df.to_csv(file_name, sep='\t', index=False)
The above code create a dataframe and save it in a tab separated value format. that looks like:
a b c
1 2 3
4 5 6
7 8 9
But how I can add add pound symbol with the header while saving the DataFrame.
I want the output to be like bellow:
#a b c
1 2 3
4 5 6
7 8 9
Hope I am clear with the question and thanks in advance for the help.
Note: I would like to keep the DataFrame header definition same
Using your code, just modify the a column to be #a like below
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['#a', 'b', 'c'])
file_name = 'test.tsv'
df.to_csv(file_name, sep='\t', index=False)
Edit
If you don't want to adjust the starting dataframe, use .rename before sending to csv:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
file_name = 'test.tsv'
df.rename(columns={
'a' : '#a'
}).to_csv(file_name, sep='\t', index=False)
Use the header argument to create aliases for the columns.
df.to_csv(file_name, sep='\t', index=False,
header=[f'#{x}' if x == df.columns[0] else x for x in df.columns])
#a b c
1 2 3
4 5 6
7 8 9
Here's another way to get your column aliases:
from itertools import zip_longest
header = [''.join(x) for x in zip_longest('#', df.columns, fillvalue='')]
#['#a', 'b', 'c']

Replicating rows in pandas dataframe by column value and add a new column with repetition index

My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]
Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2

Categories