Create Combined Column From Three Columns With The Most Data (Not NA) - python

I have data like:
import pandas as pd
df = pd.DataFrame(data=[[1,-2,3,0,0], [0,0,0,4,0], [0,0,0,0,5]]).T
df.columns = ['col1', 'col2', 'col3']
> df
col1 col2 col3
1 0 0
-2 0 0
3 0 0
0 4 0
0 0 5
I want to create a fourth ("Col4") that takes the col that is non-zero.
So result would be:
col1 col2 col3 col4
1 0 0 1
-2 0 0 -2
3 0 0 3
0 4 0 4
0 0 5 5
EDIT: If two non-zero, always use col1. Also, the numbers may be negative. I have updated the df to reflect this.

Using the maximum of the columns is a possibility
df['col4'] = df.max(axis=1)

Here's an example:
def func(a):
a = set(a)
assert len(a)==2 # 0 and another number
for i in a:
if i!=0:
return i
df['col4'] = df.apply(func,axis=1)

Related

Adding new rows with default value based on dataframe values into dataframe

I have data with a large number of columns:
df:
ID col1 col2 col3 ... col100
1 a x 0 1
1 a x 1 1
2 a y 1 1
4 a z 1 0
...
98 a z 1 1
100 a x 1 0
I want to fill in the missing ID values with a default value that indicate that the data is missing here. For example here it would be ID 3 and hypothetically speaking lets say the missing row data looks like ID 100
ID col1 col2 col3 ... col100
3 a x 1 0
99 a x 1 0
Expected output:
df:
ID col1 col2 col3 ... col100
1 a x 0 1
1 a x 1 1
2 a y 1 1
3 a x 1 0
4 a z 1 0
...
98 a z 1 1
99 a x 1 0
100 a x 1 0
I'm also ok with the 3 and 99 being at the bottom.
I have tried several ways of appending new rows:
noresponse = df[filterfornoresponse].head(1).copy() #assume that this will net us row 100
for i in range (1, maxID):
if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
temp = noresponse.copy()
temp['ID'] = i
df.append(temp, ignore_index = True)
This method doesn't seem to append anything.
I have also tried
pd.concat([df, temp], ignore_index = True)
instead of df.append
I have also tried adding the rows to a list noresponserows with the intention of concating the list with df:
noresponserows = []
for i in range (1, maxID):
if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
temp = noresponse.copy()
temp['ID'] = i
noresponserows.append(temp)
But here the list always ends up with only 1 row when in my data I know there are more than one rows that need to be appended.
I'm not sure why I am having trouble appending more than once instance of noresponse into the list, and why I can't directly append to a dataframe. I feel like I am missing something here.
I think it might have to do with me taking a copy of a row in the df vs constructing a new one. The reason why I take a copy of a row to get noresponse is because there are a large amount of columns so it is easier to just take an existing row.
Say you have a dataframe like this:
>>> df
col1 col2 col100 ID
0 a x 0 1
1 a y 3 2
2 a z 1 4
First, set the ID column to be the index:
>>> df = df.set_index('ID')
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
4 a z 1
Now you can use df.loc to easily add rows.
Let's select the last row as the default row:
>>> default_row = df.iloc[-1]
>>> default_row
col1 a
col2 z
col100 1
Name: 4, dtype: object
We can add it right into the dataframe at ID 3:
>>> df.loc[3] = default_row
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
4 a z 1
3 a z 1
Then use sort_index to sort the rows lexicographically by index:
>>> df = df.sort_index()
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
3 a z 1
4 a z 1
And, optionally, reset the index:
>>> df = df.reset_index()
>>> df
ID col1 col2 col100
0 1 a x 0
1 2 a y 3
2 3 a z 1
3 4 a z 1

Combine Pandas' startwith and isin

I need to get dataframe (all columns) where specific isin value appears only in startwith columns.
Currently I have codes that do both steps seperate but need them to happen in one line:
Dataframe:
col1 col2 filter_col3 filter_col4
0 0 0 1 0
1 0 0 0 1
2 0 0 0 0
#get only columns that start with str
filter_col = [c for c in df if c.startwith('filter_')]
#return row when 1 appears in any of the columns
df[df.isin([1]).any(axis=1)]
Expected result:
col1 col2 filter_col3 filter_col4
0 0 0 1 0
1 0 0 0 1
Try with pd.DataFrame.filter:
# also `.eq(1)` instead of `.isin([1])`
df[df.filter(regex='^filter_').isin([1]).any(1)]
Output:
col1 col2 filter_col3 filter_col4
0 0 0 1 0
1 0 0 0 1
You can also use df.filter with Series.eq:
In [2567]: df[df.filter(like='filter').eq(1).any(1)]
Out[2567]:
col1 col2 filter_col3 filter_col4
0 0 0 1 0
1 0 0 0 1

Select Multiple Columns in DataFrame Pandas. Slice + Select

I have de DataFrame with almost 100 columns
I need to select col2 to col4 and col54. How can I do it?
I tried:
df = df.loc[:,'col2':col4']
but i can't add col54
You can do this in a couple of different ways:
Using the same format you are currently trying to use, I think doing a join of col54 will be necessary.
df = df.loc[:,'col2':'col4'].join(df.loc[:,'col54'])
.
Another method given that col2 is close to col4 would be to do this
df = df.loc[:,['col2','col3','col4', 'col54']]
or simply
df = df[['col2','col3','col4','col54']]
You can simply do this:
df = df.loc[:,['col2','col4','col54']]
loc takes the column names as list as well.
Or this:
df[['col2','col4','col54']]
You use a list or a pandas.IndexSlice object
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(1,index=[0,1,2],columns=["col1","col2","col3","col4","col5"])
In [3]: df
Out[3]:
col1 col2 col3 col4 col5
0 1 1 1 1 1
1 1 1 1 1 1
2 1 1 1 1 1
In [4]: df.loc[:,['col1','col2','col4','col5']]
Out[4]:
col1 col2 col4 col5
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
In [5]: slicer = pd.IndexSlice
In [6]: df.loc[:,slicer["col3":"col5"]]
Out[6]:
col3 col4 col5
0 1 1 1
1 1 1 1
2 1 1 1
edit: I see I misread the OP. This is a bit tough. You can get 'Col2','Col3','Col4' using the pandas.IndexSlice as I demonstrated above. I'm trying to figure out how to include 'Col54' into that.

Pandas dataframe - adding a column from a dict

I have
df = pd.DataFrame.from_dict({'col1':['A','B', 'B', 'A']})
col1
0 A
1 B
2 B
3 A
other_dict = {'A':1, 'B':0}
I want to append a column to df, so that it looks like this:
col1 col2
0 A 1
1 B 0
2 B 0
3 A 1
You can also use map:
In [3]:
df['col2'] = df['col1'].map(other_dict)
df
Out[3]:
col1 col2
0 A 1
1 B 0
2 B 0
3 A 1
One option is to use an apply:
In [11]: df["col1"].apply(other_dict.get)
Out[11]:
0 1
1 0
2 0
3 1
Name: col1, dtype: int64
then assign it to the column:
df["col2"] = df["col1"].apply(other_dict.get)
Another which may be more efficient (if you have larger groups) is to use a transform:
In [21]: g = df.groupby("col1")
In [22]: g["col1"].transform(lambda x: other_dict[x.name])
Out[22]:
0 1
1 0
2 0
3 1
Name: col1, dtype: object
It's also worth linking to the categorical section of the docs.

Quickest way to make a get_dummies type dataframe from a column with a multiple of strings

I have a column, 'col2', that has a list of strings. The current code I have is too slow, there's about 2000 unique strings (the letters in the example below), and 4000 rows. Ending up as 2000 columns and 4000 rows.
In [268]: df.head()
Out[268]:
col1 col2
0 6 A,B
1 15 C,G,A
2 25 B
Is there a fast way to make this in a get dummies format? Where each string has it's own column and in each string's column there is a 0 or 1 if it that row has that string in col2.
In [268]: def get_list(df):
d = []
for row in df.col2:
row_list = row.split(',')
for string in row_list:
if string not in d:
d.append(string)
return d
df_list = get_list(df)
def make_cols(df, lst):
for string in lst:
df[string] = 0
return df
df = make_cols(df, df_list)
for idx in range(0, len(df['col2'])):
row_list = df['col2'].iloc[idx].split(',')
for string in row_list:
df[string].iloc[idx]+= 1
Out[113]:
col1 col2 A B C G
0 6 A,B 1 1 0 0
1 15 C,G,A 1 0 1 1
2 25 B 0 1 0 0
This is my current code for it but it's too slow.
Thanks you any help!
You can use:
>>> df['col2'].str.get_dummies(sep=',')
A B C G
0 1 1 0 0
1 1 0 1 1
2 0 1 0 0
To join the Dataframes:
>>> pd.concat([df, df['col2'].str.get_dummies(sep=',')], axis=1)
col1 col2 A B C G
0 6 A,B 1 1 0 0
1 15 C,G,A 1 0 1 1
2 25 B 0 1 0 0

Categories