subset columns based on partial match and group level in python

subset columns based on partial match and group level in python - python

I am trying to split my dataframe based on a partial match of the column name, using a group level stored in a separate dataframe. The dataframes are here, and the expected output is below
df = pd.DataFrame(data={'a19-76': [0,1,2],
'a23pz': [0,1,2],
'a23pze': [0,1,2],
'b887': [0,1,2],
'b59lp':[0,1,2],
'c56-6u': [0,1,2],
'c56-6uY': [np.nan, np.nan, np.nan]})
ids = pd.DataFrame(data={'id': ['a19', 'a23', 'b8', 'b59', 'c56'],
'group': ['test', 'sub', 'test', 'pass', 'fail']})
desired output
test_ids = 'a19-76', 'b887'
sub_ids = 'a23pz', 'a23pze', 'c56-6u'
pass_ids = 'b59lp'
fail_ids = 'c56-6u', 'c56-6uY'
I have written thise onliner, which assigned the group to each column name, but doesnt create two seperate lists as required above
gb = ids.groupby([[col for col in df.columns if col.startswith(tuple(i for i in ids.id))], 'group']).agg(lambda x: list(x)).reset_index()
gb.groupby('group').agg({'level_0':lambda x: list(x)})
thanks for reading

May be not what you are looking for, but anyway.
A pending question is what to do with not matched columns, the answer obviously depends on what you will do after matching.
Plain python solution
Simple collections wrangling, but there may be a simpler way.
from collections import defaultdict
groups = defaultdict(list)
idsr = ids.to_records(index=False)
for col in df.columns:
for id, group in idsr:
if col.startswith(id):
groups[group].append(col)
break
# the following 'else' clause is optional, it creates a group for not matched columns
else: # for ... else ...
groups['UNGROUPED'].append(col)
Groups =
{'sub': ['a23pz', 'c56-6u'], 'test': ['a19-76', 'b887', 'b59lp']}
Then after
df.columns = pd.MultiIndex.from_tuples(sorted([(k, col) for k,id in groups.items() for col in id]))
df =
sub test
a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
pandas solution
Columns to dataframe
product of dataframes (join )
filtering of the resulting dataframe
There is surely a better way
df1 = ids.copy()
df2 = df.columns.to_frame(index=False)
df2.columns = ['col']
# Not tested enhancement:
# with pandas version >= 1.2, the four following lines may be replaced by a single one :
# dfm = df1.merge(df2, how='cross')
df1['join'] = 1
df2['join'] = 1
dfm = df1.merge(df2, on='join').drop('join', axis=1)
df1.drop('join', axis=1, inplace = True)
dfm['match'] = dfm.apply(lambda x: x.col.find(x.id), axis=1).ge(0)
dfm = dfm[dfm.match][['group', 'col']].sort_values(by=['group', 'col'], axis=0)
dfm =
group col
6 sub a23pz
24 sub c56-6u
0 test a19-76
18 test b59lp
12 test b887
# Note 1: The index can be removed
# note 2: Unmatched columns are not taken in account
then after
df.columns = pd.MultiIndex.from_frame(dfm)
df =
group sub test
col a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2

You can use a regex generated from the values in iidf and filter:
Example with "test":
s = iddf.set_index('group')['id']
regex_test = '^(%s)' % '|'.join(s.loc['test'])
# the generated regex is: '^(a19|b8|b59)'
df.filter(regex=regex_test)
output:
a19-76 b887 b59lp
0 0 0 0
1 1 1 1
2 2 2 2
To get a list of columns for each unique group in iidf, apply the same process in a dictionary comprehension:
{x: list(df.filter(regex='^(%s)' % '|'.join(s.loc[x])).columns)
for x in s.index.unique()}
output:
{'test': ['a19-76', 'b887', 'b59lp'],
'sub': ['a23pz', 'c56-6u']}
NB. this should generalize to any number of groups, however, if really there are many groups, it will be preferable to loop on the columns names rather than using filter repeatedly

A straightforward groupby(...).apply(...) can achieve this result:
def id_match(group, to_match):
regex = "[{}]".format("|".join(group))
matches = to_match.str.match(regex)
return pd.Series(to_match[matches])
matched_groups = ids.groupby("group")["id"].apply(id_match, df.columns)
print(matched_groups)
group
fail 0 c56-6u
1 c56-6uY
pass 0 b887
1 b59lp
sub 0 a19-76
1 a23pz
2 a23pze
test 0 a19-76
1 a23pz
2 a23pze
3 b887
4 b59lp
You can treat this Series as a dictionary-like entity to access each of the groups independently:
print(matched_ids["fail"])
0 c56-6u
1 c56-6uY
Name: id, dtype: object
print(matched_ids["pass"])
0 b887
1 b59lp
Name: id, dtype: object
Then you can take it a step further to can subset your original DataFrame with this new Series like so:
print(df[matched_ids["fail"]])
c56-6u c56-6uY
0 0 NaN
1 1 NaN
2 2 NaN
print(df[matched_ids["pass"]])
b887 b59lp
0 0 0
1 1 1
2 2 2

Related

Python : Remove all data in a column of a dataframe and keep the last value in the first row

Let's say that I have a simple Dataframe.
import pandas as pd
data1 = [12,34,'fsdf',678,'','','dfs','','']
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4
5
6 dfs
7
8
I want to delete all the data except the last value found in the column that I want to keep in the first row. It can be an column with thousands of rows. So I would like the result :
Data
0 dfs
1
2
3
4
5
6
7
8
And I have to keep the shape of this dataframe, so not removing rows.
What are the simplest functions to do that efficiently ?
Thank you

Get index of last not empty string value and pass to first value of column:
s = df1.loc[df1['Data'].iloc[::-1].ne('').idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
If empty strings are missing values:
data1 = [12,34,'fsdf',678,np.nan,np.nan,'dfs',np.nan,np.nan]
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4 NaN
5 NaN
6 dfs
7 NaN
8 NaN
s = df1.loc[df1['Data'].iloc[::-1].notna().idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8

A simple pandas condition check like this can help,
df1['Data'] = [df1.loc[df1['Data'].ne(""), "Data"].iloc[-1]] + [''] * (len(df1) - 1)

You can replace '' with NaN using df.replace, now use df.last_valid_index
val = df1.loc[df1.replace('', np.nan).last_valid_index(), 'Data']
# Below two lines taken from #jezrael's answer
df1.loc[0, 'Data'] = val
df1.loc[1:, 'Data'] = ''
Or
You can use np.full with fill_value set to np.nan here.
val = df1.loc[df1.replace("", np.nan).last_valid_index(), "Data"]
df1 = pd.DataFrame(np.full(df1.shape, np.nan),
index=df.index,
columns=df1.columns)
df1.loc[0, "Data"] = val

How to apply a function on a series of columns, based on the values in a corresponding series of columns?

I have a df where I have several columns, that, based on the value (1-6) in these columns, I want to assign a value (0-1) to its corresponding column. I can do it on a column by column basis but would like to make it a single function. Below is some example code:
import pandas as pd
df = pd.DataFrame({'col1': [1,3,6,3,5,2], 'col2': [4,5,6,6,1,3], 'col3': [3,6,5,1,1,6],
'colA': [0,0,0,0,0,0], 'colB': [0,0,0,0,0,0], 'colC': [0,0,0,0,0,0]})
(col1 corresponds with colA, col2 with colB, col3 with colC)
This code works on a column by column basis:
df.loc[(df.col1 != 1) & (df.col1 < 6), 'colA'] = (df['colA']+ 1)
But I would like to be able to have a list of columns, so to speak, and have it correspond with another. Something like this, (but that actually works):
m = df['col1' : 'col3'] != 1 & df['col1' : 'col3'] < 6
df.loc[m, 'colA' : 'colC'] += 1
Thank You!

Idea is filter both DataFrames by DataFrame.loc, then filter columns by mask and rename columns by another df2 and last use DataFrame.add only for df.columns:
df1 = df.loc[:, 'col1' : 'col3']
df2 = df.loc[:, 'colA' : 'colC']
d = dict(zip(df1.columns,df2.columns))
df1 = ((df1 != 1) & (df1 < 6)).rename(columns=d)
df[df2.columns] = df[df2.columns].add(df1)
print (df)
col1 col2 col3 colA colB colC
0 1 4 3 0 1 1
1 3 5 6 1 1 0
2 6 6 5 0 0 1
3 3 6 1 1 0 0
4 5 1 1 1 0 0
5 2 3 6 1 1 0

Here's what I would do:
# split up dataframe
sub_df = df.iloc[:,:3]
abc = df.iloc[:,3:]
# make numpy array truth table
truth_table = (sub_df.to_numpy() > 1) & (sub_df.to_numpy() < 6)
# redefine abc based on numpy truth table
new_abc = pd.DataFrame(truth_table.astype(int), columns=['colA', 'colB', 'colC'])
# join the updated dataframe subgroups
new_df = pd.concat([sub_df, new_abc], axis=1)

count unique values in groups pandas

I have a dataframe like this:
data = {'id': [1,1,1,2,2,3],
'value': ['a','a','a','b','b','c'],
'obj_id': [1,2,3,3,3,4]
}
df = pd.DataFrame (data, columns = ['id','value','obj_id'])
I would like to get the unique counts of obj_id groupby id and value:
1 a 3
2 b 1
3 c 1
But when I do:
result=df.groupby(['id','value'])['obj_id'].nunique().reset_index(name='obj_counts')
the result I got was:
1 a 2
1 a 1
2 b 1
3 c 1
so the first two rows with same id and value don't group together.
How can I fix this? Many thanks!

For me your solution working nice with sample data.
Like mentioned #YOBEN_S in comments is possible problem traling whitespeces, then solution is add Series.str.strip:
data = {'id': [1,1,1,2,2,3],
'value': ['a ','a','a','b','b','c'],
'obj_id': [1,2,3,3,3,4]
}
df = pd.DataFrame (data, columns = ['id','value','obj_id'])
df['value'] = df['value'].str.strip()
df = df.groupby(['id','value'])['obj_id'].nunique().reset_index(name='obj_counts')
print (df)
id value obj_counts
0 1 a 3
1 2 b 1
2 3 c 1

How to select and order columns in a dataframe using an array in Python

I have a fairly lage dataframe, df2 (~50,000 rows x 2,000 columns). The column headings are sample names. Separately, I have a dataframe, df1, with a list of samples I want to include in my analysis as the df1 index. I want to use the list of samples from df1 index to select only the columns from df2 for those selected samples, discarding the rest. I also want to preserve the sample order from the df1 index.
Example data:
# df1
data1 = {'Sample': ['Sample_A','Sample_D', 'Sample_E'],
'Location': ['Bangladesh', 'Myanmar', 'Thailand'],
'Year':[2012, 2014, 2015]}
df1 = pd.DataFrame(data1)
df1.set_index('Sample')
# df2
data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
'Sample_A': [0,1,0,0,1],
'Sample_B':[0,0,1,0,0],
'Sample_C':[1,0,0,0,1],
'Sample_D':[0,0,1,1,0]}
df2 = pd.DataFrame(data2)
df2.set_index('Num')
First I generate the list of samples I want from the index of df1, e.g.
samples = df1['Sample'].tolist()
'samples' is then,
['Sample_A', 'Sample_D', 'Sample_E']
And using 'samples', my desired output dataframe, df3, should look like:
index Sample_A Sample_D
Value_1 0 0
Value_2 1 0
Value_3 0 1
Value_4 0 1
Value_5 1 0
But if I use
df3 = df2[samples]
Then I get the error message:
"['Sample_E'] not in index"
So how do I ignore samples that are not found in df2 to avoid this error message?
UPDATE
The solution that worked -
# 1. Define samples to use from df1
samples = df1['Sample'].tolist()
# Only include samples that are found in df2 as well
final_samples = list(set(list(df2.columns)) & set(samples ))
# Make new df with columns corresponding to final_samples
df3 = df2.loc[:, final_samples]

try like this..
df = pd.read_csv("data.csv", usecols=['Sample_A','Sample_D']).fillna('')
print(df)
Selecting all of the rows and some columns, It is possible to select all of the rows by using a single colon.
>>> df.loc[:, ['Sample_A','Sample_D']]
Your answer from the dataset you provided:
>>> data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
... 'Sample_A': [0,1,0,0,1],
... 'Sample_B':[0,0,1,0,0],
... 'Sample_C':[1,0,0,0,1],
... 'Sample_D':[0,0,1,1,0]}
>>> df2 = pd.DataFrame(data2)
>>> df2.set_index('Num').loc[:, ['Sample_A','Sample_D']]
Sample_A Sample_D
Num
Value_1 0 0
Value_2 1 0
Value_3 0 1
Value_4 0 1
Value_5 1 0
=====================================
>>> df3 = df2.loc[:, samples]
>>> df3
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
OR
>>> df3 = df2.reindex(columns=samples)
>>> df3
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN

You can do it this way. They columns array is in Order which you actually want.
import pandas as pd
data = {'index': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
'Sample_A': [0,1,0,0,1],
'Sample_B':[0,0,1,0,0],
'Sample_C':[1,0,0,0,1],
'Sample_D':[0,0,1,1,0]}
df = pd.DataFrame(data)
df.set_index('index')
df1 = df[['index']+['Sample_A','Sample_D']]
output:
index Sample_A Sample_D
0 Value_1 0 0
1 Value_2 1 0
2 Value_3 0 1
3 Value_4 0 1
4 Value_5 1 0
but to ignore the missing columns take the columns only belong df on which you're doing analysis.
samples = ['index', 'Sample_A', 'Sample_D','Extra_Sample']
final_samples = list(set(list(df1.columns)) & set(samples ))
Now you can pass the final_samples which is having only df2 columns.
df3 = df2[final_samples]

How to factorize two data frame meanwhile with python-pandas?

I have two data frame, one is user-item-rating and the other is side information of the items:
#df1
A12VH45Q3H5R5I B000NWJTKW 5.0
A3J8AQWNNI3WSN B000NWJTKW 4.0
A1XOBWIL4MILVM BDASK99000 1.0
#df2
B000NWJTKW ....
BDASK99000 ....
Now I w'd like to map the name of item and user to integer ID. I know there is a way of factorize:
df.apply(lambda x: pd.factorize(x)[0] + 1)
But I 'd like to ensure that the integer of the items in two data frame are consistent. So the resulting data frames is:
#df1
1 1 5.0
2 1 4.0
3 2 1.0
#df2
1 ...
2 ...
Do you know how to ensure that? Thanks in advance!

Concatenate the common column(s), and apply pd.factorize (or pd.Categorical) on that:
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
For example,
import pandas as pd
df1 = pd.DataFrame(
[('A12VH45Q3H5R5I', 'B000NWJTKW', 5.0),
('A3J8AQWNNI3WSN', 'B000NWJTKW', 4.0),
('A1XOBWIL4MILVM', 'BDASK99000', 1.0)], columns=['user', 'item', 'rating'])
df2 = pd.DataFrame(
[('B000NWJTKW', 10),
('BDASK99000', 20)], columns=['item', 'extra'])
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
codes, uniques = pd.factorize(df1['user'])
df1['user'] = codes + 1
print(df1)
print(df2)
yields
# df1
user item rating
0 1 1 5
1 2 1 4
2 3 2 1
# df2
item extra
0 1 10
1 2 20
Another way to work-around the problem (if you have enough memory) would be to merge the two DataFrames: df3 = pd.merge(df1, df2, on='item', how='outer'), and then factorize df3['item']:
df3 = pd.merge(df1, df2, on='item', how='outer')
for col in ['item', 'user']:
df3[col] = pd.factorize(df3[col])[0] + 1
print(df3)
yields
user item rating extra
0 1 1 5 10
1 2 1 4 10
2 3 2 1 20

Another option could be to apply factorize on the first dataframe, and then apply the resulting mapping to the second dataframe:
# create factorization:
idx, levels = pd.factorize(df1['item'])
# replace the item codes in the first dataframe with the new index value
df1['item'] = idx
# create a dictionary mapping the original code to the new index value
d = {code: i for i, code in enumerate(codes)}
# apply this mapping to the second dataframe
df2['item'] = df2.item.apply(lambda code: d[code])
This approach will only work if every level is present in both dataframes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

subset columns based on partial match and group level in python - python

Related

Python : Remove all data in a column of a dataframe and keep the last value in the first row

How to apply a function on a series of columns, based on the values in a corresponding series of columns?

count unique values in groups pandas

How to select and order columns in a dataframe using an array in Python

How to factorize two data frame meanwhile with python-pandas?

Categories

Resources