Adding rows based on column value - python

Data frame--->with only columns ['234','apple','banana','orange']
now i have a list like
l=['apple', 'banana']
extracting from another data frame column
I am taking unique values of columns from column fruits.
fruits.unique()
which results in array[()]
to get the list of items simply looping over index values and store them in list
loop over the list to check whether the values in the list are presented in columns of data frame.
If present,then add 1 for the values that match column headers else add 0 for one that matching.
In the above case data frame after matching should look like:
234 apple banana orange
0 1 1 0

If need one row DataFrame compare columns names converted to DataFrame by Index.to_frame with DataFrame.isin, then for mapping True, False to 1,0 convert to integers and transpose:
df = pd.DataFrame(columns=['234','apple','banana','orange'])
l=['apple', 'banana']
df = df.columns.to_frame().isin(l).astype(int).T
print (df)
234 apple banana orange
0 0 1 1 0
If it is nested list use MultiLabelBinarizer:
df = pd.DataFrame(columns=['234','apple','banana','orange'])
L= [['apple', 'banana'], ['apple', 'orange', 'apple']]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(L),columns=mlb.classes_)
.reindex(df.columns, fill_value=0, axis=1))
print (df)
234 apple banana orange
0 0 1 1 0
1 0 1 0 1
EDIT: If data are from another DataFrame column solution is very similar like second one:
df = pd.DataFrame(columns=['234','apple','banana','orange'])
df1 = pd.DataFrame({"col":[['apple', 'banana'],['apple', 'orange', 'apple']]})
print (df1)
col
0 [apple, banana]
1 [apple, orange, apple]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(df1['col']),columns=mlb.classes_)
.reindex(df.columns, fill_value=0, axis=1))
print (df)
234 apple banana orange
0 0 1 1 0
1 0 1 0 1

Related

Converting a dataframe stringcolumn into multiple columns and rearrange each column based on the labels

I want to convert a stringcolumn with multiple labels into separate columns for each label and rearrange the dataframe that identical labels are in the same column. For e.g.:
ID
Label
0
apple, tom, car
1
apple, car
2
tom, apple
to
ID
Label
0
1
2
0
apple, tom, car
apple
car
tom
1
apple, car
apple
car
None
2
tom, apple
apple
None
tom
df["Label"].str.split(',',3, expand=True)
0
1
2
apple
tom
car
apple
car
None
tom
apple
None
I know how to split the stringcolumn, but I can't really figure out how to sort the label columns, especially since the number of labels per sample is different.
Here's a way to do this.
First call df['Label'].apply() to replace the csv strings with lists and also to populate a Python dict mapping labels to new column index values.
Then create a second data frame df2 that fills new label columns as specified in the question.
Finally, concatenate the two DataFrames horizontally and drop the 'Label' column.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID' : [0,1,2],
'Label' : ['apple, tom, car', 'apple, car', 'tom, apple']
})
labelInfo = [labels := {}, curLabelIdx := 0]
def foo(x, labelInfo):
theseLabels = [s.strip() for s in x.split(',')]
labels, curLabelIdx = labelInfo
for label in theseLabels:
if label not in labels:
labels[label] = curLabelIdx
curLabelIdx += 1
labelInfo[1] = curLabelIdx
return theseLabels
df['Label'] = df['Label'].apply(foo, labelInfo=labelInfo)
df2 = pd.DataFrame(np.array(df['Label'].apply(lambda x: [s if s in x else 'None' for s in labels]).to_list()),
columns = list(labels.values()))
df = pd.concat([df, df2], axis=1).drop(columns=['Label'])
print(df)
Output:
ID 0 1 2
0 0 apple tom car
1 1 apple None car
2 2 apple tom None
If you'd prefer to have the new columns named using the labels they contain, you can replace the df2 assignment line with this:
df2 = pd.DataFrame(np.array(df['Label'].apply(lambda x: [s if s in x else 'None' for s in labels]).to_list()),
columns = list(labels))
Now the output is:
ID apple tom car
0 0 apple tom car
1 1 apple None car
2 2 apple tom None
Try:
df = df.assign(xxx=df.Label.str.split(r"\s*,\s*")).explode("xxx")
df["Col"] = df.groupby("xxx").ngroup()
df = (
df.set_index(["ID", "Label", "Col"])
.unstack(2)
.droplevel(0, axis=1)
.reset_index()
)
df.columns.name = None
print(df)
Prints:
ID Label 0 1 2
0 0 apple, tom, car apple car tom
1 1 apple, car apple car NaN
2 2 tom, apple apple NaN tom
I believe what you want is something like this:
import pandas as pd
data = {'Label': ['apple, tom, car', 'apple, car', 'tom, apple']}
df = pd.DataFrame(data)
print(f"df: \n{df}")
def norm_sort(series):
mask = []
for line in series:
mask.extend([l.strip() for l in line.split(',')])
mask = sorted(list(set(mask)))
labels = []
for line in series:
labels.append(', '.join([m if m in line else 'None' for m in mask]))
return labels
df.Label = norm_sort(df.loc[:, 'Label'])
df = df.Label.str.split(', ', expand=True)
print(f"df: \n{df}")
The goal of your program is not clear. If you are curious which elements are present in the different rows, then we can just get them all and stack the dataframe like such:
df = pd.DataFrame({'label': ['apple, banana, grape', 'apple, banana', 'banana, grape']})
final_df = df['label'].str.split(', ', expand=True).stack()
final_df.reset_index(drop=True, inplace=True)
>>> final_df
0 apple
1 banana
2 grape
3 apple
4 banana
5 banana
6 grape
At this point we can drop the duplicates or count the occurrence of each, depending on your use case.

subset columns based on partial match and group level in python

I am trying to split my dataframe based on a partial match of the column name, using a group level stored in a separate dataframe. The dataframes are here, and the expected output is below
df = pd.DataFrame(data={'a19-76': [0,1,2],
'a23pz': [0,1,2],
'a23pze': [0,1,2],
'b887': [0,1,2],
'b59lp':[0,1,2],
'c56-6u': [0,1,2],
'c56-6uY': [np.nan, np.nan, np.nan]})
ids = pd.DataFrame(data={'id': ['a19', 'a23', 'b8', 'b59', 'c56'],
'group': ['test', 'sub', 'test', 'pass', 'fail']})
desired output
test_ids = 'a19-76', 'b887'
sub_ids = 'a23pz', 'a23pze', 'c56-6u'
pass_ids = 'b59lp'
fail_ids = 'c56-6u', 'c56-6uY'
I have written thise onliner, which assigned the group to each column name, but doesnt create two seperate lists as required above
gb = ids.groupby([[col for col in df.columns if col.startswith(tuple(i for i in ids.id))], 'group']).agg(lambda x: list(x)).reset_index()
gb.groupby('group').agg({'level_0':lambda x: list(x)})
thanks for reading
May be not what you are looking for, but anyway.
A pending question is what to do with not matched columns, the answer obviously depends on what you will do after matching.
Plain python solution
Simple collections wrangling, but there may be a simpler way.
from collections import defaultdict
groups = defaultdict(list)
idsr = ids.to_records(index=False)
for col in df.columns:
for id, group in idsr:
if col.startswith(id):
groups[group].append(col)
break
# the following 'else' clause is optional, it creates a group for not matched columns
else: # for ... else ...
groups['UNGROUPED'].append(col)
Groups =
{'sub': ['a23pz', 'c56-6u'], 'test': ['a19-76', 'b887', 'b59lp']}
Then after
df.columns = pd.MultiIndex.from_tuples(sorted([(k, col) for k,id in groups.items() for col in id]))
df =
sub test
a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
pandas solution
Columns to dataframe
product of dataframes (join )
filtering of the resulting dataframe
There is surely a better way
df1 = ids.copy()
df2 = df.columns.to_frame(index=False)
df2.columns = ['col']
# Not tested enhancement:
# with pandas version >= 1.2, the four following lines may be replaced by a single one :
# dfm = df1.merge(df2, how='cross')
df1['join'] = 1
df2['join'] = 1
dfm = df1.merge(df2, on='join').drop('join', axis=1)
df1.drop('join', axis=1, inplace = True)
dfm['match'] = dfm.apply(lambda x: x.col.find(x.id), axis=1).ge(0)
dfm = dfm[dfm.match][['group', 'col']].sort_values(by=['group', 'col'], axis=0)
dfm =
group col
6 sub a23pz
24 sub c56-6u
0 test a19-76
18 test b59lp
12 test b887
# Note 1: The index can be removed
# note 2: Unmatched columns are not taken in account
then after
df.columns = pd.MultiIndex.from_frame(dfm)
df =
group sub test
col a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
You can use a regex generated from the values in iidf and filter:
Example with "test":
s = iddf.set_index('group')['id']
regex_test = '^(%s)' % '|'.join(s.loc['test'])
# the generated regex is: '^(a19|b8|b59)'
df.filter(regex=regex_test)
output:
a19-76 b887 b59lp
0 0 0 0
1 1 1 1
2 2 2 2
To get a list of columns for each unique group in iidf, apply the same process in a dictionary comprehension:
{x: list(df.filter(regex='^(%s)' % '|'.join(s.loc[x])).columns)
for x in s.index.unique()}
output:
{'test': ['a19-76', 'b887', 'b59lp'],
'sub': ['a23pz', 'c56-6u']}
NB. this should generalize to any number of groups, however, if really there are many groups, it will be preferable to loop on the columns names rather than using filter repeatedly
A straightforward groupby(...).apply(...) can achieve this result:
def id_match(group, to_match):
regex = "[{}]".format("|".join(group))
matches = to_match.str.match(regex)
return pd.Series(to_match[matches])
matched_groups = ids.groupby("group")["id"].apply(id_match, df.columns)
print(matched_groups)
group
fail 0 c56-6u
1 c56-6uY
pass 0 b887
1 b59lp
sub 0 a19-76
1 a23pz
2 a23pze
test 0 a19-76
1 a23pz
2 a23pze
3 b887
4 b59lp
You can treat this Series as a dictionary-like entity to access each of the groups independently:
print(matched_ids["fail"])
0 c56-6u
1 c56-6uY
Name: id, dtype: object
print(matched_ids["pass"])
0 b887
1 b59lp
Name: id, dtype: object
Then you can take it a step further to can subset your original DataFrame with this new Series like so:
print(df[matched_ids["fail"]])
c56-6u c56-6uY
0 0 NaN
1 1 NaN
2 2 NaN
print(df[matched_ids["pass"]])
b887 b59lp
0 0 0
1 1 1
2 2 2

How do you match the value of one dataframe's column with another dataframe's column using conditionals?

I have two dataframes:
Row No. Subject
1 Apple
2 Banana
3 Orange
4 Lemon
5 Strawberry
row_number Subjects Special?
1 Banana Yes
2 Lemon No
3 Apple No
4 Orange No
5 Strawberry Yes
6 Cranberry Yes
7 Watermelon No
I want to change the Row No. of the first dataframe to match the second. It should be like this:
Row No. Subject
3 Apple
1 Banana
4 Orange
2 Lemon
5 Strawberry
I have tried this code:
for index, row in df1.iterrows():
if df1['Subject'] == df2['Subjects']:
df1['Row No.'] = df2['row_number']
But I get the error:
ValueError: Can only compare identically-labeled Series objects
Does that mean the dataframes have to have the same amount of rows and columns? Do they have to be labelled the same too? Is there a way to bypass this limitation?
Edit: I have found a promising alternative formula:
for x in df1['Subject']:
if x in df2['Subjects'].values:
df2.loc[df2['Subjects'] == x]['row_number'] = df1.loc[df1['Subject'] == x]['Row No.']
But it appears it doesn't modify the first dataframe like I want it to. Any tips why?
Furthermore, I get this warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would avoid using for loops especially when pandas has such great methods to handle these types of problems already.
Using pd.Series.replace
Here is a vectorized way of doing this -
d is the dictionary that maps the fruit to the number in second dataframe
You can use df.Subject.replace(d) to now simply replace the keys in the dict d to their values.
Overwrite the Row No. column with this now.
d = dict(zip(df2['Subjects'], df2['row_number']))
df1['Row No.'] = df1.Subject.replace(d)
print(df1)
Subject Row No.
0 Apple 3
1 Banana 1
2 Orange 4
3 Lemon 2
4 Strawberry 5
Using pd.merge
Let's try simply merging the 2 dataframe and replace the column completely.
ddf = pd.merge(df1['Subject'],
df2[['row_number','Subjects']],
left_on='Subject',
right_on='Subjects',
how='left').drop('Subjects',1)
ddf.columns = df1.columns[::-1]
print(ddf)
Subject Row No.
0 Apple 3
1 Banana 1
2 Orange 4
3 Lemon 2
4 Strawberry 5
Assuming the first is df1 and the second is df2, this should do what you want it to:
import pandas as pd
d1 = {'Row No.': [1, 2, 3, 4, 5], 'Subject': ['Apple', 'Banana', 'Orange',
'Lemon', 'Strawberry']}
df1 = pd.DataFrame(data=d1)
d2 = {'row_number': [1, 2, 3, 4, 5, 6, 7], 'Subjects': ['Banana', 'Lemon', 'Apple',
'Orange', 'Strawberry', 'Cranberry', 'Watermelon'], 'Special?': ['Yes', 'No',
'No', 'No',
'Yes', 'Yes', 'No']}
df2 = pd.DataFrame(data=d2)
for x in df1['Subject']:
if x in df2['Subjects'].values:
df1.loc[df1['Subject'] == x, 'Row No.'] = (df2.loc[df2['Subjects'] == x]['row_number']).item()
#print(df1)
#print(df2)
In your edited answer it looks like you had the dataframes swapped and you were missing the item() to get the actual row_number value and not the Series object.

Name group of columns and rows in Pandas DataFrame

I would like to give a name to groups of columns and rows in my Pandas DataFrame to achieve the same result as a merged Excel table:
However, I can't find any way to give an overarching name to groups of columns/rows like what is shown.
I tried wrapping the tables in an array, but the dataframes don't display:
labels = ['a', 'b', 'c']
df = pd.DataFrame(np.ones((3,3)), index=labels, columns=labels)
labeledRowsCols = pd.DataFrame([df, df])
labeledRowsCols = pd.DataFrame(labeledRowsCols.T, index=['actual'], columns=['predicted 1', 'predicted 2'])
print(labeledRowsCols)
predicted 1 predicted 2
actual NaN NaN
You can set hierarchical indices for both the rows and columns.
import pandas as pd
df = pd.DataFrame([[3,1,0,3,1,0],[0,3,0,0,3,0],[2,1,3,2,1,3]])
col_ix = pd.MultiIndex.from_product([['Predicted: Set 1', 'Predicted: Set 2'], list('abc')])
row_ix = pd.MultiIndex.from_product([['True label'], list('abc')])
df = df.set_index(row_ix)
df.columns = col_ix
df
# returns:
Predicted: Set 1 Predicted: Set 2
a b c a b c
True label a 3 1 0 3 1 0
b 0 3 0 0 3 0
c 2 1 3 2 1 3
Exporting this to Excel should have the merged cells as in your example.

Pandas split CSV into multiple CSV's (or DataFrames) by a column

I'm very lost with a problem and some help or tips will be appreciated.
The problem: I've a csv file with a column with the possibility of multiple values like:
Fruit;Color;The_evil_column
Apple;Red;something1
Apple;Green;something1
Orange;Orange;something1
Orange;Green;something2
Apple;Red;something2
Apple;Red;something3
I've loaded the data into a dataframe and i need to split that dataframe into multiple dataframes based on the value of the column "The_evil_column":
df1
Fruit;Color;The_evil_column
Apple;Red;something1
Apple;Green;something1
Orange;Orange;something1
df2
Fruit;Color;The_evil_column
Orange;Green;something2
Apple;Red;something2
df3
Fruit;Color;The_evil_column
Apple;Red;something3
After reading some posts i'm even more confused and i need some tip about this please.
you can generate a dictionary of DataFrames:
d = {g:x for g,x in df.groupby('The_evil_column')}
In [95]: d.keys()
Out[95]: dict_keys(['something1', 'something2', 'something3'])
In [96]: d['something1']
Out[96]:
Fruit Color The_evil_column
0 Apple Red something1
1 Apple Green something1
2 Orange Orange something1
or a list of DataFrames:
In [103]: l = [x for _,x in df.groupby('The_evil_column')]
In [104]: l[0]
Out[104]:
Fruit Color The_evil_column
0 Apple Red something1
1 Apple Green something1
2 Orange Orange something1
In [105]: l[1]
Out[105]:
Fruit Color The_evil_column
3 Orange Green something2
4 Apple Red something2
In [106]: l[2]
Out[106]:
Fruit Color The_evil_column
5 Apple Red something3
UPDATE:
In [111]: g = pd.read_csv(filename, sep=';').groupby('The_evil_column')
In [112]: g.ngroups # number of unique values in the `The_evil_column` column
Out[112]: 3
In [113]: g.apply(lambda x: x.to_csv(r'c:\temp\{}.csv'.format(x.name)))
Out[113]:
Empty DataFrame
Columns: []
Index: []
will produce 3 files:
In [115]: glob.glob(r'c:\temp\something*.csv')
Out[115]:
['c:\\temp\\something1.csv',
'c:\\temp\\something2.csv',
'c:\\temp\\something3.csv']
you can just filter the frame by the value of the column:
frame=pd.read_csv('file.csv',delimiter=';')
frame['The_evil_column']=='something1'
this returns:
0 True
1 True
2 True
3 False
4 False
5 False
Name: The_evil_column, dtype: bool
Therefore you access these columns:
frame1 = frame[frame['The_evil_column']=='something1']
Later you can drop the column:
frame1 = frame1.drop('The_evil_column', axis=1)
Simpler but less efficient way is:
data = pd.read_csv('input.csv')
out = []
for evil_element in list(set(list(data['The_evil_column']))):
out.append(data[data['The_evil_column']==evil_element])
out will have list of all data dataframes.

Categories