Merging multiple dataframe lines into aggregate lines - python

For the following dataframe:
df = pd.DataFrame({'Name': {0: "A", 1: "A", 2:"A", 3: "B"},
'Spec1': {0: '1', 1: '3', 2:'5',
3: '1'},
'Spec2': {0: '2a', 1: np.nan, 2:np.nan,
3: np.nan}
}, columns=['Name', 'Spec1', 'Spec2'])
Name Spec1 Spec2
0 A 1 2a
1 A 3 NaN
2 A 5 NaN
3 B 1 NaN
I would like to aggregate the columns into:
Name Spec
0 A 1,3,5,2a
1 B 1
Is there a more "pandas" way of doing this than just looping and keeping track of the values?

Or using melt
df.melt('Name').groupby('Name').value.apply(lambda x:','.join(pd.Series(x).dropna())).reset_index().rename(columns={'value':'spec'})
Out[2226]:
Name spec
0 A 1,3,5,2a
1 B 1

Another way
In [966]: (df.set_index('Name').unstack()
.dropna().reset_index()
.groupby('Name')[0].apply(','.join))
Out[966]:
Name
A 1,3,5,2a
B 1
Name: 0, dtype: object

Group rows by name, combine column values as a list, dropping NaN:
df = df.groupby('Name').agg(lambda x: list(x.dropna()))
Spec1 Spec2
Name
A [1, 3, 5] [2a]
B [1] []
Now merge Spec1 and Spec2 lists. Bring Name back as a column. Name the new Spec column.
df = (df.Spec1 + df.Spec2).reset_index().rename(columns={0:"Spec"})
Name Spec
0 A [1, 3, 5, 2a]
1 B [1]
Finally, convert Spec lists to string representations:
df.Spec = df.Spec.apply(','.join)
Name Spec
0 A 1,3,5,2a
1 B 1

Related

Dropping columns before a matching string in the following column

Is there direct possibility to drop all columns before a matching string in a pandas Dataframe. For eg. if my column 8 contains a string 'Matched' I want to drop columns 0 to 7 ?
Well, you did not give any information where and how to look for 'Matched', but let's say that integer col_num contains the number of the matched column:
col_num = np.where(df == 'Matched')[1][0]
df.drop(columns=df.columns[0:col_num],inplace=True)
will do the drop
Example
data = {'A': {0: 1}, 'B': {0: 2}, 'C': {0: 3}, 'Match1': {0: 4}, 'D': {0: 5}}
df = pd.DataFrame(data)
df
A B C Match1 D
0 1 2 3 4 5
Code
remove in front of first Match + # column : boolean indexing
df.loc[:, df.columns.str.startswith('Match').cumsum() > 0]
result
Match1 D
0 4 5

Remove all the rows having same column values of another column which is duplicated

1.Input: we have a dataframe
ID name
1 a
1 b
2 a
2 c
3 d
2.Now I took the first duplicate 'name' (here it is 'a' with ID as '2') value and remove the rest, output:
ID name
1 a
1 b
2 c
3 d
Code I used:
df.loc[~df.duplicated(keep='first', subset=['name'])]
3.Now I want to remove all the rows sharing the same 'ID' ( here the 'a' removed was having '2' as ID, so we remove all rows with '2' as ID), Final Expected output : so we remove [2 c]
ID name
1 a
1 b
3 d
Code I tried: But it is not working
dt = df.name.duplicated(keep='first')
df.loc[~df.groupby(['ID','dt']).size().reset_index().drop(columns={0})]
You can use some kind of blacklist for the ID's:
Sample data:
import pandas as pd
d = {'ID':[1, 1, 2, 2, 3], 'name':['a', 'b', 'a', 'c', 'd']}
df = pd.DataFrame(d)
Code:
df[~df['ID'].isin(df[df['name'].duplicated()]['ID'])]
Output:
ID name
0 1 a
1 1 b
4 3 d
Code simplified:
blacklist = df[df['name'].duplicated()]['ID']
mask = ~df['ID'].isin(blacklist)
df[mask]
If the Dataframe is ordered by ID those two approaches should work:
df = pd.DataFrame(data={'ID': [1, 1, 1, 2, 3], 'name': ['a', 'b', 'a', 'c', 'd']})
df1 = df.loc[~df.duplicated(keep='first', subset=['ID'])]
df2 = df1.loc[~df1.duplicated(keep='first', subset=['name'])]
print(df2)
print(df.drop_duplicates(keep='first', subset=['ID']).drop_duplicates(keep='first', subset=['name']))
ID name
0 1 a
3 2 c
4 3 d
If it's order by name you should do subset=['name'] and then subset=['ID'].

Identifying the columns having duplicate column value with Different column name in python

How to identify the columns in a data frame with same column_value But with different column name , we need to list both the column , here i am able to list only one of them.
from pandas import DataFrame
import numpy as np
import pandas as pd
raw_data = {
'id': ['1', '2', '2', '3', '3'],
'name': ['A', 'B', 'B', 'C', 'D'],
'age' : [1, 2, 2, 3, 3],
'name_dup': ['A', 'B', 'B', 'C', 'D'],
'age_dup': [1, 2, 2, 3, 3]}
df = pd.DataFrame(raw_data, columns = ['id', 'name','age','name_dup','age_dup'])
Like in the image ,one can observe that name and name_dup have same column values but column names are different With the below Function i am able to get only name as an output as shown below where expected is name_dup.
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
iv = vs.iloc[:,i].tolist()
for j in range(i+1, lcs):
jv = vs.iloc[:,j].tolist()
if iv == jv:
dups.append(cs[i])
break
return dups
duplicate_columns(df)
Output of Above Code is Shown Below :
Expected List Duplicate columns Output
name and name_dup age and age_dup.
Here further to This keep drop any one of the column and rename the new column from list_check if we have a list of column name :
list_check = ['name','age']
Expected DataFrame
Note : It is not compulsory that it will always be colname will be colname_dup it can also be lname.
Do you mean by:
s = df.T.duplicated().reset_index()
vals = s.loc[s[0], 'index'].tolist()
colk = df.columns.drop(vals)
print(vals)
print(colk)
print(df.drop(vals, axis=1))
Output:
['name_dup', 'age_dup']
['id', 'name', 'age']
id name age
0 1 A 1
1 2 B 2
2 2 B 2
3 3 C 3
4 3 D 3
You can try this:
df.T.drop_duplicates().T
output:
id name age
0 1 A 1
1 2 B 2
2 2 B 2
3 3 C 3
4 3 D 3

Add suffix to columns in pandas dataframe [duplicate]

I want to add _x suffix to each column name like so:
featuresA = myPandasDataFrame.columns.values + '_x'
How do I do this? Additionally, if I wanted to add x_ as a suffix, how would the solution change?
The following is the nicest way to add suffix in my opinion.
df = df.add_suffix('_some_suffix')
As it is a function that is called on DataFrame and returns DataFrame - you can use it in chain of the calls.
You can use a list comprehension:
df.columns = [str(col) + '_x' for col in df.columns]
There are also built-in methods like .add_suffix() and .add_prefix() as mentioned in another answer.
Elegant In-place Concatenation
If you're trying to modify df in-place, then the cheapest (and simplest) option is in-place addition directly on df.columns (i.e., using Index.__iadd__).
df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df
A B
0 9 12
1 4 7
2 2 5
3 1 4
df.columns += '_some_suffix'
df
A_some_suffix B_some_suffix
0 9 12
1 4 7
2 2 5
3 1 4
To add a prefix, you would similarly use
df.columns = 'some_prefix_' + df.columns
df
some_prefix_A some_prefix_B
0 9 12
1 4 7
2 2 5
3 1 4
Another cheap option is using a list comprehension with f-string formatting (available on python3.6+).
df.columns = [f'{c}_some_suffix' for c in df]
df
A_some_suffix B_some_suffix
0 9 12
1 4 7
2 2 5
3 1 4
And for prefix, similarly,
df.columns = [f'some_prefix{c}' for c in df]
Method Chaining
It is also possible to do add *fixes while method chaining. To add a suffix, use DataFrame.add_suffix
df.add_suffix('_some_suffix')
A_some_suffix B_some_suffix
0 9 12
1 4 7
2 2 5
3 1 4
This returns a copy of the data. IOW, df is not modified.
Adding prefixes is also done with DataFrame.add_prefix.
df.add_prefix('some_prefix_')
some_prefix_A some_prefix_B
0 9 12
1 4 7
2 2 5
3 1 4
Which also does not modify df.
Critique of add_*fix
These are good methods if you're trying to perform method chaining:
df.some_method1().some_method2().add_*fix(...)
However, add_prefix (and add_suffix) creates a copy of the entire dataframe, just to modify the headers. If you believe this is wasteful, but still want to chain, you can call pipe:
def add_suffix(df):
df.columns += '_some_suffix'
return df
df.some_method1().some_method2().pipe(add_suffix)
I Know 4 ways to add a suffix (or prefix) to your column's names:
1- df.columns = [str(col) + '_some_suffix' for col in df.columns]
or
2- df.rename(columns= lambda col: col+'_some_suffix')
or
3- df.columns += '_some_suffix' much easiar.
or, the nicest:
3- df.add_suffix('_some_suffix')
I haven't seen this solution proposed above so adding this to the list:
df.columns += '_x'
And you can easily adapt for the prefix scenario.
Using DataFrame.rename
df = pd.DataFrame({'A': range(3), 'B': range(4, 7)})
print(df)
A B
0 0 4
1 1 5
2 2 6
Using rename with axis=1 and string formatting:
df.rename('col_{}'.format, axis=1)
# or df.rename(columns='col_{}'.format)
col_A col_B
0 0 4
1 1 5
2 2 6
To actually overwrite your column names, we can assign the returned values to our df:
df = df.rename('col_{}'.format, axis=1)
or use inplace=True:
df.rename('col_{}'.format, axis=1, inplace=True)
I figured that this is what I would use quite often, for example:
df = pd.DataFrame({'silverfish': range(3), 'silverspoon': range(4, 7),
'goldfish': range(10, 13),'goldilocks':range(17,20)})
My way of dynamically renaming:
color_list = ['gold','silver']
for i in color_list:
df[f'color_{i}']=df.filter(like=i).sum(axis=1)
OUTPUT:
{'silverfish': {0: 0, 1: 1, 2: 2},
'silverspoon': {0: 4, 1: 5, 2: 6},
'goldfish': {0: 10, 1: 11, 2: 12},
'goldilocks': {0: 17, 1: 18, 2: 19},
'color_gold': {0: 135, 1: 145, 2: 155},
'color_silver': {0: 20, 1: 30, 2: 40}}
Pandas also has a add_prefix method and a add_suffix method to do this.

Explode a row to multiple rows in pandas dataframe

I have a dataframe with the following header:
id, type1, ..., type10, location1, ..., location10
and I want to convert it as follows:
id, type, location
I managed to do this using embedded for loops but it's very slow:
new_format_columns = ['ID', 'type', 'location']
new_format_dataframe = pd.DataFrame(columns=new_format_columns)
print(data.head())
new_index = 0
for index, row in data.iterrows():
ID = row["ID"]
for i in range(1,11):
if row["type"+str(i)] == np.nan:
continue
else:
new_row = pd.Series([ID, row["type"+str(i)], row["location"+str(i)]])
new_format_dataframe.loc[new_index] = new_row.values
new_index += 1
Any suggestions for improvement using native pandas features?
You can use lreshape:
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
Sample:
import pandas as pd
df = pd.DataFrame({
'type1': {0: 1, 1: 4},
'id': {0: 'a', 1: 'a'},
'type10': {0: 1, 1: 8},
'location1': {0: 2, 1: 9},
'location10': {0: 5, 1: 7}})
print (df)
id location1 location10 type1 type10
0 a 2 5 1 1
1 a 9 7 4 8
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
id Location Type
0 a 2 1
1 a 9 4
2 a 5 1
3 a 7 8
Another solution with double melt:
print (pd.concat([pd.melt(df, id_vars='id', value_vars=types, value_name='type'),
pd.melt(df, value_vars=location, value_name='Location')], axis=1)
.drop('variable', axis=1))
id type Location
0 a 1 2
1 a 4 9
2 a 1 5
3 a 8 7
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

Categories