Merging multiple dataframe lines into aggregate lines

Merging multiple dataframe lines into aggregate lines - python

For the following dataframe:
df = pd.DataFrame({'Name': {0: "A", 1: "A", 2:"A", 3: "B"},
'Spec1': {0: '1', 1: '3', 2:'5',
3: '1'},
'Spec2': {0: '2a', 1: np.nan, 2:np.nan,
3: np.nan}
}, columns=['Name', 'Spec1', 'Spec2'])
Name Spec1 Spec2
0 A 1 2a
1 A 3 NaN
2 A 5 NaN
3 B 1 NaN
I would like to aggregate the columns into:
Name Spec
0 A 1,3,5,2a
1 B 1
Is there a more "pandas" way of doing this than just looping and keeping track of the values?

Or using melt
df.melt('Name').groupby('Name').value.apply(lambda x:','.join(pd.Series(x).dropna())).reset_index().rename(columns={'value':'spec'})
Out[2226]:
Name spec
0 A 1,3,5,2a
1 B 1

Another way
In [966]: (df.set_index('Name').unstack()
.dropna().reset_index()
.groupby('Name')[0].apply(','.join))
Out[966]:
Name
A 1,3,5,2a
B 1
Name: 0, dtype: object

Group rows by name, combine column values as a list, dropping NaN:
df = df.groupby('Name').agg(lambda x: list(x.dropna()))
Spec1 Spec2
Name
A [1, 3, 5] [2a]
B [1] []
Now merge Spec1 and Spec2 lists. Bring Name back as a column. Name the new Spec column.
df = (df.Spec1 + df.Spec2).reset_index().rename(columns={0:"Spec"})
Name Spec
0 A [1, 3, 5, 2a]
1 B [1]
Finally, convert Spec lists to string representations:
df.Spec = df.Spec.apply(','.join)
Name Spec
0 A 1,3,5,2a
1 B 1

Related

Dropping columns before a matching string in the following column

Is there direct possibility to drop all columns before a matching string in a pandas Dataframe. For eg. if my column 8 contains a string 'Matched' I want to drop columns 0 to 7 ?

Well, you did not give any information where and how to look for 'Matched', but let's say that integer col_num contains the number of the matched column:
col_num = np.where(df == 'Matched')[1][0]
df.drop(columns=df.columns[0:col_num],inplace=True)
will do the drop

Example
data = {'A': {0: 1}, 'B': {0: 2}, 'C': {0: 3}, 'Match1': {0: 4}, 'D': {0: 5}}
df = pd.DataFrame(data)
df
A B C Match1 D
0 1 2 3 4 5
Code
remove in front of first Match + # column : boolean indexing
df.loc[:, df.columns.str.startswith('Match').cumsum() > 0]
result
Match1 D
0 4 5

Remove all the rows having same column values of another column which is duplicated

1.Input: we have a dataframe
ID name
1 a
1 b
2 a
2 c
3 d
2.Now I took the first duplicate 'name' (here it is 'a' with ID as '2') value and remove the rest, output:
ID name
1 a
1 b
2 c
3 d
Code I used:
df.loc[~df.duplicated(keep='first', subset=['name'])]
3.Now I want to remove all the rows sharing the same 'ID' ( here the 'a' removed was having '2' as ID, so we remove all rows with '2' as ID), Final Expected output : so we remove [2 c]
ID name
1 a
1 b
3 d
Code I tried: But it is not working
dt = df.name.duplicated(keep='first')
df.loc[~df.groupby(['ID','dt']).size().reset_index().drop(columns={0})]

You can use some kind of blacklist for the ID's:
Sample data:
import pandas as pd
d = {'ID':[1, 1, 2, 2, 3], 'name':['a', 'b', 'a', 'c', 'd']}
df = pd.DataFrame(d)
Code:
df[~df['ID'].isin(df[df['name'].duplicated()]['ID'])]
Output:
ID name
0 1 a
1 1 b
4 3 d
Code simplified:
blacklist = df[df['name'].duplicated()]['ID']
mask = ~df['ID'].isin(blacklist)
df[mask]

If the Dataframe is ordered by ID those two approaches should work:
df = pd.DataFrame(data={'ID': [1, 1, 1, 2, 3], 'name': ['a', 'b', 'a', 'c', 'd']})
df1 = df.loc[~df.duplicated(keep='first', subset=['ID'])]
df2 = df1.loc[~df1.duplicated(keep='first', subset=['name'])]
print(df2)
print(df.drop_duplicates(keep='first', subset=['ID']).drop_duplicates(keep='first', subset=['name']))
ID name
0 1 a
3 2 c
4 3 d
If it's order by name you should do subset=['name'] and then subset=['ID'].

Identifying the columns having duplicate column value with Different column name in python

How to identify the columns in a data frame with same column_value But with different column name , we need to list both the column , here i am able to list only one of them.
from pandas import DataFrame
import numpy as np
import pandas as pd
raw_data = {
'id': ['1', '2', '2', '3', '3'],
'name': ['A', 'B', 'B', 'C', 'D'],
'age' : [1, 2, 2, 3, 3],
'name_dup': ['A', 'B', 'B', 'C', 'D'],
'age_dup': [1, 2, 2, 3, 3]}
df = pd.DataFrame(raw_data, columns = ['id', 'name','age','name_dup','age_dup'])
Like in the image ,one can observe that name and name_dup have same column values but column names are different With the below Function i am able to get only name as an output as shown below where expected is name_dup.
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
iv = vs.iloc[:,i].tolist()
for j in range(i+1, lcs):
jv = vs.iloc[:,j].tolist()
if iv == jv:
dups.append(cs[i])
break
return dups
duplicate_columns(df)
Output of Above Code is Shown Below :
Expected List Duplicate columns Output
name and name_dup age and age_dup.
Here further to This keep drop any one of the column and rename the new column from list_check if we have a list of column name :
list_check = ['name','age']
Expected DataFrame
Note : It is not compulsory that it will always be colname will be colname_dup it can also be lname.

Do you mean by:
s = df.T.duplicated().reset_index()
vals = s.loc[s[0], 'index'].tolist()
colk = df.columns.drop(vals)
print(vals)
print(colk)
print(df.drop(vals, axis=1))
Output:
['name_dup', 'age_dup']
['id', 'name', 'age']
id name age
0 1 A 1
1 2 B 2
2 2 B 2
3 3 C 3
4 3 D 3

You can try this:
df.T.drop_duplicates().T
output:
id name age
0 1 A 1
1 2 B 2
2 2 B 2
3 3 C 3
4 3 D 3

Add suffix to columns in pandas dataframe [duplicate]

I want to add _x suffix to each column name like so:
featuresA = myPandasDataFrame.columns.values + '_x'
How do I do this? Additionally, if I wanted to add x_ as a suffix, how would the solution change?

The following is the nicest way to add suffix in my opinion.
df = df.add_suffix('_some_suffix')
As it is a function that is called on DataFrame and returns DataFrame - you can use it in chain of the calls.

You can use a list comprehension:
df.columns = [str(col) + '_x' for col in df.columns]
There are also built-in methods like .add_suffix() and .add_prefix() as mentioned in another answer.

Elegant In-place Concatenation
If you're trying to modify df in-place, then the cheapest (and simplest) option is in-place addition directly on df.columns (i.e., using Index.__iadd__).
df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df
A B
0 9 12
1 4 7
2 2 5
3 1 4
df.columns += '_some_suffix'
df
A_some_suffix B_some_suffix
0 9 12
1 4 7
2 2 5
3 1 4
To add a prefix, you would similarly use
df.columns = 'some_prefix_' + df.columns
df
some_prefix_A some_prefix_B
0 9 12
1 4 7
2 2 5
3 1 4
Another cheap option is using a list comprehension with f-string formatting (available on python3.6+).
df.columns = [f'{c}_some_suffix' for c in df]
df
A_some_suffix B_some_suffix
0 9 12
1 4 7
2 2 5
3 1 4
And for prefix, similarly,
df.columns = [f'some_prefix{c}' for c in df]
Method Chaining
It is also possible to do add *fixes while method chaining. To add a suffix, use DataFrame.add_suffix
df.add_suffix('_some_suffix')
A_some_suffix B_some_suffix
0 9 12
1 4 7
2 2 5
3 1 4
This returns a copy of the data. IOW, df is not modified.
Adding prefixes is also done with DataFrame.add_prefix.
df.add_prefix('some_prefix_')
some_prefix_A some_prefix_B
0 9 12
1 4 7
2 2 5
3 1 4
Which also does not modify df.
Critique of add_*fix
These are good methods if you're trying to perform method chaining:
df.some_method1().some_method2().add_*fix(...)
However, add_prefix (and add_suffix) creates a copy of the entire dataframe, just to modify the headers. If you believe this is wasteful, but still want to chain, you can call pipe:
def add_suffix(df):
df.columns += '_some_suffix'
return df
df.some_method1().some_method2().pipe(add_suffix)

I Know 4 ways to add a suffix (or prefix) to your column's names:
1- df.columns = [str(col) + '_some_suffix' for col in df.columns]
or
2- df.rename(columns= lambda col: col+'_some_suffix')
or
3- df.columns += '_some_suffix' much easiar.
or, the nicest:
3- df.add_suffix('_some_suffix')

I haven't seen this solution proposed above so adding this to the list:
df.columns += '_x'
And you can easily adapt for the prefix scenario.

Using DataFrame.rename
df = pd.DataFrame({'A': range(3), 'B': range(4, 7)})
print(df)
A B
0 0 4
1 1 5
2 2 6
Using rename with axis=1 and string formatting:
df.rename('col_{}'.format, axis=1)
# or df.rename(columns='col_{}'.format)
col_A col_B
0 0 4
1 1 5
2 2 6
To actually overwrite your column names, we can assign the returned values to our df:
df = df.rename('col_{}'.format, axis=1)
or use inplace=True:
df.rename('col_{}'.format, axis=1, inplace=True)

I figured that this is what I would use quite often, for example:
df = pd.DataFrame({'silverfish': range(3), 'silverspoon': range(4, 7),
'goldfish': range(10, 13),'goldilocks':range(17,20)})
My way of dynamically renaming:
color_list = ['gold','silver']
for i in color_list:
df[f'color_{i}']=df.filter(like=i).sum(axis=1)
OUTPUT:
{'silverfish': {0: 0, 1: 1, 2: 2},
'silverspoon': {0: 4, 1: 5, 2: 6},
'goldfish': {0: 10, 1: 11, 2: 12},
'goldilocks': {0: 17, 1: 18, 2: 19},
'color_gold': {0: 135, 1: 145, 2: 155},
'color_silver': {0: 20, 1: 30, 2: 40}}

Pandas also has a add_prefix method and a add_suffix method to do this.

Explode a row to multiple rows in pandas dataframe

I have a dataframe with the following header:
id, type1, ..., type10, location1, ..., location10
and I want to convert it as follows:
id, type, location
I managed to do this using embedded for loops but it's very slow:
new_format_columns = ['ID', 'type', 'location']
new_format_dataframe = pd.DataFrame(columns=new_format_columns)
print(data.head())
new_index = 0
for index, row in data.iterrows():
ID = row["ID"]
for i in range(1,11):
if row["type"+str(i)] == np.nan:
continue
else:
new_row = pd.Series([ID, row["type"+str(i)], row["location"+str(i)]])
new_format_dataframe.loc[new_index] = new_row.values
new_index += 1
Any suggestions for improvement using native pandas features?

You can use lreshape:
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
Sample:
import pandas as pd
df = pd.DataFrame({
'type1': {0: 1, 1: 4},
'id': {0: 'a', 1: 'a'},
'type10': {0: 1, 1: 8},
'location1': {0: 2, 1: 9},
'location10': {0: 5, 1: 7}})
print (df)
id location1 location10 type1 type10
0 a 2 5 1 1
1 a 9 7 4 8
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
id Location Type
0 a 2 1
1 a 9 4
2 a 5 1
3 a 7 8
Another solution with double melt:
print (pd.concat([pd.melt(df, id_vars='id', value_vars=types, value_name='type'),
pd.melt(df, value_vars=location, value_name='Location')], axis=1)
.drop('variable', axis=1))
id type Location
0 a 1 2
1 a 4 9
2 a 1 5
3 a 8 7
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging multiple dataframe lines into aggregate lines - python

Or using melt df.melt('Name').groupby('Name').value.apply(lambda x:','.join(pd.Series(x).dropna())).reset_index().rename(columns={'value':'spec'}) Out[2226]: Name spec 0 A 1,3,5,2a 1 B 1

Another way In [966]: (df.set_index('Name').unstack() .dropna().reset_index() .groupby('Name')[0].apply(','.join)) Out[966]: Name A 1,3,5,2a B 1 Name: 0, dtype: object

Related

Dropping columns before a matching string in the following column

Remove all the rows having same column values of another column which is duplicated

Identifying the columns having duplicate column value with Different column name in python

Add suffix to columns in pandas dataframe [duplicate]

Explode a row to multiple rows in pandas dataframe

Categories

Resources