Pandas implode Dataframe with values separated by char

Pandas implode Dataframe with values separated by char - python

I was just wondering how is the best approach to implode a DataFrame with values separated by a given char.
For example, imagine this dataframe:
A B C D E
1 z a q p
2 x s w l
3 c d e k
4 v f r m
5 b g t n
And we want to implode by #
A B C D E
1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n
Maybe to create a copy from the original dataframe and process column by column with Pandas str.concat?
Thanks in advance!

Use DataFrame.agg with join, then convert Series to one row DataFrame with Series.to_frame and transpose by DataFrame.T:
df = df.astype(str).agg('#'.join).to_frame().T
print (df)
A B C D E
0 1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n

Related

Groupby pandas dataframe keeping unique values for some columns and list other columns

I want to group the following output by material_id keeping the unique values of material_description and MPN, but list the plant_id. picture for reference
def search_output(materials):
df=pd.DataFrame(materials)
df_ref = df.loc[:, df.columns!='#search.score'].groupby('material_id').agg({lambda
x:list(x)})
return df_ref
This currently groups by material_id and list other columns.
The following code i use to keep unique values grouped by material_id, but now I am missing the plant_id list column.
df_t = df.loc[:, df.columns!='#search.score'].groupby('material_id' ['material_description','MPN'].agg(['unique'])
picture for reference#2
I'm looking for a way to combine the two. A way to group by a column, keep unique values of specific columns and list other columns at the same time.
Hope you can help - and sorry for the pictures, but can't figure out how to add output otherwise :)

You can create dictionary by lists - first for aggregation by unique and for all another columns by list with dict.fromkeys, join them an pass to GroupBy.agg:
print (df)
material_id material_description MPN A B
0 1 descr1 a b c
1 1 descr2 a d e
2 1 descr1 b b c
3 2 descr3 a b c
4 2 descr4 a b c
5 2 descr4 a b c
u_cols = ['material_description','MPN']
d = {c: 'unique' if c in u_cols else list for c in df.columns.drop('material_id')}
df_ref = df.loc[:, df.columns!='#search.score'].groupby('material_id').agg(d)
print (df_ref)
material_description MPN A B
material_id
1 [descr1, descr2] [a, b] [b, d, b] [c, e, c]
2 [descr3, descr4] [a] [b, b, b] [c, c, c]

Make a new dataframe from multiple dataframes

Suppose I have 3 dataframes that are wrapped in a list. The dataframes are:
df_1 = pd.DataFrame({'text':['a','b','c','d','e'],'num':[2,1,3,4,3]})
df_2 = pd.DataFrame({'text':['f','g','h','i','j'],'num':[1,2,3,4,3]})
df_3 = pd.DataFrame({'text':['k','l','m','n','o'],'num':[6,5,3,1,2]})
The list of the dfs is:
df_list = [df_1, df_2, df_3]
Now I want to make a for loop such that goes on df_list, and for each df takes the text column and merge them on a new dataframe with a new column head called topic. Now since each text column is different from each dataframe I want to populate the headers as topic_1, topic_2, etc. The desired outcome should be as follow:
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o
I can easily extract the text columns as:
lst = []
for i in range(len(df_list)):
lst.append(df_list[i]['text'].tolist())
It is just that I am stuck on the last part, namely bringing the columns into 1 df without using brute force.

You can extract the wanted columns with a list comprehension and concat them:
pd.concat([d['text'].rename(f'topic_{i}')
for i,d in enumerate(df_list, start=1)],
axis=1)
output:
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o

Generally speaking you want to avoid looping anything on a pandas DataFrame. However, in this solution I do use a loop to rename your columns. This should work assuming you just have these 3 dataframes:
import pandas as pd
df_1 = pd.DataFrame({'text':['a','b','c','d','e'],'num':[2,1,3,4,3]})
df_2 = pd.DataFrame({'text':['f','g','h','i','j'],'num':[1,2,3,4,3]})
df_3 = pd.DataFrame({'text':['k','l','m','n','o'],'num':[6,5,3,1,2]})
df_list = [df_1.text, df_2.text, df_3.text]
df_combined = pd.concat(df_list,axis=1)
df_combined.columns = [f"topic_{i+1}" for i in range(len(df_combined.columns))]
>>> df_combined
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o

How Best to Unpack a Pandas Dataframe of Tuples?

Probably really straightforward but I'm having no luck with Google. I have a 2 column dataframe of tuples, and I'm looking to unpack each tuple then pair up the contents from the same position in each column. For example:
Col1 Col2
(a,b,c) (d,e,f)
my desired output is
a d
b e
c f
I have a solution using loops but I would like to know a better way to do it - firstly because I am trying to eradicate loops from my life and secondly because it's potentially not as flexible as I may need it to be.
l1=[('a','b'),('c','d'),('e','f','g'),('h','i')]
l2=[('j','k'),('l','m'),('n','o','p'),('q','r')]
df = pd.DataFrame(list(zip(l1,l2)),columns=['Col1','Col2'])
df
Out[547]:
Col1 Col2
0 (a, b) (j, k)
1 (c, d) (l, m)
2 (e, f, g) (n, o, p)
3 (h, i) (q, r)
for i in range(len(df)):
for j in range(len(df.iloc[i][1])):
print(df.iloc[i][0][j], df.iloc[i][1][j])
a j
b k
c l
d m
e n
f o
g p
h q
i r
All pythonic suggestions and guidance hugely appreciated. Many thanks.
Addition: an example including a row with differing length tuples, per Ch3steR's request below - my loop would not work in this instance ('d2' would not be included, where I would want it to be outputted paired with a null).
l1=[('a','b'),('c','d','d2'),('e','f','g'),('h','i')]
l2=[('j','k'),('l','m'),('n','o','p'),('q','r')]
df = pd.DataFrame(list(zip(l1,l2)),columns=['Col1','Col2'])

Send each Series tolist and then reconstruct the DataFrame and stack. Then concat back together. This will leave you with a MultiIndex with the first level being the original DataFrame index and the second level being the position in the tuple.
This will work for older versions of pandas pd.__version__ < '1.3.0' and for instances where the tuples have an unequal number of elements (where explode will fail)
import pandas as pd
df1 = pd.concat([pd.DataFrame(df[col].tolist()).stack().rename(col)
for col in df.columns], axis=1)
Col1 Col2
0 0 a j
1 b k
1 0 c l
1 d m
2 0 e n
1 f o
2 g p
3 0 h q
1 i r

if the tuples length are always matching and you don't have the newer version of pandas to pass a list columns to explode, do something like this:
import pandas as pd
pd.concat([df.Col1.explode(), df.Col2.explode()], axis=1).reset_index(drop=True)
Col1 Col2
0 a j
1 b k
2 c l
3 d m
4 e n
5 f o
6 g p
7 h q
8 i r

Renaming columns on slice of dataframe not performing as expected

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})

This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))

To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

Apply a function to a specific row using the index value

I have the following table:
import pandas as pd
import numpy as np
#Dataframe with random numbers and with an a,b,c,d,e index
df = pd.DataFrame(np.random.randn(5,5), index = ['a','b','c','d','e'])
#Now i name the columns the same
df.columns = ['a','b','c','d','e']
#Resulting dataframe:
a b c d e
a 2.214229 1.621352 0.083113 0.818191 -0.900224
b -0.612560 -0.028039 -0.392266 0.439679 1.596251
c 1.378928 -0.309353 -0.651817 1.499517 0.515772
d -0.061682 1.141558 -0.811471 0.242874 0.345159
e -0.714760 -0.172082 0.205638 0.220528 1.182013
How can i apply a function to the dataframes index? I want to round the numbers for every column where the index is "c".
#Numbers to round to 2 decimals:
a b c d e
c 1.378928 -0.309353 -0.651817 1.499517 0.515772
What is the best way to do this?

For label based indexing use loc:
In [22]:
df = pd.DataFrame(np.random.randn(5,5), index = ['a','b','c','d','e'])
#Now i name the columns the same
df.columns = ['a','b','c','d','e']
df
Out[22]:
a b c d e
a -0.051366 1.856373 -0.224172 -0.005668 0.986908
b -1.121298 -1.018863 2.328420 -0.117501 -0.231463
c 2.241418 -0.838571 -0.551222 0.662890 -1.234716
d 0.275063 0.295788 0.689171 0.227742 0.091928
e 0.269730 0.326156 0.210443 -0.494634 -0.489698
In [23]:
df.loc['c'] = np.round(df.loc['c'],decimals=2)
df
Out[23]:
a b c d e
a -0.051366 1.856373 -0.224172 -0.005668 0.986908
b -1.121298 -1.018863 2.328420 -0.117501 -0.231463
c 2.240000 -0.840000 -0.550000 0.660000 -1.230000
d 0.275063 0.295788 0.689171 0.227742 0.091928
e 0.269730 0.326156 0.210443 -0.494634 -0.489698

To round values of column c:
df['c'].round(decimals=2)
To round values of row c:
df.loc['c'].round(decimals=2)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas implode Dataframe with values separated by char - python

Use DataFrame.agg with join, then convert Series to one row DataFrame with Series.to_frame and transpose by DataFrame.T: df = df.astype(str).agg('#'.join).to_frame().T print (df) A B C D E 0 1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n

Related

Groupby pandas dataframe keeping unique values for some columns and list other columns

Make a new dataframe from multiple dataframes

How Best to Unpack a Pandas Dataframe of Tuples?

Renaming columns on slice of dataframe not performing as expected

Apply a function to a specific row using the index value

Categories

Resources