pandas combine columns without null keep string values - python

I want to combine columns without null and keep string values.
Example data:
a,b,c
123.jpg,213.jpg,987.jpg
,159.jpg,
There is my code:
cols = ['a','b','c']
df['combine_columns'] = df[cols].stack().groupby(level=0),agg(','.join)
print(df)
And the result:
a,b,c,combine_columns
123.jpg,213.jpg,987.jpg,"123.jpg,213.jpg,987.jpg"
,159.jpg,,159.jpg
But I want something like this:
a,b,c,combine_columns
123.jpg,213.jpg,987.jpg,""123.jpg","213.jpg","987.jpg""
,159.jpg,,"159.jpg"
How can I do this?

You can use apply with a list comprehension and pandas.notna as filter:
df['combine_columns'] = df.apply(lambda x: ','.join([e for e in x if pd.notna(e)]),
axis=1)
output:
a b c combine_columns
0 123.jpg 213.jpg 987.jpg 123.jpg,213.jpg,987.jpg
1 NaN 159.jpg NaN 159.jpg
Adding extra " in the string:
df['combine_columns'] = df.apply(lambda x: '"%s"' % ','.join([e for e in x if pd.notna(e)]),
axis=1)
output:
a b c combine_columns
0 123.jpg 213.jpg 987.jpg "123.jpg,213.jpg,987.jpg"
1 NaN 159.jpg NaN "159.jpg"

Related

Split string column based on delimiter and convert it to dict in Pandas without loop

I have below dataframe
clm1, clm2, clm3
10, a, clm4=1|clm5=5
11, b, clm4=2
My desired result is
clm1, clm2, clm4, clm5
10, a, 1, 5
11, b, 2, Nan
I have tried below method
rows = list(df.index)
dictlist = []
for index in rows: #loop through each row to convert clm3 to dict
i = df.at[index, "clm3"]
mydict = dict(map(lambda x: x.split('='), [x for x in i.split('|') if '=' in x]))
dictlist.append(mydict)
l=json_normalize(dictlist) #convert dict column to flat dataframe
resultdf = example.join(l).drop('clm3',axis=1)
This is giving me desired result but I am looking for a more efficient way to convert clm3 to dict which does not involve looping through each row.
two steps :
idea is to create a double split and then group by the index and unstack the values as columns
s = (
df["clm3"]
.str.split("|", expand=True)
.stack()
.str.split("=", expand=True)
.reset_index(level=1, drop=True)
)
final = pd.concat([df, s.groupby([s.index, s[0]])[1].sum().unstack()], axis=1).drop(
"clm3", axis=1
)
print(final)
clm1 clm2 clm4 clm5
0 10 a 1 5
1 11 b 2 NaN
Using str.extractall to get your values and unstack to pivot them to a column for each unique value.
And str.get_dummies to get a column for each unique clm.
values = (
df['clm3'].str.extractall('(=\d)')[0]
.str.replace('=', '')
.unstack()
.rename_axis(None, axis=1)
)
columns = df['clm3'].str.replace('=\d', '').str.get_dummies(sep='|').columns
values.columns = columns
dfnew = pd.concat([df[['clm1', 'clm2']], values], axis=1)
clm1 clm2 0 1
0 10 a 1 5
1 11 b 2 NaN

Python: Pivot Table/group by specific conditions

I'm trying to change structure of my data from text file(.txt) which data look like this:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
And I would like to transform them into this format (like pivot-table in excel which column name is character between ":" and each group always start with :1:)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Does anyone have any idea? Thanks in advance.
First create DataFrame by read_csv with header=None, because no header in file:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
Extract original column by DataFrame.pop, then remove traling : by Series.str.strip and Series.str.split values to 2 new columns. Then create groups by compare with Series.eq for == by string 0 with Series.cumsum, create MultiIndex by DataFrame.set_index and last reshape by Series.unstack:
df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J
Use:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Another way to do this:
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
The output will be looking like this:
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None

how to remove 0's from a string without impacting other cells in pandas data frame?

I have a data frame which has "0's" and looks as below:
df = pd.DataFrame({
'WARNING':['4402,43527,0,7628,54337',4402,0,0,'0,1234,56437,76252',0,3602],
'FAILED':[0,0,'5555,6753,0','4572,0,8764,8753',9876,0,'0,4579,7514']
})
I want to remove the zeroes from the strings where there are multiple values such that the results df looks like this:
df = pd.DataFrame({
'WARNING':['4402,43527,7628,54337',4402,0,0,'1234,56437,76252',0,3602],
'FAILED':[0,0,'5555,6753','4572,8764,8753',9876,0,'4579,7514']
})
However the ones which have individual 0's in a cell should remain intact. How do I achieve this?
df = pd.DataFrame({
'WARNING':['0,0786,1230,01234,0',4402,0,0,'0,1234,56437,76252',0,3602],
'FAILED':[0,0,'5555,6753,0','4572,0,8764,8753',9876,0,'0,4579,7514']
})
df.apply(lambda x: x.str.strip('0,|,0')).replace(",0,", ",")
Output:
WARNING FAILED
0 786,1230,01234 NaN
1 NaN NaN
2 NaN 5555,6753
3 NaN 4572,0,8764,8753
4 1234,56437,76252 NaN
5 NaN NaN
6 NaN 4579,7514
I would solve it with a list comprehension.
In [1]: df.apply(lambda col: col.astype(str).apply(lambda x: ','.join([y for y in x.split(',') if y != '0']) if ',' in x else x), axis=0)
Out[1]: 
FAILED WARNING
0 0 4402,43527,7628,54337
1 0 4402
2 5555,6753 0
3 4572,8764,8753 0
4 9876 1234,56437,76252
5 0 0
6 4579,7514 3602
Breaking it down:
Iterate over all columns with df.apply(lambda col: ..., axis=0)
Convert each column's values to string with col.astype(str)
Apply a function to each "cell" of col with .apply(lambda x: ...)
The lambda function first checks if ',' exists in x, otherwise returns the original value of x
If ',' in x, it splits x by ',', which creates a list of y's
It keeps only the y != '0'
It joins everything at the end with a ','.join(...)
You can use regex with a negative look behind to replace 0, only if it not preceded by another digit.
import re
df.applymap(lambda x: re.sub(r'(?<![0-9])0,', '', str(x)))
WARNING FAILED
0 4402,43527,7628,54337 0
1 4402 0
2 0 5555,6753,0
3 0 4572,8764,8753
4 1234,56437,76252 9876
5 0 0
6 3602 4579,7514
For the test case W-B points out:
s = '0,0999,9990,999'
re.sub(r'(?<![0-9])0,', '', s)
#'0999,9990,999'

How to merge strings pandas df

I am trying merge specific strings in a pandas df. The df below is just an example. The values in my df will differ but the basic rules will apply. I basically want to merge all rows until there's a 4 letter string.
Whilst the 4 letter string in this df is always Excl, my df will contain numerous 4 letter strings.
import pandas as pd
d = ({
'A' : ['Include','Inclu','Incl','Inc'],
'B' : ['Excl','de','ude','l'],
'C' : ['X','Excl','Excl','ude'],
'D' : ['','Y','ABC','Excl'],
})
df = pd.DataFrame(data=d)
Out:
A B C D
0 Include Excl X
1 Inclu de Excl Y
2 Incl ude Excl ABC
3 Inc l ude Excl
Intended Output:
A B C D
0 Include Excl X
1 Include Excl Y
2 Include Excl ABC
3 Include Excl
So row 0 stays the same as col B has 4 letters. Row 1 merges Col A,B as Col C 4 letters. Row 2 stays the same as above. Row 3 merges Col A,B,C as Col D has 4 letters.
I have tried to do this manually by merging all columns and then go back and removing unwanted values.
df["Com"] = df["A"].map(str) + df["B"] + df["C"]
But I would have to manually go through each row and remove different lengths of letters.
The above df is just an example. The central similarity is I need to merge everything before the 4 letter string.
You could do something like
mask = (df.iloc[:, 1:].applymap(len) == 4).cumsum(1) == 0
df.A = df.A + df.iloc[:, 1:][mask].apply(lambda x: x.str.cat(), 1)
df.iloc[:, 1:] = df.iloc[:, 1:][~mask].fillna('')
try this,
Sorry for the clumsy solution, I'll try to improve the performance ,
temp=df.eq('Excl').shift(-1,axis=1)
df['end']= temp.apply(lambda x:x.argmax(),axis=1)
res=df.apply(lambda x:x.loc[:x['end']].sum(),axis=1)
mask=temp.replace(False,np.NaN).fillna(method='ffill').fillna(False).astype(bool)
del df['end']
df[:]=np.where(mask,'',df)
df['A']=res
print df
Output:
A B C D
0 Include Excl X
1 Include Excl Y
2 Include Excl ABC
3 Include Excl
Improved solution:
res= df.apply(lambda x:x.loc[:x.eq('Excl').shift(-1).argmax()].sum(),axis=1)
mask=df.eq('Excl').shift(-1,axis=1).replace(False,np.NaN).fillna(method='ffill').fillna(False).astype(bool)
df[:]=np.where(mask,'',df)
df['A']=res
More simplified solution:
t=df.eq('Excl').shift(-1,axis=1)
res= df.apply(lambda x:x.loc[:x.eq('Excl').shift(-1).argmax()].sum(),axis=1)
df[:]=np.where(t.fillna(0).astype(int).cumsum() >= 1,'',df)
df['A']=res
I am giving you a rough approach,
Here, we are finding the location of the 'Excl' and merging the column values up it so as to obtain our desired output.
ls=[]
for i in range(len(df)):
end=(df.loc[i,:].index[(df.loc[i,:]=='Excl')][0])
ls.append(''.join(df.loc[i,:end].replace({'Excl':''}).values))
df['A']=ls

Replacing empty values with NaN in object/categorical variables

So I've searched SO for this and found a bunch of useful threads on how to replace empty values with NaN. However I can't get any of them to work on my DataFrame.
I've used:
df.replace('', np.NaN)
df3 = df.applymap(lambda x: np.nan if x == '' else x)
and even:
df.iloc[:,86:350] = df.iloc[:,86:350].apply(lambda x: x.str.strip()).replace('', np.nan)
and the code runs fine without error but when I look in my dataframe i still have b'' values instead of NaN. Any ideas on what I am missing?
I'm sorry for not giving the code to reproduce this as I don't know how to do that as I suspect it's specific to my dataframe which I imported from SPSS and these values were string variables in SPSS if that helps.
You were close with your second try:
df = df.applymap(lambda x: np.NaN if not x else x)
To show that both '' and b'' will evaluate to True in the conditional:
l = ['', b'']
for x in l:
if x:
print ('Not empty')
else:
print ('Empty')
>>> Empty
>>> Empty
Sample:
from pandas import DataFrame
from numpy import NaN
df = DataFrame([[1,2,''], ['',b'',3], [4, 5, b'']])
print (df)
# Output
0 1 2
0 1 2
1 b'' 3
2 4 5 b''
df2 = df.applymap(lambda x: NaN if not x else x)
print (df2)
# Output
0 1 2
0 1 2 NaN
1 NaN NaN 3
2 4 5 NaN

Categories