split multiple columns in pandas dataframe by delimiter - python

I have survey data which annoying has returned multiple choice questions in the following way. It's in an excel sheet There is about 60 columns with responses from single to multiple that are split by /. This is what I have so far, is there any way to do this quicker without having to do this for each individual column
data = {'q1': ['one', 'two', 'three'],
'q2' : ['one/two/three', 'a/b/c', 'd/e/f'],
'q3' : ['a/b/c', 'd/e/f','g/h/i']}
df = pd.DataFrame(data)
df[['q2a', 'q2b', 'q2c']]= df['q2'].str.split('/', expand = True, n=0)
df[['q3a', 'q3b', 'q3c']]= df['q2'].str.split('/', expand = True, n=0)
clean_df = df.drop(df[['q2', 'q3']], axis=1)

We can use list comprehension with add_prefix, then we use pd.concat to concatenate everything to your final df:
splits = [df[col].str.split(pat='/', expand=True).add_prefix(col) for col in df.columns]
clean_df = pd.concat(splits, axis=1)
q10 q20 q21 q22 q30 q31 q32
0 one one two three a b c
1 two a b c d e f
2 three d e f g h i
If you actually want your column names to be suffixed by a letter, you can do the following with string.ascii_lowercase:
from string import ascii_lowercase
dfs = []
for col in df.columns:
d = df[col].str.split('/', expand=True)
c = d.shape[1]
d.columns = [col + l for l in ascii_lowercase[:c]]
dfs.append(d)
clean_df = pd.concat(dfs, axis=1)
q1a q2a q2b q2c q3a q3b q3c
0 one one two three a b c
1 two a b c d e f
2 three d e f g h i

You can create a dict d that transforms numbers to letters. Then loop through the columns and dynamically change their names:
input:
import pandas as pd
df = pd.DataFrame({'q1': ['one', 'two', 'three'],
'q2' : ['one/two/three', 'a/b/c', 'd/e/f'],
'q3' : ['a/b/c', 'd/e/f','g/h/i']})
code:
ltrs = list('abcdefghijklmonpqrstuvwxyz')
nmbrs = [i[0] for i in enumerate(ltrs)]
d = dict(zip(nmbrs, ltrs))
cols = df.columns[1:]
for col in cols:
df1 = df[col].str.split('/', expand = True)
df1.columns = df1.columns.map(d)
df1 = df1.add_prefix(f'{col}')
df = pd.concat([df,df1], axis=1)
df = df.drop(cols, axis=1)
df
output:
Out[1]:
q1 q2a q2b q2c q3a q3b q3c
0 one one two three a b c
1 two a b c d e f
2 three d e f g h i

Related

How to replace data in one pandas df by the data of another one?

Want to replace some rows of some columns in a bigger pandas df by data in a smaller pandas df. The column names are same in both.
Tried using combine_first but it only updates the null values.
For example lets say df1.shape is 100, 25 and df2.shape is 10,5
df1
A B C D E F G ...Z Y Z
1 abc 10.20 0 pd.NaT
df2
A B C D E
1 abc 15.20 1 10
Now after replacing df1 should look like:
A B C D E F G ...Z Y Z
1 abc 15.20 1 10 ...
To replace values in df1 the condition is where df1.A = df2.A and df1.B = df2.B
How can it be achieved in the most pythonic way? Any help will be appreciated.
Don't know I really understood your question does this solves your problem ?
df1 = pd.DataFrame(data={'A':[1],'B':[2],'C':[3],'D':[4]})
df2 = pd.DataFrame(data={'A':[1],'B':[2],'C':[5],'D':[6]})
new_df=pd.concat([df1,df2]).drop_duplicates(['A','B'],keep='last')
print(new_df)
output:
A B C D
0 1 2 5 6
You could play with Multiindex.
First let us create those dataframe that you are working with:
cols = pd.Index(list(ascii_uppercase))
vals = np.arange(100*len(cols)).reshape(100, len(cols))
df = pd.DataFrame(vals, columns=cols)
df1 = pd.DataFrame(vals[:10,:5], columns=cols[:5])
Then transform A and B in indices:
df = df.set_index(["A","B"])
df1 = df1.set_index(["A","B"])*1.5 # multiply just to make the other values different
df.loc[df1.index, df1.columns] = df1
df = df.reset_index()

Python: Pivot Table/group by specific conditions

I'm trying to change structure of my data from text file(.txt) which data look like this:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
And I would like to transform them into this format (like pivot-table in excel which column name is character between ":" and each group always start with :1:)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Does anyone have any idea? Thanks in advance.
First create DataFrame by read_csv with header=None, because no header in file:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
Extract original column by DataFrame.pop, then remove traling : by Series.str.strip and Series.str.split values to 2 new columns. Then create groups by compare with Series.eq for == by string 0 with Series.cumsum, create MultiIndex by DataFrame.set_index and last reshape by Series.unstack:
df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J
Use:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Another way to do this:
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
The output will be looking like this:
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None

ffill not filling data in pandas dataframe

I have a dataframe like this :
A B C E D
---------------
0 a r g g
1 x
2 x f f r
3 t
3 y
I am trying for forward filling using ffill. It is not working
cols = df.columns[:4].tolist()
df[cols] = df[cols].ffill()
I also tried :
df[cols] = df[cols].fillna(method='ffill')
But it is not getting filled.
Is it the empty columns in data causing this issue?
Data is mocked. Exact data is different (contains strings,numbers and empty columns)
desired o/p:
A B C E D
---------------
0 a r g g
1 a r g x
2 x f f r
3 x f f t
3 x f f y
Replace empty values in subset of columns by NaN:
df[cols] = df[cols].replace('', np.nan).ffill()
You should replace the empty strings with np.NaN before:
df = df.replace('', np.NaN)
df[cols] = df[cols].ffill()
Replace '' with np.nan first:
df[df='']=np.nan
df[cols] = df[cols].ffill()

Pandas - Interleave / Zip two DataFrames by row

Suppose I have two dataframes:
>> df1
0 1 2
0 a b c
1 d e f
>> df2
0 1 2
0 A B C
1 D E F
How can I interleave the rows? i.e. get this:
>> interleaved_df
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
(Note my real DFs have identical columns, but not the same number of rows).
What I've tried
inspired by this question (very similar, but asks on columns):
import pandas as pd
from itertools import chain, zip_longest
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
concat_df = pd.concat([df1,df2])
new_index = chain.from_iterable(zip_longest(df1.index, df2.index))
# new_index now holds the interleaved row indices
interleaved_df = concat_df.reindex(new_index)
ValueError: cannot reindex from a duplicate axis
The last call fails because df1 and df2 have some identical index values (which is also the case with my real DFs).
Any ideas?
You can sort the index after concatenating and then reset the index i.e
import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
concat_df = pd.concat([df1,df2]).sort_index().reset_index(drop=True)
Output :
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
EDIT (OmerB) : Incase of keeping the order regardless of the index value then.
import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']]).reset_index()
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']]).reset_index()
concat_df = pd.concat([df1,df2]).sort_index().set_index('index')
Use toolz.interleave
In [1024]: from toolz import interleave
In [1025]: pd.DataFrame(interleave([df1.values, df2.values]))
Out[1025]:
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
Here's an extension of #Bharath's answer that can be applied to DataFrames with user-defined indexes without losing them, using pd.MultiIndex.
Define Dataframes with the full set of column/ index labels and names:
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df1.columns.name = 'cols'
df1.index.name = 'rows'
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df2.columns.name = 'cols'
df2.index.name = 'rows'
Add DataFrame ID to MultiIndex:
df1.index = pd.MultiIndex.from_product([[1], df1.index], names=["df_id", df1.index.name])
df2.index = pd.MultiIndex.from_product([[2], df2.index], names=["df_id", df2.index.name])
Then use #Bharath's concat() and sort_index():
data = pd.concat([df1, df2], axis=0, sort=True)
data.sort_index(axis=0, level=data.index.names[::-1], inplace=True)
Output:
cols col_a col_b col_c
df_id rows
1 one a b c
2 one A B C
1 two d e f
2 two D E F
You could also preallocate a new DataFrame, and then fill it using a slice.
def interleave(dfs):
data = np.transpose(np.array([np.empty(dfs[0].shape[0]*len(dfs), dtype=dt) for dt in dfs[0].dtypes]))
out = pd.DataFrame(data, columns=dfs[0].columns)
for ix, df in enumerate(dfs):
out.iloc[ix::len(dfs),:] = df.values
return out
The preallocation code is taken from this question.
While there's a chance it could outperform the index method for certain data types / sizes, it won't behave gracefully if the DataFrames have different sizes.
Note - for ~200000 rows with 20 columns of mixed string, integer and floating types, the index method is around 5x faster.
You can try this way :
In [31]: import pandas as pd
...: from itertools import chain, zip_longest
...:
...: df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
...: df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
In [32]: concat_df = pd.concat([df1,df2]).sort_index()
...:
In [33]: interleaved_df = concat_df.reset_index(drop=1)
In [34]: interleaved_df
Out[34]:
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F

Apply a function to a specific row using the index value

I have the following table:
import pandas as pd
import numpy as np
#Dataframe with random numbers and with an a,b,c,d,e index
df = pd.DataFrame(np.random.randn(5,5), index = ['a','b','c','d','e'])
#Now i name the columns the same
df.columns = ['a','b','c','d','e']
#Resulting dataframe:
a b c d e
a 2.214229 1.621352 0.083113 0.818191 -0.900224
b -0.612560 -0.028039 -0.392266 0.439679 1.596251
c 1.378928 -0.309353 -0.651817 1.499517 0.515772
d -0.061682 1.141558 -0.811471 0.242874 0.345159
e -0.714760 -0.172082 0.205638 0.220528 1.182013
How can i apply a function to the dataframes index? I want to round the numbers for every column where the index is "c".
#Numbers to round to 2 decimals:
a b c d e
c 1.378928 -0.309353 -0.651817 1.499517 0.515772
What is the best way to do this?
For label based indexing use loc:
In [22]:
df = pd.DataFrame(np.random.randn(5,5), index = ['a','b','c','d','e'])
#Now i name the columns the same
df.columns = ['a','b','c','d','e']
df
Out[22]:
a b c d e
a -0.051366 1.856373 -0.224172 -0.005668 0.986908
b -1.121298 -1.018863 2.328420 -0.117501 -0.231463
c 2.241418 -0.838571 -0.551222 0.662890 -1.234716
d 0.275063 0.295788 0.689171 0.227742 0.091928
e 0.269730 0.326156 0.210443 -0.494634 -0.489698
In [23]:
df.loc['c'] = np.round(df.loc['c'],decimals=2)
df
Out[23]:
a b c d e
a -0.051366 1.856373 -0.224172 -0.005668 0.986908
b -1.121298 -1.018863 2.328420 -0.117501 -0.231463
c 2.240000 -0.840000 -0.550000 0.660000 -1.230000
d 0.275063 0.295788 0.689171 0.227742 0.091928
e 0.269730 0.326156 0.210443 -0.494634 -0.489698
To round values of column c:
df['c'].round(decimals=2)
To round values of row c:
df.loc['c'].round(decimals=2)

Categories