Insert space in Pandas Data Frame Column String for each character - python

I have a data frame look like below I need to give space between each letter of word in same column
import pandas as pd
df = pd.DataFrame({'sequence': ['ABCAD', 'DBAACR']})
df
Expected Output
sequence
A b C A D
D B A A C R

import pandas as pd
df = pd.DataFrame({'sequence':['ABCAD','DBAACR']})
A = []
for i in df['sequence']:
a = (" ".join(i))
A.append(a)
df = pd.DataFrame({'sequence':A})
df
If you execute above cell which will return the pandas DataFrame as below.
sequence
0 A B C A D
1 D B A A C R
Thanks and don't forget to upvote :D

pd.DataFrame({'sequence':[' '.join('ABCAD'),' '.join('DBAACR')]})

You can use apply with lambda function to process columns in pandas data frame
df.sequence.apply(lambda x: ' '.join(list(x)))
Output:
0 A B C A D
1 D B A A C R

Related

Concatenate two columns

I have two text columns A and B. I want to take the first non empty string or if both A and B has values take the values from A. C is the column im trying to create:
import pandas as pd
cols = ['A','B']
data = [['data','data'],
['','data'],
['',''],
['data1','data2']]
df = pd.DataFrame.from_records(data=data, columns=cols)
A B
0 data data
1 data
2
3 data1 data2
My attempt:
df['C'] = df[cols].apply(lambda row: sorted([val if val else '' for val in row], reverse=True)[0], axis=1) #Reverse sort to avoid picking an empty string
A B C
0 data data data
1 data data
2
3 data1 data2 data2 #I want data1 here
Expected output:
A B C
0 data data data
1 data data
2
3 data1 data2 data1
I think I want the pandas equivalent of SQL coalesce.
You can also use numpy.where:
In [1022]: import numpy as np
In [1023]: df['C'] = np.where(df['A'].eq(''), df['B'], df['A'])
In [1024]: df
Out[1024]:
A B C
0 data data data
1 data data
2
3 data1 data2 data1
Let's try idxmax + lookup:
df['C'] = df.lookup(df.index, df.ne('').idxmax(1))
Alternatively you can use Series.where:
df['C'] = df['A'].where(lambda x: x.ne(''), df['B'])
A B C
0 data data data
1 data data
2
3 data1 data2 data1

Pandas implode Dataframe with values separated by char

I was just wondering how is the best approach to implode a DataFrame with values separated by a given char.
For example, imagine this dataframe:
A B C D E
1 z a q p
2 x s w l
3 c d e k
4 v f r m
5 b g t n
And we want to implode by #
A B C D E
1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n
Maybe to create a copy from the original dataframe and process column by column with Pandas str.concat?
Thanks in advance!
Use DataFrame.agg with join, then convert Series to one row DataFrame with Series.to_frame and transpose by DataFrame.T:
df = df.astype(str).agg('#'.join).to_frame().T
print (df)
A B C D E
0 1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n

Python: Pivot Table/group by specific conditions

I'm trying to change structure of my data from text file(.txt) which data look like this:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
And I would like to transform them into this format (like pivot-table in excel which column name is character between ":" and each group always start with :1:)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Does anyone have any idea? Thanks in advance.
First create DataFrame by read_csv with header=None, because no header in file:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
Extract original column by DataFrame.pop, then remove traling : by Series.str.strip and Series.str.split values to 2 new columns. Then create groups by compare with Series.eq for == by string 0 with Series.cumsum, create MultiIndex by DataFrame.set_index and last reshape by Series.unstack:
df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J
Use:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Another way to do this:
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
The output will be looking like this:
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None

How to initialize a two dimensional string DataFrame array in python

I want to initialize a 31756x2 data frame of strings.
I want it to look like this:
index column1 column2
0 A B
1 A B
.
.
31756 A B
I wrote:
content_split = [["A", "B"] for x in range(31756)]
This is the result:
I did get a two dimensional list, but I want the columns to be separated like in a data frame, and I can't seem to get it to work (like column1: A.. , column2: B...)
Would love some help.
Use DataFrame constructor only:
df = pd.DataFrame([["A", "B"] for x in range(31756)], columns=['col1','col2'])
print (df.head())
col1 col2
0 A B
1 A B
2 A B
3 A B
4 A B
Or:
N = 31756
df = pd.DataFrame({'col1':['A'] * N, 'col2':['B'] * N})
print (df.head())
col1 col2
0 A B
1 A B
2 A B
3 A B
4 A B
import pandas as pd
df = pd.DataFrame(index=range(31756))
df.loc[:,'column1'] = 'A'
df.loc[:,'column2'] = 'B'
Using numpy.tile:
import numpy as np
df = pd.DataFrame(np.tile(list('AB'), (31756, 1)), columns=['col1','col2'])
Or just passing a dictionary:
df = pd.DataFrame({'A':['A']*31756, 'B':['B']*31756})
If using this latter method you may want to explicitly sort the columns since the dictionary doesn't have order:
df = pd.DataFrame({'A':['A']*31756, 'B':['B']*31756}).sort_index(axis=1)
For fun
pd.DataFrame(index=range(31756)).assign(dict(col1='A', col2='B'))

Apply a function to a specific row using the index value

I have the following table:
import pandas as pd
import numpy as np
#Dataframe with random numbers and with an a,b,c,d,e index
df = pd.DataFrame(np.random.randn(5,5), index = ['a','b','c','d','e'])
#Now i name the columns the same
df.columns = ['a','b','c','d','e']
#Resulting dataframe:
a b c d e
a 2.214229 1.621352 0.083113 0.818191 -0.900224
b -0.612560 -0.028039 -0.392266 0.439679 1.596251
c 1.378928 -0.309353 -0.651817 1.499517 0.515772
d -0.061682 1.141558 -0.811471 0.242874 0.345159
e -0.714760 -0.172082 0.205638 0.220528 1.182013
How can i apply a function to the dataframes index? I want to round the numbers for every column where the index is "c".
#Numbers to round to 2 decimals:
a b c d e
c 1.378928 -0.309353 -0.651817 1.499517 0.515772
What is the best way to do this?
For label based indexing use loc:
In [22]:
df = pd.DataFrame(np.random.randn(5,5), index = ['a','b','c','d','e'])
#Now i name the columns the same
df.columns = ['a','b','c','d','e']
df
Out[22]:
a b c d e
a -0.051366 1.856373 -0.224172 -0.005668 0.986908
b -1.121298 -1.018863 2.328420 -0.117501 -0.231463
c 2.241418 -0.838571 -0.551222 0.662890 -1.234716
d 0.275063 0.295788 0.689171 0.227742 0.091928
e 0.269730 0.326156 0.210443 -0.494634 -0.489698
In [23]:
df.loc['c'] = np.round(df.loc['c'],decimals=2)
df
Out[23]:
a b c d e
a -0.051366 1.856373 -0.224172 -0.005668 0.986908
b -1.121298 -1.018863 2.328420 -0.117501 -0.231463
c 2.240000 -0.840000 -0.550000 0.660000 -1.230000
d 0.275063 0.295788 0.689171 0.227742 0.091928
e 0.269730 0.326156 0.210443 -0.494634 -0.489698
To round values of column c:
df['c'].round(decimals=2)
To round values of row c:
df.loc['c'].round(decimals=2)

Categories