Concat string Series when some values are empty strings - python

How can I vectorize the concatenation of strings in two columns when there are empty strings present? Here is the problem:
My columns in DF:
col1 = pd.Series(['text1','','text3'])
col2 = pd.Series(['text1','text2','text3'])
When I do:
new_col = col1.str.cat(col2, sep='/')
it gives:
new_col = pd.Series(['text1/text1','','text3/text3'])
but it should give:
new_col = pd.Series(['text1/text1','/text2','text3/text3'])
How can I do this?
Pandas version 0.24.2

If there is missing value instead empty string is necessary parameter na_rep in Series.str.cat:
col1 = pd.Series(['text1',np.nan,'text3'])
col2 = pd.Series(['text1','text2','text3'])
because if empty string it working nice:
col1 = pd.Series(['text1','','text3'])
col2 = pd.Series(['text1','text2','text3'])
new_col = col1.str.cat(col2, sep='/')
print (new_col)
0 text1/text1
1 /text2
2 text3/text3
dtype: object
Also is possible use alternative:
new_col = col1 + '/' + col2
print (new_col)
0 text1/text1
1 /text2
2 text3/text3
dtype: object

Related

How to trim strings based on values from another column in pandas?

I have a pandas data frame like below:
df:
col1 col2
ACDCAAAAA 4
CDACAAA 2
ADDCAAAAA 3
I need to trim col1 strings based corresponding col2 values like below:
dfout:
col1 col2
ACDCA 4
CDACA 2
ADDCAA 3
I tried : df['col1].str[:-(df['col2'])] but getting NaN in output.
Does anyone know how to do that?
Thanks for your time.
Use list comprhension with zip:
df['new'] = [a[:-b] for a, b in zip(df['col1'], df['col2'])]
A regex option might be:
df["col1"] = df["col1"].str.replace(r'.{' + df["col2"].astype(str) + r'}$', '')
Use df.apply:
In [2613]: df['col1'] = df.apply(lambda x: x['col1'][: x['col2'] + 1], 1)

check if row value value is equal to column name and access the value of the column

Sample dataframe -
col1
col2
col3
col4
colfromvaluestobepicked
new col
1
1
0
1
'col1'
1
0
0
1
1
'col2'
0
I want to create a new column whose values are based on if the colfromvaluestobepicked ('col1') is col1 then pick that col value and assign it to the new col and so on.
I am not sure how to achieve this?
Use DataFrame.melt for alternative for lookup:
df1 = df.melt('colfromvaluestobepicked', ignore_index=False)
df['new']=df1.loc[df1['colfromvaluestobepicked'].str.strip("'") == df1['variable'],'value']
try this:
df['new col'] = df.apply(lambda row: row[row['colfromvaluestobepicked']],axis =1)

How to re order this DataFrame in Python wthout hard coding column values?

I have a df ('COL3 SUM' is the full name with a space):
COL1 COL2 COL3 SUM COL4 COL5
1 2 3 4 5
How can I re order this df so that 'COL3 SUM' always comes at the end of the dataframe like so without re ordering any of the rest of the df?
COL1 COL2 COL4 COL5 COL3 SUM
1 2 4 5 3
If you want to move all columns with some keyword e.g. "SUM" to the end:
create list of unmoved columns
append excluded columns to end
reorder dataframe by calling on new column list
code:
new_cols = [col for col in df.columns if "SUM" not in col]
moved_cols = list(set(df.columns) - set(new_cols))
new_cols.extend(moved_cols)
df = df[new_cols]
output:
COL1 COL2 COL4 COL5 COL3 SUM
0 1 2 4 5 3
newDF = df[sorted(df.columns,key=lambda x:(" " in x,x))]
would do it i think
if the only criteria is the space ... just use that as the only criteria
newDF = df[sorted(df.columns,key=lambda x:" " in x]
if its still changing the original order try
newDF = df[sorted(df.columns,key=lambda x:(" " in x,list(df.columns).index(x))]
this assumes the key bit is the space ... if its SUM just change what you are checking for

create a column which is the difference of two string columns in pandas

I have pandas dataframe like below:
df = pd.DataFrame ({'col1': ['apple;orange;pear', 'grape;apple;kiwi;pear'], 'col2': ['apple', 'grape;kiwi']})
col1 col2
0 apple;orange;pear apple
1 grape;apple;kiwi;pear grape;kiwi
I need the data like below:
col1 col2 col3
0 apple;orange;pear apple orange;pear
1 grape;apple;kiwi;pear grape;kiwi apple;pear
Does anyone know how to do that? Thanks.
In this example, the second row of col2 grape;kiwi, the sub-strings are in different position of the second row in col1 grape;apple;kiwi;pear.
[How do I create a new column in pandas from the difference of two string columns? does not work in my case.
You can use set to find the differences. As a first step, you need to convert the strings to a set.
df['col3'] = (
df.apply(lambda x: ';'.join(set(x.col1.split(';')).difference(x.col2.split(';'))),
axis=1)
)
col1 col2 col3
0 apple;orange;pear apple orange;pear
1 grape;apple;kiwi grape;kiwi apple;pear
Magic of str.get_dummies
s=df.col1.str.get_dummies(';').sub(df.col2.str.get_dummies(';'),fill_value=0)
df['col3']=s.eq(1).dot(s.columns+';').str[:-1]
df
col1 col2 col3
0 apple;orange;pear apple orange;pear
1 grape;apple;kiwi;pear grape;kiwi apple;pear

Pandas dataframe get columns names and value_counts

how to get all column names where values in columns are 'f' or 't' into array ?
df['FTI'].value_counts()
instead of this 'FTI' i need array of returned columns. Is it possible?
Reproducible example:
df = pd.DataFrame({'col1':[1,2,3], 'col2':['f', 'f', 'f'], 'col3': ['t','t','t'], 'col4':['d','d','d']})
col1 col2 col3 col4
0 1 f t d
1 2 f t d
2 3 f t d
Such that, using eq and all:
>>> s = (df.eq('t') | df.eq('f')).all()
col1 False
col2 True
col3 True
col4 False
dtype: bool
To get the names:
>>> s[s].index.values
array(['col2', 'col3'], dtype=object)
To get the positions:
>>> np.flatnonzero(s) + 1
array([2, 3])
Yes. It is possible. Here is one way
You can get the columns like this.
cols=[]
for col in df.columns:
if df[col].str.contains('f|t').any()==True:
cols.append(col)
Then you can just use this for frequencies
f= pd.Series()
for col in cols:
f=pd.concat([f,df[col].value_counts()])

Categories