How to trim strings based on values from another column in pandas? - python

I have a pandas data frame like below:
df:
col1 col2
ACDCAAAAA 4
CDACAAA 2
ADDCAAAAA 3
I need to trim col1 strings based corresponding col2 values like below:
dfout:
col1 col2
ACDCA 4
CDACA 2
ADDCAA 3
I tried : df['col1].str[:-(df['col2'])] but getting NaN in output.
Does anyone know how to do that?
Thanks for your time.

Use list comprhension with zip:
df['new'] = [a[:-b] for a, b in zip(df['col1'], df['col2'])]

A regex option might be:
df["col1"] = df["col1"].str.replace(r'.{' + df["col2"].astype(str) + r'}$', '')

Use df.apply:
In [2613]: df['col1'] = df.apply(lambda x: x['col1'][: x['col2'] + 1], 1)

Related

Truncate string and replace with "X" Python Pandas DataFrame

I have a df such as:
d = {'col1': [11111111, 2222222]]}
df = pd.DataFrame(data=d)
df
col1
0 11111111
1 2222222
I need to remove everything before the first four characters and replace with something like "X" such that the new df would be
d = {'col1': [XXXX1111, XXX2222]]}
df = pd.DataFrame(data=d)
df
col1
0 XXXX1111
1 XXX2222
New to python still and have been able to for example slice the last four characters. But have not been able to replace everything else with X's.
Also, strings can be different lengths. So the number of X's is dependent on the length of the string. That particularly is what has given me trouble. If they were all the same length this would be much easier.
You can use .str.replace() with regex:
df.col1 = df.col1.astype(str).str.replace(
r"^(.*)(.{4})$", lambda g: "X" * len(g.group(1)) + g.group(2)
)
print(df)
Prints:
col1
0 XXXX1111
1 XXX2222
df['col1'] = list(map(lambda l: 'X'*(l-4), df['col1'].astype(str).apply(len))) + df['col1'].astype(str).str[-4:]
map() is to repeat X n-4 times, where n is the length of each element in col1.
.str[-4:] is to get the last 4 character in col1 column
# print(df)
col1
0 XXXX1111
1 XXX2222

How to switch column values in the same Pandas DataFrame

I have the following DataFrame:
I need to switch values of col2 and col3 with the values of col4 and col5. Values of col1 will remain the same. The end result needs to look as the following:
Is there a way to do this without looping through the DataFrame?
Use rename in pandas
In [160]: df = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]})
In [161]: df
Out[161]:
A B
0 1 3
1 2 4
2 3 5
In [167]: df.rename({'B':'A','A':'B'},axis=1)
Out[167]:
B A
0 1 3
1 2 4
2 3 5
This should do:
og_cols = df.columns
new_cols = [df.columns[0], *df.columns[3:], *df.columns[1:3]]
df = df[new_cols] # Sort columns in the desired order
df.columns = og_cols # Use original column names
If you want to swap the column values:
df.iloc[:, 1:3], df.iloc[:, 3:] = df.iloc[:,3:].to_numpy(copy=True), df.iloc[:,1:3].to_numpy(copy=True)
Pandas reindex could help :
cols = df.columns
#reposition the columns
df = df.reindex(columns=['col1','col4','col5','col2','col3'])
#pass in new names
df.columns = cols

Count across dataframe columns based on str.contains (or similar)

I would like to count the number of cells within each row that contain a particular character string, cells which have the particular string more than once should be counted once only.
I can count the number of cells across a row which equal a given value, but when I expand this logic to use str.contains, I have issues, as shown below
d = {'col1': ["a#", "b","c#"], 'col2': ["a", "b","c#"]}
df = pd.DataFrame(d)
#can correctly count across rows using equality
thisworks =( df =="a#" ).sum(axis=1)
#can count across a column using str.contains
thisworks1=df['col1'].str.contains('#').sum()
#but cannot use str.contains with a dataframe so what is the alternative
thisdoesnt =( df.str.contains('#') ).sum(axis=1)
Output should be a series showing the number of cells in each row that contain the given character string.
str.contains is a series method. To apply it to whole dataframe you need either agg or apply such as:
df.agg(lambda x: x.str.contains('#')).sum(1)
Out[2358]:
0 1
1 0
2 2
dtype: int64
If you don't like agg nor apply, you may use np.char.find to work directly on underlying numpy array of df
(np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1)
Out[2360]: array([1, 0, 2])
Passing it to series or a columns of df
pd.Series((np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1), index=df.index)
Out[2361]:
0 1
1 0
2 2
dtype: int32
A solution using df.apply:
df = pd.DataFrame({'col1': ["a#", "b","c#"],
'col2': ["a", "b","c#"]})
df
col1 col2
0 a# a
1 b b
2 c# c#
df['sum'] = df.apply(lambda x: x.str.contains('#'), axis=1).sum(axis=1)
col1 col2 sum
0 a# a 1
1 b b 0
2 c# c# 2
Something like this should work:
df = pd.DataFrame({'col1': ['#', '0'], 'col2': ['#', '#']})
df['totals'] = df['col1'].str.contains('#', regex=False).astype(int) +\
df['col2'].str.contains('#', regex=False).astype(int)
df
# col1 col2 totals
# 0 # # 2
# 1 0 # 1
It should generalize to as many columns as you want.

Concat string Series when some values are empty strings

How can I vectorize the concatenation of strings in two columns when there are empty strings present? Here is the problem:
My columns in DF:
col1 = pd.Series(['text1','','text3'])
col2 = pd.Series(['text1','text2','text3'])
When I do:
new_col = col1.str.cat(col2, sep='/')
it gives:
new_col = pd.Series(['text1/text1','','text3/text3'])
but it should give:
new_col = pd.Series(['text1/text1','/text2','text3/text3'])
How can I do this?
Pandas version 0.24.2
If there is missing value instead empty string is necessary parameter na_rep in Series.str.cat:
col1 = pd.Series(['text1',np.nan,'text3'])
col2 = pd.Series(['text1','text2','text3'])
because if empty string it working nice:
col1 = pd.Series(['text1','','text3'])
col2 = pd.Series(['text1','text2','text3'])
new_col = col1.str.cat(col2, sep='/')
print (new_col)
0 text1/text1
1 /text2
2 text3/text3
dtype: object
Also is possible use alternative:
new_col = col1 + '/' + col2
print (new_col)
0 text1/text1
1 /text2
2 text3/text3
dtype: object

how to create a dataframe aggregating (grouping?) a dataframe containing only strings

I would like to create a dataframe "aggregating" a larger data set.
Starting:
df:
col1 col2
1 A B
2 A C
3 A B
and getting:
df_aggregated:
col1 col2
1 A B
2 A C
without using any calclulation (count())
I would write:
df_aggreagated = df.groupby('col1')
but I do not get anything
print ( df_aggregated )
"error"
any help appreciated
You can accomplish this by simply dropping the duplicate entries using the df.drop_duplicates function:
df_aggregated = df.drop_duplicates(subset=['col1', 'col2'], keep=False)
print(df_aggregated)
col1 col2
1 A B
2 A C
You can use groupby with a function:
In [849]: df.groupby('col2', as_index=False).max()
Out[849]:
col2 col1
0 B A
1 C A

Categories