Truncate string and replace with "X" Python Pandas DataFrame - python

I have a df such as:
d = {'col1': [11111111, 2222222]]}
df = pd.DataFrame(data=d)
df
col1
0 11111111
1 2222222
I need to remove everything before the first four characters and replace with something like "X" such that the new df would be
d = {'col1': [XXXX1111, XXX2222]]}
df = pd.DataFrame(data=d)
df
col1
0 XXXX1111
1 XXX2222
New to python still and have been able to for example slice the last four characters. But have not been able to replace everything else with X's.
Also, strings can be different lengths. So the number of X's is dependent on the length of the string. That particularly is what has given me trouble. If they were all the same length this would be much easier.

You can use .str.replace() with regex:
df.col1 = df.col1.astype(str).str.replace(
r"^(.*)(.{4})$", lambda g: "X" * len(g.group(1)) + g.group(2)
)
print(df)
Prints:
col1
0 XXXX1111
1 XXX2222

df['col1'] = list(map(lambda l: 'X'*(l-4), df['col1'].astype(str).apply(len))) + df['col1'].astype(str).str[-4:]
map() is to repeat X n-4 times, where n is the length of each element in col1.
.str[-4:] is to get the last 4 character in col1 column
# print(df)
col1
0 XXXX1111
1 XXX2222

Related

How to trim strings based on values from another column in pandas?

I have a pandas data frame like below:
df:
col1 col2
ACDCAAAAA 4
CDACAAA 2
ADDCAAAAA 3
I need to trim col1 strings based corresponding col2 values like below:
dfout:
col1 col2
ACDCA 4
CDACA 2
ADDCAA 3
I tried : df['col1].str[:-(df['col2'])] but getting NaN in output.
Does anyone know how to do that?
Thanks for your time.
Use list comprhension with zip:
df['new'] = [a[:-b] for a, b in zip(df['col1'], df['col2'])]
A regex option might be:
df["col1"] = df["col1"].str.replace(r'.{' + df["col2"].astype(str) + r'}$', '')
Use df.apply:
In [2613]: df['col1'] = df.apply(lambda x: x['col1'][: x['col2'] + 1], 1)

How can I remove string after last underscore in python dataframe?

I want to remove the all string after last underscore from the dataframe. If I my data in dataframe looks like.
AA_XX,
AAA_BB_XX,
AA_BB_XYX,
AA_A_B_YXX
I would like to get this result
AA,
AAA_BB,
AA_BB,
AA_A_B
You can do this simply using Series.str.split and Series.str.join:
In [2381]: df
Out[2381]:
col1
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
In [2386]: df['col1'] = df['col1'].str.split('_').str[:-1].str.join('_')
In [2387]: df
Out[2387]:
col1
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
Explaination:
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})
Creates
col
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
Use apply in order to loop through the column you want to edit.
I broke the string at _ and then joined all parts leaving the last part at _
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
print(df)
Results:
col
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
If your dataset contains values like AA (values without underscore).
Change the lambda like this
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX', 'AA']})
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]) if len(r.split('_')) > 1 else r)
print(df)
Here is another way of going about it.
import pandas as pd
data = {'s': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']}
df = pd.DataFrame(data)
def cond1(s):
temp_s = s.split('_')
temp_len = len(temp_s)
if len(temp_s) == 1:
return temp_s
else:
return temp_s[:len(temp_s)-1]
df['result'] = df['s'].apply(cond1)

How to remove double quotes while assigning columns to dataframe

I have below list
ColumnName = 'Emp_id','Emp_Name','EmpAGe'
While i am trying to read above columns and assign inside dataframe i am getting extra double quotes
df = pd.dataframe(data,columns=[ColumnName])
columns=[ColumnName]
i am getting columns = ["'Emp_id','Emp_Name','EmpAGe'"]
how can i handle these extra double quotes and remove them while assigning header to data
This code
ColumnName = 'Emp_id','Emp_Name','EmpAGe'
Is a tuple and not a list.
In case you want three columns, each with values on the tuple above you gonna need
df = pd.dataframe(data,columns=list(ColumnName))
The problem is how you define the columns for pandas DataFrame.
The example below will build a correct data frame :
import pandas as pd
ColumnName1 = 'Emp_id','Emp_Name','EmpAGe'
df1 = [['A1','A1','A2'],['1','2','1'],['a0','a1','a3']]
df = pd.DataFrame(data=df1,columns=ColumnName1 )
df
Result :
Emp_id Emp_Name EmpAGe
0 A1 A1 A2
1 1 2 1
2 a0 a1 a3
A print screen of the code I wrote with the result, with no double quotations
Just for the shake of the understanding, where you can use col.replace to get the desired ..
Let take an example..
>>> df
col1" col2"
0 1 1
1 2 2
Result:
>>> df.columns = [col.replace('"', '') for col in df.columns]
# df.columns = df.columns.str.replace('"', '') <-- can use this as well
>>> df
col1 col2
0 1 1
1 2 2
OR
>>> df = pd.DataFrame({ '"col1"':[1, 2], '"col2"':[1,2]})
>>> df
"col1" "col2"
0 1 1
1 2 2
>>> df.columns = [col.replace('"', '') for col in df.columns]
>>> df
col1 col2
0 1 1
1 2 2
Your input is not quite right. ColumnName is already list-like and it should be passed on directly rather than wrapped in another list. In the latter case it would be interpreted as one single column.
df = pd.DataFrame(data, columns=ColumnName)

Count across dataframe columns based on str.contains (or similar)

I would like to count the number of cells within each row that contain a particular character string, cells which have the particular string more than once should be counted once only.
I can count the number of cells across a row which equal a given value, but when I expand this logic to use str.contains, I have issues, as shown below
d = {'col1': ["a#", "b","c#"], 'col2': ["a", "b","c#"]}
df = pd.DataFrame(d)
#can correctly count across rows using equality
thisworks =( df =="a#" ).sum(axis=1)
#can count across a column using str.contains
thisworks1=df['col1'].str.contains('#').sum()
#but cannot use str.contains with a dataframe so what is the alternative
thisdoesnt =( df.str.contains('#') ).sum(axis=1)
Output should be a series showing the number of cells in each row that contain the given character string.
str.contains is a series method. To apply it to whole dataframe you need either agg or apply such as:
df.agg(lambda x: x.str.contains('#')).sum(1)
Out[2358]:
0 1
1 0
2 2
dtype: int64
If you don't like agg nor apply, you may use np.char.find to work directly on underlying numpy array of df
(np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1)
Out[2360]: array([1, 0, 2])
Passing it to series or a columns of df
pd.Series((np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1), index=df.index)
Out[2361]:
0 1
1 0
2 2
dtype: int32
A solution using df.apply:
df = pd.DataFrame({'col1': ["a#", "b","c#"],
'col2': ["a", "b","c#"]})
df
col1 col2
0 a# a
1 b b
2 c# c#
df['sum'] = df.apply(lambda x: x.str.contains('#'), axis=1).sum(axis=1)
col1 col2 sum
0 a# a 1
1 b b 0
2 c# c# 2
Something like this should work:
df = pd.DataFrame({'col1': ['#', '0'], 'col2': ['#', '#']})
df['totals'] = df['col1'].str.contains('#', regex=False).astype(int) +\
df['col2'].str.contains('#', regex=False).astype(int)
df
# col1 col2 totals
# 0 # # 2
# 1 0 # 1
It should generalize to as many columns as you want.

Split CSV file on multiple delimiters and then detect duplicate rows

I am reading a csv file with panda. I need to duplicate rows according to number of strings in a given column (could be multiple). Example, using col1 and separator "|":
in_csv:
col1, col2, col3
ABC|EFG, 1, a
ABC|EFG, 1, bb
ABC|EFG, 2, c
out_csv:
col1, col2, col3
ABC, 1, a
EFG, 1, a
ABC, 1, bb
EFG, 1, bb
ABC, 2, c
EFG, 2, c
I tried reading through a loop row by row, using incsv_dt.row1.iloc[ii].split('|') but I believe there should be an easier way to do it. Strings in col1 being seperated by | could be multiple
Thanks
Unsorted and might not work if there are are entries without the '|' in the first column. Creates two dataframes based on 'col1' and then appends them together. Also might not work if there are multiple '|'s in col1.
df = pd.DataFrame()
df['col1'] = ['APC|EFG', 'APC|EFG','APC|EFG']
df['col2'] = [1,1,2]
df['col3'] = ['a','bb','c']
# split into two columns based on '|' delimiter
df = pd.concat([df, df['col1'].str.split('|', expand = True)], axis=1)
# create two dataframes with new labels
df2 = df.drop(['col1',1], axis=1)
df2.rename(columns={0: 'col1'}, inplace=True)
df3 = df.drop(['col1',0], axis=1)
df3.rename(columns={1: 'col1'}, inplace=True)
# append them together
df = df2.append(df3)
Setup for the example:
df = pd.DataFrame()
df['col1'] = ['APC|EFG', 'APC', 'APC|EFG|XXX']
df['col2'] = [1, 1, 2]
df['col3'] = ['a', 'bb', 'c']
You can first create an new data frame with the split colums.
Then drop the empty values. This works fine if some values have
multiple splits and some have none.
dfs = df['col1'].str.split('|',
expand = True).unstack().reset_index().set_index('level_1')[0].dropna().to_frame()
To merge this with the original dataframe, make sure, the indexes are the same.
When I tried, the original dataframe had a RangeIndex, so I convert that to
Integer
df.index = list(df.index)
Then you can merge the data frames on the index and rename the new column back to 'col1'
df_result = pd.merge(dfs,
dfx[['col2', 'col3']],
left_index=True, right_index=True,
how='outer').rename(columns={0: 'col1'})
print(df_result)
Results in
col1 col2 col3
0 APC 1 a
0 EFG 1 a
1 APC 1 bb
2 APC 2 c
2 EFG 2 c
2 XXX 2 c

Categories