Split CSV file on multiple delimiters and then detect duplicate rows - python

I am reading a csv file with panda. I need to duplicate rows according to number of strings in a given column (could be multiple). Example, using col1 and separator "|":
in_csv:
col1, col2, col3
ABC|EFG, 1, a
ABC|EFG, 1, bb
ABC|EFG, 2, c
out_csv:
col1, col2, col3
ABC, 1, a
EFG, 1, a
ABC, 1, bb
EFG, 1, bb
ABC, 2, c
EFG, 2, c
I tried reading through a loop row by row, using incsv_dt.row1.iloc[ii].split('|') but I believe there should be an easier way to do it. Strings in col1 being seperated by | could be multiple
Thanks

Unsorted and might not work if there are are entries without the '|' in the first column. Creates two dataframes based on 'col1' and then appends them together. Also might not work if there are multiple '|'s in col1.
df = pd.DataFrame()
df['col1'] = ['APC|EFG', 'APC|EFG','APC|EFG']
df['col2'] = [1,1,2]
df['col3'] = ['a','bb','c']
# split into two columns based on '|' delimiter
df = pd.concat([df, df['col1'].str.split('|', expand = True)], axis=1)
# create two dataframes with new labels
df2 = df.drop(['col1',1], axis=1)
df2.rename(columns={0: 'col1'}, inplace=True)
df3 = df.drop(['col1',0], axis=1)
df3.rename(columns={1: 'col1'}, inplace=True)
# append them together
df = df2.append(df3)

Setup for the example:
df = pd.DataFrame()
df['col1'] = ['APC|EFG', 'APC', 'APC|EFG|XXX']
df['col2'] = [1, 1, 2]
df['col3'] = ['a', 'bb', 'c']
You can first create an new data frame with the split colums.
Then drop the empty values. This works fine if some values have
multiple splits and some have none.
dfs = df['col1'].str.split('|',
expand = True).unstack().reset_index().set_index('level_1')[0].dropna().to_frame()
To merge this with the original dataframe, make sure, the indexes are the same.
When I tried, the original dataframe had a RangeIndex, so I convert that to
Integer
df.index = list(df.index)
Then you can merge the data frames on the index and rename the new column back to 'col1'
df_result = pd.merge(dfs,
dfx[['col2', 'col3']],
left_index=True, right_index=True,
how='outer').rename(columns={0: 'col1'})
print(df_result)
Results in
col1 col2 col3
0 APC 1 a
0 EFG 1 a
1 APC 1 bb
2 APC 2 c
2 EFG 2 c
2 XXX 2 c

Related

Truncate string and replace with "X" Python Pandas DataFrame

I have a df such as:
d = {'col1': [11111111, 2222222]]}
df = pd.DataFrame(data=d)
df
col1
0 11111111
1 2222222
I need to remove everything before the first four characters and replace with something like "X" such that the new df would be
d = {'col1': [XXXX1111, XXX2222]]}
df = pd.DataFrame(data=d)
df
col1
0 XXXX1111
1 XXX2222
New to python still and have been able to for example slice the last four characters. But have not been able to replace everything else with X's.
Also, strings can be different lengths. So the number of X's is dependent on the length of the string. That particularly is what has given me trouble. If they were all the same length this would be much easier.
You can use .str.replace() with regex:
df.col1 = df.col1.astype(str).str.replace(
r"^(.*)(.{4})$", lambda g: "X" * len(g.group(1)) + g.group(2)
)
print(df)
Prints:
col1
0 XXXX1111
1 XXX2222
df['col1'] = list(map(lambda l: 'X'*(l-4), df['col1'].astype(str).apply(len))) + df['col1'].astype(str).str[-4:]
map() is to repeat X n-4 times, where n is the length of each element in col1.
.str[-4:] is to get the last 4 character in col1 column
# print(df)
col1
0 XXXX1111
1 XXX2222

How to switch column values in the same Pandas DataFrame

I have the following DataFrame:
I need to switch values of col2 and col3 with the values of col4 and col5. Values of col1 will remain the same. The end result needs to look as the following:
Is there a way to do this without looping through the DataFrame?
Use rename in pandas
In [160]: df = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]})
In [161]: df
Out[161]:
A B
0 1 3
1 2 4
2 3 5
In [167]: df.rename({'B':'A','A':'B'},axis=1)
Out[167]:
B A
0 1 3
1 2 4
2 3 5
This should do:
og_cols = df.columns
new_cols = [df.columns[0], *df.columns[3:], *df.columns[1:3]]
df = df[new_cols] # Sort columns in the desired order
df.columns = og_cols # Use original column names
If you want to swap the column values:
df.iloc[:, 1:3], df.iloc[:, 3:] = df.iloc[:,3:].to_numpy(copy=True), df.iloc[:,1:3].to_numpy(copy=True)
Pandas reindex could help :
cols = df.columns
#reposition the columns
df = df.reindex(columns=['col1','col4','col5','col2','col3'])
#pass in new names
df.columns = cols

How to remove double quotes while assigning columns to dataframe

I have below list
ColumnName = 'Emp_id','Emp_Name','EmpAGe'
While i am trying to read above columns and assign inside dataframe i am getting extra double quotes
df = pd.dataframe(data,columns=[ColumnName])
columns=[ColumnName]
i am getting columns = ["'Emp_id','Emp_Name','EmpAGe'"]
how can i handle these extra double quotes and remove them while assigning header to data
This code
ColumnName = 'Emp_id','Emp_Name','EmpAGe'
Is a tuple and not a list.
In case you want three columns, each with values on the tuple above you gonna need
df = pd.dataframe(data,columns=list(ColumnName))
The problem is how you define the columns for pandas DataFrame.
The example below will build a correct data frame :
import pandas as pd
ColumnName1 = 'Emp_id','Emp_Name','EmpAGe'
df1 = [['A1','A1','A2'],['1','2','1'],['a0','a1','a3']]
df = pd.DataFrame(data=df1,columns=ColumnName1 )
df
Result :
Emp_id Emp_Name EmpAGe
0 A1 A1 A2
1 1 2 1
2 a0 a1 a3
A print screen of the code I wrote with the result, with no double quotations
Just for the shake of the understanding, where you can use col.replace to get the desired ..
Let take an example..
>>> df
col1" col2"
0 1 1
1 2 2
Result:
>>> df.columns = [col.replace('"', '') for col in df.columns]
# df.columns = df.columns.str.replace('"', '') <-- can use this as well
>>> df
col1 col2
0 1 1
1 2 2
OR
>>> df = pd.DataFrame({ '"col1"':[1, 2], '"col2"':[1,2]})
>>> df
"col1" "col2"
0 1 1
1 2 2
>>> df.columns = [col.replace('"', '') for col in df.columns]
>>> df
col1 col2
0 1 1
1 2 2
Your input is not quite right. ColumnName is already list-like and it should be passed on directly rather than wrapped in another list. In the latter case it would be interpreted as one single column.
df = pd.DataFrame(data, columns=ColumnName)

Count across dataframe columns based on str.contains (or similar)

I would like to count the number of cells within each row that contain a particular character string, cells which have the particular string more than once should be counted once only.
I can count the number of cells across a row which equal a given value, but when I expand this logic to use str.contains, I have issues, as shown below
d = {'col1': ["a#", "b","c#"], 'col2': ["a", "b","c#"]}
df = pd.DataFrame(d)
#can correctly count across rows using equality
thisworks =( df =="a#" ).sum(axis=1)
#can count across a column using str.contains
thisworks1=df['col1'].str.contains('#').sum()
#but cannot use str.contains with a dataframe so what is the alternative
thisdoesnt =( df.str.contains('#') ).sum(axis=1)
Output should be a series showing the number of cells in each row that contain the given character string.
str.contains is a series method. To apply it to whole dataframe you need either agg or apply such as:
df.agg(lambda x: x.str.contains('#')).sum(1)
Out[2358]:
0 1
1 0
2 2
dtype: int64
If you don't like agg nor apply, you may use np.char.find to work directly on underlying numpy array of df
(np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1)
Out[2360]: array([1, 0, 2])
Passing it to series or a columns of df
pd.Series((np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1), index=df.index)
Out[2361]:
0 1
1 0
2 2
dtype: int32
A solution using df.apply:
df = pd.DataFrame({'col1': ["a#", "b","c#"],
'col2': ["a", "b","c#"]})
df
col1 col2
0 a# a
1 b b
2 c# c#
df['sum'] = df.apply(lambda x: x.str.contains('#'), axis=1).sum(axis=1)
col1 col2 sum
0 a# a 1
1 b b 0
2 c# c# 2
Something like this should work:
df = pd.DataFrame({'col1': ['#', '0'], 'col2': ['#', '#']})
df['totals'] = df['col1'].str.contains('#', regex=False).astype(int) +\
df['col2'].str.contains('#', regex=False).astype(int)
df
# col1 col2 totals
# 0 # # 2
# 1 0 # 1
It should generalize to as many columns as you want.

Add new column in Pandas DataFrame Python [duplicate]

This question already has answers here:
How to add a new column to an existing DataFrame?
(32 answers)
Closed 4 years ago.
I have dataframe in Pandas for example:
Col1 Col2
A 1
B 2
C 3
Now if I would like to add one more column named Col3 and the value is based on Col2. In formula, if Col2 > 1, then Col3 is 0, otherwise would be 1. So, in the example above. The output would be:
Col1 Col2 Col3
A 1 1
B 2 0
C 3 0
Any idea on how to achieve this?
You just do an opposite comparison. if Col2 <= 1. This will return a boolean Series with False values for those greater than 1 and True values for the other. If you convert it to an int64 dtype, True becomes 1 and False become 0,
df['Col3'] = (df['Col2'] <= 1).astype(int)
If you want a more general solution, where you can assign any number to Col3 depending on the value of Col2 you should do something like:
df['Col3'] = df['Col2'].map(lambda x: 42 if x > 1 else 55)
Or:
df['Col3'] = 0
condition = df['Col2'] > 1
df.loc[condition, 'Col3'] = 42
df.loc[~condition, 'Col3'] = 55
The easiest way that I found for adding a column to a DataFrame was to use the "add" function. Here's a snippet of code, also with the output to a CSV file. Note that including the "columns" argument allows you to set the name of the column (which happens to be the same as the name of the np.array that I used as the source of the data).
# now to create a PANDAS data frame
df = pd.DataFrame(data = FF_maxRSSBasal, columns=['FF_maxRSSBasal'])
# from here on, we use the trick of creating a new dataframe and then "add"ing it
df2 = pd.DataFrame(data = FF_maxRSSPrism, columns=['FF_maxRSSPrism'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = FF_maxRSSPyramidal, columns=['FF_maxRSSPyramidal'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = deltaFF_strainE22, columns=['deltaFF_strainE22'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = scaled, columns=['scaled'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = deltaFF_orientation, columns=['deltaFF_orientation'])
df = df.add( df2, fill_value=0 )
#print(df)
df.to_csv('FF_data_frame.csv')

Categories