How to extract duplicate values in each column separately? - python

I want to extract only values with two or more occurrence in each column separately and write them in separate file with column header.
Example file: (actual csv file is 1.5 Gb, here including summary of it)
First row is the header row of each column
AO1,BO1,CO1,DO1,EO1,FO1
pep2,red2,ter3,typ3,ghl4,rtf5
ghp2,asd2,ghj3,typ3,ghj3,ert4
typ2,sdf2,rty3,ert4,asd2,sdf2
pep2,xcv2,bnm3,wer3,vbn3,wer2
dfg4,fgh3,uio2,wer3,ghj2,rtf5
dfg6,xcv4,dfg3,ret5,ytu2,rtf5
pep2,xcv4,ert1,dgf2,ert3,fgh3
okj2,xcv4,jkl3,ghr4,cvb3,rtf5
poi2,tyu2,iop3,cvb3,hjk5,rtf5
qwe2,wer2,iop3,typ3,ert3,cvb3
I have tried to write code in R and even Python panda but failed to get the result.
Expected outcome:
AO1 BO1 CO1 DO1 EO1 FO1
pep2 xcv4 iop3 typ3 ert3 rtf5
pep2 xcv4 iop3 typ3 ert3 rtf5
pep2 xcv4 typ3 rtf5
wer3 rtf5
wer3 rtf5

import pandas as pd
from StringIO import StringIO
df = pd.read_csv(StringIO("""AO1,BO1,CO1,DO1,EO1,FO1
pep2,red2,ter3,typ3,ghl4,rtf5
ghp2,asd2,ghj3,typ3,ghj3,ert4
typ2,sdf2,rty3,ert4,asd2,sdf2
pep2,xcv2,bnm3,wer3,vbn3,wer2
dfg4,fgh3,uio2,wer3,ghj2,rtf5
dfg6,xcv4,dfg3,ret5,ytu2,rtf5
pep2,xcv4,ert1,dgf2,ert3,fgh3
okj2,xcv4,jkl3,ghr4,cvb3,rtf5
poi2,tyu2,iop3,cvb3,hjk5,rtf5
qwe2,wer2,iop3,typ3,ert3,cvb3"""))
d = {}
for col in df.columns:
repeated_values = df[col].value_counts()[df[col].value_counts() >= 2].index.tolist()
cond = df[col].isin(repeated_values)
d[col] = df[cond][col]
final = pd.concat(d, axis=1)

df <- data.table::fread('AO1,BO1,CO1,DO1,EO1,FO1
pep2,red2,ter3,typ3,ghl4,rtf5
ghp2,asd2,ghj3,typ3,ghj3,ert4
typ2,sdf2,rty3,ert4,asd2,sdf2
pep2,xcv2,bnm3,wer3,vbn3,wer2
dfg4,fgh3,uio2,wer3,ghj2,rtf5
dfg6,xcv4,dfg3,ret5,ytu2,rtf5
pep2,xcv4,ert1,dgf2,ert3,fgh3
okj2,xcv4,jkl3,ghr4,cvb3,rtf5
poi2,tyu2,iop3,cvb3,hjk5,rtf5
qwe2,wer2,iop3,typ3,ert3,cvb3'
, data.table = FALSE)
lapply(df, function (x) x[duplicated(x) | duplicated(x, fromLast = T)])
You could write a csv directly in the lapply call as well

Related

Is there any method to replace specific data from column without breaking its structure or spliting

Hi there i am trying to figure out how to replace a specific data of csv file. i have a file which is base or location data of id's.
https://store8.gofile.io/download/5b031959-e0b0-4dbf-aec6-264e0b87fd09/service%20block.xlsx (sheet 2 had data ).
The file which i want to replace data using id is below
https://store8.gofile.io/download/6e13a19a-bac8-4d16-8692-e4435eed2a08/Serp.csv
Highlighted part need to be deleted after filling location.
import pandas as pd
df1= pd.read_excel("serp.xlsx", header=None)
df2= pd.read_excel("flocnam.xlsx", header=None)
df1 = df1[0].str.split(";", expand=True)
df1[4] = df1[4].apply(lambda x: v[-1] if (v := x.split()) else "")
df2[1] = df2[1].apply(lambda x: x.split("-")[0])
m = dict(zip(df2[1], df2[0]))
df1[4]= df1[4].replace(m)
print(df1)
df1.to_csv ("test.csv")
It worked but not how i wanted.
https://store8.gofile.io/download/c0ae7e05-c0e2-4f43-9d13-da12ddf73a8d/test.csv
trying to replace it like this.(desired output)
Thank you for being Supportive community❤️
If I understand correctly, you simply need to specify the separator ;
>>> df.to_csv(‘test.csv’, sep=‘;’, index_label=False)

.replace codes will not replace column with new column in python

I am trying to read a column in python, and create a new column using python.
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
I tried this, but it will not create a new column no matter what I do.
example
It seems like you made a silly mistake
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']}) # Why do you have this line?
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
Try removing the line with the comment. AFAIK, it is reinitializing your DataFrame and thus the WT_RESIDUE column becomes empty.
Considering sample from provided input.
We can use map function to map the keys of dict to existing column and persist corresponding values in new column.
df = pd.DataFrame({
'WT_RESIDUE':['ALA', "REMARK", 'VAL', "LYS"]
})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df.WT_RESIDUE.map(codes)
Input
WT_RESIDUE
0 ALA
1 REMARK
2 VAL
3 LYS
Output
WT_RESIDUE MUTATION_CODE
0 ALA A
1 REMARK NaN
2 VAL V
3 LYS K

Pandas DataFrame combine rows by column value, where Date Rows are NULL

Scenerio:
Parse the PDF Bank statement and transform into clean and formatted csv file.
What I've tried:
I manage to parse the pdf file(tabular format) using camelot library but failed to produce the desired result in sense of formatting.
Code:
import camelot
import pandas as pd
tables = camelot.read_pdf('test.pdf', pages = '3')
for i, table in enumerate(tables):
print(f'table_id:{i}')
print(f'page:{table.page}')
print(f'coordinates:{table._bbox}')
tables = camelot.read_pdf('test.pdf', flavor='stream', pages = '3')
columns = df.iloc[0]
df.columns = columns
df = df.drop(0)
df.head()
for c in df.select_dtypes('object').columns:
df[c] = df[c].str.replace('$', '')
df[c] = df[c].str.replace('-', '')
def convert_to_float(num):
try:
return float(num.replace(',',''))
except:
return 0
for col in ['Deposits', 'Withdrawals', 'Balance']:
df[col] = df[col].map(convert_to_float)
My_Result:
Desired_Output:
The logic I came up with is to move those rows up i guess n-1 if date column is NaN i don't know if this logic is right or not.Can anyone help me to sort out this properly?
I tried pandas groupby and aggregation functions but it only merging the whole data and removing NaN and duplicate dates which is not suitable because every entry is necessary.
Using Transform -
df.loc[~df.Date.isna(), 'group'] = 1
g = df.group.fillna(0).cumsum()
df['Description'] = df.groupby(g)['Description'].transform(' '.join)
new_df = df.loc[~df['Date'].isna()]

Splitting text and numbers in dataframe in python

I have a dataframe df with column name 'col' as the second column and the data looks like:
Dataframe
Want to separate text part in one column with name "Casing Size" and numerical part with "DepthTo" in other column.
Desired Output
import pandas as pd
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_excel(io.BytesIO(uploaded['Test-Checking.xlsx']))
#Method 1
df2 = pd.DataFrame(data=df, columns=['col'])
df2 = df2.col.str.extract('([a-zA-Z]+)([^a-zA-Z]+)', expand=True)
df2.columns = ['CasingSize', 'DepthTo']
df2
#Method 2
def split_col(x):
try:
numb = float(x.split()[0])
txt = x.split()[1]
except:
numb = float(x.split()[1])
txt = x.split()[0]
x['col1'] = txt
x['col2'] = numb
df2['col1'] = df.col.apply(split_col)
df2
Tried two methods but none of them work correctly. Is there anyone help me?
Code in Google Colab
Excel File Attached
Try this
first you need to return the the values from your functions. then you can unpack them into your columns using the to_list()
def sample(x):
b,y=x.split()
return b,y
temp_df=df2['col'].apply(sample)
df2[['col1','col2']]=pd.DataFrame(temp_df.tolist())
You could try splitting the values into a list, then sorting them so that the numerical part comes first. Then you could apply pd.Series and assign back to the two columns.
import pandas as pd
df = pd.DataFrame({'col':["PWT 69.2", '283.5 HWT', '62.9 PWT', '284 HWT']})
df[['Casing Size','DepthTO']] = df['col'].str.split().apply(lambda x: sorted(x)).apply(pd.Series)
print(df)
Output
col Casing Size DepthTO
0 PWT 69.2 69.2 PWT
1 283.5 HWT 283.5 HWT
2 62.9 PWT 62.9 PWT
3 284 HWT 284 HWT

Parsing data in Excel using python

In Excel, I have to separate the following value from one cell into two:
2016-12-12 (r=0.1)
2016-12-13* (r=0.7)
How do I do that in Python so that in the Excel file, dates and "r=#" will be in different cells? And also, is there a way to automatically remove the "*" sign?
This task is pretty straight forward if you use pandas:
Build a test file:
import pandas as pd
df_out = pd.DataFrame(
['2016-12-12 (r=0.1)', '2016-12-13* (r=0.7)'], columns=['data'])
df_out.to_excel('test.xlsx')
Code to convert string:
def convert_date(row):
return pd.Series([c.strip('*').strip('(').strip(')')
for c in row.split()])
Test code:
# read in test file
df_in = pd.read_excel('test.xlsx')
print(df_in)
# build a new dataframe
df_new = df_in['data'].apply(convert_date)
df_new.columns = ['date', 'r']
print(df_new)
# save the dataframe
df_new.to_excel('test2.xlsx')
Results:
data
0 2016-12-12 (r=0.1)
1 2016-12-13* (r=0.7)
date r
0 2016-12-12 r=0.1
1 2016-12-13 r=0.7

Categories