Pandas not able to merge the file - python

I am trying to merge two files, I am supplying them headers as they are not able to pick up headers when I merge them using concatenate , I get an error when I am trying to drop a column......
ValueError: labels ['lh.aparc.a2009s.meancurv'] not contained in axis
Therefore I am trying the below method.....
The headers are important because I want to compute average, mean etc on the basis of these headers....
But currently, the result file looks like this
CSV 1CSV1 looks like this CSV 2 looks the same just with rh
# !/bin/bash
ls -d */ | sed -e "s/\///g" | grep -v "Results" | grep -v "Output">> subjects.txt;
module unload freesurfer
module load freesurfer/5.3.0
module load python
export SUBJECTS_DIR=/N/u/shrechak/Karst/GENFL_FREESURFER53_KARST_RES
source $FREESURFER_HOME/FreeSurferEnv.sh
aparcstats2table --hemi lh --subjectsfile=subjects.txt --parc aparc.a2009s --meas meancurv --tablefile lh.a2009s.meancurv.txt
aparcstats2table --hemi rh --subjectsfile=subjects.txt --parc aparc.a2009s --meas meancurv --tablefile rh.a2009s.meancurv.txt
for f in *.txt; do
mv "$f" "${f%.txt}.csv"
done
python <<END_OF_PYTHON
import csv
import pandas as pd
names= ["meancurv",
"lh_G_and_S_frontomargin_meancurv",
"lh_G_and_S_occipital_inf_meancurv",
"lh_G_and_S_paracentral_meancurv",
"lh_G_and_S_subcentral_meancurv",
"lh_G_and_S_transv_frontopol_meancurv",
"lh_G_and_S_cingul-ant_meancurv",
"lh_G_and_S_cingul-Mid-Ant_meancurv",
"lh_G_and_S_cingul-Mid-Post_meancurv",
"lh_G_cingul-Post-dorsal_meancurv",
"lh_G_cingul-Post-ventral_meancurv",
"lh_G_cuneus_meancurv",
"lh_G_front_inf-Opercular_meancurv",
"lh_G_front_inf-orbital_meancurv",
"lh_G_front_inf-Triangul_meancurv",
"lh_G_front_middle_meancurv",
"lh_G_front_sup_meancurv",
"lh_G_Ins_lg_and_S_cent_ins_meancurv",
"lh_G_insular_short_meancurv",
"lh_G_occipital_middle_meancurv",
"lh_G_occipital_sup_meancurv",
"lh_G_oc-temp_lat-fusifor_meancurv",
"lh_G_oc-temp_med-Lingual_meancurv",
"lh_G_oc-temp_med-Parahip_meancurv",
"lh_G_orbital_meancurv",
"lh_G_pariet_infoangular_meancurv",
"lh_G_pariet_infSupramar_meancurv",
"lh_G_parietal_sup_meancurv",
"lh_G_postcentral_meancurv",
"lh_G_precentral_meancurv",
"lh_G_precuneus_meancurv",
"lh_G_rectus_meancurv",
"lh_G_subcallosal_meancurv",
"lh_G_temp_sup-G_T_transv_meancurv",
"lh_G_temp_sup-Lateral_meancurv",
"lh_G_temp_sup-Plan_polar_meancurv",
"lh_G_temp_supPlan_tempo_meancurv",
"lh_G_temporal_inf_meancurv",
"lh_G_temporal_middle_meancurv",
"lh_Lat_Fis-ant-Horizont_meancurv",
"lh_Lat_Fis-ant-Vertical_meancurv",
"lh_Lat_Fispost_meancurv",
"lh_Pole_occipital_meancurv",
"lh_Pole_temporal_meancurv",
"lh_S_calcarine_meancurv",
"lh_S_central_meancurv",
"lh_S_cingulMarginalis_meancurv",
"lh_S_circular_insula_ant_meancurv",
"lh_S_circular_insula_inf_meancurv",
"lh_S_circular_insula_sup_meancurv",
"lh_S_collat_transv_ant_meancurv",
"lh_S_collat_transv_post_meancurv",
"lh_S_front_inf_meancurv",
"lh_S_front_middle_meancurv",
"lh_S_front_sup_meancurv",
"lh_S_interm_prim-Jensen_meancurv",
"lh_S_intrapariet_and_P_trans_meancurv",
"lh_S_oc_middle_and_Lunatus_meancurv",
"lh_S_oc_sup_and_transversal_meancurv",
"lh_S_occipital_ant_meancurv",
"lh_S_oc-temp_lat_meancurv",
"lh_S_oc-temp_med_and_Lingual_meancurv",
"lh_S_orbital_lateral_meancurv",
"lh_S_orbital_med-olfact_meancurv",
"lh_S_orbital-H_Shaped_meancurv",
"lh_S_parieto_occipital_meancurv",
"lh_S_pericallosal_meancurv",
"lh_S_postcentral_meancurv",
"lh_S_precentral-inf-part_meancurv",
"lh_S_precentral-sup-part_meancurv",
"lh_S_suborbital_meancurv",
"lh_S_subparietal_meancurv",
"lh_S_temporal_inf_meancurv",
"lh_S_temporal_sup_meancurv",
"lh_S_temporal_transverse_meancurv"]
df1 = pd.read_csv('lh.a2009s.meancurv.csv', header = None, names = names)
names1 = ["meancurv",
"rh_G_and_S_frontomargin_meancurv",
"rh_G_and_S_occipital_inf_meancurv",
"rh_G_and_S_paracentral_meancurv",
"rh_G_and_S_subcentral_meancurv",
"rh_G_and_S_transv_frontopol_meancurv",
"rh_G_and_S_cingul-Ant_meancurv",
"rh_G_and_S_cingul-Mid-Ant_meancurv",
"rh_G_and_S_cingul-Mid-Post_meancurv",
"rh_G_cingul-Post-dorsal_meancurv",
"rh_G_cingul-Post-ventral_meancurv",
"rh_G_cuneus_meancurv",
"rh_G_front_inf-Opercular_meancurv",
"rh_G_front_inf-Orbital_meancurv",
"rh_G_front_inf-Triangul_meancurv",
"rh_G_front_middle_meancurv",
"rh_G_front_sup_meancurv",
"rh_G_Ins_lg_and_S_cent_ins_meancurv",
"rh_G_insular_short_meancurv",
"rh_G_occipital_middle_meancurv",
"rh_G_occipital_sup_meancurv",
"rh_G_oc-temp_lat-fusifor_meancurv",
"rh_G_oc-temp_med-Lingual_meancurv",
"rh_G_oc-temp_med-Parahip_meancurv",
"rh_G_orbital_meancurv",
"rh_G_pariet_inf-Angular_meancurv",
"rh_G_pariet_inf-Supramar_meancurv",
"rh_G_parietal_sup_meancurv",
"rh_G_postcentral_meancurv",
"rh_G_precentral_meancurv",
"rh_G_precuneus_meancurv",
"rh_G_rectus_meancurv",
"rh_G_subcallosal_meancurv",
"rh_G_temp_sup-G_T_transv_meancurv",
"rh_G_temp_sup-Lateral_meancurv",
"rh_G_temp_sup-Plan_polar_meancurv",
"rh_G_temp_sup-Plan_tempo_meancurv",
"rh_G_temporal_inf_meancurv",
"rh_G_temporal_middle_meancurv",
"rh_Lat_Fis-ant-Horizont_meancurv",
"rh_Lat_Fis-ant-Vertical_meancurv",
"rh_Lat_Fis-post_meancurv",
"rh_Pole_occipital_meancurv",
"rh_Pole_temporal_meancurv",
"rh_S_calcarine_meancurv",
"rh_S_central_meancurv",
"rh_S_cingulMarginalis_meancurv",
"rh_S_circular_insula_ant_meancurv",
"rh_S_circular_insula_inf_meancurv",
"rh_S_circular_insula_sup_meancurv",
"rh_S_collat_transv_ant_meancurv",
"rh_S_collat_transv_post_meancurv",
"rh_S_front_inf_meancurv",
"rh_S_front_middle_meancurv",
"rh_S_front_sup_meancurv",
"rh_S_interm_prim-Jensen_meancurv",
"rh_S_intrapariet_and_P_trans_meancurv",
"rh_S_oc_middle_and_Lunatus_meancurv",
"rh_S_oc_sup_and_transversal_meancurv",
"rh_S_occipital_ant_meancurv",
"rh_S_oc-temp_lat_meancurv",
"rh_S_oc-temp_med_and_Lingual_meancurv",
"rh_S_orbital_lateral_meancurv",
"rh_S_orbital_med-olfact_meancurv",
"rh_S_orbital-H_Shaped_meancurv",
"rh_S_parieto_occipital_meancurv",
"rh_S_pericallosal_meancurv",
"rh_S_postcentral_meancurv",
"rh_S_precentral-inf-part_meancurv",
"rh_S_precentral-sup-part_meancurv",
"rh_S_suborbital_meancurv",
"rh_S_subparietal_meancurv",
"rh_S_temporal_inf_meancurv",
"rh_S_temporal_sup_meancurv",
"rh_S_temporal_transverse_meancurv"
]
df2 = pd.read_csv('rh.a2009s.meancurv.csv', header = None, names = names1)
result = pd.merge(df1, df2, on='meancurv', how='outer')
result.to_csv('result.csv')
END_OF_PYTHON
echo "goodbye!";

So you want to skip the first row and only pull the data parts.
Here's an MCVE.
Code:
import io
import pandas as pd
csv1 = io.StringIO(u'''
a,b,c
1,4,7
2,5,8
3,6,9
''')
df = pd.read_csv(csv1, names = ['d','e','f'], skiprows = [1])
print df
Output:
d e f
0 1 4 7
1 2 5 8
2 3 6 9

Here's a way you can merge two files together file keeping the headers from the one of the files after merging.
Say you're keeping files in a list 'files':
files = ['file1.csv', 'file2.csv'] #keep files here
finalDF = pd.DataFrame() #this is an empty dataframe
for file in files:
thisDF = pd.read_csv(file)
finalDF = finalDF.append(thisDF, ignore_index=True)
Now if you want try these two lines:
say you want to check the header using a simple print head()
print finalDF.head()
and if you want to write this merged data frame to a csv file
finalDF.to_csv('merged-file.csv', encoding="utf-8", index=False)
for skipping rows are you trying to skip rows after or before
merging? let me know and i can try helping with that too.
Example:
file1.csv:
,column1,column2,column3,column4,Date,Device,sample_site
2,14888,0.060011931,248084,13.40535464,3/15/2017,DESKTOP,http://www.example1.com
11,1358,0.033212679,40888,7.465099785,3/15/2017,MOBILE,http://www.example2.com
23,130,0.02998155,4336,8.337638376,3/15/2017,TABLET,http://www.example3.com
file2.csv:
,column1,column2,column3,column4,Date,Device,sample_site
35,2685,0.034564882,77680,10.97812822,3/15/2017,DESKTOP,https://www.example4.com
45,280,0.026197605,10688,7.801272455,3/15/2017,MOBILE,https://www.example5.com
54,24,0.022878932,1049,8.202097235,3/15/2017,TABLET,https://www.example6.com
merged-file.csv:
Unnamed: 0,column1,column2,column3,column4,Date,Device,sample_site
2,14888,0.060011931,248084,13.40535464,3/15/2017,DESKTOP,http://www.example1.com
11,1358,0.033212679,40888,7.465099785,3/15/2017,MOBILE,http://www.example2.com
23,130,0.02998155,4336,8.337638376,3/15/2017,TABLET,http://www.example3.com
35,2685,0.034564882,77680,10.97812822,3/15/2017,DESKTOP,https://www.example4.com
45,280,0.026197605,10688,7.801272455,3/15/2017,MOBILE,https://www.example5.com
54,24,0.022878932,1049,8.202097235,3/15/2017,TABLET,https://www.example6.com
Reply:
Are you trying to merge data based on a column? In that case you can concat or merge with join based on an axis.
Say for example:
pd.concat([df1, df2]) #add axis and join type if necessary.
Here's the documentation to help you understand: merging and concat in pandas

Related

Is there any method to replace specific data from column without breaking its structure or spliting

Hi there i am trying to figure out how to replace a specific data of csv file. i have a file which is base or location data of id's.
https://store8.gofile.io/download/5b031959-e0b0-4dbf-aec6-264e0b87fd09/service%20block.xlsx (sheet 2 had data ).
The file which i want to replace data using id is below
https://store8.gofile.io/download/6e13a19a-bac8-4d16-8692-e4435eed2a08/Serp.csv
Highlighted part need to be deleted after filling location.
import pandas as pd
df1= pd.read_excel("serp.xlsx", header=None)
df2= pd.read_excel("flocnam.xlsx", header=None)
df1 = df1[0].str.split(";", expand=True)
df1[4] = df1[4].apply(lambda x: v[-1] if (v := x.split()) else "")
df2[1] = df2[1].apply(lambda x: x.split("-")[0])
m = dict(zip(df2[1], df2[0]))
df1[4]= df1[4].replace(m)
print(df1)
df1.to_csv ("test.csv")
It worked but not how i wanted.
https://store8.gofile.io/download/c0ae7e05-c0e2-4f43-9d13-da12ddf73a8d/test.csv
trying to replace it like this.(desired output)
Thank you for being Supportive community❤️
If I understand correctly, you simply need to specify the separator ;
>>> df.to_csv(‘test.csv’, sep=‘;’, index_label=False)

How to extract duplicate values in each column separately?

I want to extract only values with two or more occurrence in each column separately and write them in separate file with column header.
Example file: (actual csv file is 1.5 Gb, here including summary of it)
First row is the header row of each column
AO1,BO1,CO1,DO1,EO1,FO1
pep2,red2,ter3,typ3,ghl4,rtf5
ghp2,asd2,ghj3,typ3,ghj3,ert4
typ2,sdf2,rty3,ert4,asd2,sdf2
pep2,xcv2,bnm3,wer3,vbn3,wer2
dfg4,fgh3,uio2,wer3,ghj2,rtf5
dfg6,xcv4,dfg3,ret5,ytu2,rtf5
pep2,xcv4,ert1,dgf2,ert3,fgh3
okj2,xcv4,jkl3,ghr4,cvb3,rtf5
poi2,tyu2,iop3,cvb3,hjk5,rtf5
qwe2,wer2,iop3,typ3,ert3,cvb3
I have tried to write code in R and even Python panda but failed to get the result.
Expected outcome:
AO1 BO1 CO1 DO1 EO1 FO1
pep2 xcv4 iop3 typ3 ert3 rtf5
pep2 xcv4 iop3 typ3 ert3 rtf5
pep2 xcv4 typ3 rtf5
wer3 rtf5
wer3 rtf5
import pandas as pd
from StringIO import StringIO
df = pd.read_csv(StringIO("""AO1,BO1,CO1,DO1,EO1,FO1
pep2,red2,ter3,typ3,ghl4,rtf5
ghp2,asd2,ghj3,typ3,ghj3,ert4
typ2,sdf2,rty3,ert4,asd2,sdf2
pep2,xcv2,bnm3,wer3,vbn3,wer2
dfg4,fgh3,uio2,wer3,ghj2,rtf5
dfg6,xcv4,dfg3,ret5,ytu2,rtf5
pep2,xcv4,ert1,dgf2,ert3,fgh3
okj2,xcv4,jkl3,ghr4,cvb3,rtf5
poi2,tyu2,iop3,cvb3,hjk5,rtf5
qwe2,wer2,iop3,typ3,ert3,cvb3"""))
d = {}
for col in df.columns:
repeated_values = df[col].value_counts()[df[col].value_counts() >= 2].index.tolist()
cond = df[col].isin(repeated_values)
d[col] = df[cond][col]
final = pd.concat(d, axis=1)
df <- data.table::fread('AO1,BO1,CO1,DO1,EO1,FO1
pep2,red2,ter3,typ3,ghl4,rtf5
ghp2,asd2,ghj3,typ3,ghj3,ert4
typ2,sdf2,rty3,ert4,asd2,sdf2
pep2,xcv2,bnm3,wer3,vbn3,wer2
dfg4,fgh3,uio2,wer3,ghj2,rtf5
dfg6,xcv4,dfg3,ret5,ytu2,rtf5
pep2,xcv4,ert1,dgf2,ert3,fgh3
okj2,xcv4,jkl3,ghr4,cvb3,rtf5
poi2,tyu2,iop3,cvb3,hjk5,rtf5
qwe2,wer2,iop3,typ3,ert3,cvb3'
, data.table = FALSE)
lapply(df, function (x) x[duplicated(x) | duplicated(x, fromLast = T)])
You could write a csv directly in the lapply call as well

Parsing data in Excel using python

In Excel, I have to separate the following value from one cell into two:
2016-12-12 (r=0.1)
2016-12-13* (r=0.7)
How do I do that in Python so that in the Excel file, dates and "r=#" will be in different cells? And also, is there a way to automatically remove the "*" sign?
This task is pretty straight forward if you use pandas:
Build a test file:
import pandas as pd
df_out = pd.DataFrame(
['2016-12-12 (r=0.1)', '2016-12-13* (r=0.7)'], columns=['data'])
df_out.to_excel('test.xlsx')
Code to convert string:
def convert_date(row):
return pd.Series([c.strip('*').strip('(').strip(')')
for c in row.split()])
Test code:
# read in test file
df_in = pd.read_excel('test.xlsx')
print(df_in)
# build a new dataframe
df_new = df_in['data'].apply(convert_date)
df_new.columns = ['date', 'r']
print(df_new)
# save the dataframe
df_new.to_excel('test2.xlsx')
Results:
data
0 2016-12-12 (r=0.1)
1 2016-12-13* (r=0.7)
date r
0 2016-12-12 r=0.1
1 2016-12-13 r=0.7

How to Copy the Matching Columns between CSV Files Using Pandas?

I have two dataframes(f1_df and f2_df):
f1_df looks like:
ID,Name,Gender
1,Smith,M
2,John,M
f2_df looks like:
name,gender,city,id
Problem:
I want the code to compare the header of f1_df with f2_df by itself and copy the data of the matching columns using panda.
Output:
the output should be like this:
name,gender,city,id # name,gender,and id are the only matching columns btw f1_df and f2_df
Smith,M, ,1 # the data copied for name, gender, and id columns
John,M, ,2
I am new to Pandas and not sure how to handle the problem. I have tried to do an inner join to the matching columns, but that did not work.
Here is what I have so far:
import pandas as pd
f1_df = pd.read_csv("file1.csv")
f2_df = pd.read_csv("file2.csv")
for i in f1_df:
for j in f2_df:
i = i.lower()
if i == j:
joined = f1_df.join(f2_df)
print joined
Any idea how to solve this?
try this if you want to merge / join your DFs on common columns:
first lets convert all columns to lower case:
df1.columns = df1.columns.str.lower()
df2.columns = df2.columns.str.lower()
now we can join on common columns
common_cols = df2.columns.intersection(df1.columns).tolist()
joined = df1.set_index(common_cols).join(df2.set_index(common_cols)).reset_index()
Output:
In [259]: joined
Out[259]:
id name gender city
0 1 Smith M NaN
1 2 John M NaN
export to CSV:
In [262]: joined.to_csv('c:/temp/joined.csv', index=False)
c:/temp/joined.csv:
id,name,gender,city
1,Smith,M,
2,John,M,

how to convert link list to matrix in python

my input data looks like(input.txt):
AGAP2 TCGA-BL-A0C8-01A-11R-A10U-07 66.7328
AGAP2 TCGA-BL-A13I-01A-11R-A13Y-07 186.8366
AGAP3 TCGA-BL-A13J-01A-11R-A10U-07 183.3767
AGAP3 TCGA-BL-A3JM-01A-12R-A21D-07 33.2927
AGAP3 TCGA-BT-A0S7-01A-11R-A10U-07 57.9040
AGAP3 TCGA-BT-A0YX-01A-11R-A10U-07 99.8540
AGAP4 TCGA-BT-A20J-01A-11R-A14Y-07 88.8278
AGAP4 TCGA-BT-A20N-01A-11R-A14Y-07 129.7021
i want the output.txt looks like :
TCGA-BL-A0C8-01A-11R-A10U-07 TCGA-BL-A13I-01A-11R-A13Y-07 ...
AGAP2 66.7328 186.8366
AGAP3 0 0
Using pandas: read csv, create pivot and write csv.
import pandas as pd
df = pd.read_table("input.txt", names="xy", sep=r'\s+')
# reset index first - we need named column
new = df.reset_index().pivot(index="index", columns='x', values='y')
new.fillna(0, inplace=True)
new.to_csv("output.csv", sep='\t') # tab separated
Reshaping and Pivot Tables
EDIT: filling empty values

Categories