Pandas not able to merge the file - python
I am trying to merge two files, I am supplying them headers as they are not able to pick up headers when I merge them using concatenate , I get an error when I am trying to drop a column......
ValueError: labels ['lh.aparc.a2009s.meancurv'] not contained in axis
Therefore I am trying the below method.....
The headers are important because I want to compute average, mean etc on the basis of these headers....
But currently, the result file looks like this
CSV 1CSV1 looks like this CSV 2 looks the same just with rh
# !/bin/bash
ls -d */ | sed -e "s/\///g" | grep -v "Results" | grep -v "Output">> subjects.txt;
module unload freesurfer
module load freesurfer/5.3.0
module load python
export SUBJECTS_DIR=/N/u/shrechak/Karst/GENFL_FREESURFER53_KARST_RES
source $FREESURFER_HOME/FreeSurferEnv.sh
aparcstats2table --hemi lh --subjectsfile=subjects.txt --parc aparc.a2009s --meas meancurv --tablefile lh.a2009s.meancurv.txt
aparcstats2table --hemi rh --subjectsfile=subjects.txt --parc aparc.a2009s --meas meancurv --tablefile rh.a2009s.meancurv.txt
for f in *.txt; do
mv "$f" "${f%.txt}.csv"
done
python <<END_OF_PYTHON
import csv
import pandas as pd
names= ["meancurv",
"lh_G_and_S_frontomargin_meancurv",
"lh_G_and_S_occipital_inf_meancurv",
"lh_G_and_S_paracentral_meancurv",
"lh_G_and_S_subcentral_meancurv",
"lh_G_and_S_transv_frontopol_meancurv",
"lh_G_and_S_cingul-ant_meancurv",
"lh_G_and_S_cingul-Mid-Ant_meancurv",
"lh_G_and_S_cingul-Mid-Post_meancurv",
"lh_G_cingul-Post-dorsal_meancurv",
"lh_G_cingul-Post-ventral_meancurv",
"lh_G_cuneus_meancurv",
"lh_G_front_inf-Opercular_meancurv",
"lh_G_front_inf-orbital_meancurv",
"lh_G_front_inf-Triangul_meancurv",
"lh_G_front_middle_meancurv",
"lh_G_front_sup_meancurv",
"lh_G_Ins_lg_and_S_cent_ins_meancurv",
"lh_G_insular_short_meancurv",
"lh_G_occipital_middle_meancurv",
"lh_G_occipital_sup_meancurv",
"lh_G_oc-temp_lat-fusifor_meancurv",
"lh_G_oc-temp_med-Lingual_meancurv",
"lh_G_oc-temp_med-Parahip_meancurv",
"lh_G_orbital_meancurv",
"lh_G_pariet_infoangular_meancurv",
"lh_G_pariet_infSupramar_meancurv",
"lh_G_parietal_sup_meancurv",
"lh_G_postcentral_meancurv",
"lh_G_precentral_meancurv",
"lh_G_precuneus_meancurv",
"lh_G_rectus_meancurv",
"lh_G_subcallosal_meancurv",
"lh_G_temp_sup-G_T_transv_meancurv",
"lh_G_temp_sup-Lateral_meancurv",
"lh_G_temp_sup-Plan_polar_meancurv",
"lh_G_temp_supPlan_tempo_meancurv",
"lh_G_temporal_inf_meancurv",
"lh_G_temporal_middle_meancurv",
"lh_Lat_Fis-ant-Horizont_meancurv",
"lh_Lat_Fis-ant-Vertical_meancurv",
"lh_Lat_Fispost_meancurv",
"lh_Pole_occipital_meancurv",
"lh_Pole_temporal_meancurv",
"lh_S_calcarine_meancurv",
"lh_S_central_meancurv",
"lh_S_cingulMarginalis_meancurv",
"lh_S_circular_insula_ant_meancurv",
"lh_S_circular_insula_inf_meancurv",
"lh_S_circular_insula_sup_meancurv",
"lh_S_collat_transv_ant_meancurv",
"lh_S_collat_transv_post_meancurv",
"lh_S_front_inf_meancurv",
"lh_S_front_middle_meancurv",
"lh_S_front_sup_meancurv",
"lh_S_interm_prim-Jensen_meancurv",
"lh_S_intrapariet_and_P_trans_meancurv",
"lh_S_oc_middle_and_Lunatus_meancurv",
"lh_S_oc_sup_and_transversal_meancurv",
"lh_S_occipital_ant_meancurv",
"lh_S_oc-temp_lat_meancurv",
"lh_S_oc-temp_med_and_Lingual_meancurv",
"lh_S_orbital_lateral_meancurv",
"lh_S_orbital_med-olfact_meancurv",
"lh_S_orbital-H_Shaped_meancurv",
"lh_S_parieto_occipital_meancurv",
"lh_S_pericallosal_meancurv",
"lh_S_postcentral_meancurv",
"lh_S_precentral-inf-part_meancurv",
"lh_S_precentral-sup-part_meancurv",
"lh_S_suborbital_meancurv",
"lh_S_subparietal_meancurv",
"lh_S_temporal_inf_meancurv",
"lh_S_temporal_sup_meancurv",
"lh_S_temporal_transverse_meancurv"]
df1 = pd.read_csv('lh.a2009s.meancurv.csv', header = None, names = names)
names1 = ["meancurv",
"rh_G_and_S_frontomargin_meancurv",
"rh_G_and_S_occipital_inf_meancurv",
"rh_G_and_S_paracentral_meancurv",
"rh_G_and_S_subcentral_meancurv",
"rh_G_and_S_transv_frontopol_meancurv",
"rh_G_and_S_cingul-Ant_meancurv",
"rh_G_and_S_cingul-Mid-Ant_meancurv",
"rh_G_and_S_cingul-Mid-Post_meancurv",
"rh_G_cingul-Post-dorsal_meancurv",
"rh_G_cingul-Post-ventral_meancurv",
"rh_G_cuneus_meancurv",
"rh_G_front_inf-Opercular_meancurv",
"rh_G_front_inf-Orbital_meancurv",
"rh_G_front_inf-Triangul_meancurv",
"rh_G_front_middle_meancurv",
"rh_G_front_sup_meancurv",
"rh_G_Ins_lg_and_S_cent_ins_meancurv",
"rh_G_insular_short_meancurv",
"rh_G_occipital_middle_meancurv",
"rh_G_occipital_sup_meancurv",
"rh_G_oc-temp_lat-fusifor_meancurv",
"rh_G_oc-temp_med-Lingual_meancurv",
"rh_G_oc-temp_med-Parahip_meancurv",
"rh_G_orbital_meancurv",
"rh_G_pariet_inf-Angular_meancurv",
"rh_G_pariet_inf-Supramar_meancurv",
"rh_G_parietal_sup_meancurv",
"rh_G_postcentral_meancurv",
"rh_G_precentral_meancurv",
"rh_G_precuneus_meancurv",
"rh_G_rectus_meancurv",
"rh_G_subcallosal_meancurv",
"rh_G_temp_sup-G_T_transv_meancurv",
"rh_G_temp_sup-Lateral_meancurv",
"rh_G_temp_sup-Plan_polar_meancurv",
"rh_G_temp_sup-Plan_tempo_meancurv",
"rh_G_temporal_inf_meancurv",
"rh_G_temporal_middle_meancurv",
"rh_Lat_Fis-ant-Horizont_meancurv",
"rh_Lat_Fis-ant-Vertical_meancurv",
"rh_Lat_Fis-post_meancurv",
"rh_Pole_occipital_meancurv",
"rh_Pole_temporal_meancurv",
"rh_S_calcarine_meancurv",
"rh_S_central_meancurv",
"rh_S_cingulMarginalis_meancurv",
"rh_S_circular_insula_ant_meancurv",
"rh_S_circular_insula_inf_meancurv",
"rh_S_circular_insula_sup_meancurv",
"rh_S_collat_transv_ant_meancurv",
"rh_S_collat_transv_post_meancurv",
"rh_S_front_inf_meancurv",
"rh_S_front_middle_meancurv",
"rh_S_front_sup_meancurv",
"rh_S_interm_prim-Jensen_meancurv",
"rh_S_intrapariet_and_P_trans_meancurv",
"rh_S_oc_middle_and_Lunatus_meancurv",
"rh_S_oc_sup_and_transversal_meancurv",
"rh_S_occipital_ant_meancurv",
"rh_S_oc-temp_lat_meancurv",
"rh_S_oc-temp_med_and_Lingual_meancurv",
"rh_S_orbital_lateral_meancurv",
"rh_S_orbital_med-olfact_meancurv",
"rh_S_orbital-H_Shaped_meancurv",
"rh_S_parieto_occipital_meancurv",
"rh_S_pericallosal_meancurv",
"rh_S_postcentral_meancurv",
"rh_S_precentral-inf-part_meancurv",
"rh_S_precentral-sup-part_meancurv",
"rh_S_suborbital_meancurv",
"rh_S_subparietal_meancurv",
"rh_S_temporal_inf_meancurv",
"rh_S_temporal_sup_meancurv",
"rh_S_temporal_transverse_meancurv"
]
df2 = pd.read_csv('rh.a2009s.meancurv.csv', header = None, names = names1)
result = pd.merge(df1, df2, on='meancurv', how='outer')
result.to_csv('result.csv')
END_OF_PYTHON
echo "goodbye!";
So you want to skip the first row and only pull the data parts.
Here's an MCVE.
Code:
import io
import pandas as pd
csv1 = io.StringIO(u'''
a,b,c
1,4,7
2,5,8
3,6,9
''')
df = pd.read_csv(csv1, names = ['d','e','f'], skiprows = [1])
print df
Output:
d e f
0 1 4 7
1 2 5 8
2 3 6 9
Here's a way you can merge two files together file keeping the headers from the one of the files after merging.
Say you're keeping files in a list 'files':
files = ['file1.csv', 'file2.csv'] #keep files here
finalDF = pd.DataFrame() #this is an empty dataframe
for file in files:
thisDF = pd.read_csv(file)
finalDF = finalDF.append(thisDF, ignore_index=True)
Now if you want try these two lines:
say you want to check the header using a simple print head()
print finalDF.head()
and if you want to write this merged data frame to a csv file
finalDF.to_csv('merged-file.csv', encoding="utf-8", index=False)
for skipping rows are you trying to skip rows after or before
merging? let me know and i can try helping with that too.
Example:
file1.csv:
,column1,column2,column3,column4,Date,Device,sample_site
2,14888,0.060011931,248084,13.40535464,3/15/2017,DESKTOP,http://www.example1.com
11,1358,0.033212679,40888,7.465099785,3/15/2017,MOBILE,http://www.example2.com
23,130,0.02998155,4336,8.337638376,3/15/2017,TABLET,http://www.example3.com
file2.csv:
,column1,column2,column3,column4,Date,Device,sample_site
35,2685,0.034564882,77680,10.97812822,3/15/2017,DESKTOP,https://www.example4.com
45,280,0.026197605,10688,7.801272455,3/15/2017,MOBILE,https://www.example5.com
54,24,0.022878932,1049,8.202097235,3/15/2017,TABLET,https://www.example6.com
merged-file.csv:
Unnamed: 0,column1,column2,column3,column4,Date,Device,sample_site
2,14888,0.060011931,248084,13.40535464,3/15/2017,DESKTOP,http://www.example1.com
11,1358,0.033212679,40888,7.465099785,3/15/2017,MOBILE,http://www.example2.com
23,130,0.02998155,4336,8.337638376,3/15/2017,TABLET,http://www.example3.com
35,2685,0.034564882,77680,10.97812822,3/15/2017,DESKTOP,https://www.example4.com
45,280,0.026197605,10688,7.801272455,3/15/2017,MOBILE,https://www.example5.com
54,24,0.022878932,1049,8.202097235,3/15/2017,TABLET,https://www.example6.com
Reply:
Are you trying to merge data based on a column? In that case you can concat or merge with join based on an axis.
Say for example:
pd.concat([df1, df2]) #add axis and join type if necessary.
Here's the documentation to help you understand: merging and concat in pandas
Related
Is there any method to replace specific data from column without breaking its structure or spliting
Hi there i am trying to figure out how to replace a specific data of csv file. i have a file which is base or location data of id's. https://store8.gofile.io/download/5b031959-e0b0-4dbf-aec6-264e0b87fd09/service%20block.xlsx (sheet 2 had data ). The file which i want to replace data using id is below https://store8.gofile.io/download/6e13a19a-bac8-4d16-8692-e4435eed2a08/Serp.csv Highlighted part need to be deleted after filling location. import pandas as pd df1= pd.read_excel("serp.xlsx", header=None) df2= pd.read_excel("flocnam.xlsx", header=None) df1 = df1[0].str.split(";", expand=True) df1[4] = df1[4].apply(lambda x: v[-1] if (v := x.split()) else "") df2[1] = df2[1].apply(lambda x: x.split("-")[0]) m = dict(zip(df2[1], df2[0])) df1[4]= df1[4].replace(m) print(df1) df1.to_csv ("test.csv") It worked but not how i wanted. https://store8.gofile.io/download/c0ae7e05-c0e2-4f43-9d13-da12ddf73a8d/test.csv trying to replace it like this.(desired output) Thank you for being Supportive community❤️
If I understand correctly, you simply need to specify the separator ; >>> df.to_csv(‘test.csv’, sep=‘;’, index_label=False)
How to extract duplicate values in each column separately?
I want to extract only values with two or more occurrence in each column separately and write them in separate file with column header. Example file: (actual csv file is 1.5 Gb, here including summary of it) First row is the header row of each column AO1,BO1,CO1,DO1,EO1,FO1 pep2,red2,ter3,typ3,ghl4,rtf5 ghp2,asd2,ghj3,typ3,ghj3,ert4 typ2,sdf2,rty3,ert4,asd2,sdf2 pep2,xcv2,bnm3,wer3,vbn3,wer2 dfg4,fgh3,uio2,wer3,ghj2,rtf5 dfg6,xcv4,dfg3,ret5,ytu2,rtf5 pep2,xcv4,ert1,dgf2,ert3,fgh3 okj2,xcv4,jkl3,ghr4,cvb3,rtf5 poi2,tyu2,iop3,cvb3,hjk5,rtf5 qwe2,wer2,iop3,typ3,ert3,cvb3 I have tried to write code in R and even Python panda but failed to get the result. Expected outcome: AO1 BO1 CO1 DO1 EO1 FO1 pep2 xcv4 iop3 typ3 ert3 rtf5 pep2 xcv4 iop3 typ3 ert3 rtf5 pep2 xcv4 typ3 rtf5 wer3 rtf5 wer3 rtf5
import pandas as pd from StringIO import StringIO df = pd.read_csv(StringIO("""AO1,BO1,CO1,DO1,EO1,FO1 pep2,red2,ter3,typ3,ghl4,rtf5 ghp2,asd2,ghj3,typ3,ghj3,ert4 typ2,sdf2,rty3,ert4,asd2,sdf2 pep2,xcv2,bnm3,wer3,vbn3,wer2 dfg4,fgh3,uio2,wer3,ghj2,rtf5 dfg6,xcv4,dfg3,ret5,ytu2,rtf5 pep2,xcv4,ert1,dgf2,ert3,fgh3 okj2,xcv4,jkl3,ghr4,cvb3,rtf5 poi2,tyu2,iop3,cvb3,hjk5,rtf5 qwe2,wer2,iop3,typ3,ert3,cvb3""")) d = {} for col in df.columns: repeated_values = df[col].value_counts()[df[col].value_counts() >= 2].index.tolist() cond = df[col].isin(repeated_values) d[col] = df[cond][col] final = pd.concat(d, axis=1)
df <- data.table::fread('AO1,BO1,CO1,DO1,EO1,FO1 pep2,red2,ter3,typ3,ghl4,rtf5 ghp2,asd2,ghj3,typ3,ghj3,ert4 typ2,sdf2,rty3,ert4,asd2,sdf2 pep2,xcv2,bnm3,wer3,vbn3,wer2 dfg4,fgh3,uio2,wer3,ghj2,rtf5 dfg6,xcv4,dfg3,ret5,ytu2,rtf5 pep2,xcv4,ert1,dgf2,ert3,fgh3 okj2,xcv4,jkl3,ghr4,cvb3,rtf5 poi2,tyu2,iop3,cvb3,hjk5,rtf5 qwe2,wer2,iop3,typ3,ert3,cvb3' , data.table = FALSE) lapply(df, function (x) x[duplicated(x) | duplicated(x, fromLast = T)]) You could write a csv directly in the lapply call as well
Parsing data in Excel using python
In Excel, I have to separate the following value from one cell into two: 2016-12-12 (r=0.1) 2016-12-13* (r=0.7) How do I do that in Python so that in the Excel file, dates and "r=#" will be in different cells? And also, is there a way to automatically remove the "*" sign?
This task is pretty straight forward if you use pandas: Build a test file: import pandas as pd df_out = pd.DataFrame( ['2016-12-12 (r=0.1)', '2016-12-13* (r=0.7)'], columns=['data']) df_out.to_excel('test.xlsx') Code to convert string: def convert_date(row): return pd.Series([c.strip('*').strip('(').strip(')') for c in row.split()]) Test code: # read in test file df_in = pd.read_excel('test.xlsx') print(df_in) # build a new dataframe df_new = df_in['data'].apply(convert_date) df_new.columns = ['date', 'r'] print(df_new) # save the dataframe df_new.to_excel('test2.xlsx') Results: data 0 2016-12-12 (r=0.1) 1 2016-12-13* (r=0.7) date r 0 2016-12-12 r=0.1 1 2016-12-13 r=0.7
How to Copy the Matching Columns between CSV Files Using Pandas?
I have two dataframes(f1_df and f2_df): f1_df looks like: ID,Name,Gender 1,Smith,M 2,John,M f2_df looks like: name,gender,city,id Problem: I want the code to compare the header of f1_df with f2_df by itself and copy the data of the matching columns using panda. Output: the output should be like this: name,gender,city,id # name,gender,and id are the only matching columns btw f1_df and f2_df Smith,M, ,1 # the data copied for name, gender, and id columns John,M, ,2 I am new to Pandas and not sure how to handle the problem. I have tried to do an inner join to the matching columns, but that did not work. Here is what I have so far: import pandas as pd f1_df = pd.read_csv("file1.csv") f2_df = pd.read_csv("file2.csv") for i in f1_df: for j in f2_df: i = i.lower() if i == j: joined = f1_df.join(f2_df) print joined Any idea how to solve this?
try this if you want to merge / join your DFs on common columns: first lets convert all columns to lower case: df1.columns = df1.columns.str.lower() df2.columns = df2.columns.str.lower() now we can join on common columns common_cols = df2.columns.intersection(df1.columns).tolist() joined = df1.set_index(common_cols).join(df2.set_index(common_cols)).reset_index() Output: In [259]: joined Out[259]: id name gender city 0 1 Smith M NaN 1 2 John M NaN export to CSV: In [262]: joined.to_csv('c:/temp/joined.csv', index=False) c:/temp/joined.csv: id,name,gender,city 1,Smith,M, 2,John,M,
how to convert link list to matrix in python
my input data looks like(input.txt): AGAP2 TCGA-BL-A0C8-01A-11R-A10U-07 66.7328 AGAP2 TCGA-BL-A13I-01A-11R-A13Y-07 186.8366 AGAP3 TCGA-BL-A13J-01A-11R-A10U-07 183.3767 AGAP3 TCGA-BL-A3JM-01A-12R-A21D-07 33.2927 AGAP3 TCGA-BT-A0S7-01A-11R-A10U-07 57.9040 AGAP3 TCGA-BT-A0YX-01A-11R-A10U-07 99.8540 AGAP4 TCGA-BT-A20J-01A-11R-A14Y-07 88.8278 AGAP4 TCGA-BT-A20N-01A-11R-A14Y-07 129.7021 i want the output.txt looks like : TCGA-BL-A0C8-01A-11R-A10U-07 TCGA-BL-A13I-01A-11R-A13Y-07 ... AGAP2 66.7328 186.8366 AGAP3 0 0
Using pandas: read csv, create pivot and write csv. import pandas as pd df = pd.read_table("input.txt", names="xy", sep=r'\s+') # reset index first - we need named column new = df.reset_index().pivot(index="index", columns='x', values='y') new.fillna(0, inplace=True) new.to_csv("output.csv", sep='\t') # tab separated Reshaping and Pivot Tables EDIT: filling empty values