I have a Pandas Dataframe consisting of multiple .fits files, each one containing multiple columns with individual labels. I'd like to extract one column and create variables that contain the first and last rows of said column but I'm having a hard time accomplishing that for the individual .fits files and not just the entire Dataframe. Any help would be appreciated! :)
Here is how I read in my files:
path = '/Users/myname/folder/'
m = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in fnmatch.filter(files, '*.fits')]
^^^ This recursively searches through my directory containing multiple .fits files in many subfolders.
dataframes = []
for ii in range(0,len(m)):
data = pd.read_csv(m[ii], header = 'infer', delimiter = '\t')
d = pd.DataFrame(data)
top = d['desired_column'].head()
bottom = d['desired_column'].tail()
First_and_Last = pd.concat([top,bottom])
I tried using the .head and .tail commands for Pandas Dataframes but I am unsure how to properly use it for what I desire. For how I read in my fits files, the following code gives me the very first few rows and the very last few rows (5 to be exact with the default value for head and tail being 5) as seen here:
0 2.456849e+06
1 2.456849e+06
2 2.456849e+06
3 2.456849e+06
4 2.456849e+06
1118 2.456852e+06
1119 2.456852e+06
1120 2.456852e+06
1121 2.456852e+06
1122 2.456852e+06
What I want to do is try to get the first and last row for each .fits file for the specific column I want and not just for the Dataframe containing the .fits files. With the way I am reading in my .fits files, the Dataframe seems to sort of concatenate all the files together. Any tips on how I can accomplish this goal?
If you want only the first row:
top = d['desired_column'].head(1)
If you want only the last row:
bottom = d['desired_column'].tail(1)
I didn't find the problem of "Dataframe seems to sort of concatenate all the files together." Would you please clarify the question?
Btw, after data = pd.read_csv(m[ii], header = 'infer', delimiter = '\t'), data is already a DataFrame. Therefore, d = pd.DataFrame(data) is unnecessary.
The .iloc function should easily pull the top and bottom row, where df["col_1"] here below represents the column of interest:
In [28]: import pandas as pd
In [29]: import numpy as np
In [30]: np.random.seed(42)
In [31]: df = pd.DataFrame(np.random.randn(6,3), columns=["col_1", "col_2", "col_3"])
In [32]: df
Out[32]:
col_1 col_2 col_3
0 0.496714 -0.138264 0.647689
1 1.523030 -0.234153 -0.234137
2 1.579213 0.767435 -0.469474
3 0.542560 -0.463418 -0.465730
4 0.241962 -1.913280 -1.724918
5 -0.562288 -1.012831 0.314247
In [33]: pd.Series([df["col_1"].iloc[0], df["col_1"].iloc[-1]]) # pd.Series([top, bottom]) ; or pd.DataFrame([top, bottom]), if data frame needed.
Out[33]:
0 0.496714
1 -0.562288
dtype: float64
Related
I currently have 2 csv files and am reading them both in, and need to get the ID's in one csv and find them in the other so that I can get their row of data. Currently I have the following code that I believe goes through the first dataframe but only is adding the last match onto the new dataframe. I need it to add all of the subsequent rows however.
Here is my code:
patientSet = pd.read_csv("794_chips_RMA.csv")
affSet = probeset[probeset['Analysis']==1].reset_index(drop=True)
houseGenes = probeset[probeset['Analysis']==0].reset_index(drop=True)
for x in affSet['Probeset']:
#patients = patientSet[patientSet['ID']=='1557366_at'].reset_index(drop=True)
#patients = patientSet[patientSet['ID']=='224851_at'].reset_index(drop=True)
patients = patientSet[patientSet['ID']==x].reset_index(drop=True)
print(affSet['Probeset'])
print(patientSet['ID'])
print(patients)
The following is the output:
0 1557366_at
1 224851_at
2 1554784_at
3 231578_at
4 1566643_a_at
5 210747_at
6 231124_x_at
7 211737_x_at
Name: Probeset, dtype: object
0 1007_s_at
1 1053_at
2 117_at
3 121_at
4 1255_g_at
...
54670 AFFX-ThrX-5_at
54671 AFFX-ThrX-M_at
54672 AFFX-TrpnX-3_at
54673 AFFX-TrpnX-5_at
54674 AFFX-TrpnX-M_at
Name: ID, Length: 54675, dtype: object
ID phchp003v1 phchp003v2 phchp003v3 ... phchp367v1 phchp367v2 phchp368v1 phchp368v2
0 211737_x_at 12.223453 11.747159 9.941889 ... 14.828389 9.322779 10.609053 10.771162
as you can see, it is only matching the very last ID from the first dataframe, and not all of them. How can I get them all to match and be in patients? Thank you.
you probably want to use the merge function
df_inner = pd.merge(df1, df2, on='id', how='inner')
check here https://www.datacamp.com/community/tutorials/joining-dataframes-pandas search for "inner join"
--edit--
you can specify the columns (using left_on=None,right_on=None,) , look here: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
#Rui Lima already posted the correct answer, but you'll need to use the following to make it work:
df = pd.merge(patientSet , affSet, on='ID', how='inner')
I need to upload multiple excel files - each one has a name of starting date. Eg. "20190114".
Then I need to append them in one DataFrame.
For this, I use the following code:
all_data = pd.DataFrame()
for f in glob.glob('C:\\path\\*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
In fact, I do not need all data, but filtered by multiple columns.
Then, I would like to create an additional column ('from') with values of file name (which is "date") for each respective file.
Example:
Data from the excel file, named '20190101'
Data from the excel file, named '20190115'
The final dataframe must have values in 'price' column not equal to '0' and in code column - with code='r' (I do not know if it's possible to export this data already filtered, avoiding exporting huge volume of data?) and then I need to add a column 'from' with the respective date coming from file's name:
like this:
dataframes for trial:
import pandas as pd
df1 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[0,12.5,17.5,24.5,7.5],
'code':['r','r','r','c','r'] })
df2 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[7.5,24.5,0,149.5,7.5],
'code':['r','r','r','c','r'] })
IIUC, you can filter necessary rows ,then concat, for file name you can use os.path.split() and access the filename with string slicing:
l=[]
for f in glob.glob('C:\\path\\*.xlsx'):
df=pd.read_excel(f)
df['from']=os.path.split(f)[1][:-5]
l.append(df[(df['code'].eq('r')&df['price'].ne(0))])
pd.concat(l,ignore_index=True)
id price code from
0 id_2 12.5 r 20190101
1 id_3 17.5 r 20190101
2 id_5 7.5 r 20190101
3 id_1 7.5 r 20190115
4 id_2 24.5 r 20190115
5 id_5 7.5 r 20190115
I have 37 data frames of different shapes, I need to concatenate them. The following is what I tried:
path = '/Users/data_frames/data'
all_files = [os.path.join(path,i) for i in os.listdir(path) if i.endswith('pc.tsv')]
main = []
for files in all_files:
dfs = pd.DataFrame.from_csv(files, sep="\t")
dfs.reset_index(drop=True, inplace=True)
main.append(pc_matrix)
merged_pr_matrix =pd.concat(main,axis=1)
The above script runs with this line,
dfs.reset_index(drop=True, inplace=True)
However, I lost the original index values (row names). I would like to keep them. For example, now I have the final matrix after concatenating, as following:
ABV TCG FGH HKL MK MYT JUJN MTPTA
0 5130132,5 22778,703125 675790,6875 4846942,5 106934,4453125 2884897,25 2777415 3487836
1 3478507,5 898987,375 2825588,5 5006338,5 119250,765625 4393944,5 3111324,25 2594582,75
2 18402,615234375 56879,6484375 524456,3125 323671,4063 166333,4375 78539,921875 233480,0625 35772,69140625
3 2310551,5 587836,1875 241836,5 5488325 29411,296875 517361,46875 190795,078125 67885,640625
4 95646,140625 1106308 1356453 17681780 592893,9375 1857957 1224196 1417179,25
In the original inputs I had values in index which I would like to keep.
I am a beginner in python. I have a hundred pair of CSV file. The file looks like this:
25_13oct_speed_0.csv
26_13oct_speed_0.csv
25_13oct_speed_0.1.csv
26_13oct_speed_0.1.csv
25_13oct_speed_0.2.csv
26_13oct_speed_0.2.csv
and others
I want to concatenate the pair files between 25 and 26 file. each pair of the file has a speed threshold (Speed_0, 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0) which is labeled on the file name. These files have the same structure data.
Mac Annotation X Y
A first 0 0
A last 0 0
B first 0 0
B last 0 0
Therefore, concatenate analyze is enough to join these two data. I use this method:
df1 = pd.read_csv('25_13oct_speed_0.csv')
df2 = pd.read_csv('26_13oct_speed_0.csv')
frames = [df1, df2]
result = pd.concat(frames)
for each pair files. but it takes time and not an elegant way. is there a good way to combine automatically the pair file and save simultaneously?
Idea is create DataFrame by list of files and add 2 new columns by Series.str.split by first _:
print (files)
['25_13oct_speed_0.csv', '26_13oct_speed_0.csv',
'25_13oct_speed_0.1.csv', '26_13oct_speed_0.1.csv',
'25_13oct_speed_0.2.csv', '26_13oct_speed_0.2.csv']
df1 = pd.DataFrame({'files': files})
df1[['g','names']] = df1['files'].str.split('_', n=1, expand=True)
print (df1)
files g names
0 25_13oct_speed_0.csv 25 13oct_speed_0.csv
1 26_13oct_speed_0.csv 26 13oct_speed_0.csv
2 25_13oct_speed_0.1.csv 25 13oct_speed_0.1.csv
3 26_13oct_speed_0.1.csv 26 13oct_speed_0.1.csv
4 25_13oct_speed_0.2.csv 25 13oct_speed_0.2.csv
5 26_13oct_speed_0.2.csv 26 13oct_speed_0.2.csv
Then loop per groups per column names, loop by groups with DataFrame.itertuples and create new DataFrame with read_csv, if necessary add new column filled by values from g, append to list, concat and last cave to new file by name from column names:
for i, g in df1.groupby('names'):
out = []
for n in g.itertuples():
df = pd.read_csv(n.files).assign(source=n.g)
out.append(df)
dfbig = pd.concat(out, ignore_index=True)
print (dfbig)
dfbig.to_csv(g['names'].iat[0])
My goal is to group .csv files in a directory by shared characteristics in the file name. My directory contains files with names:
After_Source1_Receiver1.csv
After_Source1_Receiver2.csv
Before_Source1_Receiver1.csv
Before_Source1_Receiver2.csv
During1_Source1_Receiver1.csv
During1_Source1_Receiver2.csv
During2_Source1_Receiver1.csv
During2_Source1_Receiver2.csv
I would like to sort these files into groups on the numbers following the "Source" and "Receiver" sections of the file name (as shown below) so I can later concatenate them.
Group 1
Before_Source1_Receiver1.csv
During1_Source1_Receiver1.csv
During2_Source1_Receiver1.csv
After_Source1_Receiver1.csv
Group 2
Before_Source1_Receiver2.csv
During1_Source1_Receiver2.csv
During2_Source1_Receiver2.csv
After_Source1_Receiver2.csv
Any ideas?
It says you want to do this in pandas so here is a pandas solution.
fnames = ['After_Source1_Receiver1.csv',
'After_Source1_Receiver2.csv',
'Before_Source1_Receiver1.csv',
'Before_Source1_Receiver2.csv',
'During1_Source1_Receiver1.csv',
'During1_Source1_Receiver2.csv',
'During2_Source1_Receiver1.csv',
'During2_Source1_Receiver2.csv']
df = pd.DataFrame(fnames, columns=['names'])
I don't know what you want to do with your end results but this is how you group them.
pattern = r'Source(\d+)_Receiver(\d+)'
for _, g in pd.concat([df, df['names'].str.extract(pattern)], axis=1).groupby([0,1]):
print(g.names)
0 After_Source1_Receiver1.csv
2 Before_Source1_Receiver1.csv
4 During1_Source1_Receiver1.csv
6 During2_Source1_Receiver1.csv
Name: names, dtype: object
1 After_Source1_Receiver2.csv
3 Before_Source1_Receiver2.csv
5 During1_Source1_Receiver2.csv
7 During2_Source1_Receiver2.csv
Name: names, dtype: object