Join multiple CSV files by using python pandas - python

I am trying to create a CSV file from multiple csv files by using python pandas.
accreditation.csv :-
"pid","accreditation_body","score"
"25799","TAAC","4.5"
"25796","TAAC","5.6"
"25798","DAAC","5.7"
ref_university :-
"id","pid","survery_year","end_year"
"1","25799","2018","2018"
"2","25797","2016","2018"
I want to create a new table by reading the instruction from table_structure.csv. I want to join two tables and rewrite the accreditation.csv . REFERENCES ref_university(id, survey_year) is connecting with ref_university.csv and inserting id and survery_year columns value by matching the pid column value.
table_structure.csv :-
table_name,attribute_name,attribute_type,Description
,,,
accreditation,accreditation_body,varchar,
,grading,varchar,
,pid,int4, "REFERENCES ref_university(id, survey_year)"
,score,float8,
Modified CSV file should look like,
New accreditation.csv :-
"accreditation_body","grading","pid","id","survery_year","score"
"TAAC","","25799","1","2018","2018","4.5"
"TAAC","","25797","2","2016","2018","5.6"
"DAAC","","25798","","","","5.7"
I can read the csv in panda
df = pd.read_csv("accreditation.csv")
But, what is the recommended way to read the REFERENCES instruction and pick the columns value. If there is no value then column should be blank.
We can not hardcore pid in panda function. We have to read table_structure.csv and match if there is a Reference then call the mentioned columns. It should not be merged, just the specific columns should be added.

Dynamic solution is possible, but not so easy:
df = pd.read_csv("table_structure.csv")
#remove only NaNs rows
df = df.dropna(how='all')
#repalce NaNs by forward filling
df['table_name'] = df['table_name'].ffill()
#create for each table_name one row
df = (df.dropna(subset=['Description'])
.join(df.groupby('table_name')['attribute_name'].apply(list)
.rename('cols'), 'table_name'))
#get name of DataFrame and new columns names
df['df1'] = df['Description'].str.extract('REFERENCES\s*(.*)\s*\(')
df['new_cols'] = df['Description'].str.extract('\(\s*(.*)\s*\)')
df['new_cols'] = df['new_cols'].str.split(', ')
#remove unnecessary columns
df = df.drop(['attribute_type','Description'], axis=1).set_index('table_name')
print (df)
table_name
accreditation pid [accreditation_body, grading, pid, score]
df1 new_cols
table_name
accreditation ref_university [id, survey_year]
#for select by named create dictioanry of DataFrames
data = {'accreditation' : pd.read_csv("accreditation.csv"),
'ref_university': pd.read_csv("ref_university.csv")}
#seelct by index
v = df.loc['accreditation']
print (v)
attribute_name pid
cols [accreditation_body, grading, pid, score]
df1 ref_university
new_cols [id, survey_year]
Name: accreditation, dtype: object
Select by dictionary and by Series v
df = pd.merge(data[v.name],
data[v['df1']][v['new_cols'] + [v['attribute_name']]],
on=v['attribute_name'],
how='left')
is converted to:
df = pd.merge(data['accreditation'],
data['ref_university'][['id', 'survey_year'] + ['pid']],
on='pid',
how='left')
and return:
print (df)
pid accreditation_body score id survey_year
0 25799 TAAC 4.5 1.0 2018.0
1 25796 TAAC 5.6 NaN NaN
2 25798 DAAC 5.7 NaN NaN
Last add new columns by union and reindex:
df = df.reindex(columns=df.columns.union(v['cols']))
print (df)
accreditation_body grading id pid score survey_year
0 TAAC NaN 1.0 25799 4.5 2018.0
1 TAAC NaN NaN 25796 5.6 NaN
2 DAAC NaN NaN 25798 5.7 NaN

Here is the working code. Try it! When files are huge set low_memory=False in pd.read_csv()
import pandas as pd
import glob
# gets path to the folder datafolder
path = r"C:\Users\data_folder"
# reads all files with.csv ext
filenames = glob.glob(path + "\*.csv")
print('File names:', filenames)
df=pd.DataFrame()
# for loop to iterate and concat csv files
for file in filenames:
temp=pd.read_csv(file,low_memory=False)
df= pd.concat([df, temp], axis=1) #set axis =0 if you want to join rows
df.to_csv('output.csv')

Related

Explode a cell populated with multiple values into unique rows

I want to "explode" each cell that has multiple words in it into distinct rows while retaining it's rating and sysnet value when being conjoined. I attempted to import someone's pandas_explode library but VS code just does not want to recognize it. Is there any way for me in pandas documentation or some nifty for loop that'll extract and redistribute these words? Example csv is in the img link
import json
import pandas as pd # version 1.01
df = pd.read_json('result.json')
df.to_csv('jsonToCSV.csv', index=False)
df = pd.read_csv('jsonToCSV.csv')
df = df.explode('words')
print(df)
df = df.to_csv(r'C:\Users\alant\Desktop\test.csv', index = None, header=True)
Output when running above:
synset rating words
0 1034312 0.0 ['discourse', 'talk about', 'discuss']
1 146856 0.0 ['merging', 'meeting', 'coming together']
2 829378 0.0 ['care', 'charge', 'tutelage', 'guardianship']
3 8164585 0.0 ['administration', 'governance', 'governing bo...
4 1204318 0.0 ['nonhierarchical', 'nonhierarchic']
... ... ... ...
8605 7324673 1.0 ['emergence', 'outgrowth', 'growth']
csv file
If you have columns that needs to be kept from exploding, I suggest setting them as index first and then explode.
For your example, try if this works for you.
df = df.set_index(['synset','rating']).apply(pd.Series.explode) # this would work for exploding multiple columns as well
# then reset the index
df = df.reset_index()

Import multiple excel files, create a column and get values from excel file's name

I need to upload multiple excel files - each one has a name of starting date. Eg. "20190114".
Then I need to append them in one DataFrame.
For this, I use the following code:
all_data = pd.DataFrame()
for f in glob.glob('C:\\path\\*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
In fact, I do not need all data, but filtered by multiple columns.
Then, I would like to create an additional column ('from') with values of file name (which is "date") for each respective file.
Example:
Data from the excel file, named '20190101'
Data from the excel file, named '20190115'
The final dataframe must have values in 'price' column not equal to '0' and in code column - with code='r' (I do not know if it's possible to export this data already filtered, avoiding exporting huge volume of data?) and then I need to add a column 'from' with the respective date coming from file's name:
like this:
dataframes for trial:
import pandas as pd
df1 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[0,12.5,17.5,24.5,7.5],
'code':['r','r','r','c','r'] })
df2 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[7.5,24.5,0,149.5,7.5],
'code':['r','r','r','c','r'] })
IIUC, you can filter necessary rows ,then concat, for file name you can use os.path.split() and access the filename with string slicing:
l=[]
for f in glob.glob('C:\\path\\*.xlsx'):
df=pd.read_excel(f)
df['from']=os.path.split(f)[1][:-5]
l.append(df[(df['code'].eq('r')&df['price'].ne(0))])
pd.concat(l,ignore_index=True)
id price code from
0 id_2 12.5 r 20190101
1 id_3 17.5 r 20190101
2 id_5 7.5 r 20190101
3 id_1 7.5 r 20190115
4 id_2 24.5 r 20190115
5 id_5 7.5 r 20190115

concatenating and saving multiple pair of CSV in pandas

I am a beginner in python. I have a hundred pair of CSV file. The file looks like this:
25_13oct_speed_0.csv
26_13oct_speed_0.csv
25_13oct_speed_0.1.csv
26_13oct_speed_0.1.csv
25_13oct_speed_0.2.csv
26_13oct_speed_0.2.csv
and others
I want to concatenate the pair files between 25 and 26 file. each pair of the file has a speed threshold (Speed_0, 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0) which is labeled on the file name. These files have the same structure data.
Mac Annotation X Y
A first 0 0
A last 0 0
B first 0 0
B last 0 0
Therefore, concatenate analyze is enough to join these two data. I use this method:
df1 = pd.read_csv('25_13oct_speed_0.csv')
df2 = pd.read_csv('26_13oct_speed_0.csv')
frames = [df1, df2]
result = pd.concat(frames)
for each pair files. but it takes time and not an elegant way. is there a good way to combine automatically the pair file and save simultaneously?
Idea is create DataFrame by list of files and add 2 new columns by Series.str.split by first _:
print (files)
['25_13oct_speed_0.csv', '26_13oct_speed_0.csv',
'25_13oct_speed_0.1.csv', '26_13oct_speed_0.1.csv',
'25_13oct_speed_0.2.csv', '26_13oct_speed_0.2.csv']
df1 = pd.DataFrame({'files': files})
df1[['g','names']] = df1['files'].str.split('_', n=1, expand=True)
print (df1)
files g names
0 25_13oct_speed_0.csv 25 13oct_speed_0.csv
1 26_13oct_speed_0.csv 26 13oct_speed_0.csv
2 25_13oct_speed_0.1.csv 25 13oct_speed_0.1.csv
3 26_13oct_speed_0.1.csv 26 13oct_speed_0.1.csv
4 25_13oct_speed_0.2.csv 25 13oct_speed_0.2.csv
5 26_13oct_speed_0.2.csv 26 13oct_speed_0.2.csv
Then loop per groups per column names, loop by groups with DataFrame.itertuples and create new DataFrame with read_csv, if necessary add new column filled by values from g, append to list, concat and last cave to new file by name from column names:
for i, g in df1.groupby('names'):
out = []
for n in g.itertuples():
df = pd.read_csv(n.files).assign(source=n.g)
out.append(df)
dfbig = pd.concat(out, ignore_index=True)
print (dfbig)
dfbig.to_csv(g['names'].iat[0])

Finding first and last rows in Pandas Dataframes for individual files

I have a Pandas Dataframe consisting of multiple .fits files, each one containing multiple columns with individual labels. I'd like to extract one column and create variables that contain the first and last rows of said column but I'm having a hard time accomplishing that for the individual .fits files and not just the entire Dataframe. Any help would be appreciated! :)
Here is how I read in my files:
path = '/Users/myname/folder/'
m = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in fnmatch.filter(files, '*.fits')]
^^^ This recursively searches through my directory containing multiple .fits files in many subfolders.
dataframes = []
for ii in range(0,len(m)):
data = pd.read_csv(m[ii], header = 'infer', delimiter = '\t')
d = pd.DataFrame(data)
top = d['desired_column'].head()
bottom = d['desired_column'].tail()
First_and_Last = pd.concat([top,bottom])
I tried using the .head and .tail commands for Pandas Dataframes but I am unsure how to properly use it for what I desire. For how I read in my fits files, the following code gives me the very first few rows and the very last few rows (5 to be exact with the default value for head and tail being 5) as seen here:
0 2.456849e+06
1 2.456849e+06
2 2.456849e+06
3 2.456849e+06
4 2.456849e+06
1118 2.456852e+06
1119 2.456852e+06
1120 2.456852e+06
1121 2.456852e+06
1122 2.456852e+06
What I want to do is try to get the first and last row for each .fits file for the specific column I want and not just for the Dataframe containing the .fits files. With the way I am reading in my .fits files, the Dataframe seems to sort of concatenate all the files together. Any tips on how I can accomplish this goal?
If you want only the first row:
top = d['desired_column'].head(1)
If you want only the last row:
bottom = d['desired_column'].tail(1)
I didn't find the problem of "Dataframe seems to sort of concatenate all the files together." Would you please clarify the question?
Btw, after data = pd.read_csv(m[ii], header = 'infer', delimiter = '\t'), data is already a DataFrame. Therefore, d = pd.DataFrame(data) is unnecessary.
The .iloc function should easily pull the top and bottom row, where df["col_1"] here below represents the column of interest:
In [28]: import pandas as pd
In [29]: import numpy as np
In [30]: np.random.seed(42)
In [31]: df = pd.DataFrame(np.random.randn(6,3), columns=["col_1", "col_2", "col_3"])
In [32]: df
Out[32]:
col_1 col_2 col_3
0 0.496714 -0.138264 0.647689
1 1.523030 -0.234153 -0.234137
2 1.579213 0.767435 -0.469474
3 0.542560 -0.463418 -0.465730
4 0.241962 -1.913280 -1.724918
5 -0.562288 -1.012831 0.314247
In [33]: pd.Series([df["col_1"].iloc[0], df["col_1"].iloc[-1]]) # pd.Series([top, bottom]) ; or pd.DataFrame([top, bottom]), if data frame needed.
Out[33]:
0 0.496714
1 -0.562288
dtype: float64

merging 2 dataframes and saving taking inputs for all files in a directory

i have 2 directories, containing this files.
Dir1 Dir2
abc_complete.xlsx abc_before.xlsx
file2_complete.xlsx file2_before.xlsx
xyz_complete.xlsx xyz_before.xlsx
pqr_complete.xlsx pqr_before.xlsx
for abc_complete.xlsx i brought it into pandas dataframe (df1)
it contains columns:
id name sex
1 jon m
2 sam m
3 elle f
4 bob m
for abc_before.xlsxpandas df would be (df2):
new_sex
f
f
f
f
I had to delete 'sex' from df1 and merge the 'new_sex' from df2 into df1
My approach:
df1 = pd.read_excel('path/to/file1_complete.xlsx')
df2 = pd.read_excel('path/to/file1_before.xlsx')
df1.drop('sex', axis=1, inplace=True) #dropping column 'sex' from df1
df1['sex'] = df2['new_sex'] #joining new sex column from df2
df1.to_excel('path/to/file1_new.xlsx')
this is working fine, but i wanted an automated process, which would take files from my Dir1 and Dir2.
that too in order file2_complete.xlsx file2_before.xlsx at a time and xyz_complete.xlsx xyz_before.xlsx and so on.
and save the new dataframe with respective file names, abc_new.xlsx,file2_new.xlsx and so on.
is there any way to achieve this automation?
Consider this approach:
import glob
import os
complete_dir = r'C:\Temp\.data\Dir1'
before_dir = r'C:\Temp\.data\Dir2'
# result dir
new_dir = r'C:\Temp\.data\Dir3'
files = glob.glob(os.path.join(complete_dir, '*.xlsx'))
def process_file(fn, complete='_complete', before='_before', new='_new'):
fn2 = os.path.join(before_dir, os.path.basename(fn.replace(complete,before)))
new = os.path.join(new_dir, os.path.basename(fn.replace(complete,new)))
print('processing [{}] ...'.format(fn))
pd.read_excel(fn, usecols=['id','name']) \
.assign(new_sex=pd.read_excel(fn2, usecols=['new_sex'], squeeze=True)) \
.to_excel(new, index=False)
[process_file(f) for f in files]

Categories