How to Include Source File in Column in Pandas Dataframe

How to Include Source File in Column in Pandas Dataframe - python

I have a list of 50+ Excel files that I loop through and consolidate into one dataframe. However, I need to know the source of the data, since the data will repeat across these files.
Each file name is the date of the report. Since this data is time series data, I need to pull this date into the dataframe to do further manipulations.
import os
import glob
import pandas as pd
path = r"path"
extension = 'xls*'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
files_df = pd.concat([pd.read_excel(fp, usecols=[0,15], header=None) for fp in files], ignore_index=True)
I get the expected dataframe. I just do not know how to include the source file name as a third column. I thought there was an argument to do this in pd.read_excel(), but I couldn't find it.
For example, I have a list of the following files:
02-2019.xlsx
03-2011.xls
04-2014.xls
etc
I want to include those file names next to the data that comes from that file in the combined dataframe.

Maybe use the keys= parameter in pd.concat()?
import os
import glob
import pandas as pd
path = r"path"
extension = 'xls*'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
# remove ignore_index=True otherwise keys parameter won't work
files_df = pd.concat([pd.read_excel(fp, usecols=[0,15], header=None) for fp in files], keys=[f"{fp.split('.')[0]}" for fp in files])
You can then reset_index() and convert to_datetime()
fp.reset_index(inplace=True)
fp['index'] = pd.to_datetime(fp['index'])

Related

Read Linux Path and append all data

I want to read all csv files present in a Linux path and store it in a single data frame using Python.
I am able to read the files but while storing, each file is getting created as dictionary object ex: df['file1'],df['file2'] and so on.
Please let me know how can I store each csv file into separate data frame dynamically and then combine them to store in a single data frame.
Thanks in advance.

from pathlib import Path
import pandas as pd
dataframes = []
for p in Path("path/to/data").iterdir():
if p.suffix == ".csv":
dataframes.append(pd.read_csv(p))
df = pd.concat(dataframes)
Or if you want to include subdirectories
from pathlib import Path
import pandas as pd
path = Path("path/to/data")
df = pd.concat([pd.read_csv(f) for f in path.glob("**/*.csv")], ignore_index=True)

How to add filename as column to every file in a directory python

Hi there stack overflow community,
I have several csv-files in a folder and I need to append a column containing the first 8 chars of each filename in a aditional column of the csv. After this step i want to save the datafram including the new colum to the same file.
I get the right output, but it doesn't save the changes in the csv file :/
Maybe someone has some inspiration for me. Thanks a lot!
from tkinter.messagebox import YES
import pandas as pd
import glob, os
import fnmatch
import os
files = glob.glob(r'path\*.csv')
for fp in files:
df = pd.concat([pd.read_csv(fp).assign(date=os.path.basename(fp).split('.')[0][:8])])
#for i in df('date'):
#Decoder problem
print(df)

use:
df.to_csv
like this:
for fp in files:
df = pd.concat([pd.read_csv(fp).assign(date=os.path.basename(fp).split('.')[0][:8])])
df.to_csv(fp, index=False) # index=False if you don't want to save the index as a new column in the csv
btw, I think this may also work and is more readable:
for fp in files:
df = pd.read(fp)
df[date] = os.path.basename(fp).split('.')[0][:8]
df.to_csv(fp, index=False)

Pandas - Reading CSVs to dataframes in a FOR loop then appending to a master DF is returning a blank DF

I've searched for about an hour for an answer to this and none of the solutions I've found are working. I'm trying to get a folder full of CSVs into a single dataframe, to output to one big csv. Here's my current code:
import os
sourceLoc = "SOURCE"
destLoc = sourceLoc + "MasterData.csv"
masterDF = pd.DataFrame([])
for file in os.listdir(sourceLoc):
workingDF = pd.read_csv(sourceLoc + file)
print(workingDF)
masterDF.append(workingDF)
print(masterDF)
The SOURCE is a folder path but I've had to remove it as it's a work network path. The loop is reading the CSVs to the workingDF variable as when I run it it prints the data into the console, but it's also finding 349 rows for each file. None of them have that many rows of data in them.
When I print masterDF it prints Empty DataFrame Columns: [] Index: []
My code is from this solution but that example is using xlsx files and I'm not sure what changes, if any, are needed to get it to work with CSVs. The Pandas documentation on .append and read_csv is quite limited and doesn't indicate anything specific I'm doing wrong.
Any help would be appreciated.

There are a couple of things wrong with your code, but the main thing is that pd.append returns a new dataframe, instead of modifying in place. So you would have to do:
masterDF = masterDF.append(workingDF)
I also like the approach taken by I_Al-thamary - concat will probably be faster.
One last thing I would suggest, is instead of using glob, check out pathlib.
import pandas as pd
from pathlib import Path
path = Path("your path")
df = pd.concat(map(pd.read_csv, path.rglob("*.csv"))))

you can use glob
import glob
import pandas as pd
import os
path = "your path"
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join(path,'*.csv'))))
print(df)

You may store them all in a list and pd.concat them at last.
dfs = [
pd.read_csv(os.path.join(sourceLoc, file))
for file in os.listdir(sourceLoc)
]
masterDF = pd.concat(df)

Loop/iterate through a directory of excel files & add to the bottom of the dataframe

I am currently working on importing and formatting a large number of excel files (all the same format/scheme, but different values) with Python.
I have already read in and formatted one file and everything worked fine so far.
I would now do the same for all the other files and combine everything in one dataframe, i.e. read in the first excel in one dataframe, add the second at the bottom of the dataframe, add the third at the bottom the dataframe, and so on until I have all the excel files in one dataframe.
So far my script looks something like this:
import pandas as pd
import numpy as np
import xlrd
import os
path = os.getcwd()
path = "path of the directory"
wbname = "name of the excel file"
files = os.listdir(path)
files
wb = xlrd.open_workbook(path + wbname)
# I only need the second sheet
df = pd.read_excel(path + wbname, sheet_name="sheet2", skiprows = 2, header = None,
skipfooter=132)
# here is where all the formatting is happening ...
df
So, "files" is a list with all file relevant names. Now I have to try to put one file after the other into a loop (?) so that they all eventually end up in df.
Has anyone ever done something like this or can help me here.

Something like this might work:
import os
import pandas as pd
list_dfs=[]
for file in os.listdir('path_to_all_xlsx'):
df = pd.read_excel(file, <the rest of your config to parse>)
list_dfs.append(df)
all_dfs = pd.concat(list_dfs)
You read all the dataframes and add them to a list, and then the concat method adds them all together int one big dataframe.

Opening Multiple`.xls` files from a folder in a different directory and creating one dataframe using Pandas

I am trying to open multiple xls files in a folder from a particular directory. I wish to read into these files and open all of them in one data frame. So far I am able to access the directory and put all the xls files into a list like this
import os
import pandas as pd
path = ('D:\Anaconda Hub\ARK analysis\data\year2021\\february')
files = os.listdir(path)
files
# outputting the variable files which appears to be a list.
Output:
['ARK_Trade_02012021_0619PM_EST_601875e069e08.xls',
'ARK_Trade_02022021_0645PM_EST_6019df308ae5e.xls',
'ARK_Trade_02032021_0829PM_EST_601b2da2185c6.xls',
'ARK_Trade_02042021_0637PM_EST_601c72b88257f.xls',
'ARK_Trade_02052021_0646PM_EST_601dd4dc308c5.xls',
'ARK_Trade_02082021_0629PM_EST_6021c739595b0.xls',
'ARK_Trade_02092021_0642PM_EST_602304eebdd43.xls',
'ARK_Trade_02102021_0809PM_EST_6024834cc5c8d.xls',
'ARK_Trade_02112021_0639PM_EST_6025bf548f5e7.xls',
'ARK_Trade_02122021_0705PM_EST_60270e4792d9e.xls',
'ARK_Trade_02162021_0748PM_EST_602c58957b6a8.xls']
I am now trying to get it into one dataframe like this:
frame = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1')
frame.append(data)
df = pd.concat(frame, axis=0, ignore_index=True)
However, when doing this I sometimes obtain a blank data frame or it throws an error like this:
FileNotFoundError: [Errno 2] No such file or directory: 'ARK_Trade_02012021_0619PM_EST_601875e069e08.xls'
Help would truly be appreciated with this task.
Thanks in advance.

The issue happens because if you simply put the file name the interpreter assumes that it is in the current working directory, therefore you need to use os module to get proper location:
import os
import pandas as pd
path = ('D:\Anaconda Hub\ARK analysis\data\year2021\\february')
files = os.listdir(path)
#frame = pd.DataFrame() ...This will not work!
frame = [] # Do this instead
for f in files:
data = pd.read_excel(os.path.join(path, f), 'Sheet1') # Here join filename with folder location
frame.append(data)
df = pd.concat(frame, axis=0, ignore_index=True)
The other issue is that frame should be a list or some other iterable. Pandas has append method for dataframes but if you want to use concat then it will need to be a list.

you can add a middle step to check if the path exists or not, I suspect this is an isolated issue with your server, from memory when working on older windows servers (namely 2012), I would sometimes have issues where the Path couldn't be found even though it 100% existed.
import pandas as pd
from pathlib import Path
# assuming you want xls and xlsx.
files = Path('folder_location').glob('*.xls*')
dfs = []
for file in files:
if file.is_file():
df = pd.read_excel(file, sheet_name='sheet')
dfs.append(df)
final_df = pd.concat(dfs)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Include Source File in Column in Pandas Dataframe - python

Related

Read Linux Path and append all data

How to add filename as column to every file in a directory python

Pandas - Reading CSVs to dataframes in a FOR loop then appending to a master DF is returning a blank DF

Loop/iterate through a directory of excel files & add to the bottom of the dataframe

Opening Multiple`.xls` files from a folder in a different directory and creating one dataframe using Pandas

Categories

Resources