I have these data exports that are populating every hour in a particular directory, and i'm hoping to have a script that reads all the files and appends them into one master dataframe in Python. Only issue is, since they are populating every hour, I don't want to append existing or already added csv files to the master dataframe.
I'm very new to Python, and so far have only been able to load all the files in the directory and append them all, using the below code:
import pandas as pd
import os
import glob
path = os.environ['HOME'] + "/file_location/"
allFiles = glob.glob(os.path.join(path,"name_of_files*.csv"))
df = pd.concat((pd.read_csv(f) for f in allFiles), sort=False)
With the above code, it looks into the file_location and imports any files with the name "name_of_files" & uses a wildcard as the tail of each of the files will be different.
I could continue to do this, but i'm literally going to have hundreds of files and don't want to import them all and append/concat them each and every hour. To avoid this i'd like to have that master data frame mentioned above and just have new csv files that populate each hour to be automatically appended to that existing master df.
Again super new to Python, so not even sure what to do next. Any advice would be greatly appreciated!
Related
First-time poster here! I have perused these forums for a while and I am taken aback by how supportive this community is.
My problem involves several excel files with the same name, column headers, data types, that I am trying to read in with pandas. After reading them in, I want to compare the column 'Agreement Date' across all the data-frames and create a yes/no column if they match. I then want to export the data frame.
I am still learning Python and Pandas so I am struggling with this task. This is my code so far:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# read .xlsx file into a list
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files
for excelfiles in allfiles:
raw_excel = pd.read_excel(allfiles)
# place all the pulled dataframe into a list
list = [raw_excel]
From here though I am quite lost. I do not know how to join all of my files together on my id column and then compare the 'Agreement Date' column? Any help would be greatly appreciated!
THANKS!!
In your loop you need to hand the looped value and not the whole list to read_excel
You have to append the list values within the loop, otherwise only the last item will be in the list
Do not overwrite python builtins such as list or you can encounter some difficult to debug behaviors
Here's what I would change:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# get file name list of .xlsx files in the directory
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files & place all the pulled dataframe into a list
dataframes_list = []
for file in allfiles:
dataframes_list.append(pd.read_excel(file))
You can then append the DataFrames like this:
merged_df = dataframes_list[0]
for df in dataframes_list[1:]:
merged_df.append(df, ignore_index=True)
Use ignore_index if the Indexes are overlapping and causing problems. If they already are distinct and you want to keep them, set this to False.
I am working on a shared network drive where I have 1 folder (main folder) containing many subfolders; 1 for each date (over 1700) and then within them csv files (results.csv) with a common name at the end (same file format). Each csv contains well over 30k rows.
I wish to read in all the csvs appending them into one dataframe to perform some minor calculations. I have used the below code. It ran for 3+ days so I quit, but looking at the dataframe it actually got 80% of the way through. But it seems inefficient as it takes ages and when I want to add the latest days file it will have to re-run again. I also only need a handful of the columns within each csv so want to use the usecols=['A', 'B', 'C'] function but not sure how to incorporate it. Could someone shed some light please on a better solution?
import glob
import os
import pandas as pd
file_source = glob.glob(r"//location//main folder//**//*results.csv", recursive=True)
appended_file = []
for i in file_source:
df = pd.read_csv(i)
appended_file.append(df)
combined=pd.concat(appended_file, axis=0, ignore_index=True, sort=False)
Thanks.
I have a folder that have say a few hundreds files and is growing every hour. I am trying to consolidate all the data into a single file for analysis use. But the script I wrote is not too effective for processing these data as it will read all the content in the folder and append it to an xlsx file. The processing time is simply too long.
What I seeking is to enhance and improve my script:
1) To be able to only read and extract data new files that have not been previously read
2) To extract and append these data to update the xlxs file.
I just need some to enlightenment to help me improve on the script.
Part of my code is as follows
import pandas as pd
import numpy as np
import os
import dask.dataframe as dd
import glob
import schedule
import time
import re
import datetime as dt
def job():
# Select the path to download the files
path=r'V:\DB\ABCD\BEFORE\8_INCHES'
files=glob.glob(path+"/*.csv")
df=None
# Extracting of information from files
for i, file in enumerate (files) :
if i==0:
df= np.transpose(pd.read_csv(file,delimiter="|",index_col=False))
df['Path'] =file
df['Machine No']=re.findall("MC-11",str(df["Path"]))
df['Process']= re.findall("ABCD",str(df["Path"]))
df['Before/After']=re.findall("BEFORE",str(df["Path"]))
df['Wafer Size']=re.findall("8_INCHES",str(df["Path"]))
df['Employee ID']=df["Path"].str.extract(r'(?<!\d)(\d{6})(?!\d)',expand=False)
df['Date']=df["Path"].str.extract(r'(\d{4}_\d{2}_\d{2})',expand=False)
df['Lot Number']=df["Path"].str.extract(r'(\d{7}\D\d)',expand=False)
df['Part Number']=df["Path"].str.extract(r'([A-Z]{2,3}\d{3,4}[A-Z][A-Z]\d{2,4}[A-Z])',expand=False)
df["Part Number"].fillna("ENGINNERING SAMPLE",inplace=True)
else:
tmp= np.transpose(pd.read_csv(file,delimiter="|",index_col=False))
tmp['Path'] =file
tmp['Machine No']=tmp["Path"].str.extract(r'(\D{3}\d{2})',expand=False)
tmp['Process']= tmp["Path"].str.extract(r'(\w{8})',expand=False)
tmp['Before/After']= tmp["Path"].str.extract(r'([B][E][F][O][R][E])',expand= False)
tmp['Wafer Size']= tmp["Path"].str.extract(r'(\d\_\D{6})',expand= False)
tmp['Employee ID']=tmp["Path"].str.extract(r'(?<!\d)(\d{6})(?!\d)',expand=False)
tmp['Date']=tmp["Path"].str.extract(r'(\d{4}_\d{2}_\d{2})',expand=False)
tmp['Lot Number']=tmp["Path"].str.extract(r'(\d{7}\D\d)',expand=False)
tmp['Part Number']=tmp["Path"].str.extract(r'([A-Z]{2,3}\d{3,4}[A-Z][A-Z]\d{2,4}[A-Z])',expand=False)
tmp["Part Number"].fillna("ENGINNERING SAMPLE",inplace=True)
df= df.append(tmp)
export_excel= rf.to_excel(r'C:\Users\hoosk\Documents\Python Scripts\hoosk\test26_feb_2020.xlsx')
#schedule to run every hour
schedule.every(1).hour.do(job)
while True:
schedule.run_pending()
time.sleep(1)
In general terms you'll want to do the following:
Read in the xlsx file at the start of your script.
Extract a set with all the filename already parsed (Path attribute)
For each file you iterate over check if it is contained within the set of already processed files.
This assumes that existing files don't have their content updated. If that could happen, you may want to track metrics like last change date (a checksum would be most reliable, but probably too expensive to compute).
I am a beginner of Python. I have about 1000 CSV files (1.csv, 2.csv....1000.csv). Each CSV file has about 3,000,000,000 rows and 14 variables. I would like to clean data in each CSV file first using the same process for each CSV file:
sum variable A and variable B,
count C by sorting date, if the number of records in C for one day is greater than 50, then drop it.
Next, save the cleaned data into a new CSV file. At last, append all 1000 new CSV files into one CSV file.
I have some code as follows, but it imports all CSV files first, then process to clean data, which is very inefficient. I would like to clean the data in each CSV file first, then append new CSV files. Can anyone help me on this? Any help will be appreciated.
This what I understand from your question. I read all the files and I add a new column for the summation. Then I order the date and drop any value of C greater than 50. After that, I save the update. Before you do this you have to copy your original files or you can save them with a different files name.
import glob
import os
import pandas as pd
path = "./data/"
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of paths
for file in all_files:
# Getting the file name without extension
file_name = os.path.splitext(os.path.basename(file))[0]
df = pd.read_csv(file_name)
df['new_column'] = df['A']+ df['B']
df.sort_values(by='C')
df.drop(df.loc[df['C']>50].index, inplace=True)
df.to_csv(file_name)
I am pretty new to Python in general, but am trying to make a script that takes data from certain files in a folder and puts it into an Excel spreadsheet.
The code I have will find the file type that I want in my specified folder, and then make a list with the full file paths.
import os
file_paths = []
for folder, subs, files in os.walk('C://Users/Dir'):
for filename in files:
if filename.endswith(".log") or filename.endswith(".txt"):
file_paths.append(os.path.abspath(os.path.join(folder,filename)))
It will also take a specific file path, pull data from the correct column, and put it into excel in the correct cells.
import pandas as pd
import numpy
for i in range(len(file_paths)):
fields = ['RDCR']
data = pd.read_table(file_paths[i], sep= "\s+", names = fields, usecols=[3],
Where I am having trouble is making the read_table iterate through my list of files and put the data into an excel sheet where every time it reads a new file it moves over one column in the spreadsheet.
Ideally, the for loop would see how long the file_paths list is, and use that as the range. It would then use the file_paths[i] to input the file names into the read_table one by one.
What happens is that it finds the length of file_paths, and instead of iterating through the files in it one by one, it just inputs the data from the last file on the list.
Any help would be much appreciated! Thank you!
Try to concatenate all of them at once and write to excel one time.
from glob import glob
import pandas as pd
files = glob('C://Users/Dir/*.log') + glob('C://Users/Dir/*.txt')
def read_file(f):
fields = ['RDCR']
return pd.read_table(
f, sep="\s+",
names=fields, usecols=[3])
df = pd.concat([read_file(f) for f in files], axis=1).to_excel('out.xlsx')