I have a folder that have say a few hundreds files and is growing every hour. I am trying to consolidate all the data into a single file for analysis use. But the script I wrote is not too effective for processing these data as it will read all the content in the folder and append it to an xlsx file. The processing time is simply too long.
What I seeking is to enhance and improve my script:
1) To be able to only read and extract data new files that have not been previously read
2) To extract and append these data to update the xlxs file.
I just need some to enlightenment to help me improve on the script.
Part of my code is as follows
import pandas as pd
import numpy as np
import os
import dask.dataframe as dd
import glob
import schedule
import time
import re
import datetime as dt
def job():
# Select the path to download the files
path=r'V:\DB\ABCD\BEFORE\8_INCHES'
files=glob.glob(path+"/*.csv")
df=None
# Extracting of information from files
for i, file in enumerate (files) :
if i==0:
df= np.transpose(pd.read_csv(file,delimiter="|",index_col=False))
df['Path'] =file
df['Machine No']=re.findall("MC-11",str(df["Path"]))
df['Process']= re.findall("ABCD",str(df["Path"]))
df['Before/After']=re.findall("BEFORE",str(df["Path"]))
df['Wafer Size']=re.findall("8_INCHES",str(df["Path"]))
df['Employee ID']=df["Path"].str.extract(r'(?<!\d)(\d{6})(?!\d)',expand=False)
df['Date']=df["Path"].str.extract(r'(\d{4}_\d{2}_\d{2})',expand=False)
df['Lot Number']=df["Path"].str.extract(r'(\d{7}\D\d)',expand=False)
df['Part Number']=df["Path"].str.extract(r'([A-Z]{2,3}\d{3,4}[A-Z][A-Z]\d{2,4}[A-Z])',expand=False)
df["Part Number"].fillna("ENGINNERING SAMPLE",inplace=True)
else:
tmp= np.transpose(pd.read_csv(file,delimiter="|",index_col=False))
tmp['Path'] =file
tmp['Machine No']=tmp["Path"].str.extract(r'(\D{3}\d{2})',expand=False)
tmp['Process']= tmp["Path"].str.extract(r'(\w{8})',expand=False)
tmp['Before/After']= tmp["Path"].str.extract(r'([B][E][F][O][R][E])',expand= False)
tmp['Wafer Size']= tmp["Path"].str.extract(r'(\d\_\D{6})',expand= False)
tmp['Employee ID']=tmp["Path"].str.extract(r'(?<!\d)(\d{6})(?!\d)',expand=False)
tmp['Date']=tmp["Path"].str.extract(r'(\d{4}_\d{2}_\d{2})',expand=False)
tmp['Lot Number']=tmp["Path"].str.extract(r'(\d{7}\D\d)',expand=False)
tmp['Part Number']=tmp["Path"].str.extract(r'([A-Z]{2,3}\d{3,4}[A-Z][A-Z]\d{2,4}[A-Z])',expand=False)
tmp["Part Number"].fillna("ENGINNERING SAMPLE",inplace=True)
df= df.append(tmp)
export_excel= rf.to_excel(r'C:\Users\hoosk\Documents\Python Scripts\hoosk\test26_feb_2020.xlsx')
#schedule to run every hour
schedule.every(1).hour.do(job)
while True:
schedule.run_pending()
time.sleep(1)
In general terms you'll want to do the following:
Read in the xlsx file at the start of your script.
Extract a set with all the filename already parsed (Path attribute)
For each file you iterate over check if it is contained within the set of already processed files.
This assumes that existing files don't have their content updated. If that could happen, you may want to track metrics like last change date (a checksum would be most reliable, but probably too expensive to compute).
Related
I am new to coding. I basically have a bunch of files in "nifti" format, I wanted to simply load them, apply a thresholding function to them and then save them. I was able to write the few lines of code to do it to one file (it worked), but I have many so I created another python file and tried to make a for loop. I think it does everything fine but the last step for saving my files just keeps overwriting so in the end I only get one output file.
import numpy as np
import nibabel as nb
import glob
import os
path= 'subjects'
all_files=glob.glob(path + '/*.nii')
for filename in all_files:
image=nb.load(filename)
data=image.get_fdata()
data [data<0.1]=0
new_image=nb.Nifti1Image(data, affine=image.affine, header=image.header)
nb.save(new_image,filename+1)
I have a data frame which I read in from a locally saved CSV file.
I then want to loop over said file and create several CSV files based on a string in one column.
Lastly, I want to add all those files to a zip file, but without saving them locally. I just want one zip archive including all the different CSV files.
All my attempts using the io or zipfile modules only resulted in one zip file with one CSV file in it (pretty much with what I started with)
Any help would be much appreciated!
Here is my code so far, which works but saves all CSV files just to my hard drive.
import pandas as pd
from zipfile import ZipFile
df = pd.read_csv("myCSV.csv")
channelsList = df["Turn one column to list"].values.tolist()
channelsList = list(set(channelsList)) #delete duplicates from list
for channel in channelsList:
newDf = df.loc[df['Something to match'] == channel]
newDf.to_csv(f"{channel}.csv") # saves csv files to disk
DataFrame.to_csv() can write to any file-like object, and ZipFile.writestr() can accept a string (or bytes), so it is possible to avoid writing the CSV files to disk using io.StringIO. See the example code below.
Note: If the channel is simply stored in a single column of your input data, then the more idiomatic (and more efficient) way to iterate over the partitions of your data is to use groupby().
from io import StringIO
from zipfile import ZipFile
import numpy as np
import pandas as pd
# Example data
df = pd.DataFrame(np.random.random((100,3)), columns=[*'xyz'])
df['channel'] = np.random.randint(5, size=len(df))
with ZipFile('/tmp/output.zip', 'w') as zf:
for channel, channel_df in df.groupby('channel'):
s = StringIO()
channel_df.to_csv(s, index=False, header=True)
zf.writestr(f"{channel}.csv", s.getvalue())
My dataset looks at flight delays and cancellations from 2009 to 2018. Here are the important points to consider:
Each year is its own csv file so '2009.csv', '2010.csv', all the way to '2018.csv'
Each file is roughly 700mb
I used the following to combine csv files
import pandas as pd
import numpy as np
import os, sys
import glob
os.chdir('c:\\folder'
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_airline_csv = pd.concat([pd.read_csv(f) for f in all_filenames])
combined_airline_csv.to_csv('combined_airline_csv.csv', index =False, encoding = 'utf-8-sig')
When I run this, I receive the following message:
MemoryError: Unable to allocate 43.3MiB for an array with shape(5674621, ) and data type float64.
I am presuming that my file is too large and that will need to run this on a virtual machine (i.e. AWS).
Any thoughts?
Thank you!
This is a duplicate of how to merge 200 csv files in Python.
Since you just want to combine them into one file, there is no need to load all data into a dataframe at the same time. Since they all have the same structure, I would advise creating one filewriter, then open each file with a file reader and write (if we want to be fancy let's call it stream) the data line by line. Just be careful not to copy the headers each time, since you only want them one time. Pandas is simply not the best tool for this task :)
In general, this is a typical task that can also be done easily and even faster directly on the command line. (code depends on the os)
I have these data exports that are populating every hour in a particular directory, and i'm hoping to have a script that reads all the files and appends them into one master dataframe in Python. Only issue is, since they are populating every hour, I don't want to append existing or already added csv files to the master dataframe.
I'm very new to Python, and so far have only been able to load all the files in the directory and append them all, using the below code:
import pandas as pd
import os
import glob
path = os.environ['HOME'] + "/file_location/"
allFiles = glob.glob(os.path.join(path,"name_of_files*.csv"))
df = pd.concat((pd.read_csv(f) for f in allFiles), sort=False)
With the above code, it looks into the file_location and imports any files with the name "name_of_files" & uses a wildcard as the tail of each of the files will be different.
I could continue to do this, but i'm literally going to have hundreds of files and don't want to import them all and append/concat them each and every hour. To avoid this i'd like to have that master data frame mentioned above and just have new csv files that populate each hour to be automatically appended to that existing master df.
Again super new to Python, so not even sure what to do next. Any advice would be greatly appreciated!
I am pretty new to Python in general, but am trying to make a script that takes data from certain files in a folder and puts it into an Excel spreadsheet.
The code I have will find the file type that I want in my specified folder, and then make a list with the full file paths.
import os
file_paths = []
for folder, subs, files in os.walk('C://Users/Dir'):
for filename in files:
if filename.endswith(".log") or filename.endswith(".txt"):
file_paths.append(os.path.abspath(os.path.join(folder,filename)))
It will also take a specific file path, pull data from the correct column, and put it into excel in the correct cells.
import pandas as pd
import numpy
for i in range(len(file_paths)):
fields = ['RDCR']
data = pd.read_table(file_paths[i], sep= "\s+", names = fields, usecols=[3],
Where I am having trouble is making the read_table iterate through my list of files and put the data into an excel sheet where every time it reads a new file it moves over one column in the spreadsheet.
Ideally, the for loop would see how long the file_paths list is, and use that as the range. It would then use the file_paths[i] to input the file names into the read_table one by one.
What happens is that it finds the length of file_paths, and instead of iterating through the files in it one by one, it just inputs the data from the last file on the list.
Any help would be much appreciated! Thank you!
Try to concatenate all of them at once and write to excel one time.
from glob import glob
import pandas as pd
files = glob('C://Users/Dir/*.log') + glob('C://Users/Dir/*.txt')
def read_file(f):
fields = ['RDCR']
return pd.read_table(
f, sep="\s+",
names=fields, usecols=[3])
df = pd.concat([read_file(f) for f in files], axis=1).to_excel('out.xlsx')