I have around 500 .txt files in my local system and would like to merge them into a dataframe in Google Colab. I have already uploaded them via Upload option where I uploaded the zipped folder containing the .txt files and later unzipped them in Google Colab. Each .txt file has one row data eg. 0 12 34.3 423
I tried the following code to directly upload from my local system but it did not work
Colab cannot access your local files through the typical built-ins as far as I know. You have to use Colab-specific modules. The guide is here.
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
This will prompt you to select the files to upload.
EDIT: As you need the file names, you can just use the loop above and then concatenate as you mentioned correctly.
# create a list of file names
file = []
for fn in uploaded.keys():
files.append(fn)
# create a list of dataframes
for file in files:
new = pd.read_csv(file)
try:
frames.append(new)
except:
frames = [new]
# concat all of your frames at once
df = pd.concat(frames)
Alternatively, depending on the size of your files, you could also join the for loops and load one file and concat it directly to the existing frames such that the memory has to hold less data at once.
Related
I am trying to read in data from text files that are moved into a network share drive over a VPN. The overall intent is to be able to loop through the files with yesterday's date (either in the file name, or by the Modified Date) and extract the pipe delimited data separated by "|" and concat it into a pandas df. The issue I am having is actually being able to read files from the network drive. So far I have only been able to figure out how to use os.listdir to identify the file names, but not actually read them. Anyone have any ideas?
This is what I've tried so far that has actually started to pan out = with os.listdir correctly being able to see the Network folder and the files inside - but how would I call the actual files inside (filtered by date or not) to actually get it to work in the loop?
import pandas as pd
#folder = os.listdir(r'\\fileshare.com\PATH\TO\FTP\FILES')
folder = (r'\\fileshare.com\PATH\TO\FTP\FILES')
main_dataframe = pd.DataFrame(pd.read_csv(folder[0]))
for i in range (1, len(folder)):
data = pd.read_csv(folder[i])
df = pd.DataFrame(data)
main_dataframe = pd.concat([main_dataframe, df], axis=1)
print(main_dataframe)
I'm pretty new at Python and doing things like this, so I apologize if I refer to anything wrong. Any advice would be greatly appreciated!
I am working with an S3 bucket which has multiple levels and every subfolder has multiple files.
I am trying to run a python script in Glue that needs to combine all these files into 1 dataframe per folder run a process on it and then save it in another S3 bucket under a similar file path.
Here is the hierarchy of the folders:
long_path/folder1:
long_path/folder1/A: A1.csv, A2.csv, A3.csv
long_path/folder1/B: B1.csv, B2.csv
long_path/folder2:
long_path/folder2/C: C1.csv, C2.csv... C5.csv
long_path/folder3:
long_path/folder3/D: D1.csv
long_path/folder3/E: E1.csv...E4.csv
I would like to combine all the csv in the folders A,B,C,D,E and create individual data frames called df_a, df_b, df_c, df_d, df_e
So far, my approach has been to create a list with these paths and creating the dataframes by iterating over this list:
list = ["long_path/folder1/A", "long_path/folder1/B", "long_path/folder1/C", "long_path/folder1/D", "long_path/folder1/E"]
for i in list:
files = []
for item in s3_client.list_objects_v2(Bucket=bucket, Prefix=i)['Contents']:
if item['Key'].endswith(".csv"):
files.append(item['Key'])
list_df = []
for i in files:
path = "s3://bucket-name/" + i
df = pd.read_csv(path, engine='pyarrow')
list_df.append(df)
final_df = pd.concat(list_df)
And then I do the process within this loop.
This code however looks very clunky. Is there a more efficient and cleaner way to do this task?
How do I combine all the files in a folder for multiple folders?
Thanks in advance!
I have code running as AWS Lambda that queries internal database and generates files in different formats. The files are generated in parts and uploaded to S3 using multi-part upload:
self.mpu = self.s3_client.create_multipart_upload(
Bucket=self.bucket_name,
ContentType=self.get_content_type(),
Expires=self.expire_daytime,
Key=self.filename,
)
and
response = self.s3_client.upload_part(
Bucket=self.bucket_name,
Key=self.filename,
PartNumber=self.part_number,
UploadId=self.upload_id,
Body=data
)
self.current_part_info.update({
'PartNumber': self.part_number,
'ETag': response['ETag']
})
One of the formats I need to support is XLS or XLSX. It's fairly easy to create multiple CSV files on S3. But is it possible to combine them directly on S3 into XLS/XLSX without downloading them?
My current code generates an XLSX file in memory, creates a local file, and then uploads it to S3:
import xlsxwriter
self.workbook = xlsxwriter.Workbook(self.filename)
# download CSV files...
for sheet_name, sheet_info in sheets.items():
sheet = self.workbook.add_worksheet(name=sheet_name)
# code that does formatting
for... # loop through rows
for... # loop thru columns
sheet.write(row, col, col_str)
self.workbook.close()
This works fine for small queries, but the users will want to use it for a large amount of data.
When I run it with large queries, it runs out of memory. AWS Lambda has limited memory and limited disk space, and I'm hitting those limits.
Is it possible to combine CSV files into XLS or XLSX somehow without holding the entire file in local space (both memory and disk space are a problem)?
I have a Databricks notebook setup that works as the following;
pyspark connection details to Blob storage account
Read file through spark dataframe
convert to pandas Df
data modelling on pandas Df
convert to spark Df
write to blob storage in single file
My problem is, that you can not name the file output file, where I need a static csv filename.
Is there way to rename this in pyspark?
## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""
## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"
## Connection string to connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
Followed by outputting file after data transformation
dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Where the file is then write as "part-00000-tid-336943946930983.....csv"
Where as a the goal is to have "Output.csv"
Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.
I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs
Any help here is greatly appreciated.
Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-.... files in a HDFS output path like Output/ named by you.
If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv or set the number of reduce processes with 1 like using coalesce(1) function.
So in your scenario, you only need to adjust the order of calling these functions to make the coalease function called at the front of save function, as below.
dfspark.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)
The coalesce and repartition do not help with saving the dataframe into 1 normally named file.
I ended up just renaming the 1 csv file and deleting the folder with log:
def save_csv(df, location, filename):
outputPath = os.path.join(location, filename + '_temp.csv')
df.repartition(1).write.format("com.databricks.spark.csv").mode("overwrite").options(header="true", inferSchema="true").option("delimiter", "\t").save(outputPath)
csv_files = os.listdir(os.path.join('/dbfs', outputPath))
# moving the parquet-like temp csv file into normally named one
for file in csv_files:
if file[-4:] == '.csv':
dbutils.fs.mv(os.path.join(outputPath,file) , os.path.join(location, filename))
dbutils.fs.rm(outputPath, True)
# using save_csv
save_csv_location = 'mnt/.....'
save_csv(df, save_csv_location, 'name.csv')
I have multiple Excel spreedsheets in given folder and it's sub folder. All have same file name string with suffix as date and time. How to merge them all into one single file while making worksheet name and titles as index for appending data frames. Typically there would be small chunks of 200 KB each file of ~100 files in subfolders or 20 MB of ~10 files in subfolders
This may help you to merge all the xlsx file in current directory.
import glob
import os
import pandas as pd
output = pd.DataFrame()
for file in glob.glob(os.getcwd()+"\\*.xlsx"):
cn = pd.read_excel(file)
output = output.append(cn)
output.to_csv(os.getcwd()+"\\outPut.csv", index = False, na_rep = "NA", header=None)
print("Completed +::" )
Note : you need xlrd-1.1.0 library along with pandas to read xlsx files.
I have tried operating using static file name definitions, would be good if it makes consolation by column header from dynamic file list pick, whichever starts with .xls* (xls / xlsx / xlsb / xlsm) and .csv and .txt
import pandas as pd
db = pd.read_excel("/data/Sites/Cluster1 0815.xlsx")
db1 = pd.read_excel("/data/Sites/Cluster2 0815.xlsx")
db2 = read_excel("/data/Sites/Cluster3 0815.xlsx")
sdb = db.append(db1)
sdb = sdb.append(db2)
sdb.to_csv("/data/Sites/sites db.csv", index = False, na_rep = "NA", header=None)
Dynamic file list merge found to have the below output. However the processing time has to be counted on...
gur.com/QKTKw.jpg
While running on batch files the code generated below error (please to note that these file are asymmetric in information carried) attached is snap: