Import and append pickle files

Import and append pickle files - python

How could I import and append all files in a directory?
files = os.listdir(r"C:\Users\arv\Desktop\pickle_files")
data = []
for i in files:
data.append(pd.read_pickle(i))
df = pd.concat(['data'])

Almost like you tried to do it yourslf:
df = pd.concat([pd.read_pickle(f) for f in files])

Related

Loop for converting multiple files from csv.gz to csv

I have several csv.gzip files I am trying to convert and save to csv files. I'm able to do it for an individual file using:
with gzip.open('Pool.csv.gz') as f:
Pool= pd.read_csv(f)
Pool.to_csv("Pool.csv")
I'm trying to create a loop to convert all files in directory but I'm failing. Here is my code:
import gzip
import glob
import os
os.chdir('/home/path')
extension = 'csv.gz'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
for i in range(len(all_filenames)): gzip.open(i) as f:
pool_1 = pd.read_csv(f)

You can use os.listdir() to create your list of files and then loop through it:
import os
import gzip
import pandas as pd
dir_path = "/home/path"
all_files = [f for f in os.listdir(dir_path) if f.endswith('csv.gz')]
for file in all_files:
with gzip.open(f"{dir_path}/{file}") as f:
df = pd.read_csv(f)
df.to_csv(f"{file.split('.')[0]}.csv")

Read multiple csv files from multiple folders in Python

I have a folder that includes folders and these folders include many csv files. I want to import and concatenate all of them in Python.
Let's say main folder: /main
subfolders: /main/main_1
csv: /main/main_1/first.csv
path='/main'
df_list = []
for file in os.listdir(path):
df = pd.read_csv(file)
df_list.append(df)
final_df = df.append(df for df in df_list)

What about this:
import pandas as pd
from pathlib import Path
directory = "path/to/root_dir"
# Read each CSV file in dir "path/to/root_dir"
dfs = []
for file in Path(directory).glob("**/*.csv"):
dfs.append(pd.read_csv(file))
# Put the dataframes to a single dataframe
df = pd.concat(dfs)
Change the path/to/root_dir to where ever your CSV files are.

I found a way to concat all of them but it doesn't satisfy to me as it takes too much time due to computational complexity.
path = "/main"
folders = []
directory = os.path.join(path)
for root,dirs,files in os.walk(directory):
folders.append(root)
del folders[0]
final = []
for folder in folders:
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join(folder + "/*.csv"))))
final.append(df)

Remember to add back main to the path:
df =pd.read_csv(path + "/" + file)

Search CSV from subdirectory and add folder name as a column

I want read csv's from different sub-directories in my working directory to create a combined csv file. The combined csv should have a column containing the sub-directory name from which that particular csv was read from.
This is what I tried.
import os
import glob
import pandas as pd
all_filenames = [i for i in glob.glob('*/*.csv'),recursive=True)]
list_subfolder = [f.name for f in os.scandir(ride_path) if f.is_dir()]
df_list = []
for i in range(len(all_filenames)):
dir_name = list_subfolder[i]
current_csv = all_filenames[i]
data = pd.read_csv(current_csv)
data["sub_folder"]= dir_name
df_list.append(data)
combined_df = pd.concat(df_list)
combined_df.to_csv("combined_csv.csv", index=False)
The problem is that, it adds sub-directories that does not have csvs' in them, which is wrong and problematic. What is the best way to this right.

You can do this via pathlib module:
from pathlib import Path
inp_path = Path('.') # specify the inp path. Here, ('.') means current working dir
df_list= []
for csv_file in inp_path.glob('**/*.csv'): # glob here will return generator obj which will yield csv file one by one
df = pd.read_csv(csv_file)
df['file_name'] = csv_file.parent # possible to get parent dir via pathlib
df_list.append(df_list)
combined_df = pd.concat(df_list)
combined_df.to_csv("combined_csv.csv", index=False)
Note.
1- use csv_file.parent.name if you just need the name.
2- use csv_file.parent.absolute() if you want the full path of parent dir.

You can use us os.path.split():
import os
import glob
import pandas as pd
all_filenames = [i for i in glob.glob("**/*.csv", recursive=True)]
df_list = []
for f in all_filenames:
current_csv = f
data = pd.read_csv(current_csv)
data["sub_folder"] = os.path.split(f)[0] # <-- [0] is directory [1] is filename
df_list.append(data)
combined_df = pd.concat(df_list)
print(combined_df)
combined_df.to_csv("combined_csv.csv", index=False)

Another option with glob and os:
import os
import glob
import pandas as pd
df_list = []
for csv in glob.glob('**/*.csv', recursive=True):
parent_folder = os.path.split(os.path.dirname(csv))[-1]
df = pd.read_csv(csv)
df['subfolder'] = parent_folder
df_list.append(df)
combined_df = pd.concat(df_list)
combined_df.to_csv("combined_csv.csv", index=False)

One-line method (adapted from the #nk03 answer).
import pandas as pd
import pathlib as pth
pd.concat([pd.read_csv(csvfile).assign(file_name=csvfile.parent)
for csvfile in pth.Path(".").glob("**/*.csv")]) \
.to_csv("combined_csv.csv", index=False)

Read multiple csv files zipped in one file

I have several csv files in several zip files in on folder, so for example:
A.zip (contains csv1,csv2,csv3)
B.zip (contains csv4, csv5, csv6)
which are in the folder path C:/Folder/, when I load normal csv files in a folder I use the following code:
import glob
import pandas as pd
files = glob.glob("C/folder/*.csv")
dfs = [pd.read_csv(f, header=None, sep=";") for f in files]
df = pd.concat(dfs,ignore_index=True)
followed by this post: Reading csv zipped files in python
One csv in zip works like this:
import pandas as pd
import zipfile
zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip')
df = pd.read_csv(zf.open('intfile.csv'))
Any idea how to optimize this loop for me?

Use zip.namelist() to get list of files inside the zip
Ex:
import glob
import zipfile
import pandas as pd
for zip_file in glob.glob("C/folder/*.zip"):
zf = zipfile.ZipFile(zip_file)
dfs = [pd.read_csv(zf.open(f), header=None, sep=";") for f in zf.namelist()]
df = pd.concat(dfs,ignore_index=True)
print(df)

I would try to tackle it in two passes. First pass, extract the contents of the zipfile onto the filesystem. Second Pass, read all those extracted CSVs using the method you already have above:
import glob
import pandas as pd
import zipfile
def extract_files(file_path):
archive = zipfile.ZipFile(file_path, 'r')
unzipped_path = archive.extractall()
return unzipped_path
zipped_files = glob.glob("C/folder/*.zip")]
file_paths = [extract_files(zf) for zf in zipped_files]
dfs = [pd.read_csv(f, header=None, sep=";") for f in file_paths]
df = pd.concat(dfs,ignore_index=True)

Retrieving data from multiple files into multiple dataframes

Scenario: I have a list of files in a folder (including the file paths). I am trying to get the content of each of those files into a dataframe (one for each file), then further perform some operations and later merge these dataframes.
From various other questions in SO, I found multiple ways to iterate over the files in a folder and get the data, but all of those I found usually ready the files in a loop and concatenate them into a single dataframe automatically, which does not work for me.
For example:
import os
import pandas as pd
path = os.getcwd()
files = os.listdir(path)
files_xls = [f for f in files if f[-3:] == 'xls*']
df = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
or
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("*.xls*"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
The only piece of code I could put together from what I found is:
from os.path import isfile, join
import glob
mypath = "/DGMS/Destop/uploaded"
listoffiles = glob.glob(os.path.join(mypath, "*.xls*"))
contentdataframes = (pd.read_excel(f) for f in listoffiles)
This lines run without error, but they appear not to do anything, no variables or created nor changed.
Question: What am I doing wrong here? Is there a better way to do this?

You are really close, need join all data by concat from generator:
contentdataframes = (pd.read_excel(f) for f in listoffiles)
df = pd.concat(contentdataframes, ignore_index=True)
If need list of DataFrames:
contentdataframes = [pd.read_excel(f) for f in listoffiles]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Import and append pickle files - python

How could I import and append all files in a directory? files = os.listdir(r"C:\Users\arv\Desktop\pickle_files") data = [] for i in files: data.append(pd.read_pickle(i)) df = pd.concat(['data'])

Almost like you tried to do it yourslf: df = pd.concat([pd.read_pickle(f) for f in files])

Related

Loop for converting multiple files from csv.gz to csv

Read multiple csv files from multiple folders in Python

Search CSV from subdirectory and add folder name as a column

Read multiple csv files zipped in one file

Retrieving data from multiple files into multiple dataframes

Categories

Resources