I have several csv files in several zip files in on folder, so for example:
A.zip (contains csv1,csv2,csv3)
B.zip (contains csv4, csv5, csv6)
which are in the folder path C:/Folder/, when I load normal csv files in a folder I use the following code:
import glob
import pandas as pd
files = glob.glob("C/folder/*.csv")
dfs = [pd.read_csv(f, header=None, sep=";") for f in files]
df = pd.concat(dfs,ignore_index=True)
followed by this post: Reading csv zipped files in python
One csv in zip works like this:
import pandas as pd
import zipfile
zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip')
df = pd.read_csv(zf.open('intfile.csv'))
Any idea how to optimize this loop for me?
Use zip.namelist() to get list of files inside the zip
Ex:
import glob
import zipfile
import pandas as pd
for zip_file in glob.glob("C/folder/*.zip"):
zf = zipfile.ZipFile(zip_file)
dfs = [pd.read_csv(zf.open(f), header=None, sep=";") for f in zf.namelist()]
df = pd.concat(dfs,ignore_index=True)
print(df)
I would try to tackle it in two passes. First pass, extract the contents of the zipfile onto the filesystem. Second Pass, read all those extracted CSVs using the method you already have above:
import glob
import pandas as pd
import zipfile
def extract_files(file_path):
archive = zipfile.ZipFile(file_path, 'r')
unzipped_path = archive.extractall()
return unzipped_path
zipped_files = glob.glob("C/folder/*.zip")]
file_paths = [extract_files(zf) for zf in zipped_files]
dfs = [pd.read_csv(f, header=None, sep=";") for f in file_paths]
df = pd.concat(dfs,ignore_index=True)
Related
I have several csv.gzip files I am trying to convert and save to csv files. I'm able to do it for an individual file using:
with gzip.open('Pool.csv.gz') as f:
Pool= pd.read_csv(f)
Pool.to_csv("Pool.csv")
I'm trying to create a loop to convert all files in directory but I'm failing. Here is my code:
import gzip
import glob
import os
os.chdir('/home/path')
extension = 'csv.gz'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
for i in range(len(all_filenames)): gzip.open(i) as f:
pool_1 = pd.read_csv(f)
You can use os.listdir() to create your list of files and then loop through it:
import os
import gzip
import pandas as pd
dir_path = "/home/path"
all_files = [f for f in os.listdir(dir_path) if f.endswith('csv.gz')]
for file in all_files:
with gzip.open(f"{dir_path}/{file}") as f:
df = pd.read_csv(f)
df.to_csv(f"{file.split('.')[0]}.csv")
I'm trying to get all 'link' data from several .csv files and create .txt files with these links data without merging into one file (currently result_df.txt) . First step works well! (thanks Ali!) but I would like to keep the name of my several csv files (each name are different) into these txt files.
name1.csv --> name1.txt
name2.csv --> name2.txt
name3.csv --> name3.txt
...
Any suggestions here please?
Many thanks
from os.path import abspath, join
from os import listdir
import pandas as pd
result_df = pd.DataFrame(columns=['link'])
abs_path = abspath(path) # path of your folder
for filename in listdir(abs_path):
df = pd.read_csv(join(abs_path, filename), usecols=['link'])
result_df = pd.concat([result_df, df], ignore_index=True)
result_df.to_csv('result_df.txt', header=None, index=None, sep=' ', mode='w')
So why do you concat ? just save to .txt file updating the name df.to_csv(filename[:-4] + '.txt')
from os.path import abspath, join
from os import listdir
import pandas as pd
result_df = pd.DataFrame(columns=['link'])
abs_path = abspath(path) # path of your folder
for filename in listdir(abs_path):
df = pd.read_csv(join(abs_path, filename), usecols=['link'])
df.to_csv(filename[:-4] + '.csv')
I want read csv's from different sub-directories in my working directory to create a combined csv file. The combined csv should have a column containing the sub-directory name from which that particular csv was read from.
This is what I tried.
import os
import glob
import pandas as pd
all_filenames = [i for i in glob.glob('*/*.csv'),recursive=True)]
list_subfolder = [f.name for f in os.scandir(ride_path) if f.is_dir()]
df_list = []
for i in range(len(all_filenames)):
dir_name = list_subfolder[i]
current_csv = all_filenames[i]
data = pd.read_csv(current_csv)
data["sub_folder"]= dir_name
df_list.append(data)
combined_df = pd.concat(df_list)
combined_df.to_csv("combined_csv.csv", index=False)
The problem is that, it adds sub-directories that does not have csvs' in them, which is wrong and problematic. What is the best way to this right.
You can do this via pathlib module:
from pathlib import Path
inp_path = Path('.') # specify the inp path. Here, ('.') means current working dir
df_list= []
for csv_file in inp_path.glob('**/*.csv'): # glob here will return generator obj which will yield csv file one by one
df = pd.read_csv(csv_file)
df['file_name'] = csv_file.parent # possible to get parent dir via pathlib
df_list.append(df_list)
combined_df = pd.concat(df_list)
combined_df.to_csv("combined_csv.csv", index=False)
Note.
1- use csv_file.parent.name if you just need the name.
2- use csv_file.parent.absolute() if you want the full path of parent dir.
You can use us os.path.split():
import os
import glob
import pandas as pd
all_filenames = [i for i in glob.glob("**/*.csv", recursive=True)]
df_list = []
for f in all_filenames:
current_csv = f
data = pd.read_csv(current_csv)
data["sub_folder"] = os.path.split(f)[0] # <-- [0] is directory [1] is filename
df_list.append(data)
combined_df = pd.concat(df_list)
print(combined_df)
combined_df.to_csv("combined_csv.csv", index=False)
Another option with glob and os:
import os
import glob
import pandas as pd
df_list = []
for csv in glob.glob('**/*.csv', recursive=True):
parent_folder = os.path.split(os.path.dirname(csv))[-1]
df = pd.read_csv(csv)
df['subfolder'] = parent_folder
df_list.append(df)
combined_df = pd.concat(df_list)
combined_df.to_csv("combined_csv.csv", index=False)
One-line method (adapted from the #nk03 answer).
import pandas as pd
import pathlib as pth
pd.concat([pd.read_csv(csvfile).assign(file_name=csvfile.parent)
for csvfile in pth.Path(".").glob("**/*.csv")]) \
.to_csv("combined_csv.csv", index=False)
So i'm having multiple CSVs which i combined using this script:
import os
import glob
import numpy as np
import pandas as pd
#set working directory
os.chdir("Path to CSVs")
#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv("mz_all.csv", index=False, encoding='utf-8-sig')
What i want to do is convert it into an xlsx file, what i already did BUT all of the headers from the csv get put into one column which looks like an absolute mess. The code for the conversion looks like this:
# Reading the csv file
df_new = pd.read_csv('mz_all.csv')
# saving xlsx file
GFG = pd.ExcelWriter('MZ_EXCEL.xlsx')
df_new.to_excel(GFG, index=False)
GFG.save()
Here's how the excel is looking atm, as you can see all headers got pushed into the first column but instead i just want it to be organized like it was in the csv
Have you tried to save it to Excel immediately like this?
import os
import glob
import numpy as np
import pandas as pd
#set working directory
os.chdir("Path to CSVs")
#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f, sep=';') for f in all_filenames ])
combined_csv.to_excel('MZ_EXCEL.xlsx', index=False)
I currently have a folder that contains multiple files with similar names that I am trying to read from.
For example:
Folder contains files:
apple_2019_08_26_23434.xls
apple_2019_08_25_55345.xls
apple_2019_08_24_99345.xls
the name format of the file is very simple:
apple_<date>_<5 random numbers>.xls
How can I read the excel file into a pandas df if I do not care about the random 5 digits at the end?
e.g.
df = pd.read_excel('e:\Document\apple_2019_08_26_<***wildcard***>.xls')
Thank you!
You could use unix style pathname expansions via glob.
import glob
# get .txt files in current directory
txt_files = glob.glob('./*.txt')
# get .xls files in some_dir
xls_files = glob.glob('some_dir/*.xls')
# do stuff with files
# ...
Here, * basically means "anything".
Example with pandas:
import glob
for xls_file in glob.glob('e:/Document/apple_2019_08_26_*.xls'):
df = pd.read_excel(xls_file)
# do stuff with df
# ...
Change your directory with os.chdir then import all files which startwith the correct name:
import os
os.chdir(r'e:\Document')
dfs = [pd.read_excel(file) for file in os.listdir() if file.startswith('apple_2019_08')]
Now you can access each dataframe by index:
print(dfs[0])
print(dfs[1])
Or combine them to one large dataframe if they have the same format
df_all = pd.concat(dfs, ignore_index=True)
If you want the 5-digit part to be changeable in the code, you could try something like this:
from os import listdir
from os.path import isfile, join
import pandas as pd
mypath = '/Users/username/aPath'
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
fiveDigitNumber = onlyfiles[0][17:22]
filename = onlyfiles[0][:17]+fiveDigitNumber+onlyfiles[0][22:]
df = pd.read_excel(filename)