I'm trying to get all 'link' data from several .csv files and create .txt files with these links data without merging into one file (currently result_df.txt) . First step works well! (thanks Ali!) but I would like to keep the name of my several csv files (each name are different) into these txt files.
name1.csv --> name1.txt
name2.csv --> name2.txt
name3.csv --> name3.txt
...
Any suggestions here please?
Many thanks
from os.path import abspath, join
from os import listdir
import pandas as pd
result_df = pd.DataFrame(columns=['link'])
abs_path = abspath(path) # path of your folder
for filename in listdir(abs_path):
df = pd.read_csv(join(abs_path, filename), usecols=['link'])
result_df = pd.concat([result_df, df], ignore_index=True)
result_df.to_csv('result_df.txt', header=None, index=None, sep=' ', mode='w')
So why do you concat ? just save to .txt file updating the name df.to_csv(filename[:-4] + '.txt')
from os.path import abspath, join
from os import listdir
import pandas as pd
result_df = pd.DataFrame(columns=['link'])
abs_path = abspath(path) # path of your folder
for filename in listdir(abs_path):
df = pd.read_csv(join(abs_path, filename), usecols=['link'])
df.to_csv(filename[:-4] + '.csv')
Related
I want to concat all csv file that have this specific word 'tables' on the filename.
Below code is upload all csv file without filter the specific word that i want.
# importing the required modules
import glob
import pandas as pd
# specifying the path to csv files
#path = "csvfoldergfg"
path = "folder_directory"
# csv files in the path
files = glob.glob(path + "/*.csv")
# defining an empty list to store
# content
data_frame = pd.DataFrame()
content = []
# checking all the csv files in the
# specified path
for filename in files:
# reading content of csv file
# content.append(filename)
df = pd.read_csv(filename, index_col=None)
content.append(df)
# converting content to data frame
data_frame = pd.concat(content)
print(data_frame)
example filename are:
abcd-tables.csv
abcd-text.csv
abcd-forms.csv
defg-tables.csv
defg-text.csv
defg-forms.csv
From the example filenames. The expected output is concat filenames
abcd-tables.csv
defg-tables.csv
into single dataframe. Assuming the header are same.
*Really appreciate you guys can solve this
You can use:
import pandas as pd
import pathlib
path = 'folder_directory'
content = []
for filename in pathlib.Path(path).glob('*-tables.csv'):
df = pd.read_csv(filename, index_col=None)
content.append(df)
df = pd.concat(content, ignore_index=True)
I have a pandas script as follows. Where I am reading multiple csv files in a given folder. All the csv files have a similar format and columns.
For a given column(Area), I want to add all the rows. Then I want to save this data into a new CSV file.
This is the code so far.
import pandas as pd
import glob
path = r'C:\Users\kundemj\Desktop\Post_Processing\HEA517_2000' # path
all_files = glob.glob(path + "/*.csv")
for filename in all_files:
df = pd.read_csv(filename)
area_sum = df['Area'].sum()
print(area_sum)
I could figure out, by using excel_write function, but I want to use 'to_csv', and also with mode = 'append', as I have bunch of folders with same filenames.
The CSV file format I am looking for is as follows:
filename1, filename2, filename3,.....
area_sum1, area_sum2, area_sum3,.....
You could try this:
import pandas as pd
import glob
path = r'C:\Users\kundemj\Desktop\Post_Processing\HEA517_2000' # path
all_files = glob.glob(path + "/*.csv")
# Create an empty dict
results = {"filename": [], "sum": []}
# Iterate on files and populate the newly created dict
for filename in all_files:
results["filename"].append(filename)
df = pd.read_csv(filename)
results["sum"].append(df['Area'].sum())
# Save to csv file
results = pd.DataFrame(results)
results.to_csv(path="path_to_file.csv", index=False)
Say I have 200 csv files, I want to read these csv files at one time, and store each csv file in different data frames like df1 for the first file and so on up to df200. Doing manual like df1=pd.read_csv takes a lot of time up to 200. How do I do this using pandas?
I have tried using for loop, but unable to approach, stuck.
import pandas as pd
import glob
all_files = glob.glob("file_path" + "/*.csv")
dfs_dict = {}
for idx, filename in enumerate(all_files):
df = pd.read_csv(filename, index_col=None, header=0)
dfs_dict["df" + str(idx)] = df
Try using this :
import pandas as pd
import glob
path = r'path of the folder where all csv exists'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
li will have all the csv's... you can furthur preprocess them to separate them into different files,
or if all the csv's have the same column and you want to concatenate them to a single dataframe, you could use the concat function in pandas over li to return the single dataframe.
import pandas as pd
import os
dfs=[] #empty list of dataframes
dirname = #where your files are
for root,folders,files in os.walk(dirname):
for file in files:
fp = os.path.join(root,file)
df=pd.read_csv(fp)
dfs.append(df)
df=pd.concat(dfs)
I currently have a folder that contains multiple files with similar names that I am trying to read from.
For example:
Folder contains files:
apple_2019_08_26_23434.xls
apple_2019_08_25_55345.xls
apple_2019_08_24_99345.xls
the name format of the file is very simple:
apple_<date>_<5 random numbers>.xls
How can I read the excel file into a pandas df if I do not care about the random 5 digits at the end?
e.g.
df = pd.read_excel('e:\Document\apple_2019_08_26_<***wildcard***>.xls')
Thank you!
You could use unix style pathname expansions via glob.
import glob
# get .txt files in current directory
txt_files = glob.glob('./*.txt')
# get .xls files in some_dir
xls_files = glob.glob('some_dir/*.xls')
# do stuff with files
# ...
Here, * basically means "anything".
Example with pandas:
import glob
for xls_file in glob.glob('e:/Document/apple_2019_08_26_*.xls'):
df = pd.read_excel(xls_file)
# do stuff with df
# ...
Change your directory with os.chdir then import all files which startwith the correct name:
import os
os.chdir(r'e:\Document')
dfs = [pd.read_excel(file) for file in os.listdir() if file.startswith('apple_2019_08')]
Now you can access each dataframe by index:
print(dfs[0])
print(dfs[1])
Or combine them to one large dataframe if they have the same format
df_all = pd.concat(dfs, ignore_index=True)
If you want the 5-digit part to be changeable in the code, you could try something like this:
from os import listdir
from os.path import isfile, join
import pandas as pd
mypath = '/Users/username/aPath'
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
fiveDigitNumber = onlyfiles[0][17:22]
filename = onlyfiles[0][:17]+fiveDigitNumber+onlyfiles[0][22:]
df = pd.read_excel(filename)
I have several csv files in several zip files in on folder, so for example:
A.zip (contains csv1,csv2,csv3)
B.zip (contains csv4, csv5, csv6)
which are in the folder path C:/Folder/, when I load normal csv files in a folder I use the following code:
import glob
import pandas as pd
files = glob.glob("C/folder/*.csv")
dfs = [pd.read_csv(f, header=None, sep=";") for f in files]
df = pd.concat(dfs,ignore_index=True)
followed by this post: Reading csv zipped files in python
One csv in zip works like this:
import pandas as pd
import zipfile
zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip')
df = pd.read_csv(zf.open('intfile.csv'))
Any idea how to optimize this loop for me?
Use zip.namelist() to get list of files inside the zip
Ex:
import glob
import zipfile
import pandas as pd
for zip_file in glob.glob("C/folder/*.zip"):
zf = zipfile.ZipFile(zip_file)
dfs = [pd.read_csv(zf.open(f), header=None, sep=";") for f in zf.namelist()]
df = pd.concat(dfs,ignore_index=True)
print(df)
I would try to tackle it in two passes. First pass, extract the contents of the zipfile onto the filesystem. Second Pass, read all those extracted CSVs using the method you already have above:
import glob
import pandas as pd
import zipfile
def extract_files(file_path):
archive = zipfile.ZipFile(file_path, 'r')
unzipped_path = archive.extractall()
return unzipped_path
zipped_files = glob.glob("C/folder/*.zip")]
file_paths = [extract_files(zf) for zf in zipped_files]
dfs = [pd.read_csv(f, header=None, sep=";") for f in file_paths]
df = pd.concat(dfs,ignore_index=True)