Extract single column from multiple CSVs and save to new CSV - python

I would like to read out a specific column from over 100 CSV files to create a new CSV file. The source column's header will be renamed with the filename the column is extracted from.
I can get the individual columns, but I have been unable to rename each column's header without the ".csv" extension:
import os
import pandas as pd
folder = "C:/Users/Doc/Data"
files = os.scandir(folder)
E2080 = []
with os.scandir(folder) as files:
for file in files:
#print(file)
df = pd.read_csv(file, index_col=None)
dist = {file: (df['lnt_dist'])}
E = pd.DataFrame(dist)
E2080.append(E)
dist = pd.concat(E2080, ignore_index=False, axis=1)
dist.head()
dist.to_csv('E2080', index=False)

This is the final code that worked for me (see output 1):
E2080 = []
with os.scandir(folder) as files:
for file in files:
#print(file)
df = pd.read_csv(file, index_col=None)
dist = {file: (df['lnt_dist'])}
E = pd.DataFrame(dist)
E_1 = E.rename(columns={file: file.name.split('.')[0]}) # rename df header while dropping the ext **[.csv]** and the `os.scandir` attribute `<DirEntry>`
E2080.append(E_1)
dist = pd.concat(E_28, ignore_index=False, axis=1)
#dist.head()
dist.to_csv('E2080.csv', index=False)

You should use file.name instead of file to get string with name.
And with string you can use .split(".") to get name without extension.
for file in os.scandir(folder):
print(file.name, '=>', file.name.split(".")[0])
Or you could use pathlib.Path instead of os.scandir() to have more functions.
for file in pathlib.Path('test').iterdir():
print(file.name, '=>', file.stem)

Related

How to create a data frame from different Text files

I was working on a project where I have to scrape the some text files from a source. I completed this task and I have 140 text file.
This is one of the text file I have scraped.
I am trying to create a dataframe where I should have one row for each text file. So I wrote the below code:-
import pandas as pd
import os
txtfolder = r'/home/spx072/Black_coffer_assignment/' #Change to your folder path
#Find the textfiles
textfiles = []
for root, folder, files in os.walk(txtfolder):
for file in files:
if file.endswith('.txt'):
fullname = os.path.join(root, file)
textfiles.append(fullname)
# textfiles.sort() #Sort the filesnames
#Read each of them to a dataframe
for filenum, file in enumerate(textfiles, 1):
if filenum==1:
df = pd.read_csv(file, names=['data'], sep='delimiter', header=None)
df['Samplename']=os.path.basename(file)
else:
tempdf = pd.read_csv(file, names=['data'], sep='delimiter', header=None)
tempdf['Samplename']=os.path.basename(file)
df = pd.concat([df, tempdf], ignore_index=True)
df = df[['Samplename','data']] #
The code runs fine, but the dataframe I am getting is some thing like this :-
I want that each text file should be inside a single row like:-
1.txt should be in df['data'][0],
2.txt should be in df'data' and so on.
I tried different codes and also check several questions but still unable to get the desired result. Can anyone help.
I'm not shure why you need pd.read_csv() for this. Try it with pure python:
result = pd.DataFrame(columns=['Samplename', 'data'])
for file in textfiles:
with open(file) as f:
data = f.read()
result = pd.concat([result, pd.DataFrame({'Samplename' : file, 'data': data}, index=[0])], axis=0, ignore_index=True)

How to upload all csv files that have specific name inside filename in python

I want to concat all csv file that have this specific word 'tables' on the filename.
Below code is upload all csv file without filter the specific word that i want.
# importing the required modules
import glob
import pandas as pd
# specifying the path to csv files
#path = "csvfoldergfg"
path = "folder_directory"
# csv files in the path
files = glob.glob(path + "/*.csv")
# defining an empty list to store
# content
data_frame = pd.DataFrame()
content = []
# checking all the csv files in the
# specified path
for filename in files:
# reading content of csv file
# content.append(filename)
df = pd.read_csv(filename, index_col=None)
content.append(df)
# converting content to data frame
data_frame = pd.concat(content)
print(data_frame)
example filename are:
abcd-tables.csv
abcd-text.csv
abcd-forms.csv
defg-tables.csv
defg-text.csv
defg-forms.csv
From the example filenames. The expected output is concat filenames
abcd-tables.csv
defg-tables.csv
into single dataframe. Assuming the header are same.
*Really appreciate you guys can solve this
You can use:
import pandas as pd
import pathlib
path = 'folder_directory'
content = []
for filename in pathlib.Path(path).glob('*-tables.csv'):
df = pd.read_csv(filename, index_col=None)
content.append(df)
df = pd.concat(content, ignore_index=True)

Pandas - Move output columns into a new sheet

I have many dataframes as txt files that I'm converting into xlsx. For each file, I want to take my output columns and move them into a new sheet called "Analyzed Data". I'm not sure how to do this with ExcelWriter:
writer = pd.ExcelWriter('filepath', engine = 'xlsxwriter')
df.to_excel(writer, sheet_name = ' Data Analyzed')
writer.save()
My understanding is that this requires my file to be xlsx, I have to write the filepath separately for each xlsx file, and I'm not sure how to select only my output columns as the ones to move to the new sheet. Each file has a different amount of columns with different column names. My code is below:
import os
import pandas as pd
path = r'C:\Users\Me\1Test'
filelist = []
for root, dirs, files in os.walk(path):
for f in files:
if not f.endswith('.txt'):
continue
filelist.append(os.path.join(root, f))
for f in filelist:
df = pd.read_table(f)
col = df.iloc[ : , : -3]
df['Average'] = col.mean(axis = 1)
out = (df.join(df.drop(df.columns[[-3,-1]], axis=1)
.sub(df[df.columns[-3]], axis=0)
.add_suffix(' - Background')))
out.to_excel(f.replace('txt', 'xlsx'), 'Sheet1')

Creating one unique file from many others saved in a folder

I've a list of csv files (approx. 100) that I'd like to include in one single csv file.
The list is found using
PATH_DATA_FOLDER = 'mypath/'
list_files = os.listdir(PATH_DATA_FOLDER)
for f in list_files:
list_columns = list(pd.read_csv(os.path.join(PATH_DATA_FOLDER, f)).columns)
df = pd.DataFrame(columns=list_columns)
print(df)
Which returns the files (it is just a sample, since I have 100 and more files):
['file1.csv', 'name2.csv', 'example.csv', '.DS_Store']
This, unfortunately, includes also hidden files, that I'd like to exclude.
Each file has the same columns:
Columns: [Name, Surname, Country]
I'd like to find a way to create one unique file with all these fields, plus information of the original file (e.g., adding a new column with the file name).
I've tried with
df1 = pd.read_csv(os.path.join(PATH_DATA_FOLDER, f))
df1['File'] = f # file name
df = df.append(df1)
df = df.reset_index(drop=True).drop_duplicates() # I'd like to drop duplicates in both Name and Surname
but it returns a dataframe with the last entry, so I guess the problem is in the for loop.
I hope you can provide some help.
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#drop duplicates and reset index
combined_csv.drop_duplicates().reset_index(drop=True)
#Save the combined file
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
Have you tried using glob?
filenames = glob.glob("mypath/*.csv") #list of all you csv files.
df = pd.DataFrame(columns=["Name", "Surname", "Country"])
for filename in filenames:
df = df.append(pd.read_csv(filename))
df = df.drop_duplicates().reset_index(drop=True)
Another way would be concatenating the csv files using the cat command after removing the headers and then read the concatenated csv file using pd.read_csv.

Merge all the csv file in one folder and add a new column to pandas dataframe with partial file name in Python

I have a folder inside have over 100 CSV files. They all have the same prefix name.
eg:
shcool.Math001.csv
School.Math002.csv.
School.Physics001.csv. etc... They all contain the same number of columns.
How can I merge all the CSV files in one data frame in Python and add a new column with those files names but the prefix name "School." needs to be removed?
I found some code example online but did not sovle my problem:
path = r'C:\\Users\\me\\data\\'
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_csv(f, sep='\t') for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)
Try this, haven't tested:
import os
import pandas as pd
path ='<folder path to CSVs>'
dfs = []
for filename in os.listdir(path):
sample_df = pd.read_csv(filename)
sample_df['filename'] = ''.join(filename[7:])
dfs.append(sample_df)
df = pd.concat(dfs, axis=0, ignore_index=True)
Add DataFrame.assign in generator comprehension for add new column:
path = r'C:\\Users\\me\\data\\'
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_csv(f, sep='\t').assign(New=+os.path.basename(f[7:]).split('.')[0]) for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

Categories