Copy column,add some text and write in a new csv file - python

I want to make a script that would copy 2nd column from multiple csv files in a folder and add some text before saving it to a single csv file .
here is what i want to do :
1.) Grab data in the 2nd column from all csv files
2.) Append text "hello" & "welcome" to each row at start and end
3.) Write the data into a single file
I tried creating it using pandas
import os
import pandas as pd
dataframes = [pd.read_csv(p, index_col=2, header=None) for p in ('1.csv','2.csv','3.csv')]
merged_dataframe = pd.concat(dataframes, axis=0)
merged_dataframe.to_csv("all.csv", index=False)
The Problem is -
In above code I am forced to mention the file names manually which is very difficult, as a solution I need to include all csv file *.csv
Need to use something like writr.writerow(("Hello"+r[1]+"welcome"))
As there are multiple csv files with many rows(around 100k) in each file so i need to speed up as well.
Here is a sample of the csv files:
"1.csv" "2.csv" "3.csv"
a,Jac b,William c,James
And here is how I would like the output to look all.csv:
Hello Jac welcome
Hello William welcome
Hello James welcome
Any solution using .merge() .append() or .concat() ??
How can I achieve this using python ?

You don't need pandas for this. Here's a really simple way of doing this with csv
import csv
import glob
with open("path/to/output", 'w') as outfile:
for fpath in glob.glob('path/to/directory/*.csv'):
with open(fpath) as infile:
for row in csv.reader(infile):
outfile.write("Hello {} welcome\n".format(row[1]))

1) If you would like to import all .csv files in a folder, you can just use
for i in [a in os.listdir() if a[-4:] == '.csv']:
#code to read in .csv file and concatenate to existing dataframe
2) To append the text and write to a file, you can map a function to each element of the dataframe's column 2 to add the text.
#existing dataframe called df
df[df.columns[1]].map(lambda x: "Hello {} welcome".format(x)).to_csv(<targetpath>)
#replace <targetpath> with your target path
See http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.to_csv.html for all the various parameters you can pass in to to_csv.

Here is a non-pandas solution using built in csv module. Not sure about speed.
import os
import csv
path_to_files = "path to files"
all_csv = os.path.join(path_to_files, "all.csv")
file_list = os.listdir(path_to_files)
names = []
for file in file_list:
if file.endswith(".csv"):
path_to_current_file = os.path.join(path_to_files, file)
with open(path_to_current_file, "r") as current_csv:
reader = csv.reader(current_csv, delimiter=',')
for row in reader:
names.append(row[1])
with open(all_csv, "w") as out_csv:
writer = csv.writer(current_csv, delimiter=',')
for name in names:
writer.writerow(["Hello {} welcome".format(name))

Related

Python - for all csv files in folder make new csv file with header names and the file they came from

I have a code where I am writing to five csv files, and after all of the CSV files are created, I would like to run a function to put all of the headers into a csv or xlsx file where each row represents a header in a file.
So in a folder called "Example" there are 5 csv files, called "1.csv", "2.csv"... "5.csv"; for the code I would like to have, a new file would be created called "Headers of files in Example", where the first column is the name of the csv file the header came from, and the second column contains the headers. Ultimately looking like this:contents of Headers of files in example, where the headers of 1.csv are a,b,c and so on.
My python coding is fairly basic at this point, but I definitely think what I would like to do is possible. Any suggestions to help would be greatly appreciated!
After some more digging I was able to find some code that did what I wanted it to, after some slight modifications:
import csv
import glob
import pandas as pd
def headers():
path = r'path to folder containing csv files/'
all_files = glob.glob(path + "*.csv")
files = all_files
myheaders = ['filename', 'header']
with open("Headers of foldername.csv", "w", newline='') as fw:
cw = csv.writer(fw, delimiter=",")
for filename in files:
with open(filename, 'r') as f:
cr = csv.reader(f)
# get title
for column_name in (x.strip() for x in next(cr)):
cw.writerow([filename, column_name])
file = pd.read_csv("Headers of foldername.csv")
file.to_csv("Headers of foldername.csv", header=myheaders, index=False)
Given you have the DataFrames in the memory, you just need to create a new DataFrame, I like to use dictionaries of lists to create it, then for each file/dataframe you extract the columns and upload it to the mock DataFrame.
Later you can save the new DataFrame to a file.
summary_df = {
'file_name': list(),
'headers': list()}
for file, filename in zip(list_of_files, list_of_names):
aux_headers = file.columns.to_list()
summary_df['headers'] += aux_headers
summary_df['file_name'] += [filename] * len(aux_headers)
summary_df = pd.DataFrame(summary_df)
I hope this piece of code helps. Essentially what it does is to iterate over all files you want, their names in file_names then read them using pandas. Once the csv is loaded you extract the headers with df.columns and store it in a list which is then saves as a new csv by pandas.
import pandas as pd
header_names = []
file_names = ['1.csv', '2.csv']
for file_name in file_names:
df = pd.read_csv(file_name)
header_names.extend(list(df.columns))
new_df = pd.DataFrame(l)
new_df.to_csv("headers.csv")

Sort out columns of multiple csv files at once in Python

really appreciate your help.
I have around 200 csv files with same header.
eg of headers are x , y, z, time, id, type
I would like to sort out time colums of all csv files and save them again.
This is so far I have tried. But it doesn't work.
Could you please help me ??
Thank you
import csv
import operator
import glob
import pandas as pd
data = dict() # filename : lists
path="./*.csv"
files=glob.glob(path)
for filename in files:
# process each file
with open(filename, 'r') as f:
# read file to a list of lists
lists = [row for row in csv.reader(f, delimiter=',')]
# sort and save into a dict
sorted_df = lists.sort_values(by=["time"], ascending=True)
sorted_df.to_csv('%.csv', index=False)
I don't have much knowledge about the csv module but you're using pandas and it supports reading csv files with pd.read_csv, why not utilize that..
for filename in files:
df = pd.read_csv(filename)
df.sort_values('time', inplace=True)
df.to_csv(filename, index=False)
This would overwrite all the files with same data sorted by time.

Finding the number of rows for all files within a folder

Hello I am trying to find the number of rows for all files within a folder. I am trying to do this for a folder that contains only ".txt" files and for a folder that contains ."csv" files.
I know that the way to get the number of rows for a SINGLE ".txt" file is something like this:
file = open("sample.txt","r")
Counter = 0
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
Whereas for a SINGLE ".csv" file is something like this:
file = open("sample.csv")
reader = csv.reader(file)
lines= len(list(reader))
print(lines)
But how can I do this for ALL files within a folder? That is, how can I loop each of these procedures across all files within a folder and, ideally, export the output into an excel sheet with columns akin to these:
Filename Number of Rows
1.txt 900
2.txt 653
and so on and so on.
Thank you so much for your help.
You can use glob to detect the files and then just iterate over them.
Other methods : How do I list all files of a directory?
import glob
# 1. list all text files in the directory
rel_filepaths = glob.glob("*.txt")
# 2. (optional) create a function to read the number of rows in a file
def count_rows(filepath):
res = 0
f = open(filepath, 'r')
res = len(f.readlines())
f.close()
return res
# 3. iterate over your files and use the count_row function
counts = [count_rows(filepath) for filepath in rel_filepaths]
print(counts)
Then, if you want to export this result in a .csv or .xslx file, I recommend using pandas.
import pandas as pd
# 1. create a new table and add your two columns filled with the previous values
df = pd.DataFrame()
df["Filename"] = rel_filepaths
df["Number of rows"] = counts
# 2. export this dataframe to `.csv`
df.to_csv("results.csv")
You can also use pandas.ExcelWriter() if you want to use the .xlsx format. Link to documentation & examples : Pandas - ExcelWriter doc

Pandas - Trying to store multiple .txt files in a .csv

I have a folder with about 500 .txt files. I would like to store the content in a csv file, with 2 columns, column 1 being the name of the file and column 2 being the file content in string. So I'd end up with a CSV file with 501 rows.
I've snooped around SO and tried to find similar questions, and came up with the following code:
import pandas as pd
from pandas.io.common import EmptyDataError
import os
def Aggregate_txt_csv(path):
for files in os.listdir(path):
with open(files, 'r') as file:
try:
df = pd.read_csv(file, header=None, delim_whitespace=True)
except EmptyDataError:
df = pd.DataFrame()
return df.to_csv('file.csv', index=False)
However it returns an empty .csv file. Am I doing something wrong?
There are several problems on your code. One of them is that pd.read_csv is not opening file because you're not passing the path to the given file. I think you should try to play from this code
import os
import pandas as pd
from pandas.io.common import EmptyDataError
def Aggregate_txt_csv(path):
files = os.listdir(path)
df = []
for file in files:
try:
d = pd.read_csv(os.path.join(path, file), header=None, delim_whitespace=True)
d["file"] = file
except EmptyDataError:
d = pd.DataFrame({"file":[file]})
df.append(d)
df = pd.concat(df, ignore_index=True)
df.to_csv('file.csv', index=False)
Use pathlib
Path.glob() to find all the files
When using path objects, file.stem returns the file name from the path.
Use pandas.concat to combine the dataframes in df_list
from pathlib import Path
import pandas as pd
p = Path('e:/PythonProjects/stack_overflow') # path to files
files = p.glob('*.txt') # get all txt files
df_list = list() # create an empty list for the dataframes
for file in files: # iterate through each file
with file.open('r') as f:
text = '\n'.join([line.strip() for line in f.readlines()]) # join all rows in list as a single string separated with \n
df_list.append(pd.DataFrame({'filename': [file.stem], 'contents': [text]})) # create and append a dataframe
df_all = pd.concat(df_list) # concat all the dataframes
df_all.to_csv('files.txt', index=False) # save to csv
I noticed there's already an answer, but I've gotten it to work with a relatively simple piece of code. I've only edited the file read-in a little bit, and the dataframe is outputting successfully.
Link here
import pandas as pd
from pandas.io.common import EmptyDataError
import os
def Aggregate_txt_csv(path):
result = []
print(os.listdir(path))
for files in os.listdir(path):
fullpath = os.path.join(path, files)
if not os.path.isfile(fullpath):
continue
with open(fullpath, 'r', errors='replace') as file:
try:
content = '\n'.join(file.readlines())
result.append({'title': files, 'body': content})
except EmptyDataError:
result.append({'title': files, 'body': None})
df = pd.DataFrame(result)
return df
df = Aggregate_txt_csv('files')
print(df)
df.to_csv('result.csv')
Most importantly here, I am appending to an array so as not to run pandas' concatenate function too much, as that would be pretty bad for performance. Additionally, reading in the file should not need read_csv, as there isn't a set format for the file. So using '\n'.join(file.readlines()) allows you to read in the file plainly and take out all lines into a string.
At the end, I convert the array of dictionaries into a final dataframe, and it returns the result.
EDIT: for paths that aren't the current directory, I updated it to append the path so that it could find the necessary files, apologies for the confusion

How to extract a single row from multiple CSV files to a new file

I have hundreds of CSV files on my disk, and one file added daily and I want to extract one row from each of them and put them in a new file. Then I want to daily add values to that same file. CSV files looks like this:
business_day,commodity,total,delivery,total_lots
.
.
20160831,CTC,,201710,10
20160831,CTC,,201711,10
20160831,CTC,,201712,10
20160831,CTC,Total,,385
20160831,HTC,,201701,30
20160831,HTC,,201702,30
.
.
I want to fetch the row that contains 'Total' from each file. The new file should look like:
business_day,commodity,total,total_lots
20160831,CTC,Total,385
20160901,CTC,Total,555
.
.
The raw files on my disk are named '20160831_foo.CSV', '20160901_foo.CSV etc..
After Googling this I have yet not seen any examples on how to extract only one value from a CSV file. Any hints/help much appreciated. Happy to use pandas if that makes life easier.
I ended up with the following:
import pandas as pd
import glob
list_ = []
filenames = glob.glob('c:\\Financial Data\\*_DAILY.csv')
for filename in filenames:
df = pd.read_csv(filename, index_col = None, usecols = ['business_day', 'commodity', 'total', 'total_lots'], parse_dates = ['business_day'], infer_datetime_format = True)
df = df[((df['commodity'] == 'CTC') & (df['total'] == 'Total'))]
list_.append(df)
df = pd.concat(list_, ignore_index = True)
df['total_lots'] = df['total_lots'].astype(int)
df = df.sort_values(['business_day'])
df = df.set_index('business_day')
Then I save it as my required file.
Read the csv files and process them directly like so:
with open('some.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
# do something here with `row`
break
I would recommend appending rows onto a list after processing for the rows that you desire, and then passing it onto a pandas Dataframe that will simplify your data manipulations a lot.

Categories