I obtain multiple CSV files from API, in which I need to remove New Lines present in the CSV and join the record, consider the data provided below;
My Code to remove the New Line:
## Loading necessary libraries
import glob
import os
import shutil
import csv
## Assigning necessary path
source_path = "/home/Desktop/Space/"
dest_path = "/home/Desktop/Output/"
# Assigning file_read path to modify the copied CSV files
file_read_path = "/home/Desktop/Output/*.csv"
## Code to copy .csv files from one folder to another
for csv_file in glob.iglob(os.path.join(source_path, "*.csv"), recursive = True):
shutil.copy(csv_file, dest_path)
## Code to delete the second row in all .CSV files
for filename in glob.glob(file_read_path):
with open(filename, "r", encoding = 'ISO-8859-1') as file:
reader = list(csv.reader(file , delimiter = ","))
for i in range(0,len(reader)):
reader[i] = [row_space.replace("\n", "") for row_space in reader[i]]
with open(filename, "w") as output:
writer = csv.writer(output, delimiter = ",", dialect = 'unix')
for row in reader:
writer.writerow(row)
I actually copy the CSV files into a new folder and then use the above code to remove any new line present in the file.
You are fixing the csv File, because they have wrong \n the problem here is how
to know if the line is a part of the previous line or not. if all lines starts
with specifics words like in your example SV_a5d15EwfI8Zk1Zr or just SV_ You can do something like this:
import glob
# this is the FIX PART
# I have file ./data.csv(contains your example) Fixed version is in data.csv.FIXED
file_read_path = "./*.csv"
for filename in glob.glob(file_read_path):
with open(filename, "r", encoding='ISO-8859-1') as file, open(filename + '.FIXED', "w", encoding='ISO-8859-1') as target:
previous_line = ''
for line in file:
# check if it's a new line or a part of the previous line
if line.startswith('SV_'):
if previous_line:
target.write( previous_line + '\n')
previous_line = line[:-1] # remove \n
else:
# concatenate the broken part with previous_line
previous_line += line[:-1] # remove \n
# add last line
target.write(previous_line + '\n')
Ouput:
SV_a5d15EwfI8Zk1Zr;QID4;"<span style=""font-size:16px;""><strong>HOUR</strong> Interview completed at:</span>";HOUR;TE;SL;;;true;ValidNumber;0;23.0;0.0;882;-873;0
SV_a5d15EwfI8Zk1Zr;QID6;"<span style=""font-size:16px;""><strong>MINUTE</strong> Interview completed:</span>";MIN;TE;SL;;;true;ValidNumber;0;59.0;0.0;882;-873;0
SV_a5d15EwfI8Zk1Zr;QID8;Number of Refusals - no language<br />For <strong>Zero Refusals - no language</strong> use 0;REFUSAL1;TE;SL;;;true;ValidNumber;0;99.0;0.0;882;-873;0
SV_a5d15EwfI8Zk1Zr;QID10;<strong>DAY OF WEEK:</strong>;WEEKDAY;MC;SACOL;TX;;true;;0;;;882;-873;0
SV_a5d15EwfI8Zk1Zr;QID45;"<span style=""font-size:16px;"">Using points from 0 to 10, how likely would you be recommend Gatwick Airport to a friend or colleague?</span><div> </div>";NPSCORE;MC;NPS;;;true;;0;;;882;-873;
EDITS:
Can Be Simpler using split too, this will fix the file it self:
import glob
# this is the FIX PART
# I have file //data.csv the fixed version in the same file
file_read_path = "./*.csv"
# assuming that all lines starts with SV_
STARTING_KEYWORD = 'SV_'
for filename in glob.glob(file_read_path):
with open(filename, "r", encoding='ISO-8859-1') as file:
lines = file.read().split(STARTING_KEYWORD)
with open(filename, 'w', encoding='ISO-8859-1') as file:
file.write('\n'.join(STARTING_KEYWORD + l.replace('\n', '') for l in lines if l))
Well I'm not sure on the restrictions you have. But if you can use the pandas library , this is simple.
import pandas as pd
data_set = pd.read_csv(data_file,skip_blank_lines=True)
data_set.to_csv(target_file,index=False)
This will create a CSV File will all new lines removed. You can save a lot of time with available libraries.
Related
I try to find a way to add a function in my script to ignore or delete the first line of my CSV files. I know we can do that with pandas but it is possible without?
Many thanks for your help.
Here is my code -
from os import mkdir
from os.path import join, splitext, isdir
from glob import iglob
from csv import DictReader
from collections import defaultdict
from urllib.request import urlopen
from shutil import copyfileobj
csv_folder = r"/Users/folder/PycharmProjects/pythonProject/CSVfiles/"
glob_pattern = "*.csv"
for file in iglob(join(csv_folder, glob_pattern)):
with open(file) as csv_file:
reader = DictReader(csv_file)
save_folder, _ = splitext(file)
if not isdir(save_folder):
mkdir(save_folder)
title_counter = defaultdict(int)
for row in reader:
url = row["link"]
title = row["title"]
title_counter[title] += 1
_, ext = splitext(url)
save_filename = join(save_folder, f"{title}_{title_counter[title]}{ext}".replace('/', '-'))
print(f"'{save_filename}'")
with urlopen(url) as req, open(save_filename, "wb") as save_file:
copyfileobj(req, save_file)
Use the next() function to skip the first row of your CSV.
with open(file) as csv_file:
reader = DictReader(csv_file)
# skip first row
next(reader)
You could just read the raw text from the file as normal and then split the text by new line and delete the first line:
file = open(filename, 'r') # Open the file
content = file.read() # Read the file
lines = content.split("\n") # Split the text by the newline character
del lines[0] # Delete the first index from the resulting list, ie delete the first line.
Although this may take a long time for larger CSV files, so this may not be the best solution.
Or you could simply skip the first row in your for loop.
Instead of:
...
for row in reader:
...
Could you use:
...
for row_num, row in enumerate(list(reader)):
if row_num == 0:
continue
...
instead? I think that should skip the first row.
Using the following code to merge CSV files, it will at times put the data in the wrong columns. Rather than being in Columns A-D it will put the data in columns F-J. From what I can tell is it's the first line of a new CSV that gets put in the wrong column, however, not every CSV file.
import glob
import codecs
import csv
my_files = glob.glob("*.csv")
header_saved = False
with codecs.open('Final-US-Allies-Expects.csv','w', "UTF-8", 'ignore') as file_out: #save data to
for filename in my_files:
with codecs.open(filename, 'r', 'UTF-8', 'ignore') as file_in:
header = next(file_in)
if not header_saved:
file_out.write(header) #write header
header_saved = True
for line in file_in:
file_out.write(line) #write next line
original code available at Merging multiple CSV files without headers being repeated (using Python) (reputation not high enough to add to original question)
Visual of issue
I've attached a visual of the issue. I need to be able to have every line be written in in the column it is meant to be written into.
Thanks for your help in advance.
Looks like you are not checking if the lines end in new line character before writing it to the file. This could mess up the alignment. Could you try this?
import glob
import codecs
import csv
my_files = glob.glob("*.csv")
header_saved = False
with codecs.open('output.csv','w', "UTF-8", 'ignore') as file_out:
for filename in my_files:
with codecs.open(filename, 'r', 'UTF-8', 'ignore') as file_in:
header = next(file_in)
if not header_saved:
file_out.write(header if "\n" == header[-1] else header + "\n")
header_saved = True
for line in file_in:
file_out.write(line if "\n" == line[-1] else line + "\n")
I have a folder that has over 15,000 csv files. They all have different number of column names.
Most files have its first row as a column name (attribute of data) like this :
Name Date Contact Email
a b c d
a2 b2 c2 d2
What I want to do is read first row of all files, store them as a list, and write that list as new csv file.
Here is what I have done so far :
import csv
import glob
list=[]
files=glob.glob('C:/example/*.csv')
for file in files :
f = open(file)
a=[file,f.readline()]
list.append(a)
with open('test.csv', 'w') as testfile:
csv_writer = csv.writer(testfile)
for i in list:
csv_writer.writerow(i)
When I try this code, result comes out like this :
[('C:/example\\example.csv', 'Name,Date,Contact,Email\n'), ('C:/example\\example2.csv', 'Address,Date,Name\n')]
Therefore in a made csv, all attributes of each file go into second column making it look like this (for some reason, there's a empty row between) :
New CSV file made
Moreover when going through files, I have encoutered another error :
UnicodeDecodeError: 'cp949' codec can't decode byte 0xed in position 6: illegal multibyte sequence
So I included this code in first line but it didn't work saying files are invalid.
import codecs
files=glob.glob('C:/example/*.csv')
fileObj = codecs.open( files, "r", "utf-8" )
I read answers on stackflow but I couldn't find one related to my problem. I appreciate your answers.
Ok, so
import csv
import glob
list=[]
files=glob.glob('C:/example/*.csv')
for file in files :
f = open(file)
a=[file,f.readline()]
list.append(a)
here you're opening the file and then creating a list with the column headers as a string(note that means they'll look like "Column1,Column2") and the file name. So [("Filename", "Column1, Column2")]
so you're going to need to split that on the ',' like:
for file in files :
f = open(file)
a=[file] + f.readline().split(',')
Now we have:
["filename", ("Column1", "Column2")]
So it's still going to print to the file wrong. We need to concatenate the lists.
a=[file] + f.readline().split(',')
So we get:
["filename", "Column1", "Column2"]
And you should be closing each file after you open it with f.close() or use a context manager inside your loop like:
for file in files :
with open(file) as f:
a=[file] + f.readline()
list.append(a)
Better solution and how I would write it:
import csv
import glob
files = glob.glob('mydir/*.csv')
lst = list()
for file in files:
with open(file) as f:
reader = csv.reader(f)
lst.append(next(reader))
try:
with open(files,'r'.encoding='utf8') as f:
# do things
except UnicodeError:
with open(files,'r'.encoding='utf8') as f:
# do things
a little bit of tidying, proper context managing, and using csv.reader:
import csv
import glob
list=[]
files=glob.glob('C:/example/*.csv')
with open('test.csv', 'w') as testfile:
csv_writer = csv.writer(testfile)
for file in files:
with open(file, 'r') as infile:
reader = csv.reader(infile)
headers = next(reader)
lst = [file] + headers
writer.writerow(lst)
this will write a new csv with one row per infile, each row being filename, column1, column2, ...
I am new to data processing using CSV module. And i have input file And using this code`
import csv
path1 = "C:\\Users\\apple\\Downloads\\Challenge\\raw\\charity.a.data"
csv_file_path = "C:\\Users\\apple\\Downloads\\Challenge\\raw\\output.csv.bak"
with open(path1, 'r') as in_file:
in_file.__next__()
stripped = (line.strip() for line in in_file)
lines = (line.split(":$%:") for line in stripped if line)
with open(csv_file_path, 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('id', 'donor_id','last_name','first_name','year','city','state','postal_code','gift_amount'))
writer.writerows(lines)
`
Is it possible to remove (:) in the first and last column of csv file. And i want output be like
Please help me.
If you just want to eliminate the ':' at the first and last column, this should work. Keep in mind that your dataset should be tab (or something other than comma) separated before you read it, because as I commented in your question, there are commas ',' in your dataset.
path1 = '/path/input.csv'
path2 = '/path/output.csv'
with open(path1, 'r') as input, open(path2, 'w') as output:
file = iter(input.readlines())
output.write(next(file))
for row in file:
output.write(row[1:][:-2] + '\n')
Update
So after giving your code, I added a small change to do the whole process starting from the initial file. The idea is the same. You should just exclude the first and the last char of each line. So instead of line.strip() you should have line.strip()[1:][:-2].
import csv
path1 = "C:\\Users\\apple\\Downloads\\Challenge\\raw\\charity.a.data"
csv_file_path = "C:\\Users\\apple\\Downloads\\Challenge\\raw\\output.csv.bak"
with open(path1, 'r') as in_file:
in_file.__next__()
stripped = (line.strip()[1:][:-2] for line in in_file)
lines = (line.split(":$%:") for line in stripped if line)
with open(csv_file_path, 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('id', 'donor_id','last_name','first_name','year','city','state','postal_code','gift_amount'))
writer.writerows(lines)
I'm writing a script that has a for loop to extract a list of variables from each 'data_i.csv' file in a folder, then appends that list as a new row in a single 'output.csv' file.
My objective is to define the headers of the file once and then append data to the 'output.csv' container-file so it will function as a backlog for a standard measurement.
The first time I run the script it will add all the files in the folder. Next time I run it, I want it to only append files that have been added since. I thought one way of doing this would be to check for duplicates, but the codes I found for that so far only searched for consecutive duplicates.
Do you have suggestions?
Here's how I made it so far:
import csv, os
# Find csv files
for csvFilename in os.listdir('.'):
if not csvFilename.endswith('.csv'):
continue
# Read in csv file and choose certain cells
csvRows = []
csvFileObj = open(csvFilename)
csvData = csv.reader(csvFileObj,delimiter=' ',skipinitialspace='True')
csvLines = list(csvData)
cellID = csvLines[4][3]
# Read in several variables...
csvRows = [cellID]
csvFileObj.close()
resultFile = open("Output.csv", 'a') #open in 'append' modus
wr = csv.writer(resultFile)
wr.writerows([csvRows])
csvFileObj.close()
resultFile.close()
This is the final script after mgc's answer:
import csv, os
f = open('Output.csv', 'r+')
merged_files = csv.reader(f)
merged_files = list()
for csvFilename in os.listdir('.'):
if not csvFilename.endswith('_spm.txt'):
continue
if csvFilename in merged_files:
continue
csvRows = []
csvFileObj = open(csvFilename)
csvData = csv.reader(csvFileObj,delimiter=' ',skipinitialspace='True')
csvLines = list(csvData)
waferID = csvLines[4][3]
temperature = csvLines[21][2]
csvRows = [waferID,thickness]
merged_files.append(csvRows)
csvFileObj.close()
wr = csv.writer(f)
wr.writerows(merged_files)
f.close()
You can keep track of the name of each file already handled. If this log file don't need to be human readable, you can use pickle. At the start of your script, you can do :
import pickle
try:
with open('merged_log', 'rb') as f:
merged_files = pickle.load(f)
except FileNotFoundError:
merged_files = set()
Then you can add a condition to avoid files previously treated :
if filename in merged_files: continue
Then when you are processing a file you can do :
merged_files.add(filename)
And keep trace of your variable at the end of your script (so it will be used on a next use) :
with open('merged_log', 'wb') as f:
pickle.dump(merged_files, f)
(However there is other options to your problem, for example you can slightly change the name of your file once it has been processed, like changing the extension from .csv to .csv_ or moving processed files in a subfolder, etc.)
Also, in the example in your question, i don't think that you need to open (and close) your output file on each iteration of your for loop. Open it once before your loop, write what you have to write, then close it when you have leaved the loop.