How to combine CSV files without using pandas - python

I have 3 csv files that I need to merge
all the 3 files have the first three columns being equal while like name, age, sex but other columns are different for all.
I am new to python. I need assistance on this. I can comprehend any code written. Thanks
I have tried some codes but not working
file 1
firstname,secondname,age,address,postcode,height
gdsd,gas,uugd,gusa,uuh,hhuuw
kms,kkoil,jjka,kja,kaja,loj
iiow,uiuw,iue,oijw,uow,oiujw
ujis,oiiw,ywuq,sax,cxv,ywf
file 2
firstname,secondname,age,home-town,spousename,marital_staus
gdsd,gas,uugd,vbs,owu,nsvc
kms,kkoil,jjka,kja,kaja,loj
iiow,uiuw,iue,xxfaf,owuq,pler
ujis,oiiw,ywuq,gfhd,lzac,oqq
file 3
firstname,secondname,age,drive,educated,
gdsd,gas,uugd,no,yes
kms,kkoil,jjka,no,no
iiow,uiuw,iue,yes,no
ujis,oiiw,ywuq,yes,yes
desired result
firstname,secondname,age,hometown,spousename,marital_status,adress,post_code,height,drive,educated
note that firstname,secondname,age is the same across the 3 tables
I need valid codes please

Here's a generic solution for concatenating CSV files that have heterogeneous headers with Python.
What you need to do first is to read the header of each CSV file for determining the "unique" field names.
Then, you just have to read each input record and output it while transforming it to match the new header (which is the unique fields).
#!/usr/bin/env python3
import csv
paths = [ 'file1.csv', 'file2.csv', 'file3.csv' ]
fieldnames = set()
for p in paths:
with open(p,'r') as f:
reader = csv.reader(f)
fieldnames.update( next(reader) )
with open('combined.csv', 'w') as o:
writer = csv.DictWriter(o, fieldnames = fieldnames)
writer.writeheader()
for p in paths:
with open(p,'r') as f:
reader = csv.DictReader(f)
writer.writerows( reader )
remark: I open the files twice, so it won't work for inputs that are streams (for ex. sys.stdin)

Related

Csv, Python, separating elements in one column to different columns

So I have a CSV file like this,
how can I separate them into different columns like this,
using python without using the pandas lib.
Implementation that should work in python 3.6+.
import csv
with open("input.csv", newline="") as inputfile:
with open("output.csv", "w", newline="") as outputfile:
reader = csv.DictReader(inputfile) # reader
fieldnames = reader.fieldnames
writer = csv.DictWriter(outputfile, fieldnames=fieldnames) # writer
# make header
writer.writeheader()
# loop over each row in input CSV
for row in reader:
# get first column
column: str = str(row[fieldnames[0]])
numbers: list = column.split(",")
if len(numbers) != len(fieldnames):
print("Error: Lengths not equal")
# write row in output CSV
writer.writerow({field: num for field, num in zip(fieldnames, numbers)})
Explanation of the code:
The above code takes two file names input.csv and output.csv. The names being verbose don't need any further explanation.
It reads each row from input.csv and writes corresponding row in output.csv.
The last line is a "dictionary comprehension" combined with zip (similar to "list comprehensions" for lists). It's a nice way to do a lot of stuff in a single line but same code in expanded form looks like:
row = {}
for field, num in zip(fieldnames, numbers):
row[field] = num
writer.writerow(row)
It is already separated into different columns by , as separator, but the european version of excel usually uses ; as separator. You can specify the separator, when you import the csv:
https://support.microsoft.com/en-us/office/import-or-export-text-txt-or-csv-files-5250ac4c-663c-47ce-937b-339e391393ba
If you really want to change the file content with python use the replace function and replace , with ;: How to search and replace text in a file?

Writing to CSV file choosing the column to write in

I'm trying to import the data from X (6 in this case) CSV Files containing some data about texts, and putting one specific row from each Document onto a new one, in such a way that they appear next to each other (export from document 1 on column 1, from the second document on row 2 and so on). I've been unsuccessful so far.
# I have a list containing the path to all relevant files
files = ["path1", "path2", ...]
# then I tried cycling through the folder like this
for file in files:
with open(file, "r") as csvfile:
reader = csv.reader(csvfile, delimiter=",")
for row on reader:
# I'm interested in the stuff stored on Column 2
print([row[2]])
# as you can see, I can get the info from the files, but from here
# on, I can't find a way to then write that information on the
# appropiate coloumn of the newly created CSV file
I know how to open a writer, what I don't know is how to write a script that writes the info it fetches from the original 6 documents on a DIFFERENT COLUMN every time a new file is processed.
# I have a list containing the path to all relevant files
files = ["path1", "path2", ...]
newfile = "newpath1"
# then I tried cycling through the folder like this
for file in files:
with open(file, "r") as csvfile:
reader = csv.reader(csvfile, delimiter=",")
with open(newfile, "a") as wcsvfile:
writer = csv.writer(wcsvfile)
for row on reader:
# I'm interested in the stuff stored on Column 2
writer.writerow([row[2]])

Python: removing old records and duplicates between two CSV files

Total beginner with Python, need help to achieve a task!
I have two csv files. old.csv and new.csv
Both have same structure (A to Z columns), and each record has a unique identifier which is a number, in column F (sixth column). Between these two CSVs there are a few duplicates records.
I’m looking for a way to eliminate records that are also in the old.csv, from the new.csv and output to a new file that has the same structure, so the new output.csv has truly only the new records.
What's a good way to achieve this? I need to be able to run this on a windows machine through a command line.
Any help is appreciated! Thanks in advance!
Read the csv file and map it to tuple
import csv
f = open('old.csv', 'rb')
reader = csv.reader(f)
reader = map(tuple,reader)
fn = open('new.csv', 'rb')
readern = csv.reader(fn)
readern = map(tuple,readern)
Get all unique identifiers in old.csv
from operator import itemgetter
reader= map(itemgetter(5), reader)
add all items whose identifier is not in old.csv, add them to unique list
unique= [item for item in readern if item[5] not in reader]
Write the rows to output.csv
output = open("output.csv", 'wb')
wr = csv.writer(output)
for row in unique:
wr.writerow(row)
output.close()
Hope this helps!
A simple approach would be:
collect all identifiers from old.csv in a set
loop through new.csv
if a record has an identifier that's noz in your set, write it to output.csv
You will probably want to use the csv module for reading and writing the files.

Merging two csv files where common column matches

I have a csv of users, and a csv of virtual machines, and i need to merge the users into their vms only where their id match.
But all im getting is a huge file containing everything.
file_names = ['vms.csv', 'users.csv']
o_data = []
for afile in file_names:
file_h = open(afile)
a_list = []
a_list.append(afile)
csv_reader = csv.reader(file_h, delimiter=';')
for row in csv_reader:
a_list.append(row[0])
o_data.append((n for n in a_list))
file_h.close()
with open('output.csv', 'w') as op_file:
csv_writer = csv.writer(op_file, delimiter=';')
for row in list(zip(*o_data)):
csv_writer.writerow(row)
op_file.close()
Im relatively new to python, am i missing something?
I've always found pandas really helpful for tasks like this. You can simply load the datasets into pandas data frames and then use the merge function to merge them where the values in a column are same.
import pandas
vms = pandas.read_csv('vms.csv')
users = pandas.read_csv('users.csv')
output = pandas.merge(vms, users)
output.to_csv('output.tsv')
You can find the documentation for the different options at http://pandas.pydata.org/pandas-docs/stable/merging.html

Combining multiple CSV file into a single one

I have CSV files in which Data is formatted as follows:
file1.csv
ID,NAME
001,Jhon
002,Doe
fille2.csv
ID,SCHOOLS_ATTENDED
001,my Nice School
002,His lovely school
file3.csv
ID,SALARY
001,25
002,40
ID field is kind of primary key that will be used to fetch record.
What is the most efficient way to read 3 to 4 files and get corresponding data and store in another CSV file having headings (ID,NAME,SCHOOLS_ATTENDED,SALARY)?
The file sizes are in the hundreds of MBs (100, 200 Mb).
Hundreds of megabytes aren't that much. Why not go for a simple approach using the csv module and collections.defaultdict:
import csv
from collections import defaultdict
result = defaultdict(dict)
fieldnames = {"ID"}
for csvfile in ("file1.csv", "file2.csv", "file3.csv"):
with open(csvfile, newline="") as infile:
reader = csv.DictReader(infile)
for row in reader:
id = row.pop("ID")
for key in row:
fieldnames.add(key) # wasteful, but I don't care enough
result[id][key] = row[key]
The resulting defaultdict looks like this:
>>> result
defaultdict(<type 'dict'>,
{'001': {'SALARY': '25', 'SCHOOLS_ATTENDED': 'my Nice School', 'NAME': 'Jhon'},
'002': {'SALARY': '40', 'SCHOOLS_ATTENDED': 'His lovely school', 'NAME': 'Doe'}})
You could then combine that into a CSV file (not my prettiest work, but good enough for now):
with open("out.csv", "w", newline="") as outfile:
writer = csv.DictWriter(outfile, sorted(fieldnames))
writer.writeheader()
for item in result:
result[item]["ID"] = item
writer.writerow(result[item])
out.csv then contains
ID,NAME,SALARY,SCHOOLS_ATTENDED
001,Jhon,25,my Nice School
002,Doe,40,His lovely school
Following is the working code for combining multiple csv files with specific keywords in their names into 1 final csv file. I have set the default keyword to "file" but u can set it blank if u want to combine all csv files from a folder_path. This code will take header from your first csv file and use it as a header in final combined csv file. It will ignore headers of all other csv files.
import glob,os
#staticmethod
def Combine_multiple_csv_files_thatContainsKeywordInTheirNames_into_one_csv_file(folder_path,keyword='file'):
#takes header only from 1st csv, all other csv headers are skipped and data is appened to final csv
fileNames = glob.glob(folder_path + "*" + keyword + "*"+".csv") # fileNames INCLUDES FOLDER_PATH TOO
with open(folder_path+"Combined_csv.csv", "w", newline='') as fout:
print('Combining multiple csv files into 1')
csv_write_file = csv.writer(fout, delimiter=',')
# a.writerows(op)
with open(fileNames[0], mode='rt') as read_file: # utf8
csv_read_file = csv.reader(read_file, delimiter=',') # CSVREADER READS FILE AS 1 LIST PER ROW. SO WHEN WRITIN TO ANOTHER CSV FILE WITH FUNCTION WRITEROWS, IT INTRODUCES ANOTHER NEW LINE '\N' CHARACTER. SO TO AVOID DOUBLE NEWLINES , WE SET NEWLINE AS '' WHEN WE OPEN CSV WRITER OBJECT
csv_write_file.writerows(csv_read_file)
for num in range(1, len(fileNames)):
with open(fileNames[num], mode='rt') as read_file: # utf8
csv_read_file = csv.reader(read_file, delimiter=',') # CSVREADER READS FILE AS 1 LIST PER ROW. SO WHEN WRITIN TO ANOTHER CSV FILE WITH FUNCTION WRITEROWS, IT INTRODUCES ANOTHER NEW LINE '\N' CHARACTER. SO TO AVOID DOUBLE NEWLINES , WE SET NEWLINE AS '' WHEN WE OPEN CSV WRITER OBJECT
next(csv_read_file) # ignore header
csv_write_file.writerows(csv_read_file)

Categories