Combining multiple CSV file into a single one - python

I have CSV files in which Data is formatted as follows:
file1.csv
ID,NAME
001,Jhon
002,Doe
fille2.csv
ID,SCHOOLS_ATTENDED
001,my Nice School
002,His lovely school
file3.csv
ID,SALARY
001,25
002,40
ID field is kind of primary key that will be used to fetch record.
What is the most efficient way to read 3 to 4 files and get corresponding data and store in another CSV file having headings (ID,NAME,SCHOOLS_ATTENDED,SALARY)?
The file sizes are in the hundreds of MBs (100, 200 Mb).

Hundreds of megabytes aren't that much. Why not go for a simple approach using the csv module and collections.defaultdict:
import csv
from collections import defaultdict
result = defaultdict(dict)
fieldnames = {"ID"}
for csvfile in ("file1.csv", "file2.csv", "file3.csv"):
with open(csvfile, newline="") as infile:
reader = csv.DictReader(infile)
for row in reader:
id = row.pop("ID")
for key in row:
fieldnames.add(key) # wasteful, but I don't care enough
result[id][key] = row[key]
The resulting defaultdict looks like this:
>>> result
defaultdict(<type 'dict'>,
{'001': {'SALARY': '25', 'SCHOOLS_ATTENDED': 'my Nice School', 'NAME': 'Jhon'},
'002': {'SALARY': '40', 'SCHOOLS_ATTENDED': 'His lovely school', 'NAME': 'Doe'}})
You could then combine that into a CSV file (not my prettiest work, but good enough for now):
with open("out.csv", "w", newline="") as outfile:
writer = csv.DictWriter(outfile, sorted(fieldnames))
writer.writeheader()
for item in result:
result[item]["ID"] = item
writer.writerow(result[item])
out.csv then contains
ID,NAME,SALARY,SCHOOLS_ATTENDED
001,Jhon,25,my Nice School
002,Doe,40,His lovely school

Following is the working code for combining multiple csv files with specific keywords in their names into 1 final csv file. I have set the default keyword to "file" but u can set it blank if u want to combine all csv files from a folder_path. This code will take header from your first csv file and use it as a header in final combined csv file. It will ignore headers of all other csv files.
import glob,os
#staticmethod
def Combine_multiple_csv_files_thatContainsKeywordInTheirNames_into_one_csv_file(folder_path,keyword='file'):
#takes header only from 1st csv, all other csv headers are skipped and data is appened to final csv
fileNames = glob.glob(folder_path + "*" + keyword + "*"+".csv") # fileNames INCLUDES FOLDER_PATH TOO
with open(folder_path+"Combined_csv.csv", "w", newline='') as fout:
print('Combining multiple csv files into 1')
csv_write_file = csv.writer(fout, delimiter=',')
# a.writerows(op)
with open(fileNames[0], mode='rt') as read_file: # utf8
csv_read_file = csv.reader(read_file, delimiter=',') # CSVREADER READS FILE AS 1 LIST PER ROW. SO WHEN WRITIN TO ANOTHER CSV FILE WITH FUNCTION WRITEROWS, IT INTRODUCES ANOTHER NEW LINE '\N' CHARACTER. SO TO AVOID DOUBLE NEWLINES , WE SET NEWLINE AS '' WHEN WE OPEN CSV WRITER OBJECT
csv_write_file.writerows(csv_read_file)
for num in range(1, len(fileNames)):
with open(fileNames[num], mode='rt') as read_file: # utf8
csv_read_file = csv.reader(read_file, delimiter=',') # CSVREADER READS FILE AS 1 LIST PER ROW. SO WHEN WRITIN TO ANOTHER CSV FILE WITH FUNCTION WRITEROWS, IT INTRODUCES ANOTHER NEW LINE '\N' CHARACTER. SO TO AVOID DOUBLE NEWLINES , WE SET NEWLINE AS '' WHEN WE OPEN CSV WRITER OBJECT
next(csv_read_file) # ignore header
csv_write_file.writerows(csv_read_file)

Related

How to combine CSV files without using pandas

I have 3 csv files that I need to merge
all the 3 files have the first three columns being equal while like name, age, sex but other columns are different for all.
I am new to python. I need assistance on this. I can comprehend any code written. Thanks
I have tried some codes but not working
file 1
firstname,secondname,age,address,postcode,height
gdsd,gas,uugd,gusa,uuh,hhuuw
kms,kkoil,jjka,kja,kaja,loj
iiow,uiuw,iue,oijw,uow,oiujw
ujis,oiiw,ywuq,sax,cxv,ywf
file 2
firstname,secondname,age,home-town,spousename,marital_staus
gdsd,gas,uugd,vbs,owu,nsvc
kms,kkoil,jjka,kja,kaja,loj
iiow,uiuw,iue,xxfaf,owuq,pler
ujis,oiiw,ywuq,gfhd,lzac,oqq
file 3
firstname,secondname,age,drive,educated,
gdsd,gas,uugd,no,yes
kms,kkoil,jjka,no,no
iiow,uiuw,iue,yes,no
ujis,oiiw,ywuq,yes,yes
desired result
firstname,secondname,age,hometown,spousename,marital_status,adress,post_code,height,drive,educated
note that firstname,secondname,age is the same across the 3 tables
I need valid codes please
Here's a generic solution for concatenating CSV files that have heterogeneous headers with Python.
What you need to do first is to read the header of each CSV file for determining the "unique" field names.
Then, you just have to read each input record and output it while transforming it to match the new header (which is the unique fields).
#!/usr/bin/env python3
import csv
paths = [ 'file1.csv', 'file2.csv', 'file3.csv' ]
fieldnames = set()
for p in paths:
with open(p,'r') as f:
reader = csv.reader(f)
fieldnames.update( next(reader) )
with open('combined.csv', 'w') as o:
writer = csv.DictWriter(o, fieldnames = fieldnames)
writer.writeheader()
for p in paths:
with open(p,'r') as f:
reader = csv.DictReader(f)
writer.writerows( reader )
remark: I open the files twice, so it won't work for inputs that are streams (for ex. sys.stdin)

Append to a CSV file which does not end with newline

Suppose I have sample data in an Excel document:
header1
header2
header3
some data
testing
123
moar data
hello!
456
I export this data to csv format with Excel, with File > Save as > .csv
This is my data sample.csv:
$ cat sample.csv
header1,header2,header3
some data,testing,123
moar data,hello!,456%
Note that Excel apparently does not add a newline at the end, by default -- this is indicated by % at the end.
Now let's say I want to append a row(s) to the CSV file. I can use csv module to do that:
import csv
def append_to_csv_file(file: str, row: dict, encoding=None) -> None:
# open file for reading and writing
with open(file, 'a+', newline='', encoding=encoding) as out_file:
# retrieve field names (CSV file headers)
reader = csv.reader(out_file)
out_file.seek(0)
field_names = next(reader, None)
# add new row to the CSV file
writer = csv.DictWriter(out_file, field_names)
writer.writerow(row)
row = {'header1': 'new data', 'header2': 'blah', 'header3': 789}
append_to_csv_file('sample.csv', row)
So now a newline is added to end of file, but problem is that the data is added to end of last line, rather than on a separate line:
$ cat sample.csv
header1,header2,header3
some data,testing,123
moar data,hello!,456new data,blah,789
This causes issue when I want to read back the updated data from the file:
with open('sample.csv', newline='') as f:
print(list(csv.DictReader(f)))
# [{..., 'header3': '456new data', None: ['blah', '789']}]
Question: so what is the best option to handle case when CSV file might not have newline at the end, when appending a row(s) to file.
Current attempt
This is my solution to work around case when appending to CSV file, but file may not end with a newline character:
import csv
def append_to_csv_file(file: str, row: dict, encoding=None) -> None:
with open(file, 'a+', newline='', encoding=encoding,) as out_file:
# get current file position
pos = out_file.tell()
print('pos:', pos)
# seek to one character back
out_file.seek(pos - 1)
# read in last character
c = out_file.read(1)
print(out_file.tell(), repr(c))
if c != '\n':
delta = out_file.write('\n')
pos += delta
print('new_pos:', pos)
# retrieve field names (CSV file headers)
reader = csv.reader(out_file)
out_file.seek(0)
field_names = next(reader, None)
# add new row to the CSV file
writer = csv.DictWriter(out_file, field_names)
# out_file.seek(pos + 1)
writer.writerow(row)
row = {'header1': 'new data', 'header2': 'blah', 'header3': 789}
append_to_csv_file('sample.csv', row)
This is output from running the script:
pos: 68
68 '6'
new_pos: 69
The contents of CSV file now look as expected:
$ cat sample.csv
header1,header2,header3
some data,testing,123
moar data,hello!,456
new data,blah,789
I am wondering if anyone knows of an easier way to do this. I feel like I might be overthinking this a bit. I basically want to account for cases where CSV file might need a newline added to end, before a new row is appended to end of file.
If it helps, I am running this on a Mac OS environment.

Prints to console. Now I want to print to CSV file

I can read a text file with names and print in ascending order to console. I simply want to write the sorted names to a column in a CSV file. Can't I take the printed(file) and send to CSV?
Thanks!
import csv
with open('/users/h/documents/pyprojects/boy-names.txt','r') as file:
for file in sorted(file):
print(file, end='')
#the following isn't working.
with open('/users/h/documents/pyprojects/boy-names.csv', 'w', newline='') as csvFile:
names = ['Column1']
writer = csv.writer(names)
print(file)
You can do something like this:
import csv
with open('boy-names.txt', 'rt') as file, open('boy-names.csv', 'w', newline='') as csv_file:
csv_writer = csv.writer(csv_file, quoting=csv.QUOTE_MINIMAL)
csv_writer.writerow(['Column1'])
for boy_name in sorted(file.readlines()):
boy_name = boy_name.rstrip('\n')
print(boy_name)
csv_writer.writerow([boy_name])
This is covered in the documentation.
The only tricky part is converting the lines from the file to a list of 1-element lists.
import csv
with open('/users/h/documents/pyprojects/boy-names.txt','r') as file:
names = [[k.strip()] for k in sorted(file.readlines())]
with open('/users/h/documents/pyprojects/boy-names.csv', 'w', newline='') as csvFile:
writer = csv.writer(csvFile)
writer.writerow(['Column1'])
writer.writerows(names)
So, names will contain (for example):
[['Able'],['Baker'],['Charlie'],['Delta']]
The CSV recorder expects to write a row or a set of rows. EACH ROW has to be a list (or tuple). That's why I created it like I did. By calling writerows, the outer list contains the set of rows to be written. Each element of the outer list is a row. I want each row to contain one item, so each is a one element list.
If I had created this:
['Able','Baker','Charlie','Delta']
then writerows would have treated each string as a sequence, resulting in a CSV file like this:
A,b,l,e
B,a,k,e,r
C,h,a,r,l,i,e
D,e,l,t,a
which is amusing but not very useful. And I know that because I did it while I was creating your answer.

Extract strings from text file and write it to Excel

I was trying to write a Python script to extract texts from text file and write it into excel file.
The question is I do not know how to extract the strings next to equal.
I am new to Python, at this stage just manage to open the file.
The data looks like below:
ADD IUACCAREALST: AREA=RA, MCC="510", MNC="28", LAC="0x020a", RAC="0x68", RACRANGE="0x73", SUBRANGE=SPECIAL_IMSI_RANGE, BEGIMSI="511100001243", ENDIMSI="53110100270380", CTRLTYPE=REJECT, CAUSE=ROAMING_NOT_ALLOWED_IN_LA;
ADD IUACCAREALST: AREA=RA, MCC="510", MNC="28", LAC="0x01Fa", RAC="0x67", RACRANGE="0x63", SUBRANGE=SPECIAL_IMSI_RANGE, BEGIMSI="", ENDIMSI="", CTRLTYPE=REJECT, CAUSE=ROAMING_NOT_ALLOWED_IN_LA;
Output should be like below:
#!/usr/bin/python
import csv
import re
fieldnames = ['AREA', 'MCC', 'MNC']
re_fields = re.compile(r'({})\s+=\s(.*)'.format('|'.join(fieldnames)), re.I)
with open('input.txt') as f_input, open('output.csv', 'wb') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames= fieldnames)
csv_output.writeheader()
Your corrected pattern is HERE
I would break text like that into BLOCKS and then find the matches in each block:
import csv
import re
fieldnames = ['AREA', 'MCC', 'MNC']
re_fields = re.compile(r'({})\s*=\s*([^,]+),'.format('|'.join(fieldnames)))
with open(fn) as f_input:
data=f_input.read()
for block in re.finditer(r'(?<=ADD IUACCAREALST:)(.*?)(?=ADD IUACCAREALST:|\Z)', data, flags=re.S | re.M):
print(re_fields.findall(block.group(1)))
Prints:
[('AREA', 'RA'), ('MCC', '"510"'), ('MNC', '"28"')]
[('AREA', 'RA'), ('MCC', '"510"'), ('MNC', '"28"')]
At that point, use each list of tuples to create a dict forming that csv record; write it to the csv file. Done!

Python - Reading the contents of csv in python and appending it

import csv
with open("somecities.csv") as f:
reader = csv.DictReader(f)
data = [r for r in reader]
Contents of somecities.csv:
Country,Capital,CountryPop,AreaSqKm
Canada,Ottawa,35151728,9984670
USA,Washington DC,323127513,9833520
Japan,Tokyo,126740000,377972
Luxembourg,Luxembourg City,576249,2586
New to python and I'm trying to read and append a csv file. I've spent some time experimenting with some responses to similar questions with no luck--which is why I believe the code above to be pretty useless.
What I am essentially trying to achieve is to store each row from the CSV in memory using a dictionary, with the country names as keys, and values being tuples containing the other information in the table in the sequence they are in within the CSV file.
And from there I am trying to add three more cities to the csv(Country, Capital, CountryPop, AreaSqKm) and view the updated csv. How should I go about doing all of this?
The desired additions to the updated csv are:
Brazil, Brasília, 211224219, 8358140
China, Beijing, 1403500365, 9388211
Belgium, Brussels, 11250000, 30528
EDIT:
Import csv
with open("somecities.csv", "r") as csvinput:
with open(" somecities_update.csv", "w") as csvresult:
writer = csv.writer(csvresult, lineterminator='\n')
reader = csv.reader(csvinput)
all = []
headers = next(reader)
for row in reader:
all.append(row)
# Now we write to the new file
writer.write(headers)
for record in all:
writer.write(record)
#row.append(Brazil, Brasília, 211224219, 8358140)
#row.append(China, Beijing, 1403500365, 9388211)
#row.append(Belgium, Brussels, 11250000, 30528)
So assuming you can use pandas for this I would go about it this way:
import pandas as pd
df1 = pd.read_csv('your_initial_file.csv', index_col='Country')
df2 = pd.read_csv('your_second_file.csv', index_col='Country')
dfs = [df1,df2]
final_df = pd.concat(dfs)
DictReader will only represent each row as a dictionary, eg:
{
"Country": "Canada",
...,
"AreaSqKm": "9984670"
}
If you want to store the whole CSV as a dictionary you'll have to create your own:
import csv
all_data = {}
with open("somecities.csv", "r") as f:
reader = csv.DictReader(f)
for row in reader:
# Key = country, values = tuple containing the rest of the data.
all_data[row["Country"]] = (row["Capital"], row["CountryPop"], row["AreaSqKm"])
# Add the new cities to the dictionary here...
# To write the data to a new CSV
with open("newcities.csv", "w") as f:
writer = csv.writer(f)
for key, values in all_data.items():
writer.writerow([key] + list(values))
As others have said, though, the pandas library could be a good choice. Check out its read_csv and to_csv functions.
Just another idea with creating and list and appending the new values through list construct as below, not tested:
import csv
with open("somecities.csv", "r") as csvinput:
with open("result.csv", "w") as csvresult:
writer = csv.writer(csvresult, lineterminator='\n')
reader = csv.reader(csvinput)
all = []
row = next(reader)
row.append(Brazil, Brasília, 211224219, 8358140)
row.append(China, Beijing, 1403500365, 9388211)
all.append(row)
for row in reader:
row.append(row[0])
all.append(row)
writer.writerows(all)
The simplest Form i see, tested in python 3.6
Opening a file with the 'a' parameter allows you to append to the end of the file instead of simply overwriting the existing content. Try that.
>>> with open("somecities.csv", "a") as fd:
... fd.write("Brazil, Brasília, 211224219, 8358140")
OR
#!/usr/bin/python3.6
import csv
fields=['Brazil', 'Brasília', '211224219','8358140']
with open(r'somecities.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow(fields)

Categories