Merging two csv files where common column matches

Merging two csv files where common column matches - python

I have a csv of users, and a csv of virtual machines, and i need to merge the users into their vms only where their id match.
But all im getting is a huge file containing everything.
file_names = ['vms.csv', 'users.csv']
o_data = []
for afile in file_names:
file_h = open(afile)
a_list = []
a_list.append(afile)
csv_reader = csv.reader(file_h, delimiter=';')
for row in csv_reader:
a_list.append(row[0])
o_data.append((n for n in a_list))
file_h.close()
with open('output.csv', 'w') as op_file:
csv_writer = csv.writer(op_file, delimiter=';')
for row in list(zip(*o_data)):
csv_writer.writerow(row)
op_file.close()
Im relatively new to python, am i missing something?

I've always found pandas really helpful for tasks like this. You can simply load the datasets into pandas data frames and then use the merge function to merge them where the values in a column are same.
import pandas
vms = pandas.read_csv('vms.csv')
users = pandas.read_csv('users.csv')
output = pandas.merge(vms, users)
output.to_csv('output.tsv')
You can find the documentation for the different options at http://pandas.pydata.org/pandas-docs/stable/merging.html

Related

How to combine CSV files without using pandas

I have 3 csv files that I need to merge
all the 3 files have the first three columns being equal while like name, age, sex but other columns are different for all.
I am new to python. I need assistance on this. I can comprehend any code written. Thanks
I have tried some codes but not working
file 1
firstname,secondname,age,address,postcode,height
gdsd,gas,uugd,gusa,uuh,hhuuw
kms,kkoil,jjka,kja,kaja,loj
iiow,uiuw,iue,oijw,uow,oiujw
ujis,oiiw,ywuq,sax,cxv,ywf
file 2
firstname,secondname,age,home-town,spousename,marital_staus
gdsd,gas,uugd,vbs,owu,nsvc
kms,kkoil,jjka,kja,kaja,loj
iiow,uiuw,iue,xxfaf,owuq,pler
ujis,oiiw,ywuq,gfhd,lzac,oqq
file 3
firstname,secondname,age,drive,educated,
gdsd,gas,uugd,no,yes
kms,kkoil,jjka,no,no
iiow,uiuw,iue,yes,no
ujis,oiiw,ywuq,yes,yes
desired result
firstname,secondname,age,hometown,spousename,marital_status,adress,post_code,height,drive,educated
note that firstname,secondname,age is the same across the 3 tables
I need valid codes please

Here's a generic solution for concatenating CSV files that have heterogeneous headers with Python.
What you need to do first is to read the header of each CSV file for determining the "unique" field names.
Then, you just have to read each input record and output it while transforming it to match the new header (which is the unique fields).
#!/usr/bin/env python3
import csv
paths = [ 'file1.csv', 'file2.csv', 'file3.csv' ]
fieldnames = set()
for p in paths:
with open(p,'r') as f:
reader = csv.reader(f)
fieldnames.update( next(reader) )
with open('combined.csv', 'w') as o:
writer = csv.DictWriter(o, fieldnames = fieldnames)
writer.writeheader()
for p in paths:
with open(p,'r') as f:
reader = csv.DictReader(f)
writer.writerows( reader )
remark: I open the files twice, so it won't work for inputs that are streams (for ex. sys.stdin)

To remove rows in a dataset imported from csv files with python

I want to work with the data imported from csv files. However, there are many lines of information that I don´t need in the csv files. Let´s say, data from the first three rows and all rows after 125 should be removed. How can I get this job done by using Python? I have figured out the way to remove the first three rows but I am still having problem with the rest part.
import csv
csv_file = open('Raman_060320.csv')
csv_reader = csv.reader(csv_file, delimiter='\t')
for skip in range(3):
next(csv_reader)
for row in csv_reader:
print(row)
csv_file.close()
I am from the field of hydrology and don´t know very deep about programming (I´ve just began to learn), so I would appreciate all the help I could get.

As suggested by Damzaky, using pandas:
import pandas as pd
df = pd.read_csv('Raman_060320.csv')
#Keep rows 4 - 125
df = df[3:126]
#Save to csv
df.to_csv('Raman_060320.csv', index = False)

Pandas picks wrong columns with df[[]]

I have a large csv file, 40+ columns, I'm trying to sort it using pandas and only write selected ones into a new file. Here's my code:
Edit: I was probably wrong to assume I've done everything correctly up until the end, here's the entire file: I read in 10 csv files, add them to one, filter the rows so that they are unique in a way I need them to, then I want to filter again, this time select just the few columns.
I am completely new to python, so the code probably looks disgusting and there's the issue I assume.
if __name__ == "__main__":
files = ['airOT199701.csv', 'airOT199702.csv', 'airOT199703.csv', 'airOT199704.csv', 'airOT199705.csv', 'airOT199706.csv', 'airOT199707.csv', 'airOT199708.csv', 'airOT199709.csv', 'airOT199710.csv', 'airOT199711.csv', 'airOT199712.csv']
with open('filterflights.csv', 'w') as outcsv:
writer = csv.DictWriter(outcsv, fieldnames = ["YEAR","MONTH","DAY_OF_MONTH","DAY_OF_WEEK","FL_DATE","UNIQUE_CARRIER","TAIL_NUM","FL_NUM","ORIGIN_AIRPORT_ID","ORIGIN","ORIGIN_STATE_ABR","DEST_AIRPORT_ID","DEST","DEST_STATE_ABR","CRS_DEP_TIME","DEP_TIME","DEP_DELAY","DEP_DELAY_NEW","DEP_DEL15","DEP_DELAY_GROUP","TAXI_OUT","WHEELS_OFF","WHEELS_ON","TAXI_IN","CRS_ARR_TIME","ARR_TIME","ARR_DELAY","ARR_DELAY_NEW","ARR_DEL15","ARR_DELAY_GROUP","CANCELLED","CANCELLATION_CODE","DIVERTED","CRS_ELAPSED_TIME","ACTUAL_ELAPSED_TIME","AIR_TIME","FLIGHTS","DISTANCE","DISTANCE_GROUP","CARRIER_DELAY","WEATHER_DELAY","NAS_DELAY","SECURITY_DELAY","LATE_AIRCRAFT_DELAY","DIFFERENCE"])
writer.writeheader()
filewriter = csv.writer(outcsv, delimiter=',')
for i in range(len(files)):
reader = csv.reader(open(files[i], 'r'), delimiter=',')
next(reader, None)
result = set()
for r in reader:
r.append(abs(int(r[8])-int(r[11]))%25)
key = (r[7],r[8],r[11])
if key not in result:
filewriter.writerow(r)
result.add(key)
df = pd.read_csv('filterflights.csv')
df.header(3)
df = df[["FL_DATE","FL_NUM","ORIGIN_AIRPORT_ID","ORIGIN","ORIGIN_STATE_ABR", "DEST_AIRPORT_ID","DEST","DEST_STATE_ABR", "DEP_TIME", "ARR_TIME", "DISTANCE", "DIFFERENCE"]]
df.header(3)
df.to_csv('filteredflights.csv', index=False)
I get the error:AttributeError: 'DataFrame' object has no attribute 'header' in line 23. All csv files are in the same folder as the python file
Possible issue: original csv files do not have DIFFERENCE column, can that cause the issue? Trying to append value with r.append, but maybe it doesn't know what to append to?

you can use pandas.reindex() to subset the data frame and preserve given order,
col_subset = ["FL_DATE","FL_NUM","ORIGIN_AIRPORT_ID","ORIGIN","ORIGIN_STATE_ABR", "DEST_AIRPORT_ID","DEST","DEST_STATE_ABR", "DEP_TIME", "ARR_TIME", "DISTANCE", "DIFFERENCE"]
df = df.reindex(columns= col_subset)

Better way to parse CSV into list or array

Is there a better way to create a list or a numpy array from this csv file? What I'm asking is how to do it and parse more gracefully than I did in the code below.
fname = open("Computers discovered recently by discovery method.csv").readlines()
lst = [elt.strip().split(",")[8:] for elt in fname if elt != "\n"][4:]
lst2 = []
for row in lst:
print(row)
if row[0].startswith("SMZ-") or row[0].startswith("MTR-"):
lst2.append(row)
print(*lst2, sep = "\n")

You can always use Pandas. As an example,
import pandas as pd
import numpy as np
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv')
To convert it, you will have to convert it to your favorite numeric type. I guess you can write the whole thing in one line:
result = numpy.array(list(df)).astype("float")
You can also do the following:
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

You can use pandas and specify header column to make it work correctly on you sample file
import pandas as pd
df = pd.read_csv('Computers discovered recently by discovery method.csv', header=2)
You can check your content using:
>>> df.head()
You can check headers using
>>> df.columns
And to convert it to numpy array you can use
>>> np_arr = df.values
It comes with a lot of options to parse and read csv files. For more information please check the docs

I am not sure what you want but try this
import csv
with open("Computers discovered recently by discovery method.csv", 'r') as f:
reader = csv.reader(f)
ll = list(reader)
print (ll)
this should read the csv line by line and store it as a list

You should never parse CSV structures manually unless you want to tackle all possible exceptions and CSV format oddities. Python has you covered in that regard with its csv module.
The main problem, in your case, stems from your data - there seems to be two different CSV structures in a single file so you first need to find where your second structure begins. Plus, from your code, it seems you want to filter out all columns before Details_Table0_Netbios_Name0 and include only rows whose Details_Table0_Netbios_Name0 starts with SMZ- or MTR-. So something like:
import csv
with open("Computers discovered recently by discovery method.csv") as f:
reader = csv.reader(f) # create a CSV reader
for row in reader: # skip the lines until we encounter the second CSV structure/header
if row and row[0] == "Header_Table0_Netbios_Name0":
break
index = row.index("Details_Table0_Netbios_Name0") # find where your columns begin
result = [] # storage for the rows we're interested in
for row in reader: # read the rest of the CSV row by row
if row and row[index][:4] in {"SMZ-", "MTR-"}: # only include these rows
result.append(row[index:]) # trim and append to the `result` list
print(result[10]) # etc.
# ['MTR-PC0BXQE6-LB', 'PR2', 'anisita', 'VALUEADDCO', 'VALUEADDCO', 'Heartbeat Discovery',
# '07.12.2017 17:47:51', '13']
should do the trick.

Sample Code
import csv
csv_file = 'sample.csv'
with open(csv_file) as fh:
reader = csv.reader(fh)
for row in reader:
print(row)
sample.csv
name,age,salary
clado,20,25000
student,30,34000
sam,34,32000

Python: removing old records and duplicates between two CSV files

Total beginner with Python, need help to achieve a task!
I have two csv files. old.csv and new.csv
Both have same structure (A to Z columns), and each record has a unique identifier which is a number, in column F (sixth column). Between these two CSVs there are a few duplicates records.
I’m looking for a way to eliminate records that are also in the old.csv, from the new.csv and output to a new file that has the same structure, so the new output.csv has truly only the new records.
What's a good way to achieve this? I need to be able to run this on a windows machine through a command line.
Any help is appreciated! Thanks in advance!

Read the csv file and map it to tuple
import csv
f = open('old.csv', 'rb')
reader = csv.reader(f)
reader = map(tuple,reader)
fn = open('new.csv', 'rb')
readern = csv.reader(fn)
readern = map(tuple,readern)
Get all unique identifiers in old.csv
from operator import itemgetter
reader= map(itemgetter(5), reader)
add all items whose identifier is not in old.csv, add them to unique list
unique= [item for item in readern if item[5] not in reader]
Write the rows to output.csv
output = open("output.csv", 'wb')
wr = csv.writer(output)
for row in unique:
wr.writerow(row)
output.close()
Hope this helps!

A simple approach would be:
collect all identifiers from old.csv in a set
loop through new.csv
if a record has an identifier that's noz in your set, write it to output.csv
You will probably want to use the csv module for reading and writing the files.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging two csv files where common column matches - python

Related

How to combine CSV files without using pandas

To remove rows in a dataset imported from csv files with python

Pandas picks wrong columns with df[[]]

Better way to parse CSV into list or array

Python: removing old records and duplicates between two CSV files

Categories

Resources