Two CSV files with data in different orders (No Pandas) - python

I have two csv files simulating patient data that I need to read in and compare.
Without using Pandas, I need to sort the second file by Subject_ID and append the sex of the patient to the first csv file. I don't know where to start without using Pandas. Any ideas?
So far my plan is to somehow work with a dictionary to try to re-group the second file.
with open('Patient_Sex.csv','r') as file_sex, open('Patient_FBG.csv','r') as file_fbg:
patient_reader = csv.DictReader(file_sex)
fbg_reader = csv.DictReader(file_fbg)
After this, it gets really muddy for me.

I think this is what you are looking for, assuming you are working with .csv files based on the data that you posted.
Basically you can just parse the files as JSON and then you can manipulate them easily.
import csv
import json
gender_data = []
full_data = []
with open("stack/new.csv", encoding="utf-8") as csvf:
csvReader = csv.DictReader(csvf)
for row in csvReader:
gender_data.append(row)
with open("stack/info.csv", encoding="utf-8") as csvf:
csvReader = csv.DictReader(csvf)
for row in csvReader:
full_data.append(row)
for x in gender_data:
for y in full_data:
if x["SUBJECT_ID"] == y["SUBJECT_ID"]:
y["SEX"] = x["SEX"]
f = csv.writer(open("stack/test.csv", "w+"))
f.writerow(["SUBJECT_ID", "YEAR_1", "YEAR_2", "YEAR_3", "SEX"])
for x in full_data:
f.writerow(
[
x["SUBJECT_ID"],
x["YEAR_1"],
x["YEAR_2"],
x["YEAR_3"],
x["SEX"] if "SEX" in x else "",
]
)

You can do this without importing any modules by reading the csv files as lists of lines, and append lines in the main file with the sex upon a matching name:
with open('test1.csv') as csvfile:
main_csv = [i.rstrip() for i in csvfile.readlines()]
with open('test2.csv') as csvfile:
second_csv = [i.rstrip() for i in csvfile.readlines()]
for n, i in enumerate(main_csv):
if n == 0:
main_csv[n] = main_csv[n] + ',SEX'
else:
patient = i.split(',')[0]
hits = [line.split(',')[-1] for line in second_csv if line.startswith(patient)]
if hits:
main_csv[n] = main_csv[n] + ',' + hits[0]
else:
main_csv[n] = main_csv[n] + ','
with open('test.csv', 'w') as f:
f.write('\n'.join(main_csv))

Related

Create multiple files from unique values of a column using inbuilt libraries of python

I started learning python and was wondering if there was a way to create multiple files from unique values of a column. I know there are 100's of ways of getting it done through pandas. But I am looking to have it done through inbuilt libraries. I couldn't find a single example where its done through inbuilt libraries.
Here is the sample csv file data:
uniquevalue|count
a|123
b|345
c|567
d|789
a|123
b|345
c|567
Sample output file:
a.csv
uniquevalue|count
a|123
a|123
b.csv
b|345
b|345
I am struggling with looping on unique values in a column and then print them out. Can someone explain with logic how to do it ? That will be much appreciated. Thanks.
import csv
from collections import defaultdict
header = []
data = defaultdict(list)
DELIMITER = "|"
with open("inputfile.csv", newline="") as csvfile:
reader = csv.reader(csvfile, delimiter=DELIMITER)
for i, row in enumerate(reader):
if i == 0:
header = row
else:
key = row[0]
data[key].append(row)
for key, value in data.items():
filename = f"{key}.csv"
with open(filename, "w", newline="") as f:
writer = csv.writer(f, delimiter=DELIMITER)
rows = [header] + value
writer.writerows(rows)
import csv
with open('sample.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
with open(f"{row[0]}.csv", 'a') as inner:
writer = csv.writer(
inner, delimiter='|',
fieldnames=('uniquevalue', 'count')
)
writer.writerow(row)
the task can also be done without using csv module. the lines of the file are read, and with read_file.read().splitlines()[1:] the newline characters are stripped off, also skipping the header line of the csv file. with a set a unique collection of inputdata is created, that is used to count number of duplicates and to create the output files.
with open("unique_sample.csv", "r") as read_file:
items = read_file.read().splitlines()[1:]
for line in set(items):
with open(line[:line.index('|')] + '.csv', 'w') as output:
output.write((line + '\n') * items.count(line))

How to filter columns within a .CSV file and then save those filtered columns to a new .CSV file in Python?

I am analyzing a large weather data file, Data.csv. I need to write a program in Python that will filter the Data.csv file and keep the following columns only: STATION, NAME/LOCATION, DATE, AWND, SNOW. Then save the filtered file and name it filteredData.csv.
I am using Python 3.8. I have only been able to somewhat figure out how to filter the columns I need within a print function. How do I filter this file and then save the filtered file?
import csv
filename = 'Data.csv'
f = open(filename, 'rt')
reader = csv.reader(f,delimiter=',')
for column in reader:
print(column[0] + "," + column[1] + "," + column[2] + "," + column[3] + "," + column[4] + "," + column[13])
A small section of the Data.csv file
It can be quickly done using Pandas
import pandas as pd
weather_data = pd.read_csv('Data.csv')
filtered_weather = weather_data[['Column_1','Column_1']] #Select the column names that you want
filtered_weather.to_csv('new_file',index=False)
If you're running this under windows you can simply run the code you already wrote with "> newfile.csv" at the end of the command to plug output into a test file.
If you want to do it within the code though:
import csv
new_filename = 'Reduced_Data.csv'
filename = 'Data.csv'
f = open(filename, 'rt')
reader = csv.reader(f,delimiter=',')
for row in reader:
with open(new_filename, 'a') as output:
output.write('"{}","{}","{}","{}","{}","{}"\n'.format(column[0],column[1],column[2],column[3],column[4],column[13]))
check out the CSV reader and this example. you can do something like:
import csv
content = []
with open('Data.csv', 'r') as file:
reader = csv.reader(file, delimiter = ','))
for row in reader:
content.append(row)
print(content)
## now writing them in a file:
with open('filteredData.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['STATION', 'NAME LOCATION', 'DATE', 'AWND', 'SNOW'])
for i in range(1, len(content)):
writer.writerow[content[i][0], content[i][1], content[i][2], content[i][3], content[i][13]) ## i left out some columns, so they will not be in the file later, maybe I did not get that right.
but honestly, I would use this approach but that means only copy & paste.

Extract two columns sorted from CSV

I have a large csv file, containing multiple values, in the form
Date,Dslam_Name,Card,Port,Ani,DownStream,UpStream,Status
2020-01-03 07:10:01,aart-m1-m1,204,57,302xxxxxxxxx,0,0,down
I want to extract the Dslam_Name and Ani values, sort them by Dslam_name and write them to a new csv in two different columns.
So far my code is as follows:
import csv
import operator
with open('bad_voice_ports.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
sortedlist = sorted(readCSV, key=operator.itemgetter(1))
for row in sortedlist:
bad_port = row[1][:4],row[4][2::]
print(bad_port)
f = open("bad_voice_portsnew20200103SORTED.csv","a+")
f.write(row[1][:4] + " " + row[4][2::] + '\n')
f.close()
But my Dslam_Name and Ani values are kept in the same column.
As a next step I would like to count how many times the same value appears in the 1st column.
You are forcing them to be a single column. Joining the two into a single string means Python no longer regards them as separate.
But try this instead:
import csv
import operator
with open('bad_voice_ports.csv') as readfile, open('bad_voice_portsnew20200103SORTED.csv', 'w') as writefile:
readCSV = csv.reader(readfile)
writeCSV = csv.writer(writefile)
for row in sorted(readCSV, key=operator.itemgetter(1)):
bad_port = row[1][:4],row[4][2::]
print(bad_port)
writeCSV.writerow(bad_port)
If you want to include the number of times each key occurred, you can easily include that in the program, too. I would refactor slightly to separate the reading and the writing.
import csv
import operator
from collections import Counter
with open('bad_voice_ports.csv') as readfile:
readCSV = csv.reader(readfile)
rows = []
counts = Counter()
for row in readCSV:
rows.append([row[1][:4], row[4][2::]])
counts[row[1][:4]] += 1
with open('bad_voice_portsnew20200103SORTED.csv', 'w') as writefile:
writeCSV = csv.writer(writefile)
for row in sorted(rows):
print(row)
writeCSV.writerow([counts[row[0]]] + row)
I would recommend to remove the header line from the CSV file entirely; throwing away (or separating out and prepending back) the first line should be an easy change if you want to keep it.
(Also, hard-coding input and output file names is problematic; maybe have the program read them from sys.argv[1:] instead.)
So my suggestion is failry simple. As i stated in a previous comment there is good documentation on CSV read and write in python here: https://realpython.com/python-csv/
As per an example, to read from a csv the columns you need you can simply do this:
>>> file = open('some.csv', mode='r')
>>> csv_reader = csv.DictReader(file)
>>> for line in csv_reader:
... print(line["Dslam_Name"] + " " + line["Ani"])
...
This would return:
aart-m1-m1 302xxxxxxxxx
Now you can just as easilly create a variable and store the column values there and later write them to a file or just open up a new file wile reading lines and writing the column values in there. I hope this helps you.
After the help from #tripleee and #marxmacher my final code is
import csv
import operator
from collections import Counter
with open('bad_voice_ports.csv') as csv_file:
readCSV = csv.reader(csv_file, delimiter=',')
sortedlist = sorted(readCSV, key=operator.itemgetter(1))
line_count = 0
rows = []
counts = Counter()
for row in sortedlist:
Dslam = row[1][:4]
Ani = row[4][2:]
if line_count == 0:
print(row[1], row[4])
line_count += 1
else:
rows.append([row[1][:4], row[4][2::]])
counts[row[1][:4]] += 1
print(Dslam, Ani)
line_count += 1
for row in sorted(rows):
f = open("bad_voice_portsnew202001061917.xls","a+")
f.write(row[0] + '\t' + row[1] + '\t' + str(counts[row[0]]) + '\n')
f.close()
print('Total of Bad ports =', str(line_count-1))
As with this way the desired values/columns are extracted from the initial csv file and a new xls file is generated with the desired values stored in different columns and the total values per key are counted, along with the total of entries.
Thanks for all the help, please feel free for any improvement suggestions!
You can use sorted:
import csv
_h, *data = csv.reader(open('filename.csv'))
with open('new_csv.csv', 'w') as f:
write = csv.writer(f)
csv.writerows([_h, *sorted([(i[1], i[4]) for i in data], key=lambda x:x[0])])

Appending a CSV file with a new column in python

I am trying to create a clean csv file by merging some of variables together from an old file and appending them to a new csv file.
I have no problem running the data the first time. I get the output I want but whenever I try to append the data with a new variable (i.e. new column) it appends the variable to the bottom and the output is wonky.
I have basically been running the same code for each variable, except changing the
groupvariables variable to my desired variables and then using the f2= open('outputfile.csv', "ab") <--- but with an ab for amend. Any help would be appreciated
groupvariables=['x','y']
f2 = open('outputfile.csv', "wb")
writer = csv.writer(f2, delimiter=",")
writer.writerow(("ID","Diagnosis"))
for line in csv_f:
line = line.rstrip('\n')
columns = line.split(",")
tempname = columns[0]
tempindvar = columns[1:]
templist = []
for j in groupvariables:
tempvar=tempindvar[headers.index(j)]
if tempvar != ".":
templist.append(tempvar)
newList = list(set(templist))
if len(newList) > 1:
output = 'nomatch'
elif len(newList) == 0:
output = "."
else:
output = newList[0]
tempoutrow = (tempname,output)
writer.writerow(tempoutrow)
f2.close()
CSV is a line-based file format, so the only way to add a column to an existing CSV file is to read it into memory and overwrite it entirely, adding the new column to each line.
If all you want to do is add lines, though, appending will work fine.
Here is something that might help. I assumed the first field on each row in each csv file is a primary key for the record and can be used to match rows between the two files. The code below reads the records in from one file, stored them in a dictionary, then reads in the records from another file, appended the values to the dictionary, and writes out a new file. You can adapt this example to better fit your actual problem.
import csv
# using python3
db = {}
reader = csv.reader(open('t1.csv', 'r'))
for row in reader:
key, *values = row
db[key] = ','.join(values)
reader = csv.reader(open('t2.csv', 'r'))
for row in reader:
key, *values = row
if key in db:
db[key] = db[key] + ',' + ','.join(values)
else:
db[key] = ','.join(values)
writer = open('combo.csv', 'w')
for key in sorted(db.keys()):
writer.write(key + ',' + db[key] + '\n')

Use Python to split a CSV file with multiple headers

I have a CSV file that is being constantly appended. It has multiple headers and the only common thing among the headers is that the first column is always "NAME".
How do I split the single CSV file into separate CSV files, one for each header row?
here is a sample file:
"NAME","AGE","SEX","WEIGHT","CITY"
"Bob",20,"M",120,"New York"
"Peter",33,"M",220,"Toronto"
"Mary",43,"F",130,"Miami"
"NAME","COUNTRY","SPORT","NUMBER","SPORT","NUMBER"
"Larry","USA","Football",14,"Baseball",22
"Jenny","UK","Rugby",5,"Field Hockey",11
"Jacques","Canada","Hockey",19,"Volleyball",4
"NAME","DRINK","QTY"
"Jesse","Beer",6
"Wendel","Juice",1
"Angela","Milk",3
If the size of the csv files is not huge -- so all can be in memory at once -- just use read() to read the file into a string and then use a regex on this string:
import re
with open(ur_csv) as f:
data=f.read()
chunks=re.finditer(r'(^"NAME".*?)(?=^"NAME"|\Z)',data,re.S | re.M)
for i, chunk in enumerate(chunks, 1):
with open('/path/{}.csv'.format(i), 'w') as fout:
fout.write(chunk.group(1))
If the size of the file is a concern, you can use mmap to create something that looks like a big string but is not all in memory at the same time.
Then use the mmap string with a regex to separate the csv chunks like so:
import mmap
import re
with open(ur_csv) as f:
mf=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
chunks=re.finditer(r'(^"NAME".*?)(?=^"NAME"|\Z)',mf,re.S | re.M)
for i, chunk in enumerate(chunks, 1):
with open('/path/{}.csv'.format(i), 'w') as fout:
fout.write(chunk.group(1))
In either case, this will write all the chunks in files named 1.csv, 2.csv etc.
Copy the input to a new output file each time you see a header line. Something like this (not checked for errors):
partNum = 1
outHandle = None
for line in open("yourfile.csv","r").readlines():
if line.startswith('"NAME"'):
if outHandle is not None:
outHandle.close()
outHandle = open("part%d.csv" % (partNum,), "w")
partNum += 1
outHandle.write(line)
outHandle.close()
The above will break if the input does not begin with a header line or if the input is empty.
You can use the python csv package to read your source file and write multile csv files based on the rule that if element 0 in your row == "NAME", spawn off a new file. Something like this...
import csv
outfile_name = "out_%.csv"
out_num = 1
with open('nameslist.csv', 'rb') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
csv_buffer = []
for row in csvreader:
if row[0] != "NAME":
csv_buffer.append(row)
else:
with open(outfile_name % out_num, 'wb') as csvout:
for b_row in csv_buffer:
csvout.writerow(b_row)
out_num += 1
csv_buffer = [row]
P.S. I haven't actually tested this but that's the general concept
Given the other answers, the only modification that I would suggest would be to open using csv.DictReader. pseudo code would be like this. Assuming that the first line in the file is the first header
Note that this assumes that there is no blank line or other indicator between the entries so that a 'NAME' header occurs right after data. If there were a blank line between appended files the you could use that as an indicator to use infile.fieldnames() on the next row. If you need to handle the inputs as a list, then the previous answers are better.
ifile = open(filename, 'rb')
infile = cvs.Dictreader(ifile)
infields = infile.fieldnames
filenum = 1
ofile = open('outfile'+str(filenum), 'wb')
outfields = infields # This allows you to change the header field
outfile = csv.DictWriter(ofile, fieldnames=outfields, extrasaction='ignore')
outfile.writerow(dict((fn, fn) for fn in outfields))
for row in infile:
if row['NAME'] != 'NAME':
#process this row here and do whatever is needed
else:
close(ofile)
# build infields again from this row
infields = [row["NAME"], ...] # This assumes you know the names & order
# Dict cannot be pulled as a list and keep the order that you want.
filenum += 1
ofile = open('outfile'+str(filenum), 'wb')
outfields = infields # This allows you to change the header field
outfile = csv.DictWriter(ofile, fieldnames=outfields, extrasaction='ignore')
outfile.writerow(dict((fn, fn) for fn in outfields))
# This is the end of the loop. All data has been read and processed
close(ofile)
close(ifile)
If the exact order of the new header does not matter except for the name in the first entry, then you can transfer the new list as follows:
infileds = [row['NAME']
for k in row.keys():
if k != 'NAME':
infields.append(row[k])
This will create the new header with NAME in entry 0 but the others will not be in any particular order.

Categories