I have a CSV file. Then I have some rules that have to be applied and then create a new CSV based on the rules.
So it could go two ways:
Add a new column, with its own header and data
Take an existing column and alter the data of that column.
This is what I have so far
def applyRules(directory):
FILES = []
for f in listdir(OUTPUT_DIR):
writer = csv.writer(open("%s%s" % (DZINE_DIR, f), "wb"))
for rule in Substring.objects.filter(source_file=f):
from_column = rule.from_column
to_column = rule.to_column
reader = csv.DictReader(open("%s%s" % (OUTPUT_DIR, f)))
headers = reader.fieldnames
for row in reader:
if rule.get_rule_type_display() == "substring":
string = rule.string.split(",")
# alter value
row[to_column] = string[0] + row[from_column] + string[1]
if rule.from_column == rule.to_column:
print rule.from_column
else:
print rule.from_column
The rule as a FROM_COLUMN and a TO_COLUMN, if both are the same, then the column stays the same, but the data must be updated with the rule, in this case just adding a string before and or after the current value.
When the TO_COLUMN is different, then its just a new column with the altered data as above under the new column.
So currently Im just changing the values of the dict, but Im not sure how to get it back to the new CSV etc.
If you open the output file as a DictWriter() object, then you can write out your altered dictionaries quite easily. You do need to determine your extra fieldnames ahead of time:
with open(os.path.join(OUTPUT_DIR, f), 'rb') as rfile:
reader = csv.DictReader(rfile)
headers = reader.fieldnames
rules = Substring.objects.filter(source_file=f).all()
# pre-process the rules to determine the headers
for rule in rules:
from_column = rule.from_column
to_column = rule.to_column
if from_column not in headers:
# problem; perhaps raise an error?
if to_column not in headers:
headers.append(to_column
with open(os.path.join(DZINE_DIR, f), "wb") as wfile:
writer = csv.DictWriter(wfile, fieldnames=headers)
for row in reader:
for rule in rules:
from_column = rule.from_column
to_column = rule.to_column
if rule.get_rule_type_display() == "substring":
string = rule.string.split(",")
row[to_column] = string[0] + row[from_column] + string[1]
writer.writerow(reader)
Related
I'm using a text file containing data and I want to reorganize it in a different shape. This file containing lines with values separates by a semi colon and no header. Some lines containing values who are children of other lines. I can distinguish them with a code (1 or 2) and their order : a children value is always in the line after is parent value. The number of children elements is different between one line from an other. Parents values can have no children values.
To be more explicit, here is a data sample:
030;001;1;AD0192;
030;001;2;AF5612;AF5613;AF5614
030;001;1;CD0124;
030;001;2;CD0846;CD0847;CD0848
030;002;1;EG0376;
030;002;2;EG0666;EG0667;EG0668;EG0669;EG0670;
030;003;1;ZB0001;
030;003;1;ZB0002;
030;003;1;ZB0003;
030;003;2;ZB0004;ZB0005
The structure is:
The first three characters are an id
The next three characters are also an id
The next one is a code : when 1, the value (named key in my example) is parent, when 2 values are childrens of the line before.
The values after are keys, parent or childrens.
I want to store children values (with a code 2) in a list and in the same line of their parent value.
Here is an example with my data sample above and a header:
id1;id2;key;children;
030;001;AD0192;[AF5612,AF5613,AF5614]
030;001;CD0124;[CD0846,CD0847,CD0848]
030;002;EG0376;[EG0666,EG0667,EG0668,EG0669,EG0670]
030;003;ZB0001;
030;003;ZB0002;
030;003;ZB0003;[ZB0004,ZB0005]
I'm able to build CSV delimited file from this text source file, add a header, a DictReader to manipulate easily my columns and conditions to identify my parents and children values.
But how to store hierarchical elements (with a code 2) in a list in the same line of their parent key ?
Here is my actual script in Python
import csv
inputTextFile = 'myOriginalFile.txt'
csvFile = 'myNewFile.csv'
countKey = 0
countKeyParent = 0
countKeyChildren = 0
with open(inputTextFile, 'r', encoding='utf-8') as inputFile:
stripped = (line.strip() for line in inputFile)
lines = (line.split(";") for line in stripped if line)
# Write a CSV file
with open(csvFile, 'w', newline='', encoding='utf-8') as outputCsvFile:
writer = csv.writer(outputCsvFile, delimiter=';')
writer.writerow(('id1','id2', 'code', 'key', 'children'))
writer.writerows(lines)
# Read the CSV
with open(csvFile, 'r', newline='', encoding='utf-8') as myCsvFile:
csvReader = csv.DictReader(myCsvFile, delimiter=';', quotechar='"')
for row in csvReader:
countKey +=1
if '1' in row['code'] :
countKeyParent += 1
print("Parent: " + row['key'])
elif '2' in row['code'] :
countKeyChildren += 1
print("Children: " + row['key'])
print(f"----\nSum of elements: {countKey}\nParents keys: {countKeyParent}\nChildren keys: {countKeyChildren}")
A simple solution might be the following. I first load your data in as a list of rows, each a list of strings. Then, we first build the hierarchy you've explained, and write the output to a CSV file.
from typing import List
ID_FIRST = 0
ID_SECOND = 1
PARENT_FIELD = 2
KEY_FIELD = 3
IS_PARENT = "1"
IS_CHILDREN = "2"
def read(where: str) -> List[List[str]]:
with open(where) as fp:
data = fp.readlines()
rows = []
for line in data:
fields = line.strip().split(';')
rows.append([fields[ID_FIRST],
fields[ID_SECOND],
fields[PARENT_FIELD],
*[item for item in fields[KEY_FIELD:]
if item != ""]])
return rows
def assign_parents(rows: List[List[str]]):
parent_idx = 0
for idx, fields in enumerate(rows):
if fields[PARENT_FIELD] == IS_PARENT:
parent_idx = idx
if fields[PARENT_FIELD] == IS_CHILDREN:
rows[parent_idx] += fields[KEY_FIELD:]
def write(where: str, rows: List[List[str]]):
with open(where, 'w') as file:
file.write("id1;id2;key;children;\n")
for fields in rows:
if fields[PARENT_FIELD] == IS_CHILDREN:
# These have been grouped into their parents.
continue
string = ";".join(fields[:PARENT_FIELD])
string += ";" + fields[KEY_FIELD] + ";"
if len(fields[KEY_FIELD + 1:]) != 0: # has children?
children = ",".join(fields[KEY_FIELD + 1:])
string += "[" + children + "]"
file.write(string + '\n')
def main():
rows = read('myOriginalFile.txt')
assign_parents(rows)
write('myNewFile.csv', rows)
if __name__ == "__main__":
main()
For your example I get
id1;id2;key;children;
030;001;AD0192;[AF5612,AF5613,AF5614]
030;001;CD0124;[CD0846,CD0847,CD0848]
030;002;EG0376;[EG0666,EG0667,EG0668,EG0669,EG0670]
030;003;ZB0001;
030;003;ZB0002;
030;003;ZB0003;[ZB0004,ZB0005]
which appears to be correct.
So I have csv file with over 1m records:(https://i.imgur.com/rhIhy5u.png)
I need data to be arranged differently that "params" who repeats become column/row themselves for example category1, category2, category3 (there is over 20 categories and no repeats) but all the data maintain their relations.
I tried using "pandas" and "csv" in python but i am completly new to it and i never had anything to do with such a data.
import csv
with open('./data.csv', 'r') as _filehandler:
csv_file_reader = csv.reader(_filehandler)
param = [];
csv_file_reader = csv.DictReader(_filehandler)
for row in csv_file_reader:
if not row['Param'] in param:
param.append(row['Param']);
col = "";
for p in param:
col += str(p) + '; ';
print(col);
import numpy as np
np.savetxt('./SortedWexdord.csv', (parameters), delimiter=';', fmt='%s')
I tried to think about it but data is nor my forte, any ideas?
Here's something that should work. If you need more than one value per row normalized like this, you could edit line 9 (beginning category) to grab a list of values instead of just row[1].
import csv
data = {}
with open('data.csv', 'r') as file:
reader = csv.reader(file)
next(reader) # Skip header row
for row in reader:
category, value = row[0], row[1] # Assumes category is in column 0 and target value is in column 1
if category in data:
data[category].append(value)
else:
data[category] = [value] # New entry only for each unique category
with open('output.csv', 'wb') as file: # wb is write and binary, avoids double newlines on windows
writer = csv.writer(file)
writer.writerow(['Category', 'Value'])
for category in data:
print([category] + data[category])
writer.writerow([category] + data[category]) # Make a list starting with category and then listing each value
I am trying to fix the first row of a CSV file. If column name in header starts from anything other than a-z, NUM has to be prepended. The following code fixes the special characters in each column of the first row but somehow can't get the !a-z.
path = ('test.csv')
for fname in glob.glob(path):
with open(fname, newline='') as f:
reader = csv.reader(f)
header = next(reader)
header = [column.replace ('-','_') for column in header]
header = [column.replace ('[!a-z]','NUM') for column in header]
what am I doing wrong. Please provide suggestions.
Thanks
You can do it like this.
# csv file:
# 2Hello, ?WORLD
# 1, 2
import csv
with open("test.csv", newline='') as f:
reader = csv.reader(f)
header = next(reader)
print("Original header", header)
header = [("NUM" + header[indx][1::]) for indx in range(len(header)) if not header[indx][0].isalpha()]
print("Modified header", header)
Output:
Original header ['2HELLO', '?WORLD']
Modified header ['NUMHELLO', 'NUMWORLD']
The above list comprehension is equivalent to the following for loop:
for indx in range(len(header)):
if not header[indx][0].isalpha():
header[indx] = "NUM" + header[indx][1::]
If you want to replace only numbers, then use the following:
if header[indx][0].isdigit():
You can modify this according to your requirements in case if it changes based on many relevant string functions.
https://docs.python.org/2/library/string.html
I believe you would want to replace the 'column.replace' portion with something along these lines:
re.sub(r'[!a-z]', 'NUM', column)
The full documentation reference is here for specifics: https://docs.python.org/2/library/re.html
https://www.regular-expressions.info/python.html
Since you said you want to prepend 'NUM', you could do something like this (which could be more efficient, but this shows the basic idea).
import string
column = '123'
if column[0] not in string.ascii_lowercase:
column = 'NUM' + column
# column is now 'NUM123'
I am trying to create a clean csv file by merging some of variables together from an old file and appending them to a new csv file.
I have no problem running the data the first time. I get the output I want but whenever I try to append the data with a new variable (i.e. new column) it appends the variable to the bottom and the output is wonky.
I have basically been running the same code for each variable, except changing the
groupvariables variable to my desired variables and then using the f2= open('outputfile.csv', "ab") <--- but with an ab for amend. Any help would be appreciated
groupvariables=['x','y']
f2 = open('outputfile.csv', "wb")
writer = csv.writer(f2, delimiter=",")
writer.writerow(("ID","Diagnosis"))
for line in csv_f:
line = line.rstrip('\n')
columns = line.split(",")
tempname = columns[0]
tempindvar = columns[1:]
templist = []
for j in groupvariables:
tempvar=tempindvar[headers.index(j)]
if tempvar != ".":
templist.append(tempvar)
newList = list(set(templist))
if len(newList) > 1:
output = 'nomatch'
elif len(newList) == 0:
output = "."
else:
output = newList[0]
tempoutrow = (tempname,output)
writer.writerow(tempoutrow)
f2.close()
CSV is a line-based file format, so the only way to add a column to an existing CSV file is to read it into memory and overwrite it entirely, adding the new column to each line.
If all you want to do is add lines, though, appending will work fine.
Here is something that might help. I assumed the first field on each row in each csv file is a primary key for the record and can be used to match rows between the two files. The code below reads the records in from one file, stored them in a dictionary, then reads in the records from another file, appended the values to the dictionary, and writes out a new file. You can adapt this example to better fit your actual problem.
import csv
# using python3
db = {}
reader = csv.reader(open('t1.csv', 'r'))
for row in reader:
key, *values = row
db[key] = ','.join(values)
reader = csv.reader(open('t2.csv', 'r'))
for row in reader:
key, *values = row
if key in db:
db[key] = db[key] + ',' + ','.join(values)
else:
db[key] = ','.join(values)
writer = open('combo.csv', 'w')
for key in sorted(db.keys()):
writer.write(key + ',' + db[key] + '\n')
I have a CSV file that is being constantly appended. It has multiple headers and the only common thing among the headers is that the first column is always "NAME".
How do I split the single CSV file into separate CSV files, one for each header row?
here is a sample file:
"NAME","AGE","SEX","WEIGHT","CITY"
"Bob",20,"M",120,"New York"
"Peter",33,"M",220,"Toronto"
"Mary",43,"F",130,"Miami"
"NAME","COUNTRY","SPORT","NUMBER","SPORT","NUMBER"
"Larry","USA","Football",14,"Baseball",22
"Jenny","UK","Rugby",5,"Field Hockey",11
"Jacques","Canada","Hockey",19,"Volleyball",4
"NAME","DRINK","QTY"
"Jesse","Beer",6
"Wendel","Juice",1
"Angela","Milk",3
If the size of the csv files is not huge -- so all can be in memory at once -- just use read() to read the file into a string and then use a regex on this string:
import re
with open(ur_csv) as f:
data=f.read()
chunks=re.finditer(r'(^"NAME".*?)(?=^"NAME"|\Z)',data,re.S | re.M)
for i, chunk in enumerate(chunks, 1):
with open('/path/{}.csv'.format(i), 'w') as fout:
fout.write(chunk.group(1))
If the size of the file is a concern, you can use mmap to create something that looks like a big string but is not all in memory at the same time.
Then use the mmap string with a regex to separate the csv chunks like so:
import mmap
import re
with open(ur_csv) as f:
mf=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
chunks=re.finditer(r'(^"NAME".*?)(?=^"NAME"|\Z)',mf,re.S | re.M)
for i, chunk in enumerate(chunks, 1):
with open('/path/{}.csv'.format(i), 'w') as fout:
fout.write(chunk.group(1))
In either case, this will write all the chunks in files named 1.csv, 2.csv etc.
Copy the input to a new output file each time you see a header line. Something like this (not checked for errors):
partNum = 1
outHandle = None
for line in open("yourfile.csv","r").readlines():
if line.startswith('"NAME"'):
if outHandle is not None:
outHandle.close()
outHandle = open("part%d.csv" % (partNum,), "w")
partNum += 1
outHandle.write(line)
outHandle.close()
The above will break if the input does not begin with a header line or if the input is empty.
You can use the python csv package to read your source file and write multile csv files based on the rule that if element 0 in your row == "NAME", spawn off a new file. Something like this...
import csv
outfile_name = "out_%.csv"
out_num = 1
with open('nameslist.csv', 'rb') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
csv_buffer = []
for row in csvreader:
if row[0] != "NAME":
csv_buffer.append(row)
else:
with open(outfile_name % out_num, 'wb') as csvout:
for b_row in csv_buffer:
csvout.writerow(b_row)
out_num += 1
csv_buffer = [row]
P.S. I haven't actually tested this but that's the general concept
Given the other answers, the only modification that I would suggest would be to open using csv.DictReader. pseudo code would be like this. Assuming that the first line in the file is the first header
Note that this assumes that there is no blank line or other indicator between the entries so that a 'NAME' header occurs right after data. If there were a blank line between appended files the you could use that as an indicator to use infile.fieldnames() on the next row. If you need to handle the inputs as a list, then the previous answers are better.
ifile = open(filename, 'rb')
infile = cvs.Dictreader(ifile)
infields = infile.fieldnames
filenum = 1
ofile = open('outfile'+str(filenum), 'wb')
outfields = infields # This allows you to change the header field
outfile = csv.DictWriter(ofile, fieldnames=outfields, extrasaction='ignore')
outfile.writerow(dict((fn, fn) for fn in outfields))
for row in infile:
if row['NAME'] != 'NAME':
#process this row here and do whatever is needed
else:
close(ofile)
# build infields again from this row
infields = [row["NAME"], ...] # This assumes you know the names & order
# Dict cannot be pulled as a list and keep the order that you want.
filenum += 1
ofile = open('outfile'+str(filenum), 'wb')
outfields = infields # This allows you to change the header field
outfile = csv.DictWriter(ofile, fieldnames=outfields, extrasaction='ignore')
outfile.writerow(dict((fn, fn) for fn in outfields))
# This is the end of the loop. All data has been read and processed
close(ofile)
close(ifile)
If the exact order of the new header does not matter except for the name in the first entry, then you can transfer the new list as follows:
infileds = [row['NAME']
for k in row.keys():
if k != 'NAME':
infields.append(row[k])
This will create the new header with NAME in entry 0 but the others will not be in any particular order.