I have three CSV files with attributes Product_ID, Name, Cost, Description. Each file contains Product_ID. I want to combine Name (file1), Cost(file2), Description(File3) to new CSV file with Product_ID and all three above attributes. I need efficient code as files contains over 130000 rows.
After combining all data to new file, I have to load that data in a dictionary.
Like: Product_Id as Key and Name,Cost,Description as Value.
It might be more efficient to read each input .csv into a dictionary before creating your aggregated result.
Here's a solution for reading in each file and storing the columns in a dictionary with Product_IDs as the keys. I assume that each Product_ID value exists in each file and that headers are included. I also assume that there are no duplicate columns across the files aside from Product_ID.
import csv
from collections import defaultdict
entries = defaultdict(list)
files = ['names.csv', 'costs.csv', 'descriptions.csv']
headers = ['Product_ID']
for filename in files:
with open(filename, 'rU') as f: # Open each file in files.
reader = csv.reader(f) # Create a reader to iterate csv lines
heads = next(reader) # Grab first line (headers)
pk = heads.index(headers[0]) # Get the position of 'Product_ID' in
# the list of headers
# Add the rest of the headers to the list of collected columns (skip 'Product_ID')
headers.extend([x for i,x in enumerate(heads) if i != pk])
for row in reader:
# For each line, add new values (except 'Product_ID') to the
# entries dict with the line's Product_ID value as the key
entries[row[pk]].extend([x for i,x in enumerate(row) if i != pk])
writer = csv.writer(open('result.csv', 'wb')) # Open file to write csv lines
writer.writerow(headers) # Write the headers first
for key, value in entries.items():
writer.writerow([key] + value) # Write the product IDs
# concatenated with the other values
A general solution that produces a record, maybe incomplete, for each id it encounters processing the 3 files needs the use of a specialized data structure that fortunately is just a list, with a preassigned number of slots
d = {id:[name,None,None] for id, name in [line.strip().split(',') for line in open(fn1)]}
for line in open(fn2):
id, cost = line.strip().split(',')
if id in d:
d[id][1] = cost
else:
d[id] = [None, cost, None]
for line in open(fn3):
id, desc = line.strip().split(',')
if id in d:
d[id][2] = desc
else:
d[id] = [None, None, desc]
for id in d:
if all(d[id]):
print ','.join([id]+d[id])
else: # for this id you have not complete info,
# so you have to decide on your own what you want, I have to
pass
If you are sure that you don't want to further process incomplete records, the code above can be simplified
d = {id:[name] for id, name in [line.strip().split(',') for line in open(fn1)]}
for line in open(fn2):
id, cost = line.strip().split(',')
if id in d: d[id].append(name)
for line in open(fn3):
id, desc = line.strip().split(',')
if id in d: d[id].append(desc)
for id in d:
if len(d[id])==3: print ','.join([id]+d[id])
Related
I need help sorting a list from a text file. I'm reading a .txt and then adding some data, then sorting it by population change %, then lastly, writing that to a new text file.
The only thing that's giving me trouble now is the sort function. I think the for statement syntax is what's giving me issues -- I'm unsure where in the code I would add the sort statement and how I would apply it to the output of the for loop statement.
The population change data I am trying to sort by is the [1] item in the list.
#Read file into script
NCFile = open("C:\filelocation\NC2010.txt")
#Save a write file
PopulationChange =
open("C:\filelocation\Sorted_Population_Change_Output.txt", "w")
#Read everything into lines, except for first(header) row
lines = NCFile.readlines()[1:]
#Pull relevant data and create population change variable
for aLine in lines:
dataRow = aLine.split(",")
countyName = dataRow[1]
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
popChange = ((population2010-population2000)/population2000)*100
outputRow = countyName + ", %.2f" %popChange + "%\n"
PopulationChange.write(outputRow)
NCFile.close()
PopulationChange.close()
You can fix your issue with a couple of minor changes. Split the line as you read it in and loop over the sorted lines:
lines = [aLine.split(',') for aLine in NCFile][1:]
#Pull relevant data and create population change variable
for dataRow in sorted(lines, key=lambda row: row[1]):
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
...
However, if this is a csv you might want to look into the csv module. In particular DictReader will read in the data as a list of dictionaries based on the header row. I'm making up the field names below but you should get the idea. You'll notice I sort the data based on 'countryName' as it is read in:
from csv import DictReader, DictWriter
with open("C:\filelocation\NC2010.txt") as NCFile:
reader = DictReader(NCFile)
data = sorted(reader, key=lambda row: row['countyName'])
for row in data:
population2000 = float(row['population2000'])
population2010 = float(row['population2010'])
popChange = ((population2010-population2000)/population2000)*100
row['popChange'] = "{0:.2f}".format(popChange)
with open("C:\filelocation\Sorted_Population_Change_Output.txt", "w") as PopulationChange:
writer = csv.DictWriter(PopulationChange, fieldnames=['countryName', 'popChange'])
writer.writeheader()
writer.writerows(data)
This will give you a 2 column csv of ['countryName', 'popChange']. You would need to correct this with the correct fieldnames.
You need to read all of the lines in the file before you can sort it. I've created a list called change to hold the tuple pair of the population change and the country name. This list is sorted and then saved.
with open("NC2010.txt") as NCFile:
lines = NCFile.readlines()[1:]
change = []
for line in lines:
row = line.split(",")
country_name = row[1]
population_2000 = float(row[6])
population_2010 = float(row[8])
pop_change = ((population_2010 / population_2000) - 1) * 100
change.append((pop_change, country_name))
change.sort()
output_rows = []
[output_rows.append("{0}, {1:.2f}\n".format(pair[1], pair[0]))
for pair in change]
with open("Sorted_Population_Change_Output.txt", "w") as PopulationChange:
PopulationChange.writelines(output_rows)
I used a list comprehension to generate the output rows which swaps the pair back in the desired order, i.e. country name first.
I have a CSV (~1.5m rows) in the following format:
id, tag1, tag2, name1, value1
There are several rows with the same id. If a row has the same id, it will have the same tag1 and tag2. So, what I want to do is to append in the end of the row the name1, value1 which will be different.
Example:
Original:
id,tag1,tag2,name1,value1
12,orange,car,john,32
13,green,bike,george,23
12,orange,car,elen,21
Final:
id,tag1,tag2,name1,value1
12,orange,car,john,32,elen,21
13,green,bike,george,23
The only way I can do it is with a brute force script in Python. Create a dictionary with the key of the id and then a list with all the other parameters. Each time I find an id which is already in the dictionary, I just append the last two fields in the dictionary value as a list.
However, it is not the most efficient way to do it in such a big file. Is there any other way to do it, maybe with a library?
Kay's suggestion using sorted input data could look something like this:
with open('in.txt') as infile, open('out.txt', mode='w') as outfile:
# Prime the first line
line = infile.readline()
# When collating lines, running_line will look like:
# ['id,tag1,tag2', 'name1', 'value1', 'name2', 'value2', ...]
# Prime it with just the 'id,tag1,tag2' of the first line
running_line = [line[:-1].rsplit(',', 2)[0]]
while line:
curr_it12, name, value = line[:-1].rsplit(',', 2)
if running_line[0] == curr_it12:
# Current line's id/tag1/tag2 matches previous line's.
running_line.extend([name, value])
else:
# Current line's id/tag1/tag2 doesn't match. Output the previous...
outfile.write(','.join(running_line) + '\n')
# ...and start a new running_line
running_line = [curr_it12, name, value]
# Grab the next line
line = infile.readline()
# Flush the last line
outfile.write(','.join(running_line) + '\n')
I have a csv file in the form of:
'userid','metric name (1-10)','value'
the column 'metric name' has upwards of 10 different metrics so the same userid will have multiple rows associated with it. what I would like to accomplish would be something like this:
'userid1', 'metric name 1'='value1', 'metric name 2'='value2', 'metric name 3'='value3'... 'metric name 10' = 'value10'
A single row for each userid with all the metrics and values associated with that user in k/v pairs
I started playing around with pivot but that function doesn't really do what I need it to...
import pandas as pd
data=pd.read_csv('bps.csv')
data.pivot('entityName', 'metricName', 'value').stack()
I am thinking I need to iterator through the dataset by user and then grab the metrics associated with that user and build the metric k/v pairs during each iteration before going on to a new user. I did a pretty thorough job of searching the internet but I didn't find exactly what I was looking for. Please let me know if there is a simple library I could use.
Here come a solution using only standard python, not any framework.
Starting from the following data file :
id1,name,foo
id1,age,10
id2,name,bar
id2,class,example
id1,aim,demonstrate
You can execute the following code :
separator = ","
userIDKey = "userID"
defaultValue = "No data"
data = {}
#collect the data
with open("data.csv", 'r') as dataFile:
for line in dataFile:
#remove end of line character
line = line.replace("\n", "")
userID, fieldName, value = line.split(separator)
if not userID in data.keys():
data[userID] = {userIDKey:userID}
data[userID][fieldName] = value
#retrieve all the columns header in use
columnsHeaders = set()
for key in data:
dataset = data[key]
for datasetKey in dataset :
columnsHeaders.add(datasetKey)
columnsHeaders.remove(userIDKey)
columnsHeaders = list(columnsHeaders)
columnsHeaders.sort()
def getValue(key, dic):
if key in dic.keys():
return dic[key]
else:
return defaultValue
#then export the result
with open("output.csv", 'w') as outputFile:
#export first line of header
outputFile.write(userIDKey)
for header in columnsHeaders:
outputFile.write(", {0}".format(header))
outputFile.write("\n")
#and export each line
for key in data:
dataset = data[key]
outputFile.write(dataset[userIDKey])
for header in columnsHeaders:
outputFile.write(", {0}".format(getValue(header, dataset)))
outputFile.write("\n")
And then you get the following result :
userID, age, aim, class, name
id1, 'age'='10', 'aim'='demonstrate', 'class'='No data', 'name'='foo'
id2, 'age'='No data', 'aim'='No data', 'class'='example', 'name'='bar'
I think this code can be easily modified to match you objectives if required.
Hope it helps.
Arthur.
I have the csv file as follows:
product_name, product_id, category_id
book, , 3
shoe, 3, 1
lemon, 2, 4
I would like to update product_id of each row by providing the column name using python's csv library.
So for an example if I pass:
update_data = {"product_id": [1,2,3]}
then the csv file should be:
product_name, product_id, category_id
book, 1, 3
shoe, 2, 1
lemon, 3, 4
You can use your existing dict and iter to take items in order, eg:
import csv
update_data = {"product_id": [1,2,3]}
# Convert the values of your dict to be directly iterable so we can `next` them
to_update = {k: iter(v) for k, v in update_data.items()}
with open('input.csv', 'rb') as fin, open('output.csv', 'wb') as fout:
# create in/out csv readers, skip intial space so it matches the update dict
# and write the header out
csvin = csv.DictReader(fin, skipinitialspace=True)
csvout = csv.DictWriter(fout, csvin.fieldnames)
csvout.writeheader()
for row in csvin:
# Update rows - if we have something left and it's in the update dictionary,
# use that value, otherwise we use the value that's already in the column.
row.update({k: next(to_update[k], row[k]) for k in row if k in to_update})
csvout.writerow(row)
Now - this assumes that each new column value goes to the row number and that the existing values should be used after that. You could change that logic to only use new values when the existing value is blank for instance (or whatever other criteria you wish).
(assuming you're using 3.x)
Python has a CSV module in the standard library which helps read and amend CSV files.
Using that I'd find the index for the column you are after and store it in the dictionary you've made. Once that has been found it's simply a matter of popping the list item into each row.
import csv
update_data = {"product_id": [None, [1,2,3]]}
#I've nested the original list inside another so that we can hold the column index in the first position.
line_no = 0
#simple counter for the first step.
new_csv = []
#Holds the new rows for when we rewrite the file.
with open('test.csv', 'r') as csvfile:
filereader = csv.reader(csvfile)
for line in filereader:
if line_no == 0:
for key in update_data:
update_data[key][0] = line.index(key)
#This finds us the columns index and stores it for us.
else:
for key in update_data:
line[update_data[key][0]] = update_data[key][1].pop(0)
#using the column index we enter the new data into the correct place whilst removing it from the input list.
new_csv.append(line)
line_no +=1
with open('test.csv', 'w') as csvfile:
filewriter = csv.writer(csvfile)
for line in new_csv:
filewriter.writerow(line)
So far all I am able to do within the function is store all the data twice over.
import csv
def csvWriter(filename, records):
header = []
for i in records:
for v in i:
header.append(v)
test = open(filename,'w')
dict_wr = csv.DictWriter(test,header)
dict_wr.writerow(dict(zip(header,header)))
for i in records:
dict_wr.writerow(dict(zip(header,i.values())))
test.close()
return '%d records processed.' % len(records)
File contains:
a,b,a,b
1,2,1,2
3,4,3,4
I believe I found the problem, inside the for loop, I'm having trouble creating the proper header.
It looks like your records in this example are 2 dictionaries with the same keys ('a' and 'b'). You can fix this by getting the header from the first dictionary:
for i in records[0]:
for v in i:
header.append(v)
However, if your other records have unique key values, you'll probably want to include them in the header:
for i in records:
for v in i:
if v not in header: header.append(v)
Finally, if you have repeated values in the header, and you have DictWriter write a row, it will repeat entries for each duplicated header value. For example,
import csv
testfile = open("test.csv", "w")
header = ["a","b", "a","b"] #duplicate header fields
writer = csv.DictWriter(testfile, header)
writer.writerow(dict(zip(header, header)))
writer.writerow({header[0]:"A", header[1]:"B"}) #notice it writes 4 values to the row
testfile.close()