Split CSV by columns [duplicate] - python

This question already has answers here:
splitting CSV file by columns
(4 answers)
Closed 4 years ago.
I am trying to split a CSV file containing stock data of 1500+ companies. The first column contains dates and subsequent columns contain company data.
Goal 1: I'm trying to split the huge CSV file into smaller pieces. Let's say 30 companies per smaller file. To do this, I need to split the CSV by column number, not rows. I've been looking up code snippets but I haven't found anything that does this exactly. Also, each separate file would need to contain the first column, i.e. the dates.
Goal 2: I want to make the company name a column of its own, the date a column of its own and the indicators columns of their own. So, I can call the data for a company as a single record (row) in Django - I don't need all the dates, just the last day of every quarter. Right now, I'm having to filter the data by date and indicator and set that as an object to display in my frontend.
If you have questions, just ask.
EDIT:
Following is some code I patched together.
import os
import csv
from math import floor
from datetime import datetime
import re
class SimFinDataset:
def __init__(self, dataFilePath, csvDelimiter = "semicolon"):
self.numIndicators = None
self.numCompanies = 1
# load data
self.loadData(dataFilePath, csvDelimiter)
def loadData(self, filePath, delimiter):
numRow = 0
delimiterChar = ";" if delimiter == "semicolon" else ","
csvfile = open(filePath, 'rb')
reader = csv.reader(csvfile, delimiter=delimiterChar, quotechar='"')
header = next(reader)
row_count = sum(1 for _ in reader)
csvfile.seek(0)
for row in reader:
numRow += 1
if numRow > 1 and numRow != row_count and numRow != row_count-1:
# company id row
if numRow == 2:
rowLen = len(row)
idVal = None
for index, columnVal in enumerate(row):
if index > 0:
if idVal is not None and idVal != columnVal:
self.numCompanies += 1
if self.numIndicators is None:
self.numIndicators = index - 1
if index + 1 == rowLen:
if self.numIndicators is None:
self.numIndicators = index
idVal = columnVal
if numRow > 2 and self.numIndicators is None:
return
else:
filename = 1
with open(str(filename) + '.csv', 'wb') as csvfile:
if self.numCompanies % 30 == 0:
print ("im working")
spamwriter = csv.writer(csvfile, delimiter=';')
spamwriter.writerow(header)
spamwriter.writerow(row)
filename += 1
#print (self.numIndicators)
dataset = SimFinDataset('new-data.csv','semicolon')

a solution for goal1 is here.
splitting CSV file by columns
However you have the pandas way:
import pandas as pd
# let's say first 10 columns
csv_path="mycsv.csv"
out_path ="\\...\\out.csv"
pd.read_csv(csv_path).iloc[:, :10].to_csv(out_path)
You can also do something like
mydf.groupby("company_name").unstack()`
To make each company a column of its own

Related

How to index rows in csv files and take columns as arguments?

Here is a sample of my csv file:
Date Open High Low Close
9/2/2021 34.05 40.34 33.03 36.7
9/3/2021 35.9 41.98 34.9 36.89
Here is a sample of my code:
import csv
def StockMarket():
while True:
command = input('$')
if command.lower() == 'quit':
break
elif command.lower() == 'readfiles':
mrna_data,pfe_data =
ReadFiles('MRNA.csv.numbers','PFE.1.csv')
#elif command.lower() = 'pricesondate'
def ReadFiles(MRNA,PFE):
file = open(MRNA,'r', newline = '')
reader = csv.reader(file,delimiter ='\t')
next(reader)
mrna_data = []
for row in reader:
#index rows only need row 0 and 4th row
mrna_data.append(row)
reader.close()
file = open(PFE,'r', newline = '')
reader = csv.reader(file,delimiter ='\t')
next(reader)
pfe_data = []
for row in reader:
pfe_data.append(row)
reader.close()
return mrna_data,pfe_data
I want to index rows 0 and 4 since they are the only rows I'm using. Then I would like to use row 0 as an argument in "YYYY-MM-DD" format which would then return me the corresponding row 4(This would be a separate function).
I've tried doing multiple methods from examples online and such but none of them work. If someone could help, I would really appreciate it.

How to print the 2nd column of a searched string in a CSV?

I am trying to search for a file name in a CSV (in column A). If it finds it, then I want to print only the second column (column B), not the whole row.
The CSV is like this:
File Name,ID
1234.bmp,1A
1111.bmp,2B
This is what I have so far, but it prints both the columns:
import os
import csv
f_name = os.listdir(r'C:\Users\Peter\Documents\Python test\Files')[0]
data = []
with open ("test.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
col = [x[0] for x in data]
if f_name in col:
for x in range(len(data)):
if f_name ==data[x][0]:
action = print(data[x])
else:
print("File not listed")
You were close. You only had a problem with the indexing (and the print statement).
After this part of the code:
data = []
with open ("test.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
The data would now be a list of lists:
[
['File Name', 'ID'],
['1234.bmp', '1A'],
['1111.bmp', '2B']
]
In the part where you check the 1st column:
if f_name == data[x][0]:
action = print(data[x])
You printed data[x] which would be one row. You need to index it further to access the 2nd column:
print(data[x]) # ['1234.bmp', '1A']
print(data[x][1]) # 1A
Furthermore, print returns None, so None would be saved into action:
>>> action = print("123")
123
>>> print(action)
None
You need to assign the value to action then print(action):
if f_name == data[x][0]:
action = data[x][1]
print(action) # 1A or 2B
You can also further improve the code by eliminating col. I understand that it's for checking if f_name is in the 1st column ("File Name") of the CSV. Since you are already iterating over each row, you can already check it there if f_name is in row. If it finds it, store the index of that row in a variable (ex. idx_fname_in_csv), so that later, you can access it directly from data. This eliminates the extra variable col and avoids iterating over the data twice.
import os
import csv
f_name = os.listdir(r'C:\Users\Peter\Documents\Python test\Files')[0]
data = []
idx_fname_in_csv = -1 # invalid
with open("test.csv") as csvfile:
reader = csv.reader(csvfile)
for idx, row in enumerate(reader):
data.append(row)
if f_name in row:
idx_fname_in_csv = idx
if idx_fname_in_csv > 0:
action = data[idx_fname_in_csv][1]
print(action)
else:
print("File not listed")
Here data would still have the same contents (list of lists) but I used enumerate to keep track of the index.

I wrote a python script to dedupe a csv and I think it's 90% working. Could really use some help troubleshooting one issue

The code is supposed to find duplicates by comparing FirstName, LastName, and Email. All Duplicates should be written to the Dupes.csv file, and all Uniques should be written to Deduplicated.csv, but this is currently not happening..
Example:
If row A shows up in Orginal.csv 10 times, the code writes A1 to deduplicated.csv, and it writes A2 - A10 to dupes.csv.
This is incorrect. A1-A10 should ALL be written to the dupes.csv file, leaving only unique rows in deduplicated.csv.
Another strange behavior is that A2-A10 are all getting written to dupes.csv TWICE!
I would really appreciate any and all feedback as this is my first professional python script and I'm feeling pretty disheartened.
Here is my code:
import csv
def read_csv(filename):
the_file = open(filename, 'r', encoding='latin1')
the_reader = csv.reader(the_file, dialect='excel')
table = []
#As long as the table row has values we will add it to the table
for row in the_reader:
if len(row) > 0:
table.append(tuple(row))
the_file.close()
return table
def create_file(table, filename):
join_file = open(filename, 'w+', encoding='latin1')
for row in table:
line = ""
#build up the new row - don't comma on last item so add last item separate
for i in range(len(row)-1):
line += row[i] + ","
line += row[-1]
#adds the string to the new file
join_file.write(line+'\n')
join_file.close()
def main():
original = read_csv('Contact.csv')
print('finished read')
#hold duplicate values
dupes = []
#holds all of the values without duplicates
dedup = set()
#pairs to know if we have seen a match before
pairs = set()
for row in original:
#if row in dupes:
#dupes.append(row)
if (row[4],row[5],row[19]) in pairs:
dupes.append(row)
else:
pairs.add((row[4],row[5],row[19]))
dedup.add(row)
print('finished first parse')
#go through and add in one more of each duplicate
seen = set()
for row in dupes:
if row in seen:
continue
else:
dupes.append(row)
seen.add(row)
print ('writing files')
create_file(dupes, 'duplicate_leads.csv')
create_file(dedup, 'deduplicated_leads.csv')
if __name__ == '__main__':
main()
You should look into the pandas module for this, it will be extremely fast, and much easier than rolling your own.
import pandas as pd
x = pd.read_csv('Contact.csv')
duplicates = x.duplicated(['row4', 'row5', 'row19'], keep = False)
#use the names of the columns you want to check
x[duplicates].to_csv('duplicates.csv') #write duplicates
x[~duplicates].to_csv('uniques.csv') #write uniques

Combining two scripts into one code for csv file data verification

Hello everyone currently I have two scripts that I would like to combine into 1 code. The first script finds missing time stamps from a set of data and fills in a blank row with NaN values then saves to an output file. The second script compares different rows in a set of data and creates a new column with True/False values based on the test condition.
If I run each script as a function then call both with another function I would get two separate output files. How can I make this run with only 1 saved output file?
First Code
import pandas as pd
df = pd.read_csv("data5.csv", index_col="DateTime", parse_dates=True)
df = df.resample('1min').mean()
df = df.reindex(pd.date_range(df.index.min(), df.index.max(), freq="1min"))
df.to_csv("output.csv", na_rep='NaN')
Second Code
with open('data5.csv', 'r') as f:
rows = [row.split(',') for row in f]
rows = [[cell.strip() for cell in row if cell] for row in rows]
def isValidRow(row):
return float(row[5]) <= 900 or all(float(val) > 7 for val in row[1:4])
header, rows = rows[0], rows[1:]
validRows = list(map(isValidRow, rows))
with open('output.csv', 'w') as f:
f.write(','.join(header + ['IsValid']) + '\n')
for row, valid in zip(rows, validRows):
f.write(','.join(row + [str(valid)]) + '\n')
Let put your code as function of filenames:
def first_code(file_in, file_out):
df = pd.read_csv(file_in, ... )
...
df.to_csv(file_out, ...)
def second_code(file_in, file_out):
with open(file_in, 'r') as f:
...
....
with open(file_out, 'w') as f:
...
Your solution can then be:
first_code('data5.csv', 'output.csv')
second_code('output.csv', 'output.csv')
Hope it helps
Note that there is not problem reading and writing in the same file. Be sure that the file is previously closed to avoid side effect. This is implicitly done by using with, which is a good practice
In the second code, change data5.csv which is the first input to the second code to output.csv. and make sure that the file1.py and file2.py are in the same directory. so your modified code in a single file will be as follows:
import pandas as pd
df = pd.read_csv("data5.csv", index_col="DateTime", parse_dates=True)
df = df.resample('1min').mean()
df = df.reindex(pd.date_range(df.index.min(), df.index.max(), freq="1min"))
df.to_csv("output.csv", na_rep='NaN')
with open('output.csv', 'r') as f:
rows = [row.split(',') for row in f]
rows = [[cell.strip() for cell in row if cell] for row in rows]
def isValidRow(row):
return float(row[5]) <= 900 or all(float(val) > 7 for val in row[1:4])
header, rows = rows[0], rows[1:]
validRows = list(map(isValidRow, rows))
with open('output.csv', 'w') as f:
f.write(','.join(header + ['IsValid']) + '\n')
for row, valid in zip(rows, validRows):
f.write(','.join(row + [str(valid)]) + '\n')

Replace element in column with previous one in CSV file using python

3rd UPDATE: To describe the problem in precise:-
================================================
First post, so not able to format it well. Sorry for this.
I have a CSV file called sample.CSV. I need to add additional columns to this file, I could do it using below script. What is missing in my script
If present value in column named "row" is different from previous element. Then update the column named "value" with the previous row column value. If not, update it as zero in the "value" column.
Hope my question is clear. Thanks a lot for your support.
My script:
#!/usr/local/bin/python3 <bl
import csv, os, sys, time
inputfile='sample.csv'
with open(inputfile, 'r') as input, open('input.csv', 'w') as output:
reader = csv.reader(input, delimiter = ';')
writer = csv.writer(output, delimiter = ';')
list1 = []
header = next(reader)
header.insert(1,'value')
header.insert(2,'Id')
list1.append(header)
count = 0
for column in reader:
count += 1
list1.append(column)
myvalue = []
myvalue.append(column[4])
if count == 1:
firstmyvalue = myvalue
if count > 2 and myvalue != firstmyvalue:
column.insert(0, myvalue[0])
else:
column.insert(0, 0)
if column[0] != column[8]:
del column[0]
column.insert(0,0)
else:
del column[0]
column.insert(0,myvalue[0])
column.insert(1, count)
column.insert(0, 1)
writer.writerows(list1)
sample.csv:-
rate;sec;core;Ser;row;AC;PCI;RP;ne;net
244000;262399;7;5;323;29110;163;-90.38;2;244
244001;262527;6;5;323;29110;163;-89.19;2;244
244002;262531;6;5;323;29110;163;-90.69;2;244
244003;262571;6;5;325;29110;163;-88.75;2;244
244004;262665;7;5;320;29110;163;-90.31;2;244
244005;262686;7;5;326;29110;163;-91.69;2;244
244006;262718;7;5;323;29110;163;-89.5;2;244
244007;262753;7;5;324;29110;163;-90.25;2;244
244008;277482;5;5;325;29110;203;-87.13;2;244
My expected output:-
rate;value;Id;sec;core;Ser;row;AC;PCI;RP;ne;net
1;0;1;244000;262399;7;5;323;29110;163;-90.38;2;244
1;0;2;244001;262527;6;5;323;29110;163;-89.19;2;244
1;0;3;244002;262531;6;5;323;29110;163;-90.69;2;244
1;323;4;244003;262571;6;5;325;29110;163;-88.75;2;244
1;325;5;244004;262665;7;5;320;29110;163;-90.31;2;244
1;320;6;244005;262686;7;5;326;29110;163;-91.69;2;244
1;326;7;244006;262718;7;5;323;29110;163;-89.5;2;244
1;323;8;244007;262753;7;5;324;29110;163;-90.25;2;244
1;324;9;244008;277482;5;5;325;29110;203;-87.13;2;244
This will do the part you were asking for in a generic way, however your output clearly has more changes to it than the question asks for. I added in the Id column just to show how you can order the column output too:
df = pd.read_csv('sample.csv', sep=";")
df.loc[:,'value'] = None
df.loc[:, 'Id'] = df.index + 1
prev = None
for i, row in df.iterrows():
if prev is not None:
if row.row == prev.row:
df.value[i] = prev.value
else:
df.value[i] = prev.row
prev = row
df.to_csv('output.csv', index=False, cols=['rate','value','Id','sec','core','Ser','row','AC','PCI','RP','ne','net'], sep=';')
previous = []
for i, entry in enumerate(csv.reader(test.csv)):
if not i: # do this on first entry only
previous = entry # initialize here
print(entry)
else: # other entries
if entry[2] != previous[2]: # check if this entries row is equal to previous entries row
entry[1] = previous[2] # add previous entries row value to this entries var
previous = entry
print(entry)
import csv
with open('test.csv') as f, open('output.csv','w') as o:
out = csv.writer(o, delimiter='\t')
out.writerow(["id", 'value', 'row'])
reader = csv.DictReader(f, delimiter="\t") #Assuming file is tab delimited
prev_row = '100'
for line in reader:
if prev_row != line["row"]:
prev_row = line["row"]
out.writerow([line["id"],prev_row,line["row"]])
else:
out.writerow(line.values())
o.close()
content of output.csv:
id value row
1 0 100
2 0 100
3 110 110
4 140 140

Categories