Replace element in column with previous one in CSV file using python - python

3rd UPDATE: To describe the problem in precise:-
================================================
First post, so not able to format it well. Sorry for this.
I have a CSV file called sample.CSV. I need to add additional columns to this file, I could do it using below script. What is missing in my script
If present value in column named "row" is different from previous element. Then update the column named "value" with the previous row column value. If not, update it as zero in the "value" column.
Hope my question is clear. Thanks a lot for your support.
My script:
#!/usr/local/bin/python3 <bl
import csv, os, sys, time
inputfile='sample.csv'
with open(inputfile, 'r') as input, open('input.csv', 'w') as output:
reader = csv.reader(input, delimiter = ';')
writer = csv.writer(output, delimiter = ';')
list1 = []
header = next(reader)
header.insert(1,'value')
header.insert(2,'Id')
list1.append(header)
count = 0
for column in reader:
count += 1
list1.append(column)
myvalue = []
myvalue.append(column[4])
if count == 1:
firstmyvalue = myvalue
if count > 2 and myvalue != firstmyvalue:
column.insert(0, myvalue[0])
else:
column.insert(0, 0)
if column[0] != column[8]:
del column[0]
column.insert(0,0)
else:
del column[0]
column.insert(0,myvalue[0])
column.insert(1, count)
column.insert(0, 1)
writer.writerows(list1)
sample.csv:-
rate;sec;core;Ser;row;AC;PCI;RP;ne;net
244000;262399;7;5;323;29110;163;-90.38;2;244
244001;262527;6;5;323;29110;163;-89.19;2;244
244002;262531;6;5;323;29110;163;-90.69;2;244
244003;262571;6;5;325;29110;163;-88.75;2;244
244004;262665;7;5;320;29110;163;-90.31;2;244
244005;262686;7;5;326;29110;163;-91.69;2;244
244006;262718;7;5;323;29110;163;-89.5;2;244
244007;262753;7;5;324;29110;163;-90.25;2;244
244008;277482;5;5;325;29110;203;-87.13;2;244
My expected output:-
rate;value;Id;sec;core;Ser;row;AC;PCI;RP;ne;net
1;0;1;244000;262399;7;5;323;29110;163;-90.38;2;244
1;0;2;244001;262527;6;5;323;29110;163;-89.19;2;244
1;0;3;244002;262531;6;5;323;29110;163;-90.69;2;244
1;323;4;244003;262571;6;5;325;29110;163;-88.75;2;244
1;325;5;244004;262665;7;5;320;29110;163;-90.31;2;244
1;320;6;244005;262686;7;5;326;29110;163;-91.69;2;244
1;326;7;244006;262718;7;5;323;29110;163;-89.5;2;244
1;323;8;244007;262753;7;5;324;29110;163;-90.25;2;244
1;324;9;244008;277482;5;5;325;29110;203;-87.13;2;244

This will do the part you were asking for in a generic way, however your output clearly has more changes to it than the question asks for. I added in the Id column just to show how you can order the column output too:
df = pd.read_csv('sample.csv', sep=";")
df.loc[:,'value'] = None
df.loc[:, 'Id'] = df.index + 1
prev = None
for i, row in df.iterrows():
if prev is not None:
if row.row == prev.row:
df.value[i] = prev.value
else:
df.value[i] = prev.row
prev = row
df.to_csv('output.csv', index=False, cols=['rate','value','Id','sec','core','Ser','row','AC','PCI','RP','ne','net'], sep=';')

previous = []
for i, entry in enumerate(csv.reader(test.csv)):
if not i: # do this on first entry only
previous = entry # initialize here
print(entry)
else: # other entries
if entry[2] != previous[2]: # check if this entries row is equal to previous entries row
entry[1] = previous[2] # add previous entries row value to this entries var
previous = entry
print(entry)

import csv
with open('test.csv') as f, open('output.csv','w') as o:
out = csv.writer(o, delimiter='\t')
out.writerow(["id", 'value', 'row'])
reader = csv.DictReader(f, delimiter="\t") #Assuming file is tab delimited
prev_row = '100'
for line in reader:
if prev_row != line["row"]:
prev_row = line["row"]
out.writerow([line["id"],prev_row,line["row"]])
else:
out.writerow(line.values())
o.close()
content of output.csv:
id value row
1 0 100
2 0 100
3 110 110
4 140 140

Related

How to index rows in csv files and take columns as arguments?

Here is a sample of my csv file:
Date Open High Low Close
9/2/2021 34.05 40.34 33.03 36.7
9/3/2021 35.9 41.98 34.9 36.89
Here is a sample of my code:
import csv
def StockMarket():
while True:
command = input('$')
if command.lower() == 'quit':
break
elif command.lower() == 'readfiles':
mrna_data,pfe_data =
ReadFiles('MRNA.csv.numbers','PFE.1.csv')
#elif command.lower() = 'pricesondate'
def ReadFiles(MRNA,PFE):
file = open(MRNA,'r', newline = '')
reader = csv.reader(file,delimiter ='\t')
next(reader)
mrna_data = []
for row in reader:
#index rows only need row 0 and 4th row
mrna_data.append(row)
reader.close()
file = open(PFE,'r', newline = '')
reader = csv.reader(file,delimiter ='\t')
next(reader)
pfe_data = []
for row in reader:
pfe_data.append(row)
reader.close()
return mrna_data,pfe_data
I want to index rows 0 and 4 since they are the only rows I'm using. Then I would like to use row 0 as an argument in "YYYY-MM-DD" format which would then return me the corresponding row 4(This would be a separate function).
I've tried doing multiple methods from examples online and such but none of them work. If someone could help, I would really appreciate it.

How to filter a row based on multiple indexes and multiple conditions?

I have a file which looks like this:
#This is TEST-data
2020-09-07T00:00:03.230+02:00,ID-10,3,London,Manchester,London,1,1,1
2020-09-07T00:00:03.230+02:00,ID-10,3,London,London,Manchester,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London1,1
2020-09-07T00:00:03.230+02:00,ID-30,3,Madrid,Sevila,Sevilla,1,1,1
2020-09-07T00:00:03.230+02:00,ID-30,GGG,Madrid,Sevilla,Madrid,1
2020-09-07T00:00:03.230+02:00,ID-40,GGG,Madrid,Barcelona,1,1,1,1
2020-09-07T00:00:03.230+02:00
2020-09-07T00:00:03.230+02:00
Index[2] in each row shows how much cities are present in that specific row. So the first row has value 3 for index[2], which are London, Manchester, London.
I am trying to do the following:
For every row I need to check if any of row [3] + the cities mentioned after it (based on the number of cities) are present in cities_to_filter. But this only needs to be done if row[2] is a number. I also need to tackle the fact that some rows contain less then 2 items.
This is my code:
path = r'c:\data\ELK\Desktop\test_data_countries.txt'
cities_to_filter = ['Sevilla', 'Manchester']
def filter_row(row):
if row[2].isdigit():
amount_of_cities = int(row[2]) if len(row) > 2 else True
cities_to_check = row[3:3+amount_of_cities]
condition_1 = any(city in cities_to_check for city in cities_to_filter)
return condition_1
with open (path, 'r') as output_file:
reader = csv.reader(output_file, delimiter = ',')
next(reader)
for row in reader:
if filter_row(row):
print(row)
Right now I receive the following error:
UnboundLocalError: local variable 'condition_1' `referenced before assignment`
You could do something like this:
import sys
def filter_row(row):
'''Returns True if the row should be removed'''
if len(row) > 2:
if row[2].isdigit():
amount_of_cities = int(row[2])
cities_to_check = row[3:3+amount_of_cities]
else:
# don't have valid city count, just try the rest of the row
cities_to_check = row[3:]
return any(city in cities_to_check for city in cities_to_filter)
print(f'Invalid row: {row}', file=sys.stderr))
return True
with open (path, 'r') as input_file:
reader = csv.reader(input_file, delimiter = ',')
next(reader)
for row in reader:
if filter_row(row):
print(row)
In filter() the row length is checked to ensure that a possible city count in row[2] is present. If the count is a number it is used to calculate the upper bound to extract the cities to check. Otherwise the row from index 3 to the end of the row is processed which will include the additional number values, but probably not city names.
If there are too few fields the row it is filtered by returning True and an error message is printed.
I suggest you to filter before to optimize everything.
Here the beginning of the path you should explore :
test_data = pd.DataFrame({'ID':['ID-10','ID-10','ID-20','ID-20','ID-30','ID-30','ID-40'],'id':[3,3,2,2,3,'GGG','GGG'],'cities':[['London','Manchester','London',1,1,1],['London','Manchester','London',1,1],['London','London',1,1],['London','London',1,1],['Madrid','Sevilla','Sevilla',1,1,1],['Madrid','Sevilla','Sevilla',1],['Madrid','Barçelona',1]]})
cities_to_filter = ['Sevilla', 'Manchester']
_condition1 = test_data.index.isin(test_data[test_data.id.str.isnumeric() != False][test_data[test_data.id.str.isnumeric() != False].id > 2].index)
test_data['results'] = np.where( _condition1,1,0)
test_data
OUTPUT :
And then you apply an 'any() in' for filtering the cities, but there are a lot of ways.

How to print the 2nd column of a searched string in a CSV?

I am trying to search for a file name in a CSV (in column A). If it finds it, then I want to print only the second column (column B), not the whole row.
The CSV is like this:
File Name,ID
1234.bmp,1A
1111.bmp,2B
This is what I have so far, but it prints both the columns:
import os
import csv
f_name = os.listdir(r'C:\Users\Peter\Documents\Python test\Files')[0]
data = []
with open ("test.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
col = [x[0] for x in data]
if f_name in col:
for x in range(len(data)):
if f_name ==data[x][0]:
action = print(data[x])
else:
print("File not listed")
You were close. You only had a problem with the indexing (and the print statement).
After this part of the code:
data = []
with open ("test.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
The data would now be a list of lists:
[
['File Name', 'ID'],
['1234.bmp', '1A'],
['1111.bmp', '2B']
]
In the part where you check the 1st column:
if f_name == data[x][0]:
action = print(data[x])
You printed data[x] which would be one row. You need to index it further to access the 2nd column:
print(data[x]) # ['1234.bmp', '1A']
print(data[x][1]) # 1A
Furthermore, print returns None, so None would be saved into action:
>>> action = print("123")
123
>>> print(action)
None
You need to assign the value to action then print(action):
if f_name == data[x][0]:
action = data[x][1]
print(action) # 1A or 2B
You can also further improve the code by eliminating col. I understand that it's for checking if f_name is in the 1st column ("File Name") of the CSV. Since you are already iterating over each row, you can already check it there if f_name is in row. If it finds it, store the index of that row in a variable (ex. idx_fname_in_csv), so that later, you can access it directly from data. This eliminates the extra variable col and avoids iterating over the data twice.
import os
import csv
f_name = os.listdir(r'C:\Users\Peter\Documents\Python test\Files')[0]
data = []
idx_fname_in_csv = -1 # invalid
with open("test.csv") as csvfile:
reader = csv.reader(csvfile)
for idx, row in enumerate(reader):
data.append(row)
if f_name in row:
idx_fname_in_csv = idx
if idx_fname_in_csv > 0:
action = data[idx_fname_in_csv][1]
print(action)
else:
print("File not listed")
Here data would still have the same contents (list of lists) but I used enumerate to keep track of the index.

Split CSV by columns [duplicate]

This question already has answers here:
splitting CSV file by columns
(4 answers)
Closed 4 years ago.
I am trying to split a CSV file containing stock data of 1500+ companies. The first column contains dates and subsequent columns contain company data.
Goal 1: I'm trying to split the huge CSV file into smaller pieces. Let's say 30 companies per smaller file. To do this, I need to split the CSV by column number, not rows. I've been looking up code snippets but I haven't found anything that does this exactly. Also, each separate file would need to contain the first column, i.e. the dates.
Goal 2: I want to make the company name a column of its own, the date a column of its own and the indicators columns of their own. So, I can call the data for a company as a single record (row) in Django - I don't need all the dates, just the last day of every quarter. Right now, I'm having to filter the data by date and indicator and set that as an object to display in my frontend.
If you have questions, just ask.
EDIT:
Following is some code I patched together.
import os
import csv
from math import floor
from datetime import datetime
import re
class SimFinDataset:
def __init__(self, dataFilePath, csvDelimiter = "semicolon"):
self.numIndicators = None
self.numCompanies = 1
# load data
self.loadData(dataFilePath, csvDelimiter)
def loadData(self, filePath, delimiter):
numRow = 0
delimiterChar = ";" if delimiter == "semicolon" else ","
csvfile = open(filePath, 'rb')
reader = csv.reader(csvfile, delimiter=delimiterChar, quotechar='"')
header = next(reader)
row_count = sum(1 for _ in reader)
csvfile.seek(0)
for row in reader:
numRow += 1
if numRow > 1 and numRow != row_count and numRow != row_count-1:
# company id row
if numRow == 2:
rowLen = len(row)
idVal = None
for index, columnVal in enumerate(row):
if index > 0:
if idVal is not None and idVal != columnVal:
self.numCompanies += 1
if self.numIndicators is None:
self.numIndicators = index - 1
if index + 1 == rowLen:
if self.numIndicators is None:
self.numIndicators = index
idVal = columnVal
if numRow > 2 and self.numIndicators is None:
return
else:
filename = 1
with open(str(filename) + '.csv', 'wb') as csvfile:
if self.numCompanies % 30 == 0:
print ("im working")
spamwriter = csv.writer(csvfile, delimiter=';')
spamwriter.writerow(header)
spamwriter.writerow(row)
filename += 1
#print (self.numIndicators)
dataset = SimFinDataset('new-data.csv','semicolon')
a solution for goal1 is here.
splitting CSV file by columns
However you have the pandas way:
import pandas as pd
# let's say first 10 columns
csv_path="mycsv.csv"
out_path ="\\...\\out.csv"
pd.read_csv(csv_path).iloc[:, :10].to_csv(out_path)
You can also do something like
mydf.groupby("company_name").unstack()`
To make each company a column of its own

Read and Compare 2 CSV files on a row and column basis

I have two CSV files. data.csv and data2.csv.
I would like to first of Strip the two data files down to the data I am interested in. I have figured this part out with data.csv. I would then like to compare by row making sure that if a row is missing to add it.
Next I want to look at column 2. If there is a value there then I want to write to column 3 if there is data in column 3 then write to 4, etc.
My current program looks like sow. Need some guidance
Oh and I am using Python V3.4
__author__ = 'krisarmstrong'
#!/usr/bin/python
import csv
searched = ['aircheck', 'linkrunner at', 'onetouch at']
def find_group(row):
"""Return the group index of a row
0 if the row contains searched[0]
1 if the row contains searched[1]
etc
-1 if not found
"""
for col in row:
col = col.lower()
for j, s in enumerate(searched):
if s in col:
return j
return -1
inFile = open('data.csv')
reader = csv.reader(inFile)
inFile2 = open('data2.csv')
reader2 = csv.reader(inFile2)
outFile = open('data3.csv', "w")
writer = csv.writer(outFile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
header = next(reader)
header2 = next(reader2)
"""Built a list of items to sort. If row 12 contains 'LinkRunner AT' (group 1),
one stores a triple (1, 12, row)
When the triples are sorted later, all rows in group 0 will come first, then
all rows in group 1, etc.
"""
stored = []
writer.writerow([header[0], header[3]])
for i, row in enumerate(reader):
g = find_group(row)
if g >= 0:
stored.append((g, i, row))
stored.sort()
for g, i, row in stored:
writer.writerow([row[0], row[3]])
inFile.close()
outFile.close()
Perhaps try:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
col1.append(row[0])
col2.append(row[1])
for i in xrange(len(col1))
if col1[i] == '':
#thing to do if there is nothing for col1
if col2[i] == '':
#thing to do if there is nothing for col2
This is a start at "making sure that if a row is missing to add it".

Categories