Here is a sample of my csv file:
Date Open High Low Close
9/2/2021 34.05 40.34 33.03 36.7
9/3/2021 35.9 41.98 34.9 36.89
Here is a sample of my code:
import csv
def StockMarket():
while True:
command = input('$')
if command.lower() == 'quit':
break
elif command.lower() == 'readfiles':
mrna_data,pfe_data =
ReadFiles('MRNA.csv.numbers','PFE.1.csv')
#elif command.lower() = 'pricesondate'
def ReadFiles(MRNA,PFE):
file = open(MRNA,'r', newline = '')
reader = csv.reader(file,delimiter ='\t')
next(reader)
mrna_data = []
for row in reader:
#index rows only need row 0 and 4th row
mrna_data.append(row)
reader.close()
file = open(PFE,'r', newline = '')
reader = csv.reader(file,delimiter ='\t')
next(reader)
pfe_data = []
for row in reader:
pfe_data.append(row)
reader.close()
return mrna_data,pfe_data
I want to index rows 0 and 4 since they are the only rows I'm using. Then I would like to use row 0 as an argument in "YYYY-MM-DD" format which would then return me the corresponding row 4(This would be a separate function).
I've tried doing multiple methods from examples online and such but none of them work. If someone could help, I would really appreciate it.
I have a file which looks like this:
#This is TEST-data
2020-09-07T00:00:03.230+02:00,ID-10,3,London,Manchester,London,1,1,1
2020-09-07T00:00:03.230+02:00,ID-10,3,London,London,Manchester,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London1,1
2020-09-07T00:00:03.230+02:00,ID-30,3,Madrid,Sevila,Sevilla,1,1,1
2020-09-07T00:00:03.230+02:00,ID-30,GGG,Madrid,Sevilla,Madrid,1
2020-09-07T00:00:03.230+02:00,ID-40,GGG,Madrid,Barcelona,1,1,1,1
2020-09-07T00:00:03.230+02:00
2020-09-07T00:00:03.230+02:00
Index[2] in each row shows how much cities are present in that specific row. So the first row has value 3 for index[2], which are London, Manchester, London.
I am trying to do the following:
For every row I need to check if any of row [3] + the cities mentioned after it (based on the number of cities) are present in cities_to_filter. But this only needs to be done if row[2] is a number. I also need to tackle the fact that some rows contain less then 2 items.
This is my code:
path = r'c:\data\ELK\Desktop\test_data_countries.txt'
cities_to_filter = ['Sevilla', 'Manchester']
def filter_row(row):
if row[2].isdigit():
amount_of_cities = int(row[2]) if len(row) > 2 else True
cities_to_check = row[3:3+amount_of_cities]
condition_1 = any(city in cities_to_check for city in cities_to_filter)
return condition_1
with open (path, 'r') as output_file:
reader = csv.reader(output_file, delimiter = ',')
next(reader)
for row in reader:
if filter_row(row):
print(row)
Right now I receive the following error:
UnboundLocalError: local variable 'condition_1' `referenced before assignment`
You could do something like this:
import sys
def filter_row(row):
'''Returns True if the row should be removed'''
if len(row) > 2:
if row[2].isdigit():
amount_of_cities = int(row[2])
cities_to_check = row[3:3+amount_of_cities]
else:
# don't have valid city count, just try the rest of the row
cities_to_check = row[3:]
return any(city in cities_to_check for city in cities_to_filter)
print(f'Invalid row: {row}', file=sys.stderr))
return True
with open (path, 'r') as input_file:
reader = csv.reader(input_file, delimiter = ',')
next(reader)
for row in reader:
if filter_row(row):
print(row)
In filter() the row length is checked to ensure that a possible city count in row[2] is present. If the count is a number it is used to calculate the upper bound to extract the cities to check. Otherwise the row from index 3 to the end of the row is processed which will include the additional number values, but probably not city names.
If there are too few fields the row it is filtered by returning True and an error message is printed.
I suggest you to filter before to optimize everything.
Here the beginning of the path you should explore :
test_data = pd.DataFrame({'ID':['ID-10','ID-10','ID-20','ID-20','ID-30','ID-30','ID-40'],'id':[3,3,2,2,3,'GGG','GGG'],'cities':[['London','Manchester','London',1,1,1],['London','Manchester','London',1,1],['London','London',1,1],['London','London',1,1],['Madrid','Sevilla','Sevilla',1,1,1],['Madrid','Sevilla','Sevilla',1],['Madrid','Barçelona',1]]})
cities_to_filter = ['Sevilla', 'Manchester']
_condition1 = test_data.index.isin(test_data[test_data.id.str.isnumeric() != False][test_data[test_data.id.str.isnumeric() != False].id > 2].index)
test_data['results'] = np.where( _condition1,1,0)
test_data
OUTPUT :
And then you apply an 'any() in' for filtering the cities, but there are a lot of ways.
I am trying to search for a file name in a CSV (in column A). If it finds it, then I want to print only the second column (column B), not the whole row.
The CSV is like this:
File Name,ID
1234.bmp,1A
1111.bmp,2B
This is what I have so far, but it prints both the columns:
import os
import csv
f_name = os.listdir(r'C:\Users\Peter\Documents\Python test\Files')[0]
data = []
with open ("test.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
col = [x[0] for x in data]
if f_name in col:
for x in range(len(data)):
if f_name ==data[x][0]:
action = print(data[x])
else:
print("File not listed")
You were close. You only had a problem with the indexing (and the print statement).
After this part of the code:
data = []
with open ("test.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
The data would now be a list of lists:
[
['File Name', 'ID'],
['1234.bmp', '1A'],
['1111.bmp', '2B']
]
In the part where you check the 1st column:
if f_name == data[x][0]:
action = print(data[x])
You printed data[x] which would be one row. You need to index it further to access the 2nd column:
print(data[x]) # ['1234.bmp', '1A']
print(data[x][1]) # 1A
Furthermore, print returns None, so None would be saved into action:
>>> action = print("123")
123
>>> print(action)
None
You need to assign the value to action then print(action):
if f_name == data[x][0]:
action = data[x][1]
print(action) # 1A or 2B
You can also further improve the code by eliminating col. I understand that it's for checking if f_name is in the 1st column ("File Name") of the CSV. Since you are already iterating over each row, you can already check it there if f_name is in row. If it finds it, store the index of that row in a variable (ex. idx_fname_in_csv), so that later, you can access it directly from data. This eliminates the extra variable col and avoids iterating over the data twice.
import os
import csv
f_name = os.listdir(r'C:\Users\Peter\Documents\Python test\Files')[0]
data = []
idx_fname_in_csv = -1 # invalid
with open("test.csv") as csvfile:
reader = csv.reader(csvfile)
for idx, row in enumerate(reader):
data.append(row)
if f_name in row:
idx_fname_in_csv = idx
if idx_fname_in_csv > 0:
action = data[idx_fname_in_csv][1]
print(action)
else:
print("File not listed")
Here data would still have the same contents (list of lists) but I used enumerate to keep track of the index.
This question already has answers here:
splitting CSV file by columns
(4 answers)
Closed 4 years ago.
I am trying to split a CSV file containing stock data of 1500+ companies. The first column contains dates and subsequent columns contain company data.
Goal 1: I'm trying to split the huge CSV file into smaller pieces. Let's say 30 companies per smaller file. To do this, I need to split the CSV by column number, not rows. I've been looking up code snippets but I haven't found anything that does this exactly. Also, each separate file would need to contain the first column, i.e. the dates.
Goal 2: I want to make the company name a column of its own, the date a column of its own and the indicators columns of their own. So, I can call the data for a company as a single record (row) in Django - I don't need all the dates, just the last day of every quarter. Right now, I'm having to filter the data by date and indicator and set that as an object to display in my frontend.
If you have questions, just ask.
EDIT:
Following is some code I patched together.
import os
import csv
from math import floor
from datetime import datetime
import re
class SimFinDataset:
def __init__(self, dataFilePath, csvDelimiter = "semicolon"):
self.numIndicators = None
self.numCompanies = 1
# load data
self.loadData(dataFilePath, csvDelimiter)
def loadData(self, filePath, delimiter):
numRow = 0
delimiterChar = ";" if delimiter == "semicolon" else ","
csvfile = open(filePath, 'rb')
reader = csv.reader(csvfile, delimiter=delimiterChar, quotechar='"')
header = next(reader)
row_count = sum(1 for _ in reader)
csvfile.seek(0)
for row in reader:
numRow += 1
if numRow > 1 and numRow != row_count and numRow != row_count-1:
# company id row
if numRow == 2:
rowLen = len(row)
idVal = None
for index, columnVal in enumerate(row):
if index > 0:
if idVal is not None and idVal != columnVal:
self.numCompanies += 1
if self.numIndicators is None:
self.numIndicators = index - 1
if index + 1 == rowLen:
if self.numIndicators is None:
self.numIndicators = index
idVal = columnVal
if numRow > 2 and self.numIndicators is None:
return
else:
filename = 1
with open(str(filename) + '.csv', 'wb') as csvfile:
if self.numCompanies % 30 == 0:
print ("im working")
spamwriter = csv.writer(csvfile, delimiter=';')
spamwriter.writerow(header)
spamwriter.writerow(row)
filename += 1
#print (self.numIndicators)
dataset = SimFinDataset('new-data.csv','semicolon')
a solution for goal1 is here.
splitting CSV file by columns
However you have the pandas way:
import pandas as pd
# let's say first 10 columns
csv_path="mycsv.csv"
out_path ="\\...\\out.csv"
pd.read_csv(csv_path).iloc[:, :10].to_csv(out_path)
You can also do something like
mydf.groupby("company_name").unstack()`
To make each company a column of its own
I have two CSV files. data.csv and data2.csv.
I would like to first of Strip the two data files down to the data I am interested in. I have figured this part out with data.csv. I would then like to compare by row making sure that if a row is missing to add it.
Next I want to look at column 2. If there is a value there then I want to write to column 3 if there is data in column 3 then write to 4, etc.
My current program looks like sow. Need some guidance
Oh and I am using Python V3.4
__author__ = 'krisarmstrong'
#!/usr/bin/python
import csv
searched = ['aircheck', 'linkrunner at', 'onetouch at']
def find_group(row):
"""Return the group index of a row
0 if the row contains searched[0]
1 if the row contains searched[1]
etc
-1 if not found
"""
for col in row:
col = col.lower()
for j, s in enumerate(searched):
if s in col:
return j
return -1
inFile = open('data.csv')
reader = csv.reader(inFile)
inFile2 = open('data2.csv')
reader2 = csv.reader(inFile2)
outFile = open('data3.csv', "w")
writer = csv.writer(outFile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
header = next(reader)
header2 = next(reader2)
"""Built a list of items to sort. If row 12 contains 'LinkRunner AT' (group 1),
one stores a triple (1, 12, row)
When the triples are sorted later, all rows in group 0 will come first, then
all rows in group 1, etc.
"""
stored = []
writer.writerow([header[0], header[3]])
for i, row in enumerate(reader):
g = find_group(row)
if g >= 0:
stored.append((g, i, row))
stored.sort()
for g, i, row in stored:
writer.writerow([row[0], row[3]])
inFile.close()
outFile.close()
Perhaps try:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
col1.append(row[0])
col2.append(row[1])
for i in xrange(len(col1))
if col1[i] == '':
#thing to do if there is nothing for col1
if col2[i] == '':
#thing to do if there is nothing for col2
This is a start at "making sure that if a row is missing to add it".