Python iterating/look up dictionary unexpected behavior - python

I have a problem when I try to look up data in a csv dictionary. A list of dates and times are in one csv and it should look up the data to specific date and time in second csv. I look for an exact match and 22 next records. The problem is that it only fetch for first date and time and rest is not found even though I can see it's there. I feel like this has a very easy solution, but I can't think anything. It must be a problem in my iteration code.
Code:
import csv
csv_eph = open("G:\\db.csv")
csv_reader_eph = csv.reader(csv_eph, delimiter=",")
csv_dict_eph = csv.DictReader (csv_eph)
csv_matches = open("G:\\query.csv")
csv_reader_matches = csv.reader(csv_matches, delimiter=",")
csv_dict_matches = csv.DictReader (csv_matches)
result = []
var = 0
for row in csv_dict_matches:
datum = row["Date"]
cas = row["Time"]
result.append('\n')
result.append(row)
for eph in csv_dict_eph:
if str(eph["datum"]) == str(datum) and str(eph["cas"]) == str(cas):
var = 23
if var > 0:
result.append(eph)
var = var - 1
with open("G:\\compiled.txt", "w") as output:
for item in result:
output.write(str(item))
output.write('\n')
SOLUTION!
I implemented jasonharper solution and it works flawlesly, many thanks. It was indeed problem with end of dictionary. Now fixed it looks like this and works like intended:
import csv
csv_eph = open("G:\\db.csv")
csv_reader_eph = csv.reader(csv_eph, delimiter=",")
csv_dict_eph = csv.DictReader (csv_eph)
csv_matches = open("G:\\query.csv")
csv_reader_matches = csv.reader(csv_matches, delimiter=",")
csv_dict_matches = csv.DictReader (csv_matches)
#jasonharper
eph_list = []
for eph in csv_dict_eph:
eph_list.append(eph)
print (eph_list)
result = []
var = 0
for row in csv_dict_matches:
print (row)
datum = row["Date"]
cas = row["Time"]
result.append('\n')
result.append(row)
for eph in eph_list:
if str(eph["datum"]) == str(datum) and str(eph["cas"]) == str(cas):
var = 23
if var > 0:
result.append(eph)
var = var - 1
with open("G:\\compiled.txt", "w") as output:
for item in result:
output.write(str(item))
output.write('\n')

i believe changing:
csv_dict_eph = csv.DictReader (csv_eph)
to:
csv_dict_eph = list(csv.DictReader(csv_eph))
will fix the problem.

Related

Getting a Total for a value in a csv file row that is a text

I am trying to get the total number of Mondays in my cvs file. My current code will return all the Mondays, but I need it to return 1972. I am at a loss. I was trying it is searchcursor, but that was a nightmare. I am new to programming python so I am look for your individual wisdoms. Thank you for your time code is below.
Csv_file_data, I am trying to just get the total Mondays out of this csv
import csv
with open(r"C:\users\david\OneDrive\Documents\ArcGIS\Projects\MyProject1\Burglaries_TableToExcel.csv", 'r') as monday:
reader = csv.reader(monday,delimiter =",")
title = next(reader)[16]
found_section = False
header = None
DayOfWeek_index = None
DayOfWeek_sum = 'Monday'
for row in reader:
if not found_section:
if len(row) > 0:
if row[16] == "DayOfWeek":
header = next(reader)
DayOfWeek_index = header_index("Monday")
found_section = True
else:
if len(row) > 0:
DayOfWeek_sum += float(row[DayOfWeek_index])
else:
break
print(DayOfWeek_sum)
An example, not tested as I was not going to hand transcribe the data from the image.
import csv
with open(r"C:\users\david\OneDrive\Documents\ArcGIS\Projects\MyProject1\Burglaries_TableToExcel.csv", 'r') as monday:
mon_ct = 0
cvs_dict = csv.DictReader(monday)
for row in csv_dict:
if row["DayOfWeek"] == "Monday":
mon_ct += 1

How to pick values for specific times in a date (large list of date, time, value)

I have a file with these columns: date, times, and value of a stock. Basically, per-minute value of stocks. I would like to calculate the difference in the value of a stock at 10 AM and 4 PM. This is the code I have so far:
fileName = "C:\\...\\US_200901_210907.csv"
with open(fileName) as f:
for line in f.readlines()[1:]:
split = line.split(";")
time = split[3]
date = split[2]
for timev in f.readlines()[1:]:
if timev == '100000':
Spot = float(split[2])
elif timev == '160000':
Close = float(split[2])
Diff = Spot - Close
print(Diff)
I am not sure if I am doing this right. But the code needs to cycle/loop through each date first, find the value of the stock at '100000' and '160000' and then calculate the difference between the two. Then move to the next day. And at the end of all days, print the differences for each day.
The "Diff = Spot - Close" line also gives me an error, says "NameError: name 'Spot' is not defined"
Any help is appreciated.
Dataset looks like this (extract):
====================
After working more on this on my own, I was able to get this to work:
import csv
filename = "C:\\...\\US_200901_210907.csv"
with open(filename, 'r') as f:
reader = csv.reader(f, delimiter=';')
next(reader, None) # skip header
rows = list(reader)
listOfDates = []
index = 0
for row in rows:
if rows[index][2] not in listOfDates:
listOfDates.append(rows[index][2])
index = index + 1
print(listOfDates)
startPrice = 0
endPrice = 0
index = 0
startPriceSet = False
endPriceSet = False
for date in listOfDates:
for row in rows:
if rows[index][2] == date:
# print(rows[index][2])
# print(date)
if rows[index][3] == '100000':
startPrice = float(rows[index][7])
startPriceSet = True
elif rows[index][3] == '160000':
endPrice = float(rows[index][7])
endPriceSet = True
index = index + 1
if startPriceSet and endPriceSet:
print(date, startPrice, endPrice, startPrice - endPrice)
startPriceSet = False
endPriceSet = False
Why not leverage a pandas DataFrame for this calculation -
import pandas as pd
df = pd.read_csv("C:\\...\\US_200901_210907.csv")
# give appropriate column names before or after loading the data
# assuming we have the columns 'time', 'date' & 'stockvalue' in df
# might have to use pandas.to_datetime
print(df[(df['time']=='time1') && (df['date']=='date1')]['stockvalue']-df[(df['time']=='time2') && (df['date']=='date1')]['stockvalue'])
Also, why do you have an embedded for loop?
One of the approach with the sheet you have provided:
import pandas as pd
from collections import defaultdict
df = pd.read_excel("Data.xlsx", header=None, dtype='str')
out = defaultdict(lambda: defaultdict(float))
for rowindex, row in df.iterrows():
date = row[2]
name = row[0]
if row[3] == "100000":
out[name]['DATE'] = row[2]
out[name]['START'] = float(row[4])
if row[3] == "160000":
out[name]['END'] = float(row[4])
for stock, data in out.items():
print (stock+': DATE: '+data['DATE']+' START: '+data['START']+' END:'+data['END']+' diff = '+str(int(data['END']-data['START'])))

Merging two csv files into list of dictionaries

i have a task to do and i got stuck because whatever i do it does't seem to work.
So i have to csv files.
First called persons_file and it contains header line: id, name, surname.
And visits_file containing id, person_id, site.
I have to write a function called merge that gets to files as arguments (both StrionIO type) and returns list of dictionaries with number of visits for each users:
[ {
"id": (person's id),
"name": (person's name),
"surname": (person's surname),
"visits": (number of visits)
} ]
I came up with this and i don't know where my mistake is.
import io
def merge(persons_file,visits_file):
line_counter = 0
return_list = []
list_of_person_ids = []
visits = 0
for row in visits_file:
if line_counter == 0:
line_counter+=1
continue
list_of_person_ids.append(row.split(',')[1])
line_counter = 0
for row in persons_file:
if line_counter == 0:
line_counter+=1
continue
help_dict = {}
split_row = row.split(',')
help_dict['id'] = split_row[0]
help_dict['name'] = split_row[1]
help_dict['surname'] = split_row[2][:len(split_row[2])-1]
if split_row[0] in list_of_person_ids:
visits = list_of_person_ids.count(split_row[0])
help_dict['visits'] = str(visits)
return_list.append(help_dict)
visits=0
return return_list
file1 = open('persons_file.csv' , mode='r')
file2 = open('visits_file.csv' , mode='r')
persons_file_arg = io.StringIO(file1.read())
visits_file_arg = io.StringIO(file2.read())
list_of_visits = merge(persons_file_arg,visits_file_arg)
for i in list_of_visits:
print(i)
file1.close()
file2.close()
I will be glad if anyone could help me.
What is the issue? Is it the output that is not what you expected, or are you getting an exception? Your code seems like it should achieve the result you want, but I have a couple suggestions to make that could simplify things.
Look into collections.Counter you could then call count_of_visits_by_person_id = Counter(list_of_person_ids) to get a result of the form:
{person_id: number_of_visits, ...}. You could then use this to simply look up the number of visits in your next for loop. e.g.:
from collections import Counter
...
count_of_visits_by_person_id = Counter(list_of_person_ids)
for row in persons_file:
if line_counter == 0:
line_counter += 1
continue
help_dict = {}
split_row = row.split(',')
help_dict['id'] = split_row[0]
help_dict['name'] = split_row[1]
help_dict['surname'] = split_row[2][:-1]
# [:len(split_row[2]) - 1] is equivalent to [:-1]
# I assume you are stripping whitespace from the right side,
# which can also be accomplished using split_row[2].rstrip()
if split_row[0] in count_of_visits_by_person_id:
visits = count_of_visits_by_person_id[split_row[0]]
else:
visits = 0
help_dict['visits'] = str(visits)
return_list.append(help_dict)
The generally simpler and safer way to open files is using the with statement. Here is an example:
with open('visits_file.csv', mode='r') as visits_file:
row = visits_file.readline()
while row:
row = visits_file.readline() # Skips the first line
list_of_person_ids.append(row.split(',')[1])

Python read XML file (near 50mb)

I'm parsing a XML String into CSV string but it's going very slow:
INDEX_COLUMN = "{urn:schemas-microsoft-com:office:spreadsheet}Index"
CELL_ELEMENT = "Cell"
DATA_ELEMENT = "Data"
def parse_to_csv_string(xml):
print('parse_to_csv_string')
csv = []
parsed_data = serialize_xml(xml)
rows = list(parsed_data[1][0])
header = get_cells_text(rows[0])
rows.pop(0)
csv.append(join(",", header))
for row in rows:
values = get_cells_text(row)
csv.append(join(",", values))
return join("\n", csv)
def serialize_xml(xml):
return ET.fromstring(xml)
def get_cells_text(row):
keys = []
cells = normalize_row_cells(row)
for elm in cells:
keys.append(elm[0].text or "")
while len(keys) < 92:
keys.append("")
return keys
def normalize_row_cells(row):
cells = list(row)
updated_cells = copy.deepcopy(cells)
pos = 1
for elm in cells:
strIndexAttr = elm.get(INDEX_COLUMN)
index = int(strIndexAttr) if strIndexAttr else pos
while index > pos:
empty_elm = ET.Element(CELL_ELEMENT)
child = ET.SubElement(empty_elm, DATA_ELEMENT)
child.text = ""
updated_cells.insert(pos - 1, empty_elm)
pos += 1
pos += 1
return updated_cells
The XML String sometimes miss a few columns and I need to iterate it to fill missing columns - every row must have 92 columns. That's why I have some helper functions to manipulate XML.
Right now I'm running my function with 4GB as Lambda and still getting timeout :(
Any idea on how to improve performance?
The normalize_row_cells constructs ElementTree Element instances but get_cells_text is only interested in each instance's child's text attribute, so I would consider changing normalize_row_cells to just return the text. Also, it's performing copies and calling list.insert: inserting elements into the middle of lists can be expensive, because each element after the insertion point must be moved.
Something like this (untested code) avoids making copies and insertions and returns only the required text, making get_cells_text redundant.
def normalize_row_cells(row):
cells = list(row)
updated_cells = []
pos = 1
for _ in range(0, 92):
elm = cells[pos - 1]
strIndexAttr = elm.get(INDEX_COLUMN)
index = int(strIndexAttr) if strIndexAttr else pos
if index == pos:
updated_cells.append(elm[0].text)
pos += 1
else:
update_cells.append("")
return updated_cells
If you can match your cells to their header names then using csv.DictWriter from the standard library might be even better (you need to profile to be sure).
import csv
import io
def parse_to_csv_string(xml):
print('parse_to_csv_string')
csv = []
parsed_data = serialize_xml(xml)
rows = list(parsed_data[1][0])
header = get_cells_text(rows[0])
with io.StringIO() as f:
writer = csv.DictWriter(f, fieldnames=header)
for row in rows:
row = get_cells_text(row)
writer.writerow(row)
f.seek(0)
data = f.read()
return data
def get_cells_text(row):
row_dict = {}
for cell in row:
column_name = get_column_name(cell) # <- can this be done?
row_dict[column_name] = elm[0].text or ""
return row_dict

List index out of range error in breaking whiloe loop in python

Hi I am new to python and struggling my way out. Currently ia m doing some appending excel files kind of task and here's my sample code. Getting list out of index error as according to me while loop is not breaking at rhe end of each excel file. Any help would be appreciated. Thanks:
import xlrd
import glob
import os
import openpyxl
import csv
from xlrd import open_workbook
from os import listdir
row = {}
basedir = '../files/'
files = listdir('../files')
sheets = [filename for filename in files if filename.endswith("xlsx")]
header_is_written = False
for filename in sheets:
print('Parsing {0}{1}\r'.format(basedir,filename))
worksheet = open_workbook(basedir+filename).sheet_by_index(0)
print (worksheet.cell_value(5,6))
counter = 0
while True:
row['plan name'] = worksheet.cell_value(1+counter,1).strip()
row_values = worksheet.row_slice(counter+1,start_colx=0, end_colx=30)
row['Dealer'] = int(row_values[0].value)
row['Name'] = str(row_values[1].value)
row['City'] = str(row_values[2].value)
row['State'] = str(row_values[3].value)
row['Zip Code'] = int(row_values[4].value)
row['Region'] = str(row_values[5].value)
row['AOM'] = str(row_values[6].value)
row['FTS Short Name'] = str(row_values[7].value)
row['Overall Score'] = float(row_values[8].value)
row['Overall Rank'] = int(row_values[9].value)
row['Count of Ros'] = int(row_values[10].value)
row['Count of PTSS Cases'] = int(row_values[11].value)
row['% of PTSS cases'] = float(row_values[12].value)
row['Rank of Cases'] = int(row_values[13].value)
row['% of Not Prepared'] = float(row_values[14].value)
row['Rank of Not Prepared'] = int(row_values[15].value)
row['FFVt Pre Qrt'] = float(row_values[16].value)
row['Rank of FFVt'] = int(row_values[17].value)
row['CSI Pre Qrt'] = int(row_values[18].value)
row['Rank of CSI'] = int(row_values[19].value)
row['FFVC Pre Qrt'] = float(row_values[20].value)
row['Rank of FFVc'] = int(row_values[21].value)
row['OnSite'] = str(row_values[22].value)
row['% of Onsite'] = str(row_values[23].value)
row['Not Prepared'] = int(row_values[24].value)
row['Open'] = str(row_values[25].value)
row['Cost per Vin Pre Qrt'] = float(row_values[26].value)
row['Damages per Visit Pre Qrt'] = float(row_values[27].value)
row['Claim Sub time pre Qrt'] = str(row_values[28].value)
row['Warranty Index Pre Qrt'] = str(row_values[29].value)
counter += 1
if row['plan name'] is None:
break
with open('table.csv', 'a',newline='') as f:
w=csv.DictWriter(f, row.keys())
if header_is_written is False:
w.writeheader()
header_is_written = True
w.writerow(row)
In place of while True use for.
row['plan name'] = worksheet.cell_value(1 + counter, 1).strip()
row_values = worksheet.row_slice(counter + 1, start_colx=0, end_colx=30)
for values in row_values:
row['Dealer'] = int(values.value)
row['Name'] = str(values.value)
....
because while True means to run this loop infinite time.(or until it means break keyword) inside while loop
Read more about while loop
while True loop basically means: execute the following code block to infinity, unless a break or sys.exit statement get you out.
So in your case, you need to terminate after the lines to append the excel are over (exhausted). You have two options: check if there are more lines to append, and if not break.
A more suitable approach when writing a file is for loops. This kind of a loop terminates when it is exausted.
Also, you should consider gathering the content of the excel in one operation, and save it to a variable. Then, once you have it, create iteration and append it to csv.

Categories