Using Pyvaru for bulk data (CSV) validation - python
Am looking for a generic validator module to assist in sanitizing data and importantly, giving back an error log stating why data has been rejected. Am working primarily with CSV files each with an average of 40 columns and with about 40,000 rows. A CSV file would have a mixture of Personal Identifying Information, Contact Information and details about the Account they hold with us w
E.g.
First Name|Last Name|Other Name|Passport Number|Date of Birth|Phone Number|Email Address|Physical Address|Account Number|Invoice Number|Date Opened|Amount Due|Date Due|etc|etc
I need to validate basic stuff like data type, data length, options/choices, ranges, mandatory fields etc. Also there are conditional validations e.g. if an Amount Due value has been provided, then the Date Due must also be provided. If it hasn't then I raise an error.
Pyvaru provides some basic validation classes. Is it possible to implement both this scenarios of basic validation plus conditional validation with pyvaru? If yes, how would I structure the validations. Must I create objects e.g. Identifier Objects, then Account Objects for me to use pyvaru?
Pyvaru validates python objects (class instances and collections like dictionaries, lists and so on), so starting from a CSV I would convert each record into a dictionary using the DictReader.
So, given a CSV like:
policyID,statecode,county,eq_site_limit,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity
119736,FL,CLAY COUNTY,498960,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1
448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0,0,0,30.063936,-81.707664,Residential,Masonry,3
206893,FL,CLAY COUNTY,190724.4,190724.4,190724.4,190724.4,190724.4,192476.78,0,0,0,0,30.089579,-81.700455,Residential,Wood,1
333743,FL,CLAY COUNTY,0,79520.76,0,0,79520.76,86854.48,0,0,0,0,30.063236,-81.707703,Residential,Wood,3
172534,FL,CLAY COUNTY,0,254281.5,0,254281.5,254281.5,246144.49,0,0,0,0,30.060614,-81.702675,Residential,Wood,1
The validation code would be something like:
import csv
from pyvaru import Validator
from pyvaru.rules import MaxLengthRule, MaxValueRule
class CsvRecordValidator(Validator):
def get_rules(self) -> list:
record: dict = self.data
return [
MaxLengthRule(apply_to=record.get('statecode'),
label='State Code',
max_length=2),
MaxValueRule(apply_to=record.get('eq_site_limit'),
label='Site limit',
max_value=40000),
]
with open('sample.csv', 'r') as csv_file:
reader = csv.DictReader(csv_file)
row = 0
for record in reader:
row += 1
validator = CsvRecordValidator(record)
validation = validator.validate()
if not validation.is_successful():
print(f'Row {row} did not validate. Details: {validation.errors}')
The example above is just a dumb example of what you can do, specifically it checks that "statecode" column has a max length of 2 and that "eq_site_limit" has a max value of 40k.
You can implement your own rules by subclassing the abstract class ValidationRule and implementing the apply() method:
class ContainsHelloRule(ValidationRule):
def apply(self) -> bool:
return 'hello' in self.apply_to
It's also possible to negate rules using bitwise operators... for example by using the previous custom rule which checks that a string mst contain "hello" you can write:
~ ContainsHelloRule(apply_to=a_string, label="A test string")
and thus the rule will be valid only if the string DOES NOT contains the "hello" string.
It would also possible to validate CSV records, without dictionary conversion by validating each line using a PatternRule with a validation regex... but of course you won't know which column value is invalid.
Related
How to read and send CSV column data to PyTest test cases
We have a utility that calls APIs and saves their responses to CSV. That CSV (resp.csv) has API request (column A) along with headers (column C) and payload (column B) in it. And also the body of their response is stored in column D, and the response code in column E (not visible in the CSV image). The CSV file looks like this: I want to pass each response to a set of PyTest test cases which will be having some assertion specific to the response. For checking the status code I can do this via a function call, which returns the response code before writing to CSV. But the requirement is to read responses from CSV and pass them for assertion/test cases: #pytest.mark.parametrize('test_input', check_response) def test_APIResponse(test_input): print(check_response) assert test_input == 200, "Test pass" How can I call the response body stored at CSV (column D) and do assertion via using PyTest test cases? Can someone guide me with this? Thanks
Check this out, I think it might help docs. For your specific example you could de something like import pytest def load_cases(): # read_csv class Case: def __init__(self, number): self.i = number def __repr__(self): return '{number}'.format(number=self.i) return [Case(i) for i in range(5)] def parametrize(name, values): # function for readable description return pytest.mark.parametrize(name, values, ids=map(repr, values)) #parametrize("case", load_cases()) def test_responses(case): print(case.i) You are creating class Case and store everything you need inside this class then access properties from test. You also can play around with indirect parametrization for fixtures, but don't overcomplicate your code. To read a specific column use something like pandas or just split your string.
I wrote a package called Parametrize From File that can be used to do this. I gave a detailed example of how to load test parameters from an XLSX file in another Stack Overflow post, but I'll briefly reiterate the important points here. The only complication is that Parametrize From File expects to be able to load test cases as a dictionary of lists of dictionaries (see the docs for more info). This layout makes sense for YAML/TOML/NestedText files, but not for XLSX/CSV files. So we need to provide a function that loads the XSLX/CSV file in question and converts it to the expected format. pandas makes this pretty easy to do if you're willing to add the dependency, otherwise it probably wouldn't be too hard to write something yourself. Edit: Here's a more concrete example. To begin, here's what the CSV file might look like: request_,payload,header,response,code http://localhost:8080/c/u/t,{"ci":""},{},{"ss":""},200 http://localhost:8080/c/u/t?Id=x,{"ci":""},{},{"res":""},200 A few things to note: The first row gives a name to each column. The code I've written relies on this, and use those same names as the arguments to the parametrized test function. If your file doesn't have these headers, you would need to hard-code names each column. The name "request" is reserved by pytest, so we have to use "request_" here. Here's what the corresponding test script might look like: import parametrize_from_file as pff from csv import DictReader from collections import defaultdict def load_csv(path): with open(path) as f: cases = list(DictReader(f)) return defaultdict(lambda: cases) pff.add_loader('.csv', load_csv) #pff.parametrize def test_api_request_response(request_, payload, header, response, code): assert request_ == ... assert payload == ... assert headers == ... assert response == ... assert code == ... A few things to note: This assumes that the CSV file has the same base name as the test script. If this isn't the case, it's easy to specify a different path. The load function is expected to return a dictionary mapping test names (e.g. test_api_request_response) to lists of test cases, where each test case is a dictionary mapping parameter names (e.g. request_) to parameter values (e.g. http://localhost:8080). In this case the files doesn't specify any test names, so we cheat and use a defaultdict to return the same test cases for any test name.
Error in the coding : TypeError : list indices must be integer, not str
Code is importing another file, which is working perfectly. But, there is a problem in the line where I try to import the csv file, with a column called 'account key', returning the TypeError above. import file_import as fi Function for collectively finding data necessary from a csv file. def unique_students(csv_file): unique_students_list = set() for information in csv_file: unique_students_list.add(csv_file["account_key"]) return len(unique_students_list) #enrollment_num_rows = len(fi.enrollments) #engagement_num_rows = len(fi.daily_engagement) #submission_num_rows = len(fi.project_submissions) #enrollment_num_unique_students = unique_students(fi.enrollments) #engagement_num_unique_students = unique_students(fi.daily_engagement) #submission_num_unique_students = unique_students(fi.project_submissions)
csv_file["account_key"] Lists expect a numeric index. As far as I know, only dictionaries accept String indices. I'm not entirely sure what this is supposed to do; I think your logic is flawed. You bind information in the for loop, then never use it. Even if the list did accept a string index, all it would do is populate the Set with the same information over and over since the for loop body remains the same same every loop. This would only work if you were expecting csv_file to be a custom container type that had side effects when indexed (like advancing some internal counter).
python: how to create list of/iterate through multiple instances of a variable
I am working with the pymarc library. My question is, how to I deal with multiple instances of a variable, either building a list, or otherwise iterating through them? A MARC field can be accessed by adding the field number to the record variable. For instance, I have three instances of an 856 field in one record, which can be accessed as record['856']. But only the first instance is passed. assigning a variable record['856'][0] or record['856'][1] etc, doesn't work. I have tried creating a list, which is shown below, but that didn't work from pymarc import MARCReader with open('file.mrc', 'rb') as fh: reader = MARCReader(fh) for record in reader: """get all 856 fields -- delete unwanted 856 fields =856 40$uhttp://url1.org =856 40$uhttp://url2.org =856 40$uhttp://url3.org """ eight56to956s = [] eight56to956 = record['856'] eight56to956s.append(eight56to956) print eight56to956s I know how I would do this in php, but I'm not getting my head around python syntax to even search the web for the right thing.
You need a dictionary, where you can set 856 as the key and then a list of values you want to be tagged to 856 your_856 = {856: ['=856 40$uhttp://url1.org', '=856 40$uhttp://url2.org', '=856 40$uhttp://url3.org']} you can now access the values like with an index, here is an Example print(your_856[856][1]) this outputs =856 40$uhttp://url2.org
Get fields from a specific Jira issue
I'm trying to get all the fields and values from a specific issue my code: authenticated_jira = JIRA(options={'server': self.jira_server}, basic_auth=(self.jira_username, self.jira_password)) issue = authenticated_jira.issue(self.id) print issue.fields() Instead of returning the list of fields it returns: <jira.resources.PropertyHolder object at 0x108431390>
authenticated_jira = JIRA(options={'server': self.jira_server}, basic_auth=(self.jira_username, self.jira_password)) issue = authenticated_jira.issue(self.id) for field_name in issue.raw['fields']: print "Field:", field_name, "Value:", issue.raw['fields'][field_name] Depends on field type, sometimes you get dictionary as a value and then you have to find the actual value you want.
Found using: print self.issue_object.raw which returns the raw json dictionary which can be iterate and manipulate.
You can use issue.raw['fields']['desired_field'], but this way is kind of indirectly accessing the field values, because what you get in return is not consistent. You get lists of strings, then just strings themselves, and then straight up values that don't have a key for you to access them with, so you'll have to iterate, count the location, and then parse to get value which is unreliable. Best way is to use issue.fields.customfield_# This way you don't have to do any parsing through the .raw fields Almost everything has a customfield associated with it. You can pull just issues from REST API to find customfield #'s, or some of the fields that you get from using .raw will have a customfield id that should look like "customfield_11111" and that's what you'll use.
Using Answer from #kobi-k but dumping in better format, I used following code: with open("fields.txt", "w") as f: json.dump(issue.raw, f, indent=4) It dumped all the fields in a file with name "fields.txt"
Search a single column for a particular value in a CSV file and return an entire row
Issue The code does not correctly identify the input (item). It simply dumps to my failure message even if such a value exists in the CSV file. Can anyone help me determine what I am doing wrong? Background I am working on a small program that asks for user input (function not given here), searches a specific column in a CSV file (Item) and returns the entire row. The CSV data format is shown below. I have shortened the data from the actual amount (49 field names, 18000+ rows). Code import csv from collections import namedtuple from contextlib import closing def search(): item = 1000001 raw_data = 'active_sanitized.csv' failure = 'No matching item could be found with that item code. Please try again.' check = False with closing(open(raw_data, newline='')) as open_data: read_data = csv.DictReader(open_data, delimiter=';') item_data = namedtuple('item_data', read_data.fieldnames) while check == False: for row in map(item_data._make, read_data): if row.Item == item: return row else: return failure CSV structure active_sanitized.csv Item;Name;Cost;Qty;Price;Description 1000001;Name here:1;1001;1;11;Item description here:1 1000002;Name here:2;1002;2;22;Item description here:2 1000003;Name here:3;1003;3;33;Item description here:3 1000004;Name here:4;1004;4;44;Item description here:4 1000005;Name here:5;1005;5;55;Item description here:5 1000006;Name here:6;1006;6;66;Item description here:6 1000007;Name here:7;1007;7;77;Item description here:7 1000008;Name here:8;1008;8;88;Item description here:8 1000009;Name here:9;1009;9;99;Item description here:9 Notes My experience with Python is relatively little, but I thought this would be a good problem to start with in order to learn more. I determined the methods to open (and wrap in a close function) the CSV file, read the data via DictReader (to get the field names), and then create a named tuple to be able to quickly select the desired columns for the output (Item, Cost, Price, Name). Column order is important, hence the use of DictReader and namedtuple. While there is the possibility of hard-coding each of the field names, I felt that if the program can read them on file open, it would be much more helpful when working on similar files that have the same column names but different column organization. Research CSV Header and named tuple: What is the pythonic way to read CSV file data as rows of namedtuples? Converting CSV data to tuple: How to split a CSV row so row[0] is the name and any remaining items are a tuple? There were additional links of research, but I cannot post more than two.
You have three problems with this: You return on the first failure, so it will never get past the first line. You are reading strings from the file, and comparing to an int. _make iterates over the dictionary keys, not the values, producing the wrong result (item_data(Item='Name', Name='Price', Cost='Qty', Qty='Item', Price='Cost', Description='Description')). for row in (item_data(**data) for data in read_data): if row.Item == str(item): return row return failure This fixes the issues at hand - we check against a string, and we only return if none of the items matched (although you might want to begin converting the strings to ints in the data rather than this hackish fix for the string/int issue). I have also changed the way you are looping - using a generator expression makes for a more natural syntax, using the normal construction syntax for named attributes from a dict. This is cleaner and more readable than using _make and map(). It also fixes problem 3.