I'm in the process of building data driven tests in Python using unittest and ddt.
Is there a way for me to be able to select specific fields from the test data instead of having to pass all the fields as separate parameters?
For example:
I have a csv file containing customers as below:
Title,FirstName,Surname,AddressA,AddressB,AddressC,City,State,PostCode
Mr,Bob,Gergory,44 road end,A Town,Somewhere,LOS ANGELES,CA,90004
Miss,Alice,Woodrow,99 some street,Elsewhere,A City,LOS ANGELES,CA,90003
From this I'd like to be able to select just the first name, city and state in the test.
I can do this like below, however this seems messy and will be more with wider files:
#data(get_test_data("customers.csv"))
#unpack
def test_create_new_customer(self, Title,FirstName,Surname,AddressA,AddressB,AddressC,City,State,PostCode):
self.customer.enter_first_name(FirstName)
self.customer.enter_city(City)
self.customer.enter_state_code(State)
self.customer.click_update()
I was hoping to be able to build a dictionary list out of the csv and then access it as below:
#data(get_test_data_as_dictionary("customers.csv"))
#unpack
def test_create_new_customer(self, test_data):
self.customer.enter_first_name(test_data["FirstName"])
self.customer.enter_city(test_data["City"])
self.customer.enter_state_code(test_data["State"])
self.customer.click_update()
However it would seem that ddt is smarter that I thought and breaks out the data from the dictionary and still expects all the parameters to be declared.
Is there a better way to achieve what I'm after?
Related
We have a utility that calls APIs and saves their responses to CSV. That CSV (resp.csv) has API request (column A) along with headers
(column C) and payload (column B) in it. And also the body of their response is stored in column D, and the response code in column E (not visible in the CSV image).
The CSV file looks like this:
I want to pass each response to a set of PyTest test cases which will be having some assertion specific to the response.
For checking the status code I can do this via a function call, which returns the response code before writing to CSV. But the requirement is to read responses from CSV and pass them for assertion/test cases:
#pytest.mark.parametrize('test_input', check_response)
def test_APIResponse(test_input):
print(check_response)
assert test_input == 200, "Test pass"
How can I call the response body stored at CSV (column D) and do assertion via using PyTest test cases?
Can someone guide me with this?
Thanks
Check this out, I think it might help docs. For your specific example you could de something like
import pytest
def load_cases():
# read_csv
class Case:
def __init__(self, number):
self.i = number
def __repr__(self):
return '{number}'.format(number=self.i)
return [Case(i) for i in range(5)]
def parametrize(name, values):
# function for readable description
return pytest.mark.parametrize(name, values, ids=map(repr, values))
#parametrize("case", load_cases())
def test_responses(case):
print(case.i)
You are creating class Case and store everything you need inside this class then access properties from test. You also can play around with indirect parametrization for fixtures, but don't overcomplicate your code.
To read a specific column use something like pandas or just split your string.
I wrote a package called Parametrize From File that can be used to do this. I gave a detailed example of how to load test parameters from an XLSX file in another Stack Overflow post, but I'll briefly reiterate the important points here.
The only complication is that Parametrize From File expects to be able to load test cases as a dictionary of lists of dictionaries (see the docs for more info). This layout makes sense for YAML/TOML/NestedText files, but not for XLSX/CSV files. So we need to provide a function that loads the XSLX/CSV file in question and converts it to the expected format. pandas makes this pretty easy to do if you're willing to add the dependency, otherwise it probably wouldn't be too hard to write something yourself.
Edit: Here's a more concrete example. To begin, here's what the CSV file might look like:
request_,payload,header,response,code
http://localhost:8080/c/u/t,{"ci":""},{},{"ss":""},200
http://localhost:8080/c/u/t?Id=x,{"ci":""},{},{"res":""},200
A few things to note:
The first row gives a name to each column. The code I've written relies on this, and use those same names as the arguments to the parametrized test function. If your file doesn't have these headers, you would need to hard-code names each column.
The name "request" is reserved by pytest, so we have to use "request_" here.
Here's what the corresponding test script might look like:
import parametrize_from_file as pff
from csv import DictReader
from collections import defaultdict
def load_csv(path):
with open(path) as f:
cases = list(DictReader(f))
return defaultdict(lambda: cases)
pff.add_loader('.csv', load_csv)
#pff.parametrize
def test_api_request_response(request_, payload, header, response, code):
assert request_ == ...
assert payload == ...
assert headers == ...
assert response == ...
assert code == ...
A few things to note:
This assumes that the CSV file has the same base name as the test script. If this isn't the case, it's easy to specify a different path.
The load function is expected to return a dictionary mapping test names (e.g. test_api_request_response) to lists of test cases, where each test case is a dictionary mapping parameter names (e.g. request_) to parameter values (e.g. http://localhost:8080). In this case the files doesn't specify any test names, so we cheat and use a defaultdict to return the same test cases for any test name.
I am looking to set up a data-driven approach for my python selenium project (there is none currently). Planning to have the data file as xlsx.
I use pytest in my project. Hence, I explored ddt, #data, #unpack and
pytest.mark.parametrize.
I am able to read my excel values as pass them with #data-unpack or parametrize.
However, in my case, each of my tests will use selected columns from my data file - not all.
eg) My data list will be like this (user, password, item_number, item_name)[('user1', 'abc', 1, 'it1234')('user2', 'def',2, 'it5678')]
My function1 (test 1) will need to parameterize user and password columns only.
My function2 (test 2) will need to parameterize item_number and item_name columns only.
What library or method can I use for my need? Basically, I need to be able to parameterize specific columns from my data file for my tests.
I wrote a library called Parametrize From File that can load test parameters from data files like this. But I'm not sure that I fully understand your example. If this was your data file...
user
password
item number
item name
A
B
C
D
E
F
G
H
...would these be the tests you want to run?
#pytest.mark.parametrize(
'user, password',
[('A', 'B'), ('E', 'F')],
)
def test_1(user, password):
assert ...
#pytest.mark.parametrize(
'iterm_number, item_name',
[('C', 'D'), ('G', 'H')],
)
def test_2(user, password):
assert ...
In other words, are the user/password columns completely unrelated to the item_number/item_name columns? If no, I'm misunderstanding your question. If yes, this isn't very scalable. It's easy to imagine writing 100 tests, each with 2+ parameters, for a total of >200 columns! This format also breaks the convention that every value in a row should be related in some way. I'd recommend either putting the parameters for each test into their own file/worksheet, or using a file format that better matches the list-of-tuples/list-of-dicts structure expected by pytest, e.g. YAML, TOML, NestedText, etc.
With all that said, here's how you would load parameters from an xlsx file using Parametrize From File:
import pandas as pd
from collections import defaultdict
import parametrize_from_file as pff
def load_xlsx(path):
"""
Load an xlsx file and return the data structure expected by Parametrize
From File, which is a map of test names to test parameters. In this case,
the xlsx file doesn't specify any test names, so we use a `defaultdict` to
make all the parameters available to any test.
"""
df = pd.read_excel(path)
return defaultdict(lambda: df)
def get_cols(cols):
"""
Extract specific columns from the parameters loaded from the xlsx file.
The parameters are loaded as a pandas DataFrame, and need to be converted
into a list of dicts in order to be understood by Parametrize From File.
"""
def _get_cols(df):
return df[cols].to_dict('records')
return _get_cols
# Use the function we defined above to load xlsx files.
pff.add_loader('.xlsx', load_xlsx)
#pff.parametrize(preprocess=get_cols(['user', 'password']))
def test_1(user, password):
pass
#pff.parametrize(preprocess=get_cols(['item_number', 'item_name']))
def test_2(item_number, item_name):
pass
Note that this code would be much simpler if the parameters were organized in one of the formats I recommended above.
I'm brand new here and brand new to Python and programming in general. I wrote a simple script today that I'm pretty proud of as a new beginner. I used BS4 and Requests to scrape some data from a website. I put all of the data in dictionaries inside a list. The same key/value pairs exist for every list item. For simplicity, I'm left with something like this:
[{'country': 'us', 'state':'new york', 'people':50},{'country':'us', 'state':'california','people':30']}
Like I said, pretty simple, but then I can turn it into a Pandas dataframe and everything is organized with a few hundred different dictionaries inside the list. My next step is to do run this scrape every hour for 5 hours--and the only thing that changes is the value of the 'people' key. All of the sudden I'm not sure a list of lists of dictionaries (did I say that right?!) is a great idea. Plus, I really only need to get the updated values of 'people' from the webpage. Is this something I can realistically do with built in Python lists and dictionaries? I don't know much about databases, but I'm thinking that maybe SQLite might be good to use. I really only know about it in concept but haven't worked with it. Thoughts?
Ideally, after several scrapes, I would have easy access to the data to say, see 'people' in 'new york' over time. Or find at what time 'california' had the highest number of people. And then I could plot the data in 1000 different ways! I'd love any guidance or direction here. Thanks a bunch!
You could create a Python class, like this:
class StateStats:
def __init__(self, country, state, people):
self.country = country
self.state = state
self.people = people
def update():
# Do whatever your update script is here
# Except, update the value self.people when it changes
# Like this: self.people = newPeopleValueAsAVariable
And then create instances of it like this:
# For each country you have scraped, make a new instance of this class
# This assumes that the list you gathered is stored in a variable named my_list
state_stats_list = []
for dictionary in my_list:
state_stats_list.append(
StateStats(
dictionary['country'],
dictionary['state'],
dictionary['people']
)
)
# Or, instead, you can just create the class instances
# when you scrape the webpage, instead of creating a
# list and then creating another list of classes from that list
You could also use a database like SQLite, but I think this will be fine for your purpose. Hope this helps!
I am learning web scraping using scrapy. Having Pretty Fun with it. The only problem is I can't save the scraped data in the way I want to.
The below code scrapes reviews from Amazon. How to make the storing of data better?
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import csv
class Oneplus6Spider(scrapy.Spider):
name = 'oneplus6'
allowed_domains = ['amazon.in']
start_urls = ['https://www.amazon.in/OnePlus-Silk-White-128GB-
Storage/product-reviews/B078BNQ2ZS/ref=cm_cr_arp_d_viewopt_sr?
ie=UTF8&reviewerType=all_reviews&filterByStar=positive&pageNumber=1']
def parse(self, response):
writer = csv.writer(open('jack.csv','w+'))
opinions = response.xpath('//*[#class="a-size-base a-link-normal
review-title a-color-base a-text-bold"]/text()').extract()
for opinion in opinions:
yield({'Opinion':opinion})
reviewers = response.xpath('//*[#class="a-size-base a-link-normal
author"]/text()').extract()
for reviewer in reviewers:
yield({'Reviewer':reviewer})
verified = response.xpath('//*[#class="a-size-mini a-color-state a-
text-bold"]/text()').extract()
for verified_buyer in verified:
yield({'Verified_buyer':verified_buyer})
ratings = response.xpath('//span[#class="a-icon-
alt"]/text()').extract()
for rating in ratings:
yield({'Rating':rating[0]})
model_bought = response.xpath('//a[#class="a-size-mini a-link-
normal a-color-secondary"]/text()').extract()
for model in model_bought:
yield({'Model':model})
I tried using scrapy's default way -o method and also tried using csv.
The data gets stored in single row.I am very new to pandas and csv modules and I can't figure out how to store the scraped data in a proper format?
It is storing all the values in one single row.
I want the different values in different rows
Eg: Reviews|Rating|Model|
but I just can't figure out how to do it
How can I do it ?
It's observed in your code that you're trying to extract records with different types: they're all dict objects with a single key, where the key might have different values ("Opinion", "Reviewer", etc.).
In Scrapy, exporting data to CSV is handled by CsvItemExporter where the _write_headers_and_set_fields_to_export method is what matters with your current problem, as the exporter needs to know the list of fields (column names) before writing the first item.
Specifically:
It'll first check the fields_to_export attribute (configured by the FEED_EXPORT_FIELDS setting via feed exporter (related code here))
If unset:
2.a. If the first item is a dict, it'll use all its keys as the column name.
2.b. If the first item is a scrapy.Item, it'll use the keys from the item definition.
Thus there're several ways to resolve the problem:
You may define a scrapy.Item class with all possible keys you need, and yield items of this type in your code (just fill in the one field you need, and leave others empty, for any specific record).
Or, properly configure the FEED_EXPORT_FIELDS setting and leave other part of your existing code unchanged.
I suppose the hints above are sufficient. Please let me know if you need further examples.
To set csv data format one of the easiest way is to clean data using excel power queries, follow these steps:
Open csv file in excel.
Select all values using ctrl+A.
Then click on table from insert and create table.
After create table click on Data from top menu and select
From Table.
Know they open new excel window power queries.
Select any column and click on split column.
From split column select by delimiter.
Know select delimiter like comma, space etc.
Final step: Select advanced option in which there are two
options split in rows or column.
You can do all type of data cleaning using these power queries, this is the easiest way to setup data format according to your need.
Am looking for a generic validator module to assist in sanitizing data and importantly, giving back an error log stating why data has been rejected. Am working primarily with CSV files each with an average of 40 columns and with about 40,000 rows. A CSV file would have a mixture of Personal Identifying Information, Contact Information and details about the Account they hold with us w
E.g.
First Name|Last Name|Other Name|Passport Number|Date of Birth|Phone Number|Email Address|Physical Address|Account Number|Invoice Number|Date Opened|Amount Due|Date Due|etc|etc
I need to validate basic stuff like data type, data length, options/choices, ranges, mandatory fields etc. Also there are conditional validations e.g. if an Amount Due value has been provided, then the Date Due must also be provided. If it hasn't then I raise an error.
Pyvaru provides some basic validation classes. Is it possible to implement both this scenarios of basic validation plus conditional validation with pyvaru? If yes, how would I structure the validations. Must I create objects e.g. Identifier Objects, then Account Objects for me to use pyvaru?
Pyvaru validates python objects (class instances and collections like dictionaries, lists and so on), so starting from a CSV I would convert each record into a dictionary using the DictReader.
So, given a CSV like:
policyID,statecode,county,eq_site_limit,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity
119736,FL,CLAY COUNTY,498960,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1
448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0,0,0,30.063936,-81.707664,Residential,Masonry,3
206893,FL,CLAY COUNTY,190724.4,190724.4,190724.4,190724.4,190724.4,192476.78,0,0,0,0,30.089579,-81.700455,Residential,Wood,1
333743,FL,CLAY COUNTY,0,79520.76,0,0,79520.76,86854.48,0,0,0,0,30.063236,-81.707703,Residential,Wood,3
172534,FL,CLAY COUNTY,0,254281.5,0,254281.5,254281.5,246144.49,0,0,0,0,30.060614,-81.702675,Residential,Wood,1
The validation code would be something like:
import csv
from pyvaru import Validator
from pyvaru.rules import MaxLengthRule, MaxValueRule
class CsvRecordValidator(Validator):
def get_rules(self) -> list:
record: dict = self.data
return [
MaxLengthRule(apply_to=record.get('statecode'),
label='State Code',
max_length=2),
MaxValueRule(apply_to=record.get('eq_site_limit'),
label='Site limit',
max_value=40000),
]
with open('sample.csv', 'r') as csv_file:
reader = csv.DictReader(csv_file)
row = 0
for record in reader:
row += 1
validator = CsvRecordValidator(record)
validation = validator.validate()
if not validation.is_successful():
print(f'Row {row} did not validate. Details: {validation.errors}')
The example above is just a dumb example of what you can do, specifically it checks that "statecode" column has a max length of 2 and that "eq_site_limit" has a max value of 40k.
You can implement your own rules by subclassing the abstract class ValidationRule and implementing the apply() method:
class ContainsHelloRule(ValidationRule):
def apply(self) -> bool:
return 'hello' in self.apply_to
It's also possible to negate rules using bitwise operators... for example by using the previous custom rule which checks that a string mst contain "hello" you can write:
~ ContainsHelloRule(apply_to=a_string, label="A test string")
and thus the rule will be valid only if the string DOES NOT contains the "hello" string.
It would also possible to validate CSV records, without dictionary conversion by validating each line using a PatternRule with a validation regex... but of course you won't know which column value is invalid.