Storing Scraped data in csv - python

I am learning web scraping using scrapy. Having Pretty Fun with it. The only problem is I can't save the scraped data in the way I want to.
The below code scrapes reviews from Amazon. How to make the storing of data better?
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import csv
class Oneplus6Spider(scrapy.Spider):
name = 'oneplus6'
allowed_domains = ['amazon.in']
start_urls = ['https://www.amazon.in/OnePlus-Silk-White-128GB-
Storage/product-reviews/B078BNQ2ZS/ref=cm_cr_arp_d_viewopt_sr?
ie=UTF8&reviewerType=all_reviews&filterByStar=positive&pageNumber=1']
def parse(self, response):
writer = csv.writer(open('jack.csv','w+'))
opinions = response.xpath('//*[#class="a-size-base a-link-normal
review-title a-color-base a-text-bold"]/text()').extract()
for opinion in opinions:
yield({'Opinion':opinion})
reviewers = response.xpath('//*[#class="a-size-base a-link-normal
author"]/text()').extract()
for reviewer in reviewers:
yield({'Reviewer':reviewer})
verified = response.xpath('//*[#class="a-size-mini a-color-state a-
text-bold"]/text()').extract()
for verified_buyer in verified:
yield({'Verified_buyer':verified_buyer})
ratings = response.xpath('//span[#class="a-icon-
alt"]/text()').extract()
for rating in ratings:
yield({'Rating':rating[0]})
model_bought = response.xpath('//a[#class="a-size-mini a-link-
normal a-color-secondary"]/text()').extract()
for model in model_bought:
yield({'Model':model})
I tried using scrapy's default way -o method and also tried using csv.
The data gets stored in single row.I am very new to pandas and csv modules and I can't figure out how to store the scraped data in a proper format?
It is storing all the values in one single row.
I want the different values in different rows
Eg: Reviews|Rating|Model|
but I just can't figure out how to do it
How can I do it ?

It's observed in your code that you're trying to extract records with different types: they're all dict objects with a single key, where the key might have different values ("Opinion", "Reviewer", etc.).
In Scrapy, exporting data to CSV is handled by CsvItemExporter where the _write_headers_and_set_fields_to_export method is what matters with your current problem, as the exporter needs to know the list of fields (column names) before writing the first item.
Specifically:
It'll first check the fields_to_export attribute (configured by the FEED_EXPORT_FIELDS setting via feed exporter (related code here))
If unset:
2.a. If the first item is a dict, it'll use all its keys as the column name.
2.b. If the first item is a scrapy.Item, it'll use the keys from the item definition.
Thus there're several ways to resolve the problem:
You may define a scrapy.Item class with all possible keys you need, and yield items of this type in your code (just fill in the one field you need, and leave others empty, for any specific record).
Or, properly configure the FEED_EXPORT_FIELDS setting and leave other part of your existing code unchanged.
I suppose the hints above are sufficient. Please let me know if you need further examples.

To set csv data format one of the easiest way is to clean data using excel power queries, follow these steps:
Open csv file in excel.
Select all values using ctrl+A.
Then click on table from insert and create table.
After create table click on Data from top menu and select
From Table.
Know they open new excel window power queries.
Select any column and click on split column.
From split column select by delimiter.
Know select delimiter like comma, space etc.
Final step: Select advanced option in which there are two
options split in rows or column.
You can do all type of data cleaning using these power queries, this is the easiest way to setup data format according to your need.

Related

How to parse a complex text file using Python string methods or regex and export into tabular form

As the title mentions, my issue is that I don't understand quite how to extract the data I need for my table (The columns for the table I need are Date, Time, Courtroom, File Number, Defendant Name, Attorney, Bond, Charge, etc.)
I think regex is what I need but my class did not go over this, so I am confused on how to parse in order to extract and output the correct data into an organized table...
I am supposed to turn my text file from this
https://pastebin.com/ZM8EPu0p
and export it into a more readable format like this- example output is below
Here is what I have so far.
def readFile(court):
csv_rows = []
# read and split txt file into pages & chunks of data by pagragraph
with open(court, "r") as file:
data_chunks = file.read().split("\n\n")
for chunk in data_chunks:
chunk = chunk.strip # .strip removes useless spaces
if str(data_chunks[:4]).isnumeric(): # if first 4 characters are digits
entry = None # initialize an empty dictionary
elif (
str(data_chunks).isspace() and entry
): # if we're on an empty line and the entry dict is not empty
csv_rows.DictWriter(dialect="excel") # turn csv_rows into needed output
entry = {}
else:
# parse here?
print(data_chunks)
return csv_rows
readFile("/Users/mia/Desktop/School/programming/court.txt")
It is quite a lot of work to achieve that, but it is possible. If you split it in a couple of sub-tasks.
First, your input looks like a text file so you could parse it line by line. -- using https://www.w3schools.com/python/ref_file_readlines.asp
Then, I noticed that your data can be split in pages. You would need to prepare a lot of regular expressions, but you can start with one for identifying where each page starts. -- you may want to read this as your expression might get quite complicated: https://www.w3schools.com/python/python_regex.asp
The goal of this step is to collect all lines from a page in some container (might be a list, dict, whatever you find it suitable).
And afterwards, write some code that parses the information page by page. But for simplicity I suggest to start with something easy, like the columns for "no, file number and defendant".
And when you got some data in a reliable manner, you can address the export part, using pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

Python Docx Module merges tables when added subsequently to document

I'm using the python-docx module and python 3.9.0 to create word docx files with python. The problem I have is the following:
A) I defined a table style named my_table_style
B) I open my template, add one table of that style to my document object and then I store the created file with the following code:
import os
from docx import Document
template_path = os.path.realpath(__file__).replace("test.py","template.docx")
my_file = Document(template_path)
my_file.add_table(1,1,style="my_table_style").rows[-1].cells[0].paragraphs[0].add_run("hello")
my_file.save(template_path.replace("template.docx","test.docx"))
When I now open test.docx, it's all good, there's one table with one row saying "hello".
NOW, when I use this syntax to create two of these tables:
import os
from docx import Document
template_path = os.path.realpath(__file__).replace("test.py","template.docx")
my_file = Document(template_path)
my_file.add_table(1,1,style="my_table_style").rows[-1].cells[0].paragraphs[0].add_run("hello")
my_file.add_table(1,1,style="my_table_style").rows[-1].cells[0].paragraphs[0].add_run("hello")
my_file.save(template_path.replace("template.docx","test.docx"))
Instead of getting two tables, each with one row saying "hello", I get one single table with two rows, each saying "hello". The formatting is however correct, according to my_table_style, so it seems that python-docx merges two subsequently added tables of the same table style. Is this normal behavior? How can I avoid that?
Cheers!
HINTS:
When I use print(len(my_file.tables)) to print the amount of tables present in my_file, I actually get "2"! Also, when I change the style used in the second add_table line it works all good, so this seems to be related to the fact of using the same style. Any ideas, anyone?
Alright, so I figured it out, it seems to be default behaviour by Word to do what's described above. I manually created a table style my_custom_style in the template.docx file where I customized the table border lines etc. to have the format I want to have as if I would have two tables.
Instead of then using two add_table() statements, I used
new_table = my_file.add_table(1,1,style = "my_custom_style")
first_row = new_table.rows[-1]
second_row = new_table.add_row()
(you can actually access table styles defined in your template via python-docx, simply by using the table style name you used to manually create your table style in your word template file used to open your Document object. Just make sure you tick the "add this table style to the word template" option upon saving the style in Word and it should all work). Everything working now.

Pythonic way to solve a text normalization task

Basically, I have a Hive script file, from which I need to extract the names for all the tables created. For example, from the contents
...
create table Sales ...
...
create external table Persons ...
...
Sales and Persons should be extracted. To accomplish this, my basic idea is like:
Search for key phrases create table and create external table,
Extract the next token which should be the table name.
However, the input may not be canonical. For example,
Tab/newline may be used along with space as token delimiter
There may be multiple consecutive delimiters between tokens
Mixed use of upper and lower case letters like create TABLE
Therefore, I'm thinking about first normalizing the input to a canonical form before applying the basic algorithm. Then with some effort, I come up with the following
' '.join(input.split()).lower()
As a Python newcomer, I'm wondering whether this is the Pythonic way to solve the problem, or it may be flawed in the very first place? Is there a simple way to do this in a streaming fashion, i.e., avoiding loading the whole input into memory at once?
Like some comments stated, regex is a neat and easy way to get what you want. If you don't mind getting lowercase results, this one should work:
import re
my_str = """
...
create table Sales ...
create TabLE
test
create external table Persons ...
...
"""
pattern = r"table\s+(\w+)\b"
items = re.findall(pattern, my_str.lower())
print items
It captures the next word after "table " (followed by at least one whitespace / newline).
To get the original case of the table names:
for x, item in enumerate(items):
i = my_str.lower().index(item)
items[x] = my_str[i:i+len(item)]
print items

Python ddt unittest select specific fields from test data

I'm in the process of building data driven tests in Python using unittest and ddt.
Is there a way for me to be able to select specific fields from the test data instead of having to pass all the fields as separate parameters?
For example:
I have a csv file containing customers as below:
Title,FirstName,Surname,AddressA,AddressB,AddressC,City,State,PostCode
Mr,Bob,Gergory,44 road end,A Town,Somewhere,LOS ANGELES,CA,90004
Miss,Alice,Woodrow,99 some street,Elsewhere,A City,LOS ANGELES,CA,90003
From this I'd like to be able to select just the first name, city and state in the test.
I can do this like below, however this seems messy and will be more with wider files:
#data(get_test_data("customers.csv"))
#unpack
def test_create_new_customer(self, Title,FirstName,Surname,AddressA,AddressB,AddressC,City,State,PostCode):
self.customer.enter_first_name(FirstName)
self.customer.enter_city(City)
self.customer.enter_state_code(State)
self.customer.click_update()
I was hoping to be able to build a dictionary list out of the csv and then access it as below:
#data(get_test_data_as_dictionary("customers.csv"))
#unpack
def test_create_new_customer(self, test_data):
self.customer.enter_first_name(test_data["FirstName"])
self.customer.enter_city(test_data["City"])
self.customer.enter_state_code(test_data["State"])
self.customer.click_update()
However it would seem that ddt is smarter that I thought and breaks out the data from the dictionary and still expects all the parameters to be declared.
Is there a better way to achieve what I'm after?

Search a single column for a particular value in a CSV file and return an entire row

Issue
The code does not correctly identify the input (item). It simply dumps to my failure message even if such a value exists in the CSV file. Can anyone help me determine what I am doing wrong?
Background
I am working on a small program that asks for user input (function not given here), searches a specific column in a CSV file (Item) and returns the entire row. The CSV data format is shown below. I have shortened the data from the actual amount (49 field names, 18000+ rows).
Code
import csv
from collections import namedtuple
from contextlib import closing
def search():
item = 1000001
raw_data = 'active_sanitized.csv'
failure = 'No matching item could be found with that item code. Please try again.'
check = False
with closing(open(raw_data, newline='')) as open_data:
read_data = csv.DictReader(open_data, delimiter=';')
item_data = namedtuple('item_data', read_data.fieldnames)
while check == False:
for row in map(item_data._make, read_data):
if row.Item == item:
return row
else:
return failure
CSV structure
active_sanitized.csv
Item;Name;Cost;Qty;Price;Description
1000001;Name here:1;1001;1;11;Item description here:1
1000002;Name here:2;1002;2;22;Item description here:2
1000003;Name here:3;1003;3;33;Item description here:3
1000004;Name here:4;1004;4;44;Item description here:4
1000005;Name here:5;1005;5;55;Item description here:5
1000006;Name here:6;1006;6;66;Item description here:6
1000007;Name here:7;1007;7;77;Item description here:7
1000008;Name here:8;1008;8;88;Item description here:8
1000009;Name here:9;1009;9;99;Item description here:9
Notes
My experience with Python is relatively little, but I thought this would be a good problem to start with in order to learn more.
I determined the methods to open (and wrap in a close function) the CSV file, read the data via DictReader (to get the field names), and then create a named tuple to be able to quickly select the desired columns for the output (Item, Cost, Price, Name). Column order is important, hence the use of DictReader and namedtuple.
While there is the possibility of hard-coding each of the field names, I felt that if the program can read them on file open, it would be much more helpful when working on similar files that have the same column names but different column organization.
Research
CSV Header and named tuple:
What is the pythonic way to read CSV file data as rows of namedtuples?
Converting CSV data to tuple: How to split a CSV row so row[0] is the name and any remaining items are a tuple?
There were additional links of research, but I cannot post more than two.
You have three problems with this:
You return on the first failure, so it will never get past the first line.
You are reading strings from the file, and comparing to an int.
_make iterates over the dictionary keys, not the values, producing the wrong result (item_data(Item='Name', Name='Price', Cost='Qty', Qty='Item', Price='Cost', Description='Description')).
for row in (item_data(**data) for data in read_data):
if row.Item == str(item):
return row
return failure
This fixes the issues at hand - we check against a string, and we only return if none of the items matched (although you might want to begin converting the strings to ints in the data rather than this hackish fix for the string/int issue).
I have also changed the way you are looping - using a generator expression makes for a more natural syntax, using the normal construction syntax for named attributes from a dict. This is cleaner and more readable than using _make and map(). It also fixes problem 3.

Categories