Python: How do I syntax data scraping from xlsx file? - python

Currently I am scraping some data from xlsx file. My code works, but looks like a mess - at least for me.
So I am unsure if my code is ok according to PEP8.
from openpyxl import load_workbook
[...]
for row in sheet.iter_rows():
id = row[0].value
name = row[1].value
second_name = row[2].value
# ignore the following
# middle_name = row[3].value
city = row[4].value
address = row[5].value
field_x = row[7].value
field_y = row[10].value
some_function_to_save_to_database(id, name, second_name, ...)
etc. (Please note that for some of those values I do extra-validation etc).
So it works but it feels a bit "clunky". Obviously I could pass them directly to function, making it some_function_to_save_to_database(row[0].value, row[1].value, ...), but is it any better? Feels like I lose readability a lot in this one.
So my question is as follows: Is it good approach or should I map those fields field names to row order? What is proper way to style this kind of scraping?

Your code does not violate PEP8. However, it's a little cumbersome. And it's not easy to maintain if the data changed. Maybe you can try:
DATA_INDEX_MAP = {
'id' : 0,
'name' : 1,
'second_name' : 2,
'city' : 4,
'address' : 5,
'field_x' : 7,
'field_y' : 10
}
def get_data_from_row(row):
return {key:row[DATA_INDEX_MAP[key]].value for key in DATA_INDEX_MAP}
for row in sheet.iter_rows():
data = get_data_from_row(row)
some_function_to_save_to_database(**data)
Then what you need to do is just to modify DATA_INDEX_MAP.

A lighter alternative to the dict in LiuChang's answer:
from operator import itemgetter
get_data = itemgetter(0, 1, 2, 4, 5, 7, 10)
for row in sheet.iter_rows():
data = [x.value for x in get_data(row)]
some_function_to_save_to_database(*data))

Related

Pandas DF.output write to columns (current data is written all to one row or one column)

I am using Selenium to extract data from the HTML body of a webpage and am writing the data to a .csv file using pandas.
The data is extracted and written to the file, however I would like to manipulate the formatting of the data to write to specified columns, after reading many threads and docs I am not able to understand how to do this.
The current CSV file output is as follows, all data in one row or one column
0,
B09KBFH6HM,
dropdownAvailable,
90,
1,
B09KBNJ4F1,
dropdownAvailable,
100,
2,
B09KBPFPCL,
dropdownAvailable,
110
or if I use the [count] count +=1 method it will be one row
0,B09KBFH6HM,dropdownAvailable,90,1,B09KBNJ4F1,dropdownAvailable,100,2,B09KBPFPCL,dropdownAvailable,110
I would like the output to be formatted as follows,
/col1 /col2 /col3 /col4
0, B09KBFH6HM, dropdownAvailable, 90,
1, B09KBNJ4F1, dropdownAvailable, 100,
2, B09KBPFPCL, dropdownAvailable, 110
I have tried using columns= options but get errors in the terminal and don't understand what feature I should be using to achieve this in the docs under the append details
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html?highlight=append#pandas.DataFrame.append
A simplified version is as follows
from selenium import webdriver
import pandas as pd
price = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
price.append(element.get_attribute("value"))
price.append(element.get_attribute("class"))
price.append(element.get_attribute("data-a-html-content"))
output = pd.DataFrame(price)
output.to_csv("Data.csv", encoding='utf-8-sig')
driver.close()
Do I need to parse each item separately and append?
I would like each of the .get_attribute values to be written to a new column.
Is there any advice you can offer for a solution to this as I am not very proficient at pandas, thank you for your helps
 Approach similar to #user17242583, but a little shorter:
data = [[e.get_attribute("value"), e.get_attribute("class"), e.get_attribute("data-a-html-content")] for e in options]
df = pd.DataFrame(data, columns=['ASIN', 'dropdownAvailable', 'size']) # third column maybe is the product size
df.to_csv("Data.csv", encoding='utf-8-sig')
Adding all your items to the price list is going to cause them all to be in one column. Instead, store separate lists for each column, in a dict, like this (name them whatever you want):
data = {
'values': [],
'classes': [],
'data_a_html_contents': [],
}
...
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data_a_html_contents.append(element.get_attribute("data-a-html-content"))
...
output = pd.DataFrame(data)
output.to_csv("Data.csv", encoding='utf-8-sig')
You were collecting the value, class and data-a-html-content and appending them to the same list price. Hence, the list becomes:
price = [value1, class1, data-a-html-content1, value2, class2, data-a-html-content2, ...]
Hence, within the dataframe it looks like:
Solution
To get value, class and data-a-html-content in seperate columns you can adopt any of the below two approaches:
Pass a dictionary to the dataframe.
Pass a list of lists to the dataframe.
While the #user17242583 and #h.devillefletcher suggests a dictionary, you can still achieve the same using list of lists as follows:
values = []
classes = []
data-a-html-contents = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data-a-html-contents.append(element.get_attribute("data-a-html-content"))
df = pd.DataFrame(data=list(zip(values, classes, data-a-html-contents)), columns=['Value', 'Class', 'Data-a-Html-Content'])
output = pd.DataFrame(my_list)
output.to_csv("Data.csv", encoding='utf-8-sig')
References
You can find a couple of relevant detailed discussions in:
Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe
Python Selenium: How do I print the values from a website in a text file?

How to insert dictionaries as values into a dictionary using loop on python

I am currently facing a problem to make my cvs data into dictionary.
I have 3 columns that I'd like to use in the file:
userID, placeID, rating
U1000, 12222, 3
U1000, 13333, 2
U1001, 13333, 4
I would like to make the result look like this:
{'U1000': {'12222': 3, '13333': 2},
'U1001': {'13333': 4}}
That is to say,
I would like to make my data structure look like:
sample = {}
sample["U1000"] = {}
sample["U1001"] = {}
sample["U1000"]["12222"] = 3
sample["U1000"]["13333"] = 2
sample["U1001"]["13333"] = 4
but I have a lot of data to be processed.
I'd like to get the result with loop, but i have tried for 2 hours and failed..
---the following codes may confuse you---
My result look like this now:
{'U1000': ['12222', 3],
'U1001': ['13333', 4]}
the value of the dict is a list rather a dictionary
the user "U1000" appears multiple times but in my result theres only one time
I think my code has many mistakes.. if you don't mind please take a look:
reader = np.array(pd.read_csv("rating_final.csv"))
included_cols = [0, 1, 2]
sample= {}
target=[]
target1 =[]
for row in reader:
content = list(row[i] for i in included_cols)
target.append(content[0])
target1.append(content[1:3])
sample = dict(zip(target, target1))
how can I improve the codes?
I have looked through stackoverflow but due to personal lacking in ability,
can anyone please kindly help me with this?
Many thanks!!
This should do what you want:
import collections
reader = ...
sample = collections.defaultdict(dict)
for user_id, place_id, rating in reader:
rating = int(rating)
sample[user_id][place_id] = rating
print(sample)
# -> {'U1000': {'12222': 3, '1333': 2}, 'U1001': {'13333': 4}}
defaultdict is a convenience utility that provides default values whenever you try to access a key that is not in the dictionary. If you don't like it (for example because you want sample['non-existent-user-id] to fail with KeyError), use this:
reader = ...
sample = {}
for user_id, place_id, rating in reader:
rating = int(rating)
if user_id not in sample:
sample[user_id] = {}
sample[user_id][place_id] = rating
The expected output in the example is impossible, since {'1333': 2} would not be associated with a key. You could get {'U1000': {'12222': 3, '1333': 2}, 'U1001': {'13333': 4}} though, with a dict of dicts:
sample = {}
for row in reader:
userID, placeID, rating = row[:3]
sample.setdefault(userID, {})[placeID] = rating # Possibly int(rating)?
Alternatively, using collections.defaultdict(dict) to avoid the need for setdefault (or alternate approaches that involve a try/except KeyError or if userID in sample: that sacrifice the atomicity of setdefault in exchange for not creating empty dicts unnecessarily):
import collections
sample = collections.defaultdict(dict)
for row in reader:
userID, placeID, rating = row[:3]
sample[userID][placeID] = rating
# Optional conversion back to plain dict
sample = dict(sample)
The conversion back to plain dict ensures future lookups don't auto-vivify keys, raising KeyError as normal, and it looks like a normal dict if you print it.
If the included_cols is important (because names or column indices might change), you can use operator.itemgetter to speed up and simplify extracting all the desired columns at once:
from collections import defaultdict
from operator import itemgetter
included_cols = (0, 1, 2)
# If columns in data were actually:
# rating, foo, bar, userID, placeID
# we'd do this instead, itemgetter will handle all the rest:
# included_cols = (3, 4, 0)
get_cols = itemgetter(*included_cols) # Create function to get needed indices at once
sample = defaultdict(dict)
# map(get_cols, ...) efficiently converts each row to a tuple of just
# the three desired values as it goes, which also lets us unpack directly
# in the for loop, simplifying code even more by naming all variables directly
for userID, placeID, rating in map(get_cols, reader):
sample[userID][placeID] = rating # Possibly int(rating)?

Appending to a list from an iterated dictionary

I've written a piece of script that at the moment I'm sure can be condensed. What I'm try to achieve is an automatic version of this:
file1 = tkFileDialog.askopenfilename(title='Select the first data file')
file2 = tkFileDialog.askopenfilename(title='Select the first data file')
TurnDatabase = tkFileDialog.askopenfilename(title='Select the turn database file')
headers = pd.read_csv(file1, nrows=1).columns
data1 = pd.read_csv(file1)
data2 = pd.read_csv(file2)
This is how the data is collected.
There's many more lines of code which focus on picking out bits of the data. I'm not going to post it all.
This is what I'm trying to condense:
EntrySummary = []
for key in Entries1.viewkeys():
MeanFRH = Entries1[key].hRideF.mean()
MeanFRHC = Entries1[key].hRideFCalc.mean()
MeanRRH = Entries1[key].hRideR.mean()
# There's 30 more lines of these...
# Then the list is updated with this:
EntrySummary.append({'Turn Number': key, 'Avg FRH': MeanFRH, 'Avg FRHC': MeanFRHC, 'Avg RRH': MeanRRH,... # and so on})
EntrySummary = pd.DataFrame(EntrySummary)
EntrySummary.index = EntrySummary['Turn Number']
del EntrySummary['Turn Number']
This is the old code. What I've tried to do is this:
EntrySummary = []
for i in headers():
EntrySummary.append({'Turn Number': key, str('Avg '[i]): str('Mean'[i])})
print EntrySummary
# The print is only there for me to see if it's worked.
However I'm getting this error at the minute:
for i in headers():
TypeError: 'Index' object is not callable
Any ideas as to where I've made a mistake? I've probably made a few...
Thank you in advance
Oli
If I'm understanding your situation correctly, you want to replace the long series of assignments in the "old code" you've shown with another loop that processes all of the different items automatically using the list of headers from your data files.
I think this is what you want:
EntrySummary = []
for key, value in Entries1.viewitems():
entry = {"Turn Number": key}
for header in headers:
entry["Avg {}".format(header)] = getattr(value, header).mean()
EntrySummary.append(entry)
You might be able to come up with some better variables names, since you know what the keys and values in Entries1 are (I do not, so I used generic names).

Removing space in dataframe python

I am getting an error in my code because I tried to make a dataframe by calling an element from a csv. I have two columns I call from a file: CompanyName and QualityIssue. There are three types of Quality issues: Equipment Quality, User, and Neither. I run into problems trying to make a dataframe df.Equipment Quality, which obviously doesn't work because there is a space there. I want to take Equipment Quality from the original file and replace the space with an underscore.
input:
Top Calling Customers, Equipment Quality, User, Neither,
Customer 3, 2, 2, 0,
Customer 1, 0, 2, 1,
Customer 2, 0, 1, 0,
Customer 4, 0, 1, 0,
Here is my code:
import numpy as np
import pandas as pd
import pandas.util.testing as tm; tm.N = 3
# Get the data.
data = pd.DataFrame.from_csv('MYDATA.csv')
# Group the data by calling CompanyName and QualityIssue columns.
byqualityissue = data.groupby(["CompanyName", "QualityIssue"]).size()
# Make a pandas dataframe of the grouped data.
df = pd.DataFrame(byqualityissue)
# Change the formatting of the data to match what I want SpiderPlot to read.
formatted = df.unstack(level=-1)[0]
# Replace NaN values with zero.
formatted[np.isnan(formatted)] = 0
includingtotals = pd.concat([formatted,pd.DataFrame(formatted.sum(axis=1),
columns=['Total'])], axis=1)
sortedtotal = includingtotals.sort_index(by=['Total'], ascending=[False])
sortedtotal.to_csv('byqualityissue.csv')
This seems to be a frequently asked question and I tried lots of the solutions but they didn't seem to work. Here is what I tried:
with open('byqualityissue.csv', 'r') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
return [[x.strip() for x in row] for row in reader]
sentence.replace(" ", "_")
And
sortedtotal['QualityIssue'] = sortedtotal['QualityIssue'].map(lambda x: x.rstrip(' '))
And what I thought was the most promising from here http://pandas.pydata.org/pandas-docs/stable/text.html:
formatted.columns = formatted.columns.str.strip().str.replace(' ', '_')
but I got this error: AttributeError: 'Index' object has no attribute 'str'
Thanks for your help in advance!
Try:
formatted.columns = [x.strip().replace(' ', '_') for x in formatted.columns]
As I understand your question, the following should work (test it out with inplace=False to see how it looks first if you want to be careful):
sortedtotal.rename(columns=lambda x: x.replace(" ", "_"), inplace=True)
And if you have white space surrounding the column names, like: "This example "
sortedtotal.rename(columns=lambda x: x.strip().replace(" ", "_"), inplace=True)
which strips leading/trailing whitespace, then converts internal spaces to "_".

App engine: generating a search index. Does anyone have simpler code that breaks down how to build an index?

I went through the documentation and tried to follow the code examples on github, but I'm still confused.
Is this the procedure?
1) Generate an index:
index = search.Index(name = "geoSearch")
2) Define fields:
ID = #my ID
geopoint = #a lat long coordinate
fields = [
search.TextField(name = "ID", value = ID),
search.GeoField(name = "location" , value = geopoint) ]
3) Create a document to store fields:
doc = search.Document(fields = fields)
4) Then I'll iterate through, and add "fields" to my document one a time like so:
search.Index(name = "geoSearch").add(doc)
And once I finish iterating through, then I'll have a search index? Does this seem reasonable? Thanks.
Note that you also need an index configuration file:
https://developers.google.com/appengine/docs/python/config/indexconfig

Categories