I am scraping some data from a website. I have names, and prices lists from different pages. I want to stack two lists first, then an array contains all data in rows for each page. However, np.insert gives the following error:
'ValueError: could not convert string to float'.
Here is the code
import numpy as np
finallist = []
...
for c in range(0, 10)
narr = np.array(names)
parr = np.array(prices)
nparr = np.array(np.column_stack((narr, parr)))
finallist = np.insert(finallist, c, nparr)
What I want to accomplish is the following.
finallist = [[[name1, price1], [name2, price3], ...] # from page one
[name1, price1], [name2, price3], ...] # from page two
... ] # from page x
So that I will be able reach all data.
Related
From the following code, I filtered a dataset and obtained their statistic values using stats.linregress(x,y). I would like to merge the obtained values in the lists to a table and then covert it to csv. How to merge the lists? I tried the .append() but then it adds [...] at the end of each list. How to convert these lists in one csv? The code below only convert the last list to csv. Also, where is appropriate to add .02f function to shorten down the digits? Many thanks!
for i in df.ingredient.unique():
mask = df.ingredient == i
x_data = df.loc[mask]["single"]
y_data = df.loc[mask]["total"]
ing = df.loc[mask]["ingredient"]
res = stats.linregress(x_data, y_data)
result_list=list(res)
#sum_table = result_list.append(result_list)
sum_table = result_list
np.savetxt("sum_table.csv", sum_table, delimiter=',')
#print(f"{i} = res")
#print(f"{i} = result_list")
output:
[0.555725080482033, 15.369647540612188, 0.655901508882146, 0.34409849111785396, 0.45223586826559015, [...]]
[0.8240446598271236, 16.290731244189164, 0.7821893273053173, 0.00012525348188386877, 0.16409500805404134, [...]]
[0.6967783360917531, 25.8981921144781, 0.861561500951743, 0.13843849904825695, 0.29030899523536124, [...]]
I have a dataframe that looks like the following, but with many rows:
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','order call','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['order','call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
I have calculated the jaccard similarity using the code below (not my solution):
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
and modified the code given by #Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
#gold_cy 's answer has helped me and i made some changes to it to get the output i like:
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x[1], y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")
the issue is that when there are more instances of the same intent, i run into the error:
ValueError: too many values to unpack (expected 2)
and I do not know how to handle that for many more examples that i have in my dataset
Do you want this?
from itertools import combinations
from operator import itemgetter
items_to_consider = []
for item in list(combinations(zip(df.Sent.values, map(set,df.key_words.values)),2)):
keywords = (list(map(itemgetter(1),item)))
intersect = keywords[0].intersection(keywords[1])
if len(intersect) > 0:
str_list = list(map(itemgetter(0),item))
str_list.append(intersect)
items_to_consider.append(str_list)
for i in items_to_consider:
for item in i[2]:
if item in i[0] and item in i[1]:
print(f"Overlap of intent (order_food) for ({i[0]}) and ({i[1]}) is {item}")
I am scrapping England's Joint Data and have the results in the correct format I want when I do one hospital at a time. I eventually want to iterate over all hospitals but first decided to make an array of three different hospitals and figure out the iteration.
The code below gives me the correct format of the final results in a pandas DataFrame when I have just one hospital:
import requests
from bs4 import BeautifulSoup
import pandas
import numpy as np
r=requests.get("http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?
hospitalName=Norfolk%20and%20Norwich%20Hospital")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"toggle_container"})[1]
i=0
temp = []
for item in all.find_all("td"):
if i%4 ==0:
temp.append(soup.find_all("span")[4].text)
temp.append(soup.find_all("h5")[0].text)
temp.append(all.find_all("td")[i].text.replace(" ",""))
i=i+1
table = np.array(temp).reshape(12,6)
final = pandas.DataFrame(table)
final
In my iterated version, I cannot figure out a way to append each result set into a final DataFrame:
hosplist = ["http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Norfolk%20and%20Norwich%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Barnet%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Altnagelvin%20Area%20Hospital"]
temp2 = []
df_final = pandas.DataFrame()
for item in hosplist:
r=requests.get(item)
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"toggle_container"})[1]
i=0
temp = []
for item in all.find_all("td"):
if i%4 ==0:
temp.append(soup.find_all("span")[4].text)
temp.append(soup.find_all("h5")[0].text)
temp.append(all.find_all("td")[i].text)
i=i+1
table = np.array(temp).reshape((int(len(temp)/6)),6)
temp2.append(table)
#df_final = pandas.DataFrame(df)
At the end, the 'table' has all the data I want but its not easy to manipulate so I want to put it in a DataFrame. However, I am getting an "ValueError: Must pass 2-d input" error.
I think this error is saying that I have 3 arrays which would make it 3 dimensional. This is just a practice iteration, there are over 400 hospitals whose data I plan to put into a dataframe but I am stuck here now.
The simple answer to your question would be HERE.
The tough part was taking your code and finding what was not right yet.
Using your full code, I modified it as shown below. Please copy and diff with yours.
import requests
from bs4 import BeautifulSoup
import pandas
import numpy as np
hosplist = ["http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Norfolk%20and%20Norwich%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Barnet%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Altnagelvin%20Area%20Hospital"]
temp2 = []
df_final = pandas.DataFrame()
for item in hosplist:
r=requests.get(item)
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"toggle_container"})[1]
i=0
temp = []
for item in all.find_all("td"):
if i%4 ==0:
temp.append(soup.find_all("span")[4].text)
temp.append(soup.find_all("h5")[0].text)
temp.append(all.find_all("td")[i].text)
i=i+1
table = np.array(temp).reshape((int(len(temp)/6)),6)
for array in table:
newArray = []
for x in array:
try:
x = x.encode("ascii")
except:
x = 'cannot convert'
newArray.append(x)
temp2.append(newArray)
df_final = pandas.DataFrame(temp2, columns=['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
print df_final
I tried to use a list comprehension for the ascii conversion, which was absolutely necessary for the strings to show up in the dataframe, but the comprehension was throwing an error, so I built in an exception, and the exception never shows.
I reorganized the code a little and was able to create the dataframe without having to encode.
Solution:
hosplist = ["http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Norfolk%20and%20Norwich%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Barnet%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Altnagelvin%20Area%20Hospital"]
temp = []
temp2 = []
df_final = pandas.DataFrame()
for item in hosplist:
r=requests.get(item)
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"toggle_container"})[1]
i=0
for item in all.find_all("td"):
if i%4 ==0:
temp.append(soup.find_all("span")[4].text)
temp.append(soup.find_all("h5")[0].text)
temp.append(all.find_all("td")[i].text.replace("-","NaN").replace("+",""))
i=i+1
temp2.append(temp)
table = np.array(temp2).reshape((int(len(temp2[0])/6)),6)
df_final = pandas.DataFrame(table, columns=['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
df_final
I'm attempting to get the last 5 orders from currency exchanges through their respective JSON API. Everything is working except for the fact there are some coins that have less than 5 orders (ask/bid) which causes some errors in the table write to Excel.
Here is what I have now:
import grequests
import json
import itertools
active_sheet("Livecoin Queries")
urls3 = [
'https://api.livecoin.net/exchange/order_book?
currencyPair=RBIES/BTC&depth=5',
'https://api.livecoin.net/exchange/order_book?
currencyPair=REE/BTC&depth=5',
]
requests = (grequests.get(u) for u in urls3)
responses = grequests.map(requests)
CellRange("B28:DJ48").clear()
def make_column(catalog_response, name):
column = []
catalog1 = catalog_response.json()[name]
quantities1, rates1 = zip(*catalog1)
for quantity, rate in zip(quantities1, rates1):
column.append(quantity)
column.append(rate)
return column
bid_table = []
ask_table = []
for response in responses:
try:
bid_table.append(make_column(response,'bids'))
ask_table.append(make_column(response,'asks'))
except (KeyError,ValueError,AttributeError):
continue
Cell(28, 2).table = zip(*ask_table)
Cell(39, 2).table = zip(*bid_table)
I've isolated the list of links down to just two with "REE" coin being the issue here.
I've tried:
for i in itertools.izip_longest(*bid_table):
#Cell(28, 2).table = zip(*ask_table)
#Cell(39, 2).table = zip(*i)
print(i)
Which prints out nicely in the terminal:
itertools terminal output
NOTE: As of right now "REE" has zero bid orders so it ends up creating an empty list:
empty list terminal output
When printing to excel I get a lot of strange outputs. None of which resemble what it looks like in the terminal. The way the information is set up in Excel requires it to be Cell(X,X).table
My question is, how do I make zipping with uneven lists play nice with tables in DataNitro?
EDIT1:
The problem is arising at catalog_response.json()[name]
def make_column(catalog_response, name):
column = []
catalog1 = catalog_response.json()[name]
#quantities1, rates1 = list(itertools.izip_longest(*catalog1[0:5]))
print(catalog1)
#for quantity, rate in zip(quantities1, rates1):
# column.append(quantity)
# column.append(rate)
#return column
Since there are zero bids there is not even an empty list created which is why I'm unable to zip them together.
ValueError: need more than 0 values to unpack
I suggest that you build the structure myTable that you intend to write back to excel.
It should be a list of lists
myTable = []
myRow = []
…build each myRow from your code…
if the length of the list for myRow is too short, pad with proper number of [None] elements
in your case if len(myRow) is 0 you need to append two “None” items
myRow.append(None)
myRow.append(None)
add the row to the output table
myTable.append(myRow)
so when ready you have a well formed nn x n table to output via:
Cell(nn, n).table = myTable
I have a one column database with several url of the form
'w.lejournal.fr/actualite/politique/sarkozy-terminator_1557749.html',
'w.lejournal.fr/palmares/palmares-immobilier/',
'w.lejournal.fr/actualite/societe/adeline-hazan-devient-la-nouvelle-controleuse-des-lieux-de-privation-de-liberte_1558176.html'
I want to create a 3 column database which first column contains these exact url, and the second column the principal category of the page (actualite, or palmares), and the third column containing the second category of the page (politique, or palmares-immobilier, or societe).
I can't give my code since I am not allowed to post urls.
I want to use python pandas.
Firstly: is this the good way to do it?
Secondly: how can I finish the concatenation?
Thank you very much.
With pure Python:
data= [
'w.lejournal.fr/actualite/politique/sarkozy-terminator_1557749.html',
'w.lejournal.fr/palmares/palmares-immobilier/',
'w.lejournal.fr/actualite/societe/adeline-hazan-devient-la-nouvelle-controleuse-des-lieux-de-privation-de-liberte_1558176.html'
]
result = []
for x in data:
cols = x.split('/')
result.append( [x, cols[1], cols[2]] )
print result
.
[
['w.lejournal.fr/actualite/politique/sarkozy-terminator_1557749.html', 'actualite', 'politique'],
['w.lejournal.fr/palmares/palmares-immobilier/', 'palmares', 'palmares-immobilier'],
['w.lejournal.fr/actualite/societe/adeline-hazan-devient-la-nouvelle-controleuse-des-lieux-de-privation-de-liberte_1558176.html', 'actualite', 'societe']
]
You have to only read and write to database.
If you have all urls started with http:// than you will need to get cols[3], cols[4]
data= [
'http://w.lejournal.fr/actualite/politique/sarkozy-terminator_1557749.html',
'http://w.lejournal.fr/palmares/palmares-immobilier/',
'http://w.lejournal.fr/actualite/societe/adeline-hazan-devient-la-nouvelle-controleuse-des-lieux-de-privation-de-liberte_1558176.html'
]
result = []
for x in data:
cols = x.split('/')
result.append( [x, cols[3], cols[4]] )
print result
No need for pandas, regex can do this quite efficiently:
import re
ts = ['w.lejournal.fr/actualite/politique/sarkozy-terminator_1557749.html',
'w.lejournal.fr/palmares/palmares-immobilier/',
'w.lejournal.fr/actualite/societe/adeline-hazan-devient-la-nouvelle-controleuse-des-lieux-de-privation-de-liberte_1558176.html']
rgx = r'(?<=w.lejournal.fr/)([aA-zZ]*)/([aA-zZ_-]*)(?=/)'
for url_address in ts:
found_group = re.findall(rgx, url_address)
for item in found_group:
print item
this is what it returns:
('actualite', 'politique')
('palmares', 'palmares-immobilier')
('actualite', 'societe')
of course you wouldn't need to do this on a list of urls