How can I use Pandas column to parse textfrom the web?

How can I use Pandas column to parse textfrom the web? - python

I've used the map function on a dataframe column of postcodes to create a new Series of tuples which I can then manipulate into a new dataframe.
def scrape_data(series_data):
#A bit of code to create the URL goes here
r = requests.get(url)
root_content = r.content
root = lxml.html.fromstring(root_content)
address = root.cssselect(".lr_results ul")
for place in address:
address_property = place.cssselect("li a")[0].text
house_type = place.cssselect("li")[1].text
house_sell_price = place.cssselect("li")[2].text
house_sell_date = place.cssselect("li")[3].text
return address_property, house_type, house_sell_price, house_sell_date
df = postcode_subset['Postcode'].map(scrape_data)
While it works where there is only one property on a results page, it fails to create a tuple for multiple properties.
What I'd like to be able to do is iterate through a series of pages and then add that content to a dataframe. I know that Pandas can convert nested dicts into dataframes, but really struggling to make it work. I've tried to use the answers at How to make a nested dictionary and dynamically append data but I'm getting lost.

At the moment your function only returns for the first place in address (usually in python you would yield (rather than return) to retrieve all the results as a generator.
When subsequently doing an apply/map, you'll usually want the function to return a Series...
However, I think you just want to return the following DataFrame:
return pd.DataFrame([{'address_ property': place.cssselect("li a")[0].text,
'house_type': place.cssselect("li")[1].text,
'house_sell_price': place.cssselect("li")[2].text,
'house_sell_date': place.cssselect("li")[3].text}
for place in address],
index=address)

To make the code work, I eventually reworked Andy Hayden's solution to:
listed = []
for place in address:
results = [{'postcode':postcode_bit,'address_ property': place.cssselect("li a")[0].text,
'house_type': place.cssselect("li")[1].text,
'house_sell_price': place.cssselect("li")[2].text,
'house_sell_date': place.cssselect("li")[3].text}]
listed.extend(results)
return listed
At least I understand a bit more about how Python data structures work now.

Related

Doing calculations while creating a List (in Python)

I'm getting data from an API and storing it on Python dictionary (and then a list of dictionaries).
I need to do calculations (max, sum, divisions...) on the dictionary data to create extra data to add to the same dictionary/list.
My current code looks like this:
stream = whatever (whatever, whatever)
keywords = []
for batch in stream:
for row in batch.results:
max_clicks = max(data_keywords["keywords_clicks"])
weighted_clicks = sum(data_keywords["keywords_weighted"])/sum(data_keywords["keywords_clicks"])
data_keywords = {}
data_keywords["keywords_text"] = row.ad_group_criterion.keyword.text
data_keywords["keywords_clicks"] = row.metrics.clicks
data_keywords["keywords_conversion_rate"] = row.metrics.conversions_from_interactions_rate
data_keywords["keywords_weighted"] = row.metrics.clicks * row.metrics.conversions_from_interactions_rate
data_keywords["etv"] = (data_keywords["keywords_clicks"]/max_clicks*data_keywords["keywords_conversion_rate"])+((1-data_keywords["keywords_clicks"]/max_clicks)*weighted_clicks)
keywords.append(data_keywords)
This doesn't work, it gives UnboundLocalError (local variable 'data_keywords' referenced before assignment). I've tried different options and got different errors.
data_keywords["etv"] is what I want to calculate ("max_clicks", "weighted_clicks" and data_keywords["keywords_weighted"] are intermediate calculations for that)
The main problem is that I need to calculate max and sum for all values inside the dictionary, then do a calculation using that max and sum for each value and then store the results in the dictionary itself.
So I don't know where to put the code to do the calculations (before the dictionary, inside the dictionary, after the dictionary or a mix)
I guess it should be possible, but I'm a Python/programming newbie and can't figure this out.
It's probably not relevant, but in case you are wondering, I'm trying to create a weighted sort (https://moz.com/blog/build-your-own-weighted-sort). And I can't use models/database to store data.
Thanks!
EDIT: Some extra info, in case it helps understand better what I need: The results that the keywords list gives without the calculations is something like this:
[{'keywords_text': 'whatever', 'keywords_clicks': 5, 'keywords_conversion_rate': 6.3}, {'keywords_text': 'whatever2', 'keywords_clicks': 50, 'keywords_conversion_rate': 2.3}, {'keywords_text': 'whatever3', 'keywords_clicks': 20, 'keywords_conversion_rate': 2.0}]
I want basically to add to this keywords list a new key/value of 'etv': 8.5 or whatever for each keyword. That etv should come from the formula that I put on my code (data_keywords["etv"] = ...) but maybe it needs changes to work in Python.
The info from this "original" keywords list comes directly from the API (I don't have that data stored anywhere) and it works perfectly if I just request the info and store it in that list. But when the problems come when I introduce the calculations (specially using sum and max inside a loop I guess).

The UnboundLocalError is because you are trying to access data_keywords["keywords_clicks"] before you have declared data_keywords or set the value for "keywords_clicks".
Also, I think you need to be clearer about what data structure you are trying to create. You mention "a list of dictionaries" which I don't see. Maybe you are trying to create a dictionary of lists, but it looks like you overwrite the dictionary values each time you go through your loop.

adding my response as an answer, as I do not have enough reputation to comment
To get rid of assignment error just move the line data_keywords = {} above max_clicks = max(data_keywords["keywords_clicks"])
Here you are trying to access a local variable before its declaration. The code in this case is trying to access a global variable which doesn't seems to exist.
stream = whatever (whatever, whatever)
keywords = []
for batch in stream:
for row in batch.results:
data_keywords = {}
max_clicks = max(data_keywords["keywords_clicks"])
weighted_clicks = sum(data_keywords["keywords_weighted"])/sum(data_keywords["keywords_clicks"])
data_keywords["keywords_text"] = row.ad_group_criterion.keyword.text
data_keywords["keywords_clicks"] = row.metrics.clicks
data_keywords["keywords_conversion_rate"] = row.metrics.conversions_from_interactions_rate
data_keywords["keywords_weighted"] = row.metrics.clicks * row.metrics.conversions_from_interactions_rate
data_keywords["etv"] = (data_keywords["keywords_clicks"]/max_clicks*data_keywords["keywords_conversion_rate"])+((1-data_keywords["keywords_clicks"]/max_clicks)*weighted_clicks)
keywords.append(data_keywords)
More on that here

You can't refer to elements of the dictionary before you create it. Move those variable assignments down to after you assign the dictionary elements.
for batch in stream:
for row in batch.results:
data_keywords = {}
data_keywords["keywords_text"] = row.ad_group_criterion.keyword.text
data_keywords["keywords_clicks"] = row.metrics.clicks
data_keywords["keywords_conversion_rate"] = row.metrics.conversions_from_interactions_rate
data_keywords["keywords_weighted"] = row.metrics.clicks * row.metrics.conversions_from_interactions_rate
max_clicks = max(data_keywords["keywords_clicks"])
weighted_clicks = sum(data_keywords["keywords_weighted"])/sum(data_keywords["keywords_clicks"])
data_keywords["etv"] = (data_keywords["keywords_clicks"]/max_clicks*data_keywords["keywords_conversion_rate"])+((1-data_keywords["keywords_clicks"]/max_clicks)*weighted_clicks)
keywords.append(data_keywords)

How to use zip to easily get the dataframe for each of the coin pair in the list through a defined function? (pandas)

I have defined a function which can get the history price for the coin:
def get_price(pair):
df=binance.fetch_ohlcv(pair,limit=258,timeframe="1d")
df=pd.DataFrame(df).rename(columns={0:"date",1:"open",2:"high",3:"low",4:"close",5:"volume"})
df["date"]=pd.to_datetime(df["date"],unit="ms")+pd.Timedelta(hours=8)
df.set_index("date",inplace=True)
return df
Then i want to use zip function to create two lists which can correspond to each other,so i can easily apply the function to get history data for each of the coin in the name list:
name=["btc","eth"]
symbol=["BTC/USDT","ETH/USDT"]
for name,pair in zip(name,symbol):
name=get_price(pair)
eth
But when i type "eth",to get the dataframe of "ETH/USDT", it gave me the error of "NameError: name 'eth' is not defined". The reason for me to do this is if i have a list of more than 10 pairs of coins, i dont want to use get_price function for each of them one by one in order to get the history data for all of them. can anyone help me to fix this errors? Thanks

You may have to do another function if you want to keep the things the way you have.
The first part (where you get the df seems fine since you have checked it). The issue is with the second part, where you are trying to run that function. Change is as below may help.
Example Run this & it will print ETH/USDT as output.
def func(x):
name=["btc","eth"]
symbol=["BTC/USDT","ETH/USDT"]
for name,pair in zip(name,symbol):
if name == x:
print(pair)
func('eth')
Similarly, if you want to run the function of getting the df, try something like this.
def func(x):
name=["btc","eth"]
symbol=["BTC/USDT","ETH/USDT"]
for name,pair in zip(name,symbol):
if name == x:
get_price(pair)
func('eth')

Use a dict to store the dataframes
Original Code:
name=["btc","eth"]
symbol=["BTC/USDT","ETH/USDT"]
for name,pair in zip(name,symbol):
name=get_price(pair)
The loop is trying to assign a pd.DataFrame object, to a string, which won't work.
I'm surprised that didn't cause a SyntaxError: can't assign to literal
Equivalently, name=get_price(pair) → 'eth' = pd.DataFrame()
From the notebook, I see can a list of names and symbols, for which you're trying to create individual dataframes.
The notebook shows, get_price returns a dataframe, when given a symbol.
Replacement Code:
The following code will create a dict of dataframes, where each string in name, is a key.
df_dict = dict()
for name, pair in zip(name, symbol):
df_dict[name] = get_price(pair)
df_dict['eth']

Changing Json from API with Python

I have a json code from API and I want to get new chat members with the code below but I only get the first two results and not the last (Tester). Why? It should itereate through the whole json file, shouldn't it?
r = requests.get("https://api.../getUpdates").json()
chat_members = []
a = 0
for i in r:
chat_members.append(r['result'][a]['message']['new_chat_members'][0]['last_name'])
a = a + 1
Json here:
{"ok":true,"result":[{"update_id":213849278,
"message":{"message_id":37731,"from":{"id":593029363,"is_bot":false,"first_name": "#tutu"},"chat":{"id":-1001272017174,"title":"tester account","username":"v_glob","type":"supergroup"},"date":1537470595,"new_chat_participant":{"id":593029363,"is_bot":false,"first_name":"tutu "},"new_chat_member":{"id":593029363,"is_bot":false,"first_name":"\u7535\u62a5\u589e\u7c89\uff0c\u4e2d\u82f1\u6587\u5ba2\u670d\uff0c\u62c9\u4eba\u6e05\u5783\u573e\u8f6f\u4ef6\uff0c\u5e7f\u544a\u63a8\u5e7f\uff0cKYC\u6750\u6599\u8ba4\u8bc1\uff0c","last_name":"#tutupeng"},"new_chat_members":[{"id":593029363,"is_bot":false,"first_name":"\u7535\u62a5\u589e\u7c89\uff0c\u4e2d\u82f1\u6587\u5ba2\u670d\uff0c\u62c9\u4eba\u6e05\u5783\u573e\u8f6f\u4ef6\uff0c\u5e7f\u544a\u63a8\u5e7f\uff0cKYC\u6750\u6599\u8ba4\u8bc1\uff0c","last_name":"#tutu"}]}},{"update_id":213849279,
"message":{"message_id":37732,"from":{"id":658150956,"is_bot":false,"first_name":"Rebecca","last_name":"Lawson"},"chat":{"id":-10012720,"title":"v glob OFFICIAL","username":"v_glob","type":"supergroup"},"date":1537484441,"new_chat_participant":{"id":65815,"is_bot":false,"first_name":"Rebecca","last_name":"Lawson"},"new_chat_member":{"id":65815,"is_bot":false,"first_name":"Rebecca","last_name":"Lawson"},"new_chat_members":[{"id":65815,"is_bot":false,"first_name":"Rebecca","last_name":"Lawson"}]}},{"update_id":213849280,
"message":{"message_id":12,"from":{"id":696749142,"is_bot":false,"first_name":"daniel","language_code":"cs-cz"},"chat":{"id":696749142,"first_name":"daniel","type":"private"},"date":1537537013,"text":"/stat","entities":[{"offset":0,"length":5,"type":"bot_command"}]}},{"update_id":213849281,
"message":{"message_id":37740,"from":{"id":669620,"is_bot":false,"first_name":"Ivan","last_name":"Tester"},"chat":{"id":-100127201,"title":"test account","username":"v_glob","type":"supergroup"},"date":1537537597,"new_chat_participant":{"id":669620191,"is_bot":false,"first_name":"Ivan","last_name":"Tester"},"new_chat_member":{"id":669620191,"is_bot":false,"first_name":"Ivan","last_name":"Tester"},"new_chat_members":[{"id":669620191,"is_bot":false,"first_name":"Ivan","last_name":"Tester"}]}}]}

Because you iterate over the entire response dict. The top level only has two items, so that's what you iterate over. Note that you don't actually use the iterator variable, and you have a completely unnecessary separate counter.
Instead, you should be iterating over the result dict:
for result in r['result']:
if "new_chat_members" in result['message']:
chat_members.append(result['message']['new_chat_members'][0]['last_name'])

A colleague of mine has come up with a solution:
for i in l['result']:
chat_members.append(i['message']['new_chat_member']['first_name'])
To sum up: Iterate through 'result' with no positional arguments

Python: Running function to append values to an empty list returns no values

This is probably a very basic question but I haven't been able to figure this out.
I'm currently using the following to append values to an empty list
shoes = {'groups':['running','walking']}
df_shoes_group_names = pd.DataFrame(shoes)
shoes_group_name=[]
for type in df_shoes_group_names['groups']:
shoes_group_name.append(type)
shoes_group_name
['running', 'walking']
I'm trying to accomplish the same using a for loop, however, when I execute the loop the list comes back as blank
shoes_group_name=[]
def list_builder(dataframe_name):
if 'shoes' in dataframe_name:
for type in df_shoes_group_names['groups']:
shoes_group_name.append(type)
list_builder(df_shoes_group_names)
shoes_group_name
[]
Reason for the function is that eventually I'll have multiple DF's with different product's so i'd like to just have if statements within the function to handle the creation of each list
so for example future examples could look like this:
df_shoes_group_names
df_boots_group_names
df_sandals_group_names
shoes_group_name=[]
boots_group_name=[]
sandals_group_name=[]
def list_builder(dataframe_name):
if 'shoes' in dataframe_name:
for type in df_shoes_group_names['groups']:
shoes_group_name.append(type)
elif 'boots' in dataframe_name:
for type in df_boots_group_names['groups']:
boots_group_name.append(type)
elif 'sandals' in dataframe_name:
for type in df_sandals_group_names['groups']:
sandals_group_name.append(type)
list_builder(df_shoes_group_names)
list_builder(df_boots_group_names)
list_builder(df_sandals_group_names)
Not sure if I'm approaching this the right way so any advice would be appreciated.
Best,

You should never call or search a variable name as if it were a string.
Instead, use a dictionary to store a variable number of variables.
Bad practice
# dataframes
df_shoes_group_names = pd.DataFrame(...)
df_boots_group_names = pd.DataFrame(...)
df_sandals_group_names = pd.DataFrame(...)
def foo(x):
if shoes in df_shoes_group_names: # <-- THIS WILL NOT WORK
# do something with x
Good practice
# dataframes
df_shoes_group_names = pd.DataFrame(...)
df_boots_group_names = pd.DataFrame(...)
df_sandals_group_names = pd.DataFrame(...)
dfs = {'shoes': df_shoes_group_names,
'boots': df_boots_group_names,
'sandals': df_sandals_group_names}
def foo(key):
if 'shoes' in key: # <-- THIS WILL WORK
# do something with dfs[key]

How to iterate and extract data from this specific JSON file example

I'm trying to extract data from a JSON file with Python.
Mainly, I want to pull out the date and time from the "Technicals" section, to put that in one column of a dataframe, as well as pulling the "AKG" number and putting that in the 2nd col of the dataframe. Yes, I've looked at similar questions, but this issue is different. Thanks for your help.
A downNdirty example of the JSON file is below:
{ 'Meta Data': { '1: etc'
'2: etc'},
'Technicals': { '2017-05-04 12:00': { 'AKG': '64.8645'},
'2017-05-04 12:30': { 'AKG': '65.7834'},
'2017-05-04 13:00': { 'AKG': '63.2348'}}}
As you can see, and what's stumping me, is while the date stays the same the time advances. 'AKG' never changes, but the number does. Some of the relevant code I've been using is below. I can extract the date and time, but I can't seem to reach the AKG numbers. Note, I don't need the "AKG", just the number.
I'll mention: I'm creating a DataFrame because this will be easier to work with when creating plots with the data...right? I'm open to an array of lists et al, or anything easier, if that will ultimately help me with the plots.
akg_time = []
akg_akg = []
technicals = akg_data['Technicals'] #akg_data is the entire json file
for item in technicals: #this works
akg_time.append(item)
for item in technicals: #this not so much
symbol = item.get('AKG')
akg_akg.append(symbol)
pp.pprint(akg_akg)
error: 'str' object has no attribute 'get'

You've almost got it. You don't even need the second loop. You can append the akg value in the first one itself:
for key in technicals: # renaming to key because that is a clearer name
akg_time.append(key)
akg_akg.append(technicals[key]['AKG'])
Your error is because you believe item (or key) is a dict. It is not. It is just a string, one of the keys of the technicals dictionary, so you'd actually need to use symbols = technicals[key].get('AKG').

Although Coldspeed answer is right: when you have a dictionary you loop through keys and values like this:
Python 3
for key,value in technicals.items():
akg_time.append(key)
akg_akg.append(value["akg"])
Python 2
for key,value in technicals.iteritems():
akg_time.append(key)
akg_akg.append(value["akg"])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I use Pandas column to parse textfrom the web? - python

Related

Doing calculations while creating a List (in Python)

How to use zip to easily get the dataframe for each of the coin pair in the list through a defined function? (pandas)

Changing Json from API with Python

Python: Running function to append values to an empty list returns no values

How to iterate and extract data from this specific JSON file example

Categories

Resources