How to transfer strings into integers with pandas - python

I'm trying to transfer strings into integers from the plans_data:
import pandas as pd
from bs4 import BeautifulSoup
import requests
plans_data = pd.DataFrame(columns = ['Country', 'Currency', 'Mobile', 'Basic', 'Standard', 'Premium'])
for index, row in countries_list.iterrows():
country = row['ID']
url = f'https://help.netflix.com/en/node/24926/{country}'
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find("table", class_="c-table")
try:
plan_country = pd.read_html(results.prettify())[0] #creates a list(!) of dataframe objects
plan_country = plan_country.rename(columns = {'Unnamed: 0':'Currency'})
plan_country = pd.DataFrame(plan_country.iloc[0,:]).transpose()
plans_data = pd.concat([plans_data, plan_country], ignore_index=True)
except AttributeError:
country_name = row['Name']
print(f'No data found for {country_name}.')
plans_data.loc[index, 'Country'] = row['Name']
plans_data
Firstly, I atempted to transfer using function float:
# 1. Here we import pandas
import pandas as pd
# 2. Here we import numpy
import numpy as np
ans_2_1_ = float(plans_data['Basic', 'Standard', 'Premium'])
However, I always get the NameError:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_15368/3072127414.py in <module>
3 # 2. Here we import numpy
4 import numpy as np
----> 5 ans_2_1_ = float(plans_data['Basic', 'Standard', 'Premium'])
NameError: name 'plans_data' is not defined
How can I solve this problem?
If my code is not appropriate for my task "transfer strings into integers", can you advise me on how to convert?

The error indicates that the second piece of code does not know what plans_data is, so first make sure they plans_data is defined where you do that, ie in the same file or the same Jupyter notebook
second problem is that plans_data['Basic', 'Standard', 'Premium'] is not valid syntax
Thirdly, and probably is your real question, how to convert the values in those columns to floats.
Elements in columns 'Basic', 'Standard', 'Premium' are strings in currency format eg '£ 5.99'. You can convert them to floats as such (you need to do it for each column):
ans_2_1_ = plans_data['Basic'].str[1:].astype(float)
... # same for Standard and Premium

Related

Whata is the problem in my python apı code?

What is the problem here I writed the code but I got problems?
import pandas as pd
import matplotlib.pyplot as plt
import requests
from alpha_vantage.timeseries import TimeSeries
key: "https://api.polygon.io/v2/aggs/ticker/AAPL/range/1/day/2023-01-09/2023-01-09?adjusted=true&sort=asc&limit=120&apiKey=<my_key>".read()
ts = TimeSeries(key, output_format='pandas')
data, meta = ts.get_intraday('', interval='1min', outputsize='full')
meta
data.info()
data.head()
plt.plot(data['4. close'])
columns = ['open', 'high', 'low', 'close', 'volume']
data.columns = columns
data['TradeDate'] = data.index.date
data['time'] = data.index.time
data.loc['2020-12-31']
market = data.between_time('09:30:00', '16:00:00').copy()
market.sort_index(inplace=True)
market.info()
market.groupby('TradeDate').agg({'low':min, 'high':max})
Error:
> C:\Users\yaray\PycharmProjects\pythonProject6\venv\Scripts\python.exe
> C:/Users/yaray/PycharmProjects/pythonProject6/main.py Traceback (most
> recent call last): File
> "C:\Users\yaray\PycharmProjects\pythonProject6\main.py", line 6, in
> <module>
> key: "https://api.polygon.io/v2/aggs/ticker/AAPL/range/1/day/2023-01-09/2023-01-09?adjusted=true&sort=asc&limit=120&apiKey=adxXwhD9disXeBHHaifFLOX9BxbDIDHD".read()
> AttributeError: 'str' object has no attribute 'read'
You are trying to read a string.
You must first make the API call, you can than read the response of that API call into a variable on which you can call the read() method.
Ps. for readability, don't put any functionality past the page line.
If you really have to assign a long string, save the string to a variable and call a method on it later.
Example:
key = "long_string"
read_key = key.read()
Your code is a little wonky. Does this get you started? It is a little more than I can put in a comment and potentially less than I would put in an answer.
Based on https://pypi.org/project/alpha-vantage/ I think you want to get started with something more like this.
from alpha_vantage.timeseries import TimeSeries
my_api_key = "ad...HD"
ts = TimeSeries(key=my_api_key, output_format='pandas')
data, meta = ts.get_intraday('GOOGL')
print(meta)
print(data)
Let me know if that helps or not so I can remove this partial answer.

Can't manipulate dataframe in pandas

Don't understand why I can't do even the most simple data manipulation with this data i've scraped. I've tried all sorts of methjods to manipulate the data but all come up with the same sort of error. Is my data even in a data frame yet? I can't tell.
import pandas as pd
from urllib.request import Request, urlopen
req = Request('https://smallcaps.com.au/director-transactions/'
, headers={'User-Agent': 'Mozilla/5.0'})
trades = urlopen(req).read()
df = pd.read_html(trades)
print(df) #<-- This line prints the df and works fine
df.drop([0, 1]) #--> THis one shows the error below
print(df)
Error:
Traceback (most recent call last):
File "C:\Users\User\PycharmProjects\Scraper\DirectorTrades.py", line 10, in <module>
df.drop([0, 1])
AttributeError: 'list' object has no attribute 'drop'
Main issue is as mentioned that pandas.read_html() returns a list of dataframes and you have to specify by index wich you like to choose.
Is my data even in a data frame yet?
df = pd.read_html(trades) No it is not, cause it provides a list of dataframes
df = pd.read_html(trades)[0] Yes, this will give you the first dataframe from list of frames
Example
import pandas as pd
from urllib.request import Request, urlopen
req = Request('https://smallcaps.com.au/director-transactions/'
, headers={'User-Agent': 'Mozilla/5.0'})
trades = urlopen(req).read()
df = pd.read_html(trades)[0]
df.drop([0, 1])
df
Output
Date
Code
Company
Director
Value
0
27/4/2022
ESR
Estrella Resources
L. Pereira
↑$1,075
1
27/4/2022
LNY
Laneway Resources
S. Bizzell
↑126,750
2
26/4/2022
FGX
Future Generation Investment Company
G. Wilson
↑$13,363
3
26/4/2022
CDM
Cadence Capital
J. Webster
↑$25,110
4
26/4/2022
TEK
Thorney Technologies
A. Waislitz
↑$35,384
5
26/4/2022
FGX
Future Generation Investment Company
K. Thorley
↑$7,980
...
read_html returns a list of dataframes.
Try:
dfs = pd.read_html(trades)
dfs = [df.drop([0,1]) for df in dfs]

How to remove every possible accents from a column in python

I am new in python. I have a data frame with a column, named 'Name'. The column contains different type of accents. I am trying to remove those accents. For example, rubén => ruben, zuñiga=zuniga, etc. I wrote following code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import unicodedata
data=pd.read_csv('transactions.csv')
data.head()
nm=data['Name']
normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')
I am getting error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-41-1410866bc2c5> in <module>()
1 nm=data['Name']
----> 2 normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')
TypeError: normalize() argument 2 must be unicode, not Series
The reason why it is giving you that error is because normalize requires a string for the second parameter, not a list of strings. I found an example of this online:
unicodedata.normalize('NFKD', u"Durrës Åland Islands").encode('ascii','ignore')
'Durres Aland Islands'
Try this for one column:
nm = nm.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
Try this for multiple columns:
obj_cols = data.select_dtypes(include=['O']).columns
data.loc[obj_cols] = data.loc[obj_cols].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))
Try this for one column:
df[column_name] = df[column_name].apply(lambda x: unicodedata.normalize(u'NFKD', str(x)).encode('ascii', 'ignore').decode('utf-8'))
Change the column name according to your data columns.

Beautiful Soup Wikipidia nested tables

I am new to Beautiful Soup and nested table and therefore I try to get some experience scraping a wikipedia table.
I have searched for any good example on the web but unfortunately I have not found anything.
My goal is to parse via pandas the table "States of the United States of America" on this web page. As you can see from my code below I have the following issues:
1) I can not extract all the columns. Apparently my code does not allow to import all the columns properly in a pandas DataFrame and writes the entries of the third column of the html table below the first column.
2) I do not know how to deal with colspan="2" which appears in some lines of the table. In my pandas DataFrame I would like to have the same entry when capital and largest city are the same.
Here is my code. Note that I got stuck trying to overcome my first issue.
Code:
from urllib.request import urlopen
import pandas as pd
wiki='https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
page = urlopen(wiki)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
right_table=soup.find_all('table')[0] # First table
rows = right_table.find_all('tr')[2:]
A=[]
B=[]
C=[]
D=[]
F=[]
for row in rows:
cells = row.findAll('td')
# print(len(cells))
if len(cells)>=11: #Only extract table body not heading
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
C.append(cells[2].find(text=True))
D.append(cells[3].find(text=True))
F.append(cells[4].find(text=True))
df=pd.DataFrame(A,columns=['State'])
df['Capital']=B
df['Largest']=C
df['Statehood']=D
df['Population']=F
df
print(df)
Do you have any suggestings?
Any help to understand better BeautifulSoup would be appreciated.
Thanks in advance.
Here's the strategy I would use.
I notice that each line in the table is complete but, as you say, some lines have two cities in the 'Cities' column and some have only one. This means that we can use the numbers of items in a line to determine whether we need to 'double' the city name mentioned in that line or not.
I begin the way you did.
>>> import requests
>>> import bs4
>>> page = requests.get('https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> right_table=soup.find_all('table')[0]
Then I find all of the rows in the table and verify that it's at least approximately correct.
>>> trs = right_table('tr')
>>> len(trs)
52
I poke around until I find the lines for Alabama and Wyoming, the first and last rows, and display their texts. They're example of the two types of rows!
>>> trs[2].text
'\n\xa0Alabama\nAL\nMontgomery\nBirmingham\n\nDec 14, 1819\n\n\n4,863,300\n\n52,420\n135,767\n50,645\n131,171\n1,775\n4,597\n\n7\n\n'
>>> trs[51].text
'\n\xa0Wyoming\nWY\nCheyenne\n\nJul 10, 1890\n\n\n585,501\n\n97,813\n253,335\n97,093\n251,470\n720\n1,864\n\n1\n\n'
I notice that I can split these strings on \n and \xa0. This can be done with a regex.
>>> [_ for _ in re.split(r'[\n\xa0]', trs[51].text) if _]
['Wyoming', 'WY', 'Cheyenne', 'Jul 10, 1890', '585,501', '97,813', '253,335', '97,093', '251,470', '720', '1,864', '1']
>>> [_ for _ in re.split(r'[\n\xa0]', trs[2].text) if _]
['Alabama', 'AL', 'Montgomery', 'Birmingham', 'Dec 14, 1819', '4,863,300', '52,420', '135,767', '50,645', '131,171', '1,775', '4,597', '7']
The if _ conditional in these list comprehensions is to discard empty strings.
The Wyoming string has a length of 12, Alabama's is 13. I would leave Alabama's string as it is for pandas. I would extend Wyoming's (and all the others of length 12) using:
>>> row = [_ for _ in re.split(r'[\n\xa0]', trs[51].text) if _]
>>> row[:3]+row[2:]
['Wyoming', 'WY', 'Cheyenne', 'Cheyenne', 'Jul 10, 1890', '585,501', '97,813', '253,335', '97,093', '251,470', '720', '1,864', '1']
The solution below should fix both issues you have mentioned.
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
wiki='https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States?action=render'
page = urlopen(wiki)
soup = BeautifulSoup(page, 'html.parser')
right_table=soup.find_all('table')[0] # First table
rows = right_table.find_all('tr')[2:]
A=[]
B=[]
C=[]
D=[]
F=[]
for row in rows:
cells = row.findAll('td')
combine_cells = cells[1].get('colspan') # Tells us whether columns for Capital and Established are the same
cells = [cell.text.strip() for cell in cells] # Extracts text and removes whitespace for each cell
index = 0 # allows us to modify columns below
A.append(cells[index]) # State Code
B.append(cells[index + 1]) # Capital
if combine_cells: # Shift columns over by one if columns 2 and 3 are combined
index -= 1
C.append(cells[index + 2]) # Largest
D.append(cells[index + 3]) # Established
F.append(cells[index + 4]) # Population
df=pd.DataFrame(A,columns=['State'])
df['Capital']=B
df['Largest']=C
df['Statehood']=D
df['Population']=F
df
print(df)
Edit: Here's a cleaner version of the above code
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
wiki = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
page = urlopen(wiki)
soup = BeautifulSoup(page, 'html.parser')
table_rows = soup.find('table')('tr')[2:] # Get all table rows
cells = [row('td') for row in table_rows] # Get all cells from rows
def get(cell): # Get stripped string from tag
return cell.text.strip()
def is_span(cell): # Check if cell has the 'colspan' attribute <td colspan="2"></td>
return cell.get('colspan')
df = pd.DataFrame()
df['State'] = [get(cell[0]) for cell in cells]
df['Capital'] = [get(cell[1]) for cell in cells]
df['Largest'] = [get(cell[2]) if not is_span(cell[1]) else get(cell[1]) for cell in cells]
df['Statehood'] = [get(cell[3]) if not is_span(cell[1]) else get(cell[2]) for cell in cells]
df['Population'] = [get(cell[4]) if not is_span(cell[1]) else get(cell[3]) for cell in cells]
print(df)

Arithmetic with pandas objects

I want to calculate the difference between two Pandas series in Python. Unfortunately, an error, which I cannot make sense of, is returned. The relevant part of my code is:
import urllib.request
import pandas as pd
base_url = "http://ichart.finance.yahoo.com/table.csv?s="
def get_data(base_url,ticker):
url = base_url + ticker
source = urllib.request.urlopen(url)
return pd.read_csv(source,index_col=0,parse_dates=True,header=None)
ticker_list = {'INTC': 'Intel'}
for ticker in ticker_list:
prices = get_data(base_url,ticker)
prices.columns = 'Open','High','Low','Close','Volume','Adj Close'
closing_prices = prices['Close']
begin = closing_prices.ix[['2013-01-03']]
end = closing_prices.ix[['2013-12-27']]
difference = end.sub(begin)
Python returns the following error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
However, type(begin) returns pandas.core.series.Series as does type(end). I used the method end.sub() because I thought I wanted to adhere to the instruction stated here: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Series.sub.html. For trying to address my problem, I (among other things) followed the following recommendations: Subtract a column from one pandas dataframe from another to no avail.
Do you have any idea where the mistake is buried in my code? In particular, why does Python state that I try to subtract strings? I am grateful for any help!
Update: Following #EdChum's comment I would like to post some data: typing begin gives:
`2013-01-03 21.32
Name: Close, dtype: object
closing_prices.head() gives:
`0
Date Close
2014-08-07 32.68
2014-08-06 32.85
2014-08-05 32.82
2014-08-04 34.05
Name: Close, dtype: object`
I had to change urllib to urllib2, and urllib.request.urlopen to just urllib2.urlopen.. But, it should otherwise be the same. The first issue was caused by having the column names stored as a value. Eliminating the header=None fixes that.
This should give you the difference between the first and last date specified:
import urllib2
import pandas as pd
base_url = "http://ichart.finance.yahoo.com/table.csv?s="
def get_data(base_url,ticker):
url = base_url + ticker
source = urllib2.urlopen(url)
return pd.read_csv(source,index_col=0,parse_dates=True)
ticker_list = {'INTC': 'Intel'}
EDIT- Be sure to check the sorting of the data.. It places the newest at the top, the oldest at the bottom for me.
for ticker in ticker_list:
prices = get_data(base_url,ticker)
prices.columns = 'Open','High','Low','Close','Volume','Adj Close'
closing_prices = prices['Close']
closing_prices = closing_prices['2013-01-03':'2013-12-27']
difference = closing_prices['2013-12-27'].values - closing_prices['2013-01-03'].values
print difference
This sounds more complicated than it really is: you'll need to convert end end begin to numeric data types. Try DataFrame.convert_objects:
...
begin = begin.convert_objects(convert_numeric=True)
end = end.convert_objects(convert_numeric=True)
difference = end.sub(begin)
Update: The following code works for me:
import urllib2
import pandas as pd
base_url = "http://ichart.finance.yahoo.com/table.csv?s="
def get_data(base_url,ticker):
url = base_url + ticker
source = urllib2.urlopen(url)
return pd.read_csv(source,index_col=0,parse_dates=True,header=None)
ticker_list = {'INTC': 'Intel'}
for ticker in ticker_list:
prices = get_data(base_url,ticker)
prices.columns = 'Open','High','Low','Close','Volume','Adj Close'
# this will convert the closing_prices Series to float
closing_prices = prices['Close'].convert_objects(convert_numeric=True)
# changed the double square brackets [[]] to single square brackets to
# obtain a scalar, rather than a single element Series
begin = closing_prices.ix['2013-01-03']
end = closing_prices.ix['2013-12-27']
difference = end - begin

Categories