Arithmetic with pandas objects - python

I want to calculate the difference between two Pandas series in Python. Unfortunately, an error, which I cannot make sense of, is returned. The relevant part of my code is:
import urllib.request
import pandas as pd
base_url = "http://ichart.finance.yahoo.com/table.csv?s="
def get_data(base_url,ticker):
url = base_url + ticker
source = urllib.request.urlopen(url)
return pd.read_csv(source,index_col=0,parse_dates=True,header=None)
ticker_list = {'INTC': 'Intel'}
for ticker in ticker_list:
prices = get_data(base_url,ticker)
prices.columns = 'Open','High','Low','Close','Volume','Adj Close'
closing_prices = prices['Close']
begin = closing_prices.ix[['2013-01-03']]
end = closing_prices.ix[['2013-12-27']]
difference = end.sub(begin)
Python returns the following error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
However, type(begin) returns pandas.core.series.Series as does type(end). I used the method end.sub() because I thought I wanted to adhere to the instruction stated here: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Series.sub.html. For trying to address my problem, I (among other things) followed the following recommendations: Subtract a column from one pandas dataframe from another to no avail.
Do you have any idea where the mistake is buried in my code? In particular, why does Python state that I try to subtract strings? I am grateful for any help!
Update: Following #EdChum's comment I would like to post some data: typing begin gives:
`2013-01-03 21.32
Name: Close, dtype: object
closing_prices.head() gives:
`0
Date Close
2014-08-07 32.68
2014-08-06 32.85
2014-08-05 32.82
2014-08-04 34.05
Name: Close, dtype: object`

I had to change urllib to urllib2, and urllib.request.urlopen to just urllib2.urlopen.. But, it should otherwise be the same. The first issue was caused by having the column names stored as a value. Eliminating the header=None fixes that.
This should give you the difference between the first and last date specified:
import urllib2
import pandas as pd
base_url = "http://ichart.finance.yahoo.com/table.csv?s="
def get_data(base_url,ticker):
url = base_url + ticker
source = urllib2.urlopen(url)
return pd.read_csv(source,index_col=0,parse_dates=True)
ticker_list = {'INTC': 'Intel'}
EDIT- Be sure to check the sorting of the data.. It places the newest at the top, the oldest at the bottom for me.
for ticker in ticker_list:
prices = get_data(base_url,ticker)
prices.columns = 'Open','High','Low','Close','Volume','Adj Close'
closing_prices = prices['Close']
closing_prices = closing_prices['2013-01-03':'2013-12-27']
difference = closing_prices['2013-12-27'].values - closing_prices['2013-01-03'].values
print difference

This sounds more complicated than it really is: you'll need to convert end end begin to numeric data types. Try DataFrame.convert_objects:
...
begin = begin.convert_objects(convert_numeric=True)
end = end.convert_objects(convert_numeric=True)
difference = end.sub(begin)
Update: The following code works for me:
import urllib2
import pandas as pd
base_url = "http://ichart.finance.yahoo.com/table.csv?s="
def get_data(base_url,ticker):
url = base_url + ticker
source = urllib2.urlopen(url)
return pd.read_csv(source,index_col=0,parse_dates=True,header=None)
ticker_list = {'INTC': 'Intel'}
for ticker in ticker_list:
prices = get_data(base_url,ticker)
prices.columns = 'Open','High','Low','Close','Volume','Adj Close'
# this will convert the closing_prices Series to float
closing_prices = prices['Close'].convert_objects(convert_numeric=True)
# changed the double square brackets [[]] to single square brackets to
# obtain a scalar, rather than a single element Series
begin = closing_prices.ix['2013-01-03']
end = closing_prices.ix['2013-12-27']
difference = end - begin

Related

Change dateformat

I have this code where I wish to change the dataformat. But I only manage to change one line and not the whole dataset.
Code:
import pandas as pd
df = pd.read_csv ("data_q_3.csv")
result = df.groupby ("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)
from datetime import datetime
datetime.fromisoformat("2020-03-18T12:13:09").strftime("%Y-%m-%d-%H:%M")
Does anyone know how to fit the code so that the datetime changes in the whole dataset?
Thanks!
After looking at your problem for a while, I figured out how to change the values in the 'DateTime' column. The only problem that may arise is if the 'Country/Region' column has duplicate location names.
Editing the time is simple, as all you have to do is make use of pythons slicing. You can slice a string by typing
string = 'abcdefghijklnmopqrstuvwxyz'
print(string[0:5])
which will result in abcdef.
Below is the finished code.
import pandas as pd
# read unknown data
df = pd.read_csv("data_q_3.csv")
# List of unknown data
result = df.groupby("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
# you need a for loop to go through the whole column
for row in result.index:
# get the current stored time
time = result.at[row, 'DateTime']
# reformat the time string by slicing the
# string from index 0 to 10, and from index 12 to 16
# and putting a dash in the middle
time = time[0:10] + "-" + time[12:16]
# store the new time in the result
result.at[row, 'DateTime'] = time
#print result
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)

How to remove rows from a dataframe if before a certain date

Iam trying to remove all rows from a dataframe where dates from 'date' column are before 1-11-2019
The dataframe is produced by scraping google news (title, date, link, publisher). Here is the full code:
from bs4 import BeautifulSoup
import requests
import html5lib
import pandas as pd
import datetime
headers = {'User-Agent': 'Mozilla/5.0'}
#URL Generator (scraping news for 'sega')
urlA= 'https://news.google.com/search?q='
urlB='sega'
urlC='&hl=en-US&gl=US&ceid=US%3Aen'
url=urlA+urlB+urlC
response=requests.get(url)
soup=BeautifulSoup(response.content,'html5lib')
print(soup)
T=[]
t=[]
L=[]
P=[]
#Collecting Data
for x in soup.find_all(class_='ipQwMb ekueJc RD0gLb'):
title=x.text
T.append(title)
print(title)
for r in soup.find_all(class_='SVJrMe'):
z=r.find('time')
if z is not None:
for y in r.find_all('time'):
time=y.get('datetime')
time=str(time).partition('T')
time=time[0]
time = datetime.datetime.strptime(time, "%Y-%m-%d").date()
print(time)
t.append(time)
else:
x='Not Specified'
t.append(x)
for z in soup.find_all(class_='VDXfz'):
links=z.get('href')
links =links[1::] #removing the dot (first character always a
dot in links which is not required)
urlx= 'https://news.google.com'
links= urlx+links
L.append(links)
for w in soup.find_all(class_='wEwyrc AVN2gc uQIVzc Sksgp'):
publisher = w.text
P.append(publisher)
#Checking length to see all is equal
print(len(T))
print(len(t))
print(len(P))
print(len(L))
df=pd.DataFrame({'Title':(T) , 'Date':(t), 'Publisher' : (P), 'Link': (L)})
print(df)
Here is the current output (first 12 rows only):
As you can see the dataframe includes dates from before month of November, what i would like to do is delete all those rows. I have already converted the dates column into a 'dateTIME' format (see code [for r in soup.find...time=datetime.datetime.strip....].
Please advise line of code to add to achieve the required function. Please let me know if any clarification required.
IIUC, what you are looking for is:
df = df[df['Date']>datetime.date(2019,1,11)]
There are a couple of options. You can write a test in your loop to see if t (lower case t seems to be your date) is before November. If so, don't even append the other items to their respective lists.
There's also a method with dataframs called drop: https://chrisalbon.com/python/data_wrangling/pandas_dropping_column_and_rows/
You could also use this. Personally, once you have the t variable, I'd test it. If it meets your criteria for adding to the list, then add the others. If not, move on.

How to fix this “TypeError: sequence item 0: expected str instance, float found”

I am trying to combine the cell values (strings) in a dataframe column using groupby method, separating the cell values in the grouped cell using commas. I ran into the following error:
TypeError: sequence item 0: expected str instance, float found
The error occurs on the following line of code, see the code block for complete codes:
toronto_df['Neighbourhood'] = toronto_df.groupby(['Postcode','Borough'])['Neighbourhood'].agg(lambda x: ','.join(x))
It seems that in the groupby function, the index corresponding to each row in the un-grouped dataframe is automatically added to the string before it was joined. This causes the TypeError. However, I have no idea how to fix the issue. I browsed a lot of threads but didn't find a solution. I would appreciate any guidance or assistance!
# Import Necessary Libraries
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
# Use BeautifulSoup to scrap information in the table from the Wikipedia page, and set up the dataframe containing all the information in the table
wiki_html = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki_html, 'lxml')
# print(soup.prettify())
table = soup.find('table', class_='wikitable sortable')
table_columns = []
for th_txt in table.tbody.findAll('th'):
table_columns.append(th_txt.text.rstrip('\n'))
toronto_df = pd.DataFrame(columns=table_columns)
for row in table.tbody.findAll('tr')[1:]:
row_data = []
for td_txt in row.findAll('td'):
row_data.append(td_txt.text.rstrip('\n'))
toronto_df = toronto_df.append({table_columns[0]: row_data[0],
table_columns[1]: row_data[1],
table_columns[2]: row_data[2]}, ignore_index=True)
toronto_df.head()
# Remove cells with a borough that is Not assigned
toronto_df.replace('Not assigned',np.nan, inplace=True)
toronto_df = toronto_df[toronto_df['Borough'].notnull()]
toronto_df.reset_index(drop=True, inplace=True)
toronto_df.head()
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
toronto_df['Neighbourhood'] = toronto_df.groupby(['Postcode','Borough'])['Neighbourhood'].agg(lambda x: ','.join(x))
toronto_df.drop_duplicates(inplace=True)
toronto_df.head()
The expected result of the 'Neighbourhood' column should separate the cell values in the grouped cell using commas, showing something like this (I cannot post images yet, so I just provide the link):
https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1557273600000&hmac=936wN3okNJ1UTDA6rOpQqwELESvqgScu08_Spai0aQQ
As mentioned in the comments, the NaN is a float, so trying to do string operations on it doesn't work (and this is the reason for the error message)
Replace your last part of code with this:
The filling of the nan is done with boolean indexing according to the logic you specified in your comment
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
toronto_df.Neighbourhood = np.where(toronto_df.Neighbourhood.isnull(),toronto_df.Borough,toronto_df.Neighbourhood)
toronto_df['Neighbourhood'] = toronto_df.groupby(['Postcode','Borough'])['Neighbourhood'].agg(lambda x: ','.join(x))

Problems transforming data in a dataframe

I've written the function (tested and working) below:
import pandas as pd
def ConvertStrDateToWeekId(strDate):
dateformat = '2016-7-15 22:44:09'
aDate = pd.to_datetime(strDate)
wk = aDate.isocalendar()[1]
yr = aDate.isocalendar()[0]
Format_4_5_4_date = str(yr) + str(wk)
return Format_4_5_4_date'
and from what I have seen on line I should be able to use it this way:
ml_poLines = result.value.select('PURCHASEORDERNUMBER', 'ITEMNUMBER', PRODUCTCOLORID', 'RECEIVINGWAREHOUSEID', ConvertStrDateToWeekId('CONFIRMEDDELIVERYDATE'))
However when I "show" my dataframe the "CONFIRMEDDELIVERYDATE" column is the original datetime string! NO errors are given.
I've also tried this:
ml_poLines['WeekId'] = (ConvertStrDateToWeekId(ml_poLines['CONFIRMEDDELIVERYDATE']))
and get the following error:
"ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions." which makes no sense to me.
I've also tried this with no success.
x = ml_poLines.toPandas();
x['testDates'] = ConvertStrDateToWeekId(x['CONFIRMEDDELIVERYDATE'])
ml_poLines2 = spark.createDataFrame(x)
ml_poLines2.show()
The above generates the following error:
AttributeError: 'Series' object has no attribute 'isocalendar'
What have I done wrong?
Your function ConvertStrDateToWeekId takes a string. But in the following line the argument of the function call is a series of strings:
x['testDates'] = ConvertStrDateToWeekId(x['CONFIRMEDDELIVERYDATE'])
A possible workaround for this error is to use the apply-function of pandas:
x['testDates'] = x['CONFIRMEDDELIVERYDATE'].apply(ConvertStrDateToWeekId)
But without more information about the kind of data you are processing it is hard to provide further help.
This was the work-around that I got to work:
`# convert the confirimedDeliveryDate to a WeekId
x= ml_poLines.toPandas();
x['WeekId'] = x[['ITEMNUMBER', 'CONFIRMEDDELIVERYDATE']].apply(lambda y:ConvertStrDateToWeekId(y[1]), axis=1)
ml_poLines = spark.createDataFrame(x)
ml_poLines.show()`
Not quite as clean as I would like.
Maybe someone else cam propose a cleaner solution.

Replace certain value that is between two specific words

I am trying to replace a value inside a string column which is between two specific wording
For example, from this dataframe I want to change
df
seller_name url
Lucas http://sanyo.mapi/s3/e42390aac371?item_title=Branded%20boys%20Clothing&seller_name=102392852&buyer_item=106822419_1056424990
To this
url
http://sanyo.mapi/s3/e42390aac371?item_title=Branded%20boys%20Clothing&seller_name=Lucas&buyer_item=106822419_1056424990
Look in the URL in the seller_name= part I replaced by the real name, I changed the numbers for the real name.
I imagine something like changing from seller_name= to the first and that it see from seller_name.
this is just an example of what i want to do but really i have many of rows in my dataframe and length of the numbers inside the seller name is not always the same
Use apply and replace the string with seller name
Sample df
import pandas as pd
df=pd.DataFrame({'seller_name':['Lucas'],'url':['http://sanyo.mapi/s3/e42390aac371?item_title=Branded%20boys%20Clothing&seller_name=102392852&buyer_item=106822419_1056424990']})
import re
def myfunc(row):
return(re.sub('(seller_name=\d{1,})','seller_name='+row.seller_name,row.url))
df['url']=df.apply(lambda x: myfunc(x),axis=1)
seller_name = 'Lucas'
url = 'http://sanyo.mapi/s3/e42390aac371?item_title=Branded%20boys%20Clothing&seller_name=102392852&buyer_item=106822419_1056424990'
a = url.index('seller_name=')
b = url.index('&', a)
out = url.replace(url[a+12:b],seller_name)
print(out)
Try This one:
This solution doesn't assume the order of your query parameters, or the length of the ID you're replacing. All it assumes is that your query is &-delimited, and that you have the seller_name parameter, present.
split_by_amps = url.split('&')
for i in range(len(split_by_amps)):
if (split_by_amps[i].startswith('seller_name')):
split_by_amps[i] += 'seller_name=' + 'Lucas'
break
result = '&'.join(split_by_amps)
You can use regular expressions to substitute the code for the name:
import pandas as pd
import re
#For example use a dictionary to map codes to names
seller_dic = {102392852:'Lucas'}
for i in range(len(df['url'])):
#very careful with this, if a url doesn't have this structure it will throw
#an error, you may want to handle exceptions
code = re.search(r'seller_name=\d+&',df['url'][i]).group(0)
code = code.replace("seller_name=","")
code = code.replace("&","")
name = 'seller_name=' + seller_dic[code] + '&'
url = re.sub(r'seller_name=\d+&', name, df['url'][i])
df['url'][i] = url

Categories