SyntaxError comparing a date with a field in Python - python

Q) Resample the data to get prices for the end of the business month. Select the Adjusted Close for each stock.import pandas as pd
mmm = pd.read_csv('mmm.csv')
ibm = pd.read_csv('ibm.csv')
fb = pd.read_csv('fb.csv')
amz_date = amz.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
amz_price = amz_date[:, ['Date', 'AdjClose']] #dataframe with only these 2 columns
mmm_date = mmm.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
mmm_price = mmm_date[:, ['Date', 'AdjClose']]
ibm_date = ibm.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
ibm_price = ibm_date[:, ['Date', 'AdjClose']]
fb_date = fb.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
fb_price = fb_date[:, ['Date', 'AdjClose']]
KeyError: 'Date'
What Am i doing wrong? also Date is column in csv file

Your particular problem is that "06" is not a legal way to describe a value here. To do arithmetic, you'd need to drop the leading zero.
Your next problem is that 2017-06-30 is (would be) an arithmetic expression that evaluates to the integer 1981. You need to express this as a date, such as datetime.strptime("6-30-2017").

Related

Pandas DF, DateOffset, creating new column

So I'm working with the JHU covid19 data and they've left their recovered dataset go, they're no longer tracking it, just confirmed cases and deaths. What I'm trying to do here is recreate it. The table is the confirmed cases and deaths for every country for every date sorted by date and my getRecovered function below attempts to pick the date for that row, find the date two weeks before that and for the country of that row, and return a 'Recovered' column, which is the confirmed of two weeks ago - the dead today.
Maybe a pointless exercise, but still would like to know how to do it haha. I know it's a big dataset also and there's a lot of operations there, but I've been running it 20 mins now and still going. Did I do something wrong or would it just take this long?
Thanks for any help, friends.
urls = [
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
]
[wget.download(url) for url in urls]
confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths = pd.read_csv('time_series_covid19_deaths_global.csv')
dates = confirmed.columns[4:]
confirmed_long_form = confirmed.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Confirmed'
)
deaths_long_form = deaths.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Deaths'
)
full_table = confirmed_long_form.merge(
right=deaths_long_form,
how='left',
on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)
full_table['Date'] = pd.to_datetime(full_table['Date'])
full_table = full_table.sort_values(by='Date', ascending=True)
def getRecovered(row):
ts = row['Date']
country = row['Country/Region']
ts = pd.Timestamp(ts)
do = pd.tseries.offsets.DateOffset(n = 14)
newTimeStamp = ts - do
oldrow = full_table.loc[(full_table['Date'] == newTimeStamp) & (full_table['Country/Region'] == country)]
return oldrow['Confirmed'] - row['Deaths']
full_table['Recovered'] = full_table.apply (lambda row: getRecovered(row), axis=1)
full_table
Your function is being applied row by row, which is likely why performance is suffering. Pandas is fastest when you make use of vectorised functions. For example you can use
pd.to_datetime(full_table['Date'])
to convert the whole date column much faster (see here: Convert DataFrame column type from string to datetime).
You can then add the date offset to that column, something like:
full_table['Recovery_date'] = pd.to_datetime(full_table['Date']) - pd.tseries.offsets.DateOffset(n = 14)
You can then self merge the table on date==recovery_date (plus any other keys) and subtract the numbers.

I am working with Covid-19 dataset from European CDC. I have pivoted the data frame, but it seems that the values are bing aggregated

I am working with Covid-19 dataset and have used pivot function below:
url = 'https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
data = pd.read_csv(url,parse_dates=['dateRep'],index_col=['dateRep'])
data.head()
data.index.name = "date"
data = data.drop(['day', 'month', 'year','geoId','countryterritoryCode','continentExp'], axis = 1)
data = data.rename(columns={'countriesAndTerritories': 'country', 'popData2019':'population', 'continentExp' : 'continent' })
df_pivot = data.pivot(index = 'date', columns = 'country', values = 'cases').fillna(0)
df_pivot`
When I look at US cases on 2020-01-04 the number in pivoted data frame is "24998.0" which is incorrect (it should be 0). I will appreciate any suggestions.Thanks.
pivot cannot aggregate, it only reshapes.
The issue is with pandas automatically parsing the index and it gets confused with months and days (so it parsed April 1st as January 4th). The simplest method is to parse the dates manually with a format after reading.
import pandas as pd
df = pd.read_csv('https://opendata.ecdc.europa.eu/covid19/casedistribution/csv')
df['dateRep'] = pd.to_datetime(df['dateRep'], format='%d/%m/%Y')
df = df.set_index('dateRep')
And now we can see it will be fine:
df[df['countriesAndTerritories'] == 'United_States_of_America'].sort_index().plot(y='cases')
We can check to see that the parsing with your method was getting confused. Automatic parsing mixed up April 1st with January 4th.
df[(df.index == '2020-04-01') & df['countriesAndTerritories'].eq('United_States_of_America')].cases
#dateRep
#2020-04-01 24998 # <-- That's your number for Jan 4th.
#Name: cases, dtype: int64

How to find and add missing dates in a dataframe of sorted dates (descending order)?

In Python, I have a DataFrame with column 'Date' (format e.g. 2020-06-26). This column is sorted in descending order: 2020-06-26, 2020-06-25, 2020-06-24...
The other column 'Reviews' is made of text reviews of a website. My data can have multiple reviews on a given date or no reviews on another date. I want to find what dates are missing in column 'Date'. Then, for each missing date, add one row with date in ´´format='%Y-%m-%d'´´, and an empty review on 'Reviews', to be able to plot them. How should I do this?
from datetime import date, timedelta
d = data['Date']
print(d[0])
print(d[-1])
date_set = set(d[-1] + timedelta(x) for x in range((d[0] - d[-1]).days))
missing = sorted(date_set - set(d))
missing = pd.to_datetime(missing, format='%Y-%m-%d')
idx = pd.date_range(start=min(data.Date), end=max(data.Date), freq='D')
#tried this
data = data.reindex(idx, fill_value=0)
data.head()
#Got TypeError: 'fill_value' ('0') is not in this Categorical's categories.
#also tried this
df2 = (pd.DataFrame(data.set_index('Date'), index=idx).fillna(0) + data.set_index('Date')).ffill().stack()
df2.head()
#Got ValueError: cannot reindex from a duplicate axis
This is my code:
for i in range(len(df)):
if i > 0:
prev = df.loc[i-1]["Date"]
current =df.loc[i]["Date"]
for a in range((prev-current).days):
if a > 0:
df.loc[df["Date"].count()] = [prev-timedelta(days = a), None]
df = df.sort_values("Date", ascending=False)
print(df)

Filtering Pandas Dataframe by date not working

I am downloading Bitcoin price data using Cryptowatch API. Downloading price data works well, but I only need price data until from 1 month ago, so data from 29.10.2019 until 28.11.2019. I read several answers to similar questions but it does not seem to work for my code as I get the same output after filtering than without filtering.
Here is my code:
#define 1day period
periods = '86400'
#get price data from cryptowatchAPI
resp = requests.get('https://api.cryptowat.ch/markets/bitfinex/btcusd/ohlc', params={'periods': periods})
resp.ok
#create pandas dataframe
data = resp.json()
df = pd.DataFrame(data['result'][periods], columns=[
'CloseTime', 'OpenPrice', 'HighPrice', 'LowPrice', 'ClosePrice', 'Volume', 'NA'])
#Make a date out of CloseTime
df['CloseTime'] = pd.to_datetime(df['CloseTime'], unit='s')
#make CloseTime Index of the Dataframe
df.set_index('CloseTime', inplace=True)
#filter df by date until 1 month ago
df.loc[datetime.date(year=2019,month=10,day=29):datetime.date(year=2019,month=11,day=28)]
df
There is no error or anything, but the output is always the same, so filtering does not work.
Thank you very much in advance!!
Use strings format of datetimes for filtering, check also more infornation in docs:
df1 = df['2019-10-29':'2019-11-28']
Or:
s = datetime.datetime(year=2019,month=10,day=29)
e = datetime.datetime(year=2019,month=11,day=28)
df1 = df[s:e]

Python: Convert columns into date format and extract order

I am asking for help in transforming values into date format.
I have following data structure:
ID ACT1 ACT2 ACT3 ACT4
1 154438.0 154104.0 155321.0 155321.0
2 154042.0 154073.0 154104.0 154104.0
...
The number in columns ACT1-4 need to be converted. Some rows contain NaN values.
I found that following function helps me to get a Gregorian date:
from datetime import datetime, timedelta
gregorian = datetime.strptime('1582/10/15', "%Y/%m/%d")
modified_date = gregorian + timedelta(days=154438)
datetime.strftime(modified_date, "%Y/%m/%d")
It would be great to know how I can apply this transformation to all columns except for "ID" and whether the approach is correct (or could be improved).
After the transformation is applied, I need to extract the order of column items, sorted by date in ascending order. For instance
ID ORDER
1 ACT1, ACT3, ACT4, ACT2
2 ACT2, ACT1, ACT3, ACT4
Thank you!
It sounds like you have two questions here.
1) To change to datetime:
cols = [col for col in df.columns if col != 'ID']
df.loc[:, cols] = df.loc[:, cols].applymap(lambda x: datetime.strptime('1582/10/15', "%Y/%m/%d") + timedelta(days=x) if np.isfinite(x) else x)
2) To get the sorted column names:
df['ORDER'] = df.loc[:, cols].apply(lambda dr: ','.join(df.loc[:, cols].columns[dr.dropna().argsort()]), axis=1)
Note: the dropna above will omit columns with NaT values from the order string.
First I would make the input column comma separated so that its much easier to handle of the form:
ID,ACT1,ACT2,ACT3,ACT4
1,154438.0,154104.0,155321.0,155321.0
2,154042.0,154073.0,154104.0,154104.0
Then you can read each line using a CSV reader, extracting key,value pairs that have your column names as keys. Then you pop the ID off that dictionary to get its value ie, 1,2,etc. And you can then reorder according to the value which is the date. The code is below:
#!/usr/bin/env python3
import csv
from operator import itemgetter
idAndTuple = {}
with open('time.txt') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
myID = row.pop('ID',None)
reorderedList = sorted(row.items(), key = itemgetter(1))
idAndTuple[myID] = reorderedList
print( myID, reorderedList )
The result when you run this is:
1 [('ACT2', '154104.0'), ('ACT1', '154438.0'), ('ACT3', '155321.0'), ('ACT4', '155321.0')]
2 [('ACT1', '154042.0'), ('ACT2', '154073.0'), ('ACT3', '154104.0'), ('ACT4', '154104.0')]
which I think is what you are looking for.

Categories