Dummy variable for first of the month - python

The crux of this is that its stock data, so the first day of the month might not always be the 1st. I have found a way of isolating these, but i don't know how to then edit the dataframe to put a "1" next to each of these.
Hopefully this makes sense.
import pandas_datareader as web
import pandas as pd
import numpy as np
import datetime as dt
df = web.DataReader('AAPL', 'google')
df = df.set_index(pd.to_datetime(df['Date']))
df.sort_index(inplace=True)
print(df.groupby(pd.Grouper(freq='MS')).nth(0))
The code i'm using. Currently it prints the first of the month correctly, but i'm not sure how to make a new column (D_FoM) with a 1 at every one of these dates.
I'm sure its something easy but i cant work it out, R is much easier for this sort of thing i feel.

Related

How do you split irregular patterned string into columns for dataframes using pandas?

I did my due diligence on figuring this out, but am still stuck. I'm struggling to split irregular patterned str (i.e. text, float, int string with irregular number of spaces in between).
My goal is to split the 'Item_Description' column into 2 columns - 'Product Size' (i.e. "4.1 OUNCE"), 'Pack Size' (i.e. "1 PK") - please see my attempt below and my screenshot.
When I run the code, nothing happens. Also, since the number of spaces are all different per item, I had no luck in creating new df columns with the split; kept getting column errors.
Really appreciate your help!
import pandas as pd
import re
import csv
import io
from IPython.display import display
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded["Total item_level.csv"]))
df.Item_Description.str.split(" ", expand=True)
I think the following works (didn't try though),
df["Pack_Size"] = df.Item_Description.str.split().map(lambda x : x[-2]+" "+x[-1])
df["Product_Size"] = df.Item_Description.str.split().map(lambda x : x[-4]+" "+x[-3])

Pandas tz.convert GMT and local time does not match

I am trying to convert UTC data to local time Mozambique. For Mozambique the local time follows GMT+2 or Africa/Maputo. However, when using .tz_localize('UTC').tz_convert(X) where X can either be = 'GMT+2' or = 'Africa/Maputo' I get separate answers. As an example:
import pandas as pd
import numpy as np
np.random.seed(2019)
N = 1000
rng = pd.date_range('2019-01-01', freq='10Min', periods=N)
df = pd.DataFrame(np.random.rand(N, 3), columns=['temp','depth','acceleration'], index=rng)
print(df.tz_localize('UTC').tz_convert('Etc/GMT+2'))
print(df.tz_localize('UTC').tz_convert('Africa/Maputo'))
The code that solves my problem is: df.tz_localize('UTC').tz_convert('Africa/Maputo'). Therefore, I wonder if I have misunderstood the tz_convert('Etc/GMT+2') method, and why the two different solutions dont provide the same answers. tz_convert('Etc/GMT-2') solves the trick but is not intuitive, at least to me.
Thanks in advance.
The time zone conversion using etcetera works in reverse, and perhaps it should be deprecated altogether, considering the following observation on its documentation:
These entries are mostly present for historical reasons, so that
people in areas not otherwise covered by the tz files could "zic -l"
to a time zone that was right for their area. These days, the
tz files cover almost all the inhabited world, so there's little
need now for the entries that are not on UTC.
Your workaround is correct and the best explanation why can be found here. Maybe stick with the tz_convert('Africa/Maputo').

Resample daily data into weekday data

i´m a beginner but i can´t seem to find a way with resample in pandas to turn daily ohlc data into separate weekdays to analyze weekday return over a period of time and how it changed.(turn it into weekly data for exmaple tuesdays as a "weekstart" and just using first so i don´t see the data for the remaining days)
Hope this question is not too stupid but i´m trying to figure it out since a few days but i haven´t found a working solution.
Thanks in advance !
import pandas as pd
import datetime
import pandas_datareader.data as web
import numpy as np
df = pd.read_csv("ethusdt.csv",parse_dates=["time"], index_col="time")
ohlc_dict = {
'open':'first',
'high':'max',
'low':'min',
'close':'last',
'volume':'sum',
'daily_change':'sum'
}
df = df.resample('W').agg(ohlc_dict)
weeklydf = df.resample('W').agg(ohlc_dict)
weeklydf['weekly_change'] = ((weeklydf['close'] / (weeklydf['open'])-1)*100)
EDIT:
Since i probalby haven´t explained my tought correctly( my english is not very good, sorry) i´ll try to explain the Problem again,
in the Dataframe i have following Data: Open, high, low, close, and volume. My goal is it to resample it to weekly data what i already got but with different starting days, for example with the data right now the daily start is always monday , but i wan´t to change it to all the different weekdays to find changes how the cryptocurrency market "frontruns" itself slowly. Have seen this for a long time but would like to have statistical proof

why and how to solve the data lost when multi encode in python pandas

Download the Data Here
Hi, I have a data something like below, and would like to multi label the data.
something to like this: target
But the problem here is data lost when multilabel it, something like below:
issue
using the coding of:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df_enc = df.drop('movieId', 1).join(df.movieId.str.join('|').str.get_dummies())
Someone can help me, feel free to download the dataset, thank you.
So that column when read in with pandas will be stored as a string. So first we'd need to convert that to an actual list.
From there use .explode() to expand out that list into a series (where the index will match the index it came from, and the column values will be the values in that list).
Then crosstab that from a series into each row and column being the value.
Then join that back up with the dataframe on the index values.
Keep in mind, when you do one-hot-encoding with high cardinality, you're table will blow up into a huge wide table. I just did this on the first 20 rows, and ended up with 233 columns. with the 225,000 + rows, it'll take a while (maybe a minute or so) to process and you end up with close to 1300 columns. This may be too complex for machine learning to do anything useful with it (although maybe would work with deep learning). You could still try it though and see what you get. What I would suggest to test out is find a way to simplify it a bit to make it less complex. Perhaps find a way to combine movie ids in a set number of genres or something like that? But then test to see if simplifying it improves your model/performance.
import pandas as pd
from ast import literal_eval
df = pd.read_csv('ratings_action.csv')
df.movieId = df.movieId.apply(literal_eval)
s = df['movieId'].explode()
df = df[['userId']].join(pd.crosstab(s.index, s))

I am having a difficulty in creating a dataframe in pandas, which is taken from a particular url. Could someone look after this?

import pandas as pd
daf = pd.read_html('https://github.com/justmarkham/DAT8/blob/master/data/beer.txt' )
*this would extract the dataset from the mentioned url but I am facing trouble in setting up the dataframe with the required index. Just lemme know how to organise the dataset properly. If you dont understand my question, just look at the code and run it and i guess you'll figure out what am i asking. *
You could use pandas.read_csv:
import pandas as pd
daf = pd.read_csv('https://github.com/justmarkham/DAT8/blob/master/data/beer.txt', ' ')

Categories