I have a DataFrame that is indexed by Date and has a couple of columns like this:
XLY UA
Date
2017-04-01 0.023991 0.060656
2017-05-01 0.010993 -0.081401
2017-06-01 -0.015596 0.130679
2017-07-01 0.019302 -0.101686
2017-08-01 -0.018608 -0.166207
2017-09-01 0.004684 -0.005298
2017-10-01 0.021203 -0.232357
2017-11-01 0.050658 0.034692
2017-12-01 0.021107 0.116513
2018-01-01 0.092411 -0.035285
2018-02-01 -0.034691 0.171206
...
2022-03-01 0.079468 0.039667
I have a python dictionary of weights
weights = {2022: 6, 2021: 5, 2020: 4, 2019: 3, 2018: 2, 2017: 1}
Is there a way to apply these weights to each row of the Dataframe so that for example, the row 2022-03-01 would be 0.079468 * 6 and .039667 * 6 and so on for all the rows that are in the year 2022, when it gets to 2021, it would apply 5 *, etc.
I know I can loop and do this. I am looking for a functional concise version.
Use mul on axis=0:
weights = {2022: 6, 2021: 5, 2020: 4, 2019: 3, 2018: 2, 2017: 1}
cols = ['XLY', 'UA']
df[cols] = df[cols].mul(df.index.year.map(weights), axis=0)
print(df)
# Output
XLY UA
Date
2017-04-01 0.023991 0.060656
2017-05-01 0.010993 -0.081401
2017-06-01 -0.015596 0.130679
2017-07-01 0.019302 -0.101686
2017-08-01 -0.018608 -0.166207
2017-09-01 0.004684 -0.005298
2017-10-01 0.021203 -0.232357
2017-11-01 0.050658 0.034692
2017-12-01 0.021107 0.116513
2018-01-01 0.184822 -0.070570
2018-02-01 -0.069382 0.342412
2022-03-01 0.476808 0.238002
I would do something llike this:
col_weights = np.array([weights[dt.year] for dt in df.index.get_level_values(0)])
df.loc[:, "XLY"] = df["XLY"] * col_weights
df.loc[:, "UA"] = df["UA"] * col_weights
fist line creates a weights array mapping index.year to weitghs dict.
next lines applies weight to each column.
Related
This question already has an answer here:
Pandas asfreq with weekly frequency
(1 answer)
Closed 2 years ago.
I create the following DataFrame:
import pandas as pd
d = {'T': [1, 2, 4, 15], 'H': [3, 4, 6, 8]}
df = pd.DataFrame(data=d, index=['10.09.2018 13:15:00','10.09.2018 13:30:00', '10.09.2018 14:00:00', '10.09.2018 22:00:00'])
df.index = pd.to_datetime(df.index)
And get the following result.
Out[30]:
T H
2018-10-09 13:15:00 1 3
2018-10-09 13:30:00 2 4
2018-10-09 14:00:00 4 6
2018-10-09 22:00:00 15 8
As you can see there is one value missing at 13:45:00 and a lot values between 14:00 and 22:00.
Is there a way to automatically find the missing values, insert a row with the missing time stamp and nan values for the missing time ?
I want to achieve this:
Out[30]:
T H
2018-10-09 13:15:00 1 3
2018-10-09 13:30:00 2 4
2018-10-09 13:45:00 nan nan
2018-10-09 14:00:00 4 6
2018-10-09 14:15:00 nan nan
...
2018-10-09 21:45:00 nan nan
2018-10-09 22:00:00 15 8
You can create a second dataframe with the correct timestep as index and join it with the original data. The following code worked in my case
# your code
import pandas as pd
d = {'T': [1, 2, 4, 15], 'H': [3, 4, 6, 8]}
df = pd.DataFrame(data=d, index=['10.09.2018 13:15:00','10.09.2018 13:30:00', '10.09.2018 14:00:00', '10.09.2018 22:00:00'])
df.index = pd.to_datetime(df.index)
# generate second dataframe with needed index
timerange = pd.date_range('10.09.2018 13:15:00', periods=40, freq='15min')
df2 = pd.DataFrame(index=timerange)
# join the original dataframe with the new one
newdf = df.join(df2, how='outer')
This seems like a basic question. I want to use the datetime index in a pandas dataframe as the x values of a machine leanring algorithm for a univarte time series comparisons.
I tried to isolate the index and then convert it to a number but i get an error.
df=data["Close"]
idx=df.index
df.index.get_loc(idx)
Date
2014-03-31 0.9260
2014-04-01 0.9269
2014-04-02 0.9239
2014-04-03 0.9247
2014-04-04 0.9233
this is what i get when i add your code
2019-04-24 00:00:00 0.7097
2019-04-25 00:00:00 0.7015
2019-04-26 00:00:00 0.7018
2019-04-29 00:00:00 0.7044
x (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
Name: Close, Length: 1325, dtype: object
I ne
ed a column of 1 to the number of values in my dataframe
First select column Close by double [] for one column DataFrame, so possible add new column:
df = data[["Close"]]
df["x"] = np.arange(1, len(df) + 1)
print (df)
Close x
Date
2014-03-31 0.9260 1
2014-04-01 0.9269 2
2014-04-02 0.9239 3
2014-04-03 0.9247 4
2014-04-04 0.9233 5
You can add a column with value range(1, len(data) + 1) as so:
df = pd.DataFrame({"y": [5, 4, 3, 2, 1]}, index=pd.date_range(start="2019-08-01", periods=5))
In [3]: df
Out[3]:
y
2019-08-01 5
2019-08-02 4
2019-08-03 3
2019-08-04 2
2019-08-05 1
df["x"] = range(1, len(df) + 1)
In [7]: df
Out[7]:
y x
2019-08-01 5 1
2019-08-02 4 2
2019-08-03 3 3
2019-08-04 2 4
2019-08-05 1 5
I have a dataframe (imported from Excel) which looks like this:
Date Period
0 2017-03-02 2017-03-01 00:00:00
1 2017-03-02 2017-04-01 00:00:00
2 2017-03-02 2017-05-01 00:00:00
3 2017-03-02 2017-06-01 00:00:00
4 2017-03-02 2017-07-01 00:00:00
5 2017-03-02 2017-08-01 00:00:00
6 2017-03-02 2017-09-01 00:00:00
7 2017-03-02 2017-10-01 00:00:00
8 2017-03-02 2017-11-01 00:00:00
9 2017-03-02 2017-12-01 00:00:00
10 2017-03-02 Q217
11 2017-03-02 Q317
12 2017-03-02 Q417
13 2017-03-02 Q118
14 2017-03-02 Q218
15 2017-03-02 Q318
16 2017-03-02 Q418
17 2017-03-02 2018
I am trying to convert all the 'Period' column into a consistent format. Some elements look already in the datetime format, others are converted to string (ex. Q217), others to int (ex 2018). Which is the fastest way to convert everything in a datetime? I was trying with some masking, like this:
mask = df['Period'].str.startswith('Q', na = False)
list_quarter = df_final[mask]['Period'].tolist()
quarter_convert = {'1':'31/03', '2':'30/06', '3':'31/08', '4':'30/12'}
counter = 0
for element in list_quarter:
element = element[1:]
quarter = element[0]
year = element[1:]
daymonth = ''.join(str(quarter_convert.get(word, word)) for word in quarter)
final = daymonth+'/'+year
list_quarter[counter] = final
counter+=1
However it fails when I try to substitute the modified elements in the original column:
df_nwe_final['Period'] = np.where(mask, pd.Series(list_quarter), df_nwe_final['Period'])
Of course I would need to do more or less the same with the 2018 type formats. However, I am sure I am missing something here, and there should be a much faster solution. Some fresh ideas from you would help! Thank you.
Reusing the code you show, let's first write a function that converts the Q-string to a datetime format (I adjusted to final format a little bit):
def convert_q_string(element):
quarter_convert = {'1':'03-31', '2':'06-30', '3':'08-31', '4':'12-30'}
element = element[1:]
quarter = element[0]
year = element[1:]
daymonth = ''.join(str(quarter_convert.get(word, word)) for word in quarter)
final = '20' + year + '-' + daymonth
return final
We can now use this to first convert all 'Q'-strings, and then pd.to_datetime to convert all elements to proper datetime values:
In [2]: s = pd.Series(['2017-03-01 00:00:00', 'Q217', '2018'])
In [3]: mask = s.str.startswith('Q')
In [4]: s[mask] = s[mask].map(convert_q_string)
In [5]: s
Out[5]:
0 2017-03-01 00:00:00
1 2017-06-30
2 2018
dtype: object
In [6]: pd.to_datetime(s)
Out[6]:
0 2017-03-01
1 2017-06-30
2 2018-01-01
dtype: datetime64[ns]
I have a .csv file that has data something like this:
#file...out/houses.csv
#data...sun may 1 11:20:43 2011
#user...abs12
#host...(null)
#group...class=house
#property..change_per_hour
#limit...0
#interval..10000000
#timestamp,house_0,house_1,house_2,house_3,.....,house_1000
2010-07-01 00:00:00 EDT,1.2,1.3,1.4,1.5,........,9.72
2010-07-01 01:00:00 EDT,2.2,2.3,2.4,2.5,........,19.72
2010-07-01 02:00:00 EDT,3.2,3.3,3.4,3.5,........,29.72
2010-07-01 05:00:00 EDT,5.2,5.3,5.4,5.5,........,59.72
2010-07-01 06:00:00 EDT,6.2,,6.4,,..............,
...
I want to convert this and save to a new .csv and the data should look like:
#file...out/houses.csv
#data...sun may 1 11:20:43 2011
#user...abs12
#host...(null)
#group...class=house
#property..change_per_hour
#limit...0
#interval..10000000
#EntityName,2010-07-01 00:00:00 EDT,2010-07-01 01:00:00 EDT,2010-07-01 02:00:00 EDT,2010-07-01 05:00:00 EDT,2010-07-01 06:00:00 EDT
house_0,1.2,2.2,3.2,5.2,6.2,...
house_1,1.3,2.3,3.3,5.3,,...
house_2,1.4,2.4,3.4,5.4,6.4,...
house_3,1.5,2.5,3.5,5.5,,...
...
house_1000,9.72,19.72,29.72,59.72,
I tried to use pandas: convert to a dictionary that looks like dtDict={'house_0':{'datetimestamp_1':'value_1','datetimestamp_2':'value_2'...}...}but I am not able to convert to a dictionary and use panda's DataFrame such as pandas.DataFrame(dtDict) to do that conversion. I dont have to use pandas (can you anything in python) but thought pandas is good for csv manipulation. any help?
Assuming it is in a pandas dataframe already, this works:
df = pd.DataFrame(
data=[[1, 3], [2, 5]],
index=[0, 1],
columns=['a', 'b']
)
Output:
>>>print(df)
a b
0 1 3
1 2 5
Then, transpose the dataframe:
>>>print(df.transpose())
0 1
a 1 2
b 3 5
I've got a pandas DataFrame containing 70 years with hourly data, looking like this:
pressure
2015-06-01 18:00:00 945.6
2015-06-01 19:00:00 945.6
2015-06-01 20:00:00 945.4
2015-06-01 21:00:00 945.4
2015-06-01 22:00:00 945.3
I want to extract the winter months (D-J-F) from every year and generate a new DataFrame with a series of winters.
I found a lot of complicated stuff (e.g. extracting the df.index.month as a new column and then adress this one afterwards), but is there a way to get the winter months straightforward?
You can use map():
import pandas as pd
df = pd.DataFrame({'date' : [datetime.date(2015, 11, 1), datetime.date(2015, 12, 1), datetime.date(2015, 1, 1), datetime.date(2015, 2, 1)],
'pressure': [1,2,3,4]})
winter_months = [12, 1, 2]
print df
# date pressure
# 0 2015-11-01 1
# 1 2015-12-01 2
# 2 2015-01-01 3
# 3 2015-02-01 4
df = df[df["date"].map(lambda t: t.month in winter_months)]
print df
# date pressure
# 1 2015-12-01 2
# 2 2015-01-01 3
# 3 2015-02-01 4
EDIT: I noticed that in your example the dates are the dataframe's index. This still works:
df = df[df.index.map(lambda t: t.month in winter_months)]
I just found that
df[(df.index.month==12) | (df.index.month==1) | (df.index.month==2)]
works fine.