Apply Numpy function over entire Dataframe - python

I am applying this function over a dataframe df1 such as the following:
AA AB AC AD
2005-01-02 23:55:00 "EQUITY" "EQUITY" "EQUITY" "EQUITY"
2005-01-03 00:00:00 32.32 19.5299 32.32 31.0455
2005-01-04 00:00:00 31.9075 19.4487 31.9075 30.3755
2005-01-05 00:00:00 31.6151 19.5799 31.6151 29.971
2005-01-06 00:00:00 31.1426 19.7174 31.1426 29.9647
def func(x):
for index, price in x.iteritems():
x[index] = price / np.sum(x,axis=1)
return x[index]
df3=func(df1.ix[1:])
However, I only get a single column returned as opposed to 3
2005-01-03 0.955843
2005-01-04 0.955233
2005-01-05 0.955098
2005-01-06 0.955773
2005-01-07 0.955877
2005-01-10 0.95606
2005-01-11 0.95578
2005-01-12 0.955621
I am guessing I am missing something in the formula to make it apply to the entire dataframe. Also how could I return the first index that has strings in its row?

You need to do it the following way :
def func(row):
return row/np.sum(row)
df2 = pd.concat([df[:1], df[1:].apply(func, axis=1)], axis=0)
It has 2 steps :
df[:1] extracts the first row, which contains strings, while df[1:] represents the rest of the DataFrame. You concatenate them later on, which answers the second part of your question.
For operating over rows you should use apply() method.

Related

Converting Pandas DataFrame dates so that I can pick out particular dates

I have two dataframes with particular data that I'm needing to merge.
Date Greenland Antarctica
0 2002.29 0.00 0.00
1 2002.35 68.72 19.01
2 2002.62 -219.32 -59.36
3 2002.71 -242.83 46.55
4 2002.79 -209.12 63.31
.. ... ... ...
189 2020.79 -4928.78 -2542.18
190 2020.87 -4922.47 -2593.06
191 2020.96 -4899.53 -2751.98
192 2021.04 -4838.44 -3070.67
193 2021.12 -4900.56 -2755.94
[194 rows x 3 columns]
and
Date Mean Sea Level
0 1993.011526 -38.75
1 1993.038692 -39.77
2 1993.065858 -39.61
3 1993.093025 -39.64
4 1993.120191 -38.72
... ... ...
1021 2020.756822 62.83
1022 2020.783914 62.93
1023 2020.811006 62.98
1024 2020.838098 63.00
1025 2020.865190 63.00
[1026 rows x 2 columns]
My ultimate goal is trying to pull out the data from the second data frame(Mean Sea Level column) that comes from (roughly) the same time frame as the dates in the first dataframe, and then merge that back in with the first data frame.
However, the only ways that I can think of selecting out certain dates, involves first converting all of the dates in the Date columns of both Dataframes to something Pandas recognizes, but I have been unable to figure our how to do that. I figured out some code(below) that can convert individual dates to a more common date format, but its been difficult to successfully apply it to all of the Dates in dataframe. Also I'm not sure I can then get Pandas to then convert that to a date format that Pandas recognizes.
from datetime import datetime
def fraction2datetime(year_fraction: float) -> datetime:
year = int(year_fraction)
fraction = year_fraction - year
first = datetime(year, 1, 1)
aux = datetime(year + 1, 1, 1)
return first + (aux - first)*fraction
I also looked at pandas.to_datetime but I don't see a way to have it read the format the dates are initially in.
So does anyone have any guidance on this? Firstly with the conversion of dates, but also with the task of picking out the dates from the second dataframe if possible. Any help would be greatly appreciated.
Suppose you have this 2 dataframes:
df1:
Date Greenland Antarctica
0 2020.79 -4928.78 -2542.18
1 2020.87 -4922.47 -2593.06
2 2020.96 -4899.53 -2751.98
3 2021.04 -4838.44 -3070.67
4 2021.12 -4900.56 -2755.94
df2:
Date Mean Sea Level
0 2020.756822 62.83
1 2020.783914 62.93
2 2020.811006 62.98
3 2020.838098 63.00
4 2020.865190 63.00
To convert the dates:
def fraction2datetime(year_fraction: float) -> datetime:
year = int(year_fraction)
fraction = year_fraction - year
first = datetime(year, 1, 1)
aux = datetime(year + 1, 1, 1)
return first + (aux - first) * fraction
df1["Date"] = df1["Date"].apply(fraction2datetime)
df2["Date"] = df2["Date"].apply(fraction2datetime)
print(df1)
print(df2)
Prints:
Date Greenland Antarctica
0 2020-10-16 03:21:35.999999 -4928.78 -2542.18
1 2020-11-14 10:04:47.999997 -4922.47 -2593.06
2 2020-12-17 08:38:24.000001 -4899.53 -2751.98
3 2021-01-15 14:23:59.999999 -4838.44 -3070.67
4 2021-02-13 19:11:59.999997 -4900.56 -2755.94
Date Mean Sea Level
0 2020-10-03 23:55:28.012795 62.83
1 2020-10-13 21:54:02.073603 62.93
2 2020-10-23 19:52:36.134397 62.98
3 2020-11-02 17:51:10.195198 63.00
4 2020-11-12 15:49:44.255992 63.00
For the join, you can use pd.merge_asof. For example this will join on nearest date within 30-day tolerance (you can tweak these values as you want):
x = pd.merge_asof(
df1, df2, on="Date", tolerance=pd.Timedelta(days=30), direction="nearest"
)
print(x)
Will print:
Date Greenland Antarctica Mean Sea Level
0 2020-10-16 03:21:35.999999 -4928.78 -2542.18 62.93
1 2020-11-14 10:04:47.999997 -4922.47 -2593.06 63.00
2 2020-12-17 08:38:24.000001 -4899.53 -2751.98 NaN
3 2021-01-15 14:23:59.999999 -4838.44 -3070.67 NaN
4 2021-02-13 19:11:59.999997 -4900.56 -2755.94 NaN
You can specify a timestamp format in to_datetime(). Otherwise, if you need to use a custom function you can use apply(). If performance is a concern, be aware that apply() does not perform as well as builtin pandas methods.
To combine the DataFrames you can use an outer join on the date column.

Select rows from Python DataFrame

I have got a Python DataFrame called "x" like this:
363108 05:01:00
363107 05:02:00
363106 05:03:00
363105 05:04:00
363104 05:05:00
...
4 16:57:00
3 16:58:00
2 16:59:00
1 17:00:00
0 17:01:00
The "time" column is string type.
I want to create a new DataFrame called "m" from all the rows in "x" such that the minute is "00".
I have tried m = x.loc[x["time"][3:5] == "00"] but I get "IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)."
Does anybody know how to do this please?
You should use "apply" for the condition.
x.loc[x["time"].apply(lambda s: s[3:5] == "00")]
*In your code you are getting the range [3:5] on time Series(row 3 to 5)
One way can be that you can create a new column in the existing dataframe that has the minutes field, which you can slice from the time column
df['minutes']=df['time'][-2:]
other_df=df.loc[df['minutes']=="00"]

Pandas Resample-Sum without Zero filling

When resampling Series with mean aggregation (daily to monthly) -> missing datetimes are filled with NaNs which is okay since we can simply remove them using .dropna() function,
however, with sum/total aggregation -> missing datetimes are filled with 0s (zeros) which is technically correct, but a bit bothersome as masks are needed to remove them.
The question is if there is a more efficient way on resampling with aggregate sum without zero-filling or using masks? Preferrably similar to dropna() but for dropping 0s.
For example:
ser = pd.Series([1]*6)
ser.index = pd.to_datetime(['2000-01-01', '2000-01-02', '2000-03-01', '2000-03-02', '2000-05-01', '2000-05-02'])
# wanted output
# 2000-01-31 2.0
# 2000-03-31 2.0
# 2000-05-31 2.0
# ideal output but for aggregate sum.
ser.resample('M').mean().dropna()
# 2000-01-31 1.0
# 2000-03-31 1.0
# 2000-05-31 1.0
# not ideal
ser.resample('M').sum()
# 2000-01-31 2
# 2000-02-29 0
# 2000-03-31 2
# 2000-04-30 0
# 2000-05-31 2
using .groupby() with .grouper() seems to have the exact behavior from resampling.
# not ideal
ser.groupby(pd.Grouper(freq='M')).sum()
# 2000-01-31 2
# 2000-02-29 0
# 2000-03-31 2
# 2000-04-30 0
# 2000-05-31 2
using .groupby() with index.year is also doable, however, there does not seem to be an 'identity' for calendar month. Noting that .index.month is not what we are after.
ser = pd.Series([1]*6)
ser.index = pd.to_datetime(['2000-01-01', '2000-01-02', '2002-03-01', '2002-03-02', '2005-05-01', '2005-05-02'])
ser.groupby(ser.index.year).sum()
# 2000 2
# 2002 2
# 2005 2
Use pd.offsets.MonthEnd and add this with the DatetimeIndex of ser to create a month end grouper, then use Series.groupby with this grouper and aggregate using sum or mean:
grp = ser.groupby(ser.index + pd.offsets.MonthEnd())
s1, s2 = grp.sum(), grp.mean()
Result:
print(s1)
2000-01-31 2
2002-03-31 2
2005-05-31 2
dtype: int64
print(s2)
2000-01-31 1
2002-03-31 1
2005-05-31 1
dtype: int64

Accessing column values with columns set up as values of an index

I have the following dataframe named df:
V1 V2
IDS
a 1 2
b 3 4
If I print out the index and the columns, this is the result:
> print(df.index)
Index(['a','b'],dtype='object',name='IDS',length=2)
> print(df.columns)
Index(['V1','V2'],dtype='object',length=2)
I want to perform a calculation on these two columns (row-wise) and add this to a new column. I have tried the following, but I can't seem to access the column as expected.
df['sum'] = df.apply(lambda row: row['V1'] + row['V2'], axis=1)
I get the following error running the last line of code:
KeyError: ('V1', 'occurred at index a')
How do I access these columns?
Update: contrived example is not showing the error, here is the actual dataframe I am working with:
DATE ... gathering_size_100_to_26 shelter_in_place
FIPS
10001 2020-01-22 ... 2020-01-01 2020-01-01
10002 2020-01-22 ... 2020-01-01 2020-01-02
10003 2020-02-25 ... 2020-01-01 2020-01-03
... ... ... ... ...
9013 2020-02-22 ... 2020-01-01 2020-01-01
I want to take the difference between 'gathering_size_100_to_26' and 'DATE', as well as 'shelter_in_place' and 'DATE' and replace this value in place.
df["v1_v2_sum"] = df["V1"] + df["V2"]
Anyways, avoid using df.apply and UDF, they have bad performance, and only needed when you have no options.
df = pd.DataFrame(data=[[0.8062, 0.9308], [0.364 , 0.6909]],index=['a','b'], columns=['V1','V2'])
print(df)
Output:
V1 V2
a 0.8062 0.9308
b 0.3640 0.6909
df['sum'] = df.apply(sum,axis=1)
print(df)
Output:
V1 V2 sum
a 0.8062 0.9308 1.7370
b 0.3640 0.6909 1.0549```
I realized I had a typo.... what was stated above (but reworked for me instance) works.

Apply function to all columns in pd dataframe using variables in other dataframes

I have a series of dataframes, some hold static values some are time series.
I have been able to add
I want to transpose the values from one time series to a new time series, applying a function which draws values from both the original time series dataframe and the dataframe which holds static values.
A snip of the time series and static dataframes are below.
Time series dataframe (Irradiance)
Datetime Date Time GHI DIF flagR SE SA TEMP
2017-07-01 00:11:00 01.07.2017 00:11 0 0 0 -9.39 -179.97 11.1
2017-07-01 00:26:00 01.07.2017 00:26 0 0 0 -9.33 -176.47 11.0
2017-07-01 00:41:00 01.07.2017 00:41 0 0 0 -9.14 -172.98 10.9
2017-07-01 00:56:00 01.07.2017 00:56 0 0 0 -8.83 -169.51 10.9
2017-07-01 01:11:00 01.07.2017 01:11 0 0 0 -8.40 -166.04 11.0
Static dataframe (Properties)
Bearing (degrees) Inclination (degrees)
Home ID
151631 244 29
151632 244 29
151633 184 44
I have written a function which I want to use to populate a new dataframe using values from both of these.
def dif(DIF, Inclination, GHI):
global Albedo
return DIF * (1 + math.cos(math.radians(Inclination)) / 2) + (GHI * Albedo * (1 - math.cos(math.radians(Inclination)) / 2))
When I have tried to do the same, but within the same dataframe I have used the Numpy vectorize funcion, so I thought I would be able to iterate over each column of the the new dataframe using the following code.
for column in DIF:
DIF[column] = np.vectorize(dif)(irradiance['DIF'], properties.iloc['Inclination (degrees)'][column], irradiance['GHI'])
Instead this throws the following error.
TypeError: cannot do positional indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [Inclination (degrees)] of <class 'str'>
I've checked the dtypes for the values of Inclination(degrees) but it is returned as Int64, not str so I'm not sure why this error is being generated.
I'm obviously missing something critical here. Are there alternative methods that would work better, or at all? Any help would be much appreciated.

Categories