I'm converting a financial spreadsheet into Pandas, and this is a frequent challenge that comes up.
In excel, suppose you have some calculation that for columns 0:n, the value depends on the previous column [shown in format Cell (row, column)]: Cell(1,n) = (Cell(1,n-1)^2)*5.
Obviously, for n=2, you could create a calculated column in Pandas:
df[2] = (df[1]^2) *5
But for a chain of say 30, that doesn't work. So currently, I am using a for loop.
total_columns_needed = list(range(0,100))
for i in total_columns_needed:
df[i] = (df[i-1]^2)* 5
That loop works fine, but I trying to see how I could use map and apply to make this look cleaner. From reading, apply is a loop function underneath, so I'm not sure whether I will get any speed from doing this. But, it could shrink the code by a lot.
The problem that I've had with:
df.apply()
is that 1) there could be other columns not involved in the calculation (which arguably shouldn't be there if the data is properly normalised), and 2) the columns don't exist yet. Part 2 could possibly be solved by creating the dataframe with all the needed columns, but I'm trying to avoid that for other reasons.
Any help in solving this greatly appreciated!
To automatically generate a bunch of columns, without a loop:
In [433]:
df = pd.DataFrame({'Val': [0,1,2,3,4]})
In [434]:
print df.Val.apply(lambda x: pd.Series(x+np.arange(0,25,5)))
0 1 2 3 4
0 0 5 10 15 20
1 1 6 11 16 21
2 2 7 12 17 22
3 3 8 13 18 23
4 4 9 14 19 24
numpy.arange(0,25,5) gives you array([ 0, 5, 10, 15, 20]). For each of the values in Val, we will add that value to array([ 0, 5, 10, 15, 20]), creating a new Series.
And finally, put the new Series together back into a new DataFrame
Related
I have a problem with regards as to how to appropriately code this condition. I'm currently creating a new pandas column in my dataframe, new_column, which performs a subtraction on the values in column test, based on what index of the data we are at. I'm currently using this code to get it to subtract a different value every 4 times:
subtraction_value = 3
subtraction_value = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
print (data['new_column']
[6,1,2,1,-5,0,-1,3,4,6]
However, I now wish to get it performing the higher subtraction on the first two positions in the column, and then 3 subtractions with the original value, another two with the higher subtraction value, 3 small subtractions, and so forth. I thought I could do it this way, with an | condition in my np.where statement:
data['new_column'] = np.where((data.index%4) | (data.index%5),
data['test']-subtraction_value,
data['test']-subtraction_value_2)
However, this didn't work, and I feel my maths may be slightly off. My desired output would look like this:
print(data['new_column'])
[6,-2,2,1,-2,-3,-4,3,7,6])
As you can see, this slightly shifts the pattern. Can I still use numpy.where() here, or do I have to take a new approach? Any help would be greatly appreciated!
As mentioned in the comment section, the output should equal
[6,-2,2,1,-2,-3,-4,2,7,6] instead of [6,-2,2,1,-2,-3,-4,3,7,6] according to your logic. Given that, you can do the following:
import pandas as pd
import numpy as np
from itertools import chain
subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
index_pos_large_subtraction = list(chain.from_iterable((data.index[i], data.index[i+1]) for i in range(0, len(data)-1, 5)))
data['new_column'] = np.where(~data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value, data['test']-subtraction_value_2)
# The next line is equivalent to the previous one
# data['new_column'] = np.where(data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value_2, data['test']-subtraction_value)
---------------------------------------------
test new_column
0 12 6
1 4 -2
2 5 2
3 4 1
4 1 -2
5 3 -3
6 2 -4
7 5 2
8 10 7
9 9 6
---------------------------------------------
As you can see, np.where works fine. Your masking condition is the problem and needs to be adjusted, you are not selecting rows according to your logic.
I am currently trying to apply the hampel filter to my dataframe in python, I have looked around and there isn't a lot of documentation for its implementation in python. I found one post but it looks like it was created before there was an actual hampel package/function and someone created a function to do a rolling mean calculation not using the filter from the package itself, even the site for Hampel the package is minimal. I am looking at the number of Covid cases per day by fips code. I have 470 time series (in days) data frame, each column is a different FIPS code and each row has the number of Covid cases per day (with dates, not the day number from start). The package for Hampel is very straight forward, it has two options for outputs, it will either return a list of the indices where it thinks there are outliers or it will replace the outliers with the median with in the data.
the two codes for using the hampel are:
[IN]:
ts = pd.Series([1, 2, 1 , 1 , 1, 2, 13, 2, 1, 2, 15, 1, 2])
[IN]: # to return indices:
outlier_indices = hampel(ts, window_size=5, n=3)
print("Outlier Indices: ", outlier_indices)
[OUT]:
Outlier Indices: [6, 10]
[IN]: # to return series with rolling medians replaced*** I'm using this format
ts_imputation = hampel(ts, window_size=5, n=3, imputation=True)
ts_imputation
[OUT]:
0 1.0
1 2.0
2 1.0
3 1.0
4 1.0
5 2.0
6 2.0
7 2.0
8 1.0
9 2.0
10 2.0
11 1.0
12 2.0
dtype: float64
So with my data frame I want it to replace the outliers in each column with the column median, I am using a window = 21 and a threshold = 6 (b/c of the data setup). I should mention each of the column starts with a different number of 0's for the rows. So for example the value for the first 80 rows for column one may be 0's and then for column 2 the first 95 rows may have 0's because each FIPs code has a diffferent number of days Given this I tried to use the .apply method with the following fx:
[IN]:
def hamp(col):
no_out = hampel(col, window_size=21, n=6, imputation=True)
return (no_out)
[IN]:
df = df.apply(hamp2, axis=1)
However, when I printed my data frame is now just all 0's. Can someone tell me what I am doing wrong?
Thank you!
recently SKTIME added HampelFilter
from sktime.transformations.series.outlier_detection import HampelFilter
y = your_data
transformer = HampelFilter(window_length=10)
y_hat = transformer.fit_transform(y)
also you can read the documentation here HampelFilter_sktime
Looks like my last question was closed, but I forgot to mention the update below the first time. Only modifying a few of the columns and not all.
What is the best way to modify (sort) a Series of data in a Pandas DataFrame?
For example, after importing some data, colums should be in ascending order, but I need to reorder data if it is not. Data is being imported from a csv into a pandas.df.
num_1 num_2 num_3
date
2020-02-03 17 22 36
2020-02-06 52 22 14
2020-02-10 5 8 29
2020-02-13 10 14 30
2020-02-17 7 8 19
I would ideally find the second row (panda Series) in the Dataframe as the record to be fixed:
num_1 num_2 num_3 num_4 num_5
date
2020-02-06 52 22 14 25 27
And modify it to be: (Only sorting nums 1-3 and not touching columns 4 & 5)
num_1 num_2 num_3 num_4 num_5
date
2020-02-06 14 22 52 25 27
I could iterate over the DataFrame and check for indexes that have Series data out of order by comparing each column to the column to it's right. Then write a custom sorter and rewrite that record back into the Dataframe, but that seems clunky.
I have to imagine there's a more Pythonic (Pandas) way to do this type of thing. I just can't find it in the pandas documentation. I don't want to reorder the rows just make sure the values are in the appropriate order within the columns.
Update: I forgot to mention one of the most critical aspects. There are other columns in the DataFrame that should not be touched. So in the example below, only sort (num_1, num_2, num_3) not the others. I'm guessing I can use the solutions posed already, split the DataFrame, sort the first part and re-merge them together. Is there an alternative?
Spliting and reconnecting does not sound bad to me, here is what I got:
cols_to_sort = ['num_1', 'num_2', 'num_3']
pd.concat([pd.DataFrame(np.sort(df[cols_to_sort].values), columns=cols_to_sort, index=df.index), df[df.columns[~df.columns.isin(cols_to_sort)]]], axis=1)
The best way is to use the sort_values() function and only allow it work on the columns which require sorting.
for index, rows in df.iterrows():
df[['col1','col2','col3']] = df[['col1','col2','col3']].sort_values(by=[index], axis = 1, ascending = True)
This loops through every row, makes the values in the desired columns ascending, and then resaves the columns.
Pandas does not support what you ask for by default (as much as I know). Usually each column is a different feature, so changing it's value may seem a little odd.
Anyway, pandas work extremely well with numpy. This is your rescue.
You can convert relevant columns to numpy array, sort by row, and then put the result back in the dataframe.
import numpy as np
cols_list = ["num_1","num_2","num_3"]
tmp_arr = np.array(df.loc[:, cols_list])
tmp_arr.sort(axis=1)
df.loc[:, cols_list] = tmp_arr
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Day":range(1,5),"num_1":[5,2,7,1], "num_2":[2,7,4,10], "num_3":[7,27,64,10]})
print(df)
cols_list = ["num_1","num_2","num_3"]
tmp_arr = np.array(df.loc[:, cols_list])
tmp_arr.sort(axis=1)
df.loc[:, cols_list] = tmp_arr
print(df)
The first print result:
Day num_1 num_2 num_3
0 1 5 2 7
1 2 2 7 27
2 3 7 4 64
3 4 1 10 10
second print result:
Day num_1 num_2 num_3
0 1 2 5 7
1 2 2 7 27
2 3 4 7 64
3 4 1 10 10
You can select whatever columns you like (cols_list).
After I already wrote this I found the similar solution here: Fastest way to sort each row in a pandas dataframe
Instead of relying on queries from SQL, I am trying to find ways that I can use pandas to do the same work but in a more time-efficient manner.
The problem I am trying to solve is best illustrated through the following simplified example:
df = pd.DataFrame({'id':list([1,2,3,4,5,6]),
'value':[12,8,31,14,45,12]})
Based on the data I would like to change the values of the "value" column when the id is 1,2,4 to 32,15,14
I managed to do this for one value with the following code:
df.loc[ df['id'] ==1, 'value'] = 32
However the problem is that the above code is very time-inefficient. So I wonder if anyone could help come up with a solution where I can update 20-30 values as fast as possible in a pandas table.
Thanks in advance
Use isin:
df.loc[df['id'].isin([1, 2, 4]), 'value'] = [32, 15, 14]
df
id value
0 1 32
1 2 15
2 3 31
3 4 14
4 5 45
5 6 12
(update: added desired data frame)
Let me start by saying that I'm reasonably confident that I found a solution to this problem several years ago, but I have not been able to re-find that solution.
Questions that address similar problems, but don't solve my particular problem include:
Efficiently select rows that match one of several values in Pandas DataFrame
Efficiently adding calculated rows based on index values to a pandas DataFrame
Compare Python Pandas DataFrames for matching rows
The Question
Let's say I have a data frame with many columns that I am working on:
big = pd.DataFrame({'match_1': [11, 12, 51, 52]})
big
match_1
0 11
1 12
2 51
3 52
I also have smaller data frame that, in theory, maps some conditional statement to a desired value:
# A smaller dataframe that we use to map values into the larger dataframe
small = pd.DataFrame({'is_even': [True, False], 'score': [10, 200]})
small
is_even score
0 True 10
1 False 200
The goal here would be to use a conditional statement to match each row in big to a single row in small. Assume that small is constructed such that there was always be one, and only one, match for each row in big. (If there has to be multiple rows in small that match, just pick the first one.)
The desired output would be something like:
desired = pd.DataFrame({'match_1': [11, 12, 51, 52], 'metric': [200, 10, 200, 10]})
desired
match_1 metric
0 11 200
1 12 10
2 51 200
3 52 10
I'm pretty sure that the syntax would look similar to:
big['score'] = small.loc[small['is_even'] == ( (big['match_1'] / 2) == 0), 'score']
This won't work because small['is_even'] is a Series of length 2, while ( (big['match_1'] / 2) == 0) is a Series of length 4. What I'm looking to do is, for each row in big, find the one row in small that matches based on a conditional.
If I can get a sequence that contains the correct row in small that matches each row in big, then I could do something like:
`big['score'] = small.loc[matching_rows, 'score']
The question I have is: how do I generate the Sequence matching rows?
Things that (I think) aren't quite what I want:
If the columns in big and small were to match simply on constant values, this would be a straight forward use of either big.merge() or big.groupby(), however, in my case, the mapping can be an arbitrarily complex boolean conditional, for example:
(big['val1'] > small['threshold']) & (big['val2'] == small['val2']) & (big['val3'] > small['min_val']) & (big['val3'] < small['max_val'])
Solutions that rely on isin(), any(), etc, don't work, because the conditional check can be arbitrarily complex.
I could certainly create a function to apply() to the bigger DataFrame, but again, I'm pretty sure there was a simpler solution.
The answer may come down to 'calculate some intermediate columns until you can do a simple merge' or 'just use apply(), but I could swear that there was a way to do what I've described above.
One approach is to use a merge in which the on_left is not a column, but a vector of keys. It's made simpler by setting the index of small to be is_even:
>>> small.set_index('is_even', inplace=True)
>>> condition = big['match_1'] % 2 == 0
>>> pd.merge(big, small, left_on=condition, right_index=True, how='left')
match_1 score
0 11 200
1 12 10
2 51 200
3 52 10
You can index small with True and False and just do a straight .ix lookup on it. Not sure it's all that much tidier than the intermediate column/merge:
In [127]: big = pd.DataFrame({'match_1': [11, 12, 51, 52]})
In [128]: small = pd.DataFrame({'score': [10, 200]}, index=[True, False])
In [129]: big['score'] = small.ix[pd.Index(list(big.match_1 % 2 == 0))].score.values
In [130]: big
Out[130]:
match_1 score
0 11 200
1 12 10
2 51 200
3 52 10