Substract Two Dataframes by Index and Keep String Columns - python

I would like to subtract two data frames by indexes:
# importing pandas as pd
import pandas as pd
# Creating the second dataframe
df1 = pd.DataFrame({"Type":['T1', 'T2', 'T3', 'T4', 'T5'],
"A":[10, 11, 7, 8, 5],
"B":[21, 5, 32, 4, 6],
"C":[11, 21, 23, 7, 9],
"D":[1, 5, 3, 8, 6]},
index =["2001", "2002", "2003", "2004", "2005"])
df1
# Creating the first dataframe
df2 = pd.DataFrame({"A":[1, 2, 2, 2],
"B":[3, 2, 4, 3],
"C":[2, 2, 7, 3],
"D":[1, 3, 2, 1]},
index =["2000", "2002", "2003", "2004"])
df2
# Desired
df = pd.DataFrame({"Type":['T1', 'T2', 'T3', 'T4', 'T5'],
"A":[10, 9, 5, 6, 5],
"B":[21, 3, 28, 1, 6],
"C":[11, 19, 16, 4, 9],
"D":[1, 2, 1, 7, 5]},
index =["2001", "2002", "2003", "2004", "2005"])
df
df1.subtract(df2)
However, it returns in some cases NAs, I would like to keep values from the first df1 if not deductable.

You could handle NaN using:
df1.subtract(df2).combine_first(df1).dropna(how='all')
output:
A B C D Type
2001 10.0 21.0 11.0 1.0 T1
2002 9.0 3.0 19.0 2.0 T2
2003 5.0 28.0 16.0 1.0 T3
2004 6.0 1.0 4.0 7.0 T4
2005 5.0 6.0 9.0 6.0 T5

You can use select_dtypes to choose the correct data type, then subtract the reindex data:
(df1.select_dtypes(include='number')
.sub(df2.reindex(df1.index, fill_value=0))
.join(df1.select_dtypes(exclude='number'))
)
Output:
A B C D Type
2001 10 21 11 1 T1
2002 9 3 19 2 T2
2003 5 28 16 1 T3
2004 6 1 4 7 T4
2005 5 6 9 6 T5

Related

Filling 0 with previous value at index

I have a df:
1 2 3 4 5 6 7 8 9 10
A 10 0 0 15 0 21 45 0 0 7
I am trying fill index A values with the current value if the next value is 0 so that the df would look like this:
1 2 3 4 5 6 7 8 9 10
A 10 10 10 15 15 21 45 45 45 7
I tried:
df.loc[['A']].replace(to_replace=0, method='ffill').values
But this does not work, where is my mistake?
If you want to use your method, you need to work with Series on both sides:
df.loc['A'] = df.loc['A'].replace(to_replace=0, method='ffill')
Alternatively, you can mask the 0 with NaNs, and ffill the data on axis=1:
df.mask(df.eq(0)).ffill(axis=1)
output:
1 2 3 4 5 6 7 8 9 10
A 10.0 10.0 10.0 15.0 15.0 21.0 45.0 45.0 45.0 7.0
Well you should change your code a little bit and work with series:
import pandas as pd
df = pd.DataFrame({'1': [10], '2': [0], '3': [0], '4': [15], '5': [0],
'6': [21], '7': [45], '8': [0], '9': [0], '10': [7]},
index=['A'])
print(df.apply(lambda x: pd.Series(x.values).replace(to_replace=0, method='ffill').values, axis=1))
Output:
A [10, 10, 10, 15, 15, 21, 45, 45, 45, 7]
dtype: object
This way, if you have multiple indices, the code still works:
import pandas as pd
df = pd.DataFrame({'1': [10, 11], '2': [0, 12], '3': [0, 0], '4': [15, 0], '5': [0, 3],
'6': [21, 3], '7': [45, 0], '8': [0, 4], '9': [0, 5], '10': [7, 0]},
index=['A', 'B'])
print(df.apply(lambda x: pd.Series(x.values).replace(to_replace=0, method='ffill').values, axis=1))
Output:
A [10, 10, 10, 15, 15, 21, 45, 45, 45, 7]
B [11, 12, 12, 12, 3, 3, 3, 4, 5, 5]
dtype: object
df.applymap(lambda x:pd.NA if x==0 else x).fillna(method='ffill',axis=1)
1 2 3 4 5 6 7 8 9 10
A 10 10 10 15 15 21 45 45 45 7

Sort rows and removing NaN values

I have a dataset looks like below:
state Item_Number
0 AP 1.0, 4.0, 20.0, 2.0, 11.0, 7.0
1 GOA 1.0, 4.0, nan, 2.0, 8.0, nan
2 GU 1.0, 4.0, 13.0, 2.0, 11.0, 7.0
3 KA 1.0, 23.0, nan, nan, 11.0, 7.0
4 MA 1.0, 14.0, 13.0, 2.0, 19.0, 21.0
I want to remove NaN values and sort the rows, as well as convert float to int. After completion the dataset should looks like below:
state Item_Number
0 AP 1, 2, 4, 7, 11, 20
1 GOA 1, 2, 4, 8
2 GU 1, 2, 4, 7, 11, 13
3 KA 1, 7, 11, 23
4 MA 1, 2, 13, 14, 19, 21
Another solution using Series.str.split and Series.apply:
df['Item_Number'] = (df.Item_Number.str.split(',')
.apply(lambda x: ', '.join([str(z) for z in sorted([int(float(y)) for y in x if 'nan' not in y])])))
[out]
state Item_Number
0 AP 1, 2, 4, 7, 11, 20
1 GOA 1, 2, 4, 8
2 GU 1, 2, 4, 7, 11, 13
3 KA 1, 7, 11, 23
4 MA 1, 2, 13, 14, 19, 21
Use list comprehension with remove missing values by principe NaN != NaN:
df['Item_Number'] = [sorted([int(float(y)) for y in x.split(',') if float(y) == float(y)]) for x in df['Item_Number']]
print (df)
state Item_Number
0 AP [1, 2, 4, 7, 11, 20]
1 GOA [1, 2, 4, 8]
2 GU [1, 2, 4, 7, 11, 13]
3 KA [1, 7, 11, 23]
4 MA [1, 2, 13, 14, 19, 21]
If need strings:
df['Item_Number'] = [' '.join(map(str, sorted([int(float(y)) for y in x.split(',') if float(y) == float(y)]))) for x in df['Item_Number']]
print (df)
state Item_Number
0 AP 1 2 4 7 11 20
1 GOA 1 2 4 8
2 GU 1 2 4 7 11 13
3 KA 1 7 11 23
4 MA 1 2 13 14 19 21

How to drop a value from data series with multiple index?

I have a data frame with the temperatures recorded per day/month/year.
Then I find the lowest temperature from each month using groupby and min functions, which gives a data series with multiple index.
How can I drop a value from a specific year and month? eg. year 2005 month 12?
# Find the lowest value per each month
[In] low = df.groupby([df['Date'].dt.year,df['Date'].dt.month])['Data_Value'].min()
[In] low
[Out]
Date Date
2005 1 -60
2 -114
3 -153
4 -13
5 -14
6 26
7 83
8 65
9 21
10 36
11 -36
12 -86
2006 1 -75
2 -53
3 -83
4 -30
5 36
6 17
7 85
8 82
9 66
10 40
11 -2
12 -32
2007 1 -63
2 -42
3 -21
4 -11
5 28
6 74
7 73
8 61
9 46
10 -33
11 -37
12 -97
[In] low.index
[Out] MultiIndex(levels=[[2005, 2006, 2007], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]],
names=['Date', 'Date'])
This works.
#dummy data
mux = pd.MultiIndex.from_arrays([
(2017,)*12 + (2018,)*12,
list(range(1, 13))*2
], names=['year', 'month'])
df = pd.DataFrame({'value': np.random.randint(1, 20, (len(mux)))}, mux)
Then just use drop.
df.drop((2017, 12), inplace=True)
>>> print(df)
value
year month
2017 1 18
2 13
3 14
4 1
5 8
6 19
7 19
8 8
9 11
10 5
11 7 <<<
2018 1 9
2 18
3 9
4 14
5 7
6 4
7 6
8 12
9 12
10 1
11 19
12 10

Pandas: Missing value imputation based on date

I have a pandas data-frame which is as follows:
df_first = pd.DataFrame({"id": [102, 102, 102, 102, 103, 103], "val1": [np.nan, 4, np.nan, np.nan, 1, np.nan], "val2": [5, np.nan, np.nan, np.nan, np.nan, 5], "rand": [np.nan, 3, 7, 8, np.nan, 4], "val3": [5, np.nan, np.nan, np.nan, 3, np.nan], "unique_date": [pd.Timestamp(2002, 3, 3), pd.Timestamp(2002, 3, 5), pd.Timestamp(2003, 4, 5), pd.Timestamp(2003, 4, 9), pd.Timestamp(2003, 8, 7), pd.Timestamp(2003, 9, 7)], "end_date": [pd.Timestamp(2005, 3, 3), pd.Timestamp(2003, 4, 7), np.nan, np.nan, pd.Timestamp(2003, 10, 7), np.nan]})
df_first
id val1 val2 rand val3 unique_date end_date
0 102 NaN 5.0 NaN 5.0 2002-03-03 2005-03-03
1 102 4.0 NaN 3.0 NaN 2002-03-05 2003-04-07
2 102 NaN NaN 7.0 NaN 2003-04-05 NaT
3 102 NaN NaN 8.0 NaN 2003-04-09 NaT
4 103 1.0 NaN NaN 3.0 2003-08-07 2003-10-07
5 103 NaN 5.0 4.0 NaN 2003-09-07 NaT
The missing value imputation should be done in a way that there is forward fill of the values that appear in each row from the data-frame that has an end_date value.
The forward fill performs for as long as the unique_date is before the end_date for the same id.
Based on what is said in the last paragraph above, the forward fill should be done per id.
Lastly, the missing value imputation should take place only for certain columns that have a name that has val in it. An important note is that no other columns have that pattern in their name. In case I haven't made myself clear enough, the solution for the above posted data-frames is posted bellow:
id val1 val2 rand val3 unique_date
0 102 NaN 5.0 NaN 5.0 2002-03-03
1 102 4.0 5.0 3.0 5.0 2002-03-05
2 102 4.0 5.0 7.0 5.0 2003-04-05
3 102 NaN 5.0 8.0 5.0 2003-04-09
4 103 1.0 NaN NaN 3.0 2003-08-07
5 103 1.0 5.0 4.0 3.0 2003-08-07
Let me know if you need any further clarification since the whole thing seems rather complicated at first sight.
Looking forward to you answers!
Sorry for the confusing question as well as explanation. At the end I was able to achieve what I wanted in the following way.
df_first = pd.DataFrame({"id": [102, 102, 102, 102, 103, 103],
"val1": [np.nan, 4, np.nan, np.nan, 1, np.nan],
"val2": [5, np.nan, np.nan, np.nan, np.nan, 5],
"val3": [np.nan, 3, np.nan, np.nan, np.nan, 4],
"val4": [5, np.nan, np.nan, np.nan, 3, np.nan],
"rand": [3, np.nan, 1, np.nan, 5, 6],
"unique_date": [pd.Timestamp(2002, 3, 3),
pd.Timestamp(2002, 3, 5),
pd.Timestamp(2003, 4, 5),
pd.Timestamp(2003, 4, 9),
pd.Timestamp(2003, 8, 7),
pd.Timestamp(2003, 9, 7)],
"end_date": [pd.Timestamp(2005, 3, 3),
pd.Timestamp(2003, 4, 7),
np.nan,
np.nan,
pd.Timestamp(2003, 10, 7),
np.nan]})
display(df_first)
indexes = []
columns = df_first.filter(like="val").columns
for column in columns:
indexes.append(df_first.columns.get_loc(column))
elements = df_first.values[:,indexes]
ids = df_first.values[:,df_first.columns.get_loc("id")]
start_dates = df_first.values[:,df_first.columns.get_loc("unique_date")]
end_dates = df_first.values[:,df_first.columns.get_loc("end_date")]
for i in range(len(elements)):
if pd.notnull(end_dates[i]):
not_nan_indexes = np.argwhere(~pd.isnull(elements[i])).ravel()
elements_prop = elements[i,not_nan_indexes]
j = i
while (j < len(elements) and start_dates[j] < end_dates[i] and ids[i] == ids[j]):
elements[j, not_nan_indexes] = elements_prop
j+=1
df_first[columns] = elements
df_first = df_first.drop(columns="end_date")
display(df_first)
Probably the solution is an overkill, but I was not able to find anything pandas specific to achieve what I wanted to.

Radius search in list of coordinates

I'm using python 2.7 and numpy (import numpy as np).
I have a list of x-y coordinates in the following shape:
coords = np.zeros((100, 2), dtype=np.int)
I have a list of values corresponding to these coordinates:
values = np.zeros(100, dtype=np.int)
My program is populating these arrays.
Now, for each coordinate, I want to find neighbours within radius r that have a non-zero value. What's the most efficient way to do that?
Demo:
import pandas as pd
from scipy.spatial.distance import pdist, squareform
In [101]: np.random.seed(123)
In [102]: coords = np.random.rand(20, 2)
In [103]: r = 0.3
In [104]: d = pd.DataFrame(squareform(pdist(coords)))
In [105]: d
Out[105]:
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 0.000000 0.539313 0.138885 0.489671 0.240183 0.566555 0.343214 0.541508 0.525761 0.295906 0.566702 0.326087 0.045059
1 0.539313 0.000000 0.509028 0.765644 0.299834 0.212418 0.535287 0.253292 0.378472 0.305322 0.504946 0.501173 0.545672
2 0.138885 0.509028 0.000000 0.369830 0.240542 0.484970 0.459329 0.449965 0.591335 0.217102 0.434730 0.187983 0.100192
3 0.489671 0.765644 0.369830 0.000000 0.579235 0.639118 0.827519 0.585140 0.946945 0.474554 0.383486 0.266724 0.444612
4 0.240183 0.299834 0.240542 0.579235 0.000000 0.364005 0.335128 0.355671 0.368796 0.148598 0.482379 0.327450 0.251218
5 0.566555 0.212418 0.484970 0.639118 0.364005 0.000000 0.676135 0.055591 0.576447 0.272729 0.315123 0.399127 0.555655
6 0.343214 0.535287 0.459329 0.827519 0.335128 0.676135 0.000000 0.679527 0.281035 0.481218 0.813671 0.621056 0.387169
7 0.541508 0.253292 0.449965 0.585140 0.355671 0.055591 0.679527 0.000000 0.602427 0.245620 0.261309 0.350237 0.526773
8 0.525761 0.378472 0.591335 0.946945 0.368796 0.576447 0.281035 0.602427 0.000000 0.498845 0.811462 0.695304 0.559738
9 0.295906 0.305322 0.217102 0.474554 0.148598 0.272729 0.481218 0.245620 0.498845 0.000000 0.333842 0.208528 0.282959
10 0.566702 0.504946 0.434730 0.383486 0.482379 0.315123 0.813671 0.261309 0.811462 0.333842 0.000000 0.254850 0.533784
11 0.326087 0.501173 0.187983 0.266724 0.327450 0.399127 0.621056 0.350237 0.695304 0.208528 0.254850 0.000000 0.288072
12 0.045059 0.545672 0.100192 0.444612 0.251218 0.555655 0.387169 0.526773 0.559738 0.282959 0.533784 0.288072 0.000000
13 0.339648 0.350100 0.407307 0.769145 0.202592 0.501132 0.185248 0.511020 0.186913 0.347808 0.678357 0.527288 0.372879
14 0.530211 0.104003 0.473790 0.689158 0.303486 0.109841 0.589377 0.149459 0.468906 0.257676 0.404710 0.431203 0.527905
15 0.622118 0.178856 0.627453 0.923461 0.391044 0.387645 0.509836 0.431502 0.273610 0.450269 0.683313 0.656742 0.639993
16 0.337079 0.211995 0.297111 0.582175 0.113238 0.251168 0.434076 0.246505 0.403684 0.107671 0.409858 0.316172 0.337886
17 0.271897 0.311029 0.313864 0.668400 0.097022 0.424905 0.252905 0.426640 0.279160 0.243693 0.576241 0.422417 0.296806
18 0.664617 0.395999 0.554151 0.592343 0.504234 0.184188 0.833801 0.157951 0.758223 0.376555 0.212643 0.410605 0.642698
19 0.328445 0.719013 0.238085 0.186618 0.476045 0.642499 0.671657 0.594990 0.828653 0.413697 0.465589 0.245340 0.284878
13 14 15 16 17 18 19
0 0.339648 0.530211 0.622118 0.337079 0.271897 0.664617 0.328445
1 0.350100 0.104003 0.178856 0.211995 0.311029 0.395999 0.719013
2 0.407307 0.473790 0.627453 0.297111 0.313864 0.554151 0.238085
3 0.769145 0.689158 0.923461 0.582175 0.668400 0.592343 0.186618
4 0.202592 0.303486 0.391044 0.113238 0.097022 0.504234 0.476045
5 0.501132 0.109841 0.387645 0.251168 0.424905 0.184188 0.642499
6 0.185248 0.589377 0.509836 0.434076 0.252905 0.833801 0.671657
7 0.511020 0.149459 0.431502 0.246505 0.426640 0.157951 0.594990
8 0.186913 0.468906 0.273610 0.403684 0.279160 0.758223 0.828653
9 0.347808 0.257676 0.450269 0.107671 0.243693 0.376555 0.413697
10 0.678357 0.404710 0.683313 0.409858 0.576241 0.212643 0.465589
11 0.527288 0.431203 0.656742 0.316172 0.422417 0.410605 0.245340
12 0.372879 0.527905 0.639993 0.337886 0.296806 0.642698 0.284878
13 0.000000 0.408426 0.339019 0.274263 0.105627 0.668252 0.643427
14 0.408426 0.000000 0.282070 0.194058 0.345013 0.294029 0.663142
15 0.339019 0.282070 0.000000 0.344028 0.355134 0.568361 0.854775
16 0.274263 0.194058 0.344028 0.000000 0.181494 0.399730 0.513362
17 0.105627 0.345013 0.355134 0.181494 0.000000 0.581128 0.551910
18 0.668252 0.294029 0.568361 0.399730 0.581128 0.000000 0.649183
19 0.643427 0.663142 0.854775 0.513362 0.551910 0.649183 0.000000
result:
In [107]: d[(0 < d) & (d < r)].apply(lambda x: x.dropna().index.tolist())
Out[107]:
0 [2, 4, 9, 12, 17]
1 [4, 5, 7, 14, 15, 16]
2 [0, 4, 9, 11, 12, 16, 19]
3 [11, 19]
4 [0, 1, 2, 9, 12, 13, 16, 17]
5 [1, 7, 9, 14, 16, 18]
6 [8, 13, 17]
7 [1, 5, 9, 10, 14, 16, 18]
8 [6, 13, 15, 17]
9 [0, 2, 4, 5, 7, 11, 12, 14, 16, 17]
10 [7, 11, 18]
11 [2, 3, 9, 10, 12, 19]
12 [0, 2, 4, 9, 11, 17, 19]
13 [4, 6, 8, 16, 17]
14 [1, 5, 7, 9, 15, 16, 18]
15 [1, 8, 14]
16 [1, 2, 4, 5, 7, 9, 13, 14, 17]
17 [0, 4, 6, 8, 9, 12, 13, 16]
18 [5, 7, 10, 14]
19 [2, 3, 11, 12]
dtype: object
You can also do this only in numpy and scipy, I find it faster.
from scipy.spatial.distance import pdist, squareform
import numpy
SIZE=512
N_PARTICLE=100
RADIUS = 15
VALUE_THRESHOLD = 0
coords = numpy.random.randint(0, SIZE, size=(N_PARTICLE, 2))
values = numpy.random.randint(0, 2, (N_PARTICLE))
square_dist = squareform(pdist(coords, metric='euclidean'))
condlist = []
for i, row in enumerate(square_dist[:]):
condlist.append(numpy.where((values>VALUE_THRESHOLD) & (row < RADIUS) & (row > 0))[0].tolist())
It must be a better way to do it thoughtfully.

Categories