Sort rows and removing NaN values - python

I have a dataset looks like below:
state Item_Number
0 AP 1.0, 4.0, 20.0, 2.0, 11.0, 7.0
1 GOA 1.0, 4.0, nan, 2.0, 8.0, nan
2 GU 1.0, 4.0, 13.0, 2.0, 11.0, 7.0
3 KA 1.0, 23.0, nan, nan, 11.0, 7.0
4 MA 1.0, 14.0, 13.0, 2.0, 19.0, 21.0
I want to remove NaN values and sort the rows, as well as convert float to int. After completion the dataset should looks like below:
state Item_Number
0 AP 1, 2, 4, 7, 11, 20
1 GOA 1, 2, 4, 8
2 GU 1, 2, 4, 7, 11, 13
3 KA 1, 7, 11, 23
4 MA 1, 2, 13, 14, 19, 21

Another solution using Series.str.split and Series.apply:
df['Item_Number'] = (df.Item_Number.str.split(',')
.apply(lambda x: ', '.join([str(z) for z in sorted([int(float(y)) for y in x if 'nan' not in y])])))
[out]
state Item_Number
0 AP 1, 2, 4, 7, 11, 20
1 GOA 1, 2, 4, 8
2 GU 1, 2, 4, 7, 11, 13
3 KA 1, 7, 11, 23
4 MA 1, 2, 13, 14, 19, 21

Use list comprehension with remove missing values by principe NaN != NaN:
df['Item_Number'] = [sorted([int(float(y)) for y in x.split(',') if float(y) == float(y)]) for x in df['Item_Number']]
print (df)
state Item_Number
0 AP [1, 2, 4, 7, 11, 20]
1 GOA [1, 2, 4, 8]
2 GU [1, 2, 4, 7, 11, 13]
3 KA [1, 7, 11, 23]
4 MA [1, 2, 13, 14, 19, 21]
If need strings:
df['Item_Number'] = [' '.join(map(str, sorted([int(float(y)) for y in x.split(',') if float(y) == float(y)]))) for x in df['Item_Number']]
print (df)
state Item_Number
0 AP 1 2 4 7 11 20
1 GOA 1 2 4 8
2 GU 1 2 4 7 11 13
3 KA 1 7 11 23
4 MA 1 2 13 14 19 21

Related

Filling 0 with previous value at index

I have a df:
1 2 3 4 5 6 7 8 9 10
A 10 0 0 15 0 21 45 0 0 7
I am trying fill index A values with the current value if the next value is 0 so that the df would look like this:
1 2 3 4 5 6 7 8 9 10
A 10 10 10 15 15 21 45 45 45 7
I tried:
df.loc[['A']].replace(to_replace=0, method='ffill').values
But this does not work, where is my mistake?
If you want to use your method, you need to work with Series on both sides:
df.loc['A'] = df.loc['A'].replace(to_replace=0, method='ffill')
Alternatively, you can mask the 0 with NaNs, and ffill the data on axis=1:
df.mask(df.eq(0)).ffill(axis=1)
output:
1 2 3 4 5 6 7 8 9 10
A 10.0 10.0 10.0 15.0 15.0 21.0 45.0 45.0 45.0 7.0
Well you should change your code a little bit and work with series:
import pandas as pd
df = pd.DataFrame({'1': [10], '2': [0], '3': [0], '4': [15], '5': [0],
'6': [21], '7': [45], '8': [0], '9': [0], '10': [7]},
index=['A'])
print(df.apply(lambda x: pd.Series(x.values).replace(to_replace=0, method='ffill').values, axis=1))
Output:
A [10, 10, 10, 15, 15, 21, 45, 45, 45, 7]
dtype: object
This way, if you have multiple indices, the code still works:
import pandas as pd
df = pd.DataFrame({'1': [10, 11], '2': [0, 12], '3': [0, 0], '4': [15, 0], '5': [0, 3],
'6': [21, 3], '7': [45, 0], '8': [0, 4], '9': [0, 5], '10': [7, 0]},
index=['A', 'B'])
print(df.apply(lambda x: pd.Series(x.values).replace(to_replace=0, method='ffill').values, axis=1))
Output:
A [10, 10, 10, 15, 15, 21, 45, 45, 45, 7]
B [11, 12, 12, 12, 3, 3, 3, 4, 5, 5]
dtype: object
df.applymap(lambda x:pd.NA if x==0 else x).fillna(method='ffill',axis=1)
1 2 3 4 5 6 7 8 9 10
A 10 10 10 15 15 21 45 45 45 7

Substract Two Dataframes by Index and Keep String Columns

I would like to subtract two data frames by indexes:
# importing pandas as pd
import pandas as pd
# Creating the second dataframe
df1 = pd.DataFrame({"Type":['T1', 'T2', 'T3', 'T4', 'T5'],
"A":[10, 11, 7, 8, 5],
"B":[21, 5, 32, 4, 6],
"C":[11, 21, 23, 7, 9],
"D":[1, 5, 3, 8, 6]},
index =["2001", "2002", "2003", "2004", "2005"])
df1
# Creating the first dataframe
df2 = pd.DataFrame({"A":[1, 2, 2, 2],
"B":[3, 2, 4, 3],
"C":[2, 2, 7, 3],
"D":[1, 3, 2, 1]},
index =["2000", "2002", "2003", "2004"])
df2
# Desired
df = pd.DataFrame({"Type":['T1', 'T2', 'T3', 'T4', 'T5'],
"A":[10, 9, 5, 6, 5],
"B":[21, 3, 28, 1, 6],
"C":[11, 19, 16, 4, 9],
"D":[1, 2, 1, 7, 5]},
index =["2001", "2002", "2003", "2004", "2005"])
df
df1.subtract(df2)
However, it returns in some cases NAs, I would like to keep values from the first df1 if not deductable.
You could handle NaN using:
df1.subtract(df2).combine_first(df1).dropna(how='all')
output:
A B C D Type
2001 10.0 21.0 11.0 1.0 T1
2002 9.0 3.0 19.0 2.0 T2
2003 5.0 28.0 16.0 1.0 T3
2004 6.0 1.0 4.0 7.0 T4
2005 5.0 6.0 9.0 6.0 T5
You can use select_dtypes to choose the correct data type, then subtract the reindex data:
(df1.select_dtypes(include='number')
.sub(df2.reindex(df1.index, fill_value=0))
.join(df1.select_dtypes(exclude='number'))
)
Output:
A B C D Type
2001 10 21 11 1 T1
2002 9 3 19 2 T2
2003 5 28 16 1 T3
2004 6 1 4 7 T4
2005 5 6 9 6 T5

i have a problem with duplicate rows in python

i want to Remove duplicate data in a rows from data frame python without affecting the shape of the DataFrame
i tried already this cods but i could not bring the year column to the now dataframe
any help thanks
dicpn = dict()
for value in ss.Country.unique():
if len(ss.loc[(ss.Country == value) & ss.Country.duplicated(keep=False)]) > 0:
all_values = ss.loc[(ss.Country == value) & ss.Country.duplicated(keep=False), 'Count'].tolist()
dicpn[value] = all_values
elif len(ss.loc[(ss.Country == value) & ss.Country.duplicated(keep=False)]) == 0:
dicpn[value] = ss.loc[(ss.Country== value), 'Count'].tolist()
# make a new dataframe
df2 = pd.DataFrame(columns=['landen', 'Count'])
df2.landen = list(dicpn.keys())
df2.Count = list(dicpn.values())
landen Count
0 Argentina [6, 15, 10, 4, 11, 1, 13, 7, 8, 1, 2, 2, 22, 3...
1 Australia [2, 1, 2230, 1, 3, 1, 5, 55, 38, 48, 9, 1, 2, ...
2 Belgium [1289, 1, 1620, 3, 8, 28, 13, 37, 2, 1, 560, 3...
3 Canada [1, 5, 230, 3, 4, 9, 3, 1, 1376, 159, 11, 44, ...
4 China [168, 12, 1, 114, 5, 8961, 1, 33, 4, 3, 23, 21...
5 Denmark [11, 20, 4, 2, 479, 140, 5, 53, 9, 2, 12, 16, ...
Country PublishYear Count
8 Argentina 2003 6
9 Argentina 2014 15
10 Argentina 2010 10
11 Argentina 2015 4
12 Argentina 2014 11
... ... ... ...
254169 United States 2004 2
254170 United States 2004 955
254171 United States 2003 10
254172 United States 2015 16
254173 United States 2012 259
]2

How to drop a value from data series with multiple index?

I have a data frame with the temperatures recorded per day/month/year.
Then I find the lowest temperature from each month using groupby and min functions, which gives a data series with multiple index.
How can I drop a value from a specific year and month? eg. year 2005 month 12?
# Find the lowest value per each month
[In] low = df.groupby([df['Date'].dt.year,df['Date'].dt.month])['Data_Value'].min()
[In] low
[Out]
Date Date
2005 1 -60
2 -114
3 -153
4 -13
5 -14
6 26
7 83
8 65
9 21
10 36
11 -36
12 -86
2006 1 -75
2 -53
3 -83
4 -30
5 36
6 17
7 85
8 82
9 66
10 40
11 -2
12 -32
2007 1 -63
2 -42
3 -21
4 -11
5 28
6 74
7 73
8 61
9 46
10 -33
11 -37
12 -97
[In] low.index
[Out] MultiIndex(levels=[[2005, 2006, 2007], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]],
names=['Date', 'Date'])
This works.
#dummy data
mux = pd.MultiIndex.from_arrays([
(2017,)*12 + (2018,)*12,
list(range(1, 13))*2
], names=['year', 'month'])
df = pd.DataFrame({'value': np.random.randint(1, 20, (len(mux)))}, mux)
Then just use drop.
df.drop((2017, 12), inplace=True)
>>> print(df)
value
year month
2017 1 18
2 13
3 14
4 1
5 8
6 19
7 19
8 8
9 11
10 5
11 7 <<<
2018 1 9
2 18
3 9
4 14
5 7
6 4
7 6
8 12
9 12
10 1
11 19
12 10

Radius search in list of coordinates

I'm using python 2.7 and numpy (import numpy as np).
I have a list of x-y coordinates in the following shape:
coords = np.zeros((100, 2), dtype=np.int)
I have a list of values corresponding to these coordinates:
values = np.zeros(100, dtype=np.int)
My program is populating these arrays.
Now, for each coordinate, I want to find neighbours within radius r that have a non-zero value. What's the most efficient way to do that?
Demo:
import pandas as pd
from scipy.spatial.distance import pdist, squareform
In [101]: np.random.seed(123)
In [102]: coords = np.random.rand(20, 2)
In [103]: r = 0.3
In [104]: d = pd.DataFrame(squareform(pdist(coords)))
In [105]: d
Out[105]:
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 0.000000 0.539313 0.138885 0.489671 0.240183 0.566555 0.343214 0.541508 0.525761 0.295906 0.566702 0.326087 0.045059
1 0.539313 0.000000 0.509028 0.765644 0.299834 0.212418 0.535287 0.253292 0.378472 0.305322 0.504946 0.501173 0.545672
2 0.138885 0.509028 0.000000 0.369830 0.240542 0.484970 0.459329 0.449965 0.591335 0.217102 0.434730 0.187983 0.100192
3 0.489671 0.765644 0.369830 0.000000 0.579235 0.639118 0.827519 0.585140 0.946945 0.474554 0.383486 0.266724 0.444612
4 0.240183 0.299834 0.240542 0.579235 0.000000 0.364005 0.335128 0.355671 0.368796 0.148598 0.482379 0.327450 0.251218
5 0.566555 0.212418 0.484970 0.639118 0.364005 0.000000 0.676135 0.055591 0.576447 0.272729 0.315123 0.399127 0.555655
6 0.343214 0.535287 0.459329 0.827519 0.335128 0.676135 0.000000 0.679527 0.281035 0.481218 0.813671 0.621056 0.387169
7 0.541508 0.253292 0.449965 0.585140 0.355671 0.055591 0.679527 0.000000 0.602427 0.245620 0.261309 0.350237 0.526773
8 0.525761 0.378472 0.591335 0.946945 0.368796 0.576447 0.281035 0.602427 0.000000 0.498845 0.811462 0.695304 0.559738
9 0.295906 0.305322 0.217102 0.474554 0.148598 0.272729 0.481218 0.245620 0.498845 0.000000 0.333842 0.208528 0.282959
10 0.566702 0.504946 0.434730 0.383486 0.482379 0.315123 0.813671 0.261309 0.811462 0.333842 0.000000 0.254850 0.533784
11 0.326087 0.501173 0.187983 0.266724 0.327450 0.399127 0.621056 0.350237 0.695304 0.208528 0.254850 0.000000 0.288072
12 0.045059 0.545672 0.100192 0.444612 0.251218 0.555655 0.387169 0.526773 0.559738 0.282959 0.533784 0.288072 0.000000
13 0.339648 0.350100 0.407307 0.769145 0.202592 0.501132 0.185248 0.511020 0.186913 0.347808 0.678357 0.527288 0.372879
14 0.530211 0.104003 0.473790 0.689158 0.303486 0.109841 0.589377 0.149459 0.468906 0.257676 0.404710 0.431203 0.527905
15 0.622118 0.178856 0.627453 0.923461 0.391044 0.387645 0.509836 0.431502 0.273610 0.450269 0.683313 0.656742 0.639993
16 0.337079 0.211995 0.297111 0.582175 0.113238 0.251168 0.434076 0.246505 0.403684 0.107671 0.409858 0.316172 0.337886
17 0.271897 0.311029 0.313864 0.668400 0.097022 0.424905 0.252905 0.426640 0.279160 0.243693 0.576241 0.422417 0.296806
18 0.664617 0.395999 0.554151 0.592343 0.504234 0.184188 0.833801 0.157951 0.758223 0.376555 0.212643 0.410605 0.642698
19 0.328445 0.719013 0.238085 0.186618 0.476045 0.642499 0.671657 0.594990 0.828653 0.413697 0.465589 0.245340 0.284878
13 14 15 16 17 18 19
0 0.339648 0.530211 0.622118 0.337079 0.271897 0.664617 0.328445
1 0.350100 0.104003 0.178856 0.211995 0.311029 0.395999 0.719013
2 0.407307 0.473790 0.627453 0.297111 0.313864 0.554151 0.238085
3 0.769145 0.689158 0.923461 0.582175 0.668400 0.592343 0.186618
4 0.202592 0.303486 0.391044 0.113238 0.097022 0.504234 0.476045
5 0.501132 0.109841 0.387645 0.251168 0.424905 0.184188 0.642499
6 0.185248 0.589377 0.509836 0.434076 0.252905 0.833801 0.671657
7 0.511020 0.149459 0.431502 0.246505 0.426640 0.157951 0.594990
8 0.186913 0.468906 0.273610 0.403684 0.279160 0.758223 0.828653
9 0.347808 0.257676 0.450269 0.107671 0.243693 0.376555 0.413697
10 0.678357 0.404710 0.683313 0.409858 0.576241 0.212643 0.465589
11 0.527288 0.431203 0.656742 0.316172 0.422417 0.410605 0.245340
12 0.372879 0.527905 0.639993 0.337886 0.296806 0.642698 0.284878
13 0.000000 0.408426 0.339019 0.274263 0.105627 0.668252 0.643427
14 0.408426 0.000000 0.282070 0.194058 0.345013 0.294029 0.663142
15 0.339019 0.282070 0.000000 0.344028 0.355134 0.568361 0.854775
16 0.274263 0.194058 0.344028 0.000000 0.181494 0.399730 0.513362
17 0.105627 0.345013 0.355134 0.181494 0.000000 0.581128 0.551910
18 0.668252 0.294029 0.568361 0.399730 0.581128 0.000000 0.649183
19 0.643427 0.663142 0.854775 0.513362 0.551910 0.649183 0.000000
result:
In [107]: d[(0 < d) & (d < r)].apply(lambda x: x.dropna().index.tolist())
Out[107]:
0 [2, 4, 9, 12, 17]
1 [4, 5, 7, 14, 15, 16]
2 [0, 4, 9, 11, 12, 16, 19]
3 [11, 19]
4 [0, 1, 2, 9, 12, 13, 16, 17]
5 [1, 7, 9, 14, 16, 18]
6 [8, 13, 17]
7 [1, 5, 9, 10, 14, 16, 18]
8 [6, 13, 15, 17]
9 [0, 2, 4, 5, 7, 11, 12, 14, 16, 17]
10 [7, 11, 18]
11 [2, 3, 9, 10, 12, 19]
12 [0, 2, 4, 9, 11, 17, 19]
13 [4, 6, 8, 16, 17]
14 [1, 5, 7, 9, 15, 16, 18]
15 [1, 8, 14]
16 [1, 2, 4, 5, 7, 9, 13, 14, 17]
17 [0, 4, 6, 8, 9, 12, 13, 16]
18 [5, 7, 10, 14]
19 [2, 3, 11, 12]
dtype: object
You can also do this only in numpy and scipy, I find it faster.
from scipy.spatial.distance import pdist, squareform
import numpy
SIZE=512
N_PARTICLE=100
RADIUS = 15
VALUE_THRESHOLD = 0
coords = numpy.random.randint(0, SIZE, size=(N_PARTICLE, 2))
values = numpy.random.randint(0, 2, (N_PARTICLE))
square_dist = squareform(pdist(coords, metric='euclidean'))
condlist = []
for i, row in enumerate(square_dist[:]):
condlist.append(numpy.where((values>VALUE_THRESHOLD) & (row < RADIUS) & (row > 0))[0].tolist())
It must be a better way to do it thoughtfully.

Categories