How to add zero values in beginning and end of Pandas? - python

I have the following dataframe
Inbound Value
1 Nan
2 Nan
3 Nan
4 ...
5 ...
19 Nan
20 130
21 130
22 140
23 140
24 170
25 170
25 170
26 ...
27 210
28 Nan
29 Nan
30 ...
.. ...
131 Nan
I would like to drop most of values which are Nan but keeping only 11 first values and keep also the last 11 Nan.
I know that data = data.dropna() drop all Nan values but I want to have as I described.

Use numpy's r_ to set range of indexes to a value and then drop remaining NaN
df.iloc[pd.np.r_[0:10, -11:0], df.columns.get_loc('Inbound Value')] = 0
data.dropna()

Related

Pandas Fill column values as previous

I have many columns that must hold their values from the previous row if the condition is met. Y & Z columns decides the values of other columns.
Y Z A B C D
100 10 20 Nan 22 40
100 11 Nan 15 Nan 41
100 10 23 Nan 24 42
100 11 Nan 16 Nan 42
100 10 25 Nan 26 45
100 11 Nan 17 Nan 45
101 17 Nan Nan Nan Nan
Expectation
Y Z A B C D
100 10 20 Nan 22 40
100 11 20 15 22 41
100 10 23 15 24 42
100 11 23 16 24 42
100 10 25 16 26 45
100 11 25 17 26 45
101 17 Nan Nan Nan Nan
So basically if the value of Y is 100 and Z is 10 the column values of B should be copied from the previous value of B and if Z is 11 the values of A and C should be copied from the previous values. I have around 20 columns like B and 20 columns like A & C. There are 50-60 columns like D , they should not be effected. And if the value of Y is other than 100 then nothing needs to be done on columns A, B and C
I was thinking of using
df[B] = df[B].shift().fillna(-1)
but I am not sure how to do it based on condition and for many columns in 1 go.
Forward filling only rows matching by mask chained by Series.eq for == with Series.isin for test membership by & for bitwise AND:
#if necessary replace strings Nan to missing values NaN
df = df.replace('Nan', np.nan)
mask = df.Y.eq(100) & df.Z.isin([10,11])
df[mask] = df[mask].ffill()
Another idea with DataFrame.mask:
df = df.mask(mask, df.ffill())
print (df)
Y Z A B C D
0 100 10 20 NaN 22 40
1 100 11 20 15 22 41
2 100 10 23 15 24 42
3 100 11 23 16 24 42
4 100 10 25 16 26 45
5 100 11 25 17 26 45
6 101 17 NaN NaN NaN NaN

How to insert row to dataframe

I have an existing dataframe like this
>>> print(dataframe)
sid
30 11
56 5
73 25
78 2
132 1
..
8531 25
8616 2
9049 1
9125 6
9316 11
Name: name, Length: 87, dtype: int64
I want to add a row like {'sid': 2, '': 100} to it but when I try this
df = pandas.DataFrame({'sid': [2], '': [100]})
df = df.set_index('sid')
dataframe = dataframe.append(df)
print(dataframe)
I end up with
sid
30 11.0 NaN
56 5.0 NaN
73 25.0 NaN
78 2.0 NaN
132 1.0 NaN
... ... ...
8616 2.0 NaN
9049 1.0 NaN
9125 6.0 NaN
9316 11.0 NaN
2 NaN 100.0
I'm hoping for something more like
sid
2 100
30 11
56 5
73 25
78 2
132 1
..
8531 25
8616 2
9049 1
9125 6
9316 11
Any idea how I can achieve that?
The way to do this was
dataframe.loc[2] = 100
Thanks anky!
Reason for the above problem, because at the time you have appended two DataFrames, you forgot to set 'sid' as the dataframe index. So, basically the two DataFrames has different structure when you append it. Make sure to set the index of both dataframes same before you append them.
data = [ [30,11], [56, 5], [73, 25]] #test dataframe
dataframe = pd.DataFrame(data, columns=['sid', ''])
dataframe = dataframe.set_index('sid')
print(dataframe)
You get,
sid
30 11
56 5
73 25
Create and set the index of df,
df = pd.DataFrame({'sid' : [2], '' : [100]})
df = df.set_index('sid')
You get,
sid
2 100
Then append them,
dataframe = df.append(dataframe)
print(dataframe)
You will get the disired outcome,
sid
2 100
30 11
56 5
73 25

Calculate mean of df, BUT if =>1 of the values differs >20% from this mean, the mean is set to NaN

I want to calculate the mean of columns a,b,c,d of the dataframe BUT if one of four values in each dataframe row differs more then 20% from this mean (of the four values), the mean has to be set to NaN.
Calculation of the mean of 4 columns is easy, but I'm stuck at defining the condition 'if mean*0.8 <= one of the values in the data row <= mean*1,2 then mean == NaN.
In the example, one or more of the values in ID:5 en ID:87 don't fit in the interval and therefore the mean is set to NaN.
(NaN-values in the initial dataframe are ignored when calculating the mean and when applying the 20%-condition to the calculated mean)
So I'm trying to calculate the mean only for the data rows with no 'outliers'.
Initial df:
ID a b c d
2 31 32 31 31
5 33 52 159 2
7 51 NaN 52 51
87 30 52 421 2
90 10 11 10 11
102 41 42 NaN 42
Desired df:
ID a b c d mean
2 31 32 31 31 31.25
5 33 52 159 2 NaN
7 51 NaN 52 51 51.33
87 30 52 421 2 NaN
90 10 11 10 11 10.50
102 41 42 NaN 42 41.67
Code:
import pandas as pd
import numpy as np


df = pd.DataFrame({"ID": [2,5,7,87,90,102],
"a": [31,33,51,30,10,41],
"b": [32,52,np.nan,52,11,42],
"c": [31,159,52,421,10,np.nan],
"d": [31,2,51,2,11,42]})

print(df)

a = df.loc[:, ['a','b','c','d']]

df['mean'] = (a.iloc[:,0:]).mean(1)


print(df)
b = df.mean.values[:,None]*0.8 < a.values[:,:] < df.mean.values[:,None]*1.2
print(b)
...
Try this:
# extract related information
s = df.iloc[:,1:]
# calculate mean
mean = s.mean(1)
# where condition is violated
mask = s.lt(mean*.8, axis=0) | s.gt(mean*1.2, axis=0)
# mask where mask is True on any row
df['mean'] = mean.mask(mask.any(1))
Output:
ID a b c d mean
0 2 31 32.0 31.0 31 31.250000
1 5 33 52.0 159.0 2 NaN
2 7 51 NaN 52.0 51 51.333333
3 87 30 52.0 421.0 2 NaN
4 90 10 11.0 10.0 11 10.500000
5 102 41 42.0 NaN 42 41.666667

Pandas concat similar DataFrames and Series

I have a list of Dataframes, all with the same columns. Occaisionally, a DataFrame has only one row, and is, hence, a Series. When I try to concatenate this list with pd.concat, where there was a Series, it puts what I want to be the columns in the index. See below for a minimal working example.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: d = {'a':np.random.randn(100), 'b':np.random.randn(100)}
In [4]: df = pd.DataFrame(d)
In [5]: thing1 = df.iloc[:10, :]
In [6]: thing1
Out[6]:
a b
0 -0.505268 -1.109089
1 -1.792729 -0.580566
2 -0.478042 0.410095
3 -0.758376 0.558772
4 0.112519 0.556316
5 -1.015813 -0.568148
6 1.234858 -1.062879
7 -0.455796 -0.107942
8 1.231422 0.780694
9 -1.082461 -1.809412
In [7]: thing2 = df.iloc[10,:]
In [8]: thing2
Out[8]:
a -1.527836
b 0.653610
Name: 10, dtype: float64
In [9]: thing3 = df.iloc[11:, :]
In [10]: thing3
Out[10]:
a b
11 -1.247939 -0.694491
12 1.359737 0.625284
13 -0.491533 -0.230665
14 1.360465 0.472451
15 0.691532 -1.822708
16 0.938316 1.310101
17 0.485776 -0.313206
18 1.398189 -0.232446
19 -0.626278 0.714052
20 -1.292272 -1.299580
21 -1.521746 -1.615611
22 1.464332 2.839602
23 0.707370 -0.162056
24 -1.825903 0.000278
25 0.917284 -0.094716
26 -0.239839 0.132572
27 -0.463240 -0.805458
28 1.174125 0.131057
29 0.183503 0.328603
30 0.045839 -0.244965
31 0.449265 0.642082
32 2.381600 -0.417044
33 0.276217 -0.257426
34 0.755067 0.012898
35 0.130339 -0.094300
36 -1.643097 0.038982
37 0.895719 0.789494
38 0.701480 -0.668440
39 -0.201400 1.441928
40 -2.018043 -0.106764
.. ... ...
70 0.971799 0.298164
71 1.307070 -2.093075
72 -1.049177 2.183065
73 -0.469273 -0.739449
74 0.685838 2.579547
75 1.994485 0.783204
76 -0.414760 -0.285766
77 -1.005873 -0.783886
78 1.486588 -0.349575
79 1.417006 -0.676501
80 1.284611 -0.817505
81 -0.624406 -1.659931
82 -0.921061 0.424663
83 -0.645472 -0.769509
84 -1.217172 -0.943542
85 -0.184948 0.482977
86 -0.253972 -0.080682
87 -0.699122 0.368751
88 1.391163 0.042899
89 -0.075512 0.019728
90 0.449151 0.486462
91 -0.182553 0.876379
92 -0.209162 0.390093
93 0.789094 1.570251
94 -1.018724 -0.084603
95 1.109534 1.840739
96 0.774806 -0.380387
97 0.534344 1.165343
98 1.003597 -0.221899
99 -0.659863 -1.061590
[89 rows x 2 columns]
In [11]: pd.concat([thing1, thing2, thing3])
Out[11]:
a b 0
0 -0.505268 -1.109089 NaN
1 -1.792729 -0.580566 NaN
2 -0.478042 0.410095 NaN
3 -0.758376 0.558772 NaN
4 0.112519 0.556316 NaN
5 -1.015813 -0.568148 NaN
6 1.234858 -1.062879 NaN
7 -0.455796 -0.107942 NaN
8 1.231422 0.780694 NaN
9 -1.082461 -1.809412 NaN
a NaN NaN -1.527836
b NaN NaN 0.653610
11 -1.247939 -0.694491 NaN
12 1.359737 0.625284 NaN
13 -0.491533 -0.230665 NaN
14 1.360465 0.472451 NaN
15 0.691532 -1.822708 NaN
16 0.938316 1.310101 NaN
17 0.485776 -0.313206 NaN
18 1.398189 -0.232446 NaN
19 -0.626278 0.714052 NaN
20 -1.292272 -1.299580 NaN
21 -1.521746 -1.615611 NaN
22 1.464332 2.839602 NaN
23 0.707370 -0.162056 NaN
24 -1.825903 0.000278 NaN
25 0.917284 -0.094716 NaN
26 -0.239839 0.132572 NaN
27 -0.463240 -0.805458 NaN
28 1.174125 0.131057 NaN
.. ... ... ...
70 0.971799 0.298164 NaN
71 1.307070 -2.093075 NaN
72 -1.049177 2.183065 NaN
73 -0.469273 -0.739449 NaN
74 0.685838 2.579547 NaN
75 1.994485 0.783204 NaN
76 -0.414760 -0.285766 NaN
77 -1.005873 -0.783886 NaN
78 1.486588 -0.349575 NaN
79 1.417006 -0.676501 NaN
80 1.284611 -0.817505 NaN
81 -0.624406 -1.659931 NaN
82 -0.921061 0.424663 NaN
83 -0.645472 -0.769509 NaN
84 -1.217172 -0.943542 NaN
85 -0.184948 0.482977 NaN
86 -0.253972 -0.080682 NaN
87 -0.699122 0.368751 NaN
88 1.391163 0.042899 NaN
89 -0.075512 0.019728 NaN
90 0.449151 0.486462 NaN
91 -0.182553 0.876379 NaN
92 -0.209162 0.390093 NaN
93 0.789094 1.570251 NaN
94 -1.018724 -0.084603 NaN
95 1.109534 1.840739 NaN
96 0.774806 -0.380387 NaN
97 0.534344 1.165343 NaN
98 1.003597 -0.221899 NaN
99 -0.659863 -1.061590 NaN
[101 rows x 3 columns]
Please note that for this problem, I need to maintain the original index.
I've spent a long time investigating the documentation but can't seem to figure out my problem. Is there an easy way around this?
thing2 = pd.DataFrame(thing2).transpose()
pd.concat([thing1, thing2, thing3])
In your case transpose() will set Pandas Series index as colums and then you can concate easily.
Documentation here : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html

pandas concat is not possible got error why

I like to add a total row on the top of my pivot, but when I trie to concat I got an error.
>>> table
Weight
Vcountry 1 2 3 4 5 6 7
V20001
1 86 NaN NaN NaN NaN NaN 92
2 41 NaN 71 40 50 51 49
3 NaN 61 60 61 60 25 62
4 51 NaN NaN NaN NaN NaN NaN
5 26 26 20 41 25 23 NaN
[5 rows x 7 columns]
Thats the pivot Table
>>> totals_frame
Vcountry 1 2 3 4 5 6 7
totalCount 204 87 151 142 135 99 203
The total of it I like to join
[1 rows x 7 columns]
>>> pc = [totals_frame, table]
>>> concat(pc)
Here the output:
reindex_items
copy_if_needed=True)
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 2887, in reindex
target = MultiIndex.from_tuples(target)
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 2486, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "inference.pyx", line 915, in pandas.lib.tuples_to_object_array (pandas\lib.c:43656)
TypeError: object of type 'long' has no len()
Here's a possible way: instead of using pd.concat use pd.DataFrame.append. There's a bit of fiddling around with the index to do, but it's still quite neat I think:
# Just setting up the dataframe:
df = pd.DataFrame({'country':['A','A','A','B','B','B'],
'weight':[1,2,3,1,2,3],
'value':[10,20,30,15,25,35]})
df = df.set_index(['country','weight']).unstack('weight')
# A bit of messing about to get the index right:
index = df.index.values.tolist()
index.append('Totals')
# Here's where the magic happens:
df = df.append(df.sum(), ignore_index=True)
df.index = index
which gives:
value
weight 1 2 3
A 10 20 30
B 15 25 35
Totals 25 45 65

Categories