python: inserting row at specifc index from one dataframe to another - python

I have a two dataframes as follows:
df1:
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
2 0 7 6 5 8
df2:
M N O P Q R S T
0 1 2 3
1 4 5 6
2 7 8 9
3 8 6 5
4 5 4 3
I have taken out a slice of data from df1 as follows:
>data_1 = df1.loc[0:1]
>data_1
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
Now I need to insert this data_1 into df2 at specific location of Index(0,P) (row,column). Is there any way to do it? I do not want to disturb the other columns in df2.
I can extract individual values of each cell and do it but since I have to do it for a large dataset, its not possible to do it cell-wise.
Cellwise method:
>var1 = df1.iat[0,1]
>var2 = df1.iat[0,0]
>df2.at[0, 'P'] = var1
>df2.at[0, 'Q'] = var2

If you specify all the columns, it is possible to do it as follows:
df2.loc[0:1, ['P', 'Q', 'R', 'S', 'T']] = df1.loc[0:1].values
Resulting dataframe:
M N O P Q R S T
0 1 2 3 8.0 6.0 4.0 9.0 7.0
1 4 5 6 2.0 6.0 3.0 8.0 5.0
2 7 8 9
3 8 6 5
4 5 4 3

You can rename columns and index names for match to second DataFrame, so possible use DataFrame.update for correct way specifiest by tuple pos:
data_1 = df1.loc[0:1]
print (data_1)
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
pos = (2, 'P')
data_1 = data_1.rename(columns=dict(zip(data_1.columns, df2.loc[:, pos[1]:].columns)),
index=dict(zip(data_1.index, df2.loc[pos[0]:].index)))
print (data_1)
P Q R S T
2 8 6 4 9 7
3 2 6 3 8 5
df2.update(data_1)
print (df2)
M N O P Q R S T
0 1 2 3 NaN NaN NaN NaN NaN
1 4 5 6 NaN NaN NaN NaN NaN
2 7 8 9 8.0 6.0 4.0 9.0 7.0
3 8 6 5 2.0 6.0 3.0 8.0 5.0
4 5 4 3 NaN NaN NaN NaN NaN
How working rename - idea is select all columns and all index values after specified column, index name by loc and then zip by columns names of data_1 with convert to dictionary. So last replace bot, index and columns names in data_1 by next columns, index values.

Related

Python How to drop rows of Pandas DataFrame whose value in a certain column is NaN

I have this DataFrame and want only the records whose "Total" column is not NaN ,and records when A~E has more than two NaN:
A B C D E Total
1 1 3 5 5 8
1 4 3 5 5 NaN
3 6 NaN NaN NaN 6
2 2 5 9 NaN 8
..i.e. something like df.dropna(....) to get this resulting dataframe:
A B C D E Total
1 1 3 5 5 8
2 2 5 9 NaN 8
Here's my code
import pandas as pd
dfInputData = pd.read_csv(path)
dfInputData = dfInputData.dropna(axis=1,how = 'any')
RowCnt = dfInputData.shape[0]
But it looks like no modification has been made even error
Please help!! Thanks
Use boolean indexing with count all columns without Total for number of missing values and not misisng values in Total:
df = df[df.drop('Total', axis=1).isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0
Or filter columns between A:E:
df = df[df.loc[:, 'A':'E'].isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0

Pandas Rolling Groupby Shift back 1, Trying to lag rolling sum

I am trying to get a rolling sum of the past 3 rows for the same ID but lagging this by 1 row. My attempt looked like the below code and i is the column. There has to be a way to do this but this method doesnt seem to work.
for i in df.columns.values:
df.groupby('Id', group_keys=False)[i].rolling(window=3, min_periods=2).mean().shift(1)
id dollars lag
1 6 nan
1 7 nan
1 6 6.5
3 7 nan
3 4 nan
3 4 5.5
3 3 5
5 6 nan
5 5 nan
5 6 5.5
5 12 5.67
5 7 8.3
I am trying to get a rolling sum of the past 3 rows for the same ID but lagging this by 1 row.
You can create the lagged rolling sum by chaining DataFrame.groupby(ID), .shift(1) for the lag 1, .rolling(3) for the window 3, and .sum() for the sum.
Example: Let's say your dataset is:
import pandas as pd
# Reproducible datasets are your friend!
d = pd.DataFrame({'grp':pd.Series(['A']*4 + ['B']*5 + ['C']*6),
'x':pd.Series(range(15))})
print(d)
grp x
A 0
A 1
A 2
A 3
B 4
B 5
B 6
B 7
B 8
C 9
C 10
C 11
C 12
C 13
C 14
I think what you're asking for is this:
d['y'] = d.groupby('grp')['x'].shift(1).rolling(3).sum()
print(d)
grp x y
A 0 NaN
A 1 NaN
A 2 NaN
A 3 3.0
B 4 NaN
B 5 NaN
B 6 NaN
B 7 15.0
B 8 18.0
C 9 NaN
C 10 NaN
C 11 NaN
C 12 30.0
C 13 33.0
C 14 36.0

Pandas Dataframe merge without duplicating either side?

I often get tables containing similar information from different sources for "QC". Sometime I want to put these two tables side by side, output to excel to show others, so we can resolve discrepancies. To do so I want a 'lazy' merge with pandas dataframe.
say, I have two tables:
df a: df b:
n I II n III IV
0 a 1 2 0 a 1 2
1 a 3 4 1 a 0 0
2 b 5 6 2 b 5 6
3 c 9 9 3 b 7 8
I want to have results like:
a merge b
n I II III IV
0 a 1 2 1 2
1 a 3 4
2 b 5 6 5 6
3 b 7 8
4 c 9 9
of course this is what I got with merge():
a.merge(b, how='outer', on="n")
n I II III IV
0 a 1 2 1.0 2.0
1 a 1 2 0.0 0.0
2 a 3 4 1.0 2.0
3 a 3 4 0.0 0.0
4 b 5 6 5.0 6.0
5 b 5 6 7.0 8.0
6 c 9 9 NaN NaN
I feel there must be an easy way to do that, but all my solution were convoluted.
Is there a parameter in merge or concat for something like "no_copy"?
Doesn't look like you can do it with the information given alone, you need to introduce a cumulative count column to add to the merge columns. Consider this solution
>>> import pandas
>>> dfa = pandas.DataFrame( {'n':['a','a','b','c'] , 'I' : [1,3,5,9] , 'II':[2,4,6,9]}, columns=['n','I','II'])
>>> dfb = pandas.DataFrame( {'n':['a','b','b'] , 'III' : [1,5,7] , 'IV':[2,6,8] }, columns=['n','III','IV'])
>>>
>>> dfa['nCC'] = dfa.groupby( 'n' ).cumcount()
>>> dfb['nCC'] = dfb.groupby( 'n' ).cumcount()
>>> dm = dfa.merge(dfb, how='outer', on=['n','nCC'] )
>>>
>>>
>>> dfa
n I II nCC
0 a 1 2 0
1 a 3 4 1
2 b 5 6 0
3 c 9 9 0
>>> dfb
n III IV nCC
0 a 1 2 0
1 b 5 6 0
2 b 7 8 1
>>> dm
n I II nCC III IV
0 a 1.0 2.0 0 1.0 2.0
1 a 3.0 4.0 1 NaN NaN
2 b 5.0 6.0 0 5.0 6.0
3 c 9.0 9.0 0 NaN NaN
4 b NaN NaN 1 7.0 8.0
>>>
It has the gaps or lack of duplication where you want although the index isn't quite identical to your output. Because NaN's are involved the various columns get coerced to float64 types.
Adding the cumulative count essentially forces instances to match with each other across both sides, the first matches for a given level match the corresponding first level, and likewise for all instances of the level for all levels.

Pandas: rolling count if within a loop

In my data frame I want to create a column '5D_Peak' as a rolling max, and then another column with rolling count of historical data that's close to the peak. I wonder if there is an easier way to simply or ideally vectorise the calculation.
This is my codes in a plain but complicated way:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,4],[4,5,2],[3,5,8],[1,8,6],[5,2,8],[1,4,10],[3,5,9],[1,4,7],[1,4,6]], columns=list('ABC'))
df['5D_Peak']=df['C'].rolling(window=5,center=False).max()
for i in range(5,len(df.A)):
val=0
for j in range(i-5,i):
if df.loc[j,'C']>df.loc[i,'5D_Peak']-2 and df.loc[j,'C']<df.loc[i,'5D_Peak']+2:
val+=1
df.loc[i,'5D_Close_to_Peak_Count']=val
This is the output I want:
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 NaN
5 1 4 10 10.0 0.0
6 3 5 9 10.0 1.0
7 1 4 7 10.0 2.0
8 1 4 6 10.0 2.0
I believe this is what you want. You can set the two values below:
'''the window within which to search "close-to_peak" values'''
lkp_rng = 5
'''how close is close?'''
closeness_measure = 2
'''function to count the number of "close-to_peak" values in the lkp_rng'''
fc = lambda x: np.count_nonzero(np.where(x >= x.max()- closeness_measure))
'''apply fc to the coulmn you choose'''
df['5D_Close_to_Peak_Count'] = df['C'].rolling(window=lkp_range,center=False).apply(fc)
df.head(10)
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 3.0
5 1 4 10 10.0 3.0
6 3 5 9 10.0 3.0
7 1 4 7 10.0 3.0
8 1 4 6 10.0 2.0
I am guessing what you mean by "historical data".

pandas.DataFrame set all string values to nan

I have a pandas.DataFrame that contain string, float and int types.
Is there a way to set all strings that cannot be converted to float to NaN ?
For example:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 "wajdi"
to:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 NaN
You can use pd.to_numeric and set errors='coerce'
pandas.to_numeric
df['D'] = pd.to_numeric(df.D, errors='coerce')
Which will give you:
A B C D
0 1 2 5.0 7.0
1 0 4 NaN 15.0
2 4 8 9.0 10.0
3 11 5 8.0 0.0
4 11 5 8.0 NaN
Deprecated solution (pandas <= 0.20 only):
df.convert_objects(convert_numeric=True)
pandas.DataFrame.convert_objects
Here's the dev note in the convert_objects source code: # TODO: Remove in 0.18 or 2017, which ever is sooner. So don't make this a long term solution if you use it.
Here is a way:
df['E'] = pd.to_numeric(df.D, errors='coerce')
And then you have:
A B C D E
0 1 2 5.0 7 7.0
1 0 4 NaN 15 15.0
2 4 8 9.0 10 10.0
3 11 5 8.0 0 0.0
4 11 5 8.0 wajdi NaN
You can use pd.to_numeric with errors='coerce'.
In [30]: df = pd.DataFrame({'a': [1, 2, 'NaN', 'bob', 3.2]})
In [31]: pd.to_numeric(df.a, errors='coerce')
Out[31]:
0 1.0
1 2.0
2 NaN
3 NaN
4 3.2
Name: a, dtype: float64
Here is one way to apply it to all columns:
for c in df.columns:
df[c] = pd.to_numeric(df[c], errors='coerce')
(See comment by NinjaPuppy for a better way.)

Categories