df = pd.read_csv('test.txt',dtype=str)
print(df)
HE WE
0 aa NaN
1 181 76
2 22 13
3 NaN NaN
I want to overwrite any of these data frames with the following indexes
dff = pd.DataFrame({'HE' : [100,30]},index=[1,2])
print(dff)
HE
1 100
2 30
for i in dff.index:
df._set_value(i,'HE',dff._get_value(i,'HE'))
print(df)
HE WE
0 aa NaN
1 100 76
2 30 13
3 NaN NaN
Is there a way to change it all at once without using 'for'?
Use DataFrame.update, (working inplace):
df.update(dff)
print (df)
HE WE
0 aa NaN
1 100 76.0
2 30 13.0
3 NaN NaN
Related
Issue:
groupby on below data, resulted in empty dataframe, not sure how to fix, please help thanks.
data:
business_group business_unit cost_center GL_code profit_center count
NaN a 12 12-09 09 1
NaN a 12 12-09 09 1
NaN b 23 23-87 87 1
NaN b 23 23-87 87 1
NaN b 34 34-76 76 1
groupby:
group_df = df.groupby(['business_group', 'business_unit','cost_center','GL_code','profit_center'],
as_index=False).count()
expected result:
business_group business_unit cost_center GL_code profit_center count
NaN a 12 12-09 09 2
NaN b 23 23-87 87 2
NaN c 34 34-76 76 1
result received:
Empty DataFrame
Columns: [business_group, business_unit, cost_center, GL_code, profit_center, count]
Index: []
That's because the NaN in business_group is the null values, and groupby() by default will drop all NaN values. You can pass dropna=False into groupby():
group_df = df.groupby(['business_group', 'business_unit',
'cost_center','GL_code','profit_center'],dropna=False,
as_index=False).count()
Output:
business_group business_unit cost_center GL_code profit_center count
0 NaN a 12 12-09 9 2
1 NaN b 23 23-87 87 2
2 NaN b 34 34-76 76 1
I want to select the previous row's value only if it meets a certain condition
E.g.
df:
Value Marker
10 0
12 0
50 1
42 1
52 0
23 1
I want to select the previous row's value where marker == 0if the current value marker == 1.
Result:
df:
Value Marker Prev_Value
10 0 nan
12 0 nan
50 1 12
42 1 12
52 0 nan
23 1 52
I tried:
df[prev_value] = np.where(df[marker] == 1, df[Value].shift(), np.nan)
but that does not take conditional previous value like i want.
condition = (df.Marker.shift() == 0) & (df.Marker == 1)
df['Prev_Value'] = np.where(condition, df.Value.shift(), np.nan)
Output:
df
Value Marker Prev_Value
0 10 0 NaN
1 12 0 NaN
2 50 1 12.0
3 42 1 NaN
4 52 0 NaN
5 23 1 52.0
You could try this:
df['Prev_Value']=np.where(dataframe['Marker'].diff()==1,dataframe['Value'].shift(1, axis = 0),np.nan)
Output:
df
Value Marker Prev_Value
0 10 0 NaN
1 12 0 NaN
2 50 1 12.0
3 42 1 NaN
4 52 0 NaN
5 23 1 52.0
If you want to get the previous non-1 marker value, if marker==1, you could try this:
prevro=[]
for i in reversed(df.index):
if df.iloc[i,1]==1:
prevro_zero=df.iloc[0:i,0][df.iloc[0:i,1].eq(0)].tolist()
if len(prevro_zero)>0:
prevro.append(prevro_zero[len(prevro_zero)-1])
else:
prevro.append(np.nan)
else:
prevro.append(np.nan)
df['Prev_Value']=list(reversed(prevro))
print(df)
Output:
Value Marker Prev_Value
0 10 0 NaN
1 12 0 NaN
2 50 1 12.0
3 42 1 12.0
4 52 0 NaN
5 23 1 52.0
df1 =
date column1 column2 column3 column4
1 123 124 125 126
2 23 24 25 26
3 42 43 44 45
df2 =
date c_xyz1 c_xyz2 c_xyz3
1 123 124 125
2 23 24 25
3 42 43 44
i need output like
df3 =
date column1 column2 column3 column4
1 123 124 125 126
2 23 24 25 26
3 42 43 44 45
date c_xyz1 c_xyz2 c_xyz3
1 123 124 125
2 23 24 25
3 42 43 44
append df2 with df1 with column names
df1 have columns 4 and df2 have columns 3 with different names need i need to append with column names.
if it is possible while writing xlsx file then fine.
im pretty sure you can just
df3 = pandas.concat([df1,df2])
you might want
pandas.concat([df1,df2],axis=1)
import pandas
df1 = pandas.DataFrame({"A":[1,2,3],"B":[2,3,4]})
df2 = pandas.DataFrame({"C":[3,4,5],"D":[4,5,6]})
df3 = pandas.concat([df1,df2])
"""
A B C D
0 1.0 2.0 NaN NaN
1 2.0 3.0 NaN NaN
2 3.0 4.0 NaN NaN
0 NaN NaN 3.0 4.0
1 NaN NaN 4.0 5.0
2 NaN NaN 5.0 6.0
"""
df3_alt = pandas.concat([df1,df2],axis=1)
"""
A B C D
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
"""
you could combine to make a csv that had different columns in the middle I guess
with open("out.csv","wb") as f:
f.write("\n".join([df1.to_csv(),df2.to_csv()])
print("".join([df1.to_csv(), df2.to_csv()]))
"""
,A,B
0,1,2
1,2,3
2,3,4
,C,D
0,3,4
1,4,5
"""
here is the excel version ...
with pandas.ExcelWriter("out.xlsx") as w:
df1.to_excel(w)
df2.to_excel(w,startrow=len(df1)+1)
i write into xlsx using
writer = pd.ExcelWriter('test.xlsx',engine='xlsxwriter')
workbook=writer.book
worksheet=workbook.add_worksheet('Validation')
writer.sheets['Validation'] = worksheet
df.to_excel(writer,sheet_name='Validation',startrow=0 , startcol=0)
another_df.to_excel(writer,sheet_name='Validation',startrow=20, startcol=0)
I have a df that looks like this
Day ID ID_2 AS D E AS1 D1 E1
29 72 Participant 1 PS 6 42 NaN NaN NaN NaN NaN NaN
35 78 Participant 1 NaN yes 3 no 2 no 2
49 22 Participant 2 PS 1 89 NaN NaN NaN NaN NaN NaN
85 18 Participant 2 NaN yes 3 no 2 no 2
I'm looking for a way to add the ID_2 column value to all rows where ID matches (i.e., for Participant 1, fill in the NaN values with the values from the other row where ID=Participant 1). I've looked into using combine but that doesn't seem to work for this particular case.
Expected output:
Day ID ID_2 AS D E AS1 D1 E1
29 72 Participant 1 PS 6 42 yes 3 no 2 no 2
35 78 Participant 1 PS 6 42 yes 3 no 2 no 2
49 22 Participant 2 PS 1 89 yes 3 no 2 no 2
85 18 Participant 2 PS 1 89 yes 3 no 2 no 2
or
Day ID ID_2 AS D E AS1 D1 E1
29 72 Participant 1 PS 6 42 NaN NaN NaN NaN NaN NaN
35 78 Participant 1 PS 6 42 yes 3 no 2 no 2
49 22 Participant 2 PS 1 89 NaN NaN NaN NaN NaN NaN
85 18 Participant 2 PS 1 89 yes 3 no 2 no 2
I think you could try
df.ID_2 = df.groupby('ID').ID_2.ffill()
# 29 PS 6 42
# 35 PS 6 42
# 49 PS 1 89
# 85 PS 1 89
Not tested, but something like this should work - can't copy your df into my browser.
print(df)
Day ID ID_2 AS D E AS1 D1 E1
0 72 Participant_1 PS_6_42 NaN NaN NaN NaN NaN NaN
1 78 Participant_1 NaN yes 3.0 no 2.0 no 2.0
2 22 Participant_2 PS_1_89 NaN NaN NaN NaN NaN NaN
3 18 Participant_2 NaN yes 3.0 no 2.0 no 2.0
df2 = df.set_index('ID').groupby('ID').transform('ffill').transform('bfill').reset_index()
print(df2)
ID Day ID_2 AS D E AS1 D1 E1
0 Participant_1 72 PS_6_42 yes 3 no 2 no 2
1 Participant_1 78 PS_6_42 yes 3 no 2 no 2
2 Participant_2 22 PS_1_89 yes 3 no 2 no 2
3 Participant_2 18 PS_1_89 yes 3 no 2 no 2
Just curious on the behavior of 'where' and why you would use it over 'loc'.
If I create a dataframe:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
'Goals':[12,23,56,7,8,0,4,2,1,34],
'Gender':['m','m','m','f','f','m','f','m','f','m']})
And then apply the 'where' function:
df2 = df.where(df['Goals']>10)
I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:
Gender Goals ID Run Distance
0 m 12.0 1.0 234.0
1 m 23.0 2.0 35.0
2 m 56.0 3.0 77.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 m 34.0 10.0 123.0
If however I use the 'loc' function:
df2 = df.loc[df['Goals']>10]
It returns the dataframe subsetted without the NaN values:
Gender Goals ID Run Distance
0 m 12 1 234
1 m 23 2 35
2 m 56 3 77
9 m 34 10 123
So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?
Think of loc as a filter - give me only the parts of the df that conform to a condition.
where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 10 123 34 m
Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:
Gender Goals
0 m 12
1 m 23
If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:
df2 = df.where(df['Goals']>10)
print (df2)
ID Run Distance Goals Gender
0 1.0 234.0 12.0 m
1 2.0 35.0 23.0 m
2 3.0 77.0 56.0 m
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 10.0 123.0 34.0 m
df2 = df.where(df['Goals']>10, 100)
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 100 100 100 100
4 100 100 100 100
5 100 100 100 100
6 100 100 100 100
7 100 100 100 100
8 100 100 100 100
9 10 123 34 m
Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.
df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
9 10 123 34 m
If use loc is possible also filter by rows by condition and columns by name(s):
s = df.loc[df['Goals']>10, 'ID']
print (s)
0 1
1 2
2 3
9 10
Name: ID, dtype: int64
df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
ID Gender
0 1 m
1 2 m
2 3 m
9 10 m
loc retrieves only the rows that matches the condition.
where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).