I have two pandas data frames df1 and df2
**df1** **df2**
cat id frequency id (other cols) A B C
A 23 2 23 ............. nan nan nan
A 43 8 43 ............. nan nan nan
B 23 56 30 ............. nan nan nan
C 30 4
I am looking for a way on how to extract information form df1 to df2 resulting in the format below, where the values of columns A, B and C are the frequency values from df1
**df2**
id (other cols) A B C
30 .......... 0 0 4
23 .......... 2 56 0
43 .......... 8 0 0
Use DataFrame.pivot with DataFrame.combine_first:
df11 = df1.pivot('cat', 'id', 'frequency')
#if id is column
df = df2.set_index('id').combine_first(df11)
#if id is index
df = df2.combine_first(df11)
Related
Issue:
groupby on below data, resulted in empty dataframe, not sure how to fix, please help thanks.
data:
business_group business_unit cost_center GL_code profit_center count
NaN a 12 12-09 09 1
NaN a 12 12-09 09 1
NaN b 23 23-87 87 1
NaN b 23 23-87 87 1
NaN b 34 34-76 76 1
groupby:
group_df = df.groupby(['business_group', 'business_unit','cost_center','GL_code','profit_center'],
as_index=False).count()
expected result:
business_group business_unit cost_center GL_code profit_center count
NaN a 12 12-09 09 2
NaN b 23 23-87 87 2
NaN c 34 34-76 76 1
result received:
Empty DataFrame
Columns: [business_group, business_unit, cost_center, GL_code, profit_center, count]
Index: []
That's because the NaN in business_group is the null values, and groupby() by default will drop all NaN values. You can pass dropna=False into groupby():
group_df = df.groupby(['business_group', 'business_unit',
'cost_center','GL_code','profit_center'],dropna=False,
as_index=False).count()
Output:
business_group business_unit cost_center GL_code profit_center count
0 NaN a 12 12-09 9 2
1 NaN b 23 23-87 87 2
2 NaN b 34 34-76 76 1
df1 =
date column1 column2 column3 column4
1 123 124 125 126
2 23 24 25 26
3 42 43 44 45
df2 =
date c_xyz1 c_xyz2 c_xyz3
1 123 124 125
2 23 24 25
3 42 43 44
i need output like
df3 =
date column1 column2 column3 column4
1 123 124 125 126
2 23 24 25 26
3 42 43 44 45
date c_xyz1 c_xyz2 c_xyz3
1 123 124 125
2 23 24 25
3 42 43 44
append df2 with df1 with column names
df1 have columns 4 and df2 have columns 3 with different names need i need to append with column names.
if it is possible while writing xlsx file then fine.
im pretty sure you can just
df3 = pandas.concat([df1,df2])
you might want
pandas.concat([df1,df2],axis=1)
import pandas
df1 = pandas.DataFrame({"A":[1,2,3],"B":[2,3,4]})
df2 = pandas.DataFrame({"C":[3,4,5],"D":[4,5,6]})
df3 = pandas.concat([df1,df2])
"""
A B C D
0 1.0 2.0 NaN NaN
1 2.0 3.0 NaN NaN
2 3.0 4.0 NaN NaN
0 NaN NaN 3.0 4.0
1 NaN NaN 4.0 5.0
2 NaN NaN 5.0 6.0
"""
df3_alt = pandas.concat([df1,df2],axis=1)
"""
A B C D
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
"""
you could combine to make a csv that had different columns in the middle I guess
with open("out.csv","wb") as f:
f.write("\n".join([df1.to_csv(),df2.to_csv()])
print("".join([df1.to_csv(), df2.to_csv()]))
"""
,A,B
0,1,2
1,2,3
2,3,4
,C,D
0,3,4
1,4,5
"""
here is the excel version ...
with pandas.ExcelWriter("out.xlsx") as w:
df1.to_excel(w)
df2.to_excel(w,startrow=len(df1)+1)
i write into xlsx using
writer = pd.ExcelWriter('test.xlsx',engine='xlsxwriter')
workbook=writer.book
worksheet=workbook.add_worksheet('Validation')
writer.sheets['Validation'] = worksheet
df.to_excel(writer,sheet_name='Validation',startrow=0 , startcol=0)
another_df.to_excel(writer,sheet_name='Validation',startrow=20, startcol=0)
I have dataframe as follows:
2017 2018
A B C A B C
0 12 NaN NaN 98 NaN NaN
1 NaN 23 NaN NaN 65 NaN
2 NaN NaN 45 NaN NaN 43
I want to convert this data frame into:
2017 2018
A B C A B C
0 12 23 45 98 65 43
First back filling missing values and then select first row by double [] for one row DataFrame:
df = df.bfill().iloc[[0]]
#alternative
#df = df.ffill().iloc[-1]]
print (df)
2017 2018
A B C A B C
0 12.0 23.0 45.0 98.0 65.0 43.0
One could sum along the columns:
import pandas as pd
import numpy as np
# Create DataFrame:
tmp = np.hstack((np.diag([12., 23., 42.]), np.diag([98., 65., 43.])))
tmp[tmp == 0] = np.NaN
df = pd.DataFrame(tmp, )
# Sum:
df2 = pd.DataFrame(df.sum(axis=0)).T
Resulting in:
0 1 2 3 4 5
0 12.0 23.0 42.0 98.0 65.0 43.0
This is convenient because Dataframe.sum ignores NaN by default. Couple of notes:
One loses the column names in this approach.
All-NaN columns will return 0 in the result.
I have the following DataFrame
VOTES CITY
24 A
22 A
20 B
NaN A
NaN A
30 B
NaN C
I need to fill the NaN with mean of values where CITY is 'A' or 'C'
The following code I tried was only updating the first row in VOTES and rest allwere updated to NaN.
train['VOTES'][((train['VOTES'].isna()) & (train['CITY'].isin(['A','C'])))]=train['VOTES'].loc[((~train['VOTES'].isna()) & (train['CITY'].isin(['A','C'])))].astype(int).mean(axis=0)
The output of 'VOTES' after this all values are updated as 'NaN' except one record which is at index 0. The Mean is calculated correctly though .
Use Series.fillna only for filtered rows with mean of filtered rows:
train['VOTES_EN']=train['VOTES'].astype(str).str.extract(r'(-?\d+\.?\d*)').astype(float)
m= train['CITY'].isin(['A','C'])
mean = train.loc[m,'VOTES_EN'].mean()
train.loc[m,'VOTES_EN']=train.loc[m,'VOTES_EN'].fillna(mean)
train['VOTES_EN'] = train['VOTES_EN'].astype(int)
print (train)
VOTES CITY VOTES_EN
0 24.0 A 24
1 22.0 A 22
2 20.0 B 20
3 NaN A 23
4 NaN A 23
5 30.0 B 30
6 NaN C 23
I've cleaned my data as much as I can and read them in Pandas dataframe. So the problem is that different files have different number of columns, but it always the second to the last non-nan column is what I want. So is there anyway to pick them out? Here is an example of the data.
... f g h l
0 ... 39994 29.568 29.569 NaN
1 ... 39994 29.568 29.569 NaN
2 ... 39994 29.568 29.569 NaN
so I want the column g in this case. So in other files, it could be f or anything depends on the number of nan columns in the end. But it's always the second to the last non-nan column is what I need. Thanks for the help ahead.
Similar idea to #piRSquared. Essentially, use loc to keep the non-null columns, then use iloc to select the second to last.
df.loc[:, ~df.isnull().all()].iloc[:, -2]
Sample input:
a b c d e f g h i j
0 0 3 6 9 12 15 18 21 NaN NaN
1 1 4 7 10 13 16 19 22 NaN NaN
2 2 5 8 11 14 17 20 23 NaN NaN
Sample output:
0 18
1 19
2 20
Name: g, dtype: int32
One liner
df.loc[:, :df.columns[(df.columns == df.isnull().all().idxmax()).argmax() - 2]]
... f g
0 ... 39994 29.568
1 ... 39994 29.568
2 ... 39994 29.568
Readable
# identify null columns
nullcols = df.isnull().all()
# find the column heading for the first null column
nullcol = nullcols.idxmax()
# where is null column at
nullcol_position = (df.columns == nullcol).argmax()
# get column 2 positions prior
col_2_prior_to_null_col = df.columns[nullcol_position - 2]
# get dataframe
print df.loc[:, :col_2_prior_to_null_col]