How do I extract column values from one dataframe to another?

How do I extract column values from one dataframe to another? - python

I have two pandas data frames df1 and df2
**df1** **df2**
cat id frequency id (other cols) A B C
A 23 2 23 ............. nan nan nan
A 43 8 43 ............. nan nan nan
B 23 56 30 ............. nan nan nan
C 30 4
I am looking for a way on how to extract information form df1 to df2 resulting in the format below, where the values of columns A, B and C are the frequency values from df1
**df2**
id (other cols) A B C
30 .......... 0 0 4
23 .......... 2 56 0
43 .......... 8 0 0

Use DataFrame.pivot with DataFrame.combine_first:
df11 = df1.pivot('cat', 'id', 'frequency')
#if id is column
df = df2.set_index('id').combine_first(df11)
#if id is index
df = df2.combine_first(df11)

Related

groupby results in empty datafram

Issue:
groupby on below data, resulted in empty dataframe, not sure how to fix, please help thanks.
data:
business_group business_unit cost_center GL_code profit_center count
NaN a 12 12-09 09 1
NaN a 12 12-09 09 1
NaN b 23 23-87 87 1
NaN b 23 23-87 87 1
NaN b 34 34-76 76 1
groupby:
group_df = df.groupby(['business_group', 'business_unit','cost_center','GL_code','profit_center'],
as_index=False).count()
expected result:
business_group business_unit cost_center GL_code profit_center count
NaN a 12 12-09 09 2
NaN b 23 23-87 87 2
NaN c 34 34-76 76 1
result received:
Empty DataFrame
Columns: [business_group, business_unit, cost_center, GL_code, profit_center, count]
Index: []

That's because the NaN in business_group is the null values, and groupby() by default will drop all NaN values. You can pass dropna=False into groupby():
group_df = df.groupby(['business_group', 'business_unit',
'cost_center','GL_code','profit_center'],dropna=False,
as_index=False).count()
Output:
business_group business_unit cost_center GL_code profit_center count
0 NaN a 12 12-09 9 2
1 NaN b 23 23-87 87 2
2 NaN b 34 34-76 76 1

How to append two data frame in pandas with both data frame column names

df1 =
date column1 column2 column3 column4
1 123 124 125 126
2 23 24 25 26
3 42 43 44 45
df2 =
date c_xyz1 c_xyz2 c_xyz3
1 123 124 125
2 23 24 25
3 42 43 44
i need output like
df3 =
date column1 column2 column3 column4
1 123 124 125 126
2 23 24 25 26
3 42 43 44 45
date c_xyz1 c_xyz2 c_xyz3
1 123 124 125
2 23 24 25
3 42 43 44
append df2 with df1 with column names
df1 have columns 4 and df2 have columns 3 with different names need i need to append with column names.
if it is possible while writing xlsx file then fine.

im pretty sure you can just
df3 = pandas.concat([df1,df2])
you might want
pandas.concat([df1,df2],axis=1)
import pandas
df1 = pandas.DataFrame({"A":[1,2,3],"B":[2,3,4]})
df2 = pandas.DataFrame({"C":[3,4,5],"D":[4,5,6]})
df3 = pandas.concat([df1,df2])
"""
A B C D
0 1.0 2.0 NaN NaN
1 2.0 3.0 NaN NaN
2 3.0 4.0 NaN NaN
0 NaN NaN 3.0 4.0
1 NaN NaN 4.0 5.0
2 NaN NaN 5.0 6.0
"""
df3_alt = pandas.concat([df1,df2],axis=1)
"""
A B C D
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
"""
you could combine to make a csv that had different columns in the middle I guess
with open("out.csv","wb") as f:
f.write("\n".join([df1.to_csv(),df2.to_csv()])
print("".join([df1.to_csv(), df2.to_csv()]))
"""
,A,B
0,1,2
1,2,3
2,3,4
,C,D
0,3,4
1,4,5
"""
here is the excel version ...
with pandas.ExcelWriter("out.xlsx") as w:
df1.to_excel(w)
df2.to_excel(w,startrow=len(df1)+1)

i write into xlsx using
writer = pd.ExcelWriter('test.xlsx',engine='xlsxwriter')
workbook=writer.book
worksheet=workbook.add_worksheet('Validation')
writer.sheets['Validation'] = worksheet
df.to_excel(writer,sheet_name='Validation',startrow=0 , startcol=0)
another_df.to_excel(writer,sheet_name='Validation',startrow=20, startcol=0)

Form single Row from all rows with corresponding values in pandas

I have dataframe as follows:
2017 2018
A B C A B C
0 12 NaN NaN 98 NaN NaN
1 NaN 23 NaN NaN 65 NaN
2 NaN NaN 45 NaN NaN 43
I want to convert this data frame into:
2017 2018
A B C A B C
0 12 23 45 98 65 43

First back filling missing values and then select first row by double [] for one row DataFrame:
df = df.bfill().iloc[[0]]
#alternative
#df = df.ffill().iloc[-1]]
print (df)
2017 2018
A B C A B C
0 12.0 23.0 45.0 98.0 65.0 43.0

One could sum along the columns:
import pandas as pd
import numpy as np
# Create DataFrame:
tmp = np.hstack((np.diag([12., 23., 42.]), np.diag([98., 65., 43.])))
tmp[tmp == 0] = np.NaN
df = pd.DataFrame(tmp, )
# Sum:
df2 = pd.DataFrame(df.sum(axis=0)).T
Resulting in:
0 1 2 3 4 5
0 12.0 23.0 42.0 98.0 65.0 43.0
This is convenient because Dataframe.sum ignores NaN by default. Couple of notes:
One loses the column names in this approach.
All-NaN columns will return 0 in the result.

Update column with NaN with mean of filtered rows

I have the following DataFrame
VOTES CITY
24 A
22 A
20 B
NaN A
NaN A
30 B
NaN C
I need to fill the NaN with mean of values where CITY is 'A' or 'C'
The following code I tried was only updating the first row in VOTES and rest allwere updated to NaN.
train['VOTES'][((train['VOTES'].isna()) & (train['CITY'].isin(['A','C'])))]=train['VOTES'].loc[((~train['VOTES'].isna()) & (train['CITY'].isin(['A','C'])))].astype(int).mean(axis=0)
The output of 'VOTES' after this all values are updated as 'NaN' except one record which is at index 0. The Mean is calculated correctly though .

Use Series.fillna only for filtered rows with mean of filtered rows:
train['VOTES_EN']=train['VOTES'].astype(str).str.extract(r'(-?\d+\.?\d*)').astype(float)
m= train['CITY'].isin(['A','C'])
mean = train.loc[m,'VOTES_EN'].mean()
train.loc[m,'VOTES_EN']=train.loc[m,'VOTES_EN'].fillna(mean)
train['VOTES_EN'] = train['VOTES_EN'].astype(int)
print (train)
VOTES CITY VOTES_EN
0 24.0 A 24
1 22.0 A 22
2 20.0 B 20
3 NaN A 23
4 NaN A 23
5 30.0 B 30
6 NaN C 23

Pandas select the second to last column which is also not nan

I've cleaned my data as much as I can and read them in Pandas dataframe. So the problem is that different files have different number of columns, but it always the second to the last non-nan column is what I want. So is there anyway to pick them out? Here is an example of the data.
... f g h l
0 ... 39994 29.568 29.569 NaN
1 ... 39994 29.568 29.569 NaN
2 ... 39994 29.568 29.569 NaN
so I want the column g in this case. So in other files, it could be f or anything depends on the number of nan columns in the end. But it's always the second to the last non-nan column is what I need. Thanks for the help ahead.

Similar idea to #piRSquared. Essentially, use loc to keep the non-null columns, then use iloc to select the second to last.
df.loc[:, ~df.isnull().all()].iloc[:, -2]
Sample input:
a b c d e f g h i j
0 0 3 6 9 12 15 18 21 NaN NaN
1 1 4 7 10 13 16 19 22 NaN NaN
2 2 5 8 11 14 17 20 23 NaN NaN
Sample output:
0 18
1 19
2 20
Name: g, dtype: int32

One liner
df.loc[:, :df.columns[(df.columns == df.isnull().all().idxmax()).argmax() - 2]]
... f g
0 ... 39994 29.568
1 ... 39994 29.568
2 ... 39994 29.568
Readable
# identify null columns
nullcols = df.isnull().all()
# find the column heading for the first null column
nullcol = nullcols.idxmax()
# where is null column at
nullcol_position = (df.columns == nullcol).argmax()
# get column 2 positions prior
col_2_prior_to_null_col = df.columns[nullcol_position - 2]
# get dataframe
print df.loc[:, :col_2_prior_to_null_col]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I extract column values from one dataframe to another? - python

Use DataFrame.pivot with DataFrame.combine_first: df11 = df1.pivot('cat', 'id', 'frequency') #if id is column df = df2.set_index('id').combine_first(df11) #if id is index df = df2.combine_first(df11)

Related

groupby results in empty datafram

How to append two data frame in pandas with both data frame column names

Form single Row from all rows with corresponding values in pandas

Update column with NaN with mean of filtered rows

Pandas select the second to last column which is also not nan

Categories

Resources