I have a loop which generates dataframes with 2 columns in each. Now, when I try to append the dataframes vertically (stacking those vertically), the code adds the new dataframes horizontally when I use pd.concat within a loop. However, the results do not merge the columns (with same lenght properly). Instead, it adds 2 new columns for every loop iteration, creating a bunch on Nans. How to solve?
df_master=pd.DataFrame()
columns=list(df_master)
data=[]
for i in range(1,3):
--do something and return a df2 with 2 columns
data.append(df2)
df_master = pd.concat(data, axis=1)
df_master.head()
How do I compress the new 2 column for every iteration within one dataframe?
If you don't need to keep the column labels of original dataframes, you can try renaming the column labels of each dataframe to the same (e.g. 0 and 1) before concat, for example:
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
Demo
df1
57 59
0 1 2
1 3 4
df2
138 140
0 11 12
1 13 14
data = [df1, df2]
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
df_master
0 1
0 1 2
1 3 4
2 11 12
3 13 14
I suppose the problem is your columns have different names in each iteration, so you could easily solve it by calling df2.rename() and renaming it to the same names
It works for me if I change axis to 0 inside the concat command.
df_master = pd.concat(data, axis=0)
Pandas would fill empty cells with NaNs in each scenario and like the example you see below.
df1 = pd.DataFrame({'col1':[11,12,13], 'col2': [21,22,23], 'col3':[31,32,33]})
df2 = pd.DataFrame({'col1':[111,112,113, 114], 'col2': [121,122,123,124]})
merge / join / concatenate data frames [df1, df2] vertically - add rows
pd.concat([df1,df2], ignore_index=True)
# output
col1 col2 col3
0 11 21 31.0
1 12 22 32.0
2 13 23 33.0
3 111 121 NaN
4 112 122 NaN
5 113 123 NaN
6 114 124 NaN
merge / join / concatenate data frames horizontally (aligning by index)
pd.concat([df1,df2], axis=1)
# output
col1 col2 col3 col1 col2
0 11.0 21.0 31.0 111 121
1 12.0 22.0 32.0 112 122
2 13.0 23.0 33.0 113 123
3 NaN NaN NaN 114 124
Related
I have a pandas df that has columns that need to be summed with to the previous column iff they contain the word "extra". For example, here is my pandas df:
id laptops laptops extra battery cables monitor monitor extra
0 54 18 108 54 28 12
1 33 9 48 20 10 4
2 82 61 98 67 21 9
...
Is there a way in pandas to find columns that contain the word extra and sum them with the previous column? That would help clean up so much data.
Thank you
Remove extra text and aggregate sum for all columns:
df1 = (df.rename(columns=lambda x: x.replace(' extra', ''))
.groupby(level=0, axis=1, sort=False)
.sum())
Or filter extra columns, remove extra and add to original columns, last remove extra columns:
m = df.columns.str.endswith('extra')
df1 = (df.add(df.loc[:, m]
.rename(columns=lambda x: x.replace(' extra', '')), axis=1, fill_value=0)
.loc[:, ~m])
EDIT: For add previous column by extra substring in end of columns names use:
m = df.columns.to_series().str.endswith('extra')
df.loc[:, m] = df.loc[:, m.shift(-1, fill_value=False)] + df.loc[:, m].to_numpy()
df = df.loc[:, ~m]
print (df)
id laptops battery cables monitor
0 0.0 54.0 108.0 54.0 28.0
1 1.0 33.0 48.0 20.0 10.0
2 2.0 82.0 98.0 67.0 21.0
I have about 20 data frames and all data frames are having same columns and I would like to add data into the empty data frame but when I use my code
interested_freq
UPC CPC freq
0 136.0 B64G 2
1 136.0 H01L 1
2 136.0 H02S 1
3 244.0 B64G 1
4 244.0 H02S 1
5 257.0 B64G 1
6 257.0 H01L 1
7 312.0 B64G 1
8 312.0 H02S 1
list_of_lists = []
max_freq = df_interested_freq[df_interested_freq['freq'] == df_interested_freq['freq'].max()]
for row, cols in max_freq.iterrows():
interested_freq = df_interested_freq[df_interested_freq['freq'] != 1]
interested_freq
list_of_lists.append(interested_freq)
list_of_lists
for append the first data frame, and then change the name in that code for hoping that it will append more data
list_of_lists = []
for row, cols in max_freq.iterrows():
interested_freq_1 = df_interested_freq_1[df_interested_freq_1['freq'] != 1]
interested_freq_1
list_of_lists.append(interested_freq_1)
list_of_lists
but the first data is disappeared and show only the recent appended data. do I have done something wrong?
One way to Create a new DataFrame from existing DataFrame is use to df.copy():
Here is Detailed documentation
The df.copy() is very much relevant here because changing the subset of data within new dataframe will change the initial DataFrame So, you have fair chances of losing your actual dataFrame thus you need it.
Suppose Example DataFrame is df1 :
>>> df1
col1 col2
1 11 12
2 21 22
Solution , you can use df.copy method as follows which will inherit the data along.
>>> df2 = df1.copy()
>>> df2
col1 col2
1 11 12
2 21 22
In case you need to new dataframe(df2) to be created as like df1 but don't want the values to inserted across the DF then you have option to use reindex_like() method.
>>> df2 = pd.DataFrame().reindex_like(df1)
# df2 = pd.DataFrame(data=np.nan,columns=df1.columns, index=df1.index)
>>> df2
col1 col2
1 NaN NaN
2 NaN NaN
Why do you use append here? It’s not a list. Once you have the first dataframe (called d1 for example), try:
new_df = df1
new_df = pd.concat([new_df, df2])
You can do the same thing for all 20 dataframes.
I'm quite new to pandas dataframes, and I'm experiencing some troubles joining two tables.
The first df has just 3 columns:
DF1:
item_id position document_id
336 1 10
337 2 10
338 3 10
1001 1 11
1002 2 11
1003 3 11
38 10 146
And the second has exactly same two columns (and plenty of others):
DF2:
item_id document_id col1 col2 col3 ...
337 10 ... ... ...
1002 11 ... ... ...
1003 11 ... ... ...
What I need is to perform an operation which, in SQL, would look as follows:
DF1 join DF2 on
DF1.document_id = DF2.document_id
and
DF1.item_id = DF2.item_id
And, as a result, I want to see DF2, complemented with column 'position':
item_id document_id position col1 col2 col3 ...
What is a good way to do this using pandas?
I think you need merge with default inner join, but is necessary no duplicated combinations of values in both columns:
print (df2)
item_id document_id col1 col2 col3
0 337 10 s 4 7
1 1002 11 d 5 8
2 1003 11 f 7 0
df = pd.merge(df1, df2, on=['document_id','item_id'])
print (df)
item_id position document_id col1 col2 col3
0 337 2 10 s 4 7
1 1002 2 11 d 5 8
2 1003 3 11 f 7 0
But if necessary position column in position 3:
df = pd.merge(df2, df1, on=['document_id','item_id'])
cols = df.columns.tolist()
df = df[cols[:2] + cols[-1:] + cols[2:-1]]
print (df)
item_id document_id position col1 col2 col3
0 337 10 2 s 4 7
1 1002 11 2 d 5 8
2 1003 11 3 f 7 0
If you're merging on all common columns as in the OP, you don't even need to pass on=, simply calling merge() will do the job.
merged_df = df1.merge(df2)
The reason is that under the hood, if on= is not passed, pd.Index.intersection is called on the columns to determine the common columns and merge on all of them.
A special thing about merging on common columns is that it doesn't matter which dataframe is on the right or the left, the rows filtered are the same because they are selected by looking up matching rows on the common columns. The only difference is where the columns are positioned; the columns in the right dataframe that are not in the left dataframe will be added to the right of the columns on the left dataframe. So unless the order of the columns matter (which can be very easily fixed using column selection or reindex()), it doesn't really matter which dataframe is on the right and which is on the left. In other words,
df12 = df1.merge(df2, on=['document_id','item_id']).sort_index(axis=1)
df21 = df2.merge(df1, on=['document_id','item_id']).sort_index(axis=1)
# df12 and df21 are the same.
df12.equals(df21) # True
This is not true if the columns to be merged on don't have the same name and you have to pass left_on= and right_on= (see #1 in this answer).
I have a huge data set in a pandas data frame. It looks something like this
df = pd.DataFrame([[1,2,3,4],[31,14,13,11],[115,613,1313,1]], columns=['c1','c1','c2','c2'])
Here first two columns have same name. So they should be concatenated into a single column so the the values are one below another. so the dataframe should look something like this:
df1 = pd.DataFrame([[1,3],[31,13],[115,1313],[2,4],[14,11],[613,1]], columns=['c1','c2'])
Note: My orignal dataframe has many column so I cannot used simple concat function to stack the columns. Also I tried using stack function, apart from concat function. What can I do?
use groupby + cumcount to create a pd.MultiIndex. Reassign column with new pd.MultiIndex and stack
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = df.copy()
df1.columns = [df.columns, df.columns.to_series().groupby(level=0).cumcount()]
print(df1.stack().reset_index(drop=True))
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
Or with a bit of creativity, in one line
df.T.set_index(
df.T.groupby([df.columns]).cumcount(),
append=True
).unstack().T.reset_index(drop=True)
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
You could melt the dataframe, then count entries within each column to use as index for the new dataframe and then unstack it back like this:
import pandas as pd
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = (pd.melt(df,var_name='column')
.assign(n = lambda x: x.groupby('column').cumcount())
.set_index(['n','column'])
.unstack())
df1.columns=df1.columns.get_level_values(1)
print(df1)
Which produces
column c1 c2
n
0 1 3
1 31 13
2 115 1313
3 2 4
4 14 11
5 613 1
so I got a DataFrame by doing:
dfgrp=df.groupby(['CCS_Category_ICD9','Gender'])['f0_'].sum()
ndf=pd.DataFrame(dfgrp)
ndf
f0_
CCS_Category_ICD9 Gender
1 F 889
M 796
U 2
2 F 32637
M 33345
U 34
Where f0_ is the sum of the counts by Gender
All I really want is a simple one level dataframe similar to this which I got via
ndf=ndf.unstack(level=1)
ndf
f0_
Gender F M U
CCS_Category_ICD9
1 889.0 796.0 2.0
2 32637.0 33345.0 34.0
3 2546.0 1812.0 NaN
4 347284.0 213782.0 34.0
But what I want is:
CCS_Category_ICD9 F M U
1 889.0 796.0 2.0
2 32637.0 33345.0 34.0
3 2546.0 1812.0 NaN
4 347284.0 213782.0 34.0
I cannot figure out how to flatten or get rid of the levels associated with f0_ and Gender All I need is the "M","F","U" column headings so I have a simple one level dataframe. I have tried reset_index and set_index along with several other variations, with no luck...
At the end I want to have a simple crosstab with row and column totals (which my example does not show..
well I did (as suggested in one answer):
ndf = ndf.f0_.unstack()
ndf
Which gave me:
Gender F M U
CCS_Category_ICD9
1 889.0 796.0 2.0
2 32637.0 33345.0 34.0
3 2546.0 1812.0 NaN
4 347284.0 213782.0 34.0
Followed by:
nndf=ndf.reset_index(['CCS_Category_ICD9','F','M','U'])
nndf
Gender CCS_Category_ICD9 F M U
0 1 889.0 796.0 2.0
1 2 32637.0 33345.0 34.0
2 3 2546.0 1812.0 NaN
3 4 347284.0 213782.0 34.0
4 5 3493.0 7964.0 1.0
5 6 12295.0 9998.0 4.0
Which just about does it But I cannot change the index name from Gender to something like Idx no matter what I do I get an extra row added with the New name ie a row titled Idx just under Gender.. Also is there a more straight forward solution?
You can
df.loc[:, 'f0_']
for the DataFrame resulting from .unstack(), ie, select the first level of your MultiIndex columns which only leaves the gender level , or alternatively
df.columns = df.columns.droplevel()
see MultiIndex.droplevel docs
Because ndf is a pd.DataFrame it has a column index. When you performed unstack() it appends the last level from the row index to the column index. Since columns already had f0_, you got a second level. To flatten the way you'd like, call unstack() on the column instead.
ndf = ndf.f0_.unstack()
The text Gender is the name of the column index. If you want to get rid of it, you have to overwrite the name attribute for that object.
ndf.columns.name = None
Use this right after the ndf.f0_.unstack()
Generally, use df.pivot when you want use a column as the row index and another column as the column index. Use df.pivot_table when you need to aggregate values due to rows with duplicate (row,column) pairs.
In this case, instead of df.groupby(...)[...].sum().unstack() you could use
df.pivot_table:
import numpy as np
import pandas as pd
N = 100
df = pd.DataFrame({'CCS': np.random.choice([1,2], size=N),
'Gender':np.random.choice(['F','M','U'], size=N),
'f0':np.random.randint(10, size=N)})
result = df.pivot_table(index='CCS', columns='Gender', values='f0', aggfunc='sum')
result.columns.name = None
result = result.reset_index()
yields
CCS F M U
0 1 89 104 90
1 2 66 65 65
Notice that after calling pivot_table(), the DataFrame result has named
index and column Indexes:
In [176]: result = df.pivot_table(index='CCS', columns='Gender', values='f0', aggfunc='sum'); result
Out[176]:
Gender F M U
CCS
1 89 104 90
2 66 65 65
The index is named CSS:
In [177]: result.index
Out[177]: Int64Index([1, 2], dtype='int64', name='CCS')
and the columns index is named Gender:
In [178]: result.columns
Out[178]: Index(['F', 'M', 'U'], dtype='object', name='Gender') # <-- notice the name='Gender'
To remove the name from an Index, assign None to the name attribute:
In [179]: result.columns.name = None
In [180]: result
Out[180]:
F M U
CCS
1 95 68 67
2 82 63 68
Though it's not needed here, to remove names from the levels of a MultiIndex,
assign a list of Nones to the names (plural) attribute:
result.columns.names = [None]*numlevels