how to concat 2 dataframes, keeping rows with same index - python

I have 2 dataframes, that i want to concat with keeping the order of the column of df1:
df1:
index unnamed:0 unnamed:1 unnamed:393 unnamed:395
0 nan 1 ... 394 396
1 0 BB CC DD
df2:
index Service
220 ABC
222 ABB
394 CC
396 DD
....
the output should be like:
df3:
index
0 394 396
1 394 396
2 CC DD
3 CC DD
if i simply make df3=pd.concat([df1,df2]) it just adds df2 at the end of the list of df 1 as whole
Almost same when using df3=pd.concat([df1,df2], axis=1, ignore_index=True)
i think there is some issue with multiindexing, but don't know what to type.
Thanks

You can use df1.join(df2).
This produces an outcome like this:
The documentation is here.
For more information about all the kinds of merging operations in pandas, you can also check this.

Related

Concatenating data from two files

There are 2 files opened with Pandas. If there are common parts in the first column of two files (colored letters), I want to paste the data of the second column of second file into the matched part of the first file. And if there is no match, I want to write 'NaN'. Is there a way I can do in this situation?
File1
enter code here
0 1
0 JCW 574
1 MBM 4212
2 COP 7424
3 KVI 4242
4 ECX 424
File2
enter code here
0 1
0 G=COP d4ssd5vwe2e2
1 G=DDD dfd23e1rv515j5o
2 G=FEW cwdsuve615cdldl
3 G=JCW io55i5i55j8rrrg5f3r
4 G=RRR c84sdw5e5vwldk455
5 G=ECX j4ut84mnh54t65y
File1#
enter code here
0 1 2
0 JCW 574 io55i5i55j8rrrg5f3r
1 MBM 4212 NaN
2 COP 7424 d4ssd5vwe2e2
3 KVI 4242 NaN
4 ECX 424 j4ut84mnh54t65y
Use Series.str.extract for new Series for matched values by df1[0] values first and then merge with left join in DataFrame.merge:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
s = df2[0].str.extract(f'({"|".join(df1[0])})', expand=False)
df = df1.merge(df2[[1]], how='left', left_on=0, right_on=s)
df.columns = np.arange(len(df.columns))
print (df)
0 1 2
0 JCW 574 io55i5i55j8rrrg5f3r
1 MBM 4212 NaN
2 COP 7424 d4ssd5vwe2e2
3 KVI 4242 NaN
4 ECX 424 j4ut84mnh54t65y
Or if need match last 3 values of column df1[0] use:
s = df2[0].str.extract(f'({"|".join(df1[0].str[-3:])})', expand=False)
df = df1.merge(df2[[1]], how='left', left_on=0, right_on=s)
df.columns = np.arange(len(df.columns))
print (df)
Have a look at the concat-function of pandas using join='outer' (https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). There is also this question and the answer to it that can help you.
It involves reindexing each of your data frames to use the column that is now called "0" as the index, and then joining two data frames based on their indices.
Also, can I suggest that you do not paste an image of your dataframes, but upload the data in a form that other people can test their suggestions.

How to combine two datasets vertically in pandas?

I have a loop which generates dataframes with 2 columns in each. Now, when I try to append the dataframes vertically (stacking those vertically), the code adds the new dataframes horizontally when I use pd.concat within a loop. However, the results do not merge the columns (with same lenght properly). Instead, it adds 2 new columns for every loop iteration, creating a bunch on Nans. How to solve?
df_master=pd.DataFrame()
columns=list(df_master)
data=[]
for i in range(1,3):
--do something and return a df2 with 2 columns
data.append(df2)
df_master = pd.concat(data, axis=1)
df_master.head()
How do I compress the new 2 column for every iteration within one dataframe?
If you don't need to keep the column labels of original dataframes, you can try renaming the column labels of each dataframe to the same (e.g. 0 and 1) before concat, for example:
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
Demo
df1
57 59
0 1 2
1 3 4
df2
138 140
0 11 12
1 13 14
data = [df1, df2]
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
df_master
0 1
0 1 2
1 3 4
2 11 12
3 13 14
I suppose the problem is your columns have different names in each iteration, so you could easily solve it by calling df2.rename() and renaming it to the same names
It works for me if I change axis to 0 inside the concat command.
df_master = pd.concat(data, axis=0)
Pandas would fill empty cells with NaNs in each scenario and like the example you see below.
df1 = pd.DataFrame({'col1':[11,12,13], 'col2': [21,22,23], 'col3':[31,32,33]})
df2 = pd.DataFrame({'col1':[111,112,113, 114], 'col2': [121,122,123,124]})
merge / join / concatenate data frames [df1, df2] vertically - add rows
pd.concat([df1,df2], ignore_index=True)
# output
col1 col2 col3
0 11 21 31.0
1 12 22 32.0
2 13 23 33.0
3 111 121 NaN
4 112 122 NaN
5 113 123 NaN
6 114 124 NaN
merge / join / concatenate data frames horizontally (aligning by index)
pd.concat([df1,df2], axis=1)
# output
col1 col2 col3 col1 col2
0 11.0 21.0 31.0 111 121
1 12.0 22.0 32.0 112 122
2 13.0 23.0 33.0 113 123
3 NaN NaN NaN 114 124

Stacking columns one below other when the column names are same

I have a huge data set in a pandas data frame. It looks something like this
df = pd.DataFrame([[1,2,3,4],[31,14,13,11],[115,613,1313,1]], columns=['c1','c1','c2','c2'])
Here first two columns have same name. So they should be concatenated into a single column so the the values are one below another. so the dataframe should look something like this:
df1 = pd.DataFrame([[1,3],[31,13],[115,1313],[2,4],[14,11],[613,1]], columns=['c1','c2'])
Note: My orignal dataframe has many column so I cannot used simple concat function to stack the columns. Also I tried using stack function, apart from concat function. What can I do?
use groupby + cumcount to create a pd.MultiIndex. Reassign column with new pd.MultiIndex and stack
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = df.copy()
df1.columns = [df.columns, df.columns.to_series().groupby(level=0).cumcount()]
print(df1.stack().reset_index(drop=True))
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
Or with a bit of creativity, in one line
df.T.set_index(
df.T.groupby([df.columns]).cumcount(),
append=True
).unstack().T.reset_index(drop=True)
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
You could melt the dataframe, then count entries within each column to use as index for the new dataframe and then unstack it back like this:
import pandas as pd
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = (pd.melt(df,var_name='column')
.assign(n = lambda x: x.groupby('column').cumcount())
.set_index(['n','column'])
.unstack())
df1.columns=df1.columns.get_level_values(1)
print(df1)
Which produces
column c1 c2
n
0 1 3
1 31 13
2 115 1313
3 2 4
4 14 11
5 613 1

groupby pandas dataframe and create another dataframe which represents the groupby results horizontally

I have a pandas dataframe by the name usabledata with columns ['marker','action','id']
usabledata = pd.DataFrame(columns=['marker','action','id'])
I ran the following commands on the usabledata dataframe:
counts = usabledata.groupby(['marker','action']).count()
counts = counts.drop(['marker','action'])
print counts
id
marker action
1 A 377
B 224
C 9881
D 149946
2 A 481
B 397
C 7468
D 147581
3 A 538
B 458
D 145916
Now, I want to create a pandas dataframe with the following format:
Marker A B C D
1 377 224 9881 149946
2 481 397 7468 147581
3 538 458 0 145916
Is it possible to do this using pandas dataframe in ipython notebook?
Also, is it possible to delete a column for example the column 'C' after obtaining this desired output?
Another doubt in the same problem, after obtaining the desired output how can I add another column 'Fraction' which is just a ratio of the columns 'A' and 'D'?
IIUC then you can call unstack with fillna:
In [124]:
gp.unstack().fillna(0)
Out[124]:
action
marker A B C D
id
1 377 224 9881 149946
2 481 397 7468 147581
3 538 458 0 145916

Appending two dataframes with same columns, different order

I have two pandas dataframes.
noclickDF = DataFrame([[0, 123, 321], [0, 1543, 432]],
columns=['click', 'id', 'location'])
clickDF = DataFrame([[1, 123, 421], [1, 1543, 436]],
columns=['click', 'location','id'])
I simply want to join such that the final DF will look like:
click | id | location
0 123 321
0 1543 432
1 421 123
1 436 1543
As you can see the column names of both original DF's are the same, but not in the same order. Also there is no join in a column.
You could also use pd.concat:
In [36]: pd.concat([noclickDF, clickDF], ignore_index=True)
Out[36]:
click id location
0 0 123 321
1 0 1543 432
2 1 421 123
3 1 436 1543
Under the hood, DataFrame.append calls pd.concat.
DataFrame.append has code for handling various types of input, such as Series, tuples, lists and dicts. If you pass it a DataFrame, it passes straight through to pd.concat, so using pd.concat is a bit more direct.
For future users (sometime >pandas 0.23.0):
You may also need to add sort=True to sort the non-concatenation axis when it is not already aligned (i.e. to retain the OP's desired concatenation behavior). I used the code contributed above and got a warning, see Python Pandas User Warning. The code below works and does not throw a warning.
In [36]: pd.concat([noclickDF, clickDF], ignore_index=True, sort=True)
Out[36]:
click id location
0 0 123 321
1 0 1543 432
2 1 421 123
3 1 436 1543
You can use append for that
df = noclickDF.append(clickDF)
print df
click id location
0 0 123 321
1 0 1543 432
0 1 421 123
1 1 436 1543
and if you need you can reset the index by
df.reset_index(drop=True)
print df
click id location
0 0 123 321
1 0 1543 432
2 1 421 123
3 1 436 1543

Categories