Concat two dataframes with different indices - python

I am trying to concatenate two dataframes. I've tried using merge(), join(), concat() in pandas, but none gave me my desired output.
df1:
Index
value
0
a
1
b
2
c
3
d
4
e
df2:
Index
value
1
f
2
g
3
h
4
i
5
j
desired output:
Index
col1
col2
0
a
f
1
b
g
2
c
h
3
d
i
4
e
j
Thanks in advance!

You can just use pd.merge and specify the index left join as follows:
import pandas as pd
df1 = pd.DataFrame(data={'value': list('ABCDE')})
df2 = pd.DataFrame(data={'value': list('FGHIJ')}, index=range(1, 6))
pd.merge(df1.rename(columns={'value': 'col1'}), df2.reset_index(drop=True).rename(columns={'value': 'col2'}), how='left', left_index=True, right_index=True)
-----------------------------------
col1 col2
0 A F
1 B G
2 C H
3 D I
4 E J
-----------------------------------

Does resetting the index of df2 work for your use case?
pd.concat([df1,df2.reset_index(drop=True)], axis=1) \
.set_axis(['Col1', 'Col2'], axis=1, inplace=False)
Result
Col1 Col2
0 a f
1 b g
2 c h
3 d i
4 e j

Related

How to get the rows based on unique column values of their first occurrence

I have a data frame like this:
df
col1 col2 col3
1 A B
1 D R
2 R P
2 D F
3 T G
1 R S
3 R S
I want to get the data frame with first 3 unique value of col1. If some col1 value comes later in the df, it will ignore.
The final data frame should look like:
df
col1 col2 col3
1 A B
1 D R
2 R P
2 D F
3 T G
How to do it most efficient way in pandas ?
Create helper consecutive groups series with Series.ne, Series.shift and Series.cumsum and then filter by boolean indexing:
N = 3
df = df[df.col1.ne(df.col1.shift()).cumsum() <= N]
print (df)
col1 col2 col3
0 1 A B
1 1 D R
2 2 R P
3 2 D F
4 3 T G
Detail:
print (df.col1.ne(df.col1.shift()).cumsum())
0 1
1 1
2 2
3 2
4 3
5 4
6 5
Name: col1, dtype: int32
here is a solution which stops at once found the three first different values
import pandas as pd
data="""
col1 col2 col3
1 A B
1 D R
2 R P
2 D F
3 T G
1 R S
3 R S
"""
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
nbr = 3
dico={}
for index, row in df.iterrows():
dico[row.col1]=True
if len(dico.keys())==nbr:
df = df[0:index+1]
break
print(df)
col1 col2 col3
0 1 A B
1 1 D R
2 2 R P
3 2 D F
4 3 T G
You can use the duplicated method in pandas:
mask1 = df.duplicated(keep = "first") # this line is to get the first occ.
mask2 = df.duplicated(keep = False) # this line is to get the row that occ one single time.
mask = ~mask1 | ~mask2
df[mask]

map DataFrame index and forward fill nan values

I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C
Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C
Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C
You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C

Duplicating each row in a dataframe with counts

For each row in a dataframe, I wish to create duplicates of it with an additional column to identify each duplicate.
E.g Original dataframe is
A | A
B | B
I wish to make make duplicate of each row with an additional column to identify it. Resulting in:
A | A | 1
A | A | 2
B | B | 1
B | B | 2
You can use df.reindex followed by a groupby on df.index.
df = df.reindex(df.index.repeat(2))
df['count'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Similarly, using reindex and assign with np.tile:
df = df.reindex(df.index.repeat(2))\
.assign(count=np.tile(df.index, 2) + 1)\
.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Use Index.repeat with loc, for count groupby with cumcount:
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
print (df)
a b
0 A A
1 B B
df = df.loc[df.index.repeat(2)]
df['new'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Or:
df = df.loc[df.index.repeat(2)]
df['new'] = np.tile(range(int(len(df.index)/2)), 2) + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Setup
Borrowed from #jezrael
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
a b
0 A A
1 B B
Solution 1
Create a pd.MultiIndex with pd.MultiIndex.from_product
Then use pd.DataFrame.reindex
idx = pd.MultiIndex.from_product(
[df.index, [1, 2]],
names=[df.index.name, 'New']
)
df.reindex(idx, level=0).reset_index('New')
New a b
0 1 A A
0 2 A A
1 1 B B
1 2 B B
Solution 2
This uses the same loc and reindex concept used by #cᴏʟᴅsᴘᴇᴇᴅ and #jezrael, but simplifies the final answer by using list and int multiplication rather than np.tile.
df.loc[df.index.repeat(2)].assign(New=[1, 2] * len(df))
a b New
0 A A 1
0 A A 2
1 B B 1
1 B B 2
Use pd.concat() to repeat, and then groupby with cumcount() to count:
In [24]: df = pd.DataFrame({'col1': ['A', 'B'], 'col2': ['A', 'B']})
In [25]: df
Out[25]:
col1 col2
0 A A
1 B B
In [26]: df_repeat = pd.concat([df]*3).sort_index()
In [27]: df_repeat
Out[27]:
col1 col2
0 A A
0 A A
0 A A
1 B B
1 B B
1 B B
In [28]: df_repeat["count"] = df_repeat.groupby(level=0).cumcount() + 1
In [29]: df_repeat # df_repeat.reset_index(drop=True); if index reset required.
Out[29]:
col1 col2 count
0 A A 1
0 A A 2
0 A A 3
1 B B 1
1 B B 2
1 B B 3

Merging two dataframes, with different lengths, and repeating values

I have two dataframes with the same col 'A' that I want to merge on. However, in df2 col A is replicated a random number of times. This replication is important to my problem and I cannot drop it. I want the final dataframe to look like df3. Where Col A merges Col B values to each replication.
df1 df2
Col A Col B Col A Col B
1 v 1 a
2 w 2 b
3 x 2 c
4 y 3 d
3 e
4 f
df3
Col A Col B Col C
1 a v
2 b w
2 c w
3 d x
3 e x
4 f y
Use merge:
df2.merge(df1, on='Col A')
Out:
Col A Col B_x Col B_y
0 1 a v
1 2 b w
2 2 c w
3 3 d x
4 3 e x
5 4 f y
And if necessary, rename afterwards:
df = df2.merge(df1, on='Col A')
df.columns = ['Col A', 'Col B', 'Col C']
for more info, see the Pandas Documentation on merging and joining.
I believe you need map by Series created by set_index:
print (df1.set_index('Col A')['Col B'])
Col A
1 v
2 w
3 x
4 y
Name: Col B, dtype: object
df2['Col C'] = df2['Col A'].map(df1.set_index('Col A')['Col B'])
print (df2)
Col A Col B Col C
0 1 a v
1 2 b w
2 2 c w
3 3 d x
4 3 e x
5 4 f y

keep the same factorizing between two data

We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.
You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32
You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4

Categories