Pandas: Key error in merge after creating index - python

I have large data frame to merge into make sure the the merge take place in multiprocessing manner I decided to use indexes. But after creating indexes I get key error.
For example:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']})
(Pdb) df1
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
But second DataFrame:
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'C': ['C1', 'C2', 'C3', 'C4']})
(Pdb) df2
A C
0 A0 C1
1 A1 C2
2 A2 C3
3 A3 C4
Now I set indexes for both the data frames where column A is the index.
df1.set_index('A', inplace=True)
df2.set_index('A', inplace=True)
(Pdb) df1
B
A
A0 B0
A1 B1
A2 B2
A3 B3
(Pdb) df2
C
A
A0 C1
A1 C2
A2 C3
A3 C4
Now when I do the merge:
(Pdb) result = pd.merge(df1, df2, on='A')
*** KeyError: 'A'
But if I do this without creating index merge take place without a key error.
(Pdb) df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']})
(Pdb) df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'C': ['C1', 'C2', 'C3', 'C4']})
(Pdb) result = pd.merge(df1, df2, on='A')
(Pdb) result
A B C
0 A0 B0 C1
1 A1 B1 C2
2 A2 B2 C3
3 A3 B3 C4

if you merge on the index, you should both:
not specify the key in merge
use the left_index = True, right_index = True arguments to merge
otherwise, you have to explicitely tell what your key is using key =

When you specify on=A you're telling Pandas that you want to use the column A to do the merge. When you make A the index, it's no longer a column in the frame.
You can either merge, and then make A the index, or make A the index and then specify left_index=True and right_index=True.

Related

create list nest from each column of pandas dataframe in python

Here is simple of my dataframe
A B C D
Date
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
so i want to create nest list like [[A1,A2,A3,A4],[B1,B2,B3,B4],....]
i use command like mylist = dataframe.value.tolist()
but it return [[A1,B1,C1,D1],[A2,B2,C2,D2]] instead
so is there a way to get nest list as i want?
#i use python 3.8.5 and pandas dataframe import data from yfinance
Just transpose and then call values:
df.T.values.tolist()
[['A1', 'A2', 'A3', 'A4'],
['B1', 'B2', 'B3', 'B4'],
['C1', 'C2', 'C3', 'C4'],
['D1', 'D2', 'D3', 'D4']]
You could loop through columns and append
li = []
for col in dataframe:
li.append(dataframe[col])
Here is a possible solutions for your problem:
l=[]
for e in df.columns:
column = df[e].values.tolist()
l.append(column)
The idea is taking every column as a list, and iterate it for all de columns.

Find values in one column that have association with certain values in another column

I have the following example pandas dataframe:
df = pd.DataFrame({'A': ['a1', 'a2', 'a3', 'a1', 'a2', 'a4'],
'B': ['b1', 'b1', 'b2', 'b3', 'b6', 'b6']})
I want to find the values in A that have association with all of the values of an input list in B.
For example, for an input lst = ['b1', 'b6'], the desired output is ['a2']. I have tried df[df['B'].isin(lst)] from here but it is indeed not yet sufficient, or may not be even necessary.
First filter df on each element of lst using a list comprehension
a_list = [ df.loc[df['B']==el,'A'].tolist() for el in lst]
# [['a1', 'a2'], ['a2', 'a4']]
The values you are looking for are the intersection of all lists in a_list
list(set.intersection(*[set(x) for x in a_list]))
# ['a2']
We can do it groupby + filter with isin + all
df.groupby('A').filter(lambda x : pd.Series(lst).isin(x['B']).all())
A B
1 a2 b1
4 a2 b6
Or
s=pd.crosstab(df.A,df.B).ge(1)[lst].all(1) # you have the series
s # s[s].index
A
a1 False
a2 True
a3 False
a4 False
dtype: bool
Using set.issubset to check if one list is in the other:
m = df.groupby('A')['B'].apply(lambda x: set(lst).issubset(set(x)))
df[df['A'].map(m)]
A B
0 a1 b1
1 a2 b1
3 a1 b3
4 a2 b6
7 a2 b7
8 a1 b6

DataFrame merge to exclude extra index column [duplicate]

This question already has answers here:
How to reset index in a pandas dataframe? [duplicate]
(3 answers)
Closed 3 years ago.
I want to merge two dataframe df1 and df2 and use the script below:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
df = df1.join(df2.iloc[1:].reset_index(), how='inner')
But there is one more column named 'index' which I don't want. Is there anyway to remove it?
A B index C D
0 A0 B0 1 C1 D1
1 A1 B1 2 C2 D2
2 A2 B2 3 C3 D3
What I want is just like this:
A B C D
0 A0 B0 C1 D1
1 A1 B1 C2 D2
2 A2 B2 C3 D3
Just use drop=True within your code.
df = df1.join(df2.iloc[1:].reset_index(drop=True), how='inner')
A B C D
0 A0 B0 C1 D1
1 A1 B1 C2 D2
2 A2 B2 C3 D3

pandas dataframe row manipulation

I'm sure that I'm missing something simple, but I haven't be able to figure this one out.
I have a DataFrame in Pandas with multiple rows that have the same keys, but different information. I want to place these rows onto the same row.
df = pd.DataFrame({'key': ['K0', 'K0', 'K1', 'K2'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
This will give me a dataframe with 4 rows and 3 columns. But there is a duplicate value 'KO' in 'key'
Is there any way to turn this into a dataframe with 3 rows, and 5 columns like shown below?
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'A': ['A0', 'A2', 'A3'],
'B': ['B0', 'B2', 'B3'],
'A_1': ['A1', 'NaN', 'NaN'],
'B_1': ['B1', 'NaN', 'NaN']})
Perform groupby on cumcount, then concatenate individual groups together.
gps = []
for i, g in df.groupby(df.groupby('key').cumcount()):
gps.append(g.drop('key', 1).add_suffix(i + 1).reset_index(drop=1))
r = pd.concat(gps, 1).sort_index(axis=1)
r['key'] = df.key.unique()
r
A1 A2 B1 B2 key
0 A0 A1 B0 B1 K0
1 A2 NaN B2 NaN K1
2 A3 NaN B3 NaN K2
You can shorten this somewhat using a list comprehension -
r = pd.concat(
[g.drop('key', 1).add_suffix(i + 1).reset_index(drop=1)
for i, g in df.groupby(df.groupby('key').cumcount())],
axis=1)\
.sort_index(axis=1)
r['key'] = df.key.unique()
r
A1 A2 B1 B2 key
0 A0 A1 B0 B1 K0
1 A2 NaN B2 NaN K1
2 A3 NaN B3 NaN K2
Let's use set_index, groupby, cumcount, and unstack, then flatten multiindex with map and format:
df_out = df.set_index(['key', df.groupby('key').cumcount()]).unstack()
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out.reset_index()
Output:
key A_0 A_1 B_0 B_1
0 K0 A0 A1 B0 B1
1 K1 A2 None B2 None
2 K2 A3 None B3 None
I think this alter the layout. just put key as an index to access fields :
df2 = df.set_index([df.key,df.index])
Then
In [248]: df2.loc['K1']
Out[248]:
A B key
2 A2 B2 K1
In [249]: df2.loc['K0']
Out[249]:
A B key
0 A0 B0 K0
1 A1 B1 K0
and iter on rows.

Using Merge on a column and Index in Pandas

I have two separate dataframes that share a project number. In type_df, the project number is the index. In time_df, the project number is a column. I would like to count the number of rows in type_df that have a Project Type of 2. I am trying to do this with pandas.merge(). It works great when using both columns, but not indices. I'm not sure how to reference the index and if merge is even the right way to do this.
import pandas as pd
type_df = pd.DataFrame(data = [['Type 1'], ['Type 2']],
columns=['Project Type'],
index=['Project2', 'Project1'])
time_df = pd.DataFrame(data = [['Project1', 13], ['Project1', 12],
['Project2', 41]],
columns=['Project', 'Time'])
merged = pd.merge(time_df,type_df, on=[index,'Project'])
print merged[merged['Project Type'] == 'Type 2']['Project Type'].count()
Error:
Name 'Index' is not defined.
Desired Output:
2
If you want to use an index in your merge you have to specify left_index=True or right_index=True, and then use left_on or right_on. For you it should look something like this:
merged = pd.merge(type_df, time_df, left_index=True, right_on='Project')
Another solution is use DataFrame.join:
df3 = type_df.join(time_df, on='Project')
For version pandas 0.23.0+ the on, left_on, and right_on parameters may now refer to either column names or index level names:
left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key2': ['K0', 'K1', 'K0', 'K1']},
index=left_index)
right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'key2': ['K0', 'K0', 'K0', 'K1']},
index=right_index)
print (left)
A B key2
key1
K0 A0 B0 K0
K0 A1 B1 K1
K1 A2 B2 K0
K2 A3 B3 K1
print (right)
C D key2
key1
K0 C0 D0 K0
K1 C1 D1 K0
K2 C2 D2 K0
K2 C3 D3 K1
df = left.merge(right, on=['key1', 'key2'])
print (df)
A B key2 C D
key1
K0 A0 B0 K0 C0 D0
K1 A2 B2 K0 C1 D1
K2 A3 B3 K1 C3 D3
You must have the same column in each dataframe to merge on.
In this case, just make a 'Project' column for type_df, then merge on that:
type_df['Project'] = type_df.index.values
merged = pd.merge(time_df,type_df, on='Project', how='inner')
merged
# Project Time Project Type
#0 Project1 13 Type 2
#1 Project1 12 Type 2
#2 Project2 41 Type 1
print merged[merged['Project Type'] == 'Type 2']['Project Type'].count()
2

Categories