combination of two DF, pandas

combination of two DF, pandas - python

I have two df,
First df
A B C
1 1 3
1 1 2
1 2 5
2 2 7
2 3 7
Second df
B D
1 5
2 6
3 4
The column Bhas the same meaning in the both dfs. What is the most easy way add column D to the corresponding values in the first df? Output should be:
A B C D
1 1 3 5
1 1 2 5
1 2 5 6
2 2 7 6
2 3 7 4

Perform a 'left' merge in your case on column 'B':
In [206]:
df.merge(df1, how='left', on='B')
Out[206]:
A B C D
0 1 1 3 5
1 1 1 2 5
2 1 2 5 6
3 2 2 7 6
4 2 3 7 4
Another method would be to set 'B' on your second df as the index and then call map:
In [215]:
df1 = df1.set_index('B')
df['D'] = df['B'].map(df1['D'])
df
Out[215]:
A B C D
0 1 1 3 5
1 1 1 2 5
2 1 2 5 6
3 2 2 7 6
4 2 3 7 4

Related

Autoincrement indexing after groupby with pandas on the original table

I cannot solve a very easy/simple problem in pandas. :(
I have the following table:
df = pd.DataFrame(data=dict(a=[1, 1, 1,2, 2, 3,1], b=["A", "A","B","A", "B", "A","A"]))
df
Out[96]:
a b
0 1 A
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 1 A
I would like to make an incrementing ID of each grouped (grouped by columns a and b) unique item. So the result would like like this (column c):
Out[98]:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I tried with:
df.groupby(["a", "b"]).nunique().cumsum().reset_index()
Result:
Out[105]:
a b c
0 1 A 1
1 1 B 2
2 2 A 3
3 2 B 4
4 3 A 5
Unfortunatelly this works only for the grouped by dataset and not on the original dataset. As you can see in the original table I have 7 rows and the grouped by returns only 5.
So could someone please help me on how to get the desired table:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Thank you in advance!

groupby + ngroup
df['c'] = df.groupby(['a', 'b']).ngroup() + 1
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1

Use pd.factorize after create a tuple from (a, b) columns:
df['c'] = pd.factorize(df[['a', 'b']].apply(tuple, axis=1))[0] + 1
print(df)
# Output
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1

split a string into separate columns in pandas

I have a dataframe with lots of data and 1 column that is structured like this:
index var_1
1 a=3:b=4:c=5:d=6:e=3
2 b=3:a=4:c=5:d=6:e=3
3 e=3:a=4:c=5:d=6
4 c=3:a=4:b=5:d=6:f=3
I am trying to structure the data in that column to look like this:
index a b c d e f
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
I have done the following thus far:
df1 = df['var1'].str.split(':', expand=True)
I can then loop through the cols of df1 and do another split on '=', but then I'll just have loads of disorganised label cols and value cols.

Use list comprehension with dictionaries for each value and pass to DataFrame constructor:
comp = [dict([y.split('=') for y in x.split(':')]) for x in df['var_1']]
df = pd.DataFrame(comp).fillna(0).astype(int)
print (df)
a b c d e f
0 3 4 5 6 3 0
1 4 3 5 6 3 0
2 4 0 5 6 3 0
3 4 5 3 6 0 3
Or use Series.str.split with expand=True for DataFrame, reshape by DataFrame.stack, again split, remove first level of MultiIndex and add new level by 0 column, last reshape by Series.unstack:
df = (df['var_1'].str.split(':', expand=True)
.stack()
.str.split('=', expand=True)
.reset_index(level=1, drop=True)
.set_index(0, append=True)[1]
.unstack(fill_value=0)
.rename_axis(None, axis=1))
print (df)
a b c d e f
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3

Here's one approach using str.get_dummies:
out = df.var_1.str.get_dummies(sep=':')
out = out * out.columns.str[2:].astype(int).values
out.columns = pd.MultiIndex.from_arrays([out.columns.str[0], out.columns])
print(out.max(axis=1, level=0))
a b c d e f
index
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3

You can apply "extractall" and "pivot".
After "extractall" you get:
0 1
index match
1 0 a 3
1 b 4
2 c 5
3 d 6
4 e 3
2 0 b 3
1 a 4
2 c 5
3 d 6
4 e 3
3 0 e 3
1 a 4
2 c 5
3 d 6
4 0 c 3
1 a 4
2 b 5
3 d 6
4 f 3
And in one step:
rslt= df.var_1.str.extractall(r"([a-z])=(\d+)") \
.reset_index(level="match",drop=True) \
.pivot(columns=0).fillna(0)
1
0 a b c d e f
index
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
#rslt.columns= rslt.columns.levels[1].values

need to filter rows present in one dataframe on another

I have two data frames in pandas from which i need to get the rows with all the corresponding column values in second which are not in first .
ex
df A
A B C D
6 4 1 6
7 6 6 3
1 6 2 9
8 0 4 9
1 0 2 3
8 4 7 5
4 7 1 1
3 7 3 4
5 2 8 8
3 2 8 8
5 2 8 8
df B
A B C D
1 0 2 3
8 4 7 5
4 7 1 1
1 0 2 3
8 4 7 5
4 7 1 1
3 7 3 4
5 2 8 8
1 1 1 1
2 2 2 2
1 1 1 1
req
A B C D
1 1 1 1
2 2 2 2
1 1 1 1
i tried using pd.merge and inner/left on all columns but it is taking a lot more computational time and resource if the rows and columns are more. is there any other way to work it around like iterating through each row of dfA with dfB on all columns and then pick the ones which are there only in dfB?

You can use merge with ind parameter.
df_b.merge(df_a, on=['A','B','C','D'],
how='left', indicator='ind')\
.query('ind == "left_only"')\
.drop('ind', axis=1)
Output:
A B C D
9 1 1 1 1
10 2 2 2 2
11 1 1 1 1

Creating a new column in panda dataframe using logical indexing and group by

I have a data frame like below
df=pd.DataFrame({'a':['a','a','b','a','b','a','a','a'], 'b' : [1,0,0,1,0,1,1,1], 'c' : [1,2,3,4,5,6,7,8],'d':['1','2','1','2','1','2','1','2']})
df
Out[94]:
a b c d
0 a 1 1 1
1 a 0 2 2
2 b 0 3 1
3 a 1 4 2
4 b 0 5 1
5 a 1 6 2
6 a 1 7 1
7 a 1 8 2
I want something like below
df[(df['a']=='a') & (df['b']==1)]
In [97]:
df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].rank()
df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].rank()
Out[97]:
0 1
3 1
5 2
6 2
7 3
dtype: float64
I want this rank as a new column in dataframe df and wherever there is no rank I want NaN. SO final output will be something like below
a b c d rank
0 a 1 1 1 1
1 a 0 2 2 NaN
2 b 0 3 1 NaN
3 a 1 4 2 1
4 b 0 5 1 NaN
5 a 1 6 2 2
6 a 1 7 1 2
7 a 1 8 2 3
I will appreciate all the help and guidance. Thanks a lot.

Almost there, you just need to call transform to return a series with an index aligned to your orig df:
In [459]:
df['rank'] = df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].transform(pd.Series.rank)
df
Out[459]:
a b c d rank
0 a 1 1 1 1
1 a 0 2 2 NaN
2 b 0 3 1 NaN
3 a 1 4 2 1
4 b 0 5 1 NaN
5 a 1 6 2 2
6 a 1 7 1 2
7 a 1 8 2 3

python pandas groupby() result

I have the following python pandas data frame:
df = pd.DataFrame( {
'A': [1,1,1,1,2,2,2,3,3,4,4,4],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} );
df
A B C
0 1 5 1
1 1 5 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 6 1
6 2 6 1
7 3 7 1
8 3 7 1
9 4 6 1
10 4 7 1
11 4 7 1
I would like to have another column storing a value of a sum over C values for fixed (both) A and B. That is, something like:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
I have tried with pandas groupby and it kind of works:
res = {}
for a, group_by_A in df.groupby('A'):
group_by_B = group_by_A.groupby('B', as_index = False)
res[a] = group_by_B['C'].sum()
but I don't know how to 'get' the results from res into df in the orderly fashion. Would be very happy with any advice on this. Thank you.

Here's one way (though it feels this should work in one go with an apply, I can't get it).
In [11]: g = df.groupby(['A', 'B'])
In [12]: df1 = df.set_index(['A', 'B'])
The size groupby function is the one you want, we have to match it to the 'A' and 'B' as the index:
In [13]: df1['D'] = g.size() # unfortunately this doesn't play nice with as_index=False
# Same would work with g['C'].sum()
In [14]: df1.reset_index()
Out[14]:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2

You could also do a one liner using transform applied to the groupby:
df['D'] = df.groupby(['A','B'])['C'].transform('sum')

You could also do a one liner using merge as follows:
df = df.merge(pd.DataFrame({'D':df.groupby(['A', 'B'])['C'].size()}), left_on=['A', 'B'], right_index=True)

you can use this method :
columns = ['col1','col2',...]
df.groupby('col')[columns].sum()
if you want you can also use .sort_values(by = 'colx', ascending = True/False) after .sum() to sort the final output by a specific column (colx) and in an ascending or descending order.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

combination of two DF, pandas - python

I have two df, First df A B C 1 1 3 1 1 2 1 2 5 2 2 7 2 3 7 Second df B D 1 5 2 6 3 4 The column Bhas the same meaning in the both dfs. What is the most easy way add column D to the corresponding values in the first df? Output should be: A B C D 1 1 3 5 1 1 2 5 1 2 5 6 2 2 7 6 2 3 7 4

Related

Autoincrement indexing after groupby with pandas on the original table

split a string into separate columns in pandas

need to filter rows present in one dataframe on another

Creating a new column in panda dataframe using logical indexing and group by

python pandas groupby() result

Categories

Resources