Dataframe Concatenation with Pandas

Dataframe Concatenation with Pandas - python

i'm quite new to pandas and i'm stuck with this dataframe concatenation.
Let's say i've 2 dataframes:
df_1=pd.DataFrame({
"A":[1, 1, 2, 2, 3, 4, 4],
"B":[1, 2, 1, 2, 1, 1, 3],
"C":['a','b','c','d','e','f','g']
})
and
df_2=pd.DataFrame({
"A":[1, 3, 4],
"D":[1,'m',7]
})
I would like to concatenate/merge the 2 dataframes on the same values of ['A'] so that the resulting dataframe is:
df_3=pd.DataFrame({
"A":[1, 1, 3, 4, 4],
"B":[1, 2, 1, 1, 3],
"C":['a','b','e','f','g'],
"D":[1, 1, 'm', 7, 7]
})
How can i do that?
Thanks in advance

Just do an inner merge:
df_1.merge(df_2, how="inner", on="A")
outputs
A B C D
0 1 1 a 1
1 1 2 b 1
2 3 1 e m
3 4 1 f 7
4 4 3 g 7

You can also do a left merge, and then dropna
df_3 = df_1.merge(df_2, on=['A'], how='left').dropna(axis=0)
Output:
A B C D
0 1 1 a 1
1 1 2 b 1
4 3 1 e m
5 4 1 f 7
6 4 3 g 7

Related

Assign unique id inside of groups (with duplicated records) [duplicate]

This question already has an answer here:
Python add new column with repeating value based on two other columns
(1 answer)
Closed 8 months ago.
I have a DataFrame looks like this:
df = pd.DataFrame({'type': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B','C','C','C','D','D'],
'value': [1, 1, 2, 3, 4, 5, 5, 5, 6, 6, 7, 7, 8],
})
I would like to create a unique id based on the type and value columns, the output will look like this:
df = pd.DataFrame({'type': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B','C','C','C','D','D'],
'value': [1, 1, 2, 3, 4, 5, 5, 5, 6, 6, 7, 7, 8],
'id': [1, 1, 2, 1, 2, 3, 3, 3, 1, 1, 2, 1, 2],
})

Use DataFrameGroupBy.rank:
df['id'] = df.groupby('type')['value'].rank('dense').astype(int)
print (df)
type value id
0 A 1 1
1 A 1 1
2 A 2 2
3 B 3 1
4 B 4 2
5 B 5 3
6 B 5 3
7 B 5 3
8 C 6 1
9 C 6 1
10 C 7 2
11 D 7 1
12 D 8 2
Or GroupBy.transform with factorize:
f = lambda x: pd.factorize(x)[0]
df['id'] = df.groupby('type')['value'].transform(f).add(1)

Use:
t = df.groupby(['type']).transform(lambda x: x.iloc[0])
df['id'] = df.groupby(['type', 'value'])[['type', 'value']].apply(lambda x: x.name[1]).reset_index().merge(df, on = ['type', 'value'])[0]-t['value']+1
Output:
type value id
0 A 1 1
1 A 1 1
2 A 2 2
3 B 3 1
4 B 4 2
5 B 5 3
6 B 5 3
7 B 5 3
8 C 6 1
9 C 6 1
10 C 7 2
11 D 7 1
12 D 8 2

how nunique works with given table values?

yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
output:
A B
0 1 1
1 2 1
2 3 1
yf.nunique(axis=0)
output:
A 3
B 1
yf.nunique(axis=1)
output:
0 1
1 2
2 2
could you please how axis=0 and axis=1 works? In axis=0, why A=2, B=1 are ignored? Wonder if nunique gets in index as well?

You can test number of unique values per columns or per index by DataFrame.nunique.
yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
print (yf)
A B
0 1 1
1 2 1
2 3 1
print (yf.nunique(axis=0))
A 3
B 1
dtype: int64
print (yf.nunique(axis=1))
0 1
1 2
2 2
dtype: int64
It means:
A is 3, because 3 unique values in column A
0 is 1, because 1 unique values in row 0

dataframe new column based on groupby operations

import pandas
import numpy
df = pandas.DataFrame({'id_1' : [1,2,1,1,1,1,1,2,2,2,2],
'id_2' : [1,1,1,1,1,2,2,2,2,2,2],
'v_1' : [2,1,1,3,2,1,2,4,1,1,2],
'v_2' : [1,1,1,1,2,2,2,1,1,2,2],
'v_3' : [3,3,3,3,4,4,4,3,3,3,3]})
In [4]: df
Out[4]:
id_1 id_2 v_1 v_2 v_3
0 1 1 2 1 3
1 2 1 1 1 3
2 1 1 1 1 3
3 1 1 3 1 3
4 1 1 2 2 4
5 1 2 1 2 4
6 1 2 2 2 4
7 2 2 4 1 3
8 2 2 1 1 3
9 2 2 1 2 3
10 2 2 2 2 3
sub = df[(df['id_1'] == 1) & (df['id_2'] == 1)].copy()
sub['v_4'] = numpy.where(sub['v_1'] == sub['v_2'].shift(), 'A', \
numpy.where(sub['v_1'] == sub['v_3'].shift(), 'B', 'C'))
In [6]: sub
Out[6]:
id_1 id_2 v_1 v_2 v_3 v_4
0 1 1 2 1 3 C
2 1 1 1 1 3 A
3 1 1 3 1 3 B
4 1 1 2 2 4 C
I have a dataframe as defined above. I would like to perform some operation, basically categorize whether v_1 equals the previous v_2 or v_3 for each group of (id_1, id_2)
I have done the the operation which performs on a sub df. And I would like to have a one line code to combine the following groupby together with the operation I have on the sub df together.
gbdf = df.groupby(by=['id_1', 'id_2'])
I have tried something like
gbdf['v_4'] = numpy.where(gbdf['v_1'] == gbdf['v_2'].shift(), 'A', \
numpy.where(gbdf['v_1'] == gbdf['v_3'].shift(), 'B', 'C'))
and the error was
'DataFrameGroupBy' object does not support item assignment
I also tried
df['v_4'] = numpy.where(gbdf['v_1'] == gbdf['v_2'].shift(), 'A', \
numpy.where(gbdf['v_1'] == gbdf['v_3'].shift(), 'B', 'C'))
which I believe the result was wrong, it does not align the groupby result with the original ordering.
I am wondering whether there is an elegant way to achieve this.

This gets you a list of dataframes that match the content of the dataframe sub, but for all results of the .groupby():
import numpy
import pandas
source = pandas.DataFrame(
{'id_1': [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2],
'id_2': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'v_1': [2, 1, 1, 3, 2, 1, 2, 4, 1, 1, 2],
'v_2': [1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2],
'v_3': [3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3]})
def add_v4(df):
df['v_4'] = numpy.where(df['v_1'] == df['v_2'].shift(), 'A', numpy.where(df['v_1'] == df['v_3'].shift(), 'B', 'C'))
return df
dfs = [add_v4(pandas.DataFrame(slice)) for _, slice in source.groupby(by=['id_1', 'id_2'])]
print(dfs)
About this line:
dfs = [add_v4(pandas.DataFrame(slice)) for _, slice in source.groupby(by=['id_1', 'id_2'])]
It's a list comprehension that gets all the slices from the groupby and turns them into actual new dataframes before passing them to add_v4, which returns the modified dataframe to be added to the list.

Creating a derived column using pandas operations

I'm trying to create a column which contains a cumulative sum of the number of entries, tid, which are grouped according to unique values of (raceid, tid). The cumulative sum should increment by the number of entries in the grouping as shown in the df3 dataframe below rather than one at a time.
import pandas as pd
df1 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3]})
rid tid
0 1 1
1 1 2
2 1 2
3 2 1
4 2 1
5 2 3
6 3 1
7 3 4
8 4 5
9 5 1
10 5 1
11 5 1
12 5 3
Giving after the required operation:
df3 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3],
'groupentries': [1, 2, 2, 2, 2, 1, 1, 1, 1, 3, 3, 3, 1],
'cumulativeentries': [1, 2, 2, 3, 3, 1, 4, 1, 1, 7, 7, 7, 2]})
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2
The derived column that I'm after is the cumulativeentries column although I've only figured out how to generate the intermediate column groupentries using pandas:
df1.groupby(["rid", "tid"]).size()

Values in cumulativeentries are actually a kind of running count.
The task is to count occurrences of the current tid in "source area" of
tid column:
from the beginning of the DataFrame,
up to (including) the end of the current group.
To compute values of both required values for each group, I defined
the following function:
def fn(grp):
lastRow = grp.iloc[-1] # last row of the current group
lastId = lastRow.name # index of this row
tids = df1.truncate(after=lastId).tid
return [grp.index.size, tids[tids == lastRow.tid].size]
To get the "source area" mentioned above I used truncate function.
In my opinion it is a very intuitive solution, based on the notion of the
"source area".
The function returns a list containing both required values:
the size of the current group,
how many tids equal to the current tid are in the
truncated tid column.
To apply this function, run:
df2 = df1.groupby(['rid', 'tid']).apply(fn).apply(pd.Series)\
.rename(columns={0: 'groupentries', 1: 'cumulativeentries'})
Details:
apply(fn) generates a Series containing 2-element lists.
apply(pd.Series) converts it to a DataFrame (with default column names).
rename sets the target column names.
And the last thing to do is to join this table to df1:
df1.join(df2, on=['rid', 'tid'])

For first column use GroupBy.transform with DataFrameGroupBy.size, for second use custom function for test all values of column to last index values, compare with last values and count matched values by sum:
f = lambda x: (df1['tid'].iloc[:x.index[-1]+1] == x.iat[-1]).sum()
df1['groupentries'] = df1.groupby(["rid", "tid"])['rid'].transform('size')
df1['cumulativeentries'] = df1.groupby(["rid", "tid"])['tid'].transform(f)
print (df1)
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2

Create a column in a dataframe that is a string of characters summarizing data in other columns

I have a dataframe like this where the columns are the scores of some metrics:
A B C D
4 3 3 1
2 5 2 2
3 5 2 4
I want to create a new column to summarize which metrics each row scored over a set threshold in, using the column name as a string. So if the threshold was A > 2, B > 3, C > 1, D > 3, I would want the new column to look like this:
A B C D NewCol
4 3 3 1 AC
2 5 2 2 BC
3 5 2 4 ABCD
I tried using a series of np.where:
df[NewCol] = np.where(df['A'] > 2, 'A', '')
df[NewCol] = np.where(df['B'] > 3, 'B', '')
etc.
but realized the result was overwriting with the last metric any time all four metrics didn't meet the conditions, like so:
A B C D NewCol
4 3 3 1 C
2 5 2 2 C
3 5 2 4 ABCD
I am pretty sure there is an easier and correct way to do this.

You could do:
import pandas as pd
data = [[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])
th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}
df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]
print(df)
Output
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD

Using dot
s=pd.Series([2,3,1,3],index=df.columns)
df.gt(s,1).dot(df.columns)
Out[179]:
0 AC
1 BC
2 ABCD
dtype: object
#df['New']=df.gt(s,1).dot(df.columns)

Another option that operates in an array fashion. It would be interesting to compare performance.
import pandas as pd
import numpy as np
# Data to test.
data = pd.DataFrame(
[
[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]
]
, columns = ['A', 'B', 'C', 'D']
)
# Series to hold the thresholds.
thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])
# Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.
data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)
print(data)
Gives:
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe Concatenation with Pandas - python

Just do an inner merge: df_1.merge(df_2, how="inner", on="A") outputs A B C D 0 1 1 a 1 1 1 2 b 1 2 3 1 e m 3 4 1 f 7 4 4 3 g 7

You can also do a left merge, and then dropna df_3 = df_1.merge(df_2, on=['A'], how='left').dropna(axis=0) Output: A B C D 0 1 1 a 1 1 1 2 b 1 4 3 1 e m 5 4 1 f 7 6 4 3 g 7

Related

Assign unique id inside of groups (with duplicated records) [duplicate]

how nunique works with given table values?

dataframe new column based on groupby operations

Creating a derived column using pandas operations

Create a column in a dataframe that is a string of characters summarizing data in other columns

Categories

Resources