Row by row multiplication on pandas or numpy - python

I have one dataframe with classes and two components and a second one with elements and the same components.
df1:
df1 = pd.DataFrame({'Class':['banana', 'apple'], 'comp1':[1, 2], 'comp2':[-5, 4]})
df2:
df2 = pd.DataFrame({'Element':['K', 'Mg'], 'comp1':[3, -4], 'comp2':[1, 3]})
I want to multiply them row by row in a way that would generate the following output:
output = pd.DataFrame({'Class': ['banana', 'banana', 'apple', 'apple'], 'Element': ['K', 'Mg', 'K', 'Mg'], 'comp1':[3, -4, 6, -8], 'comp2':[-5, -15, 4, 12]})
Could you help me?

well as i see it's like cartesian product. and then some manipulation for desired output as mentioned.
import pandas as pd
df1 = pd.DataFrame({'Class':['banana', 'apple'], 'comp1':[1, 2], 'comp2':[-5, 4]})
df2 = pd.DataFrame({'Element':['K', 'Mg'], 'comp1':[3, -4], 'comp2':[1, 3]})
#merging data
output = df1.merge(df2, how='cross')
output['comp1'] = output.pop('comp1_x') * output.pop('comp1_y')
output['comp2'] = output.pop('comp2_x') * output.pop('comp2_y')
print(output)
expected = pd.DataFrame({'Class': ['banana', 'banana', 'apple', 'apple'], 'Element': ['K', 'Mg', 'K', 'Mg'], 'comp1':[3, -4, 6, -8], 'comp2':[-5, -15, 4, 12]})
print(expected.equals(output)) # True
'''
Class Element comp1 comp2
0 banana K 3 -5
1 banana Mg -4 -15
2 apple K 6 4
3 apple Mg -8 12
'''

Related

python - how to identify and remove rows unique values from a subset of duplicate rows?

I have a dataframe with rows that are almost duplicates, except for the value on one column.
event = [1, 1, 1, 1, 2, 2, 2, 2, 3, 3]
subj = [1, 1, 2, 2, 3, 3, 4, 4, 5, 6]
age = [22, 22, 56, 56, 32, 32, 48, 48, 19, 43]
sex = ['F', 'F','M',' M', 'M', 'M',' F',' F', 'F', 'M']
fruit = ['apple', 'orange', 'apple', 'orange', 'grape', 'mango', 'grape', 'mango', 'apple', 'mango']
df = pd.DataFrame(list(zip(event, subj, age, sex, fruit)),
columns =['event', 'subj', 'age', 'sex', 'fruit'])
Each subject is associated with two fruits (either with apple/orange or grape/mango). I'd like to omit some rows so that each subject is associated with only one fruit, and each with a different fruit.
This is what I want my final dataframe to look like:
event = [1, 1, 2, 2, 3, 3]
subj = [1, 2, 3, 4, 5, 6]
age = [22, 56, 32, 48, 19, 43]
sex = ['F', 'M', 'M', 'F', 'F', 'M']
fruit = ['apple', 'orange', 'grape', 'mango', 'apple', 'mango']
df_new = df = pd.DataFrame(list(zip(event, subj, age, sex, fruit)),
columns =['event', 'subj', 'age', 'sex', 'fruit'])
I have thousands of rows and don't know which rows are "almost duplicates". I've tried using .duplicated() based on a subset, but it only allows me to keep first or last, so different subjects end up with the same fruit (for example, subjects 1 and 2 with 'apple' and subjects 3 and 4 with 'grape').
I am new to pandas and any help would be greatly appreciated.
Edit: to clarify some of the ambiguities in my question--within each event, the subjects should be unique. The subjects within the event should be associated with different fruits. There could be more than two subjects for each event, but each subject should not be associated with more than a single event (if subject 1 appeared in event 1, should not appear in other events, for example).
this should work:
df['check'] = (
df.groupby(["event"])
.apply(lambda x: x['fruit'].shift(1)==x['fruit'].shift(-1))
.reset_index(drop=True)
)
df=df[df['check']==False].drop(['check'],axis=1)
print(df)
'''
event subj age sex fruit
0 1 1 22 F apple
3 1 2 56 M orange
4 2 3 32 M grape
7 2 4 48 F mango
8 3 5 19 F apple
9 3 6 43 M mango
'''

Partially merge pandas data fame columns into a dictionary column

Given a Pandas df:
col1 col2 col3 col4
a 1 2 56
a 3 4 1
a 5 6 1
b 7 8 2
b 9 10 -11
c 11 12 9
...
Using pandas how to reshape such data frame such that multiple columns are represented using one a dictionary with column names as keys:
col1 dict_col
a { 'col2':1 ,'col3':2 , 'col4':56 }
a { 'col2':3 ,'col3':4 , 'col4':1 }
a { 'col2':5 ,'col3':6 , 'col4':1 }
b { 'col2':7 ,'col3':8 , 'col4':2 }
b { 'col2':9 ,'col3':10, 'col4':-11}
c { 'col2':11 ,'col3':12, 'col4':9 }
Note that values of that that this transformation needs to be done only with pandas and just for a part of the columns across all the data frame rows.
You can use this code :
import pandas as pd
df = pd.DataFrame({
'col1': ['a', 'a', 'a', 'b', 'b', 'c'],
'col2': [ 1, 3, 5, 7, 9, 11],
'col3': [ 2, 4, 6, 8, 10, 12],
'col4': [ 56, 1, 1, 2, -11, 9]
})
cols = ['col2', 'col3', 'col4']
lst = []
for _, row in df[cols].iterrows():
lst.append({col: row[col] for col in cols})
df['dict_col'] = lst
df = df[['','dict_col']]
print(df)
Output :
col1 dict_col
0 a {'col2': 1, 'col3': 2, 'col4': 56}
1 a {'col2': 3, 'col3': 4, 'col4': 1}
2 a {'col2': 5, 'col3': 6, 'col4': 1}
3 b {'col2': 7, 'col3': 8, 'col4': 2}
4 b {'col2': 9, 'col3': 10, 'col4': -11}
5 c {'col2': 11, 'col3': 12, 'col4': 9}
Try this command:
pd.DataFrame({'col1': df['col1'].values, 'dict_col': df.drop('col1', axis=1).to_dict(orient='records')})
Building the dict_col column content(NB: with orient=records, to_dict() returns a list of dictionaries)
dict_col = df.loc[:, ["col2", "col3","col4"]].to_dict(orient="records")
Then create df2 as a new dataframe (having 2 columns, col1, dict_col):
df2 = pd.DataFrame({"col1": df["col1"], "dict_col": dict_col})
print(df2)

Python Pandas Dataframe select row by max date in group with aggregate

I have a dataframe as follows:
df = pd.DataFrame({'id': ['A', 'A', 'B', 'B', 'C'],
'date': ['2021-01-01T14:54:42.000Z',
'2021-01-01T14:54:42.000Z',
'2021-01-01T14:55:42.000Z',
'2021-04-01T15:51:42.000Z',
'2021-03-01T15:51:42.000Z'],
'foo': ['apple', 'orange', 'apple', 'banana', 'pepper'],
'count': [3, 2, 4, 2, 1]})
I want to group the dataframe by id and date so that foo and count per date are aggregated lists. I then want to take the row with the most recent date per id.
Expected outcome
id date foo count
A '2021-01-01T14:54:42.000Z' ['apple, orange'] [3, 2]
B '2021-04-01T15:51:42.000Z' ['banana'] [2]
C '2021-03-01T15:51:42.000Z' ['pepper'] [1]
I've tried
df = df.sort_values(['id', 'date'], ascending=(True, False))
test_df = df.groupby(['id', 'date'], as_index=False)['foo', 'count'].agg(list).head(1).reset_index(drop=True)
but this only gives me the first row of the df. .first() gives me a TypeError. Any help is greatly appreciated.
In your case
df.groupby('id',as_index=False).agg({'date':'max','foo':list,'count':list})
Out[178]:
id date foo count
0 A 2021-01-01T14:54:42.000Z [apple, orange] [3, 2]
1 B 2021-04-01T15:51:42.000Z [apple, banana] [4, 2]
2 C 2021-03-01T15:51:42.000Z [pepper] [1]

Python appending a list to dataframe column

I have a dataframe from a stata file and I would like to add a new column to it which has a numeric list as an entry for each row. How can one accomplish this? I have been trying assignment but its complaining about index size.
I tried initiating a new column of strings (also tried integers) and tried something like this but it didnt work.
testdf['new_col'] = '0'
testdf['new_col'] = testdf['new_col'].map(lambda x : list(range(100)))
Here is a toy example resembling what I have:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
testdf = pd.DataFrame.from_dict(data)
This is what I would like to have:
data2 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15], 'list' : [[1,2,3],[7,8,9,10,11],[9,10,11,12],[10,11,12,13,14,15]]}
testdf2 = pd.DataFrame.from_dict(data2)
My final goal is to use explode on that "list" column to duplicate the rows appropriately.
Try this bit of code:
testdf['list'] = pd.Series(np.arange(i, j) for i, j in zip(testdf['start_val'],
testdf['end_val']+1))
testdf
Output:
col_1 col_2 start_val end_val list
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]
Let's use comprehension and zip with a pd.Series constructor and np.arange to create the lists.
If you'd stick to using the apply function:
import pandas as pd
import numpy as np
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
df = pd.DataFrame.from_dict(data)
df['range'] = df.apply(lambda row: np.arange(row['start_val'], row['end_val']+1), axis=1)
print(df)
Output:
col_1 col_2 start_val end_val range
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]

Replace values of a dataframe with the value of another dataframe

I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12
Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12
One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])

Categories