In the dataframe data, I want to groupby 'Name', find where "Price1" and "Price2" are equal and then write the values in 'answer' to a new column with respect to groupby 'Name'. ex:
d = {
'Name': ['Cat', 'Cat', 'Dog', 'Dog'],
'Price1': [2, 1, 10, 3],
'Price2':[5,1,7,3],
'answer':['A','B','C','D']
}
data = pd.DataFrame(data=d)
Name Price1 Price2 Answer
0 Cat 2 5 A
1. Cat 1 1 B <--- match, get 'B'
2. Dog 10 7 C
3. Dog 3 3 D <---- match, get 'D'
something like this
data['result'] = data.groupby('itemName')['answer'] where [data['Price1']=data['Price2'] #<---- this is the part I need equation.
and expect 2nd (1=1) and 4th (3&3) rows each match and lookup 'answer' column 'B' and 'D', so result is:
data['result']
0 'B'
1 'B'
2 'D'
3 'D'
I've tried something like this
data.groupby('itemName')['Price1'].transform(x:data['answer'][x==data['Price2']],
which gives error
ValueError: Can only compare identically-labeled Series objects
and tried this not even using x.
data.groupby('itemName')['Price1'].transform(x:data['answer'][data['Price1']==data['Price2']],
result only applies to the matched indices:
data['result']
0 NaN
1 'B'
2 NaN
3 'D'
I think I am close but missing the key concept.
IIUC,
df.loc[df['Price1'] == df['Price2'], 'result'] = df['answer']
df['result'] = df.groupby('Name')['result'].transform('first')
print(df)
Output:
Name Price1 Price2 answer result
0 Cat 2 5 A B
1 Cat 1 1 B B
2 Dog 10 7 C D
3 Dog 3 3 D D
You can also do the query and select operation in groupby.apply
out = (df.groupby('Name', as_index=False, group_keys=False)
.apply(lambda df_: df_.assign(result=df_.query('Price1 == Price2').eval('answer').item())))
print(out)
Name Price1 Price2 answer result
0 Cat 2 5 A B
1 Cat 1 1 B B
2 Dog 10 7 C D
3 Dog 3 3 D D
Related
I have a data frame that has a column where some values are See Above or See Below. The data frame looks something like this:
In[1]: df = pd.DataFrame([[1, 'Cat'], [1, 'See Above'], [4, 'See Below'],[2, 'Dog']], columns=['A','B'])
In[2]: df
Out[2]:
A B
0 1 Cat
1 1 See Above
2 4 See Below
3 2 Dog
How could I update these values based on the value in the row above or below? I have ~2300k rows for context.
# mask as nan value that are 'see above' and then ffill
df['B']=df['B'].mask(df['B'].eq('See Above'), np.nan).ffill()
# mask as nan value that are 'see below' and then bfill
df['B']=df['B'].mask(df['B'].eq('See Below'), np.nan).bfill()
df
A B
0 1 Cat
1 1 Cat
2 4 Dog
3 2 Dog
In a single line with replace:
df['B'] = df['B'].replace('See Above',np.nan).ffill().replace('See Below',np.nan).bfill()
print(df)
Result
A B
0 1 Cat
1 1 Cat
2 4 Dog
3 2 Dog
I have data like this
ID INFO
1 A=2;B=2;C=5
2 A=3;B=4;C=1
3 A=1;B=3;C=2
I want to split the Info columns into
ID A B C
1 2 2 5
2 3 4 1
3 1 3 2
I can split columns with one delimiter by using
df['A'], df['B'], df['C'] = df['INFO'].str.split(';').str
then split again by = but this seems to not so efficient in case I have many rows and especially when there are so many field that cannot be hard-coded beforehand.
Any suggestion would be greatly welcome.
You could use named groups together with Series.str.extract. In the end concat back the 'ID'. This assumes you always have A=;B=;and C= in a line.
pd.concat([df['ID'],
df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)
# ID A B C
#0 1 2 2 5
#1 2 3 4 1
#2 3 1 3 2
If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2' then we can split on ';' and partition on '='. pivot in the end to get to your desired output.
### Starting Data
#ID INFO
#1 A=2;B=2;C=5
#2 A=3;B=4;C=1
#3 A=1;B=3;C=2
#4 A=1;C=2
(df.set_index('ID')['INFO']
.str.split(';', expand=True)
.stack()
.str.partition('=')
.reset_index(-1, drop=True)
.pivot(columns=0, values=2)
)
# A B C
#ID
#1 2 2 5
#2 3 4 1
#3 1 3 2
#4 1 NaN 2
Browsing a Series is much faster that iterating across the rows of a dataframe.
So I would do:
pd.DataFrame([dict([x.split('=') for x in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()
It gives as expected:
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
It should be faster than splitting twice dataframe columns.
values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)
This will give you the desired output:
ID INFO a b c
1 a=1;b=2;c=3 1 2 3
2 a=4;b=5;c=6 4 5 6
3 a=7;b=8;c=9 7 8 9
Explanation:
The first line converts every value to a dictionary.
e.g.
x = 'a=1;b=2;c=3'
dict(item.split("=") for item in x.split(";"))
results in :
{'a': '1', 'b': '2', 'c': '3'}
DataFrame can take a list of dicts as an input and turn it into a dataframe.
Then you only need to assign the dataframe to the columns you want:
df[['a', 'b', 'c']] = pd.DataFrame(values)
Another solution is Series.str.findAll to extract values and then apply(pd.Series):
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)
Details:
df = pd.DataFrame([[1, "A=2;B=2;C=5"],
[2, "A=3;B=4;C=1"],
[3, "A=1;B=3;C=2"]],
columns=["ID", "INFO"])
print(df.INFO.str.findall(r'=(\d+)'))
# 0 [2, 2, 5]
# 1 [3, 4, 1]
# 2 [1, 3, 2]
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
# ID INFO A B C
# 0 1 A=2;B=2;C=5 2 2 5
# 1 2 A=3;B=4;C=1 3 4 1
# 2 3 A=1;B=3;C=2 1 3 2
# Remove INFO column
df = df.drop("INFO", 1)
print(df)
# ID A B C
# 0 1 2 2 5
# 1 2 3 4 1
# 2 3 1 3 2
Another solution :
#split on ';'
#explode
#then split on '='
#and pivot
df_INFO = (df.INFO
.str.split(';')
.explode()
.str.split('=',expand=True)
.pivot(columns=0,values=1)
)
pd.concat([df.ID,df_INFO],axis=1)
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
I have a dictionary(dic)and a dataframe(df),a column in df is the keys in dic and a column is the index of the dic's value(Type:list), I want to add a column in df and it should matching key-value of dic and index.
input df:
A B C
1 a ` 0
2 b # 1
3 a # 1
4 c ¥ 0
5 b % 2
input dic:
{'a': ['apple', 'append'], 'b': ['boy', 'baby', 'bus'], 'c': ['cow', 'code'], 'd': ['dog', 'dislike']}
goal df:
A B C D
1 a ` 0 apple
2 b # 1 baby
3 a # 1 append
4 c ¥ 0 cow
5 b % 2 bus
This is my current code:
df['D'] = dic[df['A']][df['C']]
Error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Please correct it, and the code should be executed as efficiently as possible.
You could define a list of tuples from both columns in the dataframe, and use each value to index the input dictionary and inner lists:
d_ = tuple(zip(df['A'], df['C']))
# (('a', 0), ('b', 1), ('a', 1), ('c', 0), ('b', 2))
df['D'] = [d[a][c] for a,c in d_]
A B C D
1 a ` 0 apple
2 b # 1 baby
3 a # 1 append
4 c ¥ 0 cow
5 b % 2 bus
I would use map and lookup (considering the dictionary name is d you can do):
df['D']=pd.DataFrame(df.A.map(d).values.tolist(),
index=df.index).lookup(df.C.index,df.C.values)
print(df)
A B C D
1 a ` 0 apple
2 b # 1 baby
3 a # 1 append
4 c ¥ 0 cow
5 b % 2 bus
You could use merge and convert your input dictionary to a dataframe:
dd = {'a': ['apple', 'append'],
'b': ['boy', 'baby', 'bus'],
'c': ['cow', 'code'],
'd': ['dog', 'dislike']}
df_dd = pd.DataFrame.from_dict(dd, orient='index')
df.merge(df_dd.stack().rename('D').reset_index(),
left_on=['A', 'C'],
right_on=['level_0','level_1'])[['A','B','C','D']]
Output:
A B C D
0 a ` 0 apple
1 b # 1 baby
2 a # 1 append
3 c ¥ 0 cow
4 b % 2 bus
reproducible code for data:
import pandas as pd
dict = {"a": "[1,2,3,4]", "b": "[1,2,3,4]"}
dict = pd.DataFrame(list(dict.items()))
dict
0 1
0 a [1,2,3,4]
1 b [1,2,3,4]
I wanted to split/delimit "column 1" and create individual rows for each split values.
expected output:
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
Should I be removing the brackets first and then split the values? I really don't get any idea of doing this. Any reference that would help me solve this please?
Based on the logic from that answer:
s = d[1]\
.apply(lambda x: pd.Series(eval(x)))\
.stack()
s.index = s.index.droplevel(-1)
s.name = "split"
d.join(s).drop(1, axis=1)
Because you have strings containing a list (and not lists) in your cells, you can use eval:
dict_v = {"a": "[1,2,3,4]", "b": "[1,2,3,4]"}
df = pd.DataFrame(list(dict_v.items()))
df = (df.rename(columns={0:'l'}).set_index('l')[1]
.apply(lambda x: pd.Series(eval(x))).stack()
.reset_index().drop('level_1',1).rename(columns={'l':0,0:1}))
or another way could be to create a DataFrame (probably faster) such as:
df = (pd.DataFrame(df[1].apply(eval).tolist(),index=df[0])
.stack().reset_index(level=1, drop=True)
.reset_index(name='1'))
your output is
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
all the rename are to get exactly your input/output
Df1
A B C
1 1 'a'
2 3 'b'
3 4 'c'
Df2
A B C
1 1 'k'
5 4 'e'
Expected output (after difference and merge of Df1 and Df2)
i.e. Df1-Df2 and then merge
output
A B C
1 1 'a'
2 3 'b'
3 4 'c'
5 4 'e'
The difference should be based on two columns A and B and not all three columns. I do not care what column C contains in both Df2 and Df1.
try this:
In [44]: df1.set_index(['A','B']).combine_first(df2.set_index(['A','B'])).reset_index()
Out[44]:
A B C
0 1 1 'a'
1 2 3 'b'
2 3 4 'c'
3 5 4 'e'
It's an outer join, then merging in column C from df2 if a value is not known in df1:
dfx = df1.merge(df2, how='outer', on=['A', 'B'])
dfx['C'] = dfx.apply(
lambda r: r.C_x if not pd.isnull(r.C_x) else r.C_y, axis=1)
dfx[['A', 'B', 'C']]
=>
A B C
0 1 1 a
1 2 3 b
2 3 4 c
3 5 4 e
Using concat and drop_duplicates:
output = pd.concat([df1, df2])
output = output.drop_duplicates(subset = ["A", "B"], keep = 'first')
* Desired df: *
A B C
0 1 1 a
1 2 3 b
2 3 4 c
1 5 4 e