Say I have a DataFrame call it one like this:
non_multiply_col col_1 col_2
A Name 1 3
and a dict like this call it two:
{'col_1': 4, 'col_2': 5}
is there a way that I can multiply all rows of one by the values in two for the columns as defined by two's keys so the result would be:
non_multiply_col col_1 col_2
A Name 4 15
I tried using multiply, but I'm not really looking to join on anything specific. Maybe I'm not understanding how to use multiply correctly.
Thanks
mul/multiply works fine if the dictionary is converted to a Series:
d = {'col_1': 4, 'col_2': 5}
df.mul(pd.Series(d), axis=1)
# col_1 col_2
#0 4 15
In case you have more columns in the data frame than the dictionary:
df = pd.DataFrame([{'col_1': 1, 'col_2': 3, 'col_3': 4}])
d = {'col_1': 4, 'col_2': 5}
cols_to_update = d.keys() # you might need cols_to_update = list(d.keys()) in python 3
# multiply the selected columns and update
df[cols_to_update] = df[cols_to_update].mul(pd.Series(d), axis=1)[cols_to_update]
df
col_1 col_2 col_3
#0 4 15 4
I happen to find this work as well, not sure if there is any caveat about this usage:
df[d.keys()] *= pd.Series(d)
Related
I have very junior question in python - i have a dataframe with a column containing some IDs and separate dataframe that contains 2 columns, out of which 1 is an array:
df1 = pd.DataFrame({"some_id": [1, 2, 3, 4, 5]})
df2 = pd.DataFrame([["A", [1, 2]], ["B", [3, 4]], ["C", [5]]], columns=['letter', 'some_ids'])
I want to add do df1 new column "letter' that for a given "some_id" will look up df2, check if this id is in df2['some_ids'] and return df2['letter']
I tried this:
df1['letter'] = df2[df1[some_id].isin(df2['some_ids')].letter
and get NaNs - any suggestion where I make mistake?
Create dictionary with flatten nested lists in dict comprehension and then use Series.map:
d = {x: a for a,b in zip(df2['letter'], df2['some_ids']) for x in b}
df1['letter'] = df1['some_id'].map(d)
Or mapping by Series created by DataFrame.explode with DataFrame.set_index:
df1['letter'] = df1['some_id'].map(df2.explode('some_ids').set_index('some_ids')['letter'])
Or use left join with rename column:
df1 = df1.merge(df2.explode('some_ids').rename(columns={'some_ids':'some_id'}), how='left')
print (df1)
some_id letter
0 1 A
1 2 A
2 3 B
3 4 B
4 5 C
I have two dataframes like so:
data = {'A': [3, 2, 1, 0], 'B': [1, 2, 3, 4]}
data2 = {'A': [3, 2, 1, 0, 3, 2], 'B': [1, 2, 3, 4, 20, 2], 'C':[5,3,2,1, 5, 1]}
df1 = pd.DataFrame.from_dict(data)
df2 = pd.DataFrame.from_dict(data2)
Now I did a groupby of df2 for C
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
Now I would like to map df1['new C'] where the columns A and B match.
A B new_C
0 3 1 1.0
1 2 2 2.0
2 1 3 2.0
3 0 4 12.5
where new c is basically the averages of C for every pair A, B from df2
Note that A and B don't have to be keys of the dataframe (i.e. they aren't unique identifiers which is why I want to map it with a dictionary originally, but failed with multiple keys)
How would I go about that?
Thank you for looking into it with me!
I found a solution to this
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = df1.apply(lambda x: values_to_map[x['A'], x['B']], axis=1)
Thanks for looking into it!
Just do np.vectorize:
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = np.vectorize(lambda x: values_to_map.get(x['A'], x['B']))(df1[['A', 'B']])
You can first form a MultiIndex from the [["A", "B"]] subset of the frame df1 and use its map function to map the A-B pairs to the desired grouped mean values:
cols = ["A", "B"]
mapper = df2.groupby(cols).C.mean()
df1["new_c"] = pd.MultiIndex.from_frame(df1[cols]).map(mapper)
to get
>>> df1
A B new_c
0 3 1 5.0
1 2 2 2.0
2 1 3 2.0
3 0 4 1.0
(if an A-B pair in df1 isn't found in df2's groups, new_c corresponding to that pair will be NaN with this method.)
Note that neither pandas' apply nor np.vectorize are "vectorized" routines. However, they might be fast enough for one's purposes and might prove more readable in places.
I have a dictionary like this:
{'a': {'col_1': [1, 2], 'col_2': ['a', 'b']},
'b': {'col_1': [3, 4], 'col_2': ['c', 'd']}}
When I try to convert this to a dataframe a get this:
col_1 col_2
a [1, 2] [a, b]
b [3, 4] [c, d]
But what I need is this:
col_1 col_2
a 1 a
2 b
b 3 c
4 d
How can I get this format. Maybe I should change my input format as well?
Thanks for help=)
You can use pd.DataFrame.from_dict setting orient='index' so the dictionary keys are set as the dataframe's indices, and then explode all columns by applying pd.Series.explode:
pd.DataFrame.from_dict(d, orient='index').apply(pd.Series.explode)
col_1 col_2
a 1 a
a 2 b
b 3 c
b 4 d
you could run a generator comprehension and apply pandas concat ... the comprehension works on the values of the dictionary, which are themselves dictionaries :
pd.concat(pd.DataFrame(entry).assign(key=key) for key,entry in data.items()).set_index('key')
col_1 col_2
key
a 1 a
a 2 b
b 3 c
b 4 d
update:
Still uses concatenation; no need to assign key to individual dataframes:
(pd.concat([pd.DataFrame(entry)
for key, entry in data.items()],
keys=data)
.droplevel(-1))
I have a dataframe of strings. The current dataframe looks like this:
Current dataframe
Each datapoint contains a Dictionary like below:
"{'Index': 1, 'TimeSpent': 74088, 'RealInc': 'Obstacle_bef', 'IdentifiedIncident': 'Obstacle', 'TrLev': 7, 'TakeOverDecision': 'stay_put'},{'Index': 2, 'TimeSpent': 11336, 'RealInc': 'Obstacle_after_success', 'IdentifiedIncident': 'Pedestrian', 'TrLev': 7 },{'Index': 3, 'TimeSpent': 38594, 'RealInc': 'Cyclist_before', 'IdentifiedIncident': 'Cyclist', 'TrLev': 7, 'TakeOverDecision': 'stay_put'},{'Index': 4, 'TimeSpent': 16011, 'RealInc': 'Cyclist_after_success', 'IdentifiedIncident': 'Pedestrian', 'TrLev': 7 }".
I would like to make a new dataframe where each colomn represents the key of that dictionary. I have tried to use eval(), as well as using apply like this. But I think because every other dict is missing the key of 'TakeOverDecision', the apply does not work on it.
Any suggestions or guidance on how to split this to a dataset which make my dataset looks like below would be great!
Desired dataframe
Maybe this could help.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html
import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
Did you try the from_dict function
import pandas as pd
df = pd.from_dict(data)
I have a pandas dataframe following the form in the example below:
data = {'id': [1,1,1,1,2,2,2,2,3,3,3], 'a': [-1,1,1,0,0,0,-1,1,-1,0,0], 'b': [1,0,0,-1,0,1,1,-1,-1,1,0]}
df = pd.DataFrame(data)
Now, what I want to do is create a pivot table such that for each of the columns except the id, I will have 3 new columns corresponding to the values. That is, for column a, I will create a_neg, a_zero and a_pos. Similarly, for b, I will create b_neg, b_zero and b_pos. The values for these new columns would correspond to the number of times those values appear in the original a and b column. The final dataframe should look like this:
result = {'id': [1,2,3], 'a_neg': [1, 1, 1],
'a_zero': [1, 2, 2], 'a_pos': [2, 1, 0],
'b_neg': [1, 1, 1], 'b_zero': [2,1,1], 'b_pos': [1,2,1]}
df_result = pd.DataFrame(result)
Now, to do this, I can do the following steps and arrive at my final answer:
by_a = df.groupby(['id', 'a']).count().reset_index().pivot('id', 'a', 'b').fillna(0).astype(int)
by_a.columns = ['a_neg', 'a_zero', 'a_pos']
by_b = df.groupby(['id', 'b']).count().reset_index().pivot('id', 'b', 'a').fillna(0).astype(int)
by_b.columns = ['b_neg', 'b_zero', 'b_pos']
df_result = by_a.join(by_b).reset_index()
However, I believe that that method is not optimal especially if I have a lot of original columns aside from a and b. Is there a shorter and/or more efficient solution for getting what I want to achieve here? Thanks.
A shorter solution, though still quite in-efficient:
In [11]: df1 = df.set_index("id")
In [12]: g = df1.groupby(level=0)
In [13]: g.apply(lambda x: x.apply(lambda x: x.value_counts())).fillna(0).astype(int).unstack(1)
Out[13]:
a b
-1 0 1 -1 0 1
id
1 1 1 2 1 2 1
2 1 2 1 1 1 2
3 1 2 0 1 1 1
Note: I think you should be aiming for the multi-index columns.
I'm reasonably sure I've seen a trick to remove the apply/value_count/fillna with something cleaner and more efficient, but at the moment it eludes me...