I have two data frames with different variable names
df1 = pd.DataFrame({'A':[2,2,3],'B':[5,5,6]})
>>> df1
A B
0 2 5
1 2 5
2 3 6
df2 = pd.DataFrame({'C':[3,3,3],'D':[5,5,6]})
>>> df2
C D
0 3 5
1 3 5
2 3 6
I want to create a third data frame where the n-th column is the product of the n-th columns in the first two data frames. In the above example, df3 would have two columns X and Y, where df.X = df.A * df.C and df.Y = df.B * df.D
df3 = pd.DataFrame({'X':[6,6,9],'Y':[25,25,36]})
>>> df3
X Y
0 6 25
1 6 25
2 9 36
Is there a simple pandas function that allows me to do this?
You can use mul, to multiply df1 by the values of df2:
df3 = df1.mul(df2.values)
df3.columns = ['X','Y']
>>> df3
X Y
0 6 25
1 6 25
2 9 36
You can also use numpy as:
df3 = np.multiply(df1, df2)
Note: Most numpy operations will take Pandas Series or DataFrame.
Related
This question already has answers here:
cartesian product in pandas
(13 answers)
Closed 19 days ago.
I'm running into a wall in terms of how to do this with Pandas. Given a dataframe (df1) with an ID column, and a separate dataframe (df2), how can I combine the two to make a third dataframe that preserves the ID column with all the possible combinations it could have?
df1
ID name.x
1 a
2 b
3 c
df2
name.y
l
m
dataframe creation:
df1 = pd.DataFrame({'ID':[1,2,3],'name.x':['a','b','c']})
df2 = pd.DataFrame({'name.y':['l','m']})
combined df
ID name.x name.y
1 a l
1 a m
2 b l
2 b m
3 c l
3 c m
create a col on each that is the same, do a full outer join, then keep the cols you want:
df1 = pd.DataFrame({'ID':[1,2,3],'name.x':['a','b','c']})
df2 = pd.DataFrame({'name.y':['l','m']})
df1['join_col'] = True
df2['join_col'] = True
df3 = pd.merge(df1,df2, how='outer',on = 'join_col')
print(df3[['ID','name.x','name.y']])
will output:
ID name.x name.y
0 1 a l
1 1 a m
2 2 b l
3 2 b m
4 3 c l
5 3 c m
There are other questions on the same topic and they helped but I have an extra twist.
I have a dataframe with multiple values in each (but not all) cells.
df = pd.DataFrame({'a':["10-30-410","20-40-500","25-50"], 'b':["5-8-9","4", "99"]})
index
a
b
0
10-30-410
5-8-9
1
20-40-500
4
2
25-50
99
How can I split each cell by the dash "-" and create three new dataframes? Note that not all cells have multiple values, in which case the second and third dataframes get NA or blank (treating these as strings).
So I need df1 to be the first of those values:
index
a
b
0
10
5
1
20
4
2
25
99
And df2 would be:
index
a
b
0
30
8
1
40
2
50
And likewise for df3:
index
a
b
0
410
9
1
500
2
I got df1 with this
df1 = df.replace(r'(\d+).*(\d+).*(\d+)+', r'\1', regex=True)
But df2 doesn't quite work. I get the second values but also 4 and 99, which should be blank.
df2 = df.replace(r'(\d+)-(\d+).*', r'\2', regex=True)
index
a
b
0
30
8
1
40
4 - should be blank
2
50
99 - should be blank
Is this the right approach? I'm pretty good on regex but fuzzy with groups. Thank you.
Use str.split + concat + stack to get the data in a more usable format:
new_df = pd.concat(
(df['a'].str.split('-', expand=True),
df['b'].str.split('-', expand=True)),
keys=('a', 'b'),
axis=1
).stack(dropna=False).droplevel(0)
new_df:
a b
0 10 5
1 30 8
2 410 9
0 20 4
1 40 None
2 500 None
0 25 99
1 50 None
2 None None
Expandable option for n cols:
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
Then groupby level 0 + reset_index to create a list of dataframes:
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
dfs:
[ a b
0 10 5
1 20 4
2 25 99,
a b
0 30 8
1 40 None
2 50 None,
a b
0 410 9
1 500 None
2 None None]
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'a': ["10-30-410", "20-40-500", "25-50"],
'b': ["5-8-9", "4", "99"]
})
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
print(dfs)
You can also try with filter:
k = pd.concat((df[c].str.split('-', expand=True).add_prefix(c+ '-')
for c in df.columns), 1).fillna('')
df1 = k.filter(like='0')
df2 = k.filter(like='1')
df3 = k.filter(like='2')
NOTE: To strip the digit from columns use : k.filter(like='0').rename(columns= lambda x: x.split('-')[0])
I've got a dataframe, like so:
ID A
0 z
2 z
2 y
5 x
To which I want to add rows for each unique value of an ID column:
ID A
0 z
2 z
2 y
5 x
0 b
2 b
5 b
I'm currently doing so in a very naïve way, which is quite inefficient/slow:
IDs = df["ID"].unique()
for ID in IDs:
df = df.append(pd.DataFrame([[ID, "b"]], columns=df.columns), ignore_index=True)
How would I go to accomplish the same without the explicit foreach, only pandas function calls?
Use drop_duplicates, rewrite column by assign and append or concat to original DataFrame:
df = df.append(df.drop_duplicates("ID").assign(A='B'), ignore_index=True)
#alternative
#df = pd.concat([df, df.drop_duplicates("ID").assign(A='B')], ignore_index=True)
print (df)
ID A
0 0 z
1 2 z
2 2 y
3 5 x
4 0 B
5 2 B
6 5 B
I am trying to merge two datasets by using pandas. One is location (longitude and latitude) and the other is time frame (0 to 24hrs, 15 mins step = 96 datapoints)
Here is the sample code:
s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
df = pd.DataFrame([list(s1), list(s2)], columns = ["A", "B", "C"])
timeframe_array=[]
for i in range(0, 3600, timeframe):
timeframe_array.append(i)
And I want to get the data like this:
A B C time
0 1 2 3 0
1 1 2 3 15
2 1 2 3 30
3 1 2 3 45
...
How can I get the data like this?
While not particularly elegant, this should work:
from __future__ import division # only needed if you're using Python 2
import pandas as pd
from math import ceil
# Constants
timeframe = 15
total_t = 3600
Create df1:
s1 = [1, 2, 3]
s2 = [4, 5, 6]
df1 = pd.DataFrame([s1, s2], columns=['A', 'B', 'C'])
Next, we want to build df2 such that the sequence 0-3600 (step=15) is replicated for each row in df1. We can extract the number of rows with df1.shape[0] (which is 2 in this case).
df2 = pd.DataFrame({'time': range(0, total_t * df1.shape[0], timeframe)})
Next, you need to replicate the rows in df1 to match df2.
factor = ceil(df2.shape[0] / df1.shape[0])
df1_f = pd.concat([df1] * factor).sort_index().reset_index(drop=True)
Lastly, join the two data frames together and trim off any excess rows.
df3 = df1_f.join(df2, how='left')[:df2.shape[0]]
Pandas may have a built-in way to do this, but to my knowledge both join and merge can only make up a difference in rows by filling with a constant (NaN by default).
Result:
>>> print(df3.head(4))
A B C time
0 1 2 3 0
1 1 2 3 15
2 1 2 3 30
3 1 2 3 45
>>> print(df3.tail(4))
A B C time
476 4 5 6 7140
477 4 5 6 7155
478 4 5 6 7170
479 4 5 6 7185
>>> df3.shape # (480, 4)
I'm trying to merge two DataFrames summing columns value.
>>> print(df1)
id name weight
0 1 A 0
1 2 B 10
2 3 C 10
>>> print(df2)
id name weight
0 2 B 15
1 3 C 10
I need to sum weight values during merging for similar values in the common column.
merge = pd.merge(df1, df2, how='inner')
So the output will be something like following.
id name weight
1 2 B 25
2 3 C 20
This solution works also if you want to sum more than one column. Assume data frames
>>> df1
id name weight height
0 1 A 0 5
1 2 B 10 10
2 3 C 10 15
>>> df2
id name weight height
0 2 B 25 20
1 3 C 20 30
You can concatenate them and group by index columns.
>>> pd.concat([df1, df2]).groupby(['id', 'name']).sum().reset_index()
id name weight height
0 1 A 0 5
1 2 B 35 30
2 3 C 30 45
In [41]: pd.merge(df1, df2, on=['id', 'name']).set_index(['id', 'name']).sum(axis=1)
Out[41]:
id name
2 B 25
3 C 20
dtype: int64
If you set the common columns as the index, you can just sum the two dataframes, much simpler than merging:
In [30]: df1 = df1.set_index(['id', 'name'])
In [31]: df2 = df2.set_index(['id', 'name'])
In [32]: df1 + df2
Out[32]:
weight
id name
1 A NaN
2 B 25
3 C 20