Multiply two dataframes with same column names but different index - python

I have two dataframes, one with data, one with a list of forecasting assumptions. The column names correspond, but the index levels do not (by design). Please show me how to multiply columns A, B, and C in df1 by the relevant columns in df2, as in my example below, and with the remainder of the original dataframe (aka column D) intact. Thanks!
df1list=[1,2,3,4,5,6,7,8,9,10]
df2list=[2017]
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'), index=list(df1list))
df2 = pd.DataFrame(np.random.randint(1,4,size=(1, 3)), columns=list('ABC'), index=list(df2list))

>>> df[['A','B','C']] * df2.values
A B C
1 81 168 116
2 21 8 6
3 147 108 52
4 54 64 114
5 48 16 20
6 72 116 12
7 36 188 178
8 90 96 162
9 63 166 156
10 120 22 10
So to overwrite you can do:
df.loc[:,['A','B','C']] = df[['A','B','C']] * df2.values
And I guess to be more programmatic you can do:
df[df2.columns] *= df2.values

Related

How to select the "locals" min and max out of a list in a panda DataFrame

I'm struggling to figure out how to do the following:
I've a dataframe that looks like this (it's a little more complicated, this is but an example):
df = pd.DataFrame({'id' : ['id1','id2'], 'coverage' : ['1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19 20 40 41 42 43 44 45 46 47 48 49 50','1 2 3 4 5 6 7 8 9 10 100 101 102 103 104 105 106 107 108 109 110']})
And I want to generate a new key that only holds the min-max of every segment, basically it should look like this:
id coverage
0 id1 1 11 13 20 40 50
1 id2 1 10 100 110
It's a simple problem but I can't come up with any solutions, I know that map(lambda x:) could work...
Thanks!
Let's try:
# split the values and convert to integers
s = df['coverage'].str.split().explode().astype(int)
# continuous blocks
blocks = s.diff().ne(1).groupby(level=0).cumsum()
s['coverage'] = (s.groupby([s.index, blocks])
.agg(['min','max'])
.astype(str).agg(' '.join, axis=1)
.groupby(level=0).agg(' '.join)
)
First, split those strings and then explode into a large Series, keeping the index as your 'id' column. Then we take the difference between successive rows within each group and check where it's not equal to 1.
Slice the exploded Series by this mask and it shifted to get the start and end points, then groupby and agg(list) (or ' '.join) to get your output.
# To numeric so values become numbers.
s = pd.to_numeric(df.set_index('id')['coverage'].str.split().explode())
m = s.groupby(level=0).diff().ne(1)
result = s[m | m.shift(-1).fillna(True)].groupby(level=0).agg(list)
id
id1 [1, 11, 13, 20, 40, 50]
id2 [1, 10, 100, 110]
Name: coverage, dtype: object

Multiply each element of a column by each element of a different dataframe

I have two data frame both having same number of columns but the first data frame has multiple rows and the second one has only one row but same number of columns as the first one. I need to multiply the entries of the first data frame with the second by column name.
DF:1
A B C
0 34 54 56
1 12 87 78
2 78 35 0
3 84 25 14
4 26 82 13
DF:2
A B C
0 2 3 1
Result
A B C
68 162 56
24 261 78
156 105 0
168 75 14
52 246 13
This will work. Here we are manipulating numpy array inside the DataFrame.
pd.DataFrame(df1.values*df2.values, columns=df1.columns, index=df1.index)
for col in df1.columns:
df1[col] = df1[col].apply(lambda x: x * df2[col])
theres probably an even simpler solution. Play around with apply, map and transform

Determine the number of unique values between consecutive rows

I have one dataframe. I want to add one column which calculates the difference value between two adjacent rows (if the sequence is different, it doesn't matter).
For instance, if in row[A] is 12,22,5,7; in row B is 22,7,3,6 then the number is 2, etc.Because in row[a] and row[b] we have the same 22 and 7(although the sequence is different). in row b we have two new number 3,6. So we add one number in row "b" at last which records the difference between row a and row b.
df = pd.DataFrame({'X': [22, 7, 43, 44, 56,67,7,38,29,130],'Y': [5,3,330,140,250,10,207,320,420,50],'Z': [7,6,136,144,312,10,82,63,42,12],'T':[12, 22, 4, 424, 256,167,27,38,229,30]},index=list('ABCDEFGHIJ'))
Thanks.
John Galt in his (now, unfortunately deleted) answer was on the right track with set operations.
In addition, accounting for duplicates will involve:
s = df.apply(set, 1)
df['diffs'] = s.diff().fillna('').str.len() + (4 - s.str.len())
df
T X Y Z diffs
A 12 22 5 7 0
B 22 7 3 6 2
C 4 43 330 136 4
D 424 44 140 144 4
E 256 56 250 312 4
F 167 67 10 10 4
G 27 7 207 82 4
H 38 38 320 63 4
I 229 29 420 42 4
J 30 130 50 12 4

How can I extract numeric ranges from 2 columns and print the range from both columns as tuples?

I'm quite new on bash scripting and on python programing; at the moment have 2 columns which contains numeric sequences as follow:
Col 1:
1
2
3
5
7
8
Col 2:
101
102
103
105
107
108
Need to extract the numeric ranges from both columns and print them according to the sequence break occrance on any of those 2 columns and the result should be as follow:
1,3,101,103
5,5,105,105
7,8,107,108
Already received a useful information on how to extract numeric ranges from one column using awk: - $ awk 'NR==1||sqrt(($0-p)*($0-p))>1{print p; printf "%s", $0 ", "} {p=$0} END{print $0}' file - ; but now the problem got a bit more complex as have to include a second column with another numeric sequence and requires as a result the ranges from the columns wherever the sequence breaks occurs on any of the 2 columns.
To add a bit more complexity the sequences can be ascending and/or descending.
Trying to find a solution using pandas (data frames) and numpy libraries for python.
Thanks in advances.
Hello MaxU thanks for your reply, unfortunately I'm hitting an issue for the following case:
Col 1:
7
8
9
10
11
Col 2:
52
51
47
46
45
Where numeric sequence in the second column is descending from the begining; it generates as a result:
7,11,45,52
instead of:
7,8,51,52
8,11,45,47
Cheers.
UPDATE:
In [103]: df
Out[103]:
Col1 Col2
0 7 52
1 8 51
2 9 47
3 10 46
4 11 45
In [104]: (df.groupby((df.diff().abs() != 1).any(1).cumsum()).agg(['min','max']))
Out[104]:
Col1 Col2
min max min max
1 7 8 51 52
2 9 11 45 47
OLD answer:
Here is one way (among many) to do it in Pandas:
Data:
In [314]: df
Out[314]:
Col1 Col2
0 1 101
1 2 102
2 3 103
3 5 105
4 8 108
5 7 107
6 6 106
7 9 109
NOTE: pay attention - rows with indexes (4,5,6) is a descending sequence
Solution:
In [350]: rslt = (df.groupby((df.diff().abs() != 1).all(1).cumsum())
...: .agg(['min','max']))
...:
In [351]: rslt
Out[351]:
Col1 Col2
min max min max
1 1 3 101 103
2 5 5 105 105
3 6 8 106 108
4 9 9 109 109
now you can easily save it to CSV file:
rslt.to_csv(r'/path/to/file_name.csv', index=False, header=None)
or just print it:
In [333]: print(rslt.to_csv(index=False, header=None))
1,3,101,103
5,5,105,105
6,8,106,108
9,9,109,109

Pandas: Faster way to iterate through a dataframe and add new data based on operations

I want to pandas to look into values in 2 columns in each row of df1, look for the match in another df2, and paste this in a new column in df1 in the same row and continue
alp=list("ABCDEFGHIJKLMNOPQRTSUVQXYZ")
df1['NewCol'] = (np.random.choice(alp)) #create new col and input random values
for i in range(len(df1['code1'])):
a = df1['code2'].iloc[i].upper()
b = df1['code1'].str[-3:].iloc[i]
df1['NewCol'].iloc[i] = df2.loc[b,a]
df1['code3'] = df1[['code3','NewCol']].max(axis=1)
df1 =df1.drop('NewCol',axis=1)
My inputs as below:
df1:
code1 code2 code3
0 XXXHYG a 12
1 XXXTBG a 23
2 XXXECT b 34
3 XXXKOL b 45
4 XXXBTW c 56
df2:
A B C D E
HYG 33 38 40 41 30
TBG 20 46 41 43 45
ECT 53 42 39 34 45
KOL 45 51 54 47 30
BTW 37 36 49 48 58
output needed:
code1 code2 code3
0 XXXHYG a 33
1 XXXTBG a 23
2 XXXECT b 42
3 XXXKOL b 51
4 XXXBTW c 56
When I do this over just 4200 rows in df1, it takes 222 seconds for just the loop.. there has got to be a way to utilize the power of pandas to do this faster?
thanks a lot for your time!
You could use apply (docs), but a faster way to do this if you have a lot of data would be to create a version of df2 with a MultiIndex (docs) using stack (docs) and look up values on this new DataFrame.
df3 = df1.copy()
tups = list(zip(df1['code1'].str[-3:], df1['code2'].str.upper()))
df3['code3'] = df2.stack()[tups].values
print(df3)
outputs
code1 code2 code3
0 XXXHYG a 33
1 XXXTBG a 20
2 XXXECT b 42
3 XXXKOL b 51
4 XXXBTW c 49

Categories