numpy/pandas effective multiplication of arrays/dataframes - python

I have two pandas data frames, which look like this:
import pandas as pd
df_one = pd.DataFrame( {
'A': [1,1,2,3,4,4,4],
'B1': [0.5,0.0,0.2,0.1,0.3,0.2,0.1],
'B2': [0.2,0.3,0.1,0.5,0.3,0.1,0.2],
'B3': [0.1,0.2,0.0,0.9,0.0,0.3,0.5]} );
df_two = pd.DataFrame( {
'A': [1,2,3,4],
'C1': [1.0,9.0,2.1,9.0],
'C2': [2.0,3.0,0.7,1.1],
'C3': [5.0,4.0,2.3,3.4]} );
df_one
A B1 B2 B3
0 1 0.5 0.2 0.1
1 1 0.0 0.3 0.2
2 2 0.2 0.1 0.0
3 3 0.1 0.5 0.9
4 4 0.3 0.3 0.0
5 4 0.2 0.1 0.3
6 4 0.1 0.2 0.5
df_two
A C1 C2 C3
0 1 1.0 2.0 5.0
1 2 9.0 3.0 4.0
2 3 2.1 0.7 2.3
3 4 9.0 1.1 3.4
What I would like to do is compute is a scalar product where I would be multiplying rows of the first data frame by the rows of the second data frame, i.e., \sum_i B_i * C_i, but in such a way that a row in the first data frame is multiplied by a row in the second data frame only if the values of the A column match in both frames. I know how to do it looping and using if's but I would like to do that in a more efficient numpy-like or pandas-like way. Any help much appreciated :)

Not sure if you want unique values for column A (If you do, use groupby on the result below)
pd.merge(df_one, df_two, on='A')
A B1 B2 B3 C1 C2 C3
0 1 0.5 0.2 0.1 1.0 2.0 5.0
1 1 0.0 0.3 0.2 1.0 2.0 5.0
2 2 0.2 0.1 0.0 9.0 3.0 4.0
3 3 0.1 0.5 0.9 2.1 0.7 2.3
4 4 0.3 0.3 0.0 9.0 1.1 3.4
5 4 0.2 0.1 0.3 9.0 1.1 3.4
6 4 0.1 0.2 0.5 9.0 1.1 3.4
pd.merge(df_one, df_two, on='A').apply(lambda s: sum([s['B%d'%i] * s['C%d'%i] for i in range(1, 4)]) , axis=1)
0 1.40
1 1.60
2 2.10
3 2.63
4 3.03
5 2.93
6 2.82

Another approach would be something similar to this:
import pandas as pd
df_one = pd.DataFrame( {
'A': [1,1,2,3,4,4,4],
'B1': [0.5,0.0,0.2,0.1,0.3,0.2,0.1],
'B2': [0.2,0.3,0.1,0.5,0.3,0.1,0.2],
'B3': [0.1,0.2,0.0,0.9,0.0,0.3,0.5]} );
df_two = pd.DataFrame( {
'A': [1,2,3,4],
'C1': [1.0,9.0,2.1,9.0],
'C2': [2.0,3.0,0.7,1.1],
'C3': [5.0,4.0,2.3,3.4]} );
lookup = df_two.groupby(df_two.A)
def multiply_rows(row):
other = lookup.get_group(row['A'])
# We want every column after "A"
x = row.values[1:]
# In this case, other is a 2D array with one row, similar to "row" above...
y = other.values[0, 1:]
return x.dot(y)
# The "axis=1" makes each row to be passed in, rather than each column
result = df_one.apply(multiply_rows, axis=1)
print result
This results in:
0 1.40
1 1.60
2 2.10
3 2.63
4 3.03
5 2.93
6 2.82

I would zip together the rows and use a filter or a comprehension that takes only the rows where A columns match.
Something like
[scalar_product(a,b) for a,b in zip (frame1, frame2) if a[0]==b[0]]
assuming that you're willing to fill in the appropriate material for scalar_product
(apologies if I've made a thinko here - this code is for example purposes only and has not been tested!)

Related

Multiply each value in DataFrame cell according to multi-index name

Given this pandas Dataframe
list_index = pd.Series(['A' for i in range(2)] + ['B' for i in range(4)] + ['C' for i in range(3)] + ['D' for i in range(6)], name='indexes')
list_type = pd.Series(['a', 'c'] + ['a', 'b','c','d'] + ['f','g','i'] + ['a','c','d','e','f','g'], name='types')
df = pd.DataFrame({
'value' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
}, index=[list_index, list_type])
indexes types value
A a 1
c 2
B a 3
b 4
c 5
d 6
C f 7
g 8
i 9
D a 10
c 11
d 12
e 13
f 14
g 15
I want to multiply each value by a factor (aka ratio) contained in another pandas.Dataframe
ratio_df = pd.DataFrame({
'ratio' : [0.1, 0.2, 0.4, 0.5]
}, index=['A', 'B', 'C', 'D'])
ratio
A 0.1
B 0.2
C 0.4
D 0.5
So that all values in df with 'indexes' == 'A' are multiplied by 0.1, and values with 'indexes' == 'B' are multiplied by 0.2 and so on.
I'm sure there is some smart way to do that but right now I can't really think of it. I know I can 'expand' ratio_df to the same length of df (with reset_index() and then creating a new column for df including ratios) and than simply perform * operation pairwise, but I'm not sure that's the fastest method.
I also looked at this answer but it's slightly different from my case.
If just needing the product of the two columns Series.mul can be aligned based on index level.
Just select the columns and mul on index level:
df['value'].mul(ratio_df['ratio'], level='indexes')
or with index level number:
df['value'].mul(ratio_df['ratio'], level=0)
The result is an unnamed Series:
indexes types
A a 0.1
c 0.2
B a 0.6
b 0.8
c 1.0
d 1.2
C f 2.8
g 3.2
i 3.6
D a 5.0
c 5.5
d 6.0
e 6.5
f 7.0
g 7.5
dtype: float64
The resulting Series can be assigned back to df as needed:
df['new'] = df['value'].mul(ratio_df['ratio'], level='indexes')
df:
value new
indexes types
A a 1 0.1
c 2 0.2
B a 3 0.6
b 4 0.8
c 5 1.0
d 6 1.2
C f 7 2.8
g 8 3.2
i 9 3.6
D a 10 5.0
c 11 5.5
d 12 6.0
e 13 6.5
f 14 7.0
g 15 7.5
Rename the ratio column to value then use mul on level=0:
df.mul(ratio_df.rename(columns={'ratio': 'value'}), level=0)
Result
value
indexes types
A a 0.1
c 0.2
B a 0.6
b 0.8
c 1.0
d 1.2
C f 2.8
g 3.2
i 3.6
D a 5.0
c 5.5
d 6.0
e 6.5
f 7.0
g 7.5
Here's one way resetting only first level; joining; multiplying and set_index back to original:
out = df.reset_index(level=1).join(ratio_df).assign(New=lambda x: x['value']*x['ratio']).set_index('types', append=True)
Output:
value ratio New
types
A a 1 0.1 0.1
c 2 0.1 0.2
B a 3 0.2 0.6
b 4 0.2 0.8
c 5 0.2 1.0
d 6 0.2 1.2
C f 7 0.4 2.8
g 8 0.4 3.2
i 9 0.4 3.6
D a 10 0.5 5.0
c 11 0.5 5.5
d 12 0.5 6.0
e 13 0.5 6.5
f 14 0.5 7.0
g 15 0.5 7.5

Finding common values between two csv file

I have two csv files (say, a and b) and both contain different datasets. The only common between those two CSV files is id_no. I would like to create a final csv file that contains all the datasets of both CSV files whose id_no are matching.
a looks like
id_no a1 a2 a3 a4
1 0.5 0.2 0.1 10.20
2 1.5 0.1 0.2 11.25
3 2.5 0.7 0.3 12.90
4 3.5 0.8 0.4 13.19
5 7.5 0.6 0.3 14.21
b looks like
id_no A1
6 10.1
8 2.5
4 12.5
2 20.5
1 2.51
I am looking for a final csv file, say c that shows the following output
id_no a1 a2 a3 a4 A1
1 0.5 0.2 0.1 10.20 2.51
2 1.5 0.1 0.2 11.25 20.5
3 2.5 0.7 0.3 12.90 0
4 3.5 0.8 0.4 13.19 12.5
5 7.5 0.6 0.3 14.21 0
Use pandas.merge:
import pandas as pd
a = pd.read_csv("data1.csv")
b = pd.read_csv("data2.csv")
output = a.merge(b, on="id_no", how="left").fillna(0).set_index("id_no")
output.to_csv("output.csv")
>>> output
a1 a2 a3 a4 A1
id_no
1 0.5 0.2 0.1 10.20 2.51
2 1.5 0.1 0.2 11.25 20.50
3 2.5 0.7 0.3 12.90 0.00
4 3.5 0.8 0.4 13.19 12.50
5 7.5 0.6 0.3 14.21 0.00
Using plain old python:
from csv import reader, writer
from pathlib import Path
with Path("file2.csv").open as f:
r = reader(f)
header = next(r)
data = {k:v for row in r for k, v in [row]}
rows = []
with Path("file1.csv").open() as f:
r = reader(f)
header.append(next(r)[-1])
for i, *row in r:
if i in data:
rows.append([i] + row + data[i])
else:
rows.append([i] + row + [0])
with Path("file1.csv").open("w") as f:
w = writer(f)
w.write_row(header)
w.write_rows(rows)

Unexplained behavior with Pandas Split (group) + Apply + Rejoin (concat), but only when sorting

I'm trying to calculate distances between a column and its lag (shift) for groups in a Pandas dataframe. The groups need to be sorted so that the shift is one timestep before. The standard way to do this is by .groupby() (aka Split), then .apply() with the distance function over each group, then rejoin with .concat(). This works fine, but only when I don't explicitly sort the grouped dataframe. when I sort the grouped dataframe, I get an error in the rejoining step.
Here's my example code, for which I was able to reproduce the unexpected behavior:
import pandas as pd
def dist_apply(group):
# when commented out, this code will run to completion (!)
group.sort_values(by='T',inplace=True)
group['shift'] = group['Y'].shift()
group['dist'] = group['Y'] - group['shift']
return group
df = pd.DataFrame(pd.DataFrame({'X': ['A', 'B', 'A', 'B', 'A', 'B'], 'T': [0.9, 0.8, 0.7, 0.9, 0.8, 0.7], 'Y': [7, 1, 8, 3, 9, 5]}))
print(df)
# split
df_g = df.groupby(['X'])
# apply
df_g = df_g.apply(dist_apply)
print(df_g)
# rejoin
df = pd.concat([df,df_g],axis=1)
print(df)
When the code that sorts the grouped dataframe is commented out, then the code prints this, which is expected:
X T Y
0 A 0.9 7
1 B 0.8 1
2 A 0.7 8
3 B 0.9 3
4 A 0.8 9
5 B 0.7 5
X T Y shift dist
0 A 0.9 7 NaN NaN
1 B 0.8 1 NaN NaN
2 A 0.7 8 7.0 1.0
3 B 0.9 3 1.0 2.0
4 A 0.8 9 8.0 1.0
5 B 0.7 5 3.0 2.0
X T Y X T Y shift dist
0 A 0.9 7 A 0.9 7 NaN NaN
1 B 0.8 1 B 0.8 1 NaN NaN
2 A 0.7 8 A 0.7 8 7.0 1.0
3 B 0.9 3 B 0.9 3 1.0 2.0
4 A 0.8 9 A 0.8 9 8.0 1.0
5 B 0.7 5 B 0.7 5 3.0 2.0
With the sorting line, the Traceback looks like this:
Traceback (most recent call last):
File "test.py", line 19, in <module>
df = pd.concat([df,df_g],axis=1)
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 229, in concat
return op.get_result()
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 420, in get_result
indexers[ax] = obj_labels.reindex(new_labels)[1]
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2236, in reindex
target = MultiIndex.from_tuples(target)
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 396, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "pandas/_libs/lib.pyx", line 2287, in pandas._libs.lib.tuples_to_object_array
TypeError: object of type 'int' has no len()
Sorting but not running the concat prints me this for df_g:
X T Y shift dist
X
A 2 A 0.7 8 NaN NaN
4 A 0.8 9 8.0 1.0
0 A 0.9 7 9.0 -2.0
B 5 B 0.7 5 NaN NaN
1 B 0.8 1 5.0 -4.0
3 B 0.9 3 1.0 2.0
which shows that it's grouped differently than the printing of df_g without the sorting (above), but it's not clear how the concat is breaking in this case.
update: I thought I had solved it by renaming the offending column ('X' in this case) and also using .reset_index() on the grouped dataframe before the merge.
df_g.columns = ['X_g','T','Y','shift','dist']
df = pd.concat([df,df_g.reset_index()],axis=1)
runs as expected, and prints this:
X T Y X level_1 X_g T Y shift dist
0 A 0.9 7 A 2 A 0.7 8 NaN NaN
1 B 0.8 1 A 4 A 0.8 9 8.0 1.0
2 A 0.7 8 A 0 A 0.9 7 9.0 -2.0
3 B 0.9 3 B 5 B 0.7 5 NaN NaN
4 A 0.8 9 B 1 B 0.8 1 5.0 -4.0
5 B 0.7 5 B 3 B 0.9 3 1.0 2.0
But looking closely, this column shows that the merge is done incorrectly:
1 B 0.8 1 A 4 A 0.8 9 8.0 1.0
I'm using Mac OSX with Python 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:05:27)
Pandas 0.24.2 + Numpy 1.17.3
and also tried upgrading to Pandas 0.25.3 and Numpy 1.17.5 with the same result.
This is tentatively working.
Rename columns to avoid duplicate:
df_g.columns = ['X_g','T','Y','shift','dist']
Reset index to single from multiindex:
df_g = df_g.reset_index(level=[0,1])
Simple merge, put df_g first if you want to keep the sorted-group order:
df = pd.merge(df_g,df)
gives me
X level_1 X_g T Y shift dist
0 A 2 A 0.7 8 NaN NaN
1 A 4 A 0.8 9 8.0 1.0
2 A 0 A 0.9 7 9.0 -2.0
3 B 5 B 0.7 5 NaN NaN
4 B 1 B 0.8 1 5.0 -4.0
5 B 3 B 0.9 3 1.0 2.0
Full code:
import pandas as pd
def dist_apply(group):
group.sort_values(by='T',inplace=True)
group['shift'] = group['Y'].shift()
group['dist'] = group['Y'] - group['shift']
return group
df = pd.DataFrame(pd.DataFrame({'X': ['A', 'B', 'A', 'B', 'A', 'B'], 'T': [0.9, 0.8, 0.7, 0.9, 0.8, 0.7], 'Y': [7, 1, 8, 3, 9, 5]}))
print(df)
df_g = df.groupby(['X'])
df_g = df_g.apply(dist_apply)
#print(df_g)
df_g.columns = ['X_g','T','Y','shift','dist']
df_g = df_g.reset_index(level=[0,1])
#print(df_g)
df = pd.merge(df_g,df)
print(df)

Python pandas data frame reshape

The data shown below is an simplified example. The actual data frame is 3750 rows 2 columns data frame. I need to reshape the data frame into another structure.
A A2
0.1 1
0.4 2
0.6 3
B B2
0.8 1
0.7 2
0.9 3
C C2
0.3 1
0.6 2
0.8 3
How can I reshape above data frame into horizontal as following:
A A2 B B2 C C2
0.1 1 0.8 1 0.3 1
0.4 2 0.7 2 0.6 2
0.6 3 0.9 3 0.8 3
You can reshape your data and create a new dataframe:
cols = 6
rows = 4
df = pd.DataFrame(df.values.T.reshape(cols,rows).T)
df.rename(columns=df.iloc[0]).drop(0)
A B C A2 B2 C2
1 0.1 0.8 0.3 1 1 1
2 0.4 0.7 0.6 2 2 2
3 0.6 0.9 0.8 3 3 3
try this, If you don't want to hard code your values.
df['header']=pd.to_numeric(df[0],errors='coerce')
l= df['header'].values
m_l = l.reshape((np.isnan(l).sum(),-1))[:,1:]
h=df[df['header'].isnull()][0].values
print pd.DataFrame(dict(zip(h,m_l)))
Output:
A B C
0 0.1 0.8 0.3
1 0.4 0.7 0.6
2 0.6 0.9 0.8

Create multiple dataframes based on the original dataframe columns number

I've search for quite a time, but I haven't found any similar question. If there is, please let me know!
I am currently trying to divide one dataframe into n dataframes where the n is equal to the number of columns of the original dataframe. All the new resulting dataframes must always keep the first column of the original dataframe. An extra would be gather all togheter in a list, for example, for further access.
In order to visualize my intention, here goes an brief example:
>> original df
GeneID A B C D E
1 0.3 0.2 0.6 0.4 0.8
2 0.5 0.3 0.1 0.2 0.6
3 0.4 0.1 0.5 0.1 0.3
4 0.9 0.7 0.1 0.6 0.7
5 0.1 0.4 0.7 0.2 0.5
My desired output would be something like this:
>> df1
GeneID A
1 0.3
2 0.5
3 0.4
4 0.9
5 0.1
>> df2
GeneID B
1 0.2
2 0.3
3 0.1
4 0.7
5 0.4
....
And so on, until all the columns from the original dataframe be covered.
What would be the better solution ?
You can use df.columns to get all column names and then create sub-dataframes:
outdflist =[]
# for each column beyond first:
for col in oridf.columns[1:]:
# create a subdf with desired columns:
subdf = oridf[['GeneID',col]]
# append subdf to list of df:
outdflist.append(subdf)
# to view all dataframes created:
for df in outdflist:
print(df)
Output:
GeneID A
0 1 0.3
1 2 0.5
2 3 0.4
3 4 0.9
4 5 0.1
GeneID B
0 1 0.2
1 2 0.3
2 3 0.1
3 4 0.7
4 5 0.4
GeneID C
0 1 0.6
1 2 0.1
2 3 0.5
3 4 0.1
4 5 0.7
GeneID D
0 1 0.4
1 2 0.2
2 3 0.1
3 4 0.6
4 5 0.2
GeneID E
0 1 0.8
1 2 0.6
2 3 0.3
3 4 0.7
4 5 0.5
Above for loop can also be written more simply as list comprehension:
outdflist = [ oridf[['GeneID', col]]
for col in oridf.columns[1:] ]
You can do with groupby
d={'df'+ str(x): y for x , y in df.groupby(level=0,axis=1)}
d
Out[989]:
{'dfA': A
0 0.3
1 0.5
2 0.4
3 0.9
4 0.1, 'dfB': B
0 0.2
1 0.3
2 0.1
3 0.7
4 0.4, 'dfC': C
0 0.6
1 0.1
2 0.5
3 0.1
4 0.7, 'dfD': D
0 0.4
1 0.2
2 0.1
3 0.6
4 0.2, 'dfE': E
0 0.8
1 0.6
2 0.3
3 0.7
4 0.5, 'dfGeneID': GeneID
0 1
1 2
2 3
3 4
4 5}
You can create a list of column names, and manually loop through and create a new DataFrame each loop.
>>> import pandas as pd
>>> d = {'col1':[1,2,3], 'col2':[3,4,5], 'col3':[6,7,8]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2 col3
0 1 3 6
1 2 4 7
2 3 5 8
>>> newstuff=[]
>>> columns = list(df)
>>> for column in columns:
... newstuff.append(pd.DataFrame(data=df[column]))
Unless your dataframe is unreasonably massive, above code should serve its job.

Categories