Multiply multi-dimensional matrix to get new dataframe with new column names - python

I created 2 DataFrames with [6,2] and [3,2]. I want to multiple 2 DataFrames to get [6,3] matrix. I am using the loop below but it is giving me a return self._getitem_column(key) error.Below is an example.
df1= pd.DataFrame{[1,2,3,4,5,6],
[23,24,25,26,27,28]}
df2= pd.DataFrame{[1,2,3],
[12,13,14]}
for j in range(len(df2)):
for i in range(len(df1)):
df3 = (df1[i, 2] * df2[j,2])
#expected result
df3= {0 1 2 3
1 276 299 322
2 288 312 336
3 300 325 350
4 312 338 364
5 324 351 378
6 336 364 392}
I am trying to replicate what I did in an excel sheet

It might be easier to leave it out of dataframes altogether, unless you have the information in dataframes currently (in which case, write back and I'll show you how to do that).
For now, this might be easier:
list1 = list(range(23,29)) # note that you have to go one higher to include 28
list2 = list(range(12,15)) # same deal
outputlist = []
for i in list1:
for j in list2:
outputlist.append(i * j)
import numpy as np
outputlist = np.array(outputlist).reshape(len(df1),len(df2))
import pandas as pd
df3 = pd.DataFrame(outputlist)
EDIT: Ok, this might get you where you need to go, then:
list3 = []
for j in range(len(df2)):
for i in range(len(df1)):
list3.append(df1.loc[i+1,0] * df2.loc[j+1,0])
import numpy as np
list3 = np.array(outputlist).reshape(len(df1),len(df2))
df3 = pd.DataFrame(list3)
EDIT AGAIN: Try this! Just make sure you replace "thenameofthecolumnindf1" with the actual name of the column in df1 that you're interested in, etc.
import numpy as np
list3 = []
for i in df1[thenameofthecolumnindf1]:
for j in df2[thenameofthecolumnindf2]:
list3.append(i * j)
list3 = np.array(list3).reshape(len(df1),len(df2))
df3 = pd.DataFrame(list3)

The math for this simply won't work. To do matrix multiplication, Number of columns in the first matrix (6) should be equal to the number of rows in the second matrix (2). You're likely getting a key indexing error because of the mismatched row/column value.
You'll have to account for 3 different dimensions in order to properly multiply them, not just 2 as is done above.

Related

Calculating the Difference in values in a dataframe

I have a dataframe that looks like this:
index Rod_1 label
0 [[1.94559799] [1.94498416] [1.94618273] ... [1.8941952 ] [1.89461277] [1.89435902]] F0
1 [[1.94129488] [1.94268905] [1.94327065] ... [1.93593512] [1.93689935] [1.93802091]] F0
2 [[1.94034818] [1.93996006] [1.93940095] ... [1.92700882] [1.92514855] [1.92449449]] F0
3 [[1.95784532] [1.96333782] [1.96036528] ... [1.94958261] [1.95199495] [1.95308231]] F2
Each cell in the Rod_1 column has an array of 12 million values. I'm trying the calculate the difference between every two values in this array to remove seasonality. That way my model will perform better, potentially.
This is the code that I've written:
interval = 1
for j in range(0, len(df_all['Rod_1'])):
for i in range(1, len(df_all['Rod_1'][0])):
df_all['Rod_1'][j][i - interval] = df_all['Rod_1'][j][i] - df_all['Rod_1'][j][i - interval]
I have 45 rows, and as I said each cell has 12 million values, so it takes 20 min to for my laptop to calculate this. Is there a faster way to do this?
Thanks in advance.
This should be much faster, I've tested up till 1M elements per cell for 10 rows which took 1.5 seconds to calculate the diffs (but a lot longer to make the test table)
import pandas as pd
import numpy as np
import time
#Create test data
np.random.seed(1)
num_rows = 10
rod1_array_lens = 5 #I tried with this at 1000000
possible_labels = ['F0','F1']
df = pd.DataFrame({
'Rod_1':[[[np.random.randint(10)] for _ in range(rod1_array_lens)] for _ in range(num_rows)],
'label':np.random.choice(possible_labels, num_rows)
})
#flatten Rod_1 from [[1],[2],[3]] --> [1,2,3]
#then use np.roll to make the diffs, throwing away the last element since it rolls over
start = time.time() #starting timing now
df['flat_Rod_1'] = df['Rod_1'].apply(lambda v: np.array([z for x in v for z in x]))
df['diffs'] = df['flat_Rod_1'].apply(lambda v: (np.roll(v,-1)-v)[:-1])
print('Took',time.time()-start,'to calculate diff')

Am I using groupby.sum() correctly?

I've the following code, and a problem in the new_df["SUM"] line:
import pandas as pd
df = pd.read_excel(r"D:\Tesina\Proteoma Humano\Tablas\uno - copia.xlsx")
#df = pd.DataFrame({'ID': ['C9JLR9','O95391', 'P05114',"P14866"], 'SEQ': ['1..100,182..250,329..417,490..583', '1..100,206..254,493..586', '1..100', "1..100,284..378" ]})
df2 = pd.DataFrame
df["SEQ"] = df["SEQ"].replace("\.\."," ", regex =True)
new_df = df.assign(SEQ=df.SEQ.str.split(',')).explode('SEQ')
for index, row in df.iterrows():
new_df['delta'] = new_df['SEQ'].map(lambda x: (int(x.split()[1])+1)-int(x.split()[0]) if x.split()[0] != '1' else (int(x.split()[1])+1))
new_df["SUM"] = new_df.groupby(["ID"]).sum().reset_index(drop=True) #Here's the error, even though I can't see where
df2 = new_df.groupby(["ID","SUM"], sort=False)["SEQ"].apply((lambda x: ','.join(x.astype(str)))).reset_index(name="SEQ")
To give some context, what it does is the following: grabs every line with the same ID, separates the numbers with a "," in between, does some math with those numbers (that's where the "delta" (which i know it's not a delta) line gets involved), and finally sums up all the "delta" for each ID, grouping them all by their original ID, so I maintain the same numbers of rows.
And, when I use a sample of the data (the one that´s commented at the beginning), it works perfectly, giving me the ouptut that I wish:
ID SUM SEQ
0 C9JLR9 353 1 100,182 250,329 417,490 583
1 O95391 244 1 100,206 254,493 586
2 P05114 101 1 100
3 P14866 196 1 100,284 378
But, when I aply it on my Excel file (that has 10471 rows), the groupby.sum() line doesn't work as it's supposed to (I've already checked everything else, I know the error is within that line).
This is the output that I receive:
ID SUM SEQ
0 C9JLR9 39 1 100,182 250,329 417,490 583
1 O95391 20 1 100,206 254,493 586
2 P05114 33 1 100
4 P98177 21 1 100,176 246
You can clearly see that the SUM values differ (and are not correct at all). I haven't been able to figure out where those numbers come from, also. It's really weird.
If anyone is interested, the solution was provided in the comments: I had to change the line with the following:
new_df["SUM"] = new_df.groupby("ID")["delta"].transform("sum")

How to reshape a 183,223,040x4 matrix into 140 matrices of dimensions 1145x1145 without MemoryError?

I have a matrix of dimensions 183,223,040x4 with the variables showed below. There are 140 different values in 'REG', and 1145 different values of both 'SAMAC' and 'SAMAC.1'
I want to iterate over REG to get 140 matrices of size 1145*1145, with the right 'VALUE' in it.
I have tried the following:
-loop over countries
-create empty matrix 1145*1145, indexed with SAMAC and with column names SAMAC.1
-go line by line of the current dataframe
-check the value of SAMAC (rows) and SAMAC.1 (columns)
-locate SAMAC and SAMAC.1 in the empty matrix and assigned the corresponding VALUE
import pandas as pd
import dask.dataframe as dd
all_sam=dd.read_csv(r'C:\GP2\all_sams_trial.csv',skiprows=1)
all_sam.head()
SAMAC SAMAC.1 REG Value
0 m_pdr m_pdr aus 0.0
1 m_wht m_pdr aus 0.0
2 m_gro m_pdr aus 0.0
3 m_v_f m_pdr aus 0.0
4 m_osd m_pdr aus 0.0
countries=list(all_sam["REG"].unique().compute())
col_names=list(all_sam["SAMAC"].unique().compute())
for country in countries:
df=pd.DataFrame(0,index=col_names,columns=col_names)
sam=all_sam[all_sam["REG"]==country].compute()
for index,row in sam.iterrows():
row_index=str(row["SAMAC"])
col_index=str(row["SAMAC.1"])
df.loc[row_index,col_index]=row['Value']
print(index)
df.to_csv(country+"_SAM.csv")
The problem is that it takes way to long to compute (around 2 days). Is there a way to speed this up?
Update 1: After understanding OP's problem of slow computation because of large size of dataframe, here's the update.
Check the dtypes of columns using all_sam.dtypes and the size (in Mb) of your dataframe using:
all_sam.memory_usage(deep=True) / 1024 ** 2
Consider changing the column name 'SAMAC.1' to 'SAMAC_1' as it could cause error in the following lines. Before processing change the dtypes of 'REG', 'SAMAC' and 'SAMAC_1' to 'categorical':
all_sam.REG = all_sam.REG.astype('category')
all_sam.SAMAC = all_sam.SAMAC.astype('category')
all_sam.SAMAC_1 = all_sam.SAMAC_1.astype('category')
Depending on your requirement, you can downcast the dtype of the 'Value' column to float16, int16, int8, etc. using the below code:
all_sam.Value = all_sam.Value.astype('float16')
Check the size again.
all_sam.memory_usage(deep=True) / 1024 ** 2
Hopefully, this will enable faster computation.
Ref: towardsdatascience.com
I have taken a small example dataframe to put up a solution to your problem.
import pandas as pd
import numpy as np
df = pd.DataFrame( {'REG':['A','A','A','A','A','A','B','B','B','B','B','B'], 'SAMAC1':['a','a','a','b','b','b','c','c','c','d','d','d'], 'SAMAC':['p','q','r','p','q','r','p','q','r','p','q','r'], 'value':[0,0,0,0,0,0,0,0,0,0,0,0]})
array_ = df[['REG','SAMAC1','SAMAC']].values.transpose()
index = pd.MultiIndex.from_arrays(array_, names=('REG', 'SAMAC1','SAMAC'))
df2 = df['value']
df2.index=index
country_labels = df2.index.get_level_values(0)
country_unique = country_labels.unique()
result_arr = []
for c in country_unique:
df3 = df2[df2.index.get_level_values(0) == c]
result_arr.append(df3.unstack().values)
result_arr = np.array(result_arr)
print(result_arr.shape)
Output: (2,2,3)

Pandas: `array_split` on a column of arrays, why don't I find back the max of the column

My initial column looks like this:
spread%
0 0.002631183029370956687450895171
1 0.002624478865422741694443794361
2 0.002503969912244045131633932303
3 0.002634517528902797001731827513
(I have 95000 rows in total)
What I wanted to do is to divide these spreads into 100 bins. That's what I did:
spread_range = np.linspace(0.000001, 0.0001, num=300)
dfspread = pd.DataFrame(spread_range,columns=['spread%'])
sorted_array = np.sort(df['spread%'])
dfspread['spread%']=np.array_split(sorted_array, 300)
dfspread['spread%'] = dfspread['spread%'].str[1]
I had to first create a dataframe with random values (spread_range) then replace these values by the good values(last line). I did not know how to do it in one step...
This is my output:
spread%
295 0.006396490507889923995723419182
296 0.006601856970328614032555077092
297 0.006874901899230889970177366191
298 0.007286400912994813194530809917
299 0.008012436834225554885192314445
but I do not find my maximum value which is: 0.02828190624663463264290952354
Any idea why?

turning a two dimensional array into a two column dataframe pandas

if I have the following, how do I make pd.DataFrame() turn this array into a dataframe with two columns. What's the most efficient way? My current approach involves creating copies out of each into a series and making dataframes out of them.
From this:
([[u'294 (24%) L', u'294 (26%) R'],
[u'981 (71%) L', u'981 (82%) R'],])
to
x y
294 294
981 981
rather than
x
[u'294 (24%) L', u'294 (26%) R']
my current approach. Looking for something more efficient
numL = pd.Series(numlist).map(lambda x: x[0])
numR = pd.Series(numlist).map(lambda x: x[1])
nL = pd.DataFrame(numL, columns=['left_num'])
nR = pd.DataFrame(numR, columns=['right_num'])
nLR = nL.join(nR)
nLR
UPDATE**
I noticed that my error simply comes down to when you pd.DataFrame() a list versus a series. WHen you create a dataframe out of a list, it merges the items into the same column. Not so with a list. That solved my problem in the most efficient way.
data = [[u'294 (24%) L', u'294 (26%) R'], [u'981 (71%) L', u'981 (82%) R'],]
clean_data = [[int(item.split()[0]) for item in row] for row in data]
# clean_data: [[294, 294], [981, 981]]
pd.DataFrame(clean_data, columns=list('xy'))
# x y
# 0 294 294
# 1 981 981
#
# [2 rows x 2 columns]

Categories