Duplicate columns & possible reduce dimensionality key error 0 Python Error - python

I have the follow data set:
so as you can see, the shape is: 21 rows x 50 columns
So I would like to apply the follow condition:
If any row from "defaultstore"= 1, then the column "FinalSL" column will receive 4 times the value which "FCST:TOTAL" column contains.
So I create the follow function to do this calculation:
def SLFinal(defaultStore, fcst):
if (defaultStore==1):
return (fcst*4)
else:
return 2
SLFinal(DFstore.iloc[i],FcstList.iloc[i])
The function is working, but I would like to apply in my dataset, so I create the follow loops to run each row and storage the data for the "defaultstore" and "FCST:TOTAL" columns:
Fcst = copiedData.iloc[:,45:46]
FcstList = []
lenOfRows2 = len(copiedData)
for i in range(0, lenOfRows2):
FcstList.append(Fcst.loc[i])
DFstoreList`DFstore = copiedData.iloc[:,46:47]
DFstore
DFstoreList = []
lenOfRows2 = len(copiedData)
for i in range(0, lenOfRows2):
DFstoreList.append(DFstore.loc[i])
And finally, the new list which will contain the values after the function be applied:
FinalSLlist1 = []
for i in range(0, lenOfRows2 ):
Rows = []
for j in range(45, 50):
Rows.append( SLFinal(DFstore[i],FcstList[i]) )
FinalSLlist1.append(Rows)
But the folloow error is happening:
---------------------------------------------------------------------------
`KeyError Traceback (most recent call last)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality`
KeyError: 0
What should I do ?

You can use boolean indexing and avoid any loops like so:
df.loc[df.defaultstore==1, 'FCST:TOTAL'] *= 4
df.loc[df.defaultstore!=1, 'FCST:TOTAL'] = 2
It might be helpful to look at the pandas documentation on boolean indexing.

import pandas as pd
Just simply use apply() method:
df['FCST:TOTAL']=df.apply(lambda x:x['FCST:TOTAL']*4 if (x['defaultstore']==1) else 2,1)
OR
If you are familiar with numpy then use where() method as it is more efficient then pandas apply() method:
import numpy as np
df['FCST:TOTAL']=np.where(df['defaultstore']==1,df['FCST:TOTAL']*4,2)

Related

get the full row data from the found extrema

I am new to using pandas and I can't find a way to get the full row of the found extrema
df = pd.read_csv('test.csv')
df['min'] = df.iloc[argrelextrema(df.Close.values, np.less_equal,
order=10)[0]]['Close']
df['max'] = df.iloc[argrelextrema(df.Close.values, np.greater_equal,
order=10)[0]]['Close']
# create lists for `min` and `max`
min_values_list = df['min'].dropna().tolist()
max_values_list = df['max'].dropna().tolist()
print(min_values_list, max_values_list)
It print only the minima and extrema values, but I need the full row data of found minima / maxima
Example of data:
Datetime,Date,Open,High,Low,Close
2021-01-11 00:00:00+00:00,18638.0,1.2189176082611084,1.2199585437774658,1.2186205387115479,1.2192147970199585
If the list is required, then I would suggest:
def df_to_list_rowwise(df: pd.DataFrame) -> list:
return [df.iloc[_, :].tolist() for _ in range(df.shape[0])]
df_min_values = df.iloc[argrelextrema(np.array(df.Close), np.less_equal)[0], :]
df_max_values = df.iloc[argrelextrema(np.array(df.Close), np.greater_equal)[0], :]
print(df_to_list_rowwise(df_min_values))
print(df_to_list_rowwise(df_max_values))
Would that help?
try to use df.dropna().index.tolist() instead of specifying the column because adding the column name returns just the value of a specific row and the specified column not the whole row

Python- trying to make new list combining values from other list

I'm trying to use two columns from an existing dataframe to generate a list of new strings with those values. I found a lot of examples doing something similar, but not the same thing, so I appreciate advice or links elsewhere if this is a repeat question. Thanks in advance!
If I start with a data frame like this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
id1 id2
0 a 1
1 b 2
2 c 3
I want to make a list that looks like
new_ids=['a_1','b_2','c_3'] where values are from combining values in row 0 for id1 with values for row 0 for id2 and so on.
I started by making lists from the columns, but can't figure out how to combine them into a new list. I also tried not using intermediate lists, but couldn't get that either. Error messages below are accurate to the mock data, but are different from the ones with real data.
#making separate lists version
#this function works
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1,idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join(str(idlist1[i]),str(idlist2[j]))
new_id.append(row)
#------------------------------------------------------------------------
#AttributeError Traceback (most recent call #last)
#<ipython-input-44-09983bd890a6> in <module>
# 1 newid_list=[]
# 2 for i in range(len(df)):
#----> 3 n1=df['id1'[i].values]
# 4 n2=df['id2'[i].values]
# 5 nid= str(n1)+"_"+str(n2)
#AttributeError: 'str' object has no attribute 'values'
#skipping making lists (also doesn't work)
newid_list=[]
for i in range(len(df)):
n1=df['id1'[i].values]
n2=df['id2'[i].values]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
#---------------------------------------------------------------------------
#TypeError Traceback (most recent call last)
#<ipython-input-41-6b0c949a1ad5> in <module>
# 1 new_id=[]
# 2 for i,j in zip(idlist1,idlist2):
#----> 3 row='_'.join(str(idlist1[i]),str(idlist2[j]))
# 4 new_id.append(row)
# 5 #return ', '.join(new_id)
#TypeError: list indices must be integers or slices, not str
(df.id1 + "_" + df.id2.astype(str)).tolist()
output:
['a_1', 'b_2', 'c_3']
your approaches(corrected):
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1, idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join([str(i),str(j)])
new_id.append(row)
newid_list=[]
for i in range(len(df)):
n1=df['id1'][i]
n2=df['id2'][i]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
points:
in first approach, when you loop on data, i and j are data, not indices, so use them as data and convert them to string.
join get list as data and simply define a list using 2 data: [str(i),str(j)] and pass to join
in second approach, you can get every element of every column using df['id1'][i] and you don't need values that return all elements of column as a numpy array
if you want to use values:
(df.id1.values + "_" + df.id2.values.astype(str)).tolist()
Try this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
index=0
newid_list=[]
while index < len(df):
newid_list.append(str(df['id1'][index]) + '_' + str(df['id2'][index]))
index+=1

How to parallelize row dataframe computations with dask

I have a dataframe like the following one:
index
paper_id
title
embedding
0
000a0fc8bbef80410199e690191dc3076a290117
PfSWIB, a potential chromatin regulator for va...
[-0.21326999, -0.39155999, 0.18850000, -0.0664...
1
000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a
Correlation between antimicrobial consumption ...
[-0.23322999, -0.27436000, -0.10449000, -0.536...
2
000b0174f992cb326a891f756d4ae5531f2845f7
Full Title: A systematic review of MERS-CoV (M...
[0.26385999, -0.07325000, 0.03762100, -0.12043...
Where the "embedding" column is a np.array() of some length, whose elements are floats. I need to compute the cosine similarity between every pair of paper_id, and my aim is trying to parallelize it since many of these computations are independent of each other. I thought dask delayed objects would be efficient for this purpose:
The code of my function is:
#dask.delayed
def cosine(vector1, vector2):
#one can use only the very first elements of the embeddings, i.e. lengths of the embeddings must coincide
num_elem = min(len(vector1), len(vector2))
vec1_norm = np.linalg.norm(vector1[0:num_elem])
vec2_norm = np.linalg.norm(vector2[0:num_elem])
try:
cosine = np.vdot(vector1[0:num_elem], vector2[0:num_elem])/(vec1_norm*vec2_norm)
except:
cosine = 0.
return cosine
delayed_cosine_matrix = np.eye(len(cosine_df),len(cosine_df))
for x in range(1, len(cosine_df)):
for y in range(x):
delayed_cosine_matrix[x,y] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
delayed_cosine_matrix[y,x] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
This however returns an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'Delayed'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-114-90cefc4986d5> in <module>
3 for x in range(1, len(cosine_df)):
4 for y in range(x):
----> 5 delayed_cosine_matrix[x,y] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
ValueError: setting an array element with a sequence.
Moreover, I would stress the fact that I have chosen np.eye() since the cosine of a vector with itself returns one, as well as I would like to exploit the symmetry of the operator, i.e.
cosine(x,y) == cosine(y,x)
Is there a way to efficiently do and parallelize it, or am I totally out of scope?
EDIT: I'm adding a small snippet code that reproduces the columns and layout needed for the dataframe (i.e. only "embeddings" and the index)
import numpy as np
import pandas as pd
emb_lengths = np.random.randint(100, 1000, size = 100)
elements = [np.random.random(size = (1,x)) for x in emb_lengths ]
my_df = pd.DataFrame(elements, columns = ['embeddings'])
my_df.embeddings = my_df.embeddings.apply(lambda x: x[0])
my_df

How to loop through different Dataframes to check in pandas whether any value in a given column is different than 0 and save the result in an array

The goal of this python pandas code would be to loop through several dataframes - results of an sql query -, check for every column of each dataframe whether there is any value different than 0, and based on that, assign the column name to a given array (ready_data or pending_data) for each dataframe.
The code is as follows:
#4). We will execute all the queries and change NAN for 0 so as to be able to track whether data is available or not
SQL_Queries = ('dutch_query', 'Fix_Int_Period_query', 'Zinsen_Port_query')
Dataframes = ('dutch', 'Fix_Int_Period', 'Zinsen_Port')
Clean_Dataframes = ('clean_dutch', 'clean_Fix_Int_Period', 'clean_Zinsen_Port')
dutch = pd.read_sql(dutch_query.format(ultimo=report_date), engine)
clean_dutch = dutch.fillna(0)
Fix_Int_Period = pd.read_sql(Fix_Int_Period_query.format(ultimo=report_date), engine)
clean_Fix_Int_Period = Fix_Int_Period.fillna(0)
Zinsen_Port = pd.read_sql(Zinsen_Port_query.format(ultimo=report_date), engine)
clean_Zinsen_Port = Zinsen_Port.fillna(0)
#5). We will check whether all data is available by looping through the columns and checking whether values are different than 0
dutch_ready_data=[]
dutch_pending_data=[]
Fix_Int_Period_ready_data=[]
Fix_Int_Period_pending_data=[]
Zinsen_Port_ready_data=[]
Zinsen_Port_pending_data=[]
for df in Dataframes:
for cdf in Clean_Dataframes:
for column in cdf:
if (((str(cdf)+[column]) != 0).any()) == False:
(str((str(df))+str('_pending_data'))).append([column])
else:
(str((str(df))+str('_ready_data'))).append([column])
The error message I keep getting is:
TypeError Traceback (most recent call last)
<ipython-input-70-fa18d45f0070> in <module>
13 for cdf in Clean_Dataframes:
14 for column in cdf:
---> 15 if (((str(cdf)+[column]) != 0).any()) == False:
16 (str((str(df))+str('_pending_data'))).append([column])
17 else:
TypeError: can only concatenate str (not "list") to str
It would be much appreciated if someone could help me out.
Thousand thanks!

Given a pandas dataframe, is there an easy way to print out a command to generate it?

After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.
You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)
Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.

Categories