Duplicate columns & possible reduce dimensionality key error 0 Python Error

Duplicate columns & possible reduce dimensionality key error 0 Python Error - python

I have the follow data set:
so as you can see, the shape is: 21 rows x 50 columns
So I would like to apply the follow condition:
If any row from "defaultstore"= 1, then the column "FinalSL" column will receive 4 times the value which "FCST:TOTAL" column contains.
So I create the follow function to do this calculation:
def SLFinal(defaultStore, fcst):
if (defaultStore==1):
return (fcst*4)
else:
return 2
SLFinal(DFstore.iloc[i],FcstList.iloc[i])
The function is working, but I would like to apply in my dataset, so I create the follow loops to run each row and storage the data for the "defaultstore" and "FCST:TOTAL" columns:
Fcst = copiedData.iloc[:,45:46]
FcstList = []
lenOfRows2 = len(copiedData)
for i in range(0, lenOfRows2):
FcstList.append(Fcst.loc[i])
DFstoreList`DFstore = copiedData.iloc[:,46:47]
DFstore
DFstoreList = []
lenOfRows2 = len(copiedData)
for i in range(0, lenOfRows2):
DFstoreList.append(DFstore.loc[i])
And finally, the new list which will contain the values after the function be applied:
FinalSLlist1 = []
for i in range(0, lenOfRows2 ):
Rows = []
for j in range(45, 50):
Rows.append( SLFinal(DFstore[i],FcstList[i]) )
FinalSLlist1.append(Rows)
But the folloow error is happening:
---------------------------------------------------------------------------
`KeyError Traceback (most recent call last)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality`
KeyError: 0
What should I do ?

You can use boolean indexing and avoid any loops like so:
df.loc[df.defaultstore==1, 'FCST:TOTAL'] *= 4
df.loc[df.defaultstore!=1, 'FCST:TOTAL'] = 2
It might be helpful to look at the pandas documentation on boolean indexing.

import pandas as pd
Just simply use apply() method:
df['FCST:TOTAL']=df.apply(lambda x:x['FCST:TOTAL']*4 if (x['defaultstore']==1) else 2,1)
OR
If you are familiar with numpy then use where() method as it is more efficient then pandas apply() method:
import numpy as np
df['FCST:TOTAL']=np.where(df['defaultstore']==1,df['FCST:TOTAL']*4,2)

Related

get the full row data from the found extrema

I am new to using pandas and I can't find a way to get the full row of the found extrema
df = pd.read_csv('test.csv')
df['min'] = df.iloc[argrelextrema(df.Close.values, np.less_equal,
order=10)[0]]['Close']
df['max'] = df.iloc[argrelextrema(df.Close.values, np.greater_equal,
order=10)[0]]['Close']
# create lists for `min` and `max`
min_values_list = df['min'].dropna().tolist()
max_values_list = df['max'].dropna().tolist()
print(min_values_list, max_values_list)
It print only the minima and extrema values, but I need the full row data of found minima / maxima
Example of data:
Datetime,Date,Open,High,Low,Close
2021-01-11 00:00:00+00:00,18638.0,1.2189176082611084,1.2199585437774658,1.2186205387115479,1.2192147970199585

If the list is required, then I would suggest:
def df_to_list_rowwise(df: pd.DataFrame) -> list:
return [df.iloc[_, :].tolist() for _ in range(df.shape[0])]
df_min_values = df.iloc[argrelextrema(np.array(df.Close), np.less_equal)[0], :]
df_max_values = df.iloc[argrelextrema(np.array(df.Close), np.greater_equal)[0], :]
print(df_to_list_rowwise(df_min_values))
print(df_to_list_rowwise(df_max_values))
Would that help?

try to use df.dropna().index.tolist() instead of specifying the column because adding the column name returns just the value of a specific row and the specified column not the whole row

Python- trying to make new list combining values from other list

I'm trying to use two columns from an existing dataframe to generate a list of new strings with those values. I found a lot of examples doing something similar, but not the same thing, so I appreciate advice or links elsewhere if this is a repeat question. Thanks in advance!
If I start with a data frame like this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
id1 id2
0 a 1
1 b 2
2 c 3
I want to make a list that looks like
new_ids=['a_1','b_2','c_3'] where values are from combining values in row 0 for id1 with values for row 0 for id2 and so on.
I started by making lists from the columns, but can't figure out how to combine them into a new list. I also tried not using intermediate lists, but couldn't get that either. Error messages below are accurate to the mock data, but are different from the ones with real data.
#making separate lists version
#this function works
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1,idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join(str(idlist1[i]),str(idlist2[j]))
new_id.append(row)
#------------------------------------------------------------------------
#AttributeError Traceback (most recent call #last)
#<ipython-input-44-09983bd890a6> in <module>
# 1 newid_list=[]
# 2 for i in range(len(df)):
#----> 3 n1=df['id1'[i].values]
# 4 n2=df['id2'[i].values]
# 5 nid= str(n1)+"_"+str(n2)
#AttributeError: 'str' object has no attribute 'values'
#skipping making lists (also doesn't work)
newid_list=[]
for i in range(len(df)):
n1=df['id1'[i].values]
n2=df['id2'[i].values]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
#---------------------------------------------------------------------------
#TypeError Traceback (most recent call last)
#<ipython-input-41-6b0c949a1ad5> in <module>
# 1 new_id=[]
# 2 for i,j in zip(idlist1,idlist2):
#----> 3 row='_'.join(str(idlist1[i]),str(idlist2[j]))
# 4 new_id.append(row)
# 5 #return ', '.join(new_id)
#TypeError: list indices must be integers or slices, not str

(df.id1 + "_" + df.id2.astype(str)).tolist()
output:
['a_1', 'b_2', 'c_3']
your approaches(corrected):
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1, idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join([str(i),str(j)])
new_id.append(row)
newid_list=[]
for i in range(len(df)):
n1=df['id1'][i]
n2=df['id2'][i]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
points:
in first approach, when you loop on data, i and j are data, not indices, so use them as data and convert them to string.
join get list as data and simply define a list using 2 data: [str(i),str(j)] and pass to join
in second approach, you can get every element of every column using df['id1'][i] and you don't need values that return all elements of column as a numpy array
if you want to use values:
(df.id1.values + "_" + df.id2.values.astype(str)).tolist()

Try this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
index=0
newid_list=[]
while index < len(df):
newid_list.append(str(df['id1'][index]) + '_' + str(df['id2'][index]))
index+=1

How to parallelize row dataframe computations with dask

I have a dataframe like the following one:
index
paper_id
title
embedding
0
000a0fc8bbef80410199e690191dc3076a290117
PfSWIB, a potential chromatin regulator for va...
[-0.21326999, -0.39155999, 0.18850000, -0.0664...
1
000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a
Correlation between antimicrobial consumption ...
[-0.23322999, -0.27436000, -0.10449000, -0.536...
2
000b0174f992cb326a891f756d4ae5531f2845f7
Full Title: A systematic review of MERS-CoV (M...
[0.26385999, -0.07325000, 0.03762100, -0.12043...
Where the "embedding" column is a np.array() of some length, whose elements are floats. I need to compute the cosine similarity between every pair of paper_id, and my aim is trying to parallelize it since many of these computations are independent of each other. I thought dask delayed objects would be efficient for this purpose:
The code of my function is:
#dask.delayed
def cosine(vector1, vector2):
#one can use only the very first elements of the embeddings, i.e. lengths of the embeddings must coincide
num_elem = min(len(vector1), len(vector2))
vec1_norm = np.linalg.norm(vector1[0:num_elem])
vec2_norm = np.linalg.norm(vector2[0:num_elem])
try:
cosine = np.vdot(vector1[0:num_elem], vector2[0:num_elem])/(vec1_norm*vec2_norm)
except:
cosine = 0.
return cosine
delayed_cosine_matrix = np.eye(len(cosine_df),len(cosine_df))
for x in range(1, len(cosine_df)):
for y in range(x):
delayed_cosine_matrix[x,y] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
delayed_cosine_matrix[y,x] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
This however returns an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'Delayed'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-114-90cefc4986d5> in <module>
3 for x in range(1, len(cosine_df)):
4 for y in range(x):
----> 5 delayed_cosine_matrix[x,y] = cosine(cosine_df.embedding[x], cosine_df.embedding[y])
ValueError: setting an array element with a sequence.
Moreover, I would stress the fact that I have chosen np.eye() since the cosine of a vector with itself returns one, as well as I would like to exploit the symmetry of the operator, i.e.
cosine(x,y) == cosine(y,x)
Is there a way to efficiently do and parallelize it, or am I totally out of scope?
EDIT: I'm adding a small snippet code that reproduces the columns and layout needed for the dataframe (i.e. only "embeddings" and the index)
import numpy as np
import pandas as pd
emb_lengths = np.random.randint(100, 1000, size = 100)
elements = [np.random.random(size = (1,x)) for x in emb_lengths ]
my_df = pd.DataFrame(elements, columns = ['embeddings'])
my_df.embeddings = my_df.embeddings.apply(lambda x: x[0])
my_df

How to loop through different Dataframes to check in pandas whether any value in a given column is different than 0 and save the result in an array

The goal of this python pandas code would be to loop through several dataframes - results of an sql query -, check for every column of each dataframe whether there is any value different than 0, and based on that, assign the column name to a given array (ready_data or pending_data) for each dataframe.
The code is as follows:
#4). We will execute all the queries and change NAN for 0 so as to be able to track whether data is available or not
SQL_Queries = ('dutch_query', 'Fix_Int_Period_query', 'Zinsen_Port_query')
Dataframes = ('dutch', 'Fix_Int_Period', 'Zinsen_Port')
Clean_Dataframes = ('clean_dutch', 'clean_Fix_Int_Period', 'clean_Zinsen_Port')
dutch = pd.read_sql(dutch_query.format(ultimo=report_date), engine)
clean_dutch = dutch.fillna(0)
Fix_Int_Period = pd.read_sql(Fix_Int_Period_query.format(ultimo=report_date), engine)
clean_Fix_Int_Period = Fix_Int_Period.fillna(0)
Zinsen_Port = pd.read_sql(Zinsen_Port_query.format(ultimo=report_date), engine)
clean_Zinsen_Port = Zinsen_Port.fillna(0)
#5). We will check whether all data is available by looping through the columns and checking whether values are different than 0
dutch_ready_data=[]
dutch_pending_data=[]
Fix_Int_Period_ready_data=[]
Fix_Int_Period_pending_data=[]
Zinsen_Port_ready_data=[]
Zinsen_Port_pending_data=[]
for df in Dataframes:
for cdf in Clean_Dataframes:
for column in cdf:
if (((str(cdf)+[column]) != 0).any()) == False:
(str((str(df))+str('_pending_data'))).append([column])
else:
(str((str(df))+str('_ready_data'))).append([column])
The error message I keep getting is:
TypeError Traceback (most recent call last)
<ipython-input-70-fa18d45f0070> in <module>
13 for cdf in Clean_Dataframes:
14 for column in cdf:
---> 15 if (((str(cdf)+[column]) != 0).any()) == False:
16 (str((str(df))+str('_pending_data'))).append([column])
17 else:
TypeError: can only concatenate str (not "list") to str
It would be much appreciated if someone could help me out.
Thousand thanks!

Given a pandas dataframe, is there an easy way to print out a command to generate it?

After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.

You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)

Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Duplicate columns & possible reduce dimensionality key error 0 Python Error - python

You can use boolean indexing and avoid any loops like so: df.loc[df.defaultstore==1, 'FCST:TOTAL'] *= 4 df.loc[df.defaultstore!=1, 'FCST:TOTAL'] = 2 It might be helpful to look at the pandas documentation on boolean indexing.

Related

get the full row data from the found extrema

Python- trying to make new list combining values from other list

How to parallelize row dataframe computations with dask

How to loop through different Dataframes to check in pandas whether any value in a given column is different than 0 and save the result in an array

Given a pandas dataframe, is there an easy way to print out a command to generate it?

Categories

Resources