I'm trying to import a function that I have a file in another folder with this structure:
Data_Analytics
|---Src
| ---__init.py__
| ---DEA_functions.py
| ---importing_modules.py
|---Data Exploratory Analysis
| ----File.ipynb
So, from File.ipynb (from now I'm working in notebook) I want to call a function that I have in the file DEA_functions.py. To do that I typed:
import sys
sys.path.insert(1, "../")
from Src.Importing_modules import *
import Src.DEA_functions as DEA
No errors during the importing process but when I want to call the function I got this error:
AttributeError: module 'Src.DEA_functions' has no attribute 'getIndexes'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-29-534dff78ff93> in <module>
4
5 #There is a negative value that we want to delete
----> 6 DEA.getIndexes(df,df['Y3'].min())
7 df['Y3'].min()
8 df['Y3'].iloc[14289]=0
AttributeError: module 'Src.DEA_functions' has no attribute 'getIndexes'e
And the function is defined in the file, this way:
def getIndexes(dfObj, value):
''' Get index positions of value in dataframe i.e. dfObj.'''
listOfPos = list()
# Get bool dataframe with True at positions where the given value exists
result = dfObj.isin([value])
# Get list of columns that contains the value
seriesObj = result.any()
columnNames = list(seriesObj[seriesObj == True].index)
# Iterate over list of columns and fetch the rows indexes where value exists
for col in columnNames:
rows = list(result[col][result[col] == True].index)
for row in rows:
listOfPos.append((row, col))
# Return a list of tuples indicating the positions of value in the dataframe
return listOfPos
I hope I made myself clear but if not do not hesitate to question whatever you need. I just want to use the functions I have defined in my file DEA_functions.py into my File.ipynb
Thank you!
I found the error, I assigned DEA as the shortname for calling my functions. Looks like I had to use lower case letter. So:
import Src.DEA_funct as dea
Related
I'm trying to use two columns from an existing dataframe to generate a list of new strings with those values. I found a lot of examples doing something similar, but not the same thing, so I appreciate advice or links elsewhere if this is a repeat question. Thanks in advance!
If I start with a data frame like this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
id1 id2
0 a 1
1 b 2
2 c 3
I want to make a list that looks like
new_ids=['a_1','b_2','c_3'] where values are from combining values in row 0 for id1 with values for row 0 for id2 and so on.
I started by making lists from the columns, but can't figure out how to combine them into a new list. I also tried not using intermediate lists, but couldn't get that either. Error messages below are accurate to the mock data, but are different from the ones with real data.
#making separate lists version
#this function works
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1,idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join(str(idlist1[i]),str(idlist2[j]))
new_id.append(row)
#------------------------------------------------------------------------
#AttributeError Traceback (most recent call #last)
#<ipython-input-44-09983bd890a6> in <module>
# 1 newid_list=[]
# 2 for i in range(len(df)):
#----> 3 n1=df['id1'[i].values]
# 4 n2=df['id2'[i].values]
# 5 nid= str(n1)+"_"+str(n2)
#AttributeError: 'str' object has no attribute 'values'
#skipping making lists (also doesn't work)
newid_list=[]
for i in range(len(df)):
n1=df['id1'[i].values]
n2=df['id2'[i].values]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
#---------------------------------------------------------------------------
#TypeError Traceback (most recent call last)
#<ipython-input-41-6b0c949a1ad5> in <module>
# 1 new_id=[]
# 2 for i,j in zip(idlist1,idlist2):
#----> 3 row='_'.join(str(idlist1[i]),str(idlist2[j]))
# 4 new_id.append(row)
# 5 #return ', '.join(new_id)
#TypeError: list indices must be integers or slices, not str
(df.id1 + "_" + df.id2.astype(str)).tolist()
output:
['a_1', 'b_2', 'c_3']
your approaches(corrected):
def get_ids(orig_df):
id1_list=[]
id2_list=[]
for i in range(len(orig_df)):
id1_list.append(orig_df['id1'].values[i])
id2_list.append(orig_df['id2'].values[i])
return(id1_list,id2_list)
idlist1, idlist2=get_ids(df)
#this is the part that doesn't work
new_id=[]
for i,j in zip(idlist1,idlist2):
row='_'.join([str(i),str(j)])
new_id.append(row)
newid_list=[]
for i in range(len(df)):
n1=df['id1'][i]
n2=df['id2'][i]
nid= str(n1)+"_"+str(n2)
newid_list.append(nid)
points:
in first approach, when you loop on data, i and j are data, not indices, so use them as data and convert them to string.
join get list as data and simply define a list using 2 data: [str(i),str(j)] and pass to join
in second approach, you can get every element of every column using df['id1'][i] and you don't need values that return all elements of column as a numpy array
if you want to use values:
(df.id1.values + "_" + df.id2.values.astype(str)).tolist()
Try this:
import pandas as pd
df=pd.DataFrame(data=[["a",1],["b",2],["c",3]], columns=["id1","id2"])
index=0
newid_list=[]
while index < len(df):
newid_list.append(str(df['id1'][index]) + '_' + str(df['id2'][index]))
index+=1
The goal of this python pandas code would be to loop through several dataframes - results of an sql query -, check for every column of each dataframe whether there is any value different than 0, and based on that, assign the column name to a given array (ready_data or pending_data) for each dataframe.
The code is as follows:
#4). We will execute all the queries and change NAN for 0 so as to be able to track whether data is available or not
SQL_Queries = ('dutch_query', 'Fix_Int_Period_query', 'Zinsen_Port_query')
Dataframes = ('dutch', 'Fix_Int_Period', 'Zinsen_Port')
Clean_Dataframes = ('clean_dutch', 'clean_Fix_Int_Period', 'clean_Zinsen_Port')
dutch = pd.read_sql(dutch_query.format(ultimo=report_date), engine)
clean_dutch = dutch.fillna(0)
Fix_Int_Period = pd.read_sql(Fix_Int_Period_query.format(ultimo=report_date), engine)
clean_Fix_Int_Period = Fix_Int_Period.fillna(0)
Zinsen_Port = pd.read_sql(Zinsen_Port_query.format(ultimo=report_date), engine)
clean_Zinsen_Port = Zinsen_Port.fillna(0)
#5). We will check whether all data is available by looping through the columns and checking whether values are different than 0
dutch_ready_data=[]
dutch_pending_data=[]
Fix_Int_Period_ready_data=[]
Fix_Int_Period_pending_data=[]
Zinsen_Port_ready_data=[]
Zinsen_Port_pending_data=[]
for df in Dataframes:
for cdf in Clean_Dataframes:
for column in cdf:
if (((str(cdf)+[column]) != 0).any()) == False:
(str((str(df))+str('_pending_data'))).append([column])
else:
(str((str(df))+str('_ready_data'))).append([column])
The error message I keep getting is:
TypeError Traceback (most recent call last)
<ipython-input-70-fa18d45f0070> in <module>
13 for cdf in Clean_Dataframes:
14 for column in cdf:
---> 15 if (((str(cdf)+[column]) != 0).any()) == False:
16 (str((str(df))+str('_pending_data'))).append([column])
17 else:
TypeError: can only concatenate str (not "list") to str
It would be much appreciated if someone could help me out.
Thousand thanks!
I'm trying to iterate each row in a Pandas dataframe named 'cd'.
If a specific cell, e.g. [row,empl_accept] in a row contains a substring, then updates the value of an other cell, e.g.[row,empl_accept_a] in the same dataframe.
for row in range(0,len(cd.index),1):
if 'Master' in cd.at[row,empl_accept]:
cd.at[row,empl_accept_a] = '1'
else:
cd.at[row,empl_accept_a] = '0'
The code above not working and jupyter notebook displays the error:
TypeError Traceback (most recent call last)
<ipython-input-70-21b1f73e320c> in <module>
1 for row in range(0,len(cd.index),1):
----> 2 if 'Master' in cd.at[row,empl_accept]:
3 cd.at[row,empl_accept_a] = '1'
4 else:
5 cd.at[row,empl_accept_a] = '0'
TypeError: argument of type 'float' is not iterable
I'm not really sure what is the problem there as the for loop contains no float variable.
Please do not use loops for this. You can do this in bulk with:
cd['empl_accept_a'] = cd['empl_accept'].str.contains('Master').astype(int).astype(str)
This will store '0' and '1' in the column. That being said, I am not convinced if storing this as strings is a good idea. You can just store these as bools with:
cd['empl_accept_a'] = cd['empl_accept'].str.contains('Master')
For example:
>>> cd
empl_accept empl_accept_a
0 Master True
1 Slave False
2 Slave False
3 Master Windu True
You need to check in your dataframe what value is placed at [row,empl_accept]. I'm sure there will be some numeric value at this location in your dataframe. Just print the value and you'll see the problem if any.
print (cd.at[row,empl_accept])
I wrote a function that iterates over the files in a folder and selects certain data. The .csv files look like this:
Timestamp Value Result
00-00-10 34567 1.0
00-00-20 45425
00-00-30 46773 0.0
00-00-40 64567
00-00-50 25665 1.0
00-01-00 25678
00-01-10 84358
00-01-20 76869 0.0
00-01-30 95830
00-01-40 87890
00-01-50 99537
00-02-00 85957 1.0
00-02-10 58840
They are saved in the path C:/Users/me/Desktop/myfolder/data and I wrote the code in C:/Users/me/Desktop/myfolder. The function (after #Daniel R 's suggestion):
PATH = os.getcwd()+'\DATA\\'
def my_function(SourceFolder):
for i, file_path in enumerate(os.listdir(PATH)):
df = pd.read_csv(PATH+file_path)
mask = (
(df.Result == 1)
| (df.Result.ffill() == 1)
| ((df.Result.ffill() == 0)
& (df.groupby((df.Result.ffill() != df.Result.ffill().shift()).cumsum()).Result.transform('size') <= 100))
)
df = mask[df]
df = df.to_csv(PATH+'df_{}.csv'.format(i))
My initial question was: How do I save each df[mask] to NewFolder without overriding the data? The code above throws AttributeError: 'str' object has no attribute 'Result'.
AttributeError Traceback (most recent call last)
<ipython-input-3-14c0dbaf5ace> in <module>()
----> 1 retrieve_data('C:/Users/me/Desktop/myfolder/DATA/*.csv')
<ipython-input-2-ba68702431ca> in my_function(SourceFolder)
6 (df.Result == 1)
7 | (df.Result.ffill() == 1)
----> 8 | ((df.Result.ffill() == 0)
9 & (df.groupby((df.Result.ffill() != df.Result.ffill().shift()).cumsum()).Result.transform('size') <= 100)))
10 df = df[mask]
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
4370 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'Result'
If your dataframe has a structure that satisfies the requirements for a pandas DataFrame:
import pandas as pd
import os
# Let '\DATA\\' be the directory where you keep your csv files, as a subdirectory of .getcwd()
PATH = os.getcwd()+'\DATA\\'
def my_function(source_folder):
for i, file_path in enumerate(os.listdir(PATH)):
df = pd.read_csv(PATH+file_path) # Use read_csv here, not DataFrame.
# You are still working with a filepath, not a dictionary.
mask = ( (df.Result == 1) | (df.Result.ffill() == 1) |
((df.Result.ffill() == 0) &
(df.groupby((df.Result.ffill() !=
df.Result.ffill().shift()).cumsum()).Result.transform('size') <= 100))
)
df = df[mask]
df = df.to_csv(PATH+'df_{}.csv'.format(i))
You should provide a sample of the data you are working on when asking question similar to this one, as a general rule. The answers received may not work for you otherwise. Please update the question with a sample of a dataframe/csv file, and a mock content of the directory, so I can update this answer.
If srcPath is different from os.getcwd() you may have to compute the full path, or the path relative to .getcwd(), before iterating on the files.
Also, the call to list() above may not be necessary, test the code with or without it.
Lastly, why are you requiring two variables as inputs for my_function()?
As far as I can see there is only one variable required, which is srcPath called in .glob(), and this is not a variable passed to the function so it must be a global variable.
EDIT: I have updated the code above on the basis of the modifications to the original questions, and the comments to this post down below.
EDIT 2: Turns out that your call to the glob.glob() did not produce what you wanted. See the updated code.
I am looking to find the total number of players by counting the unique screen names.
# Dependencies
import pandas as pd
# Save path to data set in a variable
df = "purchase_data.json"
# Use Pandas to read data
data_file_pd = pd.read_json(df)
data_file_pd.head()
# Find total numbers of players
player_count = len(df['SN'].unique())
TypeError Traceback (most recent call last)
<ipython-input-26-94bf0ee04d7b> in <module>()
1 # Find total numbers of players
----> 2 player_count = len(df['SN'].unique())
TypeError: string indices must be integers
Without access to the original data, this is guess work. But I think you might want something like this:
# Save path variable (?)
json_data = "purchase_data.json"
# convert json data to Pandas dataframe
df = pd.read_json(json_data)
df.head()
len(data_file_pd['SN'].unique())
simply if you are getting this error while connecting to schema. then at that time close the web browser and kill the Pg Admin Server and restart it. then it will be work perfectly