Saving each new dataframe created inside a for-loop in Python - python

I wrote a function that iterates over the files in a folder and selects certain data. The .csv files look like this:
Timestamp Value Result
00-00-10 34567 1.0
00-00-20 45425
00-00-30 46773 0.0
00-00-40 64567
00-00-50 25665 1.0
00-01-00 25678
00-01-10 84358
00-01-20 76869 0.0
00-01-30 95830
00-01-40 87890
00-01-50 99537
00-02-00 85957 1.0
00-02-10 58840
They are saved in the path C:/Users/me/Desktop/myfolder/data and I wrote the code in C:/Users/me/Desktop/myfolder. The function (after #Daniel R 's suggestion):
PATH = os.getcwd()+'\DATA\\'
def my_function(SourceFolder):
for i, file_path in enumerate(os.listdir(PATH)):
df = pd.read_csv(PATH+file_path)
mask = (
(df.Result == 1)
| (df.Result.ffill() == 1)
| ((df.Result.ffill() == 0)
& (df.groupby((df.Result.ffill() != df.Result.ffill().shift()).cumsum()).Result.transform('size') <= 100))
)
df = mask[df]
df = df.to_csv(PATH+'df_{}.csv'.format(i))
My initial question was: How do I save each df[mask] to NewFolder without overriding the data? The code above throws AttributeError: 'str' object has no attribute 'Result'.
AttributeError Traceback (most recent call last)
<ipython-input-3-14c0dbaf5ace> in <module>()
----> 1 retrieve_data('C:/Users/me/Desktop/myfolder/DATA/*.csv')
<ipython-input-2-ba68702431ca> in my_function(SourceFolder)
6 (df.Result == 1)
7 | (df.Result.ffill() == 1)
----> 8 | ((df.Result.ffill() == 0)
9 & (df.groupby((df.Result.ffill() != df.Result.ffill().shift()).cumsum()).Result.transform('size') <= 100)))
10 df = df[mask]
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
4370 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'Result'

If your dataframe has a structure that satisfies the requirements for a pandas DataFrame:
import pandas as pd
import os
# Let '\DATA\\' be the directory where you keep your csv files, as a subdirectory of .getcwd()
PATH = os.getcwd()+'\DATA\\'
def my_function(source_folder):
for i, file_path in enumerate(os.listdir(PATH)):
df = pd.read_csv(PATH+file_path) # Use read_csv here, not DataFrame.
# You are still working with a filepath, not a dictionary.
mask = ( (df.Result == 1) | (df.Result.ffill() == 1) |
((df.Result.ffill() == 0) &
(df.groupby((df.Result.ffill() !=
df.Result.ffill().shift()).cumsum()).Result.transform('size') <= 100))
)
df = df[mask]
df = df.to_csv(PATH+'df_{}.csv'.format(i))
You should provide a sample of the data you are working on when asking question similar to this one, as a general rule. The answers received may not work for you otherwise. Please update the question with a sample of a dataframe/csv file, and a mock content of the directory, so I can update this answer.
If srcPath is different from os.getcwd() you may have to compute the full path, or the path relative to .getcwd(), before iterating on the files.
Also, the call to list() above may not be necessary, test the code with or without it.
Lastly, why are you requiring two variables as inputs for my_function()?
As far as I can see there is only one variable required, which is srcPath called in .glob(), and this is not a variable passed to the function so it must be a global variable.
EDIT: I have updated the code above on the basis of the modifications to the original questions, and the comments to this post down below.
EDIT 2: Turns out that your call to the glob.glob() did not produce what you wanted. See the updated code.

Related

How to call a function into a file

I'm trying to import a function that I have a file in another folder with this structure:
Data_Analytics
|---Src
| ---__init.py__
| ---DEA_functions.py
| ---importing_modules.py
|---Data Exploratory Analysis
| ----File.ipynb
So, from File.ipynb (from now I'm working in notebook) I want to call a function that I have in the file DEA_functions.py. To do that I typed:
import sys
sys.path.insert(1, "../")
from Src.Importing_modules import *
import Src.DEA_functions as DEA
No errors during the importing process but when I want to call the function I got this error:
AttributeError: module 'Src.DEA_functions' has no attribute 'getIndexes'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-29-534dff78ff93> in <module>
4
5 #There is a negative value that we want to delete
----> 6 DEA.getIndexes(df,df['Y3'].min())
7 df['Y3'].min()
8 df['Y3'].iloc[14289]=0
AttributeError: module 'Src.DEA_functions' has no attribute 'getIndexes'e
And the function is defined in the file, this way:
def getIndexes(dfObj, value):
''' Get index positions of value in dataframe i.e. dfObj.'''
listOfPos = list()
# Get bool dataframe with True at positions where the given value exists
result = dfObj.isin([value])
# Get list of columns that contains the value
seriesObj = result.any()
columnNames = list(seriesObj[seriesObj == True].index)
# Iterate over list of columns and fetch the rows indexes where value exists
for col in columnNames:
rows = list(result[col][result[col] == True].index)
for row in rows:
listOfPos.append((row, col))
# Return a list of tuples indicating the positions of value in the dataframe
return listOfPos
I hope I made myself clear but if not do not hesitate to question whatever you need. I just want to use the functions I have defined in my file DEA_functions.py into my File.ipynb
Thank you!
I found the error, I assigned DEA as the shortname for calling my functions. Looks like I had to use lower case letter. So:
import Src.DEA_funct as dea

Trying to call a function in Jupyterlab that I saved in my Working Directory

I'm using Jupyterlab and was trying to save a function as a module in my cwd, so I could call it from a separate module file.
So I was reading other similar posts and saved it as
outlier.py and placed in my cwd.
this is the function:
import pandas
def remove_pps_outliers(df):
df_out = pandas.DataFrame() #taking new dataframe as output
for key, subdf in df.groupby('location'): # grouping by location
# for x in DF: (subdf = df.groupby(''))
m = np.mean(subdf.price_per_sqft) # per location getting subdataframe
st = np.std(subdf.price_per_sqft) # per location getting subdataframe (this means 1 standard deviation)
reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft<=(m+st))]
# filtering all the datapoints that are >(dist) & anything below <=(m+st)
# (THINK OF A NORMAL DISTRIBUTION CURVE |___SD___ME|AN__SD__|)
df_out = pandas.concat(([df_out, reduced_df]), ignore_index=True)
# [ objs , axis ]
# [add these 2 together]
# I will keep on appending these two df PER LOCATION
return df_out
```
this is how I called it:
from outliers import *
df7 = remove_pps_outliers(df6)
df7.shape
and this is the error im getting:
~\outliers.py in remove_pps_outliers(df)
1 import pandas
----> 2
3 def remove_pps_outliers(df):
4 df_out = pandas.DataFrame() #taking new dataframe as output
5 for key, subdf in df.groupby('location'): # grouping by location
NameError: name 'pd' is not defined
Help?

Function to print dataframe, that uses df name as an argument

In function, I can't use argument to define the name of the df in df.to_csv().
I have a long script to pull apart and understand. To do so I want to save the different dataframes it uses and store them in order. I created a function to do this and add the order number 01 (number_of_interim_exports) to the name (from argument).
My problem is that I need to use this for multiple dataframe names, but the df.to_csv part won't accept an argument in place of df...
def print_interim_results_any(name, num_exports, df_name):
global number_of_interim_exports
global print_interim_outputs
if print_interim_outputs == 1:
csvName = str(number_of_interim_exports).zfill(2) + "_" +name
interimFileName = "interim_export_"+csvName+".csv"
df.to_csv(interimFileName, sep=;, encoding='utf-8', index=False)
number_of_interim_exports += 1
I think i just screwed something else up: this works fine:
import pandas as pd
df = pd.DataFrame({1:[1,2,3]})
def f(frame):
frame.to_csv("interimFileName.csv")
f(df)

AttributeError: 'Series' object has no attribute 'label'

I'm trying to follow a tutorial on sound classification in neural networks, and I've found 3 different versions of the same tutorial, all of which work, but they all reach a snag at this point in the code, where I get the "AttributeError: 'Series' object has no attribute 'label'" issue. I'm not particularly au fait with either NNs or Python, so apologies if this is something trivial like a deprecation error, but I can't seem to figure it out myself.
def parser(row):
# function to load files and extract features
file_name = os.path.join(os.path.abspath(data_dir), 'Train/train', str(row.ID) + '.wav')
# handle exception to check if there isn't a file which is corrupted
try:
# here kaiser_fast is a technique used for faster extraction
X, sample_rate = librosa.load(file_name, res_type='kaiser_fast')
# we extract mfcc feature from data
mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
except Exception as e:
print("Error encountered while parsing file: ", file)
return None, None
feature = mfccs
label = row.Class
return [feature, label]
temp = train.apply(parser, axis=1)
temp.columns = ['feature', 'label']
from sklearn.preprocessing import LabelEncoder
X = np.array(temp.feature.tolist())
y = np.array(temp.label.tolist())
lb = LabelEncoder()
y = np_utils.to_categorical(lb.fit_transform(y))
As mentioned, I've seen three different tutorials on the same subject, all of which end with the same "temp = train.apply(parser, axis=1) temp.columns = ['feature', 'label']" fragment, so I'm assuming this is assigning correctly, but I don't know where it's going wrong otherwise. Help appreciated!
Edit: Traceback as requested, turns out I'd added the wrong traceback. Also I've since found out that this is a case of converting the series object to a dataframe, so any help with that would be great.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-1613f53e2d98> in <module>()
1 from sklearn.preprocessing import LabelEncoder
2
----> 3 X = np.array(temp.feature.tolist())
4 y = np.array(temp.label.tolist())
5
/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
4370 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'feature'
Your current implementation of parser(row) method returns a list for each row of data from train DataFrame. But this is then collected as a pandas.Series object.
So your temp is actually a Series object. Then the following line dont have any effect:
temp.columns = ['feature', 'label']
Since temp is a Series, it does not have any columns, and hence temp.feature and temp.label dont exist and hence the error.
Change your parser() method as following:
def parser(row):
...
...
...
# Return pandas.Series instead of List
return pd.Series([feature, label])
By doing this, the apply method from temp = train.apply(parser, axis=1) will return a DataFrame, so your other code will work.
I cannot say about the tutorials you are following. Maybe they followed an older version of pandas which allowed a list to be automatically converted to DataFrame.

Some operations on DataFrame

I am working on praising a *.csv file. Therefore I try to create a class which helps me to simplify some operations on DataFrame.
I've created two methods in order to parse a column 'z' that contains values for the 'Price' column.
def subr(self):
isone = self.df.z == 1.0
if isone.any():
atone = self.df.Price[isone].iloc[0]
self.df.loc[self.df.z.between(0.8, 2.5), 'Benchmark'] = atone
# df.loc[(df.r >= .8) & (df.r <= 1.4), 'value'] = atone
return self.df
def obtain_z(self):
"Return a column with z for E_ref"
self.z_col = self.subr()
self.dfnew = self.df.groupby((self.df.z < self.df.z.shift()).cumsum()).apply(self.z_col)
return self.dfnew
def main():
x = ParseDataBase('data.csv')
file_content = x.read_file()
new_df = x.obtain_z()
I'm getting the following error:
'DataFrame' objects are mutable, thus they cannot be hashed
'DataFrame' objects are mutable means that we can change elements of that Frame. I'm not sure when I'm hashing.
I noticed the use of apply(self.z_col) is going wrong.
I also have no clue how to fix it.
You are passing the DataFrame self.df returned by self.subr() to apply, but actually apply only takes functions as parameters (see examples here).

Categories