and from the begining I thanks everyone that seeks to help.
I have started to learn python and came across a opportunity to use python to my advantage at work
Im basically made a script that reads a google sheets file, import it into pandas and cleaned up the data.
In the end, I just wanna have the name of the agents in the columns and all of their values for resolucao column below them so I can take the average amount of time for all of the agentes, but I'm struggling to make it with a list comprehension / for loop.
This is what the DataFrame looks like after I cleaned it up
And this is the Code that I tried to Run
PS: Sorry for the messy code.
agentes_unique = list(df['Agente'].unique())
agentes_duplicated = df['Agente']
value_resolucao_duplicated = df['resolucao']
n_of_rows = []
for row in range(len(df)):
n_of_rows.append(row)
i = 0
while n_of_rows[i] < len(n_of_rows):
df2 = pd.DataFrame({agentes_unique[i]: (value for value in df['resolucao'][i] if df['Agente'][i] == agentes_unique[i])})
i+= 1
df2.to_excel('teste.xlsx',index = True, header = True)
But in the end it came to this error:
Traceback (most recent call last):
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\indexes\range.py", line 385, in get_loc
return self._range.index(new_key)
ValueError: 0 is not in range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\FELIPE\Desktop\Python\webscraping\bot_csv_extract\bot_login\main.py", line 50, in <module>
df2 = pd.DataFrame({'Agente': (valor for valor in df['resolucao'][i] if df['Agente'][i] == 'Gabriel')})
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\indexes\range.py", line 387, in get_loc
raise KeyError(key) from err
KeyError: 0
I feel like I'm making some obvious mistake but I cant fix it
Again, thanks to anyone who tries to help
Are you looking to do something like this? This is just sample data, but a good start for what you are looking to do if I understand what your wanting to do.
data = {
'Column1' : ['Data', 'Another_Data', 'More_Data', 'Last_Data'],
'Agente' : ['Name', 'Another_Name', 'More_Names', 'Last_Name'],
'Data' : [1, 2, 3, 4]
}
df = pd.DataFrame(data)
df = df.pivot(index = 'Column1', columns=['Agente'], values = 'Data')
df.reset_index()
It is not recommended to use for loops against pandas DataFrames: It is considered messy and inefficient.
With some practice you will be able to approach problems in such a way that you will not need to use for loops in these cases.
From what I understand, your goal can be realized in 3 simple steps:
1. Select the 2 columns of interest. I recommend you take a look at how to access different elements of a dataFrame:
df = df[["Agent", "resolucao"]]
2. Convert the column you want to average to a numeric value. Say seconds:
df["resolucao"] = pd.to_timedelta(df['resolucao'].astype(str)).dt.total_seconds()
3. Apply an average aggregation, via the groupby() function:
df = df.groupby(["Agente"]).mean().reset_index()
Hope this helps.
For the next time, I also recommend you to not post the database as an image in order to be able to reproduce your code.
Cheers and keep it up!
Related
I'm very new to python. I recently downloaded this project which is used to analyze stock trends on reddit. The project is located here:
They have code in the Procces.py
def calculate_df(df):
data_df = df.filter(['tickers', 'score', 'sentiment'])
tickers_processed = pd.DataFrame(df.tickers.explode().value_counts())
tickers_processed = tickers_processed.rename(columns = {'tickers':'counts'})
tickers_processed['score'] = 0.0
tickers_processed['sentiment'] = 0.0
for idx, row_tick in enumerate(tickers_processed.iloc):
I'm getting an error when I try to enumerate the tickers_processed.iloc
Exception has occurred: NotImplementedError
ix is not iterable
Stack track:
File "C:\Users\MyUser\Desktop\NLP\trading-bot-base\tickerrain\process.py", line 113, in calculate_df
for idx, row_tick in enumerate(tickers_processed.iloc):
File "C:\Users\MyUser\Desktop\NLP\trading-bot-base\tickerrain\process.py", line 152, in processed_df
return calculate_df(df), calculate_df(df_3), calculate_df(df_1)
I've looked at a few other questions about this, they said to try to do something like this instead:
for idx, row_tick in tickers_processed.iloc[::1]
I tried this and it didn't work either. Does anyone know how I can enumerate the iloc?
Try using df.iterrows()
for idx, row_tick in tickers_processed.iterrows():
...
I'm trying to delete some rows from huge dataset in Pandas. I decided to use iterrows() function for searching for indexes to delete (since I know that deleting while iteration is bad idea).
Right now it looks like that:
list_to_delete = []
rows_to_delete = {}
for index, row in train.iterrows():
if <some conditions>:
list_to_delete.append(int(index))
rows_to_delete[int(index)] = row
train = train.drop([train.index[i] for i in list_to_delete])
It's giving me such error:
Traceback (most recent call last):
File "C:/Users/patka/PycharmProjects/PARSER/getStatistics.py", line 115, in <module>
train = train.drop([train.index[i] for i in list_to_delete])
File "C:/Users/patka/PycharmProjects/PARSER/getStatistics.py", line 115, in <listcomp>
train = train.drop([train.index[i] for i in list_to_delete])
File "C:\Users\patka\PycharmProjects\PARSER\venv\lib\site-packages\pandas\core\indexes\base.py", line 3958, in __getitem__
return getitem(key)
IndexError: index 25378 is out of bounds for axis 0 with size 25378
How is it possible?
Before that I created a copy of this dataset and tried to delete chosen rows from this copy while iterating through original one (with inplace=True). Unfortunately there was error saying that NoneType object has no attribute 'drop'.
I would appreciate your help very much.
My example row looks like that:
resolution Done
priority Major
created 2000-07-04T13:13:52.000+0200
status Resolved
Team XBee
changelog {'Team" : {'from':...
I have tried several different methods to add a row to an existing Pandas Dataframe. For example I tried the solution here. However I was not able to correct the issue. I have reverted back to my original code in hopes someone can help me here.
Here is my code:
print('XDF Created, Starting Bucket Separation...')
XDFDFdrop = pd.DataFrame.duplicated(XDFDF,subset='LastSurveyMachineID')
index_of_unique = XDFDF.drop_duplicates(subset='LastSurveyMachineID')
for index,row in zip(XDFDFdrop,XDFDF.itertuples()):
if index:
goodBucket.append(row)
else:
badBucket.append(row)
goodBucketDF = pd.DataFrame(goodBucket)
badBucketDF = pd.DataFrame(badBucket)
print('Bucket Separation Complete, EmailPrefix to F+L Test Starting...')
for emp , fname , lname , row1 in zip(goodBucketDF['EmailPrefix'] , goodBucketDF['Fname'] , goodBucketDF['Lname'] , goodBucketDF.itertuples()):
for emp2 , row2 in zip(goodBucketDF['EmailPrefix'] , goodBucketDF.itertuples()):
if columns != rows:
temp = fuzz.token_sort_ratio((fname+lname),emp)
temp2 = fuzz.token_sort_ratio((fname+lname),emp2)
if abs(temp - temp2) < 10:
badBucketDF.append(list(row2))
goodBucketDF = goodBucketDF.drop(row2)
removed = True
rows += 1
if removed:
badBucketDF.append(list(row2))
goodBucketDF = goodBucketDF.drop(row2)
removed = False
columns += 1
Please note: XDFDF is a relatively large data set that is built using pandas and was pulled from a database (it should not affect the code you see just figured I would disclose that information).
This is my Error:
Traceback (most recent call last):
File "/Users/john/PycharmProjects/Greatness/venv/Recipes.py", line 122, in <module>
goodBucketDF = goodBucketDF.drop([rows])
File "/Users/john/PycharmProjects/Greatness/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 3694, in drop
errors=errors)
File "/Users/john/PycharmProjects/Greatness/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 3108, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "/Users/john/PycharmProjects/Greatness/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 3140, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "/Users/john/PycharmProjects/Greatness/venv/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4387, in drop
'labels %s not contained in axis' % labels[mask])
KeyError: 'labels [(15, '1397659289', 'joshi.penguin#gmail.com', 'jim', 'smith', '1994-05-04', 'joshi.penguin', 'CF032611-8A86-4688-9715-E1278E75D046')] not contained in axis'
Process finished with exit code 1
I would like to know if anyone has a solution to this error so that: I can add a take a row from one Dataframe, place it in the the other DataFrame (does not need to be in order, and I don't care if index duplicates or not). Once it is in its new Dataframe I want to remove it from the old one.
My current issue is removing the row from the old Dataframe. Any help would be appreciated.
If you have any questions on the code please let me know and I will respond as soon as I can. Thank you for your help.
Edit 1
Below I have included a printout of row1. Hopefully this will help as well.
Pandas(Index=1, _1=2, entity_id='1180722688', email='assassin_penguin#live.com', Fname='jim', Lname='smith', Birthdate='1990-09-14', EmailPrefix='assassin_penguin', LastSurveyMachineID=None)
Given that XDFDF is a pandas.DataFrame, shouldn't the following work?
XDFDFdrop = pd.DataFrame.duplicated(XDFDF,subset='LastSurveyMachineID')
goodBucket = XDFDF.loc[~XDFDFdrop] #the ~ negates a boolean array
badBucket = XDFDF.loc[XDFDFdrop]
Edit:
The updated error comes from you passing an entire row rather than an index to the function pandas.DataFrame.drop.
This question is odd, since I know HOW to do something, but I dont know WHY I cant do it another way.
Suppose simple data frame:
import pandasas pd
a = pd.DataFrame([[0,1], [2,3]])
I can slice this data frame very easily, first column is a[[0]], second is a[[1]]. Simple isnt it?
Now, lets have more complex data frame. This is part of my code:
var_vec = [i for i in range(100)]
num_of_sites = 100
row_names = ["_".join(["loc", str(i)]) for i in
range(1,num_of_sites + 1)]
frame = pd.DataFrame(var_vec, columns = ["Variable"], index = row_names)
spec_ab = [i**3 for i in range(100)]
frame[1] = spec_ab
Data frame frame is also pandas DataFrame, such as a. I canget second column very easily as frame[[1]]. But when I try frame[[0]] I get an error:
Traceback (most recent call last):
File "<ipython-input-55-0c56ffb47d0d>", line 1, in <module>
frame[[0]]
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 1991, in __getitem__
return self._getitem_array(key)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 2035, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\indexing.py", line 1184, in _convert_to_indexer
indexer = labels._convert_list_indexer(objarr, kind=self.name)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\indexes\base.py", line 1112, in _convert_list_indexer
return maybe_convert_indices(indexer, len(self))
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\indexing.py", line 1856, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
I can still use frame.iloc[:,0] but problem is that I dont understand why I cant use simple slicing by [[]]? I use winpython spyder 3 if that helps.
using your code:
import pandas as pd
var_vec = [i for i in range(100)]
num_of_sites = 100
row_names = ["_".join(["loc", str(i)]) for i in
range(1,num_of_sites + 1)]
frame = pd.DataFrame(var_vec, columns = ["Variable"], index = row_names)
spec_ab = [i**3 for i in range(100)]
frame[1] = spec_ab
if you ask to print out the 'frame' you get:
Variable 1
loc_1 0 0
loc_2 1 1
loc_3 2 8
loc_4 3 27
loc_5 4 64
loc_6 5 125
......
So the cause of your problem becomes obvious, you have no column called '0'.
At line one you specify a lista called var_vec.
At line 4 you make a dataframe out of that list, but you specify the index values and the column name (which is usually good practice).
The numerical column name, '0', '1',.. as in the first example, only takes place when you dont specify the column name, its not a column position indexer.
If you want to access columns by their position, you can:
df[df.columns[0]]
what happens than, is you get the list of columns of the df, and you choose the term '0' and pass it to the df as a reference.
hope that helps you understand
edit:
another way (better) would be:
df.iloc[:,0]
where ":" stands for all rows. (also indexed by number from 0 to range of rows)
I have a a file of data and want to select a specific State. From there I need to return this in a list, but there will be years that correspond to the date with missing data, so I need to replace the missing data.
I am having some issue with my code, likely something is slightly off in my for loop:
def stateCountAsList(filepath,state):
import pandas as pd
pd.set_option('display.width',200)
import numpy as np
dataFrame = pd.read_csv(filepath,header=0,sep='\t')
df = dataFrame.iloc[0:638,:]
dfState = df[df['State'] == state]
yearList = range(1999,2012)
countsList = []
for dfState['Year'] in yearList:
countsList = dfState['Count']
else:
countsList.append(np.nan)
return countsList
print countsList.tolist()
stateCountAsList(filepath, state)
state = 'California'
Traceback:
C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py:59: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
for dfState['Year'] in yearList:
Traceback (most recent call last):
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 67, in <module>
stateCountAsList(filepath, state)
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 62, in stateCountAsList
countsList.append(np.nan)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\series.py", line 1466, in append
verify_integrity=verify_integrity)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\tools\merge.py", line 754, in concat
copy=copy)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\tools\merge.py", line 805, in __init__
raise TypeError("cannot concatenate a non-NDFrame object")
TypeError: cannot concatenate a non-NDFrame object
You have at least two different issues in your code:
The warning
A value is trying to be set on a copy of a slice from a DataFrame.
is triggered by for dfState['Year'] in yearList (line 59 in your code). In this line you try to loop over a range of years (1999 to 2012), but instead you implicitely try to assign the year value to dfState['Year']. This is not a copy, but a "view" (http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy), since df = dataFrame.iloc[0:638,:] returns a view.
But as mentioned earlier, you don't want to assign a value to the DataFrame here, only loop over years. So the for-loop should look like:
for year in range(1999,2012):
...
The second issue is in line 62. Here, you try to append np.nan to your "list" countsList - but countsList is not a list anymore, but a DataFrame!
Two lines before, you assign a pd.Series (countsList = dfState['Count']), effectively changing the type. This gives you the TypeError: cannot concatenate a non-NDFrame object
With this information you should be able to correct your loop.
As an alternative, you can get the desired result using Pandas query method (http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method-experimental):
def stateCountAsList(filepath,state):
import pandas as pd
import numpy as np
dataFrame = pd.read_csv(filepath,header=0,sep='\t')
df = dataFrame.iloc[0:638,:]
stateList = df.query("(State == #state) & (Year > 1999 < 2005").Count.tolist()
return stateList