Get the first pandas DataFrame's column? - python

I want to calculate std of my first prices DataFrame's column.
Here is my code:
import pandas as pd
def std(returns):
return pd.DataFrame(returns.std(axis=0, ddof=0))
prices = pd.DataFrame([[-0.33333333, -0.25343423, -0.1666666667],
[+0.23432323, +0.14285714, -0.0769230769],
[+0.42857143, +0.07692308, +0.1818181818]])
print(std(prices.ix[:,0]))
When I run it, i get the following error:
Traceback (most recent call last):
File "C:\Users\*****\Documents\******\******\****.py", line 12, in <module>
print(std(prices.ix[:,0]))
File "C:\Users\*****\Documents\******\******\****.py", line 10, in std
return pd.DataFrame(returns.std(axis=0, ddof=0))
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 453, in __init__
raise PandasError('DataFrame constructor not properly called!')
pandas.core.common.PandasError: DataFrame constructor not properly called!
How can I fix that?
Thank you!

Take a closer look at what is going in in your code:
>>> prices.ix[:,0]
0 -0.333333
1 0.234323
2 0.428571
>>> prices.ix[:,0].std(axis=0, ddof=0)
0.32325861621668445
So you are calling the DataFrame constructor like this:
pd.DataFrame(0.32325861621668445)
The constructor has no idea what to do with single float parameter. It needs some kind of sequence or iterable. Maybe what you what is this:
>>> pd.DataFrame([0.32325861621668445])
0
0 0.323259

It should be as simple as this:
In [0]: prices[0].std()
Out[0]: 0.39590933234452624
Columns of DataFrames are Series. You can call Series methods on them directly.

Related

how can I make a for loop to populate a DataFrame?

and from the begining I thanks everyone that seeks to help.
I have started to learn python and came across a opportunity to use python to my advantage at work
Im basically made a script that reads a google sheets file, import it into pandas and cleaned up the data.
In the end, I just wanna have the name of the agents in the columns and all of their values for resolucao column below them so I can take the average amount of time for all of the agentes, but I'm struggling to make it with a list comprehension / for loop.
This is what the DataFrame looks like after I cleaned it up
And this is the Code that I tried to Run
PS: Sorry for the messy code.
agentes_unique = list(df['Agente'].unique())
agentes_duplicated = df['Agente']
value_resolucao_duplicated = df['resolucao']
n_of_rows = []
for row in range(len(df)):
n_of_rows.append(row)
i = 0
while n_of_rows[i] < len(n_of_rows):
df2 = pd.DataFrame({agentes_unique[i]: (value for value in df['resolucao'][i] if df['Agente'][i] == agentes_unique[i])})
i+= 1
df2.to_excel('teste.xlsx',index = True, header = True)
But in the end it came to this error:
Traceback (most recent call last):
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\indexes\range.py", line 385, in get_loc
return self._range.index(new_key)
ValueError: 0 is not in range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\FELIPE\Desktop\Python\webscraping\bot_csv_extract\bot_login\main.py", line 50, in <module>
df2 = pd.DataFrame({'Agente': (valor for valor in df['resolucao'][i] if df['Agente'][i] == 'Gabriel')})
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\indexes\range.py", line 387, in get_loc
raise KeyError(key) from err
KeyError: 0
I feel like I'm making some obvious mistake but I cant fix it
Again, thanks to anyone who tries to help
Are you looking to do something like this? This is just sample data, but a good start for what you are looking to do if I understand what your wanting to do.
data = {
'Column1' : ['Data', 'Another_Data', 'More_Data', 'Last_Data'],
'Agente' : ['Name', 'Another_Name', 'More_Names', 'Last_Name'],
'Data' : [1, 2, 3, 4]
}
df = pd.DataFrame(data)
df = df.pivot(index = 'Column1', columns=['Agente'], values = 'Data')
df.reset_index()
It is not recommended to use for loops against pandas DataFrames: It is considered messy and inefficient.
With some practice you will be able to approach problems in such a way that you will not need to use for loops in these cases.
From what I understand, your goal can be realized in 3 simple steps:
1. Select the 2 columns of interest. I recommend you take a look at how to access different elements of a dataFrame:
df = df[["Agent", "resolucao"]]
2. Convert the column you want to average to a numeric value. Say seconds:
df["resolucao"] = pd.to_timedelta(df['resolucao'].astype(str)).dt.total_seconds()
3. Apply an average aggregation, via the groupby() function:
df = df.groupby(["Agente"]).mean().reset_index()
Hope this helps.
For the next time, I also recommend you to not post the database as an image in order to be able to reproduce your code.
Cheers and keep it up!

Pysolar get_azimuth function applied to pandas DataFrame

I got myself a pandas dataframe with columns latitude, longitude (which are integer type) and a date column (datetime64[ns, UTC] - as needed for the function). I use following line to produce new column of sun's azimuth:
daa['azimuth'] = daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date']), axis=1)
It crashes and I cannot figure out why, the only thing I know is that there is a problem in date:
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
If anyone had an idea what I am supposed to do with the date, it would be great, thanks.
this goes back to a bug in pandas, see issue #32174. pysolar.solar.get_azimuth calls .utctimetuple() method of given datetime object (or pd.Timestamp), which fails:
import pandas as pd
s = pd.to_datetime(pd.Series(["2020-01-01", "2020-01-02"])).dt.tz_localize('UTC')
s.iloc[0]
Out[3]: Timestamp('2020-01-01 00:00:00+0000', tz='UTC')
s.iloc[0].utctimetuple()
Traceback (most recent call last):
File "<ipython-input-4-f5e393f18fdb>", line 1, in <module>
s.iloc[0].utctimetuple()
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
You can work-around by converting the pandas Timestamp to a Python datetime object, were utctimetuple works as expected. For the given example, you can use
daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date'].to_pydatetime()), axis=1)

How to print unique values of a column in a group using Pandas?

I am trying to print unique values of the column ADO_name in my data set. Following is the example data set and code I tried (which gives error):
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
data = {'ADO_name':['car1','car1','car1','car2','car2','car2'],
'Time_sec':[0,1,2,0,1,2],
'Speed.kph':[50,51,52,0,0,52]}
dframe = DataFrame(data)
for ado in dframe.groupby('ADO_name'):
ado_name = ado["ADO_name"]
adoID = ado_name.unique()
print(adoID)
Traceback (most recent call last):
File "C:\Users\Quinton\AppData\Local\Temp\Rtmp88ifpB\chunk-code-188c39fc7de8.txt", line 14, in <module>
ado_name = ado["ADO_name"]
TypeError: tuple indices must be integers or slices, not str
What am I doing wrong and how to fix it? Please help.
You can do: dframe["ADO_name"].unique().
You may want to correct your code or use the correct way.
Here is what you need to correct in your code.
for ado in dframe.groupby('ADO_name'):
ado_name = ado[1]["ADO_name"]
adoID = ado_name.unique()
print(adoID)

Python Pandas: creating a dataframe using a function for one of the fields

I am trying to create a dataframe where one of the fields is calculated using a function. To do this I use the following code:
import pandas as pd
def didSurvive(sex):
return int(sex == "female")
titanic_df = pd.read_csv("test.csv")
submission = pd.DataFrame({
"PassengerId": titanic_df["PassengerId"],
"Survived": didSurvive(titanic_df["Sex"])
})
submission.to_csv('titanic-predictions.csv', index=False)
when I run this code I get the following error:
D:\Documents\kaggle\titanic>python predictor.py
File "predictor.py", line 3
def didSurvive() {
^
SyntaxError: invalid syntax
D:\Documents\kaggle\titanic>python predictor.py
D:\Documents\kaggle\titanic>python predictor.py
D:\Documents\kaggle\titanic>python predictor.py
Traceback (most recent call last):
File "predictor.py", line 10, in
"Survived": didSurvive(titanic_df["Sex"])
File "predictor.py", line 4, in didSurvive
return int(sex == "female")
File "C:\Python34\lib\site-packages\pandas\core\series.py", line 92,
in wrapper
"{0}".format(str(converter)))
TypeError: cannot convert the series to
D:\Documents\kaggle\titanic>
I think what is happening is I'm trying to run the int() on a series of booleans instead of an individual boolean. How do I go about fixing this?
To convert the data type of a Series, you can use astype() function, this should work:
def didSurvive(sex):
return (sex == "female").astype(int)
You can also reformat data during the import from csv file
titanic_df = pd.read_csv("test.csv", converters={'Sex':didSurvive})
submission = pd.DataFrame(titanic_df, columns=['PassengerId', 'Sex'])

Apply SequenceMatcher to DataFrame

I'm new to pandas and Python in general, so I'm hoping someone can help me with this simple question. I have a large dataframe m with several million rows and seven columns, including an ITEM_NAME_x and ITEM_NAME_y. I want to compare ITEM_NAME_x and ITEM_NAME_y using SequenceMatcher.ratio(), and add a new column to the dataframe with the result.
I've tried to come at this several ways, but keep running into errors:
>>> m.apply(SequenceMatcher(None, str(m.ITEM_NAME_x), str(m.ITEM_NAME_y)).ratio(), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4416, in apply
return self._apply_standard(f, axis)
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4491, in _apply_standard
raise e
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4480, in _apply_standard
results[i] = func(v)
TypeError: ("'float' object is not callable", 'occurred at index 0')
Could someone help me fix this?
You have to apply a function, not a float which expression SequenceMatcher(None, str(m.ITEM_NAME_x), str(m.ITEM_NAME_y)).ratio() is.
Working demo (a draft):
import difflib
from functools import partial
import pandas as pd
def apply_sm(s, c1, c2):
return difflib.SequenceMatcher(None, s[c1], s[c2]).ratio()
df = pd.DataFrame({'A': {1: 'one'}, 'B': {1: 'two'}})
print df.apply(partial(apply_sm, c1='A', c2='B'), axis=1)
output:
1 0.333333
dtype: float64

Categories