Pysolar get_azimuth function applied to pandas DataFrame - python

I got myself a pandas dataframe with columns latitude, longitude (which are integer type) and a date column (datetime64[ns, UTC] - as needed for the function). I use following line to produce new column of sun's azimuth:
daa['azimuth'] = daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date']), axis=1)
It crashes and I cannot figure out why, the only thing I know is that there is a problem in date:
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
If anyone had an idea what I am supposed to do with the date, it would be great, thanks.

this goes back to a bug in pandas, see issue #32174. pysolar.solar.get_azimuth calls .utctimetuple() method of given datetime object (or pd.Timestamp), which fails:
import pandas as pd
s = pd.to_datetime(pd.Series(["2020-01-01", "2020-01-02"])).dt.tz_localize('UTC')
s.iloc[0]
Out[3]: Timestamp('2020-01-01 00:00:00+0000', tz='UTC')
s.iloc[0].utctimetuple()
Traceback (most recent call last):
File "<ipython-input-4-f5e393f18fdb>", line 1, in <module>
s.iloc[0].utctimetuple()
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
You can work-around by converting the pandas Timestamp to a Python datetime object, were utctimetuple works as expected. For the given example, you can use
daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date'].to_pydatetime()), axis=1)

Related

Python pandas filter by word

I have
csv file:
df=pd.read_csv(Path(os.getcwd()+r'\all_files.csv'), sep=',', on_bad_lines='skip', index_col=False, dtype='unicode')
column:
column=input("Column:")
word:
word=input("Word:")
I want to filter a csv file:
df2=df[(df[column].dropna().str.contains(word.lower()))]
But when I write to column:ЄДРПОУ(Гр.8)
I have a error:
Warning (from warnings module):
File "C:\python\python\FilterExcelFiles.py", line 35
df2=df[(df[column].dropna().str.contains(word.lower()))]
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Traceback (most recent call last):
File "C:\python\python\FilterExcelFiles.py", line 51, in <module>
s()
File "C:\python\python\FilterExcelFiles.py", line 35, in s
df2=df[(df[column].dropna().str.contains(word.lower()))]
File "C:\Users\Станислав\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 3496, in __getitem__
return self._getitem_bool_array(key)
File "C:\Users\Станислав\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 3549, in _getitem_bool_array
key = check_bool_indexer(self.index, key)
File "C:\Users\Станислав\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexing.py", line 2383, in check_bool_indexer
raise IndexingError(
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
And I wont to lower df[column]
You're dropping the NaN in the indexer, making it likely shorter, which results in the error in boolean indexing.
Don't dropna, the NaN will be False anyway:
df2 = df[df[column].str.contains(word.lower())]
Alternatively, if you had a operation that would return NaNs, you could fill them with False:
df2 = df[df[column].str.contains(word.lower()).fillna(False)]
I have searched around for an answer and I came across a similar post that might have the solution for your problem.
According to the mentioned post, the reason for this error is due to the encoding for Python, which is usually ascii; the encoding can be checked by:
import sys
sys.getdefaultencoding()
To solve your problem, you need to change it to UTF-8, using the following!
import sys
reload(sys) # Note this line is essential for the change
sys.setdefaultencoding('utf-8')
Would like to credit the original solution to #jochietoch

Need a more efficient datetime conversion for/into pandas

Good morning Stackoverflow.
I am trying to find a better way to suck in a CSV file and parse the datetime. Unfortunately my data is coming in as a '%j:%H:%M:%S.%f', such as 234:17:33:00.000206700. I have the year sitting in another field from my header I skip over, so this was my method of converting prior to setting as index, since I have date rollovers to account for. It works, but is slower than I would like and is not intuitive.
dataframe = pd.read_csv(data_file,skiprows=np.arange(0,meta_lines),header=[0,1,2])
dataframe['Temp'] = meta['Date'].split('-')[2] + ' ' # splitting off the year from 08-22-2019
dataframe['Temp'] = dataframe[['Temp','AbsoluteTime']].apply(lambda x: ''.join(x),axis=1)
dataframe['AbsoluteTime'] = pd.to_datetime(dataframe['Temp'],format='%Y %j:%H:%M:%S.%f')
del dataframe['Temp']
dataframe.set_index('AbsoluteTime', inplace=True)
Originally I wanted to have pd.to_datetime parse without the %Y, resulting in the year 1900 and using time delta to add X years, however when I started down that path, I came across this error.
dataframe['AbsoluteTime']
Out[8]:
DDD:HH:MM:SS.sssssssss
Absolute Time
0 234:17:33:00.000206700
1 234:17:33:00.011264914
2 234:17:33:00.015721314
...
pd.to_datetime(dateframe['AbsoluteTime'],format='%j:%H:%M:%S.%f')
Traceback (most recent call last):
File "<ipython-input-9-6dfc074c2dc4>", line 1, in <module>
pd.to_datetime(dateframe['AbsoluteTime'],format='%j:%H:%M:%S.%f')
NameError: name 'dateframe' is not defined
pd.to_datetime(dataframe['AbsoluteTime'],format='%j:%H:%M:%S.%f')
Traceback (most recent call last):
File "<ipython-input-10-bfbf7ee22833>", line 1, in <module>
pd.to_datetime(dataframe['AbsoluteTime'],format='%j:%H:%M:%S.%f')
File "C:\Users\fkatzenb\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 512, in to_datetime
result = _assemble_from_unit_mappings(arg, errors=errors)
File "C:\Users\fkatzenb\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 582, in _assemble_from_unit_mappings
unit = {k: f(k) for k in arg.keys()}
File "C:\Users\fkatzenb\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 582, in <dictcomp>
unit = {k: f(k) for k in arg.keys()}
File "C:\Users\fkatzenb\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 577, in f
if value.lower() in _unit_map:
AttributeError: 'tuple' object has no attribute 'lower'
What gives? My problem isnt from having double brackets [[]] like other threads with this error addresses. If I do this as a test, I see...
pd.to_datetime(['234:17:33:00.000206700'],format='%j:%H:%M:%S.%f')
Out[6]: DatetimeIndex(['1900-08-22 17:33:00.000206700'], dtype='datetime64[ns]', freq=None)
I was then just going to add a timedelta to that to shift the year to the current year.
My only thought is I have is it having to do with my multiple column header (see my from_csv command). Thoughts? Suggestions?
Thanks!

Pandas TypeError: object of type 'float' has no len()

I'm doing some data-discovery using Python/Pandas.
MVCE: I have a CSV file with some street addresses and I want to find the length of the longest address in my file. (this is a simplified version of my actual problem)
I wrote this simple Python code:
import sys
import pandas as pd
df = pd.read_csv(sys.argv[1])
print(df['address'].map(len).max())
The address column is of type str, or so I thought (see below).
Why then do I get this error?
Traceback (most recent call last):
File "eval-lengths.py", line 8, in <module>
print(df['address'].map(len).max())
File "C:\Python35\lib\site-packages\pandas\core\series.py", line 2996, in map
arg, na_action=na_action)
File "C:\Python35\lib\site-packages\pandas\core\base.py", line 1004, in _map_values
new_values = map_f(values, mapper)
File "pandas/_libs/src\inference.pyx", line 1472, in pandas._libs.lib.map_infer
TypeError: object of type 'float' has no len()
Here's the output of df.info()
RangeIndex: 154733 entries, 0 to 154732
Data columns (total 2 columns):
address 154510 non-null object
zip 154732 non-null object
dtypes: object(2)
memory usage: 2.4+ MB
UPDATE
Here's a sample CSV file
address,zip
555 APPLE STREET,82101
1180 BANANA LAKE ROAD,81913
577 LEMON DR,81911
,99999
The last line is key to reproducing the problem.
You have missing data in your column, represented by NaNs (which are of float type).
Don't use map/apply, etc for things like finding the length, just do this with str.len:
df['address'].str.len()
Items for which len() is not applicable automatically show in the result as NaN. You can fillna(-1) those out to indicate the result is invalid there.
My Solution was to fillNa with an empty string and then try to run the apply, like this:
df['address'].fillna('', inplace=True)
print(df['address'].map(len).max())

Python "InterfaceError: Error binding parameter 2 - probably unsupported type."

When I run the following code, I keep getting the "InterfaceError: Error binding parameter 2 - probably unsupported type" error, and I need help identifying where the problem is. Everything works fine up until I try to send the data to sql through.
anagramsdf.to_sql('anagrams',con=conn,if_exists='replace',index=False)
cdf=pd.read_sql("select (distinct ID) from anagrams;",conn)
import pandas as pd
import sqlite3
conn = sqlite3.connect("anagrams")
xsorted=sorted(anagrams,key=sorted)
xunique=[x[0] for x in anagrams]
xunique=pd.Series(xunique)
xanagrams=pd.Series(anagrams)
anagramsdf=pd.concat([xunique,dfcount,xanagrams],axis=1)
anagramsdf.columns=['ID','anagram_count','anagram_list']
c=conn.cursor()
c.execute("create table anagrams(ID, anagram_count, anagram_list)")
conn.commit()
anagramsdf.to_sql('anagrams',con=conn,if_exists='replace',index=False)
cdf=pd.read_sql("select (distinct ID) from anagrams;",conn)
cdf=pd.read_sql("select max(anagram_count) from anagrams;",conn)
cdf
def print_full(x):
pd.set_option('display.max_rows', len(x))
print(x)
pd.reset_option('display.max_rows')
cdf=pd.read_sql("select * from anagrams where anagram_count=12;",conn)
pd.set_option('max_colwidth',200)
Full traceback error:
Traceback (most recent call last):
File "sqlpandas.py", line 88, in <module>
anagramsdf.to_sql('anagrams',con=conn,if_exists='replace',index=False)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 982, in to_sql
dtype=dtype)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/io/sql.py", line 549, in to_sql
chunksize=chunksize, dtype=dtype)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/io/sql.py", line 1567, in to_sql
table.insert(chunksize)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/io/sql.py", line 728, in insert
self._execute_insert(conn, keys, chunk_iter)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/io/sql.py", line 1357, in _execute_insert
conn.executemany(self.insert_statement(), data_list)
sqlite3.InterfaceError: Error binding parameter 2 - probably unsupported type.
Snippet from Dataframe:
ID anagram_count anagram_list
0 aa 1 (aa,)
1 anabaena 1 (anabaena,)
2 baaskaap 1 (baaskaap,)
3 caracara 1 (caracara,)
4 caragana 1 (caragana,)
I used the following code to change the datatypes to strings, and this solved the problem:
anagramsdf.dtypes
anagramsdf['ID']= anagramsdf['ID'].astype('str')
anagramsdf['anagram_list']= anagramsdf['anagram_list'].astype('str')
anagramsdf.to_sql('anagramsdf',con=conn,if_exists='append',index=False)
Using Pandas 0.23.4 I have a column with datetime values (format '%Y-%m-%d %H:%M:%S') that is of data type "string" that was throwing the same error when I tried to pass it to "to_sql" method. After converting to "datetime" dtype it worked. Hope that's helpful to someone with the same issue :).
To convert:
df['date'] = pd.to_datetime(df['date'],format=datetimeFormat,errors='coerce')
Sparrow's solution worked for me. If the index is not converted to a datetime then SQL is going to throw "error binding parameter"
I used a column that had the datetimes to first convert to the correct format and then use it as the index:
df.set_index(pd.to_datetime(df['datetime']), inplace=True)

Get the first pandas DataFrame's column?

I want to calculate std of my first prices DataFrame's column.
Here is my code:
import pandas as pd
def std(returns):
return pd.DataFrame(returns.std(axis=0, ddof=0))
prices = pd.DataFrame([[-0.33333333, -0.25343423, -0.1666666667],
[+0.23432323, +0.14285714, -0.0769230769],
[+0.42857143, +0.07692308, +0.1818181818]])
print(std(prices.ix[:,0]))
When I run it, i get the following error:
Traceback (most recent call last):
File "C:\Users\*****\Documents\******\******\****.py", line 12, in <module>
print(std(prices.ix[:,0]))
File "C:\Users\*****\Documents\******\******\****.py", line 10, in std
return pd.DataFrame(returns.std(axis=0, ddof=0))
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 453, in __init__
raise PandasError('DataFrame constructor not properly called!')
pandas.core.common.PandasError: DataFrame constructor not properly called!
How can I fix that?
Thank you!
Take a closer look at what is going in in your code:
>>> prices.ix[:,0]
0 -0.333333
1 0.234323
2 0.428571
>>> prices.ix[:,0].std(axis=0, ddof=0)
0.32325861621668445
So you are calling the DataFrame constructor like this:
pd.DataFrame(0.32325861621668445)
The constructor has no idea what to do with single float parameter. It needs some kind of sequence or iterable. Maybe what you what is this:
>>> pd.DataFrame([0.32325861621668445])
0
0 0.323259
It should be as simple as this:
In [0]: prices[0].std()
Out[0]: 0.39590933234452624
Columns of DataFrames are Series. You can call Series methods on them directly.

Categories