Combining columns using pandas - python

I am trying to combine date and time columns of a csv file and convert them to timestamp using pandas.
Here is a sample of my csv file when read into a dataframe
Dataframe after reading
Id Station Month Parameter Date From To
1.0 ANANDVIHAR Dec ?PM2.5 2015-12-01 ?00:00:00 ?00:59:00
The Following Code:-
df['DateTime'] = df.apply(lambda row: datetime.datetime.strptime(row['Date']+ ':' + row['From'], '%Y.%m.%d:%H:%M:%S'), axis=1)
Is giving the following error:-
Traceback (most recent call last):
File "project101.py", line 36, in <module>
df['DateTime'] = df.apply(lambda row: datetime.datetime.strptime(row['Date']+ ':' + row['From'], '%Y.%m.%d:%H:%M:%S'), axis=1)
File "c:\Python27\lib\site-packages\pandas\core\frame.py", line 4133, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "c:\Python27\lib\site-packages\pandas\core\frame.py", line 4229, in _apply_standard
results[i] = func(v)
File "project101.py", line 36, in <lambda>
df['DateTime'] = df.apply(lambda row: datetime.datetime.strptime(row['Date']+ ':' + row['From'], '%Y.%m.%d:%H:%M:%S'), axis=1)
File "c:\Python27\lib\_strptime.py", line 332, in _strptime
(data_string, format))
ValueError: ("time data '2015-12-01:\\xa000:00:00' does not match format '%Y.%m.%d:%H:%M:%S'", u'occurred at index 0')

You can simply do:
df['DateTime'] = pd.to_datetime(df['Date'].str.cat(df['From'], sep=" "),
format='%Y-%m-%d \\xa%H:%M:%S', errors='coerce')
The '\\xa' in the format specifier will take care of the question marks. Those marks are for misinterpreted literal, which probably looks like '\\xa'

You can use pandas.Series.str.cat function.
Following code gives you a basic idea about this:
>>> Series(['a', 'b', 'c']).str.cat(['A', 'B', 'C'], sep=',')
0 a,A
1 b,B
2 c,C
dtype: object
For more information, please check this:
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.str.cat.html
Hope this solves your problem...

I finally got a solution I stripped the question marks before the date column and applied to_datetime() to the column of the dataframe
df['From'] = df['From'].map(lambda x: str(x)[1:])
df['FromTime'] = pd.to_datetime(df['Date'].str.cat(df['From'], sep=" "),format='%Y-%m-%d %H:%M:%S', errors='coerce')

Related

How to solve this error : TypeError: 'last' only supports a DatetimeIndex index

Getting error :- 'last' only supports a DatetimeIndex index
def create_excel_file():
master_list = []
for name in filelist:
new_path = Path(name).parent
base = os.path.basename(new_path)
final = os.path.splitext(base)[0]
with open(name,"r") as f:
soupObj = bs4.BeautifulSoup(f, "lxml")
df = pd.DataFrame([(x["uri"], *x["t"].split("T"), x["u"], x["desc"])
for x in soupObj.find_all("log")],
columns=["Document", "Date", "Time", "User", "Description"])
df.insert(0, 'Database', f'{final}')
df['Document'] = df['Document'].astype(str)
df['Date'] = pd.to_datetime(df['Date']).dt.date
master_list.append(df)
df = pd.concat(master_list, axis=0, ignore_index=True)
df = df.sort_values(by='Date', ascending=True).set_index('Date').last('3M')
df = df.sort_values(by='Date', ascending=False)
df.to_excel("logfile.xlsx", index=True)
create_excel_file()
suggest me what I am doing wrong
Error message:-
Traceback (most recent call last):
File "C:\Users\Desktop\project\Final test.py", line 40, in <module>
create_excel_file()
File "C:\Users\Desktop\project\Final test.py", line 34, in create_excel_file
df = df.sort_values(by='Date', ascending=True).set_index('Date').last('3M')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\AppData\Roaming\Python\Python311\site-packages\pandas\core\generic.py", line 9001, in last
raise TypeError("'last' only supports a DatetimeIndex index")
TypeError: 'last' only supports a DatetimeIndex index
Process finished with exit code 1
getting Error as shows above
From documentation
For a DataFrame with a sorted DatetimeIndex, this function selects the last few rows based on a date offset.
So, you need to make sure that you did your sort on a column whose values are of type Datetime
From your code below, make sure the data in column 'Date' are actually datetime
df = df.sort_values(by='Date', ascending=True).set_index('Date').last('3M')

How to merge 2+ columns with different length? ValueError: Length of values

I am trying to create a dataframe main_df which have the index date and followed by df['high']-df['low'] from each ticker.
Note:
in the example, the 3 tickers data from 1996/1/1 to 2020/12/31.
The ACN went public on 2001/07/19
so length of df['high']-df['low'] would be different.
The following code is what I used:
import pandas as pd
def test_demo():
tickers = ['ADI', 'ACN', 'ABT']
df2 = pd.DataFrame()
main_df = pd.DataFrame()
pd.set_option('display.max_columns', None)
for count, ticker in enumerate(tickers):
df = pd.read_csv('demo\{}.csv'.format(ticker))
print(df)
df = df.set_index('date')
df2['date'] = df.index
df2 = df2.set_index('date')
df2[ticker] = df['high'] - df['low']
if main_df.empty:
main_df = df2
count = 1
else:
main_df = main_df.join(df2, on='date', how='outer')
# main_df = main_df.merge(df, on='date')
# print(main_df)
if count % 10 == 0:
print(count)
main_df.to_csv('testdemo.csv')
test_demo()
it gives me an error and traceback as following
Traceback (most recent call last):
File "D:\PycharmProjects\backtraderP1\Main.py", line 81, in <module>
from zfunctions.WebDemo import test_demo
File "D:\PycharmProjects\backtraderP1\zfunctions\WebDemo.py", line 33, in <module>
test_demo()
File "D:\PycharmProjects\backtraderP1\zfunctions\WebDemo.py", line 13, in test_demo
df2['date'] = df.index
File "C:\Users\Cornerstone\AppData\Roaming\Python\Python39\site-packages\pandas\core\frame.py", line 3163, in __setitem__
self._set_item(key, value)
File "C:\Users\Cornerstone\AppData\Roaming\Python\Python39\site-packages\pandas\core\frame.py", line 3242, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\Cornerstone\AppData\Roaming\Python\Python39\site-packages\pandas\core\frame.py", line 3899, in _sanitize_column
value = sanitize_index(value, self.index)
File "C:\Users\Cornerstone\AppData\Roaming\Python\Python39\site-packages\pandas\core\internals\construction.py", line 751, in sanitize_index
raise ValueError(
ValueError: Length of values (4895) does not match length of index (6295)
Process finished with exit code 1
the code passes the first time process ADI, and the error appears when got to ACN data.
the line df2['date'] = df.index and the df2[ticker] = df['high'] - df['low'] shouldn't be the problem. and appears in the answers in other posts. but the combination doesn't work in this case.
if someone can help me understand it and solve this issue, would be great.
Many thanks.

tz_localize: KeyError: ('Asia/Singapore', u'occurred at index 0')

Reference to: Python pandas convert unix timestamp with timezone into datetime
Did a search on this topic but still can't find the answer.
I have a dataframe whichh is the following format:
df timestamp
1 1549914000
2 1549913400
3 1549935000
3 1549936800
5 1549936200
I use the following to convert epoch to date:
df['date'] = pd.to_datetime(df['timestamp'], unit='s')
This line will produce a date that is always 8 hours behind my local time.
So I followed the example in the link to use apply + tz.localize to Asia/Singapore, I tried the following code on the next line after the above code.
df['date'] = df.apply(lambda x: x['date'].tz_localize(x['Asia/Singapore']), axis=1)
but python return an error as below:
Traceback (most recent call last):
File "/home/test/script.py", line 479, in <module>
schedule.every(10).minutes.do(main).run()
File "/opt/cloudera/parcels/Anaconda-4.0.0/lib/python2.7/site-packages/schedule/__init__.py", line 411, in run
ret = self.job_func()
File "/home/test/script.py", line 361, in main
df['date'] = df.apply(localize_ts, axis = 1)
File "/opt/cloudera/parcels/Anaconda-4.0.0/lib/python2.7/site-packages/pandas/core/frame.py", line 4877, in apply
ignore_failures=ignore_failures)
File "/opt/cloudera/parcels/Anaconda-4.0.0/lib/python2.7/site-packages/pandas/core/frame.py", line 4973, in _apply_standard
results[i] = func(v)
File "/home/test/script.py", line 359, in localize_ts
return pd.to_datetime(row['date']).tz_localize(row['Asia/Singapore'])
File "/opt/cloudera/parcels/Anaconda-4.0.0/lib/python2.7/site-packages/pandas/core/series.py", line 623, in __getitem__
result = self.index.get_value(self, key)
File "/opt/cloudera/parcels/Anaconda-4.0.0/lib/python2.7/site-packages/pandas/core/indexes/base.py", line 2574, in get_value
raise e1
KeyError: ('Asia/Singapore', u'occurred at index 0')
Did I replace .tz_localize(x['tz']) in correctly?
As written, your code is looking for a column named Asia/Singapore. Try this instead:
df['date'] = df['date'].dt.tz_localize('Asia/Singapore')
you can try
import numpy as np
import pandas as pd
df = pd.DataFrame({'timestamp': [1549952400, 1549953600]},index=['1', '2'])
df['timestamp2'] = df['timestamp'] + 28800
df['date'] = pd.to_datetime(df['timestamp2'], unit='s')
df = df.drop('timestamp2', 1)

Pandas return DataFrame from apply function?

sdf = sdf['Name1'].apply(lambda x: tryLookup(x, tdf))
tryLookup is a function that is currently taking a string, which is the value of Name1 in the sdf column. We map the function using apply to every row in the sdf DataFrame.
Instead of tryLookup returning just a string, is there a way for tryLookup to return a DataFrame that I want to merge with the sdf DataFrame? tryLookup has some extra information, and I want to include that in the results by adding them as new columns to all the rows in sdf.
So the return for tryLookup is as such:
return pd.Series({'BEST MATCH': bestMatch, 'SIMILARITY SCORE': humanScore})
I tried something such as
sdf = sdf.merge(sdf['Name1'].apply(lambda x: tryLookup(x, tdf)), left_index=True, right_index=True)
But that just throws
Traceback (most recent call last):
File "lookup.py", line 160, in <module>
main()
File "lookup.py", line 40, in main
sdf = sdf.merge(sdf['Name1'].apply(lambda x: tryLookup(x, tdf)), left_index=True, right_index=True)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4618, in merge
copy=copy, indicator=indicator)
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 58, in merge
copy=copy, indicator=indicator)
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 473, in __init__
'type {0}'.format(type(right)))
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
Any help would be great. Thanks.
Try converting the pd.Series to a dataframe with pandas.Series.to_frame as documented here:
sdf = sdf.merge(sdf['Sold To Name (10)'].apply(lambda x: tryLookup(x, tdf)).to_frame(), left_index=True, right_index=True)

using pandas read_csv with missing data

I am attempting to read a csv file where some rows may be missing chunks of data.
This seems to be causing a problem with the pandas read_csv function when you specify the dtype. The problem appears that in order to convert from the str to whatever the dtype specifies pandas just tries to cast it directly. Therefore, if something is missing things break down.
A MWE follows (this MWE uses StringIO in place of a true file; however, the issue also happens with a real file being used)
import pandas as pd
import numpy as np
import io
datfile = io.StringIO("12 23 43| | 37| 12.23| 71.3\n12 23 55|X| | | 72.3")
names = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str, np.int, np.float, np.float]
dform = {name: dtypes[ind] for ind, name in enumerate(names)}
colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}
df = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None,
index_col=0, names=names, na_values=' ')
The error I get when I run this is
Traceback (most recent call last):
File "pandas/parser.pyx", line 1084, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12580)
TypeError: Cannot cast array from dtype('O') to dtype('int64') according to the rule 'safe'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/aliounis/Repos/stellarpy/source/mwe.py", line 15, in <module>
index_col=0, names=names, na_values=' ')
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 325, in _read
return parser.read()
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
File "pandas/parser.pyx", line 904, in pandas.parser.TextReader._read_rows (pandas/parser.c:10022)
File "pandas/parser.pyx", line 1011, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:11397)
File "pandas/parser.pyx", line 1090, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12656)
ValueError: invalid literal for int() with base 10: ' '
Is there someway I can fix this. I looked through the documentation but didn't see anything that looked like it would directly address this solution. Is this just a bug that needs to be reported to panda?
Try this:
import pandas as pd
import numpy as np
import io
datfile = io.StringIO(u"12 23 43| | 37| 12.23| 71.3\n12 23 55|X| | | 72.3")
names = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str, np.str, np.float, np.float]
dform = {name: dtypes[ind] for ind, name in enumerate(names)}
colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}
df = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None, na_values=' ')
df.columns = names
Edit: To converter dtypes post imports.
df["number"] = df["data"].astype('int')
df["data"] = df["data"].astype('float')
Your data has mixed of blanks as str and numbers.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
id 2 non-null object
flag 2 non-null object
number 2 non-null object
data 2 non-null object
data2 2 non-null float64
dtypes: float64(1), object(4)
memory usage: 152.0+ bytes
If you look at data it is np.float but converted to object and data2 is np.float until a blank then it will turn into object also.
So, as Merlin pointed out, the main problem is that nan's can't be ints, which is probably why pandas acts this way to begin with. I unfortunately didn't have a choice so I had to make some changes to the pandas source code myself. I ended up having to change lines 1087-1096 of the file parser.pyx to
na_count_old = na_count
print(col_res)
for ind, row in enumerate(col_res):
k = kh_get_str(na_hashset, row.strip().encode())
if k != na_hashset.n_buckets:
col_res[ind] = np.nan
na_count += 1
else:
col_res[ind] = np.array(col_res[ind]).astype(col_dtype).item(0)
if na_count_old==na_count:
# float -> int conversions can fail the above
# even with no nans
col_res_orig = col_res
col_res = col_res.astype(col_dtype)
if (col_res != col_res_orig).any():
raise ValueError("cannot safely convert passed user dtype of "
"{col_dtype} for {col_res} dtyped data in "
"column {column}".format(col_dtype=col_dtype,
col_res=col_res_orig.dtype.name,
column=i))
which essentially goes through each element of a column, checks to see if each element is contained in the na list (note that we have to strip the stuff so that multi-spaces show up as being in the na list). If it is then that element is set as a double np.nan. If it is not in the na list then it is cast to the original dtype specified for that column (that means the column will have multiple dtypes).
While this isn't a perfect fix (and is likely slow) it works for my needs and maybe someone else who has a similar problem will find it useful.

Categories