I have a Dataframe with the following date field:
463 14-05-2019
535 03-05-2019
570 11-05-2019
577 09-05-2019
628 08-08-2019
630 25-05-2019
Name: Date, dtype: object
I have to format it as DDMMAAAA. This is what I'm doing inside a loop (for idx, row in df.iterrows():):
I'm removing the \- char using regex:
df.at[idx, 'Date'] = re.sub('\-', '', df.at[idx, 'Date'])
then using apply to enforce and an 8 digit string with leading zeros
df['Date'] = df['Date'].apply(lambda x: '{0:0>8}'.format(x))
But even though the df['Date'] field has the 8 digits with the leading 0 on the df, whent exporting it to csv the leading zeros are being removed on the exported file like below.
df.to_csv(path_or_buf=report, header=True, index=False, sep=';')
field as in csv:
Dt_DDMMAAAA
30102019
12052019
7052019
26042019
3052019
22042019
25042019
2062019
I know I must be missing the point somewhere along the way here, but I just can't figure out what the issue (or if it's even an issue, rather then a misused method).
IMO the simplest method is to use the date_format argument when writing to CSV. This means you will need to convert the "Date" column to datetime beforehand using pd.to_datetime.
(df.assign(Date=pd.to_datetime(df['Date'], errors='coerce'))
.to_csv(path_or_buf=report, date_format='%d%m%Y', index=False))
This prints,
Date
14052019
05032019
05112019
05092019
08082019
25052019
More information on arguments to to_csv can be found in Writing a pandas DataFrame to CSV file.
What i will do is using strftime + 'to_excel`, since In csv , if you open it with text , it will show the leading zero, since csv will not keeping any format when display, in that case , you can using excel
pd.to_datetime(df.Date,dayfirst=True).dt.strftime('%m%d%Y').to_excel('your.xls')
Out[722]:
463 05142019
535 05032019
570 05112019
577 05092019
628 08082019
630 05252019
Name: Date, dtype: object
Firstly, your method is producing a file which contains leading zeros just as you expect. I reconstructed this minimal working example from your description and it works just fine:
import pandas
import re
df = pandas.DataFrame([["14-05-2019"],
["03-05-2019"],
["11-05-2019"],
["09-05-2019"],
["08-08-2019"],
["25-05-2019"]], columns=['Date'])
for idx in df.index:
df.at[idx, 'Date'] = re.sub('\-', '', df.at[idx, 'Date'])
df['Date'] = df['Date'].apply(lambda x: '{0:0>8}'.format(x))
df.to_csv(path_or_buf="report.csv", header=True, index=False, sep=';')
At this point report.csv contains this (with leading zeros just as you wanted).
Date
14052019
03052019
11052019
09052019
08082019
25052019
Now as to why you thought it wasn't working. If you are mainly in Pandas, you can stop it from guessing the type of the output by specifying a dtype in read_csv:
df_readback = pandas.read_csv('report.csv', dtype={'Date': str})
Date
0 14052019
1 03052019
2 11052019
3 09052019
4 08082019
5 25052019
It might also be that you are reading this in Excel (I'm guessing this from the fact that you are using ; separators). Unfortunately there is no way to ensure that Excel reads this field correctly on double-click, but if this is your final target, you can see how to mangle your file for Excel to read correctly in this answer.
Related
I have a txt file with data and values like this one:
PP C timestamp HR RMSSD SCL
PP1 1 20120918T131600000 NaN NaN 80.239727
PP1 1 20120918T131700000 61 0.061420 77.365127
and I am importing it like that:
df = pd.read_csv('data.txt','\t', header=0)
which gives me a nice looking dataframe:
Running
df.columns
shows this result Index(['PP', 'C', 'timestamp', 'HR', 'RMSSD', 'SCL'], dtype='object').
Now when I am trying to convert the timestamp column into a datetime column:
df["datetime"] = pd.to_datetime(df["timestamp"], format='%Y%m%dT%H%M%S%f')
I get this:
ValueError: time data 'timestamp' does not match format '%Y%m%dT%H%M%S%f' (match)
Any ideas would be appreciated.
First, the error message you're quoting is from the header row. It's trying to parse the literal string 'timestamp' as a timestamp, which is failing. If you're getting an error on an actual data row, show us that message.
All three of your posted data rows parse fine with your format in my testing:
>>> [pandas.to_datetime(s, format='%Y%m%dT%H%M%S%f')
for s in ['20120918T131600000', '20120918T131700000',
'20120918T131800000']]
[Timestamp('2012-09-18 13:16:00'), Timestamp('2012-09-18 13:17:00'), Timestamp('2012-09-18 13:18:00')]
I have no idea where you got format='%Y%m%dT%H%M%S%f'[:-3], which just removes the S%f from the format string, leaving it invalid. If you want to remove the last three digits of the data so that you ca just use %H%M%S instead of %H%M%S%f, you need to put the [:-3] on the timestamp data value, not the format.
I am importing a csv file into python with the pandas package.
SP = pd.read_csv('S&P500 (5year).csv')
When I go to use the pct_change() operand, it is unable to process the values as they have been saved with the type 'str'.
I have tried using the .astype(float) method and it returns an error could not convert string to float: '1805.51'
The 'Adj Close**' are type str and I need them as type float
Date Open High Low Close* Adj Close** Volume
0 11/1/2013 1,758.70 1,813.55 1,746.20 1,805.81 1,805.81 63,628,190,00
1 12/1/2013 1,806.55 1,849.44 1,767.99 1,848.36 1,848.36 64,958,820,000.00
2 1/1/2014 1,845.86 1,850.84 1,770.45 1,782.59 1,782.59 75,871,910,000.00
3 2/1/2014 1,782.68 1,867.92 1,737.92 1,859.45 1,859.45 69,725,590,000.00
4 3/1/2014 1,857.68 1,883.97 1,834.44 1,872.34 1,872.34 71,885,030,000.00
Try adding dtype and thousands into read_csv function. Replace the column_name in the example with the column you need to convert to float. As csv is split by commas, you need to add the thousands parameter when reading csv.
Example:
SP = pd.read_csv('S&P500 (5year).csv', thousands=',', dtype={'column_name': float})
I'd like to use .ftr files to quickly analyze hundreds of tables. Unfortunately I have some problems with decimal and thousands separator, similar to that post, just that read_feather does not allow for decimal=',', thousands='.' options. I've tried the following approaches:
df['numberofx'] = (
df['numberofx']
.apply(lambda x: x.str.replace(".","", regex=True)
.str.replace(",",".", regex=True))
resulting in
AttributeError: 'str' object has no attribute 'str'
when I change it to
df['numberofx'] = (
df['numberofx']
.apply(lambda x: x.replace(".","").replace(",","."))
I receive some strange (rounding) mistakes in the results, like 22359999999999998 instead of 2236 for some numbers that are higher than 1k. All below 1k are 10 times the real result, which is probably because of deleting the "." of the float and creating an int of that number.
Trying
df['numberofx'] = df['numberofx'].str.replace('.', '', regex=True)
also leads to some strange behavior in the results, as some numbers are going in the 10^12 and others remain at 10^3 as they should.
Here is how I create my .ftr files from multiple Excel files. I know I could simply create DataFrames from the Excel files but that would slowdown my daily calculations to much.
How can I solve that issue?
EDIT: The issue seems to come from reading in an excel file as df with non US standard regarding decimal and thousands separator and than saving it as feather. using pd.read_excel(f, encoding='utf-8', decimal=',', thousands='.') options for reading in the excel file solved my issue. That leads to the next question:
why does saving floats in a feather file lead to strange rounding errors like changing 2.236 to 2.2359999999999998?
the problem in your code is that :
when you check your column type in dataframe ( Panda ) you gonna find :
df.dtypes['numberofx']
result : type object
so suggested solution is to try :
df['numberofx'] = df['numberofx'].apply(pd.to_numeric, errors='coerce')
Another way to fix this problem is to convert your values to float :
def coerce_to_float(val):
try:
return float(val)
except ValueError:
return val
df['numberofx']= df['numberofx'].applymap(lambda x: coerce_to_float(x))
to avoid that type of float '4.806105e+12' here is a sample
Sample :
df = pd.DataFrame({'numberofx':['4806105017087','4806105017087','CN414149']})
print (df)
ID
0 4806105017087
1 4806105017087
2 CN414149
print (pd.to_numeric(df['numberofx'], errors='coerce'))
0 4.806105e+12
1 4.806105e+12
2 NaN
Name: ID, dtype: float64
df['numberofx'] = pd.to_numeric(df['numberofx'], errors='coerce').fillna(0).astype(np.int64)
print (df['numberofx'])
ID
0 4806105017087
1 4806105017087
2 0
As mentioned in my edit here is what solved my initial problem:
path = r"pathname\*_somename*.xlsx"
file_list = glob.glob(path)
for f in file_list:
df = pd.read_excel(f, encoding='utf-8', decimal=',', thousands='.')
for col in df.columns:
w= (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis=1)
if len(df[w]) > 0:
df[col] = df[col].astype(str)
if df[col].dtype == list:
df[col] = df[col].astype(str)
pathname = f[:-4] + "ftr"
df.to_feather(pathname)
df.head()
I had to add the decimal=',', thousands='.' option for reading in an excel file, which I later saved as feather. So the problem did not arise when working with .ftr files but before. The rounding problems seem to come from saving numbers with different decimal and thousand separators as .ftr files.
I have a series of dataframes which I am exporting to excel within the same file. A number of them appear to be stored as a list of dictionaries due to the way they have been constructed. I converted them using .from_dict. but when I use the df.to_excel an error is raised.
An example of one of the df's which is raising the error is shown below. My code:
excel_writer = pd.ExcelWriter('My_DFs.xlsx')
df_Done_Major = df[
(df['currency_str'].str.contains('INR|ZAR|NOK|HUF|MXN|PLN|SEK|TRY')==False) &
(df['state'].str.contains('Done'))
][['Year_Month','state','currency_str','cust_cdr_display_name','rbc_security_type1','rfq_qty','rfq_qty_CAD_Equiv']].copy()
# Trades per bucket
df_Done_Major['Bucket'] = pd.cut(df_Done['rfq_qty'], bins=bins, labels=labels)
# Polpulate empty buckets with 0 so HK, SY and TK data can be pasted adjacently
df_Done_Major_Fill_Empty_Bucket = df_Done_Major.groupby(['Year_Month','Bucket'], as_index=False)['Bucket'].size()
mux = pd.MultiIndex.from_product([df_Done_Major_Fill_Empty_Bucket.index.levels[0], df_Done_Major['Bucket'].cat.categories])
df_Done_Major_Fill_Empty_Bucket = df_Done_Major_Fill_Empty_Bucket.reindex(mux, fill_value=0)
dfTemp = df_Done_Major_Fill_Empty_Bucket
display(dfTemp)
dfTemp = pd.DataFrame.from_dict(dfTemp)
display(dfTemp)
# Export
dfTemp.to_excel(excel_writer, sheet_name='Sheet1', startrow=0, startcol=21, na_rep=0, header=True, index=True, merge_cells= True)
2018-05 0K 0
10K 2
20K 4
40K 10
60K 3
80K 1
100K 14
> 100K 273
dtype: int64
TypeError: Unsupported type <class 'pandas._libs.period.Period'> in write()
Even though I have converted to df is there additional conversion required?
Update: I can get the data into the excel using the following but the format of the dataframe is lost, which means significant excel vba to resolve.
list = [{"Data": dfTemp}, ]
I have the following data:
Example:
DRIVER_ID;TIMESTAMP;POSITION
156;2014-02-01 00:00:00.739166+01;POINT(41.8836718276551 12.4877775603346)
I want to create a pandas dataframe with 4 columns that are the id, time, longitude, latitude.
So far, I got:
cur_cab = pd.DataFrame.from_csv(
path,
sep=";",
header=None,
parse_dates=[1]).reset_index()
cur_cab.columns = ['cab_id', 'datetime', 'point']
path specifies the .txt file containing the data.
I already wrote a function that returns the longitude and latitude values from the point formated string.
How do I expand the data frame with the additional column and the splitted values ?
After loading, if you're using a recent version of pandas then you can use the vectorised str methods to parse the column:
In [87]:
df['pos_x'], df['pos_y']= df['point'].str[6:-1].str.split(expand=True)
df
Out[87]:
cab_id datetime \
0 156 2014-01-31 23:00:00.739166
point pos_x pos_y
0 POINT(41.8836718276551 12.4877775603346) 0 1
Also you should stop using from_csv it's no longer updated, use the top level read_csv so your loading code would be:
cur_cab = pd.read_csv(
path,
sep=";",
header=None,
parse_dates=[1],
names=['cab_id', 'datetime', 'point'],
skiprows=1)