I have a series of dataframes which I am exporting to excel within the same file. A number of them appear to be stored as a list of dictionaries due to the way they have been constructed. I converted them using .from_dict. but when I use the df.to_excel an error is raised.
An example of one of the df's which is raising the error is shown below. My code:
excel_writer = pd.ExcelWriter('My_DFs.xlsx')
df_Done_Major = df[
(df['currency_str'].str.contains('INR|ZAR|NOK|HUF|MXN|PLN|SEK|TRY')==False) &
(df['state'].str.contains('Done'))
][['Year_Month','state','currency_str','cust_cdr_display_name','rbc_security_type1','rfq_qty','rfq_qty_CAD_Equiv']].copy()
# Trades per bucket
df_Done_Major['Bucket'] = pd.cut(df_Done['rfq_qty'], bins=bins, labels=labels)
# Polpulate empty buckets with 0 so HK, SY and TK data can be pasted adjacently
df_Done_Major_Fill_Empty_Bucket = df_Done_Major.groupby(['Year_Month','Bucket'], as_index=False)['Bucket'].size()
mux = pd.MultiIndex.from_product([df_Done_Major_Fill_Empty_Bucket.index.levels[0], df_Done_Major['Bucket'].cat.categories])
df_Done_Major_Fill_Empty_Bucket = df_Done_Major_Fill_Empty_Bucket.reindex(mux, fill_value=0)
dfTemp = df_Done_Major_Fill_Empty_Bucket
display(dfTemp)
dfTemp = pd.DataFrame.from_dict(dfTemp)
display(dfTemp)
# Export
dfTemp.to_excel(excel_writer, sheet_name='Sheet1', startrow=0, startcol=21, na_rep=0, header=True, index=True, merge_cells= True)
2018-05 0K 0
10K 2
20K 4
40K 10
60K 3
80K 1
100K 14
> 100K 273
dtype: int64
TypeError: Unsupported type <class 'pandas._libs.period.Period'> in write()
Even though I have converted to df is there additional conversion required?
Update: I can get the data into the excel using the following but the format of the dataframe is lost, which means significant excel vba to resolve.
list = [{"Data": dfTemp}, ]
Related
I'd like to use .ftr files to quickly analyze hundreds of tables. Unfortunately I have some problems with decimal and thousands separator, similar to that post, just that read_feather does not allow for decimal=',', thousands='.' options. I've tried the following approaches:
df['numberofx'] = (
df['numberofx']
.apply(lambda x: x.str.replace(".","", regex=True)
.str.replace(",",".", regex=True))
resulting in
AttributeError: 'str' object has no attribute 'str'
when I change it to
df['numberofx'] = (
df['numberofx']
.apply(lambda x: x.replace(".","").replace(",","."))
I receive some strange (rounding) mistakes in the results, like 22359999999999998 instead of 2236 for some numbers that are higher than 1k. All below 1k are 10 times the real result, which is probably because of deleting the "." of the float and creating an int of that number.
Trying
df['numberofx'] = df['numberofx'].str.replace('.', '', regex=True)
also leads to some strange behavior in the results, as some numbers are going in the 10^12 and others remain at 10^3 as they should.
Here is how I create my .ftr files from multiple Excel files. I know I could simply create DataFrames from the Excel files but that would slowdown my daily calculations to much.
How can I solve that issue?
EDIT: The issue seems to come from reading in an excel file as df with non US standard regarding decimal and thousands separator and than saving it as feather. using pd.read_excel(f, encoding='utf-8', decimal=',', thousands='.') options for reading in the excel file solved my issue. That leads to the next question:
why does saving floats in a feather file lead to strange rounding errors like changing 2.236 to 2.2359999999999998?
the problem in your code is that :
when you check your column type in dataframe ( Panda ) you gonna find :
df.dtypes['numberofx']
result : type object
so suggested solution is to try :
df['numberofx'] = df['numberofx'].apply(pd.to_numeric, errors='coerce')
Another way to fix this problem is to convert your values to float :
def coerce_to_float(val):
try:
return float(val)
except ValueError:
return val
df['numberofx']= df['numberofx'].applymap(lambda x: coerce_to_float(x))
to avoid that type of float '4.806105e+12' here is a sample
Sample :
df = pd.DataFrame({'numberofx':['4806105017087','4806105017087','CN414149']})
print (df)
ID
0 4806105017087
1 4806105017087
2 CN414149
print (pd.to_numeric(df['numberofx'], errors='coerce'))
0 4.806105e+12
1 4.806105e+12
2 NaN
Name: ID, dtype: float64
df['numberofx'] = pd.to_numeric(df['numberofx'], errors='coerce').fillna(0).astype(np.int64)
print (df['numberofx'])
ID
0 4806105017087
1 4806105017087
2 0
As mentioned in my edit here is what solved my initial problem:
path = r"pathname\*_somename*.xlsx"
file_list = glob.glob(path)
for f in file_list:
df = pd.read_excel(f, encoding='utf-8', decimal=',', thousands='.')
for col in df.columns:
w= (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis=1)
if len(df[w]) > 0:
df[col] = df[col].astype(str)
if df[col].dtype == list:
df[col] = df[col].astype(str)
pathname = f[:-4] + "ftr"
df.to_feather(pathname)
df.head()
I had to add the decimal=',', thousands='.' option for reading in an excel file, which I later saved as feather. So the problem did not arise when working with .ftr files but before. The rounding problems seem to come from saving numbers with different decimal and thousand separators as .ftr files.
I have a Dataframe with the following date field:
463 14-05-2019
535 03-05-2019
570 11-05-2019
577 09-05-2019
628 08-08-2019
630 25-05-2019
Name: Date, dtype: object
I have to format it as DDMMAAAA. This is what I'm doing inside a loop (for idx, row in df.iterrows():):
I'm removing the \- char using regex:
df.at[idx, 'Date'] = re.sub('\-', '', df.at[idx, 'Date'])
then using apply to enforce and an 8 digit string with leading zeros
df['Date'] = df['Date'].apply(lambda x: '{0:0>8}'.format(x))
But even though the df['Date'] field has the 8 digits with the leading 0 on the df, whent exporting it to csv the leading zeros are being removed on the exported file like below.
df.to_csv(path_or_buf=report, header=True, index=False, sep=';')
field as in csv:
Dt_DDMMAAAA
30102019
12052019
7052019
26042019
3052019
22042019
25042019
2062019
I know I must be missing the point somewhere along the way here, but I just can't figure out what the issue (or if it's even an issue, rather then a misused method).
IMO the simplest method is to use the date_format argument when writing to CSV. This means you will need to convert the "Date" column to datetime beforehand using pd.to_datetime.
(df.assign(Date=pd.to_datetime(df['Date'], errors='coerce'))
.to_csv(path_or_buf=report, date_format='%d%m%Y', index=False))
This prints,
Date
14052019
05032019
05112019
05092019
08082019
25052019
More information on arguments to to_csv can be found in Writing a pandas DataFrame to CSV file.
What i will do is using strftime + 'to_excel`, since In csv , if you open it with text , it will show the leading zero, since csv will not keeping any format when display, in that case , you can using excel
pd.to_datetime(df.Date,dayfirst=True).dt.strftime('%m%d%Y').to_excel('your.xls')
Out[722]:
463 05142019
535 05032019
570 05112019
577 05092019
628 08082019
630 05252019
Name: Date, dtype: object
Firstly, your method is producing a file which contains leading zeros just as you expect. I reconstructed this minimal working example from your description and it works just fine:
import pandas
import re
df = pandas.DataFrame([["14-05-2019"],
["03-05-2019"],
["11-05-2019"],
["09-05-2019"],
["08-08-2019"],
["25-05-2019"]], columns=['Date'])
for idx in df.index:
df.at[idx, 'Date'] = re.sub('\-', '', df.at[idx, 'Date'])
df['Date'] = df['Date'].apply(lambda x: '{0:0>8}'.format(x))
df.to_csv(path_or_buf="report.csv", header=True, index=False, sep=';')
At this point report.csv contains this (with leading zeros just as you wanted).
Date
14052019
03052019
11052019
09052019
08082019
25052019
Now as to why you thought it wasn't working. If you are mainly in Pandas, you can stop it from guessing the type of the output by specifying a dtype in read_csv:
df_readback = pandas.read_csv('report.csv', dtype={'Date': str})
Date
0 14052019
1 03052019
2 11052019
3 09052019
4 08082019
5 25052019
It might also be that you are reading this in Excel (I'm guessing this from the fact that you are using ; separators). Unfortunately there is no way to ensure that Excel reads this field correctly on double-click, but if this is your final target, you can see how to mangle your file for Excel to read correctly in this answer.
I have a dataframe with around 2.5 million rows and more than 7000 columns(all categorical).
I iterate through each column, dummy the variables and do some processing and concatenate to a final dataframe.
The code is below:
cat_count = 0
df_final = pd.DataFrame()
for each_col in cat_cols:
df_temp = pd.DataFrame()
df_single_col_data = df_data[[each_col]]
cat_count += 1
# Calculate uniques and nulls in each column to display in log file.
uniques_in_column = len(df_single_col_data[each_col].unique())
nulls_in_column = df_single_col_data.isnull().sum()
print('%s has %s unique values and %s null values' %(each_col,uniques_in_column,nulls_in_column[0]))
#Convert into dummies
df_categorical_attribute = pd.get_dummies(df_single_col_data[each_col].astype(str), dummy_na=True, prefix=each_col)
df_categorical_attribute = df_categorical_attribute.loc[:, df_categorical_attribute.var() != 0.0]# Drop columns with 0 variance.
#//// Some data processing code://///
df_final = pd.concat([df_final,df_categorical_attribute],axis = 1)
print ('*'*10 + "\n Variable number %s processed!" %(cat_count))
# Write the final dataframe to a csv
df_final.to_csv('cat_processed.csv')
However, for such large data, df_final hogs up upto 75% of the memory on the server and I would like to reduce the memory footprint of this piece of code.
So what I am thinking is, I will process till the 300th column, write the results to csv. Then again process the next 300 columns, open the csv , write to it and close.
In that way, df_final at a time will hold the results of only 300 columns.
Can someone please help me with this?
Or, if there is any better way to deal with the issue, I would like to implement that too.
Below is some sample data to replicate:
df_data
rev_m1_Transform ov_m1_Transform ana_m1_Transform oov_m1_Transform
0_to_12.95 34.95_to_846.4 65_to_74.95 64.9_to_1239.51
13.95_to_116.55 14.95_to_19.95 45.05_to_60.05 34.9_to_39.95
12.95_to_13.95 19.95_to_29.95 89.95_to_9491.36 54.95_to_59.95
0_to_12.95 0_to_14.95 0_to_29.949999 64.9_to_1239.51
0_to_12.95 19.95_to_29.95 74.95_to_83.9 54.95_to_59.95
0_to_12.95 0_to_14.95 0_to_29.9499 0_to_34.9
0_to_12.95 14.95_to_19.95 45.05_to_60.05 39.95_to_44.9
0_to_12.95 0_to_14.95 0_to_29.949 0_to_34.9
0_to_12.95 19.95_to_29.95 89.95_to_9491.36 54.95_to_59.95
cat_cols is a list with all the column names in df_data
Thanks
Instead of going for columns go for a chunk of rows, then apply processing step on all the columns. For example:
CHUNKSIZE = 1000
for chunk in pd.read_csv("filename.csv", chunksize=CHUNKSIZE):
#Apply processing steps here
processed = process(chunk)
processed.to_csv("final.csv", mode="a")
Use CHUNKSIZE according to your physical memory size (RAM).
I have the following data:
Example:
DRIVER_ID;TIMESTAMP;POSITION
156;2014-02-01 00:00:00.739166+01;POINT(41.8836718276551 12.4877775603346)
I want to create a pandas dataframe with 4 columns that are the id, time, longitude, latitude.
So far, I got:
cur_cab = pd.DataFrame.from_csv(
path,
sep=";",
header=None,
parse_dates=[1]).reset_index()
cur_cab.columns = ['cab_id', 'datetime', 'point']
path specifies the .txt file containing the data.
I already wrote a function that returns the longitude and latitude values from the point formated string.
How do I expand the data frame with the additional column and the splitted values ?
After loading, if you're using a recent version of pandas then you can use the vectorised str methods to parse the column:
In [87]:
df['pos_x'], df['pos_y']= df['point'].str[6:-1].str.split(expand=True)
df
Out[87]:
cab_id datetime \
0 156 2014-01-31 23:00:00.739166
point pos_x pos_y
0 POINT(41.8836718276551 12.4877775603346) 0 1
Also you should stop using from_csv it's no longer updated, use the top level read_csv so your loading code would be:
cur_cab = pd.read_csv(
path,
sep=";",
header=None,
parse_dates=[1],
names=['cab_id', 'datetime', 'point'],
skiprows=1)
I'm importing large amounts of http logs (80GB+) into a Pandas HDFStore for statistical processing. Even within a single import file I need to batch the content as I load it. My tactic thus far has been to read the parsed lines into a DataFrame then store the DataFrame into the HDFStore. My goal is to have the index key unique for a single key in the DataStore but each DataFrame restarts it's own index value again. I was anticipating HDFStore.append() would have some mechanism to tell it to ignore the DataFrame index values and just keep adding to my HDFStore key's existing index values but cannot seem to find it. How do I import DataFrames and ignore the index values contained therein while having the HDFStore increment its existing index values? Sample code below batches every 10 lines. Naturally the real thing would be larger.
if hd_file_name:
"""
HDF5 output file specified.
"""
hdf_output = pd.HDFStore(hd_file_name, complib='blosc')
print hdf_output
columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result',
'response_size', 'referrer', 'user_agent', 'response_time']
source_name = str(log_file.name.rsplit('/')[-1]) # HDF5 Tables don't play nice with unicode so explicit str(). :(
batch = []
for count, line in enumerate(log_file,1):
data = parse_line(line, rejected_output = reject_output)
# Add our source file name to the beginning.
data.insert(0, source_name )
batch.append(data)
if not (count % 10):
df = pd.DataFrame( batch, columns = columns )
hdf_output.append(KEY_NAME, df)
batch = []
if (count % 10):
df = pd.DataFrame( batch, columns = columns )
hdf_output.append(KEY_NAME, df)
You can do it like this. Only trick is that the first time the store table doesn't exist, so get_storer will raise.
import pandas as pd
import numpy as np
import os
files = ['test1.csv','test2.csv']
for f in files:
pd.DataFrame(np.random.randn(10,2),columns=list('AB')).to_csv(f)
path = 'test.h5'
if os.path.exists(path):
os.remove(path)
with pd.get_store(path) as store:
for f in files:
df = pd.read_csv(f,index_col=0)
try:
nrows = store.get_storer('foo').nrows
except:
nrows = 0
df.index = pd.Series(df.index) + nrows
store.append('foo',df)
In [10]: pd.read_hdf('test.h5','foo')
Out[10]:
A B
0 0.772017 0.153381
1 0.304131 0.368573
2 0.995465 0.799655
3 -0.326959 0.923280
4 -0.808376 0.449645
5 -1.336166 0.236968
6 -0.593523 -0.359080
7 -0.098482 0.037183
8 0.315627 -1.027162
9 -1.084545 -1.922288
10 0.412407 -0.270916
11 1.835381 -0.737411
12 -0.607571 0.507790
13 0.043509 -0.294086
14 -0.465210 0.880798
15 1.181344 0.354411
16 0.501892 -0.358361
17 0.633256 0.419397
18 0.932354 -0.603932
19 -0.341135 2.453220
You actually don't necessarily need a global unique index, (unless you want one) as HDFStore (through PyTables) provides one by uniquely numbering rows. You can always add these selection parameters.
In [11]: pd.read_hdf('test.h5','foo',start=12,stop=15)
Out[11]:
A B
12 -0.607571 0.507790
13 0.043509 -0.294086
14 -0.465210 0.880798