Pandas Dataframe and excel datetime challenges with messy accounting data - python

I frequently work with messy accounting data, which is often in excel files. When I load an excel file into a dataframe using the following, I get 00:00:00 added to the end. I want to preserve the original date format the accountant has created so that I can extract it, but I cannot extract a date as a string when the date is formatted like this. Could someone explain this fixed behaviour and how to prevent it?
xls = pd.ExcelFile('GLQ1.xlsx')
df = pd.read_excel(xls, 'JNS512051', header=None, skiprows=8)
df.head()
0 01002-0 Bank-Current NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN Opening Balance Sep 30/18 NaN NaN 666034
2 2018-10-01 00:00:00 CR CR8729 CR8729 Fast Cash Receipts 3868.61 - NaN
3 2018-10-01 00:00:00 CR CR8732 CR8732 Fast Cash Receipts 13348.4 - NaN
4 2018-10-02 00:00:00 CR CR8733 CR8733 Fast Cash Receipts 9671.88 - NaN

when importing xlsx files you can easily prevent pandas from interpreting possible data types and leave them as it is. You can achieve this using dtypes parameter in pd.read_excel() function like so:
df = pd.read_excel(xls, 'JNS512051', header=None, skiprows=8, dtypes=object)

Related

how to clean and rearrange a dataframe with pairs of date and price columns into a df with common date index?

I have a dataframe of price data that looks like the following: (with more than 10,000 columns)
Unamed: 0
01973JAC3 corp
Unamed: 2
019754AA8 corp
Unamed: 4
01265RTJ7 corp
Unamed: 6
01988PAD0 corp
Unamed: 8
019736AB3 corp
1
2004-04-13
101.1
2008-06-16
99.1
2010-06-14
110.0
2008-06-18
102.1
NaT
NaN
2
2004-04-14
101.2
2008-06-17
100.4
2010-07-05
110.3
2008-06-19
102.6
NaT
NaN
3
2004-04-15
101.6
2008-06-18
100.4
2010-07-12
109.6
2008-06-20
102.5
NaT
NaN
4
2004-04-16
102.8
2008-06-19
100.9
2010-07-19
110.1
2008-06-21
102.6
NaT
NaN
5
2004-04-19
103.0
2008-06-20
101.3
2010-08-16
110.3
2008-06-22
102.8
NaT
NaN
...
...
...
...
...
...
...
...
...
NaT
NaN
3431
NaT
NaN
2021-12-30
119.2
NaT
NaN
NaT
NaN
NaT
NaN
3432
NaT
NaN
2021-12-31
119.4
NaT
NaN
NaT
NaN
NaT
NaN
(Those are 9-digit CUSIPs in the header. So every two columns represent date and closed price for a security.)
I would like to
find and get rid of empty pairs of date and price like "Unamed: 8" and"019736AB3 corp"
then rearrange the dateframe to a panel of monthly close price as following:
Date
01973JAC3
019754AA8
01265RTJ7
01988PAD0
2004-04-30
102.1
NaN
NaN
NaN
2004-05-31
101.2
NaN
NaN
NaN
...
...
...
...
...
2021-12-30
NaN
119.2
NaN
NaN
2021-12-31
NaN
119.4
NaN
NaN
Edit:
I wanna clarify my question.
So my dataframe has more than 10,000 columns, which makes it impossible to just drop by column names or change their names one by one. The pairs of date and price start and end at different time and are of different length (, and of different frequency). I m looking for an efficient way to arrange therm into a less messy form. Thanks.
Here is a sample of 30 columns. https://github.com/txd2x/datastore file name: sample-question2022-01.xlsx
I figured out: stacking and then reshaping.Thx for the help.
for i in np.arange(len(price.columns)/2):
temp =DataFrame(columns = ['Date', 'ClosedPrice','CUSIP'])
temp['Date'] = price.iloc[ 0:np.shape(price)[0]-1, int(2*i)]
temp['ClosedPrice'] = price.iloc[0:np.shape(price)[0]-1, int(2*i+1)]
temp['CUSIP'] =price.columns[int(i*2+1)][:9] #
df = df.append(temp)
#use for loop to stack all the column pairs
df = df.dropna(axis=0, how = 'any') # drop nan rows
df = df.pivot(index='Date', columns = 'CUSIP', values = 'ClosedPrice') #reshape dataframe to have Date as index and CUSIP and column headers
df_monthly=df.resample('M').last() #finding last price of the month
if you want to get rid of unusful columns then perform the following code:
df.drop("name_of_column", axis=1, inplace=True)
if you want to drop empty rows use:
df.drop(df.index[row_number], inplace=True)
if you want to rearrange the data using 'timestamp and date' you need to convert it to a datetime object and then make it as index:
import datetime
df.Date=pd.to_datetime(df.Date)
df = df.set_index('Date')
and you probably want to change column name before doing any of that above, df.rename(columns={'first_column': 'first', 'second_column': 'second'}, inplace = True)
Updated01:
if you want to keep just some columns of those 10000, lets say for example 10 or 7 columns, then use df = df[["first_column","second_column", ....]]
if you want to get rid of all empty columns use: df.dropna(axis=1, how = 'all') "how" keyword have two values: "all" to drop the whole row or column if it is full of Nan, "any" to drop the whole row or column if it have one Nan at least.
Update02:
Now if you have got a lot of date columns and you just want to keep one of them, supposing that you have choosed a date column that have no "Nan" values use the following code:
columns=df.columns.tolist()
for column in columns:
try:
if(df[column].dtypes=='object'):
df[column]=pd.to_datetime(df[column]).
if(df[column].dtypes=='datetime64[ns]')&(column!='Date'):
df.drop(column,axis=1,inplace=True)
except ValueError:
pass
rearrange the dataframe using months:
import datetime
df.Date=pd.to_datetime(df.Date)
df['Month']=df.Date.dt.month
df['Year']=df.Date.dt.year
df = df.set_index('Month')
df.groupby(["Year","Month"]).mean()
update03:
To combine all date columns while preserving data use the following code:
import pandas as pd
import numpy as np
df=pd.read_excel('sample_question2022-01.xlsx')
columns=df.columns.tolist()
for column in columns:
if (df[column].isnull().sum()>2300):
df.drop(column,axis=1,inplace=True)
columns=df.columns.tolist()
import itertools
count_date=itertools.count(1)
count_price=itertools.count(1)
for column in columns:
if(df[column].dtypes=='datetime64[ns]'):
df.rename(columns={column:f'date{next(count_date)}'},inplace=True)
else:
df.rename(columns={column:f'Price{next(count_price)}'},inplace=True)
columns=df.columns.tolist()
merged=df[[columns[0],columns[1]]].set_index('date1')
k=2
for i in range(2,len(columns)-1,2):
merged=pd.merge(merged,df[[columns[i],columns[i+1]]].set_index(f'date{k}'),how='outer',left_index=True,right_index=True)
k+=1
the only problem left that it will throw a memory Error.
MemoryError: Unable to allocate 27.4 GiB for an array with shape (3677415706,) and data type int64

Compare and find Duplicated values (not entire columns) across data frame with python

I have a large data frame of schedules, and I need to count the numbers of experiments run. The challenge is that usage for is repeated in rows (which is ok), but is duplicated in some, but not all columns. I want to remove the second entry (if duplicated), but I can't delete the entire second column because it will contain some new values too. How can I compare individual entries for two columns in a side by side fashion and delete the second if there is a duplicate?
The duration for this is a maximum of two days, so three days in a row is a new event with the same name starting on the third day.
The actual text for the experiment names is complicated and the data frame is 120 columns wide, so typing this in as a list or dictionary isn't possible. I'm hoping for a python or numpy function, but could use a loop.
Here are pictures for an example of the starting data frame and the desired output.starting data frame example
de-duplicated data frame example
This a hack and similar to #Params answer, but would be faster because you aren't calling .iloc a lot. The basic idea is to transpose the data frame and repeat an operation for as many times as you need to compare all of the columns. Then transpose it back to get the result in the OP.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Monday':['exp_A','exp_A','exp_A','exp_A','exp_B',np.nan,np.nan,np.nan,'exp_D','exp_D'],
'Tuesday':['exp_A','exp_A','exp_A','exp_A','exp_B','exp_B','exp_C','exp_C','exp_D','exp_D'],
'Wednesday':['exp_A','exp_D',np.nan,np.nan,np.nan,'exp_B','exp_C','exp_C','exp_C',np.nan],
'Thursday':['exp_A','exp_D',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'exp_C',np.nan]
})
df = df.T
for i in range(int(np.ceil(df.shape[0]/2))):
df[(df == df.shift(1))& (df != df.shift(2))] = np.nan
df = df.T
Monday Tuesday Wednesday Thursday
0 exp_A NaN exp_A NaN
1 exp_A NaN exp_D NaN
2 exp_A NaN NaN NaN
3 exp_A NaN NaN NaN
4 exp_B NaN NaN NaN
5 NaN exp_B NaN NaN
6 NaN exp_C NaN NaN
7 NaN exp_C NaN NaN
8 exp_D NaN exp_C NaN
9 exp_D NaN NaN NaN

How to update da Pandas Panel without duplicates

Currently i'm working on a Livetiming-Software for a motorsport-application. Therefore i have to crawl a Livetiming-Webpage and copy the Data to a big Dataframe. This Dataframe is the source of several diagramms i want to make. To keep my Dataframe up to date, i have to crawl the webpage very often.
I can download the Data and save them as a Panda.Dataframe. But my Problem is step from the downloaded DataFrame to the Big Dataframe, that includes all the Data.
import pandas as pd
import numpy as np
df1= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':['1:30,000','1:45,000','1:50,000','1:25,333','1:13,366','1:17,000'],
'Laps':['1','1','1','1','1','1']})
df2= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,],
'Laps':['2','2','2','2','2','2']})
df3= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':['1:31,000','1:41,000','1:51,000','1:21,333','1:11,366','1:11,000'],
'Laps':['2','2','2','2','2','2']})
df1.set_index(['CLS','Nr.','Laps'],inplace=True)
df2.set_index(['CLS','Nr.','Laps'],inplace=True)
df3.set_index(['CLS','Nr.','Laps'],inplace=True)
df1 shows a Dataframe from previous laps.
df2 shows a Dataframe in the second lap. The Lap is not completed, so i have a nan.
df3 shows a Dataframe after the second lap is completed.
My target is to have just one row for each Lap per Car per Class.
Either i have the problem, that i have duplicates with incomplete Laps or all date get overwritten.
I hope that someone can help me with this problem.
Thank you so far.
MrCrunsh
If I understand your problem correctly, your issue is that you have overlapping data for the second lap: information while the lap is still in progress and information after it's over. If you want to put all the information for a given lap in one row, I'd suggest use multi-index columns or changing the column names to reflect the difference between measurements during and after laps.
df = pd.concat([df1, df3])
df = pd.concat([df, df2], axis=1, keys=['after', 'during'])
The result will look like this:
after during
Pos Zeit Pos Zeit
CLS Nr. Laps
V4 24 1 5 1:13,366 NaN NaN
2 5 1:11,366 5.0 NaN
55 1 4 1:25,333 NaN NaN
2 4 1:21,333 4.0 NaN
985 1 6 1:17,000 NaN NaN
2 6 1:11,000 6.0 NaN
V5 13 1 1 1:30,000 NaN NaN
2 1 1:31,000 1.0 NaN
30 1 3 1:50,000 NaN NaN
2 3 1:51,000 3.0 NaN
700 1 2 1:45,000 NaN NaN
2 2 1:41,000 2.0 NaN

Complex case of filling NaNs in Pandas

Is there a way to go from this...
bloomberg morningstar yahoo
0 AAPL1 AAPL2 NaN
1 AAPL1 NaN AAPL3
2 NaN GOOG4 GOOG5
3 GOOG6 GOOG4 NaN
4 IBM7 NaN IBM8
5 NaN IBM9 IBM8
6 NaN NaN FB
... to this ...
bloomberg morningstar yahoo
0 AAPL1 AAPL2 AAPL3
1 GOOG6 GOOG4 GOOG5
2 IBM7 IBM9 IBM8
3 NaN NaN FB
... in Pandas?
I've munged my data enough to ensure that there will never be any "conflicting" information in a given column of the starting dataframe, e.g. the following is not possible...
A column Another column
0 AAPL1 One thing
1 AAPL1 Another thing
The only thing that can happen is that any given column either has 1) no information or 2) the right information, e.g.
A column Another column
0 AAPL1 NaN
1 AAPL1 The right information
All I want to do is fill the NaN's with the "right" information where available and then drop duplicates (which should be easy).
But some NaNs should remain, as I don't have enough data to infer their value, e.g. the FB row in the example.
Anybody have a good answer? Thanks for the help!
Here is some code to load the starting dataframe if you'd like to play around:
import pandas as pd
data = [
{'bloomberg': 'AAPL1', 'morningstar': 'AAPL2'},
{'bloomberg': 'AAPL1', 'yahoo': 'AAPL3'},
{'morningstar': 'GOOG4', 'yahoo': 'GOOG5'},
{'bloomberg': 'GOOG6', 'morningstar': 'GOOG4'},
{'bloomberg': 'IBM7', 'yahoo': 'IBM8'},
{'morningstar': 'IBM9', 'yahoo': 'IBM8'},
{'yahoo': 'FB'}]
df = pd.DataFrame(data)
Chaining ffill and bfill will do what you want:
df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1).drop_duplicates()
bloomberg morningstar yahoo
0 AAPL AAPL AAPL
2 GOOG GOOG GOOG
4 IBM IBM IBM

Read flat file to DataFrames using Pandas with field specifiers in-line

I'm attempting to read in a flat-file to a DataFrame using pandas but can't seem to get the format right. My file has a variable number of fields represented per line and looks like this:
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOCinpt|MIME=application/synthesis+ssml|TXID=NUAN-20131203004552049-FCJNJKDCAAANPCKEAAAAAAAA-txt|TXSZ=1167|UCPU=31|SCPU=15
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOCsynd|INPT=1167|DURS=5120|RSTT=stop|UCPU=31|SCPU=15
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOClise|LUSED=0|LMAX=100|OMAX=95|LFEAT=tts|UCPU=0|SCPU=0
I have the field separator at |, I've pulled a list of all unique keys into keylist, and am trying to use the following to read in the data:
keylist = ['TIME',
'CHAN',
# [truncated]
'DURS',
'RSTT']
test_fp = 'c:\\temp\\test_output.txt'
df = pd.read_csv(test_fp, sep='|', names=keylist)
This incorrectly builds the DataFrame as I'm not specifying any way to recognize the key label in the line. I'm a little stuck and am not sure which way to research -- should I be using .read_json() for example?
Not sure if there's a slick way to do this. Sometimes when the data structure is different enough from the norm it's easiest to preprocess it on the Python side. Sure, it's not as fast, but since you could immediately save it in a more standard format it's usually not worth worrying about.
One way:
with open("wfield.txt") as fp:
rows = (dict(entry.split("=",1) for entry in row.strip().split("|")) for row in fp)
df = pd.DataFrame.from_dict(rows)
which produces
>>> df
CHAN DURS EVNT INPT LFEAT LMAX LUSED \
0 FCJNJKDCAAANPCKEAAAAAAAA NaN NVOCinpt NaN NaN NaN NaN
1 FCJNJKDCAAANPCKEAAAAAAAA 5120 NVOCsynd 1167 NaN NaN NaN
2 FCJNJKDCAAANPCKEAAAAAAAA NaN NVOClise NaN tts 100 0
MIME OMAX RSTT SCPU TIME \
0 application/synthesis+ssml NaN NaN 15 20131203004552049
1 NaN NaN stop 15 20131203004552049
2 NaN 95 NaN 0 20131203004552049
TXID TXSZ UCPU
0 NUAN-20131203004552049-FCJNJKDCAAANPCKEAAAAAAA... 1167 31
1 NaN NaN 31
2 NaN NaN 0
[3 rows x 15 columns]
After you've got this, you can reshape as needed. (I'm not sure if you wanted to combine rows with the same TIME & CHAN or not.)
Edit: if you're using an older version of pandas which doesn't support passing a generator to from_dict, you can built it from a list instead:
df = pd.DataFrame(list(rows))
but note that you haev have to convert columns to numerical columns from strings after the fact.

Categories