How to update da Pandas Panel without duplicates

How to update da Pandas Panel without duplicates - python

Currently i'm working on a Livetiming-Software for a motorsport-application. Therefore i have to crawl a Livetiming-Webpage and copy the Data to a big Dataframe. This Dataframe is the source of several diagramms i want to make. To keep my Dataframe up to date, i have to crawl the webpage very often.
I can download the Data and save them as a Panda.Dataframe. But my Problem is step from the downloaded DataFrame to the Big Dataframe, that includes all the Data.
import pandas as pd
import numpy as np
df1= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':['1:30,000','1:45,000','1:50,000','1:25,333','1:13,366','1:17,000'],
'Laps':['1','1','1','1','1','1']})
df2= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,],
'Laps':['2','2','2','2','2','2']})
df3= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':['1:31,000','1:41,000','1:51,000','1:21,333','1:11,366','1:11,000'],
'Laps':['2','2','2','2','2','2']})
df1.set_index(['CLS','Nr.','Laps'],inplace=True)
df2.set_index(['CLS','Nr.','Laps'],inplace=True)
df3.set_index(['CLS','Nr.','Laps'],inplace=True)
df1 shows a Dataframe from previous laps.
df2 shows a Dataframe in the second lap. The Lap is not completed, so i have a nan.
df3 shows a Dataframe after the second lap is completed.
My target is to have just one row for each Lap per Car per Class.
Either i have the problem, that i have duplicates with incomplete Laps or all date get overwritten.
I hope that someone can help me with this problem.
Thank you so far.
MrCrunsh

If I understand your problem correctly, your issue is that you have overlapping data for the second lap: information while the lap is still in progress and information after it's over. If you want to put all the information for a given lap in one row, I'd suggest use multi-index columns or changing the column names to reflect the difference between measurements during and after laps.
df = pd.concat([df1, df3])
df = pd.concat([df, df2], axis=1, keys=['after', 'during'])
The result will look like this:
after during
Pos Zeit Pos Zeit
CLS Nr. Laps
V4 24 1 5 1:13,366 NaN NaN
2 5 1:11,366 5.0 NaN
55 1 4 1:25,333 NaN NaN
2 4 1:21,333 4.0 NaN
985 1 6 1:17,000 NaN NaN
2 6 1:11,000 6.0 NaN
V5 13 1 1 1:30,000 NaN NaN
2 1 1:31,000 1.0 NaN
30 1 3 1:50,000 NaN NaN
2 3 1:51,000 3.0 NaN
700 1 2 1:45,000 NaN NaN
2 2 1:41,000 2.0 NaN

Related

Bind one row cell with multiple rows cell for excle sheet in panda jupyter notebook

I have an excel sheet like this.
If I search using the below method I got only 1 row.
df4 = df.loc[(df['NAME '] == 'HIR')]
df4
But I want to get all rows connecting with this name (same for birthdate and place).
expected output:
How can I achieve this? how can I bind these things

You need to forward fill the data with ffill():
df = df.replace('', np.nan) # in case you don't have null values, but you have empty strings
df['NAME '] = df['NAME '].ffill()
df4 = df.loc[(df['NAME '] == 'HIR')]
df4
That will then bring up all of the rows when you use loc. You can do this on other columns as well.

First you need to remove those blank rows in your excel. then fill values by the previous value
import pandas as pd
df = pd.read_excel('so.xlsx')
df = df[~df['HOBBY'].isna()]
df[['SNO','NAME']] = df[['SNO','NAME']].ffill()
df
SNO NAME HOBBY COURSE BIRTHDATE PLACE
0 1.0 HIR DANCING BTECH 1990.0 USA
1 1.0 HIR MUSIC MTECH NaN NaN
2 1.0 HIR TRAVELLING AI NaN NaN
4 2.0 BH GAMES BTECH 1992.0 INDIA
5 2.0 BH BOOKS AI NaN NaN
6 2.0 BH SWIMMING NaN NaN NaN
7 2.0 BH MUSIC NaN NaN NaN
8 2.0 BH DANCING NaN NaN NaN

Compare and find Duplicated values (not entire columns) across data frame with python

I have a large data frame of schedules, and I need to count the numbers of experiments run. The challenge is that usage for is repeated in rows (which is ok), but is duplicated in some, but not all columns. I want to remove the second entry (if duplicated), but I can't delete the entire second column because it will contain some new values too. How can I compare individual entries for two columns in a side by side fashion and delete the second if there is a duplicate?
The duration for this is a maximum of two days, so three days in a row is a new event with the same name starting on the third day.
The actual text for the experiment names is complicated and the data frame is 120 columns wide, so typing this in as a list or dictionary isn't possible. I'm hoping for a python or numpy function, but could use a loop.
Here are pictures for an example of the starting data frame and the desired output.starting data frame example
de-duplicated data frame example

This a hack and similar to #Params answer, but would be faster because you aren't calling .iloc a lot. The basic idea is to transpose the data frame and repeat an operation for as many times as you need to compare all of the columns. Then transpose it back to get the result in the OP.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Monday':['exp_A','exp_A','exp_A','exp_A','exp_B',np.nan,np.nan,np.nan,'exp_D','exp_D'],
'Tuesday':['exp_A','exp_A','exp_A','exp_A','exp_B','exp_B','exp_C','exp_C','exp_D','exp_D'],
'Wednesday':['exp_A','exp_D',np.nan,np.nan,np.nan,'exp_B','exp_C','exp_C','exp_C',np.nan],
'Thursday':['exp_A','exp_D',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'exp_C',np.nan]
})
df = df.T
for i in range(int(np.ceil(df.shape[0]/2))):
df[(df == df.shift(1))& (df != df.shift(2))] = np.nan
df = df.T
Monday Tuesday Wednesday Thursday
0 exp_A NaN exp_A NaN
1 exp_A NaN exp_D NaN
2 exp_A NaN NaN NaN
3 exp_A NaN NaN NaN
4 exp_B NaN NaN NaN
5 NaN exp_B NaN NaN
6 NaN exp_C NaN NaN
7 NaN exp_C NaN NaN
8 exp_D NaN exp_C NaN
9 exp_D NaN NaN NaN

Save each column of a dataframe in a new dataframe while looping

I have a dataframe called 'Adj_Close' which looks like this:
AAPL TSLA GOOG
0 3.478462 NaN NaN
1 3.185191 NaN NaN
2 3.231803 NaN NaN
3 2.952128 NaN NaN
4 3.091966 NaN NaN
... ... ... ...
5005 261.779999 333.040009 1295.339966
5006 266.369995 336.339996 1306.689941
5007 264.290009 328.920013 1313.550049
5008 267.839996 331.290009 1312.989990
5009 267.250000 329.940002 1304.959961
I want to save each column ('AAPL', 'TSLA' & 'GOOG') in a new dataframe.
The code should look like this:
i = 0
n = 3
while i < n:
df_{i} = Adj_Close.iloc[:,i]
i += 1
Unfortunately it is the wrong syntax. I hope someone can help me...

The natural way to do that in python would be to create an array of dataframes, as in:
dataframes = []
for col in df.columns:
new_df = pd.DataFrame(df[col])
dataframes.append(new_df)
The result is an array (dataframes) that contains three separate data frames - one for Google, one for Tesla, and one for Apple.
[ One can also define new variables using
globals()[my_var_name] = <some_value>
But I don't believe that's what you're looking for.
]

Python Pandas Dataframe: length of index does not match - df['column'] = ndarray

I have a pandas Dataframe containing EOD financial data (OHLC) for analysis.
I'm using https://github.com/cirla/tulipy library to generate technical indicator values, that have a certain timeperiod as option. For Example. ADX with timeperiod=5 shows ADX for last 5 days.
Because of this timeperiod, the generated array with indicator values is always shorter in length than the Dataframe. Because the prices of first 5 days are used to generate ADX for day 6..
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=14)
df['mdi_14'] = mdi14
df['pdi_14'] = pdi14
>> ValueError: Length of values does not match length of index
Unfortunately, unlike TA-LIB for example, this tulip library does not provide NaN-values for these first couple of empty days...
Is there an easy way to prepend these NaN to the ndarray?
Or insert into df at a certain index & have it create NaN for the rows before it automatically?
Thanks in advance, I've been researching for days!

Maybe make the shift yourself in the code ?
period = 14
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=period
)
df['mdi_14'] = np.NAN
df['mdi_14'][period - 1:] = mdi14
I hope they will fill the first values with NAN in the lib in the future. It's dangerous to leave time series data like this without any label.

Full MCVE
df = pd.DataFrame(1, range(10), list('ABC'))
a = np.full((len(df) - 6, df.shape[1]), 2)
b = np.full((6, df.shape[1]), np.nan)
c = np.row_stack([b, a])
d = pd.DataFrame(c, df.index, df.columns)
d
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 2.0 2.0 2.0
7 2.0 2.0 2.0
8 2.0 2.0 2.0
9 2.0 2.0 2.0

The C version of the tulip library includes a start function for each indicator (reference: https://tulipindicators.org/usage) that can be used to determine the output length of an indicator given a set of input options. Unfortunately, it does not appear that the python bindings library, tulipy, includes this functionality. Instead you have to resort to dynamically reassigning your index values to align the output with the original DataFrame.
Here is an example that uses the price series from the tulipy docs:
#Create the dataframe with close prices
prices = pd.DataFrame(data={81.59, 81.06, 82.87, 83, 83.61, 83.15, 82.84, 83.99, 84.55,
84.36, 85.53, 86.54, 86.89, 87.77, 87.29}, columns=['close'])
#Compute the technical indicator using tulipy and save the result in a DataFrame
bbands = pd.DataFrame(data=np.transpose(ti.bbands(real = prices['close'].to_numpy(), period = 5, stddev = 2)))
#Dynamically realign the index; note from the tulip library documentation that the price/volume data is expected be ordered "oldest to newest (index 0 is oldest)"
bbands.index += prices.index.max() - bbands.index.max()
#Put the indicator values with the original DataFrame
prices[['BBANDS_5_2_low', 'BBANDS_5_2_mid', 'BBANDS_5_2_up']] = bbands
prices.head(15)
close BBANDS_5_2_low BBANDS_5_2_mid BBANDS_5_2_up
0 81.06 NaN NaN NaN
1 81.59 NaN NaN NaN
2 82.87 NaN NaN NaN
3 83.00 NaN NaN NaN
4 83.61 80.530042 82.426 84.321958
5 83.15 81.494061 82.844 84.193939
6 82.84 82.533343 83.094 83.654657
7 83.99 82.471983 83.318 84.164017
8 84.55 82.417750 83.628 84.838250
9 84.36 82.435203 83.778 85.120797
10 85.53 82.511331 84.254 85.996669
11 86.54 83.142618 84.994 86.845382
12 86.89 83.536488 85.574 87.611512
13 87.77 83.870324 86.218 88.565676
14 87.29 85.288871 86.804 88.319129

Complex case of filling NaNs in Pandas

Is there a way to go from this...
bloomberg morningstar yahoo
0 AAPL1 AAPL2 NaN
1 AAPL1 NaN AAPL3
2 NaN GOOG4 GOOG5
3 GOOG6 GOOG4 NaN
4 IBM7 NaN IBM8
5 NaN IBM9 IBM8
6 NaN NaN FB
... to this ...
bloomberg morningstar yahoo
0 AAPL1 AAPL2 AAPL3
1 GOOG6 GOOG4 GOOG5
2 IBM7 IBM9 IBM8
3 NaN NaN FB
... in Pandas?
I've munged my data enough to ensure that there will never be any "conflicting" information in a given column of the starting dataframe, e.g. the following is not possible...
A column Another column
0 AAPL1 One thing
1 AAPL1 Another thing
The only thing that can happen is that any given column either has 1) no information or 2) the right information, e.g.
A column Another column
0 AAPL1 NaN
1 AAPL1 The right information
All I want to do is fill the NaN's with the "right" information where available and then drop duplicates (which should be easy).
But some NaNs should remain, as I don't have enough data to infer their value, e.g. the FB row in the example.
Anybody have a good answer? Thanks for the help!
Here is some code to load the starting dataframe if you'd like to play around:
import pandas as pd
data = [
{'bloomberg': 'AAPL1', 'morningstar': 'AAPL2'},
{'bloomberg': 'AAPL1', 'yahoo': 'AAPL3'},
{'morningstar': 'GOOG4', 'yahoo': 'GOOG5'},
{'bloomberg': 'GOOG6', 'morningstar': 'GOOG4'},
{'bloomberg': 'IBM7', 'yahoo': 'IBM8'},
{'morningstar': 'IBM9', 'yahoo': 'IBM8'},
{'yahoo': 'FB'}]
df = pd.DataFrame(data)

Chaining ffill and bfill will do what you want:
df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1).drop_duplicates()
bloomberg morningstar yahoo
0 AAPL AAPL AAPL
2 GOOG GOOG GOOG
4 IBM IBM IBM

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to update da Pandas Panel without duplicates - python

Related

Bind one row cell with multiple rows cell for excle sheet in panda jupyter notebook

Compare and find Duplicated values (not entire columns) across data frame with python

Save each column of a dataframe in a new dataframe while looping

Python Pandas Dataframe: length of index does not match - df['column'] = ndarray

Complex case of filling NaNs in Pandas

Categories

Resources