Using reindex with duplicated axis - python

Let's say I have a dataframe with dates as index. Each row contains information about a certain event on that date. The problem is that there could be more than one event on said date.
This is an example DataFrame, df2:
one two
1/2 1.0 1.0
1/2 1.0 1.0
1/4 3.0 3.0
1/5 NaN 4.0
I want to add missing dates to the dataframe, and I used to be able to do it with .loc. Now .loc raises the following warning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
This is my code (it works but raises warning):
# I want to add any missing date- in this example, 1/3.
df2.loc[["1/2","1/3","1/4","1/5"]]
one two
1/2 1.0 1.0
1/2 1.0 1.0
1/3 NaN NaN
1/4 3.0 3.0
1/5 NaN 4.0
I've tried using reindex as it suggests, but my index contains duplicated values so it doesn't work:
#This doesn't work
df2.reindex(["1/2","1/3","1/4","1/5"])
ValueError: cannot reindex from a duplicate axis
What can I do to replace the old loc?

One way from join
df.join(pd.DataFrame(index=["1/2","1/3","1/4","1/5"]),how='outer')
Out[193]:
one two
1/2 1.0 1.0
1/2 1.0 1.0
1/3 NaN NaN
1/4 3.0 3.0
1/5 NaN 4.0

Related

How to create a dataframe from series object when iterating

I am iterating and as a result of a single iteration I acquire a pandas series object which looks like this:
DE_AT 118.55
DE_CZ 62.73
PL_DE 263.36
PL_SK 315.07
dtype: float64
Sometimes I might get different names and lengths of this series for example I might get:
DE_AT 118.55
DE_CZ 62.73
PL_DE 263.36
PL_NL 315.07
PL_UK 420
dtype: float64
Now I want to create a dataframe from these series objects when iterating such that I will have all names as the index, from these two series objects I would like to get:
index 1 2
DE_AT 118.55 118.55
DE_CZ 62.73 62.73
PL_DE 263.36 263.36
PL_SK 315.07 NaN
PL_NL NaN 315.07
PL_UK NaN 420
Or maybe I can store them in a list and later create a dataframe?
Basic outer join of two series:
s1=pd.Series(index=["DE_AT","DE_CZ","PL_DE", "PL_SK"], data=[1,2,3,4]).to_frame()
s2=pd.Series(index=["DE_AT","DE_CZ","PL_DE", "PL_NL", "PL_UK"], data=[1,2,3,4,5]).to_frame()
s1.join(s2, how="outer",lsuffix="1",rsuffix="2")
Output:
index
00
01
DE_AT
1.0
1.0
DE_CZ
2.0
2.0
PL_DE
3.0
3.0
PL_NL
NaN
4.0
PL_SK
4.0
NaN
PL_UK
NaN
5.0

Pandas Dataframe: grouping by index keeping only notnan value in each column

I have dataframes similar to the following ones:
,A,B
2020-01-15,1,
2020-01-15,,2
2020-01-16,3,
2020-01-16,,4
2020-01-17,5,
2020-01-17,,6
,A,B,C
2020-01-15,1,
2020-01-15,,2
2020-01-15,,,3
2020-01-16,4,
2020-01-16,,5
2020-01-16,,,6
2020-01-17,7,
2020-01-17,,8
2020-01-17,,,9
I need to transform them to the following:
,A,B
2020-01-15,1,2
2020-01-16,3,4
2020-01-17,5,6
,A,B,C
2020-01-15,1,2,3
2020-01-16,4,5,6
2020-01-17,7,8,9
I have tried with groupby().first() without success
Let us do grubby + first
s=df.groupby(level=0).first()
A B
aaa
2020-01-15 1.0 2.0
2020-01-16 3.0 4.0
2020-01-17 5.0 6.0

Select rows containing a NaN following a specific value in Pandas

I am trying to create a new DataFrame consisting of the rows corresponding to the value 1.0 or NaN in the last column, whereby I only take the Nans under a 1.0 (that is, I'm interested in everything until a 0.0 appears).
Timestamp Value Mode
00-00-10 34567 1.0
00-00-20 45425
00-00-30 46773 0.0
00-00.40 64567
00-00-50 25665 1.0
00-00-60 25678
My attempt is:
for row in data.itertuples():
while data[data.Mode != 0.0]:
df2 = df2.append(row)
else:
#How do I differentiate between a NaN under a 1.0 and a NaN under a 0.0?
print (df2)
The idea is to save every row until a 0.0 appears, and afterwards ignore every row until a 1.0 appears again.
You can use .ffill to figure out if it's a NaN below a 1 or a 0.
Here are the NaN values below a 1
df[(df['Mode'].isnull()) & df['Mode'].ffill() == 1]
# Timestamp Value Mode
#1 00-00-20 45425 NaN
#5 00-00-60 25678 NaN
To get all of the 1s and NaN below:
df[((df['Mode'].isnull()) & df['Mode'].ffill() == 1) | df.Mode == 1]
# Timestamp Value Mode
#0 00-00-10 34567 1.0
#1 00-00-20 45425 NaN
#4 00-00-50 25665 1.0
#5 00-00-60 25678 NaN
You can get away with slightly nicer logic, since you have only 1 and 0, though this might not always work due to the NaN in 'Mode' (It seems to work for the above bit)
df[((df['Mode'].isnull()) & df['Mode'].ffill()) | df.Mode]

Is it possible to write and read multiple DataFrames to/from one single file?

I'm currently dealing with a set of similar DataFrames having a double Header.
They have the following structure:
age height weight shoe_size
RHS height weight shoe_size
0 8.0 6.0 2.0 1.0
1 8.0 NaN 2.0 1.0
2 6.0 1.0 4.0 NaN
3 5.0 1.0 NaN 0.0
4 5.0 NaN 1.0 NaN
5 3.0 0.0 1.0 0.0
height weight shoe_size age
RHS weight shoe_size age
0 1.0 1.0 NaN NaN
1 1.0 2.0 0.0 2.0
2 1.0 NaN 0.0 5.0
3 1.0 2.0 0.0 NaN
4 0.0 1.0 0.0 3.0
Actually the main differences are the sorting of the first Header row, which could be made the same for all of them, and the position of the RHS header column in the second Header row. I'm currently wondering if there is an easy way of saving/reading all these DataFrames into/from a single CSV file instead of having a different CSV file for each of them.
Unfortunately, there isn't any reasonable way to store multiple dataframes in a single CSV such that retrieving each one would not be excessively cumbersome, but you can use pd.ExcelWriter and save to separate sheets in a single .xlsx file:
import pandas as pd
writer = pd.ExcelWriter('file.xlsx')
for i, df in enumerate(df_list):
df.to_excel(writer,'sheet{}'.format(i))
writer.save()
Taking back your example (with random numbers instead of your values) :
import pandas as pd
import numpy as np
h1 = [['age', 'height', 'weight', 'shoe_size'],['RHS','height','weight','shoe_size']]
df1 = pd.DataFrame(np.random.randn(3, 4), columns=h1)
h2 = [['height', 'weight', 'shoe_size','age'],['RHS','weight','shoe_size','age']]
df2 = pd.DataFrame(np.random.randn(3, 4), columns=h2)
First, reorder your columns (How to change the order of DataFrame columns?) :
df3 = df2[h1[0]]
Then, concatenate the two dataframes (Merge, join, and concatenate) :
df4 = pd.concat([df1,df3])
I don't know how you want to deal with the second row of your header (for now, it's just using two sub-columns, which is not very elegant). If, to your point of view, this row is meaningless, just reset your header like you want before to concatenate :
df1.columns=h1[0]
df3.columns=h1[0]
df5 = pd.concat([df1,df3])
Finally, save it under CSV format (pandas.DataFrame.to_csv) :
df4.to_csv('file_name.csv',sep=',')

Aligning 2 python lists according to 2 other list

I have two arrays namely nlxTTL and ttlState. Both the arrays comprise of repeating pattern of 0's and 1's indicating input voltage which can be HIGH(1) or LOW(0) and are recorded from same source which sends a TTL pulse(HIGH and LOW) with 1second pulse width.
But due to some logging mistake, some drops happen in ttlState list i.e. it doesn't log a repeating sequence of 0 and 1's and ends up dropping values.
The good part is I also log timestamp for each TTL input received for both the lists. Inter TTL event timestamp difference clearly shows that the TTL event has missed one of the pulses.
Here is an example of what data looks like:
nlxTTL, ttlState, nlxTime, ttlTime
0,0,1000,1000
1,1,2000,2000
0,1,3000,4000
1,1,4000,6000
0,0,5000,7000
1,1,6000,8000
0,0,7000,9000
1,1,8000,10000
As you can see the nlxTime and ttlTime clearly are different from each other. How can then using these timestamps I can align all 4 lists?
When dealing with tabular data such as a CSV file, it's a good idea to use a library to make the process easier. I like the pandas dataframe library.
Now for your question, one way to think about this problem is that you really have two datasets... An nlx dataset and a ttl dataset. You want to join those datasets together by timestamp. Pandas makes tasks like this very easy.
import pandas as pd
from StringIO import StringIO
data = """\
nlxTTL, ttlState, nlxTime, ttlTime
0,0,1000,1000
1,1,2000,2000
0,1,3000,4000
1,1,4000,6000
0,0,5000,7000
1,1,6000,8000
0,0,7000,9000
1,1,8000,10000
"""
# Load data into dataframe.
df = pd.read_csv(StringIO(data))
# Remove spaces from column names.
df.columns = [x.strip() for x in df.columns]
# Split the data into an nlx dataframe and a ttl dataframe.
nlx = df[['nlxTTL', 'nlxTime']].reset_index()
ttl = df[['ttlState', 'ttlTime']].reset_index()
# Merge the dataframes back together based on their timestamps.
# Use an outer join so missing data gets filled with NaNs instead
# of just dropping the rows.
merged_df = nlx.merge(ttl, left_on='nlxTime', right_on='ttlTime', how='outer')
# Get back to the original set of columns
merged_df = merged_df[df.columns]
# Print out the results.
print(merged_df)
This produces the following output.
nlxTTL ttlState nlxTime ttlTime
0 0.0 0.0 1000.0 1000.0
1 1.0 1.0 2000.0 2000.0
2 0.0 NaN 3000.0 NaN
3 1.0 1.0 4000.0 4000.0
4 0.0 NaN 5000.0 NaN
5 1.0 1.0 6000.0 6000.0
6 0.0 0.0 7000.0 7000.0
7 1.0 1.0 8000.0 8000.0
8 NaN 0.0 NaN 9000.0
9 NaN 1.0 NaN 10000.0
You'll notice that it fills in the dropped values with NaN values because we are doing an outer join. If this is undesirable, change the how='outer' parameter to how='inner' to perform an inner join. This will only keep records for which you have both an nlx and ttl response at that timestamp.

Categories