Pandas: Combine two dataframe with the samse structure [duplicate] - python

I want to perform a join/merge/append operation on a dataframe with datetime index.
Let's say I have df1 and I want to add df2 to it. df2 can have fewer or more columns, and overlapping indexes. For all rows where the indexes match, if df2 has the same column as df1, I want the values of df1 be overwritten with those from df2.
How can I obtain the desired result?

How about: df2.combine_first(df1)?
In [33]: df2
Out[33]:
A B C D
2000-01-03 0.638998 1.277361 0.193649 0.345063
2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726
2000-01-05 0.435507 -0.025162 -1.112890 0.324111
2000-01-06 -0.210756 -1.027164 0.036664 0.884715
2000-01-07 -0.821631 -0.700394 -0.706505 1.193341
2000-01-10 1.015447 -0.909930 0.027548 0.258471
2000-01-11 -0.497239 -0.979071 -0.461560 0.447598
In [34]: df1
Out[34]:
A B C
2000-01-03 2.288863 0.188175 -0.040928
2000-01-04 0.159107 -0.666861 -0.551628
2000-01-05 -0.356838 -0.231036 -1.211446
2000-01-06 -0.866475 1.113018 -0.001483
2000-01-07 0.303269 0.021034 0.471715
2000-01-10 1.149815 0.686696 -1.230991
2000-01-11 -1.296118 -0.172950 -0.603887
2000-01-12 -1.034574 -0.523238 0.626968
2000-01-13 -0.193280 1.857499 -0.046383
2000-01-14 -1.043492 -0.820525 0.868685
In [35]: df2.comb
df2.combine df2.combineAdd df2.combine_first df2.combineMult
In [35]: df2.combine_first(df1)
Out[35]:
A B C D
2000-01-03 0.638998 1.277361 0.193649 0.345063
2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726
2000-01-05 0.435507 -0.025162 -1.112890 0.324111
2000-01-06 -0.210756 -1.027164 0.036664 0.884715
2000-01-07 -0.821631 -0.700394 -0.706505 1.193341
2000-01-10 1.015447 -0.909930 0.027548 0.258471
2000-01-11 -0.497239 -0.979071 -0.461560 0.447598
2000-01-12 -1.034574 -0.523238 0.626968 NaN
2000-01-13 -0.193280 1.857499 -0.046383 NaN
2000-01-14 -1.043492 -0.820525 0.868685 NaN
Note that it takes the values from df1 for indices that do not overlap with df2. If this doesn't do exactly what you want I would be willing to improve this function / add options to it.

For a merge like this, the update method of a DataFrame is useful.
Taking the examples from the documentation:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, 2.1, np.nan],
[np.nan, 7., np.nan]])
df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4]],
index=[1, 2])
Data before the update:
>>> df1
0 1 2
0 NaN 3.0 5.0
1 -4.6 2.1 NaN
2 NaN 7.0 NaN
>>>
>>> df2
0 1 2
1 -42.6 NaN -8.2
2 -5.0 1.6 4.0
Let's update df1 with data from df2:
df1.update(df2)
Data after the update:
>>> df1
0 1 2
0 NaN 3.0 5.0
1 -42.6 2.1 -8.2
2 -5.0 1.6 4.0
Remarks:
It's important to notice that this is an operation "in place", modifying the DataFrame that calls update.
Also note that non NaN values in df1 are not overwritten with NaN values in df2

Related

Assign new value to a cell in pd.DataFrame which is a pd.Series when series index isn't unique

Here is my data if anyone wants to try to reproduce the problem:
https://github.com/LunaPrau/personal/blob/main/O_paired.csv
I have a pd.DataFrame (called O) of 1402 rows × 1402 columns with columns and index both as ['XXX-icsd', 'YYY-icsd', ...] and cell values as some np.float64, some np.nan and problematically, some as pandas.core.series.Series.
202324-icsd
644068-icsd
27121-icsd
93847-icsd
154319-icsd
202324-icsd
0.000000
0.029729
NaN
0.098480
0.097867
644068-icsd
NaN
0.000000
NaN
0.091311
0.091049
27121-icsd
0.144897
0.137473
0.0
0.081610
0.080442
93847-icsd
NaN
NaN
NaN
0.000000
0.005083
154319-icsd
NaN
NaN
NaN
NaN
0.000000
The problem is that some cells (e.g. O.loc["192693-icsd", "192401-icsd"]) contain a pandas.core.series.Series of form:
192693-icsd 0.129562
192693-icsd 0.129562
Name: 192401-icsd, dtype: float64
I'm struggling to make this cell contain only a np.float64.
I tried:
O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0]
and other various known forms of assignnign a new value to a cell in pd.DataFrame, but they only assign a new element to the same series in this cell, e.g. if I do
O.loc["192693-icsd", "192401-icsd"] = 5
then when calling O.loc["192693-icsd", "192401-icsd"] I get:
192693-icsd 5.0
192693-icsd 5.0
Name: 192401-icsd, dtype: float64
How to modify O.loc["192693-icsd", "192401-icsd"] so that it is of type np.float64?
It's not that df.loc["192693-icsd", "192401-icsd"] contain a Series, your index just isn't unique. This is especially obvious looking at these outputs:
>>> df.loc["192693-icsd"]
202324-icsd 644068-icsd 27121-icsd 93847-icsd 154319-icsd 28918-icsd 28917-icsd ... 108768-icsd 194195-icsd 174188-icsd 159632-icsd 89111-icsd 23308-icsd 253341-icsd
192693-icsd NaN NaN NaN NaN 0.146843 NaN NaN ... NaN 0.271191 NaN NaN NaN NaN 0.253996
192693-icsd NaN NaN NaN NaN 0.146843 NaN NaN ... NaN 0.271191 NaN NaN NaN NaN 0.253996
[2 rows x 1402 columns]
# And the fact that this returns the same:
>>> df.at["192693-icsd", "192401-icsd"]
192693-icsd 0.129562
192693-icsd 0.129562
Name: 192401-icsd, dtype: float64
You can fix this with a groupby, but you have to decide what to do with the non-unique groups. It looks like they're the same, so we'll combine them with max:
df = df.groupby(level=0).max()
Now it'll work as expected:
>>> df.loc["192693-icsd", "192401-icsd"])
0.129562120551387
Your non-unique values are:
>>> df.index[df.index.duplicated()]
Index(['193303-icsd', '192693-icsd', '416602-icsd'], dtype='object')
IIUC, you can try DataFrame.applymap to check each cell type and get the first row if it is Series
df = df.applymap(lambda x: x.iloc[0] if type(x) == pd.Series else x)
It works as expected for O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0]
Check this colab link: https://colab.research.google.com/drive/1XFXuj4OBu8GXQx6DTqv04XellmFcFWbC?usp=sharing

Pandas: Apply function that references other rows & other dataframe

I am trying to build a function to use in a df.apply() that references 1) other rows, and 2) another DatetimeIndex.
dt_index = DatetimeIndex(['2022-09-16', '2022-12-16', '2023-03-10', '2023-06-16',
'2023-09-15', '2023-12-15', '2024-03-15', '2024-06-14'],
dtype='datetime64[ns]', freq=None)
In regards to the main df:
df.index = DatetimeIndex(['2022-08-30', '2022-08-31', '2022-09-01', '2022-09-02',
'2022-09-03', '2022-09-04', '2022-09-05', '2022-09-06',
'2022-09-07', '2022-09-08',
...
'2024-08-20', '2024-08-21', '2024-08-22', '2024-08-23',
'2024-08-24', '2024-08-25', '2024-08-26', '2024-08-27',
'2024-08-28', '2024-08-29'],
dtype='datetime64[ns]', name='index', length=731, freq=None)
df = 3M 1Y 2Y
2022-08-30 1.00 1.00 1.00 1.000000
2022-08-31 2.50 2.50 2.50 2.500000
2022-09-01 3.50 3.50 3.50 3.500000
2022-09-02 5.50 5.50 5.50 5.833333
2022-09-03 5.65 5.65 5.65 5.983333
... ... ... ... ...
2024-08-25 630.75 615.75 599.75 607.750000
2024-08-26 631.75 616.75 600.75 608.750000
2024-08-27 632.75 617.75 601.75 609.750000
2024-08-28 633.75 618.75 602.75 610.750000
2024-08-29 634.75 619.75 603.75 611.750000
My goal is to use a function that:
For each index value, x, in df, find the closest two values in dt_index (have this below)
Then, in df, return: (x - id_low) / (id_high - id_low)
def transform(x, dt_index):
id_low = dt_index.iloc[dt_index.get_loc(x, method ='ffill')]
id_high = dt_index.iloc[dt_index.get_loc(x, method ='bfill')]
It's part 2 that I dont know how to write, as it references other rows in df outside of the one the function is being applied to.
Any help appreciated!
After fixing inaccuracies in your code,
You can simply reference your dataframe df inside the function:
def transform(x, dt_index):
id_low = dt_index[dt_index.get_indexer([x.name], method ='ffill')][0]
id_high = dt_index[dt_index.get_indexer([x.name], method ='bfill')][0]
return (x - df.loc[id_low]) / (df.loc[id_high] - df.loc[id_low])
df.transform(transform, dt_index=dt_index, axis=1)
Example:
df = pd.DataFrame(np.arange(24).reshape(6, 4))
dt_index = pd.Index([0,2,5])
# Result:
0 1 2 3
0 NaN NaN NaN NaN
1 0.500000 0.500000 0.500000 0.500000
2 NaN NaN NaN NaN
3 0.333333 0.333333 0.333333 0.333333
4 0.666667 0.666667 0.666667 0.666667
5 NaN NaN NaN NaN
Note:
NaN values are due to the mathematically undefined result for 0/0:
when id_low == id_high == x.name.

resample Pandas dataframe and merge strings in column

I want to resample a pandas dataframe and apply different functions to different columns. The problem is that I cannot properly process a column with strings. I would like to apply a function that merges the string with a delimiter such as " - ". This is a data example:
import pandas as pd
import numpy as np
idx = pd.date_range('2017-01-31', '2017-02-03')
data=list([[1,10,"ok"],[2,20,"merge"],[3,30,"us"]])
dates=pd.DatetimeIndex(['2017-01-31','2017-02-03','2017-02-03'])
d=pd.DataFrame(data, index=,columns=list('ABC'))
A B C
2017-01-31 1 10 ok
2017-02-03 2 20 merge
2017-02-03 3 30 us
Resampling the numeric columns A and B with a sum and mean aggregator works. Column C however kind of works with sum (but it gets placed on the second place, which might mean that something fails).
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': sum})
A C B
2017-01-31 1.0 a 10.0
2017-02-01 NaN 0 NaN
2017-02-02 NaN 0 NaN
2017-02-03 5.0 merge us 25.0
I would like to get this:
...
2017-02-03 5.0 merge - us 25.0
I tried using lambda in different ways but without success (not shown).
If I may ask a second related question: I can do some post processing for this, but how to fill missing cells in different columns with zeros or ""?
Your agg function for column 'C' should be a join
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': ' - '.join})
A B C
2017-01-31 1.0 10.0 ok
2017-02-01 NaN NaN
2017-02-02 NaN NaN
2017-02-03 5.0 25.0 merge - us

pandas read_csv skiprows not working

I am trying to skip some rows that have incorrect values in them.
Here is the data when i read it in from a file without using the skiprows argument.
>> df
MstrRecNbrTxt UnitIDNmb PersonIDNmb PersonTypeCde
2194593 P NaN NaN NaN
2194594 300146901 1.0 1.0 1.0
4100689 DAT NaN NaN NaN
4100690 300170330 1.0 1.0 1.0
5732515 DA NaN NaN NaN
5732516 300174170 2.0 1.0 1.0
I want to skip rows 2194593, 4100689, and 5732515. I would expect to not see those rows in the table that I have read in.
>> df = pd.read_csv(file,sep='|',low_memory=False,
usecols= cols_to_use,
skiprows=[2194593,4100689,5732515])
Yet when I print it again, those rows are still there.
>> df
MstrRecNbrTxt UnitIDNmb PersonIDNmb PersonTypeCde
2194593 P NaN NaN NaN
2194594 300146901 1.0 1.0 1.0
4100689 DAT NaN NaN NaN
4100690 300170330 1.0 1.0 1.0
5732515 DA NaN NaN NaN
5732516 300174170 2.0 1.0 1.0
Here is the data:
{'PersonIDNmb': {2194593: nan,
2194594: 1.0,
4100689: nan,
4100690: 1.0,
5732515: nan,
5732516: 1.0},
'PersonTypeCde': {2194593: nan,
2194594: 1.0,
4100689: nan,
4100690: 1.0,
5732515: nan,
5732516: 1.0},
'UnitIDNmb': {2194593: nan,
2194594: 1.0,
4100689: nan,
4100690: 1.0,
5732515: nan,
5732516: 2.0},
'\ufeffMstrRecNbrTxt': {2194593: 'P',
2194594: '300146901',
4100689: 'DAT',
4100690: '300170330',
5732515: 'DA',
5732516: '300174170'}}
What am I doing wrong?
My end goal is to get rid of the NaN values in my dataframe so that the data can be read in as integers and not as floats (because it makes it difficult to join this table to other non-float tables).
Working example... hope this helps!
from io import StringIO
import pandas as pd
import numpy as np
txt = """index,col1,col2
0,a,b
1,c,d
2,e,f
3,g,h
4,i,j
5,k,l
6,m,n
7,o,p
8,q,r
9,s,t
10,u,v
11,w,x
12,y,z"""
indices_to_skip = np.array([2, 6, 11])
# I offset `indices_to_skip` by one in order to account for header
df = pd.read_csv(StringIO(txt), index_col=0, skiprows=indices_to_skip + 1)
print(df)
col1 col2
index
0 a b
1 c d
3 g h
4 i j
5 k l
7 o p
8 q r
9 s t
10 u v
12 y z

Using .ix loses headers

How come when I use:
dfa=df1["10_day_mean"].ix["2015":"2015"]
The dataframe dfa has no header?
dfa:
Date
2015-01-10 2.000000
2015-01-20 3.000000
df1:
10day_mean Attenuation Channel1 Channel2 Channel3 Channel4 \
Date
2004-02-27 3.025 2.8640 NaN NaN NaN NaN
Is there a way to change the header of the dfa because when I plot it out my legend is the 10_day_mean and I wish to relable it as "Daily mean of every 10 days"
Thanks guys
I tried
dfa=dfa.rename(columns={0:"rename"})
and
dfa=dfa.rename(columns={"10day_mean":"rename"})
But then it says none
Your confusion here is that when you do this:
dfa=df1["10_day_mean"].ix["2015":"2015"]
this returns a Series which only has a single column so the output doesn't show the column name above the column, it'll show it at the bottom in the summary info as name.
To get the output you desired you can use double subscripting to force a dataframe with a single column to be returned:
dfa=df1[["10_day_mean"]].ix["2015":"2015"]
Example:
In [90]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[90]:
a b c
0 -1.002036 -1.703049 2.123096
1 0.497920 1.556211 -1.807895
2 0.400020 -0.703138 1.452735
3 -0.296604 -0.227155 -0.311047
4 -0.314948 -0.654925 -0.434458
In [91]:
df['a'].iloc[2:4]
Out[91]:
2 0.400020
3 -0.296604
Name: a, dtype: float64
In [92]:
df[['a']].iloc[2:4]
Out[92]:
a
2 0.400020
3 -0.296604

Categories