map DataFrame index and forward fill nan values - python

I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C

Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C

Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C

You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C

Related

Copy data from a row to another row in Pandas dataframe based on condition

Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?
You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method
You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0
import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0

Convert dataframe to pivot table with booleans(0, 1) with Pandas [duplicate]

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2

Assign each unique value of column to whole Dataframe as if data frame duplicate itself based on value of another column

i am trying to iterate value of column from df2 and assign each value of column from df2 to the df1.As if df1 will multiply itself based on value of column from df2.
let's say i have df1 as per below:
df1
1
2
3
and df2 as per below:
df2
A
B
C
I want third dataframe df3 will became like below:
df3
1 A
2 A
3 A
1 B
2 B
3 B
1 C
2 C
3 C
for now i have tried below code
for i, value in ACS_shock['scenario'].iteritems():
df1['sec'] = df1[i] = value[:]
But when i generate the file from DF1 my output is like below:
1 A B C
2 A B C
3 A B C
Any idea how can i correct this code.
much appreciated.
You can use pd.concat and np.repeat:
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.Series([1,2,3])
>>> df1
0 1
1 2
2 3
dtype: int64
>>> df2 = pd.Series(list('ABC'))
>>> df2
0 A
1 B
2 C
dtype: object
>>> df3 = pd.DataFrame({'df1': pd.concat([df1]*3).reset_index(drop=True),
'df2': np.repeat(df2, 3).reset_index(drop=True)})
>>> df3
df1 df2
0 1 A
1 2 A
2 3 A
3 1 B
4 2 B
5 3 B
6 1 C
7 2 C
8 3 C

Duplicating each row in a dataframe with counts

For each row in a dataframe, I wish to create duplicates of it with an additional column to identify each duplicate.
E.g Original dataframe is
A | A
B | B
I wish to make make duplicate of each row with an additional column to identify it. Resulting in:
A | A | 1
A | A | 2
B | B | 1
B | B | 2
You can use df.reindex followed by a groupby on df.index.
df = df.reindex(df.index.repeat(2))
df['count'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Similarly, using reindex and assign with np.tile:
df = df.reindex(df.index.repeat(2))\
.assign(count=np.tile(df.index, 2) + 1)\
.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Use Index.repeat with loc, for count groupby with cumcount:
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
print (df)
a b
0 A A
1 B B
df = df.loc[df.index.repeat(2)]
df['new'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Or:
df = df.loc[df.index.repeat(2)]
df['new'] = np.tile(range(int(len(df.index)/2)), 2) + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Setup
Borrowed from #jezrael
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
a b
0 A A
1 B B
Solution 1
Create a pd.MultiIndex with pd.MultiIndex.from_product
Then use pd.DataFrame.reindex
idx = pd.MultiIndex.from_product(
[df.index, [1, 2]],
names=[df.index.name, 'New']
)
df.reindex(idx, level=0).reset_index('New')
New a b
0 1 A A
0 2 A A
1 1 B B
1 2 B B
Solution 2
This uses the same loc and reindex concept used by #cᴏʟᴅsᴘᴇᴇᴅ and #jezrael, but simplifies the final answer by using list and int multiplication rather than np.tile.
df.loc[df.index.repeat(2)].assign(New=[1, 2] * len(df))
a b New
0 A A 1
0 A A 2
1 B B 1
1 B B 2
Use pd.concat() to repeat, and then groupby with cumcount() to count:
In [24]: df = pd.DataFrame({'col1': ['A', 'B'], 'col2': ['A', 'B']})
In [25]: df
Out[25]:
col1 col2
0 A A
1 B B
In [26]: df_repeat = pd.concat([df]*3).sort_index()
In [27]: df_repeat
Out[27]:
col1 col2
0 A A
0 A A
0 A A
1 B B
1 B B
1 B B
In [28]: df_repeat["count"] = df_repeat.groupby(level=0).cumcount() + 1
In [29]: df_repeat # df_repeat.reset_index(drop=True); if index reset required.
Out[29]:
col1 col2 count
0 A A 1
0 A A 2
0 A A 3
1 B B 1
1 B B 2
1 B B 3

concat two dataframe using python

We have one dataframe like
-0.140447131 0.124802527 0.140780106
0.062166349 -0.121484447 -0.140675515
-0.002989106 0.13984927 0.004382326
and the other as
1
1
2
We need to concat both the dataframe like
-0.140447131 0.124802527 0.140780106 1
0.062166349 -0.121484447 -0.140675515 1
-0.002989106 0.13984927 0.004382326 2
Let's say your first dataframe is like
In [281]: df1
Out[281]:
a b c
0 -0.140447 0.124803 0.140780
1 0.062166 -0.121484 -0.140676
2 -0.002989 0.139849 0.004382
And, the second like,
In [283]: df2
Out[283]:
d
0 1
1 1
2 2
Then you could create new column for df1 using df2
In [284]: df1['d_new'] = df2['d']
In [285]: df1
Out[285]:
a b c d_new
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 1
2 -0.002989 0.139849 0.004382 2
The assumption however being both dataframes have common index
Use pd.concat and specify the axis equal to 1 (rows):
df_new = pd.concat([df1, df2], axis=1)
>>> df_new
0 1 2 0
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 2
2 -0.002989 0.139849 0.004382 3

Categories