I want to append a specific amount of empty rows to that df
df = pd.DataFrame({'cow': [2, 4, 8],
'shark': [2, 0, 0],
'pudle': [10, 2, 1]})
with df = df.append(pd.Series(), ignore_index = True) I append one empty row, how can I append x amount of rows ?
You can use df.reindex to achieve this goal.
df.reindex(list(range(0, 10))).reset_index(drop=True)
cow shark pudle
0 2.0 2.0 10.0
1 4.0 0.0 2.0
2 8.0 0.0 1.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
The arguments you provide to df.reindex is going to be the total number of rows the new DataFrame has. So if your DataFrame has 3 objects, providing a list that caps out at 10 will add 7 new rows.
I'm not too pandas savvy, but if you can already add one empty row, why not just try writing a for loop and appending x times?
for i in range(x):
df = df.append(pd.Series(), ignore_index = True)
You could do:
import pandas as pd
df = pd.DataFrame({'cow': [2, 4, 8],
'shark': [2, 0, 0],
'pudle': [10, 2, 1]})
n = 10
df = df.append([[] for _ in range(n)], ignore_index=True)
print(df)
Output
cow shark pudle
0 2.0 2.0 10.0
1 4.0 0.0 2.0
2 8.0 0.0 1.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
Try with reindex
out = df.reindex(df.index.tolist()+[df.index.max()+1]*5)#reset_index(drop=True)
Out[93]:
cow shark pudle
0 2.0 2.0 10.0
1 4.0 0.0 2.0
2 8.0 0.0 1.0
3 NaN NaN NaN
3 NaN NaN NaN
3 NaN NaN NaN
3 NaN NaN NaN
3 NaN NaN NaN
Create an empty dataframe of the appropriate size and append it:
import numpy as np
df = df.append(pd.DataFrame([[np.nan] * df.shape[1]] * n,columns=df.columns),
ignore_index = True)
Related
I am hoping someone can help me optimize the following Python/Pandas code. My code works, but I know there must be a cleaner and faster way to perform the operation under consideration.
I am looking for an optimized strategy because my use case will involve 16 unique ADC Types, as opposed to 4 in the example below. Also, my initial Pandas Series (i.e. ADC Type column), will be several 100,000 data points in length, rather than 8 in the example below.
import numpy as np
import pandas as pd
from enum import Enum
data_dict = {"RAW": [4000076160, 5354368, 4641792, 4289860736,
4136386944, 5440384, 4772864, 4289881216],
"ADC_TYPE": [3, 7, 8, 9,
3, 7, 8, 9]}
df = pd.DataFrame(data_dict)
print(df)
The initial DataFrame (i.e. df) is:
RAW ADC_TYPE
0 4000076160 3
1 5354368 7
2 4641792 8
3 4289860736 9
4 4136386944 3
5 5440384 7
6 4772864 8
7 4289881216 9
I then manipulate the DataFrame above using the following code:
unique_types = df["ADC_TYPE"].unique()
dict_concat = {"RAW": [],
"ADC_TYPE_3": [],
"ADC_TYPE_7": [],
"ADC_TYPE_8": [],
"ADC_TYPE_9": []}
df_concat = pd.DataFrame(dict_concat)
for adc_type in unique_types:
df_group = df.groupby(["ADC_TYPE"]).get_group(adc_type).rename(columns={"ADC_TYPE": f"ADC_TYPE_{adc_type}"})
df_concat = pd.concat([df_concat, df_group])
print(df_concat.sort_index())
The returned DataFrame (i.e. df_concat) is displayed below. The ordering of RAW and the associated ADC Type values must remain unchanged. I need the return DataFrame to look just like the DataFrame below.
RAW ADC_TYPE_3 ADC_TYPE_7 ADC_TYPE_8 ADC_TYPE_9
0 4.000076e+09 3.0 NaN NaN NaN
1 5.354368e+06 NaN 7.0 NaN NaN
2 4.641792e+06 NaN NaN 8.0 NaN
3 4.289861e+09 NaN NaN NaN 9.0
4 4.136387e+09 3.0 NaN NaN NaN
5 5.440384e+06 NaN 7.0 NaN NaN
6 4.772864e+06 NaN NaN 8.0 NaN
7 4.289881e+09 NaN NaN NaN 9.0
This is just a pivot table with a prefix.
Edit: To preserve sorting, you can reindex from the original dataframe
df = pd.DataFrame({'RAW': {0: 4000076160,
1: 5354368,
2: 4641792,
3: 4289860736,
4: 4136386944,
5: 5440384,
6: 4772864,
7: 4289881216},
'ADC_TYPE': {0: 3, 1: 7, 2: 8, 3: 9, 4: 3, 5: 7, 6: 8, 7: 9}})
out = df.pivot(index='RAW', columns = 'ADC_TYPE', values='ADC_TYPE').add_prefix('ACC_TYPE_').reset_index().rename_axis(None, axis=1)
out = out.set_index('RAW').reindex(df['RAW']).reset_index()
Output
RAW ACC_TYPE_3 ACC_TYPE_7 ACC_TYPE_8 ACC_TYPE_9
0 4000076160 3.0 NaN NaN NaN
1 5354368 NaN 7.0 NaN NaN
2 4641792 NaN NaN 8.0 NaN
3 4289860736 NaN NaN NaN 9.0
4 4136386944 3.0 NaN NaN NaN
5 5440384 NaN 7.0 NaN NaN
6 4772864 NaN NaN 8.0 NaN
7 4289881216 NaN NaN NaN 9.0
Here is a way using str.get_dummies()
df2 = df.set_index('RAW')['ADC_TYPE'].astype(str).str.get_dummies()
(df2.mul(pd.to_numeric(df2.columns),axis=1)
.mask(lambda x: x.eq(0))
.rename('ADC_TYPE_{}'.format,axis=1)
.reset_index())
Here is a slightly different way using pd.get_dummies()
df2 = pd.get_dummies(df.set_index('RAW'),columns = ['ADC_TYPE'])
df2.mul((df2.columns.str.split('_').str[-1]).astype(int)).where(lambda x: x.ne(0))
You can also use set_index() and unstack()
(df.set_index(['RAW',df['ADC_TYPE'].astype(str).map('ADC_TYPE_{}'.format)])['ADC_TYPE']
.unstack().reindex(df['RAW']).reset_index())
Output:
RAW ADC_TYPE_3 ADC_TYPE_7 ADC_TYPE_8 ADC_TYPE_9
0 4000076160 3.0 NaN NaN NaN
1 5354368 NaN 7.0 NaN NaN
2 4641792 NaN NaN 8.0 NaN
3 4289860736 NaN NaN NaN 9.0
4 4136386944 3.0 NaN NaN NaN
5 5440384 NaN 7.0 NaN NaN
6 4772864 NaN NaN 8.0 NaN
7 4289881216 NaN NaN NaN 9.0
I liked the idea of using get_dummies, so I modified it a bit:
df = (pd.get_dummies(df, 'ADC_TYPE', '_', columns=['ADC_TYPE'])
.replace(1, np.nan)
.apply(lambda x: x.fillna(df['ADC_TYPE']))
.replace(0, np.nan))
Output:
RAW ADC_TYPE_3 ADC_TYPE_7 ADC_TYPE_8 ADC_TYPE_9
0 4000076160 3.0 NaN NaN NaN
1 5354368 NaN 7.0 NaN NaN
2 4641792 NaN NaN 8.0 NaN
3 4289860736 NaN NaN NaN 9.0
4 4136386944 3.0 NaN NaN NaN
5 5440384 NaN 7.0 NaN NaN
6 4772864 NaN NaN 8.0 NaN
7 4289881216 NaN NaN NaN 9.0
Using crosstab:
out = pd.crosstab(
df["RAW"], df["ADC_TYPE"], values=df["ADC_TYPE"], aggfunc="first"
).rename_axis(None, axis=1)
out.columns = out.columns.map("ADC_TYPE_{}".format)
out = out.reindex(df["RAW"]).reset_index()
print(out):
RAW ADC_TYPE_3 ADC_TYPE_7 ADC_TYPE_8 ADC_TYPE_9
0 4000076160 3.0 NaN NaN NaN
1 5354368 NaN 7.0 NaN NaN
2 4641792 NaN NaN 8.0 NaN
3 4289860736 NaN NaN NaN 9.0
4 4136386944 3.0 NaN NaN NaN
5 5440384 NaN 7.0 NaN NaN
6 4772864 NaN NaN 8.0 NaN
7 4289881216 NaN NaN NaN 9.0
I have a dataframe with a range index and no data, in real data the index is a time range.
E.g.
df_main = pd.DataFrame(index = pd.RangeIndex(0,15,1))
See Fig1
And I have several dataframes which varying columns and indexes, I just want to join those on the main dataframe based on index:
df1 = pd.DataFrame({'value': [1, 2, 3, 5]}, index = pd.RangeIndex(0,4,1))
df2 = pd.DataFrame({'value': [5, 6, 7, 8]}, index = pd.RangeIndex(4,8,1))
df3 = pd.DataFrame({'value2': [9, 8, 7, 6]}, index = pd.RangeIndex(0,4,1))
df4 = pd.DataFrame({'value': [1, 2],'value2': [3, 4],'value3': [5, 6]}, index = pd.RangeIndex(10,12,1))
See Fig 2,3,4,5
I tried concat:
display(pd.concat([df_main,df1,df2,df3,df4]))
Which gives me the unwanted output you can see in Fig 6.
I also tried join which results in an error I did not understand:
ValueError: Indexes have overlapping values: Index(['value', 'value2'], dtype='object')
What I want to is the output you can see in Fig7.
You could groupby the index and aggregate with first:
pd.concat([df_main, df1, df2, df3, df4]).groupby(level=0).first()
[out]
value value2 value3
0 1.0 9.0 NaN
1 2.0 8.0 NaN
2 3.0 7.0 NaN
3 5.0 6.0 NaN
4 5.0 NaN NaN
5 6.0 NaN NaN
6 7.0 NaN NaN
7 8.0 NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 1.0 3.0 5.0
11 2.0 4.0 6.0
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN
Use reduce and DataFrame.combine_first:
from functools import reduce
df = reduce((lambda x, y: x.combine_first(y)), [df_main,df1,df2,df3,df4])
print(df)
value value2 value3
0 1.0 9.0 NaN
1 2.0 8.0 NaN
2 3.0 7.0 NaN
3 5.0 6.0 NaN
4 5.0 NaN NaN
5 6.0 NaN NaN
6 7.0 NaN NaN
7 8.0 NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 1.0 3.0 5.0
11 2.0 4.0 6.0
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN
Here is what I have tried and what error I received:
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1,2,3,4,5],"B":[5,4,3,2,1],"C":[0,0,0,0,0],"D":[1,1,1,1,1]})
>>> df
A B C D
0 1 5 0 1
1 2 4 0 1
2 3 3 0 1
3 4 2 0 1
4 5 1 0 1
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1,2,3,4,5],"B":[5,4,3,2,1],"C":[0,0,0,0,0],"D":[1,1,1,1,1]})
>>> first = [2,2,2,2,2,2,2,2,2,2,2,2]
>>> first = pd.DataFrame(first).T
>>> first.index = [2]
>>> df = df.join(first)
>>> df
A B C D 0 1 2 3 4 5 6 7 8 9 10 11
0 1 5 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 3 0 1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 4 2 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
>>> second = [3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3]
>>> second = pd.DataFrame(second).T
>>> second.index = [1]
>>> df = df.join(second)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python35\lib\site-packages\pandas\core\frame.py", line 6815, in join
rsuffix=rsuffix, sort=sort)
File "C:\Python35\lib\site-packages\pandas\core\frame.py", line 6830, in _join_compat
suffixes=(lsuffix, rsuffix), sort=sort)
File "C:\Python35\lib\site-packages\pandas\core\reshape\merge.py", line 48, in merge
return op.get_result()
File "C:\Python35\lib\site-packages\pandas\core\reshape\merge.py", line 552, in get_result
rdata.items, rsuf)
File "C:\Python35\lib\site-packages\pandas\core\internals\managers.py", line 1972, in items_overlap_with_suffix
'{rename}'.format(rename=to_rename))
ValueError: columns overlap but no suffix specified: Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='object')
I am trying to create new list with the extra columns which I have to add at specific indexes of the main dataframe df.
When i tried the first it worked and you can see the output. But when I tried the same way with second I received the above mentioned error.
Kindly, let me know what I can do in this situation and achieve the goal I am expecting.
Use DataFrame.combine_first instead join if need assign to same columns created before, last DataFrame.reindex by list of columns for expected ordering:
df = pd.DataFrame({"A":[1,2,3,4,5],"B":[5,4,3,2,1],"C":[0,0,0,0,0],"D":[1,1,1,1,1]})
orig = df.columns.tolist()
first = [2,2,2,2,2,2,2,2,2,2,2,2]
first = pd.DataFrame(first).T
first.index = [2]
df = df.combine_first(first)
second = [3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3]
second = pd.DataFrame(second).T
second.index = [1]
df = df.combine_first(second)
df = df.reindex(orig + first.columns.tolist(), axis=1)
print (df)
A B C D 0 1 2 3 4 5 6 7 8 9 10 11
0 1 5 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 0 1 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
2 3 3 0 1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 4 2 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Yes this is expected behaviour because join works much like an SQL join, meaning that it will join on the provided index and concatenate all the columns together. The problem arises from the fact that pandas does not accept two columns to have the same name. Hence, if you have 2 columns in each dataframe with the same name, it will first look for a suffix to add to those columns to avoid name clashes. This is controlled with the lsuffix and rsuffix arguments in the join method.
Conclusion: 2 ways to solve this:
Either provide a suffix so that pandas is able to resolve the name clashes; or
Make sure that you don't have overlapping columns
You have to specify the suffixes since the column names are the same. Assuming you are trying to add the second values as new columns horizontally:
df = df.join(second, lsuffix='first', rsuffix='second')
A B C D 0first 1first 2first 3first 4first 5first ... 10second 11second 12 13 14 15 16 17 18 19
0 1 5 0 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 0 1 NaN NaN NaN NaN NaN NaN ... 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
2 3 3 0 1 2.0 2.0 2.0 2.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 2 0 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 0 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Consider this df:
import pandas as pd, numpy as np
df = pd.DataFrame.from_dict({'id': ['A', 'B', 'A', 'C', 'D', 'B', 'C'],
'val': [1,2,-3,1,5,6,-2],
'stuff':['12','23232','13','1234','3235','3236','732323']})
Question: how to produce a table with as many columns as unique id ({A, B, C}) and
as many rows as df where, for example for the column corresponding to id==A, the values are:
1,
np.nan,
-2,
np.nan,
np.nan,
np.nan,
np.nan
(that is the result of df.groupby('id')['val'].cumsum() joined on the indexes of df).
UMMM pivot
pd.pivot(df.index,df.id,df.val).cumsum()
Out[33]:
id A B C D
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 -2.0 NaN NaN NaN
3 NaN NaN 1.0 NaN
4 NaN NaN NaN 5.0
5 NaN 8.0 NaN NaN
6 NaN NaN -1.0 NaN
One way via a dictionary comprehension and pd.DataFrame.where:
res = pd.DataFrame({i: df['val'].where(df['id'].eq(i)).cumsum() for i in df['id'].unique()})
print(res)
A B C D
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 -2.0 NaN NaN NaN
3 NaN NaN 1.0 NaN
4 NaN NaN NaN 5.0
5 NaN 8.0 NaN NaN
6 NaN NaN -1.0 NaN
For a small number of groups, you may find this method efficient:
df = pd.concat([df]*1000, ignore_index=True)
def piv_transform(df):
return pd.pivot(df.index, df.id, df.val).cumsum()
def dict_transform(df):
return pd.DataFrame({i: df['val'].where(df['id'].eq(i)).cumsum() for i in df['id'].unique()})
%timeit piv_transform(df) # 17.5 ms
%timeit dict_transform(df) # 8.1 ms
Certainly cleaner answers have been supplied - see pivot.
df1 = pd.DataFrame( data = [df.id == x for x in df.id.unique()]).T.mul(df.groupby(['id']).cumsum().squeeze(),axis=0)
df1.columns =df.id.unique()
df1.applymap(lambda x: np.nan if x == 0 else x)
A B C D
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 -2.0 NaN NaN NaN
3 NaN NaN 1.0 NaN
4 NaN NaN NaN 5.0
5 NaN 8.0 NaN NaN
6 NaN NaN -1.0 NaN
Short and simple:
df.pivot(columns='id', values='val').cumsum()
I'm preparing data for machine learning where data is in pandas DataFrame which looks like this:
Column v1 v2
first 1 2
second 3 4
third 5 6
now i want to transform it into:
Column v1 v2 first-v1 first-v2 second-v1 econd-v2 third-v1 third-v2
first 1 2 1 2 Nan Nan Nan Nan
second 3 4 Nan Nan 3 4 Nan Nan
third 5 6 Nan Nan Nan Nan 5 6
what i've tried is to do something like this:
# we know how many values there are but
# length can be changed into length of [1, 2, 3, ...] values
values = ['v1', 'v2']
# data with description from above is saved in data
for value in values:
data[ str(data['Column'] + '-' + value)] = data[ value]
Results are a columns with name:
['first-v1' 'second-v1'..], ['first-v2' 'second-v2'..]
where there are correct values. What i'm doing wrong? Is there a more optimal way to do this because my data is big?
Thank you for your time!
You can use unstack with swaping and sorting MultiIndex in columns:
df = data.set_index('Column', append=True)[values].unstack()
.swaplevel(0,1, axis=1).sort_index(1)
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Or stack + unstack:
df = data.set_index('Column', append=True).stack().unstack([1,2])
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Last join to original:
df = data.join(df)
print (df)
Column v1 v2 first-v1 first-v2 second-v1 second-v2 third-v1 \
0 first 1 2 1.0 2.0 NaN NaN NaN
1 second 3 4 NaN NaN 3.0 4.0 NaN
2 third 5 6 NaN NaN NaN NaN 5.0
third-v2
0 NaN
1 NaN
2 6.0