I have two data frame lets say:
dataframe A with column 'name'
name
0 4
1 2
2 1
3 3
Another dataframe B with two columns i.e. name and value
name value
0 3 5
1 2 6
2 4 7
3 1 8
I want to rearrange the value in dataframe B according to the name column in dataframe A
I am expecting final dataframe similar to this:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Here are two options:
dfB.set_index('name').loc[dfA.name].reset_index()
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Or,
dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
dfA
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Timings:
import numpy as np
import pandas as pd
prng = np.random.RandomState(0)
names = np.arange(10**7)
prng.shuffle(names)
dfA = pd.DataFrame({'name': names})
prng.shuffle(names)
dfB = pd.DataFrame({'name': names, 'value': prng.randint(0, 100, 10**7)})
%timeit dfB.set_index('name').loc[dfA.name].reset_index()
1 loop, best of 3: 2.27 s per loop
%timeit dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
1 loop, best of 3: 1.65 s per loop
%timeit dfB.set_index('name').ix[dfA.name].reset_index()
1 loop, best of 3: 1.66 s per loop
Related
This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 4 years ago.
I have a really simple Pandas dataframe where each cell contains a list. I'd like to split each element of the list into it's own column. I can do that by exporting the values and then creating a new dataframe. This doesn't seem like a good way to do this especially, if my dataframe had a column aside from the list column.
import pandas as pd
df = pd.DataFrame(data=[[[8,10,12]],
[[7,9,11]]])
df = pd.DataFrame(data=[x[0] for x in df.values])
Desired output:
0 1 2
0 8 10 12
1 7 9 11
Follow-up based on #Psidom answer:
If I did have a second column:
df = pd.DataFrame(data=[[[8,10,12], 'A'],
[[7,9,11], 'B']])
How do I not loose the other column?
Desired output:
0 1 2 3
0 8 10 12 A
1 7 9 11 B
You can loop through the Series with apply() function and convert each list to a Series, this automatically expand the list as a series in the column direction:
df[0].apply(pd.Series)
# 0 1 2
#0 8 10 12
#1 7 9 11
Update: To keep other columns of the data frame, you can concatenate the result with the columns you want to keep:
pd.concat([df[0].apply(pd.Series), df[1]], axis = 1)
# 0 1 2 1
#0 8 10 12 A
#1 7 9 11 B
You could do pd.DataFrame(df[col].values.tolist()) - is much faster ~500x
In [820]: pd.DataFrame(df[0].values.tolist())
Out[820]:
0 1 2
0 8 10 12
1 7 9 11
In [821]: pd.concat([pd.DataFrame(df[0].values.tolist()), df[1]], axis=1)
Out[821]:
0 1 2 1
0 8 10 12 A
1 7 9 11 B
Timings
Medium
In [828]: df.shape
Out[828]: (20000, 2)
In [829]: %timeit pd.DataFrame(df[0].values.tolist())
100 loops, best of 3: 15 ms per loop
In [830]: %timeit df[0].apply(pd.Series)
1 loop, best of 3: 4.06 s per loop
Large
In [832]: df.shape
Out[832]: (200000, 2)
In [833]: %timeit pd.DataFrame(df[0].values.tolist())
10 loops, best of 3: 161 ms per loop
In [834]: %timeit df[0].apply(pd.Series)
1 loop, best of 3: 40.9 s per loop
I'm using Pandas to come up with new column that will search through the entire column with values [1-100] and will count the values where it's less than the current row.
See [df] example below:
[A][NewCol]
1 0
3 2
2 1
5 4
8 5
3 2
Essentially, for each row I need to look at the entire Column A, and count how many values are less than the current row. So for Value 5, there are 4 values that are less (<) than 5 (1,2,3,3).
What would be the easiest way of doing this?
Thanks!
One way to do it like this, use rank with method='min':
df['NewCol'] = (df['A'].rank(method='min') - 1).astype(int)
Output:
A NewCol
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
I am using numpy broadcast
s=df.A.values
(s[:,None]>s).sum(1)
Out[649]: array([0, 2, 1, 4, 5, 2])
#df['NewCol']=(s[:,None]>s).sum(1)
timing
df=pd.concat([df]*1000)
%%timeit
s=df.A.values
(s[:,None]>s).sum(1)
10 loops, best of 3: 83.7 ms per loop
%timeit (df['A'].rank(method='min') - 1).astype(int)
1000 loops, best of 3: 479 µs per loop
Try this code
A = [Your numbers]
less_than = []
for element in A:
counter = 0
for number in A:
if number < element:
counter += 1
less_than.append(counter)
You can do it this way:
import pandas as pd
df = pd.DataFrame({'A': [1,3,2,5,8,3]})
df['NewCol'] = 0
for idx, row in df.iterrows():
df.loc[idx, 'NewCol'] = (df.loc[:, 'A'] < row.A).sum()
print(df)
A NewCol
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
Another way is sort and reset index:
m=df.A.sort_values().reset_index(drop=True).reset_index()
m.columns=['new','A']
print(m)
new A
0 0 1
1 1 2
2 2 3
3 3 3
4 4 5
5 5 8
You didn't specify if speed or memory usage was important (or if you had a very large dataset). The "easiest" way to do it is straightfoward: calculate how many are less then i for each entry in the column and collect those into a new column:
df=pd.DataFrame({'A': [1,3,2,5,8,3]})
col=df['A']
df['new_col']=[ sum(col<i) for i in col ]
print(df)
Result:
A new_col
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
There might be more efficient ways to do this on large datasets, such as sorting your column first.
Suppose I have the following data:
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
id1 id2 x
0 1 1 10
1 1 2 20
2 1 3 50
3 2 1 15
4 2 2 20
5 2 3 30
6 3 1 40
7 3 2 70
The dataframe is sorted along the two ids. Suppose I'd like to know the value of x of the FIRST observation within each group of id1 observations. The result would be like
id1 id2 x first_x
1 1 10 10
1 2 30 10
1 3 50 10
2 1 15 15
2 2 20 15
2 3 30 15
3 1 40 40
3 2 70 40
How do I achieve this 'subscripting'? Ideally, the new column would be filled for each observation.
I thought along the lines of
df['first_x'] = df.groupby(['id1'])[0]
I think simpliest is transform with first:
df['first_x'] = df.groupby('id1')['x'].transform('first')
Or map by Series created by drop_duplicates:
df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
print (df)
id1 id2 x first_x
0 1 1 10 10
1 1 2 20 10
2 1 3 50 10
3 2 1 15 15
4 2 2 20 15
5 2 3 30 15
6 3 1 40 40
7 3 2 70 40
First is shortest and fastest solution:
np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'id1': np.random.randint(10000,size=N),
'x':np.random.randint(10000,size=N)})
df = df.sort_values('id1').reset_index(drop=True)
print (df)
In [179]: %timeit df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
10 loops, best of 3: 125 ms per loop
In [180]: %%timeit
...: first_xs = df.groupby(['id1']).first().to_dict()['x']
...:
...: df['first_x'] = df['id1'].map(lambda id: first_xs[id])
...:
1 loop, best of 3: 524 ms per loop
In [181]: %timeit df['first_x'] = df.groupby('id1')['x'].transform('first')
10 loops, best of 3: 54.9 ms per loop
In [182]: %timeit df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
10 loops, best of 3: 142 ms per loop
Something like this?
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
df = df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
As you need to consider the entire dataframe when building values for each row, you need an intermediate step.
The following gets your first_x value using a group by, then uses that as a map to add a new column.
import pandas as pd
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
first_xs = df.groupby(['id1']).first().to_dict()['x']
df['first_x'] = df['id1'].map(lambda id: first_xs[id])
This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 2 years ago.
I have a pandas.DataFrame as follows:
df1 =
a b
0 1 2
1 3 4
I'd like to make this three times to become:
df2 =
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
df2 is made from a loop, but it is not efficient.
How can I get df2 from df1 using a matrix way which is faster?
Build a one dimensional indexer to slice both the the values array and index. You must take care of the index as well to get your desired results.
use np.repeat on an np.arange to get the indexer
construct a new dataframe using this indexer on both values and the index
r = np.arange(len(df)).repeat(3)
pd.DataFrame(df.values[r], df.index[r], df.columns)
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
You can use np.repeat
df = pd.DataFrame(np.repeat(df.values,[3,3], axis = 0), columns = df.columns)
You get
a b
0 1 2
1 1 2
2 1 2
3 3 4
4 3 4
5 3 4
Time testing:
%timeit pd.DataFrame(np.repeat(df.values,[3,3], axis = 0))
1000 loops, best of 3: 235 µs per loop
%timeit pd.concat([df] * 3).sort_index()
best of 3: 1.26 ms per loop
Numpy is definitely faster in most cases so no surprises there
EDIT: I am not sure if you would be looking for repeating indices but incase you do,
pd.DataFrame(np.repeat(df.values,3, axis = 0), index = np.repeat(df.index, 3), columns = df.columns)
I do not know if it is more efficient than your loop, but it easy enough to construct as:
Code:
pd.concat([df] * 3).sort_index()
Test Code:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('ab'))
print(pd.concat([df] * 3).sort_index())
Results:
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
You can use numpy.repeat with parameter scalar 3 and then add columns parameter to DataFrame constructor:
df = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
print (df)
a b
0 1 2
1 1 2
2 1 2
3 3 4
4 3 4
5 3 4
If really want duplicated index what can complicated some pandas functions like reindex which failed:
r = np.repeat(np.arange(len(df.index)), 3)
df = pd.DataFrame(df.values[r], df.index[r], df.columns)
print (df)
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
Not the fastest (not the slowest either) but the shortest solution so far.
#Build a index array and extract the rows to build the desired new df. This handles index and data all at once.
df.iloc[np.repeat(df.index,3)]
Out[270]: In [271]:
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
Is it possible to put percentile cuts on all columns of a dataframe with using a loop? This is how I am doing it now:
df = pd.DataFrame(np.random.randn(10,5))
df_q = pd.DataFrame()
for i in list(range(len(df.columns))):
df_q[i] = pd.qcut(df[i], 5, labels=list(range(5)))
I am hoping there is a slick pandas solution for this to avoid the use of a loop.
Thanks!
pd.qcut accepts an 1D array or Series as its argument. To apply pd.qcut to every column requires multiple calls to pd.qcut. So no matter how you dress it up, there will be a loop -- either explicit or implicit.
You could for example, use apply to call pd.qcut for each column:
In [46]: df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
Out[46]:
0 1 2 3 4
0 4 0 3 0 3
1 0 0 2 3 0
2 3 4 1 2 3
3 4 1 1 1 4
4 3 2 2 4 1
5 2 4 3 0 1
6 2 3 0 4 4
7 1 3 4 2 2
8 0 1 4 3 0
9 1 2 0 1 2
but under the hood, df.apply is using a for-loop, so it really isn't very different than your for-loop:
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
In [47]: %timeit df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
100 loops, best of 3: 2.9 ms per loop
In [48]: %%timeit
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
100 loops, best of 3: 2.95 ms per loop
Note that
for i in list(range(len(df.columns))):
will only work if the columns of df happen to be sequential integers starting at 0.
It is more robust to use
for col in df:
to iterate over the columns of the DataFrame.