Populating a Pandas DataFrame frome another DataFrame based on column names - python

I have a DataFrame of the following form:
a b c
0 1 4 6
1 3 2 4
2 4 1 5
And I have a list of column names that I need to use to create a new DataFrame using the columns of the first DataFrame that correspond to each label. For example, if my list of columns is ['a', 'b', 'b', 'a', 'c'], the resulting DataFrame should be:
a b b a c
0 1 4 4 1 6
1 3 2 2 3 4
2 4 1 1 4 5
I've been trying to figure out a fast way of performing this operations because I'm dealing with extremly large DataFrames and I don't think looping is a reasonable option.

You can just use the list to select them:
In [44]:
cols = ['a', 'b', 'b', 'a', 'c']
df[cols]
Out[44]:
a b b a c
0 1 4 4 1 6
1 3 2 2 3 4
2 4 1 1 4 5
[3 rows x 5 columns]
So no need for a loop, once you have created your dataframe df then using a list of column names will just index them and create the df you want.

You can do that directly:
>>> df
a b c
0 1 4 6
1 3 2 4
2 4 1 5
>>> column_names
['a', 'b', 'b', 'a', 'c']
>>> df[column_names]
a b b a c
0 1 4 4 1 6
1 3 2 2 3 4
2 4 1 1 4 5
[3 rows x 5 columns]

From 0.17 onwards you can use reindex like
In [795]: cols = ['a', 'b', 'b', 'a', 'c']
In [796]: df.reindex(columns=cols)
Out[796]:
a b b a c
0 1 4 4 1 6
1 3 2 2 3 4
2 4 1 1 4 5
Note: Ideally, you don't want to have duplicate column names.

Related

How can I get the positions of a group?

I have pandas dataframe like this
data = [[1, 'a'], [2, 'a'], [3, 'b'], [4, 'b'], [5, 'a'], [6, 'c']]
df1 = pd.DataFrame(data, columns=['Id', 'Group'])
Id Group
1 a
2 a
3 b
4 b
5 a
6 c
Without changing order I need to get the position of every Id based on the `Group.
Basically, I want below output
Id Group position
1 a 1
2 a 2
3 b 1
4 b 2
5 a 3
6 c 1
try, transform + cumcount
df1['position'] = df1.groupby('Group').transform('cumcount') + 1
Id Group position
0 1 a 1
1 2 a 2
2 3 b 1
3 4 b 2
4 5 a 3
5 6 c 1
You can simply do it by .cumcount
df1['position'] = df1.groupby('Group').cumcount() + 1
GroupBy.cumcount numbers each item in each group from 0 to the length of that group - 1. It is NOT an aggregated function producing condensed result. So no need to use .transform() to propagate aggregated result back to each item of the whole group.
Result:
print(df1)
Id Group position
0 1 a 1
1 2 a 2
2 3 b 1
3 4 b 2
4 5 a 3
5 6 c 1

How to read a dataframe row by row and write to another dataframe row by row

I am trying to read two dataframes:
First dataframe is like:
A B C
1 2 3
The second dataframe is like:
D E F
8 9 12
2 4 6
3 5 8
2 5 7
Now i want the third dataframe which would:
Have the length same as the second dataframe
Have the D E F columns with the values as
Values of A B C columns copied and repeated through the length of the file
So essentially the third dataframe should look like:
A B C D E F
1 2 3 8 9 12
1 2 3 2 4 6
1 2 3 3 5 8
1 2 3 2 5 7
the code i have tried so far is:
dataframe1
dataframe2
dataframe3 = pd.DataFrame(columns = ['A', 'B', 'C', 'D', 'E', 'F'])
for i in range(len(dataframe2)):
dataframe3['A'] = dataframe1['A']
dataframe3['B'] = dataframe1['B']
dataframe3['C'] = dataframe1['C']
dataframe3[i, 'D'] = dataframe2[i, 'D']
dataframe3[i, 'E'] = dataframe2[i, 'E']
dataframe3[i, 'F'] = dataframe2[i, 'F']
But this is not creating the desired results.
use pd.concat()+ffill():
out=pd.concat([df1,df2],axis=1).ffill(downcast='infer')
#pd.concat([df1,df2],axis=1).fillna(downcast='infer',method='ffill')
output of out:
A B C D E F
0 1 2 3 8 9 12
1 1 2 3 2 4 6
2 1 2 3 3 5 8
3 1 2 3 2 5 7

Pandas cumulative count on new value

I have a data frame like the below one.
df = pd.DataFrame()
df['col_1'] = [1, 1, 1, 2, 2, 2, 3, 3, 3]
df['col_2'] = ['A', 'B', 'B', 'A', 'B', 'C', 'A', 'A', 'B']
df
col_1 col_2
0 1 A
1 1 B
2 1 B
3 2 A
4 2 B
5 2 C
6 3 A
7 3 A
8 3 B
I need to group by on col_1 and within each group, I need to update cumulative count whenever there is a new value in col_2. Something like below data frame.
col_1 col_2 col_3
0 1 A 1
1 1 B 2
2 1 B 2
3 2 A 1
4 2 B 2
5 2 C 3
6 3 A 1
7 3 A 1
8 3 B 2
I could do this using lists and dictionary. But couldn't find a way using pandas in built functions.
Use factorize with lambda function in GroupBy.transform:
df['col_3'] = df.groupby('col_1')['col_2'].transform(lambda x: pd.factorize(x)[0]+1)
print (df)
col_1 col_2 col_3
0 1 A 1
1 1 B 2
2 1 B 2
3 2 A 1
4 2 B 2
5 2 C 3
6 3 A 1
7 3 A 1
8 3 B 2

how can I write for loop in python for copy pandas dataframe

I'm newbie in data engineer. now i try to write python code for duplicate data from from pandas dataframe. for example
data :
A B C D E F G E
1 2 3 4 0 1 0 1
5 6 7 8 0 1 1 0
9 1 2 3 0 1 0 1
I need to copy dataframe to
dfE = A B C D E
1 2 3 4 0
5 6 7 8 0
9 1 2 3 0
dfF = A B C D F
1 2 3 4 1
5 6 7 8 1
9 1 2 3 1
dfG...
Help me please...
Hi piyaphong welcome to stackoverflow,
Basically pandas allow to select columns with column names, you can look at the code below which would probably solve your case.
from io import StringIO
import pandas as pd
data = """
A B C D E F G E
1 2 3 4 0 1 0 1
5 6 7 8 0 1 1 0
9 1 2 3 0 1 0 1
"""
df = pd.read_csv(StringIO(data), sep=' ')
dfE = df[['A', 'B', 'C', 'D', 'E']]
dfF = df[['A', 'B', 'C', 'D', 'F']]
dfG = df[['A', 'B', 'C', 'D', 'G']]

How sort rows for specific columns in a DataFrame?

I have a big DataFrame with 10 columns. I want to sort all rows just for specific columns (two). For example, if this be my data frame
A B C
0 5 1 8
1 8 2 2
2 9 3 3
I want it to sort it just for A and B but for rows so the answer should be like :
A B C
0 1 5 8
1 2 8 2
2 3 9 3
Thank you.
Call np.sort on that specific subslice of columns and assign it back using loc:
# df.loc[:, ['A', 'B']] = np.sort(df.loc[:, ['A', 'B']], axis=1)
df.loc[:, ['A', 'B']] = np.sort(df.loc[:, ['A', 'B']])
df
A B C
0 1 5 8
1 2 8 2
2 3 9 3
I am using sort
s=df[['A','B']]
s.values.sort()
df.update(s)
df
A B C
0 1 5 8
1 2 8 2
2 3 9 3

Categories