a help will be appreciated.
I have 2 DataFrames.
The first data frame consisted of an activity schedule of person,schedule, as following:
PersonID Person Origin Destination
3-1 1 A B
3-1 1 B A
13-1 1 C D
13-1 1 D C
13-2 2 A B
13-2 2 B A
And I have another DataFrame, household, containing the details of the person/agent.
PersonID1 Age1 Gender1 PersonID2 Age2 Gender2
3-1 20 M NaN NaN NaN
13-1 45 F 13-2 17 M
I want to perform a VLOOKUP on these two using pd.merge. Since the lookup(merge) will depends on the person's ID, I tried to that with a condition.
def merging(row):
if row['Person'] == 1:
row = pd.merge(row, household, how='left', left_on=['PersonID'], right_on=['Age1', 'Gender1'])
else:
row = pd.merge(row, household, how='left', left_on=['PersonID'], right_on=['Age2','Gender2'])
return row
schedule_merged = schedule.apply(merging, axis=1)
However, for some reason, it just doesn't work. The error says ValueError: len(right_on) must equal len(left_on). I'm aiming to make this kind of data in the end:
PersonID Person Origin Destination Age Gender
3-1 1 A B 20 M
3-1 1 B A 20 M
13-1 1 C D 45 F
13-1 1 D C 45 F
13-2 2 A B 17 M
13-2 2 B A 17 M
I think I messed up the pd.merge lines. While it might be more efficient to use VLOOKUP in Excel, it's just to heavy for my PC, since I have to apply this for a hundred thousand data. How could I do this properly? Thanks!
This is how I would do it if the real dataset is not more complicated than the given example. Other wise I would suggest looking at pd.melt() for more complex unpivoting.
import pandas as pd
import numpy as np
# Create Dummy schedule DataFrame
d = {'PersonID': ['3-1', '3-1', '13-1', '13-1', '13-2', '13-2'], 'Person': ['1', '1', '1', '1', '2', '2'], 'Origin': ['A', 'B', 'C', 'D', 'A', 'B'], 'Destination': ['B', 'A', 'D', 'C', 'B', 'A']}
schedule = pd.DataFrame(data=d)
schedule
# Create Dummy houshold DataFrame
d = {'PersonID1': ['3-1', '13-1'], 'Age1': ['20', '45'], 'Gender1': ['M', 'F'], 'PersonID2': [np.nan, '13-2'], 'Age2': [np.nan, '17'], 'Gender2': [np.nan, 'M']}
household = pd.DataFrame(data=d)
household
# Select columns for PersonID1 and rename columns
household1 = household[['PersonID1', 'Age1', 'Gender1']]
household1.columns = ['PersonID', 'Age', 'Gender']
# Select columns for PersonID1 and rename columns
household2 = household[['PersonID2', 'Age2', 'Gender2']]
household2.columns = ['PersonID', 'Age', 'Gender']
# Concat them together
household_new = pd.concat([household1, household2])
# Merge houshold and schedule df together on PersonID
schedule = schedule.merge(household_new, how='left', left_on='PersonID', right_on='PersonID', validate='many_to_one')
Output
PersonID Person Origin Destination Age Gender
3-1 1 A B 20 M
3-1 1 B A 20 M
13-1 1 C D 45 F
13-1 1 D C 45 F
13-2 2 A B 17 M
13-2 2 B A 17 M
Related
There is a data frame.
I would like to add column 'e' after checking below conditions.
if component of 'c' is in column 'a' AND component of 'd' is in column 'b' at same row , then component of e is OK
else ""
import pandas as pd
import numpy as np
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9']}
df = pd.DataFrame(A)
The result I want to get is
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9'], 'e':['OK','','','']}
You can merge df with itself on ['a', 'b'] on the left and ['c', 'd'] on the right. If index 'survives' the merge, then e should be OK:
df['e'] = np.where(
df.index.isin(df.merge(df, left_on=['a', 'b'], right_on=['c', 'd']).index),
'OK', '')
df
Output:
a b c d e
0 0 4 1 1 OK
1 2 5 2 4
2 1 1 3 2
3 4 7 6 9
P.S. Before the merge, we need to convert a and b columns to str type (or c and d to numeric), so that we can compare c and a, and d and b:
df[['a', 'b']] = df[['a', 'b']].astype(str)
List1 = [[1,A,!,a],[2,B,#,b],[7,C,&,c],[1,B,#,c],[4,D,#,p]]
Output should be like this:
Each different column should contain 1 value of each sublist elements
for example
column1:[1,2,7,1,4]
column2:[A,B,C,B,D]
column3:[!,#,&,#,#]
column4:[a,b,c,c,p]
in the same dataframe
Assuming that you actually meant for List1 to be this (all elements are strings):
list1 = [["1","A","!","a"],["2","B","#","b"],["7","C","&","c"],["1","B","#","c"],["4","D","#","p"]]
I don't think that you need to do anything except pass List1 to the DataFrame constructor. There are several ways to pass information to a DataFrame. Using lists of lists constructs un-named columns.
print(pd.DataFrame(list1))
0 1 2 3
0 1 A ! a
1 2 B # b
2 7 C & c
3 1 B # c
4 4 D # p
Given the below list file:
l = [['1', 'A', '!', 'a'], ['2', 'B', '#', 'b'], ['7', 'C', '&', 'c'], ['1', 'B', '#', 'c'], ['4', 'D', '#', 'p']]
You can use pandas.Dataframe for converting it as below:
import pandas as pd
pd.DataFrame(l, columns=['c1', 'c2', 'c3', 'c4'])
# columns parameter for passing customized column names
Result:
c1 c2 c3 c4
0 1 A ! a
1 2 B # b
2 7 C & c
3 1 B # c
4 4 D # p
As commented (and illustrated by John L.'s answer), pandas.DataFrame should be sufficient. If what you actually want is a transposed dataframe, try transpose manually:
import pandas as pd
df = pd.DataFrame(List1).T
Or beforehand using zip:
df = pd.DataFrame(list(zip(*List1)))
Both of which returns:
0 1 2 3 4
0 1 2 7 1 4
1 A B C B D
2 ! # & # #
3 a b c c p
I am trying to have a dataframe that includes the following two outputs, side by side as columns:
finalcust = mainorder_df, custname1_df
print(finalcust)
finalcust
Out[46]:
(10 10103.0
26 10104.0
39 10105.0
54 10106.0
72 10107.0
...
2932 10418.0
2941 10419.0
2955 10420.0
2977 10424.0
2983 10425.0
Name: ordernumber, Length: 213, dtype: float64,
1 Signal Gift Stores
2 Australian Collectors, Co.
3 La Rochelle Gifts
4 Baane Mini Imports
5 Mini Gifts Distributors Ltd.
...
117 Motor Mint Distributors Inc.
118 Signal Collectibles Ltd.
119 Double Decker Gift Stores, Ltd
120 Diecast Collectables
121 Kelly's Gift Shop
Name: customerName, Length: 91, dtype: object)
I have tried pd.merge but it says I am not allowed since there is no common column.
Anyone have any idea?
What are you actually trying to accomplish?
General Merging with df.merge()
The data frames cannot be merged because they are not related in anyway. Pandas expects them to have a similar column in order to know how to merge. pandas.DataFrame.merge docs
Example: If you wanted to take information from a customer information sheet and add it to an order list.
import pandas as pd
customers = ['A', 'B', 'C', 'D']
addresses = ['Adress_A', 'Address_B', 'Address_C', 'Address_D']
df1 = pd.DataFrame({'Customer': customers,
'Info': addresses})
df2 = pd.DataFrame({'Customer': ['A', 'B', 'C', 'D','A','B','C','D','A','B'],
'Order': [1,2,3,4,5,6,7,8,9,10]})
df = df1.merge(df2)
df =
Customer Info Order
0 A Adress_A 1
1 A Adress_A 5
2 A Adress_A 9
3 B Address_B 2
4 B Address_B 6
5 B Address_B 10
6 C Address_C 3
7 C Address_C 7
8 D Address_D 4
9 D Address_D 8
Combining with df.concat()
If they were the same size, you would use concat to combine them. There is a post about it here
Example: Adding a new list of customers to the customer df
import pandas as pd
customers = ['A', 'B', 'C', 'D']
addresses = ['Address_A', 'Address_B', 'Address_C', 'Address_D']
new_customers = ['E', 'F', 'G', 'H']
new_addresses = ['Address_E', 'Address_F', 'Address_G', 'Address_G']
df1 = pd.DataFrame({'Customer': customers,
'Info': addresses})
df2 = pd.DataFrame({'Customer': new_customers,
'Info': new_addresses})
df = pd.concat([df1, df2])
df =
Customer Info
0 A Address_A
1 B Address_B
2 C Address_C
3 D Address_D
0 E Address_E
1 F Address_F
2 G Address_G
3 H Address_G
Combining "Side by Side" by Adding a New Column
The side by side method of combination would be adding a column.
Example: Adding a new column to customer information df.
import pandas as pd
customers = ['A', 'B', 'C', 'D']
addresses = ['Address_A', 'Address_B', 'Address_C', 'Address_D']
phones = [1,2,3,4]
df = pd.DataFrame({'Customer': customers,
'Info': addresses})
df['Phones'] = phones
df =
Customer Info Phones
0 A Address_A 1
1 B Address_B 2
2 C Address_C 3
3 D Address_D 4
Actually Doing...?
If you are trying to assign a customer name to an order, that can't be done with the data you have here.
Hope this helps..
How do you combine multiple columns into one staggered column? For example, if I have data:
Column 1 Column 2
0 A E
1 B F
2 C G
3 D H
And I want it in the form:
Column 1
0 A
1 E
2 B
3 F
4 C
5 G
6 D
7 H
What is a good, vectorized pythonic way to go about doing this? I could probably do some sort of df.apply() hack but I'm betting there is a better way. The application is putting multiple dimensions of time series data into a single stream for ML applications.
First stack the columns and then drop the multiindex:
df.stack().reset_index(drop=True)
Out:
0 A
1 E
2 B
3 F
4 C
5 G
6 D
7 H
dtype: object
To get a dataframe:
pd.DataFrame(df.values.reshape(-1, 1), columns=['Column 1'])
For a series answering OP question:
pd.Series(df.values.flatten(), name='Column 1')
For a series timing tests:
pd.Series(get_df(n).values.flatten(), name='Column 1')
Timing
code
def get_df(n=1):
df = pd.DataFrame({'Column 2': {0: 'E', 1: 'F', 2: 'G', 3: 'H'},
'Column 1': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}})
return pd.concat([df for _ in range(n)])
Given Sample
Given Sample * 10,000
Given Sample * 1,000,000
Hello I have the following Data Frame:
df =
ID Value
a 45
b 3
c 10
And another dataframe with the numeric ID of each value
df1 =
ID ID_n
a 3
b 35
c 0
d 7
e 1
I would like to have a new column in df with the numeric ID, so:
df =
ID Value ID_n
a 45 3
b 3 35
c 10 0
Thanks
Use pandas merge:
import pandas as pd
df1 = pd.DataFrame({
'ID': ['a', 'b', 'c'],
'Value': [45, 3, 10]
})
df2 = pd.DataFrame({
'ID': ['a', 'b', 'c', 'd', 'e'],
'ID_n': [3, 35, 0, 7, 1],
})
df1.set_index(['ID'], drop=False, inplace=True)
df2.set_index(['ID'], drop=False, inplace=True)
print pd.merge(df1, df2, on="ID", how='left')
output:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You could use join(),
In [14]: df1.join(df2)
Out[14]:
Value ID_n
ID
a 45 3
b 3 35
c 10 0
If you want index to be numeric you could reset_index(),
In [17]: df1.join(df2).reset_index()
Out[17]:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You can do this in a single operation. join works on the index, which you don't appear to have set. Just set the index to ID, join df after also setting its index to ID, and then reset your index to return your original dataframe with the new column added.
>>> df.set_index('ID').join(df1.set_index('ID')).reset_index()
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
Also, because you don't do an inplace set_index on df1, its structure remains the same (i.e. you don't change its indexing).