Pandas python - matching values - python

I currently have two dataframes that have two matching columns. For example :
Data frame 1 with columns : A,B,C
Data frame 2 with column : A
I want to keep all lines in the first dataframe that have the values that the A contains. For example if df2 and df1 are:
df1
A B C
0 1 3
4 2 5
6 3 1
8 0 0
2 1 1
df2
Α
4
6
1
So in this case, I want to only keep the second and third line of df1.
I tried doing it like this, but it didnt work since both dataframes are pretty big:
for index, row in df1.iterrows():
counter = 0
for index2,row2 in df2.iterrows():
if row["A"] == row2["A"]:
counter = counter + 1
if counter == 0:
df2.drop(index, inplace=True)

Use isin to test for membership:
In [176]:
df1[df1['A'].isin(df2['A'])]
Out[176]:
A B C
1 4 2 5
2 6 3 1

Or use the merge method:
df1= pandas.DataFrame([[0,1,3],[4,2,5],[6,3,1],[8,0,0],[2,1,1]], columns = ['A', 'B', 'C'])
df2= pandas.DataFrame([4,6,1], columns = ['A'])
df2.merge(df1, on = 'A')

Related

Python Pandas Compare 2 dataFrame [duplicate]

This question already has an answer here:
Find unique column values out of two different Dataframes
(1 answer)
Closed 1 year ago.
i'm working on python with Pandas and i have 2 dataFrame
1 'A'
2 'B'
1 'A'
2 'B'
3 'C'
4 'D'
and i want to return the difference:
1 'C'
2 'D'
You can concatenate two dataframes and drop duplicates:
pd.concat([df1, df2]).drop_duplicates(keep=False)
If your dataframe contains more columns you can add a certain column name as a subset:
pd.concat([df1, df2]).drop_duplicates(subset='col_name', keep=False)
What i retrieve with pd.concat([df1, df2]).drop_duplicates(keep=False)
(N = name of column)
df1:
N
0 A
1 B
2 C
df2:
N
0 A
1 B
2 C
df3
N
0 A
1 B
2 C
0 A
1 B
2 C
Value in df is phone Number without '+' in it. i can't show them.
i import them with :
df1 = pd.DataFrame(ListResponse, columns=['33000000000'])
df2 = pd.read_csv('number.csv')
ListResponse return List with number and number.csv is ListResponse that i save in csv the last time i run the script
edit:
(what i want in this case is "Empty DataFrame")
just test with new value :
df3:
N
0 A
1 B
2 C
3 D
0 B
1 C
2 D
Edit2: i think drop_duplicate is not working because my func implement new value as index = 0 and not index = length+1 like you can see just above. but when same values in both df, it not return me empty df...

Shuffle Columns in Dataframe

I want to shuffle columns without order; completely pseudo-randomly, on one line of code.
Before:
A B
0 1 2
1 1 2
After:
B A
0 2 1
1 2 1
My attempts so far:
df = df.reindex(columns=columns)
df.sample(frac=1, axis=1)
df.apply(np.random.shuffle, axis=1)
You can use np.random.default_rng()'s permutation with a seed to make it reproducible.
df = df[np.random.default_rng(seed=42).permutation(df.columns.values)]
Use DataFrame.sample with the axis argument set to columns (1):
df = df.sample(frac=1, axis=1)
print(df)
B A
0 2 1
1 2 1
Or use Series.sample with columns converted to Series and change order of columns by subset:
df = df[df.columns.to_series().sample(frac=1)]
print(df)
B A
0 2 1
1 2 1
Use numpy.random.permutation with list of column names.
df = df[np.random.permutation(df.columns)]

Drop rows and sort one dataframe according to another

I have two pandas data frames (df1 and df2):
# df1
ID COL
1 A
2 F
2 A
3 A
3 S
3 D
4 D
# df2
ID VAL
1 1
2 0
3 0
3 1
4 0
My goal is to append the corresponding val from df2 to each ID in df1. However, the relationship is not one-to-one (this is my client's fault and there's nothing I can do about this). To solve this problem, I want to sort df1 by df2['ID'] such that df1['ID'] is identical to df2['ID'].
So basically, for any row i in 0 to len(df2):
if df1.loc[i, 'ID'] == df2.loc[i, 'ID'] then keep row i in df1.
if df1.loc[i, 'ID'] != df2.loc[i, 'ID'] then drop row i from df1 and repeat.
The desired result is:
ID COL
1 A
2 F
3 A
3 S
4 D
This way, I can use pandas.concat([df1, df2['ID']], axis=0) to assign df2[VAL] to df1.
Is there a standardized way to do this? Does pandas.merge() have a method for doing this?
Before this gets voted as a duplicate, please realize that len(df1) != len(df2), so threads like this are not quite what I'm looking for.
This can be done with merge on both ID and the order within each ID:
(df1.assign(idx=df1.groupby('ID').cumcount())
.merge(df2.assign(idx=df2.groupby('ID').cumcount()),
on=['ID','idx'],
suffixes=['','_drop'])
[df1.columns]
)
Output:
ID COL
0 1 A
1 2 F
2 3 A
3 3 S
4 4 D
The simplest way I can see of getting the result you want is:
# Add a count for each repetition of the ids to temporary frames
x = df1.assign(id_counter=df1.groupby('ID').cumcount())
y = df2.assign(id_counter=df2.groupby('ID').cumcount())
# Merge using the ID and the repetition counter
df1 = pd.merge(x, y, how='right', on=['ID', 'id_counter']).drop('id_counter', axis=1)
Which would produce this output:
ID COL VAL
0 1 A 1
1 2 F 0
2 3 A 0
3 3 S 1
4 4 D 0

Concatenate dataframes alternating rows with Pandas

I have two dataframes df1 and df2 that are defined like so:
df1 df2
Out[69]: Out[70]:
A B A B
0 2 a 0 5 q
1 1 s 1 6 w
2 3 d 2 3 e
3 4 f 3 1 r
My goal is to concatenate the dataframes by alternating the rows so that the resulting dataframe is like this:
dff
Out[71]:
A B
0 2 a <--- belongs to df1
0 5 q <--- belongs to df2
1 1 s <--- belongs to df1
1 6 w <--- belongs to df2
2 3 d <--- belongs to df1
2 3 e <--- belongs to df2
3 4 f <--- belongs to df1
3 1 r <--- belongs to df2
As you can see the first row of dff corresponds to the first row of df1 and the second row of dff is the first row of df2. The pattern repeats until the end.
I tried to reach my goal by using the following lines of code:
import pandas as pd
df1 = pd.DataFrame({'A':[2,1,3,4], 'B':['a','s','d','f']})
df2 = pd.DataFrame({'A':[5,6,3,1], 'B':['q','w','e','r']})
dfff = pd.DataFrame()
for i in range(0,4):
dfx = pd.concat([df1.iloc[i].T, df2.iloc[i].T])
dfff = pd.concat([dfff, dfx])
However this approach doesn't work because df1.iloc[i] and df2.iloc[i] are automatically reshaped into columns instead of rows and I cannot revert the process (even by using .T).
Question: Can you please suggest me a nice and elegant way to reach my goal?
Optional: Can you also provide an explanation about how to convert a column back to row?
I'm unable to comment on the accepted answer, but note that the sort operation in unstable by default, so you must choose a stable sorting algorithm.
pd.concat([df1, df2]).sort_index(kind='merge')
IIUC
In [64]: pd.concat([df1, df2]).sort_index()
Out[64]:
A B
0 2 a
0 5 q
1 1 s
1 6 w
2 3 d
2 3 e
3 4 f
3 1 r

Pandas: merge multiple dataframes and control column names?

I would like to merge nine Pandas dataframes together into a single dataframe, doing a join on two columns, controlling the column names. Is this possible?
I have nine datasets. All of them have the following columns:
org, name, items,spend
I want to join them into a single dataframe with the following columns:
org, name, items_df1, spend_df1, items_df2, spend_df2, items_df3...
I've been reading the documentation on merging and joining. I can currently merge two datasets together like this:
ad = pd.DataFrame.merge(df_presents, df_trees,
on=['practice', 'name'],
suffixes=['_presents', '_trees'])
This works great, doing print list(aggregate_data.columns.values) shows me the following columns:
[org', u'name', u'spend_presents', u'items_presents', u'spend_trees', u'items_trees'...]
But how can I do this for nine columns? merge only seems to accept two at a time, and if I do it sequentially, my column names are going to end up very messy.
You could use functools.reduce to iteratively apply pd.merge to each of the DataFrames:
result = functools.reduce(merge, dfs)
This is equivalent to
result = dfs[0]
for df in dfs[1:]:
result = merge(result, df)
To pass the on=['org', 'name'] argument, you could use functools.partial define the merge function:
merge = functools.partial(pd.merge, on=['org', 'name'])
Since specifying the suffixes parameter in functools.partial would only allow
one fixed choice of suffix, and since here we need a different suffix for each
pd.merge call, I think it would be easiest to prepare the DataFrames column
names before calling pd.merge:
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
For example,
import pandas as pd
import numpy as np
import functools
np.random.seed(2015)
N = 50
dfs = [pd.DataFrame(np.random.randint(5, size=(N,4)),
columns=['org', 'name', 'items', 'spend']) for i in range(9)]
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
merge = functools.partial(pd.merge, on=['org', 'name'])
result = functools.reduce(merge, dfs)
print(result.head())
yields
org name items_df1 spend_df1 items_df2 spend_df2 items_df3 \
0 2 4 4 2 3 0 1
1 2 4 4 2 3 0 1
2 2 4 4 2 3 0 1
3 2 4 4 2 3 0 1
4 2 4 4 2 3 0 1
spend_df3 items_df4 spend_df4 items_df5 spend_df5 items_df6 \
0 3 1 0 1 0 4
1 3 1 0 1 0 4
2 3 1 0 1 0 4
3 3 1 0 1 0 4
4 3 1 0 1 0 4
spend_df6 items_df7 spend_df7 items_df8 spend_df8 items_df9 spend_df9
0 3 4 1 3 0 1 2
1 3 4 1 3 0 0 3
2 3 4 1 3 0 0 0
3 3 3 1 3 0 1 2
4 3 3 1 3 0 0 3
Would doing a big pd.concat() and then renaming all the columns work for you? Something like:
desired_columns = ['items', 'spend']
big_df = pd.concat([df1, df2[desired_columns], ..., dfN[desired_columns]], axis=1)
new_columns = ['org', 'name']
for i in range(num_dataframes):
new_columns.extend(['spend_df%i' % i, 'items_df%i' % i])
bid_df.columns = new_columns
This should give you columns like:
org, name, spend_df0, items_df0, spend_df1, items_df1, ..., spend_df8, items_df8
I've wanted this as well at times but been unable to find a built-in pandas way of doing it. Here is my suggestion (and my plan for the next time I need it):
Create an empty dictionary, merge_dict.
Loop through the index you want for each of your data frames and add the desired values to the dictionary with the index as the key.
Generate a new index as sorted(merge_dict).
Generate a new list of data for each column by looping through merge_dict.items().
Create a new data frame with index=sorted(merge_dict) and columns created in the previous step.
Basically, this is somewhat like a hash join in SQL. Seems like the most efficient way I can think of and shouldn't take too long to code up.
Good luck.

Categories