I have a very huge data frame with me. I have also a small data frame with me.
Both of those data frames will have same columns.
The small data frame will have some rows that are already present in the big data frame. I want to append the small data frame to big one such that there will be no duplicates in the big data frame.
I can append simply and then remove the duplicates. But this will lead to wastage of memory to keep the duplicated data frame in the memory.
Is there any other method that can be used efficiently to solve this.?
What about isin?
Data:
df1 = pd.DataFrame({'a': [1,2,3,4,5,6,7]})
df2 = pd.DataFrame({'a': [3,4,9]})
Code:
df1.append(df2[df2.isin(df1) == False])
Output:
a
0 1
1 2
2 3
3 4
4 5
5 6
6 7
0 3
1 4
2 9
Data:
df1 = pd.DataFrame({'a': [1,2,3,4,5,6,7]})
df2 = pd.DataFrame({'a': [3,8,4,9]})
Use merge to get unique rows,
df3 = df2.merge(df1, how='left', indicator=True)
a _merge
0 3 both
1 8 left_only
2 4 both
3 9 left_only
Now, select rows with 'left_only',
df3 =df3[df3._merge == 'left_only'].iloc[:,:-1]
Finally, append them.
df1 = pd.concat([df1, df3], ignore_index=True)
Related
I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5
I have two dataframes df1 and df2 which have the duplicates rows in both. I want to merge these dfs. What i tried so far is to remove duplicates from one of the dataframe df2 as i need all the rows from the df1.
Question might be a duplicate one but i didn't find any solution/hints for this particular scenario.
data = {'Name':['ABC', 'DEF', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
'Age':[1,2,3,4,2,1,2,4]}
data2 = {'Name':['XYZ', 'NOP', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
'Sex':['M','F','M','M','M','M','F','M']}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
dfn = df1.merge(df2.drop_duplicates('Name'),on='Name')
print(dfn)
Result of above snippet:
Name Age Sex
0 ABC 1 M
1 ABC 3 M
2 ABC 4 M
3 MNO 4 M
4 XYZ 2 M
5 XYZ 1 M
6 PQR 2 F
This works perfectly well for the above data, but i have a large data and this method is behaving differently as im getting lots more rows than expected in dfn
I suspect due to large data and more duplicates im getting those extra rows but im cannot afford to delete the duplicate rows from df1.
Apologies as im not able to share the actual data as it is too large!
Edit:
A sample result from the actual data:
df2 after removing dups and the result dfn and i have only one entry in df1 for both ABC and XYZ:
Thanks in advance!
Try to drop_duplicates from df1 too:
dfn = pd.merge(df1, df2.drop_duplicates('Name'),
on='Name', how='left)
I have 2 dataframes
df1= 0 2
1 _A1-Site_0_norm _A1-Site_1_norm
and df2=
0 2
2 0.500000 0.012903
3 0.010870 0.013793
4 0.011494 0.016260
I want to use df1 as a header of df2 so that df1 is either the header of the columns or the first raw.
1 _A1-Site_0_norm _A1-Site_1_norm
2 0.500000 0.012903
3 0.010870 0.013793
4 0.011494 0.016260
i have multiple columns so it will not work to do
df2.columns=["_A1-Site_0_norm", "_A1-Site_1_norm"]
I thought of making a list of all the items present in the df1 to the use df2.columns and then include that list but I am having problems with converting the elements in row 1 of df1 of each column in items of a list.
I am not married to that approach any alternative to do it is wellcome
Many thanks
if I understood you question correctly
then this example should work for you
d={'A':[1],'B':[2],'C':[3]}
df = pd.DataFrame(data=d)
d2 = {'1':['D'],'2':['E'],'3':['F']}
df2 = pd.DataFrame(data=d2)
df.columns = df2.values.tolist() #this is what you need to implement
I have about 20 data frames and all data frames are having same columns and I would like to add data into the empty data frame but when I use my code
interested_freq
UPC CPC freq
0 136.0 B64G 2
1 136.0 H01L 1
2 136.0 H02S 1
3 244.0 B64G 1
4 244.0 H02S 1
5 257.0 B64G 1
6 257.0 H01L 1
7 312.0 B64G 1
8 312.0 H02S 1
list_of_lists = []
max_freq = df_interested_freq[df_interested_freq['freq'] == df_interested_freq['freq'].max()]
for row, cols in max_freq.iterrows():
interested_freq = df_interested_freq[df_interested_freq['freq'] != 1]
interested_freq
list_of_lists.append(interested_freq)
list_of_lists
for append the first data frame, and then change the name in that code for hoping that it will append more data
list_of_lists = []
for row, cols in max_freq.iterrows():
interested_freq_1 = df_interested_freq_1[df_interested_freq_1['freq'] != 1]
interested_freq_1
list_of_lists.append(interested_freq_1)
list_of_lists
but the first data is disappeared and show only the recent appended data. do I have done something wrong?
One way to Create a new DataFrame from existing DataFrame is use to df.copy():
Here is Detailed documentation
The df.copy() is very much relevant here because changing the subset of data within new dataframe will change the initial DataFrame So, you have fair chances of losing your actual dataFrame thus you need it.
Suppose Example DataFrame is df1 :
>>> df1
col1 col2
1 11 12
2 21 22
Solution , you can use df.copy method as follows which will inherit the data along.
>>> df2 = df1.copy()
>>> df2
col1 col2
1 11 12
2 21 22
In case you need to new dataframe(df2) to be created as like df1 but don't want the values to inserted across the DF then you have option to use reindex_like() method.
>>> df2 = pd.DataFrame().reindex_like(df1)
# df2 = pd.DataFrame(data=np.nan,columns=df1.columns, index=df1.index)
>>> df2
col1 col2
1 NaN NaN
2 NaN NaN
Why do you use append here? It’s not a list. Once you have the first dataframe (called d1 for example), try:
new_df = df1
new_df = pd.concat([new_df, df2])
You can do the same thing for all 20 dataframes.
I am new with pandas and I am trying to join two dataframes based on the equality of one specific column. For example suppose that I have the followings:
df1
A B C
1 2 3
2 2 2
df2
A B C
5 6 7
2 8 9
Both dataframes have the same columns and the value of only one column (say A) might be equal. What I want as output is this:
df3
A B C B C
2 8 9 2 2
The values for column 'A' are unique in both dataframes.
Thanks
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner')
If you wish to maintain column A as a non-index, then:
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner').reset_index()
Alternatively, you could just do:
df3 = df1.merge(df2, on='A', how='inner', suffixes=('_1', '_2'))
And then you can keep track of each value's origin