I have two dataframes
df1
x1
y1
x2
y2
label
0
0
1240
1755
label1
0
0
1240
2
label2
df2
x1
y1
x2
y2
text
992.0
943.0
1166.0
974.0
tex1
1110.0
864.0
1166.0
890.0
text2
Based on a condition like the following:
if df1['x1'] >= df2['x1'] or df1['y1'] >= df2['y1']:
# I want to add a new column 'text' in df1 with the text from df2.
df1['text'] = df2['text']
What's more, it is possible in df2 to have more than one row that makes the above-mentioned condition True, so I will need to add another if statement for df2 to get the best match.
My problem here is not the conditions but how am I supposed to approach the interaction between both data frames. Any help, or advice would be appreciated.
If you want to iterate from df1 through every row of df2 and return a match you can do it with the .apply() function in df1 and use the df2 as lookup table.
NOTE: In the above example I return the first match (by using the .iloc[0]) not all the matches.
Create two dummy dataframes
import pandas as pd
df1 = pd.DataFrame({'x1': [1, 2, 3], 'y1': [1, 5, 6]})
df2 = pd.DataFrame({'x1': [11, 1, 13], 'y1': [3, 52, 26], 'text': ['text1', 'text2', 'text3']})
Create a lookup function
def apply_condition(row, df):
condition = ((row['x1'] >= df['x1']) | (row['y1'] >= df['y1']))
return df[condition]['text'].iloc[0] # ATTENTION: Only the first match return
Create new column and print results
df1['text'] = df1.apply(lambda row: apply_condition(row, df2), axis=1)
df1.head()
Result:
Related
I have two dataframes that look like this
df1 ==
IDLocation x-coord y-coord
1 -1.546 7.845
2 3.256 1.965
.
.
35 5.723 -2.724
df2 ==
PIDLocation DIDLocation
14 5
3 2
7 26
I want to replace the columns PIDLocation, DIDLocation with Px-coord, Py-coord, Dx-coord, Dy-coord such that the two columns PIDLocation, DIDLocation are IDLocation and each IDLocation corresponds to an x-coord and y-coord in the first dataframe.
If you set the ID column as the index of df1, you can get the coord values by indexing. I changed the values in df2 in the example below to avoid index errors that would result from not having the full dataset.
import pandas as pd
df1 = pd.DataFrame({'IDLocation': [1, 2, 35],
'x-coord': [-1.546, 3.256, 5.723],
'y-coord': [7.845, 1.965, -2.724]})
df2 = pd.DataFrame({'PIDLocation': [35, 1, 2],
'DIDLocation': [2, 1, 35]})
df1.set_index('IDLocation', inplace=True)
df2['Px-coord'] = [df1['x-coord'].loc[i] for i in df2.PIDLocation]
df2['Py-coord'] = [df1['y-coord'].loc[i] for i in df2.PIDLocation]
df2['Dx-coord'] = [df1['x-coord'].loc[i] for i in df2.DIDLocation]
df2['Dy-coord'] = [df1['y-coord'].loc[i] for i in df2.DIDLocation]
del df2['PIDLocation']
del df2['DIDLocation']
print(df2)
Px-coord Py-coord Dx-coord Dy-coord
0 5.723 -2.724 3.256 1.965
1 -1.546 7.845 -1.546 7.845
2 3.256 1.965 5.723 -2.724
I have a dataframe that looks like this:
total_customers total_customer_2021-03-31 total_purchases total_purchases_2021-03-31
1 10 4 6
3 14 3 2
Now, I want to sum up the columns row-wise that are the same expect the suffix. I.e the expected output is:
total_customers total_purchases
11 10
17 5
The issue why I cannot do this manually is because I have 100+ column pairs, so I need an efficient way to do this. Also, the order of columns is not predictable either. What do you recommend?
Thanks!
Somehow we need to get an Index of columns so pairs of columns share the same name, then we can groupby sum on axis=1:
cols = pd.Index(['total_customers', 'total_customers',
'total_purchases', 'total_purchases'])
result_df = df.groupby(cols, axis=1).sum()
With the shown example, we can str.replace an optional s, followed by underscore, followed by the date format (four numbers-two numbers-two numbers) with a single s. This pattern may need modified depending on the actual column names:
cols = df.columns.str.replace(r's?_\d{4}-\d{2}-\d{2}$', 's', regex=True)
result_df = df.groupby(cols, axis=1).sum()
result_df:
total_customers total_purchases
0 11 10
1 17 5
Setup and imports:
import pandas as pd
df = pd.DataFrame({
'total_customers': [1, 3],
'total_customer_2021-03-31': [10, 14],
'total_purchases': [4, 3],
'total_purchases_2021-03-31': [6, 2]
})
assuming that your dataframe is called df the best solution is:
sum_costumers = df[total_costumers] + df[total_costumers_2021-03-31]
sum_purchases = df[total_purchases] + df[total_purchases_2021-03-31]
data = {"total_costumers" : f"{sum_costumers}", "total_purchases" : f"sum_purchases"}
df_total = pd.DataFrame(data=data, index=range(1,len(data)))
and that will give you the output you want
import pandas as pd
data = {"total_customers": [1, 3], "total_customer_2021-03-31": [10, 14], "total_purchases": [4, 3], "total_purchases_2021-03-31": [6, 2]}
df = pd.DataFrame(data=data)
final_df = pd.DataFrame()
final_df["total_customers"] = df.filter(regex='total_customers*').sum(1)
final_df["total_purchases"] = df.filter(regex='total_purchases*').sum(1)
output
final_df
total_customers total_purchases
0 11 10
1 17 5
Using #HenryEcker's sample data, and building off of the example in the docs, you can create a function and groupby on the column axis:
def get_column(column):
if column.startswith('total_customer'):
return 'total_customers'
return 'total_purchases'
df.groupby(get_column, axis=1).sum()
total_customers total_purchases
0 11 10
1 17 5
I changed the headings while coding, to make it shorter, jfi
data = {"total_c" : [1,3], "total_c_2021" :[10,14],
"total_p": [4,3], "total_p_2021": [6,2]}
df = pd.DataFrame(data)
df["total_costumers"] = df["total_c"] + df["total_c_2021"]
df["total_purchases"] = df["total_p"] + df["total_p_2021"]
If you don't want to see other columns you can drop them
df = df.loc[:, ['total_costumers','total_purchases']]
NEW PART
So I might have find a starting point for your solution! I dont now the column names but following code can be changed, İf you have a pattern with your column names( it have patterned dates, names, etc). Can you changed the column names with a loop?
df['total_customer'] = df[[col for col in df.columns if col.startswith('total_c')]].sum(axis=1)
And this solution might be helpful for you with some alterationsexample
My integers become NaNs when I add the index to the DataFrame.
I run this:
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
print (newDF)
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer'])
newDF.index.name = 'Store'
print(newDF)
and I get this:
guavas pears avocados
0 10 111 200
1 20 222 3000
guavas pears avocados
Store
Thriftway NaN NaN NaN
Meijer NaN NaN NaN
The "old" newDF has index [0, 1] while the "new" newDF has index ['Thriftway', 'Meijer']. When using the DataFrame-constructor with a DataFrame, i.e. pd.DataFrame(newDF, index=['Thriftway', 'Meijer']), pandas internally does a reindex with the list in the index-argument on the index of newDF.
Values in the new index that do not have corresponding records in the DataFrame are assigned NaN. The index [0, 1] and the index ['Thriftway', 'Meijer'] have no overlapping values thus result is a DataFrame with NaN as values.
To appreciate this try running the following:
import pandas as pd
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
print (newDF)
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer', 0, 1])
newDF.index.name = 'Store'
print(newDF)
and notice that the new DataFrame now contains the old data. To achieve what you want you can instead reindex the existing DataFrame with the new index like so:
import pandas as pd
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
print(newDF)
newDF = newDF.reindex(['Thriftway', 'Meijer'])
newDF.index.name = 'Store'
You can even reproduce what pandas is doing internally by using the index-argument of reindex:
newDF.reindex(index=['Thriftway', 'Meijer'])
The result is, as before, a DataFrame where labels that were not in the DataFrame before have been assigned NaN:
guavas pears avocados
Thriftway NaN NaN NaN
Meijer NaN NaN NaN
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer'])
In above line, you are passing both dataframe and index to pd.DataFrame().
From the source code of pandas.DataFrame(), I pick some related codes as following with assumption that data is a dataframe:
def __init__(
self,
data=None,
index: Optional[Axes] = None,
columns: Optional[Axes] = None,
dtype: Optional[Dtype] = None,
copy: bool = False,
):
if isinstance(data, BlockManager):
if index is None and columns is None and dtype is None and copy is False:
# GH#33357 fastpath
NDFrame.__init__(self, data)
return
mgr = self._init_mgr(
data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
)
If index is given, pandas.DataFrame() will create a dataframe with the same columns as the passed dataframe. Each cell is filled with NaN.
If index is not given, it will create a dataframe as same as the passed dataframe including index, columns and data.
As far as I understand you want to set the index in your dataframe to something else than 0,1. However,
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer'])
This will actually set your newDF from given index (['Thriftway', 'Meijer']) in newDF. And since (currently) you don't have any values for these two index values in newDF it will write the column values as NaN for these index values.
Two possible solutions for setting up your custom index can be like this:
you specify index when you create your dataframe
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows, index=['Thriftway', 'Meijer'])
newDF.index.name = 'Store'
print(newDF)
you use set_index after
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
newDF = newDF.set_index(pd.Index(['Thriftway', 'Meijer']))
newDF.index.name = 'Store'
print(newDF)
I want to fill the column of the df2 (~100.000 rows) with the values from the same column of df (~1.000.000 rows). Df often has several times the same row but with wrong data, so I always want to take the first value of my column 'C'.
df = pd.DataFrame([[100, 1, 2], [100, 3, 4], [100, 5, 6], [101, 7, 8], [101, 9, 10]],
columns=['A', 'B', 'C'])
df2=pd.DataFrame([[100,0],[101,0]], columns=['A', 'C'])
for i in range(0,len(df2.index)):
#My Question:
df2[i,'C']=first value of 'C' column of df where the 'A' column is the same of both dataframes. E.g. the first value for 100 would be 2 and then the first value for 101 would be 8
In the end, my output should be a table like this:
df2=pd.DataFrame([[100,2],[101,8]], columns=['A', 'C'])
You can try this:
df2['C'] = df.groupby('A')['C'].first().values
Which will give you:
A C
0 100 2
1 101 8
first() returns the first value of every group.
Then you want to assign the values to df2 column, unfortunately, you cannot assign the result directly like this:
df2['C'] = df.groupby('A')['C'].first() .
Because the above line will result in :
A C
0 100 NaN
1 101 NaN
(You can read about the cause here: Adding new column to pandas DataFrame results in NaN)
I have two dataframes :
df1:
A B C
1 ss 123
2 sv 234
3 sc 333
df2:
A dd xc
1 ss 123
df2 will always have a single row. How to check whether there is a match for that row of df2, in df1?
Using Numpy comparisons with np.all with parameter axis=1 for rows:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['ss', 'sv', 'sc'], 'C': [123, 234, 333]})
df2 = pd.DataFrame({'A': [1], 'dd': ['ss'], 'xc': [123]})
df3 = df1.loc[np.all(df1.values == df2.values, axis=1),:]
Or:
df3 = df1.loc[np.all(df1[['B','C']].values == df2[['dd','xc']].values, axis=1),:]
print(df3)
A B C
0 1 ss 123
Additional to Sandeep's answer, can do:
df1[np.all(df1.values == df2.values,1)].any().any()
For getting a boolean.
Or another way:
df1[(df2.values==df1.values).all(1)].any().any()
Or:
pd.merge(df1,df2).equals(df1)
Note: both output True
Check specific column (same as Sandeep's):
df1[col].isin(df2[col]).any()
How to check whether there is a match for that row of df2, in df1?
You can align columns and then check equality of df1 with the only row of df2:
df2.columns = df1.columns
res = (df1 == df2.iloc[0]).all(1).any() # True
The benefit of this solution is you aren't subsetting df1 (expensive), but instead constructing a Boolean dataframe / array (cheap) and checking if all values in at least one row are True.
This is still not particularly efficient as you are considering every row in df1 rather than stopping when a condition is satisfied. With numeric data, in particular, there are more efficient solutions.