I have two dataframes:
df1 = pd.DataFrame({'Code' : ['10', '100', '1010'],
'Value' : [25, 50, 75]})
df2 = pd.DataFrame({'ID' : ['A', 'B', 'C'],
'Codes' : ['10', '100;1010', '100'],
'Value' : [25, 125, 50]})
Column "Codes" in df2 can contain multiple codes separated by ";". If this is the case, I need to sum up their values from df1.
I tried .map(), but this did not work for rows with multiple codes in a row. Also, I end up converting code '1010' to value '2525'.
How do I specify a perfect match and the summation for ";" separated values?
explode() the list of Codes
merge() with df1 and calculate total, grouping on the index of df2
have created a new column with this calculated
df1 = pd.DataFrame({"Code": ["10", "100", "1010"], "Value": [25, 50, 75]})
df2 = pd.DataFrame(
{"ID": ["A", "B", "C"], "Codes": ["10", "100;1010", "100"], "Value": [25, 125, 50]}
)
df2.join(
df2["Codes"]
.str.split(";")
.explode()
.reset_index()
.merge(df1, left_on="Codes", right_on="Code")
.groupby("index")
.agg({"Value": "sum"}),
rsuffix="_calc",
)
ID
Codes
Value
Value_calc
0
A
10
25
25
1
B
100;1010
125
125
2
C
100
50
50
def sum(df1, df2):
df1['sum'] = df1['Value'] + df2['Value']
print(df1)
df1.loc[df2['Codes'].isin(df1['Code'])].apply(sum(df1, df2))
If the code in df2 is in df1 theen add values
We can make a lookup table of Code to Value mapping from df1, then use .map() on df2 to map the expanded list of Codes to the mapping. Finally, sum up the mapped values for the same ID to arrive at the desired value, as follows:
1. Make a lookup table of Code to Value mapping from df1:
mapping = df1.set_index('Code')['Value']
2. Use .map() on df2 to map the expanded list of Codes to the mapping. Sum up the mapped values for the same ID to arrive at the desired value:
df2a = df2.set_index('ID') # set `ID` as index
df2a['value_map'] = (
df2a['Codes'].str.split(';') # split by semicolon
.explode() # expand splitted values into rows
.map(mapping) # map Code from mapping
.groupby('ID').sum() # group sum by ID
)
df2 = df2a.reset_index() # reset `ID` from index back to data column
Result:
print(df2)
ID Codes Value value_map
0 A 10 25 25
1 B 100;1010 125 125
2 C 100 50 50
Related
I have two dataframes. One has months 1-5 and a value for each month, which are the same for ever ID, the other has an ID and a unique multiplier e.g.:
data = [['m', 10], ['a', 15], ['c', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['ID', 'Unique'])
data2=[[1,0.2],[2,0.3],[3,0.01],[4,0.5],[5,0.04]]
df2 = pd.DataFrame(data2, columns=['Month', 'Value'])
I want to do sum ( value / (1+unique)^(Month/12) ). E.g. for ID m, I want to do (value/(1+10)^(Month/12)), for every row in df2, and sum them. I wrote a for-loop to do this but since my real table has 277,000 entries this takes too long!
df['baseTotal']=0
for i in df.index.unique():
for i in df2.Month.unique():
df['base']= df2['Value']/pow(1+df.loc[i,'Unique'],df2['Month']/12.0)
df['baseTotal']=df['baseTotal']+df['base']
Is there a more efficient way to do this?
df['Unique'].apply(lambda x: (df2['Value']/((1+x) ** (df2['Month']/12))).sum())
0 0.609983
1 0.563753
2 0.571392
Name: Unique, dtype: float64
I have a df of the below format. Here the 'Count' column is the count of 'Col3'. Considering the first two rows, it has 2 counts of L and 1 count of W.
input_data = {
'Col1' : ['A','A','A','A','B','B','C'],
'Col2' : ['D','D','T','T','D','D','T'],
'Col3' : ['L','W','L','W','W','L','W'],
'Count': [2,1,3,2,3,2,2]
}
input_df = pd.DataFrame(input_data)
print(input_df)
I want to convert this df into the below format -required_df
output_data = {
'Col1' : ['A','A','B','C'],
'Col2' : ['D','T','D','T'],
'New_Col3' : ['L/W','L/W','L/W','W'],
'W_Count' : [1,3,3,2],
'L_Count' : [2,2,2,0]
}
i.e, the first 2 rows of the first df is converted into the first row of the required_df. For each unique set of ['Col1','Col2'], values of 'Col3' is joined with '/' and the count is added as 2 new columns W-Count and L_Count. Variation in last row, where there is only W value.
I know we need to do groupby(['Col1','Col2']) but unable to think after that to get the values of other two columns as reflected in required_df. What would be the best way to achieve this?
As the real data is sensitive, I cannot share it here. But the original data is large with lakhs of rows.
You can combine a groupby.agg and pivot_table:
(df
.groupby(['col1', 'col2'])
.agg(**{'New_col3': ('col3', lambda x: '/'.join(sorted(x)))})
.join(df.pivot_table(index=['col1', 'col2'],
columns='col3',
values='col4',
fill_value=0)
.add_suffix('_count')
)
.reset_index()
)
Output:
col1 col2 New_col3 L_count W_count
0 A D L/W 2 1
1 A T L/W 3 2
2 B D L/W 2 3
3 C T W 0 2
Used input:
df = pd.DataFrame({'col1': list('AAAABBC'),
'col2': list('DDTTDDT'),
'col3': list('LWLWWLW'),
'col4': (2,1,3,2,3,2,2)})
There are two dataframes:
df1 = pd.DataFrame({'year':[2000, 2001, 2002], 'city':['NY', 'AL', 'TX'], 'zip':[100, 200, 300]})
df2 = pd.DataFrame({'year':[2000, 2001, 2002], 'city':['NY', 'AL', 'TX'], 'zip':["95-150", "160-220", "190-310"], 'value':[10, 20, 30]})
The main df is df1 and I want to add the 'value' column from df2 to df1 based off of a matching year, city, and zip. The problem is that the zip of df2 is given in a range and I want to attach 'value' only if df1's zip is within a given range. I'm not sure how to do this. I've tried a few things like:
# Match indices so that new cols will attach when equal indices
df1 = df1.set_index(['year', 'city'])
df2 = df2.set_index(['year', 'city'])
# Split range of zip into a list
df2['zip'] = df2['zip'].str.split("-")
# Attach 'value' to df1 if df1's zip if greater than df2's min zip AND less than df2's max zip
df1['value'] = df2.loc[(df2['zip'].str[0].astype(int) <= df1['zip']) & \
(df2['zip'].str[1].astype(int) >= df1['zip']), 'value']
Which gives me this error: ValueError: Can only compare identically-labeled Series objects
Split and make sure their int
df2[['start', 'end']] = df2['zip'].str.split('-', expand=True).astype(int)
Use Series.between
df1['value'] = df1['zip'].between(df2['start'], df2['end'])
year city zip value
0 2000 NY 100 True
1 2001 AL 200 True
2 2002 TX 300 True
I have two DFs. I want to iterate through rows in DF1 and filter all the rows in DF2 with same id and get column"B" value in new columns of DF1.
data = {'id': [1,2,3]}
df1 = pd.DataFrame(data)
data = {'id': [1, 1, 3,3,3], 'B': ['ab', 'bc','ad','ds','sd']}
df2 = pd.DataFrame(data)
DF1 - id (15k rows)
DF2 - id, col1 (50M rows)
Desired output
data = {'id': [1,2,3],'B':['[ab,bc]','[]','[ad,ds,sd]']}
pd.DataFrame(data)
def func(df1):
temp3=df2.merge(pd.DataFrame(data=[df1.values]*len(df1),columns=df1.index),how='right',on=
['id'])
temp1 = temp3.B.values
return temp1
df1['B']=df1.apply(func,axis=1))
I am using merge for filtering and applying lambda function on df1. The code is taking 1 hour to execute on large data frame. How to make this run faster ?
Are you looking for a simple filter and grouped listification?
df2[df2['id'].isin(df1['id'])].groupby('id', as_index=False)[['B']].agg(list)
id B
0 1 [ab, bc]
1 2 [ca, as]
2 3 [ad, ds, sd]
Note that grouping as lists is considered suboptimal in terms of performance.
I have 14 DataFrames.
They all have an index and 1 Column called 'Total'
Here is an example of 1 DataFrame:
https://i.gyazo.com/8b31f92a469e31df89a29e4588427362.png
The index is 'Res Area'
The column is 'Total'
So what I want to do is merge them all into 1 dataframe where the index will be
the name of the df and the column 'Total' to compare all of these DFs.
Ive tried putting the df's in a dictionary with the key being the Name of the df and the value its Total of the top 10 added together but it doesnt work
Ive tried putting the df's in a dictionary with the key being the Name of the df and the value its Total of the top 10 added together but it doesnt work
df = pd.DataFrame({'Res Area': resarea_df.Total[:10].sum(), 'Year Built': yearbuilt_df.Total[:10].sum(),'Retail Area': retailarea_df.Total[:10].sum()})
I get an error that says:
If using all scalar values, you must pass an index
I just want to merge all dfs into 1 df to see each dfs top 10 Totals summed together in comparison with each other that I will plot on a graph
You are calling the wrong constructor for your DataFrame. With a dictionary of scalar values where keys become the index you want to use the .from_dict constructor:
import pandas as pd
data= {'data1': 1, 'data2': 2, 'data3': 15}
pd.DataFrame.from_dict(data, orient='index', columns=['Total'])
# Total
#data1 1
#data2 2
#data3 15
To explain the problem you are having, when constructing a DataFrame with pd.DataFrame and a dictionary the default is to make the the keys of the DataFrame the columns. Typically the values of the passed dictionary are array-like, which allows pandas to determine how many rows to make. However with all scalar values and no index there is no way to know how many rows it needs to be.
data= {'data1': 1, 'data2': 2, 'data3': 15}
pd.DataFrame(data)
#ValueError: If using all scalar values, you must pass an index
To do this correctly, you would specify an index:
pd.DataFrame(data, index=[0])
# data1 data2 data3
#0 1 2 15
Or make at least one value of data array-like:
data2 = {'data1': 1, 'data2': 2, 'data3': [15]}
pd.DataFrame(data2)
# data1 data2 data3
#0 1 2 15