please help me how to tackle the many to many matching with condition ?
import pandas as pd
company1 = {'Product': ['Pro_1','Pro_3','Pro_3','Pro_5'],
'product_date': ['2013-05-09','2012-12-02','2013-10-25','2016-08-25']}
df = pd.DataFrame(company1, columns = ['Product', 'product_date'])
print (df)
company2 = {'Product': ['Pro_1','Pro_2','Pro_2','Pro_3','Pro_3','Pro_3','Pro_3','Pro_5','Pro_5'],
'Start': ['2013-01-01','2012-01-02','2013-01-02','2014-01-01','2011-01-02','2012-01-02','2013-01-02','2014-01-25', '2017-01-26'],
'end': ['2014-01-01','2013-01-01','2013-12-31','2014-12-01','2012-01-01','2013-01-01','2013-12-31','2017-01-25', '2018-01-20'],
'inventory': [20,30,50,30,40,10,20,30,20]}
df2 = pd.DataFrame(company2, columns = ['Product', 'Start','end','inventory'])
print (df2)
result = {'Product': ['Pro_1','Pro_3','Pro_3','Pro_5'],
'inventory': [20,10,20,30]}
df3 = pd.DataFrame(result, columns = ['Product', 'inventory'])
print(df3)
I wanted to take the match df1 and df2 by the 'Product' and condition on the 'product_date' between the 'Start' and 'end' dates, then return the 'inventory' from df2.
Related
I want to add a dict to a dataframe and the appended dict has dicts or list as value.
Example:
abc = {'id': 'niceId',
'category': {'sport':'tennis',
'land': 'USA'
},
'date': '2022-04-12T23:33:21+02:00'
}
Now, I want to add this dict to a dataframe. I tried this, but it failed:
df = pd.DataFrame(abc, columns = abc.keys())
Output:
ValueError: All arrays must be of the same length
I'm thankful for your help.
Your question is not very clear in terms of what your expected output is. But assuming you want to create a dataframe where the columns should be id, category, date and numbers (just added to show the list case) in which each cell in the category column keeps a dictionary and each cell in the numbers column keeps a list, you may use from_dict method with transpose:
abc = {'id': 'niceId',
'category': {'sport':'tennis',
'land': 'USA'
},
'date': '2022-04-12T23:33:21+02:00',
'numbers': [1,2,3,4,5]
}
df = pd.DataFrame.from_dict(abc, orient="index").T
gives you a dataframe as:
id
category
date
numbers
0
niceId
{'sport':'tennis','land': 'USA'}
2022-04-12T23:33:21+02:00
[1,2,3,4,5]
So let's say you want to add another item to this dataframe:
efg = {'id': 'notniceId',
'category': {'sport':'swimming',
'land': 'UK'
},
'date': '2021-04-12T23:33:21+02:00',
'numbers': [4,5]
}
df2 = pd.DataFrame.from_dict(efg, orient="index").T
pd.concat([df, df2], ignore_index=True)
gives you a dataframe as:
id
category
date
numbers
0
niceId
{'sport':'tennis','land': 'USA'}
2022-04-12T23:33:21+02:00
[1,2,3,4,5]
1
notniceId
{'sport':'swimming','land': 'UK'}
2021-04-12T23:33:21+02:00
[4,5]
I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.
I have the below JSON string in data. I want it to look like the Expected Result Below
import json
import pandas as pd
data = [{'useEventValue': True,
'eventConditions': [{'type': 'CATEGORY',
'matchType': 'EXACT',
'expression': 'ABC'},
{'type': 'ACTION',
'matchType': 'EXACT',
'expression': 'DEF'},
{'type': 'LABEL', 'matchType': 'REGEXP', 'expression': 'GHI|JKL'}]}]
Expected Result:
Category_matchType
Category_expression
Action_matchType
Action_expression
Label_matchType
Label_expression
0
EXACT
ABC
EXACT
DEF
REGEXP
GHI|JKL
What I've Tried:
This question is similar, but I'm not using the index the way the OP is. Following this example, I've tried using json_normalize and then using various forms of melt, stack, unstack, pivot, etc. But there has to be an easier way!
# this bit of code produces the below result where I can start using reshaping functions to get to what I need but it seems messy
df = pd.json_normalize(data, 'eventConditions')
type
matchType
expression
0
CATEGORY
EXACT
ABC
1
ACTION
EXACT
DEF
2
LABEL
REGEXP
GHI|JKL
We can use json_normalize to read the json data as pandas dataframe, then use stack followed by unstack to reshape the dataframe
df = pd.json_normalize(data, 'eventConditions')
df = df.set_index([df.groupby('type').cumcount(), 'type']).stack().unstack([1, 2])
df.columns = df.columns.map('_'.join)
CATEGORY_matchType CATEGORY_expression ACTION_matchType ACTION_expression LABEL_matchType LABEL_expression
0 EXACT ABC EXACT DEF REGEXP GHI|JKL
If your data is not too large in size, you could maybe process the json data first and then create a dataframe like this:
import pandas as pd
import json
data = [{'useEventValue': True,
'eventConditions': [{'type': 'CATEGORY',
'matchType': 'EXACT',
'expression': 'ABC'},
{'type': 'ACTION',
'matchType': 'EXACT',
'expression': 'DEF'},
{'type': 'LABEL', 'matchType': 'REGEXP', 'expression': 'GHI|JKL'}]}]
new_data = {}
for i in data:
for event in i['eventConditions']:
for key in event.keys():
if key != 'type':
col_name = event['type'] + '_' + key
new_data[col_name] = [event[key]] if col_name not in new_data else new_data[col_name].append(event[key])
df = pd.DataFrame(new_data)
df
Just found a way to do it with Pandas only:
df = pd.json_normalize(data, 'eventConditions')
df = df.melt(id_vars=[('type')])
df['type'] = df['type'] + '_' + df['variable']
df.drop(columns=['variable'], inplace=True)
df.set_index('type', inplace=True)
df = df.T
I have two csv files:
live_file.csv
Supplier SKU, Manufacturer SKU, Price
ABCD, 900000, 10
EFGH, 800000, 10
old_file.csv
Supplier SKU, Manufacturer SKU, Price
ABCD, 91234, 10
EFGHX, 85332, 10
I want to find the same values in the column Supplier SKU column, when I find matching vaues I want to take the Manufacturer SKU value from old_file.csv and put it in the live_file.csv, so my result will be:
Supplier SKU, Manufacturer SKU, Price
ABCD, 91234, 10
EFGH, 800000, 10
This is what i tried:
import pandas as pd
live_file = pd.read_csv("live.csv")
old_file = pd.read_csv("old.csv")
old_file = old_file.set_index('Supplier SKU')['Manufacturer SKU'].dropna()
live_file['Manufacturer SKU'] = live_file['Supplier SKU'].replace(old_file)
live_file.to_csv(r'final.csv')
But this does not work, the end file is the same as the live file at the beginning, any help?
You can basically do a left-merge join of the two files on the column Supplier SKU
and then keep the value of column Manufacturer SKU from old_file when the merge matches, otherwise keep the value from live_file
live_file["Manufacturer SKU"] = pd.merge(live_file[["Supplier SKU", "Manufacturer SKU"]],
old_file[["Supplier SKU", "Manufacturer SKU"]],
how="left",
on="Supplier SKU",
suffixes=(None, "__right"),
indicator="merge_flag")\
.apply(lambda row: row["Manufacturer SKU"]
if row["merge_flag"] == "left_only"
else row["Manufacturer SKU__right"], axis=1)
Set index on 'Supplier SKU' using set_index, then call update on new livefile.
import pandas as pd
df1 = pd.DataFrame({
'Supplier SKU': ['ABCD','EFGH'],
'Manufacturer SKU': [900000, 800000],
'Price': [10, 10]
}).set_index('Supplier SKU')
df2 = pd.DataFrame({
'Supplier SKU': ['ABCD','EFGHX'],
'Manufacturer SKU': [91234, 85332],
'Price': [10, 10]
}).set_index('Supplier SKU')
df1.update(df2)
print(df1)
result:
To prevent 'Price' also being updated, you can just drop Price in df2:
df1.update(df2.drop(columns='Price'))
PS: call df1.reset_index() to make 'Supplier SKU' into ordinary column
I am having trouble figuring out how to properly transpose data in a DataFrame in order to calculate differences between actuals and targets. Doing something like: df['difference'] = df['Revenue'] - df['Target'], is straightforward so this is more a question of desired output formatting.
Assume you have a DataFrame with the follow columns and values:
Desire outputs would be a roll up from both sources and comparison at the Source level. Assume there are 30+ additional data points similar to revenue, users, and new users... :
and
Any and all suggestions are very much appreciated.
Setup
df = pd.DataFrame([
['2016-06-01', 15000, 10000, 1000, 900, 100, 50, 'US'],
['2016-06-01', 16000, 12000, 1500, 1200, 150, 100, 'UK']
], columns=['Date', 'Revenue', 'Target', 'Users', 'Target', 'New Users', 'Target', 'Source'])
df
Your columns are not unique. I'll start with moving Source and Date into the index and renaming the columns.
df1 = df.copy()
df1.Date = pd.to_datetime(df1.Date)
df1 = df1.set_index(['Date', 'Source'])
idx = pd.MultiIndex.from_product([['Revenue', 'Users', 'New Users'], ['Actual', 'Target']])
df1.columns = idx
df1
Then move the first level of columns to the index
df1 = df1.stack(0)
df1
From here, I'm going to sum sources across ['Revenue', 'Users', 'New Users'] and assign the result to df2.
df2 = df1.groupby(level=-1).sum()
df2
Finally:
df2['Difference'] = df2.Actual / df2.Target
df1['Difference'] = df1.Actual / df1.Target
df2
df1.stack().unstack([0, 1, -1])