The 2 dataframes I am comparing are of different size (have the same index though) and I suppose that is why I am getting the error. Can you please suggest me a way to get around that. I am looking for those rows in df2 whose user_id match with those of df1. Thanks and appreciate your response.
data = np.array([['user_id','comment','label'],
[100,'RT #Dvillain_: #oomf should text me.',0],
[100,'Buy viagra',1],
[101,'#nowplaying M.C. Shan - Juice Crew Law on',0],
[101,'Buy viagra two',1]])
data2 = np.array([['user_id','comment','label'],
[100,'First comment',0],
[100,'Buy viagra',1],
[102,'Buy viagra two',1]])
df1 = pd.DataFrame(data=data[1:,0:],columns = data[0,0:])
df2 = pd.DataFrame(data=data2[1:,0:],columns = data[0,0:])
df = df2[df2['user_id'] == df1['user_id']]
You are looking for isin
df = df2[df2['user_id'].isin(df1['user_id'])]
df
Out[814]:
user_id comment label
0 100 First comment 0
1 100 Buy viagra 1
Related
I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2
You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2
PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows
I'm stuck in a very strange problem:
I have two dfs and I have to match strings of one df with the strings of the other df, by similarity.
The target column is the name of the television program (program_name_1 & program_name_2).
In order to let him choose from a limited set of data, I also used the column 'channel' as filter.
The function applies the fuzzy algorithm and gives as result the match of the elements from the columns program_name_1 with program_name_2 and the score similarity between them.
The really strange thing is that the output works fine just for the first channel, but for all the next channels it doesn't. The first column (scorer_test_2), that just prints the program_name_1 is always correct, but scorer_test_2 (that should print program_name_2) and the similarity columns are NaN.
I did a lot of checks on the dfs: I am sure that the names of the columns are the same of the names in the lists and that in the other channels, there are all the data I'm asking for.
The strangest thing is that the first channel and all the other channels are in the same df, for this reason there are no differences between the data of the channels.
I will show you 'toys dts', to ley you understand better the problem:
df1 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_1': ['party','animals','gucci','the simpson', 'cars', 'mathematics', 'bikes', 'chef']}
df2 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_2': ['parties','gucci_gucci','animal','simpsons', 'math', 'the car', 'bike', 'cooking']}
df1 = pd.DataFrame(df1, columns = ['Channel','program_name_1'])
df2 = pd.DataFrame(df2, columns = ['Channel','program_name_2'])
that will print for the df1:
Channel program_name_1
1 party
1 animals
1 gucci
2 the simpson
2 cars
2 mathematics
3 bikes
4 chef
and for the df2:
Channel program_name_2
1 parties
1 gucci_gucci
1 animal
2 simpsons
2 math
2 the car
3 bike
4 cooking
and here the code:
scorer_test_1 = df_1.loc[(df_1['Channel'] == '1')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '1')]['program_name_2']
# creation of a function for the score
def scorer_tester_function(x):
matching_list = []
similarity = []
# iterate on the rows
for i in scorer_test_1:
if pd.isnull(i):
matching_list.append(np.null)
similarity.append(np.null)
else:
ratio = process.extract(i, scorer_test_2, limit=5, scorer=scorer_dict[x])
matching_list.append(ratio[0][0])
similarity.append(ratio[0][1])
my_df = pd.DataFrame()
my_df['program_name_1'] = scorer_test_1
my_df['program_name_2'] = pd.Series(matching_list)
my_df['similarity'] = pd.Series(similarity)
return my_df
print(scorer_tester_function('R').head())
The output that I would like to get for all the channels, but I just get if I pass the first channel in the code is this:
for the channel[1]:
program_name_1 program_name_2 similarity
party parties 95
animals animal 95
gucci gucci_gucci 75
for the channel[2]:
program_name_1 program_name_2 similarity
the simpson simpsons 85
cars the car 75
mathematics math 70
This is the output I get if I ask for the channel 2 or next:
code:
scorer_test_1 = df_1.loc[(df_1['Channel'] == '2')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '2')]['program_name_2']
output:
Channel program_name_1 program_name_2 similarity
2 the simpson NaN NaN
2 cars NaN NaN
2 mathematics NaN NaN
I hope someone can help me :)
Thanks!
This was for Index mismatch, resetting indices after adding first dataseries can do the work!
def scorer_tester_function(x):
matching_list = []
similarity = []
# iterate on the rows
for i in scorer_test_1:
if pd.isnull(i):
matching_list.append(np.null)
similarity.append(np.null)
else:
ratio = process.extract(i, scorer_test_2, limit=5)#, scorer=scorer_dict[x])
matching_list.append(ratio[0][0])
similarity.append(ratio[0][1])
my_df = pd.DataFrame()
my_df['program_name_1'] = scorer_test_1
print(my_df.index)
my_df.reset_index(inplace=True)
print(my_df.index)
my_df['program_name_2'] = pd.Series(matching_list)
my_df['similarity'] = pd.Series(similarity)
return my_df
I have this dataframe data where i have like 10.000 records of sold items for 201 authors.
I want to add a column to this dataframe which is the average price for each author.
First i create this new column average_price and then i create another dataframe df
where i have 201 columns of authors and their average price. (at least i think this is the right way to do this)
data["average_price"] = 0
df = data.groupby('Author Name', as_index=False)['price'].mean()
df looks like this
Author Name price
0 Agnes Cleve 107444.444444
1 Akseli Gallen-Kallela 32100.384615
2 Albert Edelfelt 207859.302326
3 Albert Johansson 30012.000000
4 Albin Amelin 44400.000000
... ... ...
196 Waldemar Lorentzon 152730.000000
197 Wilhelm von Gegerfelt 25808.510638
198 Yrjö Edelmann 53268.928571
199 Åke Göransson 87333.333333
200 Öyvind Fahlström 351345.454545
Now i want to use this df to populate the average_price column in the larger dataframe data.
I could not come up with how to do this so i tried a for loop which is not working. (And i know you should avoid for loops working with dataframes)
for index, row in data.iterrows():
for ind, r in df.iterrows():
if row["Author Name"] == r["Author Name"]:
row["average_price"] = r["price"]
So i wonder how this should be done?
You can use transform and groupby to add a new column:
data['average price'] = data.groupby('Author Name')['price'].transform('mean')
I think based on what you described, you should use .join method on a Pandas dataframe. You don't need to create 'average_price' column mannualy. This should simply work for your case:
df = data.groupby('Author Name', as_index=False)['price'].mean().rename(columns={'price':'average_price'})
data = data.join(df, on="Author Name")
Now you can get the average price from data['average_price'] column.
Hope this could help!
I think the easiest way to do that would be using join (aka pandas.merge)
df_data = pd.DataFrame([...]) # your data here
df_agg_data = data.groupby('Author Name', as_index=False)['price'].mean()
df_data = df_data.merge(df_agg_data, on="Author Name")
print(df_data)
I have been spending the entire day trying to figure this issue out and nothing from Stackoverflow about the topic is making it.
I am making calculations over groupby objects but the output is off. I am assuming that there is something wrong with my use of the apply method but cannot figure out what
Here is my toy dataset to illustrate my issue:
data1 = pd.DataFrame({'Id' : ['001','001','001','001','001','001','001','001','001',
'002','002','002','002','002','002','002','002','002',],
'Date': ['2020-01-12', '2019-12-30', '2019-12-01','2019-11-01', '2019-08-04', '2019-08-04', '2019-08-01', '2019-07-20', '2019-06-04',
'2020-01-11', '2019-12-12', '2019-12-01','2019-12-01', '2019-09-10', '2019-08-10', '2019-08-01', '2019-06-20', '2019-06-01'],
'Quantity' :[4,5,6,8,12,14,16,19,20, 8,7,6,5,4,3,2,1,0]
})
and my code looks like this:
today_month = int(time.strftime("%m"))
data1['Date'] =pd.to_datetime(data1['Date'])
data1 = data1[data1.Id.apply(lambda x: x.isnumeric())]
data2 = data1.groupby('Id').apply(lambda x: x.set_index('Date').resample('M').sum())
forecast = pd.DataFrame()
forecast['Id'] = data1['Id'].unique()
data3 = data2.groupby(level='Id').tail(5)
forecast['trendup'] = data3.apply(lambda x: data3['Quantity'].is_monotonic_increasing).sum()
forecast['trenddown'] = data3.apply(lambda x: data3['Quantity'].is_monotonic_decreasing).sum()
forecast['trend_status'] = np.where(~(forecast['trendup'] | forecast['trenddown']), 'Not_trending', 'trending')
forecast['L0'] = data3.apply(lambda x: data3['Quantity'].mean()).sum()
the output is this:
Id trendup trenddown trend_status L0
0 001 0 0 Not_trending 5.3
1 002 0 0 Not_trending 5.3
UPDATE:
the desired output is:
Id trendup trenddown trend_status L0
0 001 True False trending 12.3
1 002 False False Not_trending 13.0
here is the goal of this piece of code:
the goal is to prepare the data including several products for forecasting method (holts method if trend identified and ES if no trend).
for this I check for consecutive trend thanks to the is_monotonic function
then I use the output dataframe to gather which item is trending or no in order to decide which model to use.
L0 is the T0 time for the forecast, which correspond to the oldest month in the tailed dataframe.
first of all, I am confused why "is_monotonic" does not return "true or false" but 0 in the output dataframe.
second of all, I don't understand why L0 returns the mean of all the dataset and not for each group of the groupby object.
my python level is pretty limited and I have ran out of things to try to solve this. Any help on this would amazing!
IIUC - although the results don't seem to be even close:
data1=data1.sort_values("Date", axis=0, ascending=False)
data1["obs"]=data1.groupby("Id").cumcount()
data2=data1.loc[data1["obs"]<5].groupby("Id").apply(lambda x: pd.Series({"trendup": x["Quantity"].is_monotonic_increasing, "trenddown": x["Quantity"].is_monotonic_decreasing, "LO": x["Quantity"].mean()}))
data2["trend_status"]=np.where(np.logical_or(data2["trendup"], data2["trenddown"]), "trending", "Not_trending")
Outputs:
trendup trenddown LO trend_status
Id
001 True False 7.0 trending
002 False True 6.0 trending
Following a "chain" of rows and counting the consecutive months from a CSV file.
Currently I am reading a CSV file with 5 columns of interest (based on insurance policies):
CONTRACT_ID START-DATE END-DATE CANCEL_FLAG OLD_CON_ID
123456 2015-05-30 2016-05-30 0 8788
123457 2014-03-20 2015-03-20 0 12000
123458 2009-12-20 2010-12-20 0 NaN
...
I want to count the number of consecutive months a Contract chain goes for.
Example: Taking the START-DATE from the contract at the "front" of the chain (oldest contract) and the END-DATE from the end of the chain (newest contract). Oldest contract being defined by either the one before a cancelled contract in a chain or the one that has no OLD_CON_ID value.
Each row represents a contract and the prev_Con_ID points to the previous contract ID. The desired output is how many months the contract chains goes back until a gap (i.e. customer didn't have a contract for a period of time). If nothing in that column then that is the first contract in this chain.
CANCEL_FLAG should also cut the chain because a value of 1 designates that the contract was cancelled.
Current code counts the number of active contracts for each year by editing the dataframe like so:
df_contract = df_contract[
(df_contract['START_DATE'] <= pd.to_datetime('2015-05-31')) &
(df_contract['END_DATE'] >= pd.to_datetime('2015-05-31')) & (df_contract['CANCEL_FLAG'] == 0 )
]
df_contract = df_contract[df_contract['CANCEL_FLAG'] == 0
]
activecount = df_contract.count()
print activecount['CONTRACT_ID']
Here are the first 6 lines of code in which I create the dataframes and adjust the datetime values:
file_name = 'EXAMPLENAME.csv'
df = pd.read_csv(file_name)
df_contract = pd.read_csv(file_name)
df_CUSTOMERS = pd.read_csv(file_name)
df_contract['START_DATE'] = pd.to_datetime(df_contract['START_DATE'])
df_contract['END_DATE'] = pd.to_datetime(df_contract['END_DATE'])
Ideal output is something like:
FIRST_CONTRACT_ID CHAIN_LENGTH CON_MONTHS
1234567 5 60
1500001 1 4
800 10 180
Those data points would then be graphed.
EDIT2: CSV file changed, might be easier now. Question updated.
Not sure if I totally undertand your requirement, but does something like this work?:
df_contract['TOTAL_YEARS'] = (df_contract['END_DATE'] - df_contract['START_DATE']
)/np.timedelta64(1,'Y')
df_contract['TOTAL_YEARS'][(df['CANCEL_FLAG'] == 1) && (df['TOTAL_YEARS'] > 1)] = 1
After a lot of trial and error I got it working!
This finds the time difference between the first and last contracts in the chain and finds the length of the chain.
Not the cleanest code by far, but it works:
test = 'START_DATE'
df_short = df_policy[['OLD_CON_ID',test,'CONTRACT_ID']]
df_short.rename(columns={'OLD_CON_ID':'PID','CONTRACT_ID':'CID'},
inplace = True)
df_test = df_policy[['CONTRACT_ID','END_DATE']]
df_test.rename(columns={'CONTRACT_ID':'CID','END_DATE': 'PED'}, inplace = True)
df_copy1 = df_short.copy()
df_copy2 = df_short.copy()
df_copy2.rename(columns={'PID':'PPID','CID':'PID'}, inplace = True)
df_merge1 = pd.merge(df_short, df_copy2,
how='left',
on=['PID'])
df_merge1['START_DATE_y'].fillna(df_merge1['START_DATE_x'], inplace = True)
df_merge1.rename(columns={'START_DATE_x':'1_EFF','START_DATE_y':'2_EFF'}, inplace=True)
The copy, merge, fillna, and rename code is repeated for 5 merged dataframes then:
df_merged = pd.merge(df_merge5, df_test,
how='right',
on=['CID'])
df_merged['TOTAL_MONTHS'] = ((df_merged['PED'] - df_merged['6_EFF']
)/np.timedelta64(1,'M'))
df_merged4 = df_merged[
(df_merged['PED'] >= pd.to_datetime('2015-07-06'))
df_merged4['CHAIN_LENGTH'] = df_merged4.drop(['PED','1_EFF','2_EFF','3_EFF','4_EFF','5_EFF'], axis=1).apply(lambda row: len(pd.unique(row)), axis=1) -3
Hopefully my code is understood and will help someone in the future.