Adding column from dataframe with different structure

Adding column from dataframe with different structure - python

I have the following two dataframe structures:
roc_100
max min
industry Banks Health Banks Health
date
2015-03-15 3456 456 345 567
2015-03-16 6576 565 435 677
2015-03-17 5478 657 245 123
and:
roc_100
max min
date
2015-03-15 546 7856
2015-03-16 677 456
2015-03-17 3546 346
As can be seen the difference between the two dataframes is that the bottom one doesn't have an 'industry'. But the rest of the dataframe structure is the same, ie: it is also has dates along the left, and is grouped by roc_100, under which is max and min.
What I need to do is add the columns from the bottom dataframe to the top dataframe, and give the added columns an industry name, eg: 'benchmark'. The resulting dataframe should be as follows:
roc_100
max min
industry Banks Health Benchmark Banks Health Benchmark
date
2015-03-15 3456 456 546 345 567 7856
2015-03-16 6576 565 677 435 677 456
2015-03-17 5478 657 3546 245 123 346
I have tried using append and join, but neither option has worked so far because the one dataframe has an 'industry' and the other doesn't.
Edit:
I have managed to merge them correctly using:
industry_df = industry_df.merge(benchmark_df, how='inner', left_index=True, right_index=True)
The only problem now is that the newly added columns still don't have an 'industry'.
This means that if I just want one industry, eg: Health, then I can do:
print(industry_df['roc_100', 'max', 'Health'])
That works, but if I want to print all the industries including the newly added columns I can't do that. If I try:
print(industry_df['roc_100', 'max'])
This only prints out the newly added columns because they are the only ones which don't have an 'industry'. Is there a way to give these newly merged columns a name ('industry')?

You can use stack() and unstack() to bring two dataframes to identical index structures with industries as columns. Then assign new benchmark column. Last step - restore initial index/column structure by same stack() and unstack().

Related

Flattening dataframe Json column to new rows in Python

I am new to Python. I have dataframe obtained from SQL query result
UserId
UserName
Reason_details
851
Bob
[ {"reasonId":264, "reasonDescription":"prohibited", "reasonCodes":[1 , 2]} , {"reasonId":267, "reasonDescription":"Expired", "reasonCodes":[25]} ]
852
Jack
[{"reasonId":273, "reasonDescription":"Restricted", "reasonCodes":[29]}]
I want to modify this dataframe by flattening Reason_details column. Each reason in new row.
UserId
UserName
Reason_id
Reson_description
Reason_codes
851
Bob
264
Prohibited
1
851
Bob
264
Prohibited
2
851
Bob
267
Expired
25
852
Jack
273
Restricted
29
I flattened this data using good old for loops iterating over each row of source dataframe, reading value of each key in Reason_details column by using json_loads. And then creating final dataframe.
But I feel there has to be better way of doing this using dataframe and JSON functions in python.
PS: In my actual dataset there are 63 columns and 8 million rows out of which only Reason_details column has JSON value. Thus my existing approach is very inefficient iteration over all rows, all columns converting them in 2D list first and making final dataframe from it.

can you try this:
df=df.explode('Reason_details')
df = df.join(df['Reason_details'].apply(pd.Series)).drop('Reason_details',axis=1).explode('reasonCodes').drop_duplicates()

here is a slight different manner
df[['UserId', 'UserName']].merge(df['Reason_details']
.explode() # convert list to rows
.apply(pd.Series) # creates dict keys as column
.explode('reasonCodes'), # convert reason code into rows
left_index=True, # merge with original DF
right_index=True)
UserId UserName reasonId reasonDescription reasonCodes
0 851 Bob 264 prohibited 1
0 851 Bob 264 prohibited 2
0 851 Bob 267 Expired 25
1 852 Jack 273 Restricted 29

Pandas - Getting just the columns that changed between comparison

I have a post-outer-merge Dataframe where there's a column with the indicator (left_only, right_only) that differs a row from another.
But I want only the columns that changed, not all the row. I'm filtering just by left_only, but i just need the columns that effectively changed and the column ID needs to be "locked" there, is the only one that can't get removed when i filter just the ones that changed.
For example:
ID
Manufacturer
TAG
ID2
_merge
10003
Apple
334
999
left_only
10003
Samsung
223
999
right_only
10004
Samsung
253
567
left_only
10004
Samsung
253
999
right_only
The output should be:
ID
Manufacturer
TAG
ID2
10003
Apple
334
999
10004
567
And just for the record, I'm doing the merge like this:
df_final = (pd.merge(
df, df2, how='outer',
indicator=True)
).query('_merge != "both"')

Efficiently creating frequency and recency columns

This is a very specific problem - my code is very slow, wonder if I'm doing something obviously wrong or there's a better way.
The situation: I have two dataframes, frame and contacts. frame is a database of people, and contacts is points of contact with these people. They look something like:
frame:
name
id
166 Bob
253 Serge
1623 Anna
766 Benna
981 Paul
contacts:
id type date
0 253 email 2016-01-05
1 1623 sale 2012-05-12
2 1623 email 2017-12-22
3 253 sale 2018-02-15
I want to add two columns to frame, 'most_recent' and '3 year contact count', which give the most recent contact (if there is one) and the number of contacts in the past 3 years.
(frame is ~100,000 rows, and contacts is ~95,000)
So far, I'm reducing the amount of ids to iterate over, then creating a dict for each id with the right values:
id_list = [i for i in frame.index if i in contacts['id']]
freq_rec_dict = {i: [contacts.loc[contacts['id']==i,'value'].max(),
len(contacts.loc[(contacts['id']==i)&(contacts['value']>dt(2016,1,1))])]
for i in id_list}
Then, I turn the dict into a dataframe and perform a join:
freq_rec_df = pd.DataFrame.from_dict(freq_rec_dict, orient='index',columns=['most_recent','3 year contact count'])
result = frame.join(freq_rec_df)
This does give me what I need, but the dictionary comprehension took 30 minutes - I feel like there must be a more efficient way to do this (I will need this in the future). Any ideas would be much appreciated - thanks!

You don't specify your output, but here goes. You should leverage the built-in groupby method instead of taking your data out of a frame and back into a frame and then merging
contacts.groupby('id')[['date','type']].max()
date type
id
253 2018-02-15 sale
1623 2017-12-22 sale
Which you can do in one line if you need to save memory space. Again, you don't give a preferred output, so I used a left join. You could also use 'inner' to keep only rows where records exist.
df=pd.merge(frame,contacts.groupby('id')[['date','type']].max(), left_index=True, right_index=True, how='left')
name date type
id
166 Bob NaN NaN
253 Serge 2018-02-15 sale
1623 Anna 2017-12-22 sale
766 Benna NaN NaN
981 Paul NaN NaN

Pandas - Filter rows where columns match atleast once

I have a pandas dataframe as shown below:
Name ID1 ID2
Joe 248 248
Joe 248 326
Joe 721 248
Anna 295 295
Bob 721 248
Bob 721 326
Bob 248 566
I need to keep only the rows that do not have matching ID1 & ID2,
with the exception that, if both the IDs matched at least once for a Name, then drop them.
For example:
For Name = Joe, IDs match once (248), so remove all rows with Joe.
For Name = Bob, IDs never match, so keep all rows with Bob.
So far, I've tried:
Dropping duplicates by sorting names and checking if IDs match or not. But this does not take into account IDs matching at least once.
df = df.sort_values(['Name']).drop_duplicates(['Name'],keep='first')
Not sure if pandas can drop duplicates with condition where something matches 'atleast once'.

If I understand correctly, you can calculate the names to remove and then use Boolean indexing:
names_to_remove = df.loc[df['ID1'] == df['ID2'], 'Name'].values
res = df[~df['Name'].isin(names_to_remove)]
print(res)
Name ID1 ID2
4 Bob 721 248
5 Bob 721 326
6 Bob 248 566

df.groupby('Name').apply(lambda grp: grp if not (grp['ID1'] == grp['ID2']).any() else None).dropna()
Explanation: Groupby Name, then if there is any index for which ID1 and Id2 do NOT match, return the group. Else, return None and then drop the null columns.

Pandas merge result missing rows when joining on strings

I have a data set that I've been cleaning and to clean it I needed to put it into a pivot table to summarize some of the data. I'm now putting it back into a dataframe so that I can merge it with some other dataframes. df1 looks something like this:
Count Region Period ACV PRJ
167 REMAINING US WEST 3/3/2018 5 57
168 REMAINING US WEST 3/31/2018 10 83
169 SAN FRANCISCO 1/13/2018 99 76
170 SAN FRANCISCO 1/20/2018 34 21
df2 looks something like this:
Count MKTcode Region
11 RSMR0 REMAINING US SOUTH
12 RWMR0 REMAINING US WEST
13 SFR00 SAN FRANCISCO
I've tried merging them with this code:
df3 = pd.merge(df1, df2, on='Region', how='inner')
but for some reason pandas is not interpreting the Region columns as the same data and the merge is turning up NaN data in the MKTcode column and it seems to be appending df2 to df1, like this:
Count Region Period ACV PRJ MKTcode
193 WASHINGTON, D.C. 3/3/2018 36 38 NaN
194 WASHINGTON, D.C. 3/31/2018 12 3 NaN
195 ATLANTA NaN NaN NaN ATMR0
196 BOSTON NaN NaN NaN B2MRN
I've tried inner and outer joins, but the real problem seems to be that pandas is interpreting the Region column of each dataframe as different elements.
The MKTcode column and Region column in df2 has only 12 observations and each observation occurs only once, whereas df1 has several repeating instances in the Region column (multiples of the same city). Is there a way where I can just create a list of the 12 MKTcodes that I need and perform a merge where it matches with each region that I designate? Like a one to many match?
Thanks.

When a merge isn't working as expected, the first thing to do is look at the offending columns.
The biggest culprit in most cases is trailing/leading whitespaces. These are usually introduced when DataFrames are incorrectly read from files.
Try getting rid of extra whitespace characters by stripping them out. Assuming you need to join on the "Region" column, use
for df in (df1, df2):
# Strip the column(s) you're planning to join with
df['Region'] = df['Region'].str.strip()
Now, merging should work as expected,
pd.merge(df1, df2, on='Region', how='inner')
Count_x Region Period ACV PRJ Count_y MKTcode
0 167 REMAINING US WEST 3/3/2018 5 57 12 RWMR0
1 168 REMAINING US WEST 3/31/2018 10 83 12 RWMR0
2 169 SAN FRANCISCO 1/13/2018 99 76 13 SFR00
3 170 SAN FRANCISCO 1/20/2018 34 21 13 SFR00
Another possibility if you're still getting NaNs, could be because of a difference in whitespace characters between words. For example, 'REMAINING US WEST' will not compare as equal with 'REMAINING US WEST'.
This time, the fix is to use str.replace:
for df in (df1, df2):
df['Region'] = df['Region'].str.replace(r'\s+', ' ')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.