Flattening dataframe Json column to new rows in Python - python

I am new to Python. I have dataframe obtained from SQL query result
UserId
UserName
Reason_details
851
Bob
[ {"reasonId":264, "reasonDescription":"prohibited", "reasonCodes":[1 , 2]} , {"reasonId":267, "reasonDescription":"Expired", "reasonCodes":[25]} ]
852
Jack
[{"reasonId":273, "reasonDescription":"Restricted", "reasonCodes":[29]}]
I want to modify this dataframe by flattening Reason_details column. Each reason in new row.
UserId
UserName
Reason_id
Reson_description
Reason_codes
851
Bob
264
Prohibited
1
851
Bob
264
Prohibited
2
851
Bob
267
Expired
25
852
Jack
273
Restricted
29
I flattened this data using good old for loops iterating over each row of source dataframe, reading value of each key in Reason_details column by using json_loads. And then creating final dataframe.
But I feel there has to be better way of doing this using dataframe and JSON functions in python.
PS: In my actual dataset there are 63 columns and 8 million rows out of which only Reason_details column has JSON value. Thus my existing approach is very inefficient iteration over all rows, all columns converting them in 2D list first and making final dataframe from it.

can you try this:
df=df.explode('Reason_details')
df = df.join(df['Reason_details'].apply(pd.Series)).drop('Reason_details',axis=1).explode('reasonCodes').drop_duplicates()

here is a slight different manner
df[['UserId', 'UserName']].merge(df['Reason_details']
.explode() # convert list to rows
.apply(pd.Series) # creates dict keys as column
.explode('reasonCodes'), # convert reason code into rows
left_index=True, # merge with original DF
right_index=True)
UserId UserName reasonId reasonDescription reasonCodes
0 851 Bob 264 prohibited 1
0 851 Bob 264 prohibited 2
0 851 Bob 267 Expired 25
1 852 Jack 273 Restricted 29

Related

Sorting Values by rows in data frame

I have a 4 column data frame with numerical values and Nan. What I need is to put the largest numbers in the first columns so that always the first column has the maximum value and the second column the next maximum value.
for x in Exapand_re_metrs[0]:
for y in Exapand_re_metrs[1]:
for z in Exapand_re_metrs[2]:
for a in Exapand_re_metrs[3]:
lista=[x,y,z,a]
lista.sort()
df["AREA_Mayor"]=lista[0]
df["AREA_Menor"]=lista[1]
I'm not so sure what you want to do but here is a solution according to what I understood:
From what I see you have a dataframe with several columns and you would like it to be grouped in a single column with the values ​​from highest to lowest, so I will create a dataframe with almost the same characteristics as follows:
import pandas as pd
import numpy as np
cols = 3
rows = 4
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 1000, (rows, cols)), columns= ["A","B","C"])
print(df)
A B C
0 684 559 629
1 192 835 763
2 707 359 9
3 723 277 754
Now I will group all the columns in a single row and organize them in descending order like this:
data = df.to_numpy().flatten()
data = pd.DataFrame(data)
data.sort_values(by=[0],ascending=False)
So as a result we will obtain a 1xn matrix where the values ​​are in descending order:
0
4 835
5 763
11 754
9 723
6 707
0 684
2 629
1 559
7 359
10 277
3 192
8 9
Note: This code fragment should be adapted to your script; I didn't do it because I don't know your dataset and lastly my English is not that good sorry for any grammatical errors

how to create a table if column names are in one table in a column and data for the column names are in different table in a columns

I have two csv files one has column names of the data and data for those column names are in another csv inside a column of that table. This is the structure of those csv files
id
unique_ref
money_spent
1
abcd123
120
2
bcde234
145
3
cdef345
450
4
defg456
412
5
abcd123
127
6
bcde234
148
7
cdef345
489
8
defg456
415
id
fields
abcd123
apple
bcde234
orange
cdef345
grape
defg456
watermelon
Now what I want is to create another CSV which will have these fields as columns and the money_spent as data according to the unique_ref. I can't specify the column name to pivot or transpose because in the real data there are many fields.
I can use SQL or/and Python
You can iterate over second table's id, mask all rows with the same unique_ref and make a list of list from that.
the_list = [
[the_row["fields"], data_1[data_1["unique_ref"] == the_row["id"]]["money_spent"].to_numpy().tolist()]
for index, the_row in data_2.iterrows()
]
Now you have a list as:
[['abcd123', [120, 127]], ['bcde234', [145, 148]], ['cdef345', [450, 489]], ['defg456', [412, 415]]]
Using that you can create a new dataframe:
Please notice you would need the transpose of the data:
the_df = pd.DataFrame(
list(map(list, zip(*[i[1] for i in the_list]))),
columns=[i[0] for i in the_list]
)
The dataframe:
apple orange grape watermelon
0 120 145 450 412
1 127 148 489 415

Efficiently creating frequency and recency columns

This is a very specific problem - my code is very slow, wonder if I'm doing something obviously wrong or there's a better way.
The situation: I have two dataframes, frame and contacts. frame is a database of people, and contacts is points of contact with these people. They look something like:
frame:
name
id
166 Bob
253 Serge
1623 Anna
766 Benna
981 Paul
contacts:
id type date
0 253 email 2016-01-05
1 1623 sale 2012-05-12
2 1623 email 2017-12-22
3 253 sale 2018-02-15
I want to add two columns to frame, 'most_recent' and '3 year contact count', which give the most recent contact (if there is one) and the number of contacts in the past 3 years.
(frame is ~100,000 rows, and contacts is ~95,000)
So far, I'm reducing the amount of ids to iterate over, then creating a dict for each id with the right values:
id_list = [i for i in frame.index if i in contacts['id']]
freq_rec_dict = {i: [contacts.loc[contacts['id']==i,'value'].max(),
len(contacts.loc[(contacts['id']==i)&(contacts['value']>dt(2016,1,1))])]
for i in id_list}
Then, I turn the dict into a dataframe and perform a join:
freq_rec_df = pd.DataFrame.from_dict(freq_rec_dict, orient='index',columns=['most_recent','3 year contact count'])
result = frame.join(freq_rec_df)
This does give me what I need, but the dictionary comprehension took 30 minutes - I feel like there must be a more efficient way to do this (I will need this in the future). Any ideas would be much appreciated - thanks!
You don't specify your output, but here goes. You should leverage the built-in groupby method instead of taking your data out of a frame and back into a frame and then merging
contacts.groupby('id')[['date','type']].max()
date type
id
253 2018-02-15 sale
1623 2017-12-22 sale
Which you can do in one line if you need to save memory space. Again, you don't give a preferred output, so I used a left join. You could also use 'inner' to keep only rows where records exist.
df=pd.merge(frame,contacts.groupby('id')[['date','type']].max(), left_index=True, right_index=True, how='left')
name date type
id
166 Bob NaN NaN
253 Serge 2018-02-15 sale
1623 Anna 2017-12-22 sale
766 Benna NaN NaN
981 Paul NaN NaN

Simplify query to just 'where' by making new column with pandas?

I have a column with SQL queries to a column. These are implemented on a function called Select_analysis
Form:
Select_analysis (input_shapefile, output_name, {where_clause}) # it takes until where.
Example:
SELECT * from OT # OT is a dataset
GROUP BY OT.CA # CA is a number that may exist many times.Therefore we group by that field.
HAVING ((Count(OT.OBJECTID))>1) # an id that appears more than once.
OT dataset
objectid CA
1 125
2 342
3 263
1 125
We group by CA.
About having: it is applied to the rows that have objectid more than once. Which is the objectid 1 in this example.
My idea is to make another column that will store a result that will be accessed with a simple where clause in the select_analysis function
example: OT dataset
objectid CA count_of_objectid_aftergroupby
1 125 2
2 342 1
3 263 1
1 125 2
So then can be:
Select_analysis(roads.shp,output.shp, count_of_objectid_aftergroupby > '1')
Notes
it has to be in such a way so that select analysis function is used in the end.
Assuming that you are pulling the data into pandas since it's tagged pandas, here's one possible solution:
df=pd.DataFrame({'objectID':[1,2,3,1],'CA':[125,342,463,125]}).set_index('objectID')
objectID CA
1 125
2 342
3 463
1 125
df['count_of_objectid_aftergroupby']=[df['CA'].value_counts().loc[x] for x in df['CA']]
objectID CA count_of_objectid_aftergroupby
1 125 2
2 342 1
3 463 1
1 125 2
The list comp does basically this:
pull the value counts for each item in df['CA'] as a series.
Use loc to index into the series at each value of 'CA' to find the count of that value
Put that item into a list
append that list as a new column

Adding column from dataframe with different structure

I have the following two dataframe structures:
roc_100
max min
industry Banks Health Banks Health
date
2015-03-15 3456 456 345 567
2015-03-16 6576 565 435 677
2015-03-17 5478 657 245 123
and:
roc_100
max min
date
2015-03-15 546 7856
2015-03-16 677 456
2015-03-17 3546 346
As can be seen the difference between the two dataframes is that the bottom one doesn't have an 'industry'. But the rest of the dataframe structure is the same, ie: it is also has dates along the left, and is grouped by roc_100, under which is max and min.
What I need to do is add the columns from the bottom dataframe to the top dataframe, and give the added columns an industry name, eg: 'benchmark'. The resulting dataframe should be as follows:
roc_100
max min
industry Banks Health Benchmark Banks Health Benchmark
date
2015-03-15 3456 456 546 345 567 7856
2015-03-16 6576 565 677 435 677 456
2015-03-17 5478 657 3546 245 123 346
I have tried using append and join, but neither option has worked so far because the one dataframe has an 'industry' and the other doesn't.
Edit:
I have managed to merge them correctly using:
industry_df = industry_df.merge(benchmark_df, how='inner', left_index=True, right_index=True)
The only problem now is that the newly added columns still don't have an 'industry'.
This means that if I just want one industry, eg: Health, then I can do:
print(industry_df['roc_100', 'max', 'Health'])
That works, but if I want to print all the industries including the newly added columns I can't do that. If I try:
print(industry_df['roc_100', 'max'])
This only prints out the newly added columns because they are the only ones which don't have an 'industry'. Is there a way to give these newly merged columns a name ('industry')?
You can use stack() and unstack() to bring two dataframes to identical index structures with industries as columns. Then assign new benchmark column. Last step - restore initial index/column structure by same stack() and unstack().

Categories