Extract the JSON data inside the array column in Dataframe in Pandas - python

I am working on extracting JSON array column in dataframe in Python using Pandas Library, where I have a data like this
>df
id partnerid payments
5263 org1244 [{"sNo": 1, "amount":"1000"}, {"sNo": 2, "amount":"500"}]
5264 org1245 [{"sNo": 1, "amount":"2000"}, {"sNo": 2, "amount":"600"}]
5265 org1246 [{"sNo": 1, "amount":"3000"}, {"sNo": 2, "amount":"700"}]
I want to extract the JSON data inside the list and add it as a column in same dataframe
like this
>mod_df
id partnerid sNo amount
5263 org1244 1 1000
5263 org1244 2 500
5264 org1245 1 2000
5264 org1245 2 600
5265 org1246 1 3000
5265 org1246 2 700
I have tried with this approach
import pandas as pd
import json as j
df = pd.read_parquet('sample.parquet')
js_loads = df['payments'].apply(j.loads)
js_list = list(js_loads)
j_data = j.dumps(js_list)
df = df.join(pd.read_json(j_data))
df = df.drop(columns=['payments'] , axis=1)
But this works, only if we have JSON data in column not list of JSON.
Can someone explain, how can I achieve my desired output?

Convert it to list by ast.literal_eval and use explode()to transform each element to a row and also replicate the other columns.
Then, use .apply(pd.Series) to convert dict-like to series.
Finally, concatenate to original dataframe using pd.concat().
Example:
import ast
# sample data
d = {'col1': [0, 1, 2], 'payments': ['[{"sNo": 1, "amount":"1000"}, {"sNo": 2, "amount":"500"}]', '[{"sNo": 1, "amount":"2000"}, {"sNo": 2, "amount":"600"}]', '[{"sNo": 1, "amount":"3000"}, {"sNo": 2, "amount":"700"}]']}
df = pd.DataFrame(data=d, index=[0, 1, 2])
df['payments'] = df['payments'].apply(ast.literal_eval)
df = df.explode('payments')
out = pd.concat([df.drop(['payments'], axis=1), df['payments'].apply(pd.Series)], axis=1).reset_index(drop=True)
output:
col1 sNo amount
0 0 1 1000
1 0 2 500
2 1 1 2000
3 1 2 600
4 2 1 3000
5 2 2 700

Related

Merging two columns on value

Reproducible dataframe-
import pandas as pd
data = {'refid': ['1.2.34',
'1.2.35',
'1.3.66',
'1.6.99',
'1.9.00',
'1.87.66',
'1.98.00',
'1.100.1',
'1.101.3'],
}
my_index = pd.MultiIndex.from_arrays([["A"]*6 + ["B"]*3, [1, 1, 1, 2, 2, 2, 1, 1, 1]], names=["ID-A", "ID-B"])
df = pd.DataFrame(data, index=my_index)
I want a new column which clubs both ID-B and refid to second delimiter.
ex- for ID-B 1 and refid 1.2.34, first the secondary-refid column should be 1.2 and unique ID should be 1_1.2
You can use get_level_values with str.extract and concatenate the values converted as string:
df['new'] = (df.index.get_level_values('ID-B').astype(str)+'_'
+df['refid'].str.extract('(\d+\.\d+)', expand=False)
)
output:
refid new
ID-A ID-B
A 1 1.2.34 1_1.2
1 1.2.35 1_1.2
1 1.3.66 1_1.3
2 1.6.99 2_1.6
2 1.9.00 2_1.9
2 1.87.66 2_1.87
B 1 1.98.00 1_1.98
1 1.100.1 1_1.100
1 1.101.3 1_1.101

How to name Pandas Dataframe Columns automatically?

I have a Pandas dataframe df with 102 columns. Each column is named differently, say A, B, C etc. to give the original dataframe following structure
Column A. Column B. Column C. ....
Row 1.
Row 2.
---
Row n
I would like to change the columns names from A, B, C etc. to F1, F2, F3, ...., F102. I tried using df.columns but wasn't successful in renaming them this way. Any simple way to automatically rename all column names to F1 to F102 automatically, insteading of renaming each column name individually?
df.columns=["F"+str(i) for i in range(1, 103)]
Note:
Instead of a “magic” number 103 you may use the calculated number of columns (+ 1), e.g.
len(df.columns) + 1, or
df.shape[1] + 1.
(Thanks to ALollz for this tip in his comment.)
One way to do this is to convert it to a pair of lists, and convert the column names list to the index of a loop:
import pandas as pd
d = {'Column A': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column B': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column c': [1, 2, 3, 4, 5, 4, 3, 2, 1]}
dataFrame = pd.DataFrame(data=d)
cols = list(dataFrame.columns.values) #convert original dataframe into a list containing the values for column name
index = 1 #start at 1
for column in cols:
cols[index-1] = "F"+str(index) #rename the column name based on index
index += 1 #add one to index
vals = dataFrame.values.tolist() #get the values for the rows
newDataFrame = pd.DataFrame(vals, columns=cols) #create a new dataframe containing the new column names and values from rows
print(newDataFrame)
Output:
F1 F2 F3
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 4 4 4
6 3 3 3
7 2 2 2
8 1 1 1

Pandas: counting unique datetime values in group by gives weird values

So I got this DataFrame, built in a way so that for column id equal to 2, we have two different values in column num and my_date:
import pandas as pd
a = pd.DataFrame({'id': [1, 2, 3, 2],
'my_date': [datetime(2017, 1, i) for i in range(1, 4)] + [datetime(2017, 1, 1)],
'num': [2, 3, 1, 4]
})
For convenience, this is the DataFrame in a readable visual:
If I want to count the number of unique values for each id, I'd do
grouped_a = a.groupby('id').agg({'my_date': pd.Series.nunique,
'num': pd.Series.nunique}).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
which gives this weird (?) result:
Looks like the counting unique values on the datetime (which in Pandas converts to a datetime64[ns]) type is not working?
It is bug, see github 14423.
But you can use SeriesGroupBy.nunique which works nice:
grouped_a = a.groupby('id').agg({'my_date': 'nunique',
'num': 'nunique'}).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
print (grouped_a)
id num_unique_num num_unique_my_date
0 1 1 1
1 2 2 2
2 3 1 1
If DataFrame have only 3 columns, you can use:
grouped_a = a.groupby('id').agg(['nunique']).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
print (grouped_a)
id num_unique_num num_unique_my_date
0 1 1 1
1 2 2 2
2 3 1 1

How to delete the randomly sampled rows of a dataframe, to avoid sampling them again?

I have dataframe (df) of 12 rows x 5 columns. I sample 1 row from each label and create a new dataframe (df1) of 3 rows x 5 columns. I need that the next time I sample more rows from df I will not choose the same ones that are already in df1. So how can I delete the already sampled rows from df?
import pandas as pd
import numpy as np
# 12x5
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
#3x5
df1 = pd.concat(g.sample(1) for idx, g in df.groupby('label'))
#My attempt. It should be a 9x5 dataframe
df2 = pd.concat(f.drop(idx) for idx, f in df1.groupby('label'))
df
df1
df2
Starting with this DataFrame:
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
Your first sample is this:
df1 = pd.concat(g.sample(1) for idx, g in df.groupby('label'))
For the second sample, you can drop df1's indices from df:
pd.concat(g.sample(1) for idx, g in df.drop(df1.index).groupby('label'))
Out:
0 1 2 3 4 label
2 0.188005 0.765640 0.549734 0.712261 0.334071 1
4 0.599812 0.713593 0.366226 0.374616 0.952237 2
8 0.631922 0.585104 0.184801 0.147213 0.804537 3
This is not an inplace operation. It doesn't modify the original DataFrame. It just drops the rows, returns a copy, and samples from that copy. If you want it to be permanent, you can do:
df2 = df.drop(df1.index)
And sample from df2 afterwards.

Python - insert multiples rows into an existing data frame

I am trying to insert two lines into an existing data frame, but can't seem to get it to work. The existing df is:
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
I want to add two blank rows after the 1st and 2nd block rows. I would like the new data frame to look like this:
df_new = pd.DataFrame({"a" : [1,2,0,3,4,0,5,6], "block" : [1, 1, 0, 2, 2, 0, 3, 3]})
There doesn't need to be any values in the rows, I'm planning on using them as placeholders for something else. I've looked into adding rows, but most posts suggest appending one row to the beginning or end of a data frame, which won't work in my case.
Any suggestions as to my dilemma?
import pandas as pd
# Adds a new row to a DataFrame
# oldDf - The DataFrame to which the row will be added
# index - The index where the row will be added
# rowData - The new data to be added to the row
# returns - A new DataFrame with the row added
def AddRow(oldDf, index, rowData):
newDf = oldDf.head(index)
newDf = newDf.append(pd.DataFrame(rowData))
newDf = newDf.append(oldDf.tail(-index))
# Clean up the row indexes so there aren't any doubles.
# Figured you may want this.
newDf = newDf.reset_index(drop=True)
return newDf
# Initial data
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
# Insert rows
blankRow = {"a": [0], "block": [0]}
df2 = AddRow(df1, 2, blankRow)
df2 = AddRow(df2, 5, blankRow)
For the sake of performance, you can removed the reference to Reset_Index() found in the AddRow() function, and simply call it after you've added all your rows.
If you always want to insert the new row of zeros after each group of values in the block column you can do the following:
Start with your data frame:
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
Group it using the values in the block column:
gr = df1.groupby('block')
Add a row of zeros to the end of each group:
df_new = gr.apply(lambda x: x.append({'a':0,'block':0}, ignore_index=True))
Reset the indexes of the new dataframe:
df_new.reset_index(drop = True, inplace=True)
You can simply groupby the data based on the block column, then concat the placeholder at the bottom of each group then append to a new dataframe.
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
df1 # original data
Out[67]:
a block
0 1 1
1 2 1
2 3 2
3 4 2
4 5 3
5 6 3
df_group = df1.groupby('block')
df = pd.DataFrame({"a" : [], "block" : []}) # final data to be appended
for name,group in df_group:
group = pd.concat([group,pd.DataFrame({"a" : [0], "block" : [0]})])
df = df.append(group, ignore_index=True)
df
Out[71]:
a block
0 1 1
1 2 1
2 0 0
3 3 2
4 4 2
5 0 0
6 5 3
7 6 3
8 0 0

Categories