JSON list flatten to dataframe as multiple columns with prefix - python

I have a json with some nested/array items like the one below
I'm looking at flattening it before saving it into a csv
[{'SKU':'SKU1','name':'test name 1',
'ItemSalesPrices':[{'SourceNumber': 'OEM', 'AssetNumber': 'TEST1A', 'UnitPrice': 1600}, {'SourceNumber': 'RRP', 'AssetNumber': 'TEST1B', 'UnitPrice': 1500}],
},
{'SKU':'SKU2','name':'test name 2',
'ItemSalesPrices':[{'SourceNumber': 'RRP', 'AssetNumber': 'TEST2', 'UnitPrice': 1500}],
}
]
I have attempted with the good solution here flattern nested JSON and retain columns (or Panda json_normalize) but got no where so I'm hoping to get some tips from the community
SKU
Name
ItemSalesPrices_OEM_UnitPrice
ItemSalesPrices_OEM_AssetNumber
ItemSalesPrices_RRP_UnitPrice
ItemSalesPrices_RRP_AssetNumber
SKU1
test name 1
1600
TEST1A
1500
TEST1B
SKU2
test name 2
1500
TEST2
Thank you

Use json_normalize:
first = ['SKU','name']
df = pd.json_normalize(L,'ItemSalesPrices', first)
print (df)
SourceNumber AssetNumber UnitPrice SKU name
0 OEM TEST1A 1600 TEST1 test name 1
1 RRP TEST1B 1500 TEST1 test name 1
2 RRP TEST2 1500 TEST2 test name 2
Then you can pivoting values - if numeric use sum, if strings use join:
f = lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ','.join(x)
df1 = (df.pivot_table(index=first,
columns='SourceNumber',
aggfunc=f))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.rename_axis(None, axis=1).reset_index()
print (df1)
SKU name AssetNumber_OEM AssetNumber_RRP UnitPrice_OEM \
0 SKU1 test name 1 TEST1A TEST1B 1600.0
1 SKU2 test name 2 NaN TEST2 NaN
UnitPrice_RRP
0 1500.0
1 1500.0

Related

Pandas AttributeError: 'str' object has no attribute 'loc'

this is my code:
DF['CustomerId'] = DF['CustomerId'].apply(str)
print(DF.dtypes)
for index, row in merged.iterrows():
DF = DF.loc[(DF['CustomerId'] == str(row['CustomerId'])), 'CustomerId'] = row['code']
My goal is to do this:
if DF['CustomerId'] is equal to row['CustomerId'] then change value of DF['CustomerId'] to row['CustomerId']
else leave as it is.
row['CustomerId'] and DF['CustomerId'] should be string. I know that loc works not with string, but how can I do this with string type ?
thanks
You can approach without looping by merging the 2 dataframes on the common CustomerId column using .merge() and then update the CustomerID column with the code column originated from the 'merged' datraframe with .update(), as follows:
df_out = DF.merge(merged, on='CustomerId', how='left')
df_out['CustomerId'].update(df_out['code'])
Demo
Data Preparation:
data = {'CustomerId': ['11111', '22222', '33333', '44444'],
'CustomerInfo': ['Albert', 'Betty', 'Charles', 'Dicky']}
DF = pd.DataFrame(data)
print(DF)
CustomerId CustomerInfo
0 11111 Albert
1 22222 Betty
2 33333 Charles
3 44444 Dicky
data = {'CustomerId': ['11111', '22222', '44444'],
'code': ['A1011111', 'A1022222', 'A1044444']}
merged = pd.DataFrame(data)
print(merged)
CustomerId code
0 11111 A1011111
1 22222 A1022222
2 44444 A1044444
Run New Code
# ensure the CustomerId column are strings as you did
DF['CustomerId'] = DF['CustomerId'].astype(str)
merged['CustomerId'] = merged['CustomerId'].astype(str)
df_out = DF.merge(merged, on='CustomerId', how='left')
print(df_out)
CustomerId CustomerInfo code
0 11111 Albert A1011111
1 22222 Betty A1022222
2 33333 Charles NaN
3 44444 Dicky A1044444
df_out['CustomerId'].update(df_out['code'])
print(df_out)
# `CustomerId` column updated as required if there are corresponding entries in dataframe `merged`
CustomerId CustomerInfo code
0 A1011111 Albert A1011111
1 A1022222 Betty A1022222
2 33333 Charles NaN
3 A1044444 Dicky A1044444

How to create columns from rows given key:value pair in the column in pandas?

I have the DF of this kind:
pd.DataFrame({'label':['A','test1: A','test2: A','B','test1: B','test3: B'],
'value': [1,2,3,4,5,6]})
label value
0 A 1
1 test1: A 2
2 test2: A 3
3 B 4
4 test1: B 5
5 test3: B 6
And I need to convert to this:
pd.DataFrame({'label':['A','B'],
'value': [1,4],
'test1:':[2,5],
'test2:':[3,None],
'test3:':[None,6]})
label value test1: test2: test3:
0 A 1 2 3.0 NaN
1 B 4 5 NaN 6.0
I need to keep label for unique value and keys are merged to the right if present in the data. Keys may vary and be of different names for one value.
Feel free to share how to rename the question because I could not find the better way to name the problem.
EDIT:
Partly this solution contains what I need however there is no decent way to add columns representing key in the label column. Ideally something like a function with df input is needed.
Extract information into two data frames and merge them.
df2 = df[df['label'].str.contains('test')]
df3 = df2['label'].str.split(expand=True).rename(columns={0: "test", 1: "label"})
df3['value'] = df2['value']
df3 = df3.pivot_table(index='label', columns='test', values='value')
df2 = df[~df['label'].str.contains('test')]
df4 = pd.merge(df2, df3, on='label')
Output
label value test1: test2: test3:
0 A 1 2.0 3.0 NaN
1 B 4 5.0 NaN 6.0
Here's a way to do that:
df.loc[~df.label.str.contains(":"), "label"] = df.loc[~df.label.str.contains(":"), "label"].str.replace(r"(^.*$)", r"value:\1")
labels = df.label.str.split(":", expand = True).rename(columns = {0: "label1", 1:"label2"})
df = pd.concat([df, labels], axis=1)
df = pd.pivot_table(df, index="label2", columns="label1", dropna=False)
df.columns = [c[1] for c in df.columns]
df.index.name = "label"
The output is:
test1 test2 test3 value
label
A 2.0 3.0 NaN 1.0
B 5.0 NaN 6.0 4.0

How to add a dataset identifier (like id column) when append two or more datasets?

I have multiple datasets in csv format that I would like to import by appending. Each dataset has the same columns name (fields), but different values and length.
For example:
df1
date name surname age address
...
df2
date name surname age address
...
I would like to have
df=df1+df2
date name surname age address dataset
(df1) 1
... 1
(df2) 2
... 2
i.e. I would like to add a new column that is an identifier for dataset (where fields come from, if from dataset 1 or dataset 2).
How can I do it?
Is this what you're looking for?
Note: Example has fewer columns that yours but the method is the same.
import pandas as pd
df1 = pd.DataFrame({
'name': [f'Name{i}' for i in range(5)],
'age': range(10, 15)
})
df2 = pd.DataFrame({
'name': [f'Name{i}' for i in range(20, 22)],
'age': range(20, 22)
})
combined = pd.concat([df1, df2])
combined['dataset'] = [1] * len(df1) + [2] * len(df2)
print(combined)
Output
name age dataset
0 Name0 10 1
1 Name1 11 1
2 Name2 12 1
3 Name3 13 1
4 Name4 14 1
0 Name20 20 2
1 Name21 21 2
We have key in concat
combined = pd.concat([df1, df2],keys=[1,2]).reset_index(level=1)
In Spark with scala , I would do something like this :
import org.apache.spark.sql.functions._
val df1 = sparkSession.read
.option("inferSchema", "true")
.json("/home/shredder/Desktop/data1.json")
val df2 = sparkSession.read
.option("inferSchema", "true")
.json("/home/shredder/Desktop/data2.json")
val df1New = df1.withColumn("dataset",lit(1))
val df2New = df2.withColumn("dataset",lit(2))
val df3 = df1New.union(df2New)
df3.show()

Select rows that based on a where statement

How can I select values that have the word "link" in them and make them in category1 and "popcorn" in them to make them category2 and all else put in category3?
Here is a sample but my actual dataset has hundreds of rows
data = {'model': [['Lisa', 'link'], ['Lisa 2', 'popcorn'], ['telephone', 'rabbit']],
'launched': [1983, 1984, 1991]}
df = pd.DataFrame(data, columns = ['model', 'launched'])
Desired
Model launched category
['Lisa', 'link'] 1983 1
['Lisa 2', 'popcorn'] 1984 2
['telephone', 'rabbit'] 1991 3
You could use np.select to set category to 1 or 2 depending on whether 'link' or 'popcorn' is contained in a given list. Set default to 3 for the case where neither of them are contained:
import numpy as np
c1 = ['link' in i for i in df.model]
c2 = ['popcorn' in i for i in df.model]
df['category'] = np.select([c1,c2], [1,2], 3)
model launched category
0 [Lisa, link] 1983 1
1 [Lisa 2, popcorn] 1984 2
2 [telephone, rabbit] 1991 3
You can use apply function:
Create a def:
def get_categories(row):
if 'link' in row.model:
return 1
elif 'popcorn' in row.model:
return 2
else:
return 3
And then call it like that:
df['category'] = df.apply(get_categories, axis=1)
df
Outputs:
model launched category
0 [Lisa, link] 1983 1
1 [Lisa 2, popcorn] 1984 2
2 [telephone, rabbit] 1991 3
EDIT:
Based on #gred_data comment, you can actually do that in one line in order to increase performance:
df['category'] = df.model.apply(lambda x: 1 if 'link' in x else 2 if 'popcorn' in x else 3)
df
Gets you the same result.

Dropping duplicate rows but keeping certain values Pandas

I have 2 similar dataframes that I concatenated that have a lot of repeated values because they are basically the same data set but for different years.
The problem is that one of the sets has some values missing whereas the other sometimes has these values.
For example:
Name Unit Year Level
Nik 1 2000 12
Nik 1 12
John 2 2001 11
John 2 2001 11
Stacy 1 8
Stacy 1 1999 8
.
.
I want to drop duplicates on the subset = ['Name', 'Unit', 'Level'] since some repetitions don't have years.
However, I'm left with the data that has no Year and I'd like to keep the data with these values:
Name Unit Year Level
Nik 1 2000 12
John 2 2001 11
Stacy 1 1999 8
.
.
How do I keep these values rather than the blanks?
Use sort_values with default parameter na_position='last', so should be omit, and then drop_duplicates:
print (df)
Name Unit Year Level
0 Nik 1 NaN 12
1 Nik 1 2000.0 12
2 John 2 2001.0 11
3 John 2 2001.0 11
4 Stacy 1 NaN 8
5 Stacy 1 1999.0 8
subset = ['Name', 'Unit', 'Level']
df = df.sort_values('Year').drop_duplicates(subset)
Or:
df = df.sort_values(subset + ['Year']).drop_duplicates(subset)
print (df)
Name Unit Year Level
5 Stacy 1 1999.0 8
1 Nik 1 2000.0 12
2 John 2 2001.0 11
Another solution with GroupBy.first for return first non missing value of Year per groups:
df = df.groupby(subset, as_index=False, sort=False)['Year'].first()
print (df)
Name Unit Level Year
0 Nik 1 12 2000.0
1 John 2 11 2001.0
2 Stacy 1 8 1999.0
One solution that comes to mind is to first sort the concatenated dataframe by year with the sortvalues function:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
then drop duplicates with keep='first' parameter
df.drop_duplicates(subset=['Name', 'Unit', 'Level'], keep="first")
I would suggest that you look at the creation step of your merged dataset.
When merging the data sets you can do so on multiple indices i.e.
df = pd.merge(left, right, how='outer', on=['Name', 'Unit', 'Level'], suffixes=['', '_r'])
With the outer join you collect all data sets and remove duplicates right away. The only thing left is to merge the Year column which you can do like so:
df['Year'] = df[['Year', 'Year_r']].apply(lambda x: x['Year'] if (x['Year'] is not np.nan and x['Year'] != '') else x['Year_r'], axis=1)
This fills the gaps and afterwards you are able to simply drop the 'Year_r' column.
The benefit here is that not only NaN values of missing years are covered but also missing Years which are represented as empty strings.
Following a small working example:
import pandas as pd
import numpy as np
left = pd.DataFrame({'Name': ['Adam', 'Beatrice', 'Crissy', 'Dumbo', 'Peter', 'Adam'],
'Unit': ['2', '4', '6', '2', '4', '12'],
'Year': ['', '2009', '1954', '2025', '2012', '2024'],
'Level': ['L1', 'L1', 'L0', 'L4', 'L3', 'L10']})
right = pd.DataFrame({'Name': ['Adam', 'Beatrice', 'Crissy', 'Dumbo'],
'Unit': ['2', '4', '6', '2'],
'Year': ['2010', '2009', '1954', '2025'],
'Level': ['L1', 'L1', 'L0', 'L4']})
df = pd.merge(left, right, how='outer', on=['Name', 'Unit', 'Level'], suffixes=['', '_r'])
df['Year'] = df[['Year', 'Year_r']].apply(lambda x: x['Year'] if (x['Year'] is not np.nan and x['Year'] != '') else x['Year_r'], axis=1)
df

Categories