how can I add multiple levels to each category - python

For example I have multiple locations under the location column, then I want to add group numbers within each location. But number of groups is different in locations.
e.g. df1
Location
Chicago
Minneapolis
Dallas
.
.
.
and df2
Location times
Chicago 2
Minneapolis 5
Dallas 1
. .
. .
. .
What I want to get is:
Location Group
Chicago 1
Chicago 2
Minneapolis 1
Minneapolis 2
Minneapolis 3
Minneapolis 4
Minneapolis 5
Dallas 1
.
.
.
What I have now is... repeating same number of groups among the locations: 17 groups within each location. But I just realized there will be different groups among locations... then I don't know what to do next.
filled_results['location'] = results['location'].unique()
filled_results['times'] = 17
filled_results = filled_results.loc[filled_results.index.repeat(filled_results.times)]
v = pd.Series(range(1, 18))
filled_results['group'] = np.tile(v, len(filled_results) // len(v) + 1)[:len(filled_results)]
filled_results = filled_results.drop(columns=['times'])
I was thinking about a for loop, but don't know how to achieve that. for each unique location within df1, giving them 0 to x of groups based on #ofgroups in df2.

I think I found a solution myself. It's very easy if considering this as add index within each group. Here's the solution:
df = pd.DataFrame()
df['location'] = df1['location'].unique()
df = pd.merge(df,
df2,
on = 'location',
how = 'left' )
df = df.loc[df.index.repeat(df.times)]
df["Group"] = df.groupby("location")["times"].rank(method="first", ascending=True)
df["Group"] = df["Group"].astype(int)
df = df.drop(columns=['times'])

You can check out this code.
data = [
{ 'name': 'Chicago', 'c': 2 },
{ 'name': 'Minneapolis', 'c': 5 },
{ 'name': 'Dallas', 'c': 1 }
]
result = []
for location in data:
for i in range(0, location['c']):
result.append({ 'name': location['name'], 'group': i+1 })
result will be:
[{'group': 1, 'name': 'Chicago'}, {'group': 2, 'name': 'Chicago'}, {'group': 1, 'name': 'Minneapolis'}, {'group': 2, 'name': 'Minneapolis'}, {'group': 3, 'name': 'Minneapolis'}, {'group': 4, 'name': 'Minneapolis'}, {'group': 5, 'name': 'Minneapolis'}, {'group': 1, 'name': 'Dallas'}]

Related

How to access nested dictionary from a list of dictionaries

I have a list of dictionaries (sorry It's a bit complex but I'm trying to display the real data) :
[{'alerts': [{'city': ' city name1',
'country': 'ZZ',
'location': {'x': 1, 'y': 3},
'milis': 1582337463000},
{'city': ' city name2',
'country': 'ZZ',
'location': {'x': 1, 'y': 3},
'pubMillis': 1582337573000,
'type': 'TYPE2'}],
'end': '11:02:00:000',
'start': '11:01:00:000'},
{'alerts': [{'city': ' city name3',
'country': 'ZZ',
'location': {'x': 1, 'y': 3},
'milis': 1582337463000}],
'end': '11:02:00:000',
'start': '11:01:00:000'}]
In general the list structure is like this :
[
{ [
{ {},
},
{ {},
}
],
},
{ [
{ {},
},
{ {},
}
],
}
]
If I want to access city name1, I can access using this line of code : alerts[0]['alerts'][0]['city'].
If I want to access city name2, I can access using this code : alerts[0]['alerts'][1]['city'].
How can I access this in a loop?
Use nested loops:
Where alerts equals the list of dicts
for x in alerts:
for alert in x['alerts']:
print(alert['city'])
Use pandas
data equals your sample list of dicts
import pandas as pd
# create the dataframe and explode the list of dicts
df = pd.DataFrame(data).explode('alerts').reset_index(drop=True)
# json_normalize the dicts and join back to df
df = df.join(pd.json_normalize(df.alerts))
# drop the alerts column as it's no longer needed
df.drop(columns=['alerts'], inplace=True)
# output
start end country city milis location.x location.y type pubMillis
0 11:01:00:000 11:02:00:000 ZZ city name1 1.582337e+12 1 3 NaN NaN
1 11:01:00:000 11:02:00:000 ZZ city name2 NaN 1 3 TYPE2 1.582338e+12
2 11:01:00:000 11:02:00:000 ZZ city name3 1.582337e+12 1 3 NaN NaN
What is the goal? To get all city names?
>>> for top_level_alert in alerts:
for nested_alert in top_level_alert['alerts']:
print(nested_alert['city'])
city name1
city name2
city name3

how to merge multiple unique id

I have a data similar to below:
Id Car Code ShowTime
1 Honda A 10/18/2017 14:45
2 Honda A 10/18/2017 17:10
3 Honda C 10/18/2017 19:35
4 Toyota B 10/18/2017 12:20
5 Toyota B 10/18/2017 14:45
My code below return multiple instance output if I include Id which is unique:
all_car_schedules = db.session.query(Schedules.id, Schedules.code,
Car.carname, Schedules.showtime) \
.filter(Schedules.id == Car.id)
df = pd.read_sql(all_car_schedules.statement, db.session.bind)
df[['show_date', 'start_times', 'median']] = df.showtime.str.split(' ', expand=True)
df['start_times'] = df['start_times'] + df['median']
df.drop('screening', axis=1, inplace=True)
df.drop('median', axis=1, inplace=True)
df_grp = df.groupby(['id', 'code', 'carname'])
df_grp_time_stacked = df_grp['start_times'].apply(list).reset_index()
df_grp_time_stacked['start_times'] = df_grp_time_stacked['start_times'].apply(lambda x: x[0] if (len(x) == 1) else x)
return_to_dict = df_grp_time_stacked.to_dict(orient='records')
Code above returns multiple rows when the expected output should be:
"data":{
'id': '1',
'schedule': {
'car': 'Honda',
'show_date': '10/18/2017',
'time_available': [
'14:45',
'17:10',
],
'code': 'A'
}
},{
'id': '3',
'schedule': {
'car': 'Honda',
'show_date': '10/18/2017',
'time_available': [
'19:35'
],
'code': 'C'
}
},{
'id': '4',
'schedule': {
'car': 'Toyota',
'show_date': '10/18/2017',
'time_available': [
'12:20',
'14:45'
],
'code': 'B'
}
}
I am also using sqlite3 as db. I am not sure if there should be a change in the query. Please let me know your thoughts and help me on this. Thank you so much. I am also using sqlite3 as db.
You could use the groupby() function combined with the list option:
df = pd.DataFrame({'Id' : [1,2,3,4,5], 'Car': ['Honda', 'Honda', 'Honda', 'Toyota', 'Toyota'],
'Code': ['A', 'A', 'B', 'C', 'C'], 'show date': ['10/18/2017', '10/18/2017',
'10/18/2017', '10/18/2017', '10/18/2017'],
'start_times' : ['14:45', '17:10', '19:35', '12:20', '14:45']})
df.groupby(['Car', 'Code', 'show date'])['start_times'].apply(list)
Output :
start_times
Car Code show date
Honda A 10/18/2017 [14:45, 17:10]
B 10/18/2017 [19:35]
Toyota C 10/18/2017 [12:20, 14:45]
If you want to keep the first id you have to add the option 'first' to the Id row like so :
df.groupby(['Car', 'Code', 'show date']).agg({'start_times' : list, 'Id' : 'first'})
# Output
start_times Id
Car Code show date
Honda A 10/18/2017 [14:45, 17:10] 1
B 10/18/2017 [19:35] 3
Toyota C 10/18/2017 [12:20, 14:45] 4

Operations on multiple data frame in PANDAS

I have several tables that look like this:
ID YY ZZ
2 97 826
2 78 489
4 47 751
4 110 322
6 67 554
6 88 714
code:
raw = {'ID': [2, 2, 4, 4, 6, 6,],
'YY': [97,78,47,110,67,88],
'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
For each of these dfs, I have to perform a number of operations.
First, group by id,
extract the length of the column zz and average of the column zz,
put results in new df
New df that looks like this
Cities length mean
Paris 0 0
Madrid 0 0
Berlin 0 0
Warsaw 0 0
London 0 0
code:
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2)
I pulled out the average and the size of individual groups
df_grouped = df.groupby('ID').ZZ.size()
df_grouped2 = df.groupby('ID').ZZ.mean()
the problem occurs when trying to transfer results to a new table because it does not contain all the cities and the results must be matched according to the appropriate key.
I tried to use a dictionary:
dic_cities = {"Paris":df_grouped.loc[2],
"Madrid":df_grouped.loc[4],
"Warsaw":df_grouped.loc[6],
"Berlin":df_grouped.loc[8],
"London":df_grouped.loc[10]}
Unfortunately, I'm receiving KeyError: 8
I have 19 df's from which I have to extract this data and the final tables have to look like this:
Cities length mean
Paris 2 657.5
Madrid 2 536.5
Berlin 0 0.0
Warsaw 2 634.0
London 0 0.0
Does anyone know how to deal with it using groupby and the dictionary or knows a better way to do it?
First, you should index df2 on 'Cities':
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2).set_index('Cities')
Then you should reverse you dictionary:
dic_cities = {2: "Paris",
4: "Madrid",
6: "Warsaw",
8: "Berlin",
10: "London"}
Once this is done, the processing is as simple as a groupby:
for i, sub in df.groupby('ID'):
df2.loc[dic_cities[i]] = sub.ZZ.agg([len, np.mean]).tolist()
Which gives for df2:
length mean
Cities
Paris 2.0 657.5
Madrid 2.0 536.5
Berlin 0.0 0.0
Warsaw 2.0 634.0
London 0.0 0.0
See this:
import pandas as pd
# setup raw data
raw = {'ID': [2, 2, 4, 4, 6, 6,], 'YY': [97,78,47,110,67,88], 'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
# get mean values
mean_values = df.groupby('ID').mean()
# drop column
mean_values = mean_values.drop(['YY'], axis=1)
# get occurrence number
occurrence = df.groupby('ID').size()
# save data
result = pd.concat([occurrence, mean_values], axis=1, sort=False)
# rename columns
result.rename(columns={0:'length', 'ZZ':'mean'}, inplace=True)
# city data
raw2 = 'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'], 'length': 0, 'mean': 0}
df2 = pd.DataFrame(raw2)
# rename indexes
df2 = df2.rename(index={0: 2, 1:4, 2:8, 3:6, 4:10}
# merge data
df2['length'] = result['length']
df2['mean'] = result['mean']
Outout:
Cities length mean
2 Paris 2.0 657.5
4 Madrid 2.0 536.5
8 Berlin NaN NaN
6 Warsaw 2.0 634.0
10 London NaN NaN

Converting pandas JSON rows into separate columns

I have pandas dataframe where one of the columns is in JSON format. It contains lists of movie production companies for a given title. Below the sample structure:
ID | production_companies
---------------
1 | "[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]"
2 | "[{'name': 'Walt Disney Pictures', 'id': 2}]"
3 | "[{'name': 'Bold Films', 'id': 2266}, {'name': 'Blumhouse Productions', 'id': 3172}, {'name': 'Right of Way Films', 'id': 32157}]"
4 | nan
5 | nan
6 | nan
7 | "[{'name': 'Ghost House Pictures', 'id': 768}, {'name': 'North Box Productions', 'id': 22637}]"
As you see one movie (row) can have multiple production companies. I want to create for each movie separate columns containing names of the producers. Columns should look like: name_1, name_2, name_3,... etc. If there is no second or third producer it should be NaN.
I don't have much experience working with JSON formats and I've tried a few methods (iterators with lambda functions) but they are not even close to what I need.
Therefore I hope for your help guys!
EDIT:
The following code ("movies" is the main database):
from pandas.io.json import json_normalize
companies = list(movies['production_companies'])
json_normalize(companies)
gives me the following error:
AttributeError: 'str' object has no attribute 'values'
Adding on to #Andy's answer above to answer OP's question.
This part was by #Andy:
import pandas as pd
import numpy as np
import ast
import itertools
# dummy data
df = pd.DataFrame({
"ID": [1,2,3],
"production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})
# remove the nans
df.dropna(inplace=True)
# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))
My additions to answer OP's requirements:
tmp_lst = []
for idx, item in df.groupby(by='ID'):
# Crediting this part to #Andy above
tmp_df = pd.DataFrame(list(itertools.chain(*item["production_companies"].values.tolist()))).drop(columns='id')
# Transpose dataframe
tmp_df = tmp_df.T
# Add back movie id to tmp_df
tmp_df['ID'] = item['ID'].values
# Accumulate tmp_df from all unique movie ids
tmp_lst.append(tmp_df)
pd.concat(tmp_lst, sort=False)
Result:
0 1 2 ID
name Paramount Pictures United Artists Metro-Goldwyn-Mayer (MGM) 1
name Walt Disney Pictures NaN NaN 3
This should do it
import pandas as pd
import numpy as np
import ast
import itertools
# dummy data
df = pd.DataFrame({
"ID": [1,2,3],
"production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})
# remove the nans
df.dropna(inplace=True)
# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))
# flatten the column of lists into a single list, and convert to DataFrame
pd.DataFrame(list(itertools.chain(*df["production_companies"].values.tolist())))
Which yields:
id name
0 4 Paramount Pictures
1 60 United Artists
2 8411 Metro-Goldwyn-Mayer (MGM)
3 2 Walt Disney Pictures

creating a empty JSON file and uptaing it by pandas dataframe's row on python

So, I have a pandas dataframe with large no. of rows. whose one row might look like this and more.
data_store.iloc[0]
Out[5]:
mac_address 00:03:7f:05:c0:06
Visit 1/4/2016
storeid Ritika - Bhubaneswar
Mall or not High Street
Address 794, Sahid Nagar, Janpath, Bhubaneswar-751007
Unnamed: 4 OR
Locality JanPath
Affluence Index Locality 4
Lifestyle Index 5
Tourist Attraction In the locality? 0
City Bhubaneswar
Pop Density City 2131
Population Density of City NaN
City Affluence Index Medium
Mall / Shopping Complex High Street
Mall Premiumness Index NaN
Multiplex NaN
Offices Nearby NaN
Food Court NaN
Average Footfall NaN
Average Rental NaN
Size of the mall NaN
Area NaN
Upscale Street NaN
Place of Worship in vicinity NaN
High Street / Mall High Street
Brand Premiumness Index 4
Restaurant Nearby? 0
Store Size Large
Area.1 2600
There may be some more value in place of Nan just take it as a example.Now the unique key here is mac_address so I want to start with a empty JSON document. now for each row of data i will update the JSON file. like
{
mac_address: "00:03:7f:05:c0:06"
{
"cities" : [
{
"City Name1" : "Wittenbergplatz",
"City count" : "12"
},
{
"City Name2" : "Spichernstrasse",
"City Count" : "19"
},
{
"City Name3" : "Weberwiese",
"City count" : "30"
}
]
}
}
city count is no. of times a mac_address visited to a city. By reading this particular row I would like to update a city named Bhubneswar and Count 1. Now for each new row i would like to check if it is already there in JSON for that probably i would have to import the JSON in python in dictionary or something(suggest).So, if a mac_address is already there i would like to update the info of that row in existing JSON across that mac_address and if it is not there i would like to add that mac_address as new field and update the info of that row across that mac_address. I have to do it in python and pandas dataframe as i have a bit idea about pandas dataframe. Any help on this?
your desired JSON file is not a valid JSON file, here is the error message thrown by the online JSON validator:
Input:
{
"mac_address": "00:03:7f:05:c0:06" {
"cities": [{
"City Name1": "Wittenbergplatz",
"City count": "12"
}, {
"City Name2": "Spichernstrasse",
"City Count": "19"
}, {
"City Name3": "Weberwiese",
"City count": "30"
}]
}
}
Error
Error: Parse error on line 2:
..."00:03:7f:05:c0:06" { "cities": [{
-----------------------^
Expecting 'EOF', '}', ':', ',', ']', got '{'
this solution might help you to start:
In [440]: (df.groupby(['mac_address','City'])
.....: .size()
.....: .reset_index()
.....: .rename(columns={0:'count'})
.....: .groupby('mac_address')
.....: .apply(lambda x: x[['City','count']].to_dict('r'))
.....: .to_dict()
.....: )
Out[440]:
{'00:03:7f:05:c0:01': [{'City': 'aaa', 'count': 1}],
'00:03:7f:05:c0:02': [{'City': 'bbb', 'count': 1}],
'00:03:7f:05:c0:03': [{'City': 'ccc', 'count': 2}],
'00:03:7f:05:c0:05': [{'City': 'xxx', 'count': 1},
{'City': 'zzz', 'count': 1}],
'00:03:7f:05:c0:06': [{'City': 'aaa', 'count': 1},
{'City': 'bbb', 'count': 1}],
'00:03:7f:05:c0:07': [{'City': 'aaa', 'count': 3},
{'City': 'bbb', 'count': 1}]}
data:
In [441]: df
Out[441]:
mac_address City
0 00:03:7f:05:c0:06 aaa
1 00:03:7f:05:c0:06 bbb
2 00:03:7f:05:c0:07 aaa
3 00:03:7f:05:c0:07 bbb
4 00:03:7f:05:c0:07 aaa
5 00:03:7f:05:c0:01 aaa
6 00:03:7f:05:c0:02 bbb
7 00:03:7f:05:c0:03 ccc
8 00:03:7f:05:c0:03 ccc
9 00:03:7f:05:c0:07 aaa
10 00:03:7f:05:c0:05 xxx
11 00:03:7f:05:c0:05 zzz

Categories