I have this dataframe
d = {'parameters': [{'Year': '2018',
'Median Age': 'nan',
'Total Non-Family Household Income': 289.0,
'Total Household Income': 719.0,
'Gini Index of Income Inequality': 0.4121}]}
df_sample = pd.DataFrame(data=d)
df_sample.head()
I want to convert that json into pandas columns. How do I do this? Assume I only have the dataframe not the parameter d
I saw this example
#which columns have json
#device
json_cols = ['device', 'geoNetwork', 'totals', 'trafficSource']
for column in json_cols:
c_load = test[column].apply(json.loads)
c_list = list(c_load)
c_dat = json.dumps(c_list)
test = test.join(pd.read_json(c_dat))
test = test.drop(column , axis=1)
But this does not seem too pythonic...
Use json_normalize:
df_sample = pd.json_normalize(data=d, record_path=['parameters'])
Resulting dataframe:
Year
Median Age
Total Non-Family Household Income
Total Household Income
Gini Index of Income Inequality
2018
nan
289.0
719.0
0.4121
UPD:
If you already have dataframe loaded, then applying pd.Series should work:
df_sample = df_sample['parameters'].apply(pd.Series) # or df_sample['parameters'].map(json.loads).apply(pd.Series) if data is not already dict
Related
My question essentially is:
I have 3 major columns and 4 rows
The Delta column ALWAYS needs to be formatted as a whole number percent ex: 35%
Toyota & Honda Sales need to be formatted differently depending on the row
Spend and Revenue need to be $XXX,XXX ex: $100,000
Sale count needs to be a whole number XXX,XXX ex: 500
Present-Value/Sale needs to always be percent ex: 35%
Put another way, I have one column that has a single formatting regimen, but two others that have variable formatting depending on row. Any idea for this?
#This is what I have to start
data = {'Toyota Sales Performance': [500000.0000, 150000.0000, 100.0000, .2500],
'Honda Sales Performance': [750000.0000, 100000.0000, 200.0000, .3500],
'Delta': [.25, .35, .50, .75]}
df = pd.DataFrame(data, index=['Total Spend',
'Total Revenue',
'Total Sale Count',
'Present-Value/Sale'])
df
What I would like to see
data2 = {'Toyota Sales Performance': ['$500,000', '$150,000', 100, '25%'],
'Honda Sales Performance': ['$750,000', '$150,000', 200, '35%'],
'Delta': ['25%', '35%', '50%', '75%']}
df2 = pd.DataFrame(data2, index=['Total Spend',
'Total Revenue',
'Total Sale Count',
'Present-Value/Sale'])
df2
You can use apply() to run own function on every column.
import pandas as pd
data = {
'Toyota Sales Performance': [500000.0000, 150000.0000, 100.0000, .2500],
'Honda Sales Performance': [750000.0000, 100000.0000, 200.0000, .3500],
'Delta': [.25, .35, .50, .75]
}
df = pd.DataFrame(data, index=['Total Spend',
'Total Revenue',
'Total Sale Count',
'Present-Value/Sale'])
#df['Delta'] = df['Delta'].apply(lambda val: f'{val:.0%}')
df['Delta'] = df['Delta'].apply('{:.0%}'.format)
def convert_column(col):
return pd.Series({
'Total Spend': "${:,}".format(int(col['Total Spend'])),
'Total Revenue': "${:,}".format(int(col['Total Revenue'])),
'Total Sale Count': int(col['Total Sale Count']),
'Present-Value/Sale': "{:.0%}".format(col['Present-Value/Sale']),
})
cols = ['Toyota Sales Performance', 'Honda Sales Performance']
df[cols] = df[cols].apply(convert_column, axis=0)
print(df)
Result:
Toyota Sales Performance Honda Sales Performance Delta
Total Spend $500,000 $750,000 25%
Total Revenue $150,000 $100,000 35%
Total Sale Count 100 200 50%
Present-Value/Sale 25% 35% 75%
I have the following sample DF:
Car Model
Sales
Mercedes Benz
Audi
100000
Renault
50000
I have 2 calculations
Calculate the number of blank rows in the sales column
missingSalesInfo = DF['Sales'].isnull().sum()
missingSalesInfo = ("Number of missing values: ",missingSalesInfo)
Calculate the total car sales
totalSales = DF['Sales'].sum()
totalSales = ("Total car sales: ",totalSales)
What I want to do is create a new DF, lets call it DF2 to store the above results.
See example below
DF2
Description
Results
Number of missing values
1
Total car sales
150000
Use Series.agg with aggregate functions in dictionary, convert to integers and convert to DataFrame from Series by Series.reset_index, for set new columns names is used DataFrame.set_axis:
df2 = (df['Sales'].agg({'Number of missing values': lambda x: x.isna().sum(),
'Total car sales': 'sum'})
.astype(int)
.reset_index()
.set_axis(['Description','Results'], axis=1)
)
print (df2)
Description Results
0 Number of missing values 1
1 Total car sales 150000
Alternative:
df2 = (df['Sales'].agg({'Number of missing values': lambda x: x.isna().sum(),
'Total car sales': 'sum'})
.astype(int)
.reset_index())
df2.columns = ['Description','Results']
That's just a question of how to create a DataFrame. You can do that in a few ways, but here it's done with dictionary:
df2 = pd.DataFrame({
'Description': ['Number of missing values', 'Total car sales'],
'Results': [DF['Sales'].isnull().sum(), DF['Sales'].sum()]
})
Several ways to do that. Here's a simple way.
df_2 = pd.DataFrame()
df_2['Number of missing values'] = DF['Sales'].isnull().sum()
df_2['Total car sales'] = DF['Sales'].sum()`
When I'm working in SQL, I find almost all the things I do with a column are related to the following four operations:
Add a column.
Remove a column.
Change a column type.
What is the preferred way to do these three DML operations in pandas. For example, let's suppose I am starting with the following DataFrame:
import pandas as pd
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
How would I:
Change the df.id column from a string (or object as it says) to an int64 ?
Rename the column product to product_type ?
Add a new column called 'cost' with values [2.99, 3.99] ?
Remove the column called brand ?
Simple and complete:
import numpy as np
import pandas as pd
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
# Change the df.id column from a string (or object as it says) to an int64 ?
df['id'] = df['id'].astype(np.int64)
# Rename the column product to product_type ?
df = df.rename(columns={'product':'prouduct_type'})
# Add a new column called 'cost' with values [2.99, 3.99] ?
df['cost'] = pd.Series([2.99, 3.99])
# Remove the column called brand ?
df = df.drop(columns='brand')
This functions can also be chained together. I would not recommend it as it is not fixable as above:
# do all the steps above with a single line
df = df.astype({'id':np.int64},
axis=1
).rename(columns={'product':'prouduct_type'}
).assign(cost=[2.99, 3.99]
).drop(columns='brand')
There is also another way to which you can use inplace=True . This does the assignment. I don’t recommend it as it is not explicitly as the first method
# Using inplace=True
df['id'].astype(np.int64, inplace=True)
df.rename(columns={'product':'prouduct_type'}, inplace=True)
# No change from previous
df['cost'] = pd.Series([2.99, 3.99])
# pop brand out
df.pop('brand')
print(df)
You can perform these steps like this (starting with your original data frame):
# add a column
df = pd.concat([df, pd.Series([2.99, 3.99], name='cost')], axis=1)
# change column name
df = df.rename(columns={'product': 'product_type'})
# remove brand
df = df.drop(columns='brand')
# change data type
df['id'] = df['id'].astype('int')
print(df)
product_type id cost
0 drink 19 2.99
1 cup 11 3.99
you cound do:
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
df = (df.assign(cost=[2.99, 3.99],
id=lambda d: d.id.astype(int))
.drop(columns=['brand'])
.rename({"product": 'product_type'}, axis=1))
This should work
# change datatype
>>> df['id'] = df['id'].astype('int64')
>>> df.dtypes
brand object
id int64
product object
# rename column
df.rename(columns={'product': 'product_type'}, inplace=True)
>>> df
brand id product_type
0 spindrift 19 drink
1 None 11 cup
# create new column
df['Cost'] = pd.Series([2.99, 3.99])
>>> df
brand id product_type Cost
0 spindrift 19 drink 2.99
1 None 11 cup 3.99
# drop column
>>> df.drop(['brand'], axis=1, inplace=True)
>>> df
id product_type Cost
0 19 drink 2.99
1 11 cup 3.99
I already have an idea as to how I'm going to do this - I'm just curious about whether my method is the most efficient.
So for instance, let's say that for whatever reason, I have the following table:
The first 4 columns in the table are all repeated - they just say info about the employee. The reason these rows repeat is because that employee handles multiple clients.
In some cases, I am missing info on the Age and Employee duration of an employee. Another colleague gave me this information in an excel sheet.
So now, I have info on Brian's and Dennis' age and employment duration, and I need to fill all rows with their employee IDs based on the information. My plan for doing that is this:
data = {"14": # Brian's Employee ID
{"Age":31,
:"Employment Duration":3},
"21": # Dennis' Employee ID
{"Age":45,
"Employment Duratiaon":12}
}
After making the above dictionary of dictionaries with the necessary values, my plan is to iterate over each row in the above dataframe, and fill in the 'Age' and 'Employment Duration' columns based on the value in 'Employee ID':
for index, row in df.iterrows:
if row["Employee ID"] in data:
row["Age"] = data["Employee ID"]["Age"]
row["Employment Duration"] = data["Employee ID"]["Employement Duration"]
That's my plan for populating the missing values!
I'm curious about whether there's a simpler way that's just not presenting itself to me, because this was the first thing that sprang to mind!
Don't iterate over rows in pandas when you can avoid it. Instead maximize the pandas library with actions like this:
Assume we have a dataframe:
data = pd.DataFrame({
'name' : ['john', 'john', 'mary', 'mary'],
'age' : ['', '', 25, 25]
})
Which looks like:
name age
0 john
1 john
2 mary 25
3 mary 25
We can apply a lambda function like so:
data['age'] = data.apply(lambda x: 27 if x.name == 'john' else x.age, axis=1)
Or we can use pandas .loc:
data['age'].loc[data.name == 'john'] = 27
Test them out and compare how long each take to execute vs. iterating over rows.
Ensure missing values are represented as null values (np.NaN). The second set of information should be stored in another DataFrame with the same column labels.
Then by setting the Index to the 'Employee ID' update will align on the indices and fill the missing values.
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'Employee ID': ["11", "11", "14", "21"],
'Name': ['Alan', 'Alan', 'Brian', 'Dennis'],
'Age': [14,14, np.NaN, np.NaN],
'Employment Duration': [3,3, np.NaN, np.NaN],
'Clients Handled': ['A', 'B', 'C', 'G']})
data = {"14": {"Age": 31, "Employment Duration": 3},
"21": {"Age": 45, "Employment Duration": 12}}
df2 = pd.DataFrame.from_dict(data, orient='index')
Code
#df = df.replace('', np.NaN) # If not null in your dataset
df = df.set_index('Employee ID')
df.update(df2, overwrite=False)
print(df)
Name Age Employment Duration Clients Handled
Employee ID
11 Alan 14.0 3.0 A
11 Alan 14.0 3.0 B
14 Brian 31.0 3.0 C
21 Dennis 45.0 12.0 G
I have the following DataFrame in pandas:
import pandas as pd
example_data = [{'ticker': 'aapl', 'loc': 'us'}, {'ticker': 'mstf', 'loc': 'us'}, {'ticker': 'baba', 'loc': 'china'}, {'ticker': 'ibm', 'loc': 'us'}, {'ticker': 'db', 'loc': 'germany'}]
df = pd.DataFrame(example_data)
print df
loc ticker
0 us aapl
1 us mstf
2 china baba
3 us ibm
4 germany db
I want to create a new DataFrame such that each row is created from the original df but rows with loc counts greater than 2 are excluded. That is, the new df is created by looping through the old df, counting the number of loc rows that have come before, and including / excluding the row based on this count.
The following code gives the desired output.
country_counts = {}
output = []
for row in df.values:
if row[0] not in country_counts:
country_counts[row[0]] = 1
else:
country_counts[row[0]] +=1
if country_counts[row[0]] <= 2:
output.append({'loc': row[0], 'ticker': row[1]})
new_df = pd.DataFrame(output)
print new_df
loc ticker
0 us aapl
1 us mstf
2 china baba
3 germany db
The output excludes the 4th row in the original df because its loc count is greater than 2 (i.e. 3).
Does there exist a better method to perform this type of operation? Any help is greatly appreciated.
How about groupby and .head:
In [90]: df.groupby('loc').head(2)
Out[90]:
loc ticker
0 us aapl
1 us mstf
2 china baba
4 germany db
Also, be careful with your column names, since loc clashes with the .loc method.