new to Python and trying to print from a data frame
customers = {'NAME': ['Breadpot', 'Hoviz', 'Hovis', 'Grenns', 'Magnolia', 'Dozen', 'Sun'],
'CITY': ['Sydney', 'Manchester', 'London', 'London', 'Chicago', 'San Francisco', 'San Francisco'],
'COUNTRY': ['Australia', 'UK', 'UK', 'UK', 'USA', 'USA', 'USA'],
'CPERSON': ['Sam.Keng#info.com', 'harry.ham#hoviz.com', 'hamlet.host#hoviz.com', 'grenns#grenns.com', 'man#info.com', 'dozen#dozen.com', 'sunny#sun.com'],
'EMPLCNT': [250, 150, 1500, 200, 1024, 1000, 2000],
'CONTRCNT': [48, 7, 12800, 12800, 25600, 5, 2],
'CONTRCOST': [1024.00, 900.00, 10510.50, 128.30, 512000.00, 1000.20, 10000.01]
}
df = pd.DataFrame(customers, columns=['CITY', 'COUNTRY', 'CPERSON', 'EMPLCNT', 'CONTRCNT', 'EMPLCNT', 'CONTRCOST'])
new_df = df.loc[df['CONTRCNT'].idxmax()]
print('City with the largest number of signed contracts:')
print(new_df['CITY'],'(', new_df['CONTRCNT'], 'contracts)')
Trying to get code to return "City with largest number of contracts:" "city" ("number of contracts")
but instead keep getting this:
City with the largest number of signed contracts:
4 Chicago
4 Chicago
Name: CITY, dtype: object ( CONTRCNT CONTRCNT
4 25600 25600
4 25600 25600 contracts)
This should work:
customers = {'NAME': ['Breadpot', 'Hoviz', 'Hovis', 'Grenns', 'Magnolia', 'Dozen', 'Sun'],
'CITY': ['Sydney', 'Manchester', 'London', 'London', 'Chicago', 'San Francisco', 'San Francisco'],
'COUNTRY': ['Australia', 'UK', 'UK', 'UK', 'USA', 'USA', 'USA'],
'CPERSON': ['Sam.Keng#info.com', 'harry.ham#hoviz.com', 'hamlet.host#hoviz.com', 'grenns#grenns.com', 'man#info.com', 'dozen#dozen.com', 'sunny#sun.com'],
'EMPLCNT': [250, 150, 1500, 200, 1024, 1000, 2000],
'CONTRCNT': [48, 7, 12800, 12800, 25600, 5, 2],
'CONTRCOST': [1024.00, 900.00, 10510.50, 128.30, 512000.00, 1000.20, 10000.01]
}
df = pd.DataFrame(customers, columns=['CITY', 'COUNTRY', 'CPERSON', 'EMPLCNT', 'CONTRCNT', 'CONTRCOST'])
new_df = df.groupby('CITY').sum().sort_values(by='CONTRCNT', ascending = False)
print('City with the largest number of signed contracts:')
print(new_df.index.values[0],'(', new_df.iloc[0][1], 'contracts)')
Related
I have a DataFrame like:
id
country
city
amount
duplicated
1
France
Paris
200
1
2
France
Paris
200
1
3
France
Lyon
50
2
4
France
Lyon
50
2
5
France
Lyon
50
2
And I would like to store a list per distinct value in duplicated, like:
list 1
[
{
"id": 1,
"country": "France",
"city": "Paris",
"amount": 200,
},
{
"id": 2,
"country": "France",
"city": "Paris",
"amount": 200,
}
]
list 2
[
{
"id": 3,
"country": "France",
"city": "Lyon",
"amount": 50,
},
{
"id": 4,
"country": "France",
"city": "Lyon",
"amount": 50,
},
{
"id": 5,
"country": "France",
"city": "Lyon",
"amount": 50,
}
]
I tried filtering duplicates with
df[df.duplicated(['country','city','amount', 'duplicated'], keep = False)]
but it just returns the same df.
You can use groupby:
lst = (df.groupby(['country', 'city', 'amount']) # or .groupby('duplicated')
.apply(lambda x: x.to_dict('records'))
.tolist())
Output:
>>> lst
[[{'id': 3,
'country': 'France',
'city': 'Lyon',
'amount': 50,
'duplicated': 2},
{'id': 4,
'country': 'France',
'city': 'Lyon',
'amount': 50,
'duplicated': 2},
{'id': 5,
'country': 'France',
'city': 'Lyon',
'amount': 50,
'duplicated': 2}],
[{'id': 1,
'country': 'France',
'city': 'Paris',
'amount': 200,
'duplicated': 1},
{'id': 2,
'country': 'France',
'city': 'Paris',
'amount': 200,
'duplicated': 1}]]
Another solution if you want a dict indexed by duplicated key:
data = {k: v.to_dict('records') for k, v in df.set_index('duplicated').groupby(level=0)}
>>> data[1]
[{'id': 1, 'country': 'France', 'city': 'Paris', 'amount': 200},
{'id': 2, 'country': 'France', 'city': 'Paris', 'amount': 200}]
>>> data[2]
[{'id': 3, 'country': 'France', 'city': 'Lyon', 'amount': 50},
{'id': 4, 'country': 'France', 'city': 'Lyon', 'amount': 50},
{'id': 5, 'country': 'France', 'city': 'Lyon', 'amount': 50}]
If I understand you correctly, you can use DataFrame.to_dict('records') to make your lists:
list_1 = df[df['duplicated'] == 1].to_dict('records')
list_1 = df[df['duplicated'] == 2].to_dict('records')
Or for an arbitrary number of values in the column, you can make a dict:
result = {}
for value in df['duplicated'].unique():
result[value] = df[df['duplicated'] == value].to_dict('records')
My current progress
I currently have a pandas Dataframe with 5 different instances
df =
{
'Name': ['John', 'Mark', 'Kevin', 'Ron', 'Amira'
'ID': [110,111,112,113,114]
'Job title': ['xox','xoy','xoz','yow','uyt']
'Manager': ['River' 'Trevor', 'John', 'Lydia', 'Connor']
'M2': ['Shaun', 'Mary', 'Ronald', 'Cary', 'Miranda']
'M3': ['Clavis', 'Sharon', 'Randall', 'Mark', Doug']
'M4': ['Pat', 'Karen', 'Brad', 'Chad', 'Anita']
'M5': ['Ty', 'Jared', 'Bill', 'William', 'Bob']
'Location': ['US', 'US', 'JP', 'CN', 'JA']
}
list = ['River', 'Pat', 'Brad', 'William', 'Clogah']
I need to filter and drop all rows in the pandas dataframe that contain 0 values from my list and also those that contain more than one value from my list. In the case above the instances in row 1 and row 2 would be dropped because there's two of the names in the specific row within the list.
IN ROW 1 i.e. (1: 'John', 110, 'xox, 'River', 'Shaun', 'Clavis', 'Pat', 'Ty', 'US'): SEE BELOW -> IT WOULD BE DROPPED BECAUSE BOTH 'River' and 'Pat' are listed in the list
IN ROW 2 i.e. (2: 'Mark', 111, 'xoy, 'Trevor', 'Mary', 'Sharon', 'Karen', 'Jared', 'US'): SEE BELOW -> IT WOULD BE DROPPED BECAUSE BOTH 'Trevor' and 'Jared' are listed in the list
IN ROW 5 i.e. (5: 'Amira', 114, 'uyt', 'Connor', 'Miranda', 'Doug', 'Anita', 'Bob', 'JA'): SEE BELOW -> IT WOULD BE DROPPED BECAUSE the row does not contain any values from my list.
The two other instances would be kept.
Original Printed DF
0: 'Name', 'ID', 'Job title', 'Manager', 'M2', 'M3', 'M4', 'M5', 'Location'
1: 'John', 110, 'xox, 'River', 'Shaun', 'Clavis', 'Pat', 'Ty', 'US'
2: 'Mark', 111, 'xoy, 'Trevor', 'Mary', 'Sharon', 'Karen', 'Jared', 'US'
3: 'Kevin', 112, 'xoz, 'John', 'Ronald', 'Randall', 'Brad', 'Bill', 'JP
4: 'Ron', 113, 'yow', 'Lydia', 'Cary', 'Mark', 'Chad', 'William', 'CN'
5: 'Amira', 114, 'uyt', 'Connor', 'Miranda', 'Doug', 'Anita', 'Bob', 'JA'
Filtered Printed DF
3: 'Kevin', 112, 'xoz, 'John', 'Ronald', 'Randall', 'Brad', 'Bill', 'JP',
4: 'Ron', 113, 'yow', 'Lydia', 'Cary', 'Mark', 'Chad', 'William', 'CN',
The current process only filters out rows that don't contain a value equal to any value in my managers list. I want to keep rows with one manager from the list but not rows without mangers from the lis
Not the prettiest way to achieve this, but this will work:
d = {
"Name": ["John", "Mark", "Kevin", "Ron", "Amira"],
"ID": [110, 111, 112, 113, 114],
"Job title": ["xox", "xoy", "xoz", "yow", "uyt"],
"M1": ["River", "Trevor", "John", "Lydia", "Connor"],
"M2": ["Shaun", "Mary", "Ronald", "Cary", "Miranda"],
"M3": ["Clavis", "Sharon", "Randall", "Mark", "Doug"],
"M4": ["Pat", "Karen", "Brad", "Chad", "Anita"],
"M5": ["Ty", "Jared", "Bill", "William", "Bob"],
"Location": ["US", "US", "JP", "CN", "JA"],
}
df = pd.DataFrame(d)
# Isolate managers in their own DataFrame
managers = ["River", "Pat", "Trevor", "Jared", "Connor"]
df_managers = df[["M1", "M2", "M3", "M4", "M5"]]
# Assess any one employee has less than two managers and isolate those employees
less_than_two_managers = []
for i in range(df_managers.shape[0]):
if len(set(df_managers.iloc[i]).intersection(set(managers))) < 2:
less_than_two_managers.append(True)
else:
less_than_two_managers.append(False)
df["LT two managers"] = less_than_two_managers
df[df["LT two managers"] == True]
here you go:
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Mark', 'Kevin', 'Ron', 'Amira'],
'ID': [110, 111, 112, 113, 114],
'Job title': ['xox', 'xoy', 'xoz', 'yow', 'uyt'],
'Manager': ['River', 'Trevor', 'John', 'Lydia', 'Connor'],
'M2': ['Shaun', 'Mary', 'Ronald', 'Cary', 'Miranda'],
'M3': ['Clavis', 'Sharon', 'Randall', 'Mark', 'Doug'],
'M4': ['Pat', 'Karen', 'Brad', 'Chad', 'Anita'],
'M5': ['Ty', 'Jared', 'Bill', 'William', 'Bob'],
'Location': ['US', 'US', 'JP', 'CN', 'JA']}
)
managers = ['River', 'Pat', 'Trevor', 'Jared', 'Connor']
mask = df.applymap(lambda x: x in managers)
filtered_df = df[mask.values.sum(axis=1) < 2]
print(filtered_df)
to filter also the 0 (so only 1 manager will stay):
filtered_df = df[mask.values.sum(axis=1) == 1]
Vectorial solution using a mask:
m = (df.filter(regex=r'^M')
.apply(lambda s: s.isin(lst))
.sum(1).eq(1)
)
out = df.loc[m]
Output:
Name ID Job title Manager M2 M3 M4 M5 Location
2 Kevin 112 xoz John Ronald Randall Brad Bill JP
3 Ron 113 yow Lydia Cary Mark Chad William CN
I have a list of dictionaries with the following keys: country, points, price. I need to get an average of points and price for each country. Here is the list:
0: {country: "US", points: 96, price: 235}
1: {country: "Spain", points: 96, price: 110}
2: {country: "US", points: 96, price: 90}
3: {country: "US", points: 96, price: 65}
And I need a list of dictionaries back with country and their averages.
I have gotten to a point where I have a list of dictionaries with the sum of price and points:
[{'country': 'Albania', 'points': 176, 'price': 40.0}, {'country': 'Argentina', 'points': 480488, 'price': 116181.0}, {'country': 'Australia', 'points': 430092, 'price': 152979.0}
Now I need to get averages. I was thinking to create another key for country.length and then in the for loop perform a basic calculation. But not sure if this is the right approach... Thanks for help!
My code below:
count_dict = country_count.to_dict()
# Output
{'US': 62139,
'Italy': 18784,
'France': 14785,
'Spain': 8160}
# Get the sum of points and price for each country
grouped_data = wine_data.groupby('country').agg({'points':'sum', 'price':'sum'})
# Reset the index in order to convert df into a list of dictionaries
country_data = grouped_data.reset_index()
country_list = country_data.to_dict('records')
# Output
[{'country': 'Albania', 'points': 176, 'price': 40.0}, {'country': 'Argentina', 'points': 48048 etc]```
Have you tried passing your data into a Pandas DataFrame and work with it there?
You can do it as follows, first to make a DataFrame:
import pandas as pd
import numpy as np
d = {
0: {'country': "US", 'points': 96, 'price': 235},
1: {'country': "Spain", 'points': 96, 'price': 110},
2: {'country': "US", 'points': 96, 'price': 90},
3: {'country': "US", 'points': 96, 'price': 65}
}
#
df = pd.DataFrame(d).Transpose()
Out:
country points price
0 US 96 235
1 Spain 96 110
2 US 96 90
3 US 96 65
Then groupby country
# just to make sure they are numeric
df[['points','price']] = df[['points','price']].astype('float64')
df.groupby('country').mean()
Out:
points price
country
Spain 96.0 110.0
US 96.0 130.0
I have a variable and list imported from excel that looks like below:
cities= [{'City': 'Buenos Aires',
'Country': 'Argentina',
'Population': 2891000,
'Area': 4758},
{'City': 'Toronto',
'Country': 'Canada',
'Population': 2800000,
'Area': 2731571},
{'City': 'Pyeongchang',
'Country': 'South Korea',
'Population': 2581000,
'Area': 3194},
{'City': 'Marakesh', 'Country': 'Morocco', 'Population': 928850, 'Area': 200},
{'City': 'Albuquerque',
'Country': 'New Mexico',
'Population': 559277,
'Area': 491},
{'City': 'Los Cabos',
'Country': 'Mexico',
'Population': 287651,
'Area': 3750},
{'City': 'Greenville', 'Country': 'USA', 'Population': 84554, 'Area': 68},
{'City': 'Archipelago Sea',
'Country': 'Finland',
'Population': 60000,
'Area': 8300},
{'City': 'Walla Walla Valley',
'Country': 'USA',
'Population': 32237,
'Area': 33},
{'City': 'Salina Island', 'Country': 'Italy', 'Population': 4000, 'Area': 27},
{'City': 'Solta', 'Country': 'Croatia', 'Population': 1700, 'Area': 59},
{'City': 'Iguazu Falls',
'Country': 'Argentina',
'Population': 0,
'Area': 672}]
I just want the value 'Population' from each cities.
What is the most efficient or easiest way to make a list with value from each cities 'Population'?
Below is the code that I came up with, but it's inefficient.
City_Population = [cities[0]['Population'], cities[1]['Population'], cities[2]['Population']]
I am currently learning Python and any advice would be helpful!
Thank you!
Using list comprehension:
print([city['Population'] for city in cities])
OUTPUT:
[2891000, 2800000, 2581000, 928850, 559277, 287651, 84554, 60000, 32237, 4000, 1700, 0]
EDIT:
Assuming there is no population in a city:
print([city['Population'] for city in cities if 'Population' in city])
OUTPUT (removed population from a few cities in the list):
[2891000, 2800000, 2581000, 928850, 287651, 84554, 32237, 4000]
Use a getter, that way you will have empty/none values if some of them are not defined.
populations = [city.get('Population') for city in cities]
If you don't want the empty values:
populations = [pop for pop in populations if pop is not None]
I have some code that compares actual data to target data, where the actual data lives in one DataFrame and the target in another. I need to look up the target, bring it into the df with the actual data, and then compare the two. In the simplified example below, I have a set of products and a set of locations all with unique targets.
I'm using a nested for loop to pull this off: looping through the products and then the locations. The problem is that my real life data is larger on all dimensions, and it takes up an inordinate amount of time to loop through everything.
I've looked at various SO articles and none (that I can find!) seem to be related to pandas and/or relevant for my problem. Does anyone have a good idea on how to vectorize this code?
import pandas as pd
import numpy as np
import time
employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete',
'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian',
'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']
tgt_data = {'Location' : location_list,
'Product1' : [600, 200, 750, 225, 450, 175, 900],
'Product2' : [300, 100, 350, 125, 200, 90, 450],
'Product3' : [700, 250, 950, 275, 600, 225, 1200],
'Product4' : [200, 100, 250, 75, 150, 75, 300],
'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)
employee_data = {'Employee' : employee_list,
'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
'San Francisco', 'Phoenix', 'San Francisco',
'Eugene', 'San Francisco', 'Reno', 'Denver',
'Phoenix', 'Denver', 'Portland', 'Reno',
'Boulder', 'San Francisco', 'Phoenix',
'San Francisco', 'Phoenix'],
'Product1' : np.random.randint(1, 1000, 20),
'Product2' : np.random.randint(1, 700, 20),
'Product3' : np.random.randint(1, 1500, 20),
'Product4' : np.random.randint(1, 500, 20),
'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)
start = time.time()
for p in product_list:
for l in location_list:
emp_df.loc[emp_df['Location'] == l, p + '_tgt'] = (
tgt_df.loc[tgt_df['Location']==l, p].values)
emp_df[p + '_pct'] = emp_df[p] / emp_df[p + '_tgt']
print(emp_df)
end = time.time()
print(end - start)
If the target dataframe is guaranteed to have unique locations, you can use a join to make this process really quick.
import pandas as pd
import numpy as np
import time
employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete',
'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian',
'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']
tgt_data = {'Location' : location_list,
'Product1' : [600, 200, 750, 225, 450, 175, 900],
'Product2' : [300, 100, 350, 125, 200, 90, 450],
'Product3' : [700, 250, 950, 275, 600, 225, 1200],
'Product4' : [200, 100, 250, 75, 150, 75, 300],
'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)
employee_data = {'Employee' : employee_list,
'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
'San Francisco', 'Phoenix', 'San Francisco',
'Eugene', 'San Francisco', 'Reno', 'Denver',
'Phoenix', 'Denver', 'Portland', 'Reno',
'Boulder', 'San Francisco', 'Phoenix',
'San Francisco', 'Phoenix'],
'Product1' : np.random.randint(1, 1000, 20),
'Product2' : np.random.randint(1, 700, 20),
'Product3' : np.random.randint(1, 1500, 20),
'Product4' : np.random.randint(1, 500, 20),
'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)
With the setup done, we can now use our join.
product_tgt_cols = [product+'_tgt' for product in product_list]
print(product_tgt_cols) #['Product1_tgt', 'Product2_tgt', 'Product3_tgt', 'Product4_tgt', 'Product5_tgt']
product_pct_cols = [product+'_pct' for product in product_list]
print(product_pct_cols) #['Product1_pct', 'Product2_pct', 'Product3_pct', 'Product4_pct', 'Product5_pct']
start = time.time()
#join on location to get _tgt columns
emp_df = emp_df.join(tgt_df.set_index('Location'), on='Location', rsuffix='_tgt')
#divide the entire product arrays using numpy, store in temp
temp = emp_df[product_list].values/emp_df[product_tgt_cols].values
#create a new temp df for the _pct results, and assign back to emp_df
emp_df = emp_df.assign(**pd.DataFrame(temp, columns = product_pct_cols))
print(emp_df)
end = time.time()
print("with join: ",end - start)
You are having "wide format" dataframes. I feel "long format" easier to manipulate.
# turn emp_df into long
# indexed by "Employee", "Location", and "Product"
emp_df = (emp_df.set_index(['Employee', 'Location'])
.stack().to_frame())
emp_df.head()
0
Employee Location
Joe Boulder Product1 238
Product2 135
Product3 873
Product4 153
Product5 373
# turn tmp_df into a long series
# indexed by "Location" and "Product"
tgt_df = tgt_df.set_index('Location').stack()
tgt_df.head()
# set target for employees by locations:
emp_df['target'] = (emp_df.groupby('Employee')[0]
.apply(lambda x: tgt_df))
# percentage
emp_df['pct'] = emp_df[0]/emp_df['target']
# you can get the wide format back by
# emp_df = emp_df.unstack(level=2)
# which will give you a dataframe with
# multi-level index and multi-level column