I want to lookup the value of a the lookup_table value column based on the combination of text and two different columns from the data table. See example below:
Data:
VMType
Location
DSv3
East Europe
ESv3
East US
ESv3
East Asia
DSv4
Central US
Ca2
Central US
lookup_table:
Type
Code
Dv3/DSv3 - Gen Purpose East Europe
abc123
Dv3/D1 - Gen Purpose West US
abc321
Dav4/DSv4 - Gen Purpose Central US
bbb321
Eav3/ESv3 - Hi Tech East Asia
def321
Eav3/ESv3 - Hi Tech East US
xcd321
Csv2/Ca2 - Hi Tech Central US
xcc321
I want to do something like
data['new_column'] = lookup_table['Code'] where lookup_table['Type'] == Data['VMType'] + '*' + Data['Location']
or to remove the wild card it could be evaluated as follows:
data['new_column'] = lookup_table['Code'] where lookup_table['Type'] contains Data['VMType'] AND lookup_table['Type'] contains Data['Location']
Resulting in:
Data:
VMType
Location
new_column
DSv3
East Europe
abc123
ESv3
East US
xcd321
ESv3
East Asia
def321
DSv4
Central US
abc321
Ca2
Central US
xcc321
Ideally this can be done without iterating through the df.
First, extract columns VMType and Location from the lookup_table. Then merge with your data dataframe:
lookup_table['VMType'] = lookup_table['Type'].str[:2]
lookup_table['Location'] = lookup_table['Type'].str.split().str[-1]
lookup_table = lookup_table[['VMType', 'Location', 'Code']]
data.merge(lookup_table)
Output:
VMType Location Code
0 D1 Europe abc123
1 E3 US xcd321
2 E3 Asia def321
3 D1 US abc321
4 C2 US xcc321
Assuming I have the following toy dataframe, df:
Country Population Region HDI
China 100 Asia High
Canada 15 NAmerica V.High
Mexico 25 NAmerica Medium
Ethiopia 30 Africa Low
I would like to create new columns based on the population, region, and HDI of Ethiopia in a loop. I tried the following method, but it is time-consuming when a lot of columns are involved.
df['Population_2'] = df['Population'][df['Country'] == "Ethiopia"]
df['Region_2'] = df['Region'][df['Country'] == "Ethiopia"]
df['Population_2'].fillna(method='ffill')
My final DataFrame df should look like:
Country Population Region HDI Population_2 Region_2 HDI_2
China 100 Asia High 30 Africa Low
Canada 15 NAmerica V.High 30 Africa Low
Mexico 25 NAmerica Medium 30 Africa Low
Ethiopia 30 Africa Low 30 Africa Low
How about this?
for col in ['Population', 'Region', 'HDI']:
df[col + '_2'] = df.loc[df.Country=='Ethiopia', col].iat[0]
I don't quite understand the broader point of what you're trying to do, and if Ethiopia could have multiple values the solution might be different. But this works for the problem as you presented it.
You can use:
# select Ethiopia row and add suffix "_2" to the columns (except Country)
s = (df.drop(columns='Country')
.loc[df['Country'].eq('Ethiopia')].add_suffix('_2').squeeze()
)
# broadcast as new columns
df[s.index] = s
output:
Country Population Region HDI Population_2 Region_2 HDI_2
0 China 100 Asia High 30 Africa Low
1 Canada 15 NAmerica V.High 30 Africa Low
2 Mexico 25 NAmerica Medium 30 Africa Low
3 Ethiopia 30 Africa Low 30 Africa Low
You can use assign and also assuming that you have only row corresponding to Ethiopia:
d = dict(zip(df.columns.drop('Country').map('{}_2'.format),
df.set_index('Country').loc['Ethiopia']))
df = df.assign(**d)
print(df):
Country Population Region HDI Population_2 Region_2 HDI_2
0 China 100 Asia High 30 Africa Low
1 Canada 15 NAmerica V.High 30 Africa Low
2 Mexico 25 NAmerica Medium 30 Africa Low
3 Ethiopia 30 Africa Low 30 Africa Low
I have an extremely long dataframe with a lot of data which I have to clean so that I can proceed with data visualization. There are several things I have in mind that needs to be done and I can do each of them to a certain extent but I don't know how to, or if it's even possible, to do them together.
This is what I have to do:
Find the highest arrival count every year and see if the mode of transport is by air, sea or land.
period arv_count Mode of arrival
0 2013-01 984350 Air
1 2013-01 129074 Sea
2 2013-01 178294 Land
3 2013-02 916372 Air
4 2013-02 125634 Sea
5 2013-02 179359 Land
6 2013-03 1026312 Air
7 2013-03 143194 Sea
8 2013-03 199385 Land
... ... ... ...
78 2015-03 940077 Air
79 2015-03 133632 Sea
80 2015-03 127939 Land
81 2015-04 939370 Air
82 2015-04 118120 Sea
83 2015-04 151134 Land
84 2015-05 945080 Air
85 2015-05 123136 Sea
86 2015-05 154620 Land
87 2015-06 930642 Air
88 2015-06 115631 Sea
89 2015-06 138474 Land
This is an example of what the data looks like. I don't know if it's necessary but I have created another column just for year like so:
def year_extract(year):
return year.split('-')[0].strip()
df1 = pd.DataFrame(df['period'])
df1 = df1.rename(columns={'period':'Year'})
df1 = df1['Year'].apply(year_extract)
df1 = pd.DataFrame(df1)
df = pd.merge(df, df1, left_index= True, right_index= True)
I know how to use groupby and I know how to find maximum but I don't know if it is possible to find maximum in a group like finding the highest arrival count in 2013, 2014, 2015 etc
The data above is the total arrival count for all countries based on the mode of transport and period but the original data also had hundreds of additional rows of which region and country stated are stated but I dropped because I don't know how to use or clean them. It looks like this:
period region country moa arv_count
2013-01 Total Total Air 984350
2013-01 Total Total Sea 129074
2013-01 Total Total Land 178294
2013-02 Total Total Air 916372
... ... ... ... ...
2015-12 AMERICAS USA Land 2698
2015-12 AMERICAS Canada Land 924
2013-01 ASIA China Air 136643
2013-01 ASIA India Air 55369
2013-01 ASIA Japan Air 51178
I would also like to make use of the region data if it is possible. Hoping to create a clustered column chart with the 7 regions as the x axis and arrival count as y axis and each region showing the arrival count via land, sea and air but I feel like there are too much excess data that I don't know how to deal with right now.
For example, I don't know how to deal with the period and the country because all I need is the total arrival count of land, sea and air based on region and year regardless of country and months.
I used this dataframe to test the code (the one in your question):
df = pd.DataFrame([['2013-01', 'Total', 'Total', 'Air', 984350],
['2013-01', 'Total', 'Total', 'Sea', 129074],
['2013-01', 'Total', 'Total', 'Land', 178294],
['2013-02', 'Total', 'Total', 'Air', 916372],
['2015-12', 'AMERICAS', 'USA', 'Land', 2698],
['2015-12', 'AMERICAS', 'Canada', 'Land', 924],
['2013-01', 'ASIA', 'China', 'Air', 136643],
['2013-01', 'ASIA', 'India', 'Air', 55369],
['2013-01', 'ASIA', 'Japan', 'Air', 51178]],
columns = ['period', 'region', 'country', 'moa', 'arv_count'])
Here is the code to get the sum of arrival counts, by year, region and type (sea, land air):
First add a 'year' column:
df['year'] = pd.to_datetime(df['period']).dt.year
Then group by (year, region, moa) and sum arv_count in each group:
df.groupby(['region', 'year', 'moa']).arv_count.sum()
Here is the output:
region year moa
AMERICAS 2015 Land 3622
ASIA 2013 Air 243190
Total 2013 Air 1900722
Land 178294
Sea 129074
I hope this is what you were looking for!
I have a dataframe called wine that contains a bunch of rows I need to drop.
How do i drop all rows in column 'country' that are less than 1% of the whole?
Here are the proportions:
#proportion of wine countries in the data set
wine.country.value_counts() / len(wine.country)
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
New Zealand 0.009069
Israel 0.006133
Greece 0.004493
Canada 0.002526
Hungary 0.001755
Romania 0.001558
...
I got lazy and didn't include all of the results, but i think you catch my drift. I need to drop all rows with proportions less than .01
Here is the head of my dataframe:
country designation points price province taster_name variety year price_category
Portugal Avidagos 87 15.0 Douro Roger Voss Portuguese Red 2011.0 low
You can use something like this:
df = df[df.proportion >= .01]
From that dataset it should give you something like this:
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
figured it out
country_filter = wine.country.value_counts(normalize=True) > 0.01
country_index = country_filter[country_filter.values == True].index
wine = wine[wine.country.isin(list(country_index))]
I have a report that identifies key drivers of an overall number/trend. I would like to automate the functionality to be able to list/identify the underlying records based on a percentage of that number. For example if the net change for sales of widgets in the south(region) is -5,000.00, but there are positives and negatives- I would like to identify at least ~90% (-4,500.00) of all underlying drivers that make up that -5,000.00 total from largest to smallest.
data
region OfficeLocation sales
South 1 -500
South 2 300
South 3 -1000
South 4 -2000
South 5 300
South 6 -700
South 7 -400
South 8 800
North 11 300
North 22 -400
North 33 1000
North 44 800
North 55 900
North 66 -800
for South, the total sales is -3200. I would like to identify/list the drivers that make up at least 90% of this move(in descending order)- so 90% of -3200 would be 2880. And the directional moves/sales for South office 3 & 4 = -3000 would be the output for this request:
region OfficeLocation sales
South 3 -1000
South 4 -2000
for North, the total sales is +1800. I would like to identify/list the drivers that make up at least 90% of this move(in descending order)- so at least 90% of 1800 would be 1620. And the directional moves/sales for South office 3 & 4 = -3000 would be the output for this request:
region OfficeLocation sales
North 33 1000
North 44 800
Dataset above has both positive and negative trends for south/north. Any help you can provide would be greatly appreciated!
As mentioned in the comment, it isn't clear what to do in the 'North' case as the sum is positive there, but ignoring that, you could do something like the following:
In [200]: df[df.groupby('region').sales.apply(lambda g: g <= g.loc[(g.sort_values().cumsum() > 0.9*g.sum()).idxmin()])]
Out[200]:
region OfficeLocation sales
2 South 3 -1000
3 South 4 -2000
13 North 66 -800
If, in the positive case, you want to find as few elements as possible that together have the property that they make up 90% of the sum of the sales, the above solution can be adopted as follows:
def is_driver(group):
s = group.sum()
if s > 0:
group *= -1
s *= -1
a = group.sort_values().cumsum() > 0.9*s
return group <= group.loc[a.idxmin()]
In [168]: df[df.groupby('region').sales.apply(is_driver)]
Out[168]:
region OfficeLocation sales
2 South 3 -1000
3 South 4 -2000
10 North 33 1000
12 North 55 900
Note that in the case of a tie, only one element is picked out.