Combining mulitple cols names with same starting phrase to one col name - python

I have a sample dataframe from a excel file as below:
d = {"Id":[1,2],
"Freight charge - 694.5 KG # USD 0.68/KG":[340,0],
"Terminal Handling Charges":[0,0],
"IOR FEE":[0,0],
"Handling - 694.5 KG # USD 0.50/KG":[357,0],
"Delivery Cartage - 694.5 KG # USD 0.25/KG":[0,0],
"Fuel Surcharge - 694.5 KG # USD 0.25/KG":[346,0],
"War Risk Surcharge - 694.5 KG # USD 0.14/KG":[0,0],
"Freight charge - 97.5 KG # USD 1.30/KG":[0,124],
"Airway Bill Fee":[0,0],
"Handling":[0,0],
"Terminal Handling Charges - 97.5 KG # USD 0.18/KG":[0,34],
"Delivery Cartage- White glove service":[0,20]
}
df = pd.DataFrame.from_dict(d)
I have put 0 but in actual it wud be NA.
Looks like below as dataframe
I want to combine all cols which begin with a certain phrase as one col and the value for that should come in separate rows. For ex, I have cols above with "Freight Charge-". I want to make them as just one col "Freight Charge" and the values those cols have should be part of this col as values. I want to do the same for other cols which have same beginning phrase like
'Delivery Cartage" to be named as "Delivery Charges" Or anywhere I have "handling" as "Handling charges".
Want something like below:
ID Freight Charges Handling Fuel Surcharge Delivery Charges
1 340 357 346 NA
2 124 NA NA 20
I have added only a sample cols names. Pls expect cols with same starting phrase (like Freight Charges) are more than 2 with different ending text. So need a generic sols that can take as many cols name with same starting phrase and convert them into one col name

You can filter the columns as below (also, last line preserves the column names in order)
def colname(c):
if 'freight charge' in c.lower():
return 'Freight Charge'
elif 'delivery cartage' in c.lower():
return 'Delivery Charges'
elif 'handling' in c.lower():
return 'Handling charges'
else:
return c
cols = [colname(col) for col in df.columns]
df.columns = cols
#preserve the last order of the columns
old_cols = df.columns.unique().values
and you can combine the values as
df= df.groupby(lambda x:x, axis=1).sum()
Update: re-order the columns as before
df = df[list(old_cols)]
Here is the expected output

import numpy as np
Replace 0 with NaN. Drop colums with less than 1 non NaN. Split columns by special character - and take string index 0. Finally combine columns with same name
df2=df.replace(0,np.nan).dropna(thresh=1, axis='columns')
df2.columns=df2.columns.str.split('([-])').str[0]
df2.groupby(lambda x:x, axis=1).sum()
Shorter version
df.columns=df.columns.str.split('([-])').str[0]
df.replace(0,np.nan).dropna(thresh=1, axis='columns').groupby(lambda x:x, axis=1).sum()
Delivery Cartage Freight charge Fuel Surcharge Handling Id \
0 0.0 340.0 346.0 357.0 1.0
1 20.0 124.0 0.0 0.0 2.0
Terminal Handling Charges
0 0.0
1 34.0

Related

Handling a column with dates and missing dates

I have the following code to estimate profit from buy and sell price of crypto token.
import pandas as pd
# Read text file into pandas DataFrame
# --------------------------------------
df = pd.read_csv("crypto.txt", comment="#", skip_blank_lines=True, delim_whitespace=True).dropna()
# Display DataFrame
# -----------------
print(df)
print()
# Replace commas in number
# --------------------------------------
df['BuyPrice'] = df['BuyPrice'].str.replace(',', '').astype(float)
df['SellPrice'] = df['SellPrice'].str.replace(',', '').astype(float)
df['Size'] = df['Size'].str.replace(',', '').astype(float)
df['Profit'] = df.SellPrice - df.BuyPrice
# Sort BuyPrice column in ascending way
# --------------------------------------
df = df.sort_values('BuyPrice', ignore_index=True)
#df = df.sort_values('BuyPrice').reset_index(drop=True)
print()
# Sum all the numerical values and create a 'Total' row
# -----------------------------------------------------
df.loc['Total'] = df.sum(numeric_only=True)
# Replace NaN by empty space
# ---------------------------
df = df.fillna('')
df = df.rename({'BuyPrice': 'Buy Price', 'SellPrice': 'Sell Price'}, axis=1)
# Display Final DataFrame
# -----------------
print(df)
Now the output does only show the rows with sensible entries in the 'Date' column. I get
Coin BuyPrice SellPrice Size Date
1 1INCH 2,520 3180 10 23-10-2021
3 SHIB 500 450 200,000 27-10-2021
4 DOT 1650 2500 1 June 01, 2021
Coin Buy Price Sell Price Size Date Profit
0 SHIB 500.0 450.0 200000.0 27-10-2021 -50.0
1 DOT 1650.0 2500.0 1.0 June 01, 2021 850.0
2 1INCH 2520.0 3180.0 10.0 23-10-2021 660.0
Total 4670.0 6130.0 200011.0 1460.0
Clearly, we can see the rows without dates have been ignored. How could one tackle this issue? How can Pandas understand they are dates?
crypto.txt file contains:
Coin BuyPrice SellPrice Size Date
#--- --------- ---------- ---- -----------
ADA 1,580 1,600 1 NA
1INCH 2,520 3180 10 23-10-2021
SHIB 261.6 450 200,000 NA
SHIB 500 450 200,000 27-10-2021
DOT 1650 2500 1 "June 01, 2021"
It seems I couldn't write the last row and column entry within single inverted commas. Is it possible to convert all the dates in one single kind of format (say , )?

Compare two data frames, find the common elements, and fill column value if not present

So I have two data frames, with some common keywords.
for example :
df1 = {'keyword': ['Computer','Phone','Printer'],
'Price1': [1200,800,200],
'category':['first','second','first']
}
df2= {'keyword': ['Computer','Phone','Printer','chair'],
'Price2': [1200,800,200,40]
}
As you can see above, one df has a category feature, while the other doesn't.
So what I want to do is combine two dfs, keep the common items as it is, and if there are some keywords present in one df ('chair', in our case), and absent in another, add the values from df where that keyword exists,and fill that categorical feature (category) with a particular value with 'third' for example.
While not entirely clear, I think you want combine_first:
df2.combine_first(df1)
NB. I transformed the dictionaries to dataframes first with dfX = pd.DataFrame(dfX)
output:
Price1 Price2 category keyword
0 1200.0 1200 first Computer
1 800.0 800 second Phone
2 200.0 200 first Printer
3 NaN 40 NaN chair
Alternatively, use merge:
df1.merge(df2, on='keyword', how='outer')
output:
keyword Price1 category Price2
0 Computer 1200.0 first 1200
1 Phone 800.0 second 800
2 Printer 200.0 first 200
3 chair NaN NaN 40
Building upon mozway's answer, if the prices of the items do not vary across the DataFrames you don't need to specify Price1 and Price2 in the column names.
Also, after joining the data, you can fill the remaining NAs in the Category column with any word you want with the fillna().
Here is the streamlined code for you:
import pandas as pd
df1 = pd.DataFrame({'keyword': ['Computer','Phone','Printer'],
'Price': [1200,800,200],
'category':['first','second','first']
})
df2 = pd.DataFrame({'keyword': ['Computer','Phone','Printer','chair'],
'Price': [1200,800,200,40]
})
df_combined = df1.combine_first(df2)
# Arbitrarily sets the word for unknown categories
keyword = "third"
df_combined["category"].fillna(keyword, inplace=True)
And this is its output:
Price category keyword
0 1200.0 first Computer
1 800.0 second Phone
2 200.0 first Printer
3 40.0 third chair

Pivoting COVID-19 JH Data to Time Series Rows

I am trying to pivot the Johns Hopkins Data so that date columns are rows and the rest of the information stays the same. The first seven columns should stay columns, but the remaining columns (date columns) should be rows. Any help would be appreciated.
Load and Filter data
import pandas as pd
import numpy as np
deaths_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv'
confirmed_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv'
dea = pd.read_csv(deaths_url)
con = pd.read_csv(confirmed_url)
dea = dea[(dea['Province_State'] == 'Texas')]
con = con[(con['Province_State'] == 'Texas')]
View recency of data and pivot
# get the most recent data of data
mostRecentDate = con.columns[-1] # gets the columns of the matrix
# show the data frame
con.sort_values(by=mostRecentDate, ascending = False).head(10)
# save this index variable to save the order.
index = data.columns.drop(['Province_State'])
# The pivot_table method will eliminate duplicate entries from Countries with more than one city
data.pivot_table(index = 'Admin2', aggfunc = sum)
# formatting using a variety of methods to process and sort data
finalFrame = data.transpose().reindex(index).transpose().set_index('Admin2').sort_values(by=mostRecentDate, ascending=False).transpose()
The resulting data frame looks like this, however it did not preserve any of the date times
I have also tried:
date_columns = con.iloc[:, 7:].columns
con.pivot(index = date_columns, columns = 'Admin2', values = con.iloc[:, 7:])
ValueError: Must pass DataFrame with boolean values only
Edit:
As per guidance I tried the melt command listed in the first answer and it does not create rows of dates, it just removed all other non-date values.
date_columns = con.iloc[:, 7:].columns
con.melt(id_vars=date_columns)
The end result should look like this:
Date iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ Combined_Key
1/22/2020 US USA 840 48001 Anderson Texas US 31.81534745 -95.65354823 Anderson, Texas, US
1/22/2020 US USA 840 48003 Andrews Texas US 32.30468633 -102.6376548 Andrews, Texas, US
1/22/2020 US USA 840 48005 Angelina Texas US 31.25457347 -94.60901487 Angelina, Texas, US
1/22/2020 US USA 840 48007 Aransas Texas US 28.10556197 -96.9995047 Aransas, Texas, US
Use pandas melt. Great example here.
Example:
In [41]: cheese = pd.DataFrame({'first': ['John', 'Mary'],
....: 'last': ['Doe', 'Bo'],
....: 'height': [5.5, 6.0],
....: 'weight': [130, 150]})
....:
In [42]: cheese
Out[42]:
first last height weight
0 John Doe 5.5 130
1 Mary Bo 6.0 150
In [43]: cheese.melt(id_vars=['first', 'last'])
Out[43]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In [44]: cheese.melt(id_vars=['first', 'last'], var_name='quantity')
Out[44]:
first last quantity value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In your case, you need to be operating on a dataframe (i.e. con or finalframe or wherever your date column is). For example:
con.melt(id_vars=date_columns)
See specific example here.

pandas ValueError: cannot reindex from a duplicate axis when trying to do calculation based on values from another df

I have 2 dfs:
df2
dec_pl cur_key
0 JPY
1 HKD
df1
cur amount
JPY 80
HKD 20
USD 70
I like to reference del_pl in df2 for 'cur' in df1, and calculate df1.converted_amount = df1.amount * 10 ** (2 - df2.dec_pl) for df1; i.e. df1.amount times the 10 to the power of (2 - df2.dec_pl) and if there cannot find a corresponding df2.cur_key from df1.cur, e.g. USD, then just use its amount;
df1 = df1.set_index('cur')
df2 = df2.set_index('cur_key')
df1['converted_amount'] = (df1.amount*10**(2 - df2.dec_pl)).fillna(df1['amount'], downcast='infer')
but i got
ValueError: cannot reindex from a duplicate axis
I am wondering whats the best way to do this, so the results should look like,
df1
cur amount converted_amount
JPY 80 8000
HKD 20 200
USD 70 70
On possible problem is duplicates in cur_key column, like:
print (df2)
dec_pl cur_key
0 0 HKD
1 1 HKD
df1 = df1.set_index('cur')
Solutions are aggregation duplicates for unique cur_key - e.g. by sum:
df2 = df2.groupby('cur_key').sum()
Or remove duplicates - keep only first or last values per cur_key:
#first default value
df2 = df2.drop_duplicates('cur_key').set_index('cur_key')
#last value
#df2 = df2.drop_duplicates('cur_key', keep='last').set_index('cur_key')
df1['converted_amount'] = (df1.amount*10**(2 - df2.dec_pl)).fillna(df1['amount'], downcast='infer')
print (df1)
amount converted_amount
cur
JPY 80 80
HKD 20 200
USD 70 70

Pandas groupby stored in a new dataframe

I have the following code:
import pandas as pd
df1 = pd.DataFrame({'Counterparty':['Bank','Bank','GSE','PSE'],
'Sub Cat':['Tier1','Small','Small', 'Small'],
'Location':['US','US','UK','UK'],
'Amount':[50, 55, 65, 55],
'Amount1':[1,2,3,4]})
df2=df1.groupby(['Counterparty','Location'])[['Amount']].sum()
df2.dtypes
df1.dtypes
The df2 data frame does not have the columns that I am aggregating across ( Counterparty and Location). Any ideas why this is the case ? Both Amount and Amount1 are numeric fields. I just want to sum across Amount and aggregate across Amount1
For columns from index add as_index=False parameter or reset_index:
df2=df1.groupby(['Counterparty','Location'])[['Amount']].sum().reset_index()
print (df2)
Counterparty Location Amount
0 Bank US 105
1 GSE UK 65
2 PSE UK 55
df2=df1.groupby(['Counterparty','Location'], as_index=False)[['Amount']].sum()
print (df2)
Counterparty Location Amount
0 Bank US 105
1 GSE UK 65
2 PSE UK 55
If aggregate by all columns here happens automatic exclusion of nuisance columns - column Sub Cat is omitted:
df2=df1.groupby(['Counterparty','Location']).sum().reset_index()
print (df2)
Counterparty Location Amount Amount1
0 Bank US 105 3
1 GSE UK 65 3
2 PSE UK 55 4
df2=df1.groupby(['Counterparty','Location'], as_index=False).sum()
Remove the double brackets around the 'Amount' and make them single brackets. You're telling it to only select one column.

Categories