Parsing through misaligned columns in dataframe - python

I have source data that comes in a raw csv with misaligned columns, some without certain columns at all. The desired data can be located in any of 1-30 columns. The main key to find where data lies is that "1 Yr Cost" is present in the header of every sub-frame.
Example source data:
import pandas as pd
from io import StringIO
sourceCSV = """col0,col1,col2,col3,col4,col5,col6
,,Cost,1 Mn Cost,1 Yr Cost,,
,Michigan,$50 ,$55 ,$65 ,,
,,,,Cost,1 Mn Cost,1 Yr Cost
,,,Indiana,$40 ,$45 ,$55
,Cost,1 Mn Cost,1 Yr Cost,,,
New York,$25 ,$30 ,$35 ,,,
,,Cost,1 Yr Cost,,,
,Florida,$10 ,$20 ,,,"""
csvStringIO = StringIO(sourceCSV)
dfSource = pd.read_csv(csvStringIO, sep=",", header=None)
col0
col1
col2
col3
col4
col5
col6
null
null
Cost
1 Mn Cost
1 Yr Cost
null
null
null
Michigan
50
55
65
null
null
null
null
null
null
Cost
1 Mn Cost
1 Yr Cost
null
null
null
Indiana
40
45
55
null
Cost
1 Mn Cost
1 Yr Cost
null
null
null
New York
25
30
35
null
null
null
null
null
Cost
1 Yr Cost
null
null
null
null
Florida
10
20
null
null
null
I need to parse through the data and get data in a format similar to the below:
Location
Cost
1 Mn Cost
1 Yr Cost
Michigan
$50
$55
$65
Indiana
$40
$45
$55
New York
$25
$30
$35
Florida
$10
null
$20
The only thing I can figure out is to manually looping through each column, but this is very inefficient. What is the best way to accomplish this?

One possibility that works in the particular case:
consider odd rows as headers, even ones as data
add "Location" in the first cell
stack to remove NaNs
reshape
dfSource = pd.read_csv(csvStringIO, sep=",", skiprows=1, header=None)
(dfSource[0].where(dfSource.index%2==1, 'Location').to_frame()
.join(dfSource.iloc[:, 1:])
.set_index([dfSource.index//2, dfSource.index%2])
.stack().droplevel(-1).to_frame('value')
.pipe(lambda d: d.set_index(d.groupby(level=[0, 1]).cumcount(), append=True))
.unstack(1).droplevel(1)['value']
.pivot(columns=0, values=1)
)
Output:
0 1 Mn Cost 1 Yr Cost Cost Location
0 $55 $65 $50 Michigan
1 $45 $55 $40 Indiana
2 $30 $35 $25 New York
3 NaN $20 $10 Florida

The malformed dataset has some "alignment" in that is the rows containing the needed columns and values are pairwised, except for location values which has leftward offset and needs to be captured separately.
Start loading csv with skipping the 1st row of those unneeded col0 col1 col2 ....:
df = pd.read_csv(csvStringIO, sep=",", header=None, skiprows=1)
Then, we go with a short processing (grouping rows and collect columns and values separately):
def f(x):
return pd.DataFrame(columns=['Location'] + x.iloc[0].dropna().tolist(),
data=[x.iloc[-1].dropna().values])
res_df = df.groupby((df == 'Cost').any(axis=1).cumsum()).apply(f).reset_index(drop=True)
Location Cost 1 Mn Cost 1 Yr Cost
0 Michigan $50 $55 $65
1 Indiana $40 $45 $55
2 New York $25 $30 $35
3 Florida $10 NaN $20

Related

Handling a column with dates and missing dates

I have the following code to estimate profit from buy and sell price of crypto token.
import pandas as pd
# Read text file into pandas DataFrame
# --------------------------------------
df = pd.read_csv("crypto.txt", comment="#", skip_blank_lines=True, delim_whitespace=True).dropna()
# Display DataFrame
# -----------------
print(df)
print()
# Replace commas in number
# --------------------------------------
df['BuyPrice'] = df['BuyPrice'].str.replace(',', '').astype(float)
df['SellPrice'] = df['SellPrice'].str.replace(',', '').astype(float)
df['Size'] = df['Size'].str.replace(',', '').astype(float)
df['Profit'] = df.SellPrice - df.BuyPrice
# Sort BuyPrice column in ascending way
# --------------------------------------
df = df.sort_values('BuyPrice', ignore_index=True)
#df = df.sort_values('BuyPrice').reset_index(drop=True)
print()
# Sum all the numerical values and create a 'Total' row
# -----------------------------------------------------
df.loc['Total'] = df.sum(numeric_only=True)
# Replace NaN by empty space
# ---------------------------
df = df.fillna('')
df = df.rename({'BuyPrice': 'Buy Price', 'SellPrice': 'Sell Price'}, axis=1)
# Display Final DataFrame
# -----------------
print(df)
Now the output does only show the rows with sensible entries in the 'Date' column. I get
Coin BuyPrice SellPrice Size Date
1 1INCH 2,520 3180 10 23-10-2021
3 SHIB 500 450 200,000 27-10-2021
4 DOT 1650 2500 1 June 01, 2021
Coin Buy Price Sell Price Size Date Profit
0 SHIB 500.0 450.0 200000.0 27-10-2021 -50.0
1 DOT 1650.0 2500.0 1.0 June 01, 2021 850.0
2 1INCH 2520.0 3180.0 10.0 23-10-2021 660.0
Total 4670.0 6130.0 200011.0 1460.0
Clearly, we can see the rows without dates have been ignored. How could one tackle this issue? How can Pandas understand they are dates?
crypto.txt file contains:
Coin BuyPrice SellPrice Size Date
#--- --------- ---------- ---- -----------
ADA 1,580 1,600 1 NA
1INCH 2,520 3180 10 23-10-2021
SHIB 261.6 450 200,000 NA
SHIB 500 450 200,000 27-10-2021
DOT 1650 2500 1 "June 01, 2021"
It seems I couldn't write the last row and column entry within single inverted commas. Is it possible to convert all the dates in one single kind of format (say , )?

Formatting dataframe into excel

I am doing all of my data manipulation using python and have all the required values in a dataframe.
I am not sure how to format the dataframe into excel in the following format (merged cells for categories, etc) -
Eg DF-
Item No Item Name Category Italy Count Netherlands Count France Count Grand Total
1 Item A Category 1 5 10 20 35
1 Item B Category 1 5 10 20 35
Format -

assign string value to a cell in pandas

I've created a new row for storing mean values of all columns. Now I'm trying to assign name to the very first cell of the new row
I've tried the conventional method of assigning value by pointing to the cell index. It doesn't return any error but it doesn't seems to store the value in the cell.
Items Description Duration China Japan Korea
0 GDP 2012-2013 40000 35000 12000
1 GDP 2013-2014 45000 37000 12500
2 NAN NAN 42500 36000 12250
data11.loc[2,'Items Description'] = 'Average GDP'
Instead of returning below dataframe the code is still giving the previous output.
Items Description Duration China Japan Korea
0 GDP 2012-2013 40000 35000 12000
1 GDP 2013-2014 45000 37000 12500
2 Average GDP NAN 42500 36000 12250
For me working nice, but here are 2 alternatives for set value by last row and column name.
First is DataFrame.loc with specify last index value by indexing:
data11.loc[data11.index[-1], 'Items Description'] = 'Average GDP'
Or DataFrame.iloc with -1 for get last row and Index.get_loc for get position of column Items Description:
data11.iloc[-1, data11.columns.get_loc('Items Description')] = 'Average GDP'
print (data11)
Items Description Duration China Japan Korea
0 GDP 2012-2013 40000 35000 12000
1 GDP 2013-2014 45000 37000 12500
2 Average GDP NAN 42500 36000 12250

Pandas groupby stored in a new dataframe

I have the following code:
import pandas as pd
df1 = pd.DataFrame({'Counterparty':['Bank','Bank','GSE','PSE'],
'Sub Cat':['Tier1','Small','Small', 'Small'],
'Location':['US','US','UK','UK'],
'Amount':[50, 55, 65, 55],
'Amount1':[1,2,3,4]})
df2=df1.groupby(['Counterparty','Location'])[['Amount']].sum()
df2.dtypes
df1.dtypes
The df2 data frame does not have the columns that I am aggregating across ( Counterparty and Location). Any ideas why this is the case ? Both Amount and Amount1 are numeric fields. I just want to sum across Amount and aggregate across Amount1
For columns from index add as_index=False parameter or reset_index:
df2=df1.groupby(['Counterparty','Location'])[['Amount']].sum().reset_index()
print (df2)
Counterparty Location Amount
0 Bank US 105
1 GSE UK 65
2 PSE UK 55
df2=df1.groupby(['Counterparty','Location'], as_index=False)[['Amount']].sum()
print (df2)
Counterparty Location Amount
0 Bank US 105
1 GSE UK 65
2 PSE UK 55
If aggregate by all columns here happens automatic exclusion of nuisance columns - column Sub Cat is omitted:
df2=df1.groupby(['Counterparty','Location']).sum().reset_index()
print (df2)
Counterparty Location Amount Amount1
0 Bank US 105 3
1 GSE UK 65 3
2 PSE UK 55 4
df2=df1.groupby(['Counterparty','Location'], as_index=False).sum()
Remove the double brackets around the 'Amount' and make them single brackets. You're telling it to only select one column.

Update Specific Pandas Rows with Value from Different Dataframe

I have a pandas dataframe that contains budget data but my sales data is located in another dataframe that is not the same size. How can I get my sales data updated in my budget data? How can I write conditions so that it makes these updates?
DF budget:
cust type loc rev sales spend
0 abc new north 500 0 250
1 def new south 700 0 150
2 hij old south 700 0 150
DF sales:
cust type loc sales
0 abc new north 15
1 hij old south 18
DF budget outcome:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150
Any thoughts?
Assuming that 'cust' column is unique in your other df, you can call map on the sales df after setting the index to be the 'cust' column, this will map for each 'cust' in budget df to it's sales value, additionally you will get NaN where there are missing values so you call fillna(0) to fill those values:
In [76]:
df['sales'] = df['cust'].map(df1.set_index('cust')['sales']).fillna(0)
df
Out[76]:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150

Categories