How to make dictionary of column names in PySpark?

How to make dictionary of column names in PySpark? - python

I am receiving files and for some files columns are named differently.
For example:
In file 1, column names are: "studentID" , "ADDRESS", "Phone_number".
In file 2, column names are: "Common_ID", "Common_Address", "Mobile_number".
In file 3, column names are: "S_StudentID", "S_ADDRESS", "HOME_MOBILE".
I want to pass a dictionary after loading the file data into dataframes and in that dictionary I want to pass values like:
StudentId -> STUDENT_ID
Common_ID -> STUDENT_ID
S_StudentID -> STUDENT_ID
ADDRESS -> S_ADDRESS
Common_Address -> S_ADDRESS
S_ADDRESS -> S_ADDRESS
The reason for doing this because in my next dataframe I am reading column names like "STUDENT_ID", "S_ADDRESS" and if it will not find "S_ADDRESS", "STUDENT_ID" names in the dataframe, it will throw error for files whose names are not standardized. I want to run my dataframe and get values from those files after renaming in the above DF and one question when in run the new df will it pick the column name form dictionary having data in it.

You can have the dictionary as you want and use toDF with a list comprehension in order to rename the columns.
Input dataframe and column names:
from pyspark.sql import functions as F
df = spark.createDataFrame([], 'Common_ID string, ADDRESS string, COL3 string')
print(df.columns)
# ['Common_ID', 'ADDRESS', 'COL3']
Dictionary and toDF:
dict_cols = {
'StudentId': 'STUDENT_ID',
'Common_ID': 'STUDENT_ID',
'S_StudentID': 'STUDENT_ID',
'ADDRESS': 'S_ADDRESS',
'Common_Address': 'S_ADDRESS',
'S_ADDRESS': 'S_ADDRESS'
}
df = df.toDF(*[dict_cols.get(c, c) for c in df.columns])
Resultant column names:
print(df.columns)
# ['STUDENT_ID', 'S_ADDRESS', 'COL3']

Use dict and list comprehension. An easier way and which would work even if some of the columns are not in the list is
df.toDF(*[dict_cols[x] if x in dict_cols else x for x in df.columns ]).show()
+----------+---------+----+
|STUDENT_ID|S_ADDRESS|COL3|
+----------+---------+----+
+----------+---------+----+

Related

Get a list of column headers based on string list

Problem: I have a dataframe with various column headers that have names with variations of the multiple strings: 'Fee_code','zip_code', etc. and also some others with: 'street_address','violation_street address', etc.
Expected Outcome: A list with all the column headers that match the keywords: Fee, address, code, name, and possibly others based on the specific file that I'll work on. Note that I DO want to keep the 'agency name' column header.
Solution: I came up with this function to list all of the strings listed above - and some more-:
def drop_cols(df):
list1= list(df.filter(like='nam', axis=1))
list1.remove('agency_name')
list2= list(df.filter(like='add', axis=1))
list3= list(df.filter(like='fee', axis=1))
list4 = list(df.filter(like='code', axis=1))
list5 = list(df.filter(like='status', axis=1))
entry= list1+list2+list3+list4+list5
return entry
Challenge: This code works, but it's bulky and I'm wondering if there are better ways to achieve the same
Sample of column headers: 'ticket_id', 'agency_name', 'inspector_name', 'violator_name', 'violation_street_number', 'violation_street_name', 'violation_zip_code', 'mailing_address_str_number', 'mailing_address_str_name', 'city', 'state', 'zip_code', 'non_us_str_code', 'country', 'ticket_issued_date', 'hearing_date', 'violation_code', 'violation_description', 'disposition', 'fine_amount', 'admin_fee', 'state_fee', 'late_fee', 'discount_amount', 'clean_up_cost', 'judgment_amount', 'payment_amount', 'balance_due', 'payment_date', 'payment_status', 'collection_status', 'grafitti_status', 'compliance_detail', 'compliance']

One way you could go about it:
#create search collection of relevant terms
search='|'.join(['fee','address','code','name'])
#use the filter method in pandas with the regex option
#then drop the 'agency_name' column
#d is the dataframe
d.filter(regex=search,axis=1).drop('agency_name',axis=1)

Separate column data with a comma to two columns for dataframe

The data set I pulled from an API return looks like this:
([['Date', 'Value']],
[[['2019-08-31', 445000.0],
['2019-07-31', 450000.0],
['2019-06-30', 450000.0]]])
I'm trying to create a DataFrame with two columns from the data:
Date & Value
Here's what I've tried:
df = pd.DataFrame(city_data, index =['a', 'b'], columns =['Names'] .
['Names1'])
city_data[['Date','Value']] =
city_data['Date'].str.split(',',expand=True)
city_data
city_data.append({"header": column_value,
"Value": date_value})
city_data = pd.DataFrame()
This code was used to create the dataset. I pulled the lists from the API return:
column_value = data["dataset"]["column_names"]
date_value = data["dataset"]["data"]
city_data = ([column_value], [date_value])
city_data
Instead of creating a dataframe with two columns from the data, in most cases I get the "TypeError: list indices must be integers or slices, not str"

is it what you are looking for:
d = ([['Date', 'Value']],
[[['2019-08-31', 445000.0],
['2019-07-31', 450000.0],
['2019-06-30', 450000.0]]])
pd.DataFrame(d[1][0], columns=d[0][0])
return:

Column in DataFrame isn't recognised. Keyword Error: 'Date'

I'm in the initial stages of doing some 'machine learning'.
I'm trying to create a new data frame and one of the columns doesn't appear to be recognised..?
I've loaded an Excel file with 2 columns (removed the index). All fine.
Code:
df = pd.read_excel('scores.xlsx',index=False)
df=df.rename(columns=dict(zip(df.columns,['Date','Amount'])))
df.index=df['Date']
df=df[['Amount']]
#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date','Amount'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Amount'][i] = data['Amount'][i]
The error:
KeyError: 'Date'
Not really sure what's the problem here.
Any help greatly appreciated

I think in line 4 you reduce your dataframe to just one column "Amount"

To add to #Grzegorz Skibinski's answer, the problem is after line 4, there is no longer a 'Date' column. The Date column was assigned to the index and removed, and while the index has a name "Date", you can't use 'Date' as a key to get the index - you have to use data.index[i] instead of data['Date'][i].

It seems that you have an error in the formatting of your Date column.
To check that you don't have an error on the name of the columns you can print the columns names:
import pandas as pd
# create data
data_dict = {}
data_dict['Fruit '] = ['Apple', 'Orange']
data_dict['Price'] = [1.5, 3.24]
# create dataframe from dict
df = pd.DataFrame.from_dict(data_dict)
# Print columns names
print(df.columns.values)
# Print "Fruit " column
print(df['Fruit '])
This code outputs:
['Fruit ' 'Price']
0 Apple
1 Orange
Name: Fruit , dtype: object
We clearly see that the "Fruit " column as a trailing space. This is an easy mistake to do, especially when using excel.
If you try to call "Fruit" instead of "Fruit " you obtain the error you have:
KeyError: 'Fruit'

Creating a dataframe from multiple lists with list names as column names

I have 3 lists:
name = ['Robert']
age = ['25']
gender = ['m']
I want to create a dataframe like the one shown below(with name of the list as column names):
This is what I'm doing to get this dataframe :
data=pd.DataFrame([name,age,gender]).T
data.columns=['name','age','gender']
I want to know whether there is a better way of doing this

Dataframe from columns
Note the pd.DataFrame constructor accepts a dictionary of column labels mapped to lists of values. So you can use:
df = pd.DataFrame({'name': name, 'age': age, 'gender': gender'})
Dataframe from rows
Alternatively, you can feed rows using a list comprehension with zip. This creates a list of lists, each sublist representing a single row:
name = ['Robert']
age = ['25']
gender = ['m']
L = [list(row) for row in zip(name, age, gender)]
df = pd.DataFrame(L, columns=['name', 'age', 'gender'])
print(df)
name age gender
0 Robert 25 m
The above can be written functionally using map:
L = list(map(list, zip(name, age, gender)))

The fastest way:
pd.DataFrame(dict(name=['Robert'],age=['25'],gender=['m']))
pd.DataFrame takes data as first parameter which is: numpy.ndarray , dict, or DataFrame.
Considering that you don't have more variables than name, age, and gender defined, I think this might work:
not_my_data = set(dir())
# define your variables
name=['Robert']
age=['25']
gender=['m'].
my_data = set(dir()) - not_my_data
pd.DataFrame({k:globals()[k] for k in my_data})

Option 1
d = {'name':['Robert'],'age':['25'],'gender':['m']}
pd.DataFrame.from_dict(d)
Option 2
Form the dict on the fly -
pd.DataFrame.from_dict(dict(name=['Robert'], age=['25'], gender=['m']))

name=['Robert']
age=['25']
gender=['m']
data = pd.DataFrame({"name":name,"age":age,"gender":gender})

Restrict columns imported on pd.read_table()

I have a large dataset where shape = (184215, 82)
Out of the 82 columns. I would only like to import a select 6, to conserve memory, because I would need to inner join and conduct some analysis on the data
Is there a way to restrict the columns being created on pd.read_table() or is there a way to drop the unnecessary columns after the fact? (The file is CSV with no column header, I had to create the column names after the fact.
For example here is the list of 82 column:
['COBDate' 'structRefID' 'transactionID' 'tradeID' 'tradeLegID'
'tradeVersion' 'baseCptyID' 'extCptyID' 'extLongName' 'portfolio'
'productClass' 'productGroup' 'productType' 'RIC' 'CUSIP' 'ISIN' 'SEDOL'
'underlyingCurrency' 'foreignCurrency' 'notional' 'notionalCCY' 'quantity'
'MTM' 'tradeDate' 'startDate' 'expiryDate' 'optExerciseType'
'settlementDate' 'settlementType' 'payoff' 'buySell' 'strike' 'rate'
'spread' 'rateType' 'paymentFreq' 'resetFreq' 'modelUsed' 'sentWSS'
'Multiplier' 'PayoutCCY' 'Comments' 'TraderCode' 'AsnOptionStyle'
'BarrierDirection' 'BarrierMonitoringFreq' 'DayCountConv'
'SingleBarrierLevel' 'DownBarrierLevel' 'DownRebateAmount'
'UpBarrierLevel' 'UpRebateAmount' 'IsOptionOnFwd' 'NDFixingDate'
'NDFixingPage' 'NDFixingRate' 'PayoutAmount' 'Underlying' 'WSSID'
'WindowEndDate' 'WindowStartDate' 'InstrumentID' 'EffectiveDate' 'CallPut'
'IsCallable' 'IsExchTraded' 'IsRepay' 'MutualPutDate' 'OptionExpiryStyle'
'IndexTerm' 'PremiumSettlementDate' 'PremiumCcy' 'PremiumAmount'
'ExecutionDateTime' 'FlexIndexFlag' 'NotionalPrincipal' 'r_Premium'
'cpty_type' 'IBTSSID' 'PackageID' 'Component' 'Schema' 'pandas_index']
I only want the following 6 as an example:
'COBDate' 'baseCptyID' 'extCptyID' 'portfolio' 'strike' 'rate'
'spread'

For csv with no column header:
pd.read_table(usecols=[0, 1, 2])
where [0, 1, 2] are the column numbers that have to be read.
If the csv contains column headers you can also specify them by name:
cols_to_read = ['COBDate', 'baseCptyID', 'extCptyID', 'portfolio', 'strike', 'rate', 'spread']
pd.read_table(usecols=cols_to_read)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make dictionary of column names in PySpark? - python

Related

Get a list of column headers based on string list

Separate column data with a comma to two columns for dataframe

Column in DataFrame isn't recognised. Keyword Error: 'Date'

Creating a dataframe from multiple lists with list names as column names

Restrict columns imported on pd.read_table()

Categories

Resources