I have a large dataset where shape = (184215, 82)
Out of the 82 columns. I would only like to import a select 6, to conserve memory, because I would need to inner join and conduct some analysis on the data
Is there a way to restrict the columns being created on pd.read_table() or is there a way to drop the unnecessary columns after the fact? (The file is CSV with no column header, I had to create the column names after the fact.
For example here is the list of 82 column:
['COBDate' 'structRefID' 'transactionID' 'tradeID' 'tradeLegID'
'tradeVersion' 'baseCptyID' 'extCptyID' 'extLongName' 'portfolio'
'productClass' 'productGroup' 'productType' 'RIC' 'CUSIP' 'ISIN' 'SEDOL'
'underlyingCurrency' 'foreignCurrency' 'notional' 'notionalCCY' 'quantity'
'MTM' 'tradeDate' 'startDate' 'expiryDate' 'optExerciseType'
'settlementDate' 'settlementType' 'payoff' 'buySell' 'strike' 'rate'
'spread' 'rateType' 'paymentFreq' 'resetFreq' 'modelUsed' 'sentWSS'
'Multiplier' 'PayoutCCY' 'Comments' 'TraderCode' 'AsnOptionStyle'
'BarrierDirection' 'BarrierMonitoringFreq' 'DayCountConv'
'SingleBarrierLevel' 'DownBarrierLevel' 'DownRebateAmount'
'UpBarrierLevel' 'UpRebateAmount' 'IsOptionOnFwd' 'NDFixingDate'
'NDFixingPage' 'NDFixingRate' 'PayoutAmount' 'Underlying' 'WSSID'
'WindowEndDate' 'WindowStartDate' 'InstrumentID' 'EffectiveDate' 'CallPut'
'IsCallable' 'IsExchTraded' 'IsRepay' 'MutualPutDate' 'OptionExpiryStyle'
'IndexTerm' 'PremiumSettlementDate' 'PremiumCcy' 'PremiumAmount'
'ExecutionDateTime' 'FlexIndexFlag' 'NotionalPrincipal' 'r_Premium'
'cpty_type' 'IBTSSID' 'PackageID' 'Component' 'Schema' 'pandas_index']
I only want the following 6 as an example:
'COBDate' 'baseCptyID' 'extCptyID' 'portfolio' 'strike' 'rate'
'spread'
For csv with no column header:
pd.read_table(usecols=[0, 1, 2])
where [0, 1, 2] are the column numbers that have to be read.
If the csv contains column headers you can also specify them by name:
cols_to_read = ['COBDate', 'baseCptyID', 'extCptyID', 'portfolio', 'strike', 'rate', 'spread']
pd.read_table(usecols=cols_to_read)
Related
I am receiving files and for some files columns are named differently.
For example:
In file 1, column names are: "studentID" , "ADDRESS", "Phone_number".
In file 2, column names are: "Common_ID", "Common_Address", "Mobile_number".
In file 3, column names are: "S_StudentID", "S_ADDRESS", "HOME_MOBILE".
I want to pass a dictionary after loading the file data into dataframes and in that dictionary I want to pass values like:
StudentId -> STUDENT_ID
Common_ID -> STUDENT_ID
S_StudentID -> STUDENT_ID
ADDRESS -> S_ADDRESS
Common_Address -> S_ADDRESS
S_ADDRESS -> S_ADDRESS
The reason for doing this because in my next dataframe I am reading column names like "STUDENT_ID", "S_ADDRESS" and if it will not find "S_ADDRESS", "STUDENT_ID" names in the dataframe, it will throw error for files whose names are not standardized. I want to run my dataframe and get values from those files after renaming in the above DF and one question when in run the new df will it pick the column name form dictionary having data in it.
You can have the dictionary as you want and use toDF with a list comprehension in order to rename the columns.
Input dataframe and column names:
from pyspark.sql import functions as F
df = spark.createDataFrame([], 'Common_ID string, ADDRESS string, COL3 string')
print(df.columns)
# ['Common_ID', 'ADDRESS', 'COL3']
Dictionary and toDF:
dict_cols = {
'StudentId': 'STUDENT_ID',
'Common_ID': 'STUDENT_ID',
'S_StudentID': 'STUDENT_ID',
'ADDRESS': 'S_ADDRESS',
'Common_Address': 'S_ADDRESS',
'S_ADDRESS': 'S_ADDRESS'
}
df = df.toDF(*[dict_cols.get(c, c) for c in df.columns])
Resultant column names:
print(df.columns)
# ['STUDENT_ID', 'S_ADDRESS', 'COL3']
Use dict and list comprehension. An easier way and which would work even if some of the columns are not in the list is
df.toDF(*[dict_cols[x] if x in dict_cols else x for x in df.columns ]).show()
+----------+---------+----+
|STUDENT_ID|S_ADDRESS|COL3|
+----------+---------+----+
+----------+---------+----+
I have a text file with data which looks like this:
NCP_341_1834_0022.png 2 0 130 512 429
I would like to split the data into different columns with names like this:
['filename','class','xmin','ymin','xmax','ymax']
I have done this:
test_txt = pd.read_csv(r"../input/covidxct/train_COVIDx_CT-3A.txt")
test_txt.to_csv(r"../working/test/train.csv",index=None, sep='\t')
train = pd.read_csv("../working/test/train.csv")
However when I download the .csv file, it gives me the data line all in one column, as opposed to 6 columns. How can I fix this?
Just set the right separator (',' by default):
test_txt = pd.read_csv(r"../input/covidxct/train_COVIDx_CT-3A.txt", sep=' ', header=None)
if you are using test_COVIDx_CT-3A.txt from Kaggle.
Don't forget to set header=None since there is no header. You can also use colnames=['image', 'col1', 'col2', ...] to replace default names (0, 1, 2, ...)
Just to answer my own question, You can use str to split the single .csv file into different columns. For me, I split it into 6 columns, for my 6 labels:
train[['filename', 'class','xmin','ymin','xmax','ymax']] = train['NCP_96_1328_0032.png 2 9 94 512 405'].str.split(' ', 6, expand=True)
train.head()
Then just drop the column you dont need:
train.drop(train.columns[[0]], axis=1)
When writing the pandas mainTable dataframe to mainTable.csv, but after the file is written the name of the column for index is missing.
Why does this happen since I have specified index=True?
mainTable.to_csv(r'/Users/myuser/Completed/mainTable.csv',index=True)
mainTable = pd.read_csv('mainTable.csv')
print(mainTable.columns)
MacBook-Pro:Completed iliebogdanbarbulescu$ python map.py
Index(['Unnamed: 0', 'name', 'iso_a3', 'geometry', 'iso_code', 'continent']
print output
save with index_label='Index_name', since by default index_label=None.
See for pandas' .csv() method : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
mainTable.to_csv(r'/Users/myuser/Completed/mainTable.csv',index=True, index_label='Index_name')
I'm trying to get descriptive statistics for a column of data (the tb column which is a list of numbers) for every individual (i.e., each ID). Normally, I'd use a for i in range(len(list)) statement but since the ID is not a number I'm unsure of how to do that. Any tips would be helpful! The code included below gets me descriptive statistics for the entire tb column, instead of for tb data for each individual in the ID list.
df = pd.DataFrame(pd.read_csv("SurgeryTpref.csv")) #importing data
df.columns = ['date', 'time', 'tb', 'ID','before_after'] #column headers
df.to_numpy()
import pandas as pd
# read the data in with
df = pd.read_clipboard(sep=',')
# data
,date,time,tb,ID,before_after
0,6/29/20,4:15:33 PM,37.1,SR10,after
1,6/29/20,4:17:33 PM,38.1,SR10,after
2,6/29/20,4:19:33 PM,37.8,SR10,after
3,6/29/20,4:21:33 PM,37.5,SR10,after
4,6/29/20,4:23:33 PM,38.1,SR10,after
5,6/29/20,4:25:33 PM,38.5,SR10,after
6,6/29/20,4:27:33 PM,38.6,SR10,after
7,6/29/20,4:29:33 PM,37.6,SR10,after
8,6/29/20,4:31:33 PM,35.5,SR10,after
9,6/29/20,4:33:33 PM,34.7,SR10,after
summary=[]
for individual in (ID):
vals= df['tb'].describe()
summary.append(vals)
print(summary)
I'm in the initial stages of doing some 'machine learning'.
I'm trying to create a new data frame and one of the columns doesn't appear to be recognised..?
I've loaded an Excel file with 2 columns (removed the index). All fine.
Code:
df = pd.read_excel('scores.xlsx',index=False)
df=df.rename(columns=dict(zip(df.columns,['Date','Amount'])))
df.index=df['Date']
df=df[['Amount']]
#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date','Amount'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Amount'][i] = data['Amount'][i]
The error:
KeyError: 'Date'
Not really sure what's the problem here.
Any help greatly appreciated
I think in line 4 you reduce your dataframe to just one column "Amount"
To add to #Grzegorz Skibinski's answer, the problem is after line 4, there is no longer a 'Date' column. The Date column was assigned to the index and removed, and while the index has a name "Date", you can't use 'Date' as a key to get the index - you have to use data.index[i] instead of data['Date'][i].
It seems that you have an error in the formatting of your Date column.
To check that you don't have an error on the name of the columns you can print the columns names:
import pandas as pd
# create data
data_dict = {}
data_dict['Fruit '] = ['Apple', 'Orange']
data_dict['Price'] = [1.5, 3.24]
# create dataframe from dict
df = pd.DataFrame.from_dict(data_dict)
# Print columns names
print(df.columns.values)
# Print "Fruit " column
print(df['Fruit '])
This code outputs:
['Fruit ' 'Price']
0 Apple
1 Orange
Name: Fruit , dtype: object
We clearly see that the "Fruit " column as a trailing space. This is an easy mistake to do, especially when using excel.
If you try to call "Fruit" instead of "Fruit " you obtain the error you have:
KeyError: 'Fruit'