When writing the pandas mainTable dataframe to mainTable.csv, but after the file is written the name of the column for index is missing.
Why does this happen since I have specified index=True?
mainTable.to_csv(r'/Users/myuser/Completed/mainTable.csv',index=True)
mainTable = pd.read_csv('mainTable.csv')
print(mainTable.columns)
MacBook-Pro:Completed iliebogdanbarbulescu$ python map.py
Index(['Unnamed: 0', 'name', 'iso_a3', 'geometry', 'iso_code', 'continent']
print output
save with index_label='Index_name', since by default index_label=None.
See for pandas' .csv() method : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
mainTable.to_csv(r'/Users/myuser/Completed/mainTable.csv',index=True, index_label='Index_name')
Related
I am receiving files and for some files columns are named differently.
For example:
In file 1, column names are: "studentID" , "ADDRESS", "Phone_number".
In file 2, column names are: "Common_ID", "Common_Address", "Mobile_number".
In file 3, column names are: "S_StudentID", "S_ADDRESS", "HOME_MOBILE".
I want to pass a dictionary after loading the file data into dataframes and in that dictionary I want to pass values like:
StudentId -> STUDENT_ID
Common_ID -> STUDENT_ID
S_StudentID -> STUDENT_ID
ADDRESS -> S_ADDRESS
Common_Address -> S_ADDRESS
S_ADDRESS -> S_ADDRESS
The reason for doing this because in my next dataframe I am reading column names like "STUDENT_ID", "S_ADDRESS" and if it will not find "S_ADDRESS", "STUDENT_ID" names in the dataframe, it will throw error for files whose names are not standardized. I want to run my dataframe and get values from those files after renaming in the above DF and one question when in run the new df will it pick the column name form dictionary having data in it.
You can have the dictionary as you want and use toDF with a list comprehension in order to rename the columns.
Input dataframe and column names:
from pyspark.sql import functions as F
df = spark.createDataFrame([], 'Common_ID string, ADDRESS string, COL3 string')
print(df.columns)
# ['Common_ID', 'ADDRESS', 'COL3']
Dictionary and toDF:
dict_cols = {
'StudentId': 'STUDENT_ID',
'Common_ID': 'STUDENT_ID',
'S_StudentID': 'STUDENT_ID',
'ADDRESS': 'S_ADDRESS',
'Common_Address': 'S_ADDRESS',
'S_ADDRESS': 'S_ADDRESS'
}
df = df.toDF(*[dict_cols.get(c, c) for c in df.columns])
Resultant column names:
print(df.columns)
# ['STUDENT_ID', 'S_ADDRESS', 'COL3']
Use dict and list comprehension. An easier way and which would work even if some of the columns are not in the list is
df.toDF(*[dict_cols[x] if x in dict_cols else x for x in df.columns ]).show()
+----------+---------+----+
|STUDENT_ID|S_ADDRESS|COL3|
+----------+---------+----+
+----------+---------+----+
I have a Dataframe I receive from a crawler that I am importing into a database for long-term storage.
The problem I am running into is a large amount of the various dataframes have uppercase and whitespace.
I have a fix for it but I was wondering if it can be done any cleaner than this:
def clean_columns(dataframe):
for column in dataframe:
dataframe.rename(columns = {column : column.lower().replace(" ", "_")},
inplace = 1)
return dataframe
print(dataframe.columns)
Index(['Daily Foo', 'Weekly Bar'])
dataframe = clean_columns(dataframe)
print(dataframe.columns)
Index(['daily_foo', 'weekly_bar'])
You can try via columns attribute:
df.columns=df.columns.str.lower().str.replace(' ','_')
OR
via rename() method:
df=df.rename(columns=lambda x:x.lower().replace(' ','_'))
I want to count the number of times a value in Child column appears in Parent column then display this count in new column renamed child count. See previews df below.
I have this done via VBA (COUNTIFS) but now need dynamic visualization and animated display with data fed from a dir. So I resorted to Python and Pandas and tried below code after searching and reading answers like: Countif in pandas with multiple conditions | Determine if value is in pandas column | Iterate over rows in Pandas df | many others...
but still can't get the expected preview as illustrated in image below.
Any help will be very much appreciated. Thanks in advance.
#import libraries
import pandas as pd
import numpy as np
import os
#get datasets
path_dataset = r'D:\Auto'
df_ns = pd.read_csv(os.path.join(path_dataset, 'Scripts', 'data.csv'), index_col = False, encoding = 'ISO-8859-1', engine = 'python')
#preview dataframe
df_ns
#tried
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
#preview output
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
preview dataframe
preview output
expected output
[Edited] My data
Child = ['Tkt01', 'Tkt02', 'Tkt03', 'Tkt04', 'Tkt05', 'Tkt06', 'Tkt07', 'Tkt08', 'Tkt09', 'Tkt10']
Parent = [' ', ' ', 'Tkt03',' ',' ', 'Tkt03',' ', 'Tkt03',' ',' ', 'Tkt06',' ',' ',' ',]
Site_Name =[Yaounde','Douala','Bamenda','Bafoussam','Kumba','Garoua','Maroua','Ngaoundere','Buea','Ebolowa']
I created a lookalike of your df.
Before
Try this code
df['Count'] = [len(df[df['parent'].str.contains(value)]) for index, value in enumerate(df['child'])]
#breaking it down as a line by line code
counts = []
for index, value in enumerate(df['child']):
found = df[df['parent'].str.contains(value)]
counts.append(len(found))
df['Count'] = counts
After
Hope this works for you.
Since I don't have access to your data, I cannot check the code I am giving you. I suggest you will have problems with nan values with this line but you can give it a try.:
df_ns['child_count'] = df_ns['Parent'].groupby(df_ns['Child']).value_counts()
I give a name to the new column and directly assign values to it through the groupby -> value_counts functions.
I'm in the initial stages of doing some 'machine learning'.
I'm trying to create a new data frame and one of the columns doesn't appear to be recognised..?
I've loaded an Excel file with 2 columns (removed the index). All fine.
Code:
df = pd.read_excel('scores.xlsx',index=False)
df=df.rename(columns=dict(zip(df.columns,['Date','Amount'])))
df.index=df['Date']
df=df[['Amount']]
#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date','Amount'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Amount'][i] = data['Amount'][i]
The error:
KeyError: 'Date'
Not really sure what's the problem here.
Any help greatly appreciated
I think in line 4 you reduce your dataframe to just one column "Amount"
To add to #Grzegorz Skibinski's answer, the problem is after line 4, there is no longer a 'Date' column. The Date column was assigned to the index and removed, and while the index has a name "Date", you can't use 'Date' as a key to get the index - you have to use data.index[i] instead of data['Date'][i].
It seems that you have an error in the formatting of your Date column.
To check that you don't have an error on the name of the columns you can print the columns names:
import pandas as pd
# create data
data_dict = {}
data_dict['Fruit '] = ['Apple', 'Orange']
data_dict['Price'] = [1.5, 3.24]
# create dataframe from dict
df = pd.DataFrame.from_dict(data_dict)
# Print columns names
print(df.columns.values)
# Print "Fruit " column
print(df['Fruit '])
This code outputs:
['Fruit ' 'Price']
0 Apple
1 Orange
Name: Fruit , dtype: object
We clearly see that the "Fruit " column as a trailing space. This is an easy mistake to do, especially when using excel.
If you try to call "Fruit" instead of "Fruit " you obtain the error you have:
KeyError: 'Fruit'
I have a large dataset where shape = (184215, 82)
Out of the 82 columns. I would only like to import a select 6, to conserve memory, because I would need to inner join and conduct some analysis on the data
Is there a way to restrict the columns being created on pd.read_table() or is there a way to drop the unnecessary columns after the fact? (The file is CSV with no column header, I had to create the column names after the fact.
For example here is the list of 82 column:
['COBDate' 'structRefID' 'transactionID' 'tradeID' 'tradeLegID'
'tradeVersion' 'baseCptyID' 'extCptyID' 'extLongName' 'portfolio'
'productClass' 'productGroup' 'productType' 'RIC' 'CUSIP' 'ISIN' 'SEDOL'
'underlyingCurrency' 'foreignCurrency' 'notional' 'notionalCCY' 'quantity'
'MTM' 'tradeDate' 'startDate' 'expiryDate' 'optExerciseType'
'settlementDate' 'settlementType' 'payoff' 'buySell' 'strike' 'rate'
'spread' 'rateType' 'paymentFreq' 'resetFreq' 'modelUsed' 'sentWSS'
'Multiplier' 'PayoutCCY' 'Comments' 'TraderCode' 'AsnOptionStyle'
'BarrierDirection' 'BarrierMonitoringFreq' 'DayCountConv'
'SingleBarrierLevel' 'DownBarrierLevel' 'DownRebateAmount'
'UpBarrierLevel' 'UpRebateAmount' 'IsOptionOnFwd' 'NDFixingDate'
'NDFixingPage' 'NDFixingRate' 'PayoutAmount' 'Underlying' 'WSSID'
'WindowEndDate' 'WindowStartDate' 'InstrumentID' 'EffectiveDate' 'CallPut'
'IsCallable' 'IsExchTraded' 'IsRepay' 'MutualPutDate' 'OptionExpiryStyle'
'IndexTerm' 'PremiumSettlementDate' 'PremiumCcy' 'PremiumAmount'
'ExecutionDateTime' 'FlexIndexFlag' 'NotionalPrincipal' 'r_Premium'
'cpty_type' 'IBTSSID' 'PackageID' 'Component' 'Schema' 'pandas_index']
I only want the following 6 as an example:
'COBDate' 'baseCptyID' 'extCptyID' 'portfolio' 'strike' 'rate'
'spread'
For csv with no column header:
pd.read_table(usecols=[0, 1, 2])
where [0, 1, 2] are the column numbers that have to be read.
If the csv contains column headers you can also specify them by name:
cols_to_read = ['COBDate', 'baseCptyID', 'extCptyID', 'portfolio', 'strike', 'rate', 'spread']
pd.read_table(usecols=cols_to_read)