add Columns and reorder them - python

first data frame :
Index([ 'AvailabilityZone', 'CreateTime', 'Encrypted', 'Size',
'SnapshotId', 'State', 'VolumeId', 'Iops', 'VolumeType',
'MultiAttachEnabled', 'KmsKeyId', 'instanceId', 'name','Attachments']
dtype='object')
Second data frame :
Index(['Attachments', 'AvailabilityZone', 'CreateTime', 'Size',
'SnapshotId', 'VolumeId', 'Iops', 'Tags', 'VolumeType',
'KmsKeyId', 'instanceId', 'name'],
dtype='object')
I am calling API to pull data but i am getting columns in different order and sometimes columns are present and sometimes columns are not present
Example : In first data frame i have 'MultiAttachEnabled' and 'State' but i second dataframe we don't have those columns. I want to change the order columns as well and remove some of the columns like Tags and Encrypted
In Final csv file i want to get :
Attachments,
AvailabilityZone ,
CreateTime,
KmsKeyId,
Size,
SnapshotId,
State,
VolumeId,
Iops,
VolumeType,
MultiAttachEnabled,
instanceId,
Throughput.

You can try the following where you add missing columns and order column name wise.
import numpy as np
# Required columns
columns = ['Attachments', 'AvailabilityZone', 'CreateTime', 'KmsKeyId', 'Size', 'SnapshotId', 'State', 'VolumeId', 'Iops', 'VolumeType', 'MultiAttachEnabled', 'instanceId', 'Throughput']
# Get missing columns
missing_columns = set(columns).difference(set(df.columns))
# Add missing columns
for i in missing_columns:
df[i] = np.nan
# Reorder column
df = df.reindex(sorted(df.columns), axis=1)

Related

How to check file is as per format in python

So i have excel sheet of data which have 20 something columns the customer have requirment that they want to know if any of column is missing from excel im using pandas for converting data into dataframes i used if statements for few columns but as its rigid soulution they want something better
any suggestion ? are there any libraries there?
Thanks
want to check if file have all required columns and display check file if there is some erorr
Here I created a dataframe, but you would be usingdf = pd.read_excel('myfile.xlsx)`
My dataframe has only the three following columns
data = {'Name':['Tom', 'Nick', 'Sarah', 'Jack'],
'Age':[20, 21, 19, 18],
'Sex':['M', 'M', 'F', 'M']}
df = pd.DataFrame(data)
I'll make a list then of required cols
REQUIRED_COLUMNS = [
'Name',
'Age',
'Occupation',
'Sex'
]
# I'll make the columns a set to avoid O^2 looping.
dfColumns = set(df.columns)
for col in REQUIRED_COLUMNS:
if col not in dfColumns:
print(f"Column '{col}' is missing.")
Et voilà
>>> Column 'Occupation' is missing.

how to append rows into one row if the other rows have same certain name

orginal:
expected result:
Task:
I am trying to merge the 'urls column' into one row if there exist a same name in the other column ('full path') using python and jupyter notebook.
I have tried using groupby but it doesnt pass me the result i want.
Code:
df.groupby("Full Path").apply(lambda x: ", ".join(x)).reset_index()
not what i am expecting:
The reason it is not working is that you need to modify the column for the full path first before passing it to group by since there are differences in the full paths.
Based on the sample here the following should work:
df['Full Path'] = df['Full Path'].str.split('/').str[0:2].str.join('/')
test = df.groupby(by=['Full Path']).agg({'url': ', Next'.join})
test['url'] = test['url'].str.replace("Next","\n")
This code of course assumes that the grouping you want for the full path occurs in the first two items. The \n will disappear when you write the df out to Excel.
NOTE: Unless the Type and Date fields are all the same value, you cannot include them in the group by since for example, if you did groupby(['Full Path', 'Type', 'Date']) you would end up with not all the links being aggregated for an individual path+folder combination. If you wanted them to be included in a comma-separated next line column like the url, you would need to add that to the agg statement and use the replace for those as well.
Code used for testing:
import pandas as pd
pd.options.display.max_colwidth = 999
data_dict = {
'Full Path': [
'downloads/Residences Singapore',
'downloads/Residences Singapore/15234523524352',
'downloads/Residences Singapore/41242341324',
],
'Type': [
'Folder',
'File',
'File',
],
'Date': [
'07-05-22 19:24',
'07-05-22 19:24',
'07-05-22 19:24',
],
'url': [
'https://www.google.com/drive/storage/345243534534522345',
'https://www.google.com/drive/storage/523405923405672340567834589065',
'https://www.google.com/drive/storage/90658360945869012141234',
],
}
df = pd.DataFrame(data_dict)
df['Full Path'] = df['Full Path'].str.split('/').str[0:2].str.join('/')
test = df.groupby(by=['Full Path']).agg({'url': ', Next'.join})
test['url'] = test['url'].str.replace("Next","\n")
test
Output
Just groupby the FullPath and value as URL field, aggregate with comma separator. enter image description here

KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported

I have a data frame with the following columns:
job_post.columns
Index(['Job.ID_list', 'Provider', 'Status', 'Slug', 'Title', 'Position',
'Company', 'City', 'State.Name', 'State.Code', 'Address', 'Latitude',
'Longitude', 'Industry', 'Job.Description', 'Requirements', 'Salary',
'Listing.Start', 'Listing.End', 'Employment.Type', 'Education.Required',
'Created.At', 'Updated.At', 'Job.ID_desc', 'text'],
dtype='object')
I want to select only the following columns from the dataframe:
columns_job_post = ['Job.ID_listing', 'Slug', 'Position', 'Company', 'Industry', 'Job.Description','Employment.Type', 'Education.Required', 'text'] # columns to keep
However, I get the result:
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported
I solved the issue by writing:
jobs_final = job_post.reindex(columns = columns_job_post)
Similarly, I have a data frame with the following columns:
cand_exp.columns
Index(['Applicant.ID', 'Position.Name', 'Employer.Name', 'City', 'State.Name',
'State.Code', 'Start.Date', 'End.Date', 'Job.Description', 'Salary',
'Can.Contact.Employer', 'Created.At', 'Updated.At'],
dtype='object')```
I also selected just some columns from the whole list using .loc but I didn't get the KeyError: Passing list-like...
columns_cand_exp = ['Applicant.ID', 'Position.Name', 'Employer.Name', 'Job.Description', 'Salary']``` # columns to keep
resumes_final = cand_exp.loc[:, columns_cand_exp]
What is the reason for this?
Thank you in advance!
Because in the first example you introduced column names that are not exists in the original data frame (ex: Job.ID_listing).
In the second example all the columns were already in the original data frame.
as the error says: 'Passing list-likes to .loc or [] with any missing labels .....

Python custom method set to new variable changes old variable

I have created a class with two methods, NRG_load and NRG_flat. The first loads a CSV, converts it into a DataFrame and applies some filtering; the second takes this DataFrame and, after creating two columns, it melts the DataFrame to pivot it.
I am trying out these methods with the following code:
nrg105 = eNRG.NRG_load('nrg_105a.tsv')
nrg105_flat = eNRG.NRG_flat(nrg105, '105')
where eNRG is the class, and '105' as second argument is needed to run an if-loop within the method to create the aforementioned columns.
The behaviour I cannot explain is that the second line - the one with the NRG_flat method - changes the nrg105 values.
Note that if I only run the NRG_load method, I get the expected DataFrame.
What is the behaviour I am missing? Because it's not the first time I apply a syntax like that, but I never had problems, so I don't know where I should look at.
Thank you in advance for all of your suggestions.
EDIT: as requested, here is the class' code:
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 16 15:22:21 2019
#author: CAPIZZI Filippo Antonio
"""
import pandas as pd
from FixFilename import FixFilename as ff
from SplitColumn import SplitColumn as sc
from datetime import datetime as ddt
class EurostatNRG:
# This class includes the modules needed to load and filter
# the Eurostat NRG files
# Default countries' lists to be used by the functions
COUNTRIES = [
'EU28', 'AL', 'AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'EL',
'ES', 'FI', 'FR', 'GE', 'HR', 'HU', 'IE', 'IS', 'IT', 'LT', 'LU', 'LV',
'MD', 'ME', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK',
'TR', 'UA', 'UK', 'XK'
]
# Default years of analysis
YEARS = list(range(2005, int(ddt.now().year) - 1))
# NOTE: the 'datetime' library will call the current year, but since
# the code is using the 'range' function, the end years will be always
# current-1 (e.g. if we are in 2019, 'current year' will be 2018).
# Thus, I have added "-1" because the end year is t-2.
INDIC_PROD = pd.read_excel(
'./Datasets/VITO/map_nrg.xlsx',
sheet_name=[
'nrg105a_indic', 'nrg105a_prod', 'nrg110a_indic', 'nrg110a_prod',
'nrg110'
],
convert_float=True)
def NRG_load(dataset, countries=COUNTRIES, years=YEARS, unit='ktoe'):
# This module will load and refine the NRG dataset,
# preparing it to be filtered
# Fix eventual flags
dataset = ff.fix_flags(dataset)
# Load the dataset into a DataFrame
df = pd.read_csv(
dataset,
delimiter='\t',
encoding='utf-8',
na_values=[':', ': ', ' :'],
decimal='.')
# Clean up spaces from the column names
df.columns = df.columns.str.strip()
# Removes the mentioned column because it's not needed
if 'Flag and Footnotes' in df.columns:
df.drop(columns=['Flag and Footnotes'], inplace=True)
# Split the first column into separate columns
df = sc.nrg_split_column(df)
# Rename the columns
df.rename(
columns={
'country': 'COUNTRY',
'fuel_code': 'KEY_PRODUCT',
'nrg_code': 'KEY_INDICATOR',
'unit': 'UNIT'
},
inplace=True)
# Filter the dataset
df = EurostatNRG.NRG_filter(
df, countries=countries, years=years, unit=unit)
return df
def NRG_filter(df, countries, years, unit):
# This module will filter the input DataFrame 'df'
# showing only the 'countries', 'years' and 'unit' selected
# First, all of the units not of interest are removed
df.drop(df[df.UNIT != unit.upper()].index, inplace=True)
# Then, all of the countries not of interest are filtered out
df.drop(df[~df['COUNTRY'].isin(countries)].index, inplace=True)
# Finally, all of the years not of interest are removed,
# and the columns are rearranged according to the desired output
main_cols = ['KEY_INDICATOR', 'KEY_PRODUCT', 'UNIT', 'COUNTRY']
cols = main_cols + [str(y) for y in years if y not in main_cols]
df = df.reindex(columns=cols)
return df
def NRG_flat(df, name):
# This module prepares the DataFrame to be flattened,
# then it gives it as output
# Assign the indicators and products' names
if '105' in name: # 'name' is the name of the dataset
# Creating the 'INDICATOR' column
indic_dic = dict(
zip(EurostatNRG.INDIC_PROD['nrg105a_indic'].KEY_INDICATOR,
EurostatNRG.INDIC_PROD['nrg105a_indic'].INDICATOR))
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
# Creating the 'PRODUCT' column
prod_dic = dict(
zip(
EurostatNRG.INDIC_PROD['nrg105a_prod'].KEY_PRODUCT.astype(
str), EurostatNRG.INDIC_PROD['nrg105a_prod'].PRODUCT))
df['PRODUCT'] = df['KEY_PRODUCT'].map(prod_dic)
elif '110' in name:
# Creating the 'INDICATOR' column
indic_dic = dict(
zip(EurostatNRG.INDIC_PROD['nrg110a_indic'].KEY_INDICATOR,
EurostatNRG.INDIC_PROD['nrg110a_indic'].INDICATOR))
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
# Creating the 'PRODUCT' column
prod_dic = dict(
zip(
EurostatNRG.INDIC_PROD['nrg110a_prod'].KEY_PRODUCT.astype(
str), EurostatNRG.INDIC_PROD['nrg110a_prod'].PRODUCT))
df['PRODUCT'] = df['KEY_PRODUCT'].map(prod_dic)
# Delete che columns 'KEY_INDICATOR' and 'KEY_PRODUCT', and
# rearrange the columns in the desired order
df.drop(columns=['KEY_INDICATOR', 'KEY_PRODUCT'], inplace=True)
main_cols = ['INDICATOR', 'PRODUCT', 'UNIT', 'COUNTRY']
year_cols = [y for y in df.columns if y not in main_cols]
cols = main_cols + year_cols
df = df.reindex(columns=cols)
# Pivot the DataFrame to have it in flat format
df = df.melt(
id_vars=df.columns[:4], var_name='YEAR', value_name='VALUE')
# Convert the 'VALUE' column into float numbers
df['VALUE'] = pd.to_numeric(df['VALUE'], downcast='float')
# Drop rows that have no indicators (it means they are not in
# the Excel file with the products of interest)
df.dropna(subset=['INDICATOR', 'PRODUCT'], inplace=True)
return df
EDIT 2: if this could help, this is the error I receive when using the EurostatNRG class in IPython:
[autoreload of EurostatNRG failed: Traceback (most recent call last):
File
"C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 244, in check
superreload(m, reload, self.old_objects) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 394, in superreload
update_generic(old_obj, new_obj) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 331, in update_generic
update(a, b) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 279, in update_class
if (old_obj == new_obj) is True: File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py",
line 1478, in nonzero
.format(self.class.name)) ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or
a.all(). ]
I managed to find the culprit.
In the NRG_flat method, the lines:
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
...
df['PRODUCT'] = df['KEY_PRODUCT'].map(indic_dic)
mess up the copies of the df DataFrame, thus I had to change them with the Pandas assign method:
df = df.assign(INDICATOR=df.KEY_INDICATOR.map(prod_dic))
...
df = df.assign(PRODUCT=df.KEY_PRODUCT.map(prod_dic))
I do not get any more error.
Thank you for replying!

Merge Multiple CSV with different column name but same definition

I have different sources(CSV) of similar data set which i want to merge into single data and write it to my DB. Since data is coming from different sources, they use different headers in their CSV, i want to merge these columns with logical meaning.
So far, i have tried reading all headers first and re reading the files to first get all the data in a single data frame and then doing if else to merge the columns together with same meaning. Ideally I would like to create a mapping file with all possible column names per column and then read CSV using that mapping. The data is not ordered or sorted between files. Number of columns might be different too but they all have the columns i am interested in.
Sample data:
File 1:
id, name, total_amount...
1, "test", 123 ..
File 2:
member_id, tot_amnt, name
2, "test2", 1234 ..
i want this to look like
id, name, total_amount...
1, "test", 123...
2, "test2", 1234...
...
I can't think of an elegant way to do this, would be great to get some direction or help with this.
Thanks
Use skiprows and header=None to skip the header, names to specify your own list of column names, and concat to merge into a single df. i.e.
import pandas as pd
pd.concat([
pd.read_csv('file1.csv',skiprows=1,header=None,names=['a','b','c']),
pd.read_csv('file2.csv',skiprows=1,header=None,names=['a','b','c'])]
)
Edit: If the different files differ only by column order you can specify different column orders to names and if you want to select a subset of columns use usecols. But you need to do this mapping in advance, either by probing the file, or some other rule.
This requires mapping files to handlers somehow
i.e.
file1.csv
id, name, total_amount
1, "test", 123
file2.csv
member_id, tot_amnt, ignore, name
2, 1234, -1, "test2"
The following selects the common 3 columns and renames / reorders.
import pandas as pd
pd.concat([
pd.read_csv('file1.csv',skiprows=1,header=None,names=['id','name','value'],usecols=[0,1,2]),
pd.read_csv('file2.csv',skiprows=1,header=None,names=['id','value','name'],usecols=[0,1,3])],
sort=False
)
Edit 2:
And a nice way to apply this is to use lambda's and maps - i.e.
parsers = {
"schema1": lambda f: pd.read_csv(f,skiprows=1,header=None,names=['id','name','value'],usecols=[0,1,2]),
"schema2": lambda f: pd.read_csv(f,skiprows=1,header=None,names=['id','value','name'],usecols=[0,1,3])
}
map = {
"file2.csv": "schema2",
"file1.csv": "schema1"}
pd.concat([parsers[v](k) for k,v in map.items()], sort=False)
This is what i ended up doing and found to be the cleanest solution. Thanks David your help.
dict1= {'member_number': 'id', 'full name': 'name', …}
dict2= {'member_id': 'id', 'name': 'name', …}
parsers = {
"schema1": lambda f, dict: pd.read_csv(f,index_col=False,usecols=list(dict.keys())),
"schema2": lambda f, dict: pd.read_csv(f,index_col=False,usecols=list(dict.keys()))
}
map = {
'schema1': (a_file.csv,dict1),
'schema2': (b_file.csv,dict2)
}
total = []
for k,v in map.items():
d = parsers[k](v[0], v[1])
d.rename(columns=v[1], inplace=True)
total.append(d)
final_df = pd.concat(total, sort=False)

Categories