xlsx file to list for dataset - python

Here is a copy of my code when I run it the list output is in a weird format that I cant figure out how to change.
'''
import pandas as pd
inputdata = []
data = pd.read_excel('testfile.xlsx')
df = pd.DataFrame(data, columns=['black_toner_percentage'])
for i in df.values:
inputdata.append(i)
print(inputdata)
'''
array([100], dtype=int64), array([100], dtype=int64)
This is the format my list is going into and Id like it to be [100,100,ext.]
What am I doing wrong here.

Related

How to convert list data to xml using python

I have been trying to convert list data to xml file.
But getting below error : ValueError: Invalid tag name '0'
This is my header : 'Name,Job Description,Course'
Code:
import pandas as pd
lst = [ 'Name,Job Description,Course' ,
'Bob,Backend Developer,MCA',
'Raj,Business Analyst,BMS',
'Alice,FullStack Developer,CS' ]
df = pd.DataFrame(lst)
with open('output.xml', 'w') as myfile:
myfile.write(df.to_xml())
The df you created is improper. There are two scenarios.
If you took name, job description, course as single header. You
will fail at the point of saving df to xml.
In order to save df as xml there is a format that need to be followed.
Below solution works. Hope this is what you are trying to achieve.
import pandas as pd
lst = [ ['Name','Job_Description','Course'] ,
['Bob','Backend Developer','MCA'],
['Raj','Business Analyst','BMS'],
['Alice','FullStack Developer','CS'] ]
df = pd.DataFrame(lst[1:], columns=[lst[0]])
print(df)
df.to_xml('./output.xml')

How to read list of parquets with partially overlapping set of columns in dask?

Consider this code:
import dask.dataframe as dd
import numpy as np
df1=pd.DataFrame({'A': [1, 2], 'B': [11, 12]})
df1.to_parquet("df1.parquet")
df2=pd.DataFrame({'A': [3, 4], 'C': [13, 14]})
df2.to_parquet("df2.parquet")
all_files = ["df1.parquet", "df2.parquet"]
full_df = dd.read_parquet(all_files)
# dask.compute(full_df) # KeyError: "['B'] not in index"
def normalize(df):
df_cols = set(df.columns)
for c in ['A', 'B', 'C']:
if c not in df_cols:
df[c] = np.nan
df = df[sorted(df.columns)]
return df
normal_df = full_df.map_partitions(normalize)
dask.compute(normal_df) # Still gives keyError
I was hoping that after the normalization using map_partitions, I wouldn't get keyError, but the read_parquet probably fails before reaching the map_partitions step.
I could have created the DataFrame from a list of delayed objects which would each read one file and normalize the columns, but I want to avoid using delayed objects for this reason
The other option is suggested by SultanOrazbayev is to use dask dataframe like this:
def normal_ddf(path):
df = dd.read_parquet(path)
return normalize(df) # normalize f should work with both pandas and dask
full_df = dd.concat([normal_ddf(path) for path in all_files])
Problem with this is that, when all_files contains large number of files (10K) this takes a long time to create the dataframe since all those dd.read_parquet happens sequentially. Although dd.read_parquet doesn't need to load the whole file, it still needs to read some headers to get column info. Doing it sequentially on 10k files adds up.
So, what is the proper/efficient way to read a bunch of parquet files all of which don't have the same set of columns?
dd.concat should take care of your normalization.
Consider this example:
import pandas as pd
import dask.dataframe as dd
import numpy as np
import string
N = 100_000
all_files = []
for col in string.ascii_uppercase[1:]:
df = pd.DataFrame({
"A": np.random.normal(size=N),
col: (np.random.normal(size=N) ** 2) * 50,
})
fname = f"df_{col}.parquet"
all_files.append(fname)
df.to_parquet(fname)
full_df = dd.concat([dd.read_parquet(path) for path in all_files]).compute()
And I get this on my task stream dashboard:
Another option that was not mentioned in the comments by #Michael Delgado is to load each parquet into a separate dask dataframe and then stitch them together. Here's the rough pseudocode:
def normal_ddf(path):
df = dd.read_parquet(path)
return normalize(df) # normalize f should work with both pandas and dask
full_df = dd.concat([normal_ddf(path) for path in all_files])

Prevent pandas from changing int to float/date?

I'm trying to merge a series of xlsx files into one, which works fine.
However, when I read a file, columns containing ints are transformed into floats (or dates?) when I merge and output them to csv. I have tried to visualize this in the picture. I have seen some solutions to this where dtype is used to "force" specific columns into int format. However, I do not always know the index nor the title of the column, so i need a more scalable solution.
Anyone with some thoughts on this?
Thank you in advance
#specify folder with xlsx-files
xlsFolder = "{}/system".format(directory)
dfMaster = pd.DataFrame()
#make a list of all xlsx-files in folder
xlsFolderContent = os.listdir(xlsFolder)
xlsFolderList = []
for file in xlsFolderContent:
if file[-5:] == ".xlsx":
xlsFolderList.append(file)
for xlsx in xlsFolderList:
print(xlsx)
xl = pd.ExcelFile("{}/{}".format(xlsFolder, xlsx))
for sheet in xl.sheet_names:
if "_Errors" in sheet:
print(sheet)
dfSheet = xl.parse(sheet)
dfSheet.fillna(0, inplace=True)
dfMaster = dfMaster.append(dfSheet)
print("len of dfMaster:", len(dfMaster))
dfMaster.to_csv("{}/dfMaster.csv".format(xlsFolder),sep=";")
Data sample:
Try to use dtype='object' as parameter of pd.read_csv or (ExcelFile.parse) to prevent Pandas to infer the data type of each column. You can also simplify your code using pathlib:
import pandas as pd
import pathlib
directory = pathlib.Path('your_path_directory')
xlsFolder = directory / 'system'
data = []
for xlsFile in xlsFolder.glob('*.xlsx'):
sheets = pd.read_excel(xlsFile, sheet_name=None, dtype='object')
for sheetname, df in sheets.items():
if '_Errors' in sheetname:
data.append(df.fillna('0'))
pd.concat(data).to_csv(xlsxFolder / dfMaster.csv, sep=';')

UCI dataset: How to extract features and change the data into usable format after reading the data on python

I am looking to apply some ml algorithms on the data set from https://archive.ics.uci.edu/ml/datasets/University.
I noticed that the data is unstructured. Indeed, I want the data to have the features as the columns and have observations as the rows. Therefore, I need help with parsing this dataset.
Any help will be appreciated. Thanks.
Below is what I have tried:
column_names = ["University-name"
,"State"
,"location"
,"Control"
,"number-of-students"
,"male:female (ratio)"
,"student:faculty (ratio)",
"sat-verbal"
,"sat-math"
,"expenses"
,"percent-financial-aid"
,"number-of-applicants"
,"percent-admittance"
,"percent-enrolled"
,"academics"
,"social"
,"quality-of-life"
,"academic-emphasis"]
data_list =[]
data = ['https://archive.ics.uci.edu/ml/machine-learning-
databases/university/university.data','https://archive.ics.uci.edu/ml/machine-
learning-databases/university/university.data',...]'
for file in in data:
df = pd.read_csv(file, names = column_names)
data_list.append(df)
The data is not structured in a way you can parse it using pandas. Do something like this:
import requests
data = "https://archive.ics.uci.edu/ml/machine-learning-databases/university/university.data"
data = requests.get(data)
temp = data.text
import re
fdic = {'def-instance':[], 'state':[]}
for col in fdic.keys():
fdic[col].extend(re.findall(f'\({col} ([^\\\n)]*)' , temp))
import pandas as pd
pd.DataFrame(fdic)
The output:

Converting a pandas dataframe to JSON file

I am trying to convert a pandas DataFrame to JSON file. Following image shows my data:
Screenshot of the dataset from Ms. excel
I am using the following code:
import pandas as pd
os.chdir("G:\\My Drive\\LEC dashboard\\EnergyPlus simulation files\\DEC\\Ahmedabad\\Adaptive set point\\CSV")
df = pd.read_csv('Adap_40-_0_0.1_1.5_0.6.csv')
df2 = df.filter(like = '[C](Hourly)',axis =1)
df3 = df.filter(like = '[C](Hourly:ON)',axis =1)
df4 = df.filter(like = '[%](Hourly)',axis =1)
df5 = df.filter(like = '[%](Hourly:ON)',axis =1)
df6 = pd.concat([df2,df3,df4,df5],axis=1)
df6.to_json("123.json",orient='columns')
I the output, I am getting a dictionary in of values. However, I need a list as value.
The output I am getting: The JSON output I am getting by using above code
The out put that is desired: The output that is desired.
I have tried different orientations of json but nothing works.
There might be other ways of doing this but one way is this.
import json
test = pd.DataFrame({'a':[1,2,3,4,5,6]})
with open('test.json', 'w') as f:
json.dump(test.to_dict(orient='list'), f)
Result file will look like this '{"a": [1, 2, 3, 4, 5, 6]}'
There is a built-in function of pandas called to_json:
df.to_json(r'Path_to_file\file_name.json')
Take a look at the documentation if you need more specifics: https://pandas.pydata.org/pandas-docs/version/0.24/reference/api/pandas.DataFrame.to_json.html

Categories