Loading Data Set with Breaks - python

I'm trying to load a dataset with breaks in it. I am trying to find an intelligent way to make this work. I got started on it with the code i included.
As you can see, the data within the file posted on the public FTP site starts at line 11, ends at line 23818, then starts at again at 23823, and ends at 45,630.
import pandas as pd
import numpy as np
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
url = urlopen("http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/10_Portfolios_Prior_12_2_Daily_CSV.zip")
#Download Zipfile and create pandas DataFrame
zipfile = ZipFile(BytesIO(url.read()))
df = pd.read_csv(zipfile.open('10_Portfolios_Prior_12_2_Daily.CSV'), header = 0,
names = ['asof_dt','1','2','3','4','5','6','7','8','9','10'], skiprows=10).dropna()
df['asof_dt'] = pd.to_datetime(df['asof_dt'], format = "%Y%m%d")
I would ideally like the first set to have a version number "1", the second to have "2", etc.
Any help would be greatly appreciated. Thank you.

Related

i am trying to make a function that iterates through names and assign a serial number in a certain pattern then saves it in csv and JSON file

I am trying to build a function that iterates over a bunch of names in a CSV I give then extracts the last serial number written from JSON file then adding one for each name and putting serial number beside every name in the csv, but what i get is that the function generates the first serial number successfully and saves it in Json file but fails to add it in the csv via pandas and fails to update the number in the JSON file.
this is the code of the function:
from docx import Document
import pandas as pd
from datetime import datetime
import time
import os
from docx2pdf import convert
import json
date=datetime.date(datetime.now())
strdate=date.strftime("%d-%m-%Y")
year=date.strftime("%Y")
month=date.strftime("%m")
def genrateserial(a):
jsonFile1 = open("data_file.json", "r")
lastserial = jsonFile1.read()
jsonFile1.close()
for d in range(len(lastserial)):
if lastserial[d]=="\"":
lastserial[d].replace("\"","")
jsonFile1.close()
if strdate=="01" or (month[1]!=lastserial[8]):
num=1
last=f"JO/{year}{month}{num}"
data=f"{last}"
jsonstring=json.dumps(data)
jsonfile2=open("data_file.json", "w")
jsonfile2.write(jsonstring)
jsonfile2.close()
database = pd.read_csv(a)
df = pd.DataFrame(database)
df = df.dropna(axis=0)
for z in range(len(df.Name)):
newentry=f"JO/{year}{month}{num+1}"
jsonstring1=json.dumps(newentry)
jsonfile3=open("data_file.json","w")
jsonfile3.write(jsonstring1)
jsonfile3.close()
df.iloc[[z],3]=newentry
genrateserial('database.csv')

How to iterate over a directory of XML files and extract data in python

I need to read xml file and fetch data to a dataframe. I have developed this to extract data for one xml file.
import pandas as pd
import numpy as np
import xml.etree.cElementTree as et
import datetime
tree=et.parse('/data/dump_xml/1013.xml')
root=tree.getroot()
NAME = []
for name in root.iter('name'):
NAME.append(name.text)
print(NAME[0])
print(NAME[1])
UPDATE = []
for update in root.iter('lastupdate'):
UPDATE.append(update.text)
updated = datetime.datetime.fromtimestamp(int(UPDATE[0]))
lastupdate=updated.strftime('%Y-%m-%d %H:%M:%S')
ParaValue = []
for parameterevalue in root.iter('value'):
ParaValue.append(parameterevalue.text)
print(ParaValue[0])
print(ParaValue[1])
print(lastupdate,NAME[0],ParaValue[0])
print(lastupdate,NAME[1],ParaValue[1])
For one each file I need to get below two results as an output..
2022-05-23 11:25:01 in 1.5012356187e+05
2022-05-23 11:25:01 out 1.7723777592e+05
Now I need to do this to all my xml files in /data/dump_xml/ and make a df with all the data at one execution. Can someone help me to improve my code?

Download dataset Which is a Zip file Contaning lots of csv file in notebook for data analysis

I am doing a data science project.
I am using google notebook for my job
My dataset is residing at here which I want to access directly at python Notebook.
I am using following line of code to get out of it.
df = pd.read_csv('link')
But Command line is throwing an error like below
What should I do?
Its difficult to answer exactly as there lack of data but here you go for this kind of request..
you have to import ZipFile & urlopen in order to get data from url and extract the data from Zip and the use the csv file for pandas processings.
from zipfile import ZipFile
from urllib.request import urlopen
import pandas as pd
import os
URL = 'https://he-s3.s3.amazonaws.com/media/hackathon/hdfc-bank-ml-hiring-challenge/application-scorecard-for-customers/05d2b4ea-c-Dataset.zip'
# open and save the zip file onto computer
url = urlopen(URL)
output = open('05d2b4ea-c-Dataset.zip', 'wb') # note the flag: "wb"
output.write(url.read())
output.close()
# read the zip file as a pandas dataframe
df = pd.read_csv('05d2b4ea-c-Dataset.zip') zip files
# if keeping on disk the zip file is not wanted, then:
os.remove(zipName) # remove the copy of the zipfile on disk
Use urllib module to download into memory the zip file which returns a file-like object that you can read(), pass it to ZipFile(standard package).
Since here there are multiple files like
['test_data/AggregateData_Test.csv', 'test_data/TransactionData_Test.csv', 'train_data/AggregateData_Train.csv', 'train_data/Column_Descriptions.xlsx', 'train_data/sample_submission.csv', 'train_data/TransactionData_Train.csv']
Load it to a dict of dataframes with filename as the key. Altogether the code will be.
from urllib.request import urlopen
from zipfile import ZipFile
from io import BytesIO
zip_in_memory = urlopen("https://he-s3.s3.amazonaws.com/media/hackathon/hdfc-bank-ml-hiring-challenge/application-scorecard-for-customers/05d2b4ea-c-Dataset.zip").read()
z = ZipFile(BytesIO(zip_in_memory))
dict_of_dfs = {file.filename: pd.read_csv(z.open(file.filename))\
for file in z.infolist()\
if file.filename.endswith('.csv')}
Now you can access dataframes of each csv like dict_of_dfs['test_data/AggregateData_Test.csv'].
Ofcourse all of this is unnecessary if you will just download the zip from the link and pass it as a zipfile.

Unable to read a text file into jupyter notebook on mac

I downloaded this dataset and stored it in a folder called AutomobileDataset.
I cross checked the working directory using:
import pandas as pd
import numpy as np
import os
os.chdir("/Users/madan/Desktop/ML/Datasets/AutomobileDataset")
os.getcwd()
Output:
'/Users/madan/Desktop/ML/Datasets/AutomobileDataset'
Then I tried reading the file using pandas:
import pandas as pd
import numpy as np
import os
os.chdir("/Users/madan/Desktop/ML/Datasets/AutomobileDataset")
os.getcwd()
automobile_data = pd.read_csv("AutomobileDataset.txt", sep = ',',
header = None, na_values = '?')
automobile_data.head()
Output:
---------------------------------------------------------------------------
ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 26
Someone please help me with this, I don't know where I am making a mistake.
Can try this!
import os
# Read in a plain text file
with open(os.path.join("c:user/xxx/xx", "xxx.txt"), "r") as f:
text = f.read()
print(text)

How to read large data from temp file in Spotfire using IronPython

I need to read large data from temp file in Spotfire using IronPython.
First I have exported my Tibco data table in a temp file using the Exported text() method:
#Temp file for storing the TablePlot data
tempFolder = Path.GetTempPath()
tempFilename = Path.GetTempFileName()
#Export TablePlot data to the temp file
tp = tablePlotViz.As[TablePlot]()
writer = StreamWriter(tempFilename)
tp.ExportText(writer)
After that, opened the temp file using the open() method.
f = open(tempFilename)
Now when I started to read the data from the opened file and write back into a String variable then it is taking too much time. And my Spotfire screen is stopped working.
Has anyone idea about this?
My data table is of 8MB size.
Code is:
from Spotfire.Dxp.Application.Visuals import TablePlot, HtmlTextArea
import clr
import sys
clr.AddReference('System.Data')
import System
from System.Data import DataSet, DataTable, XmlReadMode
from Spotfire.Dxp.Data import DataType, DataTableSaveSettings
from System.IO import StringReader, StreamReader, StreamWriter, MemoryStream, SeekOrigin, FileStream, FileMode,Path, File
from Spotfire.Dxp.Data.Export import DataWriterTypeIdentifiers
from System.Threading import Thread
from Spotfire.Dxp.Data import IndexSet
from Spotfire.Dxp.Data import RowSelection
from Spotfire.Dxp.Data import DataValueCursor
from Spotfire.Dxp.Data import DataSelection
from Spotfire.Dxp.Data import DataPropertyClass
from Spotfire.Dxp.Data import Import
from Spotfire.Dxp.Data.Import import TextFileDataSource, TextDataReaderSettings
from System import Array
from Spotfire.Dxp.Application.Visuals import VisualContent
from Spotfire.Dxp.Application.Visuals import TablePlot
from System.IO import Path, StreamWriter
from System.Text import StringBuilder
#Temp file for storing the TablePlot data
tempFolder = Path.GetTempPath()
tempFilename = Path.GetTempFileName()
#Export TablePlot data to the temp file
tp = tablePlotViz.As[TablePlot]()
writer = StreamWriter(tempFilename)
tp.ExportText(writer)
#Build the table
sb = StringBuilder()
#Open the temp file for reading
f = open(tempFilename)
#build the html table
html = " <TABLE id='table' style='display:none;'>\n"
html += "<THEAD>"
html += " <TR><TH>"
html += " </TH><TH>".join(f.readline().split("\t")).strip()
html += " </TH></TR>"
html += "</THEAD>\n"
html += "<TBODY>\n"
for line in f:
html += "<TR><TD>"
html += "</TD><TD>".join(line.split("\t")).strip()
html += "</TD></TR>\n"
#Assigned the all HTML data in the text area
print html
The code works fine with short data.
If I am getting correctly, the intention behind the code is reading Table Plot visualization data into a string, for further using in a HTML Text Area.
There is an alternative way for doing this, without writing data into temporary file. We can use memory stream to export data and convert exported text to string for further reuse. The sample code can be referred from here.

Categories