How to get all pages data from site and save? - python

I have the data currencies cost in page. I want to download all data from 2000.01.01 to 2018.12.01. In the page i can download or get data for one day but i want fol all period or for th one year and save it to csv file. How can i do this?
I have tried to get one date and save it to csv. And also try to parse it with urllib but also can`t get all data what i need.
import pandas as pd
data = pd.read_html('http://www.nbt.tj/ru/kurs/kurs.php?date=01.02.2016')
data = data[2]
data.to_csv('currencies.csv', index=False)

Create date range in custom format, loop, get DataFrame and write each DataFrame separately with append mode, where is necessary remove header and write only for first DataFrame:
dates = pd.date_range('2010-01-01', '2018-12-01').strftime('%d.%m.%Y')
for i, x in enumerate(dates):
data = pd.read_html('http://www.nbt.tj/ru/kurs/kurs.php?date={}'.format(x))[2]
if i == 0:
data.to_csv('currencies.csv', index=False)
else:
data.to_csv('currencies.csv', index=False, mode='a', header=None)

Related

Cannot get a file to be read into a list of stock tickers and then get yfinance data for each

I am trying to read a csv file into a dataframe and then iterate over each ticker to get some yahoo finance data, but I struggle with matching the right data type read from the CSV file. The problem is that yfinance needs a str for the ticker parameter data = yf.download("AAPL', start ="2019-01-01", end="2022-04-20")
and I cannot convert the df row item into str.
This is my code:
combined = yf.download("SPY", start ="2019-01-01", end="2019-01-02")
for index, row in stockslist.iterrows():
data = yf.download([index,row["ticker"].to_string], start ="2019-01-01", end="2022-04-20")
and this is the csv file
The question is basically about this part of the code " [index,row["ticker"].to_string] " . I cannot get to pass each row of the dataframe as a ticker argument to finance.
The error I get is "TypeError: expected string or bytes-like object "
The download function doesn't understand [index,row["ticker"].to_string] parameter. Like where does it come from ?
you have to give the context of it. Like building an array with the values from the CSV then you pass the array[i].value to the download function.
A quick example with fictional code :
#initiate the array with the ticker list
array_ticker = ['APPL''MSFT''...']
#reads array + download
for i=0 in range(array_ticker.size):
data = yf.download(array_ticker[i], start ="2019-01-01", end="2022-04-20")
i=i+1
UPDATE :
If you want to keep the dataframe as you are using now, I just did a simple code to help you to sort your issue :
import pandas as pd
d = {'ticker': ['APPL', 'MSFT', 'GAV']}
ticker_list = pd.DataFrame(data=d) #creating the dataframe
print(ticker_list) #print the whole dataframe
print('--------')
print(ticker_list.iloc[1]['ticker']) #print the 2nd value of the column ticker
Same topic : How to iterate over rows in a DataFrame in Pandas

How can I filter a csv file based on its columns in python?

I have a CSV file with over 5,000,000 rows of data that looks like this (except that it is in Farsi):
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766140,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,1050000,7266.44,5,Concrete,13890108,5166884645
766146,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,700000,4844.29,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
770822,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,50,500000,1730.1,5,Concrete,13890114,5166884645
I would like to write a code to pass the first row as the header and then extract data from two specific cities (Kish and Qeshm) and save it into a new CSV file. Somthing like this one:
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
It's worth mentioning that I'm very new to python.
I've written the following block to define the headers, but this is the furthest I've gotten so far.
import pandas as pd
path = '/Users/Desktop/sample.csv'
df = pd.read_csv(path , header=[0])
df.head = ()
You don't need to use header=... because the default is to treat the first row as the header, so
df = pd.read_csv(path)
Then, to keep rows on conditions:
df2 = df[df['City'].isin(['Kish', 'Qeshm'])]
And you can save it with
df2.to_csv(another_path)

DataFrame Split On Rows and apply on header one column using Python Pandas

I'm working on some project and came up with the messy situation across where I've to split the data frame based on the first column of a data frame, So the situation is here the data frame I've with me is coming from SQL queries and I'm doing so much manipulation on that. So that is why not posting the code here.
Target: The data frame I've with me is like the below screenshot, and its available as an xlsx file.
Output: I'm looking for output like the attached file here:
The thing is I'm not able to put any logic here that how do I get this done on dataframe itself as I'm newbie in Python.
I think you can do this:
df = df.set_index('Placement# Name')
df['Date'] = df['Date'].dt.strftime('%M-%d-%Y')
df_sub = df[['Delivered Impressions','Clicks','Conversion','Spend']].sum(level=0)\
.assign(Date='Subtotal')
df_sub['CTR'] = df_sub['Clicks'] / df_sub['Delivered Impressions']
df_sub['eCPA'] = df_sub['Spend'] / df_sub['Conversion']
df_out = pd.concat([df, df_sub]).set_index('Date',append=True).sort_index(level=0)
startline = 0
writer = pd.ExcelWriter('testxls.xlsx', engine='openpyxl')
for n,g in df_out.groupby(level=0):
g.to_excel(writer, startrow=startline, index=True)
startline += len(g)+2
writer.save()
Load the Excel file into a Pandas dataframe, then extract rows based on condition.
dframe = pandas.read_excel("sample.xlsx")
dframe = dframe.loc[dframe["Placement# Name"] == "Needed value"]
Where "needed value" would be the value of one of those rows.

Python: outputting lists to excel

For my master thesis, I need to calculate expected returns for x number of stocks on a given event date. I have written the following code, which does what I intends (match Fama & French factors with a sample of event dates). However, when I try to export it to excel I can't seem to get the correct output. I.e. it doesn't contain column headings such as Dates, names of fama & french factors and the corresponding rows.
Does anybody have a workaround for this? Any improvements are gladly appreciated. Here are my code:
import pandas as pd
# Data import
ff_five = pd.read_excel('C:/Users/MBV/Desktop/cmon.xlsx',
infer_datetime_format=True)
df = pd.read_csv('C:/Users/MBV/Desktop/4.csv', parse_dates=True,
infer_datetime_format=True)
# Converting dates to datetime
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
# Creating an empty placeholder
end_date = []
# Iterating over the event dates, creating a start and end date 60 months
apart
for index, row in df.iterrows():
end_da = row['Date']-pd.DateOffset(months=60)
end_date.append(end_da)
end_date_df = pd.DataFrame(data=end_date)
m = pd.merge(end_date_df,df,left_index=True,right_index=True)
m.columns = ['Start','End']
ff_factors = []
for index, row in m.iterrows():
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
time_range= (ff_five['Date'] > row['Start']) & (ff_five['Date'] <=
row['End'])
df = ff_five.loc[time_range]
ff_factors.append(df)
EDIT:
Here are my attempt at getting the data from python to excel.
ff_factors_df = pd.DataFrame(data=ff_factors)
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('estimation_data.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
ff_factors_df.to_csv(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
To output a dataframe to csv or excel should be able to be done with
ff_five.to_excel('Filename.xls')
Change excel to csv if you want it to a csv.
Ok I tried to interpret what you were trying to do without it being very clear. But if I was interpreting it correctly you are trying to create some addition columns based on other data. Instead of creating separate lists you could possibly just put them in as new columns and then just output the columns you want potentially. Something like this maybe (had to make some assumptions and create some fake data to see if this is on the right track):
import pandas as pd
ff_five = pd.DataFrame()
ff_five['Date'] = ["2012-11-01", "2012-11-30"]
df = pd.DataFrame()
df['Date'] = ["2012-12-01", "2012-12-30"]
df['Date'] = pd.to_datetime(df['Date'])
df['End'] = df['Date'] - pd.DateOffset(months=60)
df.columns = ['Start', 'End']
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
df['ff_factor'] = (ff_five['Date'] > df['Start']) & (ff_five['Date'] <= df['End'])
df.to_excel('estimation_data.xlsx', sheet_name='Sheet1')

Write to_csv with Python Pandas: Choose which column index to insert new data

I have a set of data output in my program that I want to write to a .csv file. I am able to make a new file with the old input data, followed by the new data in the last column to the right. How can I manipulate which column my output data goes to? Also, how can I choose to not include the old input data in my new file? I'm new to pandas.
Thanks!
Loading from file:
import pandas as pd
df = pd.read_csv('D:\\Apps\\Coursera\\Kaggle-Titanic\\Data\\train.csv', header = 0)
Some manipulation:
df['Gender'] = df.Sex.map(lambda x: 0 if x=='female' else 1)
df['FamilySize'] = df.SibSp + df.Parch
Copy some fields to new:
result = df[['Sex', 'Survived', 'Age']]
Delete not needed fields:
del result['Sex']
Save to the file:
result.to_csv('D:\\Apps\\Coursera\\Kaggle-Titanic\\Swm\\result.csv', index=False)
Or if you want to save only some fields or in some specific order:
df[['Sex', 'Survived', 'Age']].to_csv('D:\\Apps\\Coursera\\Kaggle-Titanic\\Swm\\result.csv', index=False)

Categories