I'm relativley new to python
I have a excel file where i can read,Column A "url" and Column B "name".
In the future the columns will have no "column name" so i need it to read from Column A directly and column B and start iterating from cell 1.
I tried using index_col(0) but can't really seem to get the hang of it.
This is a simple download image script.
import requests
import pandas as pd
df = pd.read_excel(r'C:\Users\exdata1.xlsx')
for index, row in df.iterrows():
url = row['url']
file_name = url.split('/')
r = requests.get(url)
file_name=(row['name']+".jpeg")
if r.status_code == 200:
with open(file_name, "wb") as f:
f.write(r.content)
print (file_name)
I tried this below without any good result.
url = row['index_col(0)'] #0 for excel column "A"
file_name=(row['index_col(1)']+".jpeg") #1 for excel Column "B"
Apreciate any support!
You can set header=None as an argument of pandas.read_excel and give names to your columns.
Try this :
import requests
import pandas as pd
df = pd.read_excel(r'C:\Users\exdata1.xlsx', header=None, names=['url', 'name'])
for index, row in df.iterrows():
url = row['url']
file_name = url.split('/')
r = requests.get(url)
file_name=(row['name']+'.jpeg')
if r.status_code == 200:
with open(file_name, 'wb') as f:
f.write(r.content)
print(file_name)
If your files had no columns name pandas assign values to each column such as Unnamed: 0, you can check that py printing df.info or df.head()
you can assign columns names when reading from your file so you df always has columns name:
df.rename( columns={"Unnamed: 0" :'url', Unnamed: 0: 'name'}, inplace=True )
then you are good to go.
Related
Currently my code looks into CSV files in a folder and replaces strings based on if the file has column 'PROD_NAME' in the data. If it doesnt have column 'PROD_NAME', I'm trying to delete those files in the folder. I can get my code to print which csv files do not have the column with a little debugging, but I cant figure out how to actually delete or remove them from the folder they are in. I have tried an if statement that calls os.remove() and still nothing happens. No errors or anything.. it just finishes the script with all the files still in the folder. Here is my code. Any help is appreciated. Thanks!
def worker():
filenames = glob.glob(dest_dir + '\\*.csv')
print("Finding all files with column PROD_NAME")
time.sleep(3)
print("Changing names of products in these tables...")
for filename in filenames:
my_file = Path(os.path.join(dest_dir, filename))
try:
with open(filename):
# read data
df1 = pd.read_csv(filename, skiprows=1, encoding='ISO-8859-1') # read column header only - to get the list of columns
dtypes = {}
for col in df1.columns:# make all columns text, to avoid formatting errors
dtypes[col] = 'str'
df1 = pd.read_csv(filename, dtype=dtypes, skiprows=1, encoding='ISO-8859-1')
if 'PROD_NAME' not in df1.columns:
os.remove(filename)
#Replaces text in files
if 'PROD_NAME' in df1.columns:
df1 = df1.replace("NABVCI", "CLEAR_BV")
df1 = df1.replace("NAMVCI", "CLEAR_MV")
df1 = df1.replace("NA_NRF", "FA_GUAR")
df1 = df1.replace("N_FPFA", "FA_FLEX")
df1 = df1.replace("NAMRFT", "FA_SECURE_MVA")
df1 = df1.replace("NA_RFT", "FA_SECURE")
df1 = df1.replace("NSPFA7", "FA_PREFERRED")
df1 = df1.replace("N_ENHA", "FA_ENHANCE")
df1 = df1.replace("N_FPRA", "FA_FLEX_RETIRE")
df1 = df1.replace("N_SELF", "FA_SELECT")
df1 = df1.replace("N_SFAA", "FA_ADVANTAGE")
df1 = df1.replace("N_SPD1", "FA_SPD1")
df1 = df1.replace("N_SPD2", "FA_SPD2")
df1 = df1.replace("N_SPFA", "FA_LIFESTAGES")
df1 = df1.replace("N_SPPF", "FA_PLUS")
df1 = df1.replace("N__CFA", "FA_CHOICE")
df1 = df1.replace("N__OFA", "FA_OPTIMAL")
df1 = df1.replace("N_SCNI", "FA_SCNI")
df1 = df1.replace("NASCI_", "FA_SCI")
df1 = df1.replace("NASSCA", "FA_SSC")
df1.to_csv(filename, index=False, quotechar="'")
except:
if 'PROD_NAME' in df1.columns:
print("Could not find string to replace in this file: " + filename)
worker()
Written below is a block of code that reads the raw csv data. It extracts the first row of data (containing the column names) and looks for the column name PROD_NAME. If it finds it, it sets found to True. Else, it sets found to False. To prevent trying to delete the files whilst open, the removal is done outside of the open().
import os
filename = "test.csv"
with open(filename) as f: #Any code executed in here is while the file is open
if "PROD_NAME" in f.readlines()[0].split(","): #Replace "PROD_NAME" with the string you are looking for
print("found")
found = True
else:
print("not found")
found = False
if not found:
os.remove(filename)
else:
pass#Carry out replacements here/load it in pandas
I have an Excel workbook which has list of URL's each row has a specific URL to a PDF, I am working to download each PDF from every URL and store it in a separate folder. I tried doing that , but ended up downloading a single PDF file every time. I am only downloading files whose status code is '200'. Below is my code :-
import numpy as np
import pandas as pd
import xlrd
import wget
import requests
count = 0
df = pd.read_excel('Sample Training Data.xlsx')
row_count = print(len(df))
for col in df.columns:
for url in df[col]:
##check if the url has .pdf extension
if '.pdf' in url:
filename = url
r = requests.get(filename)
##check the status code
if r.status_code == 200:
print(filename)
count = count + 1
for i in range(0, count):
with open(r"D:\Juwi\Downloaded PDF\file_" + str(i) + ".pdf", 'wb') as f:
f.write(r.content)
Maybe you need
for index, row in df.iterrows():
instead of
for col in df.columns:
Assuming the urls are in column 0 of the dataframe you can extract the column to a Python list.
url_list = df.iloc[:,0].tolist() # Change 0 to another number if urls in a different column
Then you can loop through url_list.
I have some code that reads all the CSV files in a certain folder and concatenates them into one excel file. This code works as long as the CSV's have headers but I'm wondering if there is a way to alter my code if my CSV's didn't have any headers.
Here is what works:
path = r'C:\Users\Desktop\workspace\folder'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
df = df[~df['Ran'].isin(['Active'])]
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame.drop_duplicates(subset=None, inplace=True)
What this is doing is deleting any row in my CSV's with the word "Active" under the "Ran" column. But if I didn't have a "Ran" header for this column, is there another way to read this and do the same thing?
Thanks in advance!
df = df[~df['Ran'].isin(['Active'])]
Instead of selecting a column by name, select it by index. If the 'Ran' column is the third column in the csv use...
df = df[~df.iloc[:,2].isin(['Active'])]
If some of your files have headers and some don't then you probably should look at the first line of each file before you make a DataFrame with it.
for filename in all_files:
with open(filename) as f:
first = next(f).split(',')
if first == ['my','list','of','headers']:
header=0
names=None
else:
header=None
names=['my','list','of','headers']
f.seek(0)
df = pd.read_csv(filename, index_col=None, header=header,names=names)
df = df[~df['Ran'].isin(['Active'])]
If I understood your question correctly ...
If the header is missing, yet you know the data format, you can pass the desired column labels as a list, such as: ['id', 'thing1', 'ran', 'other_stuff'] into the names parameter of read_csv.
Per the pandas docs:
names : array-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
I'm wondering how to get parsed tables from panda into a single CSV, I have managed to get each table into a separate CSV for each one, but would like them all on one CSV. This is my current code to get multiple CSVs:
import pandas as pd
import csv
url = "https://fasttrack.grv.org.au/RaceField/ViewRaces/228697009?
raceId=318809897"
data = pd.read_html(url, attrs = {'class': 'ReportRaceDogFormDetails'} )
for i, datas in enumerate(data):
datas.to_csv("new{}.csv".format(i), header = False, index = False)
I think need concat only, because data is list of DataFrames:
df = pd.concat(data, ignore_index=True)
df.to_csv(file, header=False, index=False)
You have 2 options:
You can tell pandas to append data while writing to the CSV file.
data = pd.read_html(url, attrs = {'class': 'ReportRaceDogFormDetails'} )
for datas in data:
datas.to_csv("new.csv", header=False, index=False, mode='a')
Merge all the tables into one DataFrame and then write that into the CSV file.
data = pd.read_html(url, attrs = {'class': 'ReportRaceDogFormDetails'} )
df = pd.concat(data, ignore_index=True)
df.to_csv("new.csv", header=False, index=False)
Edit
To still separate the dataframes on the csv file, we shall have to stick with option #1 but with a few additions
data = pd.read_html(url, attrs = {'class': 'ReportRaceDogFormDetails'} )
with open('new.csv', 'a') as csv_stream:
for datas in data:
datas.to_csv(csv_stream, header=False, index=False)
csv_stream.write('\n')
all_dfs = []
for i, datas in enumerate(data):
all_dfs.append(datas.to_csv("new{}.csv".format(i), header = False, index = False))
result = pd.concat(all_dfs)
I'm looking for the best way to rename my header using dictreader / dictwriter to add to my other steps already done.
This is what I am trying to do to the Source data example below.
Remove the first 2 lines
Reorder the columns (header & data) to 2, 1, 3 vs the source file
Rename the header to ASXCode, CompanyName, GISC
When I'm at
If I use 'reader = csv.reader.inf' the first lines are removed and columns reordered but as expected no header rename
Alternately when I run the dictreader line 'reader = csv.DictReader(inf, fieldnames=('ASXCode', 'CompanyName', 'GICS'))' I receive the error 'dict contains fields not in fieldnames:' and shows the first row of data rather than the header.
I'm a bit stuck on how I get around this so any tips appreciated.
Source Data example
ASX listed companies as at Mon May 16 17:01:04 EST 2016
Company name ASX code GICS industry group
1-PAGE LIMITED 1PG Software & Services
1300 SMILES LIMITED ONT Health Care Equipment & Services
1ST AVAILABLE LTD 1ST Health Care Equipment & Services
My Code
import csv
import urllib.request
from itertools import islice
local_filename = "C:\\myfile.csv"
url = ('http://mysite/afile.csv')
temp_filename, headers = urllib.request.urlretrieve(url)
with open(temp_filename, 'r', newline='') as inf, \
open(local_filename, 'w', newline='') as outf:
# reader = csv.DictReader(inf, fieldnames=('ASXCode', 'CompanyName', 'GICS'))
reader = csv.reader(inf)
fieldnames = ['ASX code', 'Company name', 'GICS industry group']
writer = csv.DictWriter(outf, fieldnames=fieldnames)
# 1. Remove top 2 rows
next(islice(reader, 2, 2), None)
# 2. Reorder Columns
writer.writeheader()
for row in csv.DictReader(inf):
writer.writerow(row)
IIUC here is a solution using pandas and its function read_csv:
import pandas as pd
#Considering that you have your data in a file called 'stock.txt'
#and it is tab separated, by default the blank lines are not read by read_csv,
#hence set the header=1
df = pd.read_csv('stock.txt', sep='\t',header=1)
#Rename the columns as required
df.columns= ['CompanyName', 'ASXCode', 'GICS']
#Reorder the columns as required
df = df[['ASXCode','CompanyName','GICS']]
And this is how you would do it in ipython and the output would look like:
Based on your tips I got it working in the end. I hadn't used pandas before so had to ready up a little first.
I eventually worked out pandas uses a data frame so I had to do a few things differently with tocsv function and eventually add index=False parameter to the tocsv function to remove the df index.
Now all great thankyou.
import csv
import os
import urllib.request
import pandas as pd
local_filename = "C:\\myfile.csv"
url = ('http://mysite/afile.csv')
temp_filename, headers = urllib.request.urlretrieve(url)
#using pandas dataframe
df = pd.read_csv(temp_filename, sep=',',header=1) #skip header
df.columns = ['CompanyName', 'ASXCode', 'GICS'] #rename columns
df = df[['ASXCode','CompanyName','GICS']] #reorder columns
df.to_csv(local_filename, sep=',', index=False)
os.remove(temp_filename) # clean up