I want to create something like a csv file database for a school project.
I have a list of values/rows that stay the same like the UID but also values that I want to keep track of like a timestamp. Simplified example below:
UID
Timestamps
First
Date1,Date2,Date3
Second
Date1,Date2,Date3,Date4,Date5
What I get right now is:
UID
Timestamps
First
Date1
Second
Date1
Is there an elegant way to do the following in pandas that is somewhat fast and doesn´t require iteration ? :
If a specific UID has been found append the timestamp to the associated cell.
EDIT:
I hope I can make my question more clear now (sorry I am new to this). I want to scrape a list of links (UID) and add a timestamp everytime the script detects a change in an html element associated with the link. The logic for checking if a change has been made does not exist yet.
links_extended=[]
datestamp_extended=[]
while(True):
try:
a = driver.find_element_by_id('thermoplast').find_elements_by_tag_name("a")
links = [x.get_attribute("href") for x in a]
datestamp = []
now = datetime.now()
for date in links:
date = now.strftime("%Y-%m-%d %H:%M:%S")
datestamp.append(date)
datestamp_extended.extend(datestamp)
links_extended.extend(links)
#if no next link is found while loop ends
except:
hot_dict = {'Link': links_extended, 'Datestamps': datestamp_extended}
hotlist_df = pd.DataFrame(data = hot_dict)
break
Thank you very much! :)
Related
I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.
make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:
I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).
I'm trying to fetch all the data from all the rows for one specific column of a table. The problem is that the loop only fetches the first-row multiple times but is not able to continue to the next row. Here is the relevant code.
numRows = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr'))
numColumns = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/thead/tr[2]/th'))
print(numRows)
# Prints 139
print(numColumns)
# prints 21
for i in range(numRows + 1):
df = []
value = driver.find_element_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr['{}']/td[16]".format(i))
df.append(value.text)
print(df)
As is evident from the print methods is that I have all the rows and columns of my table. So that part works. But when I try to iterate over all the rows for one specific column, I only get the first value. I have tried to solve this problem by using a format() method but that doesn't seem to solve the problem. Any idea how I can solve this problem?
please try iterate through all results found instead of finding individual elements. I cannot test the code since I do not have access to HTML file.
found_elements = driver.find_elements_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr/td[16]")
for i in range(numRows + 1):
df = []
df.append(found_elements[i].text)
print(df)
I found the following code to work:
rader = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr'))
#kolonner = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/thead/tr[2]/th'))
kolonneFinish = []
kolonneBib = []
for i in range(1, rader + 1):
valueFinish = driver.find_element_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr["+str(i)+"]/td[16]")
kolonneFinish.append(valueFinish.text)
I have no idea what the + + does. So if someone knows, please comment.
I have some code and am wondering how I can properly format it and save to a csv file. I haven't been able to find a question that answers this question in the way I want, and since I am fairly new to python I'm having trouble changing code that answers similar questions to something that suits my need.
It breaks down to the following:
assets = [asset_1, asset_2, asset_3]
for i in range(len(assets)):
price = data.current(assets[i], "price")
Now if I
print price
it prints as
price 1
price 2
price 3
I can get it to print as price 1, price 2, price 3 without trouble, but it seems pretty janky for me to write something like print price[0], price[1], price[2], newline especially when I hope to extend this to many more assets in the future. All the solutions I've come accross that remove the newlines then cause price1, price2, price3, price1, price2... and I want the newline to happen specifically after price 3. I have also been able to get it to export the data to a csv without trouble, but the csv just has one column similar to the first print statement. How can I get each of the 3 prices into a csv and then a newline?
I attempted to follow this answer, and changed my print statements to follow suit, but got a syntax error when trying to print prices[index] index in my case being 0:2 (tried 1:3 in case but didn't work as expected)
Here is my actual code.
from catalyst.api import record, symbol, symbols
from datetime import datetime
import os, csv, pytz, sys
import catalyst
def initialize(context):
# Portfolio assets list
context.assets = [symbol("XMR_DASH"), symbol("BTC_XMR"), symbol("BTC_DASH")]
'''
# Creates a .CSV file with the same name as this script to store results
context.csvfile = open(os.path.splitext(os.path.basename(__file__))[0]+'.csv', 'w+')
context.csvwriter = csv.writer(context.csvfile)
'''
def handle_data(context, data):
date = context.blotter.current_dt # Current time for each iteration in the simulation
price = data.current(context.assets, "price")
print price
'''
for i in range(0,3):
price = data.current(context.assets[i], "price")
print price[any index gives syntax error]
'''
def analyze(context=None, results=None):
pass
'''
# Close open file properly at the end
context.csvfile.close()
'''
start = datetime(2017, 7, 30, 0, 0, 0, 0, pytz.utc)
end = datetime(2017, 7, 31, 0, 0, 0, 0, pytz.utc)
results = catalyst.run_algorithm(start=start, end=end, initialize=initialize,
capital_base=10000, handle_data=handle_data,
bundle='poloniex', analyze=analyze, data_frequency='minute')
I don't need help with writing data to a csv, but do need help formatting the data that is written. Preferably an answer would also work with print statements so I can visualize what's going on before going to the csv.
You can simple print the array this way:
assets = [asset_1, asset_2, asset_3]
print(","join([str(data.current(assets[i], "price")) for i in assets]))
",".join(array) joins the array with a comma between values. A Print will print the concatenated array, with a newline at the end
[str(data.current(assets[i], "price")) for i in assets] builds the array of prices you want to concatenate
Try changing the below..
for i in range(0,3):
price = data.current(context.assets[i], "price")
print price[any index gives syntax error]
To
price=[]
for i in range(0,3):
price.append(data.current(context.assets[i], "price"))
print(','.join([str(x) for x in price])
And the issue you were facing I believe is that price is a single float/double/int and not a list of prices. Thus any index on price would fail. Now with this we created a price list.
I have a dataframe in Python using pandas. It has 2 columns called 'dropoff_latitude' and 'pickup_latitude'. I want to make a function that will create a 3rd column based on these 2 variables (runs them through an api).
So I wrote a function:
def dropoff_info(row):
dropoff_latitude = row['dropoff_latitude']
dropoff_longitude = row['dropoff_longitude']
dropoff_url2 = "http://data.fcc.gov/api/block/find?format=json&latitude=%s&longitude=%s&showall=true" %(dropoff_latitude,dropoff_longitude)
dropoff_resp2 = requests.get(dropoff_url2)
dropoff_results2 = json.loads(dropoff_resp2.text)
dropoffinfo = dropoff_results2["Block"]["FIPS"][2:11]
return dropoffinfo
then I would run it as
df['newcolumn'] = dropoffinfo(df)
However it doesn't work.
Upon troubleshooting I find that when I print dropoff_latitude it looks like this:
0 40.773345947265625
1 40.762149810791016
2 40.770393371582031
...
And so I think that the URL can't get generated. I want dropoff_latitude to look like this when printed:
40.773345947265625
40.762149810791016
40.770393371582031
...
And I don't know how to specify that I want just the actual content part.
When I tried
dropoff_latitude = row['dropoff_latitude'][1]
dropoff_longitude = row['dropoff_longitude'][1]
It just gave me the values from the 1st row so that obviously didn't work.
Ideas please? I am very new to working with dataframes... Thank you!
Alex - with pandas we typically like to avoid loops, but in your particular case, the need to ping a remote server for data pretty much requires it. So I'd do something like the following:
l = []
for i in df.index:
dropoff_latitude = df.loc[i, 'dropoff_latitude']
dropoff_longitude = df.loc[i, 'dropoff_longitude']
dropoff_url2 = "http://data.fcc.gov/api/block/find?format=json&latitude=%s&longitude=%s&showall=true" %(dropoff_latitude,dropoff_longitude)
dropoff_resp2 = requests.get(dropoff_url2)
dropoff_results2 = json.loads(dropoff_resp2.text)
l.append(dropoff_results2["Block"]["FIPS"][2:11])
df['new'] = l
The key here is the .loc[i, ...] bit that gives you the ability to go through each row one by one, and call out the associated column to create the variables to send to your API.
Regarding your question about a drain on your memory - that's a little above my pay-grade, but I really don't think you have any other options in this case (unless your API has some kind of batch request that allows you to pull a larger data set in one call).
I have a CSV file, which contains patch names, their release date and some other info in separate columns. I am trying to write a Python script that will ask the user for a Patch name and once it gets the input, will check if the Patch is in the CSV file and print out the Release Date.
So far, I have written the following piece of code, which I based based on this answer.
import csv
patch = raw_input("Please provide your Patchname: ")
with open("CSV_File1.csv") as my_file1:
reader = csv.DictReader(my_file1)
for row in reader:
for k in row:
if row[k] == patch:
print "According to the CSV_File1 database: "+row[k]
This way I get the Patch name printed on the screen. I don't know how to traverse the column with the Dates, so that I can print the date that corresponds to the row with the Patch name that I provided as input.
In addition, I would like to check if that patch is the last released one. If it isn't, then print the latest one along with its release date. My problem is that the CSV file contains patch names of different software versions, so I can't just print the last of the list. For example:
PatchXXXYY,...other columns...,Release Date,... <--- (this is the header row of the CSV file)
Patch10000,...,date
Patch10001,...,date
Patch10002,...,date
Patch10100,...,date
Patch10101,...,date
Patch10102,...,date
Patch10103,...,date
Patch20000,...,date
...
So, if my input is "Patch10000", then I should get its release date and the latest available Patch, which in this case would be Patch10002, and its release date. But NOT Patch20000, as that would be a different software version. A preferable output would like this:
According to the CSV_File1 database: Patch10100 was released on
"date". The latest available patch is "Patch10103", which was
released on "date".
That's because the "XXX" digits in the PatchXXXYY above, represent the software version, and the "YY" the patch number. I hope this is clear.
Thanks in advance!
The CSV module works fine but I just wanted to throw Pandas in as this can be a good use case for it. There may be better ways to handle this but it's a fun example. This is assuming that your columns are labels(Patch_Name, Release_Date) so you will need to correct them.
import pandas as pd
my_file1 = pd.read_csv("CSV_File1.csv", error_bad_lines=False)
patch = raw_input("Please provide your Patchname: ")
#Find row that matches patch and store the index as idx
idx = my_file1[my_file1["Patch_Name"] == patch].index.tolist()
#Get the date value from row by index number
date = my_file1.get_value(idx[0], "Release_Date")
print "According to the CSV_File1 database: {} {}".format(patch, date)
There are great ways to filter and compare the data in a CSV with Pandas as well. I would give more descriptive solutions if I had more time. I highly suggest looking into the Pandas documentation.
You're almost there, though I'm a wee bit confused - your sample data doesn't have a header row. If it doesn't then you shouldn't be using a DictReader but if it does you can take this approach.
version = patch[:8]
latest_patch = ''
last_patch_data = None
with open("CSV_File1.csv") as my_file1:
reader = csv.DictReader(my_file1)
for row in reader:
# This works because of ASCII ordering. First,
# we make sure the package starts with the right
# version - e.g. Patch200
if row['Package'].startswith(version):
# Now we grab the next two numbers, so from
# Patch20042 we're grabbing '42'
patch_number = row['Package'][8:10]
# '02' > '' is true, and '42' > '02' is also True
if patch_number > latest_patch:
# If we have a greater patch number, we
# want to store that, along with the row that
# had that. We could just store the patch & date
# but it's fine to store the whole row
latest_patch = patch_number
last_patch_data = row
# No need to iterate over the keys, you *know* the
# column containing the patch. Presumably it's
# titled 'patch'
#for k in row:
# if row[k] == patch:
if row['Package'] == patch:
# assuming the date header is 'date'
print("According to the CSV_File1 database: {patch!r}"
" was released on {date!r}".format(patch=row['Package'],
date=row['Registration']))
# `None` is a singleton, which means that we can use `is`,
# rather than `==`. If we didn't even *start* with the same
# version, there was certainly no patch. You may prefer a
# different message, of course.
if last_patch_data is None:
print('No patch found')
else:
print('The latest available patch is {patch!r},'
' which was released on {date!r}'.format(patch=last_patch_data['Package'],
date=last_patch_data['Registration']))