I've written a piece of script that at the moment I'm sure can be condensed. What I'm try to achieve is an automatic version of this:
file1 = tkFileDialog.askopenfilename(title='Select the first data file')
file2 = tkFileDialog.askopenfilename(title='Select the first data file')
TurnDatabase = tkFileDialog.askopenfilename(title='Select the turn database file')
headers = pd.read_csv(file1, nrows=1).columns
data1 = pd.read_csv(file1)
data2 = pd.read_csv(file2)
This is how the data is collected.
There's many more lines of code which focus on picking out bits of the data. I'm not going to post it all.
This is what I'm trying to condense:
EntrySummary = []
for key in Entries1.viewkeys():
MeanFRH = Entries1[key].hRideF.mean()
MeanFRHC = Entries1[key].hRideFCalc.mean()
MeanRRH = Entries1[key].hRideR.mean()
# There's 30 more lines of these...
# Then the list is updated with this:
EntrySummary.append({'Turn Number': key, 'Avg FRH': MeanFRH, 'Avg FRHC': MeanFRHC, 'Avg RRH': MeanRRH,... # and so on})
EntrySummary = pd.DataFrame(EntrySummary)
EntrySummary.index = EntrySummary['Turn Number']
del EntrySummary['Turn Number']
This is the old code. What I've tried to do is this:
EntrySummary = []
for i in headers():
EntrySummary.append({'Turn Number': key, str('Avg '[i]): str('Mean'[i])})
print EntrySummary
# The print is only there for me to see if it's worked.
However I'm getting this error at the minute:
for i in headers():
TypeError: 'Index' object is not callable
Any ideas as to where I've made a mistake? I've probably made a few...
Thank you in advance
Oli
If I'm understanding your situation correctly, you want to replace the long series of assignments in the "old code" you've shown with another loop that processes all of the different items automatically using the list of headers from your data files.
I think this is what you want:
EntrySummary = []
for key, value in Entries1.viewitems():
entry = {"Turn Number": key}
for header in headers:
entry["Avg {}".format(header)] = getattr(value, header).mean()
EntrySummary.append(entry)
You might be able to come up with some better variables names, since you know what the keys and values in Entries1 are (I do not, so I used generic names).
Related
I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.
make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:
I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).
I have a very long (100+ columns) CSV file to process. Sometimes the people that generate the file make mistakes in the names of columns. If they make a mistake would just like to ignore that column so I have this very long block of code where I check for KeyError on each piece of data.
cr = csv.DictReader(response)
for row in cr:
try:
creation_time = row['Creation Time']
except KeyError:
creation_time = ''
try:
current_pm_active = row['Current Active']
except KeyError:
current_pm_active = ''
try:
current_pm_total = row['Current Total']
except KeyError:
current_pm_total = ''
... and so on and on ...
I suspect there is probably a better way to code this. Thanks!
Update. Thanks for the questions. The reason I am placing data into individual variables is that I will be inserting these values in the Django Model like so:
updated_vmt, created = Vmt.objects.update_or_create(
cluster=cluster,
added=datetime.datetime.today().strftime('%Y-%m-%d'),
defaults={
'current_pm_active' : current_pm_active,
'current_pm_total' : current_pm_total,
... big long list ...
}
)
Save your values in a dictionary instead of individual variables; use a dictionary for the mapping from the expected CSV columns to your "variable" names. Now you can loop over those keys instead of having a bunch of if statements.
If you put the values in a dictionary keyed the way you have for defaults, then you can just use defaults = my_dictionary_of_row_data in your insert statement, avoiding another "big long list".
You can use pandas.read_csv to generate a DataFrame, and cycle through columns
Here's a portion of what the Excel file looks like. Meant to include this the first time. Thanks for the help so far.
Name Phone Number Carrier
FirstName LastName1 3410142531 Alltel
FirstName LastName2 2437201754 AT&T
FirstName LastName3 9247224091 Boost Mobile
FirstName LastName4 6548310018 Cricket Wireless
FirstName LastName5 8811620411 Project Fi
I am converting a list of names, phone numbers, and carriers to a dictionary for easy reference by other code. The idea is separate code will be able to call a name and access that person's phone number and carrier.
I got the output I need, but I'm wondering if there were an easier way I could have accomplished this task and get the same output. Though it's fairly concise, I'm interested in any module or built in of which I'm not aware. My python skills are beginner at best. I wrote this in Thorny with Python 3.6.4. Thanks!
#Imports
import pandas as pd
import math
# Assign spreadsheet filename to `file`
file = 'Phone_Numbers.xlsx'
# Load spreadsheets
xl = pd.ExcelFile(file)
# Load a sheet into a DataFrame by name: df1
df1 = xl.parse('Sheet1', header=0)
# Put the dataframe into a dictionary to start
phone_numbers = df1.to_dict(orient='records')
# Converts PhoneNumbers.xlsx to a dictionary
x=0
temp_dict = {}
for item in phone_numbers:
temp_list = []
for key in phone_numbers[x]:
tempholder = phone_numbers[x][key]
if (isinstance(tempholder, float) or isinstance(tempholder, int)) and math.isnan(tempholder) == False: # Checks to see if there is a blank and if the phone number comes up as a float
# Converts any floats to string for use in later code
tempholder = str(int(tempholder))
else:
pass
temp_list.append(tempholder)
temp_dict[temp_list[0]] = temp_list[1:] # Makes the first item in the list the key and add the rest as values
x += 1
print(temp_dict)
Here's the desired output:
{'FirstName LastName1': ['3410142531', 'Alltel'], 'FirstName LastName2': [2437201754, 'AT&T'], 'FirstName LastName3': [9247224091, 'Boost Mobile'], 'FirstName LastName4': [6548310018, 'Cricket Wireless'], 'FirstName LastName5': [8811620411, 'Project Fi']
One way to do it would be to iterate through the dataframe and use a dictionary comprehension:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']] for _, row in df.iterrows()}
where df is your original dataframe (the result of xl.parse('Sheet1', header=0)). This basically iterates through all rows in your dataframe, creating a dictionary key for each Name, with Phone number and carrier as it's values (in a list), as you indicated in your output.
To make sure that your phone number is not null (as you did in your loop), you could add an if clause to your dict comprehension, such as this:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']]
for _, row in df.iterrows()
if not math.isnan(row['Phone Number'])}
df.set_index('Name').T.to_dict('list')
should do the job ,Here df is your dataframe
I have some data in a txt file and I would like to load it into a list of dicts. I would normally use csv.ReadDict(open('file')), however this data does not have the key values in the first row. Instead it has a number of rows commented out before the data actually begins. Also, sometimes, the commented rows will not always be at beginning of the file, but could be at the end of the file.
However, all line should always have the same fields, and I guess I could hard-code these field names (or key values) as they shouldn't change.
Sample Date
# twitter data
# retrieved at: 07.08.2014
# total number of records: 5
# exported by: userXYZ
# fields: date, time, username, source
10.12.2013; 02:00; tweeterA; web
10.12.2013; 02:01; tweeterB; iPhone
10.13.2013; 02:04; tweeterC; android
10.13.2013; 02:08; tweeterC; web
10.13.2013; 02:10; tweeterD; iPhone
Below is the what I've been able to figure out so far, but I need some help getting it worked out.
My Code
header = ['date', 'time', 'username', 'source']
data = []
for line in open('data.txt'):
if not line.startswith('#'):
data.append(line)
Desired Format
[{'date':'10.12.2013', 'time':'02:00', 'username':'tweeterA', 'source':,'web'},
{'date':'10.12.2013', 'time':'02:01', 'username':'tweeterB', 'source':,'iPhone'},
{'date':'10.12.2013', 'time':'02:04', 'username':'tweeterC', 'source':,'android'},
{'date':'10.12.2013', 'time':'02:08', 'username':'tweeterC', 'source':,'web'},
{'date':'10.12.2013', 'time':'02:10', 'username':'tweeterD', 'source':,'iPhone'}]
If you want a list of dicts where each dict corresponds to a row try this:
list_of_dicts = [{key: value for (key, value) in zip(header, line.strip().split('; '))} for line in open('abcd.txt') if not line.strip().startswith('#')]
for line in open('data.txt'):
if not line.startswith('#'):
data.append(line.split("; "))
at least assuming I understand you correctly
or more succinct
data = [line.split("; ") for line in open("data.txt") if not line.strip().startswith("#")]
list_of_dicts = map(lambda row:dict(zip(header,row)),data)
depending on your version of python you may get an iterator back from map in which case just do
list_of_dicts = list(map(lambda row:dict(zip(header,row)),data))
I read from a file and stored into artists_tag with column names .
Now this file has multiple columns and I need to generate a new data structure which has 2 columns from the artists_tag as it is and the most frequent value from the 'Tag' column as the 3rd column value.
Here is what I have written as of now:
import pandas as pd
from collections import Counter
def parse_artists_tags(filename):
df = pd.read_csv(filename, sep="|", names=["ArtistID", "ArtistName", "Tag", "Count"])
return df
def parse_user_artists_matrix(filename):
df = pd.read_csv(filename)
return df
# artists_tags = parse_artists_tags(DATA_PATH + "\\artists-tags.txt")
artists_tags = parse_artists_tags("C:\\Users\\15-J001TX\\Documents\\ml_task\\artists-tags.txt")
#print(artists_tags)
user_art_mat = parse_user_artists_matrix("C:\\Users\\15-J001TX\\Documents\\ml_task\\userart-mat-training.csv")
#print ("Number of tags {0}".format(len(artists_tags))) # Change this line. Should be 952803
#print ("Number of artists {0}".format(len(user_art_mat))) # Change this line. Should be 17119
# TODO Implement this. You can change the function arguments if necessary
# Return a data structure that contains (artist id, artist name, top tag) for every artist
def calculate_top_tag(all_tags):
temp = all_tags.Tag
a = Counter(temp)
a = a.most_common()
print (a)
top_tags = all_tags.ArtistID,all_tags.ArtistName,a;
return top_tags
top_tags = calculate_top_tag(artists_tags)
# Print the top tag for Nirvana
# Artist ID for Nirvana is 5b11f4ce-a62d-471e-81fc-a69a8278c7da
# Should be 'Grunge'
print ("Top tag for Nirvana is {0}".format(top_tags)) # Complete this line
In the last method calculate_top_tag I don't understand how to choose the most frequent value from the 'Tag' column and put it as the third column for top_tags before returning it.
I am new to python and my knowledge of syntax and data structures is limited. I did try the various solutions mentioned for finding the most frequent value from the list but they seem to display the entire column and not one particular value. I know this is some trivial syntax issue but after having searched for long I still cannot figure out how to get this one.
edit 1 :
I need to find the most common tag for a particular artist and not the most common overall.
But again, I don't know how to.
edit 2 :
here is the link to the data files:
https://github.com/amplab/datascience-sp14/raw/master/hw2/hw2data.tar.gz
I'm sure there is a more succint way of doing it, but this should get you started:
# returns a df grouped by ArtistID and Tag
tag_counts = artists_tags.groupby(['ArtistID', 'Tag'])
# sum up tag counts and sort in descending order
tag_counts = tag_counts.sum().sort('Count', ascending=False).reset_index()
# keep only the top ranking tag per artist
top_tags = tag_counts.groupby('ArtistID').first()
# top_tags is now a dataframe which contains the top tag for every artist
# We can simply lookup the top tag for Nirvana via it's index:
top_tags.ix['5b11f4ce-a62d-471e-81fc-a69a8278c7da'][0]
# 'Grunge'