link html, python( yes! instead of a database) - python

I have a dataset and some data. When user selects particular data, the relevant rows/columns of it should be displayed. I have a csv file with different professions, their average pay, locations and skills needed. Now if the user selects a profession, everything linked to this profession tuple should be displayed.
Example: the columns of row are : Lawyer, $45000, US and Canada, Degree in law.
Now if user selects his profession to be lawyer, various options like $45000, US and Canada should be displayed one after the other. How Can I do this directly from a CSV file?
I will design this part of website in python flask

To answer your question of whether you can do this directly from a csv file, well, I dont think there is a way to validate information or search records directly from a CSV. You could, however, try one of the following approaches.
One way this can be done is by,simply opening the csv file and saving each line/record as a sublist and then appending it to a parent list. Add the following code snippet to your application:
final_list=[]
with open('your_file.csv', 'r') as f:
result = [line.strip(',') for line in f]
final_list.append(result)
print(final_list)
#[[Lawyer, $45000, US and Canada, Degree in law],
#[Doctor...],]
In case you want to use a python module then check this :Someone has already answered a similar question.
Hope this helps :)

Related

Replace a value on one data set with a value from another data set with a dynamic lookup

This question relates primarly to Alteryx, however if it can be done in Python, or R in Alteryx workflow using the R tool then that would work as well.
I have two data sets.
Address (contains address information: Line1, Line2, City, State, Zip)
USPS (contains USPS abbreviations: Street to ST, Boulevard to BLVD, etc.)
Goal: Look at the string on the Address data set for Line1. IF it CONTAINS one of the types of streets in the USPS data set, I want to replace that part of the string with its proper abbreviation which is in a different column of the USPS data set.
Example, 123 Main Street would become 123 Main St
What I have tried:
Imported the two data sets.
Union the two data sets with the instruction of Output All Fields for When Fields Differ.
Added a formula, but this is where I am getting stuck. So far it reads:
if [Addr1] Contains(Sting, Target)
Not sure how to have it look in the USPS for one of the values. I am also not certain if this sort of dynamic lookup can take place.
If this can be done in python (I know very basic Python so I don't have code for this yet because I do not know where to start other than importing the data) I can use python within Alteryx.
Any assistance would be great. Please let me know if you need additional information.
Thank you in advance.
Use the Find Replace tool in Alteryx. This tool is akin to a lookup. Furthermore, use the Alteryx Community as a go to for these types of questions.
Input the Address dataset into the top anchor of the Find Replace tool and the USPS dataset into the bottom anchor. You'll want to find any part of the address field using the lookup field and replace it with the abbreviation field. If you need to do this across several fields in the Address dataset, then you could replicate this logic or you could use a Record ID tool, Transpose, run this logic on one field, and then Cross Tab back to the original schema. It's an advanced recipe that you'll want to master in Alteryx.
https://help.alteryx.com/current/FindReplace.htm
The overall logic that can be used is here: Using str_detect (or some other function) and some way to loop through a list to essentially perform a vlookup
However, in order to expand to Alteryx, you would need to add the Alteryx R tool. Also, some of the code would need to be changed to use the syntax that Alteryx likes.
read in the data with:
read.Alteryx('#Link Number', mode = 'data.frame')
After, the above linked question will provide the overall framework for the logic. Reiterated here:
usps[] = lapply(usps, as.character)
##Copies the original address data to a new column that will
##be altered. Preserves the orignal formatting for rollback
##if necessary
vendorData$new_addr1 = as.character(vendorData$Addr1)
##Loops through the dictionary replacing all of the common names
##with their USPS approved abbreviations for the Addr1 field.
for(i in 1:nrow(usps)) {
vendorData$new_addr1 = str_replace_all(
vendorData$new_addr1,
pattern = paste0("\\b", usps$Abbreviation[i], "\\b"),
replacement = usps$USPS_Abbrv_updated[i]
)
}
Finally, in order to be able to see the output, we would need to write a statement that will output it in one of the 5 output slots the R tool has. Here is the code for that:
write.Alteryx(data, #)

Basics of connecting python to the web and validating user input

I'm relatively new, and I'm just at a loss as to where to start. I don't expect detailed step-by-step responses (though, of course, those are more than welcome), but any nudges in the right direction would be greatly appreciated.
I want to use the Gutenberg python library to select a text based on a user's input.
Right now I have the code:
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
text = strip_headers(load_etext(11)).strip()
where the number represents the text (in this case 11 = Alice in Wonderland).
Then I have a bunch of code about what to do with the text, but I don't think that's relevant here. (If it is let me know and I can add it).
Basically, instead of just selecting a text, I want to let the user do that. I want to ask the user for their choice of author, and if Project Gutenberg (PG) has pieces by that author, have them then select from the list of book titles (if PG doesn't have anything by that author, return some response along the lines of "sorry, don't have anything by $author_name, pick someone else." And then once the user has decided on a book, have the number corresponding to that book be entered into the code.
I just have no idea where to start in this process. I know how to handle user input, but I don't know how to take that input and search for something online using it.
Ideally, I'd be able to handle things like spelling mistakes too, but that may be down the line.
I really appreciate any help anyone has the time to give. Thanks!
The gutenberg module includes facilities for searching for a text by metadata, such as author. The example from the docs is:
from gutenberg.query import get_etexts
from gutenberg.query import get_metadata
print(get_metadata('title', 2701)) # prints frozenset([u'Moby Dick; Or, The Whale'])
print(get_metadata('author', 2701)) # prints frozenset([u'Melville, Hermann'])
print(get_etexts('title', 'Moby Dick; Or, The Whale')) # prints frozenset([2701, ...])
print(get_etexts('author', 'Melville, Hermann')) # prints frozenset([2701, ...])
It sounds as if you already know how to read a value from the user into a variable, and replacing the literal author in the above would be as simple as doing something like:
author_name = my_get_input_from_user_function()
texts = get_etexts('author', author_name)
Note the following note from the same section:
Before you use one of the gutenberg.query functions you must populate the local metadata cache. This one-off process will take quite a while to complete (18 hours on my machine) but once it is done, any subsequent calls to get_etexts or get_metadata will be very fast. If you fail to populate the cache, the calls will raise an exception.
With that in mind, I haven't tried the code I've presented in this answer because I'm still waiting for my local cache to populate.

Append to list won't keep the last input on program restart?

I am trying to learn Python by doing.
Aim of the code below: To form a part of a larger file in which I will be checking if all info i.e. Address, email add, contact person etc is updated in a list (I am not sure whether to use lists, arrays or dictionary?). If yes I want it to give options to do various things for the customer etc.
The code below is basically checking whether a customer exists in the list. If not, it is supposed to add the customer name in c to the list.
When I run the program it works. But as soon as I restart the program the last added, i.e. if I entered the customer as: ABC in the last run of the program, is not in the list.
Can someone point me in the right direction on this? Also can I pass the values in the list onto multiple dictionaries as keys for further values to be added i.e. email address etc?
customer = ['GMS']
print ("Enter Customer Name:")
c = input()
if c in customer:
print ("Customer Exsists")
else:
customer.append(c)
print ("Added to list")
Your program is fine as far as it goes. It does input, and it does
append to the list.
However, all the data in the program will go away as soon as the program
exits. The only way to retain information across runs is to save the
information in some kind of persistent storage. As Rok Novosel mentions
in the comment, this can be done with the pickle module, though as a
beginner, you might want to defer that until later.
At this stage of your learning, I’d recommend looking at file
operations: opening and closing, reading and writing. For a single list
like this, the writelines() and readlines() file methods would be
the simplest way to save and restore, respectively.
As for your dictionary question: yes, since you’re making sure the
customer names are unique, you can use them as dictionary keys. Storing
that data would be more complicated; you could use pickle, or work out
a file structure to parse on input.
Q1: Your data resides in memory during one execution instance. When the program exits, the memory is freed and your data is not automatically stored elsewhere. You may use a format you like to store it onto the disk where data is persistent. Simply writing to a file could work for you at this moment of your learning.
Q2: Yes, you may use a dictionary.
Open file and read it in list
with open('file', 'r') as f:
customers = list(f)
f.close()
Do whatever You want to list. Then write to file.
To persist customers on HDD.
with open('file', 'w') as f:
for l in f:
f.write(str(l) + '\n')
f.close()

Seaching big files using list in Python - How can improve the speed?

I have a folder with 300+ .txt files with total size of 15GB+. These files contain tweets. Each line is a different tweet. I have a list of keywords I'd like to search the tweets for. I have created a script that searches each line of every file for every item on my list. If the tweet contains the keyword, then it writes the line into another file. This is my code:
# Search each file for every item in keywords
print("Searching the files of " + filename + " for the appropriate keywords...")
for file in os.listdir(file_path):
f = open(file_path + file, 'r')
for line in f:
for key in keywords:
if re.search(key, line, re.IGNORECASE):
db.write(line)
This is the format each line has:
{"created_at":"Wed Feb 03 06:53:42 +0000 2016","id":694775753754316801,"id_str":"694775753754316801","text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF","source":"\u003ca href=\"http:\/\/www.facebook.com\/twitter\" rel=\"nofollow\"\u003eFacebook\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":5981342,"id_str":"5981342","name":"Lava Kafle","screen_name":"lkafle","location":"Kathmandu, Nepal","url":"http:\/\/about.me\/lavakafle","description":"#deerwalkinc 24000+ tweeps bigdata #Team #Genomics http:\/\/deerwalk.com #Genetic #Testing #population #health #management #BigData #Analytics #java #hadoop","protected":false,"verified":false,"followers_count":24742,"friends_count":23169,"listed_count":1481,"favourites_count":147252,"statuses_count":171880,"created_at":"Sat May 12 04:49:14 +0000 2007","utc_offset":20700,"time_zone":"Kathmandu","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_link_color":"088253","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/5981342\/1416802075","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/MqmDwbCDAF","expanded_url":"http:\/\/fb.me\/Yj1JW9bJ","display_url":"fb.me\/Yj1JW9bJ","indices":[45,68]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1454482422661"}
The script works but it takes a lot of time. For ~40 keywords it needs more than 2 hours. Obviously my code is not optimized. What can I do to improve the speed?
p.s. I have read some relevant questions regarding searching and speed but I suspect that the problem in my script lies in the fact that I'm using a list for the keywords. I've tried some of the suggested solutions but to no avail.
1) External library
If you're willing to lean on external libraries (and time to execute is more important than the one-off time cost to install), you might be able to gain some speed by loading each file into a simple Pandas DataFrame and performing the keyword search as a vector operation. To get the matching tweets, you would do something like:
import pandas as pd
dataframe_from_text = pd.read_csv("/path/to/file.txt")
matched_tweets_index = dataframe_from_text.str.match("keyword_a|keyword_b")
dataframe_from_text[matched_tweets_index] # Uses the boolean search above to filter the full dataframe
# You'd then have a mini dataframe of matching tweets in `dataframe_from_text`.
# You could loop through these to save them out to a file using the `.to_dict(orient="records")` format.
Dataframe operations within Pandas can be really quick so might be worth investigating.
2) Group your regex
Looks like you're not logging which keyword you matched against. If this is true, you could group your keywords into a single regex query like so:
for line in f:
keywords_combined = "|".join(keywords)
if re.search(keywords_combined, line, re.IGNORECASE):
db.write(line)
I've not tested this but by reducing the number of loops per line, that could trim some time off.
Why it's slow
You are regex searching through a json dump, which is not always a good idea. For example, if you keywords include words like user, time, profile and image each line will result in a match because the json format for tweets has all these terms as dictionary keys.
Besides the raw JSON is huge, each tweet will be more than 1kb in size (this one is 2.1kb) but the only part that's relevent in your sample is:
"text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF",
And this is less than 100 bytes, a typical tweet is still less than 140 characters despite recent changes to the API.
Things to try:
pre compile the regex as suggested by Padraic Cunningham
Option 1. Load this data into a postgresql JSONB field. JSONB fields are indexable and can be searched very quickly
Option 2. Load this into any old database, with the context of the text field having it's own column so that this column can be searched easily.
Option 3. last but not least, extract just the text field into it's own file. You can have a CSV file where the first column is the screen name and the second is the text of the tweet. Your 15GB will be shrunk to about 1GB
In short what you are doing now is searching the whole farm for the needle when you only need to search the haystack.

Python 3 Extracting Candidate Words from a Debate File

This is my first post, so I'm sorry if I do anything wrong. That said, I searched for the question and found something similar that was never answered due to the OP not giving sufficient information. This is also homework, so I'm just looking for a hint. I really want to get this on my own.
I need to read in a debate file (.txt), and pull and store all of the lines that one candidate says to put in a word cloud. The file format is supposed to help, but I'm blanking on how to do this. The hint is that each time a new person speaks, their name followed by a colon is the first word in the first line. However, candidates' data can span multiple lines. I am supposed to store each person's lines separately. Here is a sample of the file:
LEHRER: This debate and the next three -- two presidential, one vice
presidential -- are sponsored by the Commission on Presidential
Debates. Tonight's 90 minutes will be about domestic issues and will
follow a format designed by the commission. There will be six roughly
15-minute segments with two-minute answers for the first question,
then open discussion for the remainder of each segment.
Gentlemen, welcome to you both. Let's start the economy, segment one,
and let's begin with jobs. What are the major differences between the
two of you about how you would go about creating new jobs?
LEHRER: You have two minutes. Each of you have two minutes to start. A
coin toss has determined, Mr. President, you go first.
OBAMA: Well, thank you very much, Jim, for this opportunity. I want to
thank Governor Romney and the University of Denver for your
hospitality.
There are a lot of points I want to make tonight, but the most
important one is that 20 years ago I became the luckiest man on Earth
because Michelle Obama agreed to marry me.
This is what I have for a function so far:
def getCandidate(myFile):
file = open(myFile, "r")
obama = []
romney = []
lehrer = []
file = file.readlines()
I'm just not sure how to iterate through the data so that it separates each person's words correctly. I created a dummy file to create the word cloud, and I'm able to do that fine, so all I am wondering is how to extract the information I need.
Thank you! If there is more information I can offer please let me know. This is a beginning Python course.
EDIT: New code added from a response. This works to an extent, but only grabs the first line of each candidate's response, not their entire response. I need to write code that continues to store each line under that candidate until a new name is at the start of a line.
def getCandidate(myFile, candidate):
file = open(myFile, "r")
OBAMA = []
ROMNEY = []
LEHRER = []
file = file.readlines()
for line in file:
if line.startswith("OBAMA:"):
OBAMA.append(line)
if line.startswith("ROMNEY:"):
ROMNEY.append(line)
if line.startswith("LEHRER:"):
LEHRER.append(line)
if candidate == "OBAMA":
return OBAMA
if candidate == "ROMNEY":
return ROMNEY
EDIT: I now have a new question. How can I generalize the file so that I can open any debate file between two people and a moderator? I am having a lot of trouble with this one.
I've been given a hint to look at the beginning of the line and see if the last word of each line to see if it ends in ":", but I'm still not sure how to do this. I tried splitting each line on spaces and then looking at the first item in the line, but that's as far as I've gotten.
The hint is this: after you split your lines, iterate over them and check with the string function startswith for each candidate, then append.
The iteration over a file is very simple:
for row in file:
do_something_with_row
EDIT:
To keep putting the lines until you find a new candidate, you have to keep track with a variable of the last candidate seen and if you don't find any match at the beginning of the line, you stick with the same candidate as before.
if line.startswith('OBAMA'):
last_seen=OBAMA
OBAMA.append(line)
elif blah blah blah
else:
last_seen.append(line)
By the way, I would change the definitio of the function: instead of take the name of the candidate and returning only his lines, it would be better to return a dictionary with the candidate name as keys and their lines as values, so you wouldn't need to parse the file more than once. When you will work with bigger file this could be a lifesaver.

Categories