Parsing a paragraph: detecting sentences without punctuation - python

Let's say I have the following text:
Steps toward this goal include: Increasing efficiency of mobile networks, data centers, data transmission, and spectrum allocation Reducing the amount of data apps have to pull from networks through caching, compression, and futuristic technologies like peer-to-peer data transfer Making investments in accessibility profitable by educating people about the uses of data, creating business models that thrive when free data access is offered initially, and building out credit card infrastructure so carriers can move from pre-paid to post-paid models that facilitate investment If the plan works, mobile operators will gain more customers and invest more in accessibility; phone makers will see people wanting better devices; Internet providers will get to connect more people; and people will receive affordable Internet so they can join the knowledge economy and connect with the people they care about.
As you can tell by reading the text, these are multiple sentences (a list of points). How can I split this text into sentences? I've tried using python NLTK but no luck. Checking for uppercase letters won't work either, as it isn't very reliable.
Any ideas on how to solve this problem?
Thanks.

if i understood you correctly this little code could help: (Note tested on python 2.7.5)
paragraph = 'Steps toward this goal include: Increasing efficiency of mobile networks, data centers, data transmission, and spectrum allocation Reducing the amount of data apps have to pull from networks through caching, compression, and futuristic technologies like peer-to-peer data transfer Making investments in accessibility profitable by educating people about the uses of data, creating business models that thrive when free data access is offered initially, and building out credit card infrastructure so carriers can move from pre-paid to post-paid models that facilitate investment If the plan works, mobile operators will gain more customers and invest more in accessibility; phone makers will see people wanting better devices; Internet providers will get to connect more people; and people will receive affordable Internet so they can join the knowledge economy and connect with the people they care about.'
words = []
separators = ['.',',',':',';']
oldValue = 0
for value in range(len(paragraph)):
if paragraph[value] in separators:
words.append(paragraph[oldValue:value+1])
oldValue = value+2
for word in words:
print word
[EDIT]
also you could add uppercase letter check easily with
if paragraph[value] == paragraph[value].upper():
words.append(paragraph[oldValue:value+1])
...

Related

How to convert gzip.GzipFile to dictionary?

I have a gz format file. The file is very big and the first line is as follow:
{"originaltitle":"Leasing Specialist - WPM Real Estate Management","workexperiences":[{"company":"Home Properties","country":"US","customizeddaterange":"","daterange":{"displaydaterange":"","startdate":null,"enddate":null},"description":"Responsibilities: Inspect tour routes, models and show apartments daily to ensure cleanliness. Greeting prospective residents; determining the needs and preferences of the prospect and professionally present specific apartments while providing information regarding features and benefits. Answering incoming calls in a cheerful and professional manner. Handle each call accordingly whether it is a prospect call or an irate resident that just moved in. Develop and maintain Resident relations through the courtesy of on-site personnel, promptness of maintenance calls, and knowledge of community policies. Learn to develop professional sales and closing techniques. Accompany prospects to model apartments and discusses size and layout of rooms, available facilities, such as swimming pool and saunas, location of shopping centers, services available, and terms of lease. Demonstrate thorough knowledge and use of lead tracking system. Make follow-up calls to prospective Residents who did not fill out an application. Compile and update listings of available rental units.","location":"Baltimore, MD","normalizedtitle":"leasing specialist","title":"Leasing Specialist"},{"company":"WPM Real Estate Management","country":"US","customizeddaterange":"1 year, 3 months","daterange":{"displaydaterange":"July 2017 to October 2018","startdate":{"displaydate":"July 2017","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"October 2018","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities: Inspect tour routes, models and show apartments daily to ensure cleanliness. Greeting prospective residents; determining the needs and preferences of the prospect and professionally present specific apartments while providing information regarding features and benefits. Answering incoming calls in a cheerful and professional manner. Handle each call accordingly whether it is a prospect call or an irate resident that just moved in. Develop and maintain Resident relations through the courtesy of on-site personnel, promptness of maintenance calls, and knowledge of community policies. Learn to develop professional sales and closing techniques. Accompany prospects to model apartments and discusses size and layout of rooms, available facilities, such as swimming pool and saunas, location of shopping centers, services available, and terms of lease. Demonstrate thorough knowledge and use of lead tracking system. Make follow-up calls to prospective Residents who did not fill out an application. Compile and update listings of available rental units.","location":"Baltimore, MD","normalizedtitle":"leasing specialist","title":"Leasing Specialist"},{"company":"Westminster Management","country":"US","customizeddaterange":"1 year","daterange":{"displaydaterange":"June 2016 to June 2017","startdate":{"displaydate":"June 2016","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"June 2017","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities: Tour vacant units and model with future prospects.Process applications. Answer emails and incoming phone calls. Prepare lease agreement for signing. Collect all monies that is due on dateof move-in. Enter resident repair orders for resident. Walk vacant units to ensure that the unit is ready for show. Complete residency and employment verifications. Income qualify all applicants.","location":"Baltimore, MD","normalizedtitle":"leasing consultant","title":"Leasing Consultant"},{"company":"MARYLAND MANAGEMENT COMPANY","country":"US","customizeddaterange":"1 year, 1 month","daterange":{"displaydaterange":"April 2015 to May 2016","startdate":{"displaydate":"April 2015","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"May 2016","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities: Lease apartments, sign lease agreements, complete residence maintenance repairrequest, answer phones, customer service, processed prospects applications, opened and closedinventory, responded to Level One emails Accomplishments: I was able to successfully finish FairHousing requirements. The first month I was able to properly and accurately process a application and move-in documents. Skills Used: The skills I used while at Americana were strong team work, strongcommunication, interpersonal, and leadership.","location":"Glen Burnie, MD","normalizedtitle":"leasing agent","title":"Leasing Agent"},{"company":"Amazon.com","country":"US","customizeddaterange":"1 year, 5 months","daterange":{"displaydaterange":"September 2014 to February 2016","startdate":{"displaydate":"September 2014","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"February 2016","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities: I assure customers are receiving the correct merchandise in a timely fashion.And evaluate inventoryAccomplishments:I exceeded Amazon expectations of receiving 2800 items per hour, which allowed me to train otherassociates, building confidence and skills.Skills Used:The skills i used while performing my task were strong leadership, strong communications, and beingdetailed orientated.","location":"Baltimore, MD","normalizedtitle":"customer service representative","title":"Customer Service Representative"},{"company":"Carmax Superstore","country":"US","customizeddaterange":"1 year, 2 months","daterange":{"displaydaterange":"February 2014 to April 2015","startdate":{"displaydate":"February 2014","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"April 2015","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities:Greet customersSearch for the right vehicle that best suits the customers needs and wantsSubmit financial applicationsAssist customer with the purchasing process and document signingEnter customers information for appraisal offerAssist customer with purchasing Car Max extended warrantiesConducted follow- up on a daily, weekly, and monthy basisAccomplishments:I was acknowledged by the district for having 100% in Car Max extended warranties. Also I wasacknowledged by the district for having one of the highest Voice Of Customer survey scores. I passedthe 6 week training, obtaining my sales licenseSkills Used:I demonstrate strong communication, interpersonal and listening skills. I also have strongorganizational skills.","location":"Nottingham, MD","normalizedtitle":"sales consultant","title":"Certified Sales Consultant"},{"company":"rue21","country":"US","customizeddaterange":"1 year, 8 months","daterange":{"displaydaterange":"June 2011 to February 2013","startdate":{"displaydate":"June 2011","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"February 2013","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities: Managed profit goals on a daily basisCustomer ServiceReceived Incoming shipmentDelivered daily bank depsoitsMaintained store appearanceOverlooked sales associates performanceCreated daily goals for each sales associateAccomplishments:The impact that I was able to have during my time at Rue21, I was able to build a strong team of individuals who were scored top in the region for Customer Service.Skills Used:I demonstrated strong leadership and verbal communication.","location":"Dundalk, MD","normalizedtitle":"assistant store manager","title":"Assistant Store Manager"},{"company":"Shaws Jewelers","country":"US","customizeddaterange":"1 year, 5 months","daterange":{"displaydaterange":"November 2009 to April 2011","startdate":{"displaydate":"November 2009","granularity":"MONTH","isodate":{"date":null}},"enddate":{"displaydate":"April 2011","granularity":"MONTH","isodate":{"date":null}}},"description":"Responsibilities: Customer serviceGeneral office( typing, faxing, )Made outgoing calls to valued customersCleaned and maintained show cases and lunch roomPrepared jewlery repair tickets for outgoing shipmentAccomplishments:During my time at Shaws Jewelers I was able to demonstrate excellent customer service.Also I wasable to achieve personal profit goals and credit application goals on a daily basis. I was acknowledged and rewarded by my DM for excellent team participation and over achieving the 6 standards on a dailybasis.Skills Used:I demonstrated strong verbal and listening skills. Also I have excellent interpersonal skills.","location":"Dundalk, MD","normalizedtitle":"sales associate","title":"Sales Associate"}],"skillslist":[{"monthsofexperience":0,"text":"yardi"},{"monthsofexperience":0,"text":"marketing"},{"monthsofexperience":0,"text":"outlook"},{"monthsofexperience":0,"text":"receptionist"},{"monthsofexperience":0,"text":"management"}],"url":"/r/Lashannon-Felton/1062d3b8cbb13886","additionalinfo":""}\n'
I am not familiar with gzip.GzipFile format.
Is there a way to make it a dictionary?
You will want to make use of the json module and the gzip module in Python, both of which are part of the Python Standard Library.
The gzip module provides the GzipFile class, as well as the open(),
compress() and decompress() convenience functions. The GzipFile class
reads and writes gzip-format files, automatically compressing or
decompressing the data so that it looks like an ordinary file object.
To read the compressed file, you can call gzip.open().
Opening the file with the default rb mode, will return a gzip.GzipFile object, from which you can obtain a bytes-like object by calling read().
Then, using json.loads(), you can convert the raw data into a usable Python object -- a dictionary.
The snippet below is a simple demonstration of this in action:
import gzip
import json
with gzip.open('gzipped_file.json.gz', 'rb') as f:
raw_json = f.read()
data = json.loads(raw_json)
print(type(data))
# Prints <class 'dict'>
print(data)
# Prints {'originaltitle': 'Leasing Specialist - WPM Real Estate Management', 'workexperience ...
print(data['workexperiences'][0]['company'])
# Prints Home Properties

Python API request - For Loop causing Index errors

Fairly new to Python.... struggling with the for loop in my code, specifically the assignment of Key: 'topic_title'.
I keep receiving a "list index out of range" error. The JSON response at the "solicitation_topics" is nested so I believe I need to pass the index and this works when trying to access directly from the python terminal, however within the function I keep getting the error. Any help would be greatly appreciated.
import requests, json
def get_solicitations():
# api-endpoint
URL = "https://www.sbir.gov/api/solicitations.json"
# defining a params dict for the parameters to be sent to the API
PARAMS = {"keyword": 'sbir'}
# sending get requfiest and saving the response as response object
r = requests.get(url = URL, params = PARAMS)
# extracting data in json format
api_data = r.json()
# storing selected json data into a dict
solicitations = []
for data in api_data:
temp = {
'solicitation_title': data['solicitation_title'],
'program': data['program'],
'agency': data['agency'],
'branch': data['branch'],
'close_date': data['close_date'],
'solicitation_link': data['sbir_solicitation_link'],
'topic_title': data['solicitation_topics'][0]['topic_title'],
}
solicitations.append(temp)
return (solicitations)
A snippet of the JSON response looks like this:
[
{
"solicitation_title": "Interactive Digital Media STEM Resources for Pre- College and Informal Science Education Audiences (SBIR) (R43/R44 Clinical Trial Not Allowed) ",
"solicitation_number": "PAR-20-244 ",
"program": "SBIR",
"phase": "BOTH",
"agency": "Department of Health and Human Services",
"branch": "National Institutes of Health",
"solicitation_year": "2020",
"release_date": "2020-06-25",
"open_date": "2020-08-04",
"close_date": "2022-09-03",
"application_due_date": [
"2020-09-04",
"2021-09-03",
"2022-09-02"
],
"occurrence_number": null,
"sbir_solicitation_link": "https://www.sbir.gov/node/1703169",
"solicitation_agency_url": "https://grants.nih.gov/grants/guide/pa-files/PAR-20-244.html",
"current_status": "open",
"solicitation_topics": [
{
"topic_title": "Interactive Digital Media STEM Resources for Pre-College and Informal Science Education Audiences (SBIR) (R43/R44 Clinical Trial Not Allowed) ",
"branch": "National Institutes of Health",
"topic_number": "PAR-20-244 ",
"topic_description": "The educational objective of this FOA is to provide opportunities for eligible SBCs to submit NIH SBIR grant applications to develop IDM STEM products that address student career choice and health and medicine topics for: (1) pre-kindergarten to grade 12 (P-12) students and teachers or (2) informal science education (ISE) audiences. The second educational objective is to inform the American public that their quality of health is defined by lifestyle. If this message is understood, people can begin to live longer and reduce the healthcare burden to society. Therefore, this FOA also encourages IDM STEM products that will increase public health literacy and stimulate behavioral changes towards a healthier lifestyle. The research objective of this FOA is the development of new educational products that will advance our understanding of how IDM STEM-based gaming can improve student learning. It is anticipated that increasing underserved and minority student achievement in STEM fields through IDM STEM resources will encourage these students to pursue health-related careers that will increase their economic and social opportunities. A diverse health care workforce will help to expand health care access for the underserved, foster research in neglected areas of societal need, and enrich the pool of managers and policymakers to meet the needs of a diverse population.\r\n\r\nIDM is a bridge technology that converts game-based activities from a social pastime to a powerful educational tool that challenges students with problem solving, conceptual reasoning and goal-oriented decision making. Well-designed IDM products mimic successful teacher pedagogy and exploit student interest in games for learning. IDM STEM products also integrate imbedded learning, e.g., what the student knows and new knowledge gained in the gaming process, into problem solving skills. IDM products provide real time student assessment. Unlike standardized classroom testing where student achievement is a pass or fail process, IDM-based assessment is interactive, does not punish the student, and provides feedback on how to move to the next level of play. IDM products are intended to generate long-term changes in student performance, educational outcomes and career choices.\r\n\r\nThis FOA also encourages IDM STEM products that will increase public health literacy and stimulate behavioral changes towards a healthier lifestyle. Types of applications submitted to this FOA may vary with the target audience, scientific content, educational purpose and method of delivery. IDM STEM products may include but are not limited to: game-based curricula, resources that promote attitude changes toward learning, new skills development, teamwork and group activities, public participation in scientific research (citizen science) projects, and behavioral changes in lifestyle and health. IDM STEM products designed to increase the number of underserved students, e.g., American Indian, Alaska Native, Pacific Islanders, African American, Hispanic, disabled, or otherwise underrepresented individuals considering careers in basic, behavioral or clinical research are encouraged.\r\n\r\nIDM STEM products may be designed for use in-classroom or out-of-classroom settings, e.g., as supplements to existing classroom curricula, for after-school science clubs, libraries, hospital waiting rooms and science museums. IDM products may target children in group settings or individually, with or without adult or teacher participation or supervision.\r\n\r\nThe proposed project may use any IDM gaming technology or platform but the platform chosen should be accessible to the target group.\r\n\r\n",
"sbir_topic_link": "https://www.sbir.gov/node/1703171",
"subtopics": []
}
]
},
]
Replicating your code, it looks like solicitation_topics can be an empty list. I added this line to your function:
print(f"title = {data['solicitation_title']}, topics: {data['solicitation_topics']}")
And I found this (one of several) empties:
title = PHS 2020 Omnibus Solicitation of the NIH, CDC and FDA for Small Business Innovation Research Grant Applications (Parent SBIR [R43/R44] Clinical Trial Not Allowed), topics: []
You will need to figure out how to guard against that.
If you want to skip the empty ones you could put a continue at the top of the loop:
if not data['solicitation_topics']:
continue
Or if you want to still preserve the solicitations with no topics, you should generate the title you want above, and then use that in your temp:
if data['solicitation_topics']:
topic_title = data['solicitation_topics'][0]['topic_title']
else:
topic_title = 'Not Supplied'

auto-generate e-commerce tags from item description

We are developing an e-commerce portal that enables users to list their items (name, description, tags) on the site.
However, we realized that users are not understanding item tags very well, some of them write arbitrary words some others leave it blank, so we decided to deal with it, i thought about using an Entity Extractor to generate tags, first, i tried to pass this listing to Calais:
I'm a Filipino Male looking for Office Assistant job,with knowledge in MS Word,Excel,Power Point & Internet Browsing,i'm a quick learner with clear & polite communicative skills,immense flexibility in terms of work assignments and work hours,and performing my duties with full dedication,integrity and honesty.
and i got these tags: Religion Belief, Positive psychology, Integrity, Evaluation, Behavior, Psychology, Skill.
Then i tried Stanford NER and got: Excel, Power, Point, &, Internet, Browsing
after that, i stopped trying these solutions as i thought they will not fit, and started thinking about having an e-commerce-related thesaurus that may contain product/brand names and trade related terms so i can use it with filtering user-generated posts and finding the proper tags but i couldn't find one.
so 1st question: did i miss something?
2nd question: is there better scinarios for this (i.e generating the tags)?

Does Django scale? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm building a web application with Django. The reasons I chose Django were:
I wanted to work with free/open-source tools.
I like Python and feel it's a long-term language, whereas regarding Ruby I wasn't sure, and PHP seemed like a huge hassle to learn.
I'm building a prototype for an idea and wasn't thinking too much about the future. Development speed was the main factor, and I already knew Python.
I knew the migration to Google App Engine would be easier should I choose to do so in the future.
I heard Django was "nice".
Now that I'm getting closer to thinking about publishing my work, I start being concerned about scale. The only information I found about the scaling capabilities of Django is provided by the Django team (I'm not saying anything to disregard them, but this is clearly not objective information...).
My questions:
What's the "largest" site that's built on Django today? (I measure size mostly by user traffic)
Can Django deal with 100,000 users daily, each visiting the site for a couple of hours?
Could a site like Stack Overflow run on Django?
"What are the largest sites built on Django today?"
There isn't any single place that collects information about traffic on Django built sites, so I'll have to take a stab at it using data from various locations. First, we have a list of Django sites on the front page of the main Django project page and then a list of Django built sites at djangosites.org. Going through the lists and picking some that I know have decent traffic we see:
Instagram: What Powers Instagram: Hundreds of Instances, Dozens of Technologies.
Pinterest: Alexa rank 37 (21.4.2015) and 70 Million users in 2013
Bitbucket: 200TB of Code and 2.500.000 Users
Disqus: Serving 400 million people with Python.
curse.com: 600k daily visits.
tabblo.com: 44k daily visits, see Ned Batchelder's posts Infrastructure for modern web sites.
chesspark.com: Alexa rank about 179k.
pownce.com (no longer active): alexa rank about 65k.
Mike Malone of Pownce, in his EuroDjangoCon presentation on Scaling Django Web Apps says "hundreds of hits per second". This is a very good presentation on how to scale Django, and makes some good points including (current) shortcomings in Django scalability.
HP had a site built with Django 1.5: ePrint center. However, as for novemer/2015 the entire website was migrated and this link is just a redirect. This website was a world-wide service attending subscription to Instant Ink and related services HP offered (*).
"Can Django deal with 100,000 users daily, each visiting the site for a couple of hours?"
Yes, see above.
"Could a site like Stack Overflow run on Django?"
My gut feeling is yes but, as others answered and Mike Malone mentions in his presentation, database design is critical. Strong proof might also be found at www.cnprog.com if we can find any reliable traffic stats. Anyway, it's not just something that will happen by throwing together a bunch of Django models :)
There are, of course, many more sites and bloggers of interest, but I have got to stop somewhere!
Blog post about Using Django to build high-traffic site michaelmoore.com described as a top 10,000 website. Quantcast stats and compete.com stats.
(*) The author of the edit, including such reference, used to work as outsourced developer in that project.
We're doing load testing now. We think we can support 240 concurrent requests (a sustained rate of 120 hits per second 24x7) without any significant degradation in the server performance. That would be 432,000 hits per hour. Response times aren't small (our transactions are large) but there's no degradation from our baseline performance as the load increases.
We're using Apache front-ending Django and MySQL. The OS is Red Hat Enterprise Linux (RHEL). 64-bit. We use mod_wsgi in daemon mode for Django. We've done no cache or database optimization other than to accept the defaults.
We're all in one VM on a 64-bit Dell with (I think) 32Gb RAM.
Since performance is almost the same for 20 or 200 concurrent users, we don't need to spend huge amounts of time "tweaking". Instead we simply need to keep our base performance up through ordinary SSL performance improvements, ordinary database design and implementation (indexing, etc.), ordinary firewall performance improvements, etc.
What we do measure is our load test laptops struggling under the insane workload of 15 processes running 16 concurrent threads of requests.
Not sure about the number of daily visits but here are a few examples of large Django sites:
disqus.com (talk from djangocon)
bitbucket.org (write up)
lanyrd.com (source)
support.mozilla.com (source code)
addons.mozilla.org (source code) (talk from djangocon)
theonion.com (write up)
The guardian.co.uk comment system uses Django (source)
instagram
pinterest
rdio
Here is a link to list of high traffic Django sites on Quora.
What's the "largest" site that's built on Django today? (I measure size mostly by user traffic)
In the US, it was Mahalo. I'm told they handle roughly 10 million uniques a month. Now, in 2019, Mahalo is powered by Ruby on Rails.
Abroad, the Globo network (a network of news, sports, and entertainment sites in Brazil); Alexa ranks them in to top 100 globally (around 80th currently).
Other notable Django users include PBS, National Geographic, Discovery, NASA (actually a number of different divisions within NASA), and the Library of Congress.
Can Django deal with 100k users daily, each visiting the site for a couple of hours?
Yes -- but only if you've written your application right, and if you've got enough hardware. Django's not a magic bullet.
Could a site like StackOverflow run on Django?
Yes (but see above).
Technology-wise, easily: see soclone for one attempt. Traffic-wise, compete pegs StackOverflow at under 1 million uniques per month. I can name at least dozen Django sites with more traffic than SO.
Scaling Web apps is not about web frameworks or languages, is about your architecture.
It's about how you handle you browser cache, your database cache, how you use non-standard persistence providers (like CouchDB), how tuned is your database and a lot of other stuff...
Playing devil's advocate a little bit:
You should check the DjangoCon 2008 Keynote, delivered by Cal Henderson, titled "Why I hate Django" where he pretty much goes over everything Django is missing that you might want to do in a high traffic website. At the end of the day you have to take this all with an open mind because it is perfectly possible to write Django apps that scale, but I thought it was a good presentation and relevant to your question.
The largest django site I know of is the Washington Post, which would certainly indicate that it can scale well.
Good design decisions probably have a bigger performance impact than anything else. Twitter is often cited as a site which embodies the performance issues with another dynamic interpreted language based web framework, Ruby on Rails - yet Twitter engineers have stated that the framework isn't as much an issue as some of the database design choices they made early on.
Django works very nicely with memcached and provides some classes for managing the cache, which is where you would resolve the majority of your performance issues. What you deliver on the wire is almost more important than your backend in reality - using a tool like yslow is critical for a high performance web application. You can always throw more hardware at your backend, but you can't change your users bandwidth.
I was at the EuroDjangoCon conference the other week, and this was the subject of a couple of talks - including from the founders of what was the largest Django-based site, Pownce (slides from one talk here). The main message is that it's not Django you have to worry about, but things like proper caching, load balancing, database optimisation, etc.
Django actually has hooks for most of those things - caching, in particular, is made very easy.
I'm sure you're looking for a more solid answer, but the most obvious objective validation I can think of is that Google pushes Django for use with its App Engine framework. If anybody knows about and deals with scalability on a regular basis, it's Google. From what I've read, the most limiting factor seems to be the database back-end, which is why Google uses their own...
As stated in High Performance Django Book
and Go through this Cal Henderson
See further details as mentioned below:
It’s not uncommon to hear people say “Django doesn’t scale”. Depending on how you look at it, the statement is either completely true or patently false. Django, on its own, doesn’t scale.
The same can be said of Ruby on Rails, Flask, PHP, or any other language used by a database-driven dynamic website.
The good news, however, is that Django interacts beautifully with a suite of caching and
load balancing tools that will allow it to scale to as much traffic as you can throw at it.
Contrary to what you may have read online,
it can do so without replacing core components often labeled as “too slow” such as the database ORM or the template layer.
Disqus serves over 8 billion page views per month. Those are some huge numbers.
These teams have proven Django most certainly does scale.
Our experience here at Lincoln Loop backs it up.
We’ve built big Django sites capable of spending the day on the Reddit homepage without breaking a sweat.
Django’s scaling success stories are almost too numerous to list at this point.
It backs Disqus, Instagram, and Pinterest. Want some more proof? Instagram was able to sustain over 30 million users on Django with only 3 engineers (2 of which had no back-end development
Today we use many web apps and sites for our needs. Most of them are highly useful. I will show you some of them used by python or django.
Washington Post
The Washington Post’s website is a hugely popular online news source to accompany their daily paper. Its’ huge amount of views and traffic can be easily handled by the Django web framework.
Washington Post - 52.2 million unique visitors (March, 2015)
NASA
The National Aeronautics and Space Administration’s official website is the place to find news, pictures, and videos about their ongoing space exploration. This Django website can easily handle huge amounts of views and traffic.
2 million visitors monthly
The Guardian
The Guardian is a British news and media website owned by the Guardian Media Group. It contains nearly all of the content of the newspapers The Guardian and The Observer. This huge data is handled by Django.
The Guardian (commenting system) - 41,6 million unique visitors (October, 2014)
YouTube
We all know YouTube as the place to upload cat videos and fails. As one of the most popular websites in existence, it provides us with endless hours of video entertainment. The Python programming language powers it and the features we love.
DropBox
DropBox started the online document storing revolution that has become part of daily life. We now store almost everything in the cloud. Dropbox allows us to store, sync, and share almost anything using the power of Python.
Survey Monkey
Survey Monkey is the largest online survey company. They can handle over one million responses every day on their rewritten Python website.
Quora
Quora is the number one place online to ask a question and receive answers from a community of individuals. On their Python website relevant results are answered, edited, and organized by these community members.
Bitly
A majority of the code for Bitly URL shortening services and analytics are all built with Python. Their service can handle hundreds of millions of events per day.
Reddit
Reddit is known as the front page of the internet. It is the place online to find information or entertainment based on thousands of different categories. Posts and links are user generated and are promoted to the top through votes. Many of Reddit’s capabilities rely on Python for their functionality.
Hipmunk
Hipmunk is an online consumer travel site that compares the top travel sites to find you the best deals. This Python website’s tools allow you to find the cheapest hotels and flights for your destination.
Click here for more:
25-of-the-most-popular-python-and-django-websites,
What-are-some-well-known-sites-running-on-Django
I think we might as well add Apple's App of the year for 2011, Instagram, to the list which uses django intensively.
Yes it can. It could be Django with Python or Ruby on Rails. It will still scale.
There are few different techniques. First, caching is not scaling. You could have several application servers balanced with nginx as the front in addition to hardware balancer(s).
To scale on the database side you can go pretty far with read slave in MySQL / PostgreSQL if you go the RDBMS way.
Some good examples of heavy traffic websites in Django could be:
Pownce when they were still there.
Discus (generic shared comments manager)
All the newspaper related websites: Washington Post and others.
You can feel safe.
Here's a list of some relatively high-profile things built in Django:
The Guardian's "Investigate your MP's expenses" app
Politifact.com (here's a Blog post talking about the (positive) experience. Site won a Pulitzer.
NY Times' Represent app
EveryBlock
Peter Harkins, one of the programmers over at WaPo, lists all the stuff they’ve built with Django on his blog
It's a little old, but someone from the LA Times gave a basic overview of why they went with Django.
The Onion's AV Club was recently moved from (I think Drupal) to Django.
I imagine a number of these these sites probably gets well over 100k+ hits per day. Django can certainly do 100k hits/day and more. But YMMV in getting your particular site there depending on what you're building.
There are caching options at the Django level (for example caching querysets and views in memcached can work wonders) and beyond (upstream caches like Squid). Database Server specifications will also be a factor (and usually the place to splurge), as is how well you've tuned it. Don't assume, for example, that Django's going set up indexes properly. Don't assume that the default PostgreSQL or MySQL configuration is the right one.
Furthermore, you always have the option of having multiple application servers running Django if that is the slow point, with a software or hardware load balancer in front.
Finally, are you serving static content on the same server as Django? Are you using Apache or something like nginx or lighttpd? Can you afford to use a CDN for static content? These are things to think about, but it's all very speculative. 100k hits/day isn't the only variable: how much do you want to spend? How much expertise do you have managing all these components? How much time do you have to pull it all together?
The developer advocate for YouTube gave a talk about scaling Python at PyCon 2012, which is also relevant to scaling Django.
YouTube has more than a billion users, and YouTube is built on Python.
I have been using Django for over a year now, and am very impressed with how it manages to combine modularity, scalability and speed of development. Like with any technology, it comes with a learning curve. However, this learning curve is made a lot less steep by the excellent documentation from the Django community. Django has been able to handle everything I have thrown at it really well. It looks like it will be able to scale well into the future.
BidRodeo Penny Auctions is a moderately sized Django powered website. It is a very dynamic website and does handle a good number of page views a day.
Note that if you're expecting 100K users per day, that are active for hours at a time (meaning max of 20K+ concurrent users), you're going to need A LOT of servers. SO has ~15,000 registered users, and most of them are probably not active daily. While the bulk of traffic comes from unregistered users, I'm guessing that very few of them stay on the site more than a couple minutes (i.e. they follow google search results then leave).
For that volume, expect at least 30 servers ... which is still a rather heavy 1,000 concurrent users per server.
My experience with Django is minimal but I do remember in The Django Book they have a chapter where they interview people running some of the larger Django applications. Here is a link. I guess it could provide some insights.
It says curse.com is one of the largest Django applications with around 60-90 million page views in a month.
What's the "largest" site that's built on Django today? (I measure size mostly by user traffic)
Pinterest
disqus.com
More here: https://www.shuup.com/en/blog/25-of-the-most-popular-python-and-django-websites/
Can Django deal with 100,000 users daily, each visiting the site for a couple of hours?
Yes but use proper architecture, database design, use of cache, use load balances and multiple servers or nodes
Could a site like Stack Overflow run on Django?
Yes just need to follow the answer mentioned in the 2nd question
I don't think the issue is really about Django scaling.
I really suggest you look into your architecture that's what will help you with your scaling needs.If you get that wrong there is no point on how well Django performs. Performance != Scale. You can have a system that has amazing performance but does not scale and vice versa.
Is your application database bound? If it is then your scale issues lay there as well. How are you planning on interacting with the database from Django? What happens when you database cannot process requests as fast as Django accepts them? What happens when your data outgrows one physical machine. You need to account for how you plan on dealing with those circumstances.
Moreover, What happens when your traffic outgrows one app server? how you handle sessions in this case can be tricky, more often than not you would probably require a shared nothing architecture. Again that depends on your application.
In short languages is not what determines scale, a language is responsible for performance(again depending on your applications, different languages perform differently). It is your design and architecture that makes scaling a reality.
I hope it helps, would be glad to help further if you have questions.
Another example is rasp.yandex.ru, Russian transport timetable service. Its attendance satisfies your requirements.
If you have a site with some static content, then putting a Varnish server in front will dramatically increase your performance. Even a single box can then easily spit out 100 Mbit/s of traffic.
Note that with dynamic content, using something like Varnish becomes a lot more tricky.
I develop high traffic sites using Django for the national broadcaster in Ireland. It works well for us. Developing a high performance site is more than about just choosing a framework. A framework will only be one part of a system that is as strong as it's weakest link. Using the latest framework 'X' won't solve your performance issues if the problem is slow database queries or a badly configured server or network.
The problem is not to know if django can scale or not.
The right way is to understand and know which are the network design patterns and tools to put under your django/symfony/rails project to scale well.
Some ideas can be :
Multiplexing.
Inversed proxy. Ex : Nginx, Varnish
Memcache Session. Ex : Redis
Clusterization on your project and db for load balancing and fault tolerance : Ex : Docker
Use third party to store assets. Ex : Amazon S3
Hope it help a bit. This is my tiny rock to the mountain.
Even-though there have been a lot of great answers here, I just feel like pointing out, that nobody have put emphasis on..
It depends on the application
If you application is light on writes, as in you are reading a lot more data from the DB than you are writing. Then scaling django should be fairly trivial, heck, it comes with some fairly decent output/view caching straight out of the box. Make use of that, and say, redis as a cache provider, put a load balancer in front of it, spin up n-instances and you should be able to deal with a VERY large amount of traffic.
Now, if you have to do thousands of complex writes a second? Different story. Is Django going to be a bad choice? Well, not necessarily, depends on how you architect your solution really, and also, what your requirements are.
Just my two cents :-)
If you want to use Open source then there are many options for you. But python is best among them as it has many libraries and a super awesome community.
These are a few reasons which might change your mind:
Python is very good but it is a interpreted language which makes it slow. But many accelerator and caching services are there which partly solve this problem.
If you are thinking about rapid development then Ruby on Rails is best among all. The main motto of this(ROR) framework is to give a comfortable experience to the developers. If you compare Ruby and Python both have nearly the same syntax.
Google App Engine is very good service but it will bind you in its scope, you don't get chance to experiment new things. Instead of it you can use Digital Ocean cloud which will only take $5/Month charge for its simplest droplet. Heroku is another free service where you can deploy your product.
Yes! Yes! What you heard is totally correct but here are some examples which are using other technologies
Rails: Github, Twitter(previously), Shopify, Airbnb, Slideshare, Heroku etc.
PHP: Facebook, Wikipedia, Flickr, Yahoo, Tumbler, Mailchimp etc.
Conclusion is a framework or language won't do everything for you. A better architecture, designing and strategy will give you a scalable website. Instagram is the biggest example, this small team is managing such huge data. Here is one blog about its architecture must read it.
You can definitely run a high-traffic site in Django. Check out this pre-Django 1.0 but still relevant post here: http://menendez.com/blog/launching-high-performance-django-site/
Check out this micro news aggregator called EveryBlock.
It's entirely written in Django. In fact they are the people who developed the Django framework itself.
Spreading the tasks evenly, in short optimizing each and every aspect including DBs, Files, Images, CSS etc. and balancing the load with several other resources is necessary once your site/application starts growing. OR you make some more space for it to grow. Implementation of latest technologies like CDN, Cloud are must with huge sites. Just developing and tweaking an application won't give your the cent percent satisfation, other components also play an important role.

list comprehension multiplying itself, and it isn't checking according to conditionals

I am trying to fix my condition that says if found any forbidden keyword in string or string_2 then skip it, but if not found any keyword from forbidden, but it found any word from skills then save it, but however it is multiplying the results 10 times in the else part.
string = "opportunity: this opportunity would suit a budding hacker who is seeking a first step into a commercial role or a tester with 1-3 years of experience. this is a great opportunity to utilise your experience in penetration testing, vulnerability assessments and delivering outcomes while also expanding your knowledge and skillset. benefits: perform red team engagements excellent training & development budget attendance at local and international conferences responsibilities include: working with a diverse range of customers identify and solve security problems perform penetration testing and vulnerability assessments maintain and improve penetration testing and methodologies delivery of technical reports and documentation ideally you will have: ideally current security clearance or minimum australian citizenship certifications such as oscp, sans, crest highly regarded fluent with linux command line and windows powershell experience performing assessments on client networks ability to clearly communicate vulnerability details and risks for a confidential discussion about this opportunity or to discuss other opportunities within it security & risk please contact specialist infosec recruiter john smith on 0123 456 789 or email johnsmith#example.com. australian citizens only – ideally already with a security clearance. want to know more about me? connect with me on linkedin"
string_2 = "your new company this melbourne based consultancy boasts a unique depth and breadth of capabilities across cyber security, application security, data & analytics, cloud and digital transformations. they continue to deliver rich insight, innovative strategies and solutions that help their clients reach their potential. about the opportunity this is an outstanding opportunity to utilise your experience in penetration testing and vulnerability assessments. you will use your skills to prepare high quality reports detailing security issues, making recommendations and identifying solutions. the types of testing can include vulnerability assessment, penetration testing and application security assessment. what you’ll need to succeed passion, drive and enthusiasm! demonstrated experience performing internal and external penetration testing, web application penetration testing and mobile application penetration testing industry certifications such as sans, oscp, crest crt/cct or osce strong knowledge of common vulnerabilities such as owasp top 10 and sans top 25 scripting experience - javascript, objective c and python a very strong technical background and a passion for security the ability to think outside the box what you'll get in return our client is looking for an individual that is seeking longevity in their next role and in return offers the chance to join an equal opportunity employer that is passionate about diversity. also on offer is ongoing personal and professional development, providing you with the right tools and support to thrive. what you need to do now if you’re interested in this role, click ‘apply now’ or for more information and a confidential discussion on this role or any others within it security contact john smith at johnsmith#example.com"
forbidden = ['clearance','TS/SCI','4+ years','5+ years','6+ years','7+ years','8+ years','9+ years','10+ years','11+ years','12+ years']
skills = ['owasp']
for s_prefix in forbidden:
if s_prefix in string:
print(s_prefix)
else:
print("save it")
skill_match = [s_prefix for s_prefix in forbidden if s_prefix in string]
print(skill_match)
if len(skill_match) > 0 :
pass
I am getting the output of multiples times save it while once it found clearance it should be marked as flagged, and if it doesn't found any red-flagged keyword, and any keyword from skills then save
clearance
save it
save it
save it
save it
save it
save it
save it
save it
save it
save it
['clearance']
[Finished in 0.0s]
sample:
string = "snip active cleared snip..." # skip or remove because contains cleared
string2 = "snip owasp..... php , devops" # save it because contains owasp
If you only want one line to be printed in your for loop, you probably need to change its logic, since currently it always prints something on every iteration.
One approach might be to print and break out of the loop if you spot one of the strings you're searching for, and to attach the else clause to the for loop instead of to the if. An else on a loop gets run only if the loop ended normally, not if it was escaped early by a break:
for s_prefix in forbidden:
if s_prefix in string:
print(s_prefix)
break
else:
print("save it")
If you don't need to print the matching prefix string, you could also play around with any or all.

Categories