How should I handle a large JSON dataset with Scikit.Learn? [closed]

How should I handle a large JSON dataset with Scikit.Learn? [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I began learning myself data analysis and machine learning recently and ran quickly into my first issue.
I have the data from a REST API stored in JSON. My dataset is a folder with near 350.000 text files containing the JSON returned by the Riot API match endpoint (I store League Of Legends games), summing up 11GB of text files uncompressed. File names are the IDs of the matches.
Obviously I cant load all that data into memory (8GB) to analyze it or handle it with Scikit.Learn. And even if I could, parsing is extremely slow (Getting number of soloQ games, average win ratio of champions...). I've been told to store that data in a SQLite database, but I'm not really decided what to do. SQLite should be OK as future analysis will not need all the features, so I could do SELECT easily.
What are the best approach or what should I know before? Is any essential knowledge of data analysis I'm missing?

Related

adding index "abality" to mysql data base (im not a programmer take it easy) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 days ago.
Improve this question
i gave my project to a noob programmer which i've been told "he has not used indexing database" and this is using massive cpu from my server
and this is right , my sql consumption is massive for such an small project with maximum of 300 requests/s
is it possible to convert that database somehow to a more efficent version?
if it is may you help me with this?
(project is a python telegram robot and i already know basics of python)
i used to load everything from database to a varriable in python and then process things a lot faster nd with less cpu usage and update the varriable everytime that data base changes
but i need more efficiency and also this is very hard if i want to do for all parts of robot(I've done this only for a few commands of the robot)

What is the best way to make a BeautifulSoup python script scan and retrieve data from 100k+ webpages? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 days ago.
Improve this question
I am new to web scrapping with python.
My question is somewhat theoretical, but if I want to write a script that scans 100k+ pages of a website, what is the best way to approach such a problem?
I believe that sending a new request for each webpage and then reading the data from it would be very time and resource consuming. is there a way to get around it?
I have already tried the naïve approach of sending a new http request every time, and it somewhat worked but was very slow. (scanning 3k pages took over 3 hours).

How do I generate random jsons from JSchema [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am looking for a way to generate random data for testing purposes (around 10000 files). For this testing I have a json schema.
Generation JSchema:
RunTest(){
JSchemaGenerator generator = new JSchemaGenerator();
JSchema schemaBuilding = generator.Generate(typeof(TestClass));
}
The code itself is in C# so ideally I would have a C# code for doing this, though a python solution is also accaptable. I have found a number of questions on this topic that got websites or only focusses on a single prefixed sample but I can't find how to do this in C# (or less preferably python), anybody got any good way of doing this?
As for the reason for doing this: it's two fold: 1 this tests the stability of the system by entering a lot of random data looking for edge cases we haven't thought of and 2 it's a load test. (so basically a smoke+load test)

In Oxygen Developer there is a tool that allows you to generate random JSON files from a JSON Schema, but you need to do this manually from an interface. The action can be found in the Tools menu and it opens a dialog box where you can configure various options for generating the JSON instances.
You can find more details in the user manual: https://www.oxygenxml.com/doc/versions/23.0/ug-editor/topics/json-schema-instance-generator-2.html

Using pandas within web applications - good or bad? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Is it okay to use python pandas to manipulate tabular data within a flask/django web application?
My web app receives blocks of data which we visualise in chart form on the web. We then provide the user some data manipulation operations like, sorting the the data, deleting a given column. We have our own custom code to perform these data operations which it would be much easier to do it using pandas however I'm not sure if that is a good idea or not?

It’s a good question, pandas could be use in development environment if dataset isn’t too big , if dataset is really big I think you could use spark dataframes or rdd, if data increase in function of time you can think on streaming data with pyspark.

Actually yes, but don't forget to move your computation into a separate process if it takes too long.

Django website content built from 3rd party REST API [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm currently working on my first website using the Django framework. Major parts of my content is fetched from a third party API, which requires three API requests to said API in order to fetch all the data I need.
My problem is that this slows down performance a lot, meaning my page load time is about 1-2 seconds, which I don't find satisfying at all.
I'm looking for a few alternatives/best practices for these kind of scenarios. What would one do to speed up page load times? So far, I've been thinking of running a cronjob in the background which calls the APIs for all users that are currently logged in and store the data on my local database, which has a much faster response time.
The other alternative would be loading the API request data separately and adding the data once it has been loaded, however I don't know at all how this would work.
Any other ideas or any tips on how I can improve this?
Thank you!
Tobias

A common practice it's build a cache, so you first look the data in your local database, if doesn't exists, then call the api and save the data.
Without more information it's impossible to write a working example.
You could make a custom method to do all in once.
def call_data(id):
try:
data = DataModel.objects.get(api_id=id)
except Exception, e:
data = requests.get("http://api-call/")
DataModel.objects.create(**data)
return data
This is an example, not to use in production, needs some success validation at least.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How should I handle a large JSON dataset with Scikit.Learn? [closed] - python

Related

adding index "abality" to mysql data base (im not a programmer take it easy) [closed]

What is the best way to make a BeautifulSoup python script scan and retrieve data from 100k+ webpages? [closed]

How do I generate random jsons from JSchema [closed]

Using pandas within web applications - good or bad? [closed]

Django website content built from 3rd party REST API [closed]

Categories

Resources