pd.read_table for .dat fills null values - python

I am trying to learn data analysis using "Python for Data analysis" by WesMcKinney.
There is a .dat file with the following data :
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
I'm trying to import them using :
unames=['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('D:/INSOFE/Python_practice/users.dat', sep='::', header=None,names=unames,engine='python')
But, it shows nulls
Please let me know what I'm doing wrong.

The read_table method expects relatively clean data; if you've simply saved the web page containing the table (cf. the clarifying comments), you will end up with a file full of HTML which pandas will not know what to do about.
Instead, you will want to get the raw contents of the file. In principle you could simply copy the 6040 lines from GitHub into your favorite text editor and save the contents as users.dat.
GitHub makes your life a bit simpler than that by supplying a view of the raw data as well.
With that, if you choose to save the file, most browsers (including e.g. Firefox) will produce a proper users.dat with only the data. Command line tools such as wget or curl allow you to get at the same data without having to use a fully-fledged browser.

Related

Accessing Hovertext with html

I am trying to access hover text found on graph points at this site (bottom):
http://matchhistory.na.leagueoflegends.com/en/#match-details/TRLH1/1002200043?gameHash=b98e62c1bcc887e4&tab=overview
I have the full site html but I am unable to find the values displayed in the hover text. All that can be seen when inspecting a point are x and y values that are transformed versions of these values. The mapping can be determined with manual input taken from the hovertext but this defeats the purpose of looking at the html. Additionally, the mapping changes with each match history so it is not feasible to do this for a large number of games.
Is there any way around this?
thank you
Explanation
Nearly everything on this webpage is loaded via JSON through JavaScript. We don't even have to request the original page. You will, however, have to repiece together the page via id's of items, mysteries and etc., which won't be too hard because you can request masteries similar to how we fetch items.
So, I went through the network tab in inspect and I noticed that it loaded the following JSON formatted URL:
https://acs.leagueoflegends.com/v1/stats/game/TRLH1/1002200043?gameHash=b98e62c1bcc887e4
If you notice, there is a gameHash and the id (similar to that of the link you just sent me). This page contains everything you need to rebuild it, given that you fetch all reliant JSON files.
Dealing with JSON
You can use json.loads in Python to load it, but a great tool I would recomend is:
https://jsonformatter.curiousconcept.com/
You copy and paste JSON in there and it will help you understand the data structure.
Fetching items
The webpage loads all this information via a JSON file:
https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json
It contains all of the information and tool tips about each item in the game. You can access your desired item via: theirJson['data']['1001']. Each image on the page's file name is the id (or 1001) in this example.
For instance, for 'Boots of Speed':
import requests, json
itemJson = json.loads(requests.get('https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json').text)
print(itemJson['data']['1001'])
An alternative: Selenium
Selenium could be used for this. You should look it up. It's been ported for several programming languages, one being Python. It may work as you want it to here, but I sincerely think that the JSON method (describe above), although a little more convoluted, will perform faster (since speed, based on your post, seems to be an important factor).

Download/read GovData with R

I stumbled over the following site and I wanted to download the data for the digital elevation model for the waterways.
https://www.govdata.de/web/guest/daten/-/details/1c669080-c804-11e4-8731-1681e6b88ec1bkg
Now, I have following problem, I do not understand how I can download the data.
Anybody knows how I could download the data, e.g. by using the programming language R or Python.
You will need to be on the webpage where the data is stored, not the webpage with the links to the data. Depending on what format the data is in you will need to change the (sep='\t') to fit your needs,
ex. a csv would be (sep=',')
You will then need to fine tune the formatting.
library(RCurl)
urlcontent<-
getURL('https://www.govdata.de/web/guest/daten/-/details/1c669080-c804-11e4-
8731-1681e6b88ec1bkg')
DATA<- read.table(textConnection(urlcontent), header=T, sep = '\t')
Note the read.table function may only work with a tsv type page, you will need to fine tune the reading of the page based on the formatting.
EDIT:
Using the link address for the URL I was able to successfully grab the URL, the problem though is an access error, I do not have access to download the data. This may be another error in the code, or an actual credential problem on the website side.
library(RCurl)
urlcontent<-
getURL('https://www.govdata.de/ckan/api/rest/dataset/1c669080-c804-11e4-
8731-1681e6b88ec1bkg')
DATA<- read.table(textConnection(urlcontent), header=T, sep = '\t')
Error:You don't have permission to access this server

PYPDFTK Error 32

I am back at coding above my head again in producing a program that will automatically fill out some Partner PDF's with our employees information. Currently, we have a process that involves over a 40 page PDF which we have filling in automatically with information after you make it through the first couple pages.
What we are looking for is creating a UI that allows the employee to type their info once, then have it pumped through the 40 pages filling in all the key form spots, then break the PDF up appropriately and file to the correct folders for compliance.
90% of this I have experience coding from the UI, to splitting up a completed file, but my problem is working with the PDF to fill in forms. I have exerpience using items such as PDFMiner and PDFQuery to scrape a PDF but I am stuck on entering it.
I am currently attempting to use PyPDFTK but when setting it up via their example, I can't even clear the first step as the temp file it looks like the code is trying to access is not accessible, see example basic code:
import pypdftk
datas = {
'firstname': 'Julien',
'company': 'revolunet',
'price': 42
}
generated_pdf = pypdftk.fill_form('main.pdf', datas)
And it keeps producing an error 32 and I can't figure out why!? Is this the best option and if so how can I try and remedy this.
Thank you,
Andy.
pypdftk author here; can you please paste the traceback ?
Generally this means the pdftk binary was not found and you can change it with the PDFTK_PATH environment variable. see https://github.com/revolunet/pypdftk/blob/master/README.md#pdftk-path

Properly watch websites for updates

I wrote a script that I'm using to push updates to Pushbullet channels whenever a new Nexus factory image is released. A separate channel exists for each of the first 11 devices on that page, and I'm using a rather convoluted script to watch for updates. The full setup is here (specifically this script), but I'll briefly summarize the script below. My question is this: This is clearly not the correct way to be doing this, as it's very susceptible to multiple points of failure. What would be a better method of doing this? I would prefer to stick with Python, but I'm open to other languages if they would be simpler/better.
(This question is prompted by the fact that I updated my apache 2.4 config tonight and it apparently triggered a slight change in the output of the local files that are watched by urlwatch, so ALL 11 channels got an erroneous update pushed to them.)
Basic script functionality (some nonessential parts are not included):
Create dictionary of each device codename associated with its full model name
Get existing Nexus Factory Images page using Requests
Make bs4 object from source code
For each of the 11 devices in the dictionary (loop), do the following:
Open/create page in public web directory for the device
Write source to that page, filtered using bs4: str(soup.select("h2#" + dev + " ~ table")[0])
Call urlwatch on the page to check for updates, save output to temp file
If temp file size is > 0 then the page has changed, so push update to the appropriate channel
Remove webpage and temp file
A thought that I had while typing this question: Would a possible solution be to save each current version string (for example: 5.1.0 (LMY47I)) as a pickled variable, then if urlwatch detects a difference it would compare the new version string to the pickled one and only push if they're different? I would throw regex matching in as well to ensure that the new format matches the old format and just has updated data, but could this at least be a good temporary measure to try to prevent future false alarms?
Scraping is inherently fragile, but if they don't change the source format it should be pretty straightforward in this case. You should parse the webpage into a data structure. Using bs4 is fine for this. The end result should be a python dictionary:
{
'mantaray': {
'4.2.2 (JDQ39)': {'link': 'https://...'},
'4.3 (JWR66Y)': {'link': 'https://...'},
},
...
}
Save this structure with json.dumps. Now every time you parse the page you can generate a similar data structure and compare it to the one you have on disk (update the saved one each time after you are done).
Then the only part left is comparing the datastructure. You can iterate all models and check that each version you have in the current version of the page exists in the previous version. If it does not, you have a new version.
You can also potentially generate an easy to use API for this using https://www.kimonolabs.com/ instead of doing the parsing yourself.

Form through Email that can be Parsed Using Python

I want to email out a document that will be filled in by many people and emailed back to me. I will then parse the responses using Python and load them into my database.
What is the best format to send out the initial document in?
I was thinking an interactive .pdf but do not want to have to pay for Adobe XI. Alternatively maybe a .html file but I'm not sure how easy it is to save the state of it once its been filled in in order to be emailed back to me. A .xls file may also be a solution but I'm leaning away from it simply because it would not be a particularly professional looking format.
The key points are:
Answers can be easily parsed using Python
The format should common enough to open on most computers
The document should look relatively pleasing to the eye
Send them a web-page with a FORM section, complete with some Javascript to grab the contents of the controls and send them to you (e.g. in JSON format) when they press "submit".
Another option is to set it up as a web application. There are several Python web frameworks that could be used for that. You could then e-mail people a link to the web-app.
Why don't you use Google Docs for the form. Create the form in Google Docs and save the answer in an excel sheet. And then use any python Excel format reader (Google them) to read the file. This way you don't need to parse through mails and will be performance friendly too. Or you could just make a simple form using AppEngine and save the data directly to the database.

Categories