Is it possible if I have a list of url parse them in python and take this server calls key/values without need to open any browser manually and save them to a local file?
The only library I found for csv is pandas but anything for the first part. Any example will be perfect for me.
You can investigate the use of one of the built in or available libraries that let python actually perform the browser like operations and record the results, filter them and then use the built in csv library to output the results.
You will probably need one of the lower level libraries:
urllib/urllib2/urllib3
And you may need to override, one or more, of the methods to record the transaction data that you are looking for.
Related
I recently started coding, but took a brief stint. I started a new job and I’m under some confidential restrictions. I need to make sure python and pandas are secure before I do this—I’ll also be talking with IT on Monday
I was wondering if pandas in python was a local library, or does the data get sent to or from elsewhere? If I write something in pandas—will the data be stored somewhere under pandas?
The best example of what I’m doing is best found on a medium article about stripping data from tables that don’t have csv Exports.
https://medium.com/#ageitgey/quick-tip-the-easiest-way-to-grab-data-out-of-a-web-page-in-python-7153cecfca58
Creating a DataFrame out of a dict, doing vectorized operations on its rows, printing out slices of it, etc. are all completely local. I'm not sure why this matters. Is your IT department going to say, "Well, this looks fishy—but some random guy on the internet says it's safe, so forget our policies, we'll allow it"? But, for what it's worth, you have this random guy on the internet saying it's safe.
However, Pandas can be used to make network requests. Some of the IO functions can take a URL instead of a filename or file object. Some of them can also use another library that does so—e.g., if you have lxml installed, read_html, will pass the filename to lxml to open, and if that filename is an HTTP URL, lxml will go fetch it.
This is rarely a concern, but if you want to get paranoid, you could imagine ways in which it might be.
For example, let's say your program is parsing user-supplied CSV files and doing some data processing on them. That's safe; there's no network access at all.
Now you add a way for the user to specify CSV files by URL, and you pass them into read_csv and go fetch them. Still safe; there is network access, but it's transparent to the end user and obviously needed for the user's task; if this weren't appropriate, your company wouldn't have asked you to add this feature.
Now you add a way for CSV files to reference other CSV files: if column 1 is #path/to/other/file, you recursively read and parse path/to/other/file and embed it in place of the current row. Now, what happens if I can give one of your users a CSV file where, buried at line 69105, there's #http://example.com/evilendpoint?track=me (an endpoint which does something evil, but then returns something that looks like a perfectly valid thing to insert at line 69105 of that CSV)? Now you may be facilitating my hacking of your employees, without even realizing it.
Of course this is a more limited version of exactly the same functionality that's in every web browser with HTML pages. But maybe your IT department has gotten paranoid and clamped down security on browsers and written an application-level sniffer to detect suspicious followup requests from HTML, and haven't thought to do the same thing for references in CSV files.
I don't think that's a problem a sane IT department should worry about. If your company doesn't trust you to think about these issues, they shouldn't hire you and assign you to write software that involves scraping the web. But then not every IT department is sane about what they do and don't get paranoid about. ("Sure, we can forward this under-1024 port to your laptop for you… but you'd better not install a newer version of Firefox than 16.0…")
I have main.py,header.py and var.py
header.py
import var
class table():
def __init__(self, name):
self.name = name
var.py
month = "jen"
table = "" # tried to make empty container which can save table instance but don't know how
main.py
import header
import var
var.table = header.table(var.month)
var.month = "feb"
And after this program ended, I want that var.table and var.month is modified and saved in var.py.
When your program ends, all your values are lost—unless you save them first, and load them on the next run. There are a variety of different ways to do this; which one you want depends on what kind of data you have and what you're doing with it.
The one thing you never, ever want to do is print arbitrary objects to a file and then try to figure out how to parse them later. If the answer to any of your questions is ast.literal_eval, you're saving things wrong.
One important thing to consider is when you save. If someone quits your program with ^C, and you only save during clean shutdowns, all your changes are gone.
Numpy/Pandas
Numpy and Pandas have their own built-in functions for saving data. See the Numpy docs and Pandas docs for all of the options, but the basic choices are:
Text (e.g., np.savetxt): Portable formats, editable in a spreadsheet.
Binary (e.g., np.save): Small files, fast saving and loading.
Pickle (see below, but also builtin functions): Can save arrays with arbitrary Python objects.
HDF5. If you need HDF5 or NetCDF, you probably already know that you need it.
List of strings
If all you have is a list of single-line strings, you just write them to a file and read them back line by line. It's hard to get simpler, and it's obviously human-readable.
If you need a short name for each value, or need separate sections, but your values are still all simple strings, you may want to look at configparser for CFG/INI files. But as soon as you get more complicated than that, look for a different format.
Python source
If you don't need to save anything, only load data (that your users might want to edit), you can use Python itself as a format—either a module that you import, or a script file that you exec. This can of course be very dangerous, but for a config file that's only being edited by people who already have your entire source code on their computer, that may not be a problem.
JSON and friends
JSON can save a single dict or list to a file and load it back. JSON is built into the Python standard library, and most other languages can also load and save it. JSON files are human-editable, although not beautiful.
JSON dicts and lists can be nested structure with other dicts and lists inside, and can also contain strings, floats, bools, and None, but nothing else. You can extend the json library with converters for other types, but it's a bit of work.
YAML is (almost) a superset of JSON that's easier to extend, and allows for prettier human-editable files. It doesn't have builtin support in the standard library, but there are a number of solid libraries on PyPI, like ruamel.yaml.
Both JSON and YAML can only save one dict or list per file. (The library will let you save multiple objects, but you won't be able to load them back, so be careful.) The simplest way around this is to create one big dict or list with all of you data packed into it. But JSON Lines allows you save multiple JSON dicts in a single file, at the cost of human readability. You can load it just by for line in file: obj = json.loads(obj), and you can save it with just the standard library if you know what you're doing, but you can also find third-party libraries like json-lines to do it for you.
Key-value stores
If what you want to store fits into a dict, but you want to have it on disk all the time instead of explicitly saving and loading, you want a key-value store.
dbm is an old but still functional format, as long as your keys and values are all small-ish strings and you don't have tons of them. Python makes a dbm look like a dict, so you don't need to change most of your code at all.
shelve extends dbm to let you save arbitrary values instead of just strings. It does this by using Pickle (see below), meaning it has the same safety issues, and it can also be slow.
More powerful key-value stores (and related things) are generally called NoSQL databases. There are lots of them nowadays; Redis is one of the popular choices. There's more to learn, but it can be worth it.
CSV
CSV stands for "comma-separated values", although there are variations that use whitespace or other characters. CSV is built into the standard library.
It's a great format when you have a list of objects all with the same fields, as long as all of the members are strings or numbers. But don't try to stretch it beyond that.
CSV files are just barely human-editable as text—but they can be edited very easily in spreadsheet programs like Excel or Google Sheets.
Pickle
Pickle is designed to save and load just about anything. This can be dangerous if you're reading arbitrary pickle files supplied by users, but it can also be very convenient. Pickle actually can't quite save and load everything unless you do a lot of work to add support to some of your types, but there's a third-party library named dill that extends support a lot further.
Pickle files are not at all human-readable, and are only compatible with Python, and sometimes not even with older versions of Python.
SQL
Finally, you can always build a full relational database. This it's quite as scary as it sounds.
Python has a database called sqlite3 built into the standard library.
If that looks too complicated, you may want to consider SQLAlchemy, which lets you store and query data without having to learn the SQL language. Or, if you search around, there are a number of fancier ORMs, and libraries that let you run custom list comprehensions directly against databases, and so on.
Other formats
There are ziklions of other standards out there for data files; a few even come with support in the standard library. They can be useful for special cases—plist files match what Apple uses for preferences on macOS and iOS; netrc files are a long-established way to store a list of server logins; XML is perfect if you have a time machine that can only travel to the year 2000; etc. But usually, you're better off using one of the common formats mentioned above.
I'm using urllib to open one site and get some information on it.
Is there a way to "open" this site only to the part I need and discard the rest (discard I mean don't open/load the rest)?
I'm not sure what you are trying to do. If you are simply trying to parse the site to find the useful "information", then I recommend using the library BeautifulSoup. That library makes it easy to keep certain parts of the site while discarding the rest.
If however you trying to save download bandwidth by downloading only a piece of the site, then you will need to do a lot more work. If that is the case please say so in your question and I'll update the answer.
You should be able to read(bytes) instead of read(), this will read a number of bytes instead of all of it. Then append to already downloaded bytes, and see if it contains what you're looking for. Then you should be able to stop download with .close().
is it possible to set up tables for Mysql in Python?
Here's my problem, I have bunch of .txt files which I want to load into Mysql database. Instead of creating tables in phpmyadmin manually, is it possible to do the following things all in Python?
Create table, including data type definition.
Load many files one by one. I only know this LOAD DATA LOCAL INFILE command to load one file.
Many thanks
Yes, it is possible, you'll need to read the data from the CSV files using CSV module.
http://docs.python.org/library/csv.html
And the inject the data using Python MySQL binding. Here is a good starter tutorial:
http://zetcode.com/databases/mysqlpythontutorial/
If you already know python it will be easy
It is. Typically what you want to do is use an Object-Retlational Mapping library.
Probably the most widely used in the python ecosystem is SQLAlchemy, but there is a lot of magic going on in it, so if you want to keep a tighter control on your DB schema, or if you are learning about relational DB's and want to follow along what the code does, you might be better off with something lighter like Canonical's storm.
EDIT: Just thought to add. The reason to use ORM's is that they provide a very handy way to manipulate data / interface to the DB. But if all you will ever want to do is to do a script to convert textual data to MySQL tables, than you might get along with something even easier. Check the tutorial linked from the official MySQL website, for example.
HTH!
ive looked through the current related questions but have not managed to find anything similar to my needs.
Im in the process of creating a affiliate store using zencart - now one of the issues is that zencart is not designed for redirects and affiliate stores but it can be done. I will be changing the store so it acts like a showcase store showing prices.
There is a mod called easy populate which allows me to upload datafeeds. This is all well and good however my affiliate link will not be in each product. I can do it manually after uploading the data feed and going to each product and then adding it as an image with a redirect link - However when there are over 500 items its going to be a long repetitive and time consuming job.
I have been told that I can add the links to the data feed before uploading it to zencart and this should be done using python. Ive been reading about python for several days now and feel im looking for the wrong things. I was wondering if someone could please advise the simplest way for me to get this done.
I hope the question makes sense
thanks
abs
You could craft a python script using csv module like this:
>>> import csv
>>> cartWriter = csv.writer(open('yourcart.csv', 'wb'))
>>> cartWriter.writerow(['Product', 'yourinfo', 'yourlink'])
You need to know how link should be formatted hoping that it could be composed using the other parameters present on csv file.
First, use the CSV module as systempuntoout told you, secondly, you will want to change your header to:
mimetype='text/csv'
Content-Disposition = 'attachment; filename=name_of_your_file.csv'
The way to do it depends very much of your website implementation. In pure Python you would probably do that with an HttpResponse object. In django, as well, but there are some shortcuts.
You can find a video demonstrating how to create CSV files with Python on showmedo. It's not free however.
Now, to provide a link to download the CSV, this depends of your Website. What is the technology behinds it : pure Python, Django, Pylons, Tubogear ?
If you can't answer the question, you should ask your boss a training about your infrastructure before trying to make change to it.