I am trying to develop a classifier for documents. I am relatively new to python and I am trying to figure out what the best/standard way of creating the storage structure. I am looking to introduce the dataset with machine learning algos.
I am ingesting txt files and I was thinking to have one column hold the entire document content and the second column hold the class(0-1) in my case. I initially tried creating a list of lists - such that list ["the skye is blue",1]["the sky is grey",1]["the sky is red",0].
I was also trying to create a pandas Dataframe because I thought its structure may be more suitable for data manipulation.
I was also trying to create a pandas Dataframe because I thought its structure may be more suitable for data manipulation.
I would go with that. Given that the goal is to build and train a classifier you will need to extract/compute some features from the text of the files. When you decide to do that the capability to easily generate and add new variables to a Dataframe will come in handy.
However, it also depends on the size of the data you will be crunching. If you will have massive data you should research different concepts and frameworks (for instance TensorFlow)
Related
I am currently trying to build an adjustment framework for a forecasting tool.
Essentially, I have an ML tool that you upload financial data to, and it creates a forecast. I am adding a feature where the use can manually adjust some of the forecasting (i.e., add $1M to FY22Q3) but I am stuck on how to make user inputs dynamic based off of categories of their data. Right now, it is hard coded based off of the 4 categories in my example workbook data. I want to be able to read the data frame and have the inputs be based off of the names of the column headers. Any help would be appreciated. Thanks!
I am working on a project that tracks my Spotify play history which I’ll use to generate personalized playlists.
I am saving my play history to a local dataframe (that I append to every week) as a pickle file. I also have a second dataframe that contains specific track features, also pickled locally.
I’m wondering 1) if there is a better way of structuring my data and 2) if I should be using other ways of saving my data.
I have extracted 6 months of email metadata and saved it as a csv file. The csv now only contains two columns (from and to email addresses). I want to build a graph where the vertices are those with whom I am communicating and whom communicated with me and the edges are created by a communications link labeling the edges by how many communications I had. What is the best approach for going about this?
One approach is to use Linked Data principles (although not advisable if you are short on time and don't have a background in Linked Data). Here's a possible approach:
Depict each entity as a URI
Use an existing ontology (such as foaf) to describe the data
The data is transformed into Resource Description Framework (RDF)
Use an RDF visualization tool.
Since RDF is inherently a graph, you will be able to visualize your data as well as extend it.
If you are unfamiliar with Linked Data, a way to view the garphs is using Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/). This approach is much simpler but lacks the benefits of semantic interoperability, provided you care about them in the first place.
Cytoscape might be able to import your data in that format and build a network from it.
http://www.cytoscape.org/
Your question (while mentioning Python) does not say what part or how much you want to do with Python. I will assume Python is a tool you know but that the main goal is to get the data visualized. In that case:
1) use Gephi network analysis tool - there are tools that can use your CSV file as-is and Gephi is one of them. in your case edge weights need to be preserved (= number of emails exchanged b/w 2 email addresses) which can be done using the "mixed" variation of Gephi's CSV format.
2) another option is to pre-process your CSV file (e.g. using Python), calculate edge weights (the number of e-mail between every 2 email addresses) and save it in any format you like. The result can be visualized in network analysis tools (such as Gephi) or directly in Python (e.g. using https://graph-tool.skewed.de).
Here's an example of an email network analysis project (though their graph does not show weights).
I'm trying to use HDF5 to store time-series EEG data. These files can be quite large and consist of many channels, and I like the features of the HDF5 file format (lazy I/O, dynamic compression, mpi, etc).
One common thing to do with EEG data is to mark sections of data as 'interesting'. I'm struggling with a good way to store these marks in the file. I see soft/hard links supported for linking the same dataset to other groups, etc -- but I do not see any way to link to sections of the dataset.
For example, let's assume I have a dataset called EEG containing sleep data. Let's say I run an algorithm that takes a while to process the data and generates indices corresponding to periods of REM sleep. What is the best way to store these index ranges in an HDF5 file?
The best I can think of right now is to create a dataset with three columns -- the first column is a string and contains a label for the event ("REM1"), and the second/third column contains the start/end index respectively. The only reason I don't like this solution is because HDF5 datasets are pretty set in size -- if I decide later that a period of REM sleep was mis-identified and I need to add/remove that event, the dataset size would need to change (and deleting the dataset/recreating it with a new size is suboptimal). Compound this by the fact that I may have MANY events (imagine marking eyeblink events), this becomes more of a problem.
I'm more curious to find out if there's functionality in the HDF5 file that I'm just not aware of, because this seems like a pretty common thing that one would want to do.
I think what you want is a Region Reference — essentially, a way to store a reference to a slice of your data. In h5py, you create them with the regionref property and numpy slicing syntax, so if you have a dataset called ds and your start and end indexes of your REM period, you can do:
rem_ref = ds.regionref[start:end]
ds.attrs['REM1'] = rem_ref
ds[ds.attrs['REM1']] # Will be a 1-d set of values
You can store regionrefs pretty naturally — they can be attributes on a dataset, objects in a group, or you can create a regionref-type dataset and store them in there.
In your case, I might create a group ("REM_periods" or something) and store the references in there. Creating a "REM_periods" dataset and storing the regionrefs there is reasonable too, but you run into the whole "datasets tend not to be variable-length very well" thing.
Storing them as attrs on the dataset might be OK, too, but it'd get awkward if you wanted to have more than one event type.
The corpus consists of strings (files names) and their checksums, so I expect its entropy to be higher than of normal text. Also the collection is too large to be analysed so I'm going to sample it to create global dictionary. Is there a fancy machine learning approach for my task?
Which algorithm or, better, library should I use?
I'm using python in case it matters.
I would suggest you use sparse coding. It allows you to use your data set to infer an overcomplete dictionary which is then used to encode your data. If your data is indeed of similar nature, this could work well for you.