sklearn.datasets.load_svmlight_file(f, n_features=None, dtype=<class ‘numpy.float64’>, multilabel=False, zero_based=’auto’, query_id=False, offset=0, length=-1)
[source]
Load datasets in the svmlight / libsvm format into sparse CSR matrix
This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset.
The first element of each line can be used to store a target variable to predict.
This format is used as the default format for both svmlight and the libsvm command line programs.
Parsing a text based source can be expensive. When working on repeatedly on the same dataset, it is recommended to wrap this loader with joblib.Memory.cache to store a memmapped backup of the CSR results of the first call and benefit from the near instantaneous loading of memmapped structures for the subsequent calls.
In case the file contains a pairwise preference constraint (known as “qid” in the svmlight format) these are ignored unless the query_id parameter is set to True. These pairwise preference constraints can be used to constraint the combination of samples when using pairwise loss functions (as is the case in some learning to rank problems) so that only pairs with the same query_id value are considered.
This implementation is written in Cython and is reasonably fast. However, a faster API-compatible loader is also available at:
https://github.com/mblondel/svmlight-loaderParameters: |
f : {str, file-like, int} (Path to) a file to load. If a path ends in “.gz” or “.bz2”, it will be uncompressed on the fly. If an integer is passed, it is assumed to be a file descriptor. A file-like or file descriptor will not be closed by this function. A file-like object must be opened in binary mode. n_features : int or None The number of features to use. If None, it will be inferred. This argument is useful to load several files that are subsets of a bigger sliced dataset: each subset might not have examples of every feature, hence the inferred shape might vary from one slice to another. n_features is only required if dtype : numpy data type, default np.float64 Data type of dataset to be loaded. This will be the data type of the output numpy arrays multilabel : boolean, optional, default False Samples may have several labels each (see http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html) zero_based : boolean or “auto”, optional, default “auto” Whether column indices in f are zero-based (True) or one-based (False). If column indices are one-based, they are transformed to zero-based to match Python/NumPy conventions. If set to “auto”, a heuristic check is applied to determine this from the file contents. Both kinds of files occur “in the wild”, but they are unfortunately not self-identifying. Using “auto” or True should always be safe when no query_id : boolean, default False If True, will return the query_id array for each file. offset : integer, optional, default 0 Ignore the offset first bytes by seeking forward, then discarding the following bytes up until the next new line character. length : integer, optional, default -1 If strictly positive, stop reading any new line of data once the position in the file has reached the (offset + length) bytes threshold. |
---|---|
Returns: |
X : scipy.sparse matrix of shape (n_samples, n_features) y : ndarray of shape (n_samples,), or, in the multilabel a list of tuples of length n_samples. query_id : array of shape (n_samples,) query_id for each sample. Only returned when query_id is set to True. |
To use joblib.Memory to cache the svmlight file:
from sklearn.externals.joblib import Memory from sklearn.datasets import load_svmlight_file mem = Memory("./mycache") @mem.cache def get_data(): data = load_svmlight_file("mysvmlightfile") return data[0], data[1] X, y = get_data()
© 2007–2017 The scikit-learn developers
Licensed under the 3-clause BSD License.
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html