我需要从sklearn.ensemble中适合RandomForestRegressor。
forest = ensemble.RandomForestRegressor(**RF_tuned_parameters)
model = forest.fit(train_fold, train_y)
yhat = model.predict(test_fold)
这段代码一直工作,直到我对数据(train_y)进行了一些预处理。
错误信息如下:
DataConversionWarning:当期望一个1d数组时,传递一个列向量y。请将y的形状更改为(n_samples,),例如使用ravel()。
模型=森林。fit (train_fold train_y)
之前train_y是一个Series,现在它是numpy array(它是一个列向量)。如果我应用train_y.ravel(),那么它就变成了一个行向量,并且没有出现错误消息,因为预测步骤需要很长时间(实际上它永远不会结束……)
在RandomForestRegressor的文档中,我发现train_y应该定义为y: array-like, shape = [n_samples]或[n_samples, n_outputs]
你知道怎么解决这个问题吗?
博士TL;
使用
y = np.squeeze(y)
而不是
y = y.ravel()
As Python's ravel() may be a valid way to achieve the desired results in this particular case, I would, however, recommend using numpy.squeeze().
The problem here is, that if the shape of your y (numpy array) is e.g. (100, 2), then y.ravel() will concatenate the two variables on the second axis along the first axis, resulting in a shape like (200,). This might not be what you want when dealing with independent variables that have to be regarded on their own.
On the other hand, numpy.squeeze() will just trim any redundant dimensions (i.e. which are of size 1). So, if your numpy array's shape is (100, 1), this will result in an array of shape (100,), whereas the result for a numpy array of shape (100, 2) will not change, as none of the dimensions have size 1.
使用neuraxle,您可以轻松解决这个问题:
p = Pipeline([
# expected outputs shape: (n, 1)
OutputTransformerWrapper(NumpyRavel()),
# expected outputs shape: (n, )
RandomForestRegressor(**RF_tuned_parameters)
])
p, outputs = p.fit_transform(data_inputs, expected_outputs)
Neuraxle是一个类似sklearn的框架,用于深度学习项目中的超参数调优和AutoML !