【机器学习】金融风控建模中逻辑回归算法的权重配比规则与深度学习CNN原理解析金融风控是机器学习在传统行业中最为成熟的应用领域之一。逻辑回归作为金融风控建模的经典算法以其可解释性强、训练效率高、部署便捷等优势在信贷审批、欺诈检测、信用评分等场景中占据核心地位。然而随着金融数据规模的爆炸式增长和风险模式的日益复杂单一的逻辑回归模型逐渐暴露出特征表达能力不足的局限。本文将深入解析金融风控中逻辑回归的特征权重配比规则和样本权重处理方法同时引入深度学习CNN的原理解析探讨两种算法在金融风控中的协同应用路径。逻辑回归在金融风控中的核心地位逻辑回归虽然名为回归实际上是一种广泛应用于二分类问题的线性模型。在金融风控场景中逻辑回归主要用于预测借款人违约、交易欺诈等二元事件的发生概率。逻辑回归的数学基础逻辑回归通过Sigmoid函数将线性回归的输出映射到(0,1)区间使其具有概率意义。模型的数学形式为$$P(y1|X) \frac{1}{1 e^{-(\beta_0 \beta_1 x_1 \beta_2 x_2 ... \beta_n x_n)}}$$其中 $P(y1|X)$ 表示在特征 $X$ 条件下正样本的预测概率$\beta_i$ 是第 $i$ 个特征的权重系数。Sigmoid函数的概率映射Sigmoid函数 $\sigma(z) 1/(1e^{-z})$ 将任意实数映射到0到1之间使得模型输出可以自然地解释为概率。当线性组合 $z$ 趋近正无穷时概率趋近1当 $z$ 趋近负无穷时概率趋近0。import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split def sigmoid(z): return 1 / (1 np.exp(-z)) z_values np.linspace(-10, 10, 100) probabilities sigmoid(z_values) print(逻辑回归Sigmoid映射示例) print(fz-5时概率: {sigmoid(-5):.4f}) print(fz0时概率: {sigmoid(0):.4f}) print(fz5时概率: {sigmoid(5):.4f})特征权重配比规则详解在金融风控模型中特征权重的配比直接决定了各个风险因子对最终评分的影响程度。合理的权重配比是模型有效性的关键保障。权重的经济学解释逻辑回归的权重系数 $\beta_i$ 可以解释为在其他特征不变的情况下特征 $x_i$ 每增加一个单位对数几率log-odds的变化量。在金融风控中这意味着$$ln\left(\frac{P}{1-P}\right) \beta_0 \beta_1 x_1 ... \beta_n x_n$$当 $\beta_i 0$ 时特征 $x_i$ 与违约概率正相关当 $\beta_i 0$ 时特征 $x_i$ 与违约概率负相关。def calculate_odds_ratio(coefficients, feature_names): odds_ratios np.exp(coefficients) result pd.DataFrame({ 特征名称: feature_names, 权重系数: coefficients, 优势比(Odds Ratio): odds_ratios, 影响方向: [正向 if c 0 else 负向 for c in coefficients] }) return result.sort_values(优势比, ascendingFalse) feature_names [收入水平, 负债比率, 信用历史长度, 近期查询次数, 贷款金额, 工作稳定性, 教育水平, 年龄] sample_coefficients np.array([-0.45, 0.62, -0.38, 0.51, 0.28, -0.55, -0.22, -0.15]) weight_analysis calculate_odds_ratio(sample_coefficients, feature_names) print(特征权重配比分析) print(weight_analysis)特征权重的标准化处理在多元逻辑回归中不同特征的量纲差异会导致权重系数无法直接比较。为了解决这一问题需要对特征进行标准化处理。class StandardizedLogisticRegression: def __init__(self, C1.0, penaltyl2): self.C C self.penalty penalty self.scaler StandardScaler() self.model LogisticRegression(Cself.C, penaltyself.penalty, solverliblinear) def fit(self, X, y): X_scaled self.scaler.fit_transform(X) self.model.fit(X_scaled, y) def get_standardized_weights(self, feature_names): weights self.model.coef_[0] return pd.DataFrame({ 特征: feature_names, 标准化权重: weights, 权重绝对值: np.abs(weights), 重要性排名: np.argsort(-np.abs(weights)) 1 }) def predict_proba(self, X): X_scaled self.scaler.transform(X) return self.model.predict_proba(X_scaled)多重共线性对权重的影响金融风控特征之间往往存在相关性例如收入水平和贷款金额通常正相关。多重共线性会导致权重估计不稳定权重的方差增大影响模型的可解释性。from sklearn.metrics import mean_squared_error from statsmodels.stats.outliers_influence import variance_inflation_factor def calculate_vif(X, feature_names): vif_data pd.DataFrame() vif_data[特征] feature_names vif_data[VIF] [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] return vif_data def handle_collinearity(X, feature_names, threshold10): vif_df calculate_vif(X, feature_names) high_vif_features vif_df[vif_df[VIF] threshold][特征].tolist() low_vif_features vif_df[vif_df[VIF] threshold][特征].tolist() print(f高共线性特征VIF{threshold}: {high_vif_features}) print(f保留特征: {low_vif_features}) return low_vif_features, high_vif_featuresL1正则化实现特征选择在金融风控建模中L1正则化Lasso能够自动将不重要特征的权重压缩为零实现特征选择的同时获得稀疏解。class L1FeatureSelector: def __init__(self, C_rangenp.logspace(-3, 1, 20)): self.C_range C_range self.selected_features None def select_features(self, X, y, feature_names): selected_features_count [] for C_val in self.C_range: model LogisticRegression(penaltyl1, CC_val, solversaga, max_iter1000) model.fit(X, y) n_selected np.sum(np.abs(model.coef_[0]) 1e-6) selected_features_count.append(n_selected) best_C self.C_range[selected_features_count.index(min(selected_features_count))] final_model LogisticRegression(penaltyl1, Cbest_C, solversaga, max_iter1000) final_model.fit(X, y) selected_indices np.where(np.abs(final_model.coef_[0]) 1e-6)[0] self.selected_features [feature_names[i] for i in selected_indices] return self.selected_features, final_model.coef_[0][selected_indices]样本权重处理技术金融风控数据普遍存在类别不平衡问题违约样本通常只占总体样本的极小比例。样本权重处理是解决这一问题的关键技术。类别不平衡的影响在不平衡数据集中逻辑回归会倾向于将样本预测为多数类导致模型虽然有很高的整体准确率但对少数类违约样本的识别能力极差。def demonstrate_class_imbalance(): np.random.seed(42) n_majority 10000 n_minority 200 X_majority np.random.normal(0, 1, (n_majority, 2)) X_minority np.random.normal(2, 1, (n_minority, 2)) y_majority np.zeros(n_majority) y_minority np.ones(n_minority) X np.vstack([X_majority, X_minority]) y np.hstack([y_majority, y_minority]) X_train, X_test, y_train, y_test train_test_split(X, y, test_size0.3, random_state42) model_no_weight LogisticRegression() model_no_weight.fit(X_train, y_train) preds model_no_weight.predict(X_test) print(未使用样本权重的模型表现) print(f预测为0类数量: {np.sum(preds 0)}) print(f预测为1类数量: {np.sum(preds 1)}) return X_train, X_test, y_train, y_test样本权重的计算策略金融风控中常用的样本权重计算方法包括逆频率加权和SMOTE过采样。from sklearn.utils.class_weight import compute_class_weight def compute_sample_weights(y_train, methodbalanced): if method balanced: classes np.unique(y_train) weights compute_class_weight(class_weightbalanced, classesclasses, yy_train) sample_weights np.array([weights[list(classes).index(label)] for label in y_train]) elif method inverse: n_samples len(y_train) n_pos np.sum(y_train) n_neg n_samples - n_pos sample_weights np.where(y_train 1, n_samples / (2 * n_pos), n_samples / (2 * n_neg)) else: sample_weights np.ones(len(y_train)) return sample_weights def train_with_sample_weights(X_train, y_train, X_test, y_test): sample_weights compute_sample_weights(y_train, methodbalanced) model_weighted LogisticRegression(class_weightbalanced) model_weighted.fit(X_train, y_train) model_custom LogisticRegression() model_custom.fit(X_train, y_train, sample_weightsample_weights) score_balanced model_weighted.score(X_test, y_test) score_custom model_custom.score(X_test, y_test) print(fBalanced权重策略准确率: {score_balanced:.4f}) print(f自定义权重策略准确率: {score_custom:.4f}) return model_weighted, model_custom成本敏感学习在金融风控中将违约样本误判为正常样本的代价远高于将正常样本误判为违约样本。成本敏感学习通过为不同类别的误判分配不同的代价来优化模型。class CostSensitiveLogisticRegression: def __init__(self, cost_fp1.0, cost_fn10.0): self.cost_fp cost_fp self.cost_fn cost_fn self.model None def custom_loss(self, y_true, y_pred): epsilon 1e-7 y_pred np.clip(y_pred, epsilon, 1 - epsilon) cost_matrix np.where(y_true 1, self.cost_fn, self.cost_fp) loss -cost_matrix * (y_true * np.log(y_pred) (1 - y_true) * np.log(1 - y_pred)) return np.mean(loss) def fit(self, X, y): sample_weight np.where(y 1, self.cost_fn, self.cost_fp) self.model LogisticRegression() self.model.fit(X, y, sample_weightsample_weight) return self def predict(self, X): return self.model.predict(X) def predict_proba(self, X): return self.model.predict_proba(X)模型评估指标选择在金融风控中准确率并不能全面反映模型性能。KS值、AUC-ROC和混淆矩阵是更常用的评估指标。KS值与风控阈值优化KSKolmogorov-Smirnov统计量衡量模型区分正负样本的能力是金融风控中最核心的评估指标之一。def calculate_ks(y_true, y_pred_proba): data pd.DataFrame({y_true: y_true, y_pred: y_pred_proba}) data data.sort_values(y_pred, ascendingFalse).reset_index(dropTrue) total_good np.sum(y_true 0) total_bad np.sum(y_true 1) data[cum_good] (data[y_true] 0).cumsum() / total_good data[cum_bad] (data[y_true] 1).cumsum() / total_bad data[ks] np.abs(data[cum_good] - data[cum_bad]) max_ks data[ks].max() ks_threshold data.loc[data[ks].idxmax(), y_pred] return max_ks, ks_threshold def evaluate_risk_model(y_true, y_pred_proba): from sklearn.metrics import roc_auc_score, confusion_matrix ks_value, threshold calculate_ks(y_true, y_pred_proba) auc_value roc_auc_score(y_true, y_pred_proba) y_pred_binary (y_pred_proba threshold).astype(int) cm confusion_matrix(y_true, y_pred_binary) tn, fp, fn, tp cm.ravel() evaluation { KS值: ks_value, AUC值: auc_value, 最优阈值: threshold, 精确率: tp / (tp fp) if (tp fp) 0 else 0, 召回率: tp / (tp fn) if (tp fn) 0 else 0, F1分数: 2 * tp / (2 * tp fp fn) if (2 * tp fp fn) 0 else 0, } return evaluation评估指标取值范围金融风控标准说明KS值0-10.3合格0.5优秀衡量区分能力AUC-ROC0.5-10.75合格0.85优秀综合排序能力精确率0-10.7误报控制能力召回率0-10.6风险捕获能力深度学习CNN原理解析卷积神经网络虽然在图像领域最为出名但其核心思想也在金融风控领域找到了独特的应用场景。CNN的核心组件CNN由卷积层、池化层和全连接层三部分组成。卷积层负责特征提取池化层负责降维全连接层负责分类决策。import torch import torch.nn as nn import torch.nn.functional as F class FinancialCNN(nn.Module): def __init__(self, input_channels1, sequence_length128, num_classes2): super().__init__() self.conv1 nn.Conv1d(in_channelsinput_channels, out_channels32, kernel_size3, padding1) self.conv2 nn.Conv1d(in_channels32, out_channels64, kernel_size3, padding1) self.conv3 nn.Conv1d(in_channels64, out_channels128, kernel_size3, padding1) self.pool nn.MaxPool1d(kernel_size2, stride2) self.dropout nn.Dropout(0.3) self._to_linear None self._compute_flat_size(input_channels, sequence_length) self.fc1 nn.Linear(self._to_linear, 128) self.fc2 nn.Linear(128, num_classes) def _compute_flat_size(self, channels, length): x torch.zeros(1, channels, length) x self.pool(F.relu(self.conv1(x))) x self.pool(F.relu(self.conv2(x))) x self.pool(F.relu(self.conv3(x))) self._to_linear x.numel() def forward(self, x): x self.pool(F.relu(self.conv1(x))) x self.pool(F.relu(self.conv2(x))) x self.pool(F.relu(self.conv3(x))) x x.view(x.size(0), -1) x self.dropout(F.relu(self.fc1(x))) return self.fc2(x)时序金融数据的一维卷积金融交易数据具有天然的时序特征一维卷积Conv1D可以沿着时间维度滑动捕捉局部时序模式。class TimeSeriesFeatureExtractor(nn.Module): def __init__(self): super().__init__() self.conv_block nn.Sequential( nn.Conv1d(5, 16, kernel_size7, padding3), nn.BatchNorm1d(16), nn.ReLU(), nn.MaxPool1d(2), nn.Conv1d(16, 32, kernel_size5, padding2), nn.BatchNorm1d(32), nn.ReLU(), nn.MaxPool1d(2), nn.Conv1d(32, 64, kernel_size3, padding1), nn.BatchNorm1d(64), nn.ReLU(), nn.AdaptiveAvgPool1d(1), ) def forward(self, x): x self.conv_block(x) return x.squeeze(-1) def extract_transaction_patterns(self, transaction_sequences): self.eval() with torch.no_grad(): features self.forward(transaction_sequences) return features.numpy()CNN特征图的可视化理解CNN的卷积核通过学习可以自动发现金融数据中的局部模式例如短期内频繁的小额交易、大额突发的转账行为等风险特征。def visualize_conv_filters(model, layer_nameconv1): conv_layer None for name, module in model.named_modules(): if name layer_name: conv_layer module break if conv_layer is None: return None filters conv_layer.weight.data.numpy() filter_stats pd.DataFrame({ 滤波器索引: range(filters.shape[0]), 均值: filters.mean(axis(1, 2, 3)), 标准差: filters.std(axis(1, 2, 3)), L2范数: np.sqrt((filters ** 2).sum(axis(1, 2, 3))) }) return filter_stats.sort_values(L2范数, ascendingFalse)逻辑回归与CNN的协同应用在金融风控中逻辑回归和CNN并非互斥的选择。将两者的优势结合可以构建更强大的风控模型。特征工程的协同逻辑回归需要人工设计特征而CNN可以自动从原始数据中学习特征表示。将CNN提取的深度特征作为逻辑回归的输入可以同时享受自动特征提取和强可解释性的优势。class HybridRiskModel: def __init__(self, cnn_feature_dim64): self.cnn TimeSeriesFeatureExtractor() self.lr LogisticRegression() self.feature_dim cnn_feature_dim def extract_cnn_features(self, sequential_data): self.cnn.eval() with torch.no_grad(): cnn_features self.cnn(torch.FloatTensor(sequential_data)) return cnn_features.numpy() def fit(self, static_features, sequential_data, y): cnn_features self.extract_cnn_features(sequential_data) combined_features np.hstack([static_features, cnn_features]) self.lr.fit(combined_features, y) def predict(self, static_features, sequential_data): cnn_features self.extract_cnn_features(sequential_data) combined_features np.hstack([static_features, cnn_features]) return self.lr.predict(combined_features) def get_feature_importance(self, static_feature_names): cnn_feature_names [fCNN_Feature_{i} for i in range(self.feature_dim)] all_feature_names static_feature_names cnn_feature_names importance pd.DataFrame({ 特征: all_feature_names, 权重: self.lr.coef_[0], 权重绝对值: np.abs(self.lr.coef_[0]) }) return importance.sort_values(权重绝对值, ascendingFalse)两阶段训练策略两阶段训练策略首先使用CNN对原始时序数据进行特征提取然后将提取的特征与人工设计特征拼接最后使用逻辑回归进行分类。class TwoStageRiskTrainer: def __init__(self, cnn_epochs50, lr_C1.0): self.cnn_epochs cnn_epochs self.lr_C lr_C self.cnn None self.lr None self.scaler StandardScaler() def stage1_train_cnn(self, sequential_data, y, batch_size32): self.cnn FinancialCNN() dataset torch.utils.data.TensorDataset( torch.FloatTensor(sequential_data), torch.LongTensor(y) ) loader torch.utils.data.DataLoader(dataset, batch_sizebatch_size, shuffleTrue) optimizer torch.optim.Adam(self.cnn.parameters(), lr0.001) for epoch in range(self.cnn_epochs): epoch_loss 0 for batch_x, batch_y in loader: optimizer.zero_grad() outputs self.cnn(batch_x) loss F.cross_entropy(outputs, batch_y) loss.backward() optimizer.step() epoch_loss loss.item() def stage2_train_lr(self, static_features, sequential_data, y): self.cnn.eval() with torch.no_grad(): cnn_logits self.cnn(torch.FloatTensor(sequential_data)) cnn_proba F.softmax(cnn_logits, dim1).numpy() combined np.hstack([static_features, cnn_proba]) combined_scaled self.scaler.fit_transform(combined) self.lr LogisticRegression(Cself.lr_C, max_iter1000) self.lr.fit(combined_scaled, y) def predict(self, static_features, sequential_data): self.cnn.eval() with torch.no_grad(): cnn_proba F.softmax(self.cnn(torch.FloatTensor(sequential_data)), dim1).numpy() combined np.hstack([static_features, cnn_proba]) combined_scaled self.scaler.transform(combined) return self.lr.predict_proba(combined_scaled)[:, 1]实际金融风控建模案例以下是一个完整的信贷风险评分卡构建流程从数据预处理到模型部署的全链路实现。数据预处理与特征工程class CreditScorecardPipeline: def __init__(self): self.feature_columns None self.bins {} self.woe_encoders {} self.iv_values {} def calculate_woe_iv(self, X, y, feature, n_bins10): data pd.DataFrame({feature: X[feature], target: y}) if data[feature].dtype in [float64, int64]: data[bin], bin_edges pd.qcut(data[feature], qn_bins, duplicatesdrop, retbinsTrue) self.bins[feature] bin_edges else: data[bin] data[feature] grouped data.groupby(bin)[target].agg( good_countlambda x: (x 0).sum(), bad_countlambda x: (x 1).sum() ).reset_index() total_good grouped[good_count].sum() total_bad grouped[bad_count].sum() grouped[good_dist] grouped[good_count] / total_good grouped[bad_dist] grouped[bad_count] / total_bad epsilon 0.0001 grouped[woe] np.log( (grouped[bad_dist] epsilon) / (grouped[good_dist] epsilon) ) grouped[iv] (grouped[bad_dist] - grouped[good_dist]) * grouped[woe] iv_total grouped[iv].sum() self.woe_encoders[feature] dict(zip(grouped[bin], grouped[woe])) self.iv_values[feature] iv_total return grouped, iv_total def transform_woe(self, X): X_woe pd.DataFrame(indexX.index) for feature in self.feature_columns: if feature in self.bins: bins self.bins[feature] labels [f({bins[i]}, {bins[i1]}] for i in range(len(bins)-1)] X_woe[feature] pd.cut(X[feature], binsbins, labelslabels) X_woe[feature] X_woe[feature].map(self.woe_encoders[feature]).fillna(0) else: X_woe[feature] X[feature] return X_woe def fit(self, X, y): self.feature_columns X.columns.tolist() iv_results {} for feature in self.feature_columns: _, iv self.calculate_woe_iv(X, y, feature) iv_results[feature] iv self.iv_summary pd.DataFrame( list(iv_results.items()), columns[特征, IV值] ).sort_values(IV值, ascendingFalse) return self.iv_summary模型训练与验证def train_credit_scorecard(X_train, y_train, X_val, y_val): pipeline CreditScorecardPipeline() iv_summary pipeline.fit(X_train, y_train) high_iv_features iv_summary[iv_summary[IV值] 0.02][特征].tolist() X_train_woe pipeline.transform_woe(X_train[high_iv_features]) X_val_woe pipeline.transform_woe(X_val[high_iv_features]) scaler StandardScaler() X_train_scaled scaler.fit_transform(X_train_woe) X_val_scaled scaler.transform(X_val_woe) model LogisticRegression(penaltyl2, C0.1, class_weightbalanced, solverliblinear) model.fit(X_train_scaled, y_train) train_proba model.predict_proba(X_train_scaled)[:, 1] val_proba model.predict_proba(X_val_scaled)[:, 1] train_ks, _ calculate_ks(y_train, train_proba) val_ks, _ calculate_ks(y_val, val_proba) print(f训练集KS值: {train_ks:.4f}) print(f验证集KS值: {val_ks:.4f}) return pipeline, scaler, model, train_proba, val_proba评分卡转换标准评分卡将逻辑回归的预测概率转换为整数分数便于业务人员理解和使用。class ScorecardConverter: def __init__(self, base_score600, base_odds50, pdo20): self.base_score base_score self.base_odds base_odds self.pdo pdo self.factor pdo / np.log(2) self.offset base_score - self.factor * np.log(base_odds) def probability_to_score(self, proba): odds proba / (1 - proba 1e-10) score self.offset - self.factor * np.log(odds) return np.clip(score, 0, 1000).astype(int) def score_to_risk_level(self, score): if score 700: return 低风险 elif score 600: return 中低风险 elif score 500: return 中风险 elif score 400: return 中高风险 else: return 高风险 def create_scorecard_table(self, model, feature_names, scaler): coef model.coef_[0] intercept model.intercept_[0] scorecard_data [] for i, feature in enumerate(feature_names): weight coef[i] scaled_weight weight / scaler.scale_[i] score_contribution -self.factor * scaled_weight scorecard_data.append({ 特征: feature, 权重系数: weight, 评分贡献: round(score_contribution, 2) }) scorecard_df pd.DataFrame(scorecard_data) base_score_contribution -self.factor * intercept print(f基础分数贡献: {base_score_contribution:.2f}) print(fPDO参数: {self.pdo}) return scorecard_df.sort_values(评分贡献, ascendingFalse)模型监控与迭代金融风控模型上线后需要持续监控其性能稳定性和特征分布偏移。class ModelMonitor: def __init__(self, baseline_ks, baseline_auc): self.baseline_ks baseline_ks self.baseline_auc baseline_auc self.performance_history [] def calculate_psi(self, expected_distribution, actual_distribution, n_bins10): bin_edges np.linspace(0, 1, n_bins 1) expected_bins np.histogram(expected_distribution, binsbin_edges)[0] 1e-10 actual_bins np.histogram(actual_distribution, binsbin_edges)[0] 1e-10 expected_pct expected_bins / expected_bins.sum() actual_pct actual_bins / actual_bins.sum() psi np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct)) return psi def check_model_stability(self, current_scores, current_y_true): current_ks, _ calculate_ks(current_y_true, current_scores) ks_drop self.baseline_ks - current_ks alert_level 正常 if ks_drop 0.1: alert_level 警告 elif ks_drop 0.2: alert_level 严重 return { 当前KS值: current_ks, 基准KS值: self.baseline_ks, KS衰减: ks_drop, 预警级别: alert_level } def record_performance(self, date, ks_value, auc_value, psi_value): self.performance_history.append({ 日期: date, KS值: ks_value, AUC值: auc_value, PSI值: psi_value }) def generate_monitor_report(self): report_df pd.DataFrame(self.performance_history) alert_count len(report_df[report_df[KS值] self.baseline_ks - 0.1]) total_records len(report_df) return { 监控记录数: total_records, 触发预警次数: alert_count, 预警率: alert_count / total_records if total_records 0 else 0, 平均KS值: report_df[KS值].mean(), 平均PSI值: report_df[PSI值].mean() }总结逻辑回归在金融风控建模中具有不可替代的地位其核心优势在于特征权重的可解释性和模型的简洁性。通过标准化处理解决量纲差异利用L1正则化实现特征选择结合样本权重和成本敏感学习处理类别不平衡可以构建高性能的信用评分模型。深度学习CNN的引入为金融风控带来了新的技术视角一维卷积能够自动从时序交易数据中提取风险特征与逻辑回归的协同使用形成了特征自动提取与可解释性决策的完美互补。在实际工程中模型上线后的持续监控和迭代同样重要PSI、KS值等指标的跟踪能够及时发现模型衰减确保风控系统的长期稳定运行。