The problem of non-Convergence of Tensorflow Multivariate Linear regression parameters

when using Tensorflow for multiple linear regression, we encounter the problem of parameter non-convergence. The problem lies in the choice of optimization methods: if you use tf.train.AdamOptimizer , the parameters will converge and the loss function is reasonable, but the weight and bias items are not consistent with the original, which is the first place that you don"t understand; if you use opt = tf.train.GradientDescentOptimizer , the loss function will always increase and you can"t find the reason. If beginners have been unable to find the reason, I hope you have something to understand, you can help explain that the amount of code is not large. Here is the code:

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

-sharp 
X1 = np.matrix(np.random.uniform(-10, 10, 100)).T
X2 = np.matrix(np.linspace(-10, 10, 100)).T
X3 = np.matrix(np.linspace(-10, 10, 100)).T
X_input = np.concatenate((X1, X2, X3), axis=1)
-sharp  20,, -35, 4.3 25
Y_input = 20 * X1 - 35 * X2 + 4.3 * X3 + 25 * np.ones((100, 1))

-sharp 
W = tf.Variable(tf.random_uniform(shape=[3, 1]))
b = tf.Variable(tf.random_uniform(shape=[1, 1]))

-sharp 
X = tf.placeholder(dtype=tf.float32, shape=[None, 3])
Y = tf.placeholder(dtype=tf.float32, shape=[None, 1])

-sharp 
Y_pred = tf.matmul(X, W) + b * np.ones((100, 1))

-sharp 
loss = tf.reduce_sum(tf.square(Y_pred - Y)) / 100

-sharp Adma0.01
opt = tf.train.AdamOptimizer(0.01).minimize(loss)
-sharp 
-sharp opt = tf.train.GradientDescentOptimizer(0.01).minimize(loss)

-sharp 
x_axis = []
y_axis = []

with tf.Session() as sess:
    -sharp 
    sess.run(tf.global_variables_initializer())
    print("training,please wait...")
    for i in range(20000):
        sess.run(opt, feed_dict={Y: Y_input, X: X_input})
        x_axis.append(i)
        y_axis.append(sess.run(loss, feed_dict={Y: Y_input, X: X_input}))
    print("finish training!")
    print("W:", sess.run(W), "\nb:", sess.run(b))
    print(sess.run(loss, feed_dict={Y: Y_input, X: X_input}))
    plt.plot(x_axis, y_axis)
    plt.show()

it is easy to overfitting
with such a small amount of data, but this is not the main problem. The main reason is that GD has no momentum and is easy to fall into the local optimal solution. While adam has its own momentum, generally speaking, it is not easy to fall into the local optimal, and the performance will be better.
as for the weight you start to set, it is only to calculate the Y-input value, while the neural network fits the weight value by itself, completely ignoring the weight you set, so it is different from normal, the same is the hanging ghost.

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-1bd7926-31078.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-1bd7926-31078.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?