Evaluating Whether Implementing Matrix Multiplication (Dot Product) for Data Manipulation Affects the Performance of a Predictive Model
In the realm of data-driven insights and predictive modelling, the integrity and quality of data are paramount. Data manipulation can be used to protect sensitive information by altering or modifying it. In the context of data security and privacy, altering the sensitive data is crucial. However, a pertinent concern arises: Does data manipulation impact the performance of a predictive model? Before we start investigating this, let’s first understand why data manipulation is important.
By implementing data manipulation, sensitive data columns can be modified/masked. If an attacker compromises an internal account, they will not be able to understand the sensitive data.
According to data from IBM, the global average total cost of a data breach in 2023 was 4.45 million dollars, reflecting a 15% increase over three years. This indicates that manipulating data is one process that needs to be carefully considered.
In this article, we will implement matrix multiplication (dot product) as a technique to manipulate the data.
What is Matrix Multiplication and Dot Product?
Matrix multiplication specifies a set of rules for multiplying matrices together to produce a new matrix.
Not all matrices are eligible for multiplication. In addition, there is a requirement on the dimensions of the resulting matrix output.
- The number of columns of the 1st matrix must equal the number of rows of the 2nd
- The product of an M x N matrix and an N x K matrix is an M x K matrix. The new matrix takes the rows of the 1st and columns of the 2nd
The dot product is a specific type of matrix multiplication applicable when dealing with vectors (1D matrices). Alternatively, when working with 2D matrices, you can reshape them into 1D vectors and then compute the dot product.
In summary, while matrix multiplication is a broader concept applicable to matrices of varying dimensions, the dot product is a specific case of matrix multiplication when dealing with vectors or flattened matrices.
Here are examples of both 1D (vectors) and 2D matrices for illustrating matrix multiplication and dot product:
One Dimensional Matrices (vectors)
A = [1, 2, 3]
B = [4, 5, 6]pu
Dot Product:
For the dot product of vectors A and B:
A • B (dot product) = (1 * 4) + (2 * 5) + (3 * 6) = 32
Two Dimensional Matrices
A = [[1, 2, 3],
[4, 5, 6]]
B = [[7, 8],
[9, 10],
[11, 12]]
Matrix Multiplication:
Matrix multiplication of A and B (resulting in a 2x2 matrix):
Result = [[58, 64],
[139, 154]]
How Matrix Multiplication (dot product) can be used in a process to mask data in a Deep Learning Model?
Matrix multiplication (dot model) is not a conventional data masking technique by itself. It is a mathematical operation used for various purposes in linear algebra and data manipulation. However, in some specific contexts, matrix multiplication can be used in a process that could be considered a form of data masking.
For instance, matrix multiplication can be used to apply a transformation to the data, such as encryption or obfuscation. This transformation could be viewed as a form of data masking if it alters the original data in a way that hides its true nature or obscures sensitive information.
Let’s roll up our sleeves and go through the process.
Suppose you have completed the data preprocessing step and the exploratory data analysis (EDA) step, and you are now ready to proceed with model training and evaluation. Iwill compare the performance results of two methods: one with data masking and one without. I am using RMSE as the performance metric.
- Create a function to measure the performance metric
def eval_regressor(y_true, y_pred):
rmse = math.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
print(f'RMSE: {rmse:.2f}')
r2_score = math.sqrt(sklearn.metrics.r2_score(y_true, y_pred))
print(f'R2: {r2_score:.2f}')
2. Data Transformation and Creating Matrix P2
# creation of matrix P2 as a constant
rng = np.random.default_rng(seed=42)
P2 = rng.random(size=(X.shape[1], X.shape[1]))
P2_inverse = np.linalg.inv(P2)
P2_inverse
3. Training the model without masking the data
reg_withoutmasking = LinearRegression()
reg_withoutmasking.fit(X, y)
y_test_pred_withoutmasking = reg_withoutmasking.predict(X)
eval_regressor(y, y_test_pred_withoutmasking)
The result:
4. Data Manipulation Process
X_mask = X.dot(P2)
5. Training the model with data that has been masked
reg_masking = LinearRegression()
reg_masking.fit(X_mask, y)
y_test_pred_masking = reg_masking.predict(X_mask)
eval_regressor(y, y_test_pred_masking)
Result:
Conclusion
The RMSE value in the LinearRegression model with or without masking is the SAME.
This proves that using matrix multiplication (Dot Product) for Data manipulation will not affect the Performance of a Predictive Model.
Thank you for dedicating your time to read this article! Stay tuned for more insightful content in my upcoming articles :)
You can visit my GitHub account for the complete code:
Reference:
1. https://www.ibm.com/reports/data-breach
2. https://pathlock.com/learn/5-data-masking-techniques-and-why-you-need-them/
3. https://builtin.com/data-science/dot-product-matrix
4. https://ml-cheatsheet.readthedocs.io/en/latest/linear_algebra.html#matrix-multiplication