Steps to Compute Precision, Recall, and F1 Score
When evaluating models or products for their ability to scan and mask Personally Identifiable Information (PII) in your data, it’s crucial to follow a systematic approach. Let’s assume you have a dataset with 1,000,000 rows, and you want to scan and mask each row. Here’s a simplified guide to checking the effectiveness of a model or product in identifying and masking PII:
- Run the Text through PII Masking APIs: Use the masking product (such as Protecto) to process the dataset and obtain results.
- Decide on the Sample Size: Determine the sample size based on your desired confidence level and margin of error.
- Pick the Sample Set: Randomly select a number of records that meets or exceeds the sample size calculated in Step 2.
- Establish Ground Truth: Manually go through the text in the sample and identify the PII/PHI to establish the ground truth.
- Calculate Metrics: Manually compare the ground truth against the API results to compute metrics – Recall, Precision, and F1 score.
By following these steps, you can effectively evaluate which model or product performs better in scanning and masking PII in your data. This blog covers steps 2 and 5, focusing on the formulas and computations involved.
Decide on the Sample Size (Step 2)
Determining an appropriate sample size to check if PII is masked in a dataset depends on several factors, including the size of the dataset, the proportion of PII, and the acceptable level of confidence in the results.
Determine the Confidence Level and Margin of Error:
- Confidence Level (Z-score): Typically set at 95%, meaning you can be 95% confident that the sample results are accurate. The Z-score for 95% confidence is 1.96.
- Margin of Error (E): The acceptable error margin is often set at 5%.
- Estimate the Population Proportion (p): If you have prior knowledge or historical data, use that proportion. If not, use 0.5 (50%) as a conservative estimate since it provides maximum variability.
- Use the Sample Size Formula (n):
- Adjust for Finite Population (if necessary): If the dataset size (N) is large (above 10,000), then you don’t need to do the finite population correction. If it is less than 10,000, pl apply the following correction.
Example: Sample Size Estimation
How many records should be sampled to test if PII is masked in a dataset of 1 million records, where only 10% of the records contain PII, for confidence levels of 99%, 95%, and 80%?
Parameters:
- Population size (N): 1,000,000 records
- Proportion of records with PII (P): 10% or 0.10
- Z = 1.96 for 95% confidence level
- Margin of error (E): 5% or 0.05
Calculations:
For 95% Confidence Level:
Interpretation: If 10% of the records contain PII, you must evaluate the models by calculating your scores using a sample of 139 rows from a dataset of 1 million. This sample size ensures a 95% confidence level with a 5% margin of error that the PIIs are masked properly.
Since the population size is large, the finite population correction is not needed.
If you want to recompute the above for various confidence levels. here are different Z-scores for various confidence levels:
- 99% confidence level (Z = 2.576)
- 95% confidence level (Z = 1.96)
- 80% confidence level (Z = 1.282)
Calculate Metrics (Step 5)
The F1 score, recall, and precision are metrics used to evaluate the performance of a classification model. In the context of checking if PII is masked in a dataset, these metrics can help measure the accuracy of identifying correctly masked PII records.
Definitions:
- Precision: The ratio of correctly identified positive observations to the total predicted positives. It answers the question: “Of all the records predicted as having PII, how many actually have PII?”
- Recall (Sensitivity): The ratio of correctly identified positive observations to all observations in the actual class. It answers the question: “Of all the records that actually have masked PII, how many did we correctly identify?”
- F1 Score: The harmonic mean of precision and recall. It provides a single metric that balances both precision and recall.
Calculations:
Given a sample, you will typically have four possible outcomes:
- True Positives (TP): PII is correctly identified.
- False Positives (FP): Non-PII is incorrectly identified as PII.
- True Negatives (TN): Non-PII is correctly identified as not PII.
- False Negatives (FN): PII is incorrectly identified.
Formulas:
Example:
Let’s consider an example with the following counts from a sample of 139 that we picked, for simplicity, let’s assume there is one type of PII (Person Names):
- True Positives (TP)=12
The model correctly identified and masked 12 PII (Person Names)
- False Positives (FP) = 3
The model identified and masked 3 non-names as PII (Person Names)
- True Negatives (TN) =122
The model correctly didn’t identify any names in 122 rows
- False Negatives (FN)= 2
The model missed identifying and masking 2 PII (Person Names)
Step-by-Step Calculation:
Interpretation:
- Precision of 0.8 means that 80% of the records identified as having PII are actually correct.
- Recall of 0.8571 means that 85.7% of all the records that have PII were correctly identified. Missed to identify and mask 14% of PIIs
- F1 Score of 0.8267 means there is a balanced performance between precision and recall, indicating good overall accuracy. An F1 score of 80% is typically considered good, and anything above 90% is considered excellent in the market.
By following these steps and calculations, you can effectively measure the accuracy of your PII scanning and masking models.