Data mining risk factors for cervical cancer
Sept – Dec 2017

Scenario
As part of my graduate studies in Data Science, I teamed up with a classmate to perform a series of data mining tasks to help the World Health Organization (WHO) Cancer Control Programme achieve better reproductive health outcomes for women. Specifically, we analyzed risk factors for cervical cancer to help WHO change cancer outcomes by preventing, treating, and educating the public about cancer with policies, plans, and programs that promote public health.
Our Process
Data Mining Objectives
- Identify trends and correlations between specific risk factors and the presence of human papilloma virus (HPV) leading to a diagnosis of cervical cancer.
- Make probability predictions for women in target groups to develop cervical cancer.
- Classify which risk factors are more likely to lead to HPV infection.
Dataset: Risk Factors for Cervical Cancer
| Source | Hospital Universitario de Caracas, Venezuela |
| Size | 858 instances, 36 attributes |
| Attributes | Age of initial sexual activity Number of sexual partners Age of first pregnancy Number of pregnancies Tobacco use Birth control method STD status Other cancer-related diagnoses … |
Data Challenges: Preprocessing
We performed a series of preprocessing steps to make our data usable, including scrubbing instances of missing or corrupt values, defining unclear attribute labels, and accounting for missing attributes which could otherwise increase the accuracy of our predictions.
Data Preparation: Logistic Regression
We started our investigation by performing logistic regression on the entire dataset, with the diagnosis of cervical cancer as our target (dependent) variable. Results illustrated that attributes for Abnormal Cervical Changes (Dysplasia), Smoking, and the presence of other Cancer Diagnoses are the 3 most significant factors for cervical cancer, with 99% accuracy.

Data Mining Process
Which behavior factors may lead to cervical cancer diagnosis?
Since our target was categorical in nature, we used classification techniques. J48 decision trees would produce a robust model and allow us to more easily visualize our results.
To improve the usefulness of this model and reduce the risk of overfitting our data, we first used a training set to produce a model with known output values. The remainder of the data was used for predictive mining, which produced a model with 98% accuracy.
Results
Results revealed that smoking is the most important factor for cancer diagnosis, especially when the patient also presents another type of cancer. However, when we ran additional models with different combinations of Group 1 attributes, the results showed that attributes for number of packs per year and how many years smoking have no significant effect on cancer outcomes. The fact of smoking makes the biggest difference.
Both smoking and the presence of other cancers weaken the immune system, reducing the body’s ability to fight off infections, including HPV.
Which birth control method (Pill or IUD) will be more likely to result in cervical cancer?
After removing missing values (13% for OC and 14% for IUD), converting numeric values to nominal for categorical attributes, and discretizing the values for years of birth control use, we ran more decision tree models.

Results
As we expected, the results indicated that CIN (dysplasia) and HPV are the primary predictors of cervical cancer. However, the model also predicted that number of years hormonal contraceptives are used may influence cervical cancer outcomes. Interestingly, the bin with the longest duration of hormonal contraceptive use was less likely to develop cervical cancer than those using this birth control method for 0-7 or 8-14 years.
Women who don’t use birth control (269 instances in this case) are more at risk not only for unplanned pregnancies but also for STIs, which may increase their chances of developing cervical cancer.
Do STIs have any positive relationship with cervical cancer diagnosis, and if so, which STI is the leading factor?
With this set of data, we wanted to see if the presence of STDs had any positive relationship with the target diagnosis, and if so, which type of STD is the leading factor.
Results
Decision tree models found no correlation, indicating that the presence of other STDs does not appear to be significant for cancer outcomes. We concluded that more models and more data are needed to conclusively say whether or not this factor effects cancer outcomes.
Are any interesting patterns in the data?
Since our target was unknown, we applied unsupervised k-means clustering to all attributes as a baseline for exploration, and then further refined based on the results.
Four clusters were generated using 14 attributes, with Cluster 0 forming the highest-risk group for cervical cancer. Women in this group tend to be smokers who are less likely to use birth control and more likely to have STDs (and HPV in particular), CIN (precancerous cervical cell changes), and cervical cancer.

Results
The highest-risk cluster did not differ significantly from the lowest-risk cluster for age of sexual initiation, number of pregnancies, or number of sexual partners. Thus, these factors may not be as important for cancer outcomes.
Deployment Recommendations
Clustering and Classification
The WHO can use our clustering models to identify the most at-risk populations of women and then apply our predictive models using new data instances to predict cervical cancer in these groups.
More data
WHO should aggregate larger and more robust data for further study. This will not only improve the accuracy and predictive power of our models, but also open the door to new behavioral insights, risk associations, and more meaningful health solutions.
A deeper dive into risk factors
Our model may also be used as a foundation to guide further data collection and analysis for researching other risk factors for cervical dysplasia and pre-cancer, such as the presence of other cancers and compromised immune system functioning.
Ethical considerations
HIPAA compliance, patient privacy, and informed consent should be fully understood by both patients and health professionals.
For more details about our Data Science study, take a peek at the full report.
https://annesawyerux.com/wp-content/uploads/2019/11/assignment_deliverable3.pdf