Infant Birth Weight Prediction

Linsi Lin

Introduction

Infant birth weight within normal range is a significant indicator for the parents as well as the doctors that the baby is healthy. A significant deviation can be an indication of an abnormality. This dataset set is from a Kaggle competition with 101,399 observations and 36 predictors. For this project, I picked 24 predictors which are infant sex, parents’ age and educational level, completed weeks of gestation, past birth information and some complications that the mother might have, all this information represents the factors from both external and internal factors associated with the mother and her environment. The goal is to find the model that can predict infant birth weight as accurate as possible. On the other hand, it is best to still be able to interpret the significance of each predictor for the infant birth weight.

Table of Contents

Variable Names and Descriptions:

FAGE: age of father
GAINED: weight gained during pregnancy in pounds
VISITS: number of prenatal visits during the pregnancy
MAGE: mother's age
FEDUC: father's years of education
MEDUC: mother's years of education
TOTALP: total pregnancies
BDEAD: number of children born alive now dead
TERMS: number of other terminations
WEEKS: completed weeks of gestation
CIGNUM: average number of cigarettes used daily (Mother)
DRINKNUM: average number of drinks used daily (mother)
SEX: sex of the baby (1 is boy and 2 is girl)
MARITAL: marital status of their parents (1 is married and 2 is unmarried)
LOUTCOME: outcome of last delivery
DIABETES: mother has/had diabetes (0 is no and 1 is yes)
HYDRAM: mother has/had hydramnios/Oligohydramnios (0 is no and 1 is yes)
HYPERCH: mother has/had chronic hypertension (0 is no and 1 is yes)
HYPERPR: mother has/had pregnancy hypertension (0 is no and 1 is yes)
ECLAMP: mother has/had Eclampsia (0 is no and 1 is yes)
CERVIX: mother has/had incompetent cervix (0 is no and 1 is yes)
PINFANT: mother has/had previous infant over 8.8 pounds (0 is no and 1 is yes)
PRETERM: mother has/had previous preterm/small infant (0 is no and 1 is yes)
UTERINE: mother has/had uterine bleeding (0 is no and 1 is yes)
BWEIGHT: baby's weight at birth in pounds (response variable)

Get the Data

No missing value!

Data Preprocessing

Exploratory Data Analysis

Among all the numeric variables "WEEKS" is highly correlated with infant birth weight, also note that multicollinearity is not an issue in this dataset

Linear Model

Residual plots are a useful graphical tool for identifying non-linearity. The redline is a smooth fit to the residuals, intended to make it easier to identify a trend. Ideally, the residual plot will show no fitted discernible pattern. But we can see that the red line is somewhat curved in the graph and the there shows non-constant variance. The presence of the discernable pattern indicates that the true relationship between infant birth weight and all predictors may not be truly linear.

BP test shows the p-value is less than the significance level 0.05, therefore we can reject the null hypothesis that the variance of the residuals is constant and infer that heteroscedasticity is indeed present which confirm what we observe from the graph above.

The coeffient doesn't change, only standard error changes.Good for interpretaion.

Skewness is a measure of the symmetry in a distribution. A symmetrical dataset will have a skewness equal to 0. So, a normal distribution will have a skewness of 0. Skewness essentially measures the relative size of the two tails. As a rule of thumb, skewness should be between -1 and 1.

Linear Model Selection

The true model may not be linear, but we can use Linear Model Selection and Regularization to get an idea of what predictors are most significant as we will need to use those predictors for non-linear methods.

Best subset selection and backward selection choose the same best eight predictors.
Forward selection picks seven same predictors with only one predictor MAGE being different. One thing to note that WEEKS and GAINED are two most significant predictors as they were included in the three models throughout the eight-steps. It will be very useful information when trying to perform non-linear methods for the data either by adding interaction terms or adding polynomials.

Ridge and Lasso Regression

Principle Components Regression and Partial Least Squares

Non-linear Model

Logarithmic Transformation

There is some evidence of a slight non-linear relationship in the data especially when we look at the further right and left sides. Most observations are scattered around exp(0.5)=1.6 pounds and exp(2)=7.4 pounds, this situation actually makes sense because for infant birth weight, as very few observations are either lower than 1.6 pounds or higher 7.4 pounds.

Polynomial Regression

Step Function

Cubic Splines

Regression Trees

Bagging and Random Forest

Bagging will use all 24 predictors while Random Forest uses square root of all 24 predictors, that is around 5 predictors. Random Forest is just a special case of Bagging where the method tries to decorrelate the trees.

Boosting

Summary plot

Conclusion

The test MSE doesn’t differ very much among all the methods. However, Boosting is going to be the best choice if we want to predict the infant birth weight as accurate as possible since Boosting yields the lowest test MSE, and Linear Regression will be the best option to interpret the significance of each predictor since the difference between Boosting and Linear Regression is 0.15 pounds.

The results from all models obtained consistently indicate than the completed gestational weeks is the most important predictor regarding the prediction of infant birth weight, holding all other predictors constant, one more gestational week will increase infant birth weight by 0.29 pounds.

The following are some more findings:

– Sex of the baby
The birth weight of the baby is correlated with their sex. Male babies are found to be a little heavier than female babies.

– Age and educational levels of parents
These predictors turn out to be insignificant for the prediction of infant birth weight.

– Total pregnancies and past birth information
A mother’s past pregnancy history plays an important role for making a prediction for infant birth weight. For example, a mother who had terminations before, the baby will be lighter. As for a mother who has had children born alive, but are now dead due to various reasons, their newborns will have a lower birth weight.

– Completed weeks of gestation
This turns out to be the dominant predictor as it holds a direct correlation to the weight of the baby.

– Weight gained during pregnancy
How much weight the mother gained during the pregnancy is strongly related to the weeks of gestation. It turns out to be a significant predictor for predicting infant birth weight.

– Average number of cigarettes and drinks per day (Mother)
It has been indicated that cigarettes tends to affect the infant birth weight in a negative way. However, surprisingly, how many drinks mother consumed per day is insignificant to infant birth weight.

– Medical History
Medical conditions irrespective of the seriousness can have an impact on infant birth weight. For example, a mother with diabetes tends to have a heavier infant. A mother with uterine bleeding, chronic hypertension and an incomplete cervix may have a lighter infant. The infant birth weight difference for mother who has incomplete cervix can be 1 pound.