Upcoming Events
Oral Defense of Doctoral Dissertation: Redouane Betrouni
Nov 22, 2021, 11:00 AM - 12:30 PM
Oral Defense of Doctoral Dissertation: Redoune Betrouni
Doctor of Philosophy in Computational Sciences and Informatics
Department of Computational and Data Sciences
College of Science
George Mason University
Redouane Betrouni
Bachelor of Science, University of Sciences and Technology Houari Boumediene, 1996
Master of Science, George Mason University, 2009
EFFICIENT DATA SPLITTING METHODS FOR MACHINE LEARNING MODEL FITTING
Monday, November 22, 2021, 11:00 a.m. to 12:30 p.m.
All are invited to attend.
Committee
Jason Kinser, Committee Chair
Edward Wegman
Igor Griva
Dhafer Marzougui
Abstract: In this Ph.D. dissertation, I developed a new sampling method which I named PCA-Systematic sampling as an improved stratified systematic sampling to optimally split data into training and testing subsets. This procedure will help machine learning algorithms avoid the classical mistake of overfitting. While it might be slightly more computationally expensive, it makes up for this apparent weakness by having a better estimate of test error and improving prediction accuracy. The dissertation provides computational and theoretical evidence to support the benefits of the new proposed sampling design over traditional approaches. Examples and mathematical evidence are presented to show how traditional splitting methods such as simple random sampling to partition data can distort relationship between important covariates and the variable of interest for the test dataset and consequently, lead to either poor model construction or poor model fitting assessment. In this dissertation, I create a sampling utility score index as a data quality control tool to assess data splits or sampling designs. This dissertation demonstrates the benefits of my sampling utility index as its mathematical property is derived and studied, sensitivity analysis is conducted to investigate how it behaves under different scenarios of sampling designs. Monte Carlo simulation is used to assess the significance of this index. Finally, this dissertation contributes to the field of survey sampling and predictive modeling when the new developed methodology is implemented on three distinct publicly available datasets. I show in this dissertation how this new scheme of new sampling design developed and named PCA-Systematic can be used as an application on real surveys data like the Annual Survey of Public Employment and Payroll (ASPEP) and the American Housing Survey (AHS) data. I provide evidence of improvement in estimates with comparison to the traditional methods of systematic sampling. My novel PCA-Stratified-Systematic sampling method outperforms current and best state of the art sampling methods for the classification problem of Fisher IRIS data.
Join Zoom Meeting
https://gmu.zoom.us/j/97730097375?pwd=N0Uxb1JuWmduMUVCSUdsL0gxbkJ4Zz09
Meeting ID: 977 3009 7375
Passcode: 823461
One tap mobile
+13017158592,,97730097375#,,,,*823461# US (Washington DC)
+12678310333,,97730097375#,,,,*823461# US (Philadelphia)
Dial by your location
+1 301 715 8592 US (Washington DC)
+1 267 831 0333 US (Philadelphia)
Meeting ID: 977 3009 7375
Passcode: 823461