880259 :Statistics and Methodology

Algemeen

Voertaal Engels
Werkvorm: Plenary sessions and labs (Collegerooster)
Tentamenvorm: Graded weekly assignments and a final exam (Tentamenrooster)
Niveau:Master
Studielast:6 ECTS credits
Inschrijving:Enrollment via Blackboard before start of lectures
Blackboard informatieLink to Blackboard (Als u de melding 'Guest are not allowed in this course' krijgt, dient u nog bij Blackboard in te loggen)

Docent(en)


C.H. van Heck MSc (coordinator (spring))

dr. K.M. Lang (coordinator (fall))


Doel van de cursus

This course will provide students with a firm grounding in the fundamentals of statistical inference/prediction and prepare them to hone their data modeling skills as they progress in the Data Science: Business and Governance track.

After completing this course:

  1. The student understands the fundamentals of statistical inference and prediction, the concept of a statistical model, and how these tools are used by data scientists.
  2. The student can accurately describe the concepts of statistical inference and prediction and give real-world examples of each.
  3. The student can accurately describe how statistical inference and prediction are used by data scientists.
  4. When presented with an applied regression or classification problem, the student can choose an appropriate modeling strategy for the problem and justify their choice.
  5. The student understands the importance of modeling uncertainty and can apply statistical reasoning when analyzing real-world problems to evaluate the certainty of their own conclusions.
  6. The student can describe what “uncertainty” means in applied data science problems.
  7. The student can describe why data scientists need to control for uncertainty and can articulate the consequences of failing to do so.
  8. When applying statistical modeling techniques to real data, the student will consider the uncertainty in their estimates/conclusions when interpreting/presenting their findings.
  9. The student can apply linear regression and logistic regression methods to create new, domain relevant knowledge from the information contained in real-world data.
  10. When provided with a clean, flat dataset and a regression problem involving variables on the dataset, the student can estimate a multiple linear regression model and provide relevant, substantive answers.
  11. When provided with a clean, flat dataset and a classification problem involving variables on the dataset, the student can estimate a multiple logistic regression model and provide relevant, substantive answers.
  12. In addition to the competencies in (a) and (b), the student can apply the fitted linear or logistic regression model to new data and provide model-based predictions of the criterion value or group membership, respectively.


Inhoud van de cursus

Statistics can be defined as the process of learning from data—a key task in data science. Data scientists use statistics to make trustworthy, generalizable recommendations.

This course will focus on the specific aspects of statistics that are most useful for data scientists. So, for example, statistical modeling will be presented primarily from the perspective of predictive analytics with relatively less time devoted to inferences about specific parameter estimates.

The course content will be roughly divided into three modules:

  • Introduction to statistical inference, prediction, and modeling
  • Modeling linear dependence via linear regression
  • Classification via logistic regression


Bijzonderheden

The class will meet twice weekly: once in a plenary lecture session and once in small lab sessions.

  • The plenary lectures sessions will focus on presenting new material.
  • The lab sessions will cover R skills, practical applications, and homework questions.
  • Students are expected to bring their own laptop computers to all class meetings.

Students will complete 6 assignments in which they will apply the skills covered in the lectures and labs.

  • The assignments will be performed primarily using the open source statistical software R.
  • The assignments will be administered through a web-based platform that will be made accessible through Blackboard.
  • Each assignment will be graded as pass/fail.
  • Students may not re-sit individual assignments.
  • Failure to submit an assignment will result in a “fail” score for that assignment.
  • All assignments must be completed individually.

The course will be rounded off with a written exam.

  • The final exam will use a multiple choice response format.
  • Students may re-sit the final exam, in accordance with University policy.

To pass the course, the student must pass both the assignment component and the final exam.

  • To pass the assignment component, the student must get a “pass” score on 5 out of 6 assignments.
  • If a student gets a “fail” score on 2 or more of the assignments, they will fail the assignment component.

If a student fails the assignment component, they can re-sit the entire component by taking a practical exam (e.g., analyzing a real dataset and writing a report under timed testing conditions). Full details of the re-sit practical exam option will be provided at a later date.    

 

Required Prerequisites

Familiarity with basic algebraic operations (e.g., how to re-arrange a linear equation to isolate a variable), descriptive statistics (e.g., mean, variance, quantiles), and simple data visualizations (e.g., scatter plots, bar graphs, line graphs) is assumed.

Students should also have some prior exposure to basic inferential statistics (e.g., t-test, correlation), although they need not be able to implement these methods.

 

Recommended Prerequisites

Although not required, a conceptual understanding of probability and elementary calculus (the concept of a derivative and an anti-derivative/integral) will be helpful as will some basic programming skills.

 

Compulsory Reading

Students will read several chapters from An Introduction to Statistical Learning (ISL). The full citation for ISL is:

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: Springer.

An electronic copy of ISL is freely available here:

http://www-bcf.usc.edu/~gareth/ISL/index.html

Students will also read several scientific articles (approximately one per week). Electronic versions of these articles will be distributed via Blackboard.

 

Recommended Reading

The Elements of Statistical Learning (ESL) is the more technical (and comprehensive) predecessor to ISL. ESL represents an excellent (supplementary) reference for the material covered in this course. The full citation for ESL is: 

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd ed.). New York: Springer.

An electronic copy of ESL is freely available here: http://web.stanford.edu/~hastie/ElemStatLearn/


Verplichte literatuur

  1. See above.


Aanbevolen literatuur

  1. See above.


Gewenste voorkennis

Familiarity with basic data visualizations (scatterplots, bar graphs)


Vereiste voorkennis

Understanding of descriptive statistics and basic algebraic manipulations


Verplicht voor

  • Data Science: Business and Governance ( 2017 )
  • Data Science: Business and Governance (voorjaar) ( 2017 )

(22-jan-2018)