Class syllabus for Fall 2004

Course overview

This course studies the practical aspects of large scale genomic data processing. Emphases will be placed on projects that carry out major genomic data processing steps. Important bioinformatic tools licensed from various genome research centers will be used in the projects. Students will also be taught to develop some simple scripting programs to process their own data. Topics include base-calling, raw sequence cleaning and contaminant removal; sequence database construction, search and update; shotgun assembly procedures and EST clustering methods; genome closure strategies and practices; sequence homology search and function prediction; annotation and submission of GenBank reports; data collection and dissipation through the Internet; and scripting languages for linking together an automatic genomic data processing pipeline. Additional topics in post-genomic research will also be discussed, including whole-genome oligo microarray design, microarray image processing, expression profile clustering, and pathway database. See the class schedule for an overview of the tentative course content for this semester.

Class place and time

Classes will be held TuTh 11:00-12:15 at Molecular Biology 1424.


Hui-Hsien Chou
Office: 203 Science II, 294-9242.
Office hours: Wed 10:00-11:00 or by appointment.


Mr. Sang-Jin Kim
Office: Atanasoff B02.
Office hours: Mon 9-10 and Wed 1-2, or by appointment.

Class resources

Class web page:
You should periodically check the class web page for class information updates, lecture notes and handouts, project assignments, and your scores accumulated so far. Most documents will be in HTML format readable directly in your browsers. Others, especially those with complex graphics and tables, my be released in the Potable Document Format (PDF). You can visit Adobe PDF web site to grab a copy of free PDF viewer suitable for your system.

Class account:
This course uses many software tools that are available only on UNIX platforms. If you do not already have a Computer Science Department UNIX account, first use the CS account activation page to register your personal information. In the comment field on that page, state that you are taking the CS596 class and you need a class account. One day after the web registration, go to the computing laboratory in 116 Atanasoff Hall and use the the UNIX Account Activation Terminal (large sign, next to the windows) to activate your account. It is very important that you obtain this account and agree to the the license agreement form (see below) to enable your access to the class material.

Class database:
You will also have access to a database server. You will store some of your project data in the database to learn about database tables for genomic research. Your database account will be created for you once you send in your license agreement form.

A few words about the class account and database

Note that genomic data can be quite enormous. For example, the chromatogram files used to assemble a mere 150K BAC can take up to 1G in total disk space! Although our projects are designed such that you will not need to store a lot of the data at the end, you have to be cautious about how much disk storage space you take to store your temporary results in your account and/or the database. Remove unnecessary files/tables as soon as possible to free up space for your next project. You may be denied access to your accounts due to disk quota violations until you delete some files. This limitation is not imposed by the instructor but by the system administrators of the machines you use.

Instructional material

The wide spectrum of class topics cannot be covered by a single textbook. Most information related to the bioinformati tools can be obtained from multiple sources on the Internet. We therefore do not have a single textbook chosen for this class. Instead, listed below are important reference books that will cover certain topics of this class. Obtaining of these reference books is optional and you can decide for yourself whether you want to keep some of the references. We will rely on lecture notes, online documentation, and class handouts to serve as our main teaching material. If you wish to pursue a future career in this field, however, the following books may be worth collecting:

Most relevent reference books:

  1. Developing Bioinformatics Computer skills, by cynthia Gibas and Per Jambeck. O'Reilly and Associates, Sebastopol, CA, 2001. ISBN 1-56592-664-1. If you can afford to get just one book, get this one.
  2. The Cartoon Guide to Genetics, by Larry Gonick and Mark Wheelis. Harper Perennial, New, York, 1991. ISBN 0-06-273099-1. If you are not a biology student, this book can get you started, with fun.
  3. Learning Perl, by R.L. Schwartz and T. Christiansen. O'Reilly & Associates, Inc., Sebastopol, CA, 1997. ISBN1-56592-284-0. If you choose Perl as your scripting language for projects, get this book.

Programming/Unix reference books:

  1. UNIX Utilities - a Programmer's Reference, by R.S. Tare. McGraw-Hill Book Co., New York, 1988. ISBN 0-07-100355-X.
  2. Redhat Linux Getting Started Guide, by Red Hat Software, Inc. Durham, North Carolina.
  3. HTML-The Definitive Guide, by C. Musciano and B. Kennedy. O'Reilly & Associates, Inc., Sebastopol, CA 1998. ISBN 1-56592-492-4.
  4. Beginning Perl for Bioinformatics, by James Tisdall. O'Reilly & Associates, Inc., Sebastopol, CA 2001. ISBN 0-596-00080-4.
  5. Mac OS X Panther in a Nutshell, by Chuck Toporek, Chris Stone and Jason McIntosh. O'Reilly & Associates, Inc., Sebastopol, CA 2001. ISBN 0-596-00606-3.

Reference books about bioinformatic algorithms:

  1. Introduction to Computational Molecular Biology, by J. Setubal and J. Meidanis. PWS Publishing Company, Boston, 1997. ISBN 0-534-95262-3.
  2. Computational Methods in Molecular Biology, edited by S.L. Salzberg, D.B. Searls and S. Kasif. Elsevier Sciences B. V., Amsterdam, The Netherlands, 1998. ISBN 0-444-50204-1.
  3. Biological Sequence Analysis, by R. Durbin, S. Eddy, A. Krogh and G. Mitchison. Cambridge University Press, Cambridge, United Kingdom, 1998. ISBN 0-521-62971-3.
  4. Algorithms on Strings, Trees and Sequences, by Dan Gusfield. Cambridge University Press, Cambridge, United Kingdom, 1997. ISBN 0-521-58519-8.

Reference books about genomic research lab procedures:

  1. Genome Mapping, a practical approach, edited by P.H. Dear. Oxford University Press, New York, 1997. ISBN 0-19-963630-3.
  2. Automated DNA Sequencing and Analysis, edited by M.D. Adams, C. Fields and J.C. Venter. Academic Press, New York, 1994. ISBN 0-12-717010-3.

Final Exam

Wednesday, December 15 at 9:45AM.


There will be a final exam, which will make up 30% of your semester score. The rest 70% of your score will be divided among 7 working projects. Some projects are very easy, but some others require a little programming efforts. The weight of each project will be determined depending on its difficulty level. If you have done all projects at the end of the semester, the Instructor may give you a 5% extra credit, especially when your score is at the borderline of two different grade levels. This is solely at the discretion of the Instructor and cannot be negotiated. Your performance on the projects will determine the majority of your final score. Therefore, you are advised to start working on your projects at the earliest possible time. The date each project will due is clearly indicated in the class schedule. Late projects will be penalized at the rate of 5% for the first day, 15% for the second day and 20% thereafter. Since we will be working on genomic data processing steps, some projects will depend on previous ones to continue. If you skip any project, it may be hard to finish the rest of the projects. We will take a normalization of everybody's scores at the end of the semester by setting the highest score at 100 points and adjust others accordingly. Your score will then be rounded to the nearest full percentile, and your letter grade will be determined by the following table:

Grade Cutoff scores
A 90 points and above
A- 87 points and above
B+ 84 points and above
B 80 points and above
B- 77 points and above
C+ 74 points and above
C 70 points and above
C- 67 points and above
D+ 64 points and above
D 60 points and above
D- 57 points and above
F all other scores, or cheating in anyway

Licensing agreement

To prepare for this class, the instructor has spent great efforts collecting many important bioinformatic software tools from genome research centers and companies around the world. We also license their actual genomic data as projects material. Our licensers support higher education, and are willing to let us use their proprietary software tools and data without fee. However, they asked that we only use them in our class, and never distribute them outside of our class. To cope with the the licensing requirement set forth by our licensers and to prevent the termination of our licenses if some of us violate their request, you need to agree to this license agreement. Print a copy of it out, fill in your account information so we can enable your access to the class meterial, sign it, and give it back to the instructor. It is very important that you agree and abide to the licensing agreement so we can continue to have this course offered. If you have any concern with the licensing agreement, feel free to discuss it with the instructor.

Academic integrity

Any academic dishonesty, including but not limited to, exchange of program codes, cheating during exams, plagiarism of projects, fabrication of results, sabotaging others' efforts, etc., will be viewed as academic offenses and will result in a F grade. Serious cases will be forwarded to the appropriate university committees for additional disciplinary actions. General discussions of projects at a conceptual level, including sharing experiences in the use of tools and help in debugging, etc. are allowed and encouraged.

Additional tips

  • Do not procrastinate working on projects. Start working on your projects as soon as they are released. Do not wait until the day before the deadline. You will see that projects take more time to finish when you are under pressure. This cannot be overstated, do not postpone working on your projects!
  • Be in touch with the information. Check the class home page frequently to see if there are updates to the project descriptions or hints for the projects.
  • For programming projects, design before you code. It is always easier than starting with some scratch code and "patch while you go". You should have a good picture of what you are going to do before you start doing it. Otherwise, you may end up spending hours debugging your code.
  • Understand what you learn in the class, make good use of project examples, learn how to use various tools introduced in the class, be familiar with the UNIX environment or other platforms you are working on, and never feel embarrassed to ask question if you are not sure.
  • Test your projects thoroughly before you submit. It is not how you see your projects, but how the TA sees them that will determine your score. There is no point arguing with the TA if your projects fail his tests. So test again to make sure your projects actually function as specified before you submit. You can submit your projects as many times as you wish before the due day.