chainsawriot

數據

Posted on Dec 30, 2009 by Chung-hong Chan

Coding and data entry are the Cinderellas of survey method, attracting little academic interest or concern compared with sampling, interviewing and tests of significance. Yet a survey, like the proverbial chain, is probably as good as its weakest link. And if enough care, thought and time are not devoted to these aspects of the study the validity and usefulness of the whole operation are jeopardised. We have no magical alternatives to the painstaking and methodical attention to detail which are needed for this part of the study. To do it well you need to be obsessional.

~ Anne Cartwright & Clive Seale, The Natural History of a Survey, 1990

數據。
每次講到數據，都想到 Michael Crichton 在其著作 The Andromeda Strain ((最近常常 Quote MC ，在今期的四顆書釘都有 Quote )) 說到甚麼是 Biology ：

Biology, the retarded child. . . . Even in the time of Newton and Galileo, men knew more about the moon and other heavenly bodies than they did about their own.

幸好，今天 Biology 終於成為一個重要的學科。但數據作為一個學科，有沒有被重視？還是像上面 Cartwright & Seale 所說，是 Cinderellas 、又或如 Crichton 像 Biology 般，是 retarded child 。
做研究的天天收集數據，但我們認識數據嗎？
對，數據是有最少兩個學科分化出來，例如 Data warehousing 和 Information Science ，另外電腦學科又有 Database design 等等。但是一般人的 budget 有限，做研究你會請一個專搞數據的人嗎？一般學術人士都覺得搞數據是一門簡單學問，一定識的。又或者學識 Access 等於學識搞數據 ((等於識 SPSS 等於識統計)) 。
之前寫了一本有關 data frame 的書。我想 expand 這本書為一本臨床研究及流行病研究的數據處理書藉。 ((其實 Social Science 都合用，只是範例多會用臨床及流行病數據。)) 這個或許是「真正對你 Research 有用的電腦課」的實化版。之前寫過點點有關 database normalization 的東西，之後寫過話會講 implementation ，還說會用 sqlite 。這個承諾到現在都沒有實現。想過後，覺得 sqlite 太難，沒有太大意義。可以介紹一下數據輸入軟件 EpiData 。 2010 年頭半除了要寫大整肅，另一個 Task 是為我的數據書加入一個 Epidata 的 Chapter 。或者到下半年，如果有時間，會寫另一個 Chapter ，是有關數據清理的。那麼這本有三個 Chapter 的書就大致完成。

Chapter 1: Planning, Data structure design, documentation (Using EpiData)
Chapter 2: Data manipulation (Data frame in R)
Chapter 3: Data Cleansing (Using R)

這個是我給自己 2010 年的大計。