本文通过R语言的3个短小编程片段来介绍R的一些基本特征。 1 计算Signed Log（带号对数，用于对数化处理负数）～本节基于Practial Data Science with R Chapter 4里面Data Transformation的内容 signedlog10=function(x){ ifelse(abs(x)<=1,0,sign(x)*log10(abs(x))) } if(window.hljsLoader && !document.currentScript.parentNode.hasAttribute('data-s9e-livepreview-onupdate')) { window.hljsLoader.highlightBlocks(document.currentScript.parentNode); } 1.1 这是一个函数的定义，可以在console里输入{RStudio的界面据说跟matlab很像}，也可以在R Script、R Markdown里面输入。R语言的主要平台——RStudio对所有输入成分同一对待，不同输入方式的子平台主要是运行和显示方面的区别。 1.2 定义一个函数与定义一个变量具有同样的赋值过程（但定义类不一样）。一般人用等号，学究和写书的在定稿时偷偷把等号批量替换成“ <- ” 。lol... 1.3 R的一个优势是数学和统计计算，这一点在编程语言底层有详细考虑和深入设计，该函数中的ifelse(), abs(), sign(), log10都是封装好的基本语句。同样的基本语句还有sample(), factor(), rep(), table()甚至predict（）、glm（）{general linear model}等满足高级功能的，无需调用外部库。 2 Coarsen the Levels of a Factor （<通过合并相同项或提取个别项>将等级化的枚举组合简化）～本节内容基于R Programming for Bioinformatics pp17 > y=sample(letters[1:5],20,replace = T) #take 20 random samples from 1st to 5th English alphabet# > y [1] "a" "d" "a" "a" "b" "e" "c" "a" "e" "e" "c" "d" "e" "e" "e" "c" "b" "c" "e" [20] "a" > v=as.factor(y) #factor y, a vector (array) of numerics into ordinals# > v [1] a d a a b e c a e e c d e e e c b c e a Levels: a b c d e #5 levels generated from y as they are distinct# > xx=list(I=c("a","e"),II=c("b","c","d")) #generate an (unordered) list containing two "nodes": vector I and II that each contains some characters > levels(v)=xx #set the levels of v, whose data type is "factor" as xx; the v remains to be a factor yet the attribute "level" is changes# > v [1] I II I I II I II I I I II II I I I II II II I I Levels: I II #v remains to have 20 elements but only two levels can be distinguished after "coarsening"# if(window.hljsLoader && !document.currentScript.parentNode.hasAttribute('data-s9e-livepreview-onupdate')) { window.hljsLoader.highlightBlocks(document.currentScript.parentNode); } 对level的操作进一步体现了高级封装的特性，对factor（类似enum）的操作不是修改数据本身，而是修改数据的attribute，把5个维度变成2个。 3 计算连中～本节内容基于Cousera课程Introduction to Probability and Data第3周的统计实验部分计算连中是分析“热手”问题的一个中间步骤，这里要设计一个函数来判断有序列中连中的出现情况，在这个序列中投中了是H，没中是M，连中的情况包括0（一个都没中），1（中了1个），2（连中2）等情况，具体实现过程在下面代码的注释中体现了。 function (x) { if (!is.atomic(x)) x = x[, 1] //preprocess x if x is recursive if (any(!x %in% c("H", "M"))) stop("Input should only contain hits (\"H\") and misses (\"M\")") //stop processing if the data contains illegal alphabets y = rep(0, length(x))//create a vector (horizontal array) containing only 0 with the length of x as its length y[x == "H"] = 1 //map all “H” in vector x as “1” and replace the value of corresponding locations on y with those “1” y = c(0, y, 0) //create a new “y” by adding 0s at the beginning and the end of the vector wz = which(y == 0) //return a vector of all locations of y that are valued 0 streak = diff(wz) - 1 //Firstly create a new vector consisted of all values of differences between the “cells” and its nearest neighbour in front of it starting from the second cell. e.g. diff(c) where c is (1,2,3,5)<vectors are indicated by parenthesis covering them> will return (1,1,2) Secondly, deduct number one from all “cells” on the vector in order to have zero value if the result is unhit as due to the incremental feature of locations on the vector, the result will be at least 1. return(data.frame(length = streak)) } if(window.hljsLoader && !document.currentScript.parentNode.hasAttribute('data-s9e-livepreview-onupdate')) { window.hljsLoader.highlightBlocks(document.currentScript.parentNode); } 这个例子的关键在于如何定位非0值的位置并利用这个位置作为streak的值，这Java中有String的indexOf但对于array等更复杂的数据类型则需要自己定义，而R语言在核心部分对此就进行了设计。

从3个小案例看R的特征

NTL01

本文通过R语言的3个短小编程片段来介绍R的一些基本特征。

1 计算Signed Log（带号对数，用于对数化处理负数）
～本节基于Practial Data Science with R Chapter 4里面Data Transformation的内容

signedlog10=function(x){
  ifelse(abs(x)<=1,0,sign(x)*log10(abs(x)))
}

1.1 这是一个函数的定义，可以在console里输入{RStudio的界面据说跟matlab很像}，也可以在R Script、R Markdown里面输入。R语言的主要平台——RStudio对所有输入成分同一对待，不同输入方式的子平台主要是运行和显示方面的区别。
1.2 定义一个函数与定义一个变量具有同样的赋值过程（但定义类不一样）。一般人用等号，学究和写书的在定稿时偷偷把等号批量替换成“ <- ” 。lol...
1.3 R的一个优势是数学和统计计算，这一点在编程语言底层有详细考虑和深入设计，该函数中的ifelse(), abs(), sign(), log10都是封装好的基本语句。同样的基本语句还有sample(), factor(), rep(), table()甚至predict（）、glm（）{general linear model}等满足高级功能的，无需调用外部库。

2 Coarsen the Levels of a Factor （<通过合并相同项或提取个别项>将等级化的枚举组合简化）
～本节内容基于R Programming for Bioinformatics pp17

> y=sample(letters[1:5],20,replace = T)  #take 20 random samples from 1st to 5th English alphabet#
> y
 [1] "a" "d" "a" "a" "b" "e" "c" "a" "e" "e" "c" "d" "e" "e" "e" "c" "b" "c" "e"
[20] "a"

> v=as.factor(y)  #factor y, a vector (array) of numerics into ordinals#
> v
 [1] a d a a b e c a e e c d e e e c b c e a
Levels: a b c d e    #5 levels generated from y as they are distinct#

> xx=list(I=c("a","e"),II=c("b","c","d"))  #generate an (unordered) list containing two "nodes": vector I and II that each contains some characters
> levels(v)=xx  #set the levels of v, whose data type is "factor" as xx; the v remains to be a factor yet the attribute "level" is changes#
>  v
 [1] I  II I  I  II I  II I  I  I  II II I  I  I  II II II I  I 
Levels: I II  #v remains to have 20 elements but only two levels can be distinguished after "coarsening"#

对level的操作进一步体现了高级封装的特性，对factor（类似enum）的操作不是修改数据本身，而是修改数据的attribute，把5个维度变成2个。

3 计算连中
～本节内容基于Cousera课程Introduction to Probability and Data第3周的统计实验部分
计算连中是分析“热手”问题的一个中间步骤，这里要设计一个函数来判断有序列中连中的出现情况，在这个序列中投中了是H，没中是M，连中的情况包括0（一个都没中），1（中了1个），2（连中2）等情况，具体实现过程在下面代码的注释中体现了。

function (x) 
{
  if (!is.atomic(x)) 
    x = x[, 1] //preprocess x if x is recursive
  if (any(!x %in% c("H", "M"))) 
    stop("Input should only contain hits (\"H\") and misses (\"M\")") //stop processing if the data contains illegal alphabets
  y = rep(0, length(x))//create a vector (horizontal array) containing only 0 with the length of x as its length
  y[x == "H"] = 1 //map all “H” in vector x as “1” and replace the value of corresponding locations on y with those “1”
  y = c(0, y, 0)  //create a new “y” by adding 0s at the beginning and the end of the vector
  wz = which(y == 0)  //return a vector of all locations of y that are valued 0
  streak = diff(wz) - 1  //Firstly create a new vector consisted of all values of differences between the “cells” and its nearest neighbour in front of it starting from the second cell. e.g. diff(c) where c is (1,2,3,5)<vectors are indicated by parenthesis covering them> will return (1,1,2) Secondly, deduct number one from all “cells” on the vector in order to have zero value if the result is unhit as due to the incremental feature of locations on the vector, the result will be at least 1. 
  return(data.frame(length = streak))
}

这个例子的关键在于如何定位非0值的位置并利用这个位置作为streak的值，这Java中有String的indexOf但对于array等更复杂的数据类型则需要自己定义，而R语言在核心部分对此就进行了设计。