> library(xlsx) > library(partykit) > library(data.table)

I am using randomly generated data. In this example I am selling jeans and I am curius about on what depends the period of selling different type of jeans. I have data about 10000 sales including the fit (straight, bootcut or skinny), length (short or normal), color (blue, black, red or yellow) and price (10-124 $) of the jeans and the number of days of the sale of it (0-3 days, 4-14 days or more).

> head(data) Days Fit Price $ Color Length 2 0-3 Straight 81 Red Normal 3 0-3 Straight 61 Red Short 1 4-14 Bootcut 87 Red Normal 4 4-14 Bootcut 47 Red Normal 5 4-14 Skinny 82 Red Normal 6 4-14 Straight 88 Black Normal > nrow(data) [1] 10000

In order to estimate the accuracy of our model, I will split my data to a training and a test dataset.

> set.seed(9898) > testind<-sample(c(1:nrow(data)), nrow(data)*0.2) > test<-data[testind,] > train<-data[-testind,]

The tree and the prediction:

> tree<-ctree(Days~., data=train, mincriterion = 0.99) > > test$predict<-predict(tree, newdata = test, type = "response")

By fitting the model we could control the testtype, the size of the tree, the splitting criterias etc. as well with a list of parameters in the control = ctree_control() argument. Now I only set the mincriterion, it is the value of 1-p, a split will be implemented only by exceeding this value, which is strenger by the default.

The accuracy of the modell is 0.622, which is quite good. Without the model only with guessing our accuracy would be about 0.33 (since the explained variable has 3 values), so 0.631 is a great improvement.

> test$match<-0 > test$match[which(test$predict==test$Days)]<-1 > accuracy<-sum(test$match)/nrow(test) > accuracy [1] 0.631

Now I am visualizing my tree. First I define the colors. The columncol means the colors of the columns in the terminal nodes, the labelcol the background colors of the labels in the inner nodes and the indexcol the bacground colors of the indexes.

> columncol<-hcl(c(270, 260, 250), 200, 30, 0.6) > labelcol<-hcl(200, 200, 50, 0.2) > indexcol<-hcl(150, 200, 50, 0.4)

The imaged will be saved immediately as Tree.jpg in the working directory.

> jpeg("Tree.jpg", width=2000, height=750) > plot.new() > plot(tree, type = c("simple"), gp = gpar(fontsize = 15), + drop_terminal = TRUE, tnex=1, + inner_panel = node_inner(tree, abbreviate = TRUE, + fill = c(labelcol, indexcol), pval = TRUE, id = TRUE), + terminal_panel=node_barplot(tree, col = "black", fill = columncol[c(1,2,4)], beside = TRUE, + ymax = 1, ylines = TRUE, widths = 1, gap = 0.1, + reverse = FALSE, id = TRUE)) > title(main="My Beautiful Decision Tree", cex.main=3, line=1) > dev.off()

And here is the tree:

Maybe it would be important to see according to what kind of rules was the data partitioned. Now I list these rules produced for each class. With this script we will get the results in formatted, and it will be saved as an xls file.

First I save in the terminals the predictions and the rules how have we got them.

> terminals <- data.frame(response = predict(tree, type = "response")) > terminals$prob <- predict(tree, type = "prob") > > rules<-partykit:::.list.rules.party(tree) > terminals$rule <- rules[as.character(predict(tree, type = "node"))]

Then I format and aggregate the values and the variablenames and I save it as Rules.xls.

> value<-cbind(terminals$rule, terminals[,which(grepl("prob", colnames(terminals)))]) > colnames(value)[1]<-"rules" > DF<-data.table(value) > terminals<-as.data.frame(DF[,sapply(.SD, function(x) list(max=max(x))),by=list(rules)]) > terminalsname<-cbind(names(rules), rules) > terminals<-merge(terminals, terminalsname, by=c("rules")) > names<-colnames(terminals)[which(grepl(".max", colnames(terminals)))] > names<-unlist(strsplit(names, "[.]"))[-which(unlist(strsplit(names, "[.]"))=="max")] > colnames(terminals)[which(grepl(".max", colnames(terminals)))]<-names > colnames(terminals)[which(colnames(terminals)=="V1")]<-"Terminal_Node" > colnames(terminals)[which(colnames(terminals)=="rules")]<-"Rule" > terminals<-terminals[,c(5,1,2,3,4)] > terminals$Rule[which(grepl("\"NA\"", terminals$Rule))]<-gsub("\"NA\"", "", terminals$Rule[which(grepl("\"NA\"", terminals$Rule))]) > terminals$Terminal_Node<-as.numeric(as.character(terminals$Terminal_Node)) > terminals<-terminals[order(terminals$Terminal_Node),] > write.xlsx(terminals, "Rules.xls", row.names=FALSE)

So here are the decision rules.

You can learn more about decision trees here or here.

Terminal_Node | Rule | 0-3 | 4-14 | More |

4 | Fit %in% c(“Bootcut”, “Skinny”) & Price $ <= 65 & Price $ <= 50 | 0.77 | 0.23 | 0.00 |

6 | Fit %in% c(“Bootcut”, “Skinny”) & Price $ <= 65 & Price $ > 50 & Fit %in% c(“Bootcut”, ) | 0.62 | 0.37 | 0.00 |

7 | Fit %in% c(“Bootcut”, “Skinny”) & Price $ <= 65 & Price $ > 50 & Fit %in% c(“Skinny”, ) | 0.96 | 0.04 | 0 |

11 | Fit %in% c(“Bootcut”, “Skinny”) & Price $ > 65 & Price $ <= 99 & Fit %in% c(“Bootcut”, ) & Price $ <= 76 | 0.54 | 0.45 | 0.01 |

12 | Fit %in% c(“Bootcut”, “Skinny”) & Price $ > 65 & Price $ <= 99 & Fit %in% c(“Bootcut”, ) & Price $ > 76 | 0.45 | 0.52 | 0.03 |

13 | Fit %in% c(“Bootcut”, “Skinny”) & Price $ > 65 & Price $ <= 99 & Fit %in% c(“Skinny”, ) | 0.92 | 0.08 | 0 |

14 | Fit %in% c(“Bootcut”, “Skinny”) & Price $ > 65 & Price $ > 99 | 0.25 | 0.63 | 0.13 |

16 | Fit %in% c(“Straight”) & Price $ <= 56 | 0.27 | 0.64 | 0.09 |

18 | Fit %in% c(“Straight”) & Price $ > 56 & Price $ <= 74 | 0.17 | 0.70 | 0.13 |

20 | Fit %in% c(“Straight”) & Price $ > 56 & Price $ > 74 & Price $ <= 84 | 0.13 | 0.66 | 0.21 |

21 | Fit %in% c(“Straight”) & Price $ > 56 & Price $ > 74 & Price $ > 84 | 0.07 | 0.68 | 0.25 |

]]>

In my example I use random data following normal distribution with some correlation between the variables. X is the dependent variable and A, B, C, D and E are the independents.

> head(data) X A B C D E 1 0.5029670 -0.5410024 1.1435710 0.3336631 0.3655851 0.001020111 2 -0.6840197 0.2687756 0.7954579 0.6327431 3.1429626 -0.003303845 3 -0.9907283 0.7637982 -0.5391294 0.1208131 -1.5397873 -0.003529975 4 -0.4805976 -1.5478674 -1.5155526 0.8240618 -0.5815753 -0.013569565 5 0.7124916 -2.2332242 0.8393972 0.5629028 -0.1723444 -0.001465729 6 0.8452909 0.4507010 1.2326080 -0.9208669 0.2225247 0.013846007 > cor(data) X A B C D E X 1.00000000 0.619898093 0.251345711 -0.12850031 -0.003335120 0.19875112 A 0.61989809 1.000000000 0.012738346 -0.13976283 -0.004907749 0.09347111 B 0.25134571 0.012738346 1.000000000 -0.13507005 -0.001018186 0.52016382 C -0.12850031 -0.139762829 -0.135070053 1.00000000 0.014933635 -0.91568895 D -0.00333512 -0.004907749 -0.001018186 0.01493364 1.000000000 -0.01315734 E 0.19875112 0.093471114 0.520163823 -0.91568895 -0.013157335 1.00000000

The dependent variable:

Enter method

By the enter method no variable selection is executed regardless of their partial explanatory power, so every independent variables will be in the model.

> enter <- function(data, num, form) + { + data<-data[num,] + fit <- lm(form, data) + res<-as.numeric(summary(fit)$r.square) + return(res) + } > results <- boot(data=data, statistic=enter, + R=100, form=X ~ A + B + C + D + E)

> summary(results) R original bootBias bootSE bootMed 1 100 0.55754 0.00082494 0.0064244 0.55797

> plot(results)

The value of the R-squared was about 0.55797 in every samples, no big differences occured.

Forward method

The Forward method begins with an empty model. First it selects the independent variable, in which case the absolute value of the correlation between this variable and the dependent variable is the maximum. Second it selects one with the largest partial correlation, i.e. the one with the maximal correlation after the first selected variable. Hereinafter entering variables it also takes account of the partial correlations, which is among the entering independent and the dependent variable, taking into account the other entered variables. The order of the explanatory variables matters (the later entrants are less and less able to increase the explanatory power of the model, matter what variables are already in). The entered variables are staying in the model.

> forward_actual_results<-rep(0,6) > names(forward_actual_results)<-c("r_square","A","B","C","D","E") > > selection_forward<-function(data,num) + { + data<-data[num,] + fit <- forward(lm(X ~ A + B + C + D + E, data)) + + forward_actual_results[1]<-as.numeric(summary(fit)$r.square) + + actual_selected<-rownames(summary(fit)$coef)[-1] + forward_actual_results[which(names(forward_actual_results) %in% actual_selected)]<- + forward_actual_results[which(names(forward_actual_results) %in% actual_selected)]+1 + return(forward_actual_results) + } > > forward_results<-boot(data=data, statistic=selection_forward, + R=100)

> forward<-summary(forward_results) > rownames(forward)<-c("R-squared", "A", "B", "C", "D", "E") > forward R original bootBias bootSE bootMed R-squared 100 0.55754 -0.017099 0.040908 0.55754 A 100 1.00000 0.000000 0.000000 1.00000 B 100 1.00000 0.000000 0.000000 1.00000 C 100 1.00000 -0.150000 0.358870 1.00000 D 100 0.00000 0.200000 0.402015 0.00000 E 100 1.00000 -0.150000 0.358870 1.00000

The explanatory power of the model is the same as by the enter method, the value of R-squared was about 0.55754. In most of the models the D variable was left out, in some cases the C and E variables to. The A and B variables were always in.

Backward elimination

The Backward elimination, as the name suggests, essentially works in reverse as the Forward Method: in the beginning all independent variables are in the model, and the algorithm excludes those, which don’t significantly reduce the model’s explanatory power.

> backward_actual_results<-rep(0,6) > names(backward_actual_results)<-c("r_square","A","B","C","D","E") > > selection_backward<-function(data,num) + { + data<-data[num,] + fit <- backward(lm(X ~ A + B + C + D + E, data)) + + backward_actual_results[1]<-as.numeric(summary(fit)$r.square) + + actual_selected<-rownames(summary(fit)$coef)[-1] + backward_actual_results[which(names(backward_actual_results) %in% actual_selected)]<- + backward_actual_results[which(names(backward_actual_results) %in% actual_selected)]+1 + return(backward_actual_results) + } > > backward_results<-boot(data=data, statistic=selection_backward, + R=100)

> backward<-summary(backward_results) > rownames(backward)<-c("R-squared", "A", "B", "C", "D", "E") > backward R original bootBias bootSE bootMed R-squared 100 0.55754 -0.0001725 0.0033409 0.55754 A 100 1.00000 0.0000000 0.0000000 1.00000 B 100 1.00000 0.0000000 0.0000000 1.00000 C 100 1.00000 0.0000000 0.0000000 1.00000 D 100 0.00000 0.2400000 0.4292347 0.00000 E 100 1.00000 0.0000000 0.0000000 1.00000

` `

The results by the Backward elimination were more stable, while the value of the R-squared was the same as by the forward method. The A, B, C and E variableswere always included, the variable D was mostly excluded.

Stepwise selection

The Stepwise method is similar to the Forward method, in a sense it is its “improved” version: first enters the variable with the largest absolute value of the correlation and passing in the same way further than the Forward Method. Always enters the variable with the largest absolute value of partial correlation. However, the model continues to step backwards to check, and if there is a variable, that would not significantly reduce the explanatory power of the model when excluded, it drops out. New variables may change the partial correlation of the variables inside, “can reduce their attractiveness”. We can say stepwise selection is “better” than forward and backward methods in the way that it takes into account the predicting variables together, but we can not trust it blindly and we have to check if the model makes sense. An often mentioned example is that people look good from the outside in jeans and T-shirt but we know it is better if we wear underwear as well even if it is less effective and time consumming to put it on.

> stepWise_actual_results<-rep(0,6) > names(stepWise_actual_results)<-c("r_square","A","B","C","D","E") > > selection_stepWise<-function(data,num) + { + data<-data[num,] + fit <- stepWise(lm(X ~ A + B + C + D + E, data)) + + stepWise_actual_results[1]<-as.numeric(summary(fit)$r.square) + + actual_selected<-rownames(summary(fit)$coef)[-1] + stepWise_actual_results[which(names(stepWise_actual_results) %in% actual_selected)]<- + stepWise_actual_results[which(names(stepWise_actual_results) %in% actual_selected)]+1 + return(stepWise_actual_results) + } > > stepWise_results<-boot(data=data, statistic=selection_stepWise, + R=100)

> stepWise<-summary(stepWise_results) > rownames(stepWise)<-c("R-squared", "A", "B", "C", "D", "E") > stepWise R original bootBias bootSE bootMed R-squared 100 0.55754 -0.030778 0.050863 0.55754 A 100 1.00000 0.000000 0.000000 1.00000 B 100 1.00000 0.000000 0.000000 1.00000 C 100 1.00000 -0.270000 0.446196 1.00000 D 100 0.00000 0.000000 0.000000 0.00000 E 100 1.00000 -0.270000 0.446196 1.00000

The value of the R-squared was the same as by the Forward and Backward methods. The variables A and B were always in, the D was aways out and the C and E were mostly included.

]]>

In this example I will use the data used by the Creating a complex chart in R.

> data Total Product A Product B Product C Product D Profit New York 787 340 160 161 126 83000 Bangkok 751 356 206 127 62 62000 Paris 743 408 155 93 87 85000 El Paso 696 259 209 108 120 78000 Sidney 696 270 190 176 60 47000 Shanghai 654 238 198 141 77 41000 Auckland 643 191 168 148 136 39000 Beijing 630 235 117 160 118 76000 Houston 570 199 175 99 97 51000 Amsterdam 531 180 95 170 86 55000

Of course we can automate the data import and combine as well, but because of the many possibilities and different datasources this post is not about that.

To make a nice pptx from our data, first we need a template. In our case it is the Template.pptx.

> library(ReporteRs) > report<-pptx(template="Template.pptx")

Next we format the title slide. With the addSlide command we define what kind of slide layout we use (here Title). The slide layouts have to be defined in the template pptx. We add a title, a subtitle and a date too.

> report <- addSlide(report, "Title") > report<-addTitle(report, "Top Sales Offices") > report<-addSubtitle(report, "Reporting example - RManic") > report<-addDate(report)

The title slide:

On the next slide we want to have some text too, we will use a Title and Content layout. Here will be shown the top 3 sales offices according to different criteria.

> report <- addSlide(report, "Title and Content" ) > report<-addTitle(report, "Top 3 Sales Offices according to")

Firtst we specify the rows. The current data about the offices will be automatically replaced. There will be 16 rows on this slide.

> par1<-pot("Profit:", textProperties(font.size = 15, font.weight="bold")) > par2<-pot(paste(rownames(data[order(-data$Profit),])[1], ": ", data$"Profit"[order(-data$"Profit")][1], "$", collapse=""), textProperties( font.size = 12 )) > par3<-pot(paste(rownames(data[order(-data$Profit),])[2], ": ", data$"Profit"[order(-data$"Profit")][2], "$", collapse=""), textProperties( font.size = 12 )) > par4<-pot(paste(rownames(data[order(-data$Profit),])[3], ": ", data$"Profit"[order(-data$"Profit")][3], "$", collapse=""), textProperties( font.size = 12 )) > par5<-pot("number of sales of product A", textProperties(font.size = 15, font.weight="bold")) > par6<-pot(paste(rownames(data[order(-data$"Product A"),])[1], ": ", data$"Product A"[order(-data$"Product A")][1], collapse=""), textProperties( font.size = 12 )) > par7<-pot(paste(rownames(data[order(-data$"Product A"),])[2], ": ", data$"Product A"[order(-data$"Product A")][2], collapse=""), textProperties( font.size = 12 )) > par8<-pot(paste(rownames(data[order(-data$"Product A"),])[3], ": ", data$"Product A"[order(-data$"Product A")][3], collapse=""), textProperties( font.size = 12 )) > par9<-pot("number of sales of product B", textProperties(font.size = 15, font.weight="bold")) > par10<-pot(paste(rownames(data[order(-data$"Product B"),])[1], ": ", data$"Product B"[order(-data$"Product B")][1], collapse=""), textProperties( font.size = 12 )) > par11<-pot(paste(rownames(data[order(-data$"Product B"),])[2], ": ", data$"Product B"[order(-data$"Product B")][2], collapse=""), textProperties( font.size = 12 )) > par12<-pot(paste(rownames(data[order(-data$"Product B"),])[3], ": ", data$"Product B"[order(-data$"Product B")][3], collapse=""), textProperties( font.size = 12 )) > par13<-pot("number of sales of product C", textProperties(font.size = 15, font.weight="bold")) > par14<-pot(paste(rownames(data[order(-data$"Product C"),])[1], ": ", data$"Product C"[order(-data$"Product C")][1], collapse=""), textProperties( font.size = 12 )) > par15<-pot(paste(rownames(data[order(-data$"Product C"),])[2], ": ", data$"Product C"[order(-data$"Product C")][2], collapse=""), textProperties( font.size = 12 )) > par16<-pot(paste(rownames(data[order(-data$"Product C"),])[3], ": ", data$"Product C"[order(-data$"Product C")][3], collapse=""), textProperties( font.size = 12 ))

Then we define the paragraphs. There will be in ordered and unordered list in two levels. With append=TRUE the paragraphs will be appended.

> report<-addParagraph(report, set_of_paragraphs(par2, par3, par4), append=TRUE, par.properties=parProperties(list.style = 'unordered', text.align="left", padding.left=15, level=2)) > report<-addParagraph(report, par5, append=TRUE, par.properties=parProperties(list.style = 'ordered', text.align="left", padding.top=15, level=1)) > report<-addParagraph(report, set_of_paragraphs(par6, par7, par8), append=TRUE, par.properties=parProperties(list.style = 'unordered', text.align="left", padding.left=15, level=2)) > report<-addParagraph(report, par9, append=TRUE, par.properties=parProperties(list.style = 'ordered', text.align="left", padding.top=15, level=1)) > report<-addParagraph(report, set_of_paragraphs(par10, par11, par12), append=TRUE, par.properties=parProperties(list.style = 'unordered', text.align="left", padding.left=15, level=2)) > report<-addParagraph(report, par13, append=TRUE, par.properties=parProperties(list.style = 'ordered', text.align="left", padding.top=15, level=1)) > report<-addParagraph(report, set_of_paragraphs(par14, par15, par16), append=TRUE, par.properties=parProperties(list.style = 'unordered', text.align="left", padding.left=15, level=2))

The Title and Content Slide:

Finally we are creating a slide with a chart. We will use the chart from the Creating a complex chart in R post (some sizes eg. margin and font will be different).

> report <- addSlide( report, "Title and Content" ) > report<-addTitle(report, "Chart") There are some variables we have to calculate and define first.

> data$CumProp<-cumsum(data$Total)/10000 > maxCumProp<-max(data$CumProp) > data$CumProp<-data$CumProp*(max(data$Total)/max(data$CumProp)) > > labels<-sapply(data[,2:5], function(x) paste(round(100*(x/data[,1]), 0), "%", sep=" ") ) > labelcoor<-cbind(data[,2]/2, + data[,2]+data[,3]/2, + data[,2]+data[,3]+data[,4]/2, + data[,2]+data[,3]+data[,4]+data[,5]/2) > > legend<-c("Product A", "Product B", "Product C", "Product D") > > colors<-c("#CC941399", "#005A9452", "#C383ED99", "#00B8C199", "#FF000099")

```
The function of the chart:
```

> chart<-function() + { + par(xpd=TRUE, bg="transparent", mar=c(0,0.7,1,0.5)) + layout(matrix(c(1,1,1,2,3,4,0,5,0,6,7,0), 4, 3, byrow = TRUE), width=c(1,3,2), + heights=c(0.6,2,0.7,0.8), respect = TRUE) + #1 + plot.new() + title("Top 10 Sales Offices", line=-1, cex.main=3.5, font.main=2) + #2 + plot.new() + text(0.6, 0.5, "Absolut Number of Sales", cex=2, srt = 90) + #3 + chart<-barplot(t(as.matrix(data[,2:5])), axes=FALSE, border="transparent", mar=c(0,0,0,0), + beside=FALSE, col=colors[1:4], axisnames=FALSE, space=0.3:0.3) + lines(chart, data$CumProp, lwd=1) + points(chart, data$CumProp, pch=16) + axis(4, at=seq(from=0, to=max(data[,1]), by=max(data[,1])/4), + lab=paste(round(100*seq(from=0, to=maxCumProp, by=maxCumProp/4),2), "%", sep=""), + las=2, cex.axis=1.5) + axis(2, at=seq(from=0, to=max(data[,1]), by=max(data[,1])/4), + lab=round(seq(from=0, to=max(data[,1]), by=max(data[,1])/4),0), + las=2, cex.axis=1.5) + text(chart+0.1, labelcoor, labels, cex=1.5) + text(c(0.7, chart[2:length(chart)]), data$Total+30, data$Total, cex=1.7) + #4 + plot.new() + text(0.2, 0.5, "Cumulative Proportion of Sales", cex=2, srt = 270) + legend(0.35,0.8, legend, fill=colors[1:4], border="transparent", cex=2, bty="n") + #5 + plot.new() + text(seq(from=0.02, to=0.98, by=0.93/9), t(rep(0.7, length(rownames(data)))), rownames(data), cex=1.5, srt = 30) + #6 + plot.new() + text(x=0.4, y=0.55, "Profit (USD)", cex=2, srt=90) + #7 + profit<-plot(data$Profit, type="b", axes=FALSE, col=colors[5], lwd=1, xlab="", ylab="") + text(seq(from=1.2, to=10.2, by=1), data$Profit+5000, data$Profit, cex=1.5) + axis(2, at=seq(from=35000, to=100000, by=100000/4), + lab=round(seq(from=35000, to=100000, by=100000/4),0), + las=2, cex.axis=1.5) + }

And we are adding the chart to the slide and save our report. The chart will be editable in the pptx. If we don’t want that, set editable=FALSE .

> report <- addPlot( report, function() chart(), width=9, height=5) > > writeDoc( report, "Rmanic.pptx" )

The last slide:

It is also possible to create a batch file from our script and schedule the report delivery and enjoy the results without any additional work.

]]>

In our example we would like to visualise the number of sales of 4 different products and the profit in our 10 best selling offices. Here are the data we would like to plot:

> data Total Product A Product B Product C Product D Profit New York 787 340 160 161 126 83000 Bangkok 751 356 206 127 62 62000 Paris 743 408 155 93 87 85000 El Paso 696 259 209 108 120 78000 Sidney 696 270 190 176 60 47000 Shanghai 654 238 198 141 77 41000 Auckland 643 191 168 148 136 39000 Beijing 630 235 117 160 118 76000 Houston 570 199 175 99 97 51000 Amsterdam 531 180 95 170 86 55000

In the first column is the number of the sales of the 4 products, in the second, third, fourth and fifth columns the number of the sales of the products one by one and in the last column is the profit in the different offices in USD (hypothetical data).

Before the plot we have to calculate some additional things.

To the added line we need the cumulative proportion of the total number of sales and the maximum value of the cumulative proportion to the axis according to the line. In our example we have other offices as well, and the number of sales in all of our offices is 10000.

> data$CumProp<-cumsum(data$Total)/total > maxCumProp<-max(data$CumProp)

In order to expand this line from the bottom to the top of the chart we have to correct it’s values with the ratio of the maximal number of sales and the maximal original value of the calculated cumulative proportion.

> data$CumProp<-data$CumProp*(max(data$Total)/max(data$CumProp))

We calculate the coordinates the labels of the columns (in the middle of the parts of the columns) and edit the labels: instead of the absolute number of sales we represent the proportion of sales of each product in percentage. We also define the text of the legend and the chosen colors.

> labels<-sapply(data[,2:5], function(x) paste(round(100*(x/data[,1]), 0), "%", sep=" ") ) > labelcoor<-cbind(data[,2]/2, + data[,2]+data[,3]/2, + data[,2]+data[,3]+data[,4]/2, + data[,2]+data[,3]+data[,4]+data[,5]/2) > > legend<-c("Product A", "Product B", "Product C", "Product D") > > colors<-c("#CC941399", "#005A9452", "#C383ED99", "#00B8C199", "#FF000099")

Now we can start to create our chart.

We will save our chart as a png. First we define the size of it and some graphical parameters. The background will be transparent, so we can use our chart for example in a ppt with a background color.

> png("~/Chart.png", width=2000, height=1000) > par(xpd=TRUE, bg="transparent", mar=c(5,4.5,3.2,3))

With the layout command we divide the chart.

> layout(matrix(c(1,1,1,2,3,4,0,5,0,6,7,0), 4, 3, byrow = TRUE), width=c(0.6,4.6,1.4), + heights=c(0.4,2,0.5,0.7), respect = TRUE)

That means that the chart will be divided to 12 parts. To the first row comes the first chart (it will be the title), to the second row the 2., 3. and 4. chart (left axis name, stacked column chart, legend), to the third row only the 5. and to the last the 6. and 7. The width and height of the parts are determined here too.

1 | 1 | 1 |

2 | 3 | 4 |

0 | 5 | 0 |

6 | 7 | 0 |

After specifying which chart where to come, we can start plotting.

> #1 > plot.new() > title("Top 10 Sales Offices", line=-4, cex.main=5, font.main=2) > #2 > plot.new() > text(0.9, 0.5, "Absolut Number of Sales", cex=3, srt = 90)

By the 3. chart we define the axises separate, because we want different scales. The axis names are located to the two other parts of the plot, there we can specify them more precisely.

> #3 > chart<-barplot(t(as.matrix(data[,2:5])), axes=FALSE, border="transparent", mar=c(0,0,0,0), + beside=FALSE, col=colors[1:4], axisnames=FALSE, space=0.3:0.3) > lines(chart, data$CumProp, lwd=2) > points(chart, data$CumProp, pch=16) > axis(4, at=seq(from=0, to=max(data[,1]), by=max(data[,1])/4), + lab=paste(round(100*seq(from=0, to=maxCumProp, by=maxCumProp/4),2), "%", sep=""), + las=2, cex.axis=2.4) > axis(2, at=seq(from=0, to=max(data[,1]), by=max(data[,1])/4), + lab=round(seq(from=0, to=max(data[,1]), by=max(data[,1])/4),0), + las=2, cex.axis=2.4) > text(chart+0.1, labelcoor, labels, cex=2.9) > text(c(0.7, chart[2:length(chart)]), data$Total+25, data$Total, cex=2.9) > #4 > plot.new() > text(0.03, 0.5, "Cumulative Proportion of Sales", cex=3, srt = 270) > legend(0.15,0.3, legend, fill=colors[1:4], border="transparent", cex=4, bty="n") > #5 > plot.new() > text(seq(from=0.02, to=0.98, by=0.96/9), t(rep(1, length(rownames(data)))), rownames(data), cex=3, srt = 30) > #6 > plot.new() > text(x=0.9, y=0.5, "Profit (USD)", cex=3, srt=90) > #7 > profit<-plot(data$Profit, type="b", axes=FALSE, col=colors[5], lwd=4, xlab="", ylab="") > text(seq(from=1, to=(20), by=1), data$Profit+7000, data$Profit, cex=2.3) > axis(2, at=seq(from=35000, to=100000, by=100000/4), + lab=round(seq(from=35000, to=100000, by=100000/4),0), + las=2, cex.axis=2.4) > # > dev.off()

Learn more about creating plots in R here and here.

]]>

Two facing magnets are keeping the rose in the air and it is balanced by four fishing lines.

Materials:

- two ringmagnet (60 x 10 mm, adhesive force: 6 kg)
- a round plastic flowerpot (diameter: 65 mm, height: 65 mm)
- a rose in plastic pot (diamter max: 60 mm, height max 50 mm)
- a trivet or a flat dish (diameter: 175 mm)
- fishing line (diameter: 0,35 mm)
- adhesive tape
- powerful universal glue
- felt, perls, ribbon for decoration

The plastic flowerpot and the rose shouldn’t weigh much.

First cut 4 long (15 cm) pieces from the fishing line. Mark four points an equal distance on the flowerpot with a protractor or something that has a right angle. These are the places of the fishing lines.

Affix well the fishing lines to the designated locations with the tape. Be careful, they will want to escape. Hide the tapes with some felt and decorate it with the ribbon or other lightweight stuff.

Measure out the length of the fishing line. If they are too short, the rose will fly too low, but if they are too long, the rose will not be balanced and it will fall. In my case 5,5 cm long fishing line are holding the rose stable, but the stronger your magnet is the higher the rose can fly, maybe you have to experiment a bit. You can put some perls on the fishing line as well if you want.

Affix well the fishing lines on the trivet or flat dish an equal distance.

Affix the magnet in the bottom of the plastis flowerpot and in the middle of the trivet, the same poles should face each other.

Check how the flower pot flies and correct the length of the fishing lines if neccessary.

Put the rose in the pot and check it again. When it flies perfect, hide the trivet with a felt.

Congratulation, your flying rose is ready!!!

Here and here you can find some other great experiments with magnet for kids.

Here, here and here are some further decoration ideas with magnets.

]]>

The correlation can be interpreted as the cosine of the angle between the normalized vectors of the variables .

cor(A,B) = cos(α)

Therefore the correlation between A and C is null, if their vectors are orthogonal (cos (90°)=0). The correlation between the variables is higher, if the angle between their vectors is smaller, so if the angle between A and C vectors is fixed, the correlation between A and B as well as between B and C variables is the largest, if the vector of the B variable is in the same plane like the vectors of A and C variables.

For example, if the correlation between A and B, B and C variables are the same, then its maximum value is (cos(45°)), when the correlation between A and C is null.

Let’s see how it works in practice!

We are generating 3 normally distributed random variables (n = 1000).

> options(digits=7)

> set.seed(234)

>

> M <- matrix(rnorm(3000), ncol=3)

> colnames(M) <- c(“A”, “B”, “C”)

> head(M)

A B C

[1,] -1.34352141 -0.158314852 -0.41120490

[2,] 0.62177555 0.018813945 -0.27796435

[3,] 0.80087466 0.498246468 0.40257018

[4,] -1.38889241 -1.675263002 0.45676675

[5,] -0.71435686 3.003174741 -0.43762865

[6,] -0.32406105 -0.608898653 1.36512746

We are defining the desired correlation matrix. The value of the variable maxCor is (rounded down), i.e., the maximal correlation between A and B, B and C variables, if they are the same values and the correlation between A and C variables is 0.

> maxCor <- floor(1e7*sqrt(2)/2)/1e7

> CM <- matrix(c(1,maxCor,0,

+ maxCor,1,maxCor,

+ 0,maxCor,1), nrow=3)

> colnames(CM) <- c(“A”, “B”, “C”)

> rownames(CM) <- c(“A”, “B”, “C”)

> CM

A B C

A 1.0000000 0.7071067 0.0000000

B 0.7071067 1.0000000 0.7071067

C 0.0000000 0.7071067 1.0000000

We change the values of the 3 generated variables in order to reach the given correlation matrix with Cholesky decomposition.

> L <- chol(CM)

> ABC <- ABC %*% t(L)

> head(ABC)

A B C

[1,] -1.45546690 -0.52315033 -0.00027866842

[2,] 0.63507902 -0.26466082 -0.00018837296

[3,] 1.15318808 0.75488359 0.00027281678

[4,] -2.57348211 -0.72782332 0.00030954511

[5,] 1.40920812 1.68593692 -0.00029657546

[6,] -0.75461737 0.93457073 0.00092512980

Let’s see if the correlations between the variable are as we wanted!

> cor(ABC)

A B C

A 1.000000000 0.35850510 0.034449684

B 0.358505099 1.00000000 0.812815583

C 0.034449684 0.81281558 1.000000000

The correlation matrix is not the most successful, however, as we can see a distribution like this is theoretically possible.

What happens if we increase a little the correlation between A and B, B and C variables (with 0.0000001)?

> maxCor <- floor(1e7*sqrt(2)/2)/1e7+1e-7

> CM <- matrix(c(1,maxCor,0,

+ maxCor,1,maxCor,

+ 0,maxCor,1), nrow=3)

> L <- chol(CM)

Error in chol.default(CM) :

the leading minor of order 3 is not positive definite

We get an error message, because such distribution doesn’t exist. The desired correlation matrix is not positive definite, there is a negative eigenvalue, so it can not be a correlation matrix. If the correlation between A and B, B and C variables are greater than , then the correlations between A and C cannot be 0.

> eigen(CM)

$values

[1] 2.0000000e+00 1.0000000e+00 -2.6606238e-08$vectors

[,1] [,2] [,3]

[1,] 0.50000000 -7.0710678e-01 0.50000000

[2,] 0.70710678 -4.4408920e-16 -0.70710678

[3,] 0.50000000 7.0710678e-01 0.50000000

Of course the correlation between A and B, B and C variables can be less than these values, because if the vector of the B variable is not in the plane of the vectors of A and C variables, then the angles between A and B, B and C variables can be larger, so the correlations between them are smaller.

Similarly, if the correlations between A and B, B and C are different, for example the angle between the vectors of the A and B variables is 55°, the correlation between the B and C is maximal if the angle between the corresponding vectors is 90° – 55° = 35°. So, if the correlation between A and B is cos(55°) = 0.574, the correlation between A and C can be 0, if the correlation between the variables B and C is maximum cos (35°) = 0.819.

> maxCorAB <- 0.5735764

> maxCorBC <- 0.8191520

> CM <- matrix(c(1,maxCorAB,0,

+ maxCorAB,1,maxCorBC,

+ 0,maxCorBC,1), nrow=3)

> CM

[,1] [,2] [,3]

[1,] 1.0000000 0.5735764 0.000000

[2,] 0.5735764 1.0000000 0.819152

[3,] 0.0000000 0.8191520 1.000000

>

> L <- chol(CM)

>

> ABC <- M %*% t(L)

>

> cor(ABC)

[,1] [,2] [,3]

[1,] 1.000000000 0.3341330 0.032695729

[2,] 0.334132999 1.0000000 0.770568600

[3,] 0.032695729 0.7705686 1.000000000

If, however, we increase slightly the correlation between the A and B or B and C variables than the maximum value, we get an error message again.

> maxCorAB <- 0.5735764+1e-7

> maxCorBC <- 0.8191520

> CM <- matrix(c(1,maxCorAB,0,

+ maxCorAB,1,maxCorBC,

+ 0,maxCorBC,1), nrow=3)

>

> L <- chol(CM)

Error in chol.default(CM) :

the leading minor of order 3 is not positive definite

>

> maxCorAB <- 0.5735764

> maxCorBC <- 0.8191520+1e-7

> CM <- matrix(c(1,maxCorAB,0,

+ maxCorAB,1,maxCorBC,

+ 0,maxCorBC,1), nrow=3)

>

> L <- chol(CM)

Error in chol.default(CM) :

the leading minor of order 3 is not positive definite

The correlations between the variables are not independent. Correlation between two variables is possible only within a given framework (even if this framework is quite wide), if the correlation between them and a third variable is given.

]]>