Saturday 25 May 2013

Student choices between SAS and R in teaching presentations.

This is my second post in a series of posts (the first one is here) about a SAS/R course that I've recently finished teaching to MSc students at +Cardiff University.

I taught this course using a hybrid of flipped classrooms / IBL (although I'm cautious when using the term IBL as I'm not entirely sure my approach fits with any variation of Moore's methods). I gave students access to all the content of the class before hand (including notes, exercises and a series of screencasts - all the materials are here if they're of interest). The students were then given "challenges" and had to deliver their solutions to as presentations to the other students in the class. The aim of this was to get the students to teach themselves/each other and quite often I would not actually have to say much at all during a class (this allowed for a better use of 'me' by the students during the lab sessions).

(Here's a previous post about flipping the classroom and here's one about IBL)

In the previous post I described how students chose to use SAS and/or R in their class test. Most students chose SAS despite displaying a preference for R when asked.

The above is 1 of 3 assessments that the students have had to go through. This post is about the second assessment: a group presentation. In this presentation I asked students to teach an aspect of SAS or R that had not been covered in class (you can see the brief here).

I believe that this is a particularly important thing to assess as I in no way can pretend to teach them everything. It's important that they know how to learn new things that they might need in their career.

I was expecting groups to select a particular language and then a particular topic but interestingly most groups chose to look at both languages and compare strengths and merits.
I had 6 groups and here are the subjects that they looked at:
  • Time series forecasting: both in SAS and R;
  • Principal component analysis: both in SAS and R;
  • Random sampling: both in SAS and R;
  • Survey sampling (in SAS) and the creating a gif of the Mandelbrot set (in R);
  • Scorecard building in SAS;
  • Mapping and spatial analysis: in R.
The 3 groups who carried out a single thing in both SAS and R did a good job of describing strengths and weaknesses of both languages. It was a pleasure to see and again reassures me that an important message has gotten across to the students which is that there is not 1 best tool but an appropriate tool for a particular job.

I'm planning on putting the code/slide up for these talks for them to serve as resources for students doing the course next year but I want to wait till I've marked their final piece of work and asked the students if they mind. In the mean time I'll repost this gif of the Mandelbrot set made by 1 of the groups (I thought this was cool!):



I still want to write generally about the teaching/learning methods in this class and will do that later but if it's of interest my PCUTL portfolio is available here and in there I describe and justify a lot of what I'm doing.

Monday 20 May 2013

Probability of saying 'yes' to academic responsibilities

I've just read a great post by +Adriana Salerno: Learning to say no.

In that post Adriana discusses how in mathematics (and I'm sure a bunch of other/most fields) one needs a long period of uninterrupted time to work on Research she links to this Big Bang Theory clip:



She also however talks about how as an early career researcher it's important to take opportunities for responsibilities as and when they come. This is something that rings very true to me. Growing up I played a lot of rugby and basically had a "Say yes to coach" attitude ("Vince, you're slow, run sprints" - "Yes coach", "Vince, you're going to sit on the bench this week" - "Yes coach" etc... - Although I actually said "Oui Monsieur" as all my rugby was played in France, but I digress...).

I've kind of taken that attitude in to the early days of my career (I'm still a 'young pup' academia wise) but I also am very grateful of every opportunity that gets sent my way (I'm very lucky to be sitting on various committees, the editorial boards for a couple of journals and am in the middle of preparing not 1 but 2 brand new courses which is a great opportunity as opposed to being given others people's courses!).

Having said that, as Adriana points out in her blog it's important to find a balance so that I can also do some research.

The point of this post is not to say that I've figured out how to do that but to post this xkcd style graph that I made using this package on github: XKCDify.

If this was done by Randall Munroe the Alt Text would be far better...


I'm about at the point where the solid line meets the dashed line (ie the "unkown" for me). I suspect that I'm still being quite optimistic as to how low the probability of saying yes will go for me as I still generally do as I'm told and appreciate the opportunities greatly :)

In Adriana's post she talks about a "research day", I might try to be strict on that...

PS Here's another similar kind of graph that +Paul Harper (my head of research group) put together when he was actually looking back a bit on his 10 years in Academia.

(If anyone's interested here's the repo with the code I used to get that plot, I actually used +Sage Mathematical Software System 's find_fit command to fit a quintic to the few points I wanted to have on there... There might be a better way to do that though...)

Saturday 18 May 2013

Student choices between SAS and R

I'm going to be writing a couple of posts looking back at a class I've taught that's just coming to an end (at the time of writing this I've got one more group presentation to see).

The course teaches SAS and R in parallel on our MSc course (if it's of interest all the teaching materials are here).

I'll be blogging about this class as I taught it a bit differently to the usual "Students Listen - Teachers Lecture" style. I'll get back to that more in future posts (although a lot of what I've done is in my PCUTL module 2 portfolio).

The purpose of this post will be to briefly discuss two questions that were on my class test that I feel give some (very shallow) information as to how the students experienced the course. (The class test was made of 4 questions: q1 - a simple task to be performed in both SAS and R, q2 - a task in SAS, q3 - a task in R and finally q4 - a task in either language.)

The class is taught over 5 weeks:
  1. Week 1: Introduction and basic statistics
  2. Week 2: Data manipulation
  3. Week 3: Programming
  4. Week 4: Extras (for example we take a look at proc optmodel and ggplot2).
  5. Week 5: A 2 hour class test
The first question on the class test (you can see it here) asked the students to rank their enjoyment of each week (the purpose of this question was to give them a nice easy starting point). So a ranking of 1 implied a favorite week while a ranking of 5 implied the least favorite.

This plot shows the mean ranking given to each week:


This data and the following discussion should be taken with no implied rigour: I'm not analysing this too closely and also students might just have written down any sequence of rankings without thinking about things too much (this was after all a test).


First of all it does seem that the students enjoyed the class test the least (which I guess is to be expected).

Secondly, it looks like the first week was perhaps less enjoyed in general.

I think this is also to be expected, I taught the class in a way that I don't believe would have been familiar to the students (I tried to encourage them to teach each other and themselves) so perhaps that first week was just a bit too unfamiliar. I'll try and rectify that in future years (if only by pointing students to what students from previous years did).

Now for the second point.



By design I hope that the students learn how to carry out various programming tasks in SAS and R seeing the strengths and weaknesses of each language as they go. The last question of the class test (again here) involved a bit of data manipulation on small data sets and I believe that the main difficulty was that this question did not force a language on the students. In essence choosing the language was the most important point of the question.

My personal approach would have been to use R for this particular question but interestingly most students chose SAS. Some used a combination (often starting in SAS before realising that perhaps R was better suited which led to a bit of a clumsy hybrid). On average SAS was used by 77% of the students for question 4 (some of which managed the task very well!).

During the group presentations I've been asking students afterwards a "this does not count" question (ie making it clear that there's no wrong or right answer to this and that I'm simply interested/curious):
'If you were starting a consultancy company tomorrow but could use only one package: ether SAS or R which would you pick?'
The really pleasing thing is that almost all of the students miss the constraint in my question and immediately reply something like:
"It depends on the kind of consultancy we'll be doing."
I re-iterate the constraint (after telling them that that's the actual right answer :)), I'd say that a majority of students seem to prefer R. Perhaps the biais towards SAS in the class test was due to the conditions (time was short) but overall it's been nice to see that most students realise that it's about finding the right tool for a particular job.

I'm yet to see the individual course work that they'll be handing in this week, which is of a very similar format. I wonder which language they'll have picked for Q4...

EDIT: Here's the next post in this series (looking at choices between SAS and R during teaching presentations).

Sunday 5 May 2013

Counting semi magic squares using generating functions and Sage.

+John Cook has written a couple of posts recently that caught my eye and they kind of start with this one: Rolling dice for normal sample: Python version.

Go check it out it's pretty cool but basically John Cook shows how to use the Sympy package to obtain coefficients from generating functions in python.

I thought I'd do something similar but with +Sage Mathematical Software System (which is built on python and would be using Sympy on this occasion) but also on a slightly different topic: Semi Magic Squares.

A semi magic square is defined as a square matrix whose row and columns sums are equal to some constant $r$.

So for example semi magic squares of size 3 will be of the form:

$\begin{pmatrix} a&b&r-a-b\\ c&d&r-c-d\\ r-a-c&r-b-d&a+b+c+d-r \end{pmatrix}$

(with $0\leq a, b, c, d, a+c, b+d, a+b, c+d \leq r \leq a+b+c+d$ and $a,b,c,d \in \mathbb{Z}$)

Using this it can be shown that $|\text{SMS}(3,r)|=\binom{r+4}{4}+\binom{r+3}{4}+\binom{r+2}{4}$ (where SMS(n,r) is the set of semi magic squares of size $n$ and line size $r$). A really nice and accessible proof of this is in this paper by Bona: A New Proof of the Formula for the Number of 3 by 3 Magic Squares.

That is however not the point of this blog post :) In fact I'm just going to show the Sage code needed to enumerate the set SMS(n,2).

There's a nice result obtained in A combinatorial distribution problem by Anand Dumir and Gupta in 1966 that expresses the number of elements in SMS(n,r) as a generating function:

$\frac{e^{\frac{x}{2}}}{\sqrt{1-x}}=\sum_{n=0}^{\infty}|\text{SMS}(n,2)|\frac{x^n}{n!^2}$

In the rest of this post I'll now show how to use +Sage Mathematical Software System to get the numbers of Semi Magic Squares of line sum 2 and size $n$.

First of all we need to define the above generating function. In Sage this is easy and as we're going to use the variable $x$ don't need to declare it as a symbolic variable (this is by default for $x$ in sage).

Here's the natural way of doing this:
sage: f(x) = e ^ (x / 2)/(sqrt(1 - x))

We can get a plot of $f(x)$ easily in Sage:
sage: plot(f(x),x,-10,1)

which gives:



To obtain the taylor series (up to $x^5$) expansion of this function we simply call:
sage: taylor(f(x),x,0,5)

Which returns:
69/160*x^5 + 47/96*x^4 + 7/12*x^3 + 3/4*x^2 + x + 1

Recalling the generating function formulae above we see that the size of $|\text{SMS}(5,2)|$ is: $(5!)^2\times \frac{69}{160}=6210$.

We can use usual python syntax to define a function that will return $|\text{SMS}(n,2)|$ for any $n$:
sage: def SMS(n,2):
....:    return taylor(f(x),x,0,n).coeff(x^n)*(factorial(n)^2)

This makes use of the coeff method on the polynomials in sage.

To get $|\text{SMS}(n,2)|$ for $0\leq n \leq 10$ we can use:
sage: print [SMS(n,2) for n in range(11)]

which gives:
[0, 1, 3, 21, 282, 6210, 202410, 9135630, 545007960, 41514583320]

If we throw that sequence in to the OEIS (my best friend during my PhD) we indeed get the correct sequence: A000681 (if you don't know about the OEIS it's awesome, be sure to click one of those two links).

In my previous post today I just used pure Python to carry out simulations to estimate $\pi$ but there are certain things that I find Sage to be better suited for (like generating functions :) ).

In my PhD I looked at a set of objects similar to Semi Magic Squares called: Alternating Sign Matrices.

You can find some of the arguments I briefly talk about here in my thesis (the thought of someone actually reading it is terrifying). I used Mathematica throughout my PhD but (without wanting to sound like a sales pitch) if I had known about Sage (not a sponsor) at the time I would have probably used Sage.

(All the math on this page is rendered using http://www.mathjax.org/ which I'm not entirely sure to have configured correctly for +Blogger; if some of the math hasn't rendered try reloading the page...)

Using Monte Carlo Simulation for the Estimation of Pi

A while ago on G+, +Sara Del Valle posted a link to this article The 12 Most Controversial Facts In Mathematics. Whether or not the title of that article is precise is a whole other question but one of the problems in there discusses the use of probability to estimate $\pi$.

At the time I wrote a small python script to randomly generate points (this approach of continuous random sampling is called Monte Carlo simulation) that lie in a square of length $2r$. If we let $S$ be the total number of points and $C$ to be the points that lie in an inscribed circle of radius $r$ then:

$$P(\text{point in circle})=\frac{\pi r^2}{4r^2}$$



Using this if we generate a high enough number of points we can estimate P(point in circle). We generate a point $(x,y)$ and check if that point is in the circle by checking if $x^2+y^2\leq r^2$. Using all this we get:

$\pi\approx \frac{4C}{S}$

This morning I've modified the script slightly so that it generates a few more plots:
  • If the relevant option is set it will generate a new plot for every single plot;
  • A plot of the estimate of $\pi$ as a function of the number of points.
Here's a plot of the estimate of $\pi$:



The main point of me changing this was to put together this screencast that describes the process:



If it's of interest the github repo with the code is available here.

On a related note, here's another video I put together a while back showing the basic process that can be used to simulate a queueing process:


Friday 3 May 2013

Results from my online version of the two thirds of the average game.

In my previous blog post I described briefly the online version of the two thirds of the average game that I set up using Google's app engine (the main reason for this was to make sure I knew how to do it as I plan on using something similar in an upcoming class).

I invited people to play and kept the game going for a week. I was hoping for a few more participants than I go but it was nice to get 76 people take the time to play.

Drum roll...

The winning guess was 15 and this was guessed by a single individual who did not leave a url but whose google username makes me think their name is Dalibor Smid: so congratulations and thanks for playing :)

The distribution of the guesses is fairly traditional for this experiment with peaks at the guesses that are 2/3s of the previous peak (66,44,29, 19 etc...).


Thanks a lot to everyone for playing and also for everyone who re-shared my invitation:
and others that I apologise for forgetting!

I'm going to leave the game up there and reset it. I'll let the data store fill up until reaching 200 people and then reset it again (assuming 200 poeple ever decide to play :) etc...

So if you want to have another go please do go play: