Banner

Friday, May 26, 2017

Introducing Kotlin Statistics




Last fall I wrote an article Kotlin for Data Science where I proposed Kotlin as a general programming language for data science and analytics. To my surprise, I later found out a lot of folks read the article and quietly started to investigate the idea. In Spring 2017 the conversation began to grow in the Kotlin community, but nothing set fire to the prospect quite like when Google made its announcement to support Kotlin. With that announcement, as well as the fact Kotlin compiles to JavaScript and soon LLVM, it is clear that Kotlin is poised to gain adoption on multiple domains.

Since my previous article, the topic of integrating data science workflows with software engineering has gained traction across data science communities. O'Reilly posted an interesting article about using Go for data science, and a quick Google search for "DevOps for Data Science" reveals the same theme: the gap between data science and software engineering is a logical next step in progressing data science as a discipline.

I believe that Python and R will continue to be tools used for analytics, but they are not sustainable for workflows that require continuous delivery into production. Of course, libraries like Apache Spark and others are striving to support multiple language API's, but Kotlin has a lot of potential to usher in a bigger picture. That is what I hope to show with the Kotlin Statistics library I released this week. It is not a silver bullet to all problems in data science, nor does it have advanced features like ML at the moment. Rather, I want Kotlin Statistics to show how Kotlin's inferred static typing and abstraction can make data science code simpler and more tactical, but also resilient and refactorable. Not to mention, the tooling for Kotlin is fantastic with Intellij IDEA.


A Quick Tour


I released the Kotlin Statistics library this week. It is not yet at a 1.0 version, but it should give you a good set of tools to start doing fundamental statistical analysis.

Take for example this Kotlin code below where I declare a Patient type, and I include the first name, last name, birthday, and white blood cell count. I also have an enum called Gender reflecting a MALE/FEMALE category. Of course, I could import this data from a text file, a database, or another source, but for now I am going to declare them in literal Kotlin code:

data class Patient(val firstName: String,
                   val lastName: String,
                   val gender: Gender,
                   val birthday: LocalDate,
                   val whiteBloodCellCount: Int)


val patients = listOf(
        Patient("John", "Simone", Gender.MALE, LocalDate.of(1989, 1, 7), 4500),
        Patient("Sarah", "Marley", Gender.FEMALE, LocalDate.of(1970, 2, 5), 6700),
        Patient("Jessica", "Arnold", Gender.FEMALE, LocalDate.of(1980, 3, 9), 3400),
        Patient("Sam", "Beasley", Gender.MALE, LocalDate.of(1981, 4, 17), 8800),
        Patient("Dan", "Forney", Gender.MALE, LocalDate.of(1985, 9, 13), 5400),
        Patient("Lauren", "Michaels", Gender.FEMALE, LocalDate.of(1975, 8, 21), 5000),
        Patient("Michael", "Erlich", Gender.MALE, LocalDate.of(1985, 12, 17), 4100),
        Patient("Jason", "Miles", Gender.MALE, LocalDate.of(1991, 11, 1), 3900),
        Patient("Rebekah", "Earley", Gender.FEMALE, LocalDate.of(1985, 2, 18), 4600),
        Patient("James", "Larson", Gender.MALE, LocalDate.of(1974, 4, 10), 5100),
        Patient("Dan", "Ulrech", Gender.MALE, LocalDate.of(1991, 7, 11), 6000),
        Patient("Heather", "Eisner", Gender.FEMALE, LocalDate.of(1994, 3, 6), 6000),
        Patient("Jasper", "Martin", Gender.MALE, LocalDate.of(1971, 7, 1), 6000)
)

enum class Gender {
    MALE,
    FEMALE
}
If you find the LocalDate.of() or other parts of the declaration to be redundant and wordy, you can easily create functions or type aliases to make things more concise, but I am not going to digress into that right now.
Let's start with some basic analysis: what is the average and standard deviation of whiteBloodCellCount across all the patients? We can leverage some extension functions in Kotlin Statistics to find this quickly:

fun main(args: Array<String>) {

    val averageWbcc =
            patients.map { it.whiteBloodCellCount }.average()

    val standardDevWbcc =
            patients.map { it.whiteBloodCellCount }.standardDeviation()

    println("Average WBCC: $averageWbcc, Std Dev WBCC: $standardDevWbcc")

}
We should get this output:
Average WBCC: 5346.153846153846, Std Dev WBCC: 1412.2177503341948
However, we sometimes need to slice our data not only for more detailed insight but also to judge our sample. For example, did we get a representative sample with our patients for both male and female? We can use the countBy() operator in Kotlin Statistics to count a Collection or Sequence of items by a keySelector as shown here:

fun main(args: Array<String>) {

    val genderCounts = patients.countBy(
            keySelector = { it.gender }
    )

    println(genderCounts)
}

This returns a Map<Gender,Int>, reflecting the patient count by gender. Here is what it looks like in the output from our code above:
{MALE=8, FEMALE=5}
Okay, so our sample is a bit MALE-heavy, but let's move on. We can also find the average white blood cell count by gender using averageBy(). This accepts not only a keySelector lambda but also an intMapper to select an integer off each Patient (we could also use doubleMapper, bigDecimalMapper, etc). In this case, we are selecting the whiteBloodCellCount off each Patient and averaging it by Gender, as shown next:

fun main(args: Array<String>) {

    val averageWbccByGender = patients.averageBy(
            keySelector = { it.gender },
            intMapper = { it.whiteBloodCellCount }
    )

    println(averageWbccByGender)
}

{MALE=5475.0, FEMALE=5140.0}

So the average WBCC for MALE is 5475, and FEMALE is 5140.

What about age? Did we get a good sampling of younger and older patients? If you look at our Patient class, we only have a birthday to work with which is a Java 8 LocalDate. But using Java 8's date and time utilities, we can derive the age in years in the keySelector like this:

fun main(args: Array<String>) {

    val patientCountByAge = patients.countBy(
            keySelector = { ChronoUnit.YEARS.between(it.birthday, LocalDate.now()) }
    )

    println(patientCountByAge)
}
And here is the output:

{28=1, 47=1, 37=1, 36=1, 31=2, 41=1, 25=2, 32=1, 43=1, 23=1, 45=1}

If you look at our output for the code, it is not very meaningful to get a count by age. It would be better if we could count by age ranges, like 20-29, 30-39, and 40-49. We can do this using the binByXXX() operators. If we want to bin by an Int value such as age, we can define a BinModel that starts at 20, and increments each binSize by 10. We also provide the value we are binning using binMapper, which is the patient's age as shown below:

fun main(args: Array<String>) {

    val binnedPatients = patients.binByInt(
            binMapper = { ChronoUnit.YEARS.between(it.birthday, LocalDate.now()).toInt() },
            binSize = 10,
            rangeStart = 20
    )

    binnedPatients.forEach {
        println(it)
    }
}

And here is the output showning all our Patient items binned up in a BinModel, by these age ranges:

Bin(range=20..29, value=[Patient(firstName=John, lastName=Simone, gender=MALE, birthday=1989-01-07, whiteBloodCellCount=4500), Patient(firstName=Jason, lastName=Miles, gender=MALE, birthday=1991-11-01, whiteBloodCellCount=3900), Patient(firstName=Dan, lastName=Ulrech, gender=MALE, birthday=1991-07-11, whiteBloodCellCount=6000), Patient(firstName=Heather, lastName=Eisner, gender=FEMALE, birthday=1994-03-06, whiteBloodCellCount=6000)])
Bin(range=30..39, value=[Patient(firstName=Jessica, lastName=Arnold, gender=FEMALE, birthday=1980-03-09, whiteBloodCellCount=3400), Patient(firstName=Sam, lastName=Beasley, gender=MALE, birthday=1981-04-17, whiteBloodCellCount=8800), Patient(firstName=Dan, lastName=Forney, gender=MALE, birthday=1985-09-13, whiteBloodCellCount=5400), Patient(firstName=Michael, lastName=Erlich, gender=MALE, birthday=1985-12-17, whiteBloodCellCount=4100), Patient(firstName=Rebekah, lastName=Earley, gender=FEMALE, birthday=1985-02-18, whiteBloodCellCount=4600)])
Bin(range=40..49, value=[Patient(firstName=Sarah, lastName=Marley, gender=FEMALE, birthday=1970-02-05, whiteBloodCellCount=6700), Patient(firstName=Lauren, lastName=Michaels, gender=FEMALE, birthday=1975-08-21, whiteBloodCellCount=5000), Patient(firstName=James, lastName=Larson, gender=MALE, birthday=1974-04-10, whiteBloodCellCount=5100), Patient(firstName=Jasper, lastName=Martin, gender=MALE, birthday=1971-07-01, whiteBloodCellCount=6000)])

We can look up the bin for a given age using an accessor syntax. For example, we can retrieve the Bin for the age 25 like this, and it will return the 20-29 bin:

fun main(args: Array<String>) {

    val binnedPatients = patients.binByInt(
            binMapper = { ChronoUnit.YEARS.between(it.birthday, LocalDate.now()).toInt() },
            binSize = 10,
            rangeStart = 20
    )

    println(binnedPatients[25])
}

If we wanted to not collect the items into bins but rather perform an aggregation on each one, we can do that by also providing a groupOp argument. This allows you to use a lambda specifying how to reduce each List<Patient> for each Bin. Below is the average white blood cell count by age range:

fun main(args: Array<String>) {

    val avgWbccByAgeRange = patients.binByInt(
            binMapper = { ChronoUnit.YEARS.between(it.birthday, LocalDate.now()).toInt() },
            binSize = 10,
            rangeStart = 20,
            groupOp = { it.map { it.whiteBloodCellCount }.average() }
    )

    println(avgWbccByAgeRange)
}

Here is the output, showing that the average white blood cell count for each age range is within the 5000's:

BinModel(bins=[Bin(range=20..29, value=5100.0), Bin(range=30..39, value=5260.0), Bin(range=40..49, value=5700.0)])

Using let() for Multiple Calculations


There may be times you want to perform multiple aggregations to create reports of various metrics. This is usually achievable using Kotlin's let() operator. Say you wanted to find the 1st, 25th, 50th, 75th, and 100th percentiles by gender. We can tactically use a Kotlin extension function called wbccPercentileByGender() which will take a set of patients and separate a percentile calculation by gender. Then we can invoke it for the five desired percentiles and package them in a Map<Double,Map<Gender,Double>>, as shown below:


fun main(args: Array<String>) {

    fun Collection<Patient>.wbccPercentileByGender(percentile: Double) =
            percentileBy(
                    percentile = percentile,
                    keySelector = { it.gender },
                    doubleMapper = { it.whiteBloodCellCount.toDouble() }
            )

    val percentileQuadrantsByGender = patients.let {
        mapOf(1.0 to it.wbccPercentileByGender(1.0),
                25.0 to it.wbccPercentileByGender(25.0),
                50.0 to it.wbccPercentileByGender(50.0),
                75.0 to it.wbccPercentileByGender(75.0),
                100.0 to it.wbccPercentileByGender(100.0)
        )
    }

    percentileQuadrantsByGender.forEach(::println)
}

OUTPUT:

1.0={MALE=3900.0, FEMALE=3400.0}
25.0={MALE=4200.0, FEMALE=4000.0}
50.0={MALE=5250.0, FEMALE=5000.0}
75.0={MALE=6000.0, FEMALE=6350.0}
100.0={MALE=8800.0, FEMALE=6700.0}

Summary

This was a somewhat simple introduction to Kotlin Statistics and the functionality I have built so far. Be sure to read the project's README to see a more comprehensive set of operators available in the library. Over time, I plan on improving with linear regression, charting, and other features. I am also thinking of putting in Bayesian model support after I finish scoping it out.

But more importantly, I hope this demonstrates Kotlin's efficacy in being tactical but robust. Kotlin is capable of rapid turnaround for quick ad hoc analysis, but you can take that statically-typed code and put it in production if you need to. While I am seeking to add more functionality to this, it would be awesome to see others contribute to the idea of using Kotlin for these kinds of purposes.

1 comment:

  1. Hello Thomas,
    The Article on Advanced SQL for Data Analysis is nice.It give detail information about it .Thanks for Sharing the information about Data Analysis. data science consulting

    ReplyDelete