ISA's Big-R Workshop

<jonathan.rosenblatt@weizmann.ac.il>

date: 23.10.2014

[Written with R Markdown]

Typical BIG Workflow

  1. Store and pre-processes in
    1. NoSQL database (e.g. Cassandra)
    2. Distributed file system (e.g. Hadoop,…)
  2. Dump data to SQL database (e.g. MySQL,SAS,…)
  3. Analyze using your favorite software
    (R, SAS, SPSS, RapidMiner, Assembly, …):
    1. Using the software's routines.
    2. Using database's routines.

Remarks:

  • I will talk only of 3.1.
  • Eddie will talk about 1 & 2.

You know you have a problem when

  • Job never starts
    cannot allocate vector of size n Mb
  • Job never ends
  • Job starts and two days later
    cannot allocate vector of size n Mb
    #reasonsforhomicidalrampage

Many paths to the same destination

But might require different resources.

  1. Linear algebra example:

    • A %*% B %*% x
    • A %*% (B %*% x)
    • Digress: openBLAS, atlas and MKL.
  2. Optimization example:

    • Newton-Raphson: (CPU, RAM)
    • Fisher Scoring: (CPU, RAM)
    • Gradient Descent: (CPU, RAM)
    • Stochastic Gradient Descent: (CPU, RAM)

Diagnose the effective constraint

  • CPU?
  • RAM?
  • RAM interchangeable with CPU via the swap-file.

Fix the right problem!

If CPU constraint- Parallelize

  • Within R:
    • parallel::parApply() family.
    • foreach::foreach() parallelized loops.
  • Third-party:
    • Condor, SGE, AWS-SQS, …
  • Remember!
    • Memory not shared (typically).
    • Memory consumption multiplies.
  • Very useful for simulations.
  • Talk to me offline or ask Tal Galili™ for another workshop.

If RAM constraint

Keep listening.

Background

  • Base R bundles in-memory, batch algorithms.
  • Recalling
    • Memory(RAM) \( = \mathcal O(GB) \)
    • Memory(HD) \( = \mathcal O(TB) \)
    • Memory(Server) \( = \mathcal O(PB) \)
    • Memory(Cluster) \( = \mathcal{\omega}(PB) \)
  • Batch algorithms need \( \times 3 \) RAM than object's size.

Escaping SMALL-DATA land

  1. Algorithms:
    • Exploit sparseness.
    • From batch to streaming.
    • From in-RAM to in-HD.
  2. Scale RAM:
    • Buy more RAM.
    • Rent more RAM (cloud).
  3. Scale HD:
    • Buy more HD.
    • Rent more HD (cloud).
    • Go to Eddie's talk (in 1.5h).

Outline of examples

  1. Diagnose your bottleneck.
  2. From batch to streaming.
    • Stream from RAM Regression.
    • Exploit sparsity.
    • Stream from RAM Classification
  3. From in-RAM to out-of-RAM.
    • Stream from HD Regression.
    • Stream from HD Classification.

Disclaimers

  1. BIG cannot be solved live.
  2. First time lecture. Good luck everyone.
  3. Don't use packrat over Dropbox.
  4. If you are unfamiliar with magrittr, your life is about to change.

Old you

hourly_delay <- filter( 
  summarise(
    group_by( 
      filter(
        flights, 
        !is.na(dep_delay)
      ), 
      date, hour
    ), 
    delay = mean(dep_delay), 
    n = n()
  ), 
  n > 10 
) 

New you

hourly_delay <- flights %>% 
 filter(!is.na(dep_delay)) %>% 
 group_by(date, hour) %>% 
 summarise( 
   delay = mean(dep_delay), 
   n = n() ) %>% 
 filter(n > 10)