<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>randyzwitch - Articles</title>
    <description>Data Science, Programming, Technology, Design</description>
    <link>
    http://randyzwitch.com</link>
    
      
      <item>
        <title>Building pyarrow with CUDA support</title>
        
          <description>&lt;p&gt;The other day I was looking to read an Arrow buffer on GPU using Python, but as far as I could tell, none of the provided pyarrow packages on conda or pip are built with CUDA support. Like many of the packages in the compiled-C-wrapped-by-Python ecosystem, &lt;a href=&quot;http://arrow.apache.org/&quot;&gt;Apache Arrow&lt;/a&gt; is thoroughly documented, but the number of permutations of how you could choose to build &lt;a href=&quot;http://arrow.apache.org/docs/developers/python.html#building-on-linux-and-macos&quot;&gt;pyarrow with CUDA support&lt;/a&gt; quickly becomes overwhelming.&lt;/p&gt;

</description>
        
        <pubDate>Fri, 03 Apr 2020 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/pyarrow-cuda-support/</link>
        <guid isPermaLink="true">http://randyzwitch.com/pyarrow-cuda-support/</guid>
        <content type="html" xml:base="/pyarrow-cuda-support/">&lt;p&gt;The other day I was looking to read an Arrow buffer on GPU using Python, but as far as I could tell, none of the provided pyarrow packages on conda or pip are built with CUDA support. Like many of the packages in the compiled-C-wrapped-by-Python ecosystem, &lt;a href=&quot;http://arrow.apache.org/&quot;&gt;Apache Arrow&lt;/a&gt; is thoroughly documented, but the number of permutations of how you could choose to build &lt;a href=&quot;http://arrow.apache.org/docs/developers/python.html#building-on-linux-and-macos&quot;&gt;pyarrow with CUDA support&lt;/a&gt; quickly becomes overwhelming.&lt;/p&gt;

&lt;p&gt;In this post, I’ll show how to build pyarrow with CUDA support on Ubuntu using Docker and &lt;a href=&quot;https://docs.python-guide.org/dev/virtualenvs/#lower-level-virtualenv&quot;&gt;virtualenv&lt;/a&gt;. These directions are approximately the same as the official Apache Arrow docs, just that I explain them step-by-step and show only the single build toolchain I used.&lt;/p&gt;

&lt;h2 id=&quot;step-1-docker-with-gpu-support&quot;&gt;Step 1: Docker with GPU support&lt;/h2&gt;

&lt;p&gt;Even though I use Ubuntu 18.04 LTS on a workstation with an NVIDIA GPU, whenever I undertake a project like this, I like to use a Docker container to keep everything isolated. The last thing you want to do is to debug environment errors, changing dependencies for one project and breaking something else. Thankfully, &lt;a href=&quot;https://hub.docker.com/r/nvidia/cuda/&quot;&gt;NVIDIA Docker developer images&lt;/a&gt; are available via DockerHub:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;docker run &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--gpus&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all &lt;span class=&quot;nt&quot;&gt;--rm&lt;/span&gt; nvidia/cuda:10.1-devel-ubuntu18.04 bash
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-it&lt;/code&gt; flag puts us inside the container at a bash prompt, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--gpus=all&lt;/code&gt; allows the Docker container to access my workstation’s GPUs and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--rm&lt;/code&gt; deletes the container after we’re done to save space.&lt;/p&gt;

&lt;h2 id=&quot;step-2-setting-up-the-ubuntu-docker-container&quot;&gt;Step 2: Setting up the Ubuntu Docker container&lt;/h2&gt;

&lt;p&gt;When you pull Docker containers from DockerHub, frequently they are bare-bones in terms of libraries included, and usually can also be updated. For building pyarrow, it’s useful to install the following:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;apt update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt upgrade &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt;

apt &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;git &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
wget &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
libssl-dev &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
autoconf &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
flex &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
bison &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
llvm-7 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
clang &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
cmake &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
python3-pip &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
libjemalloc-dev &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
libboost-dev &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
libboost-filesystem-dev &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
libboost-system-dev &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
libboost-regex-dev  &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
python3-dev &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In a later step, we’ll use the Arrow third-party dependency script to ensure all needed dependencies are present, but these are a good start.&lt;/p&gt;

&lt;h2 id=&quot;step-3-cloning-apache-arrow-from-github&quot;&gt;Step 3: Cloning Apache Arrow from GitHub&lt;/h2&gt;

&lt;p&gt;Cloning Arrow from GitHub is pretty straightforward. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;git checkout apache-arrow-0.15.0&lt;/code&gt; line is optional; I needed version 0.15.0 for the project I was exploring, but if you want to build from the master branch of Arrow, you can omit that line.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;git clone https://github.com/apache/arrow.git /repos/arrow
&lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; /repos/arrow
git submodule init &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; git submodule update
git checkout apache-arrow-0.15.0
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PARQUET_TEST_DATA&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PWD&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/cpp/submodules/parquet-testing/data&quot;&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ARROW_TEST_DATA&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PWD&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/testing/data&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;step-4-installing-remaining-apache-arrow-dependencies&quot;&gt;Step 4: Installing remaining Apache Arrow dependencies&lt;/h2&gt;

&lt;p&gt;As mentioned in Step 2, some of the dependencies for building Arrow are system-level and can be installed via apt. To ensure that we have all the remaining third-party dependencies, we can use the provided script in the Arrow repository:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pip3 &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;virtualenv
virtualenv pyarrow
&lt;span class=&quot;nb&quot;&gt;source&lt;/span&gt; ./pyarrow/bin/activate
pip &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;six numpy pandas cython pytest hypothesis
&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;dist
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ARROW_HOME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;pwd&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;/dist
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;pwd&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;/dist/lib:&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;cpp
./thirdparty/download_dependencies.sh &lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/arrow-thirdparty
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The script downloads all of the necessary libraries as well as sets environment variables that are picked up later, which is amazingly helpful.&lt;/p&gt;

&lt;h2 id=&quot;step-5-building-apache-arrow-c-library&quot;&gt;Step 5: Building Apache Arrow C++ library&lt;/h2&gt;

&lt;p&gt;pyarrow links to the Arrow C++ bindings, so it needs to be present before we can build the pyarrow wheel:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;build &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;build

cmake &lt;span class=&quot;nt&quot;&gt;-DCMAKE_INSTALL_PREFIX&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$ARROW_HOME&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DCMAKE_INSTALL_LIBDIR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;lib &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_FLIGHT&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_GANDIVA&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_ORC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_WITH_BZ2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_WITH_ZLIB&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_WITH_ZSTD&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_WITH_LZ4&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_WITH_SNAPPY&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_WITH_BROTLI&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_PARQUET&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_PYTHON&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_PLASMA&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_BUILD_TESTS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DARROW_CUDA&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
..

make &lt;span class=&quot;nt&quot;&gt;-j&lt;/span&gt;
make &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is a pretty standard workflow for building a C or C++ library. We create a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;build&lt;/code&gt; directory, call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cmake&lt;/code&gt; from inside of that directory to set up the options we want to use, then use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;make&lt;/code&gt; and then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;make install&lt;/code&gt; to compile and install the library, respectively. I chose all of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-DARROW_*&lt;/code&gt; options above just as a copy/paste from the Arrow documentation; Arrow doesn’t take long to build using these options, but it’s possibly the case that only &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-DARROW_PYTHON=ON&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-DARROW_CUDA=ON&lt;/code&gt; are truly necessary to build pyarrow.&lt;/p&gt;

&lt;h2 id=&quot;step-6-building-pyarrow-wheel&quot;&gt;Step 6: Building pyarrow wheel&lt;/h2&gt;

&lt;p&gt;With the Apache Arrow C++ bindings built, we can now build the Python wheel:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; /repos/arrow/python
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PYARROW_WITH_PARQUET&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PYARROW_WITH_CUDA&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1
python setup.py build_ext &lt;span class=&quot;nt&quot;&gt;--build-type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;release &lt;span class=&quot;nt&quot;&gt;--bundle-arrow-cpp&lt;/span&gt; bdist_wheel
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As cmake and make run, you’ll eventually see the following in the build logs, which shows that we’re getting the behavior we want:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cmake &lt;span class=&quot;nt&quot;&gt;--build&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--config&lt;/span&gt; release &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;  5%] Compiling Cython CXX &lt;span class=&quot;nb&quot;&gt;source &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;_cuda...
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;  5%] Built target _cuda_pyx
Scanning dependencies of target _cuda
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 11%] Building CXX object CMakeFiles/_cuda.dir/_cuda.cpp.o
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 16%] Linking CXX shared module release/_cuda.cpython-36m-x86_64-linux-gnu.so
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; 16%] Built target _cuda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When the process finishes, the final wheel will be in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/repos/arrow/python/dist&lt;/code&gt; directory.&lt;/p&gt;

&lt;h2 id=&quot;step-7-optional-validate-build&quot;&gt;Step 7 (optional): Validate build&lt;/h2&gt;

&lt;p&gt;If you want to validate that your pyarrow wheel has CUDA installed, you can run the following:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;pyarrow&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; root@9260485caca3:/repos/arrow/python/dist# pip &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;pyarrow-0.15.1.dev0+g40d468e16.d20200402-cp36-cp36m-linux_x86_64.whl
Processing ./pyarrow-0.15.1.dev0+g40d468e16.d20200402-cp36-cp36m-linux_x86_64.whl
Requirement already satisfied: six&amp;gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1.0.0 &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; /repos/arrow/pyarrow/lib/python3.6/site-packages &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;from &lt;span class=&quot;nv&quot;&gt;pyarrow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;0.15.1.dev0+g40d468e16.d20200402&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1.14.0&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
Requirement already satisfied: numpy&amp;gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1.14 &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; /repos/arrow/pyarrow/lib/python3.6/site-packages &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;from &lt;span class=&quot;nv&quot;&gt;pyarrow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;0.15.1.dev0+g40d468e16.d20200402&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1.18.2&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
Installing collected packages: pyarrow
Successfully installed pyarrow-0.15.1.dev0+g40d468e16.d20200402
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;pyarrow&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; root@9260485caca3:/repos/arrow/python/dist# python
Python 3.6.9 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;default, Nov  7 2019, 10:44:02&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;GCC 8.3.0] on linux
Type &lt;span class=&quot;s2&quot;&gt;&quot;help&quot;&lt;/span&gt;, &lt;span class=&quot;s2&quot;&gt;&quot;copyright&quot;&lt;/span&gt;, &lt;span class=&quot;s2&quot;&gt;&quot;credits&quot;&lt;/span&gt; or &lt;span class=&quot;s2&quot;&gt;&quot;license&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;more information.
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; from pyarrow import cuda
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When the line &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;from pyarrow import cuda&lt;/code&gt; runs without error, then we know that our pyarrow build with CUDA was successful.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>A Beginner's Look at BenchmarkTools.jl</title>
        
          <description>&lt;p&gt;For the number of years I’ve been programming using Julia, I’ve never really been concerned with performance. Which is to say, I’ve appreciated that &lt;em&gt;other people&lt;/em&gt; are interested in performance and have proven that Julia can be as fast as any other performance language out there. But I’ve never been one to pour over the &lt;a href=&quot;https://docs.julialang.org/en/v1/manual/performance-tips/&quot;&gt;Performance Tips&lt;/a&gt; section of the Julia manual trying to squeeze every last bit of performance.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 16 Dec 2019 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/benchmarktools-julia-benchmarking/</link>
        <guid isPermaLink="true">http://randyzwitch.com/benchmarktools-julia-benchmarking/</guid>
        <content type="html" xml:base="/benchmarktools-julia-benchmarking/">&lt;p&gt;For the number of years I’ve been programming using Julia, I’ve never really been concerned with performance. Which is to say, I’ve appreciated that &lt;em&gt;other people&lt;/em&gt; are interested in performance and have proven that Julia can be as fast as any other performance language out there. But I’ve never been one to pour over the &lt;a href=&quot;https://docs.julialang.org/en/v1/manual/performance-tips/&quot;&gt;Performance Tips&lt;/a&gt; section of the Julia manual trying to squeeze every last bit of performance.&lt;/p&gt;

&lt;p&gt;But now that I’ve released &lt;a href=&quot;https://www.omnisci.com/blog/announcing-omnisci.jl-a-julia-client-for-omnisci&quot;&gt;OmniSci.jl&lt;/a&gt;, and as a company one of our major selling points is &lt;a href=&quot;https://www.omnisci.com/platform&quot;&gt;accelerated analytics&lt;/a&gt;, I figured it was time to stop assuming I wrote decent-ish code and really pay attention to performance. This post highlights my experience as a beginner, and hopefully will show how others can get started in learning to optimize their Julia code.&lt;/p&gt;

&lt;h2 id=&quot;read-the-manuals&quot;&gt;Read The Manuals!&lt;/h2&gt;

&lt;p&gt;As I mentioned above, I’ve written Julia for many years now, and in that time I’ve grown up with many of the tips in the performance tips section of the documentation. Things like &lt;a href=&quot;https://docs.julialang.org/en/v1/manual/performance-tips/#Write-%22type-stable%22-functions-1&quot;&gt;“write type stable functions”&lt;/a&gt; and &lt;a href=&quot;https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables-1&quot;&gt;“avoid global variables”&lt;/a&gt; are things that I’ve internalized as good programming practices, as opposed to doing them just because they are performant. But with this long familiarity with the language comes laziness, and by not reading the BenchmarkTools.jl documentation, I started off benchmarking incorrectly. Consider this example:&lt;/p&gt;

&lt;div class=&quot;language-julia highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Random&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OmniSci&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BenchmarkTools&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Base&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Threads&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#change defaults, since examples long-running&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;BenchmarkTools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DEFAULT_PARAMETERS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seconds&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BenchmarkTools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DEFAULT_PARAMETERS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;samples&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#generate test data&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;gendata&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;typemin&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;typemax&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;gendata&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generic&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; with&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;int64_10x6&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gendata&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;^&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int64&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#Test whether broadcasting more/less efficient than pre-allocating results array&lt;/span&gt;
       &lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; preallocate&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

           &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Vector&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OmniSci&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TStringValue&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;undef&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;))&lt;/span&gt;

           &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
               &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OmniSci&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TStringValue&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])&lt;/span&gt;
           &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

           &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;
       &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;preallocate&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generic&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; with&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nd&quot;&gt;@benchmark&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v61&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OmniSci&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TStringValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64_10x6&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BenchmarkTools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Trial&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;memory&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;estimate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;297.55&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MiB&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;allocs&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;estimate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;6000005&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;--------------&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;minimum&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;mf&quot;&gt;750.146&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ms&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.00&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;median&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;      &lt;span class=&quot;mf&quot;&gt;1.014&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;29.38&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;        &lt;span class=&quot;mf&quot;&gt;1.151&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;28.38&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;maximum&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;mf&quot;&gt;1.794&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;43.06&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;--------------&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;samples&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;          &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;evals&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nd&quot;&gt;@benchmark&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v62&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;preallocate&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64_10x6&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BenchmarkTools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Trial&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;memory&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;estimate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;297.55&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MiB&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;allocs&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;estimate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;6000002&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;--------------&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;minimum&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;mf&quot;&gt;753.877&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ms&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.00&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;median&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;      &lt;span class=&quot;mf&quot;&gt;1.021&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;28.30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;        &lt;span class=&quot;mf&quot;&gt;1.158&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;28.10&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;maximum&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;mf&quot;&gt;1.806&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;43.17&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;--------------&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;samples&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;          &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;evals&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The benchmark above tests whether it’s worth &lt;a href=&quot;https://docs.julialang.org/en/v1/manual/performance-tips/#Pre-allocating-outputs-1&quot;&gt;pre-allocating the results array&lt;/a&gt; vs. using the more convenient &lt;a href=&quot;https://docs.julialang.org/en/v1/manual/functions/#man-vectorized-1&quot;&gt;dot broadcasting syntax&lt;/a&gt;. The idea here is that growing an array over and over can be inefficient when you know the result size at the outset. Yet, comparing the times above, for all statistics pre-allocating the array is &lt;em&gt;slightly worse&lt;/em&gt;, even though we’re passing the compiler more knowledge up front. This didn’t sit well with me, so I consulted the BenchmarkTools.jl manual and found the following about &lt;a href=&quot;https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#interpolating-values-into-benchmark-expressions&quot;&gt;variable interpolation&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A good rule of thumb is that &lt;strong&gt;external variables should be explicitly interpolated into the benchmark expression&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Interpolating the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int64_10x6&lt;/code&gt; input array into the function takes it from being a global variable to a local, and sure enough, we see roughly a &lt;strong&gt;6% improvement&lt;/strong&gt; in the minimum time when we pre-allocate the array:&lt;/p&gt;

&lt;div class=&quot;language-julia highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nd&quot;&gt;@benchmark&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v61i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OmniSci&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TStringValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64_10x6&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BenchmarkTools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Trial&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;memory&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;estimate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;297.55&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MiB&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;allocs&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;estimate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;6000002&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;--------------&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;minimum&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;mf&quot;&gt;763.817&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ms&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.00&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;median&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;      &lt;span class=&quot;mf&quot;&gt;960.446&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ms&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;24.02&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;        &lt;span class=&quot;mf&quot;&gt;1.178&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;28.68&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;maximum&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;mf&quot;&gt;1.886&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;45.11&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;--------------&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;samples&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;          &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;evals&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nd&quot;&gt;@benchmark&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v62i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;preallocate&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64_10x6&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BenchmarkTools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Trial&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;memory&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;estimate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;297.55&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MiB&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;allocs&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;estimate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;6000002&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;--------------&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;minimum&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;mf&quot;&gt;721.597&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ms&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.00&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;median&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;      &lt;span class=&quot;mf&quot;&gt;1.072&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;30.45&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;        &lt;span class=&quot;mf&quot;&gt;1.234&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;32.92&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;maximum&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;mf&quot;&gt;1.769&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;44.51&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;--------------&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;samples&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;          &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;evals&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Whether that 6% improvement will hold up over time or not, at least conceptually we’re no longer worse off for pre-allocating, which fits my mental model of how things should work.&lt;/p&gt;

&lt;h2 id=&quot;evaluate-your-benchmark-over-the-range-of-inputs-you-care-about&quot;&gt;Evaluate Your Benchmark Over the Range of Inputs You Care About&lt;/h2&gt;

&lt;p&gt;In the comparison above, I evaluate the benchmark over &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;10^6&lt;/code&gt; observations. How did I choose 1 million as the “right” number of events to test, instead of just testing 1 or 10 events? My general goal for benchmarking this code is to speed up the methods of loading data into an OmniSciDB database. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TStringValue&lt;/code&gt; is one of the internal methods as part of doing a row-wise table load, converting whatever data is present in an array or DataFrame from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;::Type{T}&lt;/code&gt; into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;String&lt;/code&gt; (think iterating over a text file by line). Since users trying to accelerate their database operations are probably going to be using millions to billions of data points, I’m interested in understanding how the functions are performing at these volumes of data.&lt;/p&gt;

&lt;p&gt;The other conscious decision I made was the environment to test on. I could test this on massive CPU- and GPU-enabled servers, but I’m testing this on my Dell XPS 15 laptop. Why?  Because I’m actually interested in how things are performing under more real-world conditions for a realistic user. Testing the performance characteristics of a high-end server with tons of memory and cores would be fun, but I want to make sure any performance improvements are broadly applicable, instead of just because I am throwing more hardware at the problem.&lt;/p&gt;

&lt;p&gt;Less important to me to control for was garbage collection, using a fresh session before each measurement or other “best case scenario” optimizations. I would expect my users to be more analytics and data science focused, so re-using the same session is going to be common. If the performance improvements aren’t completely obvious, I’m not going to incorporate them into the codebase.&lt;/p&gt;

&lt;h2 id=&quot;case-study-speeding-up-tstringvalue&quot;&gt;Case Study: Speeding Up &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TStringValue&lt;/code&gt;&lt;/h2&gt;

&lt;p&gt;For my test, I evaluate the following as the methods to benchmark:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;broadcasting: current library default&lt;/li&gt;
  &lt;li&gt;pre-allocating result array&lt;/li&gt;
  &lt;li&gt;pre-allocated result array with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@inbounds&lt;/code&gt; macro&lt;/li&gt;
  &lt;li&gt;pre-allocated result array with threads&lt;/li&gt;
  &lt;li&gt;pre-allocated result array with threads and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@inbounds&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;10x6-observations&quot;&gt;10x6 observations&lt;/h3&gt;

&lt;div id=&quot;ts_106&quot; style=&quot;height:400px;width:950px;&quot;&gt;&lt;/div&gt;
&lt;script type=&quot;text/javascript&quot;&gt;

    // Initialize after dom ready
    var myChart = echarts.init(document.getElementById(&quot;ts_106&quot;));

    // Load data into the ECharts instance
    myChart.setOption(
{&quot;xAxis&quot;:[{&quot;splitNumber&quot;:5,&quot;axisLabel&quot;:{&quot;show&quot;:true,&quot;interval&quot;:&quot;auto&quot;,&quot;rotate&quot;:0,&quot;inside&quot;:false,&quot;formatter&quot;:&quot;{value}&quot;,&quot;margin&quot;:8},&quot;data&quot;:[&quot;broadcast&quot;,&quot;pre-allocate&quot;,&quot;pre-allocate/inbounds&quot;,&quot;threads&quot;,&quot;threads/inbounds&quot;],&quot;scale&quot;:false,&quot;gridIndex&quot;:0,&quot;minInterval&quot;:0,&quot;zlevel&quot;:0,&quot;triggerEvent&quot;:false,&quot;z&quot;:0,&quot;inverse&quot;:false,&quot;nameLocation&quot;:&quot;middle&quot;,&quot;nameGap&quot;:30,&quot;silent&quot;:true,&quot;type&quot;:&quot;category&quot;}],&quot;ec_charttype&quot;:&quot;xy plot&quot;,&quot;series&quot;:[{&quot;name&quot;:&quot;Min&quot;,&quot;yAxisIndex&quot;:0,&quot;xAxisIndex&quot;:0,&quot;smooth&quot;:false,&quot;data&quot;:[752.568,748.719,738.013,249.117,241.585],&quot;markLine&quot;:{&quot;data&quot;:[],&quot;lineStyle&quot;:{}},&quot;type&quot;:&quot;bar&quot;},{&quot;name&quot;:&quot;Median&quot;,&quot;yAxisIndex&quot;:0,&quot;xAxisIndex&quot;:0,&quot;smooth&quot;:false,&quot;data&quot;:[990.071,988.012,967.184,253.161,246.792],&quot;markLine&quot;:{&quot;data&quot;:[],&quot;lineStyle&quot;:{}},&quot;type&quot;:&quot;bar&quot;}],&quot;theme&quot;:{&quot;geo&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#000000&quot;}},&quot;emphasis&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;rgb(100,0,0)&quot;}}},&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderColor&quot;:&quot;#444444&quot;,&quot;borderWidth&quot;:0.5,&quot;areaColor&quot;:&quot;#eeeeee&quot;},&quot;emphasis&quot;:{&quot;borderColor&quot;:&quot;#444444&quot;,&quot;borderWidth&quot;:1,&quot;areaColor&quot;:&quot;rgba(255,215,0,0.8)&quot;}}},&quot;parallel&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;markPoint&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#eeeeee&quot;}},&quot;emphasis&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#eeeeee&quot;}}}},&quot;visualMap&quot;:{&quot;color&quot;:[&quot;#e01f54&quot;,&quot;#e7dbc3&quot;]},&quot;funnel&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;bar&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;barBorderColor&quot;:&quot;#ccc&quot;,&quot;barBorderWidth&quot;:0},&quot;emphasis&quot;:{&quot;barBorderColor&quot;:&quot;#ccc&quot;,&quot;barBorderWidth&quot;:0}}},&quot;map&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#000000&quot;}},&quot;emphasis&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;rgb(100,0,0)&quot;}}},&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderColor&quot;:&quot;#444444&quot;,&quot;borderWidth&quot;:0.5,&quot;areaColor&quot;:&quot;#eeeeee&quot;},&quot;emphasis&quot;:{&quot;borderColor&quot;:&quot;#444444&quot;,&quot;borderWidth&quot;:1,&quot;areaColor&quot;:&quot;rgba(255,215,0,0.8)&quot;}}},&quot;scatter&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;pie&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;graph&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#eeeeee&quot;}}},&quot;symbolSize&quot;:4,&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}},&quot;smooth&quot;:false,&quot;symbol&quot;:&quot;emptyCircle&quot;,&quot;color&quot;:[&quot;#e01f54&quot;,&quot;#001852&quot;,&quot;#f5e8c8&quot;,&quot;#b8d2c7&quot;,&quot;#c6b38e&quot;,&quot;#a4d8c2&quot;,&quot;#f3d999&quot;,&quot;#d3758f&quot;,&quot;#dcc392&quot;,&quot;#2e4783&quot;,&quot;#82b6e9&quot;,&quot;#ff6347&quot;,&quot;#a092f1&quot;,&quot;#0a915d&quot;,&quot;#eaf889&quot;,&quot;#6699FF&quot;,&quot;#ff6666&quot;,&quot;#3cb371&quot;,&quot;#d5b158&quot;,&quot;#38b6b6&quot;],&quot;lineStyle&quot;:{&quot;normal&quot;:{&quot;color&quot;:&quot;#aaaaaa&quot;,&quot;width&quot;:1}}},&quot;backgroundColor&quot;:&quot;rgba(0,0,0,0)&quot;,&quot;line&quot;:{&quot;symbolSize&quot;:4,&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:1}},&quot;smooth&quot;:false,&quot;symbol&quot;:&quot;emptyCircle&quot;,&quot;lineStyle&quot;:{&quot;normal&quot;:{&quot;width&quot;:2}}},&quot;candlestick&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderColor0&quot;:&quot;#b8d2c7&quot;,&quot;color&quot;:&quot;#e01f54&quot;,&quot;borderColor&quot;:&quot;#f5e8c8&quot;,&quot;borderWidth&quot;:1,&quot;color0&quot;:&quot;#001852&quot;}}},&quot;sankey&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;valueAxis&quot;:{&quot;axisLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}},&quot;axisLabel&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333&quot;},&quot;show&quot;:true},&quot;splitLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:[&quot;#ccc&quot;]}},&quot;splitArea&quot;:{&quot;areaStyle&quot;:{&quot;color&quot;:[&quot;rgba(250,250,250,0.3)&quot;,&quot;rgba(200,200,200,0.3)&quot;]},&quot;show&quot;:false},&quot;axisTick&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}}},&quot;toolbox&quot;:{&quot;iconStyle&quot;:{&quot;normal&quot;:{&quot;borderColor&quot;:&quot;#999999&quot;},&quot;emphasis&quot;:{&quot;borderColor&quot;:&quot;#666666&quot;}}},&quot;categoryAxis&quot;:{&quot;axisLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}},&quot;axisLabel&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333&quot;},&quot;show&quot;:true},&quot;splitLine&quot;:{&quot;show&quot;:false,&quot;lineStyle&quot;:{&quot;color&quot;:[&quot;#ccc&quot;]}},&quot;splitArea&quot;:{&quot;areaStyle&quot;:{&quot;color&quot;:[&quot;rgba(250,250,250,0.3)&quot;,&quot;rgba(200,200,200,0.3)&quot;]},&quot;show&quot;:false},&quot;axisTick&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}}},&quot;tooltip&quot;:{&quot;axisPointer&quot;:{&quot;crossStyle&quot;:{&quot;color&quot;:&quot;#cccccc&quot;,&quot;width&quot;:1},&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#cccccc&quot;,&quot;width&quot;:1}}},&quot;timeline&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#293c55&quot;}},&quot;emphasis&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#293c55&quot;}}},&quot;controlStyle&quot;:{&quot;normal&quot;:{&quot;color&quot;:&quot;#293c55&quot;,&quot;borderColor&quot;:&quot;#293c55&quot;,&quot;borderWidth&quot;:0.5},&quot;emphasis&quot;:{&quot;color&quot;:&quot;#293c55&quot;,&quot;borderColor&quot;:&quot;#293c55&quot;,&quot;borderWidth&quot;:0.5}},&quot;checkpointStyle&quot;:{&quot;color&quot;:&quot;#e43c59&quot;,&quot;borderColor&quot;:&quot;rgba(194,53,49,0.5)&quot;},&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;color&quot;:&quot;#293c55&quot;,&quot;borderWidth&quot;:1},&quot;emphasis&quot;:{&quot;color&quot;:&quot;#a9334c&quot;}},&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#293c55&quot;,&quot;width&quot;:1}},&quot;radar&quot;:{&quot;symbolSize&quot;:4,&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:1}},&quot;smooth&quot;:false,&quot;symbol&quot;:&quot;emptyCircle&quot;,&quot;lineStyle&quot;:{&quot;normal&quot;:{&quot;width&quot;:2}}},&quot;logAxis&quot;:{&quot;axisLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}},&quot;axisLabel&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333&quot;},&quot;show&quot;:true},&quot;splitLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:[&quot;#ccc&quot;]}},&quot;splitArea&quot;:{&quot;areaStyle&quot;:{&quot;color&quot;:[&quot;rgba(250,250,250,0.3)&quot;,&quot;rgba(200,200,200,0.3)&quot;]},&quot;show&quot;:false},&quot;axisTick&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}}},&quot;textStyle&quot;:{},&quot;gauge&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;boxplot&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;color&quot;:[&quot;#e01f54&quot;,&quot;#001852&quot;,&quot;#f5e8c8&quot;,&quot;#b8d2c7&quot;,&quot;#c6b38e&quot;,&quot;#a4d8c2&quot;,&quot;#f3d999&quot;,&quot;#d3758f&quot;,&quot;#dcc392&quot;,&quot;#2e4783&quot;,&quot;#82b6e9&quot;,&quot;#ff6347&quot;,&quot;#a092f1&quot;,&quot;#0a915d&quot;,&quot;#eaf889&quot;,&quot;#6699FF&quot;,&quot;#ff6666&quot;,&quot;#3cb371&quot;,&quot;#d5b158&quot;,&quot;#38b6b6&quot;],&quot;title&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333333&quot;},&quot;subtextStyle&quot;:{&quot;color&quot;:&quot;#aaaaaa&quot;}},&quot;dataZoom&quot;:{&quot;dataBackgroundColor&quot;:&quot;rgba(47,69,84,0.3)&quot;,&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333333&quot;},&quot;handleSize&quot;:&quot;100%&quot;,&quot;handleColor&quot;:&quot;#a7b7cc&quot;,&quot;fillerColor&quot;:&quot;rgba(167,183,204,0.4)&quot;,&quot;backgroundColor&quot;:&quot;rgba(47,69,84,0)&quot;},&quot;timeAxis&quot;:{&quot;axisLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}},&quot;axisLabel&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333&quot;},&quot;show&quot;:true},&quot;splitLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:[&quot;#ccc&quot;]}},&quot;splitArea&quot;:{&quot;areaStyle&quot;:{&quot;color&quot;:[&quot;rgba(250,250,250,0.3)&quot;,&quot;rgba(200,200,200,0.3)&quot;]},&quot;show&quot;:false},&quot;axisTick&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}}},&quot;legend&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333333&quot;}}},&quot;yAxis&quot;:[{&quot;splitNumber&quot;:5,&quot;axisLabel&quot;:{&quot;show&quot;:true,&quot;interval&quot;:&quot;auto&quot;,&quot;rotate&quot;:0,&quot;inside&quot;:false,&quot;formatter&quot;:&quot;{value}&quot;,&quot;margin&quot;:8},&quot;scale&quot;:false,&quot;gridIndex&quot;:0,&quot;minInterval&quot;:0,&quot;zlevel&quot;:0,&quot;triggerEvent&quot;:false,&quot;z&quot;:0,&quot;inverse&quot;:false,&quot;nameLocation&quot;:&quot;middle&quot;,&quot;nameGap&quot;:50,&quot;silent&quot;:true,&quot;type&quot;:&quot;value&quot;}],&quot;toolbox&quot;:{&quot;feature&quot;:{},&quot;orient&quot;:&quot;vertical&quot;,&quot;itemSize&quot;:15,&quot;height&quot;:&quot;auto&quot;,&quot;zlevel&quot;:0,&quot;z&quot;:2,&quot;itemGap&quot;:20,&quot;right&quot;:&quot;auto&quot;,&quot;top&quot;:&quot;center&quot;,&quot;width&quot;:&quot;auto&quot;,&quot;show&quot;:false,&quot;showTitle&quot;:true},&quot;ec_width&quot;:1000,&quot;ec_height&quot;:500,&quot;tooltip&quot;:{&quot;triggerOn&quot;:&quot;mousemove&quot;,&quot;enterable&quot;:true,&quot;borderColor&quot;:&quot;#333&quot;,&quot;transitionDuration&quot;:0.4,&quot;hideDelay&quot;:100,&quot;padding&quot;:5,&quot;showDelay&quot;:0,&quot;borderWidth&quot;:0,&quot;showContent&quot;:true,&quot;backgroundColor&quot;:&quot;rgba(50,50,50,0.7)&quot;,&quot;trigger&quot;:&quot;item&quot;,&quot;alwaysShowContent&quot;:false,&quot;confine&quot;:false,&quot;show&quot;:true},&quot;grid&quot;:[{&quot;height&quot;:&quot;auto&quot;,&quot;show&quot;:false,&quot;width&quot;:&quot;auto&quot;,&quot;backgroundColor&quot;:&quot;transparent&quot;}],&quot;aria&quot;:{&quot;show&quot;:true},&quot;title&quot;:[{&quot;left&quot;:&quot;left&quot;,&quot;borderColor&quot;:&quot;transparent&quot;,&quot;bottom&quot;:&quot;auto&quot;,&quot;padding&quot;:5,&quot;zlevel&quot;:0,&quot;borderWidth&quot;:1,&quot;target&quot;:&quot;blank&quot;,&quot;z&quot;:2,&quot;itemGap&quot;:5,&quot;shadowOffsetY&quot;:0,&quot;shadowOffsetX&quot;:0,&quot;right&quot;:&quot;auto&quot;,&quot;top&quot;:&quot;auto&quot;,&quot;subtarget&quot;:&quot;blank&quot;,&quot;show&quot;:true}],&quot;ec_renderer&quot;:&quot;canvas&quot;,&quot;legend&quot;:{&quot;itemWidth&quot;:25,&quot;data&quot;:[&quot;Min&quot;,&quot;Median&quot;],&quot;borderColor&quot;:&quot;transparent&quot;,&quot;orient&quot;:&quot;horizontal&quot;,&quot;bottom&quot;:&quot;auto&quot;,&quot;height&quot;:&quot;auto&quot;,&quot;zlevel&quot;:0,&quot;padding&quot;:5,&quot;borderWidth&quot;:1,&quot;inactiveColor&quot;:&quot;#ccc&quot;,&quot;z&quot;:2,&quot;align&quot;:&quot;auto&quot;,&quot;itemGap&quot;:10,&quot;itemHeight&quot;:14,&quot;backgroundColor&quot;:&quot;transparent&quot;,&quot;shadowOffsetY&quot;:0,&quot;shadowOffsetX&quot;:0,&quot;right&quot;:&quot;auto&quot;,&quot;top&quot;:&quot;auto&quot;,&quot;width&quot;:&quot;auto&quot;,&quot;selectedMode&quot;:true,&quot;show&quot;:true}} );
&lt;/script&gt;

&lt;p&gt;For the first three on the left, these are comparisons of the single-threaded methods. You can see that pre-allocating the output array is marginally faster than broadcasting, and using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@inbounds&lt;/code&gt; macro is incrementally faster still, but neither method provides enough speedup to be worth implementing. The difference between the red and the blue bars represents a garbage collection occurring, but again, the three methods aren’t different enough to notice anything interesting.&lt;/p&gt;

&lt;p&gt;For the multi-threaded tests, I’m using 6 threads (one per physical core), and we’re seeing roughly a &lt;strong&gt;3x speedup&lt;/strong&gt;. Like the single-threaded tests above, using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@inbounds&lt;/code&gt; is only marginally faster, but not enough to widely implement for the cost of increased code complexity. Interestingly, doing these multi-threaded benchmarks didn’t trigger garbage collect &lt;em&gt;at all&lt;/em&gt; across my five iterations; not sure if this is specific due to threading or not, but something to explore outside of this blog post.&lt;/p&gt;

&lt;h3 id=&quot;10x7-observations&quot;&gt;10x7 observations&lt;/h3&gt;

&lt;p&gt;To see how these calculation methods might change at a larger scale, I bumped up the observations by an order of 10 and saw the following results:&lt;/p&gt;

&lt;div id=&quot;ts_108&quot; style=&quot;height:400px;width:950px;&quot;&gt;&lt;/div&gt;
&lt;script type=&quot;text/javascript&quot;&gt;

    // Initialize after dom ready
    var myChart = echarts.init(document.getElementById(&quot;ts_108&quot;));

    // Load data into the ECharts instance
    myChart.setOption(
{&quot;xAxis&quot;:[{&quot;splitNumber&quot;:5,&quot;axisLabel&quot;:{&quot;show&quot;:true,&quot;interval&quot;:&quot;auto&quot;,&quot;rotate&quot;:0,&quot;inside&quot;:false,&quot;formatter&quot;:&quot;{value}&quot;,&quot;margin&quot;:8},&quot;data&quot;:[&quot;broadcast&quot;,&quot;pre-allocate&quot;,&quot;pre-allocate/inbounds&quot;,&quot;threads&quot;,&quot;threads/inbounds&quot;],&quot;scale&quot;:false,&quot;gridIndex&quot;:0,&quot;minInterval&quot;:0,&quot;zlevel&quot;:0,&quot;triggerEvent&quot;:false,&quot;z&quot;:0,&quot;inverse&quot;:false,&quot;nameLocation&quot;:&quot;middle&quot;,&quot;nameGap&quot;:30,&quot;silent&quot;:true,&quot;type&quot;:&quot;category&quot;}],&quot;ec_charttype&quot;:&quot;xy plot&quot;,&quot;series&quot;:[{&quot;name&quot;:&quot;Min&quot;,&quot;yAxisIndex&quot;:0,&quot;xAxisIndex&quot;:0,&quot;smooth&quot;:false,&quot;data&quot;:[26.316,27.064,26.219,2.717,2.641],&quot;markLine&quot;:{&quot;data&quot;:[],&quot;lineStyle&quot;:{}},&quot;type&quot;:&quot;bar&quot;},{&quot;name&quot;:&quot;Median&quot;,&quot;yAxisIndex&quot;:0,&quot;xAxisIndex&quot;:0,&quot;smooth&quot;:false,&quot;data&quot;:[39.332,38.925,39.387,17.659,16.659],&quot;markLine&quot;:{&quot;data&quot;:[],&quot;lineStyle&quot;:{}},&quot;type&quot;:&quot;bar&quot;}],&quot;theme&quot;:{&quot;geo&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#000000&quot;}},&quot;emphasis&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;rgb(100,0,0)&quot;}}},&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderColor&quot;:&quot;#444444&quot;,&quot;borderWidth&quot;:0.5,&quot;areaColor&quot;:&quot;#eeeeee&quot;},&quot;emphasis&quot;:{&quot;borderColor&quot;:&quot;#444444&quot;,&quot;borderWidth&quot;:1,&quot;areaColor&quot;:&quot;rgba(255,215,0,0.8)&quot;}}},&quot;parallel&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;markPoint&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#eeeeee&quot;}},&quot;emphasis&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#eeeeee&quot;}}}},&quot;visualMap&quot;:{&quot;color&quot;:[&quot;#e01f54&quot;,&quot;#e7dbc3&quot;]},&quot;funnel&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;bar&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;barBorderColor&quot;:&quot;#ccc&quot;,&quot;barBorderWidth&quot;:0},&quot;emphasis&quot;:{&quot;barBorderColor&quot;:&quot;#ccc&quot;,&quot;barBorderWidth&quot;:0}}},&quot;map&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#000000&quot;}},&quot;emphasis&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;rgb(100,0,0)&quot;}}},&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderColor&quot;:&quot;#444444&quot;,&quot;borderWidth&quot;:0.5,&quot;areaColor&quot;:&quot;#eeeeee&quot;},&quot;emphasis&quot;:{&quot;borderColor&quot;:&quot;#444444&quot;,&quot;borderWidth&quot;:1,&quot;areaColor&quot;:&quot;rgba(255,215,0,0.8)&quot;}}},&quot;scatter&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;pie&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;graph&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#eeeeee&quot;}}},&quot;symbolSize&quot;:4,&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}},&quot;smooth&quot;:false,&quot;symbol&quot;:&quot;emptyCircle&quot;,&quot;color&quot;:[&quot;#e01f54&quot;,&quot;#001852&quot;,&quot;#f5e8c8&quot;,&quot;#b8d2c7&quot;,&quot;#c6b38e&quot;,&quot;#a4d8c2&quot;,&quot;#f3d999&quot;,&quot;#d3758f&quot;,&quot;#dcc392&quot;,&quot;#2e4783&quot;,&quot;#82b6e9&quot;,&quot;#ff6347&quot;,&quot;#a092f1&quot;,&quot;#0a915d&quot;,&quot;#eaf889&quot;,&quot;#6699FF&quot;,&quot;#ff6666&quot;,&quot;#3cb371&quot;,&quot;#d5b158&quot;,&quot;#38b6b6&quot;],&quot;lineStyle&quot;:{&quot;normal&quot;:{&quot;color&quot;:&quot;#aaaaaa&quot;,&quot;width&quot;:1}}},&quot;backgroundColor&quot;:&quot;rgba(0,0,0,0)&quot;,&quot;line&quot;:{&quot;symbolSize&quot;:4,&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:1}},&quot;smooth&quot;:false,&quot;symbol&quot;:&quot;emptyCircle&quot;,&quot;lineStyle&quot;:{&quot;normal&quot;:{&quot;width&quot;:2}}},&quot;candlestick&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderColor0&quot;:&quot;#b8d2c7&quot;,&quot;color&quot;:&quot;#e01f54&quot;,&quot;borderColor&quot;:&quot;#f5e8c8&quot;,&quot;borderWidth&quot;:1,&quot;color0&quot;:&quot;#001852&quot;}}},&quot;sankey&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;valueAxis&quot;:{&quot;axisLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}},&quot;axisLabel&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333&quot;},&quot;show&quot;:true},&quot;splitLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:[&quot;#ccc&quot;]}},&quot;splitArea&quot;:{&quot;areaStyle&quot;:{&quot;color&quot;:[&quot;rgba(250,250,250,0.3)&quot;,&quot;rgba(200,200,200,0.3)&quot;]},&quot;show&quot;:false},&quot;axisTick&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}}},&quot;toolbox&quot;:{&quot;iconStyle&quot;:{&quot;normal&quot;:{&quot;borderColor&quot;:&quot;#999999&quot;},&quot;emphasis&quot;:{&quot;borderColor&quot;:&quot;#666666&quot;}}},&quot;categoryAxis&quot;:{&quot;axisLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}},&quot;axisLabel&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333&quot;},&quot;show&quot;:true},&quot;splitLine&quot;:{&quot;show&quot;:false,&quot;lineStyle&quot;:{&quot;color&quot;:[&quot;#ccc&quot;]}},&quot;splitArea&quot;:{&quot;areaStyle&quot;:{&quot;color&quot;:[&quot;rgba(250,250,250,0.3)&quot;,&quot;rgba(200,200,200,0.3)&quot;]},&quot;show&quot;:false},&quot;axisTick&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}}},&quot;tooltip&quot;:{&quot;axisPointer&quot;:{&quot;crossStyle&quot;:{&quot;color&quot;:&quot;#cccccc&quot;,&quot;width&quot;:1},&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#cccccc&quot;,&quot;width&quot;:1}}},&quot;timeline&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#293c55&quot;}},&quot;emphasis&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#293c55&quot;}}},&quot;controlStyle&quot;:{&quot;normal&quot;:{&quot;color&quot;:&quot;#293c55&quot;,&quot;borderColor&quot;:&quot;#293c55&quot;,&quot;borderWidth&quot;:0.5},&quot;emphasis&quot;:{&quot;color&quot;:&quot;#293c55&quot;,&quot;borderColor&quot;:&quot;#293c55&quot;,&quot;borderWidth&quot;:0.5}},&quot;checkpointStyle&quot;:{&quot;color&quot;:&quot;#e43c59&quot;,&quot;borderColor&quot;:&quot;rgba(194,53,49,0.5)&quot;},&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;color&quot;:&quot;#293c55&quot;,&quot;borderWidth&quot;:1},&quot;emphasis&quot;:{&quot;color&quot;:&quot;#a9334c&quot;}},&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#293c55&quot;,&quot;width&quot;:1}},&quot;radar&quot;:{&quot;symbolSize&quot;:4,&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:1}},&quot;smooth&quot;:false,&quot;symbol&quot;:&quot;emptyCircle&quot;,&quot;lineStyle&quot;:{&quot;normal&quot;:{&quot;width&quot;:2}}},&quot;logAxis&quot;:{&quot;axisLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}},&quot;axisLabel&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333&quot;},&quot;show&quot;:true},&quot;splitLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:[&quot;#ccc&quot;]}},&quot;splitArea&quot;:{&quot;areaStyle&quot;:{&quot;color&quot;:[&quot;rgba(250,250,250,0.3)&quot;,&quot;rgba(200,200,200,0.3)&quot;]},&quot;show&quot;:false},&quot;axisTick&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}}},&quot;textStyle&quot;:{},&quot;gauge&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;boxplot&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;color&quot;:[&quot;#e01f54&quot;,&quot;#001852&quot;,&quot;#f5e8c8&quot;,&quot;#b8d2c7&quot;,&quot;#c6b38e&quot;,&quot;#a4d8c2&quot;,&quot;#f3d999&quot;,&quot;#d3758f&quot;,&quot;#dcc392&quot;,&quot;#2e4783&quot;,&quot;#82b6e9&quot;,&quot;#ff6347&quot;,&quot;#a092f1&quot;,&quot;#0a915d&quot;,&quot;#eaf889&quot;,&quot;#6699FF&quot;,&quot;#ff6666&quot;,&quot;#3cb371&quot;,&quot;#d5b158&quot;,&quot;#38b6b6&quot;],&quot;title&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333333&quot;},&quot;subtextStyle&quot;:{&quot;color&quot;:&quot;#aaaaaa&quot;}},&quot;dataZoom&quot;:{&quot;dataBackgroundColor&quot;:&quot;rgba(47,69,84,0.3)&quot;,&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333333&quot;},&quot;handleSize&quot;:&quot;100%&quot;,&quot;handleColor&quot;:&quot;#a7b7cc&quot;,&quot;fillerColor&quot;:&quot;rgba(167,183,204,0.4)&quot;,&quot;backgroundColor&quot;:&quot;rgba(47,69,84,0)&quot;},&quot;timeAxis&quot;:{&quot;axisLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}},&quot;axisLabel&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333&quot;},&quot;show&quot;:true},&quot;splitLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:[&quot;#ccc&quot;]}},&quot;splitArea&quot;:{&quot;areaStyle&quot;:{&quot;color&quot;:[&quot;rgba(250,250,250,0.3)&quot;,&quot;rgba(200,200,200,0.3)&quot;]},&quot;show&quot;:false},&quot;axisTick&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}}},&quot;legend&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333333&quot;}}},&quot;yAxis&quot;:[{&quot;splitNumber&quot;:5,&quot;axisLabel&quot;:{&quot;show&quot;:true,&quot;interval&quot;:&quot;auto&quot;,&quot;rotate&quot;:0,&quot;inside&quot;:false,&quot;formatter&quot;:&quot;{value}&quot;,&quot;margin&quot;:8},&quot;scale&quot;:false,&quot;gridIndex&quot;:0,&quot;minInterval&quot;:0,&quot;zlevel&quot;:0,&quot;triggerEvent&quot;:false,&quot;z&quot;:0,&quot;inverse&quot;:false,&quot;nameLocation&quot;:&quot;middle&quot;,&quot;nameGap&quot;:50,&quot;silent&quot;:true,&quot;type&quot;:&quot;value&quot;}],&quot;toolbox&quot;:{&quot;feature&quot;:{},&quot;orient&quot;:&quot;vertical&quot;,&quot;itemSize&quot;:15,&quot;height&quot;:&quot;auto&quot;,&quot;zlevel&quot;:0,&quot;z&quot;:2,&quot;itemGap&quot;:20,&quot;right&quot;:&quot;auto&quot;,&quot;top&quot;:&quot;center&quot;,&quot;width&quot;:&quot;auto&quot;,&quot;show&quot;:false,&quot;showTitle&quot;:true},&quot;ec_width&quot;:1000,&quot;ec_height&quot;:500,&quot;tooltip&quot;:{&quot;triggerOn&quot;:&quot;mousemove&quot;,&quot;enterable&quot;:true,&quot;borderColor&quot;:&quot;#333&quot;,&quot;transitionDuration&quot;:0.4,&quot;hideDelay&quot;:100,&quot;padding&quot;:5,&quot;showDelay&quot;:0,&quot;borderWidth&quot;:0,&quot;showContent&quot;:true,&quot;backgroundColor&quot;:&quot;rgba(50,50,50,0.7)&quot;,&quot;trigger&quot;:&quot;item&quot;,&quot;alwaysShowContent&quot;:false,&quot;confine&quot;:false,&quot;show&quot;:true},&quot;grid&quot;:[{&quot;height&quot;:&quot;auto&quot;,&quot;show&quot;:false,&quot;width&quot;:&quot;auto&quot;,&quot;backgroundColor&quot;:&quot;transparent&quot;}],&quot;aria&quot;:{&quot;show&quot;:true},&quot;color&quot;:[&quot;#10222B&quot;,&quot;#95AB63&quot;,&quot;#BDD684&quot;,&quot;#E2F0D6&quot;,&quot;#F6FFE0&quot;],&quot;title&quot;:[{&quot;left&quot;:&quot;left&quot;,&quot;borderColor&quot;:&quot;transparent&quot;,&quot;bottom&quot;:&quot;auto&quot;,&quot;padding&quot;:5,&quot;zlevel&quot;:0,&quot;borderWidth&quot;:1,&quot;target&quot;:&quot;blank&quot;,&quot;z&quot;:2,&quot;itemGap&quot;:5,&quot;shadowOffsetY&quot;:0,&quot;shadowOffsetX&quot;:0,&quot;right&quot;:&quot;auto&quot;,&quot;top&quot;:&quot;auto&quot;,&quot;subtarget&quot;:&quot;blank&quot;,&quot;show&quot;:true}],&quot;ec_renderer&quot;:&quot;canvas&quot;,&quot;legend&quot;:{&quot;itemWidth&quot;:25,&quot;data&quot;:[&quot;Min&quot;,&quot;Median&quot;],&quot;borderColor&quot;:&quot;transparent&quot;,&quot;orient&quot;:&quot;horizontal&quot;,&quot;bottom&quot;:&quot;auto&quot;,&quot;height&quot;:&quot;auto&quot;,&quot;zlevel&quot;:0,&quot;padding&quot;:5,&quot;borderWidth&quot;:1,&quot;inactiveColor&quot;:&quot;#ccc&quot;,&quot;z&quot;:2,&quot;align&quot;:&quot;auto&quot;,&quot;itemGap&quot;:10,&quot;itemHeight&quot;:14,&quot;backgroundColor&quot;:&quot;transparent&quot;,&quot;shadowOffsetY&quot;:0,&quot;shadowOffsetX&quot;:0,&quot;right&quot;:&quot;auto&quot;,&quot;top&quot;:&quot;auto&quot;,&quot;width&quot;:&quot;auto&quot;,&quot;selectedMode&quot;:true,&quot;show&quot;:true}});
&lt;/script&gt;

&lt;p&gt;Like at the 1 million data range, there isn’t much difference between the three single-threaded methods. All three of them are within a few percentage in either direction (all three methods triggered garbage collection in each of their five runs).&lt;/p&gt;

&lt;p&gt;For the multi-threaded tests, an interesting performance scenario emerged. Like the 1 million point tests, it’s possible to get a run where garbage collection isn’t triggered, which leads to a large min/median difference in the multi-threaded tests. If you can avoid garbage collection, using six threads here gives nearly a &lt;strong&gt;10x speedup&lt;/strong&gt;, and at the median where both single-threaded and multi-threaded trigger garbage collection you still get a &lt;strong&gt;2x speedup&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;parallelism--compiler-hinting&quot;&gt;Parallelism &amp;gt; Compiler Hinting&lt;/h2&gt;

&lt;p&gt;In the case study above, I’ve demonstrated that for this problem, threading is the first way to pursue speeding up the OmniSci.jl load table methods. While pre-allocating the size of the output array and using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@inbounds&lt;/code&gt; did show some slight speedups, using threads to perform the calculations are where the largest improvements occurred. Incorporating the pre-allocation step naturally comes out from the way I wrote the threading methods, so I’ll incorporate that too. Disabling bounds-checking on arrays using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@inbounds&lt;/code&gt; seems more dangerous than it is worth, even though none of these methods should ever get outside of their bounds.&lt;/p&gt;

&lt;p&gt;Overall, I hope this post has demonstrated that you don’t have to fancy yourself a high-frequency trader or a bit-twiddler to find ways to improve your Julia code. The first step is reading the manuals for benchmarking, and then like any other pursuit, the only way to get a feeling for what works is to try things.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;All of the code for this blog post can be found in this &lt;a href=&quot;https://gist.github.com/randyzwitch/dbe9ce13aa819a1306d62610bb58b173&quot;&gt;GitHub gist&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4.16 Release Notes</title>
        
          <description>&lt;p&gt;It’s been a while since the last update, but RSiteCatalyst is still going strong! Thanks to &lt;a href=&quot;https://github.com/slin30&quot;&gt;Wen&lt;/a&gt; for submitting a &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues/210&quot;&gt;fix&lt;/a&gt;/&lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues/239&quot;&gt;enhancement&lt;/a&gt; to enable the ability to use multiple columns from a Classification within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueDataWarehouse&lt;/code&gt; function. No other bug fixes were made, nor was any additional functionality added.&lt;/p&gt;

</description>
        
        <pubDate>Tue, 05 Nov 2019 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-16-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-16-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-16-release-notes/">&lt;p&gt;It’s been a while since the last update, but RSiteCatalyst is still going strong! Thanks to &lt;a href=&quot;https://github.com/slin30&quot;&gt;Wen&lt;/a&gt; for submitting a &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues/210&quot;&gt;fix&lt;/a&gt;/&lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues/239&quot;&gt;enhancement&lt;/a&gt; to enable the ability to use multiple columns from a Classification within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueDataWarehouse&lt;/code&gt; function. No other bug fixes were made, nor was any additional functionality added.&lt;/p&gt;

&lt;p&gt;Version 1.4.16 of RSiteCatalyst was submitted to CRAN yesterday and should be available for download in the coming days.&lt;/p&gt;

&lt;h2 id=&quot;community-contributions&quot;&gt;Community Contributions&lt;/h2&gt;
&lt;p&gt;As I’ve mentioned in many a blog post before this one, I encourage all users of the software to continue reporting bugs via &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues&quot;&gt;GitHub issues&lt;/a&gt;, and especially if you can provide a working code example. Even better, a fix via pull request will ensure that your bug will be addressed in a timely manner and for the benefit to others in the community.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: Please don’t email directly via the email in the RSiteCatalyst package, it will not be returned. Having a valid email contact in the package is a requirement to have a package listed on CRAN so they can contact the package author, it is not meant to imply I can/will provide endless, personalized support for free.&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>DataOps Summit&amp;#58; Streaming Real-time Telemetry With OmniSci and StreamSets</title>
        
          <description>&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/GqLCK3Eohss&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

</description>
        
        <pubDate>Thu, 31 Oct 2019 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/omnisci-dataops-streamsets/</link>
        <guid isPermaLink="true">http://randyzwitch.com/omnisci-dataops-streamsets/</guid>
        <content type="html" xml:base="/omnisci-dataops-streamsets/">&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/GqLCK3Eohss&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;In this talk from the &lt;a href=&quot;https://www.dataopssummit-sf.com/&quot;&gt;StreamSets DataOps 2019 conference&lt;/a&gt;, I provide an overview of the &lt;a href=&quot;https://www.omnisci.com/blog/creating-the-omnisci-f1-demo&quot;&gt;data pipeline for the OmniSci F1 Demo&lt;/a&gt;. Using &lt;a href=&quot;https://streamsets.com/opensource&quot;&gt;StreamSets Data Collector&lt;/a&gt; in concert with &lt;a href=&quot;https://kafka.apache.org/&quot;&gt;Apache Kafka&lt;/a&gt; and &lt;a href=&quot;https://github.com/omnisci/omniscidb&quot;&gt;OmniSciDB&lt;/a&gt;, you can create a full real-time data pipeline for telemetry data using only open-source components.&lt;/p&gt;

&lt;p&gt;The talk outlines using the UDP listener for StreamSets to collect packets from the F1 2018 game, writing the packets to Kafka, reading from Kafka and using Groovy to parse the packets, and using the &lt;a href=&quot;https://github.com/omnisci/omnisci-jdbc&quot;&gt;OmniSci JDBC driver&lt;/a&gt; to insert the data into one of nine OmniSciDB tables. With this workflow, you have a robust platform for accelerated analytics, using the power of GPUs for fast computation.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href=&quot;https://github.com/omnisci/vehicle-telematics-analytics-demo&quot;&gt;https://github.com/omnisci/vehicle-telematics-analytics-demo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Speakerdeck: &lt;a href=&quot;https://speakerdeck.com/omnisci/the-f1-demo-streaming-real-time-telemetry-using-apache-kafka-and-streamsets&quot;&gt;https://speakerdeck.com/omnisci/the-f1-demo-streaming-real-time-telemetry-using-apache-kafka-and-streamsets&lt;/a&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>ODSC webinar&amp;#58; End-to-End Data Science Without Leaving the GPU</title>
        
          <description>&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/5xBMrO-BSy8&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

</description>
        
        <pubDate>Thu, 11 Jul 2019 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/omnisci-cudf-rapids-odsc-webinar/</link>
        <guid isPermaLink="true">http://randyzwitch.com/omnisci-cudf-rapids-odsc-webinar/</guid>
        <content type="html" xml:base="/omnisci-cudf-rapids-odsc-webinar/">&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/5xBMrO-BSy8&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;In this webinar sponsored by the &lt;a href=&quot;https://odsc.com/&quot;&gt;Open Data Science Conference&lt;/a&gt; (ODSC), I outline a brief history of GPU analytics and the problems that using GPU analytics solves relative to using other parallel computation methods such as Hadoop. I also demonstrate how &lt;a href=&quot;https://www.omnisci.com/&quot;&gt;OmniSci&lt;/a&gt; fits into the broader GPU-accelerated data science workflow, with examples provided using Python.&lt;/p&gt;

&lt;p&gt;Check out the video, grab the Jupyter Notebook from the &lt;a href=&quot;https://github.com/omnisci/odscwebinar&quot;&gt;odscwebinar&lt;/a&gt; repo and get started with OmniSci and GPU-accelerated data science!&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>PyData NYC 2018&amp;#58; End-to-End Data Science Without Leaving the GPU</title>
        
          <description>&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/gQszQcFHcZc&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

</description>
        
        <pubDate>Fri, 01 Feb 2019 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/omnisci-cudf-pydata-nyc-2018/</link>
        <guid isPermaLink="true">http://randyzwitch.com/omnisci-cudf-pydata-nyc-2018/</guid>
        <content type="html" xml:base="/omnisci-cudf-pydata-nyc-2018/">&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/gQszQcFHcZc&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;This talk is from October 2018, and so much has changed in the &lt;a href=&quot;https://rapids.ai/&quot;&gt;GOAI/RAPIDS&lt;/a&gt; ecosystem that it’s comical to see how much has changed! Regardless, the high-level concepts of how &lt;a href=&quot;https://www.omnisci.com&quot;&gt;OmniSci&lt;/a&gt; works and the concepts behind GPU dataframes (then: pygdf, now: &lt;a href=&quot;https://github.com/rapidsai/cudf&quot;&gt;cudf&lt;/a&gt;) remain the same, so watching this talk still has value if you are interested in an end-to-end GPU workflow.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.omnisci.com/blog/pymapd_0.7_updated_RAPIDS_support_pyarrow_python_more/&quot;&gt;With the release of pymapd 0.7 a few days ago&lt;/a&gt;, getting started with GPU data science is just a matter of having an NVIDIA GPU and &lt;a href=&quot;https://github.com/omnisci/mapd-core&quot;&gt;OmniSci Core (OSS)&lt;/a&gt; and a quick conda command to set up your environment:&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conda install -c conda-forge -c nvidia -c rapidsai -c numba -c defaults pymapd cudf python=3.6&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;So check out the video, grab the Jupyter Notebook from the &lt;a href=&quot;https://github.com/omnisci/pydatanyc2018&quot;&gt;pydatanyc2018 GitHub repo&lt;/a&gt; and get started with GPU-accelerated data science!&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Using pandas and pymapd for ETL into OmniSci</title>
        
          <description>&lt;p&gt;I’ve got &lt;a href=&quot;https://pydata.org/nyc2018/schedule/presentation/41/&quot;&gt;PyData NYC 2018&lt;/a&gt; in two days and rather finishing up my talk, I just realized that my source data has a silent corruption due to non-standard timestamps. Here’s how I fixed this using pandas and then uploaded the data to &lt;a href=&quot;https://omnisci.com&quot;&gt;OmniSci&lt;/a&gt;.&lt;/p&gt;

</description>
        
        <pubDate>Tue, 16 Oct 2018 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/omnisci-pymapd-etl/</link>
        <guid isPermaLink="true">http://randyzwitch.com/omnisci-pymapd-etl/</guid>
        <content type="html" xml:base="/omnisci-pymapd-etl/">&lt;p&gt;I’ve got &lt;a href=&quot;https://pydata.org/nyc2018/schedule/presentation/41/&quot;&gt;PyData NYC 2018&lt;/a&gt; in two days and rather finishing up my talk, I just realized that my source data has a silent corruption due to non-standard timestamps. Here’s how I fixed this using pandas and then uploaded the data to &lt;a href=&quot;https://omnisci.com&quot;&gt;OmniSci&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;computers-are-dumb-make-things-easier-for-them&quot;&gt;Computers Are Dumb, MAKE THINGS EASIER FOR THEM!&lt;/h2&gt;

&lt;p&gt;Literally every data tool in the world can read the &lt;a href=&quot;https://www.iso.org/iso-8601-date-and-time-format.html&quot;&gt;ISO-8601 timestamp format&lt;/a&gt;. Conversely, not every tool in the world can read Excel or whatever horrible other tool people use to generate the CSV files seen in the wild. While I should’ve been more diligent checking my data ingestion, I didn’t until I created a wonky report…&lt;/p&gt;

&lt;p&gt;Let’s take a look at the format that tripped me up:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/excelformatdates.png&quot; alt=&quot;Excel data format sucks&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Month/Day/Year Hour:Minute:Second AM/PM&lt;/code&gt; feels very much like an Excel date format that you get when Excel is used as a display medium. Unfortunately, when you write CSV files like this, the next tool to read them has to understand 1) that these columns are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timestamps&lt;/code&gt; and 2) if the user doesn’t specify the format, has to guess the format.&lt;/p&gt;

&lt;p&gt;In my case, I didn’t do descriptive statistics on my timestamp columns and had a silent truncation(!) of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AM/PM&lt;/code&gt; portion of the data. So instead of having 24 hours in the day, the parser read the data as follows (the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#AM&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#PM&lt;/code&gt; are my comments for clarity):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;datetime_beginning_utc
2001-01-01 01:00:00 #AM
2001-01-01 01:00:00 #PM
2001-01-01 02:00:00 #AM
2001-01-01 02:00:00 #PM
2001-01-01 03:00:00 #AM
2001-01-01 03:00:00 #PM
2001-01-01 04:00:00 #AM
2001-01-01 04:00:00 #PM
2001-01-01 05:00:00 #AM
2001-01-01 05:00:00 #PM
2001-01-01 06:00:00 #AM
2001-01-01 06:00:00 #PM
2001-01-01 07:00:00 #AM
2001-01-01 07:00:00 #PM
2001-01-01 08:00:00 #AM
2001-01-01 08:00:00 #PM
2001-01-01 09:00:00 #AM
2001-01-01 09:00:00 #PM
2001-01-01 10:00:00 #AM
2001-01-01 10:00:00 #PM
2001-01-01 11:00:00 #AM
2001-01-01 11:00:00 #PM
2001-01-01 12:00:00 #AM
2001-01-01 12:00:00 #PM
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So while the data looks like it was imported correctly (because, it is a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timestamp&lt;/code&gt;), it wasn’t until I realized that hours 13-23 were missing from my data that I realized I had an error.&lt;/p&gt;

&lt;h2 id=&quot;pandas-to-the-rescue&quot;&gt;Pandas To The Rescue!&lt;/h2&gt;

&lt;p&gt;Fixing this issue is as straight-forward as reading the CSV into python using pandas and specifying the date format:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;datetime&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/mnt/storage1TB/hrl_load_metered/hrl_load_metered.csv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;n&quot;&gt;parse_dates&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
                 &lt;span class=&quot;n&quot;&gt;date_parser&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;datetime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;datetime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;strptime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;%m/%d/%Y %I:%M:%S %p&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/pandasdatetimefix.png&quot; alt=&quot;Yay pandas!&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can see from the code above that pandas has taken our directive about the format and it appears the data have been parsed correctly. A good secondary check here is that the difference in timestamps is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-5&lt;/code&gt;, which is the offset of the East Coast of the United States relative to UTC.&lt;/p&gt;

&lt;h2 id=&quot;uploading-to-omnisci-directly-from-pandas&quot;&gt;Uploading to OmniSci Directly From Pandas&lt;/h2&gt;

&lt;p&gt;Since my PyData talk is going to be using OmniSci, I need to upload this corrected data or rebuild all my work (I’ll opt for fixing my source). Luckily, the &lt;a href=&quot;https://pymapd.readthedocs.io/en/latest/&quot;&gt;pymapd&lt;/a&gt; package provides tight integration to an OmniSci database, providing a means of uploading the data directly from a pandas dataframe:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pymapd&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#connect to database
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pymapd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;localhost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;9091&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;mapd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;HyperInteractive&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dbname&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;mapd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#truncate table so that table definition can be reused
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;execute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;truncate table hrl_load_metered&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#re-load data into table
#with none of the optional arguments, pymapd infers that this is an insert operation, since table name exists
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load_table_columnar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hrl_load_metered&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;I have a pre-existing table &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hrl_load_metered&lt;/code&gt; on the database, so I can truncate the table to remove its (incorrect) data but keep the table structure. Then I can use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;load_table_columnar&lt;/code&gt; to insert the cleaned up data into my table and now my data is correct.&lt;/p&gt;

&lt;h2 id=&quot;computers-may-be-dumb-but-humans-are-lazy&quot;&gt;Computers May Be Dumb, But Humans Are Lazy&lt;/h2&gt;

&lt;p&gt;At the beginning, I joked that computers are dumb. Computers are just tools that do exactly what a human programs them to do, and really, it was my laziness that caused this data error. Luckily, I did catch this before my talk and the fix is pretty easy.&lt;/p&gt;

&lt;p&gt;I’d like to say I’m going to remember to check my data going forward, but in reality, I’m just documenting this here for the next time I make the same, lazy mistake.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Mirroring an FTP Using lftp and cron</title>
        
          <description>&lt;p&gt;As my Developer Advocate role leads me to doing more and more Sysadmin/Data Engineer type of work, I continuously find myself looking for more efficient ways of copying data folders to where I need them. While there are a lot of great GUI ETL tools out there, for me the simplest &lt;em&gt;and fastest&lt;/em&gt; way tends to be using linux utilities. Here’s how to mirror an FTP using lftp, with a cron repeater every five minutes.&lt;/p&gt;

</description>
        
        <pubDate>Thu, 06 Sep 2018 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/mirror-ftp-lftp/</link>
        <guid isPermaLink="true">http://randyzwitch.com/mirror-ftp-lftp/</guid>
        <content type="html" xml:base="/mirror-ftp-lftp/">&lt;p&gt;As my Developer Advocate role leads me to doing more and more Sysadmin/Data Engineer type of work, I continuously find myself looking for more efficient ways of copying data folders to where I need them. While there are a lot of great GUI ETL tools out there, for me the simplest &lt;em&gt;and fastest&lt;/em&gt; way tends to be using linux utilities. Here’s how to mirror an FTP using lftp, with a cron repeater every five minutes.&lt;/p&gt;

&lt;h2 id=&quot;data-are-on-an-ftp-need-further-processing&quot;&gt;Data are on an FTP, Need Further Processing&lt;/h2&gt;

&lt;p&gt;The problem I have is data that exists on a remote FTP, but are in a binary format that is incompatible with loading directly into &lt;a href=&quot;https://www.omnisci.com/&quot;&gt;OmniSci&lt;/a&gt;. My current plan is to use Python to convert the binary format into CSV, but with the data on a server that I don’t control, I need to make a copy somewhere else.&lt;/p&gt;

&lt;p&gt;It’s also the case that the data are roughly 300GB &lt;em&gt;per day&lt;/em&gt;, streaming in at various intervals across the day, so I need to make sure that any copying I do is thoughtful. Downloading 300GB of data per day is bad enough, doing it multiple times even worse!&lt;/p&gt;

&lt;h2 id=&quot;lftp-mirror-to-the-rescue&quot;&gt;lftp &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mirror&lt;/code&gt; to the Rescue!&lt;/h2&gt;

&lt;p&gt;The best choice in my case seems to be copying the files onto a VM I own. &lt;a href=&quot;https://lftp.yar.ru/&quot;&gt;lftp&lt;/a&gt; has an option &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mirror&lt;/code&gt; to do just that. Here is the one-liner I’m using:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;lftp &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;mirror -c -e --parallel=20 --verbose /pub/data/nccf/com/nwm/prod /nwmftp/prod;quit;&quot;&lt;/span&gt; ftp.government.gov&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lftp -e&lt;/code&gt; : Execute command in quotes. In this case, the FTP allows anonymous access, so no user/pw  arguments needed&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mirror&lt;/code&gt;: Mirror command for lftp&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-c&lt;/code&gt;: If download fails for whatever reason, keep trying (c = “continue”)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-e&lt;/code&gt;: Delete files on remote that are no longer on source (i.e. keep folders in perfect sync)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--parallel&lt;/code&gt;: Allow multiple connections for parallel downloading of multiple files&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--verbose&lt;/code&gt;: Print lots of messages, helpful for debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With all of the flags in place, the last two arguments are the source (remote FTP) and destination (my VM) directories. Finally, I add a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;quit&lt;/code&gt; statement to exit &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lftp&lt;/code&gt; once the mirror process is over. This is mostly hygiene since I plan to run this on a cron scheduler and don’t want to leave the sessions open.&lt;/p&gt;

&lt;h2 id=&quot;run-this-every-five-minutes-forever&quot;&gt;Run This Every Five Minutes, Forever&lt;/h2&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cron&lt;/code&gt; really is one of the greatest timesavers ever, especially in that it allows super-repetitive work to be automated away, usually with a single line. Here is the line I added after calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;crontab -e&lt;/code&gt; on the command-line:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;/5 &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; /home/username/pull_from_ftp.sh&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Quite simply, “every 5 minutes, run pull_from_ftp.sh”. Creating &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pull_from_ftp.sh&lt;/code&gt; is as straightforward as creating a text file:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;c&quot;&gt;#!/bin/bash&lt;/span&gt;
lftp &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;mirror -c -e --parallel=20 --verbose /pub/data/nccf/com/nwm/prod /nwmftp/prod;quit;&quot;&lt;/span&gt; ftp.government.gov&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;thats-it-yes&quot;&gt;That’s It? YES!&lt;/h2&gt;

&lt;p&gt;With just a few characters short of a full tweet, you can mirror an entire folder from an FTP, automatically. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lftp&lt;/code&gt; in combination with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cron&lt;/code&gt; helped me factor out hundreds of lines of pre-existing Python code, which not only removed untested, copy-pasted code from the workflow but also added parallel downloading, increasing data throughput.&lt;/p&gt;

&lt;p&gt;Not bad for a couple of free Linux utilities :)&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Creating an OmniSci ODBC Connection in RStudio Server</title>
        
          <description>&lt;p&gt;&lt;em&gt;Edit 10/1/2018: When I wrote this blog post, the company and product were named MapD. I’ve changed the title to reflect the new company name, but left the MapD references below to hopefully avoid confusion&lt;/em&gt;&lt;/p&gt;

</description>
        
        <pubDate>Tue, 21 Aug 2018 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/mapd-odbc-rstudio-server/</link>
        <guid isPermaLink="true">http://randyzwitch.com/mapd-odbc-rstudio-server/</guid>
        <content type="html" xml:base="/mapd-odbc-rstudio-server/">&lt;p&gt;&lt;em&gt;Edit 10/1/2018: When I wrote this blog post, the company and product were named MapD. I’ve changed the title to reflect the new company name, but left the MapD references below to hopefully avoid confusion&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/mapd-odbc-rstudio-connection.png&quot; alt=&quot;MapD ODBC RStudio Server&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In my post &lt;a href=&quot;https://www.mapd.com/blog/installing-mapd-on-microsoft-azure/&quot;&gt;&lt;em&gt;Installing MapD on Microsoft Azure&lt;/em&gt;&lt;/a&gt;, I showed how to install MapD Community Edition on Microsoft Azure, using Ubuntu 16.04 LTS as the base image. One thing I glossed over during the firewall/security section was that I opened ports for Jupyter Notebook and other data science tools, but I didn’t actually show how to install any of those tools.&lt;/p&gt;

&lt;p&gt;For this post, I’ll cover how to install MapD ODBC drivers and create a connection within RStudio server.&lt;/p&gt;

&lt;h2 id=&quot;1-installing-rstudio-server-on-microsoft-azure&quot;&gt;1. Installing RStudio Server on Microsoft Azure&lt;/h2&gt;

&lt;p&gt;With an &lt;a href=&quot;https://github.com/mapd/mapd_on_azure&quot;&gt;Ubuntu VM running MapD&lt;/a&gt;, installing RStudio Server takes but a handful of commands. The &lt;a href=&quot;https://www.rstudio.com/products/rstudio/download-server/&quot;&gt;RStudio Server download&lt;/a&gt;/install page has fantastic instructions, but if you are looking for &lt;a href=&quot;https://www.jumpingrivers.com/blog/hosting-rstudio-server-on-azure/&quot;&gt;Azure-specific RStudio Server install&lt;/a&gt; instructions, this blog post from &lt;a href=&quot;https://www.jumpingrivers.com/&quot;&gt;Jumping Rivers&lt;/a&gt; does a great job.&lt;/p&gt;

&lt;h2 id=&quot;2-installing-an-odbc-driver-manager&quot;&gt;2. Installing an ODBC Driver Manager&lt;/h2&gt;

&lt;p&gt;There are two major ODBC driver managers for Linux and macOS: &lt;a href=&quot;http://www.unixodbc.org/&quot;&gt;unixODBC&lt;/a&gt; and &lt;a href=&quot;http://www.iodbc.org/dataspace/doc/iodbc/wiki/iodbcWiki/WelcomeVisitors&quot;&gt;iODBC&lt;/a&gt;. I have had more overall ODBC driver installation success with unixODBC than iODBC; here are the instructions for building unixODBC from source:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;c&quot;&gt;#download source and extract&lt;/span&gt;
wget ftp://ftp.unixodbc.org/pub/unixODBC/unixODBC-2.3.7.tar.gz
&lt;span class=&quot;nb&quot;&gt;gunzip &lt;/span&gt;unixODBC&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.tar.gz
&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;xvf unixODBC&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.tar

&lt;span class=&quot;c&quot;&gt;#compile and install&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;unixODBC-2.3.7
./configure
make
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;make &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;If you want to check everything is installed correctly, you can run the following command:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;odbc_config &lt;span class=&quot;nt&quot;&gt;--cflags&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#result&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-DHAVE_UNISTD_H&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-DHAVE_PWD_H&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-DHAVE_SYS_TYPES_H&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-DHAVE_LONG_LONG&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-DSIZEOF_LONG_INT&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;8 &lt;span class=&quot;nt&quot;&gt;-I&lt;/span&gt;/usr/local/include&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;3-installing-mapd-odbc-driver-system-wide&quot;&gt;3. Installing MapD ODBC Driver System-wide&lt;/h2&gt;

&lt;p&gt;With unixODBC installed, the next step is to &lt;a href=&quot;https://www.mapd.com/docs/latest/6_odbc.html&quot;&gt;install the MapD ODBC drivers&lt;/a&gt;. ODBC drivers for MapD are provided as part of &lt;a href=&quot;https://www.mapd.com/platform/downloads/&quot;&gt;MapD Enterprise Edition&lt;/a&gt;, so you’ll need to contact your sales representative to get the appropriate version for your MapD installation.&lt;/p&gt;

&lt;p&gt;For Linux, the MapD ODBC drivers are provided as a tarball, which when extracted provides all of the necessary ODBC driver files:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;c&quot;&gt;#make a directory to extract files into&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;mapd_odbc &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;mapd_odbc
&lt;span class=&quot;nb&quot;&gt;tar&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-xvf&lt;/span&gt; ../mapd_odbc_installer_linux_3.80.1.36.tar.gz

&lt;span class=&quot;c&quot;&gt;#move to /opt/mapd/mapd_odbc (or wherever the other MapD files are)&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; .. &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;mv &lt;/span&gt;mapd_odbc /opt/mapd/mapd_odbc&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;By convention, MapD suggests placing the ODBC drivers in the same directory as your installation (frequently, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/opt/mapd&lt;/code&gt;). Wherever you choose to place the directory, you need add that location into the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/odbcinst.ini&lt;/code&gt; file:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;MapD Driver]
Driver          &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; /opt/mapd/mapd_odbc/libs/libODBC.so&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;At this point, we have everything we need to define a connection string within R using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;odbc&lt;/code&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;odbc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbConnect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;odbc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;odbc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Driver&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;MapD Driver&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Server&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;localhost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Database&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;mapd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UID&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;mapd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PWD&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;helloRusers!&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Port&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;9091&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Depending on your use case/security preferences, there are two downsides to this method: 1) the credentials are in plain-text in the middle of the script and 2) the &lt;a href=&quot;https://support.rstudio.com/hc/en-us/articles/115010915687-Using-RStudio-Connections&quot;&gt;RStudio Connection window&lt;/a&gt; also shows the credentials in connection window in plain-text until you delete the connection. This can be remedied by defining a DSN (data source name).&lt;/p&gt;

&lt;h2 id=&quot;4-defining-a-dsn&quot;&gt;4. Defining A DSN&lt;/h2&gt;

&lt;p&gt;A DSN is what people usually think of when installing ODBC drivers, as it holds some/all of the actual details for connecting to the database. DSN files can be placed in two locations: system-wide in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/obdc.ini&lt;/code&gt; or in an individual user’s home directory (needs to be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;~/.odbc.ini&lt;/code&gt;, a hidden file).&lt;/p&gt;

&lt;p&gt;In order to have the credentials completely masked in the RStudio session, place the following in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/obdc.ini&lt;/code&gt; file:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;MapD Production]
&lt;span class=&quot;nv&quot;&gt;Driver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;MapD Driver
&lt;span class=&quot;nv&quot;&gt;PWD&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;helloRusers!
&lt;span class=&quot;nv&quot;&gt;UID&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;mapd
&lt;span class=&quot;nv&quot;&gt;HOST&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;localhost
&lt;span class=&quot;nv&quot;&gt;DATABASE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;mapd
&lt;span class=&quot;nv&quot;&gt;PORT&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;9091&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Within the RStudio Connection pane, we can now test our DSN:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/mapd-odbc-rstudio-dsn-test.png&quot; alt=&quot;MapD ODBC RStudio Server DSN Test&quot; /&gt;&lt;/p&gt;

&lt;p&gt;With the DSN defined, the R connection code becomes much shorter, with no credentials exposed within the R session:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;library&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;DBI&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
con &amp;lt;- dbConnect&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;odbc::odbc&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;, &lt;span class=&quot;s2&quot;&gt;&quot;MapD Production&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;odbc-a-big-bag-of-hurt-but-super-useful&quot;&gt;ODBC: A Big Bag Of Hurt, But Super Useful&lt;/h2&gt;

&lt;p&gt;While the instructions above aren’t the easiest to work through, once you have ODBC set up and working one time, it’s usually just a matter of appending various credentials to the existing files to add databases.&lt;/p&gt;

&lt;p&gt;From a MapD perspective, ODBC is supported through our Enterprise Edition, but it is the &lt;em&gt;slowest&lt;/em&gt; way to work with the database. Up to this point, we’ve focused mostly on supporting Python through the &lt;a href=&quot;https://github.com/mapd/pymapd&quot;&gt;pymapd&lt;/a&gt; package and the &lt;a href=&quot;https://www.mapd.com/blog/scaling-pandas-to-the-billions-with-ibis-and-mapd/&quot;&gt;MapD Ibis backend&lt;/a&gt;, but there’s no reason technical reason why R can’t also be a first-class citizen.&lt;/p&gt;

&lt;p&gt;So if you’re interested in helping develop an R package for MapD, whether using &lt;a href=&quot;https://rstudio.github.io/reticulate/articles/introduction.html&quot;&gt;reticulate&lt;/a&gt; to wrap pymapd or to help develop Apache Thrift bindings and Apache Arrow native code, &lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;send me a Twitter message&lt;/a&gt; or &lt;a href=&quot;https://www.linkedin.com/in/randyzwitch/&quot;&gt;connect via LinkedIn&lt;/a&gt; (or any other way to contact me) and we’ll figure out how to collaborate!&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Data Science Without Leaving the GPU</title>
        
          <description>&lt;p&gt;&lt;em&gt;Edit 10/1/2018: When I wrote this blog post, the company and product were named MapD. I’ve changed the title to reflect the new company name, but left the MapD references below to hopefully avoid confusion&lt;/em&gt;&lt;/p&gt;

</description>
        
        <pubDate>Mon, 23 Jul 2018 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/mapd-apache-arrow-xgboost/</link>
        <guid isPermaLink="true">http://randyzwitch.com/mapd-apache-arrow-xgboost/</guid>
        <content type="html" xml:base="/mapd-apache-arrow-xgboost/">&lt;p&gt;&lt;em&gt;Edit 10/1/2018: When I wrote this blog post, the company and product were named MapD. I’ve changed the title to reflect the new company name, but left the MapD references below to hopefully avoid confusion&lt;/em&gt;&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/9I207CIvk5Y?rel=0&quot; frameborder=&quot;0&quot; allow=&quot;autoplay; encrypted-media&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Data has been growing rapidly for some time now, but CPU-based analytics solutions haven’t been able to sustain the same rate of growth in order to keep up. CPUs in desktop and laptop machines have started adding more cores, but even a 4- or 8-core CPU can only do so much work. Eventually the bottleneck will become not having enough bandwidth to keep all the CPU cores ‘fed’ with data to manipulate. &lt;a href=&quot;/big-data-hadoop-amazon-ec2-cloudera-part-1/&quot;&gt;Hadoop&lt;/a&gt; provides a framework for working with larger datasets, but its distributed nature can often feel like setting it up is more hassle than its worth.&lt;/p&gt;

&lt;p&gt;GPU-based analytics solutions provide a great middle-ground; high-parallelism via thousands of GPU cores, while not having to automatically use a networked, multi-node architecture such as Hadoop. A single &lt;a href=&quot;/building-data-science-workstation-2017/&quot;&gt;data science workstation&lt;/a&gt; with 2-4  GPUs can reasonably handle hundreds of millions of records, especially when using the &lt;a href=&quot;https://www.omnisci.com/blog/scaling-pandas-to-the-billions-with-ibis-and-mapd/&quot;&gt;Ibis backend for MapD&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this webinar, I demonstrate how to do each step of a machine learning workflow, from exploring a dataset to adding features to estimating an xgboost model for predicting the amount of tip a user will give after a taxi ride. Because MapD incorporates Apache Arrow under the hood for its data transfer, this can all be done seamlessly by passing pointers, rather than needing expensive I/O operations, between each tool used. Not having to transfer the data off of the GPU has interesting implications for analytics, which I also discuss towards the end of the talk.&lt;/p&gt;

&lt;p&gt;Enjoy!&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Parallel, Disk-Efficient .zip to .gz Conversion</title>
        
          <description>&lt;p&gt;Similar to my last post about needing to &lt;a href=&quot;https://randyzwitch.com/bulk-loading-postgis/&quot;&gt;merge shapefiles using Postgis&lt;/a&gt;, I recently downloaded a bunch of energy data from the federal government. 13,370 files to be exact. While the data size itself isn’t that large (~8GB, compressed), an open-source tool I was looking to evaluate only supports &lt;em&gt;gzip&lt;/em&gt; compression instead of the &lt;em&gt;zip&lt;/em&gt; compressed files I actually had.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 18 Jun 2018 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/zip-to-gzip-conversion-parallel/</link>
        <guid isPermaLink="true">http://randyzwitch.com/zip-to-gzip-conversion-parallel/</guid>
        <content type="html" xml:base="/zip-to-gzip-conversion-parallel/">&lt;p&gt;Similar to my last post about needing to &lt;a href=&quot;https://randyzwitch.com/bulk-loading-postgis/&quot;&gt;merge shapefiles using Postgis&lt;/a&gt;, I recently downloaded a bunch of energy data from the federal government. 13,370 files to be exact. While the data size itself isn’t that large (~8GB, compressed), an open-source tool I was looking to evaluate only supports &lt;em&gt;gzip&lt;/em&gt; compression instead of the &lt;em&gt;zip&lt;/em&gt; compressed files I actually had.&lt;/p&gt;

&lt;p&gt;While I could’ve used this opportunity to merge the files together into one and do all the data cleaning, I became obsessed with figuring out how to just switch the compression scheme. Here’s the one-liner that emerged:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;find &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-type&lt;/span&gt; f &lt;span class=&quot;nt&quot;&gt;-name&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'*.zip'&lt;/span&gt; | parallel &lt;span class=&quot;s2&quot;&gt;&quot;unzip -p -q {} | gzip &amp;gt; {.}.gz &amp;amp;&amp;amp; rm {}&quot;&lt;/span&gt;

find &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-type&lt;/span&gt; f &lt;span class=&quot;nt&quot;&gt;-name&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'*.zip'&lt;/span&gt;
  - find all zip files &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;the current directory, including subdirectories

| parallel
  - take input list passed by find &lt;span class=&quot;nb&quot;&gt;command&lt;/span&gt;, run some &lt;span class=&quot;nb&quot;&gt;command &lt;/span&gt;against each argument &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;parallel

&lt;span class=&quot;s2&quot;&gt;&quot;unzip -p -q {} | gzip &amp;gt; {.}.gz &amp;amp;&amp;amp; rm {}&quot;&lt;/span&gt;
  - unzip a file, with flags &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; to pass data to STDOUT and &lt;span class=&quot;nt&quot;&gt;-q&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;quiet mode
  - &lt;span class=&quot;o&quot;&gt;{}&lt;/span&gt; represents the input file, which comes from list passed by find
  - &lt;span class=&quot;nb&quot;&gt;gzip &lt;/span&gt;takes STDOUT as its input, writes to a file whose name is determined by &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;.&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;the input file name going into parallel, where the &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; removes the file extension&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  - &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;rm&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{}&lt;/span&gt; is run after the &lt;span class=&quot;nb&quot;&gt;gzip &lt;/span&gt;process finishes, removing the original .zip file&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;As a one-liner, it’s not the hardest to comprehend what’s going on, but it’s also not the most intuitive. The key idea here is that once we find all of the zip files, we can unzip/gzip the files in parallel. Note that this works because each process is independent from the other; a single file itself is being unzipped and then gzipped, we’re not unzipping and gzipping a single file in parallel. Just that multiple single-threaded processes are being kicked off at once instead of leaving the other cores in the CPU idle.&lt;/p&gt;

&lt;p&gt;Once the unzip-to-gzip process has occurred, then I delete the original zip file. So for the most part, this process can be considered to take constant disk space (if you ignore that 4-8 files are being processed at one time).&lt;/p&gt;

&lt;p&gt;Like many one-liners, this took longer to figure out than was actually worth the time savings. But such is life, and now it’s available in the wild for the next person who wonders how to do something like this!&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Bulk Loading Shapefiles Into Postgres/Postgis</title>
        
          <description>&lt;p&gt;Recently I’ve been doing a fair bit of work with geospatial data, mostly on the data preparation side. While there are common data formats, I have found that because so much of this data are sourced from government agencies, the data are often in many files that can be concatenated.&lt;/p&gt;

</description>
        
        <pubDate>Fri, 01 Jun 2018 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/bulk-loading-postgis/</link>
        <guid isPermaLink="true">http://randyzwitch.com/bulk-loading-postgis/</guid>
        <content type="html" xml:base="/bulk-loading-postgis/">&lt;p&gt;Recently I’ve been doing a fair bit of work with geospatial data, mostly on the data preparation side. While there are common data formats, I have found that because so much of this data are sourced from government agencies, the data are often in many files that can be concatenated.&lt;/p&gt;

&lt;p&gt;In this example, I will show how to take a few dozen county-level shapefiles of parcel data from Utah and load it into a single table in &lt;a href=&quot;https://www.postgresql.org/&quot;&gt;Postgres&lt;/a&gt;/&lt;a href=&quot;https://postgis.net/&quot;&gt;Postgis&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;step-1-downloading-shapefiles&quot;&gt;Step 1: Downloading Shapefiles&lt;/h2&gt;

&lt;p&gt;The following shell commands come from an in-progress collaboration with a friend, where we are going to analyze daily air quality in Utah over the past several years. &lt;a href=&quot;https://opendata.utah.gov/browse&quot;&gt;Utah is open-data-friendly&lt;/a&gt;, providing &lt;a href=&quot;https://gis.utah.gov/data/cadastre/parcels/#UtahLIRParcels&quot;&gt;shapefiles for every parcel of land in Utah&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;While it may have been possible to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wget&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;curl&lt;/code&gt; to download every shapefile, they are stored within Google Drive with a bunch of hashed URLs, so I just clicked on each file instead of trying to be clever. So if you want to follow along with this blog post exactly, you’ll need to download the 25 zip files of Utah shapefiles:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;ls&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-lrt&lt;/span&gt; utah_lir_shapefiles/
total 408688
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch    954984 Jun  1 13:10 Parcels_Beaver_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   7183466 Jun  1 13:10 Parcels_BoxElder_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   9152777 Jun  1 13:10 Parcels_Cache_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   3279384 Jun  1 13:10 Parcels_Carbon_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch    356058 Jun  1 13:10 Parcels_Daggett_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch  18908413 Jun  1 13:10 Parcels_Davis_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   3900415 Jun  1 13:10 Parcels_Duchesne_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   2689950 Jun  1 13:10 Parcels_Garfield_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   2156109 Jun  1 13:10 Parcels_Grand_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   8107608 Jun  1 13:10 Parcels_Iron_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   1975537 Jun  1 13:10 Parcels_Juab_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   3273485 Jun  1 13:10 Parcels_Kane_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   2741403 Jun  1 13:10 Parcels_Millard_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   1110627 Jun  1 13:10 Parcels_Morgan_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   2970626 Jun  1 13:10 Parcels_Rich_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch 200183664 Jun  1 13:11 Parcels_SaltLake_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   1397522 Jun  1 13:11 Parcels_SanJuan_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   1576757 Jun  1 13:11 Parcels_Sanpete_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   7911261 Jun  1 13:11 Parcels_Summit_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   4480456 Jun  1 13:11 Parcels_Tooele_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch  69690149 Jun  1 13:11 Parcels_Utah_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch   5025674 Jun  1 13:11 Parcels_Wasatch_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch  35896908 Jun  1 13:11 Parcels_Washington_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch    298313 Jun  1 13:11 Parcels_Wayne_LIR.zip
&lt;span class=&quot;nt&quot;&gt;-rw-rw-r--&lt;/span&gt; 1 rzwitch rzwitch  23225130 Jun  1 13:11 Parcels_Weber_LIR.zip&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;step-2-bulk-unzip&quot;&gt;Step 2: Bulk Unzip&lt;/h2&gt;

&lt;p&gt;With all of these files in the same directory at the same level (i.e. no subfolders), it’s pretty easy to bulk unzip the files, with one caveat: to move the contents of the unzipped files into a new directory, you need to use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-d&lt;/code&gt; flag:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;utah_lir_shapefiles_unzipped &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; unzip utah_lir_shapefiles/&lt;span class=&quot;se&quot;&gt;\*&lt;/span&gt;.zip &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; utah_lir_shapefiles_unzipped&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The reason I created a new directory (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mkdir&lt;/code&gt;) and then unzipped the files into a new directory is that when doing analysis, I always like to keep the source data separate, so that I always have the option of starting completely over. It also can make regular expression globs easier :)&lt;/p&gt;

&lt;h2 id=&quot;step-3-creating-postgis-table-definition&quot;&gt;Step 3: Creating Postgis Table Definition&lt;/h2&gt;

&lt;p&gt;After all of the county zip files are unzipped, you get 25 sub-directories structured like the following:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;ls&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-ltr&lt;/span&gt;
total 10916
&lt;span class=&quot;nt&quot;&gt;-rw-rw-rw-&lt;/span&gt; 1 rzwitch rzwitch   67868 Sep  3  2017 Parcels_Beaver_LIR.shx
&lt;span class=&quot;nt&quot;&gt;-rw-rw-rw-&lt;/span&gt; 1 rzwitch rzwitch   28280 Sep  3  2017 Parcels_Beaver_LIR.shp.xml
&lt;span class=&quot;nt&quot;&gt;-rw-rw-rw-&lt;/span&gt; 1 rzwitch rzwitch 1503304 Sep  3  2017 Parcels_Beaver_LIR.shp
&lt;span class=&quot;nt&quot;&gt;-rw-rw-rw-&lt;/span&gt; 1 rzwitch rzwitch    3036 Sep  3  2017 Parcels_Beaver_LIR.sbx
&lt;span class=&quot;nt&quot;&gt;-rw-rw-rw-&lt;/span&gt; 1 rzwitch rzwitch   83052 Sep  3  2017 Parcels_Beaver_LIR.sbn
&lt;span class=&quot;nt&quot;&gt;-rw-rw-rw-&lt;/span&gt; 1 rzwitch rzwitch     425 Sep  3  2017 Parcels_Beaver_LIR.prj
&lt;span class=&quot;nt&quot;&gt;-rw-rw-rw-&lt;/span&gt; 1 rzwitch rzwitch 9471508 Sep  3  2017 Parcels_Beaver_LIR.dbf
&lt;span class=&quot;nt&quot;&gt;-rw-rw-rw-&lt;/span&gt; 1 rzwitch rzwitch       5 Sep  3  2017 Parcels_Beaver_LIR.cpg&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The .shp files from the 25 counties all have the same format, which is very convenient. In this step, we can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;shp2pgsql&lt;/code&gt; utility that comes with Postgis to read a shapefile, determine the proper schema, then create the table in the database:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;shp2pgsql &lt;span class=&quot;nt&quot;&gt;-I&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; 26912 &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; utah_lir_shapefiles_unzipped/Parcels_Beaver_LIR/Parcels_Beaver_LIR.shp &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
utahlirparcels  | psql &lt;span class=&quot;nt&quot;&gt;-h&lt;/span&gt; localhost &lt;span class=&quot;nt&quot;&gt;-U&lt;/span&gt; &amp;lt;username&amp;gt; &amp;lt;database&amp;gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The key flag here is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-p&lt;/code&gt;, which means ‘prepare mode’; the shapefile will get read, a table created, but no data loaded. By not loading the data in this step, it makes looping over the files easier later, as no special logic is required to keep the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Parcels_Beaver_LIR.shp&lt;/code&gt; from being duplicated in Postgis (because it was never loaded in the first place).&lt;/p&gt;

&lt;h2 id=&quot;step-4-bulk-loading-shapefiles-into-postgis&quot;&gt;Step 4: Bulk Loading Shapefiles into Postgis&lt;/h2&gt;

&lt;p&gt;The last steps of the loading process are to 1) get all of the shapefile locations and 2) feed them to shp2pgsql:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;i &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;find utah_lir_shapefiles_unzipped/ &lt;span class=&quot;nt&quot;&gt;-type&lt;/span&gt; f &lt;span class=&quot;nt&quot;&gt;-name&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'*.shp'&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do
  &lt;/span&gt;shp2pgsql &lt;span class=&quot;nt&quot;&gt;-I&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; 26912 &lt;span class=&quot;nt&quot;&gt;-a&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$i&lt;/span&gt; utahlirparcels  | psql &lt;span class=&quot;nt&quot;&gt;-h&lt;/span&gt; localhost &lt;span class=&quot;nt&quot;&gt;-U&lt;/span&gt; &amp;lt;username&amp;gt; &amp;lt;database&amp;gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;To get all of the shapefile locations, I use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;find&lt;/code&gt; with flags &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-type f&lt;/code&gt; (files type) and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;name&lt;/code&gt; to search for the pattern within the directory. This command goes through the entire set of subdirectories and gets all the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.shp&lt;/code&gt; files. From there, I iterate over the list of files using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;for i in...&lt;/code&gt;, then pass the value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$i&lt;/code&gt; into a similar &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;shp2pgsql&lt;/code&gt; as above. However, rather than using flag &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-p&lt;/code&gt; for ‘prepare’, we are now going to use flag &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-a&lt;/code&gt; for ‘append’. This will perform an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INSERT INTO utahlirparcels()&lt;/code&gt; statement for Postgres, loading in the actual data from the 25 shapefiles.&lt;/p&gt;

&lt;h2 id=&quot;spend-time-now-to-save-time-later&quot;&gt;Spend Time Now To Save Time Later&lt;/h2&gt;

&lt;p&gt;Like so much of shell scripting, figuring out these commands took longer than I would’ve expected. Certainly, they took longer to figure out than it would’ve taken to copy-paste a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;shp2pgsql&lt;/code&gt; 25 times! But by taking the time upfront to figure out a generic method of looping over shapefiles, the next time (and every time after that) I find myself needing to do this, this code will be available to load multiple shapefiles into Postgis.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Using RSiteCatalyst With Microsoft PowerBI Desktop</title>
        
          <description>&lt;p&gt;With pretty regular frequency I get emails asking if RSiteCatalyst can be used with &lt;a href=&quot;https://powerbi.microsoft.com/en-us/&quot;&gt;Microsoft Power BI&lt;/a&gt;. While admittedly I’m not a frequent user of the Windows operating system (nor dashboarding tools like Tableau or Power BI), I am pleased to report that it is fact possible to call the &lt;a href=&quot;https://marketing.adobe.com/developer/documentation/analytics-reporting-1-4/whatsnew&quot;&gt;Adobe Analytics API&lt;/a&gt; with Power BI via RSiteCatalyst!&lt;/p&gt;

</description>
        
        <pubDate>Tue, 13 Mar 2018 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-powerbi-desktop-microsoft/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-powerbi-desktop-microsoft/</guid>
        <content type="html" xml:base="/rsitecatalyst-powerbi-desktop-microsoft/">&lt;p&gt;With pretty regular frequency I get emails asking if RSiteCatalyst can be used with &lt;a href=&quot;https://powerbi.microsoft.com/en-us/&quot;&gt;Microsoft Power BI&lt;/a&gt;. While admittedly I’m not a frequent user of the Windows operating system (nor dashboarding tools like Tableau or Power BI), I am pleased to report that it is fact possible to call the &lt;a href=&quot;https://marketing.adobe.com/developer/documentation/analytics-reporting-1-4/whatsnew&quot;&gt;Adobe Analytics API&lt;/a&gt; with Power BI via RSiteCatalyst!&lt;/p&gt;

&lt;h2 id=&quot;step-1-call-adobe-analytics-api-using-get-data-menu&quot;&gt;Step 1: Call Adobe Analytics API Using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Get Data&lt;/code&gt; Menu&lt;/h2&gt;

&lt;p&gt;The majority of getting RSiteCatalyst to work within Power BI desktop is getting the R script correct. From the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Get Data&lt;/code&gt; menu, choose the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;More...&lt;/code&gt; menu option to bring up all of the data import tools that Power BI defines:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/rsitecatalyst-powerbi-getdata.png&quot; alt=&quot;rsitecatalyst powerbi get data&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Once you choose ‘R Script’, an input box will open where you can place your RSiteCatalyst function calls:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/rsitecatalyst-powerbi-rscript.png&quot; alt=&quot;rsitecatalyst powerbi rscript&quot; /&gt;&lt;/p&gt;

&lt;p&gt;After hitting ‘OK’, Power BI will evaluate your R code, determining which statements return a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;data.frame&lt;/code&gt; (which is the only allowable data structure imported into Power BI). You can choose which &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;data.frame(s)&lt;/code&gt; you want to import from the ‘Navigator’ window:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/rsitecatalyst-powerbi-navigator.png&quot; alt=&quot;rsitecatalyst powerbi navigator&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Once you hit ‘OK’, Power BI imports the data and you can use your Adobe Analytics data just as you would in R with RSiteCatalyst (or, like any other data source like CSV or database…)&lt;/p&gt;

&lt;h2 id=&quot;limitations&quot;&gt;Limitations&lt;/h2&gt;

&lt;p&gt;While it’s possible to call RSiteCatalyst through Power BI, there are some limitations to keep in mind.&lt;/p&gt;

&lt;p&gt;First, RSiteCatalyst will only work with Microsoft Power BI &lt;em&gt;Desktop&lt;/em&gt;, which is installed locally on your machine. The Power BI Service, which is more of a shared dashboard/data store environment, &lt;a href=&quot;https://docs.microsoft.com/en-us/power-bi/service-r-packages-support#requirements-and-limitations-of-r-packages&quot;&gt;does not allow external API calls&lt;/a&gt; as part of its security model. So while you can analyze your data locally, you cannot share dashboards to the Power BI Service.&lt;/p&gt;

&lt;p&gt;The second limitation I’ve noticed is that Power BI doesn’t read from from a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.Renviron&lt;/code&gt; file (at least, not from the default Windows location that R GUI reads). So you will need to place your credentials directly in the R script, which is never really ideal (though, may not be a big deal all things considered).&lt;/p&gt;

&lt;p&gt;Finally, the R script runs synchronously, so when placing multiple calls in the same R script you will need to wait for all of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;data.frame&lt;/code&gt; results before you can use them within Power BI. This is the same default behavior within R, sans using &lt;a href=&quot;https://rstudio.github.io/promises/&quot;&gt;Promises&lt;/a&gt; or parallelism of some sort, but it’s still important to keep in mind.&lt;/p&gt;

&lt;h2 id=&quot;dashboards-dashboards-dashboards&quot;&gt;Dashboards, Dashboards, Dashboards!&lt;/h2&gt;

&lt;p&gt;With a few minutes work, I was able to create this rudimentary dashboard (&lt;a href=&quot;/assets/r_code/rsitecatalyst_powerbi_example.R&quot;&gt;R code&lt;/a&gt;):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/rsitecatalyst-powerbi-dashboard.png&quot; alt=&quot;rsitecatalyst powerbi dashboard&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Someone with more interesting/higher volume data could surely do better. But the most important thing in my opinion is that Microsoft has built an awesome integration with R and that creating dashboards in Power BI is &lt;em&gt;waaaaaay&lt;/em&gt; easier than the last time I tried to create a dashboard using Excel and the Adobe Report Builder plugin.&lt;/p&gt;

&lt;p&gt;Happy dashboarding!&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Getting Started With OmniSci, Part 2: Electricity Dataset</title>
        
          <description>&lt;p&gt;&lt;em&gt;Edit 10/1/2018: When I wrote this blog post, the company and product were named MapD. I’ve changed the title to reflect the new company name, but left the MapD references below to hopefully avoid confusion&lt;/em&gt;&lt;/p&gt;

</description>
        
        <pubDate>Fri, 23 Feb 2018 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/mapd-pjm-electricity-data/</link>
        <guid isPermaLink="true">http://randyzwitch.com/mapd-pjm-electricity-data/</guid>
        <content type="html" xml:base="/mapd-pjm-electricity-data/">&lt;p&gt;&lt;em&gt;Edit 10/1/2018: When I wrote this blog post, the company and product were named MapD. I’ve changed the title to reflect the new company name, but left the MapD references below to hopefully avoid confusion&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In my &lt;a href=&quot;http://randyzwitch.com/mapd-install-load-data/&quot;&gt;previous MapD post&lt;/a&gt;, I loaded &lt;a href=&quot;http://www.pjm.com/markets-and-operations/ops-analysis/historical-load-data.aspx&quot;&gt;electricity data&lt;/a&gt; into MapD Community Edition, intentionally ignoring the &lt;em&gt;what&lt;/em&gt; of the data to keep that post from being too overwhelming. Now let’s take a step back and explain the dataset, show how to format the data using Python that was loaded into MapD, then use the MapD Immerse UI to build a simple dashboard.&lt;/p&gt;

&lt;h2 id=&quot;pjm-metered-load-data&quot;&gt;PJM Metered Load Data&lt;/h2&gt;

&lt;p&gt;I started off my career at &lt;a href=&quot;http://pjm.com/&quot;&gt;PJM&lt;/a&gt; doing long-term electricity demand forecasting, to help power engineers do transmission line studies for reliability and to support expansion of the electrical grid in the U.S. Because PJM is a quasi-government agency, they provide over &lt;a href=&quot;http://www.pjm.com/markets-and-operations/ops-analysis/historical-load-data.aspx&quot;&gt;25 years of hourly electricity usage&lt;/a&gt; for the Eastern and Central U.S., both in aggregate and by local power region (roughly, the local power company territories).&lt;/p&gt;

&lt;p&gt;However, just because the data is available doesn’t mean it’s &lt;em&gt;convenient&lt;/em&gt;, and unfortunately, the data are stored as Excel spreadsheets. This is easily remedied using pandas (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;v0.22.0&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;python3.6&lt;/code&gt;):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;os&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#change to directory with files for convenience
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;chdir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;~/electricity_data&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#first sheet in workbook contains all info for years 1993-1999
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df1993_1999&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1993&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1999&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#melt, append df1993-df1999 together
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_melted&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df1993_1999&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;columns&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df1993_1999&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;columns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tolist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;x_melt&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;melt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id_vars&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'ACTUAL_DATE'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'ZONE_NAME'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;var_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;HOUR_ENDING&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;MW&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;df_melted&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df_melted&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_melt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#multiple sheets to concatenate
#too much variation for a one-liner
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d2000&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2000-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;17&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2001&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2001-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2002&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2002-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;18&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2003&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2003-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;19&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2004&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2004-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2005&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2005-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;27&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2006&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2006-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;29&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2007&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2007-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;29&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2008&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2008-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;29&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2009&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2009-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;29&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2010&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2010-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;29&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2011&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2011-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2012&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2012-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;33&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2013&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2013-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;34&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2014&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2014-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;34&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A:Z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2015&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2015-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;40&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;B:AA&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2016&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2016-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;40&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;B:AA&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2017&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2017-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;42&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;B:AA&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d2018&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_excel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2018-hourly-loads.xls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sheet_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;40&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usecols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;B:AA&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#loop over dataframes, read in matrix-formatted data, melt to normalized form
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;ord&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d2000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2001&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2002&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2003&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2004&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2005&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2006&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2007&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2008&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2009&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2010&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;d2011&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2012&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2013&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2014&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2015&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2016&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2017&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2018&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;key&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;ord&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;temp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;ord&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;temp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;columns&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df1993_1999&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;columns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tolist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;#standardize column names
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;temp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ACTUAL_DATE&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_datetime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ACTUAL_DATE&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;#force datetime, excel reader wonky
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;df_melted&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df_melted&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;melt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id_vars&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'ACTUAL_DATE'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'ZONE_NAME'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;var_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;HOUR_ENDING&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;MW&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#(4941384, 4)
#130MB as CSV
#remove any dates that are null, artifacts from excel reader
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_melted&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;notnull&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_melted&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ACTUAL_DATE&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hourly_loads.csv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;    
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The code is a bit verbose, if only because I didn’t want to spend time to figure out how to programmatically determine how many tabs each workbook has. But the concept is the same each time: read an Excel file, get the data into a dataframe, then convert the data to &lt;em&gt;long&lt;/em&gt; form. So instead of having 26 columns (Date, Zone, Hr1-Hr24), we have 4 columns, which is quite frequently a more convenient way to access the data (especially when using SQL).&lt;/p&gt;

&lt;p&gt;The final statement writes out a CSV of approximately 4MM rows, the same dataset that was loaded &lt;a href=&quot;http://randyzwitch.com/mapd-install-load-data/&quot;&gt;using mapdql&lt;/a&gt; in the first post.&lt;/p&gt;

&lt;h2 id=&quot;top-10-usage-days-by-season&quot;&gt;Top 10 Usage Days By Season&lt;/h2&gt;

&lt;p&gt;One of the metrics I used to monitor as part of my job was the top 5/top 10 peak electricity use days per Summer (high A/C usage) and Winter (electric space heating) seasons. Back in those days, I used to use SAS against an enterprise database and the results would come back &lt;em&gt;eventually&lt;/em&gt;…&lt;/p&gt;

&lt;p&gt;Obviously, it’s not a fair comparison to compare today’s GPUs vs. late ’90s enterprise databases in terms of performance, but back then it did take a non-trivial amount of effort to run this query to keep the report updated. With MapD, I can do the same report in ~100ms:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;--MapD doesn't currently support window functions, so need to precalculate maximum by day&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;qry&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;actual_date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;zone_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MW&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;daily_max_usage&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hourly_loads&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;where&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;zone_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'MIDATL'&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;actual_date&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;between&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'2017-06-01'&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'2017-09-30'&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;group&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;by&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;hl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;actual_date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;hl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;zone_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;hl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hour_ending&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;hl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MW&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hourly_loads&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hl&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;inner&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;join&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;qry&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;on&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;qry&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;actual_date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;actual_date&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;qry&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;daily_max_usage&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mw&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;order&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;by&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;daily_max_usage&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;desc&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;limit&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/pjm-2017-summer-coincident-peaks.png&quot; alt=&quot;top 10 electric usage&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The thing about returning an answer in 100ms or so is that its fast enough where calling these results from a webpage/dashboard would be very responsive; that’s where MapD Immerse comes in.&lt;/p&gt;

&lt;h2 id=&quot;building-a-dashboard-using-mapd-immerse&quot;&gt;Building A Dashboard Using MapD Immerse&lt;/h2&gt;

&lt;p&gt;Rather than copy/pasting the query in and running it, it’s pretty easy to build an automated report using the Immerse dashboard builder. I’m limited to a single data source because I’m using MapD Community Edition, but in just a few minutes I was able to create the following dashboard:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/mapd-immerse-dashboard.png&quot; alt=&quot;mapd immerse dashboard&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I took the query from above and built a view to encapsulate the query, so I didn’t have to worry about the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;with&lt;/code&gt; statement or joins, I could just use the view as if the results were pre-calculated. From there, adding in a results table and two bar charts was fairly quick (in the same drag-and-drop style of Tableau or other BI/reporting tools).&lt;/p&gt;

&lt;p&gt;While this dashboard is pretty rudimentary in its design, were this data source set up as real-time using Apache Kafka or similar, this chart would always be up-to-date for use on a TV screen or as a browser bookmark without any additional data or web engineering.&lt;/p&gt;

&lt;p&gt;Obviously, many dashboarding tools exist, but its important to note that no pre-aggregation or column indexing or other standard database performance tricks are being employed (outside of specialized hardware and fast GPU RAM caching). Even with 10 dashboard tiles updating serially 100ms at a time, you are still in the 1-2s page load time, on par with the fastest-loading dynamic webpages on the internet.&lt;/p&gt;

&lt;h2 id=&quot;programmatic-analytics-using-pymapd&quot;&gt;Programmatic analytics using pymapd&lt;/h2&gt;

&lt;p&gt;While dashboarding can be very effective for keeping senior management up-to-date, the real value of data is unlocked with more in-depth analytics and segmentation. In my next blog post, I’ll cover how to access MapD using &lt;a href=&quot;http://pymapd.readthedocs.io/en/latest/&quot;&gt;pymapd&lt;/a&gt; in Python, doing more advanced visualizations and maybe even some machine learning…&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4.14 Release Notes</title>
        
          <description>&lt;p&gt;Like the last several updates, this blog post will be fairly short, given only a single bug fix was added.&lt;/p&gt;

</description>
        
        <pubDate>Fri, 16 Feb 2018 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-14-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-14-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-14-release-notes/">&lt;p&gt;Like the last several updates, this blog post will be fairly short, given only a single bug fix was added.&lt;/p&gt;

&lt;p&gt;Thanks again to GitHub user &lt;a href=&quot;https://github.com/leocwlau&quot;&gt;leocwlau&lt;/a&gt; who &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues/232&quot;&gt;reported&lt;/a&gt; that the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetReportSuiteGroups&lt;/code&gt; function added an additional field AND provided the &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/pull/233&quot;&gt;solution&lt;/a&gt;. No other bug fixes were made, nor was any additional functionality added.&lt;/p&gt;

&lt;p&gt;Version 1.4.14 of RSiteCatalyst was submitted to CRAN today and should be available for download in the coming days.&lt;/p&gt;

&lt;h2 id=&quot;community-contributions&quot;&gt;Community Contributions&lt;/h2&gt;
&lt;p&gt;As I’ve mentioned in many a blog post before this one, I encourage all users of the software to continue reporting bugs via &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues&quot;&gt;GitHub issues&lt;/a&gt;, and especially if you can provide a working code example. Even better, a fix via pull request will ensure that your bug will be addressed in a timely manner and for the benefit to others in the community.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: Please don’t email directly via the email in the RSiteCatalyst package, it will not be returned. Having a valid email contact in the package is a requirement to have a package listed on CRAN so they can contact the package author, it is not meant to imply I can/will provide endless, personalized support for free.&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Getting Started With OmniSci, Part 1: Docker Install and Loading Data</title>
        
          <description>&lt;p&gt;It’s been nearly five years since I wrote about &lt;a href=&quot;http://localhost:4000/big-data-hadoop-amazon-ec2-cloudera-part-1/&quot;&gt;Getting Started with Hadoop&lt;/a&gt; for big data. In those years, there have been incremental improvements in columnar file formats and dramatic computation speed improvements with Apache Spark, but I still wouldn’t call the Hadoop ecosystem convenient for actual data &lt;em&gt;analysis&lt;/em&gt;.  During this same time period, thanks to &lt;a href=&quot;https://developer.nvidia.com/&quot;&gt;NVIDIA&lt;/a&gt; and their &lt;a href=&quot;https://devblogs.nvidia.com/even-easier-introduction-cuda/&quot;&gt;CUDA library&lt;/a&gt; for general-purpose calculations on GPUs, graphics cards went from enabling visuals on a computer to enabling massively-parallel calculations as well.&lt;/p&gt;

</description>
        
        <pubDate>Thu, 01 Feb 2018 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/mapd-install-load-data/</link>
        <guid isPermaLink="true">http://randyzwitch.com/mapd-install-load-data/</guid>
        <content type="html" xml:base="/mapd-install-load-data/">&lt;p&gt;It’s been nearly five years since I wrote about &lt;a href=&quot;http://localhost:4000/big-data-hadoop-amazon-ec2-cloudera-part-1/&quot;&gt;Getting Started with Hadoop&lt;/a&gt; for big data. In those years, there have been incremental improvements in columnar file formats and dramatic computation speed improvements with Apache Spark, but I still wouldn’t call the Hadoop ecosystem convenient for actual data &lt;em&gt;analysis&lt;/em&gt;.  During this same time period, thanks to &lt;a href=&quot;https://developer.nvidia.com/&quot;&gt;NVIDIA&lt;/a&gt; and their &lt;a href=&quot;https://devblogs.nvidia.com/even-easier-introduction-cuda/&quot;&gt;CUDA library&lt;/a&gt; for general-purpose calculations on GPUs, graphics cards went from enabling visuals on a computer to enabling massively-parallel calculations as well.&lt;/p&gt;

&lt;p&gt;Building upon CUDA is &lt;a href=&quot;https://www.omnisci.com/&quot;&gt;MapD&lt;/a&gt;, an analytics platform that allows for super-fast SQL queries and interactive visualizations. In this blog post, I’ll show how to use Docker to install &lt;a href=&quot;https://www.omnisci.com/blog/2017/05/08/mapd-open-sources-gpu-powered-database/&quot;&gt;MapD Community Edition&lt;/a&gt; and load &lt;a href=&quot;http://www.pjm.com/markets-and-operations/ops-analysis/historical-load-data.aspx&quot;&gt;hourly electricity demand&lt;/a&gt; data to analyze.&lt;/p&gt;

&lt;h2 id=&quot;installing-mapd-ce-using-dockernvidia-docker&quot;&gt;Installing MapD CE using Docker/nvidia-docker&lt;/h2&gt;

&lt;p&gt;While CUDA makes it &lt;em&gt;possible&lt;/em&gt; to do calculations on GPUs, I wouldn’t go as far as to say it is easy, including just getting everything installed! Luckily, there is Docker and &lt;a href=&quot;https://devblogs.nvidia.com/nvidia-docker-gpu-server-application-deployment-made-easy/&quot;&gt;nvidia-docker&lt;/a&gt;, which provide all-in-one &lt;em&gt;containers&lt;/em&gt; with all necessary drivers and libraries installed to build upon. MapD provides instructions for installing &lt;a href=&quot;https://www.mapd.com/docs/latest/getting-started/docker-gpu-ce-recipe/&quot;&gt;MapD CE using nvidia-docker&lt;/a&gt;, with the main installation command as follows:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;##nvidia-docker version 2&lt;/span&gt;
docker run &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nvidia &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/mapd-docker-storage:/mapd-storage &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; 9090-9092:9090-9092 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
mapd/mapd-ce-cuda
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;When you kickoff this command (I’m using a ssh terminal into a &lt;a href=&quot;http://randyzwitch.com/building-data-science-workstation-2017/&quot;&gt;remote Ubuntu desktop&lt;/a&gt;), Docker will download all the required images from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mapd/mapd-ce-cuda&lt;/code&gt;repository and start a background process for the MapD database and the Immerse visualization interface/web server:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/docker-dl-images.png&quot; alt=&quot;docker images&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Once all of the images are downloaded, you can find the container that was created using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;docker container ls&lt;/code&gt;, then run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;docker exec -it &amp;lt;container id&amp;gt; bash&lt;/code&gt; to start the container and drop you into a Bash shell (on the container). From this point, MapD Community Edition will be running!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/docker-container-ls.png&quot; alt=&quot;docker ls&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;loading-data-using-the-immerse-interface&quot;&gt;Loading Data Using the Immerse Interface&lt;/h2&gt;

&lt;p&gt;Once the Bash shell opens in the terminal, you can now interact with MapD via the Docker container. However, for beginning exploration, it’s much simpler to use the Immerse Web Interface at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;localhost:9092&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/mapd-immerse.png&quot; alt=&quot;mapd immerse&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Uploading data via the Data Manager interface is reasonably performant for smaller files; a test file with four columns and million or so rows loaded in a few seconds (dependent on your upload speed, obviously):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/mapd-import-table.png&quot; alt=&quot;mapd data manager&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Edit the column names and types if you want (the CSV reader gets it right for me most of the time). Then, once the ‘Save Table’ button is clicked, MapD will import the CSV data into a columnar binary format, so that the GPU can operate directly on the data rather than reading from the CSV each query.&lt;/p&gt;

&lt;h2 id=&quot;loading-data-using-the-command-line&quot;&gt;Loading Data Using the Command Line&lt;/h2&gt;

&lt;p&gt;While browser GUIs are great for some things, I’m still very much a command-line guy, at least for things like loading data. MapD provides the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mapdql&lt;/code&gt; interface to load data and query, very much like psql for Postgres and other databases. To load my 4.9 million * 4 column dataset, I used the following commands:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker container &lt;span class=&quot;nb&quot;&gt;ls
&lt;/span&gt;CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                              NAMES
51be4b888448        mapd/mapd-ce-cuda   &lt;span class=&quot;s2&quot;&gt;&quot;/bin/sh -c '/mapd/s…&quot;&lt;/span&gt;   44 hours ago        Up 44 hours         0.0.0.0:9090-9092-&amp;gt;9090-9092/tcp   nifty_heisenberg

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; 51be4b888448 bash
root@1f64b2dcc316:/mapd# bin/mapdql
Password: &amp;lt;default is &lt;span class=&quot;s2&quot;&gt;&quot;HyperInteractive&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;
User mapd connected to database mapd

mapdql&amp;gt; create table hourly_loads&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
..&amp;gt; ACTUAL_DATE DATE,
..&amp;gt; ZONE_NAME TEXT,
..&amp;gt; HOUR_ENDING SMALLINT,
..&amp;gt; MW FLOAT&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

mapdql&amp;gt; copy hourly_loads from &lt;span class=&quot;s1&quot;&gt;'hourly_loads.csv'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
Result
Loaded: 4898472 recs, Rejected: 0 recs &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;0.923000 secs
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The &lt;a href=&quot;https://www.mapd.com/docs/latest/mapd-core-guide/data-definition/&quot;&gt;DDL&lt;/a&gt; for MapD seems pretty much the same as every other database language. First you define a table’s columns and their types, then you can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;copy&lt;/code&gt; command to load data from a CSV. The statement that prints upon success begins to give an indication of the speed MapD provides, loading nearly 5 million records in less than a second.&lt;/p&gt;

&lt;h2 id=&quot;simplistic-query-performance&quot;&gt;Simplistic Query Performance&lt;/h2&gt;

&lt;p&gt;Up this point, I’ve intentionally not described the data I uploaded into MapD; in my next post, I’ll cover the dataset I’m using and how I converted the data from Excel spreadsheets into a CSV. But before ending this post, I wanted to show a brief summary of the performance of MapD:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/mapd-query-speed.png&quot; alt=&quot;mapd query speed&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The first query shows a simple record count by the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hour_ending&lt;/code&gt; dimension in my table, something you might run if you weren’t too familiar with the table. You’ll notice that running this &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;group by&lt;/code&gt; across the 5 million row dataset took 5143ms, which isn’t so fast. What’s going on?&lt;/p&gt;

&lt;p&gt;Because this is the first query from a cold start, MapD needs to load data into GPU RAM. So while the first query takes a few seconds, the second query displays a &lt;em&gt;warmed-up&lt;/em&gt; level of performance: 212ms to scan 5 million rows, filter by a few values of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zone_name&lt;/code&gt; column, then grouping by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hour_ending&lt;/code&gt;. For reference, a &lt;a href=&quot;https://sciencing.com/fast-blink-eye-5199669.html&quot;&gt;human blink takes 100-400 ms&lt;/a&gt;, so this second query quite literally finished in the blink of an eye…&lt;/p&gt;

&lt;h2 id=&quot;dashboards-streaming-data-and-more&quot;&gt;Dashboards, Streaming Data and more…&lt;/h2&gt;

&lt;p&gt;This first blog post just scratched the surface on what is possible using just the Community Edition of MapD. In future blog posts, I will provide the code to create the dataset, do some basic descriptive statistics, and even do some analysis and dashboarding of historical electricity demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update, 2/1/2018 4:49 p.m.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Per Todd Mostak from MapD, the second query would likely even run faster than 212ms, had I run it again:&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-conversation=&quot;none&quot; data-cards=&quot;hidden&quot; data-partner=&quot;tweetdeck&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;Also fyi that 212ms query will likely run faster if you run it again (even changing the literals). MapD compiles queries the first time it sees a new query plan but then can reuse the same compiled code if you change literals (like for zone_name).&lt;/p&gt;&amp;mdash; Todd Mostak (@ToddMostak) &lt;a href=&quot;https://twitter.com/ToddMostak/status/959181487848525824?ref_src=twsrc%5Etfw&quot;&gt;February 1, 2018&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;</content>
      </item>
      
    
      
      <item>
        <title>Parallelizing Distance Calculations Using A GPU With CUDAnative.jl</title>
        
          <description>&lt;p&gt;Hacker News discussion: &lt;a href=&quot;https://news.ycombinator.com/item?id=15021244&quot;&gt;link&lt;/a&gt;&lt;/p&gt;

</description>
        
        <pubDate>Mon, 14 Aug 2017 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/cudanative-jl-julia/</link>
        <guid isPermaLink="true">http://randyzwitch.com/cudanative-jl-julia/</guid>
        <content type="html" xml:base="/cudanative-jl-julia/">&lt;p&gt;Hacker News discussion: &lt;a href=&quot;https://news.ycombinator.com/item?id=15021244&quot;&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://randyzwitch.com/notebooks/cudanative_haversine_julia_example.ipynb&quot;&gt;Code as Julia Jupyter Notebook&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Julia has the reputation as a “fast” language in that it’s possible to write high-performing code. However, what I appreciate most about Julia is not just that the code is fast, but rather that Julia makes high-performance concepts &lt;em&gt;accessible&lt;/em&gt; without having to have a deep computer science or compiled language background (neither of which I possess!)&lt;/p&gt;

&lt;p&gt;For version 0.6 of Julia, another milestone has been reached in the “accessible” high-performance category: the ability to &lt;a href=&quot;https://julialang.org/blog/2017/03/cudanative&quot;&gt;run Julia code natively on NVIDIA GPUs&lt;/a&gt; through the &lt;a href=&quot;https://github.com/JuliaGPU/CUDAnative.jl&quot;&gt;CUDAnative.jl&lt;/a&gt; package. While CUDAnative.jl is still very much in its development stages, the package is already far-enough along that within a few hours, as a complete beginner to GPU programming, I was able to see in excess of 20x speedups for my toy example to calculate haversine distance.&lt;/p&gt;

&lt;h2 id=&quot;getting-started&quot;&gt;Getting Started&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://julialang.org/blog/2017/03/cudanative&quot;&gt;CUDAnative.jl introduction blog post&lt;/a&gt; and &lt;a href=&quot;http://juliagpu.github.io/CUDAnative.jl/stable/#Installation-1&quot;&gt;documentation&lt;/a&gt; cover the installation process in-depth, so I won’t repeat the details here. I’m already a regular compile-from-source Julia user and I found the installation process pretty easy on my &lt;a href=&quot;http://randyzwitch.com/building-data-science-workstation-2017/&quot;&gt;CUDA-enabled Ubuntu workstation&lt;/a&gt;. If you can already do TensorFlow, Keras or other GPU tutorials on your computer, getting CUDAnative.jl to work shouldn’t take more than 10-15 minutes.&lt;/p&gt;

&lt;h2 id=&quot;julia-cpu-implementation&quot;&gt;Julia CPU Implementation&lt;/h2&gt;

&lt;p&gt;To get a feel for what sort of speedup I could expect from using a GPU, I wrote a naive implementation of a distance matrix calculation in Julia:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#https://github.com/quinnj/Rosetta-Julia/blob/master/src/Haversine.jl&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;haversine&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lon1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lon2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;6372.8&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;asin&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sqrt&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sind&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;^&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cosd&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cosd&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sind&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lon2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lon1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;^&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;))&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; pairwise_dist&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Vector&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;},&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lon&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Vector&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;})&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;#Pre-allocate, since size is known&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;#Brute force fill in each cell, ignore that distance [i,j] = distance [j,i]&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;
            &lt;span class=&quot;nd&quot;&gt;@inbounds&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;haversine&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lon&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lat&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lon&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Example benchmark call&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;lat10000&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10000&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;.*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;45&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;lon10000&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10000&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;.*&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;120&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@time&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;native_julia_cellwise&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pairwise_dist&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat10000&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lon10000&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The above code takes a pair of lat/lon values, then calculates the &lt;a href=&quot;https://rosettacode.org/wiki/Haversine_formula&quot;&gt;haversine distance&lt;/a&gt; between the two points. This algorithm is naive in that a distance matrix is symmetric (i.e. the distance between A to B is the same from B to A), so I could’ve done half the work by setting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;result[i,j]&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;result[j,i]&lt;/code&gt; to the same value, but as a measure of work for a benchmark this toy example is fine. Also note that this implementation runs on a single core, no CPU-core-level parallelization has been implemented.&lt;/p&gt;

&lt;p&gt;Or to put all that another way: if someone wanted to tackle this problem without thinking very hard, the implementation might look like this.&lt;/p&gt;

&lt;h2 id=&quot;cudanativejl-implementation&quot;&gt;CUDAnative.jl Implementation&lt;/h2&gt;

&lt;p&gt;There are two parts to the CUDAnative.jl implementation: the kernel (i.e. the actual calculation) and the boilerplate code for coordinating the writing to/from the CPU and GPU.&lt;/p&gt;

&lt;h4 id=&quot;kernel-code&quot;&gt;Kernel Code&lt;/h4&gt;

&lt;p&gt;The kernel code has similarities to the CPU implementation, with a few key differences:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Method signature is one lat/lon point vs. the lat/lon vectors, rather than a pairwise distance calculation&lt;/li&gt;
  &lt;li&gt;Boilerplate code for thread index on the GPU (0-indexed vs. normal Julia 1-indexing)&lt;/li&gt;
  &lt;li&gt;The trigonometric functions need to be prepended with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CUDAnative.&lt;/code&gt;, to differentiate that the GPU functions aren’t the same as the functions from Base Julia&lt;/li&gt;
  &lt;li&gt;Rather than return an array as part of the function return, we use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;out&lt;/code&gt; keyword argument to write directly to the GPU memory&lt;/li&gt;
&lt;/ul&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CUDAnative&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CUDAdrv&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Calculate one point vs. all other points simultaneously&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; kernel_haversine&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;latpoint&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lonpoint&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;AbstractVector&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;},&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lon&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;AbstractVector&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;},&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;AbstractVector&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;})&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;#Thread index&lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;#Need to do the n-1 dance, since CUDA expects 0 and Julia does 1-indexing&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockDim&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;threadIdx&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;6372.8&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CUDAnative&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;asin&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CUDAnative&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sqrt&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CUDAnative&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sind&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;latpoint&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;^&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CUDAnative&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cosd&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CUDAnative&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cosd&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;latpoint&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CUDAnative&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sind&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lonpoint&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lon&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;^&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;))&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;#Return nothing, since we're writing directly to the out array allocated on GPU&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;nothing&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h4 id=&quot;coordination-code&quot;&gt;Coordination Code&lt;/h4&gt;

&lt;p&gt;The coordination code is similar to what you might see in a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;main()&lt;/code&gt; function in C or Java, where the kernel is applied to the input data. I am using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dev&lt;/code&gt; keyword with the default value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CuDevice(0)&lt;/code&gt; to indicate that the code should be run on the first (in my case, only) GPU device.&lt;/p&gt;

&lt;p&gt;The remainder of the code has comments on its purpose, primarily:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Transfer Julia CPU arrays to GPU arrays (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CuArray&lt;/code&gt;)&lt;/li&gt;
  &lt;li&gt;Set number of threads/blocks&lt;/li&gt;
  &lt;li&gt;Calculate distance between a point and all other points in the array, write back to CPU&lt;/li&gt;
&lt;/ul&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#validated kernel_haversine/distmat returns same answer as CPU haversine method (not shown)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; distmat&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Vector&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;},&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lon&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Vector&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;};&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dev&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CuDevice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CuDevice&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;))&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;#Create a context&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CuContext&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dev&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;#Change to objects with CUDA context&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;d_lat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CuArray&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;d_lon&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CuArray&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lon&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;d_out&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CuArray&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Vector&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;))&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;#Calculate number of calculations, threads, blocks&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;len&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1024&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;blocks&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ceil&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;))&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;#Julia side accumulation of results to relieve GPU memory pressure&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;accum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;# run and time the test&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;secs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CUDAdrv&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;@elapsed&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;
            &lt;span class=&quot;nd&quot;&gt;@cuda&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blocks&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kernel_haversine&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lon&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_lat&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_lon&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_out&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;accum&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Vector&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_out&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;#Clean up context&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;destroy!&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;#Return timing and bring results back to Julia&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;secs&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;accum&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Example benchmark call&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;timing&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;distmat&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lat10000&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lon10000&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;≈&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;native_julia_cellwise&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#validate results equivalent CPU and GPU&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The code is written to process one row of the distance matrix at a time to minimize GPU memory usage. By writing out the results to the CPU after each loop iteration, I have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;n-1&lt;/code&gt; extra CPU transfers, which is less performant than calculating all the distances first then transferring, but my consumer-grade GPU with 6GB of RAM would run out of GPU memory before completing the calculation otherwise.&lt;/p&gt;

&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;

&lt;p&gt;The performance characteristics of the CPU and GPU calculations are below for various sizes of distance matrices. Having not done any GPU calculations before, I was surprised to see how much of a penalty there is writing back and forth to the GPU. As you can see from the navy-blue line, the execution time is fixed for matrices of size 1 to 1000, representing the fixed cost of moving the data from the CPU to the GPU.&lt;/p&gt;

&lt;p&gt;Of course, once we get above 1000x1000 matrices, the GPU really starts to shine. Due to the log scale, it’s a bit hard to see the magnitude differences, but at 100000x100000 there is a &lt;strong&gt;23x&lt;/strong&gt; reduction in execution time (565.008s CPU vs. 24.32s GPU).&lt;/p&gt;

&lt;div id=&quot;linep&quot; style=&quot;height:400px;width:800px;&quot;&gt;&lt;/div&gt;
&lt;script type=&quot;text/javascript&quot;&gt;
    // Initialize after dom ready
    var myChart = echarts.init(document.getElementById(&quot;linep&quot;));

    // Load data into the ECharts instance
    myChart.setOption(
{&quot;xAxis&quot;:[{&quot;splitNumber&quot;:5,&quot;axisLine&quot;:{&quot;show&quot;:false,&quot;onZero&quot;:true,&quot;lineStyle&quot;:{&quot;normal&quot;:{},&quot;emphasis&quot;:{}}},&quot;axisLabel&quot;:{&quot;show&quot;:true,&quot;interval&quot;:&quot;auto&quot;,&quot;rotate&quot;:0,&quot;inside&quot;:false,&quot;formatter&quot;:&quot;{value}&quot;,&quot;margin&quot;:8},&quot;scale&quot;:true,&quot;gridIndex&quot;:0,&quot;name&quot;:&quot;Matrix dimensions (square)&quot;,&quot;minInterval&quot;:0,&quot;zlevel&quot;:0,&quot;triggerEvent&quot;:false,&quot;z&quot;:0,&quot;splitLine&quot;:{&quot;show&quot;:false,&quot;interval&quot;:&quot;auto&quot;,&quot;lineStyle&quot;:{&quot;normal&quot;:{},&quot;emphasis&quot;:{}}},&quot;inverse&quot;:false,&quot;nameLocation&quot;:&quot;middle&quot;,&quot;nameGap&quot;:30,&quot;silent&quot;:true,&quot;type&quot;:&quot;log&quot;}],&quot;ec_charttype&quot;:&quot;xy plot&quot;,&quot;series&quot;:[{&quot;name&quot;:&quot;CPU&quot;,&quot;yAxisIndex&quot;:0,&quot;xAxisIndex&quot;:0,&quot;smooth&quot;:true,&quot;data&quot;:[[1.0,6.0e-6],[10.0,1.7e-5],[100.0,0.001091],[1000.0,0.090409],[10000.0,5.620437],[100000.0,565.008425]],&quot;markLine&quot;:{&quot;data&quot;:[],&quot;lineStyle&quot;:{&quot;normal&quot;:{},&quot;emphasis&quot;:{}}},&quot;large&quot;:true,&quot;type&quot;:&quot;line&quot;,&quot;largeThreshold&quot;:2000},{&quot;name&quot;:&quot;GPU&quot;,&quot;yAxisIndex&quot;:0,&quot;xAxisIndex&quot;:0,&quot;smooth&quot;:true,&quot;data&quot;:[[1.0,0.14232168],[10.0,0.15084915],[100.0,0.15897949],[1000.0,0.16998644],[10000.0,0.6376571],[100000.0,24.32015]],&quot;markLine&quot;:{&quot;data&quot;:[],&quot;lineStyle&quot;:{&quot;normal&quot;:{},&quot;emphasis&quot;:{}}},&quot;large&quot;:true,&quot;type&quot;:&quot;line&quot;,&quot;largeThreshold&quot;:2000}],&quot;theme&quot;:{&quot;geo&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#000000&quot;}},&quot;emphasis&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;rgb(100,0,0)&quot;}}},&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderColor&quot;:&quot;#444444&quot;,&quot;borderWidth&quot;:0.5,&quot;areaColor&quot;:&quot;#eeeeee&quot;},&quot;emphasis&quot;:{&quot;borderColor&quot;:&quot;#444444&quot;,&quot;borderWidth&quot;:1,&quot;areaColor&quot;:&quot;rgba(255,215,0,0.8)&quot;}}},&quot;parallel&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;markPoint&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#eeeeee&quot;}},&quot;emphasis&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#eeeeee&quot;}}}},&quot;visualMap&quot;:{&quot;color&quot;:[&quot;#e01f54&quot;,&quot;#e7dbc3&quot;]},&quot;funnel&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;bar&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;barBorderColor&quot;:&quot;#ccc&quot;,&quot;barBorderWidth&quot;:0},&quot;emphasis&quot;:{&quot;barBorderColor&quot;:&quot;#ccc&quot;,&quot;barBorderWidth&quot;:0}}},&quot;map&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#000000&quot;}},&quot;emphasis&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;rgb(100,0,0)&quot;}}},&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderColor&quot;:&quot;#444444&quot;,&quot;borderWidth&quot;:0.5,&quot;areaColor&quot;:&quot;#eeeeee&quot;},&quot;emphasis&quot;:{&quot;borderColor&quot;:&quot;#444444&quot;,&quot;borderWidth&quot;:1,&quot;areaColor&quot;:&quot;rgba(255,215,0,0.8)&quot;}}},&quot;scatter&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;pie&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;graph&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#eeeeee&quot;}}},&quot;symbolSize&quot;:4,&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}},&quot;smooth&quot;:false,&quot;symbol&quot;:&quot;emptyCircle&quot;,&quot;color&quot;:[&quot;#e01f54&quot;,&quot;#001852&quot;,&quot;#f5e8c8&quot;,&quot;#b8d2c7&quot;,&quot;#c6b38e&quot;,&quot;#a4d8c2&quot;,&quot;#f3d999&quot;,&quot;#d3758f&quot;,&quot;#dcc392&quot;,&quot;#2e4783&quot;,&quot;#82b6e9&quot;,&quot;#ff6347&quot;,&quot;#a092f1&quot;,&quot;#0a915d&quot;,&quot;#eaf889&quot;,&quot;#6699FF&quot;,&quot;#ff6666&quot;,&quot;#3cb371&quot;,&quot;#d5b158&quot;,&quot;#38b6b6&quot;],&quot;lineStyle&quot;:{&quot;normal&quot;:{&quot;color&quot;:&quot;#aaaaaa&quot;,&quot;width&quot;:1}}},&quot;backgroundColor&quot;:&quot;rgba(0,0,0,0)&quot;,&quot;line&quot;:{&quot;symbolSize&quot;:4,&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:1}},&quot;smooth&quot;:false,&quot;symbol&quot;:&quot;emptyCircle&quot;,&quot;lineStyle&quot;:{&quot;normal&quot;:{&quot;width&quot;:2}}},&quot;candlestick&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderColor0&quot;:&quot;#b8d2c7&quot;,&quot;color&quot;:&quot;#e01f54&quot;,&quot;borderColor&quot;:&quot;#f5e8c8&quot;,&quot;borderWidth&quot;:1,&quot;color0&quot;:&quot;#001852&quot;}}},&quot;sankey&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;valueAxis&quot;:{&quot;axisLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}},&quot;axisLabel&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333&quot;},&quot;show&quot;:true},&quot;splitLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:[&quot;#ccc&quot;]}},&quot;splitArea&quot;:{&quot;areaStyle&quot;:{&quot;color&quot;:[&quot;rgba(250,250,250,0.3)&quot;,&quot;rgba(200,200,200,0.3)&quot;]},&quot;show&quot;:false},&quot;axisTick&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}}},&quot;toolbox&quot;:{&quot;iconStyle&quot;:{&quot;normal&quot;:{&quot;borderColor&quot;:&quot;#999999&quot;},&quot;emphasis&quot;:{&quot;borderColor&quot;:&quot;#666666&quot;}}},&quot;categoryAxis&quot;:{&quot;axisLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}},&quot;axisLabel&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333&quot;},&quot;show&quot;:true},&quot;splitLine&quot;:{&quot;show&quot;:false,&quot;lineStyle&quot;:{&quot;color&quot;:[&quot;#ccc&quot;]}},&quot;splitArea&quot;:{&quot;areaStyle&quot;:{&quot;color&quot;:[&quot;rgba(250,250,250,0.3)&quot;,&quot;rgba(200,200,200,0.3)&quot;]},&quot;show&quot;:false},&quot;axisTick&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}}},&quot;tooltip&quot;:{&quot;axisPointer&quot;:{&quot;crossStyle&quot;:{&quot;color&quot;:&quot;#cccccc&quot;,&quot;width&quot;:1},&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#cccccc&quot;,&quot;width&quot;:1}}},&quot;timeline&quot;:{&quot;label&quot;:{&quot;normal&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#293c55&quot;}},&quot;emphasis&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#293c55&quot;}}},&quot;controlStyle&quot;:{&quot;normal&quot;:{&quot;color&quot;:&quot;#293c55&quot;,&quot;borderColor&quot;:&quot;#293c55&quot;,&quot;borderWidth&quot;:0.5},&quot;emphasis&quot;:{&quot;color&quot;:&quot;#293c55&quot;,&quot;borderColor&quot;:&quot;#293c55&quot;,&quot;borderWidth&quot;:0.5}},&quot;checkpointStyle&quot;:{&quot;color&quot;:&quot;#e43c59&quot;,&quot;borderColor&quot;:&quot;rgba(194,53,49,0.5)&quot;},&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;color&quot;:&quot;#293c55&quot;,&quot;borderWidth&quot;:1},&quot;emphasis&quot;:{&quot;color&quot;:&quot;#a9334c&quot;}},&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#293c55&quot;,&quot;width&quot;:1}},&quot;radar&quot;:{&quot;symbolSize&quot;:4,&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:1}},&quot;smooth&quot;:false,&quot;symbol&quot;:&quot;emptyCircle&quot;,&quot;lineStyle&quot;:{&quot;normal&quot;:{&quot;width&quot;:2}}},&quot;logAxis&quot;:{&quot;axisLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}},&quot;axisLabel&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333&quot;},&quot;show&quot;:true},&quot;splitLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:[&quot;#ccc&quot;]}},&quot;splitArea&quot;:{&quot;areaStyle&quot;:{&quot;color&quot;:[&quot;rgba(250,250,250,0.3)&quot;,&quot;rgba(200,200,200,0.3)&quot;]},&quot;show&quot;:false},&quot;axisTick&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}}},&quot;textStyle&quot;:{},&quot;gauge&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;},&quot;emphasis&quot;:{&quot;borderWidth&quot;:0,&quot;borderColor&quot;:&quot;#ccc&quot;}}},&quot;boxplot&quot;:{&quot;itemStyle&quot;:{&quot;normal&quot;:{&quot;borderWidth&quot;:1},&quot;emphasis&quot;:{&quot;borderWidth&quot;:2}}},&quot;color&quot;:[&quot;#e01f54&quot;,&quot;#001852&quot;,&quot;#f5e8c8&quot;,&quot;#b8d2c7&quot;,&quot;#c6b38e&quot;,&quot;#a4d8c2&quot;,&quot;#f3d999&quot;,&quot;#d3758f&quot;,&quot;#dcc392&quot;,&quot;#2e4783&quot;,&quot;#82b6e9&quot;,&quot;#ff6347&quot;,&quot;#a092f1&quot;,&quot;#0a915d&quot;,&quot;#eaf889&quot;,&quot;#6699FF&quot;,&quot;#ff6666&quot;,&quot;#3cb371&quot;,&quot;#d5b158&quot;,&quot;#38b6b6&quot;],&quot;title&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333333&quot;},&quot;subtextStyle&quot;:{&quot;color&quot;:&quot;#aaaaaa&quot;}},&quot;dataZoom&quot;:{&quot;dataBackgroundColor&quot;:&quot;rgba(47,69,84,0.3)&quot;,&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333333&quot;},&quot;handleSize&quot;:&quot;100%&quot;,&quot;handleColor&quot;:&quot;#a7b7cc&quot;,&quot;fillerColor&quot;:&quot;rgba(167,183,204,0.4)&quot;,&quot;backgroundColor&quot;:&quot;rgba(47,69,84,0)&quot;},&quot;timeAxis&quot;:{&quot;axisLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}},&quot;axisLabel&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333&quot;},&quot;show&quot;:true},&quot;splitLine&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:[&quot;#ccc&quot;]}},&quot;splitArea&quot;:{&quot;areaStyle&quot;:{&quot;color&quot;:[&quot;rgba(250,250,250,0.3)&quot;,&quot;rgba(200,200,200,0.3)&quot;]},&quot;show&quot;:false},&quot;axisTick&quot;:{&quot;show&quot;:true,&quot;lineStyle&quot;:{&quot;color&quot;:&quot;#333&quot;}}},&quot;legend&quot;:{&quot;textStyle&quot;:{&quot;color&quot;:&quot;#333333&quot;}}},&quot;yAxis&quot;:[{&quot;splitNumber&quot;:5,&quot;axisLine&quot;:{&quot;show&quot;:false,&quot;onZero&quot;:true,&quot;lineStyle&quot;:{&quot;normal&quot;:{},&quot;emphasis&quot;:{}}},&quot;axisLabel&quot;:{&quot;show&quot;:true,&quot;interval&quot;:&quot;auto&quot;,&quot;rotate&quot;:0,&quot;inside&quot;:false,&quot;formatter&quot;:&quot;{value}&quot;,&quot;margin&quot;:8},&quot;scale&quot;:true,&quot;gridIndex&quot;:0,&quot;name&quot;:&quot;Time in seconds&quot;,&quot;minInterval&quot;:0,&quot;zlevel&quot;:0,&quot;triggerEvent&quot;:false,&quot;z&quot;:0,&quot;inverse&quot;:false,&quot;nameLocation&quot;:&quot;middle&quot;,&quot;nameGap&quot;:50,&quot;silent&quot;:true,&quot;type&quot;:&quot;log&quot;}],&quot;toolbox&quot;:{&quot;feature&quot;:{},&quot;orient&quot;:&quot;vertical&quot;,&quot;itemSize&quot;:15,&quot;height&quot;:&quot;auto&quot;,&quot;zlevel&quot;:0,&quot;z&quot;:2,&quot;itemGap&quot;:20,&quot;right&quot;:&quot;auto&quot;,&quot;top&quot;:&quot;center&quot;,&quot;width&quot;:&quot;auto&quot;,&quot;show&quot;:false,&quot;showTitle&quot;:true},&quot;ec_width&quot;:800,&quot;ec_height&quot;:400,&quot;grid&quot;:[{&quot;height&quot;:&quot;auto&quot;,&quot;show&quot;:false,&quot;width&quot;:&quot;auto&quot;,&quot;backgroundColor&quot;:&quot;transparent&quot;}],&quot;title&quot;:[{&quot;left&quot;:&quot;center&quot;,&quot;borderColor&quot;:&quot;transparent&quot;,&quot;bottom&quot;:&quot;auto&quot;,&quot;padding&quot;:5,&quot;zlevel&quot;:0,&quot;borderWidth&quot;:1,&quot;target&quot;:&quot;blank&quot;,&quot;z&quot;:2,&quot;itemGap&quot;:5,&quot;shadowOffsetY&quot;:0,&quot;shadowOffsetX&quot;:0,&quot;right&quot;:&quot;auto&quot;,&quot;top&quot;:&quot;auto&quot;,&quot;subtarget&quot;:&quot;blank&quot;,&quot;textStyle&quot;:{&quot;fontFamily&quot;:&quot;sans-serif&quot;,&quot;fontStyle&quot;:&quot;normal&quot;,&quot;color&quot;:&quot;#000&quot;,&quot;fontSize&quot;:14,&quot;fontWeight&quot;:&quot;normal&quot;},&quot;show&quot;:true,&quot;text&quot;:&quot;Haversine distance: CPU vs. GPU&quot;}],&quot;legend&quot;:{&quot;itemWidth&quot;:25,&quot;data&quot;:[&quot;CPU&quot;,&quot;GPU&quot;],&quot;borderColor&quot;:&quot;transparent&quot;,&quot;orient&quot;:&quot;horizontal&quot;,&quot;bottom&quot;:&quot;auto&quot;,&quot;height&quot;:&quot;auto&quot;,&quot;zlevel&quot;:0,&quot;padding&quot;:5,&quot;borderWidth&quot;:1,&quot;inactiveColor&quot;:&quot;#ccc&quot;,&quot;z&quot;:2,&quot;align&quot;:&quot;auto&quot;,&quot;itemGap&quot;:10,&quot;itemHeight&quot;:14,&quot;backgroundColor&quot;:&quot;transparent&quot;,&quot;shadowOffsetY&quot;:0,&quot;shadowOffsetX&quot;:0,&quot;right&quot;:&quot;right&quot;,&quot;top&quot;:&quot;middle&quot;,&quot;width&quot;:&quot;auto&quot;,&quot;selectedMode&quot;:true,&quot;show&quot;:true}}
);
&lt;/script&gt;

&lt;h2 id=&quot;what-i-learned&quot;&gt;What I Learned&lt;/h2&gt;

&lt;p&gt;There are myriad things I learned from this project, but most important is that GPGPU processing can be accessible for people like myself without a CS background. Julia isn’t the first high-level language to provide CUDA functionality, but the fact that the code is so similar to native Julia makes GPU computing something I can include in my toolbox &lt;em&gt;today&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Over time, I’m sure I’ll get better results as I learn more about CUDA, as CUDAnative.jl continues to smooth out the rough edges, etc. But the fact that as a beginner that I could achieve such large speedups in just an hour or two of coding and sparse CUDAnative.jl documentation bodes well for the future of GPU computing in Julia.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://randyzwitch.com/notebooks/cudanative_haversine_julia_example.ipynb&quot;&gt;Code as Julia Jupyter Notebook&lt;/a&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4.13 Release Notes</title>
        
          <description>&lt;p&gt;This blog post will be fairly short, given the minor nature of the update.&lt;/p&gt;

</description>
        
        <pubDate>Sun, 23 Jul 2017 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-13-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-13-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-13-release-notes/">&lt;p&gt;This blog post will be fairly short, given the minor nature of the update.&lt;/p&gt;

&lt;p&gt;Several users complained about OAUTH2 authentication not working, which I didn’t know because I usually use the legacy authentication method! Luckily, GitHub user &lt;a href=&quot;https://github.com/leocwlau&quot;&gt;leocwlau&lt;/a&gt; reported the &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues/223&quot;&gt;issue&lt;/a&gt; AND provided the &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/pull/224&quot;&gt;solution&lt;/a&gt;. No other bug fixes were made, nor was any additional functionality added.&lt;/p&gt;

&lt;p&gt;So if you ran into an issue where your login no longer worked, version 1.4.13 of RSiteCatalyst should remedy the issue. Even if you hadn’t run into this authentication issue, users should still upgrade, as all updates are cumulative in nature.&lt;/p&gt;

&lt;h2 id=&quot;community-contributions&quot;&gt;Community Contributions&lt;/h2&gt;
&lt;p&gt;As I’ve mentioned in many a blog post before this one, I encourage all users of the software to continue reporting bugs via &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues&quot;&gt;GitHub issues&lt;/a&gt;, and especially if you can provide a working code example. Even better, a fix via pull request will ensure that your bug will be addressed in a timely manner and for the benefit to others in the community.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: Please don’t email directly via the email in the RSiteCatalyst package, it will not be returned. Having a valid email contact in the package is a requirement to have a package listed on CRAN so they can contact the package author, it is not meant to imply I can/will provide endless, personalized support for free.&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes</title>
        
          <description>&lt;p&gt;I released version 1.4.12 of RSiteCatalyst before I wrote the release notes for version 1.4.11, so this blog post will treat both releases as one. Users should upgrade directly to version 1.4.12 as the releases are cumulative in nature.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 10 Apr 2017 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-12-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-12-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-12-release-notes/">&lt;p&gt;I released version 1.4.12 of RSiteCatalyst before I wrote the release notes for version 1.4.11, so this blog post will treat both releases as one. Users should upgrade directly to version 1.4.12 as the releases are cumulative in nature.&lt;/p&gt;

&lt;h2 id=&quot;get-method-additions&quot;&gt;Get* method additions&lt;/h2&gt;

&lt;p&gt;Two &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Get*&lt;/code&gt; methods were added in this release: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetClickMapReporting&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetPreviousServerCalls&lt;/code&gt;, mostly for completeness. Analytics users will likely not need to use these methods, but they are useful for &lt;a href=&quot;http://randyzwitch.com/adobe-analytics-implementation-documentation/&quot;&gt;generating documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;bug-fixes&quot;&gt;Bug fixes&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Fixed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetLogin&lt;/code&gt; function, adding &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;selected_ims_group_list&lt;/code&gt; parameter to response (caused test suite failure)&lt;/li&gt;
  &lt;li&gt;Fixed issue with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ViewProcessingRules&lt;/code&gt; where nested rules threw errors (#214)&lt;/li&gt;
  &lt;li&gt;Fixed issue with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetMarketingChannelRules&lt;/code&gt; where nested rules threw errors, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;matches&lt;/code&gt; column duplicated values across rows (#180)&lt;/li&gt;
  &lt;li&gt;Added ability to use a segment in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueDataWarehouse&lt;/code&gt; function, which was previously implemented incorrectly (#216)&lt;/li&gt;
  &lt;li&gt;Fixed issue with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueDataWarehouse&lt;/code&gt; not returning the proper number of results when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;enqueueOnly = FALSE&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Fixed encoding for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueDataWarehouse&lt;/code&gt; to correctly use UTF-8-BOM (#198)&lt;/li&gt;
  &lt;li&gt;Fixed parser for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetFeeds&lt;/code&gt;, to unnest ‘activity’ data frame into discrete columns&lt;/li&gt;
  &lt;li&gt;Fixed issue where message &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Error in if (!is.null(elements[i, ]$classification) &amp;amp;&amp;amp; nchar(elements[i, : missing value where TRUE/FALSE needed&lt;/code&gt; displayed when using multiple elements in a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Queue*&lt;/code&gt; function (#207)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;community-contributions-an-adobe-summit-bounce&quot;&gt;Community Contributions (An Adobe Summit bounce?!)&lt;/h2&gt;

&lt;p&gt;In the past month, the number of GitHub issues submitted has increased dramatically, a good problem to have!&lt;/p&gt;

&lt;p&gt;I encourage all users of the software to continue reporting bugs via GitHub issues, and especially if you can provide a working code example. Even better, a fix via pull request will ensure that your bug will be addressed in a timely manner and for the benefit to others in the community.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Self-Service Adobe Analytics Data Feeds!</title>
        
          <description>&lt;p&gt;I’ve written several posts about the Adobe Analytics Analytics (née Clickstream) Data Feed (links: &lt;a href=&quot;http://randyzwitch.com/adobe-analytics-clickstream-raw-data-feed/&quot;&gt;1&lt;/a&gt;,&lt;a href=&quot;http://randyzwitch.com/adobe-analytics-clickstream-data-feed-relational-database/&quot;&gt;2&lt;/a&gt;,&lt;a href=&quot;http://randyzwitch.com/adobe-analytics-clickstream-data-feed-calculations/&quot;&gt;3&lt;/a&gt;) over the past several years. The &lt;a href=&quot;https://marketing.adobe.com/resources/help/en_US/reference/analytics-data-feed.html&quot;&gt;Analytics Data Feed&lt;/a&gt; is an invaluable tool for moving beyond aggregate-level reporting information about your customers to really in-depth, customer-level analytics.&lt;/p&gt;

</description>
        
        <pubDate>Fri, 03 Mar 2017 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/adobe-clickstream-data-feed-self-service-admin-panel/</link>
        <guid isPermaLink="true">http://randyzwitch.com/adobe-clickstream-data-feed-self-service-admin-panel/</guid>
        <content type="html" xml:base="/adobe-clickstream-data-feed-self-service-admin-panel/">&lt;p&gt;I’ve written several posts about the Adobe Analytics Analytics (née Clickstream) Data Feed (links: &lt;a href=&quot;http://randyzwitch.com/adobe-analytics-clickstream-raw-data-feed/&quot;&gt;1&lt;/a&gt;,&lt;a href=&quot;http://randyzwitch.com/adobe-analytics-clickstream-data-feed-relational-database/&quot;&gt;2&lt;/a&gt;,&lt;a href=&quot;http://randyzwitch.com/adobe-analytics-clickstream-data-feed-calculations/&quot;&gt;3&lt;/a&gt;) over the past several years. The &lt;a href=&quot;https://marketing.adobe.com/resources/help/en_US/reference/analytics-data-feed.html&quot;&gt;Analytics Data Feed&lt;/a&gt; is an invaluable tool for moving beyond aggregate-level reporting information about your customers to really in-depth, customer-level analytics.&lt;/p&gt;

&lt;p&gt;While the Analytics Data Feed is nowhere as easy to use as the Adobe UI, Report Builder, Analytics Workspace or even &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst&quot;&gt;RSiteCatalyst&lt;/a&gt;, the hardest part for me when I was a digital analytics consultant was just obtaining the files in the first place (to say nothing of restarting a failed feed)! Luckily, Adobe has now built an interface accessible by any Adobe Analytics admin, removing the need for ClientCare to set up and maintain feeds.&lt;/p&gt;

&lt;p&gt;In this post, I will briefly highlight how to set up an Analytics Data Feed from inside Adobe Analytics and give my impressions of the tool (as it exists at the time of writing). Note this post is not meant to be a substitute to the official documentation; Adobe provides detailed information about the entire process &lt;a href=&quot;https://marketing.adobe.com/resources/help/en_US/reference/analytics-data-feed.html&quot;&gt;in their Help section&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;landing-the-files-via-ftpsftps3&quot;&gt;Landing the Files via FTP/SFTP/S3&lt;/h2&gt;

&lt;p&gt;After clicking ‘Admin -&amp;gt; Data Feeds’ in the Adobe Analytics Admin menu/panel, you should see an interface similar to the following:
&lt;img src=&quot;/assets/img/adobe-analytics-data-feed-landing-page.png&quot; alt=&quot;landing page&quot; /&gt;&lt;/p&gt;

&lt;p&gt;After clicking the “(+) Add” button, you will be brought to a page with three parts: Feed Information, Delivery Location and Data Column Definitions. The top half of the page looks as follows:
&lt;img src=&quot;/assets/img/adobe-analytics-data-feed-ftp.png&quot; alt=&quot;feed info&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I find these two panels fairly self-explanatory. On the left, you give your data feed a name, choose the report suite(s) you want the data feed for, pick the delivery granularity, then choose the start/end date or choose ‘continuous’ for indefinite future delivery. On the right side, you provide your server information, FTP/SFTP or Amazon S3.&lt;/p&gt;

&lt;p&gt;There are several things I love about this setup:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Delivery via SFTP and S3: For the longest time, FTP was the only choice, which disappointed security-minded folks who wanted SFTP. With S3 delivery, you can house your data files and read them into Hadoop directly (assuming you use AWS); this means you don’t necessarily have to do anything with the data until you need it, then you can fire up an EMR job to get your data. Dump the cluster when you’re done. Nice.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Multiple Report Suites, same window: I’m so happy to see that this interface supports choosing multiple report suites within the same “feed”. This makes it so much easier than having to create a separate feed instance, when in many cases the settings will stay the same (other than report suite of course).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What I don’t love:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Not saving FTP information: When you do choose to create separate feeds, there doesn’t appear to be a way to use the same credentials without typing them in yourself. Not a huge deal if your password is “Mom”, a little more annoying if you auto-generate a password that looks like “aHR0cDovL3JhbmR5endpdGNoLmNvbQ==”&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;data-column-definitions&quot;&gt;Data Column Definitions&lt;/h2&gt;

&lt;p&gt;After choosing the report suites and providing the delivery details, the remaining step is to decide which data fields you want as part of the data feed. Luckily, there are templates that can be chosen, so that the user doesn’t have to remember every single field and the order that they need.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/adobe-analytics-data-feed-definition.png&quot; alt=&quot;column structure&quot; /&gt;&lt;/p&gt;

&lt;p&gt;My favored practice when working with the data feeds is to have every feed have the same structure, even if certain report suites don’t have nearly the same amount of eVars and props implemented. I’m of the mind that taking all the fields, then letting the ETL process handle what to do with the data is much less error-prone. You can always re-run an ETL process if you forgot something; if you forget a column in your data feed, you need to re-process all of the feeds.&lt;/p&gt;

&lt;p&gt;Empty columns are also much less wasteful in the age of distributed computing (such as Hadoop), as &lt;a href=&quot;http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/&quot;&gt;columnar data formats&lt;/a&gt;/databases remove much of the performance penalty of empty columns. Using dictionary encoding and columnar data structures, you can represent an entire column of missing values as {NA: &lt;number of=&quot;&quot; rows=&quot;&quot;&gt;} and spend mere bytes for data storage and still have lossless compression.&lt;/number&gt;&lt;/p&gt;

&lt;h2 id=&quot;save-and-repeat&quot;&gt;Save and Repeat&lt;/h2&gt;

&lt;p&gt;Once you have your data feed defined, you save your work and that’s it! I’m not sure what SLA might be in place, but it seems like there’s roughly a 48-hour delay between submitting a new feed and having the data start to process. Once the processing starts, your notification inbox will quickly look like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/adobe-analytics-data-feed-email-notification.png&quot; alt=&quot;emails&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;automated-monitoring-with-rsitecatalyst&quot;&gt;Automated Monitoring with RSiteCatalyst&lt;/h2&gt;
&lt;p&gt;If you want to have more control around monitoring than just staring at your inbox, you can use RSiteCatalyst to get the processing status of your feed:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RSiteCatalyst&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Authentication (assumes credentials saved in .Renviron)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SCAuth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;USER&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;SECRET&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Get full list of report suites&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rs&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetReportSuites&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Pass in all non-virtual report suites&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;feedstatus&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetFeeds&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;is.na&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;virtual&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rsid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetFeeds&lt;/code&gt; will return a dataframe that returns the feed information, as well as the processing status:
&lt;img src=&quot;/assets/img/adobe-analytics-data-feed-rsitecatalyst.png&quot; alt=&quot;processing status&quot; /&gt;&lt;/p&gt;

&lt;p&gt;It’s not hard to imagine what a cronjob would look like, polling the Adobe API every few minutes to see which feeds are complete, then kicking off an ETL process.&lt;/p&gt;

&lt;h2 id=&quot;data-liberated&quot;&gt;Data, Liberated!&lt;/h2&gt;

&lt;p&gt;This blog post has been relatively light on code and analysis, but I hope I’ve highlighted that the barriers to obtaining the most granular data Adobe can provide have been completely removed. With the new Analytics Data Feed interface, the possibility of bespoke customer analytics is only a few button clicks and an FTP away.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Building a Data Science Workstation (2017)</title>
        
          <description>&lt;p&gt;Update, 3/14/2018: While I’ve still maintained the same basic workstation, I have done some upgrades. Before GPU prices skyrocketed, I added a 1080ti to workstation along with the 1060. However, this required an upgrade of my case to fit both. I ended up getting a &lt;a href=&quot;https://www.corsair.com/uk/en/Categories/Products/Cases/Graphite-Series%E2%84%A2-760T-Full-Tower-Windowed-Case/p/CC-9011073-WW&quot;&gt;Corsair 760T&lt;/a&gt;, which is an enormous increase in space over the Corsair mid-tower I had before.&lt;/p&gt;

</description>
        
        <pubDate>Wed, 18 Jan 2017 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/building-data-science-workstation-2017/</link>
        <guid isPermaLink="true">http://randyzwitch.com/building-data-science-workstation-2017/</guid>
        <content type="html" xml:base="/building-data-science-workstation-2017/">&lt;p&gt;Update, 3/14/2018: While I’ve still maintained the same basic workstation, I have done some upgrades. Before GPU prices skyrocketed, I added a 1080ti to workstation along with the 1060. However, this required an upgrade of my case to fit both. I ended up getting a &lt;a href=&quot;https://www.corsair.com/uk/en/Categories/Products/Cases/Graphite-Series%E2%84%A2-760T-Full-Tower-Windowed-Case/p/CC-9011073-WW&quot;&gt;Corsair 760T&lt;/a&gt;, which is an enormous increase in space over the Corsair mid-tower I had before.&lt;/p&gt;

&lt;p&gt;Since I was already in the case, I also added a AIO water-cooler to my CPU, a &lt;a href=&quot;https://www.corsair.com/us/en/Categories/Products/Cooling/Hydro-Series%E2%84%A2-H115i-PRO-RGB-280mm-Liquid-CPU-Cooler/p/CW-9060032-WW&quot;&gt;Corsair H115i Pro&lt;/a&gt;. While this wasn’t strictly necessary (I wasn’t having overheating issues with a single GPU and overclocked CPU), with two GPUs I figured that it was a good precaution. The water-cooler is also much quieter than my original air-cooled heatsink, which is nice.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/data-science-workstation-2018.JPG&quot; alt=&quot;data science workstation 2018&quot; /&gt;&lt;/p&gt;

&lt;p&gt;So now I have a huge amount of open space, which is great for airflow, a dual-GPU rig and depending on how you calculate things, I’m at around $3000 for a workstation that should serve me well into the next several years (hopefully more!)&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;(&lt;em&gt;original post below&lt;/em&gt;)&lt;/p&gt;
&lt;blockquote class=&quot;twitter-tweet&quot; data-lang=&quot;en&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;i need some kind of super computer to parse a multi-gigabyte XML file, &lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;@randyzwitch&lt;/a&gt;&lt;/p&gt;&amp;mdash; jason thompson (@usujason) &lt;a href=&quot;https://twitter.com/usujason/status/821429000270528512&quot;&gt;January 17, 2017&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;//platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;For all the downsides of social media these days, the people I’ve met and the inside jokes bring me immense joy. From just this one tweet, several people messaged me aghast at 1) the idea of multi-gigabyte XML and that 2) I apparently have a computer for just this purpose!&lt;/p&gt;

&lt;p&gt;Recently, I did build a workstation as a test bed for learning more about data science, specifically in the areas of Docker and GPU computing. Here’s what I built.&lt;/p&gt;

&lt;h2 id=&quot;machine-specs-and-assembling-from-parts&quot;&gt;Machine Specs and Assembling from Parts&lt;/h2&gt;

&lt;p&gt;While there are several specialist workstation companies like &lt;a href=&quot;http://www.titancomputers.com/SearchResults.asp?Search=workstation&quot;&gt;Titan Computing&lt;/a&gt; that sell configurable workstations, assembling a computer from parts is pretty easy, and more importantly, can be significantly cheaper if you choose ‘consumer-grade’ parts instead of ‘server-grade’. Because I’m not planning on curing cancer or building Skynet with this computer, I opted not to get a dual-chip motherboard and used a single Intel i7 chip rather than going with Xeon workstation class chip(s). I also chose to use standard DDR4 RAM rather than ECC RAM. Here’s the full spec list:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://pcpartpicker.com/list/TPwTjc&quot;&gt;PCPartPicker part list&lt;/a&gt;&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Component&lt;/th&gt;
      &lt;th&gt;Details&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;CPU&lt;/td&gt;
      &lt;td&gt;Intel Core i7-5820K 3.3GHz 6-Core Processor&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;CPU Cooler&lt;/td&gt;
      &lt;td&gt;Cooler Master Hyper 212 EVO 82.9 CFM Sleeve Bearing CPU Cooler&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Motherboard&lt;/td&gt;
      &lt;td&gt;Asus X99-A II ATX LGA2011-3 Motherboard&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Memory&lt;/td&gt;
      &lt;td&gt;Crucial Ballistix Sport LT 32GB (2 x 16GB) DDR4-2400 Memory&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Memory&lt;/td&gt;
      &lt;td&gt;Crucial Ballistix Sport LT 32GB (2 x 16GB) DDR4-2400 Memory&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Memory&lt;/td&gt;
      &lt;td&gt;Crucial Ballistix Sport LT 32GB (2 x 16GB) DDR4-2400 Memory&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Memory&lt;/td&gt;
      &lt;td&gt;Crucial Ballistix Sport LT 32GB (2 x 16GB) DDR4-2400 Memory&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Storage&lt;/td&gt;
      &lt;td&gt;Samsung 850 EVO-Series 250GB 2.5” Solid State Drive&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Storage&lt;/td&gt;
      &lt;td&gt;Hitachi Deskstar 1TB 3.5” 7200RPM Internal Hard Drive&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Video Card&lt;/td&gt;
      &lt;td&gt;EVGA GeForce GTX 1060 6GB 6GB SC GAMING Video Card&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Case&lt;/td&gt;
      &lt;td&gt;Corsair SPEC-02 ATX Mid Tower Case&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Power Supply&lt;/td&gt;
      &lt;td&gt;Corsair CXM 750W 80+ Bronze Certified Semi-Modular ATX Power Supply&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Optical Drive&lt;/td&gt;
      &lt;td&gt;Asus DRW-24B1ST/BLK/B/AS DVD/CD Writer&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Total: ~$2000&lt;/p&gt;

&lt;p&gt;So for about $2000 and a few hours of time assembling, I have the rough equivalent of an r3.4xlarge instance on AWS ($1.33/hr on-demand at the time of writing). It would take about 1500 hours of usage to breakeven vs. using AWS, but cost isn’t really the point; the convenience of having the computer in my house and not having to do the startup/shutdown/EC2 images/S3/firewall/etc. dance is more than worth it to me so that I can focus on learning instead of operations.&lt;/p&gt;

&lt;h2 id=&quot;ubuntu-1604lts-cuda-and-other-installs&quot;&gt;Ubuntu 16.04LTS, CUDA, and Other Installs&lt;/h2&gt;

&lt;p&gt;While I did use Windows 10 to validate I put together my hardware correctly (and to mess with overlocking settings using the ASUS motherboard tools), I decided to use Ubuntu 16.04LTS as my base operating system. This allows for the most ‘server-like’ operations that I’m used to from a Linux environment. I enabled an internal static IP through my router, and for the most part, I either SSH into the machine from my MacBook Pro or use web UIs such as Jupyter Notebook (and again, remotely from my laptop).&lt;/p&gt;

&lt;p&gt;I tried to install the NVIDIA drivers from source for the GTX1060 GPU, but eventually gave up and went the &lt;a href=&quot;http://tipsonubuntu.com/2016/08/24/nvidia-367-44-support-titan-x-pascal-gtx-1060/&quot;&gt;apt package manager route&lt;/a&gt; and everything works fine. Though I rarely sit at my desktop, I do have a 4K monitor hooked up to this computer which looks &lt;em&gt;gorgeous&lt;/em&gt; with a video card of this caliber, and I have CUDA installed and working as well.&lt;/p&gt;

&lt;p&gt;From there, I installed any number of tools from Python, R, Julia, MonetDB, Docker, Neo4j, Postgres, Spark, BlazingDB…any/all of which I hope to write about more in the near future.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4.10 Release Notes</title>
        
          <description>&lt;p&gt;Version 1.4.10 of RSiteCatalyst brings a handful of new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Get*&lt;/code&gt; methods, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueDatawarehouse&lt;/code&gt; and a couple of bugs fixes/low-level code improvements.&lt;/p&gt;

</description>
        
        <pubDate>Tue, 13 Dec 2016 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-10-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-10-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-10-release-notes/">&lt;p&gt;Version 1.4.10 of RSiteCatalyst brings a handful of new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Get*&lt;/code&gt; methods, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueDatawarehouse&lt;/code&gt; and a couple of bugs fixes/low-level code improvements.&lt;/p&gt;

&lt;h2 id=&quot;queuedatawarehouse&quot;&gt;QueueDatawarehouse&lt;/h2&gt;

&lt;p&gt;The most useful user-facing change IMO is the addition of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueDatawarehouse&lt;/code&gt; method, which allows for submitting (and &lt;em&gt;sometimes&lt;/em&gt; retrieving) Data Warehouse requests via R. This should be a huge timesaver for those of you using Data Warehouse as a substitute for the Adobe Analytics raw data feed (my employer alone pulls hundreds of Data Warehouse feeds per day).&lt;/p&gt;

&lt;p&gt;In the coming days, I’ll write a blog post in more detail about how to use this method effectively to query Data Warehouse, but in the meantime, here’s a sample function call:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#API Credentials&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SCAuth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;USER&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;SECRET&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#FTP Credentials&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FTP&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;FTP&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FTPUSER&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;FTPUSER&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FTPPW&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;FTPPW&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Write QueueWarehouse to FTP&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report.id&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueDataWarehouse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;report-suite&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2016-11-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2016-11-07&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;page&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;enqueueOnly&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ftp&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FTP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;21&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;directory&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/DWtest/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;username&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FTPUSER&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FTPPW&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filename&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;myreport.csv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;viewprocessingrules&quot;&gt;ViewProcessingRules&lt;/h2&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ViewProcessingRules&lt;/code&gt; has also been added on an experimental basis, as this API method is not documented (as of the writing of this post), so Adobe may choose to modify or remove this in a future release. As the method name indicates, this new feature allows for the viewing of the processing rules for a list of report suites, including the behaviors that define the rules. There is currently no (public) method for &lt;em&gt;setting&lt;/em&gt; processing rules via the API.&lt;/p&gt;

&lt;p&gt;As processing rules are a super-user functionality, I would love it if someone in the community could verify that this method works for a large number of report suites/rules.&lt;/p&gt;

&lt;h2 id=&quot;get-method-additions&quot;&gt;Get* method additions&lt;/h2&gt;

&lt;p&gt;Three Get* methods were added in this release: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetVirtualReportSuiteSettings&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetReportSuiteGroups&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetTimeStampEnabled&lt;/code&gt;, each corresponding to settings able to be viewed within the Adobe Analytics admin panel.&lt;/p&gt;

&lt;h2 id=&quot;bug-fixes&quot;&gt;Bug fixes&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;GetRealTimeSettings: fixed bug where a passing a list of report suites lead to a parsing error&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;QueueSummary: Redefined method arguments to make &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;date&lt;/code&gt; an optional parameter, allowing for more elegant use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;date.to&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;date.from&lt;/code&gt; parameters.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;package-level-improvements&quot;&gt;Package-level improvements&lt;/h2&gt;

&lt;p&gt;At long last, I’ve added integration testing via the AppVeyor testing service. For the longest time, I’ve had a very nonchalant attitude towards Windows (since I don’t use it), but given that RSiteCatalyst is enterprise software and so many businesses use Windows, I figured it was time.&lt;/p&gt;

&lt;p&gt;Luckily, there were none of the tests in the test suite throws any errors specifically due to Windows, so effectively this change is just defensive programming against that class of error in the future.&lt;/p&gt;

&lt;h2 id=&quot;community-contributions&quot;&gt;Community Contributions&lt;/h2&gt;

&lt;p&gt;As in the past several releases, there have been contributions from the community keeping RSiteCatalyst moving forward! Special thanks to Diego Villuendas Pellicero for writing the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueDataWarehouse&lt;/code&gt; functionality and Johann de Boer for highlighting that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueSummary&lt;/code&gt; could be improved.&lt;/p&gt;

&lt;p&gt;I encourage all users of the software to continue reporting bugs via GitHub issues, and especially if you can provide a working code example. Even better, a fix via pull request will ensure that your bug will be addressed in a timely manner and for the benefit to others in the community.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>WordPress to Jekyll: A 30x Speedup</title>
        
          <description>&lt;p&gt;About a month ago, I switched this blog from WordPress hosted on Bluehost to Jekyll on GitHub Pages. I suspected moving to a static website would be faster than generated HTML via PHP, and it is certainly cheaper (GitHub Pages is “free”). But it wasn’t until I needed a dataset for doing some dataset visualization development that I realize how much of an improvement it has been!&lt;/p&gt;

</description>
        
        <pubDate>Mon, 10 Oct 2016 00:00:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/wordpress-jekyll-30x-speedup/</link>
        <guid isPermaLink="true">http://randyzwitch.com/wordpress-jekyll-30x-speedup/</guid>
        <content type="html" xml:base="/wordpress-jekyll-30x-speedup/">&lt;p&gt;About a month ago, I switched this blog from WordPress hosted on Bluehost to Jekyll on GitHub Pages. I suspected moving to a static website would be faster than generated HTML via PHP, and it is certainly cheaper (GitHub Pages is “free”). But it wasn’t until I needed a dataset for doing some dataset visualization development that I realize how much of an improvement it has been!&lt;/p&gt;

&lt;h2 id=&quot;packages-packages-packages&quot;&gt;Packages, Packages, Packages&lt;/h2&gt;

&lt;p&gt;With the release of v0.5 of Julia, I’ve been working (less) on updating my packages and making new packages (more), because making new stuff is more fun than maintaining old stuff! One of the packages I’ve been building is for the &lt;a href=&quot;http://echarts.baidu.com/&quot;&gt;ECharts visualization library&lt;/a&gt; (v3) from Baidu. While Julia doesn’t necessarily need another visualization library, visualization is something I’m interested in and learning is easier when you’re solving problems you like. And since the world doesn’t need another Iris example, I decided to share some real world website performance data :)&lt;/p&gt;

&lt;h2 id=&quot;line-chart&quot;&gt;Line Chart&lt;/h2&gt;

&lt;p&gt;One of the first features I developed for &lt;a href=&quot;https://github.com/randyzwitch/ECharts.jl&quot;&gt;ECharts.jl&lt;/a&gt; was X-Y charts, which I posit is the most common chart type in business. One thing that is great about the underlying ECharts JavaScript library is that interactivity is really easy to achieve:&lt;/p&gt;

&lt;div id=&quot;linep&quot; style=&quot;height:400px;width:800px;&quot;&gt;&lt;/div&gt;
&lt;script type=&quot;text/javascript&quot;&gt;
    // Initialize after dom ready
    var myChart = echarts.init(document.getElementById(&quot;linep&quot;));

    // Load data into the ECharts instance
    myChart.setOption({&quot;xAxis&quot;:[{&quot;scale&quot;:false,&quot;gridIndex&quot;:0,&quot;splitNumber&quot;:5,&quot;minInterval&quot;:0,&quot;silent&quot;:true,&quot;data&quot;:[&quot;2016-06-20&quot;,&quot;2016-06-21&quot;,&quot;2016-06-22&quot;,&quot;2016-06-23&quot;,&quot;2016-06-24&quot;,&quot;2016-06-25&quot;,&quot;2016-06-26&quot;,&quot;2016-06-27&quot;,&quot;2016-06-28&quot;,&quot;2016-06-29&quot;,&quot;2016-06-30&quot;,&quot;2016-07-01&quot;,&quot;2016-07-02&quot;,&quot;2016-07-03&quot;,&quot;2016-07-04&quot;,&quot;2016-07-05&quot;,&quot;2016-07-06&quot;,&quot;2016-07-07&quot;,&quot;2016-07-08&quot;,&quot;2016-07-09&quot;,&quot;2016-07-10&quot;,&quot;2016-07-11&quot;,&quot;2016-07-12&quot;,&quot;2016-07-13&quot;,&quot;2016-07-14&quot;,&quot;2016-07-15&quot;,&quot;2016-07-16&quot;,&quot;2016-07-17&quot;,&quot;2016-07-18&quot;,&quot;2016-07-19&quot;,&quot;2016-07-20&quot;,&quot;2016-07-21&quot;,&quot;2016-07-22&quot;,&quot;2016-07-23&quot;,&quot;2016-07-24&quot;,&quot;2016-07-25&quot;,&quot;2016-07-26&quot;,&quot;2016-07-27&quot;,&quot;2016-07-28&quot;,&quot;2016-07-29&quot;,&quot;2016-07-30&quot;,&quot;2016-07-31&quot;,&quot;2016-08-01&quot;,&quot;2016-08-02&quot;,&quot;2016-08-03&quot;,&quot;2016-08-04&quot;,&quot;2016-08-05&quot;,&quot;2016-08-06&quot;,&quot;2016-08-07&quot;,&quot;2016-08-08&quot;,&quot;2016-08-09&quot;,&quot;2016-08-10&quot;,&quot;2016-08-11&quot;,&quot;2016-08-12&quot;,&quot;2016-08-13&quot;,&quot;2016-08-14&quot;,&quot;2016-08-15&quot;,&quot;2016-08-16&quot;,&quot;2016-08-17&quot;,&quot;2016-08-18&quot;,&quot;2016-08-19&quot;,&quot;2016-08-20&quot;,&quot;2016-08-21&quot;,&quot;2016-08-22&quot;,&quot;2016-08-23&quot;,&quot;2016-08-24&quot;,&quot;2016-08-25&quot;,&quot;2016-08-26&quot;,&quot;2016-08-27&quot;,&quot;2016-08-28&quot;,&quot;2016-08-29&quot;,&quot;2016-08-30&quot;,&quot;2016-08-31&quot;,&quot;2016-09-01&quot;,&quot;2016-09-02&quot;,&quot;2016-09-03&quot;,&quot;2016-09-04&quot;,&quot;2016-09-05&quot;,&quot;2016-09-06&quot;,&quot;2016-09-07&quot;,&quot;2016-09-08&quot;,&quot;2016-09-09&quot;,&quot;2016-09-10&quot;,&quot;2016-09-11&quot;,&quot;2016-09-12&quot;,&quot;2016-09-13&quot;,&quot;2016-09-14&quot;,&quot;2016-09-15&quot;,&quot;2016-09-16&quot;,&quot;2016-09-17&quot;,&quot;2016-09-18&quot;,&quot;2016-09-19&quot;,&quot;2016-09-20&quot;,&quot;2016-09-21&quot;,&quot;2016-09-22&quot;,&quot;2016-09-23&quot;,&quot;2016-09-24&quot;,&quot;2016-09-25&quot;,&quot;2016-09-26&quot;,&quot;2016-09-27&quot;,&quot;2016-09-28&quot;,&quot;2016-09-29&quot;,&quot;2016-09-30&quot;,&quot;2016-10-01&quot;,&quot;2016-10-02&quot;,&quot;2016-10-03&quot;,&quot;2016-10-04&quot;,&quot;2016-10-05&quot;,&quot;2016-10-06&quot;,&quot;2016-10-07&quot;],&quot;inverse&quot;:false,&quot;type&quot;:&quot;category&quot;,&quot;nameLocation&quot;:&quot;middle&quot;,&quot;nameGap&quot;:30}],&quot;yAxis&quot;:[{&quot;scale&quot;:false,&quot;gridIndex&quot;:0,&quot;name&quot;:&quot;Load time in ms&quot;,&quot;splitNumber&quot;:5,&quot;minInterval&quot;:0,&quot;silent&quot;:true,&quot;inverse&quot;:false,&quot;type&quot;:&quot;value&quot;,&quot;nameLocation&quot;:&quot;middle&quot;,&quot;nameGap&quot;:50}],&quot;toolbox&quot;:{&quot;feature&quot;:{&quot;dataView&quot;:{&quot;show&quot;:true,&quot;title&quot;:&quot;Data View&quot;,&quot;lang&quot;:[&quot;Data View&quot;,&quot;Cancel&quot;,&quot;Refresh&quot;]},&quot;restore&quot;:{&quot;show&quot;:true,&quot;title&quot;:&quot;Restore&quot;},&quot;saveAsImage&quot;:{&quot;show&quot;:true,&quot;title&quot;:&quot;Save As PNG&quot;},&quot;magicType&quot;:{&quot;show&quot;:true,&quot;title&quot;:{&quot;line&quot;:&quot;Line&quot;,&quot;bar&quot;:&quot;Bar&quot;,&quot;tiled&quot;:&quot;Tiled&quot;,&quot;chord&quot;:&quot;Chord&quot;,&quot;stack&quot;:&quot;Stack&quot;,&quot;pie&quot;:&quot;Pie&quot;,&quot;force&quot;:&quot;Force&quot;,&quot;funnel&quot;:&quot;Funnel&quot;},&quot;type&quot;:[&quot;bar&quot;,&quot;line&quot;]}},&quot;itemSize&quot;:15,&quot;orient&quot;:&quot;vertical&quot;,&quot;height&quot;:&quot;auto&quot;,&quot;zlevel&quot;:0,&quot;z&quot;:2,&quot;itemGap&quot;:20,&quot;right&quot;:&quot;auto&quot;,&quot;top&quot;:&quot;center&quot;,&quot;width&quot;:&quot;auto&quot;,&quot;show&quot;:true,&quot;showTitle&quot;:true},&quot;ec_width&quot;:800,&quot;ec_height&quot;:400,&quot;ec_charttype&quot;:&quot;xy plot&quot;,&quot;color&quot;:[&quot;#2C3E50&quot;,&quot;#E74C3C&quot;,&quot;#ECF0F1&quot;,&quot;#3498DB&quot;,&quot;#2980B9&quot;],&quot;title&quot;:[{&quot;left&quot;:&quot;left&quot;,&quot;borderColor&quot;:&quot;transparent&quot;,&quot;bottom&quot;:&quot;auto&quot;,&quot;padding&quot;:5,&quot;zlevel&quot;:0,&quot;borderWidth&quot;:1,&quot;target&quot;:&quot;blank&quot;,&quot;z&quot;:2,&quot;itemGap&quot;:5,&quot;shadowOffsetY&quot;:0,&quot;shadowOffsetX&quot;:0,&quot;right&quot;:&quot;auto&quot;,&quot;subtext&quot;:&quot;Switching from WordPress on Bluehost to Jekyll on GitHub (2016/09/06)&quot;,&quot;top&quot;:&quot;auto&quot;,&quot;subtarget&quot;:&quot;blank&quot;,&quot;show&quot;:true,&quot;text&quot;:&quot;randyzwitch.com&quot;}],&quot;dataZoom&quot;:[{&quot;show&quot;:true}],&quot;series&quot;:[{&quot;name&quot;:&quot;loadtime_ms&quot;,&quot;data&quot;:[1282,1728,1047,1111,1027,643,757,1049,1201,1265,1617,1145,614,673,1023,1323,1117,1048,904,647,830,761,759,607,1141,1022,864,743,866,1328,1147,973,1178,1093,927,998,1195,1167,1023,1329,1051,929,1037,897,1197,1179,1402,1018,605,2261,2059,2383,2402,1385,2068,2290,2627,1862,2494,2753,1556,898,926,1158,1253,1403,655,497,544,526,503,575,545,628,467,518,568,513,386,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null],&quot;smooth&quot;:false,&quot;minSize&quot;:&quot;0%&quot;,&quot;type&quot;:&quot;line&quot;,&quot;maxSize&quot;:&quot;100%&quot;},{&quot;name&quot;:&quot;post&quot;,&quot;data&quot;:[null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,386,629,533,453,279,193,83,45,40,46,44,29,34,46,36,29,32,40,35,32,47,43,36,38,36,26,35,35,35,32,40,33],&quot;smooth&quot;:false,&quot;minSize&quot;:&quot;0%&quot;,&quot;type&quot;:&quot;line&quot;,&quot;maxSize&quot;:&quot;100%&quot;}]});
&lt;/script&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ECharts&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DataFrames&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Read in data&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;readtable&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/assets/data/website_time_data.csv&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Make data two different series that overlap, so endpoint touches&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pre&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2016-09-06&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;nothing&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;zip&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loadtime_ms&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;post&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2016-09-06&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;nothing&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;zip&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loadtime_ms&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])]&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Graph code&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;l&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hcat&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pre&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;post&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ec_width&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;800&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;seriesnames!&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;loadtime_ms&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;post&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;colorscheme!&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;palette&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;acw&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;FlatUI&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;yAxis!&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Load time in ms&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;title!&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;text&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;randyzwitch.com&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
          &lt;span class=&quot;n&quot;&gt;subtext&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Switching from WordPress on Bluehost to Jekyll on GitHub (2016/09/06)&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;toolbox!&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;chartTypes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;bar&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;line&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;slider!&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Even though I switched to Jekyll on WordPress on 9/6/2016, it appears that the page cache for Google Webmaster Tools didn’t really expire until 9/12/2016 or so. At the average case, the load time went from 1128ms to 38ms! Of course, this isn’t really a &lt;em&gt;fair&lt;/em&gt; comparison, as presumably GitHub Pages runs on much better hardware than the cheap Bluehost hosting I have, and I didn’t reimplement most of the garbage I had on the WordPress version of the blog. But from a user-experience standpoint, good lord what an improvement!&lt;/p&gt;

&lt;h2 id=&quot;box-plots&quot;&gt;Box Plots&lt;/h2&gt;

&lt;p&gt;Want to test out further functionality, here are some box plots of the load time variation:&lt;/p&gt;

&lt;div id=&quot;boxp&quot; style=&quot;height:400px;width:800px;&quot;&gt;&lt;/div&gt;
&lt;script type=&quot;text/javascript&quot;&gt;
    // Initialize after dom ready
    var myChartp = echarts.init(document.getElementById(&quot;boxp&quot;));

    // Load data into the ECharts instance
    myChartp.setOption({&quot;xAxis&quot;:[{&quot;splitNumber&quot;:5,&quot;boundaryGap&quot;:true,&quot;data&quot;:[&quot;WordPress&quot;,&quot;Jekyll&quot;],&quot;scale&quot;:false,&quot;gridIndex&quot;:0,&quot;minInterval&quot;:0,&quot;inverse&quot;:false,&quot;nameLocation&quot;:&quot;middle&quot;,&quot;nameGap&quot;:30,&quot;silent&quot;:true,&quot;type&quot;:&quot;category&quot;}],&quot;yAxis&quot;:[{&quot;splitNumber&quot;:5,&quot;scale&quot;:false,&quot;gridIndex&quot;:0,&quot;name&quot;:&quot;Load time in ms&quot;,&quot;minInterval&quot;:0,&quot;min&quot;:0,&quot;inverse&quot;:false,&quot;nameLocation&quot;:&quot;middle&quot;,&quot;nameGap&quot;:50,&quot;silent&quot;:true,&quot;type&quot;:&quot;value&quot;}],&quot;toolbox&quot;:{&quot;feature&quot;:{&quot;dataView&quot;:{&quot;show&quot;:true,&quot;title&quot;:&quot;Data View&quot;,&quot;lang&quot;:[&quot;Data View&quot;,&quot;Cancel&quot;,&quot;Refresh&quot;]},&quot;restore&quot;:{&quot;show&quot;:true,&quot;title&quot;:&quot;Restore&quot;},&quot;saveAsImage&quot;:{&quot;show&quot;:true,&quot;title&quot;:&quot;Save As PNG&quot;}},&quot;itemSize&quot;:15,&quot;orient&quot;:&quot;vertical&quot;,&quot;height&quot;:&quot;auto&quot;,&quot;zlevel&quot;:0,&quot;z&quot;:2,&quot;itemGap&quot;:20,&quot;right&quot;:&quot;auto&quot;,&quot;top&quot;:&quot;center&quot;,&quot;width&quot;:&quot;auto&quot;,&quot;show&quot;:true,&quot;showTitle&quot;:true},&quot;ec_width&quot;:800,&quot;ec_height&quot;:400,&quot;ec_charttype&quot;:&quot;box&quot;,&quot;color&quot;:[&quot;#004358&quot;,&quot;#1F8A70&quot;,&quot;#BEDB39&quot;,&quot;#FFE11A&quot;,&quot;#FD7400&quot;],&quot;title&quot;:[{&quot;left&quot;:&quot;left&quot;,&quot;borderColor&quot;:&quot;transparent&quot;,&quot;bottom&quot;:&quot;auto&quot;,&quot;padding&quot;:5,&quot;zlevel&quot;:0,&quot;borderWidth&quot;:1,&quot;target&quot;:&quot;blank&quot;,&quot;z&quot;:2,&quot;itemGap&quot;:5,&quot;shadowOffsetY&quot;:0,&quot;shadowOffsetX&quot;:0,&quot;right&quot;:&quot;auto&quot;,&quot;subtext&quot;:&quot;Switching from WordPress on Bluehost to Jekyll on GitHub (2016/09/06)&quot;,&quot;top&quot;:&quot;auto&quot;,&quot;subtarget&quot;:&quot;blank&quot;,&quot;show&quot;:true,&quot;text&quot;:&quot;randyzwitch.com&quot;}],&quot;series&quot;:[{&quot;name&quot;:&quot;boxplot&quot;,&quot;data&quot;:[[-35.25,750.0,1037.0,1273.5,2058.75],[19.75,33.25,36.0,42.25,55.75]],&quot;smooth&quot;:false,&quot;minSize&quot;:&quot;0%&quot;,&quot;type&quot;:&quot;boxplot&quot;,&quot;maxSize&quot;:&quot;100%&quot;},{&quot;name&quot;:&quot;outliers&quot;,&quot;data&quot;:[[&quot;WordPress&quot;,2261.0],[&quot;WordPress&quot;,2059.0],[&quot;WordPress&quot;,2383.0],[&quot;WordPress&quot;,2402.0],[&quot;WordPress&quot;,2068.0],[&quot;WordPress&quot;,2290.0],[&quot;WordPress&quot;,2627.0],[&quot;WordPress&quot;,2494.0],[&quot;WordPress&quot;,2753.0],[&quot;Jekyll&quot;,83.0]],&quot;smooth&quot;:false,&quot;minSize&quot;:&quot;0%&quot;,&quot;type&quot;:&quot;scatter&quot;,&quot;maxSize&quot;:&quot;100%&quot;}]}
);
&lt;/script&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ECharts&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DataFrames&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Read in data&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;readtable&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/Users/randyzwitch/Desktop/website_load_time.csv&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pre&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2016-09-06&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;nothing&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;zip&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loadtime_ms&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;post&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2016-09-12&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;nothing&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;zip&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loadtime_ms&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])]&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Remove nulls&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;pre&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pre&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;nothing&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;post&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;post&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;nothing&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Graph code&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;box&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pre&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;post&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;names&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;WordPress&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Jekyll&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ec_width&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;800&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;colorscheme!&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;palette&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;acw&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;VitaminC&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;yAxis!&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Load time in ms&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nameGap&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;min&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;title!&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;text&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;randyzwitch.com&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
           &lt;span class=&quot;n&quot;&gt;subtext&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Switching from WordPress on Bluehost to Jekyll on GitHub (2016/09/06)&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;toolbox!&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Usually, a box plot comparison that is as smushed as the Jekyll plot vs the WordPress one would be a poor visualization, but in this case I think it actually works. The load time for the Jekyll version of this blog is so quick and so consistent that it barely registers as an outlier if it were WordPress! It’s crazy to think that the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-1.5 * IQR&lt;/code&gt; time for WordPress is the mean/median/min load time of Jekyll.&lt;/p&gt;

&lt;h2 id=&quot;where-to-go-next&quot;&gt;Where To Go Next?&lt;/h2&gt;
&lt;p&gt;This blog post is really just an interesting finding from my experience moving to Jekyll on GitHub. As it stands now, ECharts.jl is stil in pre-METADATA mode. Right now, I assume that this would be a useful enough package to submit to METADATA some day, but I guess that depends on how much further I get smoothing the rough edges. If there are people who are interested in cleaning up this package further, I’d absolutely love to collaborate.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Bulk Downloading Adobe Analytics Data</title>
        
          <description>&lt;p&gt;&lt;em&gt;This blog post also serves as release notes for RSiteCatalyst v1.4.9, as only one feature was added (batch report request and download). But it’s a feature big enough for its own post!&lt;/em&gt;&lt;/p&gt;

</description>
        
        <pubDate>Thu, 21 Jul 2016 08:25:02 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-bulk-download-version-1-4-9-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-bulk-download-version-1-4-9-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-bulk-download-version-1-4-9-release-notes/">&lt;p&gt;&lt;em&gt;This blog post also serves as release notes for RSiteCatalyst v1.4.9, as only one feature was added (batch report request and download). But it’s a feature big enough for its own post!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Recently, I was asked how I would approach replicating the &lt;a href=&quot;http://33sticks.com/rsitecatalyst-market-basket-analysis-adobe-analytics/&quot;&gt;market basket analysis&lt;/a&gt; blog post I wrote for &lt;a href=&quot;http://33sticks.com/&quot;&gt;33 Sticks&lt;/a&gt;, but using a lot more data. Like, months and months of order-level data. While you &lt;em&gt;might&lt;/em&gt; be able to submit multiple months worth of data in a single RSiteCatalyst call, it’s a lot more elegant to request data from the Adobe Analytics API in several calls. With the new batch-submit and batch-receive functionality in RSiteCatalyst, this process can be a LOT faster.&lt;/p&gt;

&lt;h2 id=&quot;non-batched-method&quot;&gt;Non-Batched Method&lt;/h2&gt;

&lt;p&gt;Prior to version 1.4.9 of RSiteCatalyst, API calls could only be made in a serial fashion:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RSiteCatalyst&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dplyr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SCAuth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;USER&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;SECRET&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;combined_orders&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as.Date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2016-06-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as.Date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2016-06-30&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;day&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;as.character&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as.Date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;origin&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;1970-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order_details&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueRanked&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reportsuite.id&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;reportsuite&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date.from&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date.to&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;elements&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'evar13'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'product'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'revenue'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'units'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'orders'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;top&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;50000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;10000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interval.seconds&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order_details&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order_date&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;combined_orders&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rbind.fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;combined_orders&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order_details&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order_details&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The underlying assumption from a package development standpoint was that the user would be working in an interactive fashion; submit a report request, wait to get the answer back. There’s nothing inherently wrong with this code from an R standpoint that made this a slow process, you just had to wait until one report was calculated by the Adobe Analytics API until the next one was submitted.&lt;/p&gt;

&lt;h2 id=&quot;batch-method&quot;&gt;Batch Method&lt;/h2&gt;

&lt;p&gt;Of course, most APIs can process multiple calls simultaneously, and the Adobe Analytics API is no exception. Thanks to user &lt;a href=&quot;https://github.com/shashispace&quot;&gt;shashispace&lt;/a&gt;, it’s now possible to submit all of your report calls at once, then retrieve the results:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;queued&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as.Date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2016-06-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as.Date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2016-06-30&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;day&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;as.character&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as.Date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;origin&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;1970-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reportid&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueRanked&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reportsuite.id&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;reportsuite&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date.from&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date.to&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;elements&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'evar13'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'product'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'revenue'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'units'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'orders'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;top&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;50000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;10000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interval.seconds&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;enqueueOnly&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queued&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queued&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reportid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queued_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queued&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queued_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind_rows&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queued_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetReport&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This code is nearly identical to the serial snippet above, except for 1) the addition of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;enqueueOnly = TRUE&lt;/code&gt; keyword argument and 2) lowering the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;interval.seconds&lt;/code&gt; keyword argument to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1&lt;/code&gt; second instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;60&lt;/code&gt;. When you use the enqueueOnly keyword, instead of returning the report results back, a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Queue*&lt;/code&gt; function will return the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;report.id&lt;/code&gt;; by accumulating these &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;report.id&lt;/code&gt; values in a list, we can next retrieve the reports and bind them together using dplyr.&lt;/p&gt;

&lt;h2 id=&quot;performance-gain-4x-speed-up&quot;&gt;Performance gain: 4x speed-up&lt;/h2&gt;

&lt;p&gt;Although the code snippets are nearly identical, it is way faster to submit the reports all at once then retrieve the results. By submitting the requests all at once, the API will process numerous calls at once, and while you are retrieving the results of one call the others will continue to process in the background.&lt;/p&gt;

&lt;p&gt;I wouldn’t have thought this would make such a difference, but retrieving one month of daily order-level data went from taking 2420 seconds to 560 seconds! If you were to retrieve the same amount of daily data, but for an entire year, that would mean saving 6 hours in processing time.&lt;/p&gt;

&lt;h2 id=&quot;keep-the-pull-requests-coming&quot;&gt;Keep The Pull Requests Coming!&lt;/h2&gt;

&lt;p&gt;The last several RSiteCatalyst releases have been driven by contributions from the community and I couldn’t be happier! Given that I don’t spend much time in my professional life now using Adobe Analytics, having improvements driven by a community of users using the library daily is just so rewarding.&lt;/p&gt;

&lt;p&gt;So please, if you have a comment for improvement (and especially if you find a bug), please submit an &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues&quot;&gt;issue on GitHub&lt;/a&gt;. Submitting questions and issues to GitHub is the easiest way for me to provide support, while also giving other users the possibility to answer your question before I might. It will also provide a means for others to determine if they are experiencing a new or previously-known problem.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis</title>
        
          <description>&lt;p&gt;In a previous post, I outlined how to load &lt;a href=&quot;http://randyzwitch.com/adobe-analytics-clickstream-data-feed-relational-database/&quot;&gt;daily Adobe Analytics Clickstream data feeds&lt;/a&gt; into a PostgreSQL database. While this isn’t a long-term scalable solution for large e-commerce companies doing millions of page views per day, for exploratory analysis a relational database structure can work well until a more robust solution is put into place (such as Hadoop/Spark).&lt;/p&gt;

</description>
        
        <pubDate>Tue, 24 May 2016 11:11:20 +0000</pubDate>
        <link>
        http://randyzwitch.com/adobe-analytics-clickstream-data-feed-calculations/</link>
        <guid isPermaLink="true">http://randyzwitch.com/adobe-analytics-clickstream-data-feed-calculations/</guid>
        <content type="html" xml:base="/adobe-analytics-clickstream-data-feed-calculations/">&lt;p&gt;In a previous post, I outlined how to load &lt;a href=&quot;http://randyzwitch.com/adobe-analytics-clickstream-data-feed-relational-database/&quot;&gt;daily Adobe Analytics Clickstream data feeds&lt;/a&gt; into a PostgreSQL database. While this isn’t a long-term scalable solution for large e-commerce companies doing millions of page views per day, for exploratory analysis a relational database structure can work well until a more robust solution is put into place (such as Hadoop/Spark).&lt;/p&gt;

&lt;h2 id=&quot;data-validation-&quot;&gt;Data Validation &lt;groan&gt;&lt;/groan&gt;&lt;/h2&gt;

&lt;p&gt;Before digging too deeply into the data, we should validate that data from the data feed in our database (&lt;a href=&quot;https://gist.github.com/randyzwitch/7a9c48e7132e6ed9dfb0d02ec906961c&quot;&gt;custom database view code&lt;/a&gt;) matches what we observe from other sources (mainly, the Adobe Analytics interface and/or &lt;a href=&quot;http://randyzwitch.com/tags/#rsitecatalyst&quot;&gt;RSiteCatalyst&lt;/a&gt;). Given the Adobe Analytics data feed represents an export of the underlying data, and Adobe provides the formulas in the &lt;a href=&quot;https://marketing.adobe.com/resources/help/en_US/sc/clickstream/datafeeds_calculate.html&quot;&gt;data feed documentation&lt;/a&gt;, &lt;em&gt;in theory&lt;/em&gt; you should be able to replicate the numbers exactly:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;# &quot;Source 1&quot;: Pull data from the API using RSiteCatalyst&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;RSiteCatalyst&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SCAuth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;USER&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;SECRET&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;overtime&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueOvertime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date.from&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2016-04-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date.to&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2016-05-17&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visitors&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date.granularity&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;day&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# &quot;Source 2&quot;: Pull data from Postgres database&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RPostgreSQL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Connect to database&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbConnect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbDriver&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;PostgreSQL&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;postgres&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;localhost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5432&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbname&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;adobe&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbdata&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbGetQuery&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                     &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;select
                     date(date_time) as date_localtime,
                     sum(CASE WHEN post_page_event = '0' THEN 1 END) as pageviews,
                     count(distinct ARRAY_TO_STRING(ARRAY[post_visid_high::text, post_visid_low::text, visit_num::text], '')) as visits,
                     count(distinct ARRAY_TO_STRING(ARRAY[post_visid_high::text, post_visid_low::text], '')) as visitors
                     from usefuldata
                     where date_time between '2016-04-01' and '2016-05-18' and exclude_hit = '0'
                     group by 1
                     order by 1;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Compare data sources&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;diff_pv&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;overtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pageviews&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbdata&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pageviews&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;diff_pv&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;47&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;diff_visits&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;overtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;visits&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbdata&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;visits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;diff_visits&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;47&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;diff_visitors&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;overtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;visitors&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbdata&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;visitors&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;diff_visitors&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;47&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The code snippet above shows the validation, and sure enough, the “two different sources” show the same exact values (i.e. differences are 0), so everything has been loaded properly into the PostgreSQL database.&lt;/p&gt;

&lt;h2 id=&quot;finding-anomalies-for-creating-bot-rules&quot;&gt;Finding Anomalies For Creating Bot Rules&lt;/h2&gt;

&lt;p&gt;With the data validated, we can now start digging deeper into the data. As an example, although I have &lt;a href=&quot;https://marketing.adobe.com/resources/help/en_US/reference/bot_rules.html&quot;&gt;bot filtering&lt;/a&gt; enabled, this only handles bots on the &lt;a href=&quot;http://www.iab.com/guidelines/iab-abc-international-spiders-bots-list/&quot;&gt;IAB bot list&lt;/a&gt; but not necessarily people trying to scrape my site (or worse).&lt;/p&gt;

&lt;p&gt;To create a &lt;a href=&quot;https://marketing.adobe.com/resources/help/en_US/reference/t_create_bot_rules.html&quot;&gt;custom bot rule in Adobe Analytics&lt;/a&gt;, you can use IP address(es) and/or User-Agent string. However, as part of data exploration we are not limited to just these features (assuming, of course, that you can map your feature set back to an IP/User-Agent combo). To identify outlier behavior, I’m going to use a technique called ‘&lt;a href=&quot;http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf&quot;&gt;local outlier factors&lt;/a&gt;’ using the &lt;a href=&quot;https://cran.r-project.org/web/packages/Rlof/index.html&quot;&gt;Rlof&lt;/a&gt; package in R with the following data features:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Distinct Days Visited&lt;/li&gt;
  &lt;li&gt;Total Pageviews&lt;/li&gt;
  &lt;li&gt;Total Visits&lt;/li&gt;
  &lt;li&gt;Distinct Pages Viewed&lt;/li&gt;
  &lt;li&gt;Pageviews Per Visit&lt;/li&gt;
  &lt;li&gt;Average Views Per Page&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren’t the only features I could’ve used, but it should be pretty easy to view bot/scraper traffic using these metrics. Here’s the code:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;# Local outlier factor calculation&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RPostgreSQL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Rlof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbConnect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbDriver&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;PostgreSQL&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;postgres&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;localhost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5432&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbname&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;adobe&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics_lof&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbGetQuery&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;select
                          ip,
                          distinct_days_visited,
                          pageviews,
                          visits,
                          distinct_pages_viewed,
                          pageviews/visits::double precision as pv_per_visit,
                          pageviews/distinct_pages_viewed::double precision as avg_views_per_page
                          from
                          (
                          select
                          ip,
                          sum(CASE WHEN post_page_event = '0' THEN 1 END) as pageviews,
                          count(distinct ARRAY_TO_STRING(ARRAY[post_visid_high::text, post_visid_low::text, visit_num::text, visit_start_time_gmt::text], '')) as visits,
                          count(distinct post_pagename) as distinct_pages_viewed,
                          count(distinct date(date_time)) as distinct_days_visited
                          from usefuldata
                          where exclude_hit = '0'
                          group by 1
                          ) a
                          where visits &amp;gt; 1 and pageviews &amp;gt; 1;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;


&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# The higher the value of k, the more likely lof will be calculated...&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# ...but more generic the clusters&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# NaN/Inf occurs with points on top of one another/div by zero, which is likely...&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# ...with web data when most visitors have 1-2 sessions&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_lof&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics_lof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Append results, get top 500 worst scoring IP addresses&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;results&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics_lof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_lof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_lof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;worst500&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;head&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;results&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;is.infinite&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_lof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;500&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;A local outlier factor greater than 1 is classified as a potential outlier. Here’s a visual of the lof scores for the top 500 &lt;em&gt;worst&lt;/em&gt; scoring IP addresses &lt;a href=&quot;https://gist.github.com/randyzwitch/178d72e01e30943f6af82c48a47c4478&quot;&gt;(vegalite R graph code)&lt;/a&gt;:&lt;/p&gt;

&lt;div id=&quot;vis&quot;&gt;&lt;/div&gt;

&lt;p&gt;We can see from the graph that there are at least 500 IP addresses that are potential outliers (since the line doesn’t go below a lof value of 1). These points are now a good starting place to go back to our overall table and inspect the entire datafeed records by IP address.&lt;/p&gt;

&lt;h2 id=&quot;but-what-about-business-value&quot;&gt;But what about business value?&lt;/h2&gt;

&lt;p&gt;The example above just scratches the surface on what’s possible when you have access to the raw data from Adobe Analytics. It’s possible to do these calculations on my laptop using R because I only have a few hundred-thousand records and IP addresses. But this kind of ops work is pretty low-value, since unless you are trying to detect system hacking, trying to find hidden scrapers/spiders in your data to filter out just modifies the denominator of your KPIs it doesn’t lead to real money per se.&lt;/p&gt;

&lt;p&gt;In the last post of this series, I’ll cover how to work with the datafeed using Spark, and provide an example of using &lt;a href=&quot;http://spark.apache.org/docs/latest/mllib-guide.html&quot;&gt;Spark MLLib&lt;/a&gt; to increase site engagement.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Adobe: Give Credit. You DID NOT Write RSiteCatalyst.</title>
        
          <description>&lt;p&gt;&lt;strong&gt;EDIT 5/10/2016 1:30pm: Several folks from Adobe Analytics/Adobe Marketing Cloud have contacted me, and everything is resolved. I can’t untweet other people’s retweets/shares or delete comments on LinkedIn, but if everyone could stop sharing any more that would be great. 🙂&lt;/strong&gt;&lt;/p&gt;
&lt;hr /&gt;

</description>
        
        <pubDate>Mon, 09 May 2016 12:31:51 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-not-adobe-product/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-not-adobe-product/</guid>
        <content type="html" xml:base="/rsitecatalyst-not-adobe-product/">&lt;p&gt;&lt;strong&gt;EDIT 5/10/2016 1:30pm: Several folks from Adobe Analytics/Adobe Marketing Cloud have contacted me, and everything is resolved. I can’t untweet other people’s retweets/shares or delete comments on LinkedIn, but if everyone could stop sharing any more that would be great. 🙂&lt;/strong&gt;&lt;/p&gt;
&lt;hr /&gt;

&lt;p&gt;As an &lt;a href=&quot;https://github.com/randyzwitch&quot;&gt;author of several open-source software projects&lt;/a&gt;, I’ve taken for granted that people using the software share the same community values as I do. Open-source authors provide their code “&lt;a href=&quot;http://www.howtogeek.com/howto/31717/what-do-the-phrases-free-speech-vs.-free-beer-really-mean/&quot;&gt;free&lt;/a&gt;” to the community so that others may benefit without having to re-invent the wheel. The only &lt;em&gt;expectation&lt;/em&gt; (but not an actual &lt;em&gt;requirement&lt;/em&gt; per se), is attribution to the package author(s) as a thank you for the time and effort they put into writing and maintaining a quality piece of software.&lt;/p&gt;

&lt;p&gt;However, when others take direct credit for writing a package they did not, it crosses into a different realm. Adobe, you DID NOT write RSiteCatalyst, nor have you made any &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/graphs/contributors&quot;&gt;meaningful contributions&lt;/a&gt;. To take credit for RSiteCatalyst, either implicitly or explicitly, is a slight to the work of those who have contributed.&lt;/p&gt;

&lt;h2 id=&quot;adobe-summit-2014-attribution&quot;&gt;Adobe Summit 2014: Attribution!&lt;/h2&gt;

&lt;p&gt;In the beginning, there seemed to be no problem providing &lt;a href=&quot;https://blogs.adobe.com/digitalmarketing/analytics/playing-hits-summit-2014-filtered-metrics-error-monitoring/&quot;&gt;proper attribution&lt;/a&gt;. I count Ben Gaines as one of my stronger professional acquaintances (dare I say, even a friend), so I was honored that he not only mentioned me on stage at his Adobe Summit 2014 presentation, but also followed up with an &lt;a href=&quot;https://blogs.adobe.com/digitalmarketing/analytics/playing-hits-summit-2014-filtered-metrics-error-monitoring/&quot;&gt;official Adobe blog post&lt;/a&gt; re-capping his main points:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2016/05/rsitecatalyst-attribution-1024x603.png&quot; alt=&quot;rsitecatalyst-attribution&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Perfect. My package got wide exposure to the intended audience, which in turn makes it easier to devote time for development and maintenance. The recognition also helped me professionally in that time period, so if I never thanked you publicly Ben, thank you!&lt;/p&gt;

&lt;h2 id=&quot;adobe-summit-2015an-inconspicuous-absence&quot;&gt;Adobe Summit 2015: An Inconspicuous Absence&lt;/h2&gt;

&lt;p&gt;In 2015, RSiteCatalyst moved from a “Tip” to a &lt;a href=&quot;http://video.tv.adobe.com/v/2314t_876d7009-77fb-4a67-86bc-70475fddf88e/&quot;&gt;full-fledged presentation&lt;/a&gt;. I was honored when I first heard that an entire hour would be dedicated to reviewing the package, but no attribution was given:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2016/05/rsitecatalyst-resources-1024x569.png&quot; alt=&quot;rsitecatalyst-resources&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I mean, it was obviously okay to link to non-Adobe websites like &lt;a href=&quot;http://statmethods.net/&quot;&gt;statmethods.net&lt;/a&gt; (a great reference btw) and to &lt;a href=&quot;http://shiny.rstudio.com/&quot;&gt;Shiny&lt;/a&gt;…but okay, attribution is not a requirement.&lt;/p&gt;

&lt;h2 id=&quot;adobe-summit-2016-we-at-adobe&quot;&gt;Adobe Summit 2016: ‘We at Adobe…’&lt;/h2&gt;

&lt;p&gt;The non-mention at Adobe Summit 2015 could be attributed to an oversight; the following during the &lt;a href=&quot;http://summit.adobe.com/na/sessions/summit-online/online2016/#/video/15150t_b09a171f-dc7c-4ff3-b71c-cf79dedb6e94&quot;&gt;2016 RSiteCatalyst Adobe Summit presentation&lt;/a&gt; cannot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2016/05/rsitecatalyst-randy-zwitch-1024x659.png&quot; alt=&quot;rsitecatalyst-randy-zwitch&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Just so we’re clear, this isn’t me noticing the slide notes in a PDF or PPT I shouldn’t have access to. The screenshot above is directly from the &lt;a href=&quot;http://summit.adobe.com/na/sessions/summit-online/online2016/#/video/15150t_b09a171f-dc7c-4ff3-b71c-cf79dedb6e94&quot;&gt;Adobe Summit video&lt;/a&gt; and the statement was said nearly verbatim during the presentation. And it’s not like this was a one-off comment…it’s the same damn presentation as 2015, and I KNOW this script went through several rounds of review and practice by the presenters.&lt;/p&gt;

&lt;h2 id=&quot;it-costs-0-to-do-what-is-right&quot;&gt;It Costs $0 To Do What Is Right&lt;/h2&gt;

&lt;p&gt;It may be hard for RSiteCatalyst users to believe, but this was the first open-source project I ever wrote AND the means by which I learned how to write R code AND the first time I ever accessed an API. Since then, &lt;a href=&quot;https://github.com/WillemPaling&quot;&gt;Willem Paling&lt;/a&gt; did an amazing job refactoring/re-writing a majority of the package when the Adobe Analytics API was updated from version 1.3 to 1.4, and there have been numerous other contributions from the user community. Maybe even one day, &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst&quot;&gt;the repo&lt;/a&gt; will reach even 100 stars on GitHub…&lt;/p&gt;

&lt;p&gt;But save for a &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/commit/706d2997cc9ec3a95eff756308110c12e217e1ca&quot;&gt;single commit&lt;/a&gt; to a README file from an employee, Adobe you have contributed _zero_to the development and maintenance of this package. To claim otherwise is beyond distasteful to the ethos of open-source software. I’ve never asked for compensation of any kind; and again, I recognize that you don’t even need to attribute the work at all.&lt;/p&gt;

&lt;p&gt;Just don’t take credit yourselves for providing this functionality to your customers. You did not write &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst&quot;&gt;RSiteCatalyst&lt;/a&gt; Adobe, a community of (unpaid) volunteers did.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Travis CI: &quot;You Have Too Many Tests LOLZ!&quot;</title>
        
          <description>&lt;p&gt;As part of getting &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-version-1-4-8-release-notes/&quot;&gt;RSiteCatalyst 1.4.8 ready for CRAN&lt;/a&gt;, I’ve managed to accumulate hundreds of &lt;a href=&quot;https://github.com/hadley/testthat&quot;&gt;testthat&lt;/a&gt; tests across 63 test files. Each of these tests runs on &lt;a href=&quot;http://randyzwitch.com/authentication-travis-ci/&quot;&gt;Travis CI against an authenticated API&lt;/a&gt;, and the API frequently queues long-running reports. Long-story-short, my builds started failing, creating the error log message quoted below:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
        
        <pubDate>Tue, 05 Apr 2016 11:41:53 +0000</pubDate>
        <link>
        http://randyzwitch.com/travisci-10minute-timeout-build-error/</link>
        <guid isPermaLink="true">http://randyzwitch.com/travisci-10minute-timeout-build-error/</guid>
        <content type="html" xml:base="/travisci-10minute-timeout-build-error/">&lt;p&gt;As part of getting &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-version-1-4-8-release-notes/&quot;&gt;RSiteCatalyst 1.4.8 ready for CRAN&lt;/a&gt;, I’ve managed to accumulate hundreds of &lt;a href=&quot;https://github.com/hadley/testthat&quot;&gt;testthat&lt;/a&gt; tests across 63 test files. Each of these tests runs on &lt;a href=&quot;http://randyzwitch.com/authentication-travis-ci/&quot;&gt;Travis CI against an authenticated API&lt;/a&gt;, and the API frequently queues long-running reports. Long-story-short, my builds started failing, creating the error log message quoted below:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;stalled-build&quot;&gt;Stalled Build?&lt;/h2&gt;

&lt;p&gt;The most frustrating about this error is that all my tests run (albeit, a looooong time) successfully through RStudio, so I wasn’t quite sure what the problem was with the &lt;a href=&quot;https://travis-ci.org/&quot;&gt;Travis CI&lt;/a&gt; build. Travis CI does provide a comment about this in their &lt;a href=&quot;https://docs.travis-ci.com/user/common-build-problems/#My-builds-are-timing-out&quot;&gt;documentation&lt;/a&gt;, but even then it didn’t solve my problem:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;When a long running command or compile step regularly takes longer than 10 minutes without producing any output, you can adjust your build configuration to take that into consideration.&lt;/p&gt;

  &lt;p&gt;The shell environment in our build system provides a function that helps to work around that, at least for longer than 10 minutes.&lt;/p&gt;

  &lt;p&gt;If you have a command that doesn’t produce output for more than 10 minutes, you can prefix it with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;travis_wait&lt;/code&gt;, a function that’s exported by our build environment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;travis_wait&lt;/code&gt; command would work if I were installing packages, but my errors were during tests, so this parameter isn’t the answer. Luckily, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;testthat&lt;/code&gt; provides a test filtering mechanism, providing a solution by allowing the tests to be broken up into smaller chunks.&lt;/p&gt;

&lt;h2 id=&quot;regex-to-the-rescue&quot;&gt;Regex To The Rescue…&lt;/h2&gt;

&lt;p&gt;For many applications, the default testthat configuration example will work just well:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;R CMD check&lt;br /&gt;
Create tests/testthat.R that contains:&lt;br /&gt;
library(testthat)&lt;br /&gt;
library(yourpackage)&lt;br /&gt;
test_check(“yourpackage”)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;However, hidden within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test_check()&lt;/code&gt; arguments is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;filter&lt;/code&gt;, which will take a regular expression to filter which files in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test&lt;/code&gt; folder will get run when the command is triggered by R CMD check. Why is this important? Because each time a new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test_check()&lt;/code&gt; function gets called, output gets written to stdout, and thus avoids 10 minutes passing without producing any output. Here’s an example of what my successful build logs now look like (&lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/tree/master/tests&quot;&gt;GitHub code for the testthat code structure&lt;/a&gt;):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;checking tests…&lt;br /&gt;
Running ‘testthat-build.R’&lt;br /&gt;
Running ‘testthat-get.R’ [5s/267s]&lt;br /&gt;
Running ‘testthat-queuefallout.R’ [1s/59s]&lt;br /&gt;
Running ‘testthat-queueovertime.R’ [3s/210s]&lt;br /&gt;
Running ‘testthat-queuepathing.R’ [2s/55s]&lt;br /&gt;
Running ‘testthat-queueranked.R’ [2s/183s]&lt;br /&gt;
Running ‘testthat-queuesummary.R’ [2s/136s]&lt;br /&gt;
Running ‘testthat-queuetrended.R’ [17s/346s]&lt;br /&gt;
Running ‘testthat-save.R’ [1s/46s]&lt;br /&gt;
OK&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can now see that instead of getting a single output message of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Running testthat.R&lt;/code&gt;, I have nine separate test files running, none of which take 10 minutes to complete. For my package, each of my test files is labeled based on the function name, and I can end up using really simple regex literals such as the following:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;testthat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;test_check&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;RSiteCatalyst&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;get&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;So each file with the word “get” in the filename will be run by this function; I’m not worried about writing complex regexes here, since at worst I my matching is too broad and I run the same test multiple times.&lt;/p&gt;

&lt;h2 id=&quot;but-be-careful-of-case-sensitivity&quot;&gt;…But Be Careful Of Case-Sensitivity!&lt;/h2&gt;

&lt;p&gt;The one caveat to simple regex filtering above is that if you’re not careful, you’ll get no match from your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test_check()&lt;/code&gt; function, which will fail the build on Travis CI. I spent hours trying to figure out why my tests ran fine on OSX, but failed on Travis. Eventually, I even &lt;a href=&quot;https://github.com/hadley/testthat/issues/434&quot;&gt;filed an issue&lt;/a&gt; against hadley’s repo, feeling silly as soon as I found out that my error was due to case-sensitivity in Linux by not OSX (or Windows for that matter).&lt;/p&gt;

&lt;p&gt;So, pay attention, and if all else fails, go with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;filter = &quot;summary|Summary&quot;&lt;/code&gt; or similar to match the case of your filenames!&lt;/p&gt;

&lt;h2 id=&quot;you-can-never-really-have-too-many-tests&quot;&gt;You Can Never Really Have Too Many Tests&lt;/h2&gt;

&lt;p&gt;Obviously, the title of this blog post is in jest; Travis CI doesn’t care what you’re running or comments on how many tests you run. But hopefully this blog post provides the answer to the next person down the line running into this issue. Don’t delete your tests, run multiple &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test_check()&lt;/code&gt; functions and the printing every few minutes of the file name to the console should resolve the problem.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4.8 Release Notes</title>
        
          <description>&lt;p&gt;For being in RSiteCatalyst retirement, I’m ending up working on more functionality lately ¯_(ツ)_/¯. Here are the changes for RSiteCatalyst 1.4.8, which should be &lt;a href=&quot;https://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot;&gt;available on CRAN&lt;/a&gt; shortly:&lt;/p&gt;

</description>
        
        <pubDate>Mon, 04 Apr 2016 10:05:15 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-8-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-8-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-8-release-notes/">&lt;p&gt;For being in RSiteCatalyst retirement, I’m ending up working on more functionality lately ¯_(ツ)_/¯. Here are the changes for RSiteCatalyst 1.4.8, which should be &lt;a href=&quot;https://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot;&gt;available on CRAN&lt;/a&gt; shortly:&lt;/p&gt;

&lt;h2 id=&quot;segment-stacking&quot;&gt;Segment Stacking&lt;/h2&gt;

&lt;p&gt;RSiteCatalyst now has the ability to take multiple values in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;segment.id&lt;/code&gt; keyword for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Queue*&lt;/code&gt; functions. This functionality was graciously provided by &lt;a href=&quot;https://twitter.com/FootballActuary&quot;&gt;Adam Gitzes&lt;/a&gt;, closing an &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues/129&quot;&gt;issue&lt;/a&gt; that was nearly a year old. At times it felt like I was hazing him with change requests, but for Adam’s first open-source contribution, this is a huge addition in functionality.&lt;/p&gt;

&lt;p&gt;So now you are able to pass multiple segments into a function call and get an ‘AND’ behavior like so:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;stacked_seg&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueRanked&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2016-03-08&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2016-03-09&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;page&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;segment.id&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;5433e4e6e4b02df70be4ac63&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;54adfe3de4b02df70be5ea08&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The result (Visits from Social AND Visits from Apple Browsers):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2016/04/rsitecatalyst-segment-stacking-1024x58.png&quot; alt=&quot;rsitecatalyst-segment-stacking&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;queuesummary-now-with-dateto-and-datefrom-keywords&quot;&gt;QueueSummary: Now with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;date.to&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;date.from&lt;/code&gt; keywords&lt;/h2&gt;

&lt;p&gt;In response to &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues/158&quot;&gt;GitHub issue #158&lt;/a&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;date.to&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;date.from&lt;/code&gt; parameters were added; this was a minor, but long-term oversight (it’s always been possible to do this in the Adobe Analytics API). So now rather than just specifying the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;date&lt;/code&gt; keyword and getting a full-year summary or a full-month, you can specify any arbitrary start/end dates.&lt;/p&gt;

&lt;h2 id=&quot;trivial-fixes-silenced-httr-message-clarified-documentation&quot;&gt;Trivial Fixes: Silenced &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;httr&lt;/code&gt; message, clarified documentation&lt;/h2&gt;

&lt;p&gt;Starting with the newest version of httr, you get a message for any API call where the encoding wasn’t set. So for long running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Queue*&lt;/code&gt; requests, you may have received dozens of warnings to stdout about &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;No encoding supplied: defaulting to UTF-8.&quot;&lt;/code&gt; This has been remedied, and the warning should no longer occur.&lt;/p&gt;

&lt;p&gt;Also, the &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/blob/master/man/QueueRanked.Rd#L86-#L93&quot;&gt;documentation for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Queue*&lt;/code&gt; functions&lt;/a&gt; was clarified to show an example of using SAINT classifications as the report breakdown; hopefully this didn’t cause too much confusion to anyone else.&lt;/p&gt;

&lt;h2 id=&quot;volunteers-wanted&quot;&gt;Volunteers Wanted!&lt;/h2&gt;

&lt;p&gt;As I referenced in the first paragraph, while I’m fully committed to maintaining RSiteCatalyst, I don’t currently have the time/desire to continue to develop the package to improve functionality. Given that I don’t use this package for my daily work, it’s hard for me to dedicate time to the project.&lt;/p&gt;

&lt;p&gt;Thanks again to Adam Gitzes who stepped up and provided significant effort to close an outstanding feature request. I would love if others in the digital analytics community would follow Adam’s lead; don’t worry about whether you are ‘good enough’, get a working solution together and we’ll figure out how to harden the code and get it merged. Be the code change you want to see the world 🙂&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Adobe Analytics Clickstream Data Feed: Loading To Relational Database</title>
        
          <description>&lt;p&gt;In my previous post about the &lt;a href=&quot;http://randyzwitch.com/adobe-analytics-clickstream-raw-data-feed/&quot;&gt;Adobe Analytics Clickstream Data Feed&lt;/a&gt;, I showed how it was possible to take a single day worth of data and build a dataframe in R. However, most likely your analysis will require using multiple days/weeks/months of data, and given the size and complexity of the feed, loading the files into a relational database makes a lot of sense.&lt;/p&gt;

</description>
        
        <pubDate>Fri, 18 Mar 2016 14:42:23 +0000</pubDate>
        <link>
        http://randyzwitch.com/adobe-analytics-clickstream-data-feed-relational-database/</link>
        <guid isPermaLink="true">http://randyzwitch.com/adobe-analytics-clickstream-data-feed-relational-database/</guid>
        <content type="html" xml:base="/adobe-analytics-clickstream-data-feed-relational-database/">&lt;p&gt;In my previous post about the &lt;a href=&quot;http://randyzwitch.com/adobe-analytics-clickstream-raw-data-feed/&quot;&gt;Adobe Analytics Clickstream Data Feed&lt;/a&gt;, I showed how it was possible to take a single day worth of data and build a dataframe in R. However, most likely your analysis will require using multiple days/weeks/months of data, and given the size and complexity of the feed, loading the files into a relational database makes a lot of sense.&lt;/p&gt;

&lt;p&gt;Although there may be database-specific “fast-load” tools more appropriate for this application, this blog post will show how to handle this process using only R and &lt;a href=&quot;http://www.postgresql.org/download/&quot;&gt;PostgresSQL&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;file-organization&quot;&gt;File Organization&lt;/h2&gt;

&lt;p&gt;Before getting into the loading of the data into PostgreSQL, I like to sort my files by type into separate directories (remember from the &lt;a href=&quot;http://randyzwitch.com/adobe-analytics-clickstream-raw-data-feed/&quot;&gt;previous post&lt;/a&gt;, you’ll receive three files per day). R makes OS-level operations simple enough:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#### 1. Setting directory to FTP folder where files incoming from Adobe&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;## Has ~2000 files in it from 2 years of data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;setwd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;~/Downloads/datafeed/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### 2. Sort files into three separate folders&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;## Manifests - plain text files&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dir.exists&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;manifest&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dir.create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;manifest&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;list.files&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pattern&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;*.txt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file.rename&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;manifest&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;## Server calls tsv.gz&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dir.exists&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;servercalls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dir.create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;servercalls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;list.files&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pattern&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;*.tsv.gz&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file.rename&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;servercalls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;## Lookup files .tar.gz&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dir.exists&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;lookup&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dir.create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;lookup&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;list.files&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pattern&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;*.tar.gz&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file.rename&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;lookup&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Were there more file types, I could’ve abstracted this into a function instead of copying the code three times, but the idea is the same: Check to see if the directory exists, if it doesn’t then create it and move the files into the directory.&lt;/p&gt;

&lt;h2 id=&quot;connecting-and-loading-data-topostgresql-from-r&quot;&gt;Connecting and Loading Data to PostgreSQL from R&lt;/h2&gt;

&lt;p&gt;Once we have our files organized, we can begin the process of loading the files into PostgreSQL using the &lt;a href=&quot;https://cran.r-project.org/web/packages/RPostgreSQL/index.html&quot;&gt;RPostgreSQL&lt;/a&gt; R package.  RPostgreSQL is &lt;a href=&quot;https://github.com/rstats-db/DBI&quot;&gt;DBI-compliant&lt;/a&gt;, so the connection string is the same for any other type of database engine; the biggest caveat of loading your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;servercall&lt;/code&gt; data into a database is the first load is almost guaranteed to require loading as text (using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;colClasses = &quot;character&quot;&lt;/code&gt; argument in R). The reason that you’ll need to load the data as text is that Adobe Analytics implementations necessarily change over time; text is the only column format that allows for no loss of data (we can fix the schema later within Postgres either by using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALTER TABLE&lt;/code&gt; or by writing a view).&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RPostgreSQL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Connect to database&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbConnect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbDriver&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;PostgreSQL&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;postgres&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;localhost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5432&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbname&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;adobe&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Set directory to avoid having to use paste to build urls&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;setwd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;~/Downloads/datafeed/servercalls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Set column headers for server calls&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;column_headers&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read.delim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;~/Downloads/datafeed/lookup/column_headers.tsv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stringsAsFactors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Loop over entire list of files&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Setting colClasses to character only way to guarantee all data loads&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#File formats or implementations can change over time; fix schema in database after data loaded&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;list.files&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read.csv2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\t&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;header&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stringsAsFactors&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;colClasses&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;character&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbWriteTable&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'servercalls'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row.names&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Run analyze in PostgreSQL so that query planner has accurate information&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbGetQuery&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;analyze servercalls&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;With this small amount of code, we’ve generated the table definition structure (&lt;a href=&quot;https://gist.github.com/randyzwitch/e26b97d26689b6b31044&quot;&gt;see here for the underlying Postgres code&lt;/a&gt;), loaded the data, and told Postgres to analyze the table to gather statistics for efficient queries. Sweet, two years of data loaded with minimal effort!&lt;/p&gt;

&lt;h2 id=&quot;loading-lookup-tables-into-postgresql&quot;&gt;Loading Lookup Tables Into PostgreSQL&lt;/h2&gt;

&lt;p&gt;With the server call data loaded into our database, we now need to load our lookup tables. Lucky for us, these do maintain a constant format, so we don’t need to worry about setting all the fields to text, RPostgreSQL should get the column types correct.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RPostgreSQL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Connect to database&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbConnect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbDriver&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;PostgreSQL&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;postgres&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;localhost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5432&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbname&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;adobe&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;setwd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;~/Downloads/datafeed/lookup/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Create function due to repetitiveness&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Since we're loading lookup tables with mostly same values each time, put source file in table&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loadlookup&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tblname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read.csv2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tblname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;.tsv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\t&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;header&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stringsAsFactors&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbWriteTable&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tblname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row.names&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#untar files, place in directory by day&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;list.files&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pattern&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;*.tar.gz&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;untar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tbl&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;browser_type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;browser&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;color_depth&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;column_headers&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
               &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;connection_type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;country&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;event&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;javascript_version&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
               &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;languages&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;operating_systems&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;plugins&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;referrer_type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
               &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;resolution&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;search_engines&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loadlookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tbl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;strong&gt;SHORTCUT&lt;/strong&gt;: The dimension tables that are common to all report suites don’t really change over time, although that isn’t guaranteed.  In the 758 days of files I loaded (&lt;a href=&quot;https://gist.github.com/randyzwitch/5ed2f4fc8574b91efd29&quot;&gt;code&lt;/a&gt;), the only files having more than one value for a given key were: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;browser&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;browser_type&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;operating_system&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;search_engines&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;event&lt;/code&gt; (report suite specific for every company) and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;column_headers&lt;/code&gt; (report suite specific for every company). So if you’re doing a bulk load of data, it’s generally sufficient to use the newest lookup table and save yourself some time. If you are processing the data every day, you can use an &lt;a href=&quot;https://wiki.postgresql.org/wiki/UPSERT&quot;&gt;upsert process&lt;/a&gt; and generally there will be few if any updates.&lt;/p&gt;

&lt;h2 id=&quot;lets-do-analytics&quot;&gt;Let’s Do Analytics!!!!???!!!&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;&lt;moan&gt;Why is there always so much ETL work, I want to data science the hell out of some data&lt;/moan&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;At this point, if you were uploading the same amount of data for the traffic my blog does (not much), you’d be about 1-2 hours into loading data, still having done no analysis. In fact, in order to do analysis, you’d still need to modify the column names and types in your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;servercalls&lt;/code&gt; table, update the lookup tables to have the proper column names, and maybe you’d even want to pre-summarize the tables into views/materialized views for Page View/Visit/Visitor level. Whew, that’s a lot of work just to calculate daily page views.&lt;/p&gt;

&lt;p&gt;Yes it is. But taking on a project like this isn’t for page views; just use the Adobe Analytics UI!&lt;/p&gt;

&lt;p&gt;In a future blog post or two, I’ll demonstrate how to use this relational database layout to perform analyses not possible within the Adobe Analytics interface, and also show how we can skip this ETL process altogether using a &lt;a href=&quot;http://blog.cask.co/2015/03/schema-on-read-in-action/&quot;&gt;schema-on-read process&lt;/a&gt; with Spark.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Calling RSiteCatalyst From Python</title>
        
          <description>&lt;blockquote class=&quot;twitter-tweet&quot; data-lang=&quot;en&quot;&gt;
  &lt;p dir=&quot;ltr&quot; lang=&quot;en&quot;&gt;
    &lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;@randyzwitch&lt;/a&gt; Do you know if anyone has gotten RSiteCat running in a Jupyter Notebook that ran RPY2? Tired of using 2 different environments
  &lt;/p&gt;

&lt;/blockquote&gt;
</description>
        
        <pubDate>Mon, 22 Feb 2016 10:34:44 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-adobe-analytics-python/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-adobe-analytics-python/</guid>
        <content type="html" xml:base="/rsitecatalyst-adobe-analytics-python/">&lt;blockquote class=&quot;twitter-tweet&quot; data-lang=&quot;en&quot;&gt;
  &lt;p dir=&quot;ltr&quot; lang=&quot;en&quot;&gt;
    &lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;@randyzwitch&lt;/a&gt; Do you know if anyone has gotten RSiteCat running in a Jupyter Notebook that ran RPY2? Tired of using 2 different environments
  &lt;/p&gt;

  &lt;p&gt;
    — Adam Gitzes (@FootballActuary) &lt;a href=&quot;https://twitter.com/FootballActuary/status/700350988842995712&quot;&gt;February 18, 2016&lt;/a&gt;
  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This will be a very short post, because the only “new” information I’m going to provide is the minimal example to answer the question. Yes, it is in fact possible to call RSiteCatalyst from Python and seems to work well. The most important things are 1) making sure you install &lt;a href=&quot;http://rpy2.readthedocs.org/en/version_2.7.x/&quot;&gt;rpy2&lt;/a&gt; and 2) loading &lt;a href=&quot;http://pandas.pydata.org/&quot;&gt;Pandas&lt;/a&gt; (since so much of RSiteCatalyst is API calls returning data frames). It doesn’t hurt to already have experience using &lt;a href=&quot;http://randyzwitch.com/tags/#rsitecatalyst&quot;&gt;RSiteCatalyst&lt;/a&gt; in &lt;a href=&quot;/tags/#r&quot;&gt;R&lt;/a&gt;, since all we’re doing here is using Python to pass code to R.&lt;/p&gt;

&lt;h2 id=&quot;setup-code-rpy2-and-pandas&quot;&gt;Setup Code: rpy2 and Pandas&lt;/h2&gt;

&lt;p&gt;To call an R package from Python, the rpy2 package works very well, both from the REPL and Jupyter Notebook. For RSiteCatalyst, here is the set up code:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;

&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;rpy2.robjects.packages&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rpackages&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;rpy2.robjects&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pandas2ri&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Activate ability to translate R objects to pandas data frames
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pandas2ri&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;activate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Load RSiteCatalyst into Python
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rsc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rpackages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;importr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'RSiteCatalyst'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;With this code run, now you can make calls to the RSiteCatalyst R package, just as if you were in R itself.&lt;/p&gt;

&lt;h2 id=&quot;sample-call-getreportsuites&quot;&gt;Sample Call: GetReportSuites&lt;/h2&gt;

&lt;p&gt;Just to prove it works, here’s a code snippet using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetReportSuites()&lt;/code&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;# Call GetReportSuites to confim it works
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rsc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetReportSuites&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;pandas2ri&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ri2py_dataframe&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;And in Jupyter Notebook, you would see something similar to:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2016/02/rsitecatalyst-rpy2-1-1024x424.png&quot; alt=&quot;rsitecatalyst-rpy2&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;but-why&quot;&gt;But, Why?&lt;/h2&gt;

&lt;p&gt;So that’s about it…if you wanted to, you could call RSiteCatalyst from Python without much effort. There aren’t a whole lot of reasons to do so, unless like Adam above, you’d rather just use Python. I suppose if you wanted to use some other Python packages, such as &lt;a href=&quot;http://flask.pocoo.org/docs/0.10/&quot;&gt;Flask&lt;/a&gt; to create a dashboard or &lt;a href=&quot;http://stanford.edu/~mwaskom/software/seaborn/&quot;&gt;Seaborn&lt;/a&gt; for visualization you might want to do this. Until I got this tweet, it never occurred to me to do this, so YMMV.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Edit, 2/26/16: Adam Gitzes, who originally asked the question, also provides a different solution using Jupyter Notebook magics at his &lt;a href=&quot;http://maassmedia.com/r-site-catalyst-python.php&quot;&gt;blog post here&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes</title>
        
          <description>&lt;p&gt;It seems as though I missed release notes for version RSiteCatalyst 1.4.6, so we’ll do those and RSiteCatalyst 1.4.7 (now on CRAN) and the same time…&lt;/p&gt;

</description>
        
        <pubDate>Mon, 01 Feb 2016 09:24:04 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-7-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-7-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-7-release-notes/">&lt;p&gt;It seems as though I missed release notes for version RSiteCatalyst 1.4.6, so we’ll do those and RSiteCatalyst 1.4.7 (now on CRAN) and the same time…&lt;/p&gt;

&lt;h2 id=&quot;rsitecatalyst-146&quot;&gt;RSiteCatalyst 1.4.6&lt;/h2&gt;

&lt;p&gt;This release was mostly tweaking some settings, specifically:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Adding a second &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;top&lt;/code&gt; argument within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Queue*&lt;/code&gt; functions for more control on results returned. It used to be the case that a breakdown report with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;top&lt;/code&gt; argument would return, say, the top 10 values of the first variable and up to 50,000 values for the breakdown. Now you can control the second level breakdown as well, such as the top 10 pages and top 5 browsers for those pages.&lt;/li&gt;
  &lt;li&gt;Disable checking of the API call before submitting. I never ran into this, but a user was seeing that the API would return errors in validation under high volume. So if you have any weird issues, disable validation using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;validate = FALSE&lt;/code&gt; keyword argument.&lt;/li&gt;
  &lt;li&gt;The package now handles situation where API returns an unexpected type for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reportID&lt;/code&gt; and automatically converts it to the proper type (low-level issue, not a user-facing issue)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those changes carry forward into version RSiteCatalyst 1.4.7, so there is no reason for a user to stick with this release.&lt;/p&gt;

&lt;h2 id=&quot;rsitecatalyst-147---no-more-unicode-errors&quot;&gt;RSiteCatalyst 1.4.7 - No more Unicode Errors!&lt;/h2&gt;

&lt;p&gt;I was surprised it took so long for someone to report this error, but &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues/151&quot;&gt;#151&lt;/a&gt; finally reported a case from a user in Germany where search keywords were being mangled due to the presence of an umlaut. UTF-8 encoding is now the default for both calling the API and processing the results, so this issue will hopefully not arise again.&lt;/p&gt;

&lt;p&gt;Additionally, a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;locale&lt;/code&gt; argument has been added, to set the proper locale for your report suite. This is specified through the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SCAuth()&lt;/code&gt; function, with the list of possible &lt;a href=&quot;https://marketing.adobe.com/developer/documentation/analytics-reporting-1-4/r-reportdescriptionlocale&quot; target=&quot;_blank&quot;&gt;locales provided by the Adobe documentation&lt;/a&gt;. So if the even after using 1.4.7 with UTF-8 encoding by default, you are still seeing errors, try setting the locale to the country you are in/country setting of the report suite.&lt;/p&gt;

&lt;h2 id=&quot;feature-requestsbugs&quot;&gt;Feature Requests/Bugs&lt;/h2&gt;

&lt;p&gt;As always, if you come across bugs or have feature requests, please continue to use &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues&quot;&gt;RSiteCatalyst GitHub Issues&lt;/a&gt; page to submit issues. Don’t worry about cluttering up the page with tickets, please fill out a new issue for anything you encounter (with code you’ve already tried and is failing), unless you are SURE that it is the same problem someone else is facing.&lt;/p&gt;

&lt;p&gt;However, outside of patching really serious bugs, I will likely &lt;strong&gt;not spend any more time improving this package in the future&lt;/strong&gt;; my interests have changed, and RSiteCatalyst is pretty much complete as far as I’m concerned. That said, contributors are also &lt;em&gt;very welcomed&lt;/em&gt;. If there is a feature you’d like added, and especially if you can fix an outstanding issue reported at GitHub, we’d love to have your contributions. Willem and I are both parents of young children and have real jobs outside of open-source software creation, so we welcome any meaningful contributions to RSiteCatalyst that anyone would like to contribute.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>A Million Text Files And A Single Laptop</title>
        
          <description>&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2016/01/million-files-size.png&quot; alt=&quot;GNU Parallel Cat Unix&quot; /&gt;&lt;/p&gt;

</description>
        
        <pubDate>Thu, 28 Jan 2016 09:53:42 +0000</pubDate>
        <link>
        http://randyzwitch.com/gnu-parallel-medium-data/</link>
        <guid isPermaLink="true">http://randyzwitch.com/gnu-parallel-medium-data/</guid>
        <content type="html" xml:base="/gnu-parallel-medium-data/">&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2016/01/million-files-size.png&quot; alt=&quot;GNU Parallel Cat Unix&quot; /&gt;&lt;/p&gt;

&lt;p&gt;More often that I would like, I receive datasets where the data has only been partially cleaned, such as the picture on the right: hundreds, thousands…even millions of tiny files. Usually when this happens, the data all have the same format (such as having being generated by sensors or other memory-constrained devices).&lt;/p&gt;

&lt;p&gt;The problem with data like this is that 1) it’s inconvenient to think about a dataset as a million individual pieces 2) the data in aggregate are too large to hold in RAM but 3) the data are small enough where using Hadoop or even a relational database seems like overkill.&lt;/p&gt;

&lt;p&gt;Surprisingly, with judicious use of &lt;a href=&quot;http://www.gnu.org/software/parallel/&quot;&gt;GNU Parallel&lt;/a&gt;, stream processing and a relatively modern computer, you can efficiently process annoying, “medium-sized” data as described above.&lt;/p&gt;

&lt;h2 id=&quot;data-generation&quot;&gt;Data Generation&lt;/h2&gt;

&lt;p&gt;For this blog post, I used a combination of R and Python to generate the data: the “Groceries” dataset from the &lt;a href=&quot;https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf&quot;&gt;arules&lt;/a&gt; package for sampling transactions (with replacement), and the Python &lt;a href=&quot;https://github.com/joke2k/faker&quot;&gt;Faker (fake-factory)&lt;/a&gt; package to generate fake customer profiles and for creating the 1MM+ text files:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#R Code
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arules&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Groceries&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Groceries&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;groceries.txt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;,&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Python Code
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;csv&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;faker&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Faker&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fake&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Faker&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Create customer file of 1,234,567 customers with fake data
# Use dataframe index as a way to generate unique customer id
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;customers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fake&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;simple_profile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1234567&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;customer_df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;customers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;customer_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;cust_id&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;customer_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Read in transactions file from arules package
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;grocerydata.txt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;transactions&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;readlines&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Remove new line character
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transactions&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transactions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Generate transactions by cust_id
&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#file format:
#cust_id::int
#store_id::int
#transaction_datetime::string/datetime
#items::string
&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#for each customer...
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1234567&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;#...create a file...
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'/transactions/custfile_%s'&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'w'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;csvfile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;trans&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;writer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;csvfile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;delimiter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;' '&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;quotechar&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'&quot;'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;quoting&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QUOTE_MINIMAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;#...that contains all of the transactions they've ever made
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;365&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)):&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;trans&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;writerow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fake&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;zipcode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fake&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date_time_this_decade&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;before_now&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;after_now&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transactions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transactions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]])&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;problem-1-concatenating-cat---outtxt-&quot;&gt;Problem 1: Concatenating (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cat * &amp;gt;&amp;gt; out.txt&lt;/code&gt; ?!)&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;http://man7.org/linux/man-pages/man1/cat.1.html&quot;&gt;cat&lt;/a&gt; utility in Unix-y systems is familiar to most anyone who has ever opened up a Terminal window. Take some or all of the files in a folder, concatenate them together….one big file. But something funny happens once you get enough files…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; out.txt

&lt;span class=&quot;nt&quot;&gt;-bash&lt;/span&gt;: /bin/cat: Argument list too long&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;That’s a fun thought…too many files for the computer to keep track of. As it turns out, many Unix tools will only accept about 10,000 arguments; the use of the asterisk in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cat&lt;/code&gt; command gets expanded before running, so the above statement passes 1,234,567 arguments to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cat&lt;/code&gt; and you get an error message.&lt;/p&gt;

&lt;p&gt;One (naive) solution would be to loop over every file (a completely serial operation):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;f &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$f&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; ../transactions_cat/transactions.csv&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Roughly &lt;strong&gt;10,093 seconds&lt;/strong&gt; later, you’ll have your concatenated file. Three hours is quite a coffee break…&lt;/p&gt;

&lt;h2 id=&quot;solution-1-gnu-parallel--concatenation&quot;&gt;Solution 1: GNU Parallel &amp;amp; Concatenation&lt;/h2&gt;

&lt;p&gt;Above, I mentioned that looping over each file gets you past the error condition of too many arguments, but it is a serial operation. If you look at your computer usage during that operation, you’ll likely see that only a fraction of a core of your computer’s CPU is being utilized. We can greatly improve that through the use of GNU Parallel:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;span class=&quot;nb&quot;&gt;ls&lt;/span&gt; | parallel &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-j&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$f&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;cat {} &amp;gt;&amp;gt; ../transactions_cat/transactions.csv&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$f&lt;/code&gt; argument in the code is to highlight that you can choose the level of parallelism; however, you will not get infinitely linear scaling, as shown below (&lt;a href=&quot;https://gist.github.com/randyzwitch/ee0f738b5895e059fa2a&quot;&gt;graph code, Julia&lt;/a&gt;):&lt;/p&gt;

&lt;div id=&quot;cat&quot;&gt;
&lt;/div&gt;

&lt;p&gt;Given that the graph represents a single run at each level of parallelism, it’s a bit difficult to say &lt;em&gt;exactly&lt;/em&gt; where the parallelism gets maxed out, but at roughly 10 concurrent jobs, there’s no additional benefit. It’s also interesting to point out what the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-m&lt;/code&gt; argument represents; by specifying &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;m&lt;/code&gt;, you allow multiple arguments (i.e. multiple text files) to be passed as inputs into parallel. This &lt;em&gt;alone&lt;/em&gt; leads to an 8x speedup over the naive loop solution.&lt;/p&gt;

&lt;h2 id=&quot;problem-2-data--ram&quot;&gt;Problem 2: Data &amp;gt; RAM&lt;/h2&gt;

&lt;p&gt;Now that we have a single file, we’ve removed the “one million files” cognitive dissonance, but now we have a second problem: at 19.93GB, the amount of data exceeds the RAM in my laptop (2014 MBP, 16GB of RAM). So in order to do analysis, either a bigger machine is needed or processing has to be done in a streaming or “chunked” manner (such as using the &lt;a href=&quot;http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk&quot;&gt;“chunksize” keyword in pandas&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;But continuing on with our use of GNU Parallel, suppose we wanted to answer the following types of questions about our transactions data:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;How many unique products were sold?&lt;/li&gt;
  &lt;li&gt;How many transactions were there per day?&lt;/li&gt;
  &lt;li&gt;How many total items were sold per store, per month?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If it’s not clear from the list above, in all three questions there is an “embarrassingly parallel” portion of the computation. Let’s take a look at how to answer all three of these questions in a time- and RAM-efficient manner:&lt;/p&gt;

&lt;h5 id=&quot;q1-unique-products&quot;&gt;Q1: Unique Products&lt;/h5&gt;

&lt;p&gt;Given the format of the data file (transactions in a single column array), this question is the hardest to parallelize, but using a neat trick with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[tr](http://www.linfo.org/tr.html)&lt;/code&gt; (transliterate) utility, we can map our data to one product per row as we stream over the file:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Serial method (i.e. no parallelism)&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# This is a simple implementation of map &amp;amp; reduce; tr statements represent one map, sort -u statements one reducer&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# cut -d ' ' -f 5- transactions.csv | \     - Using cut, take everything from the 5th column and over from the transactions.csv file&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# tr -d \&quot; | \                              - Using tr, trim off double-quotes. This leaves us with a comma-delimited string of products representing a transaction&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# sort -u | \                               - Using sort, put similar items together, but only output the unique values&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# wc -l                                     - Count number of unique lines, which after de-duping, represents number of unique products&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;time cut&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;' '&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; 5- transactions.csv | &lt;span class=&quot;nb&quot;&gt;tr&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;tr&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;','&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'\n'&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sort&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-u&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;wc&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt;
331

real	292m7.116s

&lt;span class=&quot;c&quot;&gt;# Parallelized version, default chunk size of 1MB. This will use 100% of all CPUs (real and virtual)&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Also map &amp;amp; reduce; tr statements a single map, sort -u statements multiple reducers (8 by default)&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;time cut&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;' '&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; 5- transactions.csv | &lt;span class=&quot;nb&quot;&gt;tr&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;tr&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;','&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'\n'&lt;/span&gt; | parallel &lt;span class=&quot;nt&quot;&gt;--pipe&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--block&lt;/span&gt; 1M &lt;span class=&quot;nb&quot;&gt;sort&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-u&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sort&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-u&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;wc&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt;
331

&lt;span class=&quot;c&quot;&gt;# block size performance - Making block size smaller might improve performance&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Number of jobs can also be manipulated (not evaluated)&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# --500K:               73m57.232s&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# --Default 1M:         75m55.268s (3.84x faster than serial)&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# --2M:                 79m30.950s&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# --3M:                 80m43.311s&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The trick here is that we swap the comma-delimited transactions with the newline character; the effect of this is taking a single transaction row and returning multiple rows, one for each product. Then we pass that down the line, eventually using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sort -u&lt;/code&gt; to de-dup the list and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wc -l&lt;/code&gt; to count the number of unique lines (i.e. products).&lt;/p&gt;

&lt;p&gt;In a serial fashion, it takes quite some time to calculate the number of unique products. Incorporating GNU Parallel, just using the defaults, gives nearly a 4x speedup!&lt;/p&gt;

&lt;h5 id=&quot;q2-transactions-by-day&quot;&gt;Q2. Transactions By Day&lt;/h5&gt;

&lt;p&gt;If the file format could be considered undesirable in question 1, for question 2 the format is perfect. Since each row represents a transaction, all we need to do is perform the equivalent of a SQL &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Group By&lt;/code&gt; on the date and sum the rows:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Data is at transaction level, so just need to do equivalent of 'group by' operation&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Using cut again, we choose field 3, which is the date part of the timestamp&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# sort | uniq -c is a common pattern for doing a 'group by' count operation&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Final tr step is to trim the leading quotation mark from date string&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;time cut&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;' '&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; 3 transactions.csv | &lt;span class=&quot;nb&quot;&gt;sort&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;uniq&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;tr&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;

real	76m51.223s

&lt;span class=&quot;c&quot;&gt;# Parallelized version&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Quoting can be annoying when using parallel, so writing a Bash function is often much easier than dealing with escaping quotes&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# To do 'group by' operation using awk, need to use an associative array&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Because we are doing parallel operations, need to pass awk output to awk again to return final counts&lt;/span&gt;

awksub &lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;awk&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'{a[$3]+=1;}END{for(i in a)print i&quot; &quot;a[i];}'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; awksub
&lt;span class=&quot;nb&quot;&gt;time &lt;/span&gt;parallel &lt;span class=&quot;nt&quot;&gt;--pipe&lt;/span&gt; awksub &amp;lt; transactions.csv | &lt;span class=&quot;nb&quot;&gt;awk&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'{a[$1]+=$2;}END{for(i in a)print i&quot; &quot;a[i];}'&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;tr&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sort

&lt;/span&gt;real	8m22.674s &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;9.05x faster than serial&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Using GNU Parallel starts to become complicated here, but you do get a 9x speed-up by calculating rows by date in chunks, then “reducing” again by calculating total rows by date (a trick I picked up at this &lt;a href=&quot;http://www.rankfocus.com/use-cpu-cores-linux-commands/&quot;&gt;blog post&lt;/a&gt;.&lt;/p&gt;

&lt;h5 id=&quot;q3-total-items-per-store-per-month&quot;&gt;Q3. Total items Per store, Per month&lt;/h5&gt;

&lt;p&gt;For this example, it could be that my command-line fu is weak, but the serial method actually turns out to be the fastest. Of course, at a 14 minute run time, the real-time benefits to parallelization aren’t that great.&lt;/p&gt;

&lt;p&gt;It may be possible that one of you out there knows how to do this correctly, but an interesting thing to note is that the serial version already uses 40-50% of the available CPU available. So parallelization might yield a 2x speedup, but seven minutes extra per run isn’t worth spending hours trying to the optimal settings.&lt;/p&gt;

&lt;h2 id=&quot;but-ive-got-multiple-files&quot;&gt;But, I’ve got MULTIPLE files…&lt;/h2&gt;

&lt;p&gt;The three examples above showed that it’s possible to process datasets larger than RAM in a realistic amount of time using GNU Parallel. However, the examples also showed that working with Unix utilities can become complicated rather quickly. Shell scripts can help move beyond the “one-liner” syndrome, when the pipeline gets so long you lose track of the logic, but eventually problems are more easily solved using other tools.&lt;/p&gt;

&lt;p&gt;The data that I generated at the beginning of this post represented two concepts: transactions and customers. Once you get to the point where you want to do joins, summarize by multiple columns, estimate models, etc., loading data into a database or an analytics environment like R or Python makes sense. But hopefully this post has shown that a laptop is capable of analyzing WAY more data than most people believe, using many tools written decades ago.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>21st Century C: Error 64 on OSX When Using Make</title>
        
          <description>&lt;p&gt;To end 2015, I decided to finally learn C, instead of making it a 2016 resolution! I had previously done the &lt;a href=&quot;http://c.learncodethehardway.org/book/&quot;&gt;‘Learn C The Hard Way’&lt;/a&gt; tutorials, taken about a year off, and thus forgotten everything.&lt;/p&gt;

</description>
        
        <pubDate>Thu, 31 Dec 2015 13:17:30 +0000</pubDate>
        <link>
        http://randyzwitch.com/21st-century-c-error-64-osx/</link>
        <guid isPermaLink="true">http://randyzwitch.com/21st-century-c-error-64-osx/</guid>
        <content type="html" xml:base="/21st-century-c-error-64-osx/">&lt;p&gt;To end 2015, I decided to finally learn C, instead of making it a 2016 resolution! I had previously done the &lt;a href=&quot;http://c.learncodethehardway.org/book/&quot;&gt;‘Learn C The Hard Way’&lt;/a&gt; tutorials, taken about a year off, and thus forgotten everything.&lt;/p&gt;

&lt;p&gt;Rather than re-do the same material, I decided to get &lt;a href=&quot;http://shop.oreilly.com/product/0636920033677.do&quot;&gt;’21st Century C’&lt;/a&gt; from O’Reilly and work through that. Unfortunately, there is an error/misprint in the very beginning chapters that makes doing the exercises near impossible on OSX. This error manifests itself as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c99: invalid argument 'all' to -W Error 64&lt;/code&gt;. If you encounter this error on OSX (I’m using OSX 10.11.2 El Capitan as of writing this post), here are three methods for fixing the issue.&lt;/p&gt;

&lt;h2 id=&quot;error-64&quot;&gt;Error 64!&lt;/h2&gt;

&lt;p&gt;When the discussion of using &lt;a href=&quot;https://www.gnu.org/software/make/&quot;&gt;Makefiles&lt;/a&gt; begins on page 15, there is a discussion of the “smallest practicable makefile”, which is just six lines long:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;nv&quot;&gt;P&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;program_name
&lt;span class=&quot;nv&quot;&gt;OBJECTS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CFLAGS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-g&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-Wall&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-O3&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;LDLIBS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;c99

&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;P&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;: &lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;OBJECTS&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Unfortunately, this doesn’t &lt;em&gt;quite&lt;/em&gt; work on OSX. Page 11 in the book sort-of references that a fix is needed, but the directions aren’t so clear…&lt;/p&gt;

&lt;h2 id=&quot;error-64-solution-1-book-fix-updated&quot;&gt;Error 64, solution 1: Book Fix, updated&lt;/h2&gt;

&lt;p&gt;To use the book fix, you are supposed to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Create a file named &lt;em&gt;c99&lt;/em&gt;&lt;/li&gt;
  &lt;li&gt;Put the lines &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gcc -std=c99 $\*&lt;/code&gt; OR &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clang $\*&lt;/code&gt; in the &lt;em&gt;c99&lt;/em&gt; file&lt;/li&gt;
  &lt;li&gt;Add the file to your PATH in Terminal (such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;export PATH=&quot;/Users/computeruser:$PATH&quot;&lt;/code&gt; if the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c99&lt;/code&gt; file were located in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/Users/computeruser&lt;/code&gt; directory)&lt;/li&gt;
  &lt;li&gt;Run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chmod +x c99&lt;/code&gt; on the file to make it executable&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once you add this work-around to your PATH, then open a fresh Terminal window (or run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;source .bash_profile&lt;/code&gt; to refresh the Bash settings), you should be able to use Make to compile your C code.&lt;/p&gt;

&lt;p&gt;But to be honest, this seems like a really weird “fix” to me, as it overrides the C compiler settings for any program run via Terminal. I prefer one of two alternate solutions.&lt;/p&gt;

&lt;h2 id=&quot;error-64-solution-2-makefile-change&quot;&gt;Error 64, solution 2: Makefile Change&lt;/h2&gt;

&lt;p&gt;As I was researching this, a helpful Twitter user noted:&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; lang=&quot;en&quot; data-conversation=&quot;none&quot;&gt;
  &lt;p&gt;
    &lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;@randyzwitch&lt;/a&gt; Remove space between CFLAGS and =, and replace c99 with cc. See man c99, -W is not -Wwarnings.
  &lt;/p&gt;

  &lt;p&gt;
    — Eugene Teo (@datajottings) &lt;a href=&quot;https://twitter.com/datajottings/status/682214537341190145&quot;&gt;December 30, 2015&lt;/a&gt;
  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When you switch the ‘c99’ reference to just ‘cc’ in the Makefile, everything works fine. Here’s the subtlety different, corrected Makefile:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;nv&quot;&gt;P&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;program_name
&lt;span class=&quot;nv&quot;&gt;OBJECTS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CFLAGS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-g&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-Wall&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-O3&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;LDLIBS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;cc

&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;P&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;: &lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;OBJECTS&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;error-64-solution-3-switch-to-clang&quot;&gt;Error 64, solution 3: Switch to Clang&lt;/h2&gt;

&lt;p&gt;The final solution I came across is rather than using the GCC compiler, you can use an alternate compiler called Clang, which is also generally available on OSX (especially with XCode installed). Like solution 2 above, the Makefile is just subtlety different:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;nv&quot;&gt;P&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;program_name
&lt;span class=&quot;nv&quot;&gt;OBJECTS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CFLAGS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-g&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-Wall&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-O3&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;LDLIBS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;clang

&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;P&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;: &lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;OBJECTS&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Whether to use GCC or Clang as your compiler is really beyond the scope of this blog post; as &lt;em&gt;21st Century C&lt;/em&gt; discusses, it really shouldn’t matter (especially when you are just learning the mechanics of the language).&lt;/p&gt;

&lt;h2 id=&quot;error-64-be-gone&quot;&gt;Error 64, Be Gone!&lt;/h2&gt;

&lt;p&gt;There’s not really much more to say at this point; this blog post is mainly documentation for anyone who comes across this error in the future. I’ve continued on through the book using Clang, but suffice to say, it’s not the compiler that writes poor-quality, non-compiling code, it’s the user. Ah, the fun of learning 🙂&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Four Tactics For Well Thought Out Business Requirements</title>
        
          <description>&lt;blockquote class=&quot;twitter-tweet&quot; data-partner=&quot;tweetdeck&quot;&gt;
  &lt;p dir=&quot;ltr&quot; lang=&quot;en&quot;&gt;
    &lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;@randyzwitch&lt;/a&gt; how to get (reasonably) well thought-through requirements from business people?
  &lt;/p&gt;

&lt;/blockquote&gt;
</description>
        
        <pubDate>Fri, 21 Aug 2015 10:42:13 +0000</pubDate>
        <link>
        http://randyzwitch.com/well-thought-out-business-requirements/</link>
        <guid isPermaLink="true">http://randyzwitch.com/well-thought-out-business-requirements/</guid>
        <content type="html" xml:base="/well-thought-out-business-requirements/">&lt;blockquote class=&quot;twitter-tweet&quot; data-partner=&quot;tweetdeck&quot;&gt;
  &lt;p dir=&quot;ltr&quot; lang=&quot;en&quot;&gt;
    &lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;@randyzwitch&lt;/a&gt; how to get (reasonably) well thought-through requirements from business people?
  &lt;/p&gt;

  &lt;p&gt;
    — Art Webb (@arthurlwebb) &lt;a href=&quot;https://twitter.com/arthurlwebb/status/634710548685418496&quot;&gt;August 21, 2015&lt;/a&gt;
  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One of the most common issues in business (especially large corporations) is trying to nail down the requirements for a given analysis request. The “business people” on the front-lines are talking to their higher-ups about what they think are important questions for the business to solve, but by the time the question gets to the analyst or developer, it sounds something like:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;It would be interesting to model using SAS how our customers shop for our merchandise by channel and what overlaps there are between demographics, geography, product type and tenure. But we also have to timebox this, we can’t be boiling-the-ocean just looking for needles-in-a-haystack.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Say WHAT? Mr. Business Person, I cannot help you if you do not run that mess through &lt;a href=&quot;http://unsuck-it.com/&quot;&gt;Unsuck-It&lt;/a&gt; first.&lt;/p&gt;

&lt;p&gt;In all seriousness, I’ve found there are a few great ways for an analyst to refine a “question” like the one above into an actionable plan of attack. So the next time you get a jargon-filled, completely generic analysis request such as the one above, try these four tactics.&lt;/p&gt;

&lt;h3 id=&quot;1-all-requests-should-be-phrased-in-the-form-of-a-question&quot;&gt;1. All Requests Should Be Phrased In The Form Of A Question&lt;/h3&gt;

&lt;p&gt;The first thing to notice about the mock interaction above is that there are no question marks; it’s not a question! For an analyst or developer to work effectively, &lt;em&gt;questions&lt;/em&gt; need to be presented, not bland &lt;em&gt;statements&lt;/em&gt;. For example, a refinement series of questions from the analyst might include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;You need a &lt;em&gt;model&lt;/em&gt;? What type of model? Do you mean a predictive model, a decision tree for understanding, a PivotTable for you to poke at, a one-page PowerPoint slide to give your boss?&lt;/li&gt;
  &lt;li&gt;You specified four attributes (demographics, geography, product type and tenure). Do you have a hypothesis around these attributes (or are you just brain-blabbing)?&lt;/li&gt;
  &lt;li&gt;What is meant by “shop”? Do you mean how do customers browse our goods online and in stores, the purchase cycle, what goods are frequently purchased together or something else?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that in all three of the refinement questions above, you are taking a generic idea and really drilling into what is needed. It is &lt;em&gt;the analyst&lt;/em&gt; that is the expert in the techniques for analyzing data, so the analyst should be helping the business person to take a raw analysis request and make it into answerable questions.&lt;/p&gt;

&lt;h3 id=&quot;2-separate-the-tools-from-the-question&quot;&gt;2. Separate The Tools From The Question&lt;/h3&gt;

&lt;p&gt;The second thing to notice in the mock interaction above is the statement “&lt;em&gt;using SAS&lt;/em&gt;”. I didn’t write that to pick on SAS, but rather, this exact statement was said to me early in my career. I had a boss who would try and guess which tool was appropriate for the question he was asking. I presume that he was trying to gauge how hard he thought the problem was, or try to signal to me how hard he thought the problem was. In the end, a plain SQL query with the results copied into an Excel table was all that was necessary.&lt;/p&gt;

&lt;p&gt;As the analyst, confirm whether &lt;em&gt;the tool&lt;/em&gt; is actually part of the &lt;em&gt;deliverable&lt;/em&gt;. Meaning, if you need to deliver a Tableau workbook, ok, specifying “use Tableau” is an important part of the business question. But if the requirement is “production-quality visualizations”, Tableau may or may not be the right tool or might just be one part of a larger workflow.&lt;/p&gt;

&lt;h3 id=&quot;3-every-question-is-interesting-to-someone-solve-the-valuable-ones&quot;&gt;3. Every Question Is Interesting To Someone. Solve The Valuable Ones.&lt;/h3&gt;

&lt;p&gt;Paraphrasing the aphorism “&lt;em&gt;The path to hell is paved with good intentions&lt;/em&gt;”, the path to doing low-value work your entire career is answering questions that start ”Wouldn’t it be interesting if…”.&lt;/p&gt;

&lt;p&gt;The basis for these statements are often tangents in other meetings, where high-level executives think there is information that should just be available at everyone’s fingertips. But if you were to ask “What business action would you take if you knew this piece of information?” or “Is it worth me stopping a project worth $1 million in Pre-Tax Profit per month to answer this for you?”, you’ll suddenly the question becomes a lot less &lt;em&gt;interesting&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So always have estimates of the business impact of what you are currently working on and ask for the same estimate of those who ask for your time. Projects that are &lt;em&gt;valuable&lt;/em&gt; to the business are “interesting”, everything else is just &lt;em&gt;making work&lt;/em&gt; for other people.&lt;/p&gt;

&lt;h3 id=&quot;4-dont-just-solved-the-stated-question-solve-the-unstated-question-too&quot;&gt;4. Don’t Just Solved The Stated Question. Solve The Unstated Question Too.&lt;/h3&gt;

&lt;p&gt;Finally, when I read the mock interaction above, there are actually &lt;em&gt;two&lt;/em&gt; questions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Stated: Do we understand our customer’s purchasing behaviors?&lt;/li&gt;
  &lt;li&gt;Unstated: How do we optimize our business to take into account our customer’s purchasing behaviors?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For sure, a deep understanding of the customer base is important no matter the product. But the unstated question of “What are we doing to &lt;em&gt;do about it&lt;/em&gt;?” is so much more valuable to answer (i.e. tactic #3).&lt;/p&gt;

&lt;p&gt;So even if the refined question becomes ’Build a customer segmentation based on past purchases’, go one step further and figure out how to implement your findings. Create a test plan for increasing email click-through-rates based on the segments or optimize your display bidding, maybe build a recommender system for your website…implementation of new ideas is always going to be more valuable than just analyzing the past.&lt;/p&gt;

&lt;h3 id=&quot;always-be-assertive&quot;&gt;Always Be Assertive.&lt;/h3&gt;

&lt;p&gt;If the key to sales is “Always Be Closing”, the key to quality analysis is “Always Be Assertive”. Ask questions. Make people think about what they are doing, what they ask of others and what can be done to improve the business. It’s a rare, ego-centric co-worker who doesn’t appreciate collaborating to get to a better quality question (and answer!) than they originally started with.&lt;/p&gt;

&lt;p&gt;Being able to read into what other people are asking for, estimating its value, then delivering more than they even knew they were asking for has helped me tremendously throughout my career. Hopefully by doing some or all of the tactics above, you’ll see a marked improvement in your analysis and career as well!&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4.5 Release Notes</title>
        
          <description>&lt;p&gt;It’s only been a month since the &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-version-1-4-4-release-notes/&quot;&gt;last RSiteCatalyst update&lt;/a&gt;, and this update is also a pretty minor update in terms of functionality.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 17 Aug 2015 09:43:36 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-5-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-5-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-5-release-notes/">&lt;p&gt;It’s only been a month since the &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-version-1-4-4-release-notes/&quot;&gt;last RSiteCatalyst update&lt;/a&gt;, and this update is also a pretty minor update in terms of functionality.&lt;/p&gt;

&lt;h2 id=&quot;set-your-own-endpoint&quot;&gt;Set Your Own Endpoint&lt;/h2&gt;

&lt;p&gt;For the overseas users (or companies with weird setups), you can now use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;endpoint&lt;/code&gt; argument in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SCAuth()&lt;/code&gt; function to specify your API endpoint. For the most part, this is not recommended, as RSiteCatalyst pings the Adobe Analytics API to evaluate the proper API endpoint to use, but if for some reason you are having issues, you can override what the Adobe API says.&lt;/p&gt;

&lt;h2 id=&quot;new-functions&quot;&gt;New Functions&lt;/h2&gt;

&lt;p&gt;For this release, I briefly looked through the API explorer to see if  there were any useful methods that had been missed. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetFunctions&lt;/code&gt; (Get definitions of all formula/functions in Adobe Analytics), &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueSummary&lt;/code&gt; (Get summary metrics for numerous report suites at once), &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetPrivacySettings&lt;/code&gt; (Privacy Settings at a report suite level), and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetTemplate&lt;/code&gt; (Get template that a current report suite was built from). With the exception of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueSummary()&lt;/code&gt;, none of these functions will likely get you much in the way of additional analytics capabilities, but they are there should you want to use them.&lt;/p&gt;

&lt;h2 id=&quot;feature-requestsbugs&quot;&gt;Feature Requests/Bugs&lt;/h2&gt;

&lt;p&gt;As always, if you come across bugs or have feature requests, please continue to use the &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues&quot;&gt;RSiteCatalyst GitHub Issues&lt;/a&gt; page to submit issues. Don’t worry about cluttering up the page with tickets, please fill out a new issue for anything you encounter (with code you’ve already tried and is failing), unless you are SURE that it is the same problem someone else is facing.&lt;/p&gt;

&lt;p&gt;Outside of patching really serious bugs, I will likely &lt;strong&gt;not spend any more time improving this package in the future&lt;/strong&gt;; my interests have changed, and RSiteCatalyst is pretty much complete as far as I’m concerned. That said, contributors are also &lt;em&gt;very welcomed&lt;/em&gt;. If there is a feature you’d like added, and especially if you can fix an outstanding issue reported at GitHub, we’d love to have your contributions. Willem and I are both parents of young children and have real jobs outside of open-source software creation, so we welcome any meaningful contributions to RSiteCatalyst that anyone would like to contribute.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>JuliaCon 2015: Everyday Analytics and Visualization (video)</title>
        
          <description>&lt;p&gt;At long last, here’s the video of my presentation from JuliaCon 2015, discussion common analytics tasks and visualization. This is really two talks, the first being an example of using the citibike NYC API to analyze ridership of their public bike program, and the second a discussion of the Vega.jl package.&lt;/p&gt;

</description>
        
        <pubDate>Fri, 14 Aug 2015 10:29:41 +0000</pubDate>
        <link>
        http://randyzwitch.com/juliacon-2015-everyday-analytics-and-visualization-video/</link>
        <guid isPermaLink="true">http://randyzwitch.com/juliacon-2015-everyday-analytics-and-visualization-video/</guid>
        <content type="html" xml:base="/juliacon-2015-everyday-analytics-and-visualization-video/">&lt;p&gt;At long last, here’s the video of my presentation from JuliaCon 2015, discussion common analytics tasks and visualization. This is really two talks, the first being an example of using the citibike NYC API to analyze ridership of their public bike program, and the second a discussion of the Vega.jl package.&lt;/p&gt;

&lt;p&gt;Speaking at JuliaCon 2015 at MIT CSAIL is the professional highlight of my year; hopefully even more of you will attend next year.&lt;/p&gt;

&lt;p&gt;Enjoy!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Edit: For those of you who would like to follow-along using the actual &lt;a href=&quot;https://github.com/randyzwitch/juliacon2015&quot;&gt;presentation code&lt;/a&gt;, it is available on GitHub.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;citibank-bike-data&quot;&gt;CitiBank Bike Data&lt;/h2&gt;
&lt;iframe src=&quot;https://www.youtube.com/embed/0F8tC3ofH4g?start=135&quot; width=&quot;640&quot; height=&quot;360&quot; frameborder=&quot;0&quot; allowfullscreen=&quot;allowfullscreen&quot;&gt;&lt;/iframe&gt;

&lt;h2 id=&quot;vegajl-presentation&quot;&gt;Vega.jl Presentation&lt;/h2&gt;
&lt;iframe src=&quot;https://www.youtube.com/embed/0F8tC3ofH4g?start=3005&quot; width=&quot;640&quot; height=&quot;360&quot; frameborder=&quot;0&quot; allowfullscreen=&quot;allowfullscreen&quot;&gt;&lt;/iframe&gt;</content>
      </item>
      
    
      
      <item>
        <title>Apple MacBook Pro Model A1286 Declared Vintage - The End Of An Era</title>
        
          <description>&lt;p&gt;It’s hard to believe it’s been over 2.5 years since I wrote about my experience with Apple trying to get my &lt;a href=&quot;http://randyzwitch.com/broken-macbook-pro-hinge-fixed-free/&quot;&gt;Broken MacBook Pro Hinge&lt;/a&gt; fixed. Since that time, my Late 2008 MacBook Pro continued to work flawlessly, most of the time keeping up with the scientific programming I would do in R, Python or Julia.&lt;/p&gt;

</description>
        
        <pubDate>Fri, 07 Aug 2015 20:07:12 +0000</pubDate>
        <link>
        http://randyzwitch.com/apple-macbook-pro-model-a1286-late-2008-vintage/</link>
        <guid isPermaLink="true">http://randyzwitch.com/apple-macbook-pro-model-a1286-late-2008-vintage/</guid>
        <content type="html" xml:base="/apple-macbook-pro-model-a1286-late-2008-vintage/">&lt;p&gt;It’s hard to believe it’s been over 2.5 years since I wrote about my experience with Apple trying to get my &lt;a href=&quot;http://randyzwitch.com/broken-macbook-pro-hinge-fixed-free/&quot;&gt;Broken MacBook Pro Hinge&lt;/a&gt; fixed. Since that time, my Late 2008 MacBook Pro continued to work flawlessly, most of the time keeping up with the scientific programming I would do in R, Python or Julia.&lt;/p&gt;

&lt;p&gt;Unfortunately, it seems near impossible (if not completely impossible) to get an OEM A1281 battery as a drop-in replacement. When I went to the Apple Store at Suburban Square, PA, the “Genius” that looked at my computer took 15-20 minutes to look on the Apple website (which I obviously did before arriving, so no value-add there), only to show me a battery in stock that didn’t fit my model of computer. Only after shaming him into looking up the actual part number, was he able to utter the phrase:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Oh, no, we don’t have those any more. Your model MacBook Pro was declared “Vintage”. No more original parts are available from Apple.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Of course&lt;/em&gt; it is. After getting home, I was able to find this &lt;a href=&quot;https://support.apple.com/en-us/HT201624&quot;&gt;service bulletin&lt;/a&gt; from Apple, which outlines which models are obsolete. Apparently, it’s a hard and fast rule that once five years from the end of manufacturing arrives, a model is declared vintage (unless local laws require longer service). So even though the only “problem” with my MacBook Pro is that I was only getting one hour of battery life per charge (or less if I’m compiling code), the computer is destined for a new life somewhere else.&lt;/p&gt;

&lt;h3 id=&quot;vintage-for-me-powerful-for-thee&quot;&gt;“Vintage” For Me, Powerful For Thee&lt;/h3&gt;

&lt;p&gt;While I realize I could go the 3rd-party route and get a replacement battery, at some point, you can only spend so much money keeping older technology alive. Since I use computers pretty intensively, I ended up getting a “new” (used) 2011 MacBook Pro from a neighborhood listing that has decent life on the OEM battery. Surprisingly, I was able to get $360 for my Late-2008 MacBook Pro, being fully honest about the condition, issues and battery life. The older woman who I sold it to fully understood, but worked at a desk and didn’t care about the battery! She also said:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This is easily the most powerful computer I’ve ever owned.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Apple, like I said in my original post, you’ve got a customer for life. And while I’ve moved on to a newer machine, it’s beyond amazing to me that a 7-year old computer will continue to live on and work at a high level of performance. And with my 2011 MacBook Pro, I still have the option to upgrade the parts (though I don’t need to…SSD, 16GB of RAM and a quad-core i7 processor already!)&lt;/p&gt;

&lt;p&gt;The Retina MacBook’s are nice, but very incremental. Here’s hoping the 2011 MacBook Pro lasts as long as my Late 2008 MacBook Pro did!&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Authenticated API Testing Using Travis CI</title>
        
          <description>&lt;p&gt;As I’ve become more serious about contributing in the open-source community, having quality tests for my packages has been something I’ve spent much more time on than when I was just writing quick-and-dirty code for my own purposes. My most used open-sourced package is &lt;a href=&quot;http://randyzwitch.com/tags/#rsitecatalyst&quot;&gt;RSiteCatalyst&lt;/a&gt;, which accesses the Adobe Analytics (authenticated) API, which poses a problem: how do you maintain a project on GitHub with a full test suite, while at the same time not hard-coding your credentials in plain sight for everyone to see?&lt;/p&gt;

</description>
        
        <pubDate>Thu, 06 Aug 2015 20:21:50 +0000</pubDate>
        <link>
        http://randyzwitch.com/authentication-travis-ci/</link>
        <guid isPermaLink="true">http://randyzwitch.com/authentication-travis-ci/</guid>
        <content type="html" xml:base="/authentication-travis-ci/">&lt;p&gt;As I’ve become more serious about contributing in the open-source community, having quality tests for my packages has been something I’ve spent much more time on than when I was just writing quick-and-dirty code for my own purposes. My most used open-sourced package is &lt;a href=&quot;http://randyzwitch.com/tags/#rsitecatalyst&quot;&gt;RSiteCatalyst&lt;/a&gt;, which accesses the Adobe Analytics (authenticated) API, which poses a problem: how do you maintain a project on GitHub with a full test suite, while at the same time not hard-coding your credentials in plain sight for everyone to see?&lt;/p&gt;

&lt;p&gt;The answer ends up being using &lt;a href=&quot;http://docs.travis-ci.com/user/environment-variables/#Encrypted-Variables&quot;&gt;encrypted environment variables&lt;/a&gt; within &lt;a href=&quot;https://travis-ci.org/&quot;&gt;Travis CI&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;testthat&quot;&gt;Testthat!&lt;/h3&gt;

&lt;p&gt;In terms of a testing framework, Hadley Wickham provides a great testing framework in &lt;a href=&quot;https://github.com/hadley/testthat&quot;&gt;testthat&lt;/a&gt;; while I wouldn’t go as far as he does to say that the package makes testing &lt;em&gt;fun&lt;/em&gt;, it certainly makes testing &lt;em&gt;easy&lt;/em&gt;. Let’s take a look at some of the tests in RSiteCatalyst from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueOvertime&lt;/code&gt; function:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;test_that&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Validate QueueOvertime using legacy credentials&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;skip_on_cran&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Correct [masked] credentials&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SCAuth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;USER&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sys.getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;SECRET&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Single Metric, No granularity (summary report)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;aa&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueOvertime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-12-31&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Validate returned value is a data.frame&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expect_is&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;aa&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;data.frame&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Single Metric, Daily Granularity&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bb&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueOvertime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-12-31&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;day&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Validate returned value is a data.frame&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expect_is&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;data.frame&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Single Metric, Week Granularity&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cc&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueOvertime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-12-31&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;week&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Validate returned value is a data.frame&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expect_is&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;data.frame&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Two Metrics, Week Granularity&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dd&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueOvertime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-12-31&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;week&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Validate returned value is a data.frame&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expect_is&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;data.frame&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Two Metrics, Month Granularity, Social Visitors&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ee&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueOvertime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-12-31&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;month&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;5433e4e6e4b02df70be4ac63&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Validate returned value is a data.frame&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expect_is&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ee&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;data.frame&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Two Metrics, Day Granularity, Social Visitors, Anomaly Detection&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ff&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueOvertime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-12-31&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;day&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;5433e4e6e4b02df70be4ac63&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;anomaly.detection&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Validate returned value is a data.frame&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expect_is&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;data.frame&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;



&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;From the code above, you can see the tests are fairly simplistic; for a given number of permutations of arguments of the function, I test to see if a data frame was returned. This is because, for the most part, RSiteCatalyst is just a means of generating JSON calls, submitting them to the Adobe Analytics API, then parsing the results into an R data frame.&lt;/p&gt;

&lt;p&gt;Since there is very little additional logic in the package, I don’t spend a bunch of time testing what data is actually returned (i.e. what is returned depends on the Adobe Analytics API, not R). What is interesting is line 6; I reference &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sys.getenv()&lt;/code&gt; twice in order to pass in my username and key for the Adobe Analytics API, which feels very “interactive R”, but the goal is automated testing. Filling in those two environment variables is where Travis CI comes in.&lt;/p&gt;

&lt;h3 id=&quot;travis-ci-configuration&quot;&gt;Travis CI Configuration&lt;/h3&gt;

&lt;p&gt;In order to have any automation using Travis CI, you need to create a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.travis.yml&lt;/code&gt; configuration file. While you can read the &lt;a href=&quot;http://docs.travis-ci.com/user/languages/r/&quot;&gt;Travis docs to create the .travis.yml file for R&lt;/a&gt;, you’re probably better off just using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;use_travis&lt;/code&gt; function from &lt;a href=&quot;https://github.com/hadley/devtools&quot;&gt;devtools&lt;/a&gt; (also from Hadley, little surprise!) to create the file for you. In terms of &lt;a href=&quot;http://docs.travis-ci.com/user/encryption-keys/&quot;&gt;creating encrypted keys to use with Travis&lt;/a&gt;, you’ll need to use the &lt;a href=&quot;https://github.com/travis-ci/travis.rb&quot;&gt;Travis CLI tool&lt;/a&gt;, which is distributed as a Ruby gem (i.e. package).  If you view the &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/blob/master/.travis.yml&quot;&gt;RSiteCatalyst .travis.yml file&lt;/a&gt;, you can see that I define two global “secure” variables, the value of which are the output from running a command similar to the following in the Travis CLI tool:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;travis encrypt &lt;span class=&quot;nv&quot;&gt;RANDY&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ZWITCH
Please add the following to your .travis.yml file:

  secure: &lt;span class=&quot;s2&quot;&gt;&quot;b6S4dBc7arvox8UpuFqkz+VP2UmAW/S/B/vgaAdZiZQqUp78YDR6VYdAYN3WisCK1VLGjOVVPQvGxLik0pQokF8FU3sjX0ekH6vSJeqg4utrEZmVtNvdDLEVAmagFy8Fyduow3U4CPW7rzXqvAE4cIVqGR5Lv2KLf8ANUGn+y3E=&quot;&lt;/span&gt;

Pro Tip: You can add it automatically by running with &lt;span class=&quot;nt&quot;&gt;--add&lt;/span&gt;.&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Note that if this seems insecure, every time you run the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;encrypt&lt;/code&gt; command with the same arguments, you get a different value; Travis CI is creating new public and private RSA keys each time.&lt;/p&gt;

&lt;h3 id=&quot;setting-up-authenticated-testing-locally&quot;&gt;Setting Up Authenticated Testing Locally&lt;/h3&gt;

&lt;p&gt;If you get as far as setting up encrypted Travis CI keys and tests using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;testthat&lt;/code&gt;, the final step is really for convenience. With the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.travis.yml&lt;/code&gt; file, Travis CI sets the R environment variables on THEIR system; on your local machine, the environment variables aren’t set. Even if the environment variables were set, they would be set to the Travis CI hashed values, which is not what I want to pass to my authentication function in my R package.&lt;/p&gt;

&lt;p&gt;To set the authentication variables locally, so that each time you hit ‘check’ to build and check against CRAN errors, you just need to modify the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.Renviron&lt;/code&gt; file for R:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;n&quot;&gt;USER&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;myusername&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SECRET&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;mysecret&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;With that minor change, in addition to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.travis.yml&lt;/code&gt; file, you’ll have a seamless environment for developing and testing R packages.&lt;/p&gt;

&lt;h3 id=&quot;testing-is-like-flossing&quot;&gt;Testing Is Like Flossing…&lt;/h3&gt;

&lt;p&gt;As easy as the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;testthat&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;devtools&lt;/code&gt; packages make testing, and as inexpensively as Travis CI is as a service (free for open source projects!), there’s really no excuse to provide packaged-up code and not include tests. Hopefully this blog post has demonstrated that it’s possible to include tests even when authentication is necessary without compromising your credentials.&lt;/p&gt;

&lt;p&gt;So let’s all be sure to include tests, not just pay lip service to the idea that testing is useful. Code testing only works if you actually do it 🙂&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Getting Started: Adobe Analytics Clickstream Data Feed</title>
        
          <description>&lt;blockquote&gt;
  &lt;p&gt;“Well, first you need a TMS and a three-tiered data layer, then some jQuery with a node backend to inject customer data into the page asynchronously if you want to avoid cookie-based limitations with cross-domain tracking and be Internet Explorer 4 compatible…”&lt;/p&gt;
&lt;/blockquote&gt;

</description>
        
        <pubDate>Tue, 04 Aug 2015 09:00:35 +0000</pubDate>
        <link>
        http://randyzwitch.com/adobe-analytics-clickstream-raw-data-feed/</link>
        <guid isPermaLink="true">http://randyzwitch.com/adobe-analytics-clickstream-raw-data-feed/</guid>
        <content type="html" xml:base="/adobe-analytics-clickstream-raw-data-feed/">&lt;blockquote&gt;
  &lt;p&gt;“Well, first you need a TMS and a three-tiered data layer, then some jQuery with a node backend to inject customer data into the page asynchronously if you want to avoid cookie-based limitations with cross-domain tracking and be Internet Explorer 4 compatible…”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Blah Blah Blah. There’s a whole cottage industry around jargon-ing each other to death about digital data collection. But why? Why do we focus on &lt;em&gt;tools&lt;/em&gt;, instead of &lt;em&gt;the data&lt;/em&gt;? Because the tools are necessarily inflexible, so we work backwards from the pre-defined reports we have to the data needed to populate them correctly. Let’s go the other way for once: clickstream data to analysis &amp;amp; reporting.&lt;/p&gt;

&lt;p&gt;In this blog post, I will show the structure of the Adobe Analytics Clickstream Data Feed and how to work with a day worth of data within R. Clickstream data isn’t as raw as pure server logs, but the only limit to what we can calculate from clickstream data is what we can accomplish with a bit of programming and imagination. In later posts, I’ll show how to store a year worth of data in a relational database, storing the same data in Hadoop and doing analysis using modern tools such as Apache Spark.&lt;/p&gt;

&lt;p&gt;This blog post will not cover the mechanics of getting the feed delivered via FTP. The &lt;a href=&quot;https://marketing.adobe.com/resources/help/en_US/sc/clickstream/datafeeds_configure.html&quot;&gt;Adobe Clickstream Feed documentation&lt;/a&gt; is sufficiently clear in how to get started.&lt;/p&gt;

&lt;h3 id=&quot;ftpfile-structure&quot;&gt;FTP/File Structure&lt;/h3&gt;

&lt;p&gt;Once your Adobe Clickstream Feed starts being delivered via FTP, you’ll have a file listing that looks similar to the following:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2015/07/adobe-clickstream-data-ftp.png&quot; alt=&quot;adobe-clickstream-data-ftp&quot; /&gt;&lt;/p&gt;

&lt;p&gt;What you’ll notice is that with daily delivery, three files are provided, each having a consistent file naming format:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;\d+-\S+_\d+-\d+-\d+.tsv.gz&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the main file containing the server call level data&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;\S+_\d+-\d+-\d+-lookup_data.tar.gz&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the lookup tables, header files, etc.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;\S+_\d+-\d+-\d+.txt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Manifest file, delivered last so that any automated processes know that Adobe is finished transferring&lt;/p&gt;

&lt;p&gt;The regular expressions will be unnecessary for working with our single day of data, but it’s good to realize that there is a consistent naming structure.&lt;/p&gt;

&lt;h3 id=&quot;checking-md5-hashes&quot;&gt;Checking md5 hashes&lt;/h3&gt;

&lt;p&gt;As part of the manifest file, Adobe provides &lt;a href=&quot;https://en.wikipedia.org/wiki/MD5&quot;&gt;md5 hashes&lt;/a&gt; of the files. There are at least two purposes to this, including 1) making sure that the files truly were delivered in full and 2) that the files haven’t been manipulated/tampered with. In order to check that your md5 hashes match the values provided by Adobe, we can do the following in R:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;setwd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;~/Downloads/datafeed/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Read in Adobe manifest file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;manifest&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read.table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev_2015-07-13.txt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stringsAsFactors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;manifest&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;key&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;value&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Use digest library to calculate md5 hashes&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;digest&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;servercalls_md5&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;digest&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;01-zwitchdev_2015-07-13.tsv.gz&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;algo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;md5&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lookup_md5&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;digest&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev_2015-07-13-lookup_data.tar.gz&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;algo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;md5&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Check to see if hashes contained in manifest file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;servercalls_md5&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%in%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;manifest&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#[1] TRUE&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lookup_md5&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%in%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;manifest&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#[1] TRUE&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;As we can see, both calculated hashes are contained within the manifest, so we can be confident that the files we downloaded haven’t been modified.&lt;/p&gt;

&lt;h3 id=&quot;unzipping-and-loading-raw-files-to-data-frames&quot;&gt;Unzipping and Loading Raw Files to Data Frames&lt;/h3&gt;

&lt;p&gt;Now that our file hashes are validated, it’s time to load the files into R. For the example files, I would be able to fit the entire day into RAM because my blog does very little traffic. However, I’m going to still limit the rows brought in, as if we were working with a large e-commerce website with millions of visits per day:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Get list of lookup files from tarball&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;files_tar&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;untar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev_2015-07-13-lookup_data.tar.gz&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Extract files to _temp directory. Directory will be created if it doesn't exist&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;untar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev_2015-07-13-lookup_data.tar.gz&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exdir&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;_temp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Read each file into a data frame&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#If coding like this in R offends you, keep it to yourself...&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;files_tar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_name&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unlist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;strsplit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;.tsv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fixed&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temp_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read.delim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;_temp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;header&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stringsAsFactors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#column_headers not used as lookup table&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_name&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;column_headers&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temp_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;id&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;assign&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temp_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temp_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#gz files can be read directly into dataframes from base R&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Could also use `readr` library for performance&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;servercall_data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read.delim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;~/Downloads/datafeed/01-zwitchdev_2015-07-13.tsv.gz&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                       &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;header&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stringsAsFactors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nrows&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;500&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Use column_headers to label servercall_data data frame using first row of data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;servercall_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;column_headers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,]&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;If we were to be loading this data into a database, we’d be done with our processing; we have all of our data read into R and it would be a trivial exercise to load the data into a database (we’ll do this in a separate blog post). But since we’re going to be analyze this single day of clickstream data, we need to join these 14 data frames together.&lt;/p&gt;

&lt;h3 id=&quot;sql-the-most-important-language-for-analytics&quot;&gt;SQL: The Most Important Language for Analytics&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;As a slight tangent, if you don’t know SQL, then you’re going to have a really hard time doing any sort of advanced analytics. There are literally millions of tutorials on the Internet (including &lt;a href=&quot;http://randyzwitch.com/sqldf-package-r/&quot;&gt;this one from me&lt;/a&gt;), and understanding how to join and retrieve data from databases is the key to being more than just a report monkey.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The reason why the prior code creates 14 data frames is because the data is delivered in a &lt;a href=&quot;http://www.studytonight.com/dbms/database-normalization.php&quot;&gt;normalized&lt;/a&gt; structure from Adobe. Now we are going to &lt;a href=&quot;http://searchdatamanagement.techtarget.com/definition/denormalization&quot;&gt;de-normalize&lt;/a&gt; the data, which is just a fancy way of saying “join the files together in order to make a gigantic table.”&lt;/p&gt;

&lt;p&gt;There are probably a dozen different ways to join data frames using just R code, but I’m going to do it using the &lt;a href=&quot;https://cran.r-project.org/web/packages/sqldf/index.html&quot;&gt;sqldf&lt;/a&gt; package so that I can use SQL. This will allow for a single, declarative statement that shows the relationship between the lookup and fact tables:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sqldf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;select
sc.*,
browser.browser as browser_name,
browser_type,
connection_type.connection_type as connection_name,
country.country as country_name,
javascript_version,
languages.languages as languages,
operating_systems,
referrer_type,
resolution.resolution as screen_resolution,
search_engines
from servercall_data as sc
left join browser on sc.browser = browser.id
left join browser_type on sc.browser = browser_type.id
left join connection_type on sc.connection_type = connection_type.id
left join country on sc.country = country.id
left join javascript_version on sc.javascript = javascript_version.id
left join languages on sc.language = languages.id
left join operating_systems on sc.os = operating_systems.id
left join referrer_type on sc.ref_type = referrer_type.id
left join resolution on sc.resolution = resolution.id
left join search_engines on sc.post_search_engine = search_engines.id
;
&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;denormalized_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sqldf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;There are three lookup tables that weren’t used: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;color_depth&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plugins&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;event&lt;/code&gt;. The first two don’t have a lookup column in my data feed (click link for a full listing of &lt;a href=&quot;https://marketing.adobe.com/resources/help/en_US/sc/clickstream/datafeeds_reference.html&quot;&gt;Adobe Clickstream data feed&lt;/a&gt; columns available). These columns aren’t really useful for my purposes anyway, so not a huge loss. The third table, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;event&lt;/code&gt; list, requires a separate processing step.&lt;/p&gt;

&lt;h3 id=&quot;processing-event-data&quot;&gt;Processing Event Data&lt;/h3&gt;

&lt;p&gt;As normalized as the Adobe Clickstream Data Feed is, there is one oddity: the events per server call come in a comma-delimited string in a single column with a lookup table. This implies that a separate level of processing is necessary, outside of SQL, since the column “key” is actually multiple keys and the lookup table specifies one event type per row. So if you were to try and join the data together, you wouldn’t get any matches.&lt;/p&gt;

&lt;p&gt;To deal with this in R, we are going to do an EXTREMELY wasteful operation: we are going to create a data frame with a column for each possible event, then evaluate each row to see if that event occurred. This will use a massive amount of RAM, but of course, this is a feature/limitation of R which wouldn’t be an issue if the data were stored in a database.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Create friendly names in events table replacing spaces with underscores&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;names_filled&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tolower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;_&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Initialize a data frame with all 0 values&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Dimensions are number of observations as rows, with a column for every possible event&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;matrix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ncol&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nrow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nrow&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nrow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;servercall_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Parse comma-delimited string into vector&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Each vector value represents column name in event_df, assign value of 1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;servercall_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;post_event_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;is.na&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;strsplit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
          &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;as.character&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Rename columns with &quot;friendly&quot; names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;names_filled&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Horizontally join datasets to create final dataset&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;oneday_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;denormalized_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;With the final &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cbind&lt;/code&gt; command, we’ve created a 500 row x 1562 column dataset representing a sample of rows from one day of the Adobe Clickstream Data Feed. Having the data denormalized in this fashion takes 6.13 MB of RAM…extrapolating to 1 million rows, you would need 12.26GB of RAM (per day of data you want to analyze, if stored solely in memory).&lt;/p&gt;

&lt;h3 id=&quot;next-step-analytics&quot;&gt;Next Step: Analytics?!&lt;/h3&gt;

&lt;p&gt;A thousand words in and 91 lines of R code and we still haven’t done any actual analytics. But we’ve completed the first step in any analytics project: data prep!&lt;/p&gt;

&lt;p&gt;In future blog posts in this series, I’ll demonstrate how to actually use this data in analytics, from re-creating reports available in the Adobe Analytics UI (to prove the data is the same) to more advanced analysis such as using association rules, which can be one method for creating a “You may also like…” functionality such as the one at the bottom of this blog.&lt;/p&gt;

&lt;h2 id=&quot;example-files&quot;&gt;Example Files:&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://randyzwitch.com/wp-content/uploads/2015/08/zwitchdev_2015-07-13.txt&quot; target=&quot;_blank&quot;&gt;http://randyzwitch.com/wp-content/uploads/2015/08/zwitchdev_2015-07-13.txt&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://randyzwitch.com/wp-content/uploads/2015/08/zwitchdev_2015-07-13-lookup_data.tar.gz&quot; target=&quot;_blank&quot;&gt;http://randyzwitch.com/wp-content/uploads/2015/08/zwitchdev_2015-07-13-lookup_data.tar.gz&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://randyzwitch.com/wp-content/uploads/2015/08/01-zwitchdev_2015-07-13.tsv.gz&quot; target=&quot;_blank&quot;&gt;http://randyzwitch.com/wp-content/uploads/2015/08/01-zwitchdev_2015-07-13.tsv.gz&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4.4 Release Notes</title>
        
          <description>&lt;p&gt;It’s been about six months since the last &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-version-1-4-3-release-notes/&quot;&gt;RSiteCatalyst update&lt;/a&gt;, and this update is really just a single bug fix, but a big bug fix at that!&lt;/p&gt;

</description>
        
        <pubDate>Mon, 13 Jul 2015 09:40:20 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-4-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-4-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-4-release-notes/">&lt;p&gt;It’s been about six months since the last &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-version-1-4-3-release-notes/&quot;&gt;RSiteCatalyst update&lt;/a&gt;, and this update is really just a single bug fix, but a big bug fix at that!&lt;/p&gt;

&lt;h3 id=&quot;sparse-data--opaque-error-messages&quot;&gt;Sparse Data = Opaque Error Messages&lt;/h3&gt;

&lt;p&gt;Numerous people have reported receiving an error message from RSiteCatalyst similar to the following:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;‘names’ attribute [1] must be the same length as the vector [0]&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is about the least helpful message that could’ve been returned, but it was an R message indicating an internal function trying to overwrite the column names vector (which had non-zero length) with a vector of length zero (which is an error in the context of a data frame). Thankfully, Willem Paling was able to squash this bug (hopefully) once-and-for-all; the error occurs when a user tries to do a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Queue*&lt;/code&gt;` report with multiple breakdowns, where NULL data is returned by the Adobe API for one of the breakdowns.&lt;/p&gt;

&lt;p&gt;So hopefully, if you’ve run into this error before (which I have to imagine was quite frustrating), you shouldn’t see this again with v1.4.4 of RSiteCatalyst. Additionally, tests will be added to the test suite to attempt to trigger this warning, so that this horrible monster of a bug doesn’t appear again.&lt;/p&gt;

&lt;h3 id=&quot;authentication-messaging&quot;&gt;Authentication Messaging&lt;/h3&gt;

&lt;p&gt;The only other change of substance was to modify the message returned after calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SCAuth()&lt;/code&gt;; some users were having issues with API calls not working, after RSiteCatalyst having returned &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;'Authentication Succeeded'&lt;/code&gt; to the console. RSiteCatalyst never actually validates that your credentials are correct, just that they are stored within the session. The console message has been updated to reflect this.&lt;/p&gt;

&lt;h3 id=&quot;proper-punctuation-prevents-poor-documentation&quot;&gt;Proper Punctuation Prevents Poor Documentation!&lt;/h3&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2015/07/title-case.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The Eagle-Eyed among you might have noticed that my DESCRIPTION file was out of CRAN spec for many months. This has now been fixed, so that the meaning is as clear as possible.&lt;/p&gt;

&lt;h2 id=&quot;feature-requestsbugs&quot;&gt;Feature Requests/Bugs&lt;/h2&gt;

&lt;p&gt;As always, if you come across bugs or have feature requests, please continue to use the &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues&quot;&gt;RSiteCatalyst GitHub Issues&lt;/a&gt; page to submit issues. Don’t worry about cluttering up the page with tickets, please fill out a new issue for anything you encounter (with code you’ve already tried and is failing), unless you are SURE that it is the same problem someone else is facing.&lt;/p&gt;

&lt;p&gt;Contributors are also &lt;em&gt;very welcomed&lt;/em&gt;. If there is a feature you’d like added, and especially if you can fix an outstanding issue reported at GitHub, we’d love to have your contributions. Willem and I are both parents of young children and have real jobs outside of open-source software creation, so we welcome any meaningful contributions to RSiteCatalyst that anyone would like to contribute.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Vega.jl, Rebooted</title>
        
          <description>&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2015/05/pie-300x251.png&quot; alt=&quot;pie&quot; /&gt;
&lt;img src=&quot;/wp-content/uploads/2015/05/donut-e1432224478621.png&quot; alt=&quot;donut&quot; /&gt;&lt;/p&gt;

</description>
        
        <pubDate>Thu, 21 May 2015 12:56:07 +0000</pubDate>
        <link>
        http://randyzwitch.com/vega-jl-julia/</link>
        <guid isPermaLink="true">http://randyzwitch.com/vega-jl-julia/</guid>
        <content type="html" xml:base="/vega-jl-julia/">&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2015/05/pie-300x251.png&quot; alt=&quot;pie&quot; /&gt;
&lt;img src=&quot;/wp-content/uploads/2015/05/donut-e1432224478621.png&quot; alt=&quot;donut&quot; /&gt;&lt;/p&gt;

&lt;p style=&quot;text-align: center;&quot;&gt;
  Mmmmm, baked goods!
&lt;/p&gt;

&lt;h3 id=&quot;rebooting-vegajl&quot;&gt;Rebooting Vega.jl&lt;/h3&gt;

&lt;p&gt;Recently, I’ve found myself without a project to hack on, and I’ve always been interested in learning more about browser-based visualization. So I decided to revive the work that &lt;a href=&quot;https://github.com/johnmyleswhite&quot; target=&quot;_blank&quot;&gt;John Myles White&lt;/a&gt; had done in building &lt;a href=&quot;https://github.com/johnmyleswhite/Vega.jl&quot;&gt;Vega.jl&lt;/a&gt; nearly two years ago. And since I’ll be giving an analytics &amp;amp; visualization workshop at &lt;a href=&quot;http://juliacon.org/&quot; target=&quot;_blank&quot;&gt;JuliaCon 2015&lt;/a&gt;, I figure I better study the topic in a bit more depth.&lt;/p&gt;

&lt;h3 id=&quot;back-in-working-order&quot;&gt;Back In Working Order!&lt;/h3&gt;

&lt;p&gt;The first thing I tackled here was to upgrade the syntax to target v0.4 of Julia. This is just my developer preference, to avoid using &lt;a href=&quot;https://github.com/JuliaLang/Compat.jl&quot; target=&quot;_blank&quot;&gt;Compat.jl&lt;/a&gt; when there are so many more visualizations I’d like to support. So if you’re using v0.4, you shouldn’t see any deprecation errors; if you’re using v0.3, well, eventually you’ll use v0.4!&lt;/p&gt;

&lt;p&gt;Additionally, I modified the package to recognize the traction that Jupyter Notebook has gained in the community. Whereas the original version of Vega.jl only displayed output in a tab in a browser, I’ve overloaded the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;writemime&lt;/code&gt; method to display &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;:VegaVisualization&lt;/code&gt; inline for any environment that can display HTML. If you use Vega.jl from the REPL, you’ll still get the same default browser-opening behavior as existed before.&lt;/p&gt;

&lt;h3 id=&quot;the-first-visualizationyou-addedwas-a-pie-chart&quot;&gt;The First Visualization You Added Was A Pie Chart…&lt;/h3&gt;

&lt;h3 id=&quot;and-followed-with-a-donut-chart&quot;&gt;…And Followed With a Donut Chart?&lt;/h3&gt;

&lt;p&gt;Yup. I’m a troll like that. Besides, being loudly against pie charts is blowhardy (even if studies have shown that people are too stupid to evaluate them).&lt;/p&gt;

&lt;p&gt;Adding these two charts (besides trolling) was a proof-of-concept that I understood the codebase sufficiently in order to extend the package. Now that the syntax is working for Julia v0.4, I understand how the package works (important!), and have improved the workflow by supporting Jupyter Notebook, I plan to create all of the visualizations featured in the &lt;a href=&quot;http://trifacta.github.io/vega/editor/&quot; target=&quot;_blank&quot;&gt;Trifacta Vega Editor&lt;/a&gt; and other standard visualizations such as boxplots. If the community has requests for the order of implementation, I’ll try and accommodate them. Just add a feature request on &lt;a href=&quot;https://github.com/johnmyleswhite/Vega.jl/issues&quot; target=&quot;_blank&quot;&gt;Vega.jl GitHub issues&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;why-not-gadfly-youre-not-starting-a-language-war-are-you&quot;&gt;Why Not Gadfly? You’re Not Starting A Language War, Are You?&lt;/h3&gt;

&lt;p&gt;No, I’m not that big of a troll. Besides, I don’t think we’ve squeezed all the juice (blood?!) out of the &lt;a href=&quot;http://blog.datacamp.com/r-or-python-for-data-analysis/&quot; target=&quot;_blank&quot;&gt;R vs. Python infographic&lt;/a&gt; yet, we don’t need another pointless debate.&lt;/p&gt;

&lt;p&gt;My sole reason for not improving &lt;a href=&quot;http://dcjones.github.io/Gadfly.jl/&quot; target=&quot;_blank&quot;&gt;Gadfly&lt;/a&gt; is just that I plain don’t understand how the codebase works! There are many amazing computer scientists &amp;amp; developers in the Julia community, and I’m not really one of them. I do, however, understand how to generate JSON strings and in that sense, Vega is the perfect platform for me to contribute.&lt;/p&gt;

&lt;h3 id=&quot;collaborators-wanted&quot;&gt;Collaborators Wanted!&lt;/h3&gt;

&lt;p&gt;If you’re interested in visualization, as well as learning Julia and/or contributing to a package, Vega.jl might be a good place to start. I’m always up for collaborating with people, and creating new visualizations isn’t that difficult (especially with the Trifacta examples). So hopefully some of you will be interested in enough to join me to adding one more great visualization library to the Julia community.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Sessionizing Log Data Using data.table [Follow-up #2]</title>
        
          <description>&lt;p&gt;Thanks to user &lt;a title=&quot;dnlbrky comment&quot; href=&quot;http://randyzwitch.com/sessionizing-log-data-dplyr-r-window-functions/#comment-16205&quot; target=&quot;_blank&quot;&gt;dnlbrky&lt;/a&gt;, we now have a third way to accomplish sessionizing log data for any arbitrary time out period (see methods &lt;a href=&quot;/sessionizing-log-data-sql/&quot; title=&quot;Sessionizing Log Data Using SQL&quot;&gt;1&lt;/a&gt; and &lt;a href=&quot;/sessionizing-log-data-dplyr-r-window-functions/&quot; title=&quot;Sessionizing Log Data Using dplyr [Follow-up]&quot;&gt;2&lt;/a&gt;), this time using data.table from R along with magrittr for piping:&lt;/p&gt;

</description>
        
        <pubDate>Tue, 20 Jan 2015 09:01:19 +0000</pubDate>
        <link>
        http://randyzwitch.com/sessionizing-log-data-using-data-table-follow-2/</link>
        <guid isPermaLink="true">http://randyzwitch.com/sessionizing-log-data-using-data-table-follow-2/</guid>
        <content type="html" xml:base="/sessionizing-log-data-using-data-table-follow-2/">&lt;p&gt;Thanks to user &lt;a title=&quot;dnlbrky comment&quot; href=&quot;http://randyzwitch.com/sessionizing-log-data-dplyr-r-window-functions/#comment-16205&quot; target=&quot;_blank&quot;&gt;dnlbrky&lt;/a&gt;, we now have a third way to accomplish sessionizing log data for any arbitrary time out period (see methods &lt;a href=&quot;/sessionizing-log-data-sql/&quot; title=&quot;Sessionizing Log Data Using SQL&quot;&gt;1&lt;/a&gt; and &lt;a href=&quot;/sessionizing-log-data-dplyr-r-window-functions/&quot; title=&quot;Sessionizing Log Data Using dplyr [Follow-up]&quot;&gt;2&lt;/a&gt;), this time using data.table from R along with magrittr for piping:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;magrittr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data.table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;## Download, unzip, and load data (first 10,000 lines):&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;single_col_timestamp&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;http://randyzwitch.com/wp-content/uploads/2015/01/single_col_timestamp.csv.gz&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gzcon&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;readLines&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;10000L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;textConnection&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read.csv&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;setDT&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;## Convert to timestamp:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;single_col_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as.POSIXct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;## Order by uid and event_timestamp:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;setkey&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;single_col_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;## Sessionize the data (more than 30 minutes between events is a new session):&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;single_col_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;session_id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;cumsum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;diff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;_&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;## Examine the results:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#single_col_timestamp[uid %like% &quot;a55bb9&quot;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;single_col_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;session_id&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%like%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;fc895c3babd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;I agree with dnlbrky in that this feels a little better than the dplyr method for heavy SQL users like me, but ultimately, I still think the SQL method is the most elegant and obvious to understand. But that’s the great thing with open-source software; pick any tool you want, accomplish whatever you choose using any method you choose.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Sessionizing Log Data Using dplyr [Follow-up]</title>
        
          <description>&lt;p&gt;Last week, I wrote a blog post showing how to &lt;a href=&quot;http://randyzwitch.com/sessionizing-log-data-sql&quot;&gt;sessionize log data using standard SQL&lt;/a&gt;. The main idea of that post is that if your analytics platform supports window functions (like Postgres and Hive do), you can make quick work out of sessionizing logs. Here’s the winning query:&lt;/p&gt;

</description>
        
        <pubDate>Tue, 13 Jan 2015 16:24:52 +0000</pubDate>
        <link>
        http://randyzwitch.com/sessionizing-log-data-dplyr-r-window-functions/</link>
        <guid isPermaLink="true">http://randyzwitch.com/sessionizing-log-data-dplyr-r-window-functions/</guid>
        <content type="html" xml:base="/sessionizing-log-data-dplyr-r-window-functions/">&lt;p&gt;Last week, I wrote a blog post showing how to &lt;a href=&quot;http://randyzwitch.com/sessionizing-log-data-sql&quot;&gt;sessionize log data using standard SQL&lt;/a&gt;. The main idea of that post is that if your analytics platform supports window functions (like Postgres and Hive do), you can make quick work out of sessionizing logs. Here’s the winning query:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_event_boundary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;session_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;minutes_since_last_interval&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;new_event_boundary&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;from&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;--Query 1: Define boundary events&lt;/span&gt;
			&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
			&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;minutes_since_last_interval&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;when&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ELSE&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;END&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_event_boundary&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;single_col_timestamp&lt;/span&gt;
			&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;One nested sub-query and two window functions are all it takes to calculate the event boundaries and create a unique identifier for sessions for any arbitrary timeout chosen.&lt;/p&gt;

&lt;h2 id=&quot;its-hadleys-house-were-justleasing&quot;&gt;It’s Hadley’s House, We’re Just Leasing&lt;/h2&gt;

&lt;p&gt;Up until today, I hadn’t really done anything using dplyr.  But having a bunch of free time this week and hearing people talk so much about how great dplyr is, I decided to see what it would take to replicate this same exercise using R. dplyr has support for Postgres as a back-end, and has verbs that translate R code into window functions, so I figured it had to be possible. Here’s what I came up with:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;###Sessionization using dplyr&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dplyr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Open a localhost connection to Postgres&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Use table 'single_col_timestamp'&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#group by uid and sort by timestamp for window function&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Do minutes calculation, working around missing support for extract(epoch from timestamp)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Calculate event boundary and unique id via cumulative sum window function&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sessions&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  
        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;src_postgres&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;logfiles&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tbl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;single_col_timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;group_by&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arrange&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mutate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minutes_since_last_event&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DATE_PART&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'day'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;24&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DATE_PART&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'hour'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DATE_PART&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'minute'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DATE_PART&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'second'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                           &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
              &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mutate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_boundary&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minutes_since_last_event&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
               &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;session_id&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order_by&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;cumsum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minutes_since_last_event&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Show query syntax&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;show_query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sessions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Actually run the query&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;answer&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sessions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Generally, I’m not a fan of the pipe operator, but I figured I’d give it a shot since everyone else seems to like it. This is one nasty bit of R code, but ultimately, it is possible to get the same result as writing SQL directly. I did need to take a few roundabout ways, specifically in calculating the minutes between timestamps and substituting the CASE expression into the window function rather than call it by name, but it’s basically the same logic.&lt;/p&gt;

&lt;h2 id=&quot;why-does-this-work&quot;&gt;Why Does This Work?&lt;/h2&gt;

&lt;p&gt;If you compare the SQL code above to the R code, you might be wondering why the dplyr code works. Certainly, working the dplyr way gives me cognitive dissonance, as you generally specify the verbs you are using in reverse order as you do in SQL. But calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;show_query(sessions)&lt;/code&gt;, you actually see that dplyr is generating SQL under-the-hood (I formatted the code for easier viewing):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt;
	&lt;span class=&quot;nv&quot;&gt;&quot;uid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
	&lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
	&lt;span class=&quot;nv&quot;&gt;&quot;minutes_since_last_event&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
	&lt;span class=&quot;nv&quot;&gt;&quot;event_boundary&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
	&lt;span class=&quot;nv&quot;&gt;&quot;session_id&quot;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt;
			&lt;span class=&quot;nv&quot;&gt;&quot;uid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
			&lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
			&lt;span class=&quot;nv&quot;&gt;&quot;minutes_since_last_event&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;minutes_since_last_event&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;THEN&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ELSE&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;END&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;event_boundary&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;minutes_since_last_event&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;THEN&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ELSE&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;uid&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ROWS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UNBOUNDED&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PRECEDING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;session_id&quot;&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt;
			&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
				&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt;
					&lt;span class=&quot;nv&quot;&gt;&quot;uid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
					&lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
					&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DATE_PART&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'day'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LAG&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;uid&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
						&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DATE_PART&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'hour'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LAG&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;uid&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
						&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DATE_PART&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'minute'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LAG&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;uid&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
						&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DATE_PART&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'second'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LAG&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;uid&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;minutes_since_last_event&quot;&lt;/span&gt;
				&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;single_col_timestamp&quot;&lt;/span&gt;
				&lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;uid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;
			&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;_W1&quot;&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;_W2&quot;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Like all SQL-generating tools, the code is a bit inelegant; however, I have to say that I’m truly impressed the dplyr code was able to handle this scenario at all, given that this example has to be at least an edge-, if not a corner-case of what dplyr is meant for in terms of data manipulation.&lt;/p&gt;

&lt;h2 id=&quot;so-dplyris-going-to-become-part-of-your-toolbox&quot;&gt;So, dplyr Is Going To Become Part Of Your Toolbox?&lt;/h2&gt;

&lt;p&gt;While it was possible to re-create the same functionality, ultimately, I don’t see myself using dplyr a whole lot. In the case of using databases, it seems more efficient and portable just to write the SQL directly; at the very least, it’s what I’m already comfortable doing as part of my analytics workflow. For manipulating data frames, maybe I’d use it (I do use plyr extensively in my &lt;a href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot;&gt;RSiteCatalyst&lt;/a&gt; package), but I’d probably be more inclined to use &lt;a href=&quot;http://randyzwitch.com/sqldf-package-r/&quot;&gt;sqldf&lt;/a&gt; instead.&lt;/p&gt;

&lt;p&gt;But that’s just me, not a reflection on the package quality. Happy manipulating, however you choose to do it! 🙂&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Sessionizing Log Data Using SQL</title>
        
          <description>&lt;p&gt;Over my career as a predictive modeler/data scientist, the most important step(s) in any data project without question have been data cleaning and feature engineering. By taking the data you have, correcting flaws and reformulating raw data into additional business-specific concepts, you ensure that you move beyond pure mathematical optimization and actually solve a &lt;em&gt;business problem&lt;/em&gt;. While “big data” is often held up as the future of knowing everything, when it comes down to it, a Hadoop cluster is more often a “Ha-dump” cluster: the place data gets dumped without any proper ETL.&lt;/p&gt;

</description>
        
        <pubDate>Thu, 08 Jan 2015 11:57:56 +0000</pubDate>
        <link>
        http://randyzwitch.com/sessionizing-log-data-sql/</link>
        <guid isPermaLink="true">http://randyzwitch.com/sessionizing-log-data-sql/</guid>
        <content type="html" xml:base="/sessionizing-log-data-sql/">&lt;p&gt;Over my career as a predictive modeler/data scientist, the most important step(s) in any data project without question have been data cleaning and feature engineering. By taking the data you have, correcting flaws and reformulating raw data into additional business-specific concepts, you ensure that you move beyond pure mathematical optimization and actually solve a &lt;em&gt;business problem&lt;/em&gt;. While “big data” is often held up as the future of knowing everything, when it comes down to it, a Hadoop cluster is more often a “Ha-dump” cluster: the place data gets dumped without any proper ETL.&lt;/p&gt;

&lt;p&gt;For this blog post, I’m going to highlight a common request for time-series data: combining discrete events into sessions. Whether you are dealing with sensor data, television viewing data, digital analytics data or any other stream of events, the problem of interest is usually how a human interacts with a machine over a given period of time, not each individual event.&lt;/p&gt;

&lt;p&gt;While I usually use Hive (Hadoop) for daily work, I’m going to use Postgres (via OSX &lt;a title=&quot;Postgres.app OSX&quot; href=&quot;http://postgresapp.com&quot; target=&quot;_blank&quot;&gt;Postgres.app&lt;/a&gt;) to make this as widely accessible as possible. In general, this process will work with any infrastructure/SQL-dialect that supports &lt;a href=&quot;http://www.postgresql.org/docs/9.1/static/tutorial-window.html&quot;&gt;window functions&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;connecting-to-databaseload-data&quot;&gt;Connecting to Database/Load Data&lt;/h2&gt;

&lt;p&gt;For lightweight tasks, I find using psql (command-line tool) is easy enough. Here are the commands to create a database to hold our data and to load our two .csv files (download &lt;a href=&quot;/wp-content/uploads/2015/01/single_col_timestamp.csv.gz&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;/wp-content/uploads/2015/01/two_col_timestamp.csv.gz&quot;&gt;here&lt;/a&gt;):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2015/01/psql-load-data.png&quot; alt=&quot;psql-load-data&quot; /&gt;&lt;/p&gt;

&lt;p&gt;These files contain timestamps generated for 1000 uid values.&lt;/p&gt;

&lt;h2 id=&quot;query-1-inner-determining-session-boundary-using-a-window-function&quot;&gt;Query 1 (“Inner”): Determining Session Boundary Using A Window Function&lt;/h2&gt;

&lt;p&gt;In order to determine the boundary of each session, we can use a window function along with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lag()&lt;/code&gt;, which will allow the current row being processed to compare vs. the prior row. Of course, for all of this to work correctly, we need to have our data sorted in time order by each of our users:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;--Create boundaries at 30 minute timeout&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;minutes_since_last_interval&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;when&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ELSE&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;END&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_event_boundary&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;single_col_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;For this query, we use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lag()&lt;/code&gt; function on the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;event_timestamp&lt;/code&gt; column, and we use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;over partition by uid order by event_timestamp&lt;/code&gt; to define the window over which we want to do our calculation. To provide additional clarification about how this syntax works, I’ve added a column showing how many minutes have passed between intervals to validate that the 30-minute window is calculated correctly. The result is as follows:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2015/01/sql-session-boundary-definition.png&quot; alt=&quot;sql-session-boundary-definition&quot; /&gt;&lt;/p&gt;

&lt;p&gt;For each row where the value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;minutes_since_last_interval &amp;gt; 30&lt;/code&gt;, there is a value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1&lt;/code&gt; for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_event_boundary&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;query-2-outer-creating-a-session-id&quot;&gt;Query 2 (“Outer”): Creating A Session ID&lt;/h2&gt;

&lt;p&gt;The query above defines the event boundaries (which is helpful), but if we want to calculate session-level metrics, we need to create a unique id for each set of rows that are part of one session. To do this, we’re again going to use a window function:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_event_boundary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;session_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;minutes_since_last_interval&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;new_event_boundary&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;from&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;--Query 1: Define boundary events&lt;/span&gt;
			&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
			&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;minutes_since_last_interval&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;when&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ELSE&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;END&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_event_boundary&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;single_col_timestamp&lt;/span&gt;
			&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This query defines the same &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;over partition by uid order by event_timestamp&lt;/code&gt; window, but rather than using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lag()&lt;/code&gt; this time, we’re going to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sum()&lt;/code&gt; for the outer query. The effect of using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sum()&lt;/code&gt; in our window function is to do a cumulative sum; every time &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1&lt;/code&gt; shows up, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;session_id&lt;/code&gt; field gets incremented by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1&lt;/code&gt;. If there is a value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0&lt;/code&gt;, the sum is still the same as the row above and thus has the same &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;session_id&lt;/code&gt;. This is easier to understand visually:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2015/01/sessionized-data.png&quot; alt=&quot;sessionized-data&quot; /&gt;&lt;/p&gt;

&lt;p&gt;At this point, we have a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;session_id&lt;/code&gt; for a group of rows where there have been no 30 minute gaps in behavior.&lt;/p&gt;

&lt;h2 id=&quot;final-query-cleaned-up&quot;&gt;Final Query: Cleaned Up&lt;/h2&gt;

&lt;p&gt;Although the previous section is technically done, I usually concatenate the uid and session_id together.  I do this concatenation just to highlight that the value is usually a ‘key’ value, not a metric in itself (though it can be). Concatenating the keys together and removing the teaching columns results in the following query:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;o&quot;&gt;--&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Query&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Outer&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uses&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;do&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cumulative&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;the&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;concatentate&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'-'&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_event_boundary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BY&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ORDER&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BY&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;varchar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;session_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
			&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;--&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Query&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Define&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;boundary&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;events&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
			&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
			&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
			&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
			&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;case&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;when&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;extract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BY&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uid&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ORDER&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BY&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;then&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ELSE&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_event_boundary&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
			&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;single_col_timestamp&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
			&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2015/01/final-sessionized-data.png&quot; alt=&quot;final-sessionized-data&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;window-functions-will-you-marry-me&quot;&gt;Window Functions, Will You Marry Me?&lt;/h2&gt;

&lt;p&gt;The first time I was asked to try and solve sessionization of time-series data using Hive, I was sure the answer would be that I’d have to get a peer to write some nasty custom Java code to be able generate unique ids; in retrospect, the solution is so obvious and simple that I wish I would’ve tried to do this years ago. This is a pretty easy problem to solve using imperative programming, but if you’ve got a gigantic amount of hardware in a RDBMS or Hadoop, SQL takes care of all of the calculation without needing think through looping (or more complicated logic/data structures).&lt;/p&gt;

&lt;p&gt;Window functions fall into a weird space in the SQL language, given that they allow you to do sequential calculations when SQL should generally be thought of as “set-level” calculations (i.e. no implied order and table-wide calculations vs. row/state-specific). But now that I’ve got a hang of them, I can’t imagine my analytical life without them.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4.3 Release Notes</title>
        
          <description>&lt;p&gt;It’s a new year, so…new version of RSiteCatalyst on CRAN! For the most part, this release fixes a handful of bugs that weren’t noticed with the prior release 1.4.2 (oops!), but there are pieces of additional functionality.&lt;/p&gt;

</description>
        
        <pubDate>Tue, 06 Jan 2015 13:00:40 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-3-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-3-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-3-release-notes/">&lt;p&gt;It’s a new year, so…new version of RSiteCatalyst on CRAN! For the most part, this release fixes a handful of bugs that weren’t noticed with the prior release 1.4.2 (oops!), but there are pieces of additional functionality.&lt;/p&gt;

&lt;h2 id=&quot;new-functionality-data-feed-monitoring&quot;&gt;New functionality: Data Feed monitoring&lt;/h2&gt;

&lt;p&gt;For those of you having hourly or daily data feeds delivered via FTP, you can now find out the details of a data feed and all of a company’s feeds &amp;amp; the processing status of each using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetFeed()&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetFeeds()&lt;/code&gt; respectively.&lt;/p&gt;

&lt;p&gt;For example, calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetFeed()&lt;/code&gt; with a specific feed number will return the following information as a data frame:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2015/01/rsitecatalyst-getfeed.png&quot; alt=&quot;rsitecatalyst-getfeed&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Similarly, if you call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetFeeds(&quot;report-suite&quot;)&lt;/code&gt;, you’ll get the following information as a data frame:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2015/01/rsitecatalyst-getfeeds.png&quot; alt=&quot;rsitecatalyst-getfeeds&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I only have one feed set up for testing, but if there were more feeds delivered each day, they would show up as additional rows in the data frame. The interpretation here is that the daily feed for 1/5/15 was delivered (the 05:00:00 is GMT).&lt;/p&gt;

&lt;h2 id=&quot;bug-fixes&quot;&gt;Bug Fixes&lt;/h2&gt;

&lt;p&gt;RSiteCatalyst v1.4.2 attempted to fix an issue where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueRanked&lt;/code&gt; would error if two SAINT classifications were used. Unfortunately, by fixing that issue, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueRanked&lt;/code&gt; ONLY worked with SAINT Classifications. This was only out in the wild for a month, so hopefully it didn’t really affect anyone.&lt;/p&gt;

&lt;p&gt;Additionally, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;segment.id&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;segment.name&lt;/code&gt; weren’t printing out to the data frame in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Queue*&lt;/code&gt; functions. This has also been fixed.&lt;/p&gt;

&lt;h2 id=&quot;test-suite-using-travis-ci&quot;&gt;Test Suite Using Travis CI&lt;/h2&gt;

&lt;p&gt;To avoid future errors like the ones mentioned above, a full test suite using &lt;a href=&quot;https://github.com/hadley/testthat&quot;&gt;testthat&lt;/a&gt; has been added to RSiteCatalyst and monitored via &lt;a href=&quot;https://travis-ci.org/randyzwitch/RSiteCatalyst&quot;&gt;Travis CI&lt;/a&gt;. While there is coverage for every public function within the package, there are likely additional tests that can be added for functionality I didn’t cover. If anyone out there has particularly weird cases they use and aren’t incorporated in the &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/tree/master/tests/testthat&quot;&gt;test suite&lt;/a&gt;, please feel free to file an issue or submit a pull request and I’ll figure out how to incorporate it into the test suite.&lt;/p&gt;

&lt;h2 id=&quot;datawarehouse-api&quot;&gt;&lt;del&gt;DataWarehouse API&lt;/del&gt;&lt;/h2&gt;

&lt;p&gt;&lt;del&gt;Finally, the last bit of changes to RSiteCatalyst in v1.4.3 are internal preparations for a new package I plan to release in the coming months: &lt;a title=&quot;AdobeDW DataWarehouse&quot; href=&quot;https://github.com/randyzwitch/AdobeDW&quot; target=&quot;_blank&quot;&gt;AdobeDW&lt;/a&gt;. Several folks have asked for the ability to control Data Warehouse reports via R; for various reasons, I thought it made sense to break this out from RSiteCatalyst into its own package. If there are any R-and-Adobe-Analytics enthusiasts out there that would like to help development, please let me know! &lt;/del&gt;&lt;/p&gt;

&lt;h2 id=&quot;feature-requestsbugs&quot;&gt;Feature Requests/Bugs&lt;/h2&gt;

&lt;p&gt;As always, if you come across bugs or have feature requests, please continue to use the &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues&quot;&gt;RSiteCatalyst GitHub Issues&lt;/a&gt; page to submit issues. Don’t worry about cluttering up the page with tickets, please fill out a new issue for anything you encounter (with code you’ve already tried and is failing), unless you are SURE that it is the same problem someone else is facing.&lt;/p&gt;

&lt;p&gt;And finally, like I end every blog post about RSiteCatalyst, please note that &lt;strong&gt;I’m&lt;/strong&gt; &lt;strong&gt;not an Adobe employee&lt;/strong&gt;. This hasn’t been an issue for a few months, so maybe next time I won’t end the post with this boilerplate :)&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Review: Data Science at the Command Line</title>
        
          <description>&lt;p&gt;Admission: I didn’t &lt;em&gt;really know&lt;/em&gt; how computers worked until around 2012.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 15 Dec 2014 10:22:46 +0000</pubDate>
        <link>
        http://randyzwitch.com/data-science-command-line-review-janssens/</link>
        <guid isPermaLink="true">http://randyzwitch.com/data-science-command-line-review-janssens/</guid>
        <content type="html" xml:base="/data-science-command-line-review-janssens/">&lt;p&gt;Admission: I didn’t &lt;em&gt;really know&lt;/em&gt; how computers worked until around 2012.&lt;/p&gt;

&lt;p&gt;For the majority of my career, I’ve worked for large companies with centralized IT functions. Like many statisticians, I fell into a comfortable position of learning SAS in a Windows environment, had Ops people to fix any Unix problems I’d run into and DBAs to load data into a relational database environment.&lt;/p&gt;

&lt;p&gt;Then I became a consultant at a boutique digital analytics firm. To say I was punching above my weight was an understatement. All of the sudden it was time to go into various companies, have a one-hour kickoff meeting, then start untangling the spaghetti mess that represented their various technology systems. I also needed to figure out the boutique firm’s hacked together AWS and Rackspace infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://datascienceatthecommandline.com/&quot;&gt;&lt;img src=&quot;/wp-content/uploads/2014/12/data-science-command-line.png&quot; alt=&quot;data-science-command-line&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’m starting off this review with this admission, because my story of learning to work from the command line parallels &lt;a href=&quot;http://datascienceatthecommandline.com/&quot;&gt;Data Science at the Command Line&lt;/a&gt; author &lt;a href=&quot;https://twitter.com/jeroenhjanssens&quot;&gt;Jeroen Janssens&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Around five years ago, during my PhD program, I gradually switched from using Microsoft Windows to GNU/Linux…Out of necessity I quickly became comfortable using the command line. Eventually, as spare time got more precious, I settled down with a GNU/Linux distribution known as Ubuntu…&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;Preface, pg. xi&lt;/li&gt;
  &lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because a solid majority of people have never learned anything beyond point-and-click interface (Windows or Mac), the title of the book &lt;em&gt;Data Science at the Command Line&lt;/em&gt; is somewhat unfortunate; this is a book for ANYONE looking to start manipulating files efficiently from the command line.&lt;/p&gt;

&lt;h2 id=&quot;getting-started-safely&quot;&gt;Getting Started, Safely&lt;/h2&gt;

&lt;p&gt;One of the best parts of &lt;em&gt;Data Science at the Command Line&lt;/em&gt; is that it comes with a &lt;a href=&quot;http://datasciencetoolbox.org/&quot;&gt;pre-built virtual machine&lt;/a&gt; with 80-100 or more command line tools installed. This is a very fast and safe way to get started with the command line, as the tools are pre-installed and no matter what command you run while you’re learning, you won’t destroy a computer you actually care about!&lt;/p&gt;

&lt;p&gt;Chapters 2 and 3 move through the steps of installing the virtual machine, explaining the essential concepts of the command line, some basic commands showing simple (but powerful!) ways to chain command line tools together and how to obtain data. What I find so refreshing about these two chapters by Janssens is that the author assumes zero knowledge of the command line by the reader; these two chapters are the most accessible summary of how and why to use the command line I’ve ever read (&lt;a href=&quot;http://cli.learncodethehardway.org/book/&quot;&gt;Zed Shaw’s CLI tutorial&lt;/a&gt; is a close second, but is quite terse).&lt;/p&gt;

&lt;h2 id=&quot;the-osemn-model&quot;&gt;The OSEMN model&lt;/h2&gt;

&lt;p&gt;The middle portion of book covers the &lt;a href=&quot;http://www.dataists.com/2010/09/a-taxonomy-of-data-science/&quot;&gt;OSEMN model&lt;/a&gt; (Obtain-Scrub-Explore-Model-iNterpret) of data science; another way this book is refreshing is that rather than jump right into machine learning/predictive modeling, the author spends a considerable amount of time covering the gory details of real analysis projects: manipulating data from the format you &lt;em&gt;receive&lt;/em&gt; (XML, JSON, sloppy CSV files, etc.) and taking the (numerous) steps required to get the format you &lt;em&gt;want&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;By introducing tools such as &lt;a href=&quot;http://csvkit.readthedocs.io/en/latest/&quot;&gt;csvkit&lt;/a&gt; (csv manipulation), &lt;a href=&quot;http://stedolan.github.io/jq/&quot;&gt;jq&lt;/a&gt; (JSON processor), and classic tools such as &lt;a href=&quot;https://www.gnu.org/software/sed/manual/sed.html&quot;&gt;sed&lt;/a&gt; (stream editor) and &lt;a href=&quot;http://www.gnu.org/software/gawk/manual/gawk.html&quot;&gt;(g)awk&lt;/a&gt;, the reader gets a full treatment of how to deal with malformed data files (which in my experience are the only type available in the wild!) . Chapter 6 (“Managing Your Data Workflow”) is also a great introduction into &lt;a href=&quot;http://en.wikipedia.org/wiki/Reproducibility#Reproducible_research&quot;&gt;reproducible research&lt;/a&gt; using &lt;a href=&quot;http://blog.factual.com/introducing-drake-a-kind-of-make-for-data&quot;&gt;Drake&lt;/a&gt; (Make for Data Analysis). This is an area that I will personally be focusing my time on, as I tend to run a lot of one-off commands in HDFS and as of now, just copy them into a plain-text file. Reproducing = copy-paste in my case, which defeats the purpose of computers and scripting!&lt;/p&gt;

&lt;h2 id=&quot;an-idea-can-be-stretched-too-far&quot;&gt;An Idea Can Be Stretched Too Far&lt;/h2&gt;

&lt;p&gt;Chapters 8 and 9 cover Parallel Processing using &lt;a href=&quot;http://www.gnu.org/software/parallel/&quot;&gt;GNU Parallel&lt;/a&gt; and Modeling Data respectively. While GNU Parallel is a tool I could see using sometime in the future, I do feel like building models and creating visualizations straight from the command line is getting pretty close to just being a parlor trick. Yes, it’s obviously possible to do such things (and the author even wrote his own &lt;a href=&quot;https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/Rio&quot;&gt;command line tool Rio&lt;/a&gt; for using R from the command line), but with the amount of iteration, feature building and fine-tuning that goes on, I’d rather use &lt;a href=&quot;http://ipython.org/notebook.html&quot;&gt;IPython Notebook&lt;/a&gt; or &lt;a href=&quot;http://www.rstudio.com/&quot;&gt;RStudio&lt;/a&gt; to give me the flexibility I need to really iterate effectively.&lt;/p&gt;

&lt;h2 id=&quot;a-book-for-everyone&quot;&gt;A Book For Everyone&lt;/h2&gt;

&lt;p&gt;As I mentioned above, I really feel that &lt;em&gt;Data Science at the Command Line&lt;/em&gt; is a book well suited for anyone who does data analysis. Jeroen Janssens has done a fantastic job of taking his original &lt;a href=&quot;http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html&quot;&gt;“7 command-line tools for data science”&lt;/a&gt; blog post and extending the idea to a full-fledged book. This book has a prominent place in my work library next to &lt;a href=&quot;http://shop.oreilly.com/product/0636920023784.do&quot;&gt;Python for Data Analysis&lt;/a&gt; and in the past two months I’ve referred to each book at roughly the same rate. For under $30 for paperback at Amazon, there’s more than enough content to make you a better data scientist.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Introducing Twitter.jl</title>
        
          <description>&lt;p&gt;This is possibly the latest “announcement” of a package ever, given that &lt;a href=&quot;https://github.com/randyzwitch/Twitter.jl&quot;&gt;Twitter.jl&lt;/a&gt; has existed on &lt;a href=&quot;https://github.com/JuliaLang/METADATA.jl&quot; title=&quot;Julia METADATA&quot;&gt;METADATA&lt;/a&gt; for nearly a year now, but that’s how things go sometimes. Here’s how to get started with Twitter.jl.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 08 Dec 2014 17:12:58 +0000</pubDate>
        <link>
        http://randyzwitch.com/twitter-api-julia/</link>
        <guid isPermaLink="true">http://randyzwitch.com/twitter-api-julia/</guid>
        <content type="html" xml:base="/twitter-api-julia/">&lt;p&gt;This is possibly the latest “announcement” of a package ever, given that &lt;a href=&quot;https://github.com/randyzwitch/Twitter.jl&quot;&gt;Twitter.jl&lt;/a&gt; has existed on &lt;a href=&quot;https://github.com/JuliaLang/METADATA.jl&quot; title=&quot;Julia METADATA&quot;&gt;METADATA&lt;/a&gt; for nearly a year now, but that’s how things go sometimes. Here’s how to get started with Twitter.jl.&lt;/p&gt;

&lt;h2 id=&quot;hello-world&quot;&gt;Hello, World!&lt;/h2&gt;

&lt;p&gt;If ‘Hello, World!’ is the canonical example of getting started with a programming language, the Twitter API is becoming the first place to start for people wanting to learn about APIs. Authenticating with the Twitter API using Julia is similar to using the R or Python packages, except that rather than doing the OAuth “dance”, Twitter.jl takes all four authentication values in one function:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Twitter&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;apikey&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;q8Qw7WJTVP...&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;apisecret&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;FIichPpGJxiOssN...&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;accesstoken&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;98689850-v0zZNr...&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;accesstokensecret&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;w7bDg9K0c493T...&quot;&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;twitterauth&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;apikey&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;apisecret&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;accesstoken&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;accesstokensecret&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;All four of these values can be found after registering at the &lt;a href=&quot;https://dev.twitter.com/&quot;&gt;Twitter Developer page&lt;/a&gt; and creating an application. Having all four values in your script is less secure than just providing the api key and api secret, but in the future, I’ll likely implement the full OAuth “handshake”. One thing to keep in mind with this function as it currently works is that no validation of your credentials is performed; the only thing this function does is define a global variable &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;twittercred&lt;/code&gt; for later use by the various functions that create the OAuth headers. To shout “Hello, World!” to all of your Twitter followers, you can use the following code:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;post_status_update&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Hello, World!&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;general-packagefunction-structure&quot;&gt;General Package/Function Structure&lt;/h2&gt;

&lt;p&gt;From the example above, you can see that the function naming follows the &lt;a href=&quot;https://dev.twitter.com/rest/public&quot;&gt;Twitter REST API&lt;/a&gt; naming convention, with the HTTP verb first and the endpoint as the remainder of the function name. As such, it’s a good idea at this early package state to have the Twitter documentation open while using this package, so that you can quickly find the methods you are looking for.&lt;/p&gt;

&lt;p&gt;For each function/API endpoint, I’ve gone through and determined which parameters are required; these are required arguments in the Julia functions. For all other options, each function takes a second optional &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Dict{String, String}&lt;/code&gt; for any option shown in the Twitter documentation. While this Dict structure allows for ultimate flexibility (and quick definition of functions!), I do realize that it’s less than optimal that you don’t know what optional arguments each Twitter endpoint allows.&lt;/p&gt;

&lt;p&gt;As an example, suppose you wanted to search for tweets containing the hashtag &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#julialang&lt;/code&gt;. The minimum function call is as follows:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;julia_tweets&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_search_tweets&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;#julialang&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;By default, the API will return the 15 most recent tweets containing the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#julialang&lt;/code&gt; hashtag. To return the most recent 100 tweets (the maximum per API ‘page’), you can pass the “count” parameter via the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Options&lt;/code&gt; Dict:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;span class=&quot;n&quot;&gt;julia_tweets_100&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_search_tweets&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;#julialang&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;count&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;100&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;})&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;composite-types-and-dataframes-definitions&quot;&gt;Composite Types and DataFrames definitions&lt;/h2&gt;

&lt;p&gt;The Twitter API is structured into 4 return data types (&lt;a href=&quot;https://dev.twitter.com/overview/api/places&quot;&gt;Places&lt;/a&gt;, &lt;a href=&quot;https://dev.twitter.com/overview/api/users&quot;&gt;Users&lt;/a&gt;, &lt;a href=&quot;https://dev.twitter.com/overview/api/tweets&quot;&gt;Tweets&lt;/a&gt;, and &lt;a href=&quot;https://dev.twitter.com/overview/api/entities&quot;&gt;Entities&lt;/a&gt;), and I’ve mimicked these types using Julia &lt;a href=&quot;http://julia.readthedocs.org/en/latest/manual/types/#composite-types&quot;&gt;Composite Types&lt;/a&gt;. As such, most functions in Twitter.jl return an array of specific type, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Array{TWEETS,1}&lt;/code&gt; from the prior &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#julialang&lt;/code&gt; search example. The benefit to defining custom types for the returned Twitter data is that rudimentary DataFrame methods have also been defined:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;julia_tweets_100&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;I describe these DataFrames as ‘rudimentary’ as they parse the top level of JSON into columns, which results in some DataFrame columns having complex data types such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Dict()&lt;/code&gt; (and within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Dict()&lt;/code&gt;, nested Dicts!). As a running theme in this post, this is something I hope to get around to improving in the future.&lt;/p&gt;

&lt;h2 id=&quot;want-to-get-started-developing-julia-start-here&quot;&gt;Want to Get Started Developing Julia? Start Here!&lt;/h2&gt;

&lt;p&gt;One of the common questions I get asked is how to get started with Julia, both from a learning perspective and from a package development perspective. Hacking away on the core Julia codebase is great if you have the ability, but the code can certainly be intimidating (the people are quite friendly though). Creating a package isn’t necessarily hard, but you have to think about an idea you want to implement. The third alternative is…&lt;/p&gt;

&lt;p&gt;…improve the Twitter package! If you go to the &lt;a href=&quot;https://github.com/randyzwitch/Twitter.jl&quot;&gt;GitHub page for Twitter.jl&lt;/a&gt;, you’ll see a long list of TODO items that need to be worked on. The hardest part (building the OAuth headers) has already been taken care of. What’s left is &lt;a href=&quot;http://randyzwitch.com/julia-metaprogramming-refactoring/&quot;&gt;re-factoring the code for simplification&lt;/a&gt;, factoring out the &lt;a href=&quot;https://github.com/randyzwitch/OAuth.jl&quot;&gt;OAuth code in general into a new Julia library&lt;/a&gt; (also partially started), then building the Streaming API functions, cleaning up the DataFrame methods to remove the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Dict&lt;/code&gt; column types, paging through API results…and so-on.&lt;/p&gt;

&lt;p&gt;So if any of you are on the sidelines wanting to get some practice on developing packages, without needing to worry about learning Astrophysics first, I’d love to collaborate. And if any Julia programming masters want to collaborate, well that’s great too. All help and pull requests are welcomed.&lt;/p&gt;

&lt;p&gt;In the meantime, hopefully some of you will find this package useful for natural language processing, social networking analysis or even creating bots 😉&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4.2 Release Notes</title>
        
          <description>&lt;p&gt;RSiteCatalyst version 1.4.2 is now available on CRAN. This update was primarily bug fixes with one additional feature added.&lt;/p&gt;

</description>
        
        <pubDate>Wed, 03 Dec 2014 23:01:05 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-2-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-2-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-2-release-notes/">&lt;p&gt;RSiteCatalyst version 1.4.2 is now available on CRAN. This update was primarily bug fixes with one additional feature added.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Fixed QueueRanked function to allow multiple SAINT classifications to be specified. This allows for breaking down a SAINT classification with another SAINT classification, such as breaking down tracking codes by marketing channel and by campaign&lt;/li&gt;
  &lt;li&gt;Fixed bug in internal function, to allow for using the same element multiple times in a QueueRanked function call. This was a necessary fix for allowing multiple SAINT classifications in #1&lt;/li&gt;
  &lt;li&gt;Exported previous internal function &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SubmitJsonQueueReport&lt;/code&gt; to allow for submitting JSON requests directly to the Adobe Analytics API without all of the R function scaffolding. This approximates the same functionality as the &lt;a href=&quot;https://marketing.adobe.com/developer/get-started/api-explorer&quot;&gt;Adobe API Explorer&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the most part, this isn’t a release that most people will notice any differences from version 1.4.1. That said, special thanks go out to Jason Morgan (&lt;a href=&quot;https://github.com/framingeinstein&quot;&gt;@framingeinstein&lt;/a&gt;) for identifying the two bugs that were fixed AND submitting fixes.&lt;/p&gt;

&lt;h2 id=&quot;feature-requestsbugs&quot;&gt;Feature Requests/Bugs&lt;/h2&gt;

&lt;p&gt;As always, if you come across bugs or have feature requests, please continue to use the &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues&quot;&gt;RSiteCatalyst GitHub Issues&lt;/a&gt; page to submit issues. Don’t worry about cluttering up the page with tickets, please fill out a new issue for anything you encounter (with code you’ve already tried and is failing), unless you are SURE that it is the same problem someone else is facing.&lt;/p&gt;

&lt;p&gt;And finally, like I end every blog post about RSiteCatalyst, please note that &lt;strong&gt;I’m&lt;/strong&gt; &lt;strong&gt;not an Adobe employee&lt;/strong&gt;. Please don’t send me your API credentials, expect immediate replies (especially for you e-commerce folks sweating the holiday season!) or ask to set up phone calls to troubleshoot your problems. This is open-source software…Willem Paling and I did the hard part writing it, you’re expected to support yourself as best as possible unless you believe you’re encountering a bug. Then use GitHub.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Destroy Your Data Using Excel With This One Weird Trick!</title>
        
          <description>&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/11/pie-charts-are-better.png&quot; alt=&quot;All you pie-chart haters are wishing I used one here&quot; /&gt;&lt;/p&gt;

</description>
        
        <pubDate>Thu, 20 Nov 2014 10:07:55 +0000</pubDate>
        <link>
        http://randyzwitch.com/excel-destroys-data/</link>
        <guid isPermaLink="true">http://randyzwitch.com/excel-destroys-data/</guid>
        <content type="html" xml:base="/excel-destroys-data/">&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/11/pie-charts-are-better.png&quot; alt=&quot;All you pie-chart haters are wishing I used one here&quot; /&gt;&lt;/p&gt;

&lt;div&gt;
  &lt;p class=&quot;wp-caption-text&quot;&gt;
    All you pie-chart haters are wishing I used one here.
  &lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;I often use Twitter as a place to vent about the horribleness of Excel, from the product itself to analyses its UI and workflow influences. Admittedly, some of this is snobbish preference: if everyone used my preferred tools, then the world would be a better place! But let me back off my snobbishness a bit and just say this: please feel free to use any tool you want, up to and including pencil-and-paper…JUST.STOP.USING.EXCEL.&lt;/p&gt;

&lt;p&gt;Excel arbitrarily destroys data for fun, as evidenced by the example below.&lt;/p&gt;

&lt;h2 id=&quot;who-gives-a-f-about-seconds-im-10-minutes-late-everywhere&quot;&gt;Who Gives A ‘F’ About Seconds? I’m 10 minutes Late Everywhere!&lt;/h2&gt;

&lt;p&gt;CSV files have many flaws, but at least they are just plain text. It doesn’t take any special software to read them and you can open and close them without loss of fidelity…except if you open them with Excel.&lt;/p&gt;

&lt;p&gt;Suppose you have a CSV file with timestamps in ISO8601 format. Depending on which text editor you use, it might look something like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/11/timestamp.png&quot; alt=&quot;timestamp&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now, let’s open our file in Excel:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/11/excel-dates.png&quot; alt=&quot;excel-dates&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The first thing you might notice is that not only does Excel change the date formatting in the file to be more “‘Murica!”, they don’t even have the courtesy to use one of their existing date or time formats! And rather than keep the date the way it was, or standardize the dates to the way the rest of the world writes them, or even keep fixed-width columns, Excel feels like it should also hide the seconds! Makes sense…seconds are for other people to see, if/when they highlight an individual cell.&lt;/p&gt;

&lt;p&gt;So, you’ve opened this file, but can’t remember if you made any changes outside of applying auto-width to the columns. The data still &lt;em&gt;looks&lt;/em&gt; right, so you hit ‘Save’ when prompted by Excel. But you remember that your favorite programmer asked for a CSV file, and it’s already a CSV file, so you hit save, ignore the ‘features’ Excel brags about and email it back to your co-worker. Here’s what they receive:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/11/excel-fidelity-loss.png&quot; alt=&quot;excel-fidelity-loss&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Reading this back in our plain-text editor, we can now see we have a loss of fidelity of between 37 and 47 seconds on each cell of data. Whereas Excel keeps track of your timestamps while you’re in a SPREADSHEET, if you save as plain text, Excel assumes you want to keep the format it automatically applied to your data (automatically! silently!), and thus, destroys your file. In what world would you not care about seconds in your timestamps?&lt;/p&gt;

&lt;p&gt;Remember, this mis-feature occurs even if the only thing you do is open a plain-text file in Excel and hit save. No other Excel actions are needed to destroy your data.&lt;/p&gt;

&lt;h2 id=&quot;excel-only-the-proper-tool-if-you-dont-care&quot;&gt;Excel: Only The Proper Tool If You Don’t Care&lt;/h2&gt;

&lt;p&gt;If you don’t care about using the proper tool for analytics, don’t want to learn something new, don’t want numerical accuracy, hate visually interesting graphics, don’t need reproducibility…use Excel. For everything else, there’s everything else. Don’t be a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VLOOKUP&lt;/code&gt; guru, use SQL. Don’t store your data in Excel just because it allows for a million rows, use a database. If you need point-and-click graphics, at least spring for Tableau so the defaults look nicer.&lt;/p&gt;

&lt;p&gt;Or, learn to code using open-source languages for a total licensing cost of $0. Every analyst would get value from knowing one open-source analytics language, even topically, so that you can write simple calculation scripts and document your thought process. A side benefit is that by coding, you can also use version control like Git or SVN. Then, you can have different versions of thought, and the next analyst down the line can see how your analysis has evolved.&lt;/p&gt;

&lt;p&gt;And while I’m ranting, a special message for all you ‘top-tier’ analytics consultants out there: you should know SEVERAL of the common analytics languages. If you do your “analysis” in Excel, you are a hack or you are just providing &lt;em&gt;reporting&lt;/em&gt; for $300/hr. Use better tools, your clients deserve better. I have infinitely more respect for someone who delivers a sloppy set of slides and a documented R script than someone who knows who to put drop-shadows on MS Office documents and makes fancy decks. You are being judged not just by the C-Suite, but also by snobs like me. And when contract renewal time comes around, they do ask my opinion and I do make comments on how sophisticated your toolset was that you used (or lack thereof if you’re using Excel).&lt;/p&gt;

&lt;p&gt;It’s nearly 2015, do better. Stop Using Excel.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Code Refactoring Using Metaprogramming</title>
        
          <description>&lt;p&gt;It’s been nearly a year since I wrote &lt;a href=&quot;https://github.com/randyzwitch/Twitter.jl/&quot;&gt;Twitter.jl&lt;/a&gt;, back when I seemingly had MUCH more free time. In these past 10 months, I’ve used Julia quite a bit to develop other packages, and I try to use it at work when I know I’m not going to be collaborating with others (since my colleagues don’t know Julia, not because it’s bad for collaboration!).&lt;/p&gt;

</description>
        
        <pubDate>Tue, 18 Nov 2014 09:11:06 +0000</pubDate>
        <link>
        http://randyzwitch.com/julia-metaprogramming-refactoring/</link>
        <guid isPermaLink="true">http://randyzwitch.com/julia-metaprogramming-refactoring/</guid>
        <content type="html" xml:base="/julia-metaprogramming-refactoring/">&lt;p&gt;It’s been nearly a year since I wrote &lt;a href=&quot;https://github.com/randyzwitch/Twitter.jl/&quot;&gt;Twitter.jl&lt;/a&gt;, back when I seemingly had MUCH more free time. In these past 10 months, I’ve used Julia quite a bit to develop other packages, and I try to use it at work when I know I’m not going to be collaborating with others (since my colleagues don’t know Julia, not because it’s bad for collaboration!).&lt;/p&gt;

&lt;p&gt;One of the things that’s obvious from my earlier Julia code is that I didn’t understand how powerful metaprogramming can be, so here’s a simple example where I can replace 50 lines of Julia code with 10.&lt;/p&gt;

&lt;h2 id=&quot;ctrl-a-ctrl-c-ctrl-p-repeat&quot;&gt;CTRL-A, CTRL-C, CTRL-P. Repeat.&lt;/h2&gt;

&lt;p&gt;Admittedly, when I started on the Twitter package, I fully meant to go back and clean up the codebase, but moved onto something more fun instead. The Twitter package started out as a means of learning how to use the &lt;a href=&quot;https://github.com/JuliaWeb/Requests.jl&quot;&gt;Requests.jl&lt;/a&gt; library to make API calls, figure out the OAuth syntax I needed (which itself should be factored out of Twitter.jl), then copied-and-pasted the same basic function structure over and over. While fast, what I was left with was this (currently, the help.jl file in the Twitter package):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#############################################################&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Help section Functions for Twitter API&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#############################################################&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; get_help_configuration&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Dict&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}())&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_oauth&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;https://api.twitter.com/1.1/help/configuration.json&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JSON&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; get_help_languages&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Dict&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}())&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_oauth&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;https://api.twitter.com/1.1/help/languages.json&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JSON&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; get_help_privacy&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Dict&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}())&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_oauth&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;https://api.twitter.com/1.1/help/privacy.json&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JSON&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; get_help_tos&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Dict&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}())&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_oauth&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;https://api.twitter.com/1.1/help/tos.json&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JSON&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; get_application_rate_limit_status&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Dict&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}())&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_oauth&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;https://api.twitter.com/1.1/application/rate_limit_status.json&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JSON&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;It’s pretty clear that this is the same exact code pattern, right down to the spacing! The way to interpret this code is that for these five Twitter API methods, there are no required inputs. Optionally, there is the ‘options’ keyword that allows for specifying a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Dict()&lt;/code&gt; of options. For these five functions, there are no options you can pass to the Twitter API, so even this keyword is redundant. These are simple functions so I don’t gain a lot by way of maintainability by using metaprogramming, but at the same time, one of the core tenets of programming is ‘Don’t Repeat Yourself’, so let’s clean this up.&lt;/p&gt;

&lt;h2 id=&quot;for-symbol-in-symbolslist&quot;&gt;For :symbol in symbolslist…&lt;/h2&gt;

&lt;p&gt;In order to clean this up, we need to take out the unique parts of the function, then pass them as arguments to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@eval&lt;/code&gt; macro as follows:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;funcname&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_help_configuration&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_help_languages&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_help_privacy&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_help_tos&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_application_rate_limit_status&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;endpoint&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;help/configuration.json&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;help/languages.json&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;help/privacy.json&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;s&quot;&gt;&quot;help/tos.json&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;application/rate_limit_status.json&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;endp&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;zip&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;funcname&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;endpoint&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;nd&quot;&gt;@eval&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;($&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)(;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Dict&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}())&lt;/span&gt;

	        &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_oauth&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;https://api.twitter.com/1.1/&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;endp&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

	        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JSON&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;

    	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;What’s happening in this code is that I define two tuples: one of function names (as symbols, denoted by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;:&lt;/code&gt;) and one of the API endpoints. We can then iterate over the two tuples, substituting the function names and endpoints into the code. When the package is loaded, this code evaluates, defining the five functions for use in the Twitter package.&lt;/p&gt;

&lt;h2 id=&quot;wha&quot;&gt;Wha?&lt;/h2&gt;

&lt;p&gt;Yeah, so metaprogramming can be simple, but it can also be mind-bending. It’s one thing to not repeat yourself, it’s another to write something so complex that even YOU can’t remember how the code works. But somewhere in between lies a sweet spot where you can re-factor whole swaths of code and streamline your codebase. Metaprogramming is used throughout the Julia codebase, so if you’re interested in seeing more examples of metaprogramming, check out the Julia source code, the &lt;a href=&quot;https://github.com/JuliaWeb/Requests.jl/blob/master/src/Requests.jl&quot; title=&quot;Requests.jl code&quot;&gt;Requests.jl&lt;/a&gt; package (where I first saw this) or really anyone who actually knows what they are doing. I’m just a metaprogramming pretender at this point 🙂  &lt;/p&gt;

&lt;p&gt;To read additional discussion around this specific example, see the Julia-Users discussion at: &lt;a href=&quot;https://groups.google.com/forum/#!topic/julia-users/zvJmqB2N0GQ&quot;&gt;https://groups.google.com/forum/#!topic/julia-users/zvJmqB2N0GQ&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edit, 11/22/2014:&lt;/strong&gt; &lt;a href=&quot;http://www.reddit.com/r/Julia/comments/2mvtnr/code_refactoring_using_metaprogramming_in_julia/cma5g25&quot;&gt;DarthToaster on Reddit&lt;/a&gt; provided another fantastic way to approach refactoring, using macros:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;macro&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; endpoint&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;quote&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; $&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;esc&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;))(;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Dict&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}())&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_oauth&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;https://api.twitter.com/1.1/&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;path&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JSON&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;nd&quot;&gt;@endpoint&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_help_configuration&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;help/configuration.json&quot;&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@endpoint&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_help_languages&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;help/languages.json&quot;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4.1 Release Notes</title>
        
          <description>&lt;h2 id=&quot;changes&quot;&gt;Changes&lt;/h2&gt;

</description>
        
        <pubDate>Mon, 10 Nov 2014 10:01:36 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-1-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-1-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-1-release-notes/">&lt;h2 id=&quot;changes&quot;&gt;Changes&lt;/h2&gt;

&lt;p&gt;Version 1.4.1 of RSiteCatalyst is now available on CRAN. There were a handful of bug fixes and new features added, including:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Fixed bug in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueRanked&lt;/code&gt; function where only 10 results were returned when requesting multiple element reports. Function now returns up to 50,000 per breakdown (API limit)&lt;/li&gt;
  &lt;li&gt;Created better error message to inform user to login with credentials instead of making function call without proper API credentials&lt;/li&gt;
  &lt;li&gt;Added support for using SAINT classifications in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueRanked/QueueTrended&lt;/code&gt; functions&lt;/li&gt;
  &lt;li&gt;Added more error checking to make functions fail more elegantly&lt;/li&gt;
  &lt;li&gt;Added remaining GET methods from Reporting/Administration API&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;additional-get-methods&quot;&gt;Additional GET methods&lt;/h2&gt;

&lt;p&gt;This version of RSiteCatalyst has roughly 20 new GET methods, mostly providing additional report suite information for those who might desire to generate their documentation programmatically rather than manually. New API methods include (but are not limited to):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetMarketingChannelRules&lt;/code&gt;: Get a list of all criteria used to build the Marketing Channels report&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetReportDescription&lt;/code&gt;: For a given bookmark_id, get the report definition&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetListVariables&lt;/code&gt;: Get a list of the List Variables defined for a report suite&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetLogins&lt;/code&gt;: Get all logins for a given Company&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you were the type of person who enjoyed this blog post showing how to &lt;a href=&quot;http://randyzwitch.com/adobe-analytics-implementation-documentation/&quot; title=&quot;Adobe Analytics Report Suite documentation R&quot;&gt;auto-generate Adobe Analytics documentation&lt;/a&gt;, I encourage you to take a look at these newly incorporated functions and use them to improve your documentation even further.&lt;/p&gt;

&lt;h2 id=&quot;feature-requestsbugs&quot;&gt;Feature Requests/Bugs&lt;/h2&gt;

&lt;p&gt;If you come across any bugs, or have any feature requests, please continue to use the &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues&quot;&gt;RSiteCatalyst GitHub Issues&lt;/a&gt; page to make tickets. While I’ve responded to many of you via the maintainer email provided in the R package itself, it’s much more efficient (and you’re much more likely to get a response) if you use the GitHub Issues page. Don’t worry about cluttering up the page with tickets, please fill out a new issue for anything you encounter, unless you are SURE that it is the same problem someone else is facing.&lt;/p&gt;

&lt;p&gt;And finally, like I end every blog post about RSiteCatalyst, please note that &lt;strong&gt;I’m&lt;/strong&gt; &lt;strong&gt;not an Adobe employee&lt;/strong&gt;. Please don’t send me your API credentials, expect immediate replies or ask to set up phone calls to troubleshoot your problems. This is open-source software…Willem Paling and I did the hard part writing it, you’re expected to support yourself as best as possible unless you believe you’re encountering a bug. Then use GitHub 🙂&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Evaluating BreakoutDetection</title>
        
          <description>&lt;p&gt;A couple of weeks ago, Twitter open-sourced their &lt;a href=&quot;https://blog.twitter.com/2014/breakout-detection-in-the-wild&quot;&gt;BreakoutDetection&lt;/a&gt; package for R, a package designed to determine shifts in time-series data. The &lt;a href=&quot;https://blog.twitter.com/2014/breakout-detection-in-the-wild&quot;&gt;Twitter announcement&lt;/a&gt; does a great job of explaining the main technique for detection (E-Divisive with Medians), so I won’t rehash that material here. Rather, I wanted to see how this package works relative to the &lt;a href=&quot;http://randyzwitch.com/anomaly-detection-adobe-analytics-api/&quot;&gt;anomaly detection&lt;/a&gt; feature in the Adobe Analytics API, which I’ve &lt;a href=&quot;http://randyzwitch.com/anomaly-detection-adobe-analytics-api/&quot;&gt;written about previously&lt;/a&gt;.&lt;/p&gt;

</description>
        
        <pubDate>Thu, 06 Nov 2014 21:24:00 +0000</pubDate>
        <link>
        http://randyzwitch.com/twitter-breakoutdetection-r-package-evaluation/</link>
        <guid isPermaLink="true">http://randyzwitch.com/twitter-breakoutdetection-r-package-evaluation/</guid>
        <content type="html" xml:base="/twitter-breakoutdetection-r-package-evaluation/">&lt;p&gt;A couple of weeks ago, Twitter open-sourced their &lt;a href=&quot;https://blog.twitter.com/2014/breakout-detection-in-the-wild&quot;&gt;BreakoutDetection&lt;/a&gt; package for R, a package designed to determine shifts in time-series data. The &lt;a href=&quot;https://blog.twitter.com/2014/breakout-detection-in-the-wild&quot;&gt;Twitter announcement&lt;/a&gt; does a great job of explaining the main technique for detection (E-Divisive with Medians), so I won’t rehash that material here. Rather, I wanted to see how this package works relative to the &lt;a href=&quot;http://randyzwitch.com/anomaly-detection-adobe-analytics-api/&quot;&gt;anomaly detection&lt;/a&gt; feature in the Adobe Analytics API, which I’ve &lt;a href=&quot;http://randyzwitch.com/anomaly-detection-adobe-analytics-api/&quot;&gt;written about previously&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;getting-time-series-data-using-rsitecatalyst&quot;&gt;Getting Time-Series Data Using RSiteCatalyst&lt;/h2&gt;

&lt;p&gt;To use a real-world dataset to evaluate this package, I’m going to use roughly ten months of daily pageviews generated from my blog. The hypothesis here is that if the BreakoutDetection package works well, it should be able to detect the boundaries around when I publish a blog post (of which the dates I know with certainty) and when articles of mine get shared on sites such as Reddit. From past experience, I get about a 3-day lift in pageviews post-publishing, as the article gets tweeted out, published on &lt;a href=&quot;http://www.r-bloggers.com/&quot;&gt;R-Bloggers&lt;/a&gt; or &lt;a href=&quot;http://www.juliabloggers.com/&quot;&gt;JuliaBloggers&lt;/a&gt; and shared accordingly.&lt;/p&gt;

&lt;p&gt;Here’s the code to get daily pageviews using &lt;a href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot; title=&quot;RSiteCatalyst&quot;&gt;RSiteCatalyst&lt;/a&gt; (Adobe Analytics):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Installing BreakoutDetection package&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;install.packages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;devtools&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;devtools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;install_github&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;twitter/BreakoutDetection&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BreakoutDetection&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;RSiteCatalyst&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SCAuth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;company&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;secret&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Get pageviews for each day in 2014&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pageviews_2014&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueOvertime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'report-suite'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                               &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date.from&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'2014-02-24'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                               &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date.to&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'2014-11-05'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                               &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'pageviews'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                               &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date.granularity&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'day'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#v1.0.1 of package requires specific column names and dataframe format&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;formatted_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pageviews_2014&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;datetime&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;formatted_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;timestamp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;count&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;One thing to notice here is that BreakoutDetection requires either a single R vector or a specifically formatted data frame. In this case, because I have a timestamp, I use lines 17-18 to get the data into the required format.&lt;/p&gt;

&lt;h2 id=&quot;breakoutdetection---default-example&quot;&gt;BreakoutDetection - Default Example&lt;/h2&gt;

&lt;p&gt;In the Twitter announcement, they provide an example, so let’s evaluate those defaults first:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/11/breakoutdetection-defaults.png&quot; alt=&quot;breakoutdetection-defaults&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In order to validate my hypothesis, the package would need to detect 12 ‘breakouts’ or so, as I’ve published 12 blog posts during the sample time period. Mentally drawing lines between the red boundaries, we can see three definitive upward mean shifts, but far fewer than the 12 I expected.&lt;/p&gt;

&lt;h2 id=&quot;breakoutdetection---modifying-the-parameters&quot;&gt;BreakoutDetection - Modifying The Parameters&lt;/h2&gt;

&lt;p&gt;Given that the chart above doesn’t fit how I think my data are generated, we can modify two main parameters: beta and min.size. From the documentation:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;beta: A real numbered constant used to further control the amount of penalization. This is the default form of penalization, if neither (or both) beta or (and) percent are supplied this argument will be used. The default value is beta=0.008.&lt;/p&gt;

  &lt;p&gt;min.size:  The minimum number of observations between change points&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The first parameter I’m going to experiment with is min.size, because it requires no in-depth knowledge of the EDM technique! The value used in the first example was 24 (days) between intervals, which seems extreme in my case. It’s reasonable that I might publish a blog post per week, so let’s back that number down to 5 and see how the result changes:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/11/breakout-5.png&quot; alt=&quot;breakout-5&quot; /&gt;&lt;/p&gt;

&lt;p&gt;With 17 predicted intervals, we’ve somewhat overshot the number of blog posts mark. Not that the package is wrong per se; the boundaries are surrounding many of the spikes in the data, but perhaps having this many breakpoints isn’t useful from a monitoring standpoint. So setting the min.size parameter somewhere between 5 and 24 points would give us more than 3 breakouts, but less than 17. There is also the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;beta&lt;/code&gt; parameter that can be played with, but I’ll leave that as an exercise for another day.&lt;/p&gt;

&lt;h2 id=&quot;anomaly-detection---adobe-analytics&quot;&gt;Anomaly Detection - Adobe Analytics&lt;/h2&gt;

&lt;p&gt;From my prior post about &lt;a href=&quot;http://randyzwitch.com/anomaly-detection-adobe-analytics-api/&quot;&gt;Anomaly Detection with the Adobe Analytics API&lt;/a&gt;, Adobe has chosen to use Holt-Winters/Exponential Smoothing as their technique. Here’s what that looks like for the same time-period (code as &lt;a href=&quot;http://randyzwitch.com/wp-content/uploads/2014/11/adobe_anomaly.png&quot;&gt;GitHub Gist&lt;/a&gt;):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/11/adobe_analytics.png&quot; alt=&quot;adobe_analytics&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Even though the idea of both techniques are similar, it’s clear that the two methods don’t quite represent the same thing. In the case of the Adobe Analytics Anomaly Detection, it’s looking datapoint-by-datapoint, with a smoothing model built from the prior 35 points. If a point exceeds the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;upper-&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lower-control&lt;/code&gt; limits, then it’s an anomaly, but not necessarily indicative of a true level shift like the BreakoutDetection package is measuring.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/twitter/BreakoutDetection&quot;&gt;BreakoutDetection package&lt;/a&gt; is definitely cool, but it is a bit raw, especially the default graphics. But the package definitely does work, as evidenced by how well it put boundaries around the traffic spikes when I set the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;min.size&lt;/code&gt; parameter equal to five.&lt;/p&gt;

&lt;p&gt;Additionally, I tried to read more about the underlying methodology, but the only references that come up in Google seem to be references to the R package itself! I wish I had a better feeling for how the beta parameter influences the graph, but I guess that will come over time as I use the package more. But I’m definitely glad that Twitter open-sourced this package, as I’ve often wondered about how to detect level shifts in a more operational setting, and now I have a method to do so.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Visualizing Website Pathing With Sankey Charts</title>
        
          <description>&lt;p&gt;In my prior post on &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-d3-network-graphs/&quot; title=&quot;Visualizing Website Structure With Network Graphs&quot;&gt;visualizing website structure using network graphs&lt;/a&gt;, I referenced that network graphs showed the pairwise relationships between two pages (in a bi-directional manner). However, if you want to analyze how your visitors are pathing through your site, you can visualize your data using a Sankey chart.&lt;/p&gt;

</description>
        
        <pubDate>Wed, 10 Sep 2014 21:27:10 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-website-pathing-sankey-charts/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-website-pathing-sankey-charts/</guid>
        <content type="html" xml:base="/rsitecatalyst-website-pathing-sankey-charts/">&lt;p&gt;In my prior post on &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-d3-network-graphs/&quot; title=&quot;Visualizing Website Structure With Network Graphs&quot;&gt;visualizing website structure using network graphs&lt;/a&gt;, I referenced that network graphs showed the pairwise relationships between two pages (in a bi-directional manner). However, if you want to analyze how your visitors are pathing through your site, you can visualize your data using a Sankey chart.&lt;/p&gt;

&lt;h2 id=&quot;visualizing-single-page-to-next-page-pathing&quot;&gt;Visualizing Single Page-to-Next Page Pathing&lt;/h2&gt;

&lt;p&gt;Most digital analytics tools allow you to visualize the path between pages. In the case of Adobe Analytics, the Next Page Flow diagram is limited to 10 second-level branches in the visualization. However, the Adobe Analytics API has no such limitation, and as such we can use RSiteCatalyst to create the following visualization (&lt;a href=&quot;https://gist.github.com/randyzwitch/008be202b94bde7c4359&quot;&gt;GitHub Gist containing R code&lt;/a&gt;):&lt;/p&gt;

&lt;iframe src=&quot;http://randyzwitch.com/wp-content/uploads/2014/09/sankey.html&quot; width=&quot;750&quot; height=&quot;650&quot; frameborder=&quot;0&quot; scrolling=&quot;no&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;The data processing for this visualization is near identical to the network diagrams. We can use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueuePathing()&lt;/code&gt; from RSiteCatalyst to download our pathing data, except in this case, I specified an exact page name as the first level of the pathing pattern instead of using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;::anything::&lt;/code&gt; operator. In all Sankey charts created by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;d3Network&lt;/code&gt;, you can hover over the right-hand side nodes to see the values (you can also drag around the nodes on either side if you desire!). It’s pretty clear from this diagram that I need to do a better job retaining my visitors, as the most common path from this page is to leave. 🙁&lt;/p&gt;

&lt;h2 id=&quot;many-to-many-page-pathing&quot;&gt;Many-to-Many Page Pathing&lt;/h2&gt;

&lt;p&gt;The example above picks a single page related to Hadoop, then shows how my visitors continue through my site; sometimes, they go to other Hadoop pages, some view &lt;a title=&quot;Data Science content&quot; href=&quot;http://randyzwitch.com/#DataScience&quot; target=&quot;_blank&quot;&gt;Data Science related content&lt;/a&gt; or any number of other paths. If we want, however, we can visualize how all visitors path through all pages. Like the force-directed graph, we can get this information by using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(&quot;::anything::&quot;, &quot;::anything::&quot;)&lt;/code&gt; path pattern with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueuePathing()&lt;/code&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Multi-page pathing&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;d3Network&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;RSiteCatalyst&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### Authentication&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SCAuth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;secret&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### Get All Possible Paths with (&quot;::anything::&quot;, &quot;::anything::&quot;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pathpattern&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;::anything::&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;::anything::&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;next_page&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueuePathing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-08-31&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;element&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;page&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pathpattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;top&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;50000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Optional step: Cleaning my pagename URLs to remove to domain for clarity&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;next_page&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;http://randyzwitch.com/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;next_page&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ignore.case&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;next_page&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;http://randyzwitch.com/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;next_page&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ignore.case&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Filter out Entered Site and duplicate rows, &amp;gt;120 for chart legibility&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;next_page&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;120&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Entered Site&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Get unique values of page name to create nodes df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Create an index value, starting at 0&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodes&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as.data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unique&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodevalue&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;as.numeric&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row.names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Convert string to numeric nodeid&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by.x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by.y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;value&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;segment.id&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;segment.name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;source&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by.x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by.y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;value&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;segment.id&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;segment.name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;source&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;target&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Create next page Sankey chart&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d3output&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;~/Desktop/sankey_all.html&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d3Sankey&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Links&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Nodes&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Source&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;source&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
         &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Target&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;target&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Value&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;value&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NodeID&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
         &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodeWidth&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d3output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;width&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;750&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;height&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;700&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Running the code above provides the following visualization:&lt;/p&gt;

&lt;iframe src=&quot;http://randyzwitch.com/wp-content/uploads/2014/09/sankey_all1.html&quot; width=&quot;750&quot; height=&quot;700&quot; frameborder=&quot;0&quot; scrolling=&quot;no&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;For legibility purposes, I’m only plotting paths that occur more than 120 times. But given a large enough display, it would be possible to visualize all valid combinations of paths.&lt;/p&gt;

&lt;p&gt;One thing to keep in mind is that with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;d3.js&lt;/code&gt; library, there is a weird hiccup where if your dataset contains “duplicate” paths such that both &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Source -&amp;gt; Target &amp;amp; Target -&amp;gt; Source&lt;/code&gt; exists, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;d3.js&lt;/code&gt; will go into an infinite loop/not show any visualization. My R code doesn’t provide a solution to this issue, but it should be trivial to remove these “duplicates” should they arise in your dataset.&lt;/p&gt;

&lt;h2 id=&quot;interpretation&quot;&gt;Interpretation&lt;/h2&gt;

&lt;p&gt;Unlike the network graphs, Sankey Charts are fairly easy to understand. The “worst” path on my site in terms of keeping visitors on site is where I praised Apple for &lt;a href=&quot;http://randyzwitch.com/broken-macbook-pro-hinge-fixed-free/&quot;&gt;fixing my MacBook Pro screen&lt;/a&gt; out-of-warranty. The easy explanation for this poor performance is that this article attracts people who aren’t really my target audience in data science, but looking for information about getting THEIR screens fixed. If I wanted to engage these readers more, I guess I would need to write more Apple-related content.&lt;/p&gt;

&lt;p&gt;To the extent there are multi-stage paths, these tend to be &lt;a href=&quot;http://randyzwitch.com/tags/#hadoop&quot;&gt;Hadoop&lt;/a&gt; and &lt;a href=&quot;http://randyzwitch.com/tags/#julia&quot;&gt;Julia&lt;/a&gt;-related content. This makes sense as both technologies are fairly new, I have a lot more content in these areas, and especially in the case of Julia, I’m one of the few people writing practical content. So I’m glad to see I’m achieving some level of success in these areas.&lt;/p&gt;

&lt;p&gt;Hopefully this blog post and my previous post on &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-d3-network-graphs/&quot;&gt;visualizing your website visitors using network graphs&lt;/a&gt; have given a feel for the &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-version-1-4-release-notes/&quot;&gt;new functionality available in RSiteCatalyst v1.4&lt;/a&gt;, as well providing a new way of thinking about data visualization beyond just the default graphs provided by the Adobe Analytics interface.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Creating A Stacked Bar Chart in Seaborn</title>
        
          <description>&lt;p&gt;&lt;a href=&quot;http://randyzwitch.com/wp-content/uploads/2014/09/stacked_bar.csv&quot;&gt;Download chart data&lt;/a&gt;&lt;/p&gt;

</description>
        
        <pubDate>Tue, 09 Sep 2014 08:01:39 +0000</pubDate>
        <link>
        http://randyzwitch.com/creating-stacked-bar-chart-seaborn/</link>
        <guid isPermaLink="true">http://randyzwitch.com/creating-stacked-bar-chart-seaborn/</guid>
        <content type="html" xml:base="/creating-stacked-bar-chart-seaborn/">&lt;p&gt;&lt;a href=&quot;http://randyzwitch.com/wp-content/uploads/2014/09/stacked_bar.csv&quot;&gt;Download chart data&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The other day I was having a heck of a time trying to figure out how to make a stacked bar chart in Seaborn. But in true open-source/community fashion, I ended up getting a response from the creator of Seaborn via Twitter:&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; lang=&quot;en&quot; data-conversation=&quot;none&quot;&gt;
  &lt;p&gt;
    &lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;@randyzwitch&lt;/a&gt; I don't really like stacked bar charts, I'd suggest maybe using pointplot / factorplot with kind=point
  &lt;/p&gt;

  &lt;p&gt;
    — Michael Waskom (@michaelwaskom) &lt;a href=&quot;https://twitter.com/michaelwaskom/status/507608729840152578&quot;&gt;September 4, 2014&lt;/a&gt;
  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So there you go. I don’t want to put words in Michael’s mouth, but if he’s not a fan, then it sounded like it was up to me to find my own solution if I wanted a stacked bar chart. I hacked around on the &lt;a href=&quot;http://pandas.pydata.org/pandas-docs/stable/visualization.html&quot;&gt;pandas plotting functionality&lt;/a&gt; a while, went to the &lt;a href=&quot;http://matplotlib.org/1.3.1/examples/pylab_examples/bar_stacked.html&quot;&gt;matplotlib documentation/example for a stacked bar chart&lt;/a&gt;, tried Seaborn some more and then it hit me…I’ve gotten so used to these amazing open-source packages that my brain has atrophied! Creating a stacked bar chart is SIMPLE, even in Seaborn (and even if Michael doesn’t like them 🙂 )&lt;/p&gt;

&lt;h2 id=&quot;stacked-bar-chart--sum-of-two-series&quot;&gt;Stacked Bar Chart = Sum of Two Series&lt;/h2&gt;

&lt;p&gt;In trying so hard to create a stacked bar chart, I neglected the most obvious part. Given two series of data, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Series 1&lt;/code&gt; (“bottom”) and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Series 2&lt;/code&gt; (“top”), to create a stacked bar chart you just need to create:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;Series&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Series&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Series&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Once you have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Series 3&lt;/code&gt; (“total”), then you can use the overlay feature of matplotlib and Seaborn in order to create your stacked bar chart. Plot “total” first, which will become the base layer of the chart. Because the total by definition will be greater-than-or-equal-to the “bottom” series, once you overlay the “bottom” series on top of the “total” series, the “top” series will now be stacked on top:&lt;/p&gt;

&lt;h4 id=&quot;background-total-series&quot;&gt;Background: “Total” Series&lt;/h4&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/09/background_total.png&quot; alt=&quot;background_total&quot; /&gt;&lt;/p&gt;

&lt;h4 id=&quot;overlay-bottom-series&quot;&gt;Overlay: “Bottom” Series&lt;/h4&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/09/bottom_plot1.png&quot; alt=&quot;bottom_plot&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;end-result-stacked-bar-chart&quot;&gt;End Result: Stacked Bar Chart&lt;/h2&gt;

&lt;p&gt;Running the code in the same IPython Notebook cell results in the following chart (&lt;a href=&quot;http://randyzwitch.com/wp-content/uploads/2014/09/stacked_bar.csv&quot;&gt;download chart data&lt;/a&gt;):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/09/stacked-bar-seaborn.png&quot; alt=&quot;stacked-bar-seaborn&quot; /&gt;&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;matplotlib&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyplot&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;matplotlib&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mpl&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;seaborn&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;matplotlib&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inline&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Read in data &amp;amp; create total column
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stacked_bar_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;C:\stacked_bar.csv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;stacked_bar_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;total&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stacked_bar_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Series1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stacked_bar_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Series2&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Set general plot properties
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_style&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;white&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;({&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;figure.figsize&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)})&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Plot 1 - background - &quot;total&quot; (top) series
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;barplot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stacked_bar_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Group&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stacked_bar_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;total&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;color&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;red&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Plot 2 - overlay - &quot;bottom&quot; series
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bottom_plot&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;barplot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stacked_bar_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Group&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stacked_bar_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Series1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;color&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;#0000A3&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;


&lt;span class=&quot;n&quot;&gt;topbar&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Rectangle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;red&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;edgecolor&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'none'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;bottombar&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Rectangle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'#0000A3'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;edgecolor&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'none'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;l&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;legend&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bottombar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;topbar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Bottom Bar'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'Top Bar'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ncol&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;prop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'size'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;draw_frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Optional code - Make plot look nicer
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;despine&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;bottom_plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_ylabel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Y-axis label&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;bottom_plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_xlabel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;X-axis label&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Set fonts to consistent 16pt size
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;item&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bottom_plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xaxis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bottom_plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;yaxis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
             &lt;span class=&quot;n&quot;&gt;bottom_plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_xticklabels&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bottom_plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_yticklabels&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;item&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_fontsize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;dont-overthink-things&quot;&gt;Don’t Overthink Things!&lt;/h2&gt;

&lt;p&gt;In the end, creating a stacked bar chart in Seaborn took me 4 hours to mess around trying everything under the sun, then 15 minutes once I remembered what a stacked bar chart actually represents. Hopefully this will save someone else from my same misery.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://randyzwitch.com/wp-content/uploads/2014/09/stacked_bar.csv&quot;&gt;Download chart data&lt;/a&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Visualizing Website Structure With Network Graphs</title>
        
          <description>&lt;p&gt;Last week, &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-version-1-4-release-notes/&quot;&gt;version 1.4 of RSiteCatalyst&lt;/a&gt; was released, and now it’s possible to get site pathing information directly within R. Now, it’s easy to create impressive looking network graphs from your Adobe Analytics data using &lt;a href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot;&gt;RSiteCatalyst&lt;/a&gt; and &lt;a href=&quot;http://cran.r-project.org/web/packages/d3Network/index.html&quot;&gt;d3Network&lt;/a&gt;. In this blog post, I will cover simple and force-directed network graphs, which show the pairwise representation between pages. In a follow-up blog post, I will show how to visualize longer paths using &lt;a href=&quot;http://www.sankey-diagrams.com/&quot;&gt;Sankey diagrams&lt;/a&gt;, also from the d3Network package.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 08 Sep 2014 06:40:38 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-d3-network-graphs/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-d3-network-graphs/</guid>
        <content type="html" xml:base="/rsitecatalyst-d3-network-graphs/">&lt;p&gt;Last week, &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-version-1-4-release-notes/&quot;&gt;version 1.4 of RSiteCatalyst&lt;/a&gt; was released, and now it’s possible to get site pathing information directly within R. Now, it’s easy to create impressive looking network graphs from your Adobe Analytics data using &lt;a href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot;&gt;RSiteCatalyst&lt;/a&gt; and &lt;a href=&quot;http://cran.r-project.org/web/packages/d3Network/index.html&quot;&gt;d3Network&lt;/a&gt;. In this blog post, I will cover simple and force-directed network graphs, which show the pairwise representation between pages. In a follow-up blog post, I will show how to visualize longer paths using &lt;a href=&quot;http://www.sankey-diagrams.com/&quot;&gt;Sankey diagrams&lt;/a&gt;, also from the d3Network package.&lt;/p&gt;

&lt;h2 id=&quot;obtainingpathing-data-with-queuepathing&quot;&gt;Obtaining Pathing Data With QueuePathing&lt;/h2&gt;

&lt;p&gt;Although the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueuePathing()&lt;/code&gt; function is new to RSiteCatalyst, its syntax should feel familiar (even with all of the breaking changes we made!). In the case of creating our network graphs, we want to download all pairwise combinations of pages, which is easy to do using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;::anything::&lt;/code&gt; operator:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;RSiteCatalyst&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;d3Network&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### Authentication&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SCAuth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;username&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;secret&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### Get Pathing data using ::anything:: wildcards&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Results are limited by the API to 50000&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pathpattern&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;::anything::&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;::anything::&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queue_pathing_pages&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueuePathing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;zwitchdev&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-08-31&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;element&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;page&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pathpattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;top&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;50000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Because we are using a pathing pattern of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(&quot;::anything::&quot;, &quot;::anything::&quot;)&lt;/code&gt;, the data frame that is returned from this function will have three columns: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;step.1&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;step.2&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt;, which is the number of occurrences of the path.&lt;/p&gt;

&lt;h2 id=&quot;plotting-graph-using-d3simplenetwork&quot;&gt;Plotting Graph Using d3SimpleNetwork&lt;/h2&gt;

&lt;p&gt;Before jumping into the plotting, we need to do some quick data cleaning. Lines 1-5 below are optional; I don’t set the Adobe Analytics s.pageName on each of my blog pages (a worst practice if there ever was one!), so I use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sub()&lt;/code&gt; function in Base R to strip the domain name from the beginning of the page. The other data frame modification is to remove the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;'Entered Site'&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;'Exited Site'&lt;/code&gt; from the pagename pairs. Although this is important information generally, these behaviors aren’t needed to show the pairwise relationship between pages.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Optional step: Cleaning my pagename URLs to remove to domain for graph clarity&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queue_pathing_pages&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;http://randyzwitch.com/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queue_pathing_pages&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ignore.case&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queue_pathing_pages&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;http://randyzwitch.com/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queue_pathing_pages&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ignore.case&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### Remove Enter and Exit site values&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#This information is important for analysis, but not related to website structure&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;graph_links&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queue_pathing_pages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Entered Site&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Exited Site&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### First pass - Simple Network&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Setting standAlone = TRUE creates a full HTML file to view graph&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Set equal to FALSE to just get the d3 JavaScript&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;simpleoutput1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;C:/Users/rzwitc200/Desktop/simpleoutput1.html&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d3SimpleNetwork&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;graph_links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Source&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Target&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;height&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;600&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;width&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;750&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;linkDistance&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;charge&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;-50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;linkColour&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;#666&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodeColour&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;#3182bd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodeClickColour&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;#E34A33&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;textColour&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;#3182bd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;opacity&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0.6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;standAlone&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;simpleoutput1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Running the above code results in the following graph:&lt;/p&gt;

&lt;iframe src=&quot;http://randyzwitch.com/wp-content/uploads/2014/09/simpleoutput1.html&quot; width=&quot;750&quot; height=&quot;500&quot; frameborder=&quot;0&quot; scrolling=&quot;no&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Hmmm…looks like a blob of spaghetti, a common occurrence when creating graphs. We can do better.&lt;/p&gt;

&lt;h2 id=&quot;pruning-edges-from-thegraph&quot;&gt;Pruning Edges From The Graph&lt;/h2&gt;

&lt;p&gt;There are many &lt;a title=&quot;Pruning Edges from Network&quot; href=&quot;http://link.springer.com/chapter/10.1007%2F978-3-642-31830-6_13&quot; target=&quot;_blank&quot;&gt;complex algorithms for determining how to prune edges/nodes from a network&lt;/a&gt;. For the sake of simplicity, I’m going to use a very simple algorithm: each path has to occur more than 5 times for it to be included in the network. This will prune roughly 80% of the pairwise page combinations while keeping ~75% of the occurrences. This is simple to do using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;subset()&lt;/code&gt; function in R:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#### Second pass: thin the spaghetti blob!&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Require path to happen more than some number of times (count &amp;gt; x)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#What constitutes &quot;low volume&quot; will depend on your level of traffic&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;simpleoutput2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;C:/Users/rzwitc200/Desktop/simpleoutput2.html&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d3SimpleNetwork&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;graph_links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Source&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Target&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;height&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;600&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;width&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;750&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;linkDistance&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;charge&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;-100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;linkColour&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;#666&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodeColour&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;#3182bd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodeClickColour&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;#E34A33&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;textColour&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;#3182bd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;opacity&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0.6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;standAlone&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;simpleoutput2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The result of pruning the number of edges is a much less cluttered graph:&lt;/p&gt;

&lt;iframe src=&quot;http://randyzwitch.com/wp-content/uploads/2014/09/simpleoutput2.html&quot; width=&quot;750&quot; height=&quot;500&quot; frameborder=&quot;0&quot; scrolling=&quot;no&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Even with fewer edges in the graph, we still lose some of the information about the pages, since we don’t know what topics/groups the pages represent. We can fix that using a slightly more complex version of the d3Network graph code.&lt;/p&gt;

&lt;h2 id=&quot;force-directed-graphs&quot;&gt;Force-directed graphs&lt;/h2&gt;

&lt;p&gt;The graphs above outline the structure of randyzwitch.com, but they can be improved by adding color-coding to the nodes to represent the topic of the post, as well as making the edges thicker/thinner based on how frequently the path occurs. This can be done using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;d3ForceNetwork()&lt;/code&gt; function like so:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#### Force directed network&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Limit to more than 5 occurence like in simple network&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_graph_links&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;graph_links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Get unique values of page name to create nodes df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Create an index value, starting at 0&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_nodes&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as.data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unique&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_graph_links&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_graph_links&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step.2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_nodes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_nodes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodevalue&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;as.numeric&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row.names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_nodes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Create groupings for node colors&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#This is user-specific in terms of how to create these groupings&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Due to few number of pages/topics, I am manually coding this&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grouping&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;(hadoop|hive|pig)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;perl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;(julia|uaparser-jl)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;perl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;[r]?sitecatalyst|adobe-analytics|omniture&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;perl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;(wordpress|twenty-eleven|scrappy)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;perl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;data-science|ec2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;perl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;python&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;perl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;(digital-analytics|google-analytics|web-analyst)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;perl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;(macbook|iphone)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;perl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;(randyzwitch|about|page)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;perl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;(rstudio|rcmdr|r-language|jsonlite|r-language-oddities|tag/r|automated-re-install-of-packages-for-r-3-0|learning-r-sas|creating-dummy-variables-data-frame-r)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;perl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Create group column&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_nodes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;group&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_nodes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grouping&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Append numeric nodeid to pagename&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_graph_links&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_graph_links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_nodes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by.x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by.y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_graph_links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;value&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;source&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_graph_links&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_graph_links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_nodes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by.x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by.y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_graph_links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;step.2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;value&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;source&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;target&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d3output&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;C:/Users/rzwitc200/Desktop/fd_graph.html&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Create force-directed graph&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d3ForceNetwork&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Links&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_graph_links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Nodes&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd_nodes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Source&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;source&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
               &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Target&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;target&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NodeID&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
               &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Group&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;group&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;opacity&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0.8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Value&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;value&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
               &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d3output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
               &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;charge&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;-90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
               &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Running the code results in the following force-directed graph:&lt;/p&gt;

&lt;iframe src=&quot;http://randyzwitch.com/wp-content/uploads/2014/09/fd_graph.html&quot; width=&quot;750&quot; height=&quot;500&quot; frameborder=&quot;0&quot; scrolling=&quot;no&quot;&gt;&lt;/iframe&gt;

&lt;h2 id=&quot;interpretation&quot;&gt;Interpretation&lt;/h2&gt;

&lt;p&gt;I’m not going to lie, all three of these diagrams are hard to interpret. Like wordclouds, network graphs can often be visually interesting, yet difficult to ascertain any concrete information. Network graphs also have the tendency to reinforce what you already know (you or someone you know designed your website, you should already have a feel for its structure!).&lt;/p&gt;

&lt;p&gt;However, in the case of the force-directed graph above, I do see some interesting patterns. Specifically, there are a considerable number of nodes that aren’t attached to the main network structure. This may be occurring due to my method of pruning the network edges. More likely is that these disconnected nodes represent “dead-ends” in my blog, either because few pages link to them, there are technical errors, these are high bounce-rate pages or represent one-off topics that satiate the reader.&lt;/p&gt;

&lt;p&gt;In terms of action I can take, I can certainly look up the bounce rate for these disconnected pages/nodes and re-write the content to make it more ‘sticky’. There’s also the case of the way my “Related Posts” plugin determines related pages. As far as I know, it’s quite naive, using the existing words on the page to determine relationships between posts. So one follow-up could be to create an actual recommender system to better suggest content to my readers. Perhaps that’s a topic for a different blog post.&lt;/p&gt;

&lt;p&gt;Regardless of the actions I’ll end up taking from this information, hopefully this blog post has piqued some ideas of how to use RSiteCatalyst in a non-standard way, to extend the standard digital analytics information you are capturing with Adobe Analytics into creating interesting visualizations and potential new insights.&lt;/p&gt;

&lt;h4 id=&quot;example-data&quot;&gt;Example Data&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;For those of you who aren’t Adobe Analytics customers (or are, but don’t have API access), here are the &lt;a href=&quot;/wp-content/uploads/2014/09/queue_pathing_pages.csv&quot;&gt;data from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;queue_pathing_pages&lt;/code&gt; data frame&lt;/a&gt; above. Just read this data into R, then you should be able to follow along with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;d3Network&lt;/code&gt; code.&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.4 Release Notes</title>
        
          <description>&lt;p&gt;It felt like it would never happen, but &lt;a href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot;&gt;RSiteCatalyst v1.4&lt;/a&gt; is now available on CRAN! There are numerous changes in this version of the package, so unlike previous posts, there won’t be any code examples.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 01 Sep 2014 20:30:14 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-4-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-4-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-4-release-notes/">&lt;p&gt;It felt like it would never happen, but &lt;a href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot;&gt;RSiteCatalyst v1.4&lt;/a&gt; is now available on CRAN! There are numerous changes in this version of the package, so unlike previous posts, there won’t be any code examples.&lt;/p&gt;

&lt;h2 id=&quot;this-version-is-one-big-breaking-change&quot;&gt;THIS VERSION IS ONE BIG BREAKING CHANGE&lt;/h2&gt;

&lt;p&gt;While not the most important &lt;em&gt;improvement&lt;/em&gt;, it can’t be stressed enough that migrating to v1.4 of RSiteCatalyst is likely going to require re-writing some of your prior code. There are numerous reasons for the breaking changes, including:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Adobe made breaking changes to the API between v1.3 and v1.4, so we had to as well&lt;/li&gt;
  &lt;li&gt;I partnered with &lt;a title=&quot;Willem Paling GitHub&quot; href=&quot;https://github.com/WillemPaling&quot; target=&quot;_blank&quot;&gt;Willem Paling&lt;/a&gt;, who merged his &lt;a title=&quot;RAA - Original Source for RSiteCatalyst 1.4&quot; href=&quot;https://github.com/WillemPaling/RAA&quot; target=&quot;_blank&quot;&gt;RAA&lt;/a&gt; codebase into RSiteCatalyst to contribute most of the code in this version&lt;/li&gt;
  &lt;li&gt;Better consistency in R functions around keywords and options&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Of the changes listed above, I think #2 and #3 are the biggest benefit to end-users of RSiteCatalyst. The codebase is now much cleaner and more consistent in terms of the keyword arguments, has better error handling, and having a second person helping maintain the project has led to a better overall package.&lt;/p&gt;

&lt;p&gt;Where you’ll see the most difference is that all keyword arguments are now all lowercase and multi-word keyword arguments are now separated by a period instead of underscores or weird caMelCAse. We tried to maintain the same keyword order where possible to minimize code re-writes.&lt;/p&gt;

&lt;h2 id=&quot;pathing-and-fallout-reports&quot;&gt;Pathing and Fallout Reports&lt;/h2&gt;

&lt;p&gt;Probably the most useful improvement to RSiteCatalyst comes from those breaking changes by Adobe, which is the inclusion of Pathing and Fallout reports! I can’t say with absolute certainty, but I think with these two additional reports, the API is pretty much at parity to the Adobe Analytics interface itself. So now you can create your funnels using &lt;a title=&quot;ggplot2 documentation&quot; href=&quot;http://ggplot2.org/&quot; target=&quot;_blank&quot;&gt;ggplot2&lt;/a&gt;, make force-directed graphs or Sankey charts using &lt;a title=&quot;d3Network documentation&quot; href=&quot;http://christophergandrud.github.io/d3Network/&quot; target=&quot;_blank&quot;&gt;d3Network&lt;/a&gt; or just simple reporting of top ‘Next Pages’ and the like.&lt;/p&gt;

&lt;h2 id=&quot;support-for-oauth-authentication&quot;&gt;Support for OAuth Authentication&lt;/h2&gt;

&lt;p&gt;As part of Adobe’s commitment to consolidating systems under the single Adobe Marketing Cloud, authentication with the API using OAuth is now possible. How to set up OAuth authentication is beyond the scope of this blog post, but you can get more information at this link: &lt;a title=&quot;Adobe Marketing Cloud OAuth&quot; href=&quot;https://marketing.adobe.com/resources/help/en_US/mcloud/link_accounts.html&quot; target=&quot;_blank&quot;&gt;Adobe Marketing Cloud OAuth&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For those of you who don’t have OAuth credentials setup yet, the “legacy” version of authentication is still available in RSiteCatalyst.&lt;/p&gt;

&lt;h2 id=&quot;getclassifications-inline-segmentation-and-more&quot;&gt;GetClassifications, Inline Segmentation and More&lt;/h2&gt;

&lt;p&gt;Finally, there is now additional functionality on the descriptive side, as you can now download which Classifications are defined for a report suite, segments can be defined inline (i.e. from R) for the ‘Queue’ reports using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BuildClassificationValueSegment()&lt;/code&gt; function and functions that existed in previous versions of RSiteCatalyst tend to have more options defined than in previous versions.&lt;/p&gt;

&lt;h2 id=&quot;summarywe-want-to-hear-from-you&quot;&gt;Summary/We Want To Hear From You&lt;/h2&gt;

&lt;p&gt;While this new version of RSiteCatalyst has some annoying breaking changes, overall the package is much more robust than prior versions. I think the increase in functionality is well worth the minor annoyance of re-writing some code. Additionally, eventually Adobe will deprecate v1.3 of their API, so it’s better to move over sooner rather than later.&lt;/p&gt;

&lt;p&gt;But for all of the improvements that have been made, there’s always room for improvement, whether it’s fixing unforeseen bugs, adding new features, improving the documentation or anything else. For all suggestions, bug fixes and the like, please submit them to the &lt;a title=&quot;RSiteCatalyst GitHub&quot; href=&quot;https://github.com/randyzwitch/RSiteCatalyst&quot; target=&quot;_blank&quot;&gt;GitHub repository&lt;/a&gt; so that myself and Willem can evaluate and incorporate them. We’re also VERY open to any of you in the R community who are able to patch the code or add new features. As a friend in the data science community says, a Pull Request is always better than a Feature Request 🙂&lt;/p&gt;

&lt;p&gt;Happy API’ing everyone!&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Visualizing Analytics Languages With VennEuler.jl</title>
        
          <description>&lt;p&gt;It often doesn’t take much to get me off track, and on a holiday weekend…well, I was just begging for a fun way to shirk. Enter Harlan Harris:&lt;/p&gt;

</description>
        
        <pubDate>Fri, 29 Aug 2014 15:16:24 +0000</pubDate>
        <link>
        http://randyzwitch.com/visualizing-analytics-languages-venneuler-jl/</link>
        <guid isPermaLink="true">http://randyzwitch.com/visualizing-analytics-languages-venneuler-jl/</guid>
        <content type="html" xml:base="/visualizing-analytics-languages-venneuler-jl/">&lt;p&gt;It often doesn’t take much to get me off track, and on a holiday weekend…well, I was just begging for a fun way to shirk. Enter Harlan Harris:&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-cards=&quot;hidden&quot; data-partner=&quot;tweetdeck&quot;&gt;
  &lt;p&gt;
    someone redo this area-prop'l Venn w/ my Julia pkg! &lt;a href=&quot;http://t.co/Mh8rXZbRgY&quot;&gt;http://t.co/Mh8rXZbRgY&lt;/a&gt; &lt;a href=&quot;http://t.co/RDWNQHTw3S&quot;&gt;http://t.co/RDWNQHTw3S&lt;/a&gt; &lt;a href=&quot;http://t.co/ljujd9DG0T&quot;&gt;http://t.co/ljujd9DG0T&lt;/a&gt; via &lt;a href=&quot;https://twitter.com/revodavid&quot;&gt;@revodavid&lt;/a&gt;
  &lt;/p&gt;

  &lt;p&gt;
    — Harlan Harris (@HarlanH) &lt;a href=&quot;https://twitter.com/HarlanH/statuses/505365468363100160&quot;&gt;August 29, 2014&lt;/a&gt;
  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Hey, I’m someone looking for something to do! And I like writing Julia code! So let’s have a look at recreating this diagram in Julia using VennEuler.jl (&lt;a title=&quot;VennEuler.jl example&quot; href=&quot;http://nbviewer.ipython.org/gist/randyzwitch/860e1d9ae5a12cb61b1b&quot; target=&quot;_blank&quot;&gt;IJulia Notebook link&lt;/a&gt;):&lt;/p&gt;

&lt;div style=&quot;width: 490px&quot; class=&quot;wp-caption alignnone&quot;&gt;
  &lt;img src=&quot;http://revolution-computing.typepad.com/.a/6a010534b1db25970b01a73e0af9c7970d-800wi&quot; alt=&quot;&quot; width=&quot;480&quot; height=&quot;427&quot; /&gt;

  &lt;p class=&quot;wp-caption-text&quot;&gt;
    Source: Revolution R/KDNuggets
  &lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href=&quot;http://blog.revolutionanalytics.com/2014/08/r-tops-kdnuggets-data-analysis-software-poll-for-4th-consecutive-year.html&quot; target=&quot;_blank&quot;&gt;http://blog.revolutionanalytics.com/2014/08/r-tops-kdnuggets-data-analysis-software-poll-for-4th-consecutive-year.html&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;installing-venneulerjl&quot;&gt;Installing VennEuler.jl&lt;/h2&gt;

&lt;p&gt;Because VennEuler.jl is not in METADATA as of the time of writing, instead of using Pkg.add() you’ll need to run:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;Pkg&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clone&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;https://github.com/HarlanH/VennEuler.jl.git&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Note that VennEuler uses some of the more exotic packages (at least to me) like NLopt and Cairo, so you might need to have a few additional dependencies installed with the package.&lt;/p&gt;

&lt;h2 id=&quot;data&quot;&gt;Data&lt;/h2&gt;

&lt;p&gt;The data was a bit confusing to me at first, since the percentages add up to more than 100% (people could vote multiple times). In order to create a dataset to use, I took the percentages, multiplied by 1000, then re-created the voting pattern. The data for the graph can be downloaded from &lt;a title=&quot;Dataset&quot; href=&quot;http://randyzwitch.com/wp-content/uploads/2014/08/kdnuggets_language_survey_2014.csv&quot; target=&quot;_blank&quot;&gt;this link&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;code---circles&quot;&gt;Code - Circles&lt;/h2&gt;

&lt;p&gt;With a few modifications, I basically re-purposed Harlan’s code from the &lt;a href=&quot;https://github.com/HarlanH/VennEuler.jl/blob/master/test/DC2.jl&quot;&gt;package test files&lt;/a&gt;. The circle result is as follows:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;VennEuler&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;labels&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;readcsv&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/home/rzwitch/Desktop/kdnuggets_language_survey_2014.csv&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;header&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bool&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;labels&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Circles&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;eo&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;make_euler_object&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EulerSpec&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;())&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# circles, for now&lt;/span&gt;

&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minf&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minx&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ret&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;optimize&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;eo&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random_state&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;eo&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ftol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;xtol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.0025&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;maxtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;120&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;got &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;minf at &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;minx (returned &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ret)&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;render&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/home/rzwitch/Desktop/kd.svg&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;eo&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;minx&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/08/venneulercircles.png&quot; alt=&quot;venneulercircles&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Since the percentage of R, SAS, and Python users isn’t too dramatically different (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;49.81%&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;33.42%&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;40.97%&lt;/code&gt; respectively) and the visualizations are circles, it’s a bit hard to tell that R is about 16% points higher than SAS and 9% points higher than Python.&lt;/p&gt;

&lt;h2 id=&quot;code--rectangles&quot;&gt;Code - Rectangles&lt;/h2&gt;

&lt;p&gt;Alternatively, we can use rectangles to represent the areas:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;VennEuler&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;labels&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;readcsv&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/home/rzwitch/Desktop/kdnuggets_language_survey_2014.csv&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;header&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bool&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;labels&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Rectangles&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;eo&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;make_euler_object&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;EulerSpec&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rectangle&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EulerSpec&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rectangle&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]),&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;EulerSpec&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rectangle&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)],&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;sizesum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;


&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minf&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minx&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ret&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;optimize_iteratively&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;eo&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random_state&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;eo&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ftol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;xtol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.0025&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;maxtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;phase 1: got &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;minf at &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;minx (returned &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ret)&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minf&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minx&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ret&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;optimize&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;eo&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;minx&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ftol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;xtol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.001&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;maxtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;phase 2: got &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;minf at &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;minx (returned &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ret)&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;render&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/home/rzwitch/Desktop/kd-rects.svg&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;eo&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;minx&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/08/venneulerrectangles.png&quot; alt=&quot;venneulerrectangles&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here, it’s a slight bit easier to see that SAS and Python are about the same area-wise and that R is larger, although the different dimensions do obscure this fact a bit.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;If I spent more time with this package, I’m sure I could make something even more aesthetically pleasing. And for that matter, it’s still a pre-production package that will no doubt get better in the future. But at the very least, there is a way to create an area-proportional representation of relationships using VennEuler.jl in Julia.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>String Interpolation for Fun and Profit</title>
        
          <description>&lt;p&gt;In a previous post, I showed how I frequently use &lt;a href=&quot;http://randyzwitch.com/julia-odbc-jl/&quot;&gt;Julia as a ‘glue’ language&lt;/a&gt; to connect multiple systems in a complicated data pipeline. For this blog post, I will show two more examples where I use Julia for general programming, rather than for computationally-intense programs.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 14 Jul 2014 12:01:10 +0000</pubDate>
        <link>
        http://randyzwitch.com/string-interpolation-julia/</link>
        <guid isPermaLink="true">http://randyzwitch.com/string-interpolation-julia/</guid>
        <content type="html" xml:base="/string-interpolation-julia/">&lt;p&gt;In a previous post, I showed how I frequently use &lt;a href=&quot;http://randyzwitch.com/julia-odbc-jl/&quot;&gt;Julia as a ‘glue’ language&lt;/a&gt; to connect multiple systems in a complicated data pipeline. For this blog post, I will show two more examples where I use Julia for general programming, rather than for computationally-intense programs.&lt;/p&gt;

&lt;h2 id=&quot;string-buildingintroduction&quot;&gt;String Building: Introduction&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;http://docs.julialang.org/en/latest/manual/strings/&quot;&gt;Strings section of the Julia Manual&lt;/a&gt; provides a very in-depth treatment of the considerations when using strings within Julia. For the purposes of my examples, there are only three things to know:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Strings are immutable within Julia and 1-indexed&lt;/li&gt;
  &lt;li&gt;Strings are easily created through the a syntax familiar to most languages:&lt;/li&gt;
&lt;/ul&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;authorname&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;randy zwitch&quot;&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&quot;randy zwitch&quot;&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;typeof&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;authorname&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;ul&gt;
  &lt;li&gt;String interpolation is easiest done using dollar-sign notation. Additionally, parenthesis can be used to avoid symbol ambiguity:&lt;/li&gt;
&lt;/ul&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;interpolated&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;the author of this blog post is &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(authorname)&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&quot;the author of this blog post is randy zwitch&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;del&gt;If you are using large volumes of textual data, you’ll want to pay attention to the difference between the various string types that Julia provides (&lt;em&gt;UTF8/16/32, ASCII, Unicode, etc&lt;/em&gt;), but for the purposes of this blog post we’ll just be using the &lt;em&gt;ASCIIString&lt;/em&gt; type by not explicitly declaring the string type and only using ASCII characters.&lt;/del&gt;&lt;/p&gt;

&lt;p&gt;EDIT, 9/8/2016: Starting with version 0.5, Julia defaults to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;String&lt;/code&gt; type, which is an UTF-8 character encoding.&lt;/p&gt;

&lt;h2 id=&quot;example-1-repetitive-queries&quot;&gt;Example 1: Repetitive Queries&lt;/h2&gt;

&lt;p&gt;As part of my data engineering responsibilities at work, I often get requests to pull a sample of every table in a new database in our Hadoop cluster. This type of request is usually from the business owner, who wants to evaluate the data set has been imported correctly, but doesn’t actually want to write any sort of queries. So using the &lt;a href=&quot;https://github.com/quinnj/ODBC.jl&quot;&gt;ODBC.jl&lt;/a&gt; package, I repeatedly do the same &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;select * from &amp;lt;tablename&amp;gt;&lt;/code&gt; query and save to individual .tab files:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;       &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;     &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;A&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fresh&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;approach&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;technical&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;computing&lt;/span&gt;
&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;     &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;Documentation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;http&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;://&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;docs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;julialang&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;org&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;   &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;__&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;   &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;  &lt;span class=&quot;kt&quot;&gt;Type&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;help()&quot;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;list&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;help&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;topics&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;`&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;Version&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prerelease&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4028&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2014&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;07&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;23&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;42&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;UTC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|\&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;'&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|\&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;'&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;Commit&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2185&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bd1&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;days&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;old&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;master&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;                   &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;x86_64&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w64&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mingw32&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ODBC&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ODBC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Production hiveserver2&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pwd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ODBC&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Connection&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Object&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;----------------------&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Connection&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Data&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Source&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Production&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hiveserver2&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Production&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hiveserver2&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Connection&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Number&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Contains&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;resultset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;No&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tables&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;show tables in db;&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;elapsed&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.167028049&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;seconds&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tbl&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tables&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tab_name&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;select * from db.&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(tbl) &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;limit 1000;&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;C:&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\\&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;data_dump&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\\&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(tbl)&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;.tab&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;delim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sc&quot;&gt;'\t'&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;While the query is simple, writing/running this hundreds of times would be a waste of effort. So with a simple loop over the array of tables, I can provide a sample of hundreds of tables in .tab files with five lines of code.&lt;/p&gt;

&lt;h2 id=&quot;example-2-generating-query-code&quot;&gt;Example 2: Generating Query Code&lt;/h2&gt;

&lt;p&gt;In another task, I was asked to join a handful of Hive tables, then transpose the table from “long” to “wide”, so that each id value only had one row instead of multiple. This is fairly trivial to do using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CASE&lt;/code&gt; statements in SQL; the problem arises when you have thousands of potential row values to transpose into columns! Instead of getting carpal tunnel syndrome typing out thousands of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CASE&lt;/code&gt; statements, I decided to use Julia to generate the SQL code itself:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#Starting portion of query, the groupby columns&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;groupbycols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;select
interact.interactionid,
interact.agentname,
interact.agentid,
interact.agentgroup,
interact.agentsupervisor,
interact.sitename,
interact.dnis,
interact.agentextension,
interact.interactiondirection,
interact.interactiontype,
interact.customerid,
interact.customercity,
interact.customerstate,
interact.interactiondatetime,
interact.durationinms,&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Generate CASE statements based on the number of possible values of queryid&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; casestatements&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;repetitions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int64&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;repetitions&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;MAX(CASE WHEN q.queryid = &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;queryid then q.score END) as q&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(queryid)&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;score,&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;repetitions&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;MIN(CASE WHEN q.queryid = &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;queryid then q.startoffsetinms END) as q&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(queryid)&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;startoffset,&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;repetitions&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;MAX(CASE WHEN q.queryid = &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;queryid then q.endoffsetinms END) as q&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(queryid)&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;endoffset,&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
	&lt;span class=&quot;c&quot;&gt;#Last clause, so repeat it up to number of repetitions minus 1, then do simple print to get line without comma at end&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;repetitions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;SUM(CASE WHEN q.queryid = &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;queryid and q.score &amp;gt; q.mediumthreshold THEN 1 END) as q&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(queryid)&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;hits,&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;SUM(CASE WHEN q.queryid = &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;repetitions and q.score &amp;gt; q.mediumthreshold THEN 1 END) as q&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(repetitions)&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;hits&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Ending table statement&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tablestatements&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;from db.table1 as interact
left join db.table2 as q on (interact.interactionid = q.interactionid)
left join db.table3 as t on (interact.interactionid = t.interactionid)
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15;&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Submitting all of the statements on one line is usually frowned upon, but this will generate my SQL code&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupbycols&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;);&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;casestatements&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;);&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tablestatements&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interactionid&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;agentname&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;agentid&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;agentgroup&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;agentsupervisor&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sitename&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dnis&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;agentextension&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interactiondirection&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interactiontype&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;customerid&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;customercity&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;customerstate&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interactiondatetime&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;durationinms&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MAX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q1score&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MAX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q2score&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MAX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q3score&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MAX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q4score&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MAX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q5score&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MIN&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;startoffsetinms&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q1startoffset&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MIN&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;startoffsetinms&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q2startoffset&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MIN&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;startoffsetinms&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q3startoffset&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MIN&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;startoffsetinms&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q4startoffset&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MIN&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;startoffsetinms&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q5startoffset&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MAX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;endoffsetinms&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q1endoffset&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MAX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;endoffsetinms&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q2endoffset&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MAX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;endoffsetinms&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q3endoffset&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MAX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;endoffsetinms&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q4endoffset&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;MAX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;then&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;endoffsetinms&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q5endoffset&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SUM&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mediumthreshold&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;THEN&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q1hits&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SUM&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mediumthreshold&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;THEN&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q2hits&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SUM&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mediumthreshold&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;THEN&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q3hits&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SUM&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mediumthreshold&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;THEN&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q4hits&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SUM&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queryid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mediumthreshold&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;THEN&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;END&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q5hits&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;db&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;table1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;join&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;db&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;table2&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;on&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interactionid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interactionid&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;join&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;db&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;table3&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;on&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interact&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interactionid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interactionid&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;group&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;by&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The example here only repeats the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CASE&lt;/code&gt; statements five times, which wouldn’t really be that much typing. However, for my actual application, the number of possible values was 2153, leading to a query result which was 8157 columns! Suffice to say, I’d still be writing that code if I decided to do it by hand.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Like my ‘glue language’ post, I hope this post has shown that Julia can be used for more than grunting about microbenchmark performance. Whereas I used to use Python for doing weird string operations like this, I’m finding that the dollar-sign syntax in Julia feels more comfortable for me than the Python string formatting mini-language (although that’s not particularly difficult either). So if you’ve been hesitant to jump into learning Julia because you think it’s only useful for doing Mandelbrot calculations or complex linear algebra, Julia is just as at-home doing quick general programming tasks as well.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Maybe I Don't Really Know R After All</title>
        
          <description>&lt;p&gt;Lately, I’ve been feeling that I’m spreading myself too thin in terms of programming languages. At work, I spend most of my time in &lt;a title=&quot;Hive blog posts&quot; href=&quot;http://randyzwitch.com/tags/#hive&quot; target=&quot;_blank&quot;&gt;Hive&lt;/a&gt;/SQL, with the occasional Python for my smaller data. I really prefer &lt;a href=&quot;http://randyzwitch.com/tags/#julia&quot;&gt;Julia&lt;/a&gt;, but I’m alone at work on that one. And since I maintain a package on CRAN (&lt;a title=&quot;RSiteCatalyst&quot; href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot; target=&quot;_blank&quot;&gt;RSiteCatalyst&lt;/a&gt;), I frequently spend my evenings bug fix programming in R. Then, there’s the desire to learn a Java-based language like Scala (or, Java)…maybe Spark for my Hadoop work…&lt;/p&gt;

</description>
        
        <pubDate>Thu, 26 Jun 2014 11:18:36 +0000</pubDate>
        <link>
        http://randyzwitch.com/r-language-oddities/</link>
        <guid isPermaLink="true">http://randyzwitch.com/r-language-oddities/</guid>
        <content type="html" xml:base="/r-language-oddities/">&lt;p&gt;Lately, I’ve been feeling that I’m spreading myself too thin in terms of programming languages. At work, I spend most of my time in &lt;a title=&quot;Hive blog posts&quot; href=&quot;http://randyzwitch.com/tags/#hive&quot; target=&quot;_blank&quot;&gt;Hive&lt;/a&gt;/SQL, with the occasional Python for my smaller data. I really prefer &lt;a href=&quot;http://randyzwitch.com/tags/#julia&quot;&gt;Julia&lt;/a&gt;, but I’m alone at work on that one. And since I maintain a package on CRAN (&lt;a title=&quot;RSiteCatalyst&quot; href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot; target=&quot;_blank&quot;&gt;RSiteCatalyst&lt;/a&gt;), I frequently spend my evenings bug fix programming in R. Then, there’s the desire to learn a Java-based language like Scala (or, Java)…maybe Spark for my Hadoop work…&lt;/p&gt;

&lt;p&gt;So last night, when I ran into this series of follies with R, it really makes me wonder if I really understand how R works.&lt;/p&gt;

&lt;h2 id=&quot;jsonlitefromjson&quot;&gt;jsonlite:fromJSON&lt;/h2&gt;

&lt;p&gt;As part of the overall concept of my RSiteCatalyst package, I’m trying to make it as easy as possible for digital analysts to get their data via the &lt;a title=&quot;Adobe Analytics API&quot; href=&quot;https://marketing.adobe.com/developer/en_US&quot; target=&quot;_blank&quot;&gt;Adobe Analytics API&lt;/a&gt;.  As such, I abstract away the need to &lt;a title=&quot;Building JSON in R: Three Methods&quot; href=&quot;http://randyzwitch.com/r-json-jsonlite-sprintf-paste/&quot; target=&quot;_blank&quot;&gt;build JSON&lt;/a&gt; to request reports and parse the API answer from JSON to a data frame. Sometimes it’s easy, but sometimes you get something like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/06/nested_r_dataframe.png&quot; alt=&quot;nested_r_dataframe&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In case it’s not clear what’s going on here, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fromJSON()&lt;/code&gt; from &lt;a title=&quot;jsonlite CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/jsonlite/index.html&quot; target=&quot;_blank&quot;&gt;jsonlite&lt;/a&gt; returns a data frame as best as it can, but we have a list (of data frames!) nested inside of a column named “breakdown”. There are 12 rows here, but the proper data structure would be to take the data frame inside of ‘breakdown’ and append all of the fields from the original 12 rows, repeating the values down the rows. So something like 72 rows (12 original rows, 6 row data frames inside of the ‘breakdown’ column).&lt;/p&gt;

&lt;h2 id=&quot;loop-and-accumulate&quot;&gt;Loop and Accumulate&lt;/h2&gt;

&lt;p&gt;Because this is such a small data frame, and because &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;*apply&lt;/code&gt; functions are too frustrating in most cases, to parse this I went with the tried-and-true loop and accumulate. But instead of immediately getting what I wanted, I got this fantastic R error message:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Loop over df and accumulate results&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nrow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temp&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,],&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;breakdown_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]])&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;There&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;were&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;warnings&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;use&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;warnings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;see&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;them&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;warnings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Warning&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;messages&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;In&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;check.names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;were&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;found&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;short&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;variable&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;and&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;have&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;been&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;discarded&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Row names from a short variable? Off to StackOverflow, the savior of all language hackers, which lets me know I just need to &lt;a title=&quot;R row names short variable discarded&quot; href=&quot;http://stackoverflow.com/questions/23534066/cbind-warnings-row-names-were-found-from-a-short-variable-and-have-been-discar&quot; target=&quot;_blank&quot;&gt;add an argument to my &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cbind()&lt;/code&gt; function&lt;/a&gt;. Trying again:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Loop over df and accumulate results&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Adding row.names = NULL fixes error message&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nrow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temp&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,],&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;breakdown_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row.names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;year&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;month&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;day&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;            &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;hour&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;minute&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;breakdownTotal&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;trend&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;counts&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  &lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;So I successfully created an (84,10)-sized data frame, but &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cbind()&lt;/code&gt; allowed me to name two columns in the data frame “name”! Running ‘parsed_df$name’ at the REPL returns the first instance. So now, I have to use the unstable method of referring to the second ‘name’ column by position number if I want to access it (or, rename it using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;names()&lt;/code&gt; of course). The way I realized this behavior was occurring was that I tried to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plyr::rename&lt;/code&gt; and kept changing the name of two columns!&lt;/p&gt;

&lt;h2 id=&quot;final-solution&quot;&gt;Final Solution&lt;/h2&gt;

&lt;p&gt;In order to get past my duplicate name issue, I eventually renamed the ‘name’ columns individually by each object, prior to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cbind()&lt;/code&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Separate breakdown list and original data frame into different objects&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ex_df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;breakdown_list&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;breakdown&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;breakdown&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Loop over df and accumulate results&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nrow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;right_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;breakdown_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;right_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rename&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;right_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_raw&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;elements&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temp&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,],&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;right_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row.names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rename&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;counts&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_raw&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;In the end, I found an answer to my solution, but it seems like every time I use R the more oddities I’m able to encounter/generate. At this point, I’m starting to question whether I really understand the underpinnings of how R works. It might be time to stop trying to be a language polyglot so much and focus on really learning a few of these tools in-depth.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Using Julia As A &quot;Glue&quot; Language</title>
        
          <description>&lt;p&gt;While much of the focus in the Julia community has been on the performance aspects of Julia relative to other scientific computing languages, Julia is also perfectly suited to ‘glue’ together multiple data sources/languages. In this blog post, I will cover how to create an interactive plot using &lt;a title=&quot;Gadfly.jl documentation&quot; href=&quot;http://dcjones.github.io/Gadfly.jl/&quot; target=&quot;_blank&quot;&gt;Gadfly.jl&lt;/a&gt;, by first preparing the data using Hadoop and &lt;a title=&quot;Teradata Aster&quot; href=&quot;http://www.asterdata.com/&quot; target=&quot;_blank&quot;&gt;Teradata Aster&lt;/a&gt; via &lt;a title=&quot;Julia ODBC&quot; href=&quot;https://github.com/quinnj/ODBC.jl&quot; target=&quot;_blank&quot;&gt;ODBC.jl&lt;/a&gt;.&lt;/p&gt;

</description>
        
        <pubDate>Tue, 24 Jun 2014 08:57:31 +0000</pubDate>
        <link>
        http://randyzwitch.com/julia-odbc-jl/</link>
        <guid isPermaLink="true">http://randyzwitch.com/julia-odbc-jl/</guid>
        <content type="html" xml:base="/julia-odbc-jl/">&lt;p&gt;While much of the focus in the Julia community has been on the performance aspects of Julia relative to other scientific computing languages, Julia is also perfectly suited to ‘glue’ together multiple data sources/languages. In this blog post, I will cover how to create an interactive plot using &lt;a title=&quot;Gadfly.jl documentation&quot; href=&quot;http://dcjones.github.io/Gadfly.jl/&quot; target=&quot;_blank&quot;&gt;Gadfly.jl&lt;/a&gt;, by first preparing the data using Hadoop and &lt;a title=&quot;Teradata Aster&quot; href=&quot;http://www.asterdata.com/&quot; target=&quot;_blank&quot;&gt;Teradata Aster&lt;/a&gt; via &lt;a title=&quot;Julia ODBC&quot; href=&quot;https://github.com/quinnj/ODBC.jl&quot; target=&quot;_blank&quot;&gt;ODBC.jl&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The example problem I am going to solve is calculating and visualizing the number of airplanes by hour in the air at any given time in the U.S. for the year 1987. Because of the structure and storage of the underlying data, I will need to write some custom Hive code, upload the data to Teradata Aster via a command-line utility, re-calculate the number of flights per hour using a built-in Aster function, then using Julia to visualize the data.&lt;/p&gt;

&lt;h2 id=&quot;step-1-getting-data-from-hadoop&quot;&gt;Step 1: Getting Data From Hadoop&lt;/h2&gt;

&lt;p&gt;In a prior set of &lt;a title=&quot;Getting Started Using Hadoop, Part 3: Loading Data&quot; href=&quot;http://randyzwitch.com/uploading-data-hadoop-amazon-ec2-cloudera-part-3/&quot; target=&quot;_blank&quot;&gt;blog posts&lt;/a&gt;, I talked about loading the &lt;a title=&quot;Airline dataset&quot; href=&quot;http://stat-computing.org/dataexpo/2009/&quot; target=&quot;_blank&quot;&gt;airline dataset&lt;/a&gt; into Hadoop, then &lt;a title=&quot;Getting Started With Hadoop, Final: Analysis Using Hive &amp;amp; Pig&quot; href=&quot;http://randyzwitch.com/getting-started-hadoop-hive-pig/&quot; target=&quot;_blank&quot;&gt;analyzing the dataset using Hive or Pig&lt;/a&gt;. Using ODBC.jl, we can use Hive via Julia to submit our queries. The hardest part of setting up this process is making sure that you have the appropriate Hive drivers for your Hadoop cluster and credentials (which isn’t covered here). Once you have your DSN set up, running Hive queries is as easy as the following:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ODBC&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Connect to Hadoop cluster via Hive (pre-defined Windows DSN in ODBC Manager)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;hiveconn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ODBC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Production hiveserver2&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;your-user-name&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pwd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;your-password-here&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Clean data, return results directly to file&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#Data returned with have origin of flight, flight takeoff, flight landing and elapsed time&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;hive_query_string&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&quot;select
origin,
from_unixtime(flight_takeoff_datetime_origin) as flight_takeoff_datetime_origin,
from_unixtime(flight_takeoff_datetime_origin + (actualelapsedtime * 60)) as flight_landing_datetime_origin,
actualelapsedtime
from
(select
origin,
unix_timestamp(CONCAT(year,&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;, month, &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;, dayofmonth, &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;, SUBSTR(LPAD(deptime, 4, 0), 1, 2), &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;, SUBSTR(LPAD(deptime, 4, 0), 3, 4), &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;, &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;))  as flight_takeoff_datetime_origin,
actualelapsedtime
from vw_airline
where year = 1987 and actualelapsedtime &amp;gt; 0) inner_query;&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Run query, save results directly to file&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hive_query_string&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hiveconn&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;C:&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\\&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;airline_times.csv&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;delim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sc&quot;&gt;','&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;In this code, I’ve written my query as a Julia string, to keep my code easily modifiable. Then, I pass the Julia string object to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;query()&lt;/code&gt; function, along with my ODBC &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;connection&lt;/code&gt; object. This query runs on Hadoop through Hive, then streams the result directly to my local hard drive, making this a very RAM efficient (though I/O inefficient!) operation.&lt;/p&gt;

&lt;h2 id=&quot;step-2-shelling-out-to-load-data-to-aster&quot;&gt;Step 2: Shelling Out To Load Data To Aster&lt;/h2&gt;

&lt;p&gt;Once I created the file with my Hadoop results in it, I now have a decision point: I can either A) do the rest of the analysis in Julia or B) use a different tool for my calculations. Because this is a toy example, I’m going to use Teradata Aster to do my calculations, which provides a convenient function called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;burst()&lt;/code&gt; to regularize timestamps into fixed intervals. But before I can use Aster to ‘burst’ my data, I first need to upload it to the database.&lt;/p&gt;

&lt;p&gt;While I could loop over the data within Julia and insert each record one at a time, Teradata provides a command-line utility to upload data in parallel. Running command-line scripts from within Julia is as easy as using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;run()&lt;/code&gt; command, with each command surrounded in backticks:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#Connect to Aster (pre-defined Windows DSN in ODBC Manager)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;asterconn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ODBC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;aster01&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;your-user-name&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pwd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;your-password&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Create table to hold airline results&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;create_airline_table_statement&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&quot;create table ebi_temp.airline
(origin varchar,
flight_takeoff_datetime_origin timestamp,
flight_landing_datetime_origin timestamp,
actualelapsedtime int,
partition key (origin))&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Execute query&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create_airline_table_statement&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;asterconn&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Create airport table&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#Data downloaded from http://openflights.org/data.html&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;create_airport_table_statement&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&quot;create table ebi_temp.airport
(airport_id int,
name varchar,
city varchar,
country varchar,
IATAFAA varchar,
ICAO varchar,
latitude float,
longitude float,
altitude int,
timezone float,
dst varchar,
partition key (country))&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Execute query&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create_airport_table_statement&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;asterconn&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Upload data via run() command&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#ncluster_loader utility already on Windows PATH&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`ncluster_loader -h 192.168.1.1 -U your-user-name -w your-password -d aster01 -c --skip-rows=1 --el-enabled --el-table e_dist_error_2 --el-schema temp temp.airline C:\\airline_times.csv`&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`ncluster_loader -h 192.168.1.1 -U your-user-name -w your-password -d aster01 -c --el-enabled --el-table e_dist_error_2 --el-schema temp temp.airport C:\\airports.dat`&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;While I could’ve run this at the command-line, having all of this within an IJulia Notebook keeps all my work together, should I need to re-run this in the future.&lt;/p&gt;

&lt;h2 id=&quot;step-3-using-aster-for-calculations&quot;&gt;Step 3: Using Aster For Calculations&lt;/h2&gt;

&lt;p&gt;With my data now loaded in Aster, I can normalize the timestamps to UTC, then ‘burst’ the data into regular time intervals. Again, all of this can be done via ODBC from within Julia:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#Normalize timestamps from local time to UTC time&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;aster_view_string&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;
create view temp.vw_airline_times_utc as
select
row_number() over(order by flight_takeoff_datetime_origin) as unique_flight_number,
origin,
flight_takeoff_datetime_origin,
flight_landing_datetime_origin,
flight_takeoff_datetime_origin - (INTERVAL '1 hour' * timezone) as flight_takeoff_datetime_utc,
flight_landing_datetime_origin - (INTERVAL '1 hour' * timezone) as flight_landing_datetime_utc,
timezone
from temp.airline
left join temp.airport on (airline.origin = airport.iatafaa);&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Execute query&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;aster_view_string&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;asterconn&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Teradata Aster SQL-H functionality, accessed via ODBC query&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;burst_query_string&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&quot;create table temp.airline_burst_hour distribute by hash (origin) as
SELECT
*,
&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;INTERVAL_START&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;::date as calendar_date,
extract(HOUR from &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;INTERVAL_START&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;) as hour_utc
FROM BURST(
     ON (select
        unique_flight_number,
        origin,
        flight_takeoff_datetime_utc,
        flight_landing_datetime_utc
        FROM temp.vw_airline_times_utc
)
     START_COLUMN('flight_takeoff_datetime_utc')
     END_COLUMN('flight_landing_datetime_utc')
     BURST_INTERVAL('3600')
);&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Execute query&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;burst_query_string&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;asterconn&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Since it might not be clear what I’m doing here, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;burst()&lt;/code&gt; function in Aster takes a row of data with a start and end timestamp, and (potentially) returns multiple rows which normalize the time between the timestamps. If you’re familiar with pandas in Python, it’s a similar functionality to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;resample&lt;/code&gt; on a series of timestamps.&lt;/p&gt;

&lt;h2 id=&quot;step-4-download-smaller-data-into-julia-visualize&quot;&gt;Step 4: Download Smaller Data Into Julia, Visualize&lt;/h2&gt;

&lt;p&gt;Now that the data has been processed from Hadoop to Aster through a series of queries, we now have a much smaller dataset that can be loaded into RAM and processed by Julia:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#Calculate the number of flights per hour per day&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;flights_query&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;
select
calendar_date,
hour_utc,
sum(1) as num_flights
from temp.airline_burst_hour
group by 1,2
order by 1,2;&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Bring results into Julia DataFrame&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;flights_per_day&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flights_query&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;asterconn&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Gadfly&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Create boxplot, with one box plot per hour&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;set_default_plot_size&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cm&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cm&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plot&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flights_per_day&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hour_utc&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;num_flights&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;Guide&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xlabel&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Hour UTC&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;Guide&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylabel&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Flights In Air&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;Guide&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Number of Flights In Air To/From U.S. By Hour - 1987&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;Scale&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y_continuous&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minvalue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;maxvalue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4000&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;Geom&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;boxplot&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The Gadfly code above produces the following plot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/airline_plot.png&quot; alt=&quot;gadfly&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Since this chart is in UTC, it might not be obvious what the interpretation is of the trend. Because the airline dataset represents flights either leaving or returning to the United States, there are many fewer planes in the air overnight and the early morning hours (UTC 7-10, 2-5am Eastern). During the hours when the airports are open, there appears to be a limit of roughly 2500 planes per hour in the sky.&lt;/p&gt;

&lt;h2 id=&quot;why-not-do-all-of-this-in-julia&quot;&gt;Why Not Do All Of This In Julia?&lt;/h2&gt;

&lt;p&gt;At this point, you might be tempted to wonder why go through all of this effort? Couldn’t this all be done in Julia?&lt;/p&gt;

&lt;p&gt;Yes, you probably could do all of this work in Julia with a sufficiently large amount of RAM. As a proof-of-concept, I hope I’ve shown that there is much more to Julia than micro-benchmarking Julia’s speed relative to other scientific programming languages. You’ll notice that in none of my code have I used any type annotations, as none would really make sense (nor would they improve performance).  And although this is a toy example purposely using multiple systems, I much more frequently use Julia in this manner at work than doing linear algebra or machine learning.&lt;/p&gt;

&lt;p&gt;So next time you’re tempted to use Python or R or shell scripting or whatever, consider Julia as well. Julia is just as at-home as a scripting language as a scientific computing language.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Five Hard-Won Lessons Using Hive</title>
        
          <description>&lt;p&gt;&lt;em&gt;EDIT, 9/8/2016: Hive has come a long way in the two years since I’ve written this. While some of the code snippets might still work, it’s likely the case that this information is so out-of-date to be nothing more than a reflection of working with Hadoop in 2014.&lt;/em&gt;&lt;/p&gt;

</description>
        
        <pubDate>Thu, 12 Jun 2014 13:01:18 +0000</pubDate>
        <link>
        http://randyzwitch.com/hive-five-hard-won-lessons/</link>
        <guid isPermaLink="true">http://randyzwitch.com/hive-five-hard-won-lessons/</guid>
        <content type="html" xml:base="/hive-five-hard-won-lessons/">&lt;p&gt;&lt;em&gt;EDIT, 9/8/2016: Hive has come a long way in the two years since I’ve written this. While some of the code snippets might still work, it’s likely the case that this information is so out-of-date to be nothing more than a reflection of working with Hadoop in 2014.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I’ve been spending a ton of time lately on the data &lt;em&gt;engineering&lt;/em&gt; side of ‘data science’, so I’ve been writing a lot of Hive queries. Hive is a great tool for querying large amounts of data, without having to know very much about the underpinnings of Hadoop. Unfortunately, there are a lot of things about Hive (version 0.12 and before) that aren’t quite the same as SQL and have caused me a bunch of frustration; here they are, in no particular order.&lt;/p&gt;

&lt;h2 id=&quot;1-set-hive-temp-directory-tosame-as-final-output-directory&quot;&gt;1. Set Hive Temp directory To Same As Final Output Directory&lt;/h2&gt;

&lt;p&gt;When doing a “Create Table As” (CTAS) statement in Hive, &lt;a title=&quot;Hive scratch directory&quot; href=&quot;http://doc.mapr.com/display/MapR/Hive#Hive-HiveScratchDirectory&quot; target=&quot;_blank&quot;&gt;Hive allocates temp space for the Map and Reduce portions of the job&lt;/a&gt;. If you’re not lucky, the temp space for the job will be somewhere different than where your table actually ends up being saved, resulting in TWO I/O operations instead of just one. This can lead to a painful delay in when your Hive job says it is finished vs. when the table becomes available (one time, I saw a 30 hour delay writing 5TB of data).&lt;/p&gt;

&lt;p&gt;If your Hive jobs seem to hang after the Job Tracker says they are complete, try this setting at the beginning of your session:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;span class=&quot;nb&quot;&gt;set &lt;/span&gt;hive.optimize.insert.dest.volume&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;2column-aliasing-in-group-byorder-by&quot;&gt;2. Column Aliasing In Group By/Order By&lt;/h2&gt;

&lt;p&gt;Not sure why this isn’t a default, but if you want to be able to reference your column names by position (i.e. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;group by 1,2&lt;/code&gt;) instead of by name (i.e. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;group by name, age&lt;/code&gt;), then run this at the beginning of your session:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;span class=&quot;nb&quot;&gt;set &lt;/span&gt;hive.groupby.orderby.position.alias&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;3-be-aware-of-predicate-push-down-rules&quot;&gt;3. Be Aware Of Predicate Push-Down Rules&lt;/h2&gt;

&lt;p&gt;In Hive, you can get great performance gains if you A) partition your table by commonly used columns/business concepts (i.e. Day, State, Market, etc.) and B) you use the partitions in a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE&lt;/code&gt; clause. These are known as &lt;a href=&quot;https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-PartitionBasedQueries&quot;&gt;partition-based queries&lt;/a&gt;. Otherwise, if you don’t use a partition in your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE&lt;/code&gt; clause, you will get a full table scan.&lt;/p&gt;

&lt;p&gt;Unfortunately, when doing an OUTER JOIN, Hive will sometimes ignore the fact that your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE&lt;/code&gt; clause is on a partition and do a full table scan anyway. In order to get Hive to &lt;a href=&quot;https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior#OuterJoinBehavior-PredicatePushdownRules&quot;&gt;push your predicate down&lt;/a&gt; and avoid a full table scan, put your predicate on the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JOIN&lt;/code&gt; instead of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE&lt;/code&gt; clause:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;--#### Assume sales Hive table partitioned by day_id ####--&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;--Full Table Scan&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;employees&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sales&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;employees&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;join&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sales&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;on&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;employees&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sales&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;employee_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;where&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;day_id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;between&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'2014-03-01'&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'2014-05-31'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;--Partitioned-based query&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;employees&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sales&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;employees&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;join&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sales&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;on&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;employees&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sales&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;employee_id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sales&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;day_id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;between&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'2014-03-01'&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'2014-05-31'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;If you don’t want to think about the different rules, you can generally put your limiting clauses inside your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JOIN&lt;/code&gt; clause instead of on your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE&lt;/code&gt; clause. It &lt;em&gt;should&lt;/em&gt; just be a matter of preference (until your query performance indicates it isn’t!)&lt;/p&gt;

&lt;h2 id=&quot;4-calculate-and-append-percentiles-using-cross-join&quot;&gt;4. Calculate And Append Percentiles Using CROSS JOIN&lt;/h2&gt;

&lt;p&gt;Suppose you want to calculate the top 10% of your customers by sales. If you try to do the following, Hive will complain about needing a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GROUP BY&lt;/code&gt;, because &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;percentile_approx()&lt;/code&gt; is a summary function:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;--Hive expects that you want to calculate your percentiles by account_number and sales&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;--This code will generate an error about a missing GROUP BY statement&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;account_number&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sales&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CASE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHEN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sales&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;percentile_approx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sales&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;THEN&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ELSE&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;END&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;top10pct_sales&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;To get around the the need for a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GROUP BY&lt;/code&gt;, we can use a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CROSS JOIN&lt;/code&gt;. A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CROSS JOIN&lt;/code&gt; is another name for a Cartesian Join, meaning all of the rows from the first table will be joined to ALL of the rows of the second table. Because the subquery only returns one row, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CROSS JOIN&lt;/code&gt; provides the desired affect of joining the percentile values back to the original table while keeping the same number of rows from the original table. Generally, you don’t want to do a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CROSS JOIN&lt;/code&gt; (because relational data generally is joined on a key), but this is a good use case.&lt;/p&gt;

&lt;h2 id=&quot;5-calculating-a-histogram&quot;&gt;5.  Calculating a Histogram&lt;/h2&gt;

&lt;p&gt;Creating a histogram using Hive should be as simple as calling the &lt;a href=&quot;https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-histogram_numeric():Estimatingfrequencydistributions&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;histogram_numeric()&lt;/code&gt;&lt;/a&gt; function. However, the syntax and results of this function are just plain weird. To create a histogram, you can run the following:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;histogram_numeric&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;salary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;salary_hist&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;from&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sample_08&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;--Results&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;23507&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;68627450983&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;31881&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7647058824&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;340&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;39824&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;11498257844&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;287&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;47615&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;58011049725&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;181&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;55667&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;01219512195&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;164&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;59952&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;499999999985&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;66034&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;67153284674&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;137&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;75642&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;31707317074&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;82&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;82496&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;13636363638&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;44&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;91431&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;66666666667&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100665&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;71428571428&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;21&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;107326&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;66666666667&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;121248&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;74999999999&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;142070&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;153896&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;6666666667&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;162310&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;169810&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;176740&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;193925&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;206770&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The results of this query comes back as a list, which is very un-SQL like! To get the data as a table, we can use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LATERAL VIEW&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXPLODE&lt;/code&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;CAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bin_center&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;CAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bigint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bin_height&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;histogram_numeric&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;salary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;salary_hist&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;sample_08&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;LATERAL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VIEW&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;explode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;salary_hist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;exploded_table&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;--Results&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;bin_center&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;bin_height&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;23507&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;31881&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;340&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;39824&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;287&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;47615&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;181&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;55667&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;164&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;59952&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;66034&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;137&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;75642&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;82&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;82496&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;44&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;91431&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;100665&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;21&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;107326&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;121248&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;142070&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;153896&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;162310&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;169810&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;17&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;176740&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;18&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;193925&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;19&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;206770&lt;/span&gt;	&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;However, now that we have a table of data, it’s still not clear how to create a histogram, as the &lt;em&gt;center of variable-width bins&lt;/em&gt; is what is returned by Hive. The &lt;a href=&quot;https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-histogram_numeric():Estimatingfrequencydistributions&quot;&gt;Hive documentation for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;histogram_numeric()&lt;/code&gt;&lt;/a&gt; references Gnuplot, Excel, Mathematica and MATLAB, which I can only assume can deal with plotting the centers?  Eventually I’ll figure out how to deal with this using R or Python, but for now, I just use the table as a quick gauge of what the data looks like.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Building JSON in R: Three Methods</title>
        
          <description>&lt;p&gt;When I set out to build &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst&quot;&gt;RSiteCatalyst&lt;/a&gt;, I had a few major goals: learn R, build &lt;a href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot;&gt;CRAN&lt;/a&gt;-worthy package and learn the &lt;a href=&quot;https://marketing.adobe.com/developer/en_US/documentation&quot;&gt;Adobe Analytics API&lt;/a&gt;. As I reflect back on how the package has evolved over the past two years and what I’ve learned, I think my greatest learning was around how to deal with JSON (and strings in general).  &lt;/p&gt;

</description>
        
        <pubDate>Tue, 13 May 2014 13:27:39 +0000</pubDate>
        <link>
        http://randyzwitch.com/r-json-jsonlite-sprintf-paste/</link>
        <guid isPermaLink="true">http://randyzwitch.com/r-json-jsonlite-sprintf-paste/</guid>
        <content type="html" xml:base="/r-json-jsonlite-sprintf-paste/">&lt;p&gt;When I set out to build &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst&quot;&gt;RSiteCatalyst&lt;/a&gt;, I had a few major goals: learn R, build &lt;a href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot;&gt;CRAN&lt;/a&gt;-worthy package and learn the &lt;a href=&quot;https://marketing.adobe.com/developer/en_US/documentation&quot;&gt;Adobe Analytics API&lt;/a&gt;. As I reflect back on how the package has evolved over the past two years and what I’ve learned, I think my greatest learning was around how to deal with JSON (and strings in general).  &lt;/p&gt;

&lt;p&gt;JSON is ubiquitous as a data-transfer mechanism over the web, and R does a decent job providing the functionality to not only read JSON but also to create JSON. There are at least three methods I know of to build JSON strings, and this post will cover the pros and cons of each method.&lt;/p&gt;

&lt;h3 id=&quot;method-1-building-json-using-paste&quot;&gt;Method 1: Building JSON using paste&lt;/h3&gt;

&lt;p&gt;As a beginning R user, I didn’t have the awareness of how many great user-contributed packages are out there. So throughout the RSiteCatalyst source code you can see &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/blob/master/R/QueueOvertime.R#L75-78&quot;&gt;gems&lt;/a&gt; like:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#&quot;metrics&quot; would be a user input into a function arguments&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;a&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;b&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;c&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Loop over the metrics list, appending proper curly braces&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics_conv&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'{&quot;id&quot;:'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'&quot;'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'&quot;'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'}'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Collapse the list into a proper comma separated string&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics_final&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics_conv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;collapse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;, &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The code above loops over a character vector (using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lapply&lt;/code&gt; instead of a for loop like a good R user!), appending curly braces, then flattening the list down to a string. While this code works, it’s a quite brittle way to build JSON. You end up needing to worry about matching quotation marks, remembering if you need curly braces, brackets or singletons…overall, it’s a maintenance nightmare to build strings this way.&lt;/p&gt;

&lt;p&gt;Of course, if you have a &lt;em&gt;really simple&lt;/em&gt; JSON string you need to build, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;paste()&lt;/code&gt; doesn’t have to be off-limits, but for a majority of the cases I’ve seen, it’s probably not a good idea.&lt;/p&gt;

&lt;h3 id=&quot;method-2-building-json-using-sprintf&quot;&gt;Method 2: Building JSON using sprintf&lt;/h3&gt;

&lt;p&gt;Somewhere in the middle of building version 1 of RSiteCatalyst, I started learning Python. For those of you who aren’t familiar, Python has a &lt;a href=&quot;https://docs.python.org/2/library/stdtypes.html#string-formatting&quot;&gt;string interpolation operator&lt;/a&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;%&lt;/code&gt;, which allows you to do things like the following:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;In&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Here's a string subtitution for my name: %s&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Randy&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;Out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Here's a string subtitution for my name: Randy&quot;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Thinking that this was the most useful thing I’d ever seen in programming, I naturally searched to see if R had the same functionality. Of course, I quickly learned that all C-based languages have &lt;a href=&quot;http://en.wikipedia.org/wiki/Printf_format_string&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;printf/sprintf&lt;/code&gt;&lt;/a&gt;, and R is no exception. So I started &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/blob/master/R/QueueTrended.R#L115-119&quot;&gt;building JSON using sprintf&lt;/a&gt; in the following manner:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;elements_list&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'{&quot;id&quot;:&quot;%s&quot;,
                          &quot;top&quot;: &quot;%s&quot;,
                          &quot;startingWith&quot;:&quot;%s&quot;,
                          &quot;search&quot;:{&quot;type&quot;:&quot;%s&quot;, &quot;keywords&quot;:[%s]}
                          }'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;element&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;top&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;startingWith&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;searchType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;searchKW2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;In this example, we’re now passing R objects into the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sprintf()&lt;/code&gt; function, with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;%s&lt;/code&gt; tokens everywhere we need to substitute text. This is certainly an improvement over &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;paste()&lt;/code&gt;, especially given that Adobe provides example JSON via their &lt;a href=&quot;https://marketing.adobe.com/developer/en_US/get-started/api-explorer&quot;&gt;API explorer&lt;/a&gt;. So I copied the example strings, replaced their examples with my tokens and voilà! Better JSON string building.&lt;/p&gt;

&lt;h3 id=&quot;method-3-building-json-using-a-packagejsonlite-rjson-or-rjsonio&quot;&gt;Method 3: Building JSON using a package (jsonlite, rjson or RJSONIO)&lt;/h3&gt;

&lt;p&gt;While &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sprintf()&lt;/code&gt; allowed for much easier JSON, there is still a frequent code smell in RSiteCatalyst, as evidenced by the &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/blob/master/R/GetTrafficVars.R#L31-39&quot;&gt;following&lt;/a&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Converts report_suites to JSON&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toJSON&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toJSON&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#API request&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;postRequest&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ReportSuite.GetTrafficVars&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'{&quot;rsid_list&quot;:'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'}'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;At some point, I realized that using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;toJSON()&lt;/code&gt; function from &lt;a href=&quot;http://cran.r-project.org/web/packages/rjson/index.html&quot;&gt;rjson&lt;/a&gt; would take care of the formatting R objects to strings, yet I didn’t make the leap to understanding that I could build the &lt;em&gt;whole string&lt;/em&gt; using R objects translated by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;toJSON()&lt;/code&gt;! So I have more hard-to-maintain code where I’m checking the class/length of objects and formatting them. The efficient way to do this using rjson would be:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Efficient method&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rjson&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rsid_list&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;A&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;B&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;C&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;request.body&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toJSON&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#API request&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;postRequest&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ReportSuite.GetTrafficVars&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;request.body&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;With the code above, we’re building JSON in a very R-looking manner; just R objects and functions, and in return getting the output we want. While it’s slightly less obvious what is being created by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;request.body&lt;/code&gt;, there’s literally zero bracket-matching, quoting issues or anything else to worry about in building our JSON. That’s not to say that there isn’t a learning curve to using a JSON package, but I’d rather figure out whether I need a character vector or list than burn my eyes out looking for mismatched quotes and brackets!&lt;/p&gt;

&lt;h3 id=&quot;collaborating-makes-you-a-better-programmer&quot;&gt;Collaborating Makes You A Better Programmer&lt;/h3&gt;

&lt;p&gt;Like any pursuit, you can get pretty far on your own through hard work and self-study. However, I wouldn’t be nearly where I am without collaborating with others (especially learning about how to build JSON properly in R!). A majority of the RSiteCatalyst code for the upcoming version 1.4 was re-written by &lt;a href=&quot;https://github.com/WillemPaling&quot; title=&quot;Willem Paling GitHub&quot; target=&quot;_blank&quot;&gt;Willem Paling&lt;/a&gt;, where he added consistency to keyword arguments, switched to &lt;a href=&quot;http://cran.r-project.org/web/packages/jsonlite/index.html&quot; title=&quot;jsonlite CRAN&quot; target=&quot;_blank&quot;&gt;jsonlite&lt;/a&gt; for better JSON parsing to Data Frames, and most importantly for the topic of this post, cleaned up the method of building all the required JSON strings!&lt;/p&gt;

&lt;p&gt;Edit 5/13: For a more thorough example of building complex JSON using jsonlite, check out &lt;a href=&quot;https://github.com/randyzwitch/RSiteCatalyst/blob/version_1_4/R/QueueRanked.R#L67-114&quot; title=&quot;Complex R jsonlite example&quot; target=&quot;_blank&quot;&gt;this example&lt;/a&gt; from the v1.4 branch of RSiteCatalyst. The linked example R code populates the required arguments from this &lt;a href=&quot;https://gist.github.com/randyzwitch/762343d5e8d8501af522&quot; title=&quot;Example JSON call from Adobe API Explorer&quot; target=&quot;_blank&quot;&gt;JSON outline&lt;/a&gt; provide by Adobe.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Using SQL Workbench with Apache Hive</title>
        
          <description>&lt;p&gt;If you’ve spent any non-trivial amount of time working with Hadoop and Hive at the command line, you’ve likely wished that you could interact with Hadoop like you would any other database. If you’re lucky, your Hadoop administrator has already installed the &lt;a href=&quot;http://gethue.com/&quot;&gt;Apache Hue&lt;/a&gt; front-end to your cluster, which allows for &lt;a href=&quot;http://randyzwitch.com/uploading-data-hadoop-amazon-ec2-cloudera-part-3/&quot;&gt;interacting with Hadoop via an easy-to-use browser interface&lt;/a&gt;. However, if you don’t have Hue, Hive also supports access via JDBC; the downside is, setup is not as easy as including a single JDBC driver.&lt;/p&gt;

</description>
        
        <pubDate>Fri, 25 Apr 2014 14:05:57 +0000</pubDate>
        <link>
        http://randyzwitch.com/sql-workbench-apache-hadoop-hive/</link>
        <guid isPermaLink="true">http://randyzwitch.com/sql-workbench-apache-hadoop-hive/</guid>
        <content type="html" xml:base="/sql-workbench-apache-hadoop-hive/">&lt;p&gt;If you’ve spent any non-trivial amount of time working with Hadoop and Hive at the command line, you’ve likely wished that you could interact with Hadoop like you would any other database. If you’re lucky, your Hadoop administrator has already installed the &lt;a href=&quot;http://gethue.com/&quot;&gt;Apache Hue&lt;/a&gt; front-end to your cluster, which allows for &lt;a href=&quot;http://randyzwitch.com/uploading-data-hadoop-amazon-ec2-cloudera-part-3/&quot;&gt;interacting with Hadoop via an easy-to-use browser interface&lt;/a&gt;. However, if you don’t have Hue, Hive also supports access via JDBC; the downside is, setup is not as easy as including a single JDBC driver.&lt;/p&gt;

&lt;p&gt;While there are paid database administration tools such as &lt;a href=&quot;http://www.aquafold.com/dbspecific/apache_hive_client.html&quot;&gt;Aqua Data Studio&lt;/a&gt; that support Hive, I’m an open source kind of guy, so this tutorial will show you how to use &lt;a href=&quot;http://www.sql-workbench.net/&quot;&gt;SQL Workbench&lt;/a&gt; to access Hive via JDBC. This tutorial assumes that you are proficient enough to get SQL Workbench installed on whatever computing platform you are using (Windows, OSX, or Linux).&lt;/p&gt;

&lt;h3 id=&quot;download-hadoop-jars&quot;&gt;Download Hadoop jars&lt;/h3&gt;

&lt;p&gt;The hardest part of using Hive via JDBC is getting all of the required jars. At work I am using a &lt;a href=&quot;http://www.mapr.com/&quot;&gt;MapR distribution of Hadoop&lt;/a&gt;, and each Hadoop vendor platform provides drivers for their version of Hadoop. For MapR, all of the required Java .jar files are located at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/opt/mapr/hive/hive-0.1X/lib&lt;/code&gt; (where X represents the Hive version number you are using).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/04/mapr-hive-jars.png&quot; alt=&quot;mapr-hive-jars&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
    Download all the .jar files in one shot, just in case you need them in the future
 &lt;/p&gt;

&lt;p&gt;Since it’s not always clear which .jar files are required (especially for other projects/setups you might be doing), I just downloaded the entire set of files and placed them in a directory called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hadoop_jars&lt;/code&gt;. If you’re not using MapR, you’ll need to find and download your vendor-specific version of the following .jar files:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;hive-exec.jar&lt;/li&gt;
  &lt;li&gt;hive-jdbc.jar&lt;/li&gt;
  &lt;li&gt;hive-metastore.jar&lt;/li&gt;
  &lt;li&gt;hive-service.jar&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additionally, you will need the following general Hadoop jars (Note: for clarity/long-term applicability of this blog post, I have removed the version number from all of the jars):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;hive-cli.jar&lt;/li&gt;
  &lt;li&gt;libfb303.jar&lt;/li&gt;
  &lt;li&gt;slf4j-api.jar&lt;/li&gt;
  &lt;li&gt;commons-logging.jar&lt;/li&gt;
  &lt;li&gt;hadoop-common.jar&lt;/li&gt;
  &lt;li&gt;httpcore.jar&lt;/li&gt;
  &lt;li&gt;httpclient.jar&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whew. Once you have the Hive JDBC driver and the 10 other .jar files, we can begin the installation process.&lt;/p&gt;

&lt;h3 id=&quot;setting-up-hive-jdbc-driver&quot;&gt;Setting up Hive JDBC driver&lt;/h3&gt;

&lt;p&gt;Setting up the JDBC driver is simply a matter of providing SQL Workbench with the location of all 11 of the required .jar files. After clicking &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;File -&amp;gt; Manage Drivers&lt;/code&gt;, you’ll want to click on the white page icon to create a New Driver. Use the Folder icon to add the .jars:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/04/sqlworkbench-hive-driver-setup.png&quot; alt=&quot;sqlworkbench-hive-driver-setup&quot; /&gt;&lt;/p&gt;

&lt;p&gt;For the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Classname&lt;/code&gt; box, if you are using a relatively new version of Hive, you’ll be using Hive2 server. In that case, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Classname&lt;/code&gt; for the Hive driver is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;org.apache.hive.jdbc.HiveDriver&lt;/code&gt; (this should pop up on-screen, you just need to select the value). You are not required to put any value for the Sample URL. Hit &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OK&lt;/code&gt; and the driver window will close.&lt;/p&gt;

&lt;h3 id=&quot;connection-window&quot;&gt;Connection Window&lt;/h3&gt;

&lt;p&gt;With the Hive driver defined, all that’s left is to define the connection string. Assuming your Hadoop administrator didn’t change the default &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;port&lt;/code&gt; from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;10000&lt;/code&gt;, your connection string should look as follows:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2014/04/sqlworkbench-hive-connectionstring.png&quot; alt=&quot;sqlworkbench-hive-connectionstring&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As stated above, I’m assuming you are using Hive2 Server; if so, your connection string will be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jdbc:hive2://your-hadoop-cluster-location:10000&lt;/code&gt;. After that, type in your Username and Password and you should be all set.&lt;/p&gt;

&lt;h3 id=&quot;using-hive-with-sql-workbench&quot;&gt;Using Hive with SQL Workbench&lt;/h3&gt;

&lt;p&gt;Assuming you have achieved success with the instructions above, you’re now ready to use Hive like any other database. You will be able to submit your Hive code via the Query Window, view your schemas/tables (via the ‘Database Explorer’ functionality which opens in a separate tab) and generally use Hive like any other relational database.&lt;/p&gt;

&lt;p&gt;Of course, it’s good to remember that Hive isn’t actually a relational database! From my experience, using Hive via SQL Workbench works pretty well, but the underlying processing is still in Hadoop. So you’re not going to get the clean cancelling of queries like you would with an RDBMS , there can be a significant lag to getting answers back (due to the Hive overhead), you can blow up your computer streaming back results larger than available RAM…but it beats working at the command line.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Real-time Reporting with the Adobe Analytics API</title>
        
          <description>&lt;p&gt;Starting with &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-version-1-3-release-notes/&quot;&gt;version 1.3.1 of RSiteCatalyst&lt;/a&gt;, you can now access the &lt;a href=&quot;https://developer.omniture.com/en_US/documentation/sitecatalyst-reporting/c-real-time#concept_AD1D9EC2BC9C4897B9DE3C99D0066B8E&quot;&gt;real-time reporting capabilities of the Adobe Analytics API&lt;/a&gt; through a familiar R interface. Here’s how to get started…&lt;/p&gt;

</description>
        
        <pubDate>Mon, 10 Mar 2014 11:28:21 +0000</pubDate>
        <link>
        http://randyzwitch.com/real-time-reporting-adobe-analytics-api/</link>
        <guid isPermaLink="true">http://randyzwitch.com/real-time-reporting-adobe-analytics-api/</guid>
        <content type="html" xml:base="/real-time-reporting-adobe-analytics-api/">&lt;p&gt;Starting with &lt;a href=&quot;http://randyzwitch.com/rsitecatalyst-version-1-3-release-notes/&quot;&gt;version 1.3.1 of RSiteCatalyst&lt;/a&gt;, you can now access the &lt;a href=&quot;https://developer.omniture.com/en_US/documentation/sitecatalyst-reporting/c-real-time#concept_AD1D9EC2BC9C4897B9DE3C99D0066B8E&quot;&gt;real-time reporting capabilities of the Adobe Analytics API&lt;/a&gt; through a familiar R interface. Here’s how to get started…&lt;/p&gt;

&lt;h2 id=&quot;getrealtimeconfiguration&quot;&gt;GetRealTimeConfiguration&lt;/h2&gt;

&lt;p&gt;Before using the real-time reporting capabilities of Adobe Analytics, you first need to indicate which metrics and elements you are interested in seeing in real-time. To see which reports are already set up for real-time access on a given report suite, you can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetRealTimeConfiguration()&lt;/code&gt; function:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Get Real-Time reports that already set up&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;realtime_reports&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetRealTimeConfiguration&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&amp;lt;reportsuite&amp;gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;It’s likely the case that the first time you set this up, you’ll already see a real-time report for ‘Instances-Page-Site Section-Referring Domain’. You can leave this report in place, or switch the parameters using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SaveRealTimeConfiguration()&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;saverealtimeconfiguration&quot;&gt;SaveRealTimeConfiguration&lt;/h2&gt;

&lt;p&gt;If you want to add/modify which real-time reports are available in a report suite, you can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SaveRealTimeConfiguration()&lt;/code&gt; function:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;SaveRealTimeConfiguration&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&amp;lt;report suite&amp;gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metric1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;instances&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;elements1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;page&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;referringdomain&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;sitesection&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metric2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;revenue&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;elements2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;referringdomain&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;sitesection&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metric3&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;orders&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;elements3&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;products&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Up to three real-time reports are available to be stored at any given time. Note that you can mix-and-match what reports you want to modify, you don’t have to submit all three reports at a given time. Finally, keep in mind that it can take up to 15 minutes for the API to incorporate your real-time report changes, so if you don’t get your data right away don’t keep re-submitting the function call!&lt;/p&gt;

&lt;h2 id=&quot;getrealtimereport&quot;&gt;GetRealTimeReport&lt;/h2&gt;

&lt;p&gt;Once you have your real-time reports set up in the API, you can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetRealTimeReport()&lt;/code&gt; function in order to access your reports. There are numerous parameters for customization; selected examples are below.&lt;/p&gt;

&lt;h3 id=&quot;minimum-example---overtime-report&quot;&gt;Minimum Example - Overtime Report&lt;/h3&gt;

&lt;p&gt;The simplest function call for a real-time report is to create an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Overtime&lt;/code&gt; report (monitoring a metric over a specific time period):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;rt&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetRealTimeReport&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&amp;lt;report suite&amp;gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;instances&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The result of this call will be a DataFrame having 15 rows of one minute granularity for your metric. This is a great way to monitor real-time orders &amp;amp; revenue during a flash sale, see how users are accessing a landing page for an email marketing campaign or any other metric where you want up-to-the-minute status updates.&lt;/p&gt;

&lt;h3 id=&quot;granularity-offset-periods&quot;&gt;Granularity, Offset, Periods&lt;/h3&gt;

&lt;p&gt;If you want to have a time period other than the last 15 minutes, or one minute granularity is too volatile for the metric you are monitoring, you can add additional arguments to modify the returned DataFrame:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;rt2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetRealTimeReport&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&amp;lt;reportsuite&amp;gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;instances&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;periodMinutes&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;5&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;periodCount&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;12&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;periodOffset&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;10&quot;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;For this function call, we will receive instances for the last hour (12 periods) of five minute granularity, with a 10 minute offset (meaning, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;now - 10 minutes ago&lt;/code&gt; is the first time period reported).&lt;/p&gt;

&lt;h3 id=&quot;single-elements&quot;&gt;Single Elements&lt;/h3&gt;

&lt;p&gt;Beyond just monitoring a metric over time, you can specify an element such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;page&lt;/code&gt; to receive your metrics by:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;GetRealTimeReport&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&amp;lt;reportsuite&amp;gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;instances&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;page&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;periodMinutes&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;9&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;periodCount&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;3&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This function call will return Instances by Page, for the last 27 minutes (3 rows/periods per page, 9 minute granularity…just because!). Additionally, there are other arguments such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;algorithm&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;algorithmArgument&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;firstRankPeriod&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;floorSensitivity&lt;/code&gt; that allow for creating reports similar to what is provided in the Real-Time tab in the Adobe Analytics interface.&lt;/p&gt;

&lt;p&gt;Currently, even through the Adobe Analytics API supports real-time reports with three breakdowns, only one element breakdown is supported by RSiteCatalyst; it is planned to extend these functions in RSiteCatalyst to full support the real-time capabilities in the near future.&lt;/p&gt;

&lt;h2 id=&quot;from-dataframe-to-something-shiny&quot;&gt;From DataFrame to Something ‘Shiny’&lt;/h2&gt;

&lt;p&gt;If we’re talking real-time reports, we’re probably talking about dashboarding. If we’re talking about R and dashboarding, then naturally, &lt;a title=&quot;R ggvis&quot; href=&quot;http://ggvis.rstudio.com/&quot; target=&quot;_blank&quot;&gt;ggvis&lt;/a&gt;/&lt;a title=&quot;Shiny Web Applications&quot; href=&quot;http://www.rstudio.com/shiny/&quot; target=&quot;_blank&quot;&gt;Shiny&lt;/a&gt; comes to mind. While providing a full ggvis/Shiny example is beyond the scope of this blog post, it’s my hope to provide a working example in a future blog post. Stay tuned!&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.3 Release Notes</title>
        
          <description>&lt;p&gt;Version 1.3 of the RSiteCatalyst package to access the Adobe Analytics API is now available on CRAN! Changes include:&lt;/p&gt;

</description>
        
        <pubDate>Tue, 04 Feb 2014 09:44:19 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-3-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-3-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-3-release-notes/">&lt;p&gt;Version 1.3 of the RSiteCatalyst package to access the Adobe Analytics API is now available on CRAN! Changes include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Search via regex functionality in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueRanked/QueueTrended&lt;/code&gt; functions&lt;/li&gt;
  &lt;li&gt;Support for Realtime API reports: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Overtime&lt;/code&gt; and one-element &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Ranked&lt;/code&gt; report&lt;/li&gt;
  &lt;li&gt;Allow for variable API request timing in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Queue*&lt;/code&gt;` functions&lt;/li&gt;
  &lt;li&gt;Fixed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;validate&lt;/code&gt; flag in JSON request to work correctly&lt;/li&gt;
  &lt;li&gt;Deprecated &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetAdminConsoleLog&lt;/code&gt; (appears to be removed from the API)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;searching-via-regex-functionality&quot;&gt;Searching via Regex functionality&lt;/h3&gt;

&lt;p&gt;RSiteCatalyst now supports the search functionality of the API, similar in nature to using the Advanced Filter/Search feature within Reports &amp;amp; Analytics. Here are some examples for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueRanked&lt;/code&gt; function:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Top 100 Pages where the pagename starts with &quot;Categories&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Uses searchKW argument&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queue_ranked_pages_search&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueRanked&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&amp;lt;reportsuite&amp;gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2013-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-01-28&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                         &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;page&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                         &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;top&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;100&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                         &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;searchKW&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;^Categories&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  
                                          &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Top 100 Pages where the pagename starts with &quot;Categories&quot; OR contains &quot;Home Page&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Uses searchKW and searchType arguments&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queue_ranked_pages_search_or&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueRanked&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&amp;lt;reportsuite&amp;gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                            &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2013-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                            &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-01-28&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                            &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                            &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;page&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                            &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;top&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;100&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                            &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;searchKW&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;^Categories&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Home Page&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                            &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;searchType&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;OR&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                            &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;QueueTrended function calls work in a similar manner, returning elements broken down by time rather than a single record per element name.&lt;/p&gt;

&lt;h3 id=&quot;realtime-reporting-api&quot;&gt;Realtime Reporting API&lt;/h3&gt;

&lt;p&gt;Accessing the &lt;a href=&quot;https://developer.omniture.com/en_US/documentation/sitecatalyst-reporting/c-real-time#concept_AD1D9EC2BC9C4897B9DE3C99D0066B8E&quot;&gt;Adobe Analytics Realtime API&lt;/a&gt; now has limited support in RSiteCatalyst. Note that this is different than just using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;currentData&lt;/code&gt; parameter within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Queue*&lt;/code&gt; functions, as the realtime API methods provide data within a minute of that data being generated on-site. Currently, RSiteCatalyst only supports the most common types of reports: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Overtime&lt;/code&gt; (no eVar or prop breakdown) and one-element breakdown.&lt;/p&gt;

&lt;p&gt;Because of the extensive new functionality for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetRealTimeConfiguration()&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SaveRealTimeConfiguration()&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetRealTimeReport()&lt;/code&gt; functions, code examples will be provided as a separate blog post.&lt;/p&gt;

&lt;h3 id=&quot;variable-request-timing-for-queue-function-calls&quot;&gt;Variable request timing for Queue function calls&lt;/h3&gt;

&lt;p&gt;This feature is to fix the issue of having an API request run so long that RSiteCatalyst gave up on retrieving an answer. Usually, API requests come back in a few seconds, but in selected cases a call could run so long as to exhaust the number of attempts (previously, 10 minutes). You can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;maxTries&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;waitTime&lt;/code&gt; arguments to specify how many times you’d like RSiteCatalyst to retrieve the report and the wait time between calls:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Change timing of function call&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Wait 30 seconds between attempts to retrieve the report, try 5 times&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;queue_overtime_visits_pv_day_social_anomaly2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueOvertime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&amp;lt;reportsuite&amp;gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                                              &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2013-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                                              &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2014-01-28&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                                              &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                                              &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;day&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                                              &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Visit_Social&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                                              &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;anomalyDetection&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                                              &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;currentData&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                                              &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;maxTries&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                                              &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;waitTime&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;If you don’t specify either of these arguments, RSiteCatalyst will default to trying every five seconds to retrieve the report, up to 120 tries.&lt;/p&gt;

&lt;h3 id=&quot;new-contributor-willem-paling&quot;&gt;New Contributor: Willem Paling&lt;/h3&gt;

&lt;p&gt;I’m pleased to announce that I’ve got a new contributor for RSiteCatalyst, &lt;a title=&quot;WillemPaling on Twitter&quot; href=&quot;https://twitter.com/WillemPaling&quot; target=&quot;_blank&quot;&gt;Willem Paling&lt;/a&gt;! Willem did a near-complete re-write of the underlying code to access the API, and rather than have multiple packages out in the wild, we’ve decided to merge our works. So look forward to better-written R code and more complete access to the Adobe Analytics API’s in future releases…&lt;/p&gt;

&lt;h3 id=&quot;support&quot;&gt;Support&lt;/h3&gt;

&lt;p&gt;If you run into any problems with RSiteCatalyst, please &lt;a title=&quot;RSiteCatalyst GitHub issues&quot; href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues&quot; target=&quot;_blank&quot;&gt;file an issue on GitHub&lt;/a&gt; so it can be tracked properly. Note that I’m not an Adobe employee, so I can only provide so much support, as in most cases I can’t validate your settings to ensure you are set up correctly (nor do I have any inside information about how the system works :) )&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Edit 2/20/2014: I mistakenly forgot to add the new real-time functions to the R NAMESPACE file, and as such, you won’t be able to use them if you are using version 1.3. Upgrade to 1.3.1 to access the real-time functionality.&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Getting Started With Hadoop, Final: Analysis Using Hive &amp; Pig</title>
        
          <description>&lt;p&gt;We’ve finally made it to the final post in this tutorial! In my prior posts about getting started with &lt;a title=&quot;Hadoop posts&quot; href=&quot;http://randyzwitch.com/tags/#hadoop&quot; target=&quot;_blank&quot;&gt;Hadoop&lt;/a&gt;, we’ve covered the entire lifecycle from how to set up a small cluster using Amazon EC2 and Cloudera through how to load data using Hue. With our data loaded in HDFS, we can finally move on to the actual analysis portion of the &lt;a title=&quot;Airline dataset&quot; href=&quot;http://stat-computing.org/dataexpo/2009/the-data.html&quot; target=&quot;_blank&quot;&gt;airline dataset&lt;/a&gt; using Hive and Pig.&lt;/p&gt;

</description>
        
        <pubDate>Sun, 12 Jan 2014 20:25:32 +0000</pubDate>
        <link>
        http://randyzwitch.com/getting-started-hadoop-hive-pig/</link>
        <guid isPermaLink="true">http://randyzwitch.com/getting-started-hadoop-hive-pig/</guid>
        <content type="html" xml:base="/getting-started-hadoop-hive-pig/">&lt;p&gt;We’ve finally made it to the final post in this tutorial! In my prior posts about getting started with &lt;a title=&quot;Hadoop posts&quot; href=&quot;http://randyzwitch.com/tags/#hadoop&quot; target=&quot;_blank&quot;&gt;Hadoop&lt;/a&gt;, we’ve covered the entire lifecycle from how to set up a small cluster using Amazon EC2 and Cloudera through how to load data using Hue. With our data loaded in HDFS, we can finally move on to the actual analysis portion of the &lt;a title=&quot;Airline dataset&quot; href=&quot;http://stat-computing.org/dataexpo/2009/the-data.html&quot; target=&quot;_blank&quot;&gt;airline dataset&lt;/a&gt; using Hive and Pig.&lt;/p&gt;

&lt;h2 id=&quot;basic-descriptive-statistics-using-hive&quot;&gt;Basic Descriptive Statistics Using Hive&lt;/h2&gt;

&lt;p&gt;In &lt;a title=&quot;Getting Started with Hadoop Part 4&quot; href=&quot;http://randyzwitch.com/hadoop-creating-tables-hive/&quot; target=&quot;_blank&quot;&gt;part 4&lt;/a&gt; of this tutorial, we used a Hive script to create a view named “vw_airline” to hold all of our airline data. Running a simple query is as easy as running the following in the Hive window in Hue. Note that this is ANSI-standard SQL code, even though we are submitting it using Hive:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/11/simple-hive-query.png&quot; alt=&quot;simple-hive-query&amp;quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;A simple query like this is a great way to get a feel for the table, including determining whether or not the files were loaded correctly. Once the results are displayed, you can create simple visualizations like bar charts, line plots, scatterplots and pie charts. The results of the following query are shown below. Knowing this dataset, I can tell that the files were loaded incorrectly; the dips at Years 1994 and 2004 are too few records and will need to be reloaded.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/11/hive-visualization-results.png&quot; alt=&quot;hive-visualization-results&quot; /&gt;&lt;/p&gt;

&lt;p&gt;1994 and 2004 have too few rows, which was validated using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wc -l 1994.csv&lt;/code&gt; at the command line (outside of Hadoop)&lt;/p&gt;

&lt;p&gt;Besides just simple counts, Hive supports nearly all standard SQL syntax relative to functions such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SUM&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;COUNT&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MIN&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MAX&lt;/code&gt;, etc., table &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;joins&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user-defined functions (UDF)&lt;/code&gt;`, window functions…pretty much everything that you are used to from other SQL tools.  AFAIK, the only thing that Hive doesn’t support is nested sub-queries, but that’s on the &lt;a title=&quot;Hortonworks Stinger Initiative&quot; href=&quot;http://hortonworks.com/labs/stinger/&quot; target=&quot;_blank&quot;&gt;Stinger initiative for improving Hive&lt;/a&gt;. However, depending on the nested subquery being performed, you might be able to accomplish the same thing using a &lt;a title=&quot;Hive LEFT SEMI JOIN&quot; href=&quot;https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins#LanguageManualJoins-Examples&quot; target=&quot;_blank&quot;&gt;LEFT SEMI JOIN&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;using-pig-for-analytics&quot;&gt;Using Pig for Analytics&lt;/h2&gt;

&lt;p&gt;It’s important to realize that Hadoop isn’t just another RDBMS where you run SQL. Using Pig, you can &lt;a title=&quot;Pig syntax basics&quot; href=&quot;http://pig.apache.org/docs/r0.12.0/start.html#data-work-with&quot; target=&quot;_blank&quot;&gt;write scripts for calculation&lt;/a&gt; in a similar manner to using other high-level languages such as Python or R.&lt;/p&gt;

&lt;p&gt;For example, suppose we wanted to calculate the average distance for each route. A Pig script to calculate this might look like the following:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-pig&quot; data-lang=&quot;pig&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;--Load data from view to use
air = LOAD 'default.vw_airline' USING org.apache.hcatalog.pig.HCatLoader();

--Use FOREACH to limit data to origin, dest, distance
--Concatentate origin and destination together, separated by a pipe
--CONCAT appears to only allow two arguments, which is why the function is called twice (to allow 3 arguments)
origindest = FOREACH air generate CONCAT(origin, CONCAT('|' , dest)) as route, distance;

--Group origindest dataset by route
groupedroutes = GROUP origindest BY (route);

--Calculate average distance by route
avg_distance = FOREACH groupedroutes GENERATE group, AVG(origindest.distance);

--Show results in Pig shell
dump avg_distance;

--Write out results to text file, separated by tab (default)
store avg_distance into '/user/hue/avg_distance';
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;While it is possible to calculate average distance using Hive and a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GROUP BY&lt;/code&gt; statement, one of the benefits to using Pig is having control over every step of the data flow. So while Hive queries tend to answer a single question at a time, Pig allows an analyst to chain together any number of steps in a data flow. In the example above, we could pass the average distance for each route to another transformation, join it back to the original dataset or do anything else our analyst minds can imagine!&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Over these five blog posts, I’ve outlined how to get started with Hadoop and ‘Big Data’ using Amazon and Cloudera/Hortonworks. Hopefully I’ve been able to demystify the &lt;a title=&quot;Hadoop concepts&quot; href=&quot;http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-1/&quot; target=&quot;_blank&quot;&gt;concepts and terminology behind Hadoop&lt;/a&gt;, shown that &lt;a title=&quot;Hadoop on Amazon EC2 using Cloudera&quot; href=&quot;http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-2/&quot; target=&quot;_blank&quot;&gt;setting up a Hadoop using Cloudera on Amazon EC2&lt;/a&gt; isn’t unsurmountable, and &lt;a title=&quot;Loading data into Hadoop HDFS&quot; href=&quot;http://randyzwitch.com/uploading-data-hadoop-amazon-ec2-cloudera-part-3/&quot; target=&quot;_blank&quot;&gt;loading data&lt;/a&gt; and analyzing it using Hive and Pig isn’t dramatically different than using SQL on other database systems you may have encountered in the past.&lt;/p&gt;

&lt;p&gt;While there’s a lot of hype around ‘Big Data’, data sizes aren’t going to be getting any smaller in the future. So spend the $20 in AWS charges and build a Hadoop cluster! There’s no better way to learn than by doing…&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Quickly Create Dummy Variables in a Data Frame</title>
        
          <description>&lt;p&gt;On Quora, a question was asked about how to fix the error of the &lt;a title=&quot;Error in Random Forest 32 levels categorical variable&quot; href=&quot;https://www.quora.com/Random-Forests/How-can-I-fix-the-error-in-the-package-randomForest&quot; target=&quot;_blank&quot;&gt;randomForest package in R not being able to handle more than 32 levels in a categorical variable&lt;/a&gt;. Seeing as how I’ve seen this question asked on Kaggle forums, StackOverflow and elsewhere, here’s the answer: code your own &lt;em&gt;dummy variables&lt;/em&gt; instead of relying on Factors!&lt;/p&gt;

</description>
        
        <pubDate>Thu, 02 Jan 2014 13:58:51 +0000</pubDate>
        <link>
        http://randyzwitch.com/creating-dummy-variables-data-frame-r/</link>
        <guid isPermaLink="true">http://randyzwitch.com/creating-dummy-variables-data-frame-r/</guid>
        <content type="html" xml:base="/creating-dummy-variables-data-frame-r/">&lt;p&gt;On Quora, a question was asked about how to fix the error of the &lt;a title=&quot;Error in Random Forest 32 levels categorical variable&quot; href=&quot;https://www.quora.com/Random-Forests/How-can-I-fix-the-error-in-the-package-randomForest&quot; target=&quot;_blank&quot;&gt;randomForest package in R not being able to handle more than 32 levels in a categorical variable&lt;/a&gt;. Seeing as how I’ve seen this question asked on Kaggle forums, StackOverflow and elsewhere, here’s the answer: code your own &lt;em&gt;dummy variables&lt;/em&gt; instead of relying on Factors!&lt;/p&gt;

&lt;h2 id=&quot;code-snippet&quot;&gt;Code snippet&lt;/h2&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Generate example dataframe with character column&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;example&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as.data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;A&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;A&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;B&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;F&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;C&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;G&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;C&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;D&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;E&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;F&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;example&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;strcol&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#For every unique value in the string column, create a new 1/0 column&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#This is what Factors do &quot;under-the-hood&quot; automatically when passed to function requiring numeric data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;level&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unique&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;example&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;strcol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;example&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;dummy&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;level&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;_&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ifelse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;example&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;strcol&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;level&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;view&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;raw&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;As the code above shows, it’s trivial to generate your own 1/0 columns of data instead of relying on Factors. There are two things to keep in mind when creating your own dummy variables:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The problem you are trying to solve&lt;/li&gt;
  &lt;li&gt;How much RAM you have available&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;While it may make sense to generate dummy variables for Customer State (~50 for the United States), if you were to use the code above on City Name, you’d likely either run out of RAM or find out that there are too many levels to be useful. Of course, with any qualitative statement such as “too many levels to be useful”, oftentimes the only way to definitively know is to try it! Just make sure you save your work before running this code, just in case you run out of RAM. Or, use someone else’s computer for testing 😉&lt;/p&gt;

&lt;p&gt;Edit 1/2/14: John Myles White brought up a good point via Twitter about RAM usage:&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-conversation=&quot;none&quot; data-cards=&quot;hidden&quot; data-partner=&quot;tweetdeck&quot;&gt;
  &lt;p&gt;
    &lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;@randyzwitch&lt;/a&gt; If you're running out of RAM with dummy variables, you probably want to use a sparse matrix instead of a data.frame.
  &lt;/p&gt;

  &lt;p&gt;
    — John Myles White (@johnmyleswhite) &lt;a href=&quot;https://twitter.com/johnmyleswhite/statuses/418821463563829248&quot;&gt;January 2, 2014&lt;/a&gt;
  &lt;/p&gt;
&lt;/blockquote&gt;</content>
      </item>
      
    
      
      <item>
        <title>Adobe Analytics Implementation Documentation in 60 Seconds</title>
        
          <description>&lt;p&gt;When I was working as a digital analytics consultant, no question quite had the ability to cause belly laughs AND angst as, “Can you send me an updated copy of your implementation documentation?” I saw companies that were spending six-or-seven-figures annually on their analytics infrastructure, multi-millions in salary for employees and yet the only way to understand what data they were collecting was to inspect their JavaScript code.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 09 Dec 2013 09:46:30 +0000</pubDate>
        <link>
        http://randyzwitch.com/adobe-analytics-implementation-documentation/</link>
        <guid isPermaLink="true">http://randyzwitch.com/adobe-analytics-implementation-documentation/</guid>
        <content type="html" xml:base="/adobe-analytics-implementation-documentation/">&lt;p&gt;When I was working as a digital analytics consultant, no question quite had the ability to cause belly laughs AND angst as, “Can you send me an updated copy of your implementation documentation?” I saw companies that were spending six-or-seven-figures annually on their analytics infrastructure, multi-millions in salary for employees and yet the only way to understand what data they were collecting was to inspect their JavaScript code.&lt;/p&gt;

&lt;p&gt;Luckily for Adobe Analytics customers, the API provides a means of generating the framework for a properly-documented implementation. Here’s how to do it using &lt;a title=&quot;RSiteCatalyst CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot; target=&quot;_blank&quot;&gt;RSiteCatalyst&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;generating-adobe-analytics-documentation-file&quot;&gt;Generating Adobe Analytics documentation file&lt;/h2&gt;

&lt;p&gt;The code below outlines the commands needed to generate an Excel file (&lt;a title=&quot;Example Excel file Adobe Analytics Documentation&quot; href=&quot;http://randyzwitch.com/wp-content/uploads/2013/12/adobe_analytics_implementation_doc.xlsx&quot; target=&quot;_blank&quot;&gt;see example&lt;/a&gt;) with six tabs containing the basic structure of an Adobe Analytics. This report contains all of the report suites you have access to, the elements that reports can be broken down by, traffic variables (props), conversion variables (eVars) and segments available for reporting.&lt;/p&gt;

&lt;p&gt;Additionally, within each tab metadata is provided that contains the various settings for variables, so you’ll be able to document the expiration settings for eVars, participation, list variables, segment types and so on. &lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;RSiteCatalyst&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;WriteXLS&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Validate that underlying Perl modules for WriteXLS are installed correctly&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Will return &quot;Perl found. All required Perl modules were found&quot; if installed correctly&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;testPerl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### 1. Pull data for all report suites to create one comprehensive report ####&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Authenticate with Adobe Analytics API&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SCAuth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;user:company&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;sharedsecret&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Get Report Suites&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetReportSuites&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Get Available Elements&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;elements&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetElements&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rsid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Get eVars&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;evars&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetEvars&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rsid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Get Segments&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;segments&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetSegments&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rsid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Get Success Events&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;events&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetSuccessEvents&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rsid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Get Traffic Vars&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;props&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetProps&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suites&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rsid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### 2. Generate a single Excel file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Create list of report suite objects, written as strings&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;objlist&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;report_suites&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;elements&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;evars&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;segments&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;events&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;props&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Write out Excel file with auto-width columns, a bolded header row and filters turned on&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;WriteXLS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;objlist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/Users/randyzwitch/Desktop/adobe_analytics_implementation_doc.xlsx&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
         &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AdjWidth&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BoldHeaderRow&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AutoFilter&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The only “gotchas” to keep in mind when using the script above is that the user running this will only receive data for report suites they have access to (which is determined by Admin panel setting within Adobe Analytics) and that you need to have the &lt;a title=&quot;WriteXLS&quot; href=&quot;http://cran.r-project.org/web/packages/WriteXLS/index.html&quot; target=&quot;_blank&quot;&gt;WriteXLS&lt;/a&gt; package installed to write to Excel. The WriteXLS package uses Perl as the underlying code, so you’ll need to validate that the package is installed correctly, which is done using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;testPerl()&lt;/code&gt; function in the package.&lt;/p&gt;

&lt;h2 id=&quot;this-is-pretty-bare-bones-no&quot;&gt;This is pretty bare-bones, no?&lt;/h2&gt;

&lt;p&gt;After you run this code, you’ll have an Excel file that has all of the underlying characteristics of your Adobe Analytics implementation. It’s important to realize that this is only the &lt;em&gt;starting&lt;/em&gt; point; a great set of documentation will contain other pieces of information such as where/when the value is set (on entry, every page, when certain events occur, etc.), a layman’s explanation about what the data element means and other &lt;em&gt;business information&lt;/em&gt; so your stakeholders can be confident they are using the data correctly. Additionally, you might consider creating a single Excel file for every report suite in your implementation. It’s trivial to modify the code above to subset each data frame used above for a single value of rsid, then write to separate Excel files. Regardless of how your structure your documentation, DOCUMENT YOUR IMPLEMENTATION! The employees that come after you (and your future self!) will thank you.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;EDIT 2/4/2016: Thanks to reader &lt;a href=&quot;https://twitter.com/CSitty&quot; target=&quot;_blank&quot;&gt;@CSitty&lt;/a&gt; for pointing out the R code became a little stale. The documentation generating code should now work again for RSiteCatalyst versions &amp;gt;= 1.4 and WriteXLS &amp;gt;= 4.0 (basically, any current version as of the time of this update).&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Using Amazon EC2 with IPython Notebook</title>
        
          <description>&lt;p&gt;Last week, I wrote a guest blog post at &lt;a title=&quot;Guest post at Bad Hessian&quot; href=&quot;http://badhessian.org/2013/11/cluster-computing-for-027hr-using-amazon-ec2-and-ipython-notebook/&quot; target=&quot;_blank&quot;&gt;Bad Hessian&lt;/a&gt; about how to use IPython Notebook along with Amazon EC2 as your data science &amp;amp; analytics platform. I won’t reproduce the whole article here, but if you are interested in step-by-step instruction on how to setup an Amazon EC2 instance to use IPython Notebook, see the SlideShare presentation below which outlines the steps needed to setup a remote IPython Notebook environment (or, &lt;a title=&quot;amazon-ec2-ipython-installation PDF&quot; href=&quot;http://randyzwitch.com/wp-content/uploads/2013/11/cluster-computing-ipython-ec2.pdf&quot; target=&quot;_blank&quot;&gt;PDF download&lt;/a&gt;).&lt;/p&gt;

</description>
        
        <pubDate>Thu, 21 Nov 2013 20:13:11 +0000</pubDate>
        <link>
        http://randyzwitch.com/ipython-notebook-amazon-ec2/</link>
        <guid isPermaLink="true">http://randyzwitch.com/ipython-notebook-amazon-ec2/</guid>
        <content type="html" xml:base="/ipython-notebook-amazon-ec2/">&lt;p&gt;Last week, I wrote a guest blog post at &lt;a title=&quot;Guest post at Bad Hessian&quot; href=&quot;http://badhessian.org/2013/11/cluster-computing-for-027hr-using-amazon-ec2-and-ipython-notebook/&quot; target=&quot;_blank&quot;&gt;Bad Hessian&lt;/a&gt; about how to use IPython Notebook along with Amazon EC2 as your data science &amp;amp; analytics platform. I won’t reproduce the whole article here, but if you are interested in step-by-step instruction on how to setup an Amazon EC2 instance to use IPython Notebook, see the SlideShare presentation below which outlines the steps needed to setup a remote IPython Notebook environment (or, &lt;a title=&quot;amazon-ec2-ipython-installation PDF&quot; href=&quot;http://randyzwitch.com/wp-content/uploads/2013/11/cluster-computing-ipython-ec2.pdf&quot; target=&quot;_blank&quot;&gt;PDF download&lt;/a&gt;).&lt;/p&gt;

&lt;iframe src=&quot;http://www.slideshare.net/slideshow/embed_code/28501345&quot; width=&quot;476&quot; height=&quot;400&quot; frameborder=&quot;0&quot; marginwidth=&quot;0&quot; marginheight=&quot;0&quot; scrolling=&quot;no&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;If you already have experience setting up EC2 images and just need the IPython Notebook settings, here are the commands that are needed to set up your IPython public notebook server.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#### Start IPython, generate SHA1 password to use for IPython Notebook server
&lt;/span&gt;
&lt;span class=&quot;err&quot;&gt;$&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ipython&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Python&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;2.7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Anaconda&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1.8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x86_64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;default&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Oct&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2013&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;07&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Type&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;copyright&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;credits&quot;&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;or&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;license&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;more&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;information&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;IPython&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1.1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;--&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;An&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;enhanced&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Interactive&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Python&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;
&lt;span class=&quot;err&quot;&gt;?&lt;/span&gt;         &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Introduction&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;overview&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;of&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IPython&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'s features.
%quickref -&amp;gt; Quick reference.
help      -&amp;gt; Python'&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;own&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;help&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;system&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;object&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;?&lt;/span&gt;   &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Details&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;about&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'object'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'object??'&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;extra&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;details&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;In&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;IPython.lib&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;passwd&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;In&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;passwd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Enter&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Verify&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'sha1:207eb1f4671f:92af695...'&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#### Create nbserver profile
&lt;/span&gt;
&lt;span class=&quot;err&quot;&gt;$&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ipython&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;profile&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;create&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nbserver&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ProfileCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Generating&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;config&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;u&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'/.ipython/profile_nbserver/ipython_config.py'&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ProfileCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Generating&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;config&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;u&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'/.ipython/profile_nbserver/ipython_qtconsole_config.py'&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ProfileCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Generating&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;config&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;u&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'/.ipython/profile_nbserver/ipython_notebook_config.py'&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ProfileCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Generating&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;config&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;u&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'/.ipython/profile_nbserver/ipython_nbconvert_config.py'&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#### Create self-signed SSL certificate
&lt;/span&gt;
&lt;span class=&quot;err&quot;&gt;$&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;openssl&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;req&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x509&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nodes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;days&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;365&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;newkey&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rsa&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1024&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;keyout&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mycert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pem&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mycert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pem&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#### Modify ipython_notebook_config.py configuration file
#### Add these lines to the top of the file; no other changes necessary
#### Obviously, you'll want to add your path to the .pem key and your password
&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Configuration file for ipython-notebook.
&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Kernel config
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IPKernelApp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pylab&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'inline'&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# if you want plotting support always
&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Notebook config
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NotebookApp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;certfile&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;u&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'/home/ubuntu/certificates/mycert.pem'&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NotebookApp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ip&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'*'&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NotebookApp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;open_browser&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NotebookApp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;password&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;u&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'sha1:207eb1f4671f:92af695...'&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# It is a good idea to put it on a known, fixed port
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NotebookApp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;port&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8888&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#### Start IPython Notebook on the remote server
&lt;/span&gt;
&lt;span class=&quot;err&quot;&gt;$&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ipython&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;notebook&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;--&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;profile&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nbserver&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Happy IPython Notebooking!&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Adding Line Numbers in IPython/Jupyter Notebooks</title>
        
          <description>&lt;p&gt;Lately, I’ve been using Jupyter Notebooks for all of my Python and Julia coding. The ability to develop and submit small snippets of code and create plots inline is just so useful that it has broken the stranglehold of using an IDE while I’m coding. However, the one thing that was missing for a smooth transition was line numbers in the cells; luckily, this can be achieved in two ways.&lt;/p&gt;

</description>
        
        <pubDate>Tue, 19 Nov 2013 08:48:18 +0000</pubDate>
        <link>
        http://randyzwitch.com/line-numbers-ipython-notebook/</link>
        <guid isPermaLink="true">http://randyzwitch.com/line-numbers-ipython-notebook/</guid>
        <content type="html" xml:base="/line-numbers-ipython-notebook/">&lt;p&gt;Lately, I’ve been using Jupyter Notebooks for all of my Python and Julia coding. The ability to develop and submit small snippets of code and create plots inline is just so useful that it has broken the stranglehold of using an IDE while I’m coding. However, the one thing that was missing for a smooth transition was line numbers in the cells; luckily, this can be achieved in two ways.&lt;/p&gt;

&lt;h2 id=&quot;keyboard-shortcut&quot;&gt;Keyboard Shortcut&lt;/h2&gt;

&lt;p&gt;The easiest way to add line numbers to a Jupyter Notebook is to use the keyboard shortcut, which is &lt;strong&gt;Ctrl-m&lt;/strong&gt; to enter Command Mode, then type&lt;strong&gt; L&lt;/strong&gt;. Just highlight the cell you are interested in adding line numbers to, then hit the keyboard shortcut to toggle the line numbers.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/11/ipython-notebook-line-numbers.png&quot; alt=&quot;ipython-notebook-line-numbers&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;add-line-numbers-to-all-cells-at-startup&quot;&gt;Add Line Numbers to All Cells at Startup&lt;/h2&gt;

&lt;p&gt;&lt;del&gt;While the keyboard shortcut is great for toggling line numbers on/off, I prefer having line numbers always on. Luckily, the IPython Dev folks on Twitter were kind enough to explain how to do this:&lt;/del&gt;&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; lang=&quot;en&quot;&gt;
  &lt;p style=&quot;text-align: center;&quot;&gt;
    &lt;del&gt;&lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;@randyzwitch&lt;/a&gt; add `IPython.Cell.options_default.cm_config.lineNumbers = true;` to your custom.js&lt;/del&gt;
  &lt;/p&gt;

  &lt;p style=&quot;text-align: center;&quot;&gt;
    &lt;del&gt;— IPython Dev (@IPythonDev) &lt;a href=&quot;https://twitter.com/IPythonDev/statuses/394906726828236800&quot;&gt;October 28, 2013&lt;/a&gt;&lt;/del&gt;
  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;del&gt;I use OSX with the default ‘profile_default’ profile, so the path for my custom.js file for IPython is:&lt;/del&gt;&lt;/p&gt;

&lt;pre&gt;&lt;del&gt;/Users/randyzwitch/.ipython/profile_default/static/custom/&lt;/del&gt;&lt;/pre&gt;

&lt;p&gt;&lt;del&gt;Similarly, you can do the same for IJulia:&lt;/del&gt;&lt;/p&gt;

&lt;pre&gt;&lt;del&gt;/Users/randyzwitch/.ipython/profile_julia/static/custom&lt;/del&gt;&lt;/pre&gt;

&lt;p&gt;&lt;del&gt;If you are using a different operating system than OSX, or you are using OSX and you don’t see a custom.js file in these locations, a quick search for custom.js will get you to the right file location. Once you open up the custom.js file, you can place the line of JavaScript anywhere in the file, as long as it’s not inside any of any pre-existing functions in the file.&lt;/del&gt;&lt;/p&gt;

&lt;p&gt;&lt;del&gt;Once you place the line of JavaScript in your file, you’ll need to restart IPython/IJulia completely for the change to take effect. After that, you’ll have line numbers in each cell, each Notebook!&lt;/del&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Edit 11/4/2015: Thanks to reader Nat Dunn, I’ve been made aware that the above method no longer works, which isn’t a surprise given the amount of changes between IPython Notebook to the entire Jupyter project in the past 2 years.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;For the (currently) correct method of &lt;a href=&quot;https://www.webucator.com/blog/2015/11/show-line-numbers-by-default-in-ipython-notebook/&quot; target=&quot;_blank&quot;&gt;adding line numbers to Jupyter Notebook by default&lt;/a&gt;, please see &lt;a href=&quot;https://www.webucator.com/blog/2015/11/show-line-numbers-by-default-in-ipython-notebook/&quot; target=&quot;_blank&quot;&gt;Nat’s post&lt;/a&gt; with the correct instructions on modifying the custom.js file.&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.2 Release Notes</title>
        
          <description>&lt;p&gt;Version 1.2 of the &lt;a title=&quot;RSiteCatalyst CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot; target=&quot;_blank&quot;&gt;RSiteCatalyst&lt;/a&gt; package to access the Adobe Analytics API is now available on CRAN! Changes include:&lt;/p&gt;

</description>
        
        <pubDate>Tue, 05 Nov 2013 08:29:16 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-2-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-2-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-2-release-notes/">&lt;p&gt;Version 1.2 of the &lt;a title=&quot;RSiteCatalyst CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot; target=&quot;_blank&quot;&gt;RSiteCatalyst&lt;/a&gt; package to access the Adobe Analytics API is now available on CRAN! Changes include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Removed RCurl package dependency&lt;/li&gt;
  &lt;li&gt;Changed argument order for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetAdminConsoleLog&lt;/code&gt; to avoid error when date not passed&lt;/li&gt;
  &lt;li&gt;Return proper numeric type for metric columns&lt;/li&gt;
  &lt;li&gt;Fixed bug in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetEVars&lt;/code&gt; function&lt;/li&gt;
  &lt;li&gt;Added &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;validate:true&lt;/code&gt; flag to API to improve error reporting&lt;/li&gt;
  &lt;li&gt;Removed remaining references to Omniture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the most part, the only noticeable change for most users will be that you no longer need to call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;as.numeric()&lt;/code&gt; on a DataFrame after getting the results of an API call, as all functions now return the proper numeric type.&lt;/p&gt;

&lt;h3 id=&quot;changes-from-development-version&quot;&gt;Changes from Development Version&lt;/h3&gt;

&lt;p&gt;For any of you out there that may have installed the 1.2 development version directly from GitHub, the only difference between the 1.2 development version and the stable, CRAN version of the package is that support for the Adobe Analytics Real Time API has been removed. This functionality will continue to be developed on the &lt;a title=&quot;RSiteCatalyst version 1.3&quot; href=&quot;https://github.com/randyzwitch/RSiteCatalyst/tree/version_1_3&quot; target=&quot;_blank&quot;&gt;1.3 development branch&lt;/a&gt; on GitHub.&lt;/p&gt;

&lt;h3 id=&quot;testing&quot;&gt;Testing&lt;/h3&gt;

&lt;p&gt;For this release, I’ve made a more concerted effort to test RSiteCatalyst on various platforms outside of OSX (where I do my development). RSiteCatalyst works in the following environments:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;OSX Lion and prior&lt;/li&gt;
  &lt;li&gt;Ubuntu 12.04 LTS&lt;/li&gt;
  &lt;li&gt;Windows 7 64-bit SP1&lt;/li&gt;
  &lt;li&gt;Windows 8.1 64-bit&lt;/li&gt;
  &lt;li&gt;R 2.15.2 and newer&lt;/li&gt;
  &lt;li&gt;R and RStudio&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your environment is not listed above, it is still likely the case that RSiteCatalyst will work in your environment, as there is no operating-system-specific code in the package. If you are finding issues, validate that you have all package dependencies installed, your Adobe account has Web Service Access privileges (set in Admin panel), you have permission access to the report suites you are trying to access (also an Admin panel setting) and that your company doesn’t have any firewall settings that would prevent API access.&lt;/p&gt;

&lt;h3 id=&quot;support&quot;&gt;Support&lt;/h3&gt;

&lt;p&gt;If you run into any problems with RSiteCatalyst, please &lt;a title=&quot;RSiteCatalyst GitHub issues&quot; href=&quot;https://github.com/randyzwitch/RSiteCatalyst/issues&quot; target=&quot;_blank&quot;&gt;file an issue on GitHub&lt;/a&gt; so it can be tracked properly. Note that I’m not an Adobe employee, so I can only provide so much support, as in most cases I can’t validate your settings to ensure you are set up correctly (nor do I have any inside information about how the system works 🙂 )&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Clustering Search Keywords Using K-Means Clustering</title>
        
          <description>&lt;p&gt;One of the key tenets to doing impactful digital analysis is understanding what your visitors are trying to accomplish. One of the easiest methods to do this is by analyzing the words your visitors use to arrive on site (search keywords) and what words they are using while on the site (on-site search). &lt;/p&gt;

</description>
        
        <pubDate>Tue, 17 Sep 2013 14:41:01 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-k-means-clustering/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-k-means-clustering/</guid>
        <content type="html" xml:base="/rsitecatalyst-k-means-clustering/">&lt;p&gt;One of the key tenets to doing impactful digital analysis is understanding what your visitors are trying to accomplish. One of the easiest methods to do this is by analyzing the words your visitors use to arrive on site (search keywords) and what words they are using while on the site (on-site search). &lt;/p&gt;

&lt;p&gt;Although Google has made it much more difficult to analyze search keywords over the past several years (due to their passing of &lt;a title=&quot;(not provided): Using R and the Google Analytics API&quot; href=&quot;http://randyzwitch.com/r-google-analytics-api/&quot; target=&quot;_blank&quot;&gt;“(not provided)”&lt;/a&gt; instead of the actual keywords), we can create customer intent segments based on the keywords that are still being passed using unsupervised clustering methods such as k-means clustering.&lt;/p&gt;

&lt;h2 id=&quot;concept-k-means-clusteringunsupervised-learning&quot;&gt;Concept: K-Means Clustering/Unsupervised Learning&lt;/h2&gt;

&lt;p&gt;&lt;a title=&quot;k-means clustering&quot; href=&quot;http://en.wikipedia.org/wiki/K-means_clustering&quot; target=&quot;_blank&quot;&gt;K-means clustering&lt;/a&gt; is one of many techniques within &lt;a title=&quot;Unsupervised learning Wikipedia&quot; href=&quot;http://en.wikipedia.org/wiki/Unsupervised_learning&quot; target=&quot;_blank&quot;&gt;unsupervised learning&lt;/a&gt; that can be used for text analysis. &lt;em&gt;Unsupervised&lt;/em&gt; refers to the fact that we’re trying to understand the structure of our underlying data, rather than trying to optimize for a specific, pre-labeled criterion (such as creating a predictive model for conversion). Unsupervised learning is a great technique for exploratory analysis in that the analyst enforces few assumptions on the data, so previously unexamined relationships can be determined &lt;em&gt;then&lt;/em&gt; analyzed; contrast that with pre-defined relationships specified by the analyst (such as &lt;em&gt;visitors from mobile&lt;/em&gt; or &lt;em&gt;visitors from social&lt;/em&gt;), then evaluating how various metrics differ across these pre-defined groups.&lt;/p&gt;

&lt;p&gt;Without getting too technical, k-means clustering is a method of partitioning data into ‘k’ subsets, where each data element is assigned to the closest cluster based on the distance of the data element from the center of the cluster. In order to use k-means clustering with text data, we need to do some text-to-numeric transformation of our text data. Luckily, R provides several packages to simplify the process.&lt;/p&gt;

&lt;h2 id=&quot;converting-text-to-numeric-data-document-term-matrix&quot;&gt;Converting Text to Numeric Data: Document-Term Matrix&lt;/h2&gt;

&lt;p&gt;Since I use Adobe Analytics on this blog, I’m going to use the &lt;a title=&quot;RSiteCatalyst&quot; href=&quot;http://randyzwitch.com/rsitecatalyst&quot; target=&quot;_blank&quot;&gt;RSiteCatalyst package&lt;/a&gt; to get my natural search keywords into a dataframe. Once the keywords are in a dataframe, we can use the &lt;a title=&quot;RTextTools&quot; href=&quot;http://www.rtexttools.com/&quot; target=&quot;_blank&quot;&gt;RTextTools&lt;/a&gt; package to create a document-term matrix, where each row is our search term and each column is a 1/0 representation of whether a single word is contained within natural search term. &lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#### 0. Setup&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;RSiteCatalyst&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;RTextTools&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Loads many packages useful for text mining&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### 1. RSiteCatalyst code - Get Natural Search Keywords &amp;amp; Metrics&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Set credentials&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SCAuth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;username&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;company&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shared&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;secret&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Get list of search engine terms&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;searchkeywords&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueRanked&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report_suite&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2013-02-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2013-09-16&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;entries&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;instances&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;bounces&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;searchenginenaturalkeyword&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;top&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;100000&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;startingWith&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### 2. Process keywords into format suitable for text mining&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Create document-term matrix, passing data cleaning options&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Stem the words to avoid multiples of similar words&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Need to set wordLength to minimum of 1 because &quot;r&quot; a likely term&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dtm&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create_matrix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;searchkeywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'Natural Search Keyword'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                     &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stemWords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                     &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;removeStopwords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                     &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minWordLength&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                     &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;removePunctuation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;create_matrix&lt;/code&gt; function, I’m using four keyword arguments to process the data:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;stemWords&lt;/code&gt; reduces a word down to its root, which is a standardization method to avoid having multiple versions of words referring to the same concept (e.g. argue, arguing, argued reduces to ‘argu’)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;removeStopwords&lt;/code&gt; eliminates common English words such as “they”, “he” , “always”&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;minWordLength&lt;/code&gt; sets the minimum number of characters that constitutes a ‘word’, which I set to 1 because of the high likelihood of ‘r’ being a keyword&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;removePunctuation&lt;/code&gt; removes periods, commas, etc.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;popular-words&quot;&gt;Popular Words&lt;/h2&gt;

&lt;p&gt;If you are unfamiliar with the terms that might be contained in your dataset, you can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;findFreqTerms&lt;/code&gt; to see which terms occur with a minimum frequency. Here are the terms that occur at least 20 times on this blog:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Inspect most popular words, minimum frequency of 20&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;findFreqTerms&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dtm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lowfreq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;15&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2008&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2009&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2011&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;a&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;            &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ad&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;add&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;adsens&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;air&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;analyt&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;and&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;appl&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;at&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;back&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;bezel&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;black&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;17&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;book&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;bookmark&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;     &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;break&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;broke&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;broken&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;bubbl&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;by&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;can&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;25&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;case&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;chang&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;child&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;code&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;comment&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;comput&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;cost&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;cover&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;33&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;crack&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;css&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;custom&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;data&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;delet&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;disabl&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;display&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;do&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;41&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;doe&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;drop&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;edit&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;eleven&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;em209&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;entri&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;fix&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;footer&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;49&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;footerphp&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;for&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;free&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;from&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;get&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;glue&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;googl&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;hadoop&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;57&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;header&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;hing&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;how&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;i&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;            &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;if&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;imag&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;in&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;is&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;65&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;it&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;laptop&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;late&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;lcd&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;lid&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;link&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;logo&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;loos&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;73&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;mac&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;macbook&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;make&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;mobil&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;modifi&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;much&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;my&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;navig&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;81&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;of&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;off&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;omnitur&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;on&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;page&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;permalink&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;php&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;post&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;89&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;power&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pro&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;problem&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;program&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;proud&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;r&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;            &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;remov&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;repair&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      
 &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;97&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;replac&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;report&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;sas&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;screen&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;separ&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;site&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;sitecatalyst&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;store&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;105&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tag&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;text&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;the&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;theme&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;this&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tighten&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;to&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;top&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;113&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;turn&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;twenti&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;twentyeleven&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;uncategor&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;unibodi&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;use&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;variabl&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;version&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;     
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;121&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;view&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;vs&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;warranti&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;     &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;was&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;what&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;will&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;with&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;         &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;wordpress&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;   
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;129&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;wp&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;you&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;guessing-at-k-a-first-run-at-clustering&quot;&gt;Guessing at ‘k’: A First Run at Clustering&lt;/h2&gt;

&lt;p&gt;Once we have our data set up, we can very quickly run the k-means algorithm within R. The one downside to using k-means clustering as a technique is that the user must choose ‘k’, the number of clusters expected from the dataset. In absence of any heuristics about what ‘k’ to use, I can guess that there are five topics on this blog:
1. Data Science&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Digital Analytics  &lt;/li&gt;
  &lt;li&gt;R&lt;/li&gt;
  &lt;li&gt;Julia&lt;/li&gt;
  &lt;li&gt;WordPress&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Running the following code, we can see if the algorithm agrees:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#I think there are 5 main topics: Data Science, Web Analytics, R, Julia, Wordpress&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kmeans5&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kmeans&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dtm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Merge cluster assignment back to keywords&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kw_with_cluster&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as.data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;searchkeywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'Natural Search Keyword'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kmeans5&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cluster&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kw_with_cluster&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;keyword&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;kmeans5&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Make df for each cluster result, quickly &quot;eyeball&quot; results&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cluster1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kw_with_cluster&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kmeans5&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cluster2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kw_with_cluster&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kmeans5&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cluster3&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kw_with_cluster&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kmeans5&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cluster4&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kw_with_cluster&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kmeans5&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cluster5&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kw_with_cluster&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kmeans5&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Opening the dataframes to observe the results, it seems that the algorithm disagrees:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Cluster 1: “Free-for-All” cluster: not well separated (41.1% of terms)&lt;/li&gt;
  &lt;li&gt;Cluster 2: “wordpress” and “remove” (4.9% of terms)&lt;/li&gt;
  &lt;li&gt;Cluster 3: “powered by wordpress” (4.3% of terms)&lt;/li&gt;
  &lt;li&gt;Cluster 4: “twenty eleven” (13.5% of terms)&lt;/li&gt;
  &lt;li&gt;Cluster 5: “macbook” (36.2% of terms)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Of the clusters, the strongest cluster in terms of performance is cluster 5, which is pretty homogenous in terms of being about ‘macbook’ terms. Clusters 2-4 are all about WordPress, albeit different topics surrounding blogging. And cluster 1 is a large hodge-podge of terms that seem unrelated. Clearly, five clusters isn’t the proper value for ‘k’.   &lt;/p&gt;

&lt;h2 id=&quot;selecting-k-using-elbow-method&quot;&gt;Selecting ‘k’ Using ‘Elbow Method’&lt;/h2&gt;

&lt;p&gt;Instead of randomly choosing values of ‘k’, then looking at each cluster result until we find one we like, we can take a more automated approach to picking ‘k’. For every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kmeans&lt;/code&gt; object returned by R, there is a metric &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tot.withinss&lt;/code&gt; that provides the total of the squared distance metric for each cluster.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#accumulator for cost results&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cost_df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#run kmeans for all clusters up to 100&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Run kmeans for each level of i, allowing up to 100 iterations for convergence&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kmeans&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kmeans&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dtm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;centers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;iter.max&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Combine cluster number and cost together, write to df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cost_df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cost_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kmeans&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tot.withinss&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cost_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;cluster&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;cost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cost_df&lt;/code&gt; dataframe accumulates the results for each run, which can then be plotted using ggplot2 (&lt;a title=&quot;ggplot2 k-means elbow method gist&quot; href=&quot;https://gist.github.com/randyzwitch/6597905&quot; target=&quot;_blank&quot;&gt;ggplot2 Gist here&lt;/a&gt;):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/09/elbow-plot.png&quot; alt=&quot;elbow-plot&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The plot above is a technique known informally as the ‘elbow method’, where we are looking for breakpoints in our cost plot to understand where we should stop adding clusters. We can see that the slope of the cost function gets flatter at 10 clusters, then flatter again around 20 clusters. This means that as we add clusters above 10 (or 20), each additional cluster becomes less effective at reducing the distance from the each data center (i.e. reduces the variance less). So while we haven’t determined an absolute, single ‘best’ value of ‘k’, we have narrowed down a range of values for ‘k’ to evaluate.&lt;/p&gt;

&lt;p&gt;Ultimately, the best value of ‘k’ will be determined as a combination of a heuristic method like the ‘Elbow Method’, along with analyst judgement after looking at the results. Once you’ve determined your optimal cluster definitions, it’s trivial to calculate metrics such as Bounce Rate, Pageviews per Visit, Conversion Rate or Average Order Value to see how well the clusters actually describe different behaviors on-site.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;K-means clustering is one of many unsupervised learning techniques that can be used to understand the underlying structure of a dataset. When used with text data, k-means clustering can provide a great way to organize the thousands-to-millions of words being used by your customers to describe their visits. Once you understand what your customers are trying to do, you can tailor your on-site experiences to match these needs, as well as adjusting your reporting/dashboards to monitor the various customer groups.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;EDIT: For those who want to play around with the code but don’t use Adobe Analytics, here is the &lt;a title=&quot;search keyword file&quot; href=&quot;http://randyzwitch.com/wp-content/uploads/2013/09/searchkeywords_0913.csv&quot; target=&quot;_blank&quot;&gt;file of search keywords&lt;/a&gt; I used. Once you read in the .csv file into a dataframe and name it searchkeywords, you should be able to replicate everything in this blog post.&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Fun With Just-In-Time Compiling: Julia, Python, R and pqR</title>
        
          <description>&lt;p&gt;Recently I’ve been spending a lot of time trying to learn &lt;a title=&quot;Julia language&quot; href=&quot;http://julialang.org/&quot; target=&quot;_blank&quot;&gt;Julia&lt;/a&gt; by doing the problems at &lt;a title=&quot;Project Euler&quot; href=&quot;http://projecteuler.net/&quot; target=&quot;_blank&quot;&gt;Project Euler&lt;/a&gt;. What’s great about these problems is that it gets me out of my normal design patterns, since I don’t generally think about prime numbers, factorials and other number theory problems during my normal workday. These problems have also given me the opportunity to really think about how computers work, since Julia allows the programmer to pass type declarations to the just-in-time compiler (JIT).&lt;/p&gt;

</description>
        
        <pubDate>Mon, 02 Sep 2013 19:57:45 +0000</pubDate>
        <link>
        http://randyzwitch.com/python-pypy-julia-r-pqr-jit-just-in-time-compiler/</link>
        <guid isPermaLink="true">http://randyzwitch.com/python-pypy-julia-r-pqr-jit-just-in-time-compiler/</guid>
        <content type="html" xml:base="/python-pypy-julia-r-pqr-jit-just-in-time-compiler/">&lt;p&gt;Recently I’ve been spending a lot of time trying to learn &lt;a title=&quot;Julia language&quot; href=&quot;http://julialang.org/&quot; target=&quot;_blank&quot;&gt;Julia&lt;/a&gt; by doing the problems at &lt;a title=&quot;Project Euler&quot; href=&quot;http://projecteuler.net/&quot; target=&quot;_blank&quot;&gt;Project Euler&lt;/a&gt;. What’s great about these problems is that it gets me out of my normal design patterns, since I don’t generally think about prime numbers, factorials and other number theory problems during my normal workday. These problems have also given me the opportunity to really think about how computers work, since Julia allows the programmer to pass type declarations to the just-in-time compiler (JIT).&lt;/p&gt;

&lt;p&gt;As I’ve been working on optimizing my Julia code, I decided to figure out how fast this problem can be solved using any of the languages/techniques I know. So I decided to benchmark one of the Project Euler problems using &lt;a title=&quot;Julia language&quot; href=&quot;http://julialang.org/&quot; target=&quot;_blank&quot;&gt;Julia&lt;/a&gt;, &lt;a title=&quot;Python language&quot; href=&quot;http://python.org/&quot; target=&quot;_blank&quot;&gt;Python&lt;/a&gt;, &lt;a title=&quot;Numba&quot; href=&quot;http://numba.pydata.org/&quot; target=&quot;_blank&quot;&gt;Python with Numba&lt;/a&gt;, &lt;a title=&quot;Pypy&quot; href=&quot;http://pypy.org/&quot; target=&quot;_blank&quot;&gt;PyPy&lt;/a&gt;, &lt;a title=&quot;R&quot; href=&quot;http://cran.us.r-project.org/&quot; target=&quot;_blank&quot;&gt;R&lt;/a&gt;, R using the &lt;a title=&quot;R compiler&quot; href=&quot;http://stat.ethz.ch/R-manual/R-devel/library/compiler/html/compile.html&quot; target=&quot;_blank&quot;&gt;compiler&lt;/a&gt; package, &lt;a title=&quot;pqR&quot; href=&quot;http://radfordneal.wordpress.com/2013/06/22/announcing-pqr-a-faster-version-of-r/&quot; target=&quot;_blank&quot;&gt;pqR&lt;/a&gt; and pqR using the compiler package. Here’s what I found…&lt;/p&gt;

&lt;h2 id=&quot;problem&quot;&gt;Problem&lt;/h2&gt;

&lt;p&gt;The problem I’m using for the benchmark is calculating the smallest number that is divisible by all of the numbers in a factorial. For example, for the numbers in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;5!&lt;/code&gt;, 60 is the smallest number that is divisible by 2, 3, 4 and 5. Here’s the Julia code:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt; smallestdivisall&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int64&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;factorial&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;break&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;elseif&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;All code versions follow this same pattern: the outside loop will run from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1&lt;/code&gt; up to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;n!&lt;/code&gt;, since by definition the last value in the loop will be divisible by all of the numbers in the factorial. The inner loops go through and do a modulo calculation, checking to see if there is a remainder after division. If there is a remainder, break out of the loop and move to the next number. Once the state occurs where there is no remainder on the modulo calculation and the inner loop value of j equals the last number in the factorial (i.e. it is divisible by all of the factorial numbers), we have found the minimum number.&lt;/p&gt;

&lt;h2 id=&quot;benchmarking---overall&quot;&gt;Benchmarking - Overall&lt;/h2&gt;

&lt;p&gt;Here are the results of the eight permutations of languages/techniques (see &lt;a title=&quot;GitHub Gist for JIT test&quot; href=&quot;https://gist.github.com/randyzwitch/6341926&quot; target=&quot;_blank&quot;&gt;this&lt;/a&gt; GitHub Gist for the actual code used, &lt;a title=&quot;compiler results&quot; href=&quot;http://randyzwitch.com/wp-content/uploads/2013/09/jit.csv&quot; target=&quot;_blank&quot;&gt;this link&lt;/a&gt; for results file, and &lt;a title=&quot;ggplot2 code&quot; href=&quot;https://gist.github.com/randyzwitch/6414244&quot; target=&quot;_blank&quot;&gt;this&lt;/a&gt; GitHub Gist for the ggplot2 code):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/08/jit-comparison.png&quot; alt=&quot;jit-comparison&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Across the range of tests from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;5!&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;20!&lt;/code&gt;, Julia is the fastest to find the minimum number. Python with Numba is second and PyPy is third. pqR fares better than R in general, but using the compiler package can narrow the gap.&lt;/p&gt;

&lt;p&gt;To make more useful comparisons, in the next section I’ll compare each language to its “compiled” function state.&lt;/p&gt;

&lt;h2 id=&quot;benchmarking---individual&quot;&gt;Benchmarking - Individual&lt;/h2&gt;

&lt;h3 id=&quot;python&quot;&gt;Python&lt;/h3&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/09/JITpython-e1378131849775.png&quot; alt=&quot;JITpython&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Amongst the native Python code options, I saw a 16x speedup by using PyPy instead of Python 2.7.6 (10.62s vs. 172.06s at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;20!&lt;/code&gt;). Using Numba with Python instead of PyPy nets an &lt;em&gt;incremental&lt;/em&gt; ~40% speedup using the &lt;a title=&quot;autojit example&quot; href=&quot;http://numba.pydata.org/&quot; target=&quot;_blank&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@autojit&lt;/code&gt;&lt;/a&gt; decorator (7.63s vs. 10.63 at 20!).&lt;/p&gt;

&lt;p&gt;So in the case of Python, using two lines of code with the Numba JIT compiler you can get substantial improvements in performance without needing to do any code re-writes. This is a great benefit given that you can stay in native Python, since PyPy doesn’t support all existing packages within the Python ecosystem.&lt;/p&gt;

&lt;h3 id=&quot;rpqr&quot;&gt;R/pqR&lt;/h3&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/09/JITr-e1378132951124.png&quot; alt=&quot;JITr&quot; /&gt;&lt;/p&gt;

&lt;p&gt;It’s understood in the R community that &lt;a title=&quot;Why are R loops slow?&quot; href=&quot;http://stackoverflow.com/questions/7142767/why-are-loops-slow-in-r&quot; target=&quot;_blank&quot;&gt;loops are not a strong point&lt;/a&gt; of the language. In the case of this problem, I decided to use loops because 1) it keeps the code pattern similar across languages and 2) I hoped I’d see the max benefit from the compiler package by not trying any funky R optimizations up front.&lt;/p&gt;

&lt;p&gt;As expected, pqR is generally faster than R and using the compiler package is faster than not using the compiler. I saw ~30% improvement using pqR relative to R and ~20% &lt;em&gt;incremental&lt;/em&gt; improvement using the compiler package with pqR. Using the compiler package within R showed ~35% improvement.&lt;/p&gt;

&lt;p&gt;So unlike the case with Python, where you could just use Python with Numba and stay within the same language/environment, if you can use pqR &lt;em&gt;and&lt;/em&gt; the compiler package, you can get a performance benefit from using both.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;For a comparison like I’ve done above, it’s easy to get carried away and extrapolate the results from one simple test to all programming problems ever. “&lt;em&gt;Julia is the best language for all cases ever!!!11111eleventy!&lt;/em&gt;” would be easy to proclaim, but all problems aren’t looping problems using simple division. Once you get into writing longer programs, other tasks such string manipulation and accessing APIs, using a technique from a package only available in one ecosystem but not another, etc., which tool is “best” for solving a problem becomes a much more difficult decision. The only way to know how much improvement you can see from different techniques &amp;amp; tools is to profile your program(s) and experiment.&lt;/p&gt;

&lt;p&gt;The main thing that I took away from this exercise is that no matter which tool you are comfortable with to do analysis, there are potentially large performance improvements that can be made &lt;em&gt;just&lt;/em&gt; by using a JIT without needing to dramatically re-write your code. For those of us who don’t know C (and/or are too lazy to re-write our code several times to wring out a little extra performance), that’s a great thing.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>RSiteCatalyst Version 1.1 Release Notes</title>
        
          <description>&lt;p&gt;&lt;a title=&quot;RSiteCatalyst on CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot; target=&quot;_blank&quot;&gt;RSiteCatalyst &lt;/a&gt;version 1.1 is now available on CRAN. Changes from version 1 include:&lt;/p&gt;

</description>
        
        <pubDate>Sun, 25 Aug 2013 13:54:32 +0000</pubDate>
        <link>
        http://randyzwitch.com/rsitecatalyst-version-1-1-release-notes/</link>
        <guid isPermaLink="true">http://randyzwitch.com/rsitecatalyst-version-1-1-release-notes/</guid>
        <content type="html" xml:base="/rsitecatalyst-version-1-1-release-notes/">&lt;p&gt;&lt;a title=&quot;RSiteCatalyst on CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/RSiteCatalyst/index.html&quot; target=&quot;_blank&quot;&gt;RSiteCatalyst &lt;/a&gt;version 1.1 is now available on CRAN. Changes from version 1 include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Support for Correlations/Subrelations in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueRanked&lt;/code&gt; function&lt;/li&gt;
  &lt;li&gt;Support for Current Data in all &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Queue*&lt;/code&gt; functions&lt;/li&gt;
  &lt;li&gt;Support Anomaly Detection for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueOvertime&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueueTrended&lt;/code&gt; functions (&lt;a title=&quot;Anomaly Detection Adobe Analytics&quot; href=&quot;http://randyzwitch.com/anomaly-detection-adobe-analytics-api/&quot; target=&quot;_blank&quot;&gt;example usage with ggplot2 graph&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Decrease in wait time for API calls (from 5 seconds to 2 seconds) and extending total number of API tries before report failure (from 100 seconds to 10 minutes)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For those of you Adobe Analytics (Omniture) users who haven’t yet tried to use the Adobe Analytics API, I’ve created an &lt;a title=&quot;RSiteCatalyst main page&quot; href=&quot;http://randyzwitch.com/rsitecatalyst&quot; target=&quot;_blank&quot;&gt;introduction video&lt;/a&gt; to get started. There will also continue to be examples of using this package on this blog on the &lt;a title=&quot;RSiteCatalyst usage examples&quot; href=&quot;http://randyzwitch.com/tags/#rsitecatalyst&quot; target=&quot;_blank&quot;&gt;RSiteCatalyst&lt;/a&gt; tag. Enjoy!&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Getting Started Using Hadoop, Part 4: Creating Tables With Hive</title>
        
          <description>&lt;p&gt;In the previous three tutorials (&lt;a title=&quot;Hadoop for beginners&quot; href=&quot;http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-1/&quot; target=&quot;_blank&quot;&gt;1&lt;/a&gt;, &lt;a title=&quot;Building Hadoop cluster on Amazon EC2&quot; href=&quot;http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-2/&quot; target=&quot;_blank&quot;&gt;2&lt;/a&gt;, &lt;a title=&quot;Loading data Hadoop Hue&quot; href=&quot;http://randyzwitch.com/uploading-data-hadoop-amazon-ec2-cloudera-part-3/&quot; target=&quot;_blank&quot;&gt;3&lt;/a&gt;), we’ve covered the background of Hadoop, how to build a proof-of-concept Hadoop cluster using Amazon EC2 and how to upload a .zip file to the cluster using Hue. In Part 4, we’ll use the data uploaded from the .zip file to create a master table of all files, as well as creating a view.&lt;/p&gt;

</description>
        
        <pubDate>Thu, 22 Aug 2013 18:03:19 +0000</pubDate>
        <link>
        http://randyzwitch.com/hadoop-creating-tables-hive/</link>
        <guid isPermaLink="true">http://randyzwitch.com/hadoop-creating-tables-hive/</guid>
        <content type="html" xml:base="/hadoop-creating-tables-hive/">&lt;p&gt;In the previous three tutorials (&lt;a title=&quot;Hadoop for beginners&quot; href=&quot;http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-1/&quot; target=&quot;_blank&quot;&gt;1&lt;/a&gt;, &lt;a title=&quot;Building Hadoop cluster on Amazon EC2&quot; href=&quot;http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-2/&quot; target=&quot;_blank&quot;&gt;2&lt;/a&gt;, &lt;a title=&quot;Loading data Hadoop Hue&quot; href=&quot;http://randyzwitch.com/uploading-data-hadoop-amazon-ec2-cloudera-part-3/&quot; target=&quot;_blank&quot;&gt;3&lt;/a&gt;), we’ve covered the background of Hadoop, how to build a proof-of-concept Hadoop cluster using Amazon EC2 and how to upload a .zip file to the cluster using Hue. In Part 4, we’ll use the data uploaded from the .zip file to create a master table of all files, as well as creating a view.&lt;/p&gt;

&lt;h2 id=&quot;creating-tables-using-hive&quot;&gt;Creating Tables Using Hive&lt;/h2&gt;

&lt;p&gt;Like SQL for ‘regular’ relational databases, Hive is the tool we can use within Hadoop to create tables from data loaded into HDFS. Because Hadoop was built with large, messy data in mind, there are some amazingly convenient features for creating and loading data, such as being able to load all files in a directory (assuming they have the same format).  Here’s the Hive statement we can use to load the &lt;a title=&quot;airline dataset&quot; href=&quot;http://stat-computing.org/dataexpo/2009/the-data.html&quot; target=&quot;_blank&quot;&gt;airline dataset&lt;/a&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;-- Create table from yearly airline csv files&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;EXTERNAL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;airline&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`Year`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`Month`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`DayofMonth`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`DayOfWeek`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`DepTime`&lt;/span&gt;  &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`CRSDepTime`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`ArrTime`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`CRSArrTime`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`UniqueCarrier`&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`FlightNum`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`TailNum`&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`ActualElapsedTime`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`CRSElapsedTime`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`AirTime`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`ArrDelay`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`DepDelay`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`Origin`&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`Dest`&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`Distance`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`TaxiIn`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`TaxiOut`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`Cancelled`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`CancellationCode`&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`Diverted`&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`CarrierDelay`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`WeatherDelay`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`NASDelay`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`SecurityDelay`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;`LateAircraftDelay`&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;ROW&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FORMAT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DELIMITED&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FIELDS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TERMINATED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;','&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ESCAPED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\\&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TEXTFILE&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;LOCATION&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'/user/hue/airline/airline'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The above statement starts by outlining the structure of the table, which is mostly integers with a few string columns. The next four lines of code specifies what type of data we have, which are delimited files where the fields are terminated (separated) by commas and where the delimiter is escaped using a backslash. Finally, we type the location of our files, which is the location of the directory where we uploaded the .zip file in &lt;a title=&quot;Part 3&quot; href=&quot;http://randyzwitch.com/uploading-data-hadoop-amazon-ec2-cloudera-part-3/&quot; target=&quot;_blank&quot;&gt;part 3 of this tutorial&lt;/a&gt;. Note that we specify an “external table”, which means that if we drop the ‘airline’ table, we will still retain our raw data. Had we not specified the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;external&lt;/code&gt; keyword, Hive would’ve moved our raw data files into Hive, and had we decided to drop the ‘airline’ table, all our data would be deleted. Specifying &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;external&lt;/code&gt; also lets us build multiple tables on the same underlying dataset if we so choose.&lt;/p&gt;

&lt;h2 id=&quot;creating-a-view-using-hive&quot;&gt;Creating a View Using Hive&lt;/h2&gt;

&lt;p&gt;One thing that’s slightly awkward above Hive is that you can’t specify that there is a header row in your files. As such, once the above code loads, we have 22 rows in our ‘airline’ table where the data is invalid. Another thing that’s awkward about Hive is that there is no row-level operations, so you can’t delete data! However, we can very easily fix our problem using a view:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;-- Create view to &quot;remove&quot; 22 bad records from our table&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;create&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;view&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vw_airline&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;select&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;airline&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;where&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;uniquecarrier&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;&quot;UniqueCarrier&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now that we have our view defined, we no longer have to explicitly exclude the rows in every future query we run. Just like in SQL, views are “free” from a performance standpoint, as they don’t require any additional data storage space (they just represent stored code references).&lt;/p&gt;

&lt;h2 id=&quot;time-for-analysis&quot;&gt;Time for Analysis?!&lt;/h2&gt;

&lt;p&gt;If you’ve made it this far, you’ve waited a long time to do some actual analysis! The next and final part of this tutorial will do some interesting things using Hive and/or Pig to analyze the data. The origin of this dataset was a data mining contest to predict why a flight would arrive late to its destination and we’ll do examples towards that end.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Anomaly Detection Using The Adobe Analytics API</title>
        
          <description>&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/08/anomaly-detection-adobe-analytics.jpg&quot; alt=&quot;anomaly-detection-adobe-analytics&quot; /&gt;&lt;/p&gt;

</description>
        
        <pubDate>Thu, 15 Aug 2013 10:57:56 +0000</pubDate>
        <link>
        http://randyzwitch.com/anomaly-detection-adobe-analytics-api/</link>
        <guid isPermaLink="true">http://randyzwitch.com/anomaly-detection-adobe-analytics-api/</guid>
        <content type="html" xml:base="/anomaly-detection-adobe-analytics-api/">&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/08/anomaly-detection-adobe-analytics.jpg&quot; alt=&quot;anomaly-detection-adobe-analytics&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As digital marketers &amp;amp; analysts, we’re often asked to quantify when a metric goes beyond just random variation and becomes an actual “unexpected” result. In cases such as &lt;em&gt;A/B..N&lt;/em&gt; testing, it’s easy to calculate a t-test to quantify the difference between two testing populations, but &lt;a title=&quot;Why t-test not appropriate for time-series&quot; href=&quot;http://www.indiana.edu/~statmath/stat/all/ttest/ttest1.html&quot; target=&quot;_blank&quot;&gt;for time-series metrics, using a t-test is likely not appropriate&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To determine whether a time-series has become “out-of-control”, we can use Exponential Smoothing to forecast the Expected Value, as well as calculate Upper Control Limits (UCL) and Lower Control Limits (LCL). To the extent a data point exceeds the UCL or falls below the LCL, we can say that statistically a time-series is no longer within the expected range. There are numerous ways to &lt;a title=&quot;R time-series&quot; href=&quot;http://cran.r-project.org/web/views/TimeSeries.html&quot; target=&quot;_blank&quot;&gt;create time-series models using R&lt;/a&gt;, but for the purposes of this blog post I’m going to focus on Exponential Smoothing, which is how the &lt;a title=&quot;Anomaly Detection Adobe Analytics API&quot; href=&quot;https://developer.omniture.com/en_US/documentation/sitecatalyst-reporting/c-anomaly#concept_E51D14B9899A4974BD946A77D7368BC5&quot; target=&quot;_blank&quot;&gt;anomaly detection&lt;/a&gt; feature is implemented within the Adobe Analytics API.&lt;/p&gt;

&lt;h3 id=&quot;holt-winters--exponential-smoothing&quot;&gt;Holt-Winters &amp;amp; Exponential Smoothing&lt;/h3&gt;

&lt;p&gt;There are three techniques that the Adobe Analytics API uses to build time-series models:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Holt-Winters Additive (Triple Exponential Smoothing)&lt;/li&gt;
  &lt;li&gt;Holt-Winters Multiplicative (Triple Exponential Smoothing)&lt;/li&gt;
  &lt;li&gt;Holt Trend Corrected (Double Exponential Smoothing)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The formulas behind each of the techniques are easily found elsewhere, but the main point behind the three techniques is that time-series data can have a &lt;span style=&quot;text-decoration: underline;&quot;&gt;long-term trend&lt;/span&gt; (Double Exponential Smoothing) and/or a &lt;span style=&quot;text-decoration: underline;&quot;&gt;seasonal trend&lt;/span&gt; (Triple Exponential Smoothing). To the extent that a time-series  has a seasonal component, the seasonal component can be &lt;em&gt;additive&lt;/em&gt; (a fixed amount of increase across the series, such as the number of degrees increase in temperature in Summer) or &lt;em&gt;multiplicative&lt;/em&gt; (a multiplier relative to the level of the series, such as a 10% increase in sales during holiday periods).&lt;/p&gt;

&lt;p&gt;The Adobe Analytics API simplifies the choice of which technique to use by calculating a forecast using all three methods, then choosing the method that has the best fit as calculated by the model having the minimum (squared) error. It’s important to note that while this is probably an okay model selection method for detecting anomalies, this method does not guarantee that the model chosen is the actual “best” forecast model to fit the data.&lt;/p&gt;

&lt;h3 id=&quot;rsitecatalyst-api-call&quot;&gt;RSiteCatalyst API call&lt;/h3&gt;

&lt;p&gt;Using the RSiteCatalyst R package version 1.1, it’s trivial to access the anomaly detection feature:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Run until version &amp;gt; 1.0 on CRAN&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;devtools&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;install_github&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;RSiteCatalyst&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;randyzwitch&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;master&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Run if version &amp;gt;= 1.1 on CRAN&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;RSiteCatalyst&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#API Authentication&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SCAuth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;username&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;company&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shared_secret&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#API function call&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pageviews_w_forecast&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QueueOvertime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reportSuiteID&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;report&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;suite&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dateFrom&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2013-06-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dateTo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2013-08-13&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pageviews&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dateGranularity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;day&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;anomalyDetection&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Once the function call is run, you will receive a DataFrame of ‘Day’ granularity with the actual metric and three additional columns for the forecasted value, UCL and LCL.  Graphing these data using ggplot2 (&lt;a title=&quot;ggplot2 Anomaly Detection graph&quot; href=&quot;https://gist.github.com/randyzwitch/6241051&quot; target=&quot;_blank&quot;&gt;Graph Code Here - GitHub Gist&lt;/a&gt;), we can now see on which days an anomalous result occurred:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/08/anomaly-detection-adobe-analytics1.png&quot; alt=&quot;Huge spike in traffic July 23 - 24&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The red dots in the graph above indicate days where page views either exceeded the UCL or fell below the LCL. On July 23 - 24 timeframe, traffic to this blog spiked dramatically due to a &lt;a title=&quot;A Beginner’s Look at Julia&quot; href=&quot;http://randyzwitch.com/julia-language-beginners/&quot; target=&quot;_blank&quot;&gt;blog post about the Julia programming language&lt;/a&gt;, and continued to stay above the norm for about a week afterwards.&lt;/p&gt;

&lt;h3 id=&quot;anomaly-detection-limitations&quot;&gt;Anomaly Detection Limitations&lt;/h3&gt;

&lt;p&gt;There two limitations to keep in mind when using the Anomaly Detection feature of the Adobe Analytics API:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Anomaly Detection is currently only available for ‘Day’ granularity&lt;/li&gt;
  &lt;li&gt;Forecasts are built on 35 days of past history&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In neither case do I view these limitations as dealbreakers. The first limitation is just an engineering decision, which I’m sure could be expanded if enough people used this functionality.&lt;/p&gt;

&lt;p&gt;For the time period of 35 days to build the forecasts, this is an area where there is a balance between calculation time vs. capturing a long-term and/or seasonal trend in the data. Using 35 days as your time period, you get five weeks of day-of-week seasonality, as well as 35 points to calculate a ‘long-term’ trend. If the time period is of concern in terms of what constitutes a ‘good forecast’, then there are plenty of other techniques that can be explored using R (or any other statistical software for that matter).&lt;/p&gt;

&lt;h3 id=&quot;elevating-the-discussion&quot;&gt;Elevating the discussion&lt;/h3&gt;

&lt;p&gt;I have to give a hearty ‘Well Done!’ to the Adobe Analytics folks for elevating the discussion in terms of digital analytics. By using statistical techniques like Exponential Smoothing, analysts can move away from qualitative statements like “Does it look like &lt;em&gt;something&lt;/em&gt; is wrong with our data?” to  actually quantifying &lt;em&gt;when&lt;/em&gt; KPIs are “too far” away from the norm and should be explored further.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Tabular Data I/O in Julia</title>
        
          <description>&lt;p&gt;Importing tabular data into Julia can be done in (at least) three ways: reading a delimited file into an array, reading a delimited file into a DataFrame and accessing databases using ODBC.&lt;/p&gt;

</description>
        
        <pubDate>Tue, 06 Aug 2013 10:05:38 +0000</pubDate>
        <link>
        http://randyzwitch.com/julia-import-data/</link>
        <guid isPermaLink="true">http://randyzwitch.com/julia-import-data/</guid>
        <content type="html" xml:base="/julia-import-data/">&lt;p&gt;Importing tabular data into Julia can be done in (at least) three ways: reading a delimited file into an array, reading a delimited file into a DataFrame and accessing databases using ODBC.&lt;/p&gt;

&lt;h3 id=&quot;reading-a-file-into-an-array-using-readdlm&quot;&gt;Reading a file into an array using readdlm&lt;/h3&gt;

&lt;p&gt;The most basic way to read data into Julia is through the use of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;readdlm&lt;/code&gt; function, which will create an array:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;readdlm&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;source&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;delim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Char&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Type&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;If you are reading in a fairly normal delimited file, you can get away with just using the first two arguments, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;source&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;delim&lt;/code&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;airline_array&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;readdlm&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/Users/randyzwitch/airline/1987.csv&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;','&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;airline_array&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1311827&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;29&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;typeof&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;airline_array&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Any&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;It’s important to note that by only specifying the first two arguments, you leave it up to Julia to determine the type of array to return. In the code example above, an array of type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Any&lt;/code&gt; is returned, as the .csv file I read in was not of homogenous type such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Int64&lt;/code&gt; or &lt;del&gt;ASCII&lt;/del&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;String&lt;/code&gt;. If you know for certain which type of array you want, you specify the data type using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;type&lt;/code&gt; argument:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;airline_array&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;readdlm&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/Users/randyzwitch/airline/1987.csv&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;','&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;airline_array&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1311827&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;29&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;typeof&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;airline_array&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;It’s probably the case that unless you are looking to do linear algebra or other specific mathy type work, you’ll likely find that reading your data into a DataFrame will be more comfortable to work with (especially if you are coming from an R, Python/pandas or even spreadsheet tradition).&lt;/p&gt;

&lt;p&gt;To write an array out to a file, you can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;writedlm&lt;/code&gt; function (defaults to comma-separated):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;writedlm&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filename&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;delim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Char&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;reading-a-file-into-a-dataframe-using-readtable&quot;&gt;Reading a file into a DataFrame using readtable&lt;/h3&gt;

&lt;p&gt;As I covered in my prior blog post about Julia, you can also &lt;a title=&quot;Julia for Beginners&quot; href=&quot;http://randyzwitch.com/julia-language-beginners/&quot; target=&quot;_blank&quot;&gt;read in delimited files into Julia using the DataFrames package&lt;/a&gt;, which returns a DataFrame instead of an array. Besides just being able to read in delimited files, the DataFrames package also supports reading in gzippped files on the fly:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DataFrames&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;airline_df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;readtable&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/Users/randyzwitch/airline/1987.csv.gz&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;airline_df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1311826&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;29&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;typeof&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;airline_df&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;  &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;methods&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;see&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;constructors&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;From what I understand, in the future you will be able to read files directly from Amazon S3 into a DataFrame (this is already supported in the &lt;a title=&quot;Julia Amazon S3&quot; href=&quot;https://github.com/amitmurthy/AWS.jl&quot; target=&quot;_blank&quot;&gt;AWS package&lt;/a&gt;), but for now, the DataFrames package works only on local files. Writing a DataFrame to file can be done with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;writetable&lt;/code&gt; function:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;writetable&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filename&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;By default, the &lt;a title=&quot;Julia DataFrames&quot; href=&quot;http://juliastats.github.io/DataFrames.jl/io.html&quot; target=&quot;_blank&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;writetable&lt;/code&gt; function&lt;/a&gt; will use the delimiter specified by the filename extension and default to printing the column names as a header.&lt;/p&gt;

&lt;h3 id=&quot;accessing-databases-using-odbc&quot;&gt;Accessing Databases using ODBC&lt;/h3&gt;

&lt;p&gt;The third major way of importing tabular data into Julia is through the use of ODBC access to various databases such as MySQL and PostgreSQL.&lt;/p&gt;

&lt;h4 id=&quot;using-a-dsn&quot;&gt;Using a DSN&lt;/h4&gt;

&lt;p&gt;The &lt;a title=&quot;Julia ODBC package&quot; href=&quot;https://github.com/karbarcca/ODBC.jl&quot; target=&quot;_blank&quot;&gt;Julia ODBC package&lt;/a&gt; provides functionality to connect to a database using a Data Source Name (DSN). Assuming you store all the credentials in your DSN (server name, username, password, etc.), connecting to a database is as easy as:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ODBC&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ODBC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;MySQL&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Connection&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MySQL&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;successful&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Of course, if you don’t want to store your password in your DSN (especially in the case where there are multiple users for a computer), you can pass the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;usr&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pwd&lt;/code&gt; arguments to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ODBC.connect&lt;/code&gt; function:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;ODBC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dsn&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pwd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h4 id=&quot;using-a-connection-string&quot;&gt;Using a connection string&lt;/h4&gt;

&lt;p&gt;Alternatively, you can build your own connection strings within a Julia session using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;advancedconnect&lt;/code&gt; function:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#Amazon Redshift/Postgres connection string&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;red&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;advancedconnect&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Driver={psqlODBC};ServerName=reporting.XXXXX.us-east-1.redshift.amazonaws.com;Username=XXXX;Password=XXXX;Database=XXXX;Port=XXXX&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Connection&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Driver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;psqlODBC&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;};&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ServerName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reporting&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;XXXXX&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;us&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;east&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;1.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;redshift&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;amazonaws&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;com&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Username&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;XXXX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Password&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;XXXX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Database&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;XXXX&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Port&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;XXXX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;successful&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#MySQL connection string&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;advancedconnect&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Driver={MySQL};user=root;server=localhost;database=airline;&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Connection&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Driver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MySQL&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;};&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;root&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;server&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;localhost&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;database&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;airline&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;successful&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Regardless of which way you connect, you can query data using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;query&lt;/code&gt; function. If you want your output as a DataFrame, you can assign the result of the function to an object. If you want to save the results to a file, you specify the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;file&lt;/code&gt; argument:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ODBC&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ODBC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;MySQL&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Connection&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MySQL&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;successful&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Save query results into a DataFrame called 'results'&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;results&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Select * from a1987;&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;typeof&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;results&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;  &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;methods&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;see&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;constructors&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Save query results to a file, tab-delimited (default)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;julia&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Select * from a1987;&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;output.tab&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;delim&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;'\t'&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;summary&quot;&gt;Summary&lt;/h3&gt;

&lt;p&gt;Overall, importing data into Julia is no easier/more difficult than any other language. The biggest thing I’ve noticed thus far is that Julia is a bit less efficient than Python/pandas or R in terms of the amount of RAM needed to store data. In my experience, this is really only an issue once you are working with 1GB+ files (of course, depending on the resources available to you on your machine).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Edit 3/25/2016: A much more up-to-date method of &lt;a href=&quot;https://cbrownley.wordpress.com/2015/05/29/reading_writing_csv_with_r_python_julia/&quot; target=&quot;_blank&quot;&gt;reading CSV data into Julia&lt;/a&gt; can be found at this &lt;a href=&quot;https://cbrownley.wordpress.com/2015/05/29/reading_writing_csv_with_r_python_julia/&quot; target=&quot;_blank&quot;&gt;blog post by Clinton Brownley&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob</title>
        
          <description>&lt;p&gt;In a previous rant about &lt;a title=&quot;Data Science &amp;amp; Innovation&quot; href=&quot;http://randyzwitch.com/data-science-innovation/&quot; target=&quot;_blank&quot;&gt;data science &amp;amp; innovation&lt;/a&gt;, I made reference to a problem I’m having at work where I wanted to classify roughly a quarter-billion URLs by predicted website content (without having to actually visit the website). A few colleagues have asked how you go about even starting to solve a problem like that, and the answer is &lt;em&gt;massively parallel processing&lt;/em&gt;.&lt;/p&gt;

</description>
        
        <pubDate>Wed, 31 Jul 2013 12:34:58 +0000</pubDate>
        <link>
        http://randyzwitch.com/amazon-elastic-map-reduce-mrjob-python/</link>
        <guid isPermaLink="true">http://randyzwitch.com/amazon-elastic-map-reduce-mrjob-python/</guid>
        <content type="html" xml:base="/amazon-elastic-map-reduce-mrjob-python/">&lt;p&gt;In a previous rant about &lt;a title=&quot;Data Science &amp;amp; Innovation&quot; href=&quot;http://randyzwitch.com/data-science-innovation/&quot; target=&quot;_blank&quot;&gt;data science &amp;amp; innovation&lt;/a&gt;, I made reference to a problem I’m having at work where I wanted to classify roughly a quarter-billion URLs by predicted website content (without having to actually visit the website). A few colleagues have asked how you go about even starting to solve a problem like that, and the answer is &lt;em&gt;massively parallel processing&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&quot;attacking-the-problem-using-a-local-machine&quot;&gt;Attacking the problem using a local machine&lt;/h2&gt;

&lt;p&gt;In order to classify the URLs, the first thing that’s needed is a customized dictionary of words relative to our company’s subject matter. When you have a corpus of words that are already defined (such as a digitized book), finding the population of words is relatively simple: split the text based on spaces &amp;amp; punctuation and you’re more or less done. However, with a URL, you have one continuous string with no word boundaries. One way to try and find the boundaries would be the following in Python:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;collections&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;nltk&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Dictionary from Unix
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;internal_dict&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/usr/share/dict/words&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#Stopwords corpus from NLTK
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stopwords&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nltk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;corpus&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stopwords&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'english'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Build english_dictionary of prospect words
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;english_dictionary&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;internal_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stopwords&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;#make sure only &quot;big&quot;, useful words included
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;english_dictionary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rstrip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#How many words are in the complete dictionary?        
&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;english_dictionary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Import urls
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;line&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/path/to/urls/file.csv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;#Build counter dictionary
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;wordcount&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;collections&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;english_dictionary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;#Loop over all possible English words
&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;#Loop over all urls in list
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;wordcount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;#Once word found, add to dictionary counter&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The problem with approaching the word searching problem in this manner is you are limited to the power of your local machine. In my case with a relatively new MacBook Pro, I can process 1,000 lines in 19 seconds as a single-threaded process. At 250,000,000 URLs, that’s 4.75 million seconds…197,916 minutes…3,298 hours…137 days…&lt;strong&gt;4.58 months! &lt;/strong&gt; Of course, 4.58 months is for the data I have &lt;span style=&quot;text-decoration: underline;&quot;&gt;now&lt;/span&gt;, which is accumulating every second of every day. Clearly, to find just the custom dictionary of words, I’ll need to employ MANY more computers/tasks.&lt;/p&gt;

&lt;h2 id=&quot;amazon-elasticmapreduce--lots-of-horsepower&quot;&gt;Amazon ElasticMapreduce = Lots of Horsepower&lt;/h2&gt;

&lt;p&gt;One thing you might notice about the Python code above is that the two loops have no real reason to be run serially; each comparison of URL and dictionary word can be run independently of each other (often referred to as “&lt;a title=&quot;Embarassingly parallel&quot; href=&quot;http://english.stackexchange.com/questions/83677/what-is-embarrassing-about-an-embarrassingly-parallel-problem&quot; target=&quot;_blank&quot;&gt;embarrassingly parallel&lt;/a&gt;”). This type of programming pattern is one that is well suited to running on a Hadoop cluster. With Amazon ElasticMapReduce (EMR), we can provision tens, hundreds, even thousands of computer instances to process this URL-dictionary word comparison, and thus getting our answer much faster. The one downside of using Amazon EMR to access Hadoop is that EMR expects to get a Java ``.jar` file containing your MapReduce code. Luckily, there is a Python package called &lt;a title=&quot;MRjob Python package&quot; href=&quot;http://pythonhosted.org/mrjob/&quot; target=&quot;_blank&quot;&gt;MRJob&lt;/a&gt; that does the Python-to-Java translation automatically, so that users don’t have to switch languages to get massively parallel processing.&lt;/p&gt;

&lt;h2 id=&quot;writing-mapreduce-code&quot;&gt;Writing MapReduce code&lt;/h2&gt;

&lt;p&gt;The Python code above, keeping a tally of words &amp;amp; number of occurrences IS a version of the MapReduce coding paradigm. Going through the looping process to do the comparison is the “Map” portion of the code and the sum of the word values is the “Reduce” step. However, in order to use EMR, we need to modify the above code to remove the outer URL loop:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;mrjob.job&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MRJob&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MRWordCounter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MRJob&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;    

  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;mapper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;english_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;english_dict&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'aal'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'aalii'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'aam'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'aani'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'zythum'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'zyzomys'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'zyzzogeton'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;english_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;yield&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;reducer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;occurrences&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;yield&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;occurrences&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__name__&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'__main__'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;MRWordCounter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The reason why we remove the outer loop that loops over the lines of the URL file is because that is implicit to the EMR/Hadoop style of processing. We will specify a file that we want to process in our Python script, then EMR will distribute the URLs file across all the Hadoop nodes. Essentially, our 250,000,000 million lines of URLs will become 1,000 tasks of length 250,000 urls (assuming 125 nodes of 8 tasks each).&lt;/p&gt;

&lt;h3 id=&quot;calling-emr-from-the-python-command-line&quot;&gt;Calling EMR from the Python command line&lt;/h3&gt;

&lt;p&gt;Once we have our Python MRJob code written, we can submit our code to EMR from the command line. Here’s what an example code looks like:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;python ~/Desktop/mapreduce.py &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; emr s3://&amp;lt;s3bucket&amp;gt;/url_unload/0000_part_01 &lt;span class=&quot;nt&quot;&gt;--output-dir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;s3://&amp;lt;s3bucket&amp;gt;/url_output &lt;span class=&quot;nt&quot;&gt;--num-ec2-instances&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;81
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;There are many more options that are possible for the MRJob package, so I highly suggest that you read the &lt;a title=&quot;MRJobs EMR options&quot; href=&quot;http://pythonhosted.org/mrjob/guides/emr-quickstart.html&quot; target=&quot;_blank&quot;&gt;documentation for EMR options&lt;/a&gt;. One thing to also note is that MRJob uses a configuration file to host various options for EMR called “runners”.  Yelp (the maker of the MRJob package) has posted an &lt;a title=&quot;MRJob .conf file&quot; href=&quot;https://github.com/Yelp/mrjob/blob/master/mrjob.conf.example&quot; target=&quot;_blank&quot;&gt;example of the mrjob.conf file&lt;/a&gt; with the most common options to use. In this file, you can specify your Amazon API keys, the type of instances you want to use (I use c1.xlarge spot instances for the most part), where your SSH keys are located and so on.&lt;/p&gt;

&lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;

&lt;p&gt;In terms of performance, I have 8 files of 5GB’s each of URLs (~17.5 million lines per file) that I’m running through the MRJob code above. The first file was run with 19 c1.xlarge instances, creating on average 133 mappers and 65 reducers and taking 917 minutes (&lt;em&gt;3.14 seconds/1000 lines&lt;/em&gt;).  The second file was run with 80 c1.xlarge instances, creating 560 mappers and 160 reducers and taking 218 minutes (&lt;em&gt;0.75 seconds/1000 lines&lt;/em&gt;). So using four times as many instances leads to one-fourth of the run-time.&lt;/p&gt;

&lt;p&gt;For the most part, you can expect linear performance in terms of adding nodes to your EMR cluster. I know at some point, Hadoop will decide that it no longer needs to add any more mappers/reducers, but I haven’t had the desire to find out exactly how many I’d need to add to get to that point! 🙂&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>A Beginner's Look at Julia</title>
        
          <description>&lt;p&gt;Over the past month or so, I’ve been playing with a new scientific programming language called ‘&lt;a title=&quot;Julia language&quot; href=&quot;http://julialang.org/&quot; target=&quot;_blank&quot;&gt;Julia&lt;/a&gt;’, which aims to be a high-level language with performance approaching that of C. With that goal in mind, Julia could be a replacement for the ‘multi-language’ problem of needing to move between R, Python, MATLAB, C, Fortran, Scala, etc. within a single scientific programming project.  Here are some observations that might be helpful for others looking to get started with Julia.&lt;/p&gt;

</description>
        
        <pubDate>Tue, 23 Jul 2013 12:16:34 +0000</pubDate>
        <link>
        http://randyzwitch.com/julia-language-beginners/</link>
        <guid isPermaLink="true">http://randyzwitch.com/julia-language-beginners/</guid>
        <content type="html" xml:base="/julia-language-beginners/">&lt;p&gt;Over the past month or so, I’ve been playing with a new scientific programming language called ‘&lt;a title=&quot;Julia language&quot; href=&quot;http://julialang.org/&quot; target=&quot;_blank&quot;&gt;Julia&lt;/a&gt;’, which aims to be a high-level language with performance approaching that of C. With that goal in mind, Julia could be a replacement for the ‘multi-language’ problem of needing to move between R, Python, MATLAB, C, Fortran, Scala, etc. within a single scientific programming project.  Here are some observations that might be helpful for others looking to get started with Julia.&lt;/p&gt;

&lt;h3 id=&quot;get-used-to-git-and-make&quot;&gt;Get used to ‘Git’ and ‘make’&lt;/h3&gt;

&lt;p&gt;While there are &lt;a title=&quot;Julia language downloads&quot; href=&quot;http://julialang.org/downloads/&quot; target=&quot;_blank&quot;&gt;pre-built binaries&lt;/a&gt; for Julia, due to the rapid pace of development, it’s best to build Julia from source. To be able to keep up with the literally dozen code changes per day, you can clone the &lt;a title=&quot;Julia GitHub repo&quot; href=&quot;https://github.com/JuliaLang/julia&quot; target=&quot;_blank&quot;&gt;Julia GitHub repository&lt;/a&gt; to your local machine. If you use one of the &lt;a title=&quot;GitHub GUI downloads&quot; href=&quot;http://git-scm.com/downloads/guis&quot; target=&quot;_blank&quot;&gt;GitHub GUI’s&lt;/a&gt;, this is as easy as hitting the ‘Sync Branch’ button to receive all of the newest code updates.&lt;/p&gt;

&lt;p&gt;To install Julia, you need to compile the code. The instructions for each supported operating system are listed on the &lt;a title=&quot;Julia GitHub repo&quot; href=&quot;https://github.com/JuliaLang/julia&quot; target=&quot;_blank&quot;&gt;Julia GitHub page&lt;/a&gt;. For Mac users, use Terminal to navigate to the directory where you cloned Julia, then run the following command, where ‘n’ refers to the number of concurrent processes you want the compiler to use:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;make&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; 
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;I use 8 concurrent processes on a 2013 MacBook Pro and it works pretty well. Certainly much faster than a single process. Note that the first time you run the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;make&lt;/code&gt; command, the build process will take much longer than successive builds, as Julia downloads all the required libraries needed. After the first build, you can just run the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;make&lt;/code&gt; command with a single process, as the code updates don’t take very long to build.&lt;/p&gt;

&lt;p&gt;Package management is also done via GitHub. To add &lt;a title=&quot;Julia packages&quot; href=&quot;http://pkg.julialang.org/&quot; target=&quot;_blank&quot;&gt;Julia packages&lt;/a&gt; to your install, you use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Pkg.add()&lt;/code&gt; function, with the package name in double-quotes.&lt;/p&gt;

&lt;h3 id=&quot;julia-code-feels-very-familiar&quot;&gt;Julia code feels very familiar&lt;/h3&gt;

&lt;h4 id=&quot;text-file-import&quot;&gt;Text file import&lt;/h4&gt;

&lt;p&gt;Although the &lt;a title=&quot;Julia documentation&quot; href=&quot;http://docs.julialang.org/en/latest/manual/introduction.html#man-introduction-1&quot; target=&quot;_blank&quot;&gt;Julia documentation&lt;/a&gt; makes numerous references to MATLAB in terms of code similarity, Julia feels very familiar to me as an R and Python user. Take reading a .csv file into a dataframe and finding the dimensions of the resulting object&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#R: Read in 1987.csv from airline dataset into a dataframe&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#No import statement needed to create a dataframe in R&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;airline1987&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;csv&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;~/airline/1987.csv&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;airline1987&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1311826&lt;/span&gt;      &lt;span class=&quot;mi&quot;&gt;29&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Python: use pandas to create a dataframe&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;airline1987&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_csv&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/Users/randyzwitch/airline/1987.csv&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;airline1987&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Out&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1311826&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;29&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Julia: use DataFrames to create a dataframe&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DataFrames&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;airline1987&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;readtable&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/Users/randyzwitch/airline/1987.csv&quot;&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;airline1987&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1311826&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;29&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;In each language, the basic syntax is to call a ‘read’ function, specify the .csv filename, then the defaults of the function read in a basic file. I also could’ve specified other keyword arguments, but for purposes of this example I kept it simple.&lt;/p&gt;

&lt;h4 id=&quot;looping&quot;&gt;Looping&lt;/h4&gt;

&lt;p&gt;Looping in Julia is similar to other languages. Python requires proper spacing for each level of a loop, with a colon for each evaluated expression. And although you generally don’t use many loops in R, to do so requires using parenthesis and brackets.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#Python looping to create a term-frequency dictionary&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;collections&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Counter&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;term_freq&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Counter&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;english_dictionary&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url_list&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url_list&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;term_freq&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Julia looping to create a term-frequency dictionary&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;term_freq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Dict&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int64&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;}()&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;english_dictionary&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url_list&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;search&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;term_freq&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;term_freq&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;If you’re coming from a Python background, you can see that there’s not a ton of difference between Python looping into a dictionary vs. Julia. The biggest differences are the use of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;end&lt;/code&gt; control-flow word and that Julia doesn’t currently have the convenience “Counter” object type. R doesn’t natively have a dictionary type, but you can add a similar concept using the &lt;a title=&quot;CRAN hash package&quot; href=&quot;http://cran.r-project.org/web/packages/hash/&quot; target=&quot;_blank&quot;&gt;hash&lt;/a&gt; package.&lt;/p&gt;

&lt;h4 id=&quot;vectorization&quot;&gt;Vectorization&lt;/h4&gt;

&lt;p&gt;While not required to achieve high performance, Julia also provides the &lt;a title=&quot;Is looping as a programming construct bad?&quot; href=&quot;http://slendrmeans.wordpress.com/2013/05/11/julia-loops/&quot; target=&quot;_blank&quot;&gt;functional programming construct of vectorization and list comprehensions&lt;/a&gt;. In R, you use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;*apply&lt;/code&gt; family of functions instead of loops in order to &lt;a title=&quot;Functional programming in R&quot; href=&quot;https://github.com/hadley/devtools/wiki/Functional-programming&quot; target=&quot;_blank&quot;&gt;apply a function to multiple elements in a list&lt;/a&gt;. In Python, there are the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;map&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reduce&lt;/code&gt; functions, but there is also the concept of list comprehensions. In Julia, both of the aforementioned functionalities are possible.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-julia&quot; data-lang=&quot;julia&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#Cube every number from 1 to 100&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Python map function&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cubes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lambda&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;))&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Python list comprehension&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cubes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)]&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#R sapply function&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cubes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sapply&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Julia map function&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cubes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Julia list comprehension&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cubes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;x&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;x&quot;&gt;]]&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;In each case, the syntax is &lt;em&gt;just about&lt;/em&gt; the same to apply a function across a list/array of numbers.&lt;/p&gt;

&lt;h3 id=&quot;a-small-but-intense-community&quot;&gt;A small, but intense community&lt;/h3&gt;

&lt;p&gt;One thing that’s important to note about Julia at this stage is that it’s very early. If you’re going to be messing around with Julia, there’s going to be a lot of alone-time experimenting and reading the &lt;a title=&quot;Julia documentation&quot; href=&quot;http://docs.julialang.org/en/latest/&quot; target=&quot;_blank&quot;&gt;Julia documentation&lt;/a&gt;. There are also several other resources including a &lt;a title=&quot;Julia users Google group&quot; href=&quot;https://groups.google.com/forum/?fromgroups=#!forum/julia-users&quot; target=&quot;_blank&quot;&gt;Julia-Users Google group&lt;/a&gt;, &lt;a title=&quot;Julia for R programmers&quot; href=&quot;http://www.stat.wisc.edu/~bates/JuliaForRProgrammers.pdf&quot; target=&quot;_blank&quot;&gt;Julia for R programmers&lt;/a&gt;, individual discussions on GitHub in the ‘Issues’ section of each Julia package, and a few tutorials floating around (&lt;a title=&quot;Julia tutorials&quot; href=&quot;http://forio.com/julia/tutorials-list&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt; and &lt;a title=&quot;Julia meta tutorial&quot; href=&quot;http://datacommunitydc.org/blog/2013/07/a-julia-meta-tutorial/&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Beyond just the written examples though, I’ve found that the budding Julia community is very helpful and willing in terms of answering questions. I’ve been bugging the hell out of &lt;a title=&quot;John Myles White&quot; href=&quot;http://www.johnmyleswhite.com/&quot; target=&quot;_blank&quot;&gt;John Myles White&lt;/a&gt; and he hasn’t complained (yet!), and even when code issues are raised through the users group or on GitHub, ultimately everyone has been very respectful and eager to help. So don’t be intimidated by the fact that Julia has a very MIT and Ph.D-ness to it…jump right in and migrate some of your favorite code over from other languages.&lt;/p&gt;

&lt;p&gt;While I haven’t moved to using Julia for my everyday workload, I am getting facility to the point where I’m starting to consider using Julia for selected projects. Once the language matures a bit more, &lt;del&gt;&lt;a title=&quot;Julia Studio&quot; href=&quot;http://forio.com/julia/&quot; target=&quot;_blank&quot;&gt;JuliaStudio&lt;/a&gt; starts to approach &lt;a title=&quot;RStudio&quot; href=&quot;http://www.rstudio.com/&quot; target=&quot;_blank&quot;&gt;RStudio&lt;/a&gt; in terms of functionality&lt;/del&gt;, and I get more familiar with the language in general, I can see Julia taking over for at least one if not all of my scientific programming languages.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Getting Started Using Hadoop, Part 3: Loading Data</title>
        
          <description>&lt;p&gt;In part 2 of the “Getting Started Using Hadoop” series, I discussed how to &lt;a title=&quot;Build a Hadoop cluster Amazon EC2&quot; href=&quot;http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-2/&quot; target=&quot;_blank&quot;&gt;build a Hadoop cluster on Amazon EC2&lt;/a&gt; using Cloudera CDH. This post will cover how to get your data into the Hadoop Distributed File System (HDFS) using the publicly available “&lt;a title=&quot;Airline dataset&quot; href=&quot;http://stat-computing.org/dataexpo/2009/the-data.html&quot; target=&quot;_blank&quot;&gt;Airline Dataset&lt;/a&gt;”. While there are multiple ways to upload data into HDFS, this post will only cover the easiest method, which is to use the Hue ‘File Browser’ interface.&lt;/p&gt;

</description>
        
        <pubDate>Wed, 22 May 2013 11:39:19 +0000</pubDate>
        <link>
        http://randyzwitch.com/uploading-data-hadoop-amazon-ec2-cloudera-part-3/</link>
        <guid isPermaLink="true">http://randyzwitch.com/uploading-data-hadoop-amazon-ec2-cloudera-part-3/</guid>
        <content type="html" xml:base="/uploading-data-hadoop-amazon-ec2-cloudera-part-3/">&lt;p&gt;In part 2 of the “Getting Started Using Hadoop” series, I discussed how to &lt;a title=&quot;Build a Hadoop cluster Amazon EC2&quot; href=&quot;http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-2/&quot; target=&quot;_blank&quot;&gt;build a Hadoop cluster on Amazon EC2&lt;/a&gt; using Cloudera CDH. This post will cover how to get your data into the Hadoop Distributed File System (HDFS) using the publicly available “&lt;a title=&quot;Airline dataset&quot; href=&quot;http://stat-computing.org/dataexpo/2009/the-data.html&quot; target=&quot;_blank&quot;&gt;Airline Dataset&lt;/a&gt;”. While there are multiple ways to upload data into HDFS, this post will only cover the easiest method, which is to use the Hue ‘File Browser’ interface.&lt;/p&gt;

&lt;h2 id=&quot;loading-data-into-hdfs-using-hue&quot;&gt;Loading data into HDFS using Hue&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/05/hadoop-hue-file-browser-e1367455309802.png&quot; alt=&quot;hadoop-hue-file-browser&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;wp-caption-text&quot;&gt;
'File Browser' in Hue (Cloudera)
&lt;/p&gt;

&lt;p&gt;Loading data into Hadoop using Hue is by far the easiest way to get started. Hue provides a GUI that provides a “File Browser” like you normally see in Windows or OSX. The workflow here would be to download each year of Airline data to your local machine, then upload each file using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Upload -&amp;gt; Files&lt;/code&gt; menu drop-down.&lt;/p&gt;

&lt;p&gt;While downloading files from one site on the Internet, then uploading files to somewhere else on the Internet is somewhat wasteful of time and bandwidth, as a tutorial to &lt;em&gt;get started&lt;/em&gt; with Hadoop this isn’t the worst thing in the world. For those of you who are OSX users and comfortable using Bash from the command line, here’s some code so you don’t have to babysit the download process:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;i &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;1987..2008&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; curl http://stat-computing.org/dataexpo/2009/&lt;span class=&quot;nv&quot;&gt;$i&lt;/span&gt;.csv.bz2 &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$i&lt;/span&gt;.csv.bz2
&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; bunzip2 &lt;span class=&quot;nv&quot;&gt;$i&lt;/span&gt;.csv.bz2
&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Because you are going to be uploading a bunch of text files to your Hadoop cluster, I’d recommend zipping the files prior to upload. It doesn’t matter if you use .zip or .gz files with one key distinction: if you use &lt;strong&gt;.zip&lt;/strong&gt; files, you will upload using the &lt;span style=&quot;text-decoration: underline;&quot;&gt;“Zip Files”&lt;/span&gt; button in the File Browser; if you choose &lt;strong&gt;.gz&lt;/strong&gt;, then you must use the &lt;span style=&quot;text-decoration: underline;&quot;&gt;“Files”&lt;/span&gt; line in the File Browser. Not only will zipping the files make the upload faster, but it will also make sure you only need to do the process once (as opposed to hitting the upload button on each file). Using the .zip file upload process, you should something like the following…a new folder with all of the files extracted automatically:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/05/hue-file-browser-unzipped.png&quot; alt=&quot;hue-file-browser-unzipped&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;wp-caption-text&quot;&gt;
.zip file automatically extracted into folder with files (Hortonworks)
&lt;/p&gt;

&lt;h3 id=&quot;next-steps&quot;&gt;Next Steps&lt;/h3&gt;

&lt;p&gt;With the airline .csv files loaded for each year, we can use Pig or Hive to load the tables into a master dataset &amp;amp; schema. That will be the topic of the next tutorial.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Innovation Will Never Be At The Push Of A Button</title>
        
          <description>&lt;blockquote class=&quot;twitter-tweet&quot; data-conversation=&quot;none&quot;&gt;
  &lt;p&gt;
    @&lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;randyzwitch&lt;/a&gt; @&lt;a href=&quot;https://twitter.com/benjamingaines&quot;&gt;benjamingaines&lt;/a&gt; @&lt;a href=&quot;https://twitter.com/usujason&quot;&gt;usujason&lt;/a&gt; I am envisioning the data science equivalent of an autonomous vehicle pileup.
  &lt;/p&gt;

&lt;/blockquote&gt;
</description>
        
        <pubDate>Fri, 17 May 2013 10:28:47 +0000</pubDate>
        <link>
        http://randyzwitch.com/data-science-innovation/</link>
        <guid isPermaLink="true">http://randyzwitch.com/data-science-innovation/</guid>
        <content type="html" xml:base="/data-science-innovation/">&lt;blockquote class=&quot;twitter-tweet&quot; data-conversation=&quot;none&quot;&gt;
  &lt;p&gt;
    @&lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;randyzwitch&lt;/a&gt; @&lt;a href=&quot;https://twitter.com/benjamingaines&quot;&gt;benjamingaines&lt;/a&gt; @&lt;a href=&quot;https://twitter.com/usujason&quot;&gt;usujason&lt;/a&gt; I am envisioning the data science equivalent of an autonomous vehicle pileup.
  &lt;/p&gt;

  &lt;p&gt;
    — Todd Belcher (@toddmetrics) &lt;a href=&quot;https://twitter.com/toddmetrics/status/335030724375756800&quot;&gt;May 16, 2013&lt;/a&gt;
  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Recently, I’ve been getting my blood pressure up reading (marketing) articles about “big data” and “data science”.  What saddens me about the whole discussion is that there is the underlying premise that what is stopping companies from “harnessing the power of big data” is just the lack of an easy-to-use, push-button tool. Respectfully, if you believe this, you should bow out of the conversation altogether.&lt;/p&gt;

&lt;h3 id=&quot;math-is-hard-and-stuff&quot;&gt;Math is hard and stuff.&lt;/h3&gt;

&lt;p&gt;The first article that really bothered me is titled “&lt;a href=&quot;http://smartdatacollective.com/deanabbott/115886/do-predictive-modelers-need-know-math&quot;&gt;Do Predictive Modelers Need to Know Math?&lt;/a&gt;” This is a provocative title from a veteran in the data mining/data science industry, and his conclusion is basically ‘&lt;em&gt;Yes, but not everyone on the team needs to be able to hand-solve equations.&lt;/em&gt;’ I think that’s a fair point within the context of needing to understand the mathematical concepts behind algorithms, but not needing to be bogged down by notation.&lt;/p&gt;

&lt;p&gt;Extending that idea a little further, how far away from the math should a business be comfortable with an employee pushing the button on a machine learning algorithm? Should the CEO be building predictive models? The Intern? A Call Center Rep? For me, I think the answer falls back on the allegory of the &lt;a href=&quot;http://www.snopes.com/business/genius/where.asp&quot;&gt;highly-specialized tradesperson&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Driver: “How can you charge $100 for five minutes work? All you did was put a bolt on and turned the wrench a few times!&lt;/p&gt;

  &lt;p&gt;Mechanic: “I didn’t charge you for the parts, I charged you for knowing where to put the wrench…”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The value a data scientist brings to a business is not that he can push the buttons in a GUI like &lt;a href=&quot;http://rattle.togaware.com/&quot;&gt;rattle for R&lt;/a&gt;, &lt;a href=&quot;http://www.cs.waikato.ac.nz/ml/weka/&quot;&gt;Weka&lt;/a&gt;, &lt;a href=&quot;http://www.angoss.com/predictive-analytics-software/products/data-analysis-software&quot;&gt;KnowledgeSeeker&lt;/a&gt;, or &lt;a href=&quot;http://www.sas.com/technologies/analytics/datamining/miner/&quot;&gt;SAS Enterprise Miner&lt;/a&gt;. What your data scientist brings to the table is knowing the underlying assumptions that go into a model, how the algorithm works, which algorithm is appropriate for the business problem being solved and &lt;em&gt;when to know the model/algorithm has failed&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Of all of the things listed, the experience of knowing when the model can/has failed is what you’re paying the money for. That knowledge doesn’t come from just pushing the GUI buttons a bunch of times. And if you’re making million-dollar decisions based on an algorithm, it’s worth paying the salary for a person who really understands the model.&lt;/p&gt;

&lt;h3 id=&quot;hire-a-mathematician-get-a-programmer-free&quot;&gt;Hire a mathematician, get a programmer free&lt;/h3&gt;

&lt;p&gt;The next article that bothered me is a &lt;a href=&quot;http://www.zdnet.com/debate/business-analytics-do-we-need-data-scientists/10119786/rebuttal/#skip-intro&quot;&gt;“Business Analytics: Do we need data scientists?”&lt;/a&gt; debate over “Do we need data scientists at all?” The &lt;em&gt;No&lt;/em&gt; argument boils down to an idea that only things that can be made easy and are sufficiently developed are useful/valuable. Thus, because a general analyst can’t use Excel, but rather might need to write a SQL query or write a program to put together a dataset, the problem-domain is too difficult. The &lt;em&gt;No&lt;/em&gt; debater also refers to data scientists as being “adversarial”, “pretentious”, project “snobbery”, etc.&lt;/p&gt;

&lt;p&gt;But here’s the thing…the problem-domain isn’t particularly difficult if you hire someone with above-average math proficiency. Any decent graduate program in mathematics, statistics, computer science, economics, finance, psychology and others will be using data through programming. Now, the languages may vary between Java, R, Python, Matlab, C++, SAS, Octave, Eviews or others, but the language doesn’t matter, they’ll learn whatever language your company is using once you hire them. They also will learn the systems you are using to store your data, whether it’s a standard relational database, a NoSQL database, or a parallel processing platform like Hadoop.&lt;/p&gt;

&lt;p&gt;How can I be certain that the math person you hire will be able to learn all that’s necessary for data science? Because the type of person who likes math &amp;amp; programming is probably a ‘system builder’ type of person. The type of person who played with Legos growing up. The type of person built their own desktop computer back in the day. The type of person who thinks &lt;em&gt;How It’s Made&lt;/em&gt; is much more interesting TV than mindless reality shows. The type of person who WANTS to know how a database is storing data, what new open-source technology is out there, wants to find out how many nodes they can connect together before their program won’t finish any faster.&lt;/p&gt;

&lt;p&gt;As far as the adversarial/pretentious/snobby comment, all I can say is I’ve never witnessed that. Everyone I know in the data science community are the nicest people, willing to share code, collaborate on ideas and talk until they lose their voice about how to solve an interesting problem.&lt;/p&gt;

&lt;h3 id=&quot;data-science-is-about-innovative-research-not-reporting&quot;&gt;Data Science is about innovative research, not reporting&lt;/h3&gt;

&lt;p&gt;I’ve read four academic papers this week. I’m not in graduate school.&lt;/p&gt;

&lt;p&gt;As some of you might know, I started a new position at a startup which provides real-time intelligence for the lead generation industry. As such, I’ve got access to billions of records of unstructured data and equally as much structured data. And as a startup, there are several warts that need to be fixed with respect to data storage. So for any given day, I might go from accessing a MySQL database, Amazon Redshift (columnar RDBMS), Amazon DynamoDB (NoSQL) and plain ol’ .csv files via Excel or massive .csv files on Amazon S3. To access this data, I’ve used a combination of R, Python, SQL Workbench, and MySQL Workbench using OSX, Ubuntu desktop and a ‘headless’ Ubuntu image on Amazon EC2.&lt;/p&gt;

&lt;p&gt;Why am I giving you about all this jibber-jabber about research papers and tools? Because the idea of building a one-size-fits-all tool to solve the problem I’m working on just doesn’t make sense. And for that matter, I’m not even sure the problem I’m working on is worth solving. But that’s the thing…I don’t KNOW it’s not worth solving, so I need to find out. I’ve got a quarter-billion URLs that I think I can extract information from, just to give our clients ONE more data element to use to optimize their marketing strategies.  There may be an already existing algorithm I can use, or maybe I’ll try this research paper on &lt;a href=&quot;http://research.microsoft.com/apps/mobile/publication.aspx?id=144355&quot;&gt;“word breaking”&lt;/a&gt; I found from Microsoft Research. Once I find out the answer, if it’s valuable, then I need to be able to implement my algorithm into our real-time API, because it’s likely whatever language I end up using isn’t going to be what our API is written in.&lt;/p&gt;

&lt;p&gt;So if these aren’t the type of problems you’re working on, then maybe there is an all-in-one tool out there for you to use (and that’s okay). But these are the types of edge-case problems that I think about when I think about “data science”, and as such, it will always be custom and ad-hoc. There are many awesome open-source tools I will use to help me along the way, but it will never make sense to build an easy-to-use tool for a problem a few dozen companies may ever need to know the answer to.&lt;/p&gt;

&lt;h3 id=&quot;use-the-data-you-have-to-do-something-extraordinary&quot;&gt;Use the data you have to do something extraordinary&lt;/h3&gt;

&lt;blockquote class=&quot;twitter-tweet&quot;&gt;
  &lt;p&gt;
    If you don't 'get' something, own it. Don't dump your dumb garbage into the world.
  &lt;/p&gt;

  &lt;p&gt;
    — marc maron (@marcmaron) &lt;a href=&quot;https://twitter.com/marcmaron/status/335167001427320832&quot;&gt;May 16, 2013&lt;/a&gt;
  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I’m already 1100 words into this rant, so I’ll finish up with a few admissions. Yes, “data science” is somewhat a ridiculous name for the combination of advanced analytics and data engineering that it represents. And yes, there are plenty of vendors out there pedaling hype about the grandeur of ‘Big Data’ and why every business MUST jump on board or be left behind.&lt;/p&gt;

&lt;p&gt;But rather than focusing on why something is “useless” or “stupid” or “hype”, just ask yourself “Can I solve the business problems I have today using the tools I currently have access to?” If the answer is yes, then great, get to work. If not, maybe you can find someone to help you get where you’re going (and that person may or may not call themselves a “Data Scientist”). Either way, let’s all move forward and do something extraordinary. It’s the least we can do for our customers.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Getting Started Using Hadoop, Part 2: Building a Cluster</title>
        
          <description>&lt;p&gt;In &lt;a title=&quot;Getting Started With Hadoop, Part 1&quot; href=&quot;http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-1/&quot; target=&quot;_blank&quot;&gt;Part 1 of this series&lt;/a&gt;, I discussed some of the basic concepts around Hadoop, specifically when it’s appropriate to use Hadoop to solve your data engineering problems and the terminology of the Hadoop eco-system. This post will cover how to install your own Hadoop cluster on Amazon EC2 using Cloudera Manager.&lt;/p&gt;

</description>
        
        <pubDate>Thu, 25 Apr 2013 17:33:48 +0000</pubDate>
        <link>
        http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-2/</link>
        <guid isPermaLink="true">http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-2/</guid>
        <content type="html" xml:base="/big-data-hadoop-amazon-ec2-cloudera-part-2/">&lt;p&gt;In &lt;a title=&quot;Getting Started With Hadoop, Part 1&quot; href=&quot;http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-1/&quot; target=&quot;_blank&quot;&gt;Part 1 of this series&lt;/a&gt;, I discussed some of the basic concepts around Hadoop, specifically when it’s appropriate to use Hadoop to solve your data engineering problems and the terminology of the Hadoop eco-system. This post will cover how to install your own Hadoop cluster on Amazon EC2 using Cloudera Manager.&lt;/p&gt;

&lt;p&gt;Like prior posts talking about &lt;a title=&quot;Amazon EC2 posts&quot; href=&quot;http://randyzwitch.com/tags/#amazon_ec2&quot; target=&quot;_blank&quot;&gt;Amazon EC2&lt;/a&gt;, this post assumes you have some basic facility with Linux, submitting instructions via the command line, etc. Because really, if you’re interested in Hadoop, using the command line probably isn’t a limiting factor!&lt;/p&gt;

&lt;h2 id=&quot;building-a-18-node-hadoop-cluster&quot;&gt;Building a 18-node Hadoop Cluster&lt;/h2&gt;

&lt;p&gt;The SlideShare presentation below shows the steps to building a 18-node Hadoop cluster, using a single &lt;em&gt;m1.large&lt;/em&gt; EC2 instance as the ‘Name Node’ and 18 &lt;em&gt;m1.medium&lt;/em&gt; EC2 instances as the ‘Data Nodes’.  I chose 18 nodes because according to &lt;a title=&quot;Cloudera Manager Example&quot; href=&quot;http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/&quot; target=&quot;_blank&quot;&gt;Cloudera&lt;/a&gt;, 20 is the maximum that can be activated at one time through the Amazon API, so let’s stay under the max to avoid any errors. It’s possible to add more instances later through the Cloudera Manager (up to 50 total), if so desired.&lt;/p&gt;

&lt;p&gt;Note that going through this tutorial will cost $2.40/hr at current prices ($0.24/hr per &lt;em&gt;m1.large&lt;/em&gt; instance and $0.12/hr per &lt;em&gt;m1.medium&lt;/em&gt; instance).&lt;/p&gt;

&lt;iframe style=&quot;border: 1px solid #CCC; border-width: 1px 1px 0; margin-bottom: 5px;&quot; src=&quot;http://www.slideshare.net/slideshow/embed_code/19982722&quot; height=&quot;421&quot; width=&quot;512&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; marginwidth=&quot;0&quot; marginheight=&quot;0&quot; scrolling=&quot;no&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Since the SlideShare presentation is potentially not so friendly on the eyes, I’ve also created a &lt;a title=&quot;Cloudera Amazon EC2 instructions&quot; href=&quot;http://randyzwitch.com/wp-content/uploads/2013/04/cloudera-amazon-ec2.pdf&quot; target=&quot;_blank&quot;&gt;PDF download&lt;/a&gt; that’s full resolution.&lt;/p&gt;

&lt;h2 id=&quot;next-steps&quot;&gt;Next Steps&lt;/h2&gt;

&lt;p&gt;Once you make it through all these steps to set up a Hadoop cluster, you are ready to do some analysis. &lt;a href=&quot;http://randyzwitch.com/uploading-data-hadoop-amazon-ec2-cloudera-part-3/&quot; title=&quot;Upload data into HDFS using Hue&quot;&gt;Part 3 of this tutorial&lt;/a&gt; will cover how to upload data into HDFS using Hue.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Update, 7/13/13:&lt;/em&gt; As is the case with any open-source project, there have been several changes to the Cloudera Manager that makes setup easier. When getting started, on the screen where it asks “Which Cloudera do you want to deploy?”, choose ‘Cloudera Standard’. Also, once you get to slides 13-14 where you click on the link to get started with Hue, the link now works correctly (you don’t need to search for the Amazon DNS any more!)&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Getting Started Using Hadoop, Part 1: Intro</title>
        
          <description>&lt;p&gt;For the last couple of days I’ve been at the eMetrics conference in San Francisco. There were several panels that discussed big data, both from an engineering standpoint as well as how to adopt newer technologies from a business perspective.&lt;/p&gt;

</description>
        
        <pubDate>Thu, 18 Apr 2013 12:47:15 +0000</pubDate>
        <link>
        http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-1/</link>
        <guid isPermaLink="true">http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-1/</guid>
        <content type="html" xml:base="/big-data-hadoop-amazon-ec2-cloudera-part-1/">&lt;p&gt;For the last couple of days I’ve been at the eMetrics conference in San Francisco. There were several panels that discussed big data, both from an engineering standpoint as well as how to adopt newer technologies from a business perspective.&lt;/p&gt;

&lt;p&gt;Unfortunately, there wasn’t a whole lot of practical information on how to actually get started using ‘big data’ technologies, of which Hadoop is one.  Luckily, it’s fairly easy to create a proof-of-concept Hadoop cluster using Amazon EC2 and Cloudera.&lt;/p&gt;

&lt;p&gt;This series will be at least 5 parts, as follows:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Intro to Hadoop ecosystem and concepts&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-2/&quot; title=&quot;Setting up Hadoop Cluster on Amazon EC2&quot;&gt;Setting up Hadoop cluster on Amazon EC2&lt;/a&gt; using &lt;a title=&quot;Cloudera Amazon EC2&quot; href=&quot;http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/&quot; target=&quot;_blank&quot;&gt;Cloudera&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a title=&quot;Populating HDFS using Hue&quot; href=&quot;http://randyzwitch.com/uploading-data-hadoop-amazon-ec2-cloudera-part-3/&quot; target=&quot;_blank&quot;&gt;Populating HDFS with airline dataset&lt;/a&gt; files using &lt;a title=&quot;Hadoop Hue&quot; href=&quot;http://cloudera.github.io/hue/&quot; target=&quot;_blank&quot;&gt;Hue&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Use &lt;a title=&quot;Hive joins&quot; href=&quot;https://cwiki.apache.org/Hive/languagemanual-joins.html&quot; target=&quot;_blank&quot;&gt;Hive&lt;/a&gt; and/or &lt;a title=&quot;Apache Pig&quot; href=&quot;http://pig.apache.org/&quot; target=&quot;_blank&quot;&gt;Pig&lt;/a&gt; to &lt;a title=&quot;Creating Tables with Hive&quot; href=&quot;http://randyzwitch.com/hadoop-creating-tables-hive/&quot; target=&quot;_blank&quot;&gt;stack datasets into one master dataset&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a title=&quot;Analysis using Pig &amp;amp; Hive&quot; href=&quot;http://randyzwitch.com/getting-started-hadoop-hive-pig/&quot; target=&quot;_blank&quot;&gt;Doing analytics on the combined Airline dataset using Pig and/or Hive&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;My aim with this series is to &lt;em&gt;simply&lt;/em&gt; explain why you might want to consider using Hadoop for your data storage and processing. There’s a lot of marketing &amp;amp; vendor &lt;del&gt;bullshit&lt;/del&gt; excitement surrounding the term ‘big data’, so for this blog series, I’m just going to focus on the most important points for an analyst/marketer to understand. And other than this sentence, there will be no mentions of &lt;em&gt;MS Excel&lt;/em&gt; in terms of ‘big data’, which is &lt;a title=&quot;Use R not Excel&quot; href=&quot;http://blog.revolutionanalytics.com/2013/04/more-reasons-not-to-use-excel-for-modeling.html&quot; target=&quot;_blank&quot;&gt;barely an appropriate tool for analysis&lt;/a&gt; in general, let alone analysis at scale.&lt;/p&gt;

&lt;h3 id=&quot;what-is-hadoop--why-are-people-talking-about-it&quot;&gt;What Is Hadoop &amp;amp; Why Are People Talking About It?&lt;/h3&gt;

&lt;p&gt;At it’s simplest, Hadoop provides a parallel-processing computing framework for data storage and processing. The reason why a parallel-processing framework is important for enterprise-level analysis is due to physical limitations on how quickly a single machine can process information.&lt;/p&gt;

&lt;p&gt;As an example, suppose  you want to create a report that looks at 1 trillion daily credit card transactions. It’s possible to do your calculations on your local desktop using a tool like SAS. However, the amount of time to process that much data on a desktop with 8GB-16GB of RAM might be 8 hours, 10 hours….24 hours?! So an analyst trying to get an answer can start a SINGLE business question at 8am and &lt;em&gt;hope&lt;/em&gt; they get their answer before it’s time to leave at the end of the day. Suffice to say, not a particularly efficient way to run a business.&lt;/p&gt;

&lt;p&gt;The solution might seem to add more processors and RAM to a desktop, but what happens when you add more users asking questions? Now you need an enterprise-class server such as Oracle or Teradata (and a few million dollars!). And for every terabyte of data you want to store, you’ll need a few thousand dollars. And that’s just for your nicely structured data…what happens when you want to start storing data such as free-form text that’s not so cleanly structured? Eventually, these types of &lt;em&gt;engineering questions&lt;/em&gt; lead you towards a solution like Hadoop.&lt;/p&gt;

&lt;p&gt;The reason why there is so much discussion around Hadoop as a data platform is that it solves the problems stated above: excessive time to process vast amounts of data and excessive cost of data storage. By using “commodity hardware” along with some fancy engineering, Hadoop provides an extremely cost-effective and flexible way to handle your enterprise data.&lt;/p&gt;

&lt;h3 id=&quot;if-hadoop-is-so-great-why-doesnt-everyone-use-it&quot;&gt;If Hadoop is so Great, Why Doesn’t Everyone Use It?&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;“Fast, Cheap And Good. Everyone should use Hadoop!” - Every vendor in marketplace&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Just like you (probably) don’t use a screwdriver to stir a pot of chicken soup, not every data storage and analysis problem requires the extreme flexibility that Hadoop can provide. From the example above with credit card transactions, a standard relational database might continue to be an acceptable solution if you’re just running a basic SQL query to sum across the rows. But once your data starts moving beyond “rows and columns” and into things such as free-form text, images, clickstream data…the more Hadoop makes sense.&lt;/p&gt;

&lt;p&gt;While it’s a tautology, how you know you need a solution like Hadoop is when you suspect you need a solution like Hadoop! If you already have a highly functioning data mart that answers your business questions, you probably don’t need to re-engineer everything &lt;em&gt;just because&lt;/em&gt;. If you’re an Internet startup trying to create the next Facebook, then a standard relational database probably won’t cut it.&lt;/p&gt;

&lt;p&gt;The best example I heard at eMetrics about the need for Hadoop was from Bob Page (now at Hortonworks, a Hadoop vendor): when Bob was at Ebay, for the longest time they were throwing away data, specifically images from the listings. So prior high storage costs leading to undesirable business outcome (deletion), unstructured data in the form of images…a Hadoop framework made sense to implement. Once implemented, Ebay could look across years of auctions to answer their business questions.&lt;/p&gt;

&lt;h3 id=&quot;im-an-analyst-not-an-engineerwhats-the-minimum-i-need-to-know-to-get-started&quot;&gt;I’m An Analyst, Not An Engineer…What’s The Minimum I Need To Know To Get Started?&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;“MapReduce, Pigs, HCatalogs, Elephants, Bees, Zoos…Ooozie (Uzi’s)? WTF is everyone talking about?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you’ve made it this far and you’re not an engineer or DBA, you’re probably someone who’s interested in data science. You may be someone who already uses R, Python, Ruby or Java. Or, you’re a masochist. In any case, here are the minimum concepts I think you need to know to get started for later blog posts:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;span style=&quot;text-decoration: underline;&quot;&gt;MapReduce:&lt;/span&gt; Not explicitly a Hadoop idea, but the idea that data can be split into chunks by a key (“Map”) and then processed into information by one or more functions/transformations (“Reduce”). In the Hadoop sense, MapReduce is generally a reference to a “job” written in Java that performs a data transformation&lt;/li&gt;
  &lt;li&gt;&lt;span style=&quot;text-decoration: underline;&quot;&gt;HDFS:&lt;/span&gt; Hadoop Distributed File System. Raw data gets imported into HDFS (either structured or unstructured), the distributed around to all of the various nodes to allow for parallel processing&lt;/li&gt;
  &lt;li&gt;&lt;span style=&quot;text-decoration: underline;&quot;&gt;Hive:&lt;/span&gt; SQL-like interface so that analysts don’t have to write MapReduce code directly&lt;/li&gt;
  &lt;li&gt;&lt;span style=&quot;text-decoration: underline;&quot;&gt;Pig:&lt;/span&gt; A scripting language used for analysis. Generally, an analyst will use Hive and/or Pig to do their work&lt;/li&gt;
  &lt;li&gt;&lt;span style=&quot;text-decoration: underline;&quot;&gt;HCatalog:&lt;/span&gt; A ‘Data Warehouse’ layer on top of HDFS, similar to how you define a database table (a series of columns in a table with formats)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;next-steps&quot;&gt;Next Steps&lt;/h3&gt;

&lt;p&gt;With the above five Hadoop concepts in place, the next few posts will be to set up a proof-of-concept Hadoop cluster on Amazon EC2, processing ~12GB of publicly available data from the ‘&lt;a title=&quot;Airline dataset&quot; href=&quot;http://stat-computing.org/dataexpo/2009/the-data.html&quot; target=&quot;_blank&quot;&gt;Airline dataset&lt;/a&gt;’. That’s not ‘big’ as ‘big data’ goes, but it’s big enough to be fun to work with.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Instructions for Installing &amp;#038; Using R on Amazon EC2</title>
        
          <description>&lt;p&gt;If you’re an R user, you’ve surely heard all the hype around ‘big data’ and how R is commonly used to analyze these volumes of data. One thing that’s often missing from the discussion is HOW to work around issues using big data and R, specifically how to deal with the fact that R stores all its objects in-memory.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 08 Apr 2013 16:36:07 +0000</pubDate>
        <link>
        http://randyzwitch.com/r-amazon-ec2/</link>
        <guid isPermaLink="true">http://randyzwitch.com/r-amazon-ec2/</guid>
        <content type="html" xml:base="/r-amazon-ec2/">&lt;p&gt;If you’re an R user, you’ve surely heard all the hype around ‘big data’ and how R is commonly used to analyze these volumes of data. One thing that’s often missing from the discussion is HOW to work around issues using big data and R, specifically how to deal with the fact that R stores all its objects in-memory.&lt;/p&gt;

&lt;p&gt;While you can use packages such as &lt;a title=&quot;ff package&quot; href=&quot;http://cran.r-project.org/web/packages/ff/index.html&quot; target=&quot;_blank&quot;&gt;ff&lt;/a&gt; and &lt;a title=&quot;bigmemory&quot; href=&quot;http://cran.r-project.org/web/packages/bigmemory/index.html&quot; target=&quot;_blank&quot;&gt;bigmemory&lt;/a&gt; to overcome the in-memory limits of your local machine, these additional packages do require some re-engineering of your code. Instead, consider using &lt;a title=&quot;Amazon Web Services&quot; href=&quot;http://aws.amazon.com/ec2/&quot; target=&quot;_blank&quot;&gt;Amazon EC2&lt;/a&gt; to provision the resources you need.  Here are two ways to get started…&lt;/p&gt;

&lt;h3 id=&quot;use-a-pre-made-ami&quot;&gt;Use a Pre-Made AMI&lt;/h3&gt;

&lt;p&gt;In the great open-source tradition, there are already R Amazon EC2 AMI images available out there to use. The way I got started was using the pre-built images that &lt;a title=&quot;RStudio AMI Images&quot; href=&quot;http://www.louisaslett.com/RStudio_AMI/&quot; target=&quot;_blank&quot;&gt;Louis Aslett&lt;/a&gt; provides on his website.  Louis also provides great instructions on learning about EC2, so if you’ve never worked with R in the cloud or a just looking to get up and running fast, his website is a great means to do so.&lt;/p&gt;

&lt;h3 id=&quot;build-your-own-image&quot;&gt;Build Your Own Image&lt;/h3&gt;

&lt;p&gt;Alternatively, suppose you want to build your own customized image. For example, say you wanted to build a proof-of-concept ‘big data’ environment, so you want &lt;a title=&quot;R download at CRAN&quot; href=&quot;http://cran.r-project.org/&quot; target=&quot;_blank&quot;&gt;R&lt;/a&gt;, &lt;a title=&quot;Python download&quot; href=&quot;http://python.org/&quot; target=&quot;_blank&quot;&gt;Python&lt;/a&gt;, &lt;a title=&quot;MySQL download&quot; href=&quot;http://dev.mysql.com/&quot; target=&quot;_blank&quot;&gt;MySQL&lt;/a&gt; and &lt;a title=&quot;MongoDB&quot; href=&quot;http://www.mongodb.org/&quot; target=&quot;_blank&quot;&gt;MongoDB&lt;/a&gt;.  The commands to accomplish this are listed below. Note that I’m assuming you have a &lt;a title=&quot;AWS FAQ&quot; href=&quot;http://aws.amazon.com/ec2/faqs/&quot; target=&quot;_blank&quot;&gt;basic understanding of working through the Amazon Web Service Console (AWS)&lt;/a&gt;, including being able to get to the ‘Classic Wizard’ for launching an EC2 instance. You also should have a basic understanding of &lt;a title=&quot;Command Line tutorial&quot; href=&quot;http://cli.learncodethehardway.org/book/&quot; target=&quot;_blank&quot;&gt;working from the command line&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id=&quot;setting-up-amazon-ec2-instance&quot;&gt;Setting Up Amazon EC2 Instance&lt;/h4&gt;

&lt;ol&gt;
  &lt;li&gt;Launch an Ubuntu 12.04.1 LTS 64-bit image. You can use a free “t1.micro” image while building, then provision more resources later once you’re ready for analysis.&lt;/li&gt;
  &lt;li&gt;Accept defaults until you get to Key-Pair tab. The Key-Pair is what allows you to login securely to your Amazon EC2 image without a password. Create and download a Key-Pair if you don’t already have one or choose an existing Key-Pair if you do.&lt;/li&gt;
  &lt;li&gt;When you get to the ‘Security Groups’ tab, create a security group that has the following ports open: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;22&lt;/code&gt; (SSH), &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;80&lt;/code&gt; (HTTP), &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;443&lt;/code&gt; (HTTPS), &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;3389&lt;/code&gt; (RDP, optional), and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8787&lt;/code&gt; (RStudio Server).&lt;/li&gt;
  &lt;li&gt;Work through the rest of the Wizard until your instance is launched.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4 id=&quot;connecting-to-amazon-ec2-instance&quot;&gt;Connecting to Amazon EC2 Instance&lt;/h4&gt;

&lt;ol&gt;
  &lt;li&gt;There are two ways to connect to your EC2 image, both of which can be found by going to the “Actions” tab in the AWS console, then selecting “Connect” from the drop-down. The rest of this tutorial assumes you connect via a stand-alone SSH client (such as Terminal for Mac OSX)&lt;/li&gt;
  &lt;li&gt;Connect to your instance by typing the code provided to you, such as: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh -i me-aws.pem ubuntu@ec2-50-19-18-120.compute-1.amazonaws.com&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Be sure that before you submit this code, you either modify the line to put the directory in front of your Key-Pair, or “cd” to the directory where the Key-Pair is located&lt;/li&gt;
  &lt;li&gt;After submitting the connect code, you will get a warning saying that the ‘authenticity can’t be established, do you want to continue?’  Type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;yes&lt;/code&gt; and hit enter to log in.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4 id=&quot;installing-base-r&quot;&gt;Installing Base R&lt;/h4&gt;

&lt;p&gt;Once you are logged in, there are about a dozen commands that need to be submitted. Some commands run quickly, others can take 10-15 minutes to run through the entire installation process. Depending on how quickly each command completes, you may or may not need to type “sudo” in front of each command to have proper access rights for installation. Submit each line one at a time.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#Create a user, home directory and set password&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;useradd rstudio
&lt;span class=&quot;nb&quot;&gt;sudo mkdir&lt;/span&gt; /home/rstudio
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;passwd rstudio
&lt;span class=&quot;nb&quot;&gt;sudo chmod&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-R&lt;/span&gt; 0777 /home/rstudio

&lt;span class=&quot;c&quot;&gt;#Update all files from the default state&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get upgrade

&lt;span class=&quot;c&quot;&gt;#Add CRAN mirror to custom sources.list file using vi&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;vi /etc/apt/sources.list.d/sources.list

&lt;span class=&quot;c&quot;&gt;#Add following line (or your favorite CRAN mirror)&lt;/span&gt;
deb http://lib.stat.cmu.edu/R/CRAN/bin/linux/ubuntu precise/

&lt;span class=&quot;c&quot;&gt;#Update files to use CRAN mirror&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#Don't worry about error message&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update

&lt;span class=&quot;c&quot;&gt;#Install latest version of R&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#Install without verification&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;r-base
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;While not strictly required to run R, I also like to run the following commands to install the Curl and XML packages as well, which are useful if you want to use R to connect to any web data/APIs.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#Install in order to use RCurl &amp;amp; XML&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;aptitude &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;libcurl4-openssl-dev
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;libxml2-dev
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;With these commands run, you will now be able to run R from the command line just by typing “R” at the prompt. However, it would be a crime to do all this work and not install RStudio Server, which makes working in R so much easier.&lt;/p&gt;

&lt;h4 id=&quot;installing-rstudio-server&quot;&gt;Installing RStudio Server&lt;/h4&gt;

&lt;p&gt;Once you’ve installed the above commands, you can now access RStudio through your local browser. Navigate to the Public DNS of your image on port 8787, similar to:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;http://ec2-50-19-18-120.compute-1.amazonaws.com:8787&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;login&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;password&lt;/code&gt; will be the values you used in the image creation process (I used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rstudio&lt;/code&gt; as my username above).&lt;/p&gt;

&lt;h4 id=&quot;installing-mysql-python-and-mongodb&quot;&gt;Installing MySQL, Python, and MongoDB&lt;/h4&gt;

&lt;p&gt;If you’ve made it this far, I’m sure you realize that installing additional packages will only take a line or two of code. Even better, Python is installed by default on Linux, so we really only need to install MySQL and MongoDB.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-shell&quot; data-lang=&quot;shell&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;#Install MySQL
&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;mysql-common
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;mysql-server

&lt;span class=&quot;c&quot;&gt;#Install MongoDB
&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;mongodb
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;summary&quot;&gt;Summary&lt;/h3&gt;

&lt;p&gt;While the steps above can be intimidating if you’ve never used Linux or worked on the command line, but once you get the hang of it, your ability to use R on ‘big data’ (however you define it) will be much improved. For only a few pennies to up to a few dollars per hour, you can use hardware having 16-64GB of RAM or more.&lt;/p&gt;

&lt;p&gt;EDIT, 4/9: The code is wrapping weird on some monitors.  &lt;a title=&quot;Amazon EC2 RStudio commands&quot; href=&quot;http://randyzwitch.com/wp-content/uploads/2013/04/amazon-ec2-rstudio.txt&quot; target=&quot;_blank&quot;&gt;Click here&lt;/a&gt; for the commands in a .txt. file.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Automated Re-Install of Packages for R 3.0</title>
        
          <description>&lt;p&gt;With the big release of &lt;a title=&quot;R 3.0 introduction&quot; href=&quot;http://www.r-bloggers.com/r-3-0-0-is-released-whats-new-and-how-to-upgrade/&quot; target=&quot;_blank&quot;&gt;R 3.0&lt;/a&gt; today comes an unfortunate side effect of needing to re-install all of your packages. Luckily, R provides a pretty easy method of getting all of your packages into a list for automated re-install.  Here’s how to do it for OSX users with a default install to the Library:&lt;/p&gt;

</description>
        
        <pubDate>Wed, 03 Apr 2013 10:10:09 +0000</pubDate>
        <link>
        http://randyzwitch.com/automated-re-install-of-packages-for-r-3-0/</link>
        <guid isPermaLink="true">http://randyzwitch.com/automated-re-install-of-packages-for-r-3-0/</guid>
        <content type="html" xml:base="/automated-re-install-of-packages-for-r-3-0/">&lt;p&gt;With the big release of &lt;a title=&quot;R 3.0 introduction&quot; href=&quot;http://www.r-bloggers.com/r-3-0-0-is-released-whats-new-and-how-to-upgrade/&quot; target=&quot;_blank&quot;&gt;R 3.0&lt;/a&gt; today comes an unfortunate side effect of needing to re-install all of your packages. Luckily, R provides a pretty easy method of getting all of your packages into a list for automated re-install.  Here’s how to do it for OSX users with a default install to the Library:&lt;/p&gt;

&lt;p&gt;For Windows users, the same general process should work, assuming you change the file reference in the &lt;em&gt;installed.packages&lt;/em&gt; function to the proper Windows location. The one downside to this method is that only packages that are &lt;a title=&quot;CRAN&quot; href=&quot;http://cran.r-project.org/&quot; target=&quot;_blank&quot;&gt;listed on CRAN&lt;/a&gt; will be reinstalled, so if you installed anything using devtools, you’ll need to re-install those packages again. But at the very least, the code snippet above is a quick way to re-install most of your packages. EDIT, 4/4/13: Per Noam below, you can also use a more direct method: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;update.packages(ask=FALSE, checkBuilt = TRUE)&lt;/code&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>The Fun of Error Trapping: R Package Edition</title>
        
          <description>&lt;p&gt;For the last month or so I’ve been working on an R package to make accessing the &lt;a title=&quot;Omniture Reporting API&quot; href=&quot;https://developer.omniture.com/&quot; target=&quot;_blank&quot;&gt;Adobe (Omniture) Digital Marketing Suite Reporting API&lt;/a&gt; easier.  As part of this development effort, I’m at the point where I’m intentionally introducing errors into my function inputs, trying to guess some of the ways users might incorrectly input arguments into each function.  Imagine my surprise when I saw this:&lt;/p&gt;

</description>
        
        <pubDate>Mon, 25 Feb 2013 17:15:46 +0000</pubDate>
        <link>
        http://randyzwitch.com/r-error-message-fun/</link>
        <guid isPermaLink="true">http://randyzwitch.com/r-error-message-fun/</guid>
        <content type="html" xml:base="/r-error-message-fun/">&lt;p&gt;For the last month or so I’ve been working on an R package to make accessing the &lt;a title=&quot;Omniture Reporting API&quot; href=&quot;https://developer.omniture.com/&quot; target=&quot;_blank&quot;&gt;Adobe (Omniture) Digital Marketing Suite Reporting API&lt;/a&gt; easier.  As part of this development effort, I’m at the point where I’m intentionally introducing errors into my function inputs, trying to guess some of the ways users might incorrectly input arguments into each function.  Imagine my surprise when I saw this:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;span style=&quot;color: #0000ff;&quot;&gt;&amp;gt; result &amp;lt;- content(json)&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #ff0000;&quot;&gt;Loading required package: XML&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #ff0000;&quot;&gt;Error in parser(content, …) : could not find function “htmlTreeParse”&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #ff0000;&quot;&gt;In addition: Warning message:&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #ff0000;&quot;&gt;In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;color: #ff0000;&quot;&gt;there is no package called ‘XML’&lt;/span&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The main idea behind the functions I’ve written is making REST calls to the Omniture API, which done correctly return valid &lt;a title=&quot;JSON documentation&quot; href=&quot;http://www.json.org/&quot; target=&quot;_blank&quot;&gt;JSON&lt;/a&gt;. From there, each JSON string is converted from binary or whatever formatting they come back as using the &lt;a title=&quot;httr R package&quot; href=&quot;http://cran.r-project.org/web/packages/httr/index.html&quot; target=&quot;_blank&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;content&lt;/code&gt;&lt;/a&gt; function from the &lt;a title=&quot;httr R package&quot; href=&quot;http://cran.r-project.org/web/packages/httr/index.html&quot; target=&quot;_blank&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;httr&lt;/code&gt;&lt;/a&gt; package. Without specifying any arguments to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;content&lt;/code&gt; function, the function tries to guess at the proper translation method.&lt;/p&gt;

&lt;p&gt;The guessing is all fine and good until you don’t pass a valid JSON string!  In this case, the error message is guessing that it might be XML (the returned error is actually HTML), tries to load the XML package…then says it can’t load the XML package. A two-for-one error!&lt;/p&gt;

&lt;p&gt;Maybe it’s just me, but I’m finding this hilarious after a long day of programming. Maybe it’s because I’m not longer intimidated by an error like this, and as such, I’ve gotten over the steep learning curve of R.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note:  Hadley, if you read this, I’m not saying your httr package has any sort of bug or anything. Just that I found this particular error amusing.&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>(not provided): Using R and the Google Analytics API</title>
        
          <description>&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/01/google-not-provided.png&quot; alt=&quot;(not provided) terms from Google average 35%-60% of all organic search terms&quot; /&gt;&lt;/p&gt;

</description>
        
        <pubDate>Fri, 11 Jan 2013 16:27:33 +0000</pubDate>
        <link>
        http://randyzwitch.com/r-google-analytics-api/</link>
        <guid isPermaLink="true">http://randyzwitch.com/r-google-analytics-api/</guid>
        <content type="html" xml:base="/r-google-analytics-api/">&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2013/01/google-not-provided.png&quot; alt=&quot;(not provided) terms from Google average 35%-60% of all organic search terms&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
(not provided) terms from Google average 35%-60% of all Google organic search terms
&lt;/p&gt;

&lt;p&gt;For power users of Google Analytics, there is a heavy dose of spreadsheet work that accompanies any decent analysis.  But even with Excel in tow, it’s often difficult to get the data &lt;em&gt;just right&lt;/em&gt; without resorting to formula hacks and manual table formatting.  This is where the Google Analytics API and R can come very much in handy.&lt;/p&gt;

&lt;h2 id=&quot;connecting-to-the-google-analytics-api-using-r&quot;&gt;Connecting to the Google Analytics API using R&lt;/h2&gt;

&lt;p&gt;I’m not going to say that connecting to the Google Analytics API is easy &lt;em&gt;per se&lt;/em&gt;, but with the &lt;a href=&quot;http://skardhamar.github.com/rga/&quot; title=&quot;R Google Analytics API package&quot;&gt;rga package&lt;/a&gt; written by “skardhamar” on GitHub, it’s easier than if you had to develop the connection code yourself!  However, before you can get started making calls to the Google Analytics API, you need to register within the &lt;a href=&quot;https://code.google.com/apis/console/&quot; title=&quot;Google Analytics API console&quot;&gt;Google Analytics API console&lt;/a&gt;.  There you can define a new project and then you’ll be able to make your API calls via R.&lt;/p&gt;

&lt;p&gt;After you have your API access straightened out, the &lt;a href=&quot;http://skardhamar.github.com/rga/&quot; title=&quot;RGA package instructions&quot;&gt;GitHub page for the rga package&lt;/a&gt; has all the details in how to authenticate using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rga.open&lt;/code&gt; function.  I chose to use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;where&lt;/code&gt; argument so that I can continuously hit the API across many sessions without having to do browser authentication each time.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;rga.open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;instance&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;~/Documents/R/ga-api&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;analyzing-not-provided-as-a-google-analytics-organic-search-term&quot;&gt;Analyzing (not provided) as a Google Analytics organic search term&lt;/h2&gt;

&lt;p&gt;Once connected to the Google Analytics API, now it’s time to submit our API calls.  I used two API calls to create the graph at the top of the post, which shows the percentage of all Google organic search terms that are listed as “(not provided)” for the entire history of this blog.  The two API calls were to download the number of total organic search term visits by date from Google and the number of “(not provided)” visits by date, also from Google.  Here’s the API call for the “(not provided)” data (replace XXXXXXXX with your profile ID):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;visits_notprovided.df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ga&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getData&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;XXXXXXXX&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;start.date&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2011-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;end.date&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2013-01-10&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga:visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filters&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga:keyword==(not provided);ga:source==google;ga:medium==organic&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dimensions&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga:date&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1500&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sort&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga:date&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The result of this API call provides an R data frame containing two columns: date and number of visits where the search term was “(not provided)”.&lt;/p&gt;

&lt;h2 id=&quot;munging-the-data-using-r&quot;&gt;Munging the data using R&lt;/h2&gt;

&lt;p&gt;After pulling the data into R, all that’s left is to merge the data frames, do a few calculations, then make the boxplot.  Because the default object returned by the rga package is a data frame, it’s trivial to use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;merge&lt;/code&gt; function in R to join the data frames, then use a few calculated columns to create the percentage of visits that are “(not provided)”&lt;/p&gt;

&lt;h2 id=&quot;what-was-that-google-only-10-of-searches-are-supposed-to-be-not-provided&quot;&gt;What was that Google, only 10% of searches are supposed to be (not provided)?&lt;/h2&gt;

&lt;p&gt;By now, it’s beating a dead horse that the percentage of “(not provided)” search results from Google FAR exceeds what they said it would.  This blog gets about 5,000 visits a month, and due to the technical nature of the blog many of the users are using Chrome (which does secure search automatically) or from iOS (which also does secure search).  But at minimum, this graph illustrates the power of using the Google Analytics API via R; I can update this graph at my leisure by running my script, and I can create a graphic that’s not possible within Excel.&lt;/p&gt;

&lt;p&gt;Full code:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#### Connecting to Google Analytics API via R&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### Uses OAuth 2.0&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#### https://developers.google.com/analytics/devguides/reporting/core/v3/ for documentation&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Install devtools package &amp;amp; rga - This is only done one time&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;install.packages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;devtools&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;devtools&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;install_github&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;rga&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;skardhamar&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;


&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Load rga package - requires bitops, RCurl, rjson&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Load lubridate to handle dates&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rga&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lubridate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Authenticating to GA API. Go to https://code.google.com/apis/console/ and create&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# an API application.  Don't need to worry about the client id and shared secret for&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# this R code, it is not needed&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# If file listed in &quot;where&quot; location doesn't exist, browser window will open.&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Allow access, copy code into R console where prompted&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Once file located in &quot;where&quot; directory created, you will have continous access to&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# API without needing to do browser authentication&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rga.open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;instance&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;~/Documents/R/ga-api&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;


&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Get (not provided) Search results.  Replace XXXXXXXX with your profile ID from GA&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;visits_notprovided.df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ga&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getData&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;XXXXXXXX&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;start.date&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2011-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;end.date&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2013-01-10&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga:visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filters&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga:keyword==(not provided);ga:source==google;ga:medium==organic&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dimensions&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga:date&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1500&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sort&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga:date&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;visits_notprovided.df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;hit_date&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;np_visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Get sum of all Google Organic Search results.  Replace XXXXXXXX with your profile ID from GA&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;visits_orgsearch.df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ga&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getData&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;XXXXXXXX&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;start.date&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2011-01-01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;end.date&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2013-01-10&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga:visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filters&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga:source==google;ga:medium==organic&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dimensions&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga:date&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1500&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sort&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ga:date&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;visits_orgsearch.df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;hit_date&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;total_visits&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Merge files, create metrics, limit dataset to just days when tags firing&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merged.df&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;visits_notprovided.df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;visits_orgsearch.df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;all&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merged.df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;search_term_provided&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merged.df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;total_visits&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merged.df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;np_visits&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merged.df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pct_np&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merged.df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;np_visits&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merged.df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;total_visits&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merged.df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;yearmo&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merged.df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hit_date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;month&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merged.df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hit_date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;final_dataset&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merged.df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;total_visits&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;


&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Visualization - boxplot by month&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Main plot, minus y axis tick labels&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;boxplot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pct_np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;yearmo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;final_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Google (not provided)\nPercentage of Total Organic Searches&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xlab&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Year-Month&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylab&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Percent (not provided)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;orange&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;.8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;yaxt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;n&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Create tick sequence and format axis labels&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ticks&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;.8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;.2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;label_ticks&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;%1.f%%&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ticks&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;at&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ticks&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;label_ticks&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;</content>
      </item>
      
    
      
      <item>
        <title>Video:  SQL Queries in R using sqldf</title>
        
          <description>&lt;p&gt;This video covers how to run SQL queries using the ‘sqldf’ package within R. This sqldf tutorial was part of a &lt;a href=&quot;http://keystonesolutions.com&quot;&gt;Keystone Solutions&lt;/a&gt; podcast discussion about data science and what skills beginning analysts should be learning to improve their skill set.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 17 Dec 2012 13:22:51 +0000</pubDate>
        <link>
        http://randyzwitch.com/sqldf-package-r/</link>
        <guid isPermaLink="true">http://randyzwitch.com/sqldf-package-r/</guid>
        <content type="html" xml:base="/sqldf-package-r/">&lt;p&gt;This video covers how to run SQL queries using the ‘sqldf’ package within R. This sqldf tutorial was part of a &lt;a href=&quot;http://keystonesolutions.com&quot;&gt;Keystone Solutions&lt;/a&gt; podcast discussion about data science and what skills beginning analysts should be learning to improve their skill set.&lt;/p&gt;

&lt;p&gt;The example files from this tutorial can be downloaded from this link:&lt;/p&gt;

&lt;p&gt;&lt;a title=&quot;SQL R Tutorial data files&quot; href=&quot;http://randyzwitch.com/wp-content/uploads/2013/11/r-sql-demo-files.zip&quot; target=&quot;_blank&quot;&gt;Example Data files&lt;/a&gt;&lt;/p&gt;

&lt;iframe src=&quot;http://www.youtube.com/embed/s2oTUsAJfjI&quot; height=&quot;480&quot; width=&quot;640&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot;&gt;&lt;/iframe&gt;</content>
      </item>
      
    
      
      <item>
        <title>Video: Overlay Histogram in R (Normal, Density, Another Series)</title>
        
          <description>&lt;p&gt;This video explains how to overlay histogram plots in R for 3 common cases: overlaying a histogram with a normal curve, overlaying a histogram with a density curve, and overlaying a histogram with a second data series plotted on a secondary axis.&lt;/p&gt;

</description>
        
        <pubDate>Fri, 09 Nov 2012 13:40:01 +0000</pubDate>
        <link>
        http://randyzwitch.com/overlay-histogram-in-r/</link>
        <guid isPermaLink="true">http://randyzwitch.com/overlay-histogram-in-r/</guid>
        <content type="html" xml:base="/overlay-histogram-in-r/">&lt;p&gt;This video explains how to overlay histogram plots in R for 3 common cases: overlaying a histogram with a normal curve, overlaying a histogram with a density curve, and overlaying a histogram with a second data series plotted on a secondary axis.&lt;/p&gt;

&lt;p&gt;Note: Towards the end of the video (maybe minute 14 or so), I make a language error when talking about the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;padj&lt;/code&gt; parameter in the mtext function…the setting doesn’t “left truncated” the label, I meant “right align”, “left align”, etc.&lt;/p&gt;

&lt;iframe width=&quot;640&quot; height=&quot;480&quot; src=&quot;http://www.youtube.com/embed/C67KNai92Mo&quot; frameborder=&quot;0&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;#Step 0:  load/prepare data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Read in data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sample_data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read.csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;~/Desktop/test_data.csv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# &quot;Explode&quot; counts by age back to unsummarized &quot;raw&quot; data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rep.int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sample_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sample_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;


&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#1. Histogram with normal distributon overlaid or density curve&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;


&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#1A.  Create histogram&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xlim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;.2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;breaks&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;22&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xlab&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Age&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylab&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Percentage of Accounts&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Age Distribution of Accounts\n (where 0 &amp;lt;= age &amp;lt;= 20)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prob&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;lightgray&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#1B.  Do one of the following, either put the normal distribution on the histogram&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#     or put the smoothed density function&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Calculate normal distribution having mean/sd equal to data plotted in the&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#histogram above&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;points&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;length.out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;500&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
       &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dnorm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;length.out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;500&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
             &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;l&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;red&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Add smoothed density function to histogram, smoothness toggled using&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#&quot;adjust&quot; parameter&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lines&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;density&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;adjust&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;blue&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#2 Histogram with line plot overlaid&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#2A.  Create histogram with extra border space on right-hand side&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#Extra border space &quot;2&quot; on right  (bottom, left, top, right)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;par&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;oma&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xlim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;.2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
     &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;breaks&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;age.exploded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;22&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xlab&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Age&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
     &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylab&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Percentage of Accounts&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Age Distribution of Accounts vs. Subscription Rate \n (where reported age &amp;lt;= 20)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
     &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prob&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;lightgray&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#2B.  Add overlaid line plot, create a right-side numeric axis&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;par&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sample_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subscribe_pct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xlab&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylab&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;b&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;red&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;axes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#2C.  Add right-side axis label&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mtext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Subscription Rate&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;side&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;outer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;padj&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;File Download:&lt;/p&gt;

&lt;p&gt;&lt;a title=&quot;Histogram overlay in R&quot; href=&quot;http://randyzwitch.com/wp-content/uploads/2012/11/histogram-overlay-r.zip&quot; target=&quot;_blank&quot;&gt;Histogram overlay in R code and sample data file&lt;/a&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Video:  R, RStudio, Rcmdr &amp; rattle</title>
        
          <description>&lt;p&gt;I did a screencast for my co-workers to show how to get started with R, specifically what a base installation of R looks like, then showing how to improve your workflow using RStudio, Rcmdr or rattle.  The examples are somewhat pedestrian, but it gives a feel for what using R actually looks like.&lt;/p&gt;

</description>
        
        <pubDate>Fri, 07 Sep 2012 12:07:45 +0000</pubDate>
        <link>
        http://randyzwitch.com/video-r-rstudio-rcmdr-rattle/</link>
        <guid isPermaLink="true">http://randyzwitch.com/video-r-rstudio-rcmdr-rattle/</guid>
        <content type="html" xml:base="/video-r-rstudio-rcmdr-rattle/">&lt;p&gt;I did a screencast for my co-workers to show how to get started with R, specifically what a base installation of R looks like, then showing how to improve your workflow using RStudio, Rcmdr or rattle.  The examples are somewhat pedestrian, but it gives a feel for what using R actually looks like.&lt;/p&gt;

&lt;p&gt;If you have any questions, comments, or jeers about how bad I am at R, feel free to leave a comment in the comments section!&lt;/p&gt;

&lt;iframe src=&quot;https://player.vimeo.com/video/48599583&quot; width=&quot;640&quot; height=&quot;400&quot; frameborder=&quot;0&quot; webkitallowfullscreen=&quot;&quot; mozallowfullscreen=&quot;&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;
&lt;p&gt;&lt;a href=&quot;https://vimeo.com/48599583&quot;&gt;R Demo - Randy Zwitch&lt;/a&gt; from &lt;a href=&quot;https://vimeo.com/user13204299&quot;&gt;Keystone Solutions&lt;/a&gt; on &lt;a href=&quot;https://vimeo.com&quot;&gt;Vimeo&lt;/a&gt;.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Getting Started Using R, Part 2: Rcmdr</title>
        
          <description>&lt;p&gt;In my first post in this series, I discussed &lt;a title=&quot;Getting Started Using R, Part 1:  RStudio&quot; href=&quot;http://randyzwitch.com/getting-started-using-rstudio/&quot; target=&quot;_blank&quot;&gt;RStudio&lt;/a&gt;, an IDE that adds significant functionality and consistency to a basic installation of R.  In this post, I will discuss &lt;a title=&quot;Rcmdr download at CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/Rcmdr/index.html&quot; target=&quot;_blank&quot;&gt;Rcmdr&lt;/a&gt;, a GUI that provides the ability to do basic business statistics without having to code in R.&lt;/p&gt;

</description>
        
        <pubDate>Mon, 06 Aug 2012 10:54:18 +0000</pubDate>
        <link>
        http://randyzwitch.com/getting-started-using-r-rcmdr/</link>
        <guid isPermaLink="true">http://randyzwitch.com/getting-started-using-r-rcmdr/</guid>
        <content type="html" xml:base="/getting-started-using-r-rcmdr/">&lt;p&gt;In my first post in this series, I discussed &lt;a title=&quot;Getting Started Using R, Part 1:  RStudio&quot; href=&quot;http://randyzwitch.com/getting-started-using-rstudio/&quot; target=&quot;_blank&quot;&gt;RStudio&lt;/a&gt;, an IDE that adds significant functionality and consistency to a basic installation of R.  In this post, I will discuss &lt;a title=&quot;Rcmdr download at CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/Rcmdr/index.html&quot; target=&quot;_blank&quot;&gt;Rcmdr&lt;/a&gt;, a GUI that provides the ability to do basic business statistics without having to code in R.&lt;/p&gt;

&lt;h2 id=&quot;rcmdr-r-commander&quot;&gt;Rcmdr (“R Commander”)&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/08/rcmdr1.png&quot; alt=&quot;rcmdr&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
Example Rcmdr window with the &quot;Statistics&quot; menu expanded
&lt;/p&gt;

&lt;p&gt;&lt;a title=&quot;Rcmdr download at CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/Rcmdr/index.html&quot; target=&quot;_blank&quot;&gt;Rcmdr&lt;/a&gt; is a package for R that was &lt;a title=&quot;Rcmdr main site&quot; href=&quot;http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/&quot; target=&quot;_blank&quot;&gt;created by John Fox&lt;/a&gt; at McMaster University in Canada as a means of providing the basic statistics functionality for classroom use.  In this way, &lt;a title=&quot;Rcmdr download at CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/Rcmdr/index.html&quot; target=&quot;_blank&quot;&gt;Rcmdr&lt;/a&gt; is somewhat similar to &lt;a href=&quot;http://www.sas.com/technologies/bi/query_reporting/guide/&quot; title=&quot;SAS Enterprise Guide&quot;&gt;SAS Enterprise Guide&lt;/a&gt;, a GUI that allows quick access to the power of SAS without the requirement of writing code.&lt;/p&gt;

&lt;p&gt;While using &lt;a title=&quot;Rcmdr download at CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/Rcmdr/index.html&quot; target=&quot;_blank&quot;&gt;Rcmdr&lt;/a&gt; won’t allow you to tap into every single advanced feature that R provides, it does provide a lot of great “general” functionality that can be used in everyday business such as summary statistics, t-tests, ANOVA, linear regression modeling, graphing and data re-coding.&lt;/p&gt;

&lt;h2 id=&quot;using-rcmdr&quot;&gt;Using Rcmdr&lt;/h2&gt;

&lt;p&gt;For the most part, the &lt;a title=&quot;Rcmdr download at CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/Rcmdr/index.html&quot; target=&quot;_blank&quot;&gt;Rcmdr&lt;/a&gt; dialog boxes all look very similar.  Only the most useful options are provided, such as the variable(s) you are looking to interrogate, variable(s) you’d like to break down your analysis by, what statistics you want the output to display (mean, median, mode, etc.) and so on.  The dialog boxes vary depending on whether you are estimating a model or plotting a graph, but in my preliminary usage I haven’t found any dialog boxes that were so confusing that I needed to check the “Help” files.&lt;/p&gt;

&lt;p&gt;For example, suppose I wanted to make a boxplot of my data, income by job type. To do so, I would go to the “Graphs” menu and select “Boxplot”, which provides me with the following dialog box:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/08/rcmdr-boxplot-dialog-box.png&quot; alt=&quot;rcmdr-boxplot-dialog-box&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
Rcmdr options for creating a Boxplot
&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/08/rcmdr-boxplot.png&quot; alt=&quot;rcmdr-boxplot&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
Boxplot output created by Rcmdr
&lt;/p&gt;

&lt;p&gt;Within this dialog box, there are only 3 choices:  variable to plot (income), variable to break down the graph by (type), and “Identify outliers with mouse”, which allows for the user to point at the resulting graph to designate outliers to be labeled on the graph.  When I click “OK” in the dialog box, the result is the boxplot shown above. We can see that the “bc” (blue-collar) group has a lower mid-point to the  income range than “prof” (professors) and “wc” (white-collar).&lt;/p&gt;

&lt;p&gt;One of the best features of &lt;a title=&quot;Rcmdr download at CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/Rcmdr/index.html&quot; target=&quot;_blank&quot;&gt;Rcmdr&lt;/a&gt; is that not only do we get the output we requested, but the code window also shows the code that was necessary to create the boxplot.  In this example, the underlying R code is relatively simple:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;boxplot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;income&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylab&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;income&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xlab&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Duncan&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;By providing the underlying code, &lt;a href=&quot;http://cran.r-project.org/web/packages/Rcmdr/index.html&quot; title=&quot;Rcmdr download at CRAN&quot;&gt;Rcmdr&lt;/a&gt; serves as a teaching tool to move the beginning user towards coding in R directly, or at least, modifying the tool-generated code to include titles or whatever options the user wants to add to the original analysis/output.&lt;/p&gt;

&lt;h2 id=&quot;installation-of-rcmdr&quot;&gt;Installation of Rcmdr&lt;/h2&gt;

&lt;p&gt;Sadly, &lt;a href=&quot;http://cran.r-project.org/web/packages/Rcmdr/index.html&quot; title=&quot;Rcmdr download at CRAN&quot;&gt;Rcmdr&lt;/a&gt; is one of those add-ins that seems to work better on Windows than Mac OSX, at least for the installation portion. I’ve been able to successfully install Rcmdr on my relatively old MacBook Pro, but it did take a bit of time to figure out.  Luckily, the instructions to install &lt;a title=&quot;Rcmdr download at CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/Rcmdr/index.html&quot; target=&quot;_blank&quot;&gt;Rcmdr&lt;/a&gt; on a Mac are fairly well laid out in &lt;a href=&quot;http://wiki.math.yorku.ca/index.php/R:Installing_R_and_Rcmdr_on_a_MAC&quot; title=&quot;Rcmdr on Mac OSX&quot;&gt;this article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;However, once you get over the hurdle of downloading tcltk and XQuartz (X11 emulator), the program seems to work the same on both platforms.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Getting Started Using R, Part 1:  RStudio</title>
        
          <description>&lt;p&gt;&lt;del&gt;Despite &lt;a href=&quot;http://randyzwitch.com/learning-r-sas/&quot; title=&quot;Learning R has really made me appreciate SAS&quot;&gt;my preference for SAS over R&lt;/a&gt;,&lt;/del&gt; there are some add-ons to “basic” R that I’ve found that have made my learning process way easier. While I’m still in my infancy in learning R, I feel like once I found these additional tools, my ability to use R to get work done improved significantly.&lt;/p&gt;

</description>
        
        <pubDate>Sat, 04 Aug 2012 11:58:14 +0000</pubDate>
        <link>
        http://randyzwitch.com/getting-started-using-rstudio/</link>
        <guid isPermaLink="true">http://randyzwitch.com/getting-started-using-rstudio/</guid>
        <content type="html" xml:base="/getting-started-using-rstudio/">&lt;p&gt;&lt;del&gt;Despite &lt;a href=&quot;http://randyzwitch.com/learning-r-sas/&quot; title=&quot;Learning R has really made me appreciate SAS&quot;&gt;my preference for SAS over R&lt;/a&gt;,&lt;/del&gt; there are some add-ons to “basic” R that I’ve found that have made my learning process way easier. While I’m still in my infancy in learning R, I feel like once I found these additional tools, my ability to use R to get work done improved significantly.&lt;/p&gt;

&lt;p&gt;In this first post of three, I’ll discuss &lt;a title=&quot;R Studio main site&quot; href=&quot;http://rstudio.org/&quot; target=&quot;_blank&quot;&gt;RStudio&lt;/a&gt;, a more friendly access point to the default installation of R.  My second post will discuss &lt;a title=&quot;Rcmdr download at CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/Rcmdr/index.html&quot; target=&quot;_blank&quot;&gt;Rcmdr&lt;/a&gt;, a GUI developed for students taking a basic college-level course in Statistics.  The third post will cover &lt;a title=&quot;rattle download CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/rattle/index.html&quot; target=&quot;_blank&quot;&gt;rattle&lt;/a&gt;, a GUI specifically designed for data mining (as opposed to more general statistics like &lt;a title=&quot;Rcmdr download at CRAN&quot; href=&quot;http://cran.r-project.org/web/packages/Rcmdr/index.html&quot; target=&quot;_blank&quot;&gt;Rcmdr&lt;/a&gt;).&lt;/p&gt;

&lt;h2 id=&quot;rstudio&quot;&gt;RStudio&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/08/r-studio.png&quot; alt=&quot;r-studio&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
R Studio is an IDE that dramatically improves the R experience
&lt;/p&gt;

&lt;p&gt;&lt;a title=&quot;R Studio download&quot; href=&quot;http://rstudio.org/download/&quot; target=&quot;_blank&quot;&gt;RStudio&lt;/a&gt; is an open-source Integrated Development Environment (IDE) that provides a more consistent user experience to R.  There are many great features of &lt;a title=&quot;R Studio download&quot; href=&quot;http://rstudio.org/download/&quot; target=&quot;_blank&quot;&gt;RStudio&lt;/a&gt; over “basic” R, including:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Consistent windowing between sessions (customizable by the user)&lt;/li&gt;
  &lt;li&gt;Point-and-click exploration of data frames and other data objects&lt;/li&gt;
  &lt;li&gt;Importing data files through dialog box functionality&lt;/li&gt;
  &lt;li&gt;Customizable code syntax highlighting, auto-complete, and Help menu access from the code editor&lt;/li&gt;
  &lt;li&gt;Ability to see all installed packages, turn on packages using a checkbox, and download libraries (and their dependencies) without having to write any code&lt;/li&gt;
  &lt;li&gt;Version Control using GitHub&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While &lt;a title=&quot;R Studio download&quot; href=&quot;http://rstudio.org/download/&quot; target=&quot;_blank&quot;&gt;RStudio&lt;/a&gt; doesn’t provide a GUI that will help you run a regression model or build a graph, it provides a more “friendly” environment to work in as compared to the command-line interface of a default installation of R.  I find that by having elements like the currently active data objects and available/active packages with links to the Help files “exposed” at all times, &lt;a href=&quot;http://rstudio.org/download/&quot; title=&quot;R Studio download&quot;&gt;RStudio&lt;/a&gt; reminds me of where my analysis has been and gives me a quick way to think about “What Else?” to pursue if I hit a roadblock.&lt;/p&gt;

&lt;h2 id=&quot;installation-of-rstudio&quot;&gt;Installation of RStudio&lt;/h2&gt;

&lt;p&gt;&lt;a title=&quot;R Studio download&quot; href=&quot;http://rstudio.org/download/&quot; target=&quot;_blank&quot;&gt;RStudio&lt;/a&gt; installs like any other program for Windows or Mac OSX.  As far as I can tell, there are no advantages to using &lt;a title=&quot;R Studio download&quot; href=&quot;http://rstudio.org/download/&quot; target=&quot;_blank&quot;&gt;RStudio&lt;/a&gt; in either environment, both the Windows and OSX versions seem to work equally well.  The most important consideration is that RStudio is just an “add-on” so-to-speak, it does not include R itself.  So be sure to go to one of the &lt;a title=&quot;CRAN downloads for R&quot; href=&quot;https://cran.r-project.org/&quot; target=&quot;_blank&quot;&gt;Comprehensive R Archive Network (CRAN) sites&lt;/a&gt; to download R first.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Learning R Has Really Made Me Appreciate SAS</title>
        
          <description>&lt;p&gt;EDIT, 9/9/2016: Four years later, this blog post is a comical look back in time. It’s hard to believe that I could think this way! Having used R (and Python, Julia), I will never return back to the constraints of using SAS. The inflexible nature of everything having to be a Dataset in SAS vs. the infinite flexibility of data structures in programming-oriented languages makes it no contest.&lt;/p&gt;

</description>
        
        <pubDate>Wed, 25 Jul 2012 11:34:03 +0000</pubDate>
        <link>
        http://randyzwitch.com/learning-r-sas/</link>
        <guid isPermaLink="true">http://randyzwitch.com/learning-r-sas/</guid>
        <content type="html" xml:base="/learning-r-sas/">&lt;p&gt;EDIT, 9/9/2016: Four years later, this blog post is a comical look back in time. It’s hard to believe that I could think this way! Having used R (and Python, Julia), I will never return back to the constraints of using SAS. The inflexible nature of everything having to be a Dataset in SAS vs. the infinite flexibility of data structures in programming-oriented languages makes it no contest.&lt;/p&gt;

&lt;p&gt;But I’ll leave this here to remind myself how today’s frustration leads to tomorrow’s breakthroughs.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;For the past 18 months, it seems like all I’ve heard about in the digital marketing industry is “big data”, and with that, mentions of using Hadoop and R to solve these sorts of problems.  Why are these tools the most often mentioned?  Because they are open source, i.e. free of charge!&lt;/p&gt;

&lt;p&gt;But as I’ve tried to learn R, I keep asking myself…are all of my colleagues out of their minds?  Or, am I just beyond learning something new?  As of right now, R is just one big hack on top of a hack to me, and the software is only “free” if you don’t consider lost productivity.&lt;/p&gt;

&lt;h2 id=&quot;need-new-functionality-just-download-another-r-package&quot;&gt;Need new functionality, just download another R package!&lt;/h2&gt;

&lt;p&gt;One of the biggest “pros” I see thrown around for R relative to a tool like SAS is that when new statistical techniques are invented, someone will code it in R immediately.  A company like SAS make take 5 years to implement the feature, or it may not get implemented at all.  That’s all fine and good, but the problem I’ve found is that there are 10 ways to do something in R, and I spend more time downloading packages (along with other packages that are dependencies) than I do learning A SINGLE WAY to do something correctly.&lt;/p&gt;

&lt;p&gt;For example, take trying to get summary statistics by group.  In SAS, you use a Proc Summary statement, with either a BY group statement or a CLASS statement.  It’s fairly simple and it works.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;proc summary data= hs0; var _numeric_; class prgtype; output out=results mean= /autolabel autoname inherit; run;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In R, I ran the following code, which should be roughly equivalent:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;by(hs0, hs0$prgtype, mean)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Very simple, fewer lines…and technically wrong, throwing a 6 unhelpful errors for a single line of code.  Because it was decided that “mean” as a function would be deprecated in R.  WHY???  It’s so simple, why modify the language like that?&lt;/p&gt;

&lt;p&gt;According to the error message, I’m supposed to use colMeans instead…but once you get to how, you’re on your own, the Help documentation is garbage.  Some combination of “by” and “colMeans” might work, but I don’t have an example to follow.&lt;/p&gt;

&lt;p&gt;Google sent me to the &lt;a title=&quot;Quick-R website&quot; href=&quot;http://www.statmethods.net/&quot; target=&quot;_blank&quot;&gt;Quick-R&lt;/a&gt; website, and I found a “&lt;a title=&quot;Descriptive Statistics in R&quot; href=&quot;http://www.statmethods.net/stats/descriptives.html&quot; target=&quot;_blank&quot;&gt;descriptive statistics&lt;/a&gt;” article with by group processing…with the recommendation of using the “psych” package or the “doBy” package.  But &lt;a title=&quot;Comprehensive R Archive Network&quot; href=&quot;https://cran.r-project.org/&quot; target=&quot;_blank&quot;&gt;CRAN&lt;/a&gt; won’t let me download all of the dependencies, so again, stuck trying to do the simplest thing in statistics.&lt;/p&gt;

&lt;h2 id=&quot;lets-be-fast-and-run-everything-in-ram&quot;&gt;Let’s be fast and run everything in RAM!&lt;/h2&gt;

&lt;p&gt;My next favorite hassle in R is that you are expected to continuously monitor how many data elements you have active in a workspace.  R runs completely in RAM (as opposed to SAS which runs a combination of RAM for processing and hard disks for storage), so if you want to do something really “big”, you will quickly choke your computer.  I tried to work with a &lt;em&gt;single day&lt;/em&gt; of Omniture data from the raw data feed, and my MacBook Pro with 6GB of memory was shot.  I believe the file was 700,000 rows by 300 columns, but I could be mis-remembering.  That’s not even enough data to think about performance-tuning a program in SAS, any slop code will run quickly.&lt;/p&gt;

&lt;p&gt;How does one solve these memory errors in R?  Port to Amazon cloud seems to be the most commonly given suggestion.  But that’s more setup time, getting an R instance over to Amazon, your data over to Amazon..and now you are renting hardware.&lt;/p&gt;

&lt;h2 id=&quot;r-is-great-for-data-visualization&quot;&gt;R is great for data visualization!&lt;/h2&gt;

&lt;p&gt;From what I’ve seen from the demo(graphics) tutorial, R does have some pretty impressive visualization capabilities.  Contour maps, histograms, boxplots…there seems to be a lot of capability here beyond the realm of a tool like Excel (which, besides not being free, isn’t really for visualization).  SAS has some graphics capabilities, but they are a bit hard to master.&lt;/p&gt;

&lt;p&gt;But for all of the hassle to get your data formatted properly, downloading endless packages, avoiding memory errors, you could just pay for Tableau and get working.  Then, once you have your visualizations done in Tableau, if you are using Tableau server you can share interactive dashboards with others.  As far as I know, R graphics are static image exports, so you’re stuck with “flat” presentations.&lt;/p&gt;

&lt;h2 id=&quot;maybe-its-just-me&quot;&gt;Maybe, it’s just me&lt;/h2&gt;

&lt;p&gt;For R diehards, the above verbiage probably just sounds like whining from someone who is too new to appreciate the greatness of R or too stuck in the “old SAS way”.  That’s certainly possible.  But from my first several weeks of trying to use R, the level of frustration is way beyond anything I experienced when I was learning SAS.&lt;/p&gt;

&lt;p&gt;Luckily, I don’t currently have any consulting projects that require R or SAS at the moment, so I can continue to try and learn why everyone thinks R is so great.  But from where I sit right now, the licensing fee from SAS doesn’t seem so bad when it allows me to get to doing productive work instead of building my own statistics software piece-by-piece.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>My Top 20 Least Useful Omniture Reports</title>
        
          <description>&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/07/data-squirrel-262x300.png&quot; alt=&quot;data-squirrel&quot; /&gt;&lt;/p&gt;

</description>
        
        <pubDate>Wed, 18 Jul 2012 00:14:04 +0000</pubDate>
        <link>
        http://randyzwitch.com/least-useful-omniture-reports/</link>
        <guid isPermaLink="true">http://randyzwitch.com/least-useful-omniture-reports/</guid>
        <content type="html" xml:base="/least-useful-omniture-reports/">&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/07/data-squirrel-262x300.png&quot; alt=&quot;data-squirrel&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
Just because data CAN be captured doesn't mean it SHOULD be!
&lt;/p&gt;

&lt;p&gt;In a prior post about &lt;a href=&quot;http://randyzwitch.com/customize-adobe-sitecatalyst-menu/&quot; title=&quot;For maximum user understanding, customize the SiteCatalyst menu&quot;&gt;customizing the SiteCatalyst menu interface&lt;/a&gt;, I discussed how simple changes such as hiding empty Omniture variables/reports and re-organizing the menu structure will help improve understanding within your organization.  In the spirit of even further interface optimization, here are 20 reports within Omniture that I feel that can be hidden due to their lack of business-actionable information.&lt;/p&gt;

&lt;p&gt;Here are my Top 20, in no particular order:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Mobile:  Color Depth&lt;/li&gt;
  &lt;li&gt;Mobile:  Information Services&lt;/li&gt;
  &lt;li&gt;Mobile:  Decoration Mail Support&lt;/li&gt;
  &lt;li&gt;Mobile:  PTT&lt;/li&gt;
  &lt;li&gt;Mobile:  Device Number Transmit&lt;/li&gt;
  &lt;li&gt;Mobile:  Browser URL Length&lt;/li&gt;
  &lt;li&gt;Mobile:  DRM&lt;/li&gt;
  &lt;li&gt;Mobile:  Mail URL Length&lt;/li&gt;
  &lt;li&gt;Mobile:  Java version&lt;/li&gt;
  &lt;li&gt;Mobile:  Manufacturer&lt;/li&gt;
  &lt;li&gt;Technology:  Connection Types&lt;/li&gt;
  &lt;li&gt;Technology:  Monitor Color Depth&lt;/li&gt;
  &lt;li&gt;Technology:  JavaScript Version&lt;/li&gt;
  &lt;li&gt;Technology:  Monitor Resolutions&lt;/li&gt;
  &lt;li&gt;Visitor Profile:  Top-Level Domains&lt;/li&gt;
  &lt;li&gt;Visitor Profile:  Domains&lt;/li&gt;
  &lt;li&gt;Visitor Profile:  Geosegmentation&lt;/li&gt;
  &lt;li&gt;Traffic Sources:  All Search Page Ranking&lt;/li&gt;
  &lt;li&gt;Traffic Sources: Original Referring Domains&lt;/li&gt;
  &lt;li&gt;Custom Variable:  s.server report&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;mobile-reports&quot;&gt;Mobile reports&lt;/h2&gt;

&lt;p&gt;For the most part, the information in the separate reports can determined just by knowing the device (which is also a default Omniture report). So, a single report can take the place of 10.&lt;/p&gt;

&lt;p&gt;There’s also the pesky issue that the reports more often than not show “Unknown” for 90%+ of the mobile traffic (at least, in the U.S.).  So not only can the data be determined from knowing the mobile device being used, the additional reports aren’t even well populated.&lt;/p&gt;

&lt;h2 id=&quot;technology-reports&quot;&gt;Technology reports&lt;/h2&gt;

&lt;p&gt;The “Connection Type” report, along with “Monitor Color Depth”, measure things that haven’t been an issue in too many years to continue reporting on. LAN, 16-bit or higher.&lt;/p&gt;

&lt;p&gt;“Monitor resolution” is irrelevant in the face of also having “Browser Width” &amp;amp; “Browser Height” reports (the true size of the web page “real estate” on screen).&lt;/p&gt;

&lt;p&gt;Finally, JavaScript version?  The JavaScript report with “Enabled/Disabled” is likely more than enough information.  Or, you can just include jQuery in your website and know with 100% certainty what version is being used.&lt;/p&gt;

&lt;h2 id=&quot;visitor-profile-reports&quot;&gt;Visitor Profile reports&lt;/h2&gt;

&lt;p&gt;My dislike of the identified Visitor Profile reports are due to halfway implementation.  The “GeoSegmentation report shows a nice map representation, but only of traffic metrics like Page Views and Visits.  Why not open this up to conversion variables and really make the visualization useful, instead of needing to rely on the “flat”, non-map Visitor Zip (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;s.zip&lt;/code&gt;) report?&lt;/p&gt;

&lt;p&gt;For the “Domains” and “Top-Level Domains” report, you have granularity issues; the “Top-Level Domains” report is sort-of a country-level report, but the U.S. has several line items.  The “Domains” report shows what ISP people are using to access the Internet (which I think is generally useless in itself), but again…it spans geography, so the ISP network someone is on may not even have the same technology.  So what are we really measuring in these reports?&lt;/p&gt;

&lt;h2 id=&quot;traffic-sources-reports&quot;&gt;Traffic Sources reports&lt;/h2&gt;

&lt;p&gt;The “All Search Page Ranking” report seems like it could be useful, until you realize that 1) it aggregates all search engines (whose different algorithms provide different rankings and 2) with personalized search, rankings are no longer static. Literally every single person could see a different link position for the same search term.  So while this report may have made sense for SEO measurement in the past, it’s really past it’s prime…use the right SEO tool for the job (Conductor, SEOmoz, and the like).&lt;/p&gt;

&lt;p&gt;The “Original Referring Domains” report is weird in its own way…the absolute first URL that referred you to the site.  Really?  As Avinash has said, giving 100% credit to the first touchpoint is like giving your first girlfriend credit for you marrying your wife (paraphrased).  This report is very limited in its usefulness IMO, especially given the advances in attribution modeling in the past several years.&lt;/p&gt;

&lt;h2 id=&quot;custom-variable-sserver-report&quot;&gt;Custom Variable:  s.server report&lt;/h2&gt;

&lt;p&gt;The only custom variable report I have on this list is the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;s.server&lt;/code&gt; report; hopefully, all of your other custom variables are capturing only business-useful information!&lt;/p&gt;

&lt;p&gt;The reason I dislike the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;s.server&lt;/code&gt; variable/report is the same reason I dislike the “All Search Page Ranking” report; use the right tool for the job.  This is a lazy way of monitoring server volume for load balancing.  But if you’re doing the job well on the back-end, shouldn’t every server have the same level of volume?&lt;/p&gt;

&lt;p&gt;Even if the answer to the previous question is no (I’m not a network engineer, clearly), having an &lt;em&gt;operational&lt;/em&gt; report like this doesn’t make much sense to me in a &lt;em&gt;marketing&lt;/em&gt; reporting tool.&lt;/p&gt;

&lt;h2 id=&quot;hide-in-the-menu-dont-restrict-access&quot;&gt;Hide in the menu, don’t restrict access&lt;/h2&gt;

&lt;p&gt;By hiding reports in the Omniture menu interface, this doesn’t mean the info stops being collected or becomes unavailable to all users.  Rather, the option to use the reports isn’t immediately obvious (since they don’t show up in the menu).  Power Users can still find these reports using the search box if necessary to answer an oddball question.&lt;/p&gt;

&lt;p&gt;But in my experience, the information in these reports are generally not business useful, or are lacking in some critical way.  If you can’t make &lt;em&gt;regular, high impact decisions&lt;/em&gt; with the info, then you’re better off never looking at it at all.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Apple Has Earned a Customer for Life</title>
        
          <description>&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/06/macbook-pro-broken-hinge-screen.jpg&quot; alt=&quot;macbook pro broken hinge&quot; /&gt;&lt;/p&gt;

</description>
        
        <pubDate>Mon, 25 Jun 2012 20:38:57 +0000</pubDate>
        <link>
        http://randyzwitch.com/broken-macbook-pro-hinge-fixed-free/</link>
        <guid isPermaLink="true">http://randyzwitch.com/broken-macbook-pro-hinge-fixed-free/</guid>
        <content type="html" xml:base="/broken-macbook-pro-hinge-fixed-free/">&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/06/macbook-pro-broken-hinge-screen.jpg&quot; alt=&quot;macbook pro broken hinge&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
    Broken MacBook Pro hinge (due to glue failure)
 &lt;/p&gt;

&lt;p&gt;I used to think that when people talked about the “legendary Apple customer service” that there was plenty of hyperbole thrown in for good measure.  Until it happened to me with my broken MacBook Pro hinge.&lt;/p&gt;

&lt;h2 id=&quot;broken-macbook-pro-hinge---plenty-of-search-results&quot;&gt;“Broken MacBook Pro Hinge” - Plenty of search results&lt;/h2&gt;

&lt;p&gt;When the screen on my late 2008 15” MacBook Pro started separating from the hinge, the first thing I did was search Google.  There I found more than enough &lt;a title=&quot;Broken MacBook Pro Google search results&quot; href=&quot;http://www.google.com/#hl=en&amp;amp;sclient=psy-ab&amp;amp;q=broken+Macbook+pro+hinge&amp;amp;oq=broken+Macbook+pro+hinge&amp;amp;aq=f&amp;amp;aqi=g-K1g-bK1g-bsK1g-bK1&amp;amp;aql=&amp;amp;gs_l=hp.3..0i30j0i8i30j0i8i10i30j0i8i30.1489.8286.0.8403.32.24.2.4.4.0.429.4546.1j13j7j1j1.23.0...1.0.d5zdW3pAo3g&amp;amp;pbx=1&amp;amp;bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&amp;amp;fp=c1e99b5acbebabce&amp;amp;biw=1600&amp;amp;bih=702&quot; target=&quot;_blank&quot;&gt;search results&lt;/a&gt; to make me believe this was a widespread issue with this vintage of laptop.  And since the laptop was out of warranty, most of the results talked about re-gluing the aluminum screen cover to the hinge.&lt;/p&gt;

&lt;p&gt;After trying to re-attach the hinge to the screen using epoxy, I headed over to the Apple store in King of Prussia, PA.  To say this first encounter at the Genius Bar was frustrating is an understatement.&lt;/p&gt;

&lt;h2 id=&quot;you-shouldve-bought-applecare&quot;&gt;You should’ve bought AppleCare&lt;/h2&gt;

&lt;p&gt;Apple &lt;del&gt;cashiers&lt;/del&gt; “Geniuses” and fanboys alike are very big on pushing the AppleCare warranty, selling you with tales that Apple will fix &lt;em&gt;anything&lt;/em&gt; in that extended time period.  While that may be true, extended warranties generally don’t pay off for the consumer, and as such, I don’t buy them.&lt;/p&gt;

&lt;p&gt;Not that it would have mattered for me anyway.  My MacBook Pro is well beyond 3 years old, one of the first unibody models that came out.  You think the Apple “Genius” would’ve known that after checking the serial number, but instead just kept repeating robotically:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“You should’ve bought AppleCare.  You should’ve bought AppleCare.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Even when I asked, “A glue failure doesn’t seem like a manufacturers defect?” or “I should’ve paid $349 for an extended warranty to protect against $0.05 of faulty glue?”&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“You should’ve bought AppleCare.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At that point, after being asked if I dropped the laptop, given a series of robotic answers, suggested that I should’ve spend $349 that wouldn’t have fixed my problem, and generally treated like a monkey, I felt like smashing the laptop right on the Genius Bar just to make a scene.  Instead, I walked out feeling worse than when I arrived, with crippled MacBook Pro in hand.&lt;/p&gt;

&lt;h2 id=&quot;maybe-an-apple-certified-repair-facility-can-help&quot;&gt;Maybe an Apple Certified Repair facility can help&lt;/h2&gt;

&lt;p&gt;Since I wasn’t going back for a second round of stupidity at King of Prussia Apple Store, I decided to look up an independent shop to see what the cost of repair would be.  The repair guy immediately said “Oh, I’ve seen this a few times recently…it’s probably around $500-$600 to fix.”&lt;/p&gt;

&lt;p&gt;$%^$&amp;amp;%*(#!  For $600, I’d be about 30-35% of a new 15” MacBook Pro.  Again I left a store without doing anything, and feeling worse than when I arrived.  I either need to pay $600 or pay $2000+ to get the newer equivalent of my laptop.&lt;/p&gt;

&lt;h2 id=&quot;one-more-trip-to-the-apple-store&quot;&gt;One more trip to the Apple Store&lt;/h2&gt;

&lt;p&gt;Several weeks had passed and my laptop became pretty much unusable.  I decided to bite the bullet and pay to get the screen fixed.  I also decided to go back to an Apple Store (this time, in Ardmore, PA) to have them fix it.  I figured if I’d have to pay, might as well guarantee it would get fixed properly.&lt;/p&gt;

&lt;p&gt;When I walked up to the Genius Bar, the Apple “Genius” still asked me if I dropped my laptop (&lt;em&gt;sidebar:  Is this part of the mind tricks they give everyone?  There isn’t a scratch on the thing, let alone any dents&lt;/em&gt;).  After the Apple employee looked over the laptop, I told him in my most dejected voice that I wanted to find out how much is was to replace the screen.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Apple Genius:  “How about ‘free’?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I damn near fell off the stool I was sitting on.  How could the Apple Store in King of Prussia been so unhelpful, and then 5 minutes into the same explanation I get an offer to get the screen fixed FREE at the Suburban Square Apple Store in Ardmore?&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Apple Genius:  “And we can probably get this back to you by tomorrow.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Needless to say, I didn’t want to do anything except hit ‘Accept’ on the electronic repair form.  I’ve come too far to mess this gift up!&lt;/p&gt;

&lt;h2 id=&quot;apple-youve-earned-yourself-a-lifetime-customer&quot;&gt;Apple, you’ve earned yourself a lifetime customer&lt;/h2&gt;

&lt;p&gt;Maybe I got lucky.  Maybe it was perseverance.  Maybe this screen/hinge defect has shown up too many times in the last six weeks and Apple could no longer ignore it.&lt;/p&gt;

&lt;p&gt;Maybe it’s because I asked twice at two different Genius appointments. Or maybe Apple has realized I’ve spent several thousand dollars with them in the past several years, with this MacBook Pro, iMac, several iPhones and an iPad.  That level of spend probably doesn’t even get me in the top 50% of non-business customers, but it’s not negligible either.&lt;/p&gt;

&lt;p&gt;Whatever the reason, by comping me the $492.41, Apple has “bought” themselves a customer for life.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/06/mac-repair-order.png&quot; alt=&quot;em209-mac-repair-order&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
    The cost of a broken MacBook Pro hinge? Apparently, $492.41!
 &lt;/p&gt;

&lt;p&gt;Edit: To read the follow-up of what eventually ended up of this MacBook Pro, &lt;a href=&quot;http://randyzwitch.com/apple-macbook-pro-model-a1286-late-2008-vintage/&quot;&gt;click here&lt;/a&gt; for an article about me replacement battery interaction with Apple.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>For Maximum User Understanding, Customize the SiteCatalyst Menu</title>
        
          <description>&lt;blockquote class=&quot;twitter-tweet&quot; data-conversation=&quot;none&quot; data-lang=&quot;en&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;&lt;a href=&quot;https://twitter.com/nancyskoons&quot;&gt;@nancyskoons&lt;/a&gt; &lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;@randyzwitch&lt;/a&gt; &lt;a href=&quot;https://twitter.com/shawncreed&quot;&gt;@shawncreed&lt;/a&gt; Best Practice #1: Customizing anything is better than customizing nothing. &lt;a href=&quot;https://twitter.com/hashtag/measure?src=hash&quot;&gt;#measure&lt;/a&gt; &lt;a href=&quot;https://twitter.com/hashtag/omniture?src=hash&quot;&gt;#omniture&lt;/a&gt;&lt;/p&gt;&amp;mdash; Jason Egan (@jasonegan) &lt;a href=&quot;https://twitter.com/jasonegan/status/210398632082538497&quot;&gt;June 6, 2012&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;//platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

</description>
        
        <pubDate>Wed, 06 Jun 2012 17:31:39 +0000</pubDate>
        <link>
        http://randyzwitch.com/customize-adobe-sitecatalyst-menu/</link>
        <guid isPermaLink="true">http://randyzwitch.com/customize-adobe-sitecatalyst-menu/</guid>
        <content type="html" xml:base="/customize-adobe-sitecatalyst-menu/">&lt;blockquote class=&quot;twitter-tweet&quot; data-conversation=&quot;none&quot; data-lang=&quot;en&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;&lt;a href=&quot;https://twitter.com/nancyskoons&quot;&gt;@nancyskoons&lt;/a&gt; &lt;a href=&quot;https://twitter.com/randyzwitch&quot;&gt;@randyzwitch&lt;/a&gt; &lt;a href=&quot;https://twitter.com/shawncreed&quot;&gt;@shawncreed&lt;/a&gt; Best Practice #1: Customizing anything is better than customizing nothing. &lt;a href=&quot;https://twitter.com/hashtag/measure?src=hash&quot;&gt;#measure&lt;/a&gt; &lt;a href=&quot;https://twitter.com/hashtag/omniture?src=hash&quot;&gt;#omniture&lt;/a&gt;&lt;/p&gt;&amp;mdash; Jason Egan (@jasonegan) &lt;a href=&quot;https://twitter.com/jasonegan/status/210398632082538497&quot;&gt;June 6, 2012&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;//platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/06/stock-menu-109x300.png&quot; alt=&quot;stock-menu&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
Default Omniture report menu
&lt;/p&gt;

&lt;p&gt;Visits vs. Visitors vs. Unique Visitors…click-throughs, view-throughs, bounces…these concepts in digital analytics are fairly abstract, and many in business and marketing never really grasp the concepts fully.  Knowing the enormous amount of learning that needs to take place for digital success, why do we make our internal stakeholders hunt for data that’s organized by TOOL definitions, instead of by business function?&lt;/p&gt;

&lt;p&gt;In this case, the “tool” that I’m referring to here is Omniture SiteCatalyst.  To be clear, there’s nothing excessively &lt;em&gt;wrong&lt;/em&gt; about the default menu structure in Omniture, just that in my experience, understanding by end-users can be greatly enhanced by customizing the Omniture menu.&lt;/p&gt;

&lt;p&gt;Simple modifications such as 1) Hiding Omniture variables and products not in use, 2) organizing reports by logical business function, and 3) placing custom reports and calculated metrics next to the standard SiteCatalyst reports will get users to making decisions with their data that much faster.&lt;/p&gt;

&lt;h2 id=&quot;1-hide-omniture-variables-and-products-not-being-used&quot;&gt;1)  Hide Omniture variables and products not being used&lt;/h2&gt;

&lt;p&gt;Do your users a favor and hide the Omniture products such as Test &amp;amp; Target, Survey, and Genesis if you aren’t using them.  Same thing with any custom traffic (props) and custom conversion variables (eVars) that aren’t being used.  Nothing will distract your users faster than clicking on folders with advertisements (T&amp;amp;T, Survey) or worse, frustrate the user by making them wonder “What data is &lt;em&gt;supposed to be&lt;/em&gt; in this report?”&lt;/p&gt;

&lt;p&gt;Just by hiding or disabling these empty reports and tools advertisements, you should see an increased confidence in data quality.  Or at the very least, keep the conversation from taking a detour.&lt;/p&gt;

&lt;h2 id=&quot;2-organize-sitecatalyst-reports-by-logical-business-function&quot;&gt;2)  Organize SiteCatalyst reports by logical business function&lt;/h2&gt;

&lt;p&gt;Your internal users aren’t thinking about Omniture variable structures when they are trying to find the answer to their business questions.  So why do we keep our data artificially separated by “Custom Events”, “Custom Conversions” and “Custom Traffic”?&lt;/p&gt;

&lt;p&gt;Worse yet, who remembers that the number of Facebook Likes can be found at “&lt;em&gt;Site Metrics -&amp;gt; Custom Events -&amp;gt; Custom Events 21-30&lt;/em&gt;?”  And why are Facebook Likes next to “Logins”?  Does that mean Facebook Logins?  Probably not.&lt;/p&gt;

&lt;p&gt;Wouldn’t it be better for our users to organize reports by business function, such as:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Financial/Purchase Metrics&lt;/strong&gt; (Revenue, Discounts, Shipping, AOV, Units, Revenue Per Visit)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Usability&lt;/strong&gt; (Browser, Percent of Page Viewed, Operating System)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;SEO&lt;/strong&gt; (Non-campaign visits, Referring Domains)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Mobile&lt;/strong&gt; (Device, browser, resolution)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Site Engagement&lt;/strong&gt; (Page Views, Internal Campaigns, Logins)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Site Merchandising&lt;/strong&gt; (Products Viewed, Cart Add Ratio, Cross-Sell)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Social&lt;/strong&gt; (Facebook Likes, Pinterest Pins, Visits from Social domains)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Paid Campaigns&lt;/strong&gt; (Email, Paid Search, Display)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Traffic&lt;/strong&gt; (Total Visits, Geosegmentation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The list above isn’t meant to be exhaustive, or necessarily how you should organize your SiteCatalyst menus.  But for me, organizing the reports by the business function keeps my business thinking flowing, rather than trying to remember how Omniture was implemented by variable type.&lt;/p&gt;

&lt;h2 id=&quot;3-place-custom-reports-and-calculated-metrics-next-to-the-standard-sitecatalyst-reports&quot;&gt;3)  Place custom reports and calculated metrics next to the standard SiteCatalyst reports&lt;/h2&gt;

&lt;p&gt;This is probably more like “2b” to the above, but there’s no reason to keep custom reports and calculated metric reports segregated either.  Custom reports happen because of a specific business need, and the same thing with calculated metrics.  By placing these reports along with the out-of-the-box reports from SiteCatalyst, you take away the artificial distinction between data natively in SiteCatalyst and business-specific data populated by a web developer.&lt;/p&gt;

&lt;h2 id=&quot;why-you-wouldnt-want-to-customize&quot;&gt;Why you wouldn’t want to customize?&lt;/h2&gt;

&lt;p&gt;Shawn makes two great points in &lt;a title=&quot;Dont customize SiteCatalyst&quot; href=&quot;http://shawncreed.com/blog/sitecatalyst-menu-customization.htm&quot; target=&quot;_blank&quot;&gt;his post&lt;/a&gt; about (not) customizing the SiteCatalyst menu: users require special training and menu customization isn’t scalable.&lt;/p&gt;

&lt;h3 id=&quot;users-need-special-training&quot;&gt;&lt;em&gt;Users need special training&lt;/em&gt;&lt;/h3&gt;

&lt;p&gt;Users need to be trained anyway.  I don’t think either of us is suggesting moving all of the menus around after an implementation has been in place for years…but if you’re a company just starting out, why not start off customized?&lt;/p&gt;

&lt;p&gt;Fellow Keystoner Tim Patten also commented to me via Twitter DM about power users being used to “default”, and it’s annoying have to learn a new menu when switching companies; I’m not really worried about power users, I’m thinking about the hundreds of users in thousands of organizations who can’t get beyond page views and visits.  Power users can pick up a new menu quickly, switch back to default, or use the search box.&lt;/p&gt;

&lt;h3 id=&quot;menu-customization-isnt-scalable&quot;&gt;&lt;em&gt;Menu Customization isn’t scalable&lt;/em&gt;&lt;/h3&gt;

&lt;p&gt;This is very much true.  The larger the company, and the more complex and varied the tracking, inevitably menu customization isn’t particularly scalable.  This is probably an area where specific dashboards are a much better strategy than customizing the menus.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;For me, one of the first things I look for when working with a company looking to get their digital analytics program off the ground is whether they’ve customized their Omniture menu structure.  As a free customization, it’s something that companies should at least &lt;em&gt;consider&lt;/em&gt;.  Organizing reports by business function requires a business to think about the questions they want to regularly answer, will keep novice users from focusing on implementation concepts, and overall is just better because it’s how I think 🙂&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This blog post is a continuation of a &lt;a title=&quot;Original Tweet about SiteCatalyst Menu Customization&quot; href=&quot;https://twitter.com/randyzwitch/status/210042295859417090&quot; target=&quot;_blank&quot;&gt;Twitter conversation&lt;/a&gt; with Shawn C. Reed (&lt;a title=&quot;Shawn C. Reed Twitter account&quot; href=&quot;https://twitter.com/#!/shawncreed&quot; target=&quot;_blank&quot;&gt;@shawncreed&lt;/a&gt;), Jason Egan (&lt;a title=&quot;Jason Egan Twitter&quot; href=&quot;https://twitter.com/#!/jasonegan&quot; target=&quot;_blank&quot;&gt;@jasonegan&lt;/a&gt;), Tim Patten (&lt;a title=&quot;Tim Patten Twitter&quot; href=&quot;https://twitter.com/#!/timpatten&quot; target=&quot;_blank&quot;&gt;@timpatten&lt;/a&gt;) and others.  Shawn’s counter-argument can be found &lt;a title=&quot;Why Shawn C. Reed prefers not to customize SiteCatalyst&quot; href=&quot;http://shawncreed.com/blog/sitecatalyst-menu-customization.htm&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.  Jason wrote about &lt;a title=&quot;Jason Egan blog post&quot; href=&quot;http://www.jasonegan.net/2009/09/26/omniture-sitecatalyst-menu-customization-and-custom-reports/&quot; target=&quot;_blank&quot;&gt;Omniture menu customization&lt;/a&gt; a few years back.  And finally, if you want to read more pros-and-cons about SiteCatalyst menu customization, see the Adobe blog posts &lt;a title=&quot;Adobe post 1&quot; href=&quot;http://blogs.adobe.com/digitalmarketing/analytics/taking-sitecatalyst-menus-to-the-masses-part-i/&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt; and &lt;a title=&quot;Adobe post 2&quot; href=&quot;http://blogs.adobe.com/digitalmarketing/analytics/taking-sitecatalyst-menus-to-the-masses-part-ii/&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Effect Of Modified Bounce Rate In Google Analytics</title>
        
          <description>&lt;p&gt;A few months back, Justin Cutroni posted on his &lt;a title=&quot;Justin Cutroni web analytics blog&quot; href=&quot;http://cutroni.com/blog&quot; target=&quot;_blank&quot;&gt;blog&lt;/a&gt; some jQuery code that &lt;a title=&quot;Modifying Bounce Rate and Time on Site in Google Analytics&quot; href=&quot;http://cutroni.com/blog/2012/02/21/advanced-content-tracking-with-google-analytics-part-1/&quot; target=&quot;_blank&quot;&gt;modifies how Google Analytics tracks content&lt;/a&gt;.  Specifically, the code snippet changes how bounce rate and time on site are calculated, creates a custom variable to classify whether visitors are “Readers” vs. “Scanners” and adds some Google Analytics events to track how far down the page visitors are reading.&lt;/p&gt;

</description>
        
        <pubDate>Thu, 10 May 2012 08:05:02 +0000</pubDate>
        <link>
        http://randyzwitch.com/bounce-rate-modification-google-analytics-cutroni/</link>
        <guid isPermaLink="true">http://randyzwitch.com/bounce-rate-modification-google-analytics-cutroni/</guid>
        <content type="html" xml:base="/bounce-rate-modification-google-analytics-cutroni/">&lt;p&gt;A few months back, Justin Cutroni posted on his &lt;a title=&quot;Justin Cutroni web analytics blog&quot; href=&quot;http://cutroni.com/blog&quot; target=&quot;_blank&quot;&gt;blog&lt;/a&gt; some jQuery code that &lt;a title=&quot;Modifying Bounce Rate and Time on Site in Google Analytics&quot; href=&quot;http://cutroni.com/blog/2012/02/21/advanced-content-tracking-with-google-analytics-part-1/&quot; target=&quot;_blank&quot;&gt;modifies how Google Analytics tracks content&lt;/a&gt;.  Specifically, the code snippet changes how bounce rate and time on site are calculated, creates a custom variable to classify whether visitors are “Readers” vs. “Scanners” and adds some Google Analytics events to track how far down the page visitors are reading.&lt;/p&gt;

&lt;p&gt;Given that this blog is fairly technical and specific in nature, I was interested in seeing how the standard Google Analytics metrics would change if I implemented this code and how my changes &lt;a title=&quot;Justin Cutroni bounce rate code results&quot; href=&quot;http://cutroni.com/blog/2012/02/23/advanced-content-tracking-with-google-analytics-part-2/&quot; target=&quot;_blank&quot;&gt;compared to Justin’s&lt;/a&gt;.  I’ve always suspected my bounce rate in the 80-90% range didn’t really represent whether people were finding value in my content.  The results were quite surprising to say the least!&lt;/p&gt;

&lt;h2 id=&quot;bounce-rate---dropped-through-the-floor&quot;&gt;Bounce Rate - Dropped through the floor!&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/05/bounce-rate-graph-google-analytics-1024x212.png&quot; alt=&quot;bounce-rate-graph-google-analytics&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
Starting April 24th, Bounce Rate drops considerably!
&lt;/p&gt;

&lt;p&gt;As expected, implementing the content tracking code caused a significant drop in bounce rate, due to counting scrolling as a page “interaction” using Google Analytics events. Thus, the definition of bounce rate changed from &lt;em&gt;single page view visits&lt;/em&gt; to &lt;em&gt;visitors that don’t interact with the page by scrolling at least 150 pixels&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In the case of my blog, the bounce rate dropped from &lt;strong&gt;80-90%&lt;/strong&gt; to &lt;strong&gt;5-15%&lt;/strong&gt;!  This result tells me that people who arrive on-site aren’t arriving by accident, that they are specifically interested in the content.  Sure, I could’ve validated this using incoming search term research, but this provides a second data point.  The content I provide not only ranks well in Google, but once on-site also causes readers to want to see what the article contains.&lt;/p&gt;

&lt;h2 id=&quot;readers-vs-scanners&quot;&gt;Readers vs. Scanners&lt;/h2&gt;

&lt;p&gt;Even with the bounce rate drop above, I really don’t get a good feeling about whether people are actually reading the content.  Sure, people are scrolling 150px or more, but due to the ADHD nature of the web, plenty of people scroll without reading just to see what else is on the page!  That’s where the “Readers vs. Scanners” report comes in:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/05/google-analytics-reader-vs-scanner.png&quot; alt=&quot;google-analytics-reader-vs-scanner&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
62% of visits only scan instead of read - Need to do better here!
&lt;/p&gt;

&lt;p&gt;The report above shows that only 38% of visits to the site actually READ an article, rather than just quickly scroll.  This is disappointing, but now that I’ve got the information being tracked, I can set up a goal in Google Analytics with the aim of improving the ratio of actual readers vs. quick scrollers.&lt;/p&gt;

&lt;h2 id=&quot;average-visit-duration---still-useless&quot;&gt;Average Visit Duration - Still useless&lt;/h2&gt;

&lt;p&gt;Like the bounce rate definition change above, average visit duration and average time on page also change definitions when using the jQuery content tracking code.  Given that Google Analytics calculates time metrics by measuring the time between page views or events, by adding more events on the page, all time on site metrics have to increase (by definition).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/05/avg-visit-duration-google-analytics-1024x230.png&quot; alt=&quot;avg-visit-duration-google-analytics&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
Hard to see because of the Y-axis, but Avg. Visit Duration increases significantly as well.
&lt;/p&gt;

&lt;p&gt;That said, average visit duration is still a pretty useless metric, given that an increase/decrease in this metric &lt;a title=&quot;Avinash:  You are what you measure&quot; href=&quot;http://www.kaushik.net/avinash/measure-choose-smarter-kpis-incentives/&quot; target=&quot;_blank&quot;&gt;doesn’t immediately tell you&lt;/a&gt; “good” or “bad”…&lt;/p&gt;

&lt;h2 id=&quot;content-consumption-funnel&quot;&gt;Content Consumption “Funnel”&lt;/h2&gt;

&lt;p&gt;Finally, the last change that occurs when you implement the content tracking code is a series of Google Analytics events that measure how far down the page visitors are actually seeing.  This report, in combination with the Readers vs. Scanners report, helps understand reader engagement better than any generic “Time on Site” metric can do.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/05/content-consumption-google-analytics-1024x145.png&quot; alt=&quot;content-consumption-google-analytics&quot; /&gt;&lt;/p&gt;

&lt;p&gt;From this report, I can see that of the 2,102 articles loaded:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;89.4%&lt;/strong&gt; of the articles have a “StartReading” event fired&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;89.8%&lt;/strong&gt; of those who start to read an article reach the bottom of the article.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;19.7%&lt;/strong&gt; of those who reach the end of the article scroll past the comments to reach the true end of page&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first metric above is analogous to subtracting the bounce rate from 1, the percentage of articles viewed that don’t bounce.  The second metric (complete articles seen), with a success rate of 89.8% is ripe for segmentation.  I stated above that only 38% actually READ an article, so segmenting the above report by “Readers” vs. “Scanners” will surely lower the success rate in the “Readers” population.&lt;/p&gt;

&lt;p&gt;Finally, that &amp;lt;20% actually touch the true bottom of page is surprising to me, since this blog really doesn’t get many comments!  If there were thousands of comments and the pages were really long, ok, no one sees the bottom…but here?  I’ll have to think about this a bit.&lt;/p&gt;

&lt;h2 id=&quot;great-update-to-google-analytics-default-settings&quot;&gt;Great update to Google Analytics default settings!&lt;/h2&gt;

&lt;p&gt;Overall, my impression of the &lt;a title=&quot;jQuery Google Analytics content tracking snippet&quot; href=&quot;http://cutroni.com/blog/2012/02/21/advanced-content-tracking-with-google-analytics-part-1/&quot; target=&quot;_blank&quot;&gt;jQuery code snippet&lt;/a&gt; developed by Justin and others is that it is &lt;em&gt;extremely useful&lt;/em&gt; in understand interaction of visitors to content sites.  The only downside I see here is that it changes the definition of bounce rate within Google Analytics, which could be confusing to others who 1) aren’t aware of the code snippet running on-site or 2) don’t quite understand the subtleties of Google Analytics implementation with respect to Events and the &lt;a title=&quot;Google Analytics Non-Interaction Events&quot; href=&quot;https://developers.google.com/analytics/devguides/collection/gajs/eventTrackerGuide#non-interaction&quot; target=&quot;_blank&quot;&gt;non-interaction setting&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But since this is my personal blog, I don’t need to worry about others mis-interpreting my Google Analytics data, so I’m going to keep this functionality installed!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Update 7/25/12:  Google Analytics published a similar method to the one described above, using “setTimeout” to &lt;a title=&quot;Google Analytics Modified Bounce Rate article&quot; href=&quot;http://analytics.blogspot.com/2012/07/tracking-adjusted-bounce-rate-in-google.html&quot; target=&quot;_blank&quot;&gt;modify bounce rate&lt;/a&gt; based solely on time-on-page&lt;/em&gt;.&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Adobe Discover 3:  First Impressions</title>
        
          <description>&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/04/adobe-discover-logo.png&quot; alt=&quot;Adobe Discover&quot; /&gt;&lt;/p&gt;

</description>
        
        <pubDate>Fri, 27 Apr 2012 08:24:21 +0000</pubDate>
        <link>
        http://randyzwitch.com/adobe-discover-3-first-impressions/</link>
        <guid isPermaLink="true">http://randyzwitch.com/adobe-discover-3-first-impressions/</guid>
        <content type="html" xml:base="/adobe-discover-3-first-impressions/">&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/04/adobe-discover-logo.png&quot; alt=&quot;Adobe Discover&quot; /&gt;&lt;/p&gt;

&lt;p&gt;With yesterday’s code release, &lt;del&gt;Omniture&lt;/del&gt; Adobe released version 3 of their “Discover” tool, THE way to perform web analysis within the Adobe Digital Marketing Suite.  While SiteCatalyst has its place for basic reporting, to really dig deep into your data for actionable insights there’s no substitute to using Discover.&lt;/p&gt;

&lt;p&gt;But as with every product overhaul, there is the potential to change things that users liked and while not make enough improvement to excite the user base…but luckily, that’s not the case with Discover 3.  Here’s how I see the new features and design changes.&lt;/p&gt;

&lt;h2 id=&quot;new-darth-vader-interface&quot;&gt;New “Darth Vader” interface&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/04/adobe-discover-3-screenshot.png&quot; alt=&quot;adobe-discover-3-screenshot&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
&quot;Ooh, tough looking. Just like hardcore web analysts!&quot;
&lt;/p&gt;

&lt;p&gt;Of all the cool things about Discover 3, I’m not sure the new color palette is one of them.  &lt;a title=&quot;Adobe Discover 3 announcement&quot; href=&quot;http://blogs.adobe.com/digitalmarketing/analytics/discover-3-0-the-new-ui-might-just-be-as-cool-as-the-analysts-who-use-it/&quot; target=&quot;_blank&quot;&gt;Several reasons were given by Adobe&lt;/a&gt; for choosing the carbon colored interface, from trying to match analyst’s personalities (yuck!), reducing eye strain (ok), and consistent branding (eh).  Of the three, I’ll say that reducing eye strain is a worthy goal, although Discover 3 never struck me as “eye-burning” in the past.&lt;/p&gt;

&lt;p&gt;Maybe I’ll grow to like it, but right now, it seems really dark.  The light gray text on dark gray background needs a bit more contrast, and in general, the interface feels kinda depressing.&lt;/p&gt;

&lt;h2 id=&quot;calendars---no-more--sliders&quot;&gt;Calendars - No more #^%&amp;amp;$ sliders!&lt;/h2&gt;

&lt;p&gt;Now we’re getting somewhere.  The slider interface in Discover 2 never made sense to me.  You pick your time period up front, open a report, and then to modify the time period within an individual report you needed to move a bunch of jerky sliders around.&lt;/p&gt;

&lt;p&gt;In Discover 3, we now have the same style calendar interface as SiteCatalyst.  Makes sense from a consistency standpoint within the Adobe Digital Marketing Suite and a general UX standpoint.  Pointing at two dates on the calendar is way easier and faster than moving endpoints of a slider!&lt;/p&gt;

&lt;h2 id=&quot;heterogeneous-pathing&quot;&gt;Heterogeneous Pathing&lt;/h2&gt;

&lt;p&gt;This is so completely badass and the best new feature of Discover 3.  No longer are you confined to a fallout report that only includes just one Omniture variable type.  So if I want to do a funnel that measures visits containing a few different pages, then triggering a Facebook ‘Like’ event, a Cart Open, then an Exit Link, I can now do so!&lt;/p&gt;

&lt;p&gt;You can also switch from “Visit-level” to “Visitor-level” on the fly, which can also be useful depending on how your view your business.  Some people like to think about every visit being an opportunity to convert on-site, whereas Avinash advocates in his &lt;a title=&quot;Web Analytics 2.0 link&quot; href=&quot;http://www.amazon.com/gp/product/0470529393/ref=as_li_ss_tl?ie=UTF8&amp;amp;tag=thefuquexpe-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=0470529393&quot; target=&quot;_blank&quot;&gt;Web Analytics 2.0 book&lt;/a&gt; that using &lt;a title=&quot;Avinash Visitors Conversion Rate&quot; href=&quot;http://www.kaushik.net/avinash/excellent-analytics-tip5-conversion-rate-basics-best-practices/&quot; target=&quot;_blank&quot;&gt;Visitors as the denominator for conversion rate&lt;/a&gt; is the proper thought model.  I won’t weigh in on the difference in this post, but it’s cool that we can now change back-and-forth to see what the differences in the data are.&lt;/p&gt;

&lt;h2 id=&quot;table-builder&quot;&gt;Table Builder&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/04/adobe-discover-3-table-builder.png&quot; alt=&quot;adobe-discover-3-table-builder&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
Nice drag-and-drop options, very PivotTable like
&lt;/p&gt;

&lt;p&gt;Finally, the last really obvious difference between Discover 2 and Discover 3 is the table builder while using ranked reports.  Like the eye-strain issue talked about above, the amount of time that it took for reports to build never really seemed like an issue to me.  Perhaps that’s the &lt;del&gt;SAS&lt;/del&gt; programmer side of me that often waits hours to return a result of a complex set of commands.&lt;/p&gt;

&lt;p&gt;But now that I’ve used the table builder, it’s definitely an improvement on how data tables get built.  You get to specify each element you want in the table first, THEN the data gets retrieved.  It may sound like a small change, but when you already know what you want, not having to wait for the table to build while you keep dragging in metrics does &lt;em&gt;feel like&lt;/em&gt; it’s way faster to get the table you are looking for.&lt;/p&gt;

&lt;h2 id=&quot;adobe-discover-3---definitely-an-improvement&quot;&gt;Adobe Discover 3 - Definitely an improvement&lt;/h2&gt;

&lt;p&gt;There are probably 20 other things I haven’t noticed yet in the new Discover 3 interface, but from what I have used so far, this is a great upgrade in functionality!  It feels faster to get things completed with the table builder and the new pathing functionality across all variable types is a long time coming.  Now, if only there was a different color palette I could choose, it’d be perfect…maybe something like this?&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/04/omniture-discover-1.5.png&quot; alt=&quot;omniture-discover-1.5&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
You should prefer green, not carbon.
&lt;/p&gt;</content>
      </item>
      
    
      
      <item>
        <title>Using Omniture SiteCatalyst Target Report To Calculate YOY growth</title>
        
          <description>&lt;p&gt;Of the hundreds of stock reports and capabilities present within Adobe (Omniture) SiteCatalyst, calculating year-over-year growth isn’t the easiest thing to do.  And while conversion reports (eVars) have the “Compare Dates” functionality within the calendar menu, we can’t quickly plot the difference between two time periods within a dashboard.  This is where the Omniture SiteCatalyst Target report comes in handy.&lt;/p&gt;

</description>
        
        <pubDate>Wed, 22 Feb 2012 08:15:18 +0000</pubDate>
        <link>
        http://randyzwitch.com/omniture-sitecatalyst-target-report/</link>
        <guid isPermaLink="true">http://randyzwitch.com/omniture-sitecatalyst-target-report/</guid>
        <content type="html" xml:base="/omniture-sitecatalyst-target-report/">&lt;p&gt;Of the hundreds of stock reports and capabilities present within Adobe (Omniture) SiteCatalyst, calculating year-over-year growth isn’t the easiest thing to do.  And while conversion reports (eVars) have the “Compare Dates” functionality within the calendar menu, we can’t quickly plot the difference between two time periods within a dashboard.  This is where the Omniture SiteCatalyst Target report comes in handy.&lt;/p&gt;

&lt;h2 id=&quot;setting-up-your-goal&quot;&gt;Setting up your “Goal”&lt;/h2&gt;

&lt;p&gt;Within the Omniture Knowledge Base &lt;a href=&quot;https://omniture-help.custhelp.com/app/answers/detail/a_id/2153/kw/targets&quot; target=&quot;_blank&quot;&gt;KB2153&lt;/a&gt;, I think Omniture does a disservice by stating:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;Targets&lt;/em&gt; are quantifiable goals that you can place within the SiteCatalyst interface and compare against reports.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;While this is a true statement, I think one of the reasons that the Omniture SiteCatalyst Target report isn’t more widely used is that it doesn’t &lt;em&gt;have to be&lt;/em&gt; a future “goal” per se, any set numbers can be used.  When last year’s numbers are used, the report becomes a year-over-year comparison!&lt;/p&gt;

&lt;p&gt;For this example, I’m going to be comparing page views year-over-year.  Here’s what the page views summary by month looks like for 2011:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/02/omniture-page-views-report.png&quot; alt=&quot;omniture-page-views-report&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
Omniture Page Views Report - 2011
&lt;/p&gt;

&lt;p&gt;Using this report to set a year-over-year target, we can see the early months are in the few thousands of page views, increasing to 12,000 -14,000 later in the year.&lt;/p&gt;

&lt;h2 id=&quot;omniture-sitecatalyst-target-interface---inputting-our-numbers&quot;&gt;Omniture SiteCatalyst Target interface - Inputting our numbers&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/02/Screen-Shot-2012-02-22-at-7.49.46-AM.png&quot; alt=&quot;Screen Shot 2012-02-22 at 7.49.46 AM&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
Monthly Target Setup within Omniture Interface
&lt;/p&gt;

&lt;p&gt;Assuming you are using Omniture SiteCatalyst v15, you set up a Target report under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Favorites -&amp;gt; Targets -&amp;gt; Manage Targets&lt;/code&gt;, then choose &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Add New&lt;/code&gt; once you’re in the Targets menu.  I’ll be setting up a monthly target for Page Views, so I’ll just type it in instead of using the file upload capability.  For this example, we want to apply this target to “Entire Site” for the “Page Views” Metric.  The date range will be all of 2012, with “Monthly” granularity.  This will give you 12 boxes to type in the 2011 Page View results, and once we hit “Ok” to save, we’ll have our year-over-year report set up.&lt;/p&gt;

&lt;h2 id=&quot;getting-the-year-over-year-graph-within-omniture&quot;&gt;Getting the Year-over-Year graph within Omniture&lt;/h2&gt;

&lt;p&gt;Showing the results of our newly created Target report is as easy as going to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Favorites -&amp;gt; Targets&lt;/code&gt;, then choosing the appropriate Target.  By default, the report will show like a normal metric report, with a green overlay for your targets:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/02/Omniture-target-report-default.png&quot; alt=&quot;Omniture-target-report-default&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
Default view of Omniture SiteCatalyst Target report
&lt;/p&gt;

&lt;p&gt;The above report shows that Page Views for January 2012 are well above our January 2011 “Target” and that February has already exceeded the goal as well…which is great since we’ve got 7 more days left in the month!&lt;/p&gt;

&lt;p&gt;If we want to show the year-over-year delta, however, we can choose the “Variance” report option at the top of the graph.  Doing so will show the following report:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/wp-content/uploads/2012/02/omniture-target-variance.png&quot; alt=&quot;omniture-target-variance&quot; /&gt;&lt;/p&gt;

&lt;p class=&quot;wp-caption-text&quot;&gt;
Omniture SiteCatalyst Target Report - &quot;Variance&quot; option
&lt;/p&gt;

&lt;p&gt;By placing this report in a dashboard, we can quickly evaluate whether Page Views have grown by month year-over-year.  It’s disappointing that the only graph option Adobe provides is the raw metric higher/lower than the target instead of a percentage difference view, but the percentage difference is calculated as part of the data table view that goes along with this report.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;When talking about “growth” many businesses aren’t content with just year-over-year growth, usually aiming for 10%, 20%…10,000% growth.  These are goals that work well to track within the Omniture SiteCatalyst Target report.  But year-over-year growth can be worth monitoring too, and the Omniture SiteCatalyst Target report is a great way to do so.&lt;/p&gt;</content>
      </item>
      
    
  </channel>
</rss>
