Category Archives: Programming

Optimizing Sidekiq For Maximum CPU Performance on Multicore System

I have been working a lot with somewhat large datasets (millions of records) that benefits from parallel processing. I thought sidekiq’s multi-threading was going to be a great solution for this, but upon further investigation, I noticed my work was only marginally faster and that my CPU wasn’t ever at 100%. In fact, it was hovering more around 25%… what gives? Maybe my jobs are IO bound? Nope, that wasn’t the case… $ top showed wait cpu time to be 0.0.. The CPU wasn’t waiting for more IO! What could be the issue?

Global Interpreter Lock (GIL) Sadness

On further research, I learned that all MRI ruby threads run one at a time, even on a multi-core system! This is to protect from non-thread safe functions. Implementations of JRuby and Rubinius have threads that can run in parallel, but I didn’t have a chance to try them. Reading this toptal article was very informative for me to understand the difference between ruby concurrency and parallelism. (Sorry for referencing parallel wrong in previous blogs!)

Solution For Maxing Out CPU in Sidekiq

So the only way to max out cpu utilization with sidekiq is to use more processes. All you have to do is spawn up more sidekiq workers with the same configuration file and they will just be added to the pool of workers. Neat and simple! You have to note that more worker processes mean more memory. While worker threads can share memory, worker processes will not and if you spawn too many processes, you’ll run out of memory quickly. Also, in general, it is better to only spawn as many workers as you have logical CPUs.

I wrote a quick script to manage starting/stopping sidekiq workers. Feel free to use it too if you’d like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/bin/bash
 
NUM_WORKERS=2
NUM_PROCESSES=4
 
# http://www.ostricher.com/2014/10/the-right-way-to-get-the-directory-of-a-bash-script/
get_script_dir () {
     SOURCE="${BASH_SOURCE[0]}"
     # While $SOURCE is a symlink, resolve it
     while [ -h "$SOURCE" ]; do
          DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
          SOURCE="$( readlink "$SOURCE" )"
          # If $SOURCE was a relative symlink (so no "/" as prefix, need to resolve it relative to the symlink base directory
          [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE"
     done
     DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
     echo "$DIR"
}
 
start_sidekiq_workers() {
  echo "Starting $NUM_PROCESSES sidekiq procesess with $NUM_WORKERS each."
  for n in `eval echo {1..$NUM_PROCESSES}`; do
    bundle exec sidekiq -r "$(get_script_dir)/../config/environment.rb" -c $NUM_WORKERS &
  done
}
 
case $1 in
  stop)
  ps aux|grep "sidekiq 3"|grep -v grep|awk '{print $2}'|xargs kill
  ;;
  start)
  start_sidekiq_workers
  ;;
  status)
  ps aux|grep "sidekiq 3"|grep -v grep
  ;;
  *)
  start_sidekiq_workers
  ;;
esac

Results

So after playing around with worker threads and processes. Here is the results of the job I was working on with different parameters:

Completed importing all files in 01:39:48:119774276. – 25 workers, 1 processes
Completed importing all files in 00:44:06:214396540. – 10 workers, 2 processes
Completed importing all files in 00:28:21:940166878. – 5 workers, 4 processes
Completed importing all files in 00:17:51:737359697. – 4 workers, 4 processes
Completed importing all files in 00:11:04:804641568. – 2 workers, 8 processes
Completed importing all files in 00:09:59:336971420. – 1 worker, 16 processes

Clearly, using more processes is faster than just more worker threads. Just make sure you have enough memory! For my test run, I could only divide up my work into 16 jobs, so I couldn’t test with more processes… but I think at a certain point, adding more processes would not help make the job run any faster and would probably start slowing down the system with overhead. I recommend running benchmarks on a small subset of your data to determine what the right balance would be before processing the whole thing! You can save a lot of time if you can make a 20 hour job turn into a 2 hour job.

I would love to see how multi-threaded processes work with rubinius. It’s been pretty fun learning about concurrency and parallel computing in ruby context.

MySQL – Processing 8.5 Million Rows In a Reasonable Amount of Time

I had to crunch through a database of approximately 8.5 million records computing a hash, validating a few fields, and then updating a column with the results of the hash. Because I needed to compute the hash, I couldn’t just use an UPDATE statement to work on all the rows – I had to read and update each of the 8.5 million rows in my script. Sounds painful already!

On my initial attempt at SELECT the record, calculate the hash, and UPDATE, I was able to do about 150 rows per second… Let’s see, that’ll take a 20 hours… We can do better than that… :)

Skip to TL;DR

Parallelization & Optimization

The first thing that I wanted to optimize is the use of all 8 logical cores (Core I7 with hyper threading). That’s why I chose to use Sidekiq. We could now have multiple workers crunching different chunks of data and hopefully saturate both IO and CPU. I used Boson to make my app a simple command line application, but I could have easily used rake tasks too.

Here is pass #1 at this task:

Command Line To Start Worker:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/env ruby
 
require File.expand_path( '../../config/environment', __FILE__)
 
require 'boson/runner'
require 'sidekiq'
require 'workers/process_worker'
 
class ProcessRunner < Boson::Runner
  def process( num_workers = 10 )
    num_workers=num_workers.to_i
    puts 'spawning workers'
 
    Sidekiq.redis {|conn| conn.set('timer', Time.now.to_f) }
    num_workers.times do | worker_number |
      ProcessWorker.perform_async worker_number, 100
    end
  end
end
 
ProcessRunner.start

Worker:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
require 'sidekiq'
require 'sequel'
 
class ProcessWorker
  include Sidekiq::Worker
 
  def perform(worker_number, chunk_size)
    DB.transaction do
      dataset = DB[:work_table].select_all(:work_table).left_outer_join(:output_table, :fk_id => :id).where(:result => nil).limit(chunk_size,worker_number*chunk_size)
      output_table = DB[:output_table]
      dataset.each do |row|
        result = # Process the row data here
 
        output_table.insert(:result => result, :fk_id => row[:id])
      end
 
      if dataset.count > 0
        puts "spawn worker to do more work again."
        ProcessWorker.perform_async worker_number, chunk_size
      else
        start_time=Time.at(Sidekiq.redis {|conn| conn.get('timer') })
        end_time = Time.now
 
        puts "Worker ##{worker_number} reports job took #{(end_time - start_time)*1000} milliseconds"
      end
    end
  end
end

My strategy here was to have a secondary table (output_table) that I would dump my results to and join it to the primary table work_table. With my outer left join, I would search for rows in the work_table that had null values in the results field indicating that it had not been processed yet. I tried out using Sequel for this project because I never used it before and thought it’d be nice being able to have a ruby model way of accessing the data. It turned out sorta nice. The data was divided into different chunks among the workers. Each chunk was chunk_size large (100 in the example)… So it would look something like this:

--------------------------------------
| row 1-99    | chunk 0 for worker 0 |
| row 100-199 | chunk 1 for worker 1 |
| row 200-299 | chunk 2 for worker 2 |
| row 300-399 | chunk 3 for worker 3 |
| row 400-499 | chunk 4 for worker 4 |
| ...         | ..                   |
--------------------------------------

This algorithm and implementation did not turn out nice though! Because I was SELECT-ing data and processing it non-atomically; if, for instance, worker 1 finished rows 100-199 and asked for another set of data, it would now be working on rows 200-299 at the same time as worker 2 that was supposed to work on the data…

After processing row 100-199, the new chunk 1 was row 200-299:

--------------------------------------
| row 1-99    | chunk 0 for worker 0 |
| row 100-199 | COMPLETED            | (not returned to SELECT query)
| row 200-299 | chunk 1 for worker 1 | <- now both worker 1 and worker 2
| row 300-399 | chunk 2 for worker 2 |    is working on these rows
| row 400-499 | chunk 3 for worker 3 |
| ...         | ..                   |
--------------------------------------

Bad bad bad race conditions… Inefficient processing of data… still only getting about 300 rows processed per second… 10 hours still too long for the job… back to Google, StackOverflow, and the MySQL manual…

MySQL Bulk Data Recommendations (LOAD DATA INFILE)

The MySQL documentation has a lot of great tips for optimizing bulk inserts, but the most useful I found was this:

When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using INSERT statements. See Section 13.2.6, “LOAD DATA INFILE Syntax”.

Speed of INSERT Statements

I’ll take a 20x speed up! Let’s rewrite the code to take advantage of INFILE… I’ll be using tmp csv files.

Job Invoker:

10
11
12
13
14
15
16
17
18
  def process( num_workers = 10 )
    Sidekiq.redis {|conn| conn.set('process_workers', num_workers.to_i) }
    puts 'spawning workers'
 
    Sidekiq.redis {|conn| conn.set('timer', Time.now.to_f) }
    num_workers.times do | worker_number |
      ProcessWorker.perform_async worker_number, 1000
    end
  end

Worker:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
require 'sidekiq'
require 'sequel'
require 'sequel/load_data_infile'
 
class ProcessWorker
  include Sidekiq::Worker
 
  def total_num_of_workers
    Sidekiq.redis {|conn| conn.get('process_workers') }
  end
 
  def perform(worker_number, chunk_size, current_count = 0)
    file_data = File.open("/tmp/worker_#{worker_number}_output#{current_count}.txt", 'w')
 
    dataset = DB[:work_table].select_all(:work_table).left_outer_join(:output_table, :fk_id =&gt; :id).limit(chunk_size,total_num_of_workers.to_i*chunk_size*current_count+worker_number*chunk_size)
    output_table = DB[:output_table]
    dataset.each do |row|
      result = # Process the row data here
 
      file_data.puts "#{row[:id]},#{result}"
    end
 
    file_data.close
 
    output_table.load_csv_infile("/tmp/worker_#{worker_number}_output#{current_count}.txt", [ :fk_id, :result ])
 
 
    ## TODO: it'd be nice if we clean up the tmp directory when we're done.
 
    if dataset.count > 0
      puts "spawn worker to do more work again."
 
      ProcessWorker.perform_async worker_number, chunk_size, current_count+1
    else
      start_time_raw=Sidekiq.redis {|conn| conn.get('timer') }
      start_time=Time.at(start_time_raw.to_f)
      end_time = Time.now
 
      puts "Worker ##{worker_number} reports job took #{(end_time - start_time)*1000} milliseconds"
    end
  end
end

I made the following changes with the code:

  • I’m using the sequel load_data_infile gem… love it that there’s a gem for everything. :)
  • I avoided the race condition by using LIMIT and OFFSET to always advance not base the query off of data that is being processed
  • We write the processed batch to CSV file and have mysql import it.

Let’s give it a run…

Worker #1 reports validate job took 2153123.5738263847 milliseconds

Yay! We were finally able to finally process through the full dataset in a reasonable amount of time! 1/2 hr is not too bad for 8.5 million records, right? Actually, we can probably optimize this a bit more…

Scanning By Primary ID instead of LIMIT/OFFSET

I noticed that the workers that was processing data at the beginning of the table was returning back fast – 1-5 seconds and as we reached the end of the table, around the 4 million mark, it got extremely slow and the CPU would start working really hard.

I did a little digging and I found the issue is that LIMIT/OFFSET has to query the ENTIRE dataset up to the point where you need and then it discards all the data at the beginning and returns you the amount of data that you want. In other words, mysql was going through 4,001,000 records to give me record 4,000,000-4,01,000. It was going through 4,002,000 records to give me the next chunk and so forth.. No wonder the query was getting slower and slower!

Let’s fix that:

7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
  def perform(worker_number, chunk_size, current_count = 0)
    file_data = File.open("/tmp/worker_#{worker_number}_output#{current_count}.txt", 'w')
 
 
    low = total_num_of_workers.to_i*chunk_size*current_count+worker_number*chunk_size
    high = total_num_of_workers.to_i*chunk_size*current_count+worker_number*chunk_size+chunk_size
    dataset = DB[:work_table].select_all(:keys).left_outer_join(:output_table, :key_id => :id).where{(Sequel.qualify(:work_table,:id) >= low)}.where{Sequel.qualify(:work_table,:id) < high}
    # debugging
    puts dataset.sql
    output_table = DB[:output_table]
    dataset.each do |row|
      result = # Process the row data here
 
      file_data.puts "#{row[:id]},#{result}"
    end
 
    file_data.close
 
    tmp_keys.load_csv_infile("/tmp/worker_#{worker_number}_output#{current_count}.txt", [ :key_id, :data ])
 
 
    ## TODO: it'd be nice if we clean up the tmp directory when we're done.
 
    last_auto_increment_id = DB["SELECT `AUTO_INCREMENT` FROM  INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'my_db' AND TABLE_NAME = 'work_table'"].first[:AUTO_INCREMENT]
    if total_num_of_workers.to_i*chunk_size*current_count < last_auto_increment_id
      puts "spawn worker to do more work again."
 
      ProcessWorker.perform_async worker_number, chunk_size, current_count+1
    else
      start_time_raw=Sidekiq.redis {|conn| conn.get('validate_timer') }
      start_time=Time.at(start_time_raw.to_f)
      end_time = Time.now
 
      puts "Worker ##{worker_number} reports validate job took #{(end_time - start_time)*1000} milliseconds"
    end
  end
end

We are now scanning the table by primary ID from 1 to the AUTO_INCREMENT counter, so we are guaranteed to get all the rows. This SELECT method worked fast! If there was no data, the query returned back almost instantaneously. I had some gaps in my table from deletes, so it was really critical that this query returned back fast if it had no results. Overall, I probably lost a few trivial seconds skipping over deleted IDs. Let’s run our benchmark again:

Worker #9 reports validate job took 208492.26823838233 milliseconds

Wow! We’re able to process through 8.5 million records within FOUR MINUTES! That is certainly fast enough for me to be working with this database on a regular basis.

Applying What We Learned to UPDATEs

It is unfortunate that we can’t use LOAD FILE INLINE for updates, but let us see if we can apply the primary key scan and parallelization techniques for UPDATE-ing our existing output_table that we created.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
require 'sidekiq'
require 'sequel'
 
class ProcessUpdateWorker
  include Sidekiq::Worker
 
  def total_num_of_workers
    Sidekiq.redis {|conn| conn.get('process_update_workers') }
  end
 
 
  def perform(worker_number, chunk_size, current_count = 0)
    DB.run('SET autocommit=0;')
    DB.run('SET unique_checks=0;')
    DB.run('SET foreign_key_checks=0;')
 
 
    DB.transaction do
      low = total_num_of_workers.to_i*chunk_size*current_count+worker_number*chunk_size
      high = total_num_of_workers.to_i*chunk_size*current_count+worker_number*chunk_size+chunk_size
 
      dataset = DB[:work_table].select_all(:work_table).select_append(:data).left_outer_join(:output_table, :key_id => :id).where(:result2 => nil).where{(Sequel.qualify(:keys,:id) >= low)}.where{Sequel.qualify(:keys,:id) < high}
 
      output_table = DB[:output_table]
      puts dataset.sql
      dataset.each do |row|
        result2 = # process row here
        output_table.where("fk_id= ?", row[:id]).update(:result2 => "#{result2}")
      end
 
      last_auto_increment_id = DB["SELECT `AUTO_INCREMENT` FROM  INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'my_db' AND TABLE_NAME = 'work_table'"].first[:AUTO_INCREMENT]
      if total_num_of_workers.to_i*chunk_size*current_count < last_auto_increment_id
        puts "spawn worker to do more work again."
        ProcessUpdateWorker.perform_async worker_number, chunk_size, current_count+1
      else
        start_time=Time.at(Sidekiq.redis {|conn| conn.get('validate_timer') }.to_f)
        end_time = Time.now
 
        puts "Worker ##{worker_number} reports job took #{(end_time - start_time)*1000} milliseconds"
      end
    end
 
    DB.run('SET autocommit=0;')
    DB.run('SET unique_checks=0;')
    DB.run('SET foreign_key_checks=0;')
 
  end
end

For this job, I added the recommended turning auto commit, unique_checks, and foreign_key_checks off, but with my database, I don’t think I saw any improvements. I don’t think I really was using unique or foreign keys too much.

Anyways, after running this job, I was able to make 8.5 million updates in 26 minutes:

Worker #6 reports validate job took 1557370.9786546987 milliseconds

Not that bad, but could we make this faster??

LOAD DATA INFILE to temp table + UPDATE by JOIN

Doing more research, I saw that you can UPDATE one column from a table from another table via a join. Gave me the idea about loading the column I wanted to change into a tmp table and then overwriting the column I want to update. Would it work?

Job Invoker:

20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
  def bulk_insert_then_update( num_workers = 10)
    Sidekiq.redis {|conn| conn.set('bulk_insert_then_update_num_workers', num_workers.to_i)}
    Sidekiq.redis {|conn| conn.del('worker_complete_count')}
    puts 'starting bulk insert than update job'
 
    Sidekiq.redis {|conn| conn.set('bulk_insert_then_update_timer', Time.now.to_f) }
    DB.create_table! :tmp_table do
      primary_key :id
      foreign_key :fk_id, :keys
      String :result2
    end
    num_workers.times do | worker_number |
      BulkInsertThanUpdate.perform_async worker_number, 20000
    end
  end

Worker:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
require 'sidekiq'
require 'sequel'
 
class BulkInsertThanUpdate
  include Sidekiq::Worker
 
  def total_num_of_workers
    Sidekiq.redis {|conn| conn.get('bulk_insert_then_update_num_workers') }
  end
 
  def perform(worker_number, chunk_size, current_count = 0)
    file_data = File.open("/tmp/worker_#{worker_number}_output#{current_count}.txt", 'w')
 
    low = total_num_of_workers.to_i*chunk_size*current_count+worker_number*chunk_size
    high = total_num_of_workers.to_i*chunk_size*current_count+worker_number*chunk_size+chunk_size
    dataset = DB[:work_table].select_all(:keys).select_append(Sequel.qualify(:output_table,:result)).left_outer_join(:output_table, :key_id => :id).where(Sequel.qualify(:output_table,:result2) => nil).where{(Sequel.qualify(:work_table,:id) >= low)}.where{Sequel.qualify(:work_table,:id) < high}
    puts dataset.sql
    tmp_table = DB[:tmp_table]
    dataset.each do |row|
      result2 = # calculate results from row here
      file_data.puts "#{row[:id]},#{result2}"
    end
 
    file_data.close
 
    tmp_table.load_csv_infile("/tmp/worker_#{worker_number}_output#{current_count}.txt", [ :key_id, :result2 ])
 
 
    ## TODO: it'd be nice if we clean up the tmp directory when we're done.
 
    last_auto_increment_id = DB["SELECT `AUTO_INCREMENT` FROM  INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'my_db' AND TABLE_NAME = 'work_table'"].first[:AUTO_INCREMENT]
    if total_num_of_workers.to_i*chunk_size*current_count < last_auto_increment_id
      puts "spawn worker to do more work again."
 
      BulkInsertThanUpdate.perform_async worker_number, chunk_size, current_count+1
    else
      complete_worker_count = Sidekiq.redis {|conn| conn.incr('worker_complete_count') }
 
      if complete_worker_count.to_i == total_num_of_workers.to_i
        puts "all workers complete - running bulk sql update command..."
        DB.run('UPDATE output_table,tmp_table SET output_table.result2 = tmp_table.result2 WHERE output_table.fk_id = tmp_table.fk_id;')
        DB.run('DROP TABLE tmp_table')
      end
      start_time_raw=Sidekiq.redis {|conn| conn.get('bulk_insert_then_update_timer') }
      start_time=Time.at(start_time_raw.to_f)
      end_time = Time.now
 
      puts "Worker ##{worker_number} reports validate job took #{(end_time - start_time)*1000} milliseconds"
    end
  end
end

My strategy here was to do the exact same thing as LOAD DATA INFILE, then when all the workers were done load the data from the tmp_table into the output_table. I used the last worker that finished running to run the query by keeping a count of all the workers complete.

What’s the runtime?

Worker #7 reports validate job took 394662.35620292247 milliseconds

We got updates down to under 7 minutes!

Final Optimization Notes

To tune the system, you want to be paying close attention to the cpu time and the iowait time which you can see with top. If the iowait time is high, consider lowering the chunk size to work on smaller chunks at a time. I was able to get my iowait time to stay under 5%. If the CPU isn’t fully loaded, feel free to up the number of workers.

For my system:

i7 3770 3.4Ghz
16GB RAM
256GB Samsung 850 Pro
Mysql 5.5

I ended up with 20 workers and a 20000 chunk size (rows selected at once). When I tried increasing the number of workers, it actually had a negative effect on the benchmark speed, so there is a max that is beneficial. If you have more RAM, I would also consider tuning mysql server and even trying to have the whole DB buffered in memory (innodb_buffer_pool_size). Mysqltuner is a good resource as well as checking dba.stackexchange.com.

TL;DR

  • Parallelize the work with sidekiq – make full use of modern multi-core PCs!
  • Use LOAD DATA INFILE over INSERT statements
  • Scan table by the primary key with WHERE id BETWEEN start AND end rather than use LIMIT/OFFSET
  • Large updates can be made by loading data to a tmp_table with LOAD DATA INFILE and then updating the column via a join with the tmp_table
  • Watch your CPU usage and IOWAIT time and optimize such that your CPU utilization is high and your IOWAIT time is low.

Finally:

8.5 million rows can be SELECT’d and INSERT’d within 3.5 minutes. 1.5 minutes.
8.5 million rows can be SELECT’d and UPDATE’d within 7 minutes. 4.5 minutes.

**Update** I made it even faster by learning how to truly parallelize sidekiq!

I would love to hear if you know of better techniques that is even faster than this!

RSpec let! and before

‘let’ in rspec allows you to define objects and methods to an instance method. Official docs:

Use let to define a memoized helper method. The value will be cached across
multiple calls in the same example but not across examples.

‘let!’ allows you to define an object that is run in a `before` hook. I was curious to the order of execution of ‘before’ hooks and let, so I ran a simple test:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
context 'when testing before' do
  before :all do
    p 'before :all invoked'
  end
 
  before do
    p 'before :each invoked'
  end
 
  let!(:test) { p 'let! invoked' }
 
  it 'prints debug statements' do
  end
end

The results were:

"before :all invoked"
"before :each invoked"
"let! invoked"

So, it seems the before hooks are run BEFORE the let! hooks. This is useful info to note if you need the let! method to be run before the before block, you’ll need to call it manually yourself like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
context 'when testing before' do
  before :all do
    p 'before :all invoked'
  end
 
  before do
    test
    p 'before :each invoked'
  end
 
  let!(:test) { p 'let! invoked' }
 
  it 'prints debug statements' do
  end
end

and we get:

"before :all invoked"
"let! invoked"
"before :each invoked"

I’m officially a Ruby On Rails Developer Now

I know I don’t post too much to this blog, but you might start noticing that I have stopped posting PHP and I started to post about Ruby and Rails. Why’s that? Well, it’s because my company has switched over to being a RoR shop. Here’s my short opinion on it…

Ruby and the rails community seem much more mature and bleeding edge. How can I put those 2 together? Well, all the stuff I was working on PHP has been stuff that has strongly been influenced or straight out copied from Rails. I think Rails was one of the most popular MVCs that inspired all of the other ones (and yes, I know there were MVCs before Rails). You see, symfony draws its design off rails.

app/console -> rails/rake
doctrine -> activerecord

As rails has been doing it for much longer (rails is on version 4), it has a much more developed community. I find myself finding more gems and having to write less of my own stuff compared to php. Granted, I still make PRs to the ruby/rails community now.

I’m still getting my head around Ruby as a language, but it does have a much more concise way to write than PHP. Blocks are something that definitely has an advantage over other languages. Yes, you can implement something similar in PHP or other languages with anonymous functions or callbacks, but ruby makes it very natural and intuitive. Overall, after getting over the initial hump of learning a new language, I like it.

One thing that has helped me transition easier is having Jetbrains IDE. I moved straight from PHPStorm to Rubymine and I was very happy that almost 100% of the keybindings that I was used to using in PHP moved right over to Rubymine. The ruby debugger seems actually more flexible than the php debugger as I can actually execute arbitrary code within a breakpoint. This is great for inspecting objects and variables. You could probably do this with PHP, but it just wasn’t as simple.

I look forward to the next languages I’d be learning. So far, going from PHP -> Java -> PHP -> Ruby hasn’t been a bad experience for me. Each time, I learned something new and was able to apply principles from one language in another. I think learning about different languages helps broaden your perspective on design patterns as each language lends themselves to writing code in a certain manner.

Why Can’t I Put Debug Statements in Symfony Core?

Just a quick tip for anyone running trying to debug symfony core files, such as \Symfony\Component\HttpFoundation\Request or almost anything in HttpFoundation. Just tried adding a debug print_r() and was wondering why my code was not being executed. There doesn’t seem to be any other place where the Request is defined… I thought maybe it is a cache issue, so I did:

1
$ app/console cache:clear

It didn’t help… It turns out that all these files are cached and lumped into the app/bootstrap.php.cache file and that is only regenerated through a composer install via:

Sensio\\Bundle\\DistributionBundle\\Composer\\ScriptHandler::buildBootstrap

So to summarize, if you want to add debug statements to HttpFoundation classes, you’ll have to edit the bootstrap.php.cache file. Be careful though and don’t mess up your framework! Hope this helps somebody! :)

Online Regex Editor

I don’t know about you, but it isn’t the funnest part of my job when I have to pull out Regular Expressions and make a super long expression to match something… I especially dispise when I have to fix someone else’s Regex!

1
^((([0-9]+)\.([0-9]+)\.([0-9]+)(?:-([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?)(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?)$

You try figuring out what that does!

Well, it doesn’t have to be as painful with the online regex tool I recently found out about:

http://regexr.com/

On this site, you can write regex and sample text that needs to be matched and it will highlight whether it matches or not.

regexScreenshot

As well as get helpful hints to remember what the special characters mean:

regex3

Regex can definitely be confusing as many characters have special meaning based on the context – are we talking about a literal “:” or a ?: that means non-capturing group?

All in all, this tool makes Regular Expressions so easy! They also have a community feature so you can see if someone has uploaded a Regex pattern that you can build off of.

For my project today, I had to build a Regex that matches a semver version string. Here’s the Regex pattern I built with a little help from my coworker – thanks Daniel!

http://regexr.com/39s32

Asserting the Output of Complex Arrays in PHPUnit/PHPStorm

In unit testing, have you ever had assert a really complex array that you didn’t really want to generate the whole expected array yourself? Such as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
array (
  0 => 
  array (
    0 => 'var 0-0',
    1 => 'var 0-1',
    2 => 'var 0-2',
    3 => 'var 0-3',
    4 => 'var 0-4',
    5 => 'var 0-5',
    6 => 'var 0-6',
    7 => 'var 0-7',
    8 => 'var 0-8',
    9 => 'var 0-9',
  ),
  1 => 
  array (
    0 => 'var 1-0',
    1 => 'var 1-1',
    2 => 'var 1-2',
    3 => 'var 1-3',
    4 => 'var 1-4',
    5 => 'var 1-5',
    6 => 'var 1-6',
    7 => 'var 1-7',
    8 => 'var 1-8',
    9 => 'var 1-9',
  ),
  2 => 
  array (
    0 => 'var 2-0',
    1 => 'var 2-1',
    2 => 'var 2-2',
    3 => 'var 2-3',
    4 => 'var 2-4',
    5 => 'var 2-5',
    6 => 'var 2-6',
    7 => 'var 2-7',
    8 => 'var 2-8',
    9 => 'var 2-9',
  ),
  3 => 
  array (
    0 => 'var 3-0',
    1 => 'var 3-1',
    2 => 'var 3-2',
    3 => 'var 3-3',
    4 => 'var 3-4',
    5 => 'var 3-5',
    6 => 'var 3-6',
    7 => 'var 3-7',
    8 => 'var 3-8',
    9 => 'var 3-9',
  ),
  4 => 
  array (
    0 => 'var 4-0',
    1 => 'var 4-1',
    2 => 'var 4-2',
    3 => 'var 4-3',
    4 => 'var 4-4',
    5 => 'var 4-5',
    6 => 'var 4-6',
    7 => 'var 4-7',
    8 => 'var 4-8',
    9 => 'var 4-9',
  ),
  5 => 
  array (
    0 => 'var 5-0',
    1 => 'var 5-1',
    2 => 'var 5-2',
    3 => 'var 5-3',
    4 => 'var 5-4',
    5 => 'var 5-5',
    6 => 'var 5-6',
    7 => 'var 5-7',
    8 => 'var 5-8',
    9 => 'var 5-9',
  ),
  6 => 
  array (
    0 => 'var 6-0',
    1 => 'var 6-1',
    2 => 'var 6-2',
    3 => 'var 6-3',
    4 => 'var 6-4',
    5 => 'var 6-5',
    6 => 'var 6-6',
    7 => 'var 6-7',
    8 => 'var 6-8',
    9 => 'var 6-9',
  ),
  7 => 
  array (
    0 => 'var 7-0',
    1 => 'var 7-1',
    2 => 'var 7-2',
    3 => 'var 7-3',
    4 => 'var 7-4',
    5 => 'var 7-5',
    6 => 'var 7-6',
    7 => 'var 7-7',
    8 => 'var 7-8',
    9 => 'var 7-9',
  ),
  8 => 
  array (
    0 => 'var 8-0',
    1 => 'var 8-1',
    2 => 'var 8-2',
    3 => 'var 8-3',
    4 => 'var 8-4',
    5 => 'var 8-5',
    6 => 'var 8-6',
    7 => 'var 8-7',
    8 => 'var 8-8',
    9 => 'var 8-9',
  ),
  9 => 
  array (
    0 => 'var 9-0',
    1 => 'var 9-1',
    2 => 'var 9-2',
    3 => 'var 9-3',
    4 => 'var 9-4',
    5 => 'var 9-5',
    6 => 'var 9-6',
    7 => 'var 9-7',
    8 => 'var 9-8',
    9 => 'var 9-9',
  ),
  10 => 
  array (
    0 => 'var 10-0',
    1 => 'var 10-1',
    2 => 'var 10-2',
    3 => 'var 10-3',
    4 => 'var 10-4',
    5 => 'var 10-5',
    6 => 'var 10-6',
    7 => 'var 10-7',
    8 => 'var 10-8',
    9 => 'var 10-9',
  )

(I don’t want to type that up by hand!)

Well, guess what? You can be lazy about it… Simply use var_export and php will generate all the code for you to straight copy and paste.

1
var_export($complexObj);

**DISCLAIMER WARNING** This is generally a VERY bad practice. You are always suppose to create the assertions independently from what the output generates. It will be very easy to just copy and paste a mistake in the output that is not correct. Be sure if you do this that you look very carefully at the output and make sure it is exactly what it is supposed to be.

PHPUnit Mocking

Mocking is great for very lean unit tests. It allows you to setup a repeatable scenario to test as many of the states of a unit as you want. In general, I want to be testing only a single method or a single class. If you’re testing more than this, you’re not writing unit tests – you’re writing integration tests! A good measure on if you are writing proper unit tests would be to change something in the code to make a test fail. If you end up with a cascade of tests fail, you have integration tests, not unit tests!

So in unit testing, you want to be able to isolate a single unit and manipulate the inputs to cover all types of conditions and verify the validity of the outputs to those conditions. I like to think of it like an algebraic inequality problem. You always check your boundaries and then a value within the different boundary regions. Think of it similarly with unit testing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Calculator.php
<?php
class Calculator {
    public getNumberFromUserInput() {
        // complicated function to get number from user input 
    }
 
    public divideBy($num2) {
        return $this->getNumberFromUserInput()/$num2;
    }
}
 
// CalculatorTest.php
<?php
 
include_once("Calculator.php");
 
class CalculatorTest extends \PHPUnit_Framework_TestCase {
    public function testDivideByPositiveNumber() {
        $calcMock=$this->getMock('\Calculator',array('getNumberFromUserInput'));
        $calcMock->expects($this->once())
            ->method('getNumberFromUserInput')
            ->will($this->returnValue(10));
        $this->assertEquals(5,$calcMock->divideBy(2));
    }
 
    public function testDivideByZero() {
        $calcMock=$this->getMock('\Calculator',array('getNumberFromUserInput'));
        $calcMock->expects($this->once())
            ->method('getNumberFromUserInput')
            ->will($this->returnValue(10));
        $this->assertEquals(NAN, $calcMock->divideBy(0));
 
    }
 
    public function testDivideByNegativeNumber() {
        $calcMock=$this->getMock('\Calculator',array('getNumberFromUserInput'));
        $calcMock->expects($this->once())
            ->method('getNumberFromUserInput')
            ->will($this->returnValue(10));
        $this->assertEquals(-2,$calcMock->divideBy(-5));
 
    }
}

As you can see with the example, sometimes inputs and outputs of functions are not always so straight-forward. While we can have our standard input and output passed from the method parameters and the return value, more often than not, inputs and outputs are received and sent by calling other functions. With mocking though, this is no issue. We can mock any function and  define exactly what it should return. We can also test and verify that the function was called.

Let’s break down the example.

1
$calcMock=$this->getMock('\Calculator',array('getNumberFromUserInput'));

We are creating a new Calculator mock. The first parameter tells phpunit what class to mock, the  2nd parameter tells phpunit to only mock the ‘getNumberFromUserInput’ function and not anything else. We need to use the real divideBy() function to test that. getMock() has a lot of useful options, refer to the api or Mark Mzyk’s blog post I found on the getMock method signatures.

The next line(s) set up the mocked function.

1
2
3
$calcMock->expects($this->once())
            ->method('getNumberFromUserInput')
            ->will($this->returnValue(10));

Line 1: We are saying here that we expect this method to be called once. If it is called less or more than once, we will get an exception.

Line 2: The method name we are mocking out.

Line 3: The return value of the mocked out function. You can also throw exceptions, return back one of the arguments unmodified, or even call another callback function. The code for that is $this->throwException(new Exception()), $this->returnArgument($ArgumentNumber) and $this->returnCallback($callbackMethod) respectively. With the callback, all the parameters you pass to the mock will be passed to the callback.

Anyways, back to our example, I ran the tests and we see at our 0 boundary that we get behavior that we were not expecting and our tests fail.

1
PHPUnit_Framework_Error_Warning : Division by zero

Let’s fix that up.

1
2
3
4
5
6
7
8
9
10
class Calculator {
public function getNumberFromUserInput() {
// complicated function to get number from user input
}
 
public function divideBy($num2) {
if ($num2 == 0) return NAN;
return $this->getNumberFromUserInput()/$num2;
}
}

A few more failures…

PHPUnit_Framework_ExpectationFailedException : Expectation failed for method name is equal to  when invoked 1 time(s).
Method was expected to be called 1 times, actually called 0 times.

And finally…

1
2
3
4
5
6
7
public function testDivideByZero() {
$calcMock=$this->getMock('\Calculator',array('getNumberFromUserInput'));
$calcMock->expects($this->never())
->method('getNumberFromUserInput')
->will($this->returnValue(10));
$this->assertEquals(NAN, $calcMock->divideBy(0));
}

There are many options that you can apply to expects. Refer to this table: http://www.phpunit.de/manual/3.0/en/mock-objects.html#mock-objects.tables.matchers Oddly I could not find the same documentation on the current version of phpunit.

Lastly, if our calculator outputted data via a function, we can still test the output by mocking the output function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Calculator.php
class Calculator {
public function getNumberFromUserInput() {
// complicated function to get number from user input
}
 
public function printToScreen($value) {
// another complicated function
}
 
public function divideBy($num2) {
if ($num2 == 0) $this->printToScreen("NaN");
$this->printToScreen($this->getNumberFromUserInput()/$num2);
}
 
// CalculatorTest.php
..
public function testDivideByPositiveNumber() {
        $calcMock=$this->getMock('\Calculator',array('getNumberFromUserInput', 'printToScreen'));
        $calcMock->expects($this->once())
            ->method('getNumberFromUserInput')
            ->will($this->returnValue(10));
        $calcMock->expects($this->once())
            ->method('printToScreen')
            ->with($this->equalTo('5')); 
        $calcMock->divideBy(2);
    }
..

->with() will test the that the method is called with parameters passed. If your original function has multiple parameters, just add them in as multiple arguments to with(). ->with($this->equalTo($param1), $this->anything(),$this->equalTo($param3)). You can use any of the constraints that phpunit supports – http://www.phpunit.de/manual/3.2/en/api.html#api.assert.tables.constraints

You can git clone this full example on my github gist – https://gist.github.com/4558701

So with that said, test your code! With mocking, it makes it dead simple to test things. There should be no excuse to having tests on all the aspects of your code.

Heap Sort Visualization #2

I enhanced the heap sort algorithm code so it may be more useful for teaching purposes. Enjoy! Feel free to use it wherever you want, and if you find it really useful, leave me a comment!




Source code: HeapSort RandomArray

Built with Processing

Heap Sort Visualization

Here’s some work I did with visualizing heap sort. The goal was to make it easy to understand how heap sort works. All code is written in java/processing.org. Enjoy!




Source code: HeapSort RandomArray

Built with Processing