This is not my creation but wanted to share this incredible diagram on redirection in Linux.

This is not my creation but wanted to share this incredible diagram on redirection in Linux.
I saw a really great issue in the Rails repo where someone was able to demonstrate a full rails issue without having to recreate an entire application. I thought this was fantastic because I have never setup an example like this before and think it would save a ton of time when iterating on ideas or reproducing issues.
# frozen_string_literal: true # 1. Add any gems that you may need inline require "bundler/inline" gemfile(true) do source "https://rubygems.org" git_source(:github) { |<em>repo</em>| "https://github.com/#{<em>repo</em>}.git" } # Activate the gem you are reporting the issue against. gem "rails", github: "rails/rails", branch: "master" gem "sqlite3" gem "pry" end require "active_record" require "minitest/autorun" require "logger" require "pry" # This connection will do for database-independent bug reports. ActiveRecord::Base.establish_connection(adapter: "sqlite3", database: ":memory:") ActiveRecord::Base.logger = Logger.new(STDOUT) ActiveRecord::Schema.define do create_table :posts, force: true do |<em>t</em>| t.string :name t.text :description t.datetime :routing_start_date t.datetime :routing_end_date end end class Post < ActiveRecord::Base has_many :comments end class BugTest < Minitest::Test def test_association_stuff post = Post.create!(name: []) post.routing_start_date = DateTime::Infinity assert_equal DateTime::Infinity, post.routing_start_date end end
Elasticsearch is absolutely incredible as a search data store as it abstracts a lot of the cruft related to analyzing, distributing and returning results to searches. At some point in the evolution of Elasticsearch, you will get to a point where you need to be able to both serve searches and migrate an index at the same time. This post outlines one strategy for handling this type of a live migration
Things get complicated when you get to a point where you have a search index that needs to be available at all times and you also need to be able to add/change the mapping while serving search requests to it. For instance, say you are moving something to a new type that is different than the old one, or using a copy_to for a new field.
To handle this, the gem that I use is es-elasticity. Assuming that you have a document model called City::Document. You would accomplish this by issuing the message City::Document.rebuild_index(recreate: true)
. This will take care of all the internals needed in order to
– Create and Migrate the current data to the new index with the new mapping
– Allow all searching to take place as normal during the migration
– Delete the old index as soon as the data is migrated
But, how the hell does this work?
Elasticsearch documentation hints at how to handle this with alias indexes here https://www.elastic.co/guide/en/elasticsearch/guide/current/index-aliases.html
The source for how this works is here https://github.com/doximity/es-elasticity/blob/master/lib/elasticity/strategies/alias_index.rb
city_docs
city_docs-2018-09-07_03:03:15.413063
main alias
, it is the name of the index prefix, city_docs
hereupdate alias
, it is the name of the index prefix with _update suffixed, city_docs_update
hereTo start lets look at what our setup would actually look like through asking what main and update are pointing to
Find our current indexes and aliases.
Find what main alias
and update alias
are aliased to right now, %2A is url encoded ‘*’ because we normally will not know the “timestamp” of the indexes creation.
curl localhost:9200/city_docs-%2A/_alias/city_docs -> {"city_docs-2018-09-07_03:03:15.413063":{"aliases":{"city_docs":{}}}} curl localhost:9200/city_docs-%2A/_alias/city_docs_update -> {"city_docs-2018-09-07_03:03:15.413063":{"aliases":{"city_docs_update":{}}}} main_alias = "city_docs-2018-09-07_03:03:15.413063" update_alias = "city_docs-2018-09-07_03:03:15.413063"
Now that we have both of our main_alias
and update_alias
, we can do some a couple preflight checks and bail on the reindex if either of these are true.
Now that the state of the world is right, we can begin by creating a new index to put all of our new index.
timestamp_now = "2020-11-27_03:03:15.413063" new_index = "city_docs-#{timestamp_now}" curl -x PUT "localhost:9200/#{new_index}" -d { ...your Index Settings }
Now we need to setup our system so that we can migrate. We now have
Old 2018
New 2020
update_alias
and main_alias
that point to Old 2018
The next concept is to point our update alias
to only New 2020
and point main alias
to point to both New 2020
and Old 2018
Pointing main alias
to both indexes is the secret sauce that allows us to do a live migrations. All writes will now go to only the new index and main will continue to read from both old and new while data is being migrated.
Be sure to flush the indexes to clear the transactions logs.
curl -X POST localhost:9200/#{original_index}/_flush
Now that we have everything setup plumbing wise, the next step is to move all of our data over in batches, normally you would do this in something like sidekiq or another background processing system.
1. Create a cursor to go over the records in batches of 100
cursor = curl GET localhost:9200/#{original_index}/_search?scroll=10m&search_type=query_then_fetch&size=100
This returns both search results and a scroll_id reference to the next set of results, we loop over these and perform this basic algorithm
2. Weed out any documents that don’t exist in the original index that have been deleted since when we began the migration. To accomplish this we map all documents into a bulk request using the search result_id, type and the original_index and request all those docs using mget. Store this in a current_docs
variable
curl -X GET http://localhost:9200/_mget?refresh=true -d '{ "docs": [{ :_index=>"city_docs-2018-09-07_03:50:27.108387", :_type=>"city", :_id=>"100203" }, { :_index=>"city_docs-2018-09-07_03:50:27.108387", :_type=>"city", :_id=>"100211"}] }'
3 This approach supports removing fields on a reindex (something that lucene does not), take the new mapping and remove anything not needed in our new indexdefined_mapping_fields = index_def[:mappings][docs.first["_type"]]["properties"].keys
4 Reduce the current_docs
so that we only keep docs that exist on the index still and only take the keys from them that exist in our current mapping and bulk update with that
5. final check to see if documents don’t exist anymore by repeating the step where we grab all docs from the old index again, delete any of those that do not currently exist on the new index
Now that all is migrated, we remove the alias for main alias
to the Old 2018
index and then delete the old index.
So now we have a relatively straitforward process where we can rebuild indexes without any downtime. This approach accepts that it is ok to get double reads during a migration in order to have zero downtime for the migrations. This system could be updated to a different strategy to allow single reads at the cost of additional complexity and reduced reliability (or increased latency/disk space).