Repopulating the Taaalk database from archive.org

I'm the founder of Taaalk and a junior software engineer at Board Intelligence
Software Engineer at Stripe. Blog at robertheaton.com
Follow this Taaalk

4 followers

2,311 views

Joshua Summers
23:34, 30 Jan 22 (edit: 23:49, 30 Jan 22)
So to provide some context, when I launched Taaalk (this website; a platform for long form text-based conversations) in late 2020 I hosted it on a Hetzner server. For reasons I still do not understand, one day the site went down. When I got round to looking on the server it contained zero Taaalk related code and because I didn't rectify the situation quickly my seven day's worth of server backups were empty too. This meant I lost the most important thing I had built up: the database of existing Taaalks (conversations). 
After redeploying on Heroku, I now want to repopulate the database. The content of the original site exists on archive.org's wayback machine. I used the wayback_machine_downloader gem to download the files and have hosted them in the wayback_taaalk repository so we can discuss. Individual Taaalks are hosted in the 't' directory. The html for each Taaalk is in an index.html file in a directory named after the Taaalk's slug.

image.png 133 KB

I think it's also important to establish the schema of Taaalk so we know what we are working with. A simplified version with all the essentials is:
class Taaalk < ApplicationRecord
  belongs_to :user 
  # the user who created the Taaalk

  has_many :speakers
  # Taaalk speakers != users, this is so an individual
  # user can have different profiles depending on who
  # they are talking to

  has_many :messages

  has_one_attached :image
  # the Taaalk's optional background image
end
class Speaker < ApplicationRecord
  belongs_to :taaalk
  belongs_to :user
  has_many :messages

  has_one_attached :image # profile image

  has_rich_text :biography 
  # a speaker's bio; rich text used to enable links
  # (e.g. robertheaton.com in your bio above) 
end
class Message < ApplicationRecord
  belongs_to :user
  belongs_to :taaalk
  belongs_to :speaker

  has_rich_text :content 
  # the raw input from the rich_text_area I'm typing this
  # message in now. We can probably ignore.
  # e.g. Message.last.content.body.to_s
  # "<div class=\"trix-content\">\n
  #   <div>Hi!</div>\n
  # </div>\n"

  has_rich_text :safe_content
  # the raw content which has gone through my javascript
  # parser which adds "text bubbles" to each block. 
  # safe_content is used to render messages in front end.
  # e.g. Message.last.safe_content.body.to_s
  # "<div class=\"trix-content\">\n
  #   <div class=\"tlk-bubble-holder\"><div class=\"tlk-bubble\">Hi!</div></div>\n
  # </div>\n"
end
class User < ApplicationRecord
  has_many :taaalks
  has_many :speakers
  has_many :messages

  has_one_attached :image

  has_rich_text :biography

  validates :username, presence: true
  validates :email, presence: true
end
So, now the foundations of the situation are in place I will restate the goal: we want to go from the individual index.html files in each Taaalk slug directory from the arcive.org download, to a database populated with as much of the data as we can extract.
What are the first thoughts that are going through your mind?
Robert Heaton
09:09, 31 Jan 22
We have 2 main tasks - parse the data from the Wayback Machine HTML dumps, then insert it into the Taaalk database. Parsing will be easier than inserting the data because it doesn't depend on the rest of the Taaalk application. I think we should therefore start by focussing in inserting.
Do the HTML dumps have all the data we will need? Yes, mostly. They have message contents, usernames, user bios, message created times. As far as I can tell they don't contain user emails.
We need to think about how to handle migrating users, especially since we don't have email addresses for them. Are we happy making these pages read-only and having the Taaalks finalised? If we are then I would propose we add a `login_disabled` (or something similar) flag to the `User` model and set the migrated users' username/password hash to something like `DISABLED` when we create them. We can add extra logic to prevent logins from `login_disabled` users (although it shouldn't technically be needed if the password hash field is nonsense).
On the other hand, we may want to make the Taaalks live and have users be able to log into the accounts that initially created them again. If so then we'll either need to create some sort of account claim flow, or manually set the emails using your knowledge of who the people are and trigger password resets for them all.
We have 2 main options for how to actually ingest the data. We could either run a script in the Taaalk Heroku console that inserts data into the database directly, or make HTTP requests to the Taaalk application server that mimic the requests that users' browsers would make. I think that the script approach is likely our best bet, for reasons that we can talk about later.
A few other thoughts:
  • We should set up a QA instance of Taaalk to test the migration against, since we are doing our own Taaalk on the production instance at the moment
  • If it's easy then we should make sure that any old permalinks (eg. to an individual Taaalk) are still valid. As far as I can see all of the important permalinks are indexed by slug (eg. `/t/exploring-obsessive-compulsive-disorder`), so just repopulating the database should give us that for free.
  • Do we want to set the created timestamps of the messages to what they were initially written? I'm sure it's possible to do this in Rails, but by default it will set them all to having been created at the time we run the script
  • Profile pictures may require some extra thought, but should still be easy enough
Joshua Summers
21:28, 04 Feb 22 (edit: 21:30, 04 Feb 22)
OK, I'll make some decisions, add some comments and ask some questions:
  • User accounts
    I think all Taaalks (bar one) will not continue. So creating faux users is acceptable. I can always manually alter one in the console afterwards. I we should add an admin message at the base of "imported" Taaalk to let the user know how to get in touch to claim their account.
  • Script vs HTTP requests
    Script
  • QA instance
    Is this so we don't accidentally explode production? How do we do this, and when?
  • Message timestamps
    Let's use the original ones. I think that will add some flavour to this challenge too.
  • Profile pictures
    When I visit Taaalk on the wayback machine I don't see the profile images, however I do see the Taaalk background image. Neither are in the wayback download so we might have to give this a miss.
So I can see roughly where we are going with this, but am not sure of the best way to start. What is the first tangible thing you would do? Or is there anything else we need to do before we get our hands dirty?
Robert Heaton
14:14, 06 Feb 22 (edit: 14:15, 06 Feb 22)
Good question. I'd write a script that parses the HTML of each Taaalk from the Wayback dump into a data structure and writes them to a JSON file. For example:
[
    {
        "slug": "my-taaaalk-slug",
        "interviewer": {
            name: "Alice"
        },
        "interviewee": {
            "name": "Bob",
        },
        "messages": [
            {
                "created": 1234567890,
                "author": "interviewer",
                "contents": "Hello Bob, how are you?"
            },
            // ...etc...
        ]
    },
    // ...etc...
]
Once we've done this, we can read the JSON file into another script that will write the data to our database. We could also work on this second script first using dummy data for development, either approach is fine.
At the same time you could setup another Heroku account and deploy Taaalk to it. This will be our QA account that we'll test our data migration in before running it in production.
By the way, I think that at least some profile images actually are in the download (https://github.com/JoshInLisbon/wayback_taaalk/tree/main/img).
Joshua Summers
23:02, 07 Feb 22
Those images are from an ancient static version of Taaalk. They won't be tied to the Taaalks we are working with.
When it comes to a new Heroku deployment, are you talking about creating a staging environment? Or should I simply clone Taaalk and deploy that instance? (If there is a difference.)
Nice, I really like the JSON idea. I think we should parse the HTML first as that will help us  understand the details of what we have to insert.
When it comes to our folder-of-folders-containing-html-files => JSON script, where do you envisage this being run? Would it be a `.rb.` file in the wayback_taaalk repository which we might run with something like:
$ ruby html_articles_to_json.rb
Or should the relevant directories and files be uploaded onto the Taaalk Rails application in a temporary (medium-term) directory so we can work with them as we please? My feeling is that the latter would give us a smoother connection between the raw html files and our database. What do you think? 
Robert Heaton
08:55, 08 Feb 22
Yep the Heroku environments doc you linked to looks like a great place to start.
Starting with the script sounds good. There are a lot of options for where we could write and run it. We could certainly add the raw HTML to the Taaalk repo and develop the script in there, but we don't need to. I think that the main reason to do that would be to have that work recorded in git in case you need it in the future. This is nice, but optional. It would also be completely fine to write the script in the wayback repo and then copy only the finished JSON into the main Taaalk repo for when we come to run the migration.
Joshua Summers
11:14, 13 Feb 22 (edit: 11:55, 13 Feb 22)
OK, we are live with a staging environment: https://limitless-woodland-54671.herokuapp.com/
I also built the foundations of a `script.rb` file which is in the root directory of the wayback_taaalk project. I'd never worked with something that needed to navigate through directories, but it was easy enough.
Dir.each_child("t") do |d|
  dir = "t/" + d
  puts dir
  Dir.each_child(dir) do |f|
    file = File.open(f)
    file_data = file.read
    puts file_data[0, 30]
  end
end
When running the script it would succeed for a while, outputting the directory name and a snippet of html, but it was failing half way through. I realised that it was hitting a "t/invite" directory, which, instead of housing an `index.html` file, held sub directories which housed the invites to different Taaalks; this url structure: https://taaalk.co/t/invite/repopulating-the-taaalk-database-from-archive-org. Several invite links had been publicly available on the Taaalk site (for example, the practice in my Taaalk feedback Taaalk which lets anyone join in), and archive.org must have indexed them too.
This commit deletes this directory and includes our script.
It's not clear to me how to begin parsing. Do we want to / can we iterate through each html chunk without having to extract it manually from our massive `index.html` string? Do we want to do it that way? What is your instinct telling you?
Robert Heaton
06:59, 14 Feb 22
There's a Ruby gem called nokogiri that parses HTMLs and allows you to reference different segments of it using either CSS selector syntax or another syntax called xpath that I always have to look up. Slightly modifying the example from the docs:
require 'nokogiri'

# Fetch and parse HTML document
doc = Nokogiri::HTML(DATA_FROM_FILE)

# Search for nodes by css
doc.css('nav ul.menu li a', 'article h2').each do |link|
  puts link.content
end
We can use nokogiri to parse each HTML file, pull out the data that we want, and store it in a data structure in our program. Then at the end of the program we can write out all of the data to our JSON file.
I'm not worried about the HTML files being large - I would guess we'll still be able to process 10+ files/second, even just on your laptop. If we were dealing with something like 100,000 files then we might want to start thinking about optimisations and error handling.
Joshua Summers
12:02, 19 Feb 22
Ok, the script is coming along. This is a work in progress:
require 'nokogiri'

taaalks = []

Dir.each_child("t") do |d|
  dir = "t/" + d
  Dir.chdir(dir) do
    taaalk = {}
    data = Nokogiri::HTML(File.open("index.html"))

    taaalk[:title] = data.css('h1')[0].text

    taaalk[:speakers] = []
    data.css('.spkr-info').each do |spkr|
      speaker = {}
      spkr.css('h3 a').each do |s|
        speaker[:profile_path] = s['href']
        speaker[:name] = s.text
      end
      speaker[:twitter_handle] = spkr.css('.twitter-handle').text
      speaker[:bio] = spkr.css('.trix-content').inner_html
      taaalk[:speakers] << speaker
    end

    taaalks << taaalk
  end
end

pp taaalks

# => [...,
{:title=>"Bitcoin Maxima & Other Crypto Things",
  :speakers=>
   [{:profile_path=>"/u/joshua-summers",
     :name=>"Joshua Summers",
     :twitter_handle=>" JoshSummers1234",
     :bio=>
      "\n" +
      "  <div>I'm the founder of Taaalk &acirc;&#156;&#140;&iuml;&cedil;&#143;. Confused by crypto. Hopefully not for long.</div>\n"},
    {:profile_path=>"/u/thomas-hartman",
     :name=>"Thomas Hartman",
     :twitter_handle=>" thomashartman1",
     :bio=>
      "\n" +
      "  <div>Bitcoin investor, wealth manager, and project consultant. <br>blog: <a href=\"https://standardcrypto.wordpress.com/\">https://standardcrypto.wordpress.com/</a>\n" +
      "</div>\n"}]}, 
...]
For the speaker biographies, I decided to return the `inner_html` because it's a rich text field, so this way we can capture `<a>` tags, etc.
Is there anything I'm doing from a Ruby point of view which is particularly inelegant? Always looking to improve my fundamentals. 
Robert Heaton
14:03, 19 Feb 22
Nice! Few suggestions:
You can save a few lines and make your code terser using the `Dir[]` syntax
Even without that syntax, I don't think this is a good time to use `chdir`. It's not a big deal here, but changing the working directory mid-script makes it harder to work out where a filepath points to. I'd prefer `File.open(File.join("t", d, "index.html"))`, although thanks to `Dir[]` we don't even have to do that.
You can save a few more lines using `Array#map` in 2 places. I like using `Array#map` wherever possible because it makes it clearer what a block of code is doing, and reduces the probability of a bug where you eg. forget to actually add new elements to your list at the end of your `each` block.
I haven't tested this so it might have a bug or 2, but should be mostly correct:
taaalks = Dir["./t/*/index.html"].map do |path|
  page = Nokogiri::HTML(File.open(path))

  speakers = page.css('.spkr-info').map do |spkr_info|
    info_header = spkr_info.css('h3 a')
    {
      name: info_header.text,
      profile_path: info_header['href'],
      twitter_handle: spkr_info.css('.twitter-handle').text,
      bio: spkr_info.css('.trix-content').inner_html
    }
  end
  
  {
    title: page.css('h1')[0].text,
    speakers: speakers,
  }
end 
Joshua Summers
09:21, 26 Feb 22 (edit: 09:35, 26 Feb 22)
Nice refactor. 
I definitely am underutilising #map. My thought process it too linear/these functions are not deeply engrained in me enough; I think, "I need an array that I can fill"; I don't think "I can create and fill the array at the same time". I'm making the same mistake with my hashes. Instead of defining an empty one to fill, I should define them on the fly. It is also much easier to read.
I also didn't know about file "globbing", the Dir.glob("...") method, or it's shorthand Dir["..."]. Is there a reason you are starting your string with "./t/" and not simply "t/"?
And yes, I think I was thinking too literally about the folder structure as something real that I had to physically navigate; enter a directory to access the file in it, etc... I didn't think that for the computer the directory is not as real as it is for a mouse clicking human like me. I just need the path to open a file, I don't need to be in the directory containing the file. 
I will get on and refactor + build this out to capture the info we need... 
Robert Heaton
11:21, 26 Feb 22
No particular reason to start with ./, I primarily do it to emphasise that the path is deliberately meant to be relative to the current directory and that I didn't forget a leading / or ~/. But I think it's personal taste.
Joshua Summers
01:37, 27 Feb 22 (edit: 08:10, 27 Feb 22)
Cool, makes sense. 
OK, I think we are there with our Taaalks-JSON-blob-making script.
require 'nokogiri'

taaalks = Dir["t/*/index.html"].map do |path|
  page = Nokogiri::HTML(File.open(path))

  speakers = page.css('.spkr-info').map do |spkr_info|
    info_header = spkr_info.css('h3 a')[0]
    {
      name: info_header.text,
      id: info_header.attr('class').delete("^0-9"),
      side: spkr_info.attr('class').gsub('spkr-info spkr-info-',''),
      profile_path: info_header['href'],
      twitter_handle: spkr_info.css('.twitter-handle').text.strip,
      bio: spkr_info.css('.trix-content').inner_html,
    }
  end

  messages = page.css('.tlk-blob').map do |tlk_blob|
    {
      speaker_id: tlk_blob.css('div[class^="name-spkr-"]')[0].attr('class').delete("^0-9"),
      created_at: tlk_blob.css('div[class="tlk-blob-date"]').text.partition("(")[0].strip,
      message: tlk_blob.search('div[target="spkr-color-"]').children.to_html
    }
  end

  {
    title: page.css('h1')[0].text,
    speakers: speakers,
    messages: messages,
  }
end
I added side and id to each speaker. 
side is a string with a value of left or right, which is used in the Taaalk show.html.erb file to determine if someone's "bubbles" are left or right aligned. For example, in this Taaalk you are right and I am left.
The id seemed like the obvious way to join speakers with their messages.
With the messages, as well as the id of the message's speaker, we are extracting the created at date, and ignoring any edit dates; for example when a post has been edited you get the following string in the html: "23:34, 30 Jan 22 (edit: 23:49, 30 Jan 22)". This is why I partition at "(" - .partition("(")[0].strip
We are extracting the message as a string containing the html.
# individual message example
{
  :speaker_id=>"129",
  :created_at=>"18:20, 05 Feb 21",
  :message=>
      "\n" +
      "            <div class=\"tlk-blob-msg\">\n" +
      "<div class=\"trix-content\">\n" +
      "  <div class=\"tlk-bubble-holder\"><div class=\"tlk-bubble\">...</div></div>\n" +
      "</div>\n" +
      "\n" +
      "            </div>\n" +
      "          "
}
The html string looks a little odd... and I'm slightly concerned that the \n characters will cause some sort of problem in future, but it feels good enough to work with for now.
Do you have any comments on the latest script code?
And I guess we want to get our json over to the Taaalk rails app. What sort of file would I save that in? A .rb file? Or something else?
Robert Heaton
10:48, 27 Feb 22
Tiniest nit - I'd use String#split instead of #partition since you don't need to keep the ( character itself. I think split is a simpler and more common method and when you used partition instead my immediate thought was "oh there must be a reason why he used this instead of split".
I'd add a line at the end to print out the data in actual JSON, rather than as a Ruby hash:
require 'json'
puts(taaalks.to_json)
Then save this as a .json file in your app somewhere. The import script is going to be another Ruby script that we run from the Heroku shell, so at some point we'll need to figure out how to access the file from there.
Joshua Summers
09:42, 06 Mar 22
OK great. It's done. For the sake of completion, here is the final script:
require 'nokogiri'
require 'json'

taaalks = Dir["t/*/index.html"].map do |path|
  page = Nokogiri::HTML(File.open(path))

  speakers = page.css('.spkr-info').map do |spkr_info|
    info_header = spkr_info.css('h3 a')[0]
    {
      name: info_header.text,
      id: info_header.attr('class').delete("^0-9"),
      side: spkr_info.attr('class').gsub('spkr-info spkr-info-',''),
      profile_path: info_header['href'],
      twitter_handle: spkr_info.css('.twitter-handle').text.strip,
      bio: spkr_info.css('.trix-content').inner_html,
    }
  end

  messages = page.css('.tlk-blob').map do |tlk_blob|
    {
      speaker_id: tlk_blob.css('div[class^="name-spkr-"]')[0].attr('class').delete("^0-9"),
      created_at: tlk_blob.css('div[class="tlk-blob-date"]').text.split("(")[0].strip,
      message: tlk_blob.search('div[target="spkr-color-"]').children.to_html
    }
  end

  {
    title: page.css('h1')[0].text,
    speakers: speakers,
    messages: messages,
  }
end

File.open('taaalks.json', 'w') { |file| file.write(taaalks.to_json) }
As well as changing .partition to .script, I created a taaalks.json file and wrote taaalks.to_json into it (instead of putsing taaalks and copying that output into a JSON file).
I saved taaalks.json into the root directory of the Taaalk project... now I think about it I'm not sure this is the right location. My plan was to make a taaalk_importer.rb script, however if I am going to be running this in the Heroku console, I probably can't do ruby taaalk_importer.rb. I guess I need to write a rake task, or create a class which I can run -  e.g. TaaalkImporter.call.
What do you think would be the best approach? And if there is one, why is it better than the others?
Robert Heaton
15:04, 06 Mar 22
I think a rake task is a good idea, since I think that's the most appropriate tool that heroku provides for running scripts. The only thing I'm not sure about is where to put your JSON file so that your task can find and read it. I'd try things out in your QA heroku instance.
Follow this Taaalk

4 followers

2,311 views

Start your own Taaalk, leave your details or meet someone to Taaalk with.