Ruby on Rails Web Scraping Kimurai on Heroku

Ruby on Rails Web Scraping Kimurai on Heroku

Is Kimurai a stable choice?

Started an app, now I want some data to work with for it. I could just copy paste it over, but that's not so fun and I don't get to use any new tools.

Previously I have used Puppeteer pptr.dev or Selenium selenium.dev or Cheerio cheerio.js.org or Beautiful Soup crummy.com/software/BeautifulSoup/bs4/doc as stand alone services run on a schedule.

This time I want to add this to the app and run it as part of my Rails setup. I also want to learn more about where Ruby web scraping is at for single page/dynamic/progressive web apps. And finally, exploit familiarity - so Ruby or JS were my primary choices here.

Based on this post scrapingbee.com/blog/web-scraping-ruby I setup a test using Kimurai github.com/vifreefly/kimuraframework. Had to downgrade to ruby 2.7.3 to get it running as it is not updated for ruby 3.* yet. Seem to be in progress though.

Once running and working with the example provided, pretty confident can get what I need here from the library. Unclear how easy it is to debug so far, but better errors than JS options I have experienced already with no config.

To integrate into rails I added a new folder under /app/scrapers/scraper_name.rb to define my classes in. This works and now the files are auto loaded into the application. This also means can now run them from the rails console pretty easily for development instead of calling the file directly with ruby to ensure they will be able to run and access any other rails magic I might want to extend them with.

I also setup a rails initializer for Kimura to manage config on app boot config/initializers/kimurai.rb.

Ended up adding gem pry-byebug to start debugging my scrapers. github.com/deivid-rodriguez/pry-byebug

What I have discovered so far is no way to swap in a headed browser for local development. Relying on only screenshots and pry for debugging and not being able to see the scraper run is quite a slowdown and seems to be a limitation of Kimurai not any of the other tools. Maybe there is a config I have missed?

Now with a working example lets try get this deployed. Working with basic heroku setup so far. Assume we may need some build packs or other installs for Selenium and the Chrome web driver. Looking to not take on moving to docker just yet here.

So far ended up with

File: app.json
01: {
02:   "buildpacks": [
03:     {
04:       "url": "heroku/ruby"
05:     },
06:     {
07:       "url": "https://github.com/heroku/heroku-buildpack-chromedriver"
08:     },
09:     {
10:       "url": "https://github.com/heroku/heroku-buildpack-google-chrome"
11:     }
12:   ]
13: }
14:

This has extended the build time in heroku quite a bit - installing chrome and the webdriver. Seems like maybe a single service for scraping would be better, but then we lose some of the single app advantage for a one person project and ease of development.

Also updated the gemfile to explicitly include the chromedriver

gem "kimurai"
gem "webdrivers", "~> 4.0", require: false
# gem "chromedriver-helper" deprecated
gem "selenium-webdriver"

Thanks to poster here: github.com/heroku/heroku-buildpack-chromedr..

Finally, looks like we also need to tell Kimurai where to look for our selenium and chromedriver installs on heroku with some ENVs added to our config in the initializer we setup found over here milk1000cc.hatenablog.com/entry/2019/11/04/..

  # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
  config.selenium_chrome_path = ENV["SELENIUM_CHROME_PATH"].presence || "/usr/bin/chromium-browser"
  # Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
  config.chromedriver_path = ENV["CHROMEDRIVER_PATH"].presence || "~/.local/bin/chromedriver"