Selenium on Heroku

I spent a lot of time beating my head around the Selenium WebDriver in NodeJs that works locally on my Windows machine and easily deploys to Heroku. I finally got it working using PhantomJs as the headless browser. I was able to get Chrome and Firefox working locally, but not on Heroku. Here are the steps I took to get it working:

1. Install PhantomJS 2.1.1

http://phantomjs.org/download.html

2. Set up a Heroku app with an additional buildpack for PhantomJs:

https://github.com/stomita/heroku-buildpack-phantomjs

3. Set up a simple NodeJs/Express app. I am currently on NodeJS LTS 6.10.3. Selenium-WebDriver will not work on node < 6.

package.json:
{
  "name": "untitled2",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "start": "node index.js",
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC",
  "dependencies": {
    "express": "^4.15.2",
    "selenium-webdriver": "^3.4.0"
  },
  "engines": {
    "node": "6.10.3",
    "npm": "4.5.0"
  }
}

4. Expose a simple /test endpoint on your app to do the Selenium-WebDriver "Hello World" example:

index.js

var webdriver = require('selenium-webdriver');
var express = require('express')
var app = express()

var port = process.env.PORT || 14000;
var By = webdriver.By;

app.get('/test', function (req, res) {
    var driver = new webdriver.Builder()
        .forBrowser('phantomjs')
        .build();
    driver.get('http://www.google.com/ncr');
    driver.findElement(By.name('q')).sendKeys('webdriver');
    driver.findElement(By.name('btnG')).click();
    driver.wait(function() {
        return driver.getTitle().then(function(title) {
            console.log(title);
            return title === 'webdriver - Google Search';
        });
    }, 5000).then(function() {
        res.status(200).send('Done');
    }, function(error) {
        res.status(200).send(error);
    });
    driver.quit();
});

app.listen(port, function () {
    console.log('Example app listening on port: ',port)
})

Notes:

  • I copied phantomjs.exe straight into my project root. It should also work if you just install it and change your PATH variables to point to the installed location of phantomjs.exe
  • I am using a local port of 14000. So you can run it and go to http://localhost:14000/test to try it out.
  • I have it set to a 5 second timeout. If it times out, the promise will return an error and output: {"name":"TimeoutError"}
  • When pushing to Heroku you can watch for the PhantomJS Buildpack to be installed successfully:
remote: -----> Build succeeded!
remote: -----> PhantomJS app detected
remote: -----> Extracting PhantomJS 2.1.1 binaries to /tmp/build_a6395fe7656f5bfcbfc7cfa31d3f8381/vendor/phantomjs
remote: -----> exporting PATH and LIBRARY_PATH

Comments

  1. Exactly what I needed to get me started. Thanks!

    ReplyDelete
  2. I may have written too soon. I get:

    /app/node_modules/selenium-webdriver/lib/promise.js:2634
    Jul 06 11:17:08 comparity-qa app/web.1: throw error;
    Jul 06 11:17:08 comparity-qa app/web.1: ^
    Jul 06 11:17:08 comparity-qa app/web.1: Error: Server terminated early with status 2
    Jul 06 11:17:08 comparity-qa app/web.1: at Error (native)
    Jul 06 11:17:08 comparity-qa app/web.1: at earlyTermination.catch.e (/app/node_modules/selenium-webdriver/remote/index.js:252:52)
    Jul 06 11:17:08 comparity-qa app/web.1: at process._tickCallback (internal/process/next_tick.js:103:7)
    Jul 06 11:17:08 comparity-qa app/web.1: From: Task: WebDriver.createSession()
    Jul 06 11:17:08 comparity-qa app/web.1: at Function.createSession (/app/node_modules/selenium-webdriver/lib/webdriver.js:777:24)
    Jul 06 11:17:08 comparity-qa app/web.1: at Function.createSession (/app/node_modules/selenium-webdriver/phantomjs.js:220:55)

    ReplyDelete
  3. Your blog only talks about running on localhost and installing on Heroku. Did you get it to run on Heroku? If so, how?

    ReplyDelete
    Replies
    1. Hey Kevin, I was away for a week... Let me get you a public working repo you can clone and push to Heroku tomorrow and we can figure out why it's not working.

      Delete
    2. More than I could ask for. Thank you!

      Delete
    3. I made a public repo, with quick instructions on how to get it up on Heroku. It worked for me without any errors. Let me know if you can get it working: https://github.com/AlexViderman/heroku-selenium

      Delete
    4. Thanks very much for this! I had to divert to another task but back to this tomorrow. I will be sure to follow up.

      Delete
    5. It works! http://floating-fortress-72388.herokuapp.com/test (might get deleted later) I don't find a code or package difference indicating why my attempt failed. I owe you one. My struggles to get it working inside my existing Node app could have been dependency driven. But then I couldn't get a clean app running from your blog post. Hope I discover where I went wrong. Thanks so much for your help.

      Delete
    6. Good luck! I tried to keep the code and dependencies as clean as possible. What node version did you specify in your existing package.json for Heroku to install?

      Delete
    7. My existing app was on Node 6.9.4. I copied from this blog when I tried stand alone but got the same error. I thought maybe it was NODE_MODULES_CACHE. I still don't know.

      I submitted a PR to your repo to update to 6.11.1

      Delete
    8. Thank you for the updates. I have about 30 private node/heroku repos with 6.10.3, its been a pain upgrading all of them and testing :)

      Delete
  4. Alex- super informative. I know this is a bit much but hoping you can help me out.
    I have the following node route using selenium and chrome driver which is working correctly and returning expected html in the console:

    app.get('/google', function (req, res) {
    var driver = new webdriver
    .Builder()
    .forBrowser('chrome')
    .build();

    driver.get('https://www.google.com')
    driver
    .manage()
    .window()
    .setSize(1200, 1024);
    driver.wait(webdriver.until.elementLocated({xpath: '//*[@id="lst-ib"]'}));
    return driver
    .findElement({xpath: '//*[@id="lst-ib"]'})
    .sendKeys('stackoverflow' + webdriver.Key.RETURN)
    .then((html) => {
    return driver
    .findElement({xpath: '//*[@id="rso"]/div[1]/div/div/div/div'})
    .getAttribute("innerHTML")
    })
    .then((result) => {
    console.log(result)
    })
    .then(() => {
    res
    .status(200)
    .send('ok')
    });
    I have also installed the phantom js driver and tested that its working by returning the URL title - it works. When I use the above exact route and replace the chrome with phantomjs I get no results returned. There are no errors - just no print out in my console. The status and result are never sent to the browser so it doesn't appear to be stepping through promise chain.

    Any suggestions?

    ReplyDelete
    Replies
    1. Spent some time trying to figure out why its not working, still looking into it. The xpath lookup by "id" is not working for some reason so the findElement just sits there. You can add a timeout to the driver.wait and it will crash and let you know it timed out because it couldn't find element:

      driver.wait(webdriver.until.elementLocated({xpath: '//*[@id="lst-ib"]'}), 2000);

      If you do an xpath query by name, it actually finds the element. This is not a solution but at least a clue that something with search by id is not right:

      driver.wait(webdriver.until.elementLocated({xpath: '//*[@name="q"]'}), 2000);

      I'll post an update if I can figure out whats going on.

      Delete
    2. Ok I figured it out. Google renders different html based on your user-agent. The html for the default Phantomjs user-agent does not render the "lst-ib" id. If you set the user-agent to Chrome, your code works. Here is how I set the user-agent of Phantomjs to my Chrome user-agent:

      var driver = new webdriver
      .Builder()
      // .forBrowser('phantomjs')
      .withCapabilities(webdriver.Capabilities.phantomjs()
      .set("phantomjs.page.settings.userAgent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"))
      .build();

      Let me know if that works for you!

      This is how I debugged it. I output the contents of the form and saw it didnt have that element with the matching id. As soon as I changed the user-agent, the element with the id appeared:

      driver.findElement({xpath: '//*[@name="f"]'}).getAttribute("innerHTML").then(html=> {
      return res
      .status(200)
      .send(html)

      })
      return;

      Delete
  5. I've tried to crawl livescore which is render by hltv server However i cant do it with phantomjs... can u help me ?
    var express = require('express');
    var router = express.Router();
    var cheerio = require('cheerio');
    var webdriver = require('selenium-webdriver');


    // GET todayMatch
    router.get('/',function (req, res) {
    const {Builder, until} = require('selenium-webdriver');
    let driver = new webdriver.Builder()
    .withCapabilities(webdriver.Capabilities.phantomjs()
    .set("phantomjs.page.settings.userAgent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"))
    .build();
    const hltv_url = 'https://www.hltv.org/';

    driver.get(hltv_url)
    .then(() => driver.wait(until.titleIs('CS:GO News & Coverage | HLTV.org'), 1000))
    .then(() => driver.executeScript("window.scrollTo(0, document.body.scrollHeight);"))
    .then(() => driver.getPageSource())
    .then((source) => {
    const $ = cheerio.load(source);
    var items = [];
    $('.top-border-hide').find(".hotmatch-box.a-reset").each((_,ele) => {
    items.push($(ele));
    });
    console.log(items[0]);
    console.log(items[0].html());
    console.log(items[0].text());
    //Do whatever you want with the result

    //console.log(item.html());
    })
    .then(() => {
    driver.quit();
    });
    res.render('pages/score_api');
    });
    module.exports = router;

    ReplyDelete
    Replies
    1. I tried your code locally, without using express, and got a reply from the hltv server:
      (I removed tags, it wont let me post html here)

      div class="teambox match star0-filter "
      div class="teamrows"
      div class="teamrow"img alt="Germany" src="https://static.hltv.org/images/bigflags/30x20/DE.gif" class="flag" title="Germany" span class="team"ALTERNATE aTTaX/span/div
      div class="teamrow"img alt="Sweden" src="https://static.hltv.org/images/bigflags/30x20/SE.gif" class="flag" title="Sweden" span class="team"GODSENT/span/div
      /div
      div class="twoRowExtra"
      div class="livescore twoRowExtraRow"span data-livescore-current-map-score="" data-livescore-team="4501"/span/div
      div class="livescore twoRowExtraRow"span data-livescore-current-map-score="" data-livescore-team="6902"/span/div
      /div
      /div

      This is 100% your code:

      const cheerio = require('cheerio');
      const webdriver = require('selenium-webdriver');
      const {Builder, until} = require('selenium-webdriver');

      let driver = new webdriver.Builder()
      .withCapabilities(webdriver.Capabilities.phantomjs()
      .set("phantomjs.page.settings.userAgent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"))
      .build();

      const hltv_url = 'https://www.hltv.org/';

      driver.get(hltv_url)
      .then(() => driver.wait(until.titleIs('CS:GO News & Coverage | HLTV.org'), 1000))
      .then(() => driver.executeScript("window.scrollTo(0, document.body.scrollHeight);"))
      .then(() => driver.getPageSource())
      .then((source) => {
      const $ = cheerio.load(source);
      var items = [];
      $('.top-border-hide').find(".hotmatch-box.a-reset").each((_,ele) => {
      items.push($(ele));
      });
      console.log(items[0].html());
      })
      .then(() => {
      driver.quit();
      });

      What issue are you having? Where are you hosting this?

      Delete
    2. my issue is crawling a live score. When i use inspect elements on hltv.org site i can get:


      span data-livescore-current-map-score="" data-livescore-team="8677" class="trailing"> 14/span

      span data-livescore-current-map-score="" data-livescore-team="6947" class="leading"> 15/span
      However when i use phantomJS i just get:

      span data-livescore-current-map-score="" data-livescore-team="8677" class="trailing"> /span>
      span data-livescore-current-map-score="" data-livescore-team="6947" class="leading">

      The difference is the result i've got dont have the livescore which i render by scorebot

      Delete
    3. This comment has been removed by the author.

      Delete
    4. if i use selenium grid like:

      const {Builder, until} = require('selenium-webdriver');
      let driver = new Builder()
      .forBrowser('firefox')
      .usingServer( 'http://172.17.50.54:8080/wd/hub')
      .build();
      const hltv_url = 'https://www.hltv.org/';

      i also get a livescore data but it's in local not on heroku because i cant get firefox on heroku

      Delete
    5. Hi Duy,

      I see what you mean now! I spent some time play around with it and it looks like hltv uses WebSockets to populate the live scores, and does not seem like PhantomJs supports them. That's probably why it works in FireFox out of the box for you and not in PhantomJs.

      Personally, I have moved away from PhantomJs and have been using Puppeteer with Headless Chromium. I have gotten it to work on Heroku. I put together some quick code that was able to scrape the live scores locally:

      const puppeteer = require('puppeteer');

      function timeout(ms) {
      return new Promise(resolve => setTimeout(resolve, ms));
      }

      (async () => {
      const browser = await puppeteer.launch();
      const page = await browser.newPage();
      await page.goto('https://www.hltv.org/');
      await timeout(1000);
      const matches = await page.evaluate(() => [...document.querySelectorAll('.teambox.match')].map(elem => elem.innerHTML));

      console.log(matches);

      await browser.close();
      })();

      Here is an example of someone running Headless Chromium with Puppeteer on Heroku: https://github.com/alvarcarto/url-to-pdf-api

      Basically, you just need this build pack: https://github.com/jontewks/puppeteer-heroku-buildpack

      Good luck! Let me know how you end up solving your problem!

      Delete
    6. thanks u so much. This is my first problem when deploying my app to heroku :D i've just learning to crawl a website data and make API for my android app. Thanks for your helping :D

      Delete
    7. my next stuck is crawling data in other UTC time. Ex: Now i want it crawl data with location in VietNam but my server located in US. So i'm trying to find how to set environment with Vietnam location for puppeteer browser.

      Delete
  6. Can't understand. You said that you "copied phantomjs.exe straight into my project root". Question: for what platform need dowload phantomjs.exe for Heroku (windows, Mac OS X, Linux 64-bit, etc)?
    It will works only remote? No need install it for local computer. Please explain. Thx!

    ReplyDelete
  7. Merkur Futur Adjustable Safety Razor - Sears
    Merkur https://deccasino.com/review/merit-casino/ Futur Adjustable Safety Razor is the perfect balance 출장샵 of performance, wooricasinos.info safety, and comfort. https://septcasino.com/review/merit-casino/ Made in Solingen, Germany, this razor has a perfect sol.edu.kg balance of

    ReplyDelete

Post a Comment

Popular posts from this blog

Vue Multiselect

Angular directory structure for large projects