Parsing HTML snippet in JavaScript

u0206397 · Oct 17, 2017

I am doing web scrapping using AJAX to get some HTML text, which I parse to extract certain data.

My current method is like this:

PHP:

$.get( "domain.com/interesting-web-page.html", function( htmldata ) {
  var div = document.createElement("div");
  div.innerHTML = htmldata;
  var data = div.querySelectorAll('.css-class-of-interest');
});

Line 3: div.innerHTML = data will cause many errors in console due to broken img src URL as the current web page is not in same domain. The img src are relative to the original web page that I pulled the data from, and my web page executing the AJAX GET is not in domain.com.

I am stuck in this chicken and egg problem. The HTML snippet must be added to DOM to run querySelectorAll to extract the tags I want. However, doing so would induce unnecessary errors reported as the web browser would try to parse and retrieve the broken img src URLs.

If I don't add to the DOM, I cannot use querySelectorAll and parsing HTML myself sounds like a bad idea. I wonder if jQuery or some JavaScript libraries that simply do parsing and make a DOM without trying to download those img src.

davidktw · Oct 17, 2017

u0206397 said:
I am doing web scrapping using AJAX to get some HTML text, which I parse to extract certain data.

My current method is like this:

PHP:

$.get( "domain.com/interesting-web-page.html", function( htmldata ) { var div = document.createElement("div"); div.innerHTML = htmldata; var data = div.querySelectorAll('.css-class-of-interest'); });

Line 3: div.innerHTML = data will cause many errors in console due to broken img src URL as the current web page is not in same domain. The img src are relative to the original web page that I pulled the data from, and my web page executing the AJAX GET is not in domain.com.

I am stuck in this chicken and egg problem. The HTML snippet must be added to DOM to run querySelectorAll to extract the tags I want. However, doing so would induce unnecessary errors reported as the web browser would try to parse and retrieve the broken img src URLs.

If I don't add to the DOM, I cannot use querySelectorAll and parsing HTML myself sounds like a bad idea. I wonder if jQuery or some JavaScript libraries that simply do parsing and make a DOM without trying to download those img src.

You are already introducing into the DOM. Try using a document fragment and see if it helps. Otherwise look at using virtualdom libraries and search for methods to select inside this virtualdom. There is jQuery integration with one of such libraries as i searched online
https://github.com/Matt-Esch/virtual-dom/wiki

Also look at https://developer.mozilla.org/en-US/docs/Web/API/DOMParser

Last resort is to have your backend using node.js to help with the job

u0206397 · Oct 17, 2017

davidktw said:
You are already introducing into the DOM. Try using a document fragment and see if it helps. Otherwise look at using virtualdom libraries and search for methods to select inside this virtualdom. There is jQuery integration with one of such libraries as i searched online
https://github.com/Matt-Esch/virtual-dom/wiki

Also look at https://developer.mozilla.org/en-US/docs/Web/API/DOMParser

Last resort is to have your backend using node.js to help with the job

Yes, this is exactly the problem.

To repeat and restate the problem simply, to use querySelectorAll() method to extract those tags of interest, the HTML content must be in the DOM as nodes already. However, the unprocessed HTML contains img src with broken URLs as it was AJAX from another domain, which when added to the DOM immediately start trying to download the images and fail as expected, generating errors in the web browser console.

Currently, a normal user won't see those console errors, and the web page is working fine as it is. However, if there is some proper method to handle this requirement to avoid the unwanted image loading and thus error messages, it would be the best scenario.

As said, I will take a look at what you have suggested. Thanks.

cwchong · Oct 17, 2017

If its only img src u are concerned, wouldnt a easy hack be to replace all /src=“/ with the domain prefixed if a domain is not already there?

davidktw · Oct 17, 2017

u0206397 said:
Yes, this is exactly the problem.

To repeat and restate the problem simply, to use querySelectorAll() method to extract those tags of interest, the HTML content must be in the DOM as nodes already. However, the unprocessed HTML contains img src with broken URLs as it was AJAX from another domain, which when added to the DOM immediately start trying to download the images and fail as expected, generating errors in the web browser console.

Currently, a normal user won't see those console errors, and the web page is working fine as it is. However, if there is some proper method to handle this requirement to avoid the unwanted image loading and thus error messages, it would be the best scenario.

As said, I will take a look at what you have suggested. Thanks.

I have tried, DOMParser is your easiest approach.

Code:

var parser = new DOMParser();
var doc = parser.parseFromString("<img src='https://cdn.pixabay.com/photo/2017/01/16/15/18/soap-bubble-1984310__480.jpg'>", " text/html");

It will generate the following structure
HTML
|-HEAD
`-BODY
   `-IMG

doc.querySelector("img"); will return
<img src="https://cdn.pixabay.com/photo/2017/01/16/15/18/soap-bubble-1984310__480.jpg">
as a node.

There wouldn't be any execution or loading. You can also parse using "text/xml", but there will be some interesting effects, I will leave u to go venture on it.

u0206397 · Oct 17, 2017

davidktw said:
I have tried, DOMParser is your easiest approach.

Code:

var parser = new DOMParser(); var doc = parser.parseFromString("<img src='https://cdn.pixabay.com/photo/2017/01/16/15/18/soap-bubble-1984310__480.jpg'>", " text/html"); It will generate the following structure HTML |-HEAD `-BODY `-IMG doc.querySelector("img"); will return <img src="https://cdn.pixabay.com/photo/2017/01/16/15/18/soap-bubble-1984310__480.jpg"> as a node.

There wouldn't be any execution or loading. You can also parse using "text/xml", but there will be some interesting effects, I will leave u to go venture on it.

Ah, this looks good. I checked on Mozilla and it says DOMParser is an experimental feature only in newer web browsers. I am keeping my fingers crossed.

Thanks a million!

u0206397 · Oct 17, 2017

cwchong said:
If its only img src u are concerned, wouldnt a easy hack be to replace all /src=“/ with the domain prefixed if a domain is not already there?

I am afraid my simplistic img tag replacement using regex will fail if somehow there is any less well-formed or malformed HTML content. My parsing will go all haywire.

Parsing HTML snippet in JavaScript

More options

u0206397

Senior Member

davidktw

Arch-Supremacy Member

u0206397

Senior Member

cwchong

Master Member

davidktw

Arch-Supremacy Member

u0206397

Senior Member

u0206397

Senior Member