Parsing HTML snippet in JavaScript

u0206397

Senior Member
Joined
Jul 15, 2009
Messages
764
Reaction score
0
I am doing web scrapping using AJAX to get some HTML text, which I parse to extract certain data.

My current method is like this:

PHP:
$.get( "domain.com/interesting-web-page.html", function( htmldata ) {
  var div = document.createElement("div");
  div.innerHTML = htmldata;
  var data = div.querySelectorAll('.css-class-of-interest');
});

Line 3: div.innerHTML = data will cause many errors in console due to broken img src URL as the current web page is not in same domain. The img src are relative to the original web page that I pulled the data from, and my web page executing the AJAX GET is not in domain.com.

I am stuck in this chicken and egg problem. The HTML snippet must be added to DOM to run querySelectorAll to extract the tags I want. However, doing so would induce unnecessary errors reported as the web browser would try to parse and retrieve the broken img src URLs.

If I don't add to the DOM, I cannot use querySelectorAll and parsing HTML myself sounds like a bad idea. I wonder if jQuery or some JavaScript libraries that simply do parsing and make a DOM without trying to download those img src.
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,300
I am doing web scrapping using AJAX to get some HTML text, which I parse to extract certain data.

My current method is like this:

PHP:
$.get( "domain.com/interesting-web-page.html", function( htmldata ) {
  var div = document.createElement("div");
  div.innerHTML = htmldata;
  var data = div.querySelectorAll('.css-class-of-interest');
});

Line 3: div.innerHTML = data will cause many errors in console due to broken img src URL as the current web page is not in same domain. The img src are relative to the original web page that I pulled the data from, and my web page executing the AJAX GET is not in domain.com.

I am stuck in this chicken and egg problem. The HTML snippet must be added to DOM to run querySelectorAll to extract the tags I want. However, doing so would induce unnecessary errors reported as the web browser would try to parse and retrieve the broken img src URLs.

If I don't add to the DOM, I cannot use querySelectorAll and parsing HTML myself sounds like a bad idea. I wonder if jQuery or some JavaScript libraries that simply do parsing and make a DOM without trying to download those img src.

You are already introducing into the DOM. Try using a document fragment and see if it helps. Otherwise look at using virtualdom libraries and search for methods to select inside this virtualdom. There is jQuery integration with one of such libraries as i searched online
https://github.com/Matt-Esch/virtual-dom/wiki

Also look at https://developer.mozilla.org/en-US/docs/Web/API/DOMParser

Last resort is to have your backend using node.js to help with the job
 
Last edited:

u0206397

Senior Member
Joined
Jul 15, 2009
Messages
764
Reaction score
0
You are already introducing into the DOM. Try using a document fragment and see if it helps. Otherwise look at using virtualdom libraries and search for methods to select inside this virtualdom. There is jQuery integration with one of such libraries as i searched online
https://github.com/Matt-Esch/virtual-dom/wiki

Also look at https://developer.mozilla.org/en-US/docs/Web/API/DOMParser

Last resort is to have your backend using node.js to help with the job

Yes, this is exactly the problem.

To repeat and restate the problem simply, to use querySelectorAll() method to extract those tags of interest, the HTML content must be in the DOM as nodes already. However, the unprocessed HTML contains img src with broken URLs as it was AJAX from another domain, which when added to the DOM immediately start trying to download the images and fail as expected, generating errors in the web browser console.

Currently, a normal user won't see those console errors, and the web page is working fine as it is. However, if there is some proper method to handle this requirement to avoid the unwanted image loading and thus error messages, it would be the best scenario.

As said, I will take a look at what you have suggested. Thanks. :)
 

cwchong

Master Member
Joined
Jan 7, 2005
Messages
4,654
Reaction score
96
If its only img src u are concerned, wouldnt a easy hack be to replace all /src=“/ with the domain prefixed if a domain is not already there?
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,300
Yes, this is exactly the problem.

To repeat and restate the problem simply, to use querySelectorAll() method to extract those tags of interest, the HTML content must be in the DOM as nodes already. However, the unprocessed HTML contains img src with broken URLs as it was AJAX from another domain, which when added to the DOM immediately start trying to download the images and fail as expected, generating errors in the web browser console.

Currently, a normal user won't see those console errors, and the web page is working fine as it is. However, if there is some proper method to handle this requirement to avoid the unwanted image loading and thus error messages, it would be the best scenario.

As said, I will take a look at what you have suggested. Thanks. :)

I have tried, DOMParser is your easiest approach.
Code:
var parser = new DOMParser();
var doc = parser.parseFromString("<img src='https://cdn.pixabay.com/photo/2017/01/16/15/18/soap-bubble-1984310__480.jpg'>", " text/html");

It will generate the following structure
HTML
|-HEAD
`-BODY
   `-IMG

doc.querySelector("img"); will return
<img src="https://cdn.pixabay.com/photo/2017/01/16/15/18/soap-bubble-1984310__480.jpg">
as a node.

There wouldn't be any execution or loading. You can also parse using "text/xml", but there will be some interesting effects, I will leave u to go venture on it.
 
Last edited:

u0206397

Senior Member
Joined
Jul 15, 2009
Messages
764
Reaction score
0
I have tried, DOMParser is your easiest approach.
Code:
var parser = new DOMParser();
var doc = parser.parseFromString("<img src='https://cdn.pixabay.com/photo/2017/01/16/15/18/soap-bubble-1984310__480.jpg'>", " text/html");

It will generate the following structure
HTML
|-HEAD
`-BODY
   `-IMG

doc.querySelector("img"); will return
<img src="https://cdn.pixabay.com/photo/2017/01/16/15/18/soap-bubble-1984310__480.jpg">
as a node.

There wouldn't be any execution or loading. You can also parse using "text/xml", but there will be some interesting effects, I will leave u to go venture on it.

Ah, this looks good. I checked on Mozilla and it says DOMParser is an experimental feature only in newer web browsers. I am keeping my fingers crossed. :D

Thanks a million!
 

u0206397

Senior Member
Joined
Jul 15, 2009
Messages
764
Reaction score
0
If its only img src u are concerned, wouldnt a easy hack be to replace all /src=“/ with the domain prefixed if a domain is not already there?

I am afraid my simplistic img tag replacement using regex will fail if somehow there is any less well-formed or malformed HTML content. My parsing will go all haywire.
 
Important Forum Advisory Note
This forum is moderated by volunteer moderators who will react only to members' feedback on posts. Moderators are not employees or representatives of HWZ Forums. Forum members and moderators are responsible for their own posts. Please refer to our Community Guidelines and Standards and Terms and Conditions for more information.
Top