I am doing web scrapping using AJAX to get some HTML text, which I parse to extract certain data.
My current method is like this:
Line 3: div.innerHTML = data will cause many errors in console due to broken img src URL as the current web page is not in same domain. The img src are relative to the original web page that I pulled the data from, and my web page executing the AJAX GET is not in domain.com.
I am stuck in this chicken and egg problem. The HTML snippet must be added to DOM to run querySelectorAll to extract the tags I want. However, doing so would induce unnecessary errors reported as the web browser would try to parse and retrieve the broken img src URLs.
If I don't add to the DOM, I cannot use querySelectorAll and parsing HTML myself sounds like a bad idea. I wonder if jQuery or some JavaScript libraries that simply do parsing and make a DOM without trying to download those img src.
My current method is like this:
PHP:
$.get( "domain.com/interesting-web-page.html", function( htmldata ) {
var div = document.createElement("div");
div.innerHTML = htmldata;
var data = div.querySelectorAll('.css-class-of-interest');
});
Line 3: div.innerHTML = data will cause many errors in console due to broken img src URL as the current web page is not in same domain. The img src are relative to the original web page that I pulled the data from, and my web page executing the AJAX GET is not in domain.com.
I am stuck in this chicken and egg problem. The HTML snippet must be added to DOM to run querySelectorAll to extract the tags I want. However, doing so would induce unnecessary errors reported as the web browser would try to parse and retrieve the broken img src URLs.
If I don't add to the DOM, I cannot use querySelectorAll and parsing HTML myself sounds like a bad idea. I wonder if jQuery or some JavaScript libraries that simply do parsing and make a DOM without trying to download those img src.