Help - Web scrapping HTML webpage

emo_121 · Oct 2, 2020

I help some help in web scrapping. Im not sure if this is possible

Some details:
Im using Open hardware Monitor which by itself have a remote web server.

On my Rpi, going into the webpage url of 192.168.1.1:8085 can show the data of the cpu temps, volts etc.
My rpi is locally connected only.

Currenly my code is as follows:

Code:

import pandas as pd
import requests as rq

URL = '192.168.1.1:8085'

table = pd.read_html(URL)
print(len(table))
print(table))

It manages to print something. But i am unable to see the whole table form.

Any help on how to print the value that i want?
For example, how do i script it to store cpu Vcore temperature as array =[min,value,max] and print it out?

Any help will be appreciated!

davidktw · Oct 3, 2020

emo_121 said:
I help some help in web scrapping. Im not sure if this is possible

Some details:
Im using Open hardware Monitor which by itself have a remote web server.

On my Rpi, going into the webpage url of 192.168.1.1:8085 can show the data of the cpu temps, volts etc.
My rpi is locally connected only.

Currenly my code is as follows:

Code:

import pandas as pd import requests as rq URL = '192.168.1.1:8085 table = pd.read_html(URL) print(len(table)) print(table))

It manages to print something. But i am unable to see the whole table form.

Any help on how to print the value that i want?
For example, how do i script it to store cpu Vcore temperature as array =[min,value,max] and print it out?

Any help will be appreciated!

Resolve your syntax errors first, then give us a html output of your open hardware monitor webpage.

emo_121 · Oct 3, 2020

davidktw said:
Resolve your syntax errors first, then give us a html output of your open hardware monitor webpage.

Appologies. My code are manually typed out as I couldn't copy and paste onto here. I have updated my code in my orginal post.

The html output is as follows:

Code:

<tr data-bind="attr: { 'id': 'node-' + id(), 'class': parent.id()?'child-of-node-' + parent.id():'' }" id="node-4" class="child-of-node-3 initialized parent expanded">
              <td data-bind="html: '<img src=' + ImageURL() + ' />  ' + Text()" style="padding-left: 58px; cursor: pointer;"><span style="margin-left: -19px; padding-left: 19px" class="expander"></span><img src="images_icon/voltage.png">  Voltages</td>
              <td data-bind="text: Min"></td>
              <td data-bind="text: Value"></td>
              <td data-bind="text: Max"></td>
            </tr>

davidktw · Oct 3, 2020

emo_121 said:

Appologies. My code are manually typed out as I couldn't copy and paste onto here. I have updated my code in my orginal post.

The html output is as follows:

Code:

<tr data-bind="attr: { 'id': 'node-' + id(), 'class': parent.id()?'child-of-node-' + parent.id():'' }" id="node-4" class="child-of-node-3 initialized parent expanded">
              <td data-bind="html: '<img src=' + ImageURL() + ' />  ' + Text()" style="padding-left: 58px; cursor: pointer;"><span style="margin-left: -19px; padding-left: 19px" class="expander"></span><img src="images_icon/voltage.png">  Voltages</td>
              <td data-bind="text: Min"></td>
              <td data-bind="text: Value"></td>
              <td data-bind="text: Max"></td>
            </tr>

You can’t use this web scrapping method to get your data because your webpage is a client dynamic page. You will need to use a headless browser technique. Consider using selenium, puppeteer, playwright, cypress to do this. The headless browser is like a real browser, but not shown on your desktop, that has the necessary css, js, dom engines to properly render the page, hence capable of client rendering ur webpage and still allow client scripting to extract data from the webpage for ur custom application use.

Here is a node.js tutorial for you
https://medium.com/@andrejsabrickis/scrapping-the-content-of-single-page-application-spa-with-headless-chrome-and-puppeteer-d040025f752b

Actually you don’t need to really scrape from the webpage. Normally just client dynamic pages will be performing a lot of ajax calls to populate the page, hence you will need to observe a browser network console and find out which are the web requests that is actually retrieving the actual data, normally the response are in JSON format. This way you can simply just fire the web requests using any tool you like to obtain the json response and parse it yourself. This method is lower level than web scraping but also part of the technique since the overhead is much lower and more straightforward from a technical standpoint.

If you want, contact me via PM and I will show you quickly via a scheduled zoom session what i am talking about.

emo_121 · Oct 5, 2020

davidktw said:
You can’t use this web scrapping method to get your data because your webpage is a client dynamic page. You will need to use a headless browser technique. Consider using selenium, puppeteer, playwright, cypress to do this. The headless browser is like a real browser, but not shown on your desktop, that has the necessary css, js, dom engines to properly render the page, hence capable of client rendering ur webpage and still allow client scripting to extract data from the webpage for ur custom application use.

Here is a node.js tutorial for you
https://medium.com/@andrejsabrickis...th-headless-chrome-and-puppeteer-d040025f752b

Actually you don’t need to really scrape from the webpage. Normally just client dynamic pages will be performing a lot of ajax calls to populate the page, hence you will need to observe a browser network console and find out which are the web requests that is actually retrieving the actual data, normally the response are in JSON format. This way you can simply just fire the web requests using any tool you like to obtain the json response and parse it yourself. This method is lower level than web scraping but also part of the technique since the overhead is much lower and more straightforward from a technical standpoint.

If you want, contact me via PM and I will show you quickly via a scheduled zoom session what i am talking about.

From what you say:

I need nodeJS to do the dynamic scrapping. aka headless browser technique

I dont understand the ajax calls paragraph. LOL

It's alright. I will try to read up and learn on nodeJS and try out the link you have sent!! Really helpful for me and i appreciate it...
Tho, i have another curious question. Since i am able to extra the data via defining the IP and port number. Can't i just read it directly via code into the IP and port number?
Sorry, if im unclear or confusing. Im thinking in a i2c Protocol style. where i just identify which data i want and separate it.

davidktw · Oct 5, 2020

emo_121 said:
From what you say:

I need nodeJS to do the dynamic scrapping. aka headless browser technique

I dont understand the ajax calls paragraph. LOL

It's alright. I will try to read up and learn on nodeJS and try out the link you have sent!! Really helpful for me and i appreciate it...
Tho, i have another curious question. Since i am able to extra the data via defining the IP and port number. Can't i just read it directly via code into the IP and port number?
Sorry, if im unclear or confusing. Im thinking in a i2c Protocol style. where i just identify which data i want and separate it.

it is not node.js, node.js is just one of the many possible the web driver binding language to drive the browsers.

Take selenium for example, you can use Java selenium web driver to drive other browsers rig firefox, safari, chrome, IE, etc There are also other languages binding such as Python, Ruby, C#

You can read up more from https://www.softwaretestingmaterial.com/selenium-webdriver-architecture/

AJax calls are in short for Asynchronous Javascript and XML. This technique allows the browser to fire off asynchronous web requests to servers which gives the similarity of background web requests that doesn’t by default refresh the webpage. This is the fundamental technique that has been around for many years which makes vibrant client dynamic webpages, otherwise every action on the webpage like clicking a button, or clicking on a hyperlink will need to totally refresh the webpage. You should go read up more.

When i say you can focus on AJAX to do what you want to do is simply means you don’t actually need to parse the entire webpage just to get the data you need. Normally web applications are designed to make extensive use of ajax to perform what we call remote procedure calls or in short RPC. When you click a button, one possible design is to react to this button click event and call a javascript function. this function will create an ajax web request and fire to the server. the server will return some data. Often this data is either in JSON or XML or HTML or could be plain text too. Using this data, the function will then update the data in the webpage, hence you get the illusion of webpage not refreshing but the html table data get refreshed. This is what we called rich client application.

I am suggesting that you monitor the web requests made in the browser, identify which one of the ajax calls are returning the data set that populate the table. Hence instead of rendering the webpage, simple mimic the ajax requests to the server which will return you the data, parse the data and get what you really wanted.

I don’t understand your “curious” question, please elaborate.

emo_121 · Oct 5, 2020

davidktw said:
it is not node.js, node.js is just one of the many possible the web driver binding language to drive the browsers.

Take selenium for example, you can use Java selenium web driver to drive other browsers rig firefox, safari, chrome, IE, etc There are also other languages binding such as Python, Ruby, C#

You can read up more from https://www.softwaretestingmaterial.com/selenium-webdriver-architecture/

AJax calls are in short for Asynchronous Javascript and XML. This technique allows the browser to fire off asynchronous web requests to servers which gives the similarity of background web requests that doesn’t by default refresh the webpage. This is the fundamental technique that has been around for many years which makes vibrant client dynamic webpages, otherwise every action on the webpage like clicking a button, or clicking on a hyperlink will need to totally refresh the webpage. You should go read up more.

When i say you can focus on AJAX to do what you want to do is simply means you don’t actually need to parse the entire webpage just to get the data you need. Normally web applications are designed to make extensive use of ajax to perform what we call remote procedure calls or in short RPC. When you click a button, one possible design is to react to this button click event and call a javascript function. this function will create an ajax web request and fire to the server. the server will return some data. Often this data is either in JSON or XML or HTML or could be plain text too. Using this data, the function will then update the data in the webpage, hence you get the illusion of webpage not refreshing but the html table data get refreshed. This is what we called rich client application.

I am suggesting that you monitor the web requests made in the browser, identify which one of the ajax calls are returning the data set that populate the table. Hence instead of rendering the webpage, simple mimic the ajax requests to the server which will return you the data, parse the data and get what you really wanted.

I don’t understand your “curious” question, please elaborate.

Yes i def have alot to learn. HAHA

Regarding the curious question:
the web server is continuously sending out data to update the web page correct?

Instead of using a web page in PI. can i use a software port data decoder (idk if it exists). Meaning the data comes through the Ethernet port. A script analyse the data, and separate the data.

So im thinking of this, cos of my recent project.
I have a sever and 2 point of robot. the robot will continuously send out data in the 2.4GHz channel.
And my server script will intercept the data, seperate the data and display according if it's from robot A or robot b.
NOTE: The robot example is a standalone, it is just using the 2.4Ghz channel. and not in the WIFI protocol.

parchiao · Oct 5, 2020

Bookmark to scrap some stuff later.

davidktw · Oct 5, 2020

emo_121 said:
Yes i def have alot to learn. HAHA

Regarding the curious question:
the web server is continuously sending out data to update the web page correct?

Instead of using a web page in PI. can i use a software port data decoder (idk if it exists). Meaning the data comes through the Ethernet port. A script analyse the data, and separate the data.

So im thinking of this, cos of my recent project.
I have a sever and 2 point of robot. the robot will continuously send out data in the 2.4GHz channel.
And my server script will intercept the data, seperate the data and display according if it's from robot A or robot b.
NOTE: The robot example is a standalone, it is just using the 2.4Ghz channel. and not in the WIFI protocol.

The web server does not "normally" continuously send out data to update webpage. While there are such techniques, I doubt you will be using them. Pushing data from web server to clients is a rather advance technique and one need to specially design such kind of push events.

Most of the web systems does a REQUEST-RESPONSE communication technique.

This is a typical web dialog

YELLOW is the HTTP REQUEST headers. No HTTP BODY is sent.
CYAN is the HTTP RESPONSE headers
MAGENTA is the HTTP RESPONSE body

If I get what you are requesting, monitoring network traffic is not as easy. If you are using HTTPS, you are doing a deliberate Man-In-The-Middle(MITM) attack. There are split tunnel techniques, but it is even more involved than what I have described. TCP is only ordered at endpoints, not in the middle.

You will want to tackle your problem at the application layer, not the network layer. I don't understand why you want to complicate matters. It's actually quite simple of your requests.

Look at the HTTP request I have did. It could very well be an Ajax request.

Here is an example of a dynamic table.
You can run it yourself at https://jsfiddle.net/hLzfs1uc/

If you were to scrap this example, you will need to use a browser to execute the javascript and run the function. Otherwise scrapping the HTML will only get you

Code:

<table id="mytable">
  <tr>
    <td>Click button below to get output...</td>
  </tr>
</table>
<button onclick="get_data()">Get Data</button>

What I am suggesting is look at your browser developer console, here is what I will get when I click on the button.

If you click on the request and look at the response, you will see this

Hence I am suggesting that you hit the AJAX call directly. Here is a CURL example.

Code:

curl 'https://cors-anywhere.herokuapp.com/http://dummy.restapiexample.com/api/v1/employees' \
-X 'GET' \
-H 'Pragma: no-cache' \
-H 'Accept: */*' \
-H 'Origin: https://fiddle.jshell.net' \
-H 'Cache-Control: no-cache' \
-H 'Accept-Language: en-sg' \
-H 'Host: cors-anywhere.herokuapp.com' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15' \
-H 'Referer: https://fiddle.jshell.net/' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Connection: keep-alive'

This way, you can skip all scrapping. Just concentrate on what are the JSON data that populate the table. It could be a complex JSON, but it is likely to be structured, unless the webpage designer makes it so complicated that it ends up easier to parse the DOM directly.

There are quite a few things I shared here you will want to learn.

LastNameTan · Jan 3, 2021

davidktw said:
The web server does not "normally" continuously send out data to update webpage. While there are such techniques, I doubt you will be using them. Pushing data from web server to clients is a rather advance technique and one need to specially design such kind of push events.

Most of the web systems does a REQUEST-RESPONSE communication technique.

This is a typical web dialog
[/CODE]

This way, you can skip all scrapping. Just concentrate on what are the JSON data that populate the table. It could be a complex JSON, but it is likely to be structured, unless the webpage designer makes it so complicated that it ends up easier to parse the DOM directly.

There are quite a few things I shared here you will want to learn.

Very good detailed answer to the OP. however i would add that, there's a chance that the request may not be a public api that you can call, and it may be origin restricted/require cookie in the header. In such a case you might get an error response from calling the api. however, if it is public then there's no issue

davidktw · Jan 3, 2021

LastNameTan said:
Very good detailed answer to the OP. however i would add that, there's a chance that the request may not be a public api that you can call, and it may be origin restricted/require cookie in the header. In such a case you might get an error response from calling the api. however, if it is public then there's no issue

The authentication concern is true. However the same technique can contain cookies, and may also be extended to multiple curl calls to acquire cookies just like how a browser works, so everything still works in similar fashion, just more invocations.

Btw origin restrictions of cookies and ajax requests are enforced by the browser, curl don’t care, except that it will respect the cookies domain policy, and that is for the sake of not sending excessive cookies over rather than security

Trader11 · Jan 3, 2021

davidktw said:
The web server does not "normally" continuously send out data to update webpage. While there are such techniques, I doubt you will be using them. Pushing data from web server to clients is a rather advance technique and one need to specially design such kind of push events.

Most of the web systems does a REQUEST-RESPONSE communication technique.

This is a typical web dialog

YELLOW is the HTTP REQUEST headers. No HTTP BODY is sent.
CYAN is the HTTP RESPONSE headers
MAGENTA is the HTTP RESPONSE body

If I get what you are requesting, monitoring network traffic is not as easy. If you are using HTTPS, you are doing a deliberate Man-In-The-Middle(MITM) attack. There are split tunnel techniques, but it is even more involved than what I have described. TCP is only ordered at endpoints, not in the middle.

You will want to tackle your problem at the application layer, not the network layer. I don't understand why you want to complicate matters. It's actually quite simple of your requests.

Look at the HTTP request I have did. It could very well be an Ajax request.

Here is an example of a dynamic table.
You can run it yourself at https://jsfiddle.net/hLzfs1uc/

If you were to scrap this example, you will need to use a browser to execute the javascript and run the function. Otherwise scrapping the HTML will only get you

Code:

<table id="mytable"> <tr> <td>Click button below to get output...</td> </tr> </table> <button onclick="get_data()">Get Data</button>

What I am suggesting is look at your browser developer console, here is what I will get when I click on the button.

If you click on the request and look at the response, you will see this

Hence I am suggesting that you hit the AJAX call directly. Here is a CURL example.

Code:

curl 'https://cors-anywhere.herokuapp.com/http://dummy.restapiexample.com/api/v1/employees' \ -X 'GET' \ -H 'Pragma: no-cache' \ -H 'Accept: */*' \ -H 'Origin: https://fiddle.jshell.net' \ -H 'Cache-Control: no-cache' \ -H 'Accept-Language: en-sg' \ -H 'Host: cors-anywhere.herokuapp.com' \ -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15' \ -H 'Referer: https://fiddle.jshell.net/' \ -H 'Accept-Encoding: gzip, deflate, br' \ -H 'Connection: keep-alive'

This way, you can skip all scrapping. Just concentrate on what are the JSON data that populate the table. It could be a complex JSON, but it is likely to be structured, unless the webpage designer makes it so complicated that it ends up easier to parse the DOM directly.

There are quite a few things I shared here you will want to learn.

No React ah?

davidktw · Jan 3, 2021

Trader11 said:
No React ah?

Does Angular or React or Vue or any of these frameworks changes the HTTP req/res model that I have demonstrated in my earlier post ? If so, can you elaborate on how it will greatly affect the technique that I have described ?

Trader11 · Jan 6, 2021

davidktw said:
Does Angular or React or Vue or any of these frameworks changes the HTTP req/res model that I have demonstrated in my earlier post ? If so, can you elaborate on how it will greatly affect the technique that I have described ?

It doesn't. But TS can use React as well to load the data in the page. Imo, easier to handle the frontend by dividing each parts into components. Then TS doesn't need to refresh the entire html elements. Just upload the components that need to be refreshed

davidktw · Jan 6, 2021

Trader11 said:
It doesn't. But TS can use React as well to load the data in the page. Imo, easier to handle the frontend by dividing each parts into components. Then TS doesn't need to refresh the entire html elements. Just upload the components that need to be refreshed

Why do you need to waste resources with a browser running when a simple CLI curl utility that run with minimal resources can do exactly the same job

It don’t even need to care how the page looks, it just need to care the structure of the data returned which these days are just simple JSON object.

The topic is on web scrapping, not designing a web app.

SpicyBird · Jan 6, 2021

if you want use react, you can use npm axios

import axios from 'axios';

const getUrl = (url) =>{
axios.get(url)
.then(response =>{
console.log(response.data)
})
.catch(err =>{
console.log(err)
})
}

getUrl("http://some.com");

----------------------------------
Sorry for poor code indent, i just roughly type here without IDE.

Help - Web scrapping HTML webpage

More options

emo_121

Member

davidktw

Arch-Supremacy Member

emo_121

Member

davidktw

Arch-Supremacy Member

emo_121

Member

davidktw

Arch-Supremacy Member

emo_121

Member

parchiao

Arch-Supremacy Member

davidktw

Arch-Supremacy Member

LastNameTan

Junior Member

davidktw

Arch-Supremacy Member

Trader11

Banned

davidktw

Arch-Supremacy Member

Trader11

Banned

davidktw

Arch-Supremacy Member

SpicyBird

Senior Member