How do I crawl web data HTTPS using java?

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,300
It is fine,but anyway I need to do DOM tranverse,

it is like this.

<h2>Treasurer</h2>
<table><tr>
<td>user id</td><td>contact>
repeat
repeat
</table>

<h1>Treasurer</h1>
<table><tr>
<td>user id</td><td>contact>
</table>

then I need to store the type of the users and the customer contact details.

That is quite straight forward using the css-like selector. Read it up at Use selector-syntax to find elements: jsoup Java HTML parser
What would be your concern ?
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
i am currently not using an ide,just note pad with the java -cp library java file commands

css selector,

i was thinking h2,

its sibling should be table,

when i say a cut , i meant it has about 10k lines,it read till 3k,it stops responding,it takes about 4s to load btw,i suspect the cause is because the pages takes too long to load,i was looking at this Jsoup.connect(url).userAgent(USER_AGENT).timeout(10*1000).get();i suspect because the timeout is too short,either way,it doesn't work

so I am using

h2,table as the selectors

currently,i am using cmd line and system.out.print to do debugging,using print screen and ctrl-c to stop and view the output,i write to outputcsv file to view the log

later,i still need to use enumerated type to split into trader,treasurer,etcs.....

but the problem is ArrayList <User> userlist = new ArrayList<User>(); how can I split into multiple users
 
Last edited:

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,300
i am currently not using an ide,just note pad with the java -cp library java file commands

That wouldn't affect anything, however, I recommend you some industrial approach on how to develop, if you are interested.

1) Dump the notepad, use something more well developed for development. Either learn to use an IDE, if not, at least start with text editors such as editplus, ultraedit, jEdit and so forth. These text editors have a lot more to offer such as syntax colouring, text folding, automatic indentation, and a lot understand different file formats and will provide language context syntax colouring etc. If you are up to the challenge, go pick up VIM/GVIM or EMACS for really powerful text editors and more.

2) Go pick up how to manage software development using Apache Ant. This is a Java industrial standard build tool equivalent of unix "make".

Below are optional, but I personally think, if you are making your way into the IT industry (I suppose), then these skill sets are indispensable.

3) Learn how to develop in Unix environments, specifically Linux since it is a freely available operating system that greatly encourage "free" software development. This platform provides tons of development utilities that are readily available in today's modern Linux distribution all at almost no cost to you. Not to mention it can be installed on a minimally provisioned guest OS and still perform great.

4) Learn how to version control your codes using either Git(recommended) or at least Subversion.

css selector,

i was thinking h2,

its sibling should be table,

when i say a cut , i meant it has about 10k lines,it read till 3k,it stops responding,it takes about 4s to load btw,i suspect the cause is because the pages takes too long to load,i was looking at this Jsoup.connect(url).userAgent(USER_AGENT).timeout(10*1000).get();i suspect because the timeout is too short,either way,it doesn't work

I can't comment much about how you are selecting your elements from the DOM unless I know what site you are attempting to parse and extract information. Perhaps you might want to provide the URL of the site you are testing with.

10s to read 10k lines is way too long unless the site you are connecting to is really that slow ? How do you know it stops reading ?

You might want to split the approach you obtain the Document into 2 steps and debug on it is the parsing that is having issue or it is really the retrieval of the data over the net too slow, as such

Code:
Connection conn = Jsoup.connect(url).userAgent(USER_AGENT);
Connection.Response resp = conn.execute();
Document doc = resp.parse();

so I am using

h2,table as the selectors

currently,i am using cmd line and system.out.print to do debugging,using print screen and ctrl-c to stop and view the output,i write to outputcsv file to view the log

later,i still need to use enumerated type to split into trader,treasurer,etcs.....

but the problem is ArrayList <User> userlist = new ArrayList<User>(); how can I split into multiple users

I don't understand what you meant by users, maybe you want to be more specific in your enquiry ?
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
1)notepad++ is ok for me

2)ok,i filp through my lecture slides,apparently it is a test environment like of JUnit,it can be downloaded .zip package of Ant from
http://ant.apache.org,i just need to figure it out

3) unix i have used it before,i learnt before what is ubuntu,i download games using the whatever sudo sget cannot rememebrs..

4)yes,i am making copies of v1,v2,v3 on my desktop,i used git before I found it very CONFUSING seriously,because some teammember force me to use it even though I don't it.


I mean the code,

<table>
<tr>
<td>
100 lines into it
<td...cut here
suppose to have 1000 lines
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,300
1)notepad++ is ok for me

2)ok,i filp through my lecture slides,apparently it is a test environment like of JUnit,it can be downloaded .zip package of Ant from
http://ant.apache.org,i just need to figure it out

Ant is a build tool, not a testing tool. It can integrate with JUnit to run testcases, but it is not a testing tool.

3) unix i have used it before,i learnt before what is ubuntu,i download games using the whatever sudo sget cannot rememebrs..

Then use it more and read up more on how to be useful in a Linux environment. I can assure you will be rewarded more than just operating in Microsoft Windows environment.

4)yes,i am making copies of v1,v2,v3 on my desktop,i used git before I found it very CONFUSING seriously,because some teammember force me to use it even though I don't it.

Yes versioning using multiple copes of codes is possible. However, not mentioning on the duplicated space required for each code, it only works for 1 party, which is yourself.

In a collaborated project, you need to know who change which part of the code, when it is changed and why it is changed. You need to be able to quickly revert back to a previous revision of a part of the code or a file without reverting the whole project codes. You need to know when multiple person change the same part of the codes, this event is captured and involved parties are made known about the conflict.

You can't perform all these acts without a proper version control system. Git is born because the need for distributed collaboration of developers from all the world. It also solve some of the arcane issue using subversion.

One of the more noticeable feature of Git is it makes branching inexpensive and a leisure to use, hence allowing development to branch more often for better operational code management. This part probably wouldn't be of good use to you until you are working on production environments and on-going enhancement to projects.

I mean the code,

<table>
<tr>
<td>
100 lines into it
<td...cut here
suppose to have 1000 lines

Like I say, you need to ascertain that the "get" process is complete before you can ascertain if the parsing is faulty, which may result in the "cut" effect you are getting. Can you be absolutely sure that the cut is not because of a parsing error ?
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
need help,

..../showOnlineUsers/html/1380161056314 the url keeps changings.

parsing not faulty anymore...

i pass to my boss to test already.

Ant is a build tool, not a testing tool. It can integrate with JUnit to run testcases, but it is not a testing tool.



Then use it more and read up more on how to be useful in a Linux environment. I can assure you will be rewarded more than just operating in Microsoft Windows environment.



Yes versioning using multiple copes of codes is possible. However, not mentioning on the duplicated space required for each code, it only works for 1 party, which is yourself.

In a collaborated project, you need to know who change which part of the code, when it is changed and why it is changed. You need to be able to quickly revert back to a previous revision of a part of the code or a file without reverting the whole project codes. You need to know when multiple person change the same part of the codes, this event is captured and involved parties are made known about the conflict.

You can't perform all these acts without a proper version control system. Git is born because the need for distributed collaboration of developers from all the world. It also solve some of the arcane issue using subversion.

One of the more noticeable feature of Git is it makes branching inexpensive and a leisure to use, hence allowing development to branch more often for better operational code management. This part probably wouldn't be of good use to you until you are working on production environments and on-going enhancement to projects.



Like I say, you need to ascertain that the "get" process is complete before you can ascertain if the parsing is faulty, which may result in the "cut" effect you are getting. Can you be absolutely sure that the cut is not because of a parsing error ?
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
..../showOnlineUsers/html/1380161056314

1380161056314 this number key changing everytime I log in.

the url parameter.
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,300
..../showOnlineUsers/html/1380161056314

1380161056314 this number key changing everytime I log in.

the url parameter.

Well if I guess it right, this is a timestamp or some sort that helps to prevent caching.
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
Hi there,

on this if i am trying to do a post data how ever there is it has parameter of challengetoken how?

and if the login dialog box is a pop-up javascript windows that prompts for login how?
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
Hi there,

on this if i am trying to do a post data how ever there is it has parameter of challengetoken how?

and if the login dialog box is a pop-up javascript windows that prompts for login how?

i mean chrome pop-up a dialog box that ask me for the authentication.
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,300
i mean chrome pop-up a dialog box that ask me for the authentication.

If the browser is poping up a dialog box, then it's likely HTTP Basic Authentication. Do you have the credentials ? If so, just inject on first visit in the request headers. Go read up about how Basic Authentication is passed from browser to server.
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,300
you mean it is possible for a header request in java for http 1.0 or 1.1?

Why not ? It's necessity for HTTP compliant client. How can a HTTP client be one when one cannot control the request headers ? You will want to look at HTTPURLConnection.setRequestProperty method.

Alternatively you will want to use a more well established HTTP Client Library from Apache, found at Apache HttpComponents - HttpComponents HttpClient Overview
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,300
so for example,jsoup-1.7.2.jar is used for source code right?

Most distributed jar files are normally binary only packages. They normally do not contain source codes, though it is not an absolute.

Where you place your jar file is not absolute either. It's a convention. There are a few different conventions on how you organize your project directories.

"SRC" for source code, "LIB" for libraries, "CONF" for configurations, "BIN" for executable binaries. It's up to you to configure your build tool to search for respective types of files in each specific directories.
 
Important Forum Advisory Note
This forum is moderated by volunteer moderators who will react only to members' feedback on posts. Moderators are not employees or representatives of HWZ Forums. Forum members and moderators are responsible for their own posts. Please refer to our Community Guidelines and Standards and Terms and Conditions for more information.
Top