How do I crawl web data HTTPS using java?

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
How do I crawl web data HTTPS using java?

the website from my company can be accessed using encrypted VPN only?

Then I compile my code come across some error like ...PRIX...keystore...Cacert...?

I tried out this code from mkyong,I am able to crawl data from sites like google?

http://www.mkyong.com/java/java-https-client-httpsurlconnection-example/

I am ok to use python or C as well?
 
Last edited:

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299
How do I crawl web data HTTPS using java?

the website from my company can be accessed using encrypted VPN only?

Then I compile my code come across some error like ...PRIX...keystore...Cacert...?

I tried out this code from mkyong,I am able to crawl data from sites like google?

Java HttpsURLConnection example

I am ok to use python or C as well?

The problem most likely has to do with your VPN configuration. The codes found in the site you posted above have no issue. I also have no issue compiling and running the codes as-is on my Mac using java version "1.7.0_25"

Did you try against the google.com as originally found in the codes ? Is the code working ? When you are using VPN, does all your traffic get routed via the VPN connection, or only specific traffic going to your company's site only ? Also is there any proxy settings set by your VPN or there is a proxy that you have to use. Java VM don't use proxy settings from your OS, you have to specify them when starting the JVM, or in another words, running your java application.

Update:

Okay I think I know your problem already. Your company is either using a self-signed certificate, or a certificate where is not signed by popular root certificates, means either your company have her own private CA, or using a very very unpopular CA. In this case, refer to the following webpage on how to disable SSL validation http://www.nakov.com/blog/2009/07/16/disable-certificate-validation-in-java-ssl-connections/

If not, go read up on how to create truststore and how to apply them to your java application
 
Last edited:

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
google.com confirm will work.

left checking with VPN..most probably proxy.

The problem most likely has to do with your VPN configuration. The codes found in the site you posted above have no issue. I also have no issue compiling and running the codes as-is on my Mac using java version "1.7.0_25"

Did you try against the google.com as originally found in the codes ? Is the code working ? When you are using VPN, does all your traffic get routed via the VPN connection, or only specific traffic going to your company's site only ? Also is there any proxy settings set by your VPN or there is a proxy that you have to use. Java VM don't use proxy settings from your OS, you have to specify them when starting the JVM, or in another words, running your java application.
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299
google.com confirm will work.

left checking with VPN..most probably proxy.

Just answered your doubt, read up my first post again.

Update:
Get the root ca certificate from your company, or if you know exactly that the HTTPS site you are accessing is using a self-signed certificate, meaning the certificate is signed using the very same private key that the web server is running on, then just add the server's certificate that was issued to you into a java keystore

Either the root ca certicate or the server's certificate(for self-sign scenario). Assuming the certificate is root_ca.crt
Take note that Java keystore requires a DER format certificate.

Code:
keytool -import -trustcacerts -file root_ca.crt -alias ca -keystore mytruststore.jks

After that you may invoke your java program as such.
Code:
java -Djavax.net.ssl.trustStore=mytruststore.jks HttpsClient
 
Last edited:

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
hey,I have seen this keytool before,easy to use?

Just answered your doubt, read up my first post again.

Update:
Get the root ca certificate from your company, or if you know exactly that the HTTPS site you are accessing is using a self-signed certificate, meaning the certificate is signed using the very same private key that the web server is running on, then just add the server's certificate that was issued to you into a java keystore

Either the root ca certicate or the server's certificate(for self-sign scenario). Assuming the certificate is root_ca.crt
Take note that Java keystore requires a DER format certificate.

Code:
keytool -import -trustcacerts -file root_ca.crt -alias ca -keystore mytruststore.jks

After that you may invoke your java program as such.
Code:
java -Djavax.net.ssl.trustStore=mytruststore.jks HttpsClient
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299
hey,I have seen this keytool before,easy to use?

It's not about the keytool that is difficult to use. You need to have understanding on how certificates are validated in a secure channel usage.

Have a better understanding on how RSA public cryptography are used, how certificates works on how signing works, who sign them, what are chain certs and root certs, why browser will complain some certificates are invalid and so forth.

The main tool normally people used to deal with certificates is OpenSSL toolkit. This is the universal toolkit across the board. Java keytool offer a partial of what OpenSSL can do dealing with X.509 certificates importing and exporting out of its Java keystore, but doesn't do signing and so forth.
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
Hi there,

is .pem files same as .jks files?
update:when I try to run .pem
C:\Users\aoo.360T\Desktop>java -Djavax.net.ssl.trustStore=office-efw01.pem Https
Client
java.net.SocketException: java.security.NoSuchAlgorithmException: Error construc
ting implementation (algorithm: Default, provider: SunJSSE, class: sun.security.
ssl.SSLContextImpl$DefaultSSLContext)

you mention how to load keystore via command line how do I do so via eclipse?
 
Last edited:

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299
Hi there,

is .pem files same as .jks files?

you mention how to load keystore via command line how do I do so via eclipse?

PEM files are ASCII armored output of x.509 certificates. These are just plain X.509 certificates. They are not to be mistaken with a keystore.

Keystore is a different storage file. Keystore contain one or more public/private keys, certificates etc.

JKS is a Java keystore, PKCS#12 is a keystore standard from RSA Laboratory.

I thought I already told you that you can load truststore and keystore using system properties in commandline ? Read more at IBM i information center

The default trust manager will read from these system properties during initialisation.

If what you are interested in is manipulate KeyStore, read from Java Platform SE 6
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
my company only has extension with .p12 and .der,

for .pem is of

cannot find .crt

i try .der first,try my luck first,

use your Djavax...method
 
Last edited:

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299
my company only has extension with .p12 and .der,

cannot find .crt

DER is the binary version as oppose to PEM. You can use these 2 formats interchangeably. Java will take in the DER format. CRT is just a popular EXTENSION for X.509 certificates. It does not explicitly denote if the certificate is stored in PEM or DER format. You will need to read the file to find out that.

P12 is the EXTENSION for PKCS#12 keystore. It is possible to convert from PKCS#12 keystore to Java keystore, using the KEYTOOL from JDK 6 and above.

Please kindly google for these information.
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
Last edited:

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299
I am trying your method on disable SSL validation now.
ok so far 4 out of 5 website works,
but got one website

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 401 for URL:
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
at Example.main(Example.java:44)

what is 401 code?

Google for "HTTP 401". No offence, but I recommend that you take a more proactive role in learning.
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
By the way,

how is the performance of stringbuffer compared to string?
it should be faster right?

I need to read through the tables,how can I do it?
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299
By the way,

how is the performance of stringbuffer compared to string?
it should be faster right?

I need to read through the tables,how can I do it?

String is faster. The question you ask is incorrect. It's about the concatenation of strings in Java that is largely slow in comparison to how concatenation happens in Stringbuffer.

I think you need to be more explicit in what you are trying to do and what you propose to do. What do you mean by read through the tables ? Explain yourself more thoroughly.
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
i have crawled some data,however it appears that there is a cut,meaning it stops when at certain line at a html pages.

could it because of the timer got problem?need to set longer?

I am using jsoup to tranverse by the way?
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299
i have crawled some data,however it appears that there is a cut,meaning it stops when at certain line at a html pages.

could it because of the timer got problem?need to set longer?

I am using jsoup to tranverse by the way?

I don't use jsoup, myself, so what timer are you talking about ? You will need to be more specific with errors you encounter and provide the logs if available. I'm not aware of what "cut" you are referring to either.
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
I don't use jsoup, myself, so what timer are you talking about ? You will need to be more specific with errors you encounter and provide the logs if available. I'm not aware of what "cut" you are referring to either.

It is fine,but anyway I need to do DOM tranverse,

it is like this.

<h2>Treasurer</h2>
<table><tr>
<td>user id</td><td>contact>
repeat
repeat
</table>

<h1>Treasurer</h1>
<table><tr>
<td>user id</td><td>contact>
</table>

then I need to store the type of the users and the customer contact details.
 
Important Forum Advisory Note
This forum is moderated by volunteer moderators who will react only to members' feedback on posts. Moderators are not employees or representatives of HWZ Forums. Forum members and moderators are responsible for their own posts. Please refer to our Community Guidelines and Standards and Terms and Conditions for more information.
Top