How do I crawl web data HTTPS using java?

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
Document doc1 = Jsoup.connect(exportmonitoring).userAgent(USER_AGENT).timeout(0).post();

Hi there,

the websites takes like 5 secs to load everythings,how to set a delay or something to make it load fully first
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299
Document doc1 = Jsoup.connect(exportmonitoring).userAgent(USER_AGENT).timeout(0).post();

Hi there,

the websites takes like 5 secs to load everythings,how to set a delay or something to make it load fully first

Do you mean it takes 5s to get back a Document object ? If so, it's the network that is slow or perhaps the web server where "exportmonitoring" is referring to is that slow ?

What delay do you want ? If you get back a Document object, it has been fully retrieved.
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
Do you mean it takes 5s to get back a Document object ? If so, it's the network that is slow or perhaps the web server where "exportmonitoring" is referring to is that slow ?

What delay do you want ? If you get back a Document object, it has been fully retrieved.

i mean there is still a cut some where,when i print out the document.
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299
i mean there is still a cut some where,when i print out the document.

Provide me the URL you are referring to, I will use jsoup and retrieve it and take a look what you are talking about.
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
Provide me the URL you are referring to, I will use jsoup and retrieve it and take a look what you are talking about.

internal by vpn only.
the cut is still by <td....... only 50 out of 100 rows retrieves.
but document takes 10sec to load on browser itself.
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
Code:
//enable http entity

import org.apache.http.HttpEntity;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.SingleClientConnManager;
import org.apache.http.util.EntityUtils;
import org.apache.http.conn.ClientConnectionManager;
import org.apache.http.conn.HttpClientConnectionManager;
import org.apache.http.conn.scheme.Scheme;
import org.apache.http.conn.scheme.SchemeRegistry;
import org.apache.http.conn.ssl.SSLSocketFactory;

import java.io.InputStream;
//create inputstream
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import java.net.URLConnection;


//disable authentication
import javax.net.ssl.HostnameVerifier;
import javax.net.ssl.HttpsURLConnection;
import javax.net.ssl.KeyManager;
import javax.net.ssl.KeyManagerFactory;
import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLSession;
import javax.net.ssl.TrustManager;
import javax.net.ssl.TrustManagerFactory;
import javax.net.ssl.X509TrustManager;

import java.security.KeyStore;
import java.security.SecureRandom;
import java.security.cert.X509Certificate;


public class login{
	public static void main(String[] args) throws Exception{

		SSLContext sslContext = SSLContext.getInstance("SSL");

		// set up a TrustManager that trusts everything
		sslContext.init(null, new TrustManager[] { new X509TrustManager() {
		            public X509Certificate[] getAcceptedIssuers() {
		                    System.out.println("getAcceptedIssuers =============");
		                    return null;
		            }

		            public void checkClientTrusted(X509Certificate[] certs,
		                            String authType) {
		                    System.out.println("checkClientTrusted =============");
		            }

		            public void checkServerTrusted(X509Certificate[] certs,
		                            String authType) {
		                    System.out.println("checkServerTrusted =============");
		            }
		} }, new SecureRandom());
		
		SSLSocketFactory sf = new SSLSocketFactory(sslContext);
		Scheme httpsScheme = new Scheme("https", 443, sf);
		SchemeRegistry schemeRegistry = new SchemeRegistry();
		schemeRegistry.register(httpsScheme);

	

		
		CredentialsProvider credsProvider = new BasicCredentialsProvider();
        credsProvider.setCredentials(
                new AuthScope("localhost", 443),
                new UsernamePasswordCredentials(args[0], args[1]));
       
    	// apache HttpClient version >4.2 should use BasicClientConnectionManager
		SingleClientConnManager cm = new SingleClientConnManager(schemeRegistry);
		DefaultHttpClient httpclient = new DefaultHttpClient(cm);
		httpclient.setCredentialsProvider(credsProvider);
        
		try {
            HttpGet httpget = new HttpGet("https://mycompanysites.com");

            System.out.println("executing request" + httpget.getRequestLine());
            CloseableHttpResponse response = httpclient.execute(httpget);
            try {
                HttpEntity entity = response.getEntity();

                System.out.println("----------------------------------------");
                System.out.println(response.getStatusLine());
                if (entity != null) {
                    System.out.println("Response content length: " + entity.getContentLength());
                }
                EntityUtils.consume(entity);
            } finally {
                response.close();
            }
        } finally {
            httpclient.close();
        }
	}



}

i have already imported is required.
i have set the authentication to be true and enabled,however it still mention that the authentication is required.

checkServerTrusted =============
getAcceptedIssuers =============
----------------------------------------
HTTP/1.1 401 Authorization Required
Response content length: 494

I am using DefaultHttpClient and set the credential provider.

on authscope I put to local,what could be wrong?

494 Security Agreement Required
The server has received a request that requires a negotiated security mechanism, and the response contains a list of suitable security mechanisms for the requester to choose between,[19]:§§2.3.1–2.3.2 or a digest authentication challenge.[19]:§2.4

the reason begin i need to do
basic authentication
with httpclient
and disableSSL
at the same time.
 
Last edited:

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299
Code:
//enable http entity

import org.apache.http.HttpEntity;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.SingleClientConnManager;
import org.apache.http.util.EntityUtils;
import org.apache.http.conn.ClientConnectionManager;
import org.apache.http.conn.HttpClientConnectionManager;
import org.apache.http.conn.scheme.Scheme;
import org.apache.http.conn.scheme.SchemeRegistry;
import org.apache.http.conn.ssl.SSLSocketFactory;

import java.io.InputStream;
//create inputstream
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import java.net.URLConnection;


//disable authentication
import javax.net.ssl.HostnameVerifier;
import javax.net.ssl.HttpsURLConnection;
import javax.net.ssl.KeyManager;
import javax.net.ssl.KeyManagerFactory;
import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLSession;
import javax.net.ssl.TrustManager;
import javax.net.ssl.TrustManagerFactory;
import javax.net.ssl.X509TrustManager;

import java.security.KeyStore;
import java.security.SecureRandom;
import java.security.cert.X509Certificate;


public class login{
	public static void main(String[] args) throws Exception{

		SSLContext sslContext = SSLContext.getInstance("SSL");

		// set up a TrustManager that trusts everything
		sslContext.init(null, new TrustManager[] { new X509TrustManager() {
		            public X509Certificate[] getAcceptedIssuers() {
		                    System.out.println("getAcceptedIssuers =============");
		                    return null;
		            }

		            public void checkClientTrusted(X509Certificate[] certs,
		                            String authType) {
		                    System.out.println("checkClientTrusted =============");
		            }

		            public void checkServerTrusted(X509Certificate[] certs,
		                            String authType) {
		                    System.out.println("checkServerTrusted =============");
		            }
		} }, new SecureRandom());
		
		SSLSocketFactory sf = new SSLSocketFactory(sslContext);
		Scheme httpsScheme = new Scheme("https", 443, sf);
		SchemeRegistry schemeRegistry = new SchemeRegistry();
		schemeRegistry.register(httpsScheme);

	

		
		CredentialsProvider credsProvider = new BasicCredentialsProvider();
        credsProvider.setCredentials(
                new AuthScope("localhost", 443),
                new UsernamePasswordCredentials(args[0], args[1]));
       
    	// apache HttpClient version >4.2 should use BasicClientConnectionManager
		SingleClientConnManager cm = new SingleClientConnManager(schemeRegistry);
		DefaultHttpClient httpclient = new DefaultHttpClient(cm);
		httpclient.setCredentialsProvider(credsProvider);
        
		try {
            HttpGet httpget = new HttpGet("https://mycompanysites.com");

            System.out.println("executing request" + httpget.getRequestLine());
            CloseableHttpResponse response = httpclient.execute(httpget);
            try {
                HttpEntity entity = response.getEntity();

                System.out.println("----------------------------------------");
                System.out.println(response.getStatusLine());
                if (entity != null) {
                    System.out.println("Response content length: " + entity.getContentLength());
                }
                EntityUtils.consume(entity);
            } finally {
                response.close();
            }
        } finally {
            httpclient.close();
        }
	}



}

i have already imported is required.
i have set the authentication to be true and enabled,however it still mention that the authentication is required.

checkServerTrusted =============
getAcceptedIssuers =============
----------------------------------------
HTTP/1.1 401 Authorization Required
Response content length: 494

I am using DefaultHttpClient and set the credential provider.

on authscope I put to local,what could be wrong?

494 Security Agreement Required
The server has received a request that requires a negotiated security mechanism, and the response contains a list of suitable security mechanisms for the requester to choose between,[19]:§§2.3.1–2.3.2 or a digest authentication challenge.[19]:§2.4

the reason begin i need to do
basic authentication
with httpclient
and disableSSL
at the same time.

Well obviously that is wrong. SSL or not, HTTP basic authentication is not related to that.
Code:
CredentialsProvider credsProvider = new BasicCredentialsProvider();
    credsProvider.setCredentials(
            new AuthScope("localhost", 443),
            new UsernamePasswordCredentials(args[0], args[1]));

What are you telling the httpclient to do ? Only pass the credentials over when the your target host is localhost at port 443 ? Is that what you are trying to achieve ? Are you trying to authenticate against a web server running at localhost port 443 ?

Please read carefully HttpClient - HttpClient Authentication Guide, and see what is wrong with your code.

As for your Custom SSL Socket Factory, are you constructing that for any particular reason ? I told you HTTPS is not related to HTTP Basic/Digest Authentication.
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
Well obviously that is wrong. SSL or not, HTTP basic authentication is not related to that.
Code:
CredentialsProvider credsProvider = new BasicCredentialsProvider();
    credsProvider.setCredentials(
            new AuthScope("localhost", 443),
            new UsernamePasswordCredentials(args[0], args[1]));

What are you telling the httpclient to do ? Only pass the credentials over when the your target host is localhost at port 443 ? Is that what you are trying to achieve ? Are you trying to authenticate against a web server running at localhost port 443 ?

Please read carefully HttpClient - HttpClient Authentication Guide, and see what is wrong with your code.

As for your Custom SSL Socket Factory, are you constructing that for any particular reason ? I told you HTTPS is not related to HTTP Basic/Digest Authentication.

no SSL,
but my https has got basic http auth as well.

no digest.
 
Last edited:

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,547
Reaction score
1,299
no SSL,
but my https has got basic http auth as well.

no digest.

Like I say, HTTP authentication is not related to HTTPS. The authentication flow of Digest or Basic is the same. It is just what you pass over the wire are different, read the article in the URL I have provided in the earlier post. You will see your problem in the above mentioned fragment of codes
 

bhtan760

Banned
Joined
Aug 11, 2013
Messages
249
Reaction score
0
Code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.DataOutputStream;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.net.CookieHandler;
import java.net.CookieManager;
import java.util.List;
import java.util.ArrayList;
import java.net.URLEncoder;
import java.io.PrintWriter;
import javax.net.ssl.HttpsURLConnection;
import java.net.URL;
import java.io.FileWriter;
import java.io.File;
import java.net.MalformedURLException;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import java.net.URLConnection;
import javax.net.ssl.HostnameVerifier;
import javax.net.ssl.HttpsURLConnection;
import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLSession;
import javax.net.ssl.TrustManager;
import javax.net.ssl.X509TrustManager;
import java.security.cert.X509Certificate;

import org.apache.commons.codec.binary.Base64;

public class ichinga{

	public static void main(String[] args) throws IOException,Exception{

		try {
		
			// Create a trust manager that does not validate certificate chains
			TrustManager[] trustAllCerts = new TrustManager[] {new X509TrustManager() {
					public java.security.cert.X509Certificate[] getAcceptedIssuers() {
						return null;
					}
					public void checkClientTrusted(X509Certificate[] certs, String authType) {
					}
					public void checkServerTrusted(X509Certificate[] certs, String authType) {
					}
				}
			};
			// Install the all-trusting trust manager
			SSLContext sc = SSLContext.getInstance("SSL");
			sc.init(null, trustAllCerts, new java.security.SecureRandom());
			HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory());
			
			// Create all-trusting host name verifier
			HostnameVerifier allHostsValid = new HostnameVerifier() {
				public boolean verify(String hostname, SSLSession session) {
					return true;
				}
			};
		
			String webPage = "http://example.com";
			String name = "admin";
			String password = "password";

			String authString = name + ":" + password;
			System.out.println("auth string: " + authString);
			byte[] authEncBytes = Base64.encodeBase64(authString.getBytes());
			String authStringEnc = new String(authEncBytes);
			System.out.println("Base64 encoded auth string: " + authStringEnc);

			URL url = new URL(webPage);
			URLConnection urlConnection = url.openConnection();
			urlConnection.setRequestProperty("Authorization", "Basic " + authStringEnc);
			InputStream is = urlConnection.getInputStream();
			InputStreamReader isr = new InputStreamReader(is);

			int numCharsRead;
			char[] charArray = new char[1024];
			StringBuffer sb = new StringBuffer();
			while ((numCharsRead = isr.read(charArray)) > 0) {
				sb.append(charArray, 0, numCharsRead);
			}
			String result = sb.toString();
			System.out.println("*** BEGIN ***");
			System.out.println(result);
			System.out.println("*** END ***");
		} catch (MalformedURLException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}

i was able to combine http basic auth with disabling ssl already.
i have searched for http basic authentication and manage to find this:
http://www.avajava.com/tutorials/lessons/how-do-i-connect-to-a-url-using-basic-authentication.html
 
Last edited:
Important Forum Advisory Note
This forum is moderated by volunteer moderators who will react only to members' feedback on posts. Moderators are not employees or representatives of HWZ Forums. Forum members and moderators are responsible for their own posts. Please refer to our Community Guidelines and Standards and Terms and Conditions for more information.
Top