Python and web scraping

u0206397

Senior Member
Joined
Jul 15, 2009
Messages
764
Reaction score
0
Why do most people choose to use Python when doing web scrapping?

In comparison, web scrapping using Java, C#, C++, PHP are practically unheard of, or much lesser.

Perl may have some web scrapping stuffs but still not as popular as Python from my casual observation.
 

davidktw

Arch-Supremacy Member
Joined
Apr 15, 2010
Messages
13,391
Reaction score
1,180
Why do most people choose to use Python when doing web scrapping?

In comparison, web scrapping using Java, C#, C++, PHP are practically unheard of, or much lesser.

Perl may have some web scrapping stuffs but still not as popular as Python from my casual observation.

Java and C/C++ will be tedious for web scrapping because the need to compile for every change to the site which is not responsive enough. Scripting languages will play a more responsive approach to web scrapping.

Well in my opinion, Perl makes one of the best scrapping tool because of its powerful regex and highly dynamic capabilities with a large CPAN tools, unfortunately it is not one of the easiest tool to master unless you put in good effort to learn it.

Python is picking up with ease of usage and recent movement of it in the community, so it gets more exposure. I suppose the reason of it being more popular is not because of its web scrapping capabilities, but it's analysis libraries availability. Some uses of scrapping from the web goes to data analysis such as sentiments analysis for some specific usage, in this case python does have its advantage and why not just combine both jobs under the same tool?

If you are really into webscrapping, I will say node.js is a very good fit, because of its capability to immediately interpret and execute the scripts. There are also a couple of really good headless browser libraries like phantomjs, slimerjs which can allow dynamic loading of the site and then extract information dynamically from the web page.
 
Last edited:

tangent314

Moderator
Moderator
Joined
Jul 26, 2002
Messages
5,135
Reaction score
218
For web scrapping, a dynamically typed language makes handling json requests and response A LOT easier, so that eliminates golang, Java, C, C++ and C#.

Error handling in node.js is a complete joke, and so is the callback hell in its async model. PHP was designed for embedding a web app into html, sure you can try to do scrapping with it by bringing in libcurl, but it will still be a painful experience. Perl works, but these days python does everything perl can do in a simpler manner and everyone's gyrating towards there.

Scrapping is just a joy to do with python. The requests library is easy to use, there is beautifulsoup if you need to play with forms, regex is available if you need to scrape html instead of json, and for advanced stuff where you need to simulate user inputs there is selenium. And to run multiple concurrent requests, use gevent with monkeypatching which makes all the async handling transparent without having to go through node.js callback hell.
 

albertchan659

Member
Joined
Feb 20, 2017
Messages
152
Reaction score
15
Why do most people choose to use Python when doing web scrapping?

In comparison, web scrapping using Java, C#, C++, PHP are practically unheard of, or much lesser.

Perl may have some web scrapping stuffs but still not as popular as Python from my casual observation.

it seems that python is much preferred it is relatively easy to set up threads, as well having interesting libraries that make web scraping fast and reliable.
 

imgroot

Junior Member
Joined
Jan 25, 2019
Messages
18
Reaction score
0
Why do most people choose to use Python when doing web scrapping?

In comparison, web scrapping using Java, C#, C++, PHP are practically unheard of, or much lesser.

Simply because Scrapy is prevalent in the web scraping community. There is not much open source library for scraping that is robust and easy to use in other languages.
 
Important Forum Advisory Note
This forum is moderated by volunteer moderators who will react only to members' feedback on posts. Moderators are not employees or representatives of HWZ. Forum members and moderators are responsible for their own posts.

Please refer to our Community Guidelines and Standards, Terms of Service and Member T&Cs for more information.
Top