Web crawling with python

Travis_Touchdown · Jan 20, 2020

Hi all, do you guys do web crawling using python?

Is it useful and fun?

Are there any websites that we cannot do web crawling due to legal reasons?

davidktw · Jan 20, 2020

Travis_Touchdown said:
Hi all, do you guys do web crawling using python?

Is it useful and fun?

Are there any websites that we cannot do web crawling due to legal reasons?

You can crawl with any language

Fun or useful is up to your purpose on why you want or need it

In fact, you can just crawl with combination of tools like curl, wget, grep, awk, sed without resorting to a general purpose programming language as long as you know what you are doing

*By Right*, the websites’ T&C will determine the legality. Remember just because the information is publicly available, doesn’t mean you can do anything with it. End-users are still legally bounded by the EULA and/or T&C. End-users don’t own the contents. Practice common sense

ceecookie · Jan 20, 2020

Beware of doing that with websites hosted by Akamai. They share a common client reputation filter (if enabled by website owner) which will get ur IP blocked by many Akamai websites if u trigger the filter at one website

davidktw · May 3, 2020

gmail0 said:
sorry to hijack, can anyone help with this web crawl? i need to write a script entirely in python, without calling system (eg head, tail) and do the equivalent of "tail and head" after the following lines:

DATA = requests.get('website.com/FILE.txt')
// tail DATA.text | head

The FILE.text is:

NotNeededData
NotNeededData
NotNeededData
NeededData
NeededData
NotNeededData
NotNeededData

It could be a lot simpler if we simply assume a NEWLINE(\n) denote start of a newline, however, it is very common for web response to end with a carriage return + newline combination, and in some cases it doesn't. Thus I will safely ignore the last '\n' if it is the last character of the web response.

I didn't provide the "head" solution since I think you should do some homework yourself

davidktw · May 4, 2020

gmail0 said:
Thanks davidktw for your help but I got the same result as your script did, which does not work. The text file does have "\r\n" as line break.

result: html[2] returns the 3rd character
wanted: html[2] to return the 3rd line

The text file will be changing per second and so I cannot pre-guess how many lines are there in total. I only want to remove, say the first 5 rows and last 3 rows, and then be loaded into pandas.

For example, using the w3.org text file, I wish to load these two lines into pandas dataframe

21, EXCLAMATION MARK, A1, INVERTED EXCLAMATION MARK
22, QUOTATION MARK, A2, CENT SIGN

What you want is probably this

Issue edge cases for the solution above. Here is a revised version

davidktw · May 4, 2020

gmail0 said:
Thank you once again but the "consume" is actually unknown, I want to remove the first 5 rows and last 3 rows, both fixed. So your code will not work if the "body length" changed, say from 100 lines changed to 200 lines. For 100 lines, I want to print 100-5-3=92 lines while 200-5-3=192 lines.

Then go and modify until it works for you.

You can always go and find out the total number of lines in the html response. Using head and tail is a 2 passes approach, which is not efficient in the first place too, but unless we are dealing with a large data set, otherwise a less efficient approach gives a simpler solution.

There are many ways to skin a cat, find one that works best for you.

davidktw · May 4, 2020

sooqing said:
hi davidktw, nice to see u again, hope u can rem our email exchange about free software. anyway i am sorry to disturb u in this thread, but what is currently the bleeding edge of an optimized way to do these cutting/adding/sorting a ~500gibi - 1tebi .txt, excluding going for hardware fpga? i am trying to achieve o(lg n)

my servers are running a mixture of bsd, linux and using c/c++ and soon c++20

Perhaps map-reduce approaches works for you ? Don't think going FPGA will be of significant gain for you, since you are not dealing with complex processing per unit. The main domaince factor should be your data size as you deal it out across multiple processing units.

If you have a large number of cores/processors within the same server, then your network will not be a concern as you distribute your information out, unless you are using a shared filesystem holding the same piece of data. Even so, you are still talking about multiple systems drawing the data from the disk.

AWS EMR, Apache Hadoop or such fanning out data and aggregate them back should works. Even MPI or PVM should works, but I guess these days, most are more interested in the recent parallel techniques.

I personally don't have a lot of encounters with large dataset, so perhaps my suggestions wouldn't be the top notch proven approaches.

Does your tasks allow for embarrassingly parallel distribution, because I am not aware what kind of adding and cutting operation you are referring to. Sorting wise, within each node, you can use the best sorting algorithm, after which when you are combing back, use merge sort. I read on sample sort, but didn't really try out myself for I have little of such use.

Without further understanding the details of how you are operating, it might be hard to find out where exactly are the bottlenecks and what solution to offer.

Personally I will encourage you to try out MPI. The open implementation is OpenMPI. The one I was using in my uni should be MPICH using a cluster implementation in NUS during my Parallel and Concurrent programming course. I was using C to parallelise sorting algorithm.

However you don't need to use C with other PL bindings offered via Perl, Python, Java, etc... It is a very interesting distributed programming methodology where you can write all the nodes responsibilities in just one program and it can be designed to work differently depending on which node it is running on.

You can complete control over the distributed topology, using the common fanning out approach, or a tree distribution technique using recursion and so forth. Nodes are not necessarily systems, they could be processors, could be cores, could be also processes or even threads (I assumed). Thus you could have a development environment all self-contained within your own system, using VMs, or use them in the cloud. In AWS there is service that allow you to use MPI, https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_03_batch_mpi.html. I have not used it myself, so it is something for you to venture.

Some recent works and issues worth reading https://jiaweizhuang.github.io/blog/

davidktw · May 15, 2020

sooqing said:
our primary requirement is to make full use of hardware, like using all the ram all the time. we have tried out mapreduce and java based hadoop, but both are very inefficient, utilization less than 7 times to our 25 nodes setup. we tried to accelerate by adding more ram and more storage but the limit is reached. our storage are all hdd (8 per node) instead of ssd as they are running at 90% utilization on xfs, 32gb ram at 100%, xeon cpu at 30%, 10gbps network at 20%

Too much I/O since your network and cpu is not saturated. If your working memory is too large to stay in the memory and keep on spilling over to disk, that is what you are experiencing. Without in-depth knowledge in your setup, I won’t be able to advise much.

Some optimisation could be setting the file system to less ordered journal. There are some settings for EXT4, but since you are using XFS, the value could be different. Using SSD will helps a lot if your I/O are random instead of sequential. 32GB isn’t a lot of your dataset is really large, which is why AWS instances for large memory usage can go up to 768GB of their r5 instance types.

That is the value of cloud for that power you need without keeping it available all the time.

davidktw · May 16, 2020

sooqing said:
we tried to add more ram (current is 32gb x 25), faster hdd 10k 15k, but it doesnt affect speed thus we never tried moving to ssd, most time was spent to improve algo to reduce seek and random io. oh we are using 2 way external merge sort, so read n write never happen at the same time on the drives.

cto has suggested using infini band but it is too ex. we are a startup on shoestring

Why don’t you guys try replicate the environment in AWS under placement grouping that can offer you extremely high performance gain? A new environment that gives you more dynamic allocation of resources can give you more insight to what is really going on, and you can adjust the knobs to get a better understanding on which are the bottlenecks before you settle on what is to be changed on your own infra.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html

Using cloudwatch and with agents installed onto the instances, you can get a more collective performance charts of what is going on across the cluster.

Well if you have too much synchronisation, that could be why your cpu is not saturated. That is possible.

davidktw · May 16, 2020

newmont91 said:
sq is running a quantitative fund, dun tink the company will put their software codes on the cloud

I don’t understand enough about QF, but I don’t see how that affects the choice of infra-provider? Perhaps you can help to enlighten?

https://aws.amazon.com/solutions/case-studies/aminvest/
https://aws.amazon.com/solutions/case-studies/qrt/
https://www.slideshare.net/mobile/AmazonWebServices/how-aqr-capital-uses-aws-to-research-new-investment-signals

Just simply searching for the terms “quantitative fund aws” returns no lack of search results in usage of cloud for investment firms and recruitment for personnel for such firms categories that knows AWS.

davidktw · May 17, 2020

gmail0 said:
hi davidktw, like to ask if python can be used to write a drive defragger?

Short answer is "I don't know", because I have never used Python to perform such tasks. My last endeavour with low level accessing to the disk is when system runs in real mode, not protected mode and the OS doesn't entirely care how software are using the disk. That was written in C with some BIOS and/or assembly codes.

Generally these days, hard disks are no longer as naive. Even if you defragment your hard disk, the most you are getting is contiguous packing of a file. Not sure if it really does brings much performance gain to the system since hard disk layout exposed to the OS is not the same as what is underlying anymore.

To my knowledge, what you need is low-level access to the disk offered by the OS Kernel API. Regardless of the programming language you use, what you need is access the hardware abstracted by the OS.

In the case of Windows, here is some information. You can continue your search in this area.

In case you are not aware, you do not need to defragment an SSD, because seeking in SSD no longer further the same magnitude of penalty as a magnetic spindle hard disk. With wear levelling firmware in the SSD, defragmenting is defeating the purpose totally.

Web crawling with python

More options

Travis_Touchdown

Great Supremacy Member

davidktw

Arch-Supremacy Member

ceecookie

Arch-Supremacy Member

davidktw

Arch-Supremacy Member

davidktw

Arch-Supremacy Member

davidktw

Arch-Supremacy Member

davidktw

Arch-Supremacy Member

davidktw

Arch-Supremacy Member

davidktw

Arch-Supremacy Member

davidktw

Arch-Supremacy Member

davidktw

Arch-Supremacy Member