programming

Creating a Web Crawler using Golang

Web Crawlers are often used to retrieve something on a website so that we get the content we need. This is usually used for content needs. In this case we will try to use Golang to create a simple Web Crawler and will retrieve some content such as URLs on a website page.

Project Preparation

Now we will create a new project by creating the learn-golang-web-crawler folder. After that, initialize the project module with this command.

go mod init github.com/santekno/learn-golang-web-crawler

Creating a Web Crawler using sequential

First we will try first using the sequential method where we just do a regular loop to do a crawler to the website.

Create a main.go file then fill the file with the code below.

package main

import (
	"fmt"
	"net/http"
	"time"

	"golang.org/x/net/html"
)

var fetched map[string]bool

func Crawl(url string, depth int) {
	if depth < 0 {
		return
	}
	urls, err := findLinks(url)
	if err != nil {
		fmt.Println(err)
		return
	}
	fmt.Printf("found: %s\n", url)
	fetched[url] = true
	for _, u := range urls {
		if !fetched[u] {
			Crawl(u, depth-1)
		}
	}
}

func findLinks(url string) ([]string, error) {
	resp, err := http.Get(url)
	if err != nil {
		return nil, err
	}
	if resp.StatusCode != http.StatusOK {
		resp.Body.Close()
		return nil, fmt.Errorf("getting %s: %s", url, resp.Status)
	}
	doc, err := html.Parse(resp.Body)
	resp.Body.Close()
	if err != nil {
		return nil, fmt.Errorf("parsing %s as HTML: %v", url, err)
	}
	return visit(nil, doc), nil
}

func visit(links []string, n *html.Node) []string {
	if n.Type == html.ElementNode && n.Data == "a" {
		for _, a := range n.Attr {
			if a.Key == "href" {
				links = append(links, a.Val)
			}
		}
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		links = visit(links, c)
	}
	return links
}

func main() {
	fetched = make(map[string]bool)
	now := time.Now()
	Crawl("http://santekno.com", 2)
	fmt.Println("time taken:", time.Since(now))
}

Some explanations so that friends understand what each function created is used for the following explanation.

  • The function visit(links []string, n *html.Node) []string is used to browse on one web page there are any URLs and if the URL has been accessed it will re-access to a different URL later all URLs on the first URL website will be returned as a result.
  • The findLinks(url string)([]string, error) function is used to find the URL to be crawled by checking whether the website is available or not and retrieving all its HTML pages to be sent to the visit function.
  • The last function func Crawl(url string, depth int) is used to detect the same URL found so that it does not need to be crawled repeatedly.
  • The main function is used to define the URL to be crawled and is the main function of this program.

Do you understand the functions one by one? If so, we will try to run this program directly with the command below.

go run main.go

The program will run and access the URL that we have defined in the main function. Don’t forget Make sure the internet on your computer or laptop is running smoothly so that the process won’t take too long.

If it has finished running, it will exit the terminal as below.

found: https://www.santekno.com/jenis-jenis-name-server/
found: https://www.santekno.com/tutorial/hardware/
time taken: 3m7.149923291s

We can see that it means that it takes about 3 minutes 7 seconds to search or crawl this santekno.com website. Not too long and this is also conditioned by the internet on your laptop.

If it’s only 1 URL, maybe this is faster and what if for example we want to crawl 100 URLs / websites then if sequential we need to need at least 100 times from the first one, which is 300 minutes and this really takes a long time.

Then how can we make the process even faster to do a Web Crawler? In the next process we will try to change the Crawler process using Concurrent which we have learned before.

Then how can we make the process even faster to do a Web Crawler? In the next process we will try to change the Crawler process using Concurrent which we have learned before.

Changing the Web Crawler using Concurrent

We will modify the previous Web Crawler by adding some improvements, namely by using channels. Create a struct first like this.

type result struct {
	url string
	urls []string
	err error
	depth int
}

This struct is used to store the URL that we will crawl. Add a channel to the Crawler function at the beginning of the function.

results := make(chan *result)

We will add a channel to this crawler so that it can use goroutines and modify the Crawler function to be as below.

func Crawl(url string, depth int) {
	results := make(chan *result)

	fetch := func(url string, depth int) {
		urls, err := findLinks(url)
		results <- &result{url, urls, err, depth}
	}

	go fetch(url, depth)
	fetched[url] = true

	for fetching := 1; fetching > 0; fetching-- {
		res := <-results
		if res.err != nil {
			fmt.Println(res.err)
			continue
		}

		fmt.Printf("found: %s\n", res.url)
		if res.depth > 0 {
			for _, u := range res.urls {
				if !fetched[u] {
					fetching++
					go fetch(u, res.depth-1)
					fetched[u] = true
				}
			}
		}
	}
	close(results)
}

We can see that we created a fetch function which will call the findLinks function and store the results into the results channel. Please note that after that we will run the fetch function using goroutine as mentioned earlier.

See the next code which is looping. In this code we will retrieve all the URL data in the results channel. When will the looping code finish? This loop will finish if the fetching value has become 0.

Alright, let’s run this last modification with the same command as above.

go run main.go

After it is finished running, it will be seen how long the process execution takes to do this Crawler.

found: https://www.santekno.com/tags/encoder
found: https://www.santekno.com/categories/tutorial/page/2/
time taken: 11.673643875s

It’s amazing how much faster the initial process takes about 3 minutes but after we modify it using concurrent we summarize the time and process only 11 seconds.

Conclusion

We often use this Web Crawler for certain needs, especially if we want to analyze data that already exists on a particular website. So if we want to make a Web Crawler using Golang, try to pay attention and use concurrent so that it can be more efficient in doing it so that the process is shorter and does not need to take longer especially if we do not only one Web Crawler.

comments powered by Disqus