Web Crawlers are often used to retrieve something on a website so that we get the content we need. This is usually used for content needs. In this case we will try to use Golang to create a simple Web Crawler and will retrieve some content such as URLs on a website page.
Project Preparation
Now we will create a new project by creating the learn-golang-web-crawler
folder. After that, initialize the project module with this command.
go mod init github.com/santekno/learn-golang-web-crawler
Creating a Web Crawler using sequential
First we will try first using the sequential method where we just do a regular loop to do a crawler
to the website.
Create a main.go
file then fill the file with the code below.
package main
import (
"fmt"
"net/http"
"time"
"golang.org/x/net/html"
)
var fetched map[string]bool
func Crawl(url string, depth int) {
if depth < 0 {
return
}
urls, err := findLinks(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s\n", url)
fetched[url] = true
for _, u := range urls {
if !fetched[u] {
Crawl(u, depth-1)
}
}
}
func findLinks(url string) ([]string, error) {
resp, err := http.Get(url)
if err != nil {
return nil, err
}
if resp.StatusCode != http.StatusOK {
resp.Body.Close()
return nil, fmt.Errorf("getting %s: %s", url, resp.Status)
}
doc, err := html.Parse(resp.Body)
resp.Body.Close()
if err != nil {
return nil, fmt.Errorf("parsing %s as HTML: %v", url, err)
}
return visit(nil, doc), nil
}
func visit(links []string, n *html.Node) []string {
if n.Type == html.ElementNode && n.Data == "a" {
for _, a := range n.Attr {
if a.Key == "href" {
links = append(links, a.Val)
}
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
links = visit(links, c)
}
return links
}
func main() {
fetched = make(map[string]bool)
now := time.Now()
Crawl("http://santekno.com", 2)
fmt.Println("time taken:", time.Since(now))
}
Some explanations so that friends understand what each function created is used for the following explanation.
- The function
visit(links []string, n *html.Node) []string
is used to browse on one web page there are any URLs and if the URL has been accessed it will re-access to a different URL later all URLs on the first URL website will be returned as a result. - The
findLinks(url string)([]string, error)
function is used to find the URL to be crawled by checking whether the website is available or not and retrieving all its HTML pages to be sent to thevisit
function. - The last function
func Crawl(url string, depth int)
is used to detect the same URL found so that it does not need to be crawled repeatedly. - The
main
function is used to define the URL to be crawled and is the main function of this program.
Do you understand the functions one by one? If so, we will try to run this program directly with the command below.
go run main.go
The program will run and access the URL that we have defined in the main function. Don’t forget Make sure the internet on your computer or laptop is running smoothly so that the process won’t take too long.
If it has finished running, it will exit the terminal as below.
found: https://www.santekno.com/jenis-jenis-name-server/
found: https://www.santekno.com/tutorial/hardware/
time taken: 3m7.149923291s
We can see that it means that it takes about 3 minutes 7 seconds to search or crawl this santekno.com website. Not too long and this is also conditioned by the internet on your laptop.
If it’s only 1 URL, maybe this is faster and what if for example we want to crawl 100 URLs / websites then if sequential we need to need at least 100 times from the first one, which is 300 minutes and this really takes a long time.
Then how can we make the process even faster to do a Web Crawler? In the next process we will try to change the Crawler process using Concurrent which we have learned before.
Then how can we make the process even faster to do a Web Crawler? In the next process we will try to change the Crawler process using Concurrent which we have learned before.
Changing the Web Crawler using Concurrent
We will modify the previous Web Crawler by adding some improvements, namely by using channels. Create a struct first like this.
type result struct {
url string
urls []string
err error
depth int
}
This struct is used to store the URL that we will crawl. Add a channel to the Crawler
function at the beginning of the function.
results := make(chan *result)
We will add a channel to this crawler so that it can use goroutines and modify the Crawler
function to be as below.
func Crawl(url string, depth int) {
results := make(chan *result)
fetch := func(url string, depth int) {
urls, err := findLinks(url)
results <- &result{url, urls, err, depth}
}
go fetch(url, depth)
fetched[url] = true
for fetching := 1; fetching > 0; fetching-- {
res := <-results
if res.err != nil {
fmt.Println(res.err)
continue
}
fmt.Printf("found: %s\n", res.url)
if res.depth > 0 {
for _, u := range res.urls {
if !fetched[u] {
fetching++
go fetch(u, res.depth-1)
fetched[u] = true
}
}
}
}
close(results)
}
We can see that we created a fetch
function which will call the findLinks
function and store the results into the results
channel. Please note that after that we will run the fetch
function using goroutine as mentioned earlier.
See the next code which is looping. In this code we will retrieve all the URL data in the results
channel. When will the looping code finish? This loop will finish if the fetching value has become 0
.
Alright, let’s run this last modification with the same command as above.
go run main.go
After it is finished running, it will be seen how long the process execution takes to do this Crawler.
found: https://www.santekno.com/tags/encoder
found: https://www.santekno.com/categories/tutorial/page/2/
time taken: 11.673643875s
It’s amazing how much faster the initial process takes about 3 minutes but after we modify it using concurrent we summarize the time and process only 11 seconds.
Conclusion
We often use this Web Crawler for certain needs, especially if we want to analyze data that already exists on a particular website. So if we want to make a Web Crawler using Golang, try to pay attention and use concurrent so that it can be more efficient in doing it so that the process is shorter and does not need to take longer especially if we do not only one Web Crawler.