What are Sitemaps?#

Google introduced the Sitemaps protocol so web developers can publish lists of links from across their sites. The basic premise is that some sites have a large number of dynamic pages that are only available through the use of forms and user entries. The Sitemap files contains URLs to these pages so that web crawlers can find them. Bing, Google, Yahoo and Ask now jointly support the Sitemaps protocol.

You can see LOC’s sitemap here: https://www.loc.gov/sitemap.xml Notice that the sitemap contains links to other sitemaps.

OK so what can I do with it?#

How about we find out how often the WWI sheet music collection is updated?:

import requests
from xml.etree import ElementTree


def getIndividualUrls(sitemap, count_map):
    print('.', end='')
    r = requests.get(sitemap)    
    if r.status_code != 200:
        print("\nProblem getting url %s, response was %s" % (sitemap, r.status_code))
        return
    
    tree = ElementTree.fromstring(r.content)
    if tree.tag == "{http://www.sitemaps.org/schemas/sitemap/0.9}sitemapindex":
        for element in tree.iter("{http://www.sitemaps.org/schemas/sitemap/0.9}loc"):
            getIndividualUrls(element.text, count_map)
    else:
        for updated in tree.iter("{http://www.sitemaps.org/schemas/sitemap/0.9}changefreq"):
            count_map[updated.text] = count_map.get(updated.text, 0) + 1
            
count = dict()
print("getting the data from the site maps:")
getIndividualUrls("http://www.loc.gov/collections/world-war-i-sheet-music/sitemap.xml", count)
print("")

for key,value in count.items():
    print("%s items are updated %s" % (value, key))
    
getting the data from the site maps
..............
13000 items are updated weekly