Thursday, August 27, 2015

Scraping With Scrapy!

Web Crawling Part 1 - Scraping with Scrapy!

We know the internet is a goldmine for useless information - a mishmash of human knowledge. Information that speaks so loudly that sometimes it’s hard to hear what really matters. So how can we filter out the noise, and get at the important stuff?

Web scraping allows us to access the gems of data embedded within a web page. And much like Perl was the original Swiss Army Knife for the web, it seems Python has stepped in and become the modern programmer’s Macguyver Kit, seemingly having a tool/framework/library that fits almost every situation. And web crawling/scraping is no different. Introduce, Scrapy, an amazing library for quickly developing, testing, and deploying your web scrapers.

So what is Scrapy again? Scrapy is both a web crawler and web scraper. What does that mean? First, that means that Scrapy has the ability to navigate a sites structure by following links to different pages within or oustide of the site’s domain. Second, as Scrapy navigates these webpages, it can then peel away the layers of structural information on a webpage (i.e. HTML) to access only the specific content that you want. This makes it very useful for extracting globs of raw data from the web.

What we will be scraping: 2014 NFL Season Play-By-Play Data

What you need to proceed:

*May require running conda install scrapy from the command line if not available after install

Overview:

Here, in Part 1 we’ll focus on leveraging the Scrapy framework to quickly parse all of the play-by-play data for the 2014 NFL season. In Part 2 we’ll build off our knowledge here to implement a full-fledged web crawler to parse multiple seasons of NFL game data.

A dataflow overview

 Brief Overview

  • The spider schedules requests for download (1)
  • The engine manages these requests and then executes them concurrently (2,3)
  • When the response is available (4) the engine sends it back to the spider (5,6)
  • The spider then scrapes the response (7) and either returns more requests to the scheduler (1), or returns scraped items from the webpage (8a,8b).

First Steps…

To begin implementing the spider we need to setup our environment so that Scrapy can know where to find the items necessary to execute your spider. To do this we use the command scrapy startproject nfl_pbp_data to automatically setup the directories and files necessary for our spider. So find a directory you would like the project to reside, and execute the above from your command line. You can skip the next section if you’re already familiar with XPath and web scraping.

Recon

But wait… Before we continue we need to get an idea of where our data is located within the webpage. This requires a bit of recon on the page to determine it’s layout, and how best to access the information. The best tools to use for this nowadays come with every major browser (Firefox, Chrome, Safari, IE, Opera). They are normally hidden away in some developer tab/option that may not be readily apparent until you search for it. If in doubt, check here.

Now, for this walk-through we’re going to scrape the 2014 NFL season play-by-play data. This will be a good example to get a basic understanding of the scraping process, and might even prove useful to anyone who likes playing with the numbers. This data is available for free use and can be obtained from a number of sources. We’ll grab our data from pro-football-reference.com, which for the past few years has been given access to the official play-by-play data feeds pushed directly through NFL.com.

So now fire up your favorite browser and developer tool, and head on over to . The first thing to notice here is that the data is setup in a very nice, tabular format. It’s also not polluted with ads or other extraneous information. this will make our job a whole lot easier.

To access the particular information we want the Scrapy framework allows you to query the page using XPath or CSS selectors. Here we’ll be using XPath selectors, but you can easily substitute our queries for the CSS selector equivalent. I won’t be going into much detail about XPath, but I will try to explain the XPath queries we choose and why we’ve chosen them. For an excellent overview and reference for XPath, you can take a look at w3schools and MSDN.

Since we’ll be extracting the game data, we actually need to click one of the boxscore links. This will take you to a game summary page that has a wealth of information, including the exact information we’re looking for - the play-by-play data. If you inspect the tag containing the play-by-play data you’ll see that it resides under <table id="nfl_pbp_data".... And this is where we’ll begin constructing our XPaths.

Scrape Before You Crawl - Scrapy Shell

So one of the most useful tools in the Scrapy toolbox is the Scrapy Shell. It allows us to pull a web response into the iPython shell environment, which acts like a sort of playground for us to toss around the html until we’ve figured out how to extract our data.

 What’s IPython?

Python actually comes with it’s own shell, but IPython adds a bevy of features that essentially turn the shell into a programmer’s playground. If you find it really useful, you may want to also keep a watch out for Jupyter, the platform-agnostic versions of IPython that will be compatible with many other languages (C#, C++, Javascript, Perl, Haskel, R, Ruby, etc…).

You can fire up the Scrapy shell by traveling to the top-level directory of your project and running scrapy shell "http://www.pro-football-reference.com/boxscores/201409040sea.htm", which should bring up an IPython terminal. If not, refer to the IPython installation link above for some troubleshooting guidance.

The data that we want to access is located under <table id="pbp_data"> so we’ll start by using that as the start of our XPath.

In the first line you can see that the object response holds the response information captured by the Scrapy shell from our web address. And then in the second line we can see that the XPath returns an object with the results of our query. These objects are called selectors and are one of the primary weapons for carving out your data. From this selector we can now build out a more complex XPath that finds all table rows of the game data and then filters out any header rows (which conveniently are labeled under the class thead).

When you run this you may notice that Scrapy returns a list of objects from its query. This is because the XPath we’ve used extracts each table row and then returns the results as a list. So we can access each item just like a normal python list.

So you might be wondering, okay so I’ve queried my data, but how do I actually extract the raw data? That’s simple you use the method extract. Now, before you get carried away attaching extract to the end of all of your selector queries, realize that extract always returns a list, whether it grabs one item or a hundred. So if you plan on actually storing this data you will need to pull it out of the list object.

It becomes slightly more complex if extract returns multiple items in the list. Consider this:

XPath Note: The queries above use the // to indicate a recursive descent through the html hierarchy. This allows the query to search the entirety of the HTML document, or in the case of extracting a specific column, we can search through the entire <td> tag.

Here the data extracted is not a contiguous text node so our selector works until it finds all of the text nodes and then returns them in a list. We can extract the data from here using python’s nifty join method. Note: This works well in our example, but may not be the best representation of other data sets. Figure out what works best for you.

Let Your Spider Do The Crawling

By now, hopefully you feel comfortable scraping web data using selectors and the Scrapy shell. From here we’ll turn our attention back to implementing a functional web crawler using the rest of the Scrapy framework. First, go to your spiders directory (from the top level project directory it will be under nfl_pbp_data/spiders) and create a new python file called NFLStatsSpider.py. This will hold the guts of our spider, and is where all of the spiders you want Scrapy to use should reside. Now insert the code below as our basic template.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class GameDataSpider(CrawlSpider):
  name = "nfl_pbp_data"
  start_urls = ["http://www.pro-football-reference.com/years/2014/games.htm"]

  def parse_game_data(self, response):
    pass

 Explanation

  • CrawlSpider: one of the generic spider classes Scrapy provides that adds some additional functionality to make crawling and scraping even easier. Here, we use CrawlSpider as our base class.
  • name: the name of our spider. This is how Scrapy references our spider. You don’t have to tell it the filename, or update some huge configuration file, just set the name attribute, and anytime you need to reference your spider, you can call it by that name.
  • start_urls: the initial urls fed to our spider that begin the crawling process.

One of the benefits of using the CrawlSpider class is that it allows you to set rules for easily extracting links for your spider to scrape. The rules have several options, but the most important ones are the LinkExtractor object and the callback argument.


class GameDataSpider(CrawlSpider):
  name = "nfl_pbp_data"
  start_urls = ["http://www.pro-football-reference.com/years/2014/games.htm"]

  # the comma following the first item is REQUIRED for single rules
  rules = (
    Rule(LinkExtractor(allow=('boxscores/\w+\.htm')), callback='parse_game_data'),
  )
 

Here we use the LinkExtractor to specify which links we want to follow using a regular expression. Again, we can use the scrapy shell as a scratchpad to figure out what works. Then, we use the callback argument to specify what method to use for scraping each of the extracted links.

Note: Remeber that we don't need links on the game data page, we need the links to the game data page, so to run the code above succesfully, you can run fetch('http://www.pro-football-reference.com/years/2014/games.htm') to update the response object with html from the boxscores page.

Scalpels and Sledgehammers

In Scrapy the parser’s job job is to extract data from a response and return either data items or more requests. In this example we’re using the LinkExtractor to parse the requests we need to follow so we only need to return data items. This is actually much easier than it sounds since we’ve already done the hard part of determining the correct XPaths to use to extract our information. The only addition at this point is to take the extracted data and compile it into an object to be returned to the spider. For this Scrapy uses Item objects whose function is almost identical to Python dictionaries. The item objects are essentially a dictionary wrapper that informs Scrapy that there are scraped items to process. Because it is a dictionary we can describe our data as we extract it.


  def parse_game_data(self, response):

    for row in response.xpath('//table[@id="pbp_data"]/tbody/tr[not(contains(@class,"thead"))]'):
      play = {}
      play['game_id'] = [game_id]
      play['date'] = [date]
      play['play_id'] = [play_id]
      play['quarter'] = row.xpath('td[1]//text()').extract()
      play['time'] = row.xpath('td[2]//text()').extract()
      play['down'] = row.xpath('td[3]//text()').extract()
      play['togo'] = row.xpath('td[4]//text()').extract()
      play['location'] = row.xpath('td[5]//text()').extract()
      play['detail'] = row.xpath('td[6]//text()').extract()
      play['away_points'] = row.xpath('td[7]//text()').extract()
      play['home_points'] = row.xpath('td[8]//text()').extract()
      play_id += 1

    
      # yield the item so that the scrapy engine can continue 
      # processing web requests concurrently
      yield play

Here we are able to extract all of the game data and assign it to our Item object using whatever keys we choose, rather than having to define this information ahead of time. This is an added convenience in the newer versions of Scrapy that allows you to begin creating your scraper without predefining your data.

In the above code we also leverage the power of XPath by first querying for all rows of game data, and then again upon iteration we query each specific row to extract exactly the information we want. This is an approach I call Scalpels and Sledgehammers, and is one of the powerful features of the Scrapy selectors. We can use the general selectors to create queries that smash up the data into large chunks, and then fine-tune them to make precision cuts that extract the real gems.

A Crawling Spree

And finally we get to the fruition of our hard work, and that is executing the spider and letting it do it’s thing. The code below is the final version of the code for our spider.


import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class GameDataSpider(CrawlSpider):
  name = "nfl_pbp_data"
  start_urls = ["http://www.pro-football-reference.com/years/2014/games.htm"]

  # the comma following the first item is REQUIRED for single rules
  rules = (
    Rule(LinkExtractor(allow=('boxscores/\w+\.htm')), callback='parse_game_data'),
  )

  def parse_game_data(self, response):
    # uses the filename as the game id that takes the form yyyymmdd0hometeam 
    # (i.e. 201409040sea) - purely a convenience choice
    game_id = response.url[response.url.rfind('/')+1:-4]
    date = game_id[:4] + '-' + game_id[4:6] + '-' + game_id[6:8]

    play_id = 1
    for row in response.xpath('//table[@id="pbp_data"]/tbody/tr[not(contains(@class,"thead"))]'):
      play = {}
      play['game_id'] = [game_id]
      play['date'] = [date]
      play['play_id'] = [play_id]
      play['quarter'] = row.xpath('td[1]//text()').extract()
      play['time'] = row.xpath('td[2]//text()').extract()
      play['down'] = row.xpath('td[3]//text()').extract()
      play['togo'] = row.xpath('td[4]//text()').extract()
      play['location'] = row.xpath('td[5]//text()').extract()
      play['detail'] = row.xpath('td[6]//text()').extract()
      play['away_points'] = row.xpath('td[7]//text()').extract()
      play['home_points'] = row.xpath('td[8]//text()').extract()
      play_id += 1

      # sanitize extracted data
      for key in play.keys():
        
        # no empty keys
        if not play[key]:
          play[key] = ''

        # join lists with multiple elements 
        elif len(play[key]) > 1:
          play[key] = ''.join(play[key])

        else:
          play[key] = play[key][0]
    
      # yield the item so that the scrapy engine can continue 
      # processing web requests concurrently
      yield play

The spider can now be executed by moving to the top directory of your project and executing scrapy crawl nfl_pbp_data -s DOWNLOAD_DELAY=5 --output_file=game_data.csv.

 Options meaning

  • DOWNLOAD_DELAY: Controls the time between requests. This is a VERY important part of crawling websites. Please be kind. See note below.
  • output_file: The file to write the scraped data to. Scrapy supports the following popular formats: JSON, XML, and CSV. In Part 2, we’ll see how using an item pipeline, which post-processes an item, allows us to store the data in any format (i.e. A database). See Feed Exports in the Scrapy documentation for more information.

Legalese and Scraping Etiquette

If you notice in the crawler execution command above we’ve assigned a value to the DOWNLOAD_DELAY setting. This is important scraping etiquette, as many sites mistake web crawlers for Denial-Of-Service (DOS) attacks since a web crawler can essentially bombard a server with requests in a manner very similar to a DOS attack. In fact, many web sites try to circumvent DOS attacks (and web crawlers in some cases) by prohibiting users with abnormal web browsing activity. You should always consider this, when building your crawler, and use reasonable delays between your requests.

It is important to remember that scraping data from websites is usually bypassing the content author’s intended use of the information. Most websites expect users to view their information at a reasonable pace, and even at pro-football-reference.com they’ve taken considerable steps to allow the user to make personal use of the data offline. And, just because you can doesn’t mean you should. It’s always important to read the website’s terms of use, so you understand the conditions placed under the data your extracting. And if you’re doing serious crawling you should also take a look at their robots.txt file so that you understand the rules for web crawling services. A robots.txt file is used by most large websites and is usually utilized by search engines and other massive crawling services.

Wrap-up

The point of this blog was to provide you with a more comprehensive walk through of creating a basic web crawler in Scrapy. So many time the basic tutorials on the web lack the complexity to allow you to do any real work so I’ve tried to provide this example as a resource for those looking to see a beginner-intermediate level overview of building a scraper that can extract a large data set spread across multiple pages. In Part 2 we’ll build off of this example and cover more advanced features of the Scrapy framework such as anonymous scraping, custom export feeds, sharing data across requests, and multiple item scraping. My source for this example is provided below along with the 2014 play-by-play game data in CSV, JSON, and Sqlite.

References

Thursday, August 20, 2015

Operator Overloading in C#

Introduction

What happens when a natural disaster causes a backlog on 911 calls? Operator overload! Hahaha, I slay me. Ok so maybe there's a different form of operator overloading I'd like to discuss today, and it's the kind where you can overload the operators of the c# language. Want to redefine what == means for your class? Have at it. Want to add together two widgets and get a resulting mutant widget of doom? Go nuts I say! Operator overloading lets you redefine what most of the c# language operators mean for your class.



Here is the code, it is written using VS2013 Community Edition.

Example

I'm going to let an example do the talking for just a minute here:

using System;
using System.Linq;

namespace BlogOperatorOverload
{
    public class Monster
    {
        public string MonsterType { get; set; }
        public string Name { get; set; }

        public override string ToString()
        {
            return String.Format("{0}::{1}", MonsterType, Name);
        }

        public static Monster operator+(Monster monster1, Monster monster2)
        {
            if ((monster1.MonsterType.Equals("dragon", StringComparison.OrdinalIgnoreCase) && 
                monster2.MonsterType.Equals("man", StringComparison.OrdinalIgnoreCase)) || 
                (monster1.MonsterType.Equals("man", StringComparison.OrdinalIgnoreCase) && 
                monster2.MonsterType.Equals("dragon", StringComparison.OrdinalIgnoreCase)) 
                )
                return new Monster() { MonsterType = "Dragon-Man", Name = "Trogdor!" };
            else
                return new Monster() { MonsterType = monster1.MonsterType + "_" + monster2.MonsterType, Name = String.Join("", monster1.Name.Reverse()) + String.Join("", monster2.Name.Reverse()) };
        }
    }
}


So this first part of the sample is a class named Monster. It has 2 properties that define what it is to be a monster, an overridden ToString method that we'll use to check our output in a little bit, and most importantly it overloads the "+" addition operator. See the syntax of that? it's pretty easy; "public static [returntype] operator[operator_to_overload](params)". In this case I'm adding instance monster1 to instance monster2. I put some interesting logic in this method to handle a special edge case, and in the else I put the catch-all addition of 2 monsters. All we're doing here is creating a new resulting Monster instance that has a MonsterType and Name. Pretty cool huh? This has nothing to do with addition, but we can now add together 2 instances of the Monster class as though they were numbers! Let's see that part of the example now:

using System;

namespace BlogOperatorOverload
{
    class Program
    {
        static void Main(string[] args)
        {
            var aMan = new Monster() { Name = "Timmy", MonsterType = "Man" };
            var aDragon = new Monster() { Name = "Puff", MonsterType = "Dragon" };
            var troggy = aMan + aDragon;
            Console.WriteLine(troggy.ToString());
            var aBear = new Monster() { Name = "Billy", MonsterType = "Bear" };
            var aTiger = new Monster() { Name = "Tina", MonsterType = "Tiger" };
            var qtf = aBear + aTiger;
            Console.WriteLine(qtf.ToString());
            Console.ReadKey();
        }
    }
}


Another simple bit of code. We first declare an instance of Monster named aMan, declare another one named aDragon, and then add aMan to aDragon to get a 3rd instance variable named troggy. Any guesses as to the output of this line? If you feel like cheating the answer is a little further down the page.

Now we create an instance of Monster named aBear, another named aTiger, and add them together. Go ahead, be wild, guess what the output will be!

Or just cheat and check out this screenshot:

Yes that's right folks, it's really that easy to redefine operators in c#. When would you want to use this? Hell if I know, I haven't actually found a practical use for it yet. Drop me a line if you find one!

What's Next?

Create your own samples and play with some of the other overloadable operators. There's a link down in the Resources section that lists the overloadable operators.

Resources

Operator Overloading Tutorial
Overloadable Operators

Thursday, July 23, 2015

.Equals != == ...?

Equality!


Confusing title huh? I found out recently that .Equals() doesn't always behave the same as ==. More specifically, I was trying to compare an Int32 (int) to an Int64 (long). It was a bit more complicated than that really, I needed to find items in a list of complicated objects that matched (just by this number) complicated objects of another type in another list, and I kept coming up with no matches. I knew I should have some, but the comparison was coming up false for everything! I had a check kind of like this:
return SomeObj.SomeIntProp.Equals(SomeOtherObj.SomeLongProp);

I didn't realize at the time I initially coded it that one was an Int32 and the other an Int64, but I didn't really think it mattered anyways. So I changed it to something like this just for giggles:
return SomeObj.SomeIntProp == SomeOtherObj.SomeLongProp;

And sure enough, my code was now getting back the correct number of true results that I thought it should! Just so you know, the reason is this: there are 2 overloaded versions of Int32.Equals. One accepts an int parameter (Int32), and the other an Object parameter. If you are comparing int to int, then you'll call the first method which checks that the numbers are the same. If you call .Equals and you don't pass an Int32 value, then you will end up using the 2nd version of the method which accepts an Object parameter and always returns false if the value isn't in fact an Int32 or something with an implicit conversion to Int32. No such implicit conversion exists for Int64 to Int32. Boom!

Here's a sample console app you can run to see this in action:

using System;

namespace BlogEquals
{
    class Program
    {
        static void Main(string[] args)
        {
            int int1 = 34;
            long long1 = 34;
            Console.WriteLine("int1 == long1: " + (int1 == long1).ToString());
            Console.WriteLine("int1.Equals(long1): " + int1.Equals(long1).ToString());
            Console.ReadKey();
        }
    }
}


 The output is this:


What's Next?

See for yourself what happens if the right-side of the comparison is an Int16...can you explain the result?

Resources

Int32.Equals

Thursday, July 9, 2015

C# Anonymous Functions

Intro

Anonymous Functions: Things that activist hacker group does when they go into the potty room. Darn, I keep forgetting this is a programming blog! OK in this case an Anonymous Function is an "inline statement or expression that can be used wherever a delegate type is expected." (1). In layman's terms, rather than supplying a fully-defined function to a delegate, you can create an anonymous function within your normal code and use that. It's hard to explain, so let's see it in action.



Code Samples in VS2013, .Net 4.5.

Simple Use Case



using System;

namespace BlogAnonymousFunction
{
    public class Thingy
    {
        public string DoAThing(int val1, int val2, Func preProcessor = null)
        {
            int val3 = -1;
            if (preProcessor != null)
                val3 = preProcessor(val1, val2);
            return String.Format("val1:{0}, val2:{1}, val3: {2}", val1, val2, val3);
        }
    }
}


using System;

namespace BlogAnonymousFunction
{
    class Program
    {
        static void Main(string[] args)
        {
            var thingy = new Thingy();
            var result = thingy.DoAThing(24, 42, null);
            Console.WriteLine(result);
            result = thingy.DoAThing(24, 42, new Func((val1, val2) => val1 + val2));
            Console.WriteLine(result);
            Console.ReadKey();
        }
    }
}



A simple example to be certain. Our class Thingy has a single method DoAThing that accepts a function as a parameter. If the user passes in a value for this nullable function we will call it and set it to our 3rd value.

In the Main method we call DoAThing twice. First we call it once passing in a null function parameter. In the second call we pass in an anonymous function that adds the first and 2nd parameter, returning the result. You can see the results of our two function calls in the output window screenshot below.


Cool huh? Simple too, but it does take a few repetitions to get used to the odd syntax. Enjoy!

What's Next?

You can also do generic methods (no return type), you can create variables that reference a generic function/method, and more! Search on teh webs, you'll find plenty.

Resources

(1) Anonymous Functions
The Generic Func Delegates

Friday, June 19, 2015

.Net Web API Help Files

Intro

Help me Rhonda, help help me Rhonda...
Sorry about that, my parents used to listen to the Beach Boys all the time in the car and that tune is pretty catchy. So I was sitting here thinking about ways to help people, and I thought of Web API help files. When you make a Web API that you want to publish for people, creating documentation can get a bit tedious. Wouldn't it be great if there was an automatic way of generating help files? Yeah you know it baby, there is such a thing! And of course, it's a NuGet package that's nice and easy to plug into your existing Web API projects in .Net. Let's see what this sucker does for us!


Note: The sample project is in Visual Studio 2013, .Net 4.5.

Usage

So the NuGet package name is Microsoft ASP.NET Web API 2.2 Help Page. Bit of a mouthful there huh?

To start off our efforts, I've created a blank Web API project in Visual Studio. I created a single controller named SampleController, with a get method that returns "hi". Here's the code for it in case you want to play along:

using System.Web.Http;

namespace BlogWebApiHelp.Controllers
{
    public class SampleController : ApiController
    {
        public IHttpActionResult Get()
        {
            return Ok("hi");
        }
    }
}


If you run the project in your browser and look at http://[baseurl]/api/sample, you'll get back a json file (or xml depending on your browser) with the word "hi".

OK time to install the Help package to see what it does for us. Open up the package manager console via Tools-->NuGet Package Manager-->Package Manager Console. Type this text into the Package Manager Console and hit Enter: Install-Package Microsoft.AspNet.WebApi.HelpPage. The package is now installed and you'll notice your solution has a new folder named Areas. There's quite a bit of content in here that's all Mvc-style, and yes you can customize it all you like. It's what generates the help pages for your API that you're about to see.

There's one more small code change you have to make in order for this stuff to work. Open up your Global.asax.cs file and put this one line in your Application_Start method:

AreaRegistration.RegisterAllAreas();


Note that you will have to add using clause for System.Web.Mvc if you don't have one already.

Let's see this little sucker in action! Fire up the Web API and navigate to http://[baseurl]/help. You should see a badass little page like this one on yer screen:

That's sweet! It found the Sample controller and is showing us that it has a Get method which we can execute by calling api/Sample. Huh, this thing doesn't have a description though? Lame! Stupid package should document my methods for me. Well it isn't quite that cool, but it's pretty close. Close the browser and head back to Visual Studio. Put an XML comment on the Get method of SampleController. There's a small code change you have to make too. Open up the file Areas\HelpPage\App_Start\HelpPageConfig.cs. In here you need to uncomment this line:

config.SetDocumentationProvider(new XmlDocumentationProvider(HttpContext.Current.Server.MapPath("~/App_Data/XmlDocument.xml")));

As you can see, this line tells the Help package that it should look for a file within your web app named XmlDocument.xml within the app_data folder. We haven't created any such file so let's tell our project to autogenerate that xml file for us. Right-click on your project and open up the Properties window. Check the checkbox next to "XML documentation file:", and for the name type "bin\XmlDocument.xml". Save.


Run this thing again and open up the help url. Aww yeah baby, our XML documentation is now our Web API Help file documentation! It doesn't get much better than this folks. You can put documentation on methods and classes, it will pick up parameter documentation, and you can style these pages as you like. Awesome!



What's Next?

Try creating your own Web Api with help files, or download the source code from the link above and play with mine. Open up the two links below in the Resources section and see what other features this thing has. Tip: you can hide methods and controllers from the API documentation!

Resources

Microsoft ASP.NET Web API 2.2 Help Page
Creating Help Pages for ASP.Net Web API

Thursday, June 4, 2015

AutoMapper with C#

Intro

Our world is a cold, cruel place where mapping the properties of one type to the properties of another type is a tedious chore, best left to bleary, dreary late night work while buzzed on enough caffeine to bug-eye a rhino. But wait, what light yon winder breaks? or something...


AutoMapper! If you guessed that AutoMapper is a new way to create your own maps from within a car, you'd be wrong! AutoMapper is a NuGet package that you can use in .Net to map the properties of one type to the properties of another type. If for example you have a middle-tier Person class that needs to map to a backend database class called PersonDb, AutoMapper can take most of the tedium away from you. No more setting Person1.FirstName = Person2.FirstName, yadda yadda. All you have to do is make a single call to initialize the configuration, make another call to map object 1 to object 2, and you're golden! Let's see it in action.

Note: Sample Project is in VS2013 Community Edition.

Samples

OK let's get movin. Create yourself a console application. You'll then need to install the NuGet package named AutoMapper. I'll show you a different way than I've used in the past, maybe you'll find this easier or maybe not. Anyways...


then in the Package Manager Console, type "Install-Package AutoMapper"...


OK now we've got AutoMapper installed in our console app. Add yourself a couple classes to the project, named Type1 and Type2. Here is the code for each: 


using System;

namespace BlogAutoMapper
{
    public class Type1
    {
        public string FirstName { get; set; }
        public int Age { get; set; }
        public DateTime DateBoughtFirstCar { get; set; }
    }
}


using System;

namespace BlogAutoMapper
{
    public class Type2
    {
        public string FirstName { get; set; }
        public int Age { get; set; }
        public DateTime DateBoughtFirstCar { get; set; }
    }
} 

Pretty simple stuff here.Now let's use AutoMapper. First open up your Program.cs file and add a "using AutoMapper;" statement up in the using statements. Now plug the following code into your Main method:

using System;
using AutoMapper;

namespace BlogAutoMapper
{
    class Program
    {
        static void Main(string[] args)
        {
            var type1 = new Type1() { Age = 44, DateBoughtFirstCar = DateTime.Parse("01/01/2000"), FirstName = "Perriwinkle" };
            Mapper.CreateMap<Type1, Type2>();
            var type2 = Mapper.Map<Type2>(type1);
            Console.WriteLine("type2.DateBoughtFirstCar: " + type2.DateBoughtFirstCar.ToShortDateString() + "...type2.Age: " + type2.Age + "...type2.FirstName: " + type2.FirstName);
            Console.ReadKey();
        }
    }
}

That's it! If you had 10, 20, or even 50 properties, it would map them for you. Pretty handy huh? Here's what the output loooks like, just to prove it does what it says it does:



What's Next?

AutoMapper has a ton more features. I've shown you only the very basics. It can handle copying lists of objects, it can hit up sub-objects, you can customize the value copying (note how right now the source and destination properties have to be the same name), you can make it look at private member fields, and so much more. Play around with AutoMapper and see what it can do for you. I wonder if it could be used as a poor man's object cloning, hmm...

Resources

AutoMapper
Keith Durham: Thanks for pointing out this gem to me sir!

Sudoku Design and Implementation in C#

Sudoku Design and Implementation in C#

Introduction

Any Sudoku fans out there? This cool little combinatorial puzzle has been around for thousands of years in many different forms, but the most popular version that we know of today gained a mainstream audience only within the last decade. Most people believe that the game originated in Japan, but earliest records indicate that during the 18th century, the genius, Swedish mathematician Leonhard Euler began formulating a number puzzle very similar to the Sudoku we know today.

So why all this talk about Sudoku? Well, it turns out that the solution to the puzzles that most people reason out in their heads can be conveniently constructed into a computer algorithm that can solve any Sudoku in fractions of a second. And better yet, we can implement the solution in our favorite programming language - C#!

Source Code Version: Visual Studio 2013 .Net 4.5

This code will not be the exact same as the code presented here. It has a few additional optimizations that don’t assist in our search for knowledge and edification. However, it should be easy to follow the logic discussed here.

The Idea

First, let’s think about the algorithm before we begin coding. If you’ve never played Sudoku before, here is the objective:

Fill a 9x9 grid with digits so that each column, each row, and each of the nine 3x3 sub-grids that compose the grid (also called “boxes”, “blocks”, “regions”, or “sub-squares”) contains all of the digits from 1 to 9.

Simple, right? So essentially the sequence of our game is as follows:

  1. Select an unfilled square
  2. Pick a number between 1–9
  3. If the number is not in the same row, column, or grid as the selected square, then place the number, otherwise, select a different number.
  4. If no number matches our conditions, then we made a mistake earlier, so back up to the previously selected square.
  5. Rinse and repeat until the puzzle is filled.

This is the basic skeleton of our algorithm. In fact, with just this algorithm we can find a solution to every Sudoku puzzle. The problem, is that there are 6.67x1021 solutions to a Sudoku grid, so letting our program just fumble around guessing solutions is probably not something we’re going to hang around for. The question is, can we make our program’s job easier? And the answer is “Yes.”

The Design

At the heart of Sudoku is essentially a decision optimziation problem. Currently our sequence doesn’t seem to do anything intelligent with its guesses. It just randomly picks a square, and then randomly picks a number, but what if we could add some intelligence to those guesses. Instead of picking a random square, it would make more sense to only pick squares that give us the greatest possibility of being right. This allows us to minimize the number of incorrect guesses, which will reduce the number of times we need to backtrack, ultimately improving the performance of our solution. There are many decision optimization problems that can be solved using this technique. Check out the Dancing Links algorithm created by Donald Knuth for a more general approach to solving problems like this.[1]

Minimization (aka Pruning)

So, how do we do this? Well, first, let’s look at the minimzation (also read optimization) problem before us. If we are minimizing our problem space, then that means there must be a set of constraints (or rules) that we must adhere to in order for our decision to be optimal, and since we know the rules of the game, let’s start there.

  1. Each cell can only have one digit between 1 and 9.

  2. Each row must contain digits 1 to 9.*

  3. Each column must contain digits 1 to 9.*

  4. Each region must contain digits 1 to 9.*

*Since each row, column and region contains 9 cells, this also implies that there are no repeated digits

I imagine this as a scratchoff card for each cell, where each card has the numbers 1 to 9. From this point on let’s change the arbitrary and ambiguous term numbers to candidates. This will make it clear that we are making a decision among a collection of choices. Now, as we apply the constraints, we can scratchoff candidates that don’t belong. Now imagine the row, column, and region constraints as a series of blocks that prevent us from placing certain numbers in their areas. So when a number is placed on the grid, we now have a block placed on that number in the corresponding row, column and region. Let’s further clarify this analogy with a graphic.

This graphic shows a 3x3 region of cells, with their candidates listed in each cell. It then shows the first row, column, and region; and the constraints placed in those regions. So, if we restrict the cells in this region according to these constraints, we find that our choices begin to narrow rather quickly. For example, in the first column, we went from a possible 9 candidates, to only 3 with only a region’s worth of information. Now, armed with all the information from the entire board we can solve most Sudoku puzzles with little to no backtracking.

Constraint Propagation (I call it rippling)

The next design consideraton is actually an add-on to the last section. It has to do with maintaining the puzzle’s constraints. So we’ve selected a good set of constraints to optimize our decision making process, and after applying that to our board, we’ve “nearly” got a complete solution. However, it seems that each time we make a decision the board that we previously optimized is no longer valid. Why? Well, because each decision made changes the state of the board, thus making the previous constraints out-of-date. What we need is the information for each selection to be shared with all cells affected. Well, we know that the cells in the row, column, and region of the selected cells will now have an additional constraint based on the candiate placed in the selected cell. Thus, we can ensure that they all have this information by “rippling” the new information across those cells. This technique is known as constraint propagation, a powerful optimization techinque used to solve complex decision problems where choices have consequences that “ripple” across the solution space.[2]

One key point to note here is that once we’ve finished pruning the board, we never need to do it again, as long as we accurately maintain the state of the board through constraint propagation. This means that we must keep track of the changed cells after each selection, so that if we need to backtrack, we can simply restore the changed cells.

Backtracking

And lastly comes the most crucial step to the Sudoku algorithm - backtracking. The idea is simple enough. Keep placing numbers in squares until our constraints tell us we can’t do that. Now, once they tell us we can’t do that, and there are no other possible candidates left for that square, it’s time to try a new approach, so we will back up until we reach the previous “fork” in the road, which is to say, we backtrack until we reach a cell with an unselected candidate. This means we need to keep track of previous states so that we can easily undo changes we’ve made. If you’re thinking this smells like recursion, then you’re definitely tracking the right scent. To complete the design, we’ve got one more diagram that shows part of an initial solution, where the decisions led it to a dead end (an unsolvable path), so it backtracks to the first new path.

Implementation

Now for the fun part! Hopefully, after reading the above design approach the idea is clear enough that should you feel sufficiently motivated, you could go out and design your own Sudoku solver without too much trouble.

The setup

  1. I used a 9x9 integer array to represent the grid. The unsolved squares have the value 0, so I can easily tell if a square is unsolved.

  2. Since keeping track of the constraints requires lots of indexing and counting, I created a class to implement an array of booleans (nice way to keep track of constraints) that can easily be indexed by the board numbers 1 - 9, and will keep track of the number of candidates that are true. It also implements the IEnumerable and IEnumerator interfaces so that I can easily iterate through the list of available candidates, which makes our use of the foreach statement easier to code and a bit more expressive. If you’re a bit hazy on the enumeration interfacese in C#, try this to get the nitty-gritty.[3]

Take a look at the initial code from my SudokuSolver class below:

// Convenience class for tracking candidates
class Candidate : IEnumerable
{
    bool[] m_values;
    int m_count;
    int m_numCandidates;

    public int Count { get { return m_count; } }

    public Candidate(int numCandidates, bool initialValue)
    {
        m_values = new bool[numCandidates];
        m_count = 0;
        m_numCandidates = numCandidates;

        for (int i = 1; i <= numCandidates; i++)
            this[i] = initialValue;
    }

    public bool this[int key]
    {
        // Allows candidates to be referenced by actual value (i.e. 1-9, rather than 0 - 8)
        get { return m_values[key - 1]; }

        // Automatically tracks the number of candidates
        set
        {
            m_count += (m_values[key - 1] == value) ? 0 : (value == true) ? 1 : -1;
            m_values[key - 1] = value;
        }
    }

    public void SetAll(bool value)
    {
        for (int i = 1; i <= m_numCandidates; i++)
            this[i] = value;
    }

    public override string ToString()
    {
        StringBuilder values = new StringBuilder();
        foreach (int candidate in this)
            values.Append(candidate);
        return values.ToString();
    }

    public IEnumerator GetEnumerator()
    {
        return new CandidateEnumerator(this);
    }

    // Enumerator simplifies iterating over candidates
    private class CandidateEnumerator : IEnumerator
    {
        private int m_position;
        private Candidate m_c;

        public CandidateEnumerator(Candidate c)
        {
            m_c = c;
            m_position = 0;
        }

        // only iterates over valid candidates
        public bool MoveNext()
        {
            ++m_position;
            if (m_position <= m_c.m_numCandidates)
            {
                if (m_c[m_position] == true)
                    return true;
                else
                    return MoveNext();
            }
            else
            {
                return false;
            }
        }

        public void Reset()
        {
            m_position = 0;
        }

        public object Current
        {
            get { return m_position; }
        }
    }
}

And now the board, constraints, and other definitions.

// True values for row, grid, and region constraint matrices
// mean that they contain that candidate, inversely,
// True values in the candidate constraint matrix means that it
// is a possible value for that cell.
Candidate[,] m_cellConstraintMatrix;
Candidate[] m_rowConstraintMatrix;
Candidate[] m_colConstraintMatrix;
Candidate[,] m_regionConstraintMatrix;

// Actual puzzle grid (uses 0s for unsolved squares)
int[,] m_grid;

// Another convenience structure. Easy and expressive way
// of passing around row, column information.
struct Cell
{
    public int row, col;
    public Cell(int r, int c) { row = r; col = c; }
}

// helps avoid iterating over solved squares
HashSet<Cell> solved;
HashSet<Cell> unsolved;

// Tracks the cells changed due to propagation (i.e. the rippled cells)
Stack<HashSet<Cell>> changed;

public SudokuSolver(int[,] initialGrid)
{
    m_grid = new int[9, 9];
    m_cellConstraintMatrix = new Candidate[9, 9];
    m_rowConstraintMatrix = new Candidate[9];
    m_colConstraintMatrix = new Candidate[9];
    m_regionConstraintMatrix = new Candidate[9, 9];
    solved = new HashSet<Cell>();
    unsolved = new HashSet<Cell>();
    changed = new Stack<HashSet<Cell>>();
    bucketList = new HashSet<Cell>[10];
    steps = 0;

    // initialize constraints

    for (int row = 0; row < 9; row++)
    {
        for (int col = 0; col < 9; col++)
        {
            // copy grid, and turn on all Candidates for every cell
            m_grid[row, col] = initialGrid[row, col];
            m_cellConstraintMatrix[row, col] = new Candidate(9, true);
        }
    }

    for (int i = 0; i < 9; i++)
    {
        m_rowConstraintMatrix[i] = new Candidate(9, false);
        m_colConstraintMatrix[i] = new Candidate(9, false);
        bucketList[i] = new HashSet<Cell>();
    }
    bucketList[9] = new HashSet<Cell>();

    for (int row = 0; row < 3; row++)
        for (int col = 0; col < 3; col++)
            m_regionConstraintMatrix[row, col] = new Candidate(9, false);

    InitializeMatrices();
    PopulateCandidates();
}

Setting the constraints (Minimization)

As we discussed in the design, the code needs to make intelligent decisions based on relevant information. The relevant information in our Sudoku solver are constraints, and before we begin plugging away we need to minimize the number of choices by setting our constraints. This can be done by adding a constraint to the row, column, and region for every filled in square. We can then update our cell’s constraints by reducing the available candidates based on the row, cell, and region constraints discovered previously.

private void InitializeMatrices()
{
    for (int row = 0; row < 9; row++)
    {
        for (int col = 0; col < 9; col++)
        {
            // if the square is solved update the candidate list
            // for the row, column, and region
            if (m_grid[row, col] > 0)
            {
                int candidate = m_grid[row, col];
                m_rowConstraintMatrix[row][candidate] = true;
                m_colConstraintMatrix[col][candidate] = true;
                m_regionConstraintMatrix[row / 3, col / 3][candidate] = true;
            }
        }
    }
}

private void PopulateCandidates()
{
    //Add possible candidates by checking
    //the rows, columns and grid
    for (int row = 0; row < 9; row++)
    {
        for (int col = 0; col < 9; col++)
        {
            //if solved, then there are no possible candidates
            if (m_grid[row, col] > 0)
            {
                m_cellConstraintMatrix[row, col].SetAll(false);
                solved.Add(new Cell(row, col));
            }
            else
            {
                // populate each cell with possible candidates
                // by checking the row, col, and grid associated 
                // with that cell
                foreach (int candidate in m_rowConstraintMatrix[row])
                    m_cellConstraintMatrix[row, col][candidate] = false;
                foreach (int candidate in m_colConstraintMatrix[col])
                    m_cellConstraintMatrix[row, col][candidate] = false;
                foreach (int candidate in m_regionConstraintMatrix[row / 3, col / 3])
                    m_cellConstraintMatrix[row, col][candidate] = false;

                Cell c = new Cell(row, col);
                unsolved.Add(c);
            }
        }
    }
}

Cell Selection

Now that we have enough information about the board to make an optimal decision, we must MAKE THE DECISION. The best way to do this is to pick the cell with the least number of candidates, as this would obviously be the easiest decision to make. This requires that we search the grid for “a” minimum cell. Remeber there might be multiple cells with the same number of candidates. In that case, we just pick the first cell we find.

private Cell NextCell()
{
    if (unsolved.Count == 0)
        return new Cell(-1, -1); // easy way to singal a solved puzzle

    Cell min = unsolved.First();
    foreach (Cell cell in unsolved)
        min = (m_cellConstraintMatrix[cell.row, cell.col].Count < m_cellConstraintMatrix[min.row, min.col].Count) ? cell : min;

    return min;
}

In this explanation code I chose to use a shorter min-find approach, but in the source code example I use bucket lists to dynamically track the minimum after each cell selection. This allows my minimum lookup to take place in essentially one operation. This improved performance by about 25%, so implementation approach can definitely have a lot of weight on how your solution performs.

Candidate Selection and Deselection

Although this code is likely the most complex looking in this section, selection and deselection are really just maintenance routines for managing the state of the board. Selecting/Deselecting the cell is not problematic, but updating all of the effected cells is a bit trickier. Upon selection, we need to iterate through the row, column, and region, and remove candidates from cells that contained the just selected candidate. The process is the same for unselecting a cell, except we are returning the candidate back to the previously changed cells. And thanks to the SelectCandidate routine this process is made simpler by only having to iterate through cells we know changed.

private void SelectCandidate(Cell aCell, int candidate)
{
    HashSet<Cell> changedCells = new HashSet<Cell>();

    // place candidate on grid
    m_grid[aCell.row, aCell.col] = candidate;

    // remove candidate from cell constraint matrix
    m_cellConstraintMatrix[aCell.row, aCell.col][candidate] = false;

    // add the candidate to the cell, row, col, region constraint matrices
    m_colConstraintMatrix[aCell.col][candidate] = true;
    m_rowConstraintMatrix[aCell.row][candidate] = true;
    m_regionConstraintMatrix[aCell.row / 3, aCell.col / 3][candidate] = true;

    /**** RIPPLE ACROSS COL, ROW, REGION ****/

    // (propagation)
    // remove candidates across unsolved cells in the same
    // row and col.
    for (int i = 0; i < 9; i++)
    {
        // only change unsolved cells containing the candidate
        if (m_grid[aCell.row, i] == 0)
        {
            if (m_cellConstraintMatrix[aCell.row, i][candidate] == true)
            {
                // remove the candidate
                m_cellConstraintMatrix[aCell.row, i][candidate] = false;

                //update changed cells (for backtracking)
                changedCells.Add(new Cell(aCell.row, i));
            }
        }
        // only change unsolved cells containing the candidate
        if (m_grid[i, aCell.col] == 0)
        {
            if (m_cellConstraintMatrix[i, aCell.col][candidate] == true)
            {
                // remove the candidate
                m_cellConstraintMatrix[i, aCell.col][candidate] = false;

                //update changed cells (for backtracking)
                changedCells.Add(new Cell(i, aCell.col));
            }
        }
    }

    // (propagation)
    // remove candidates across unsolved cells in the same
    // region.
    int grid_row_start = aCell.row / 3 * 3;
    int grid_col_start = aCell.col / 3 * 3;
    for (int row = grid_row_start; row < grid_row_start + 3; row++)
        for (int col = grid_col_start; col < grid_col_start + 3; col++)
            // only change unsolved cells containing the candidate
            if (m_grid[row, col] == 0)
            {
                if (m_cellConstraintMatrix[row, col][candidate] == true)
                {
                    // remove the candidate
                    m_cellConstraintMatrix[row, col][candidate] = false;

                    //update changed cells (for backtracking)
                    changedCells.Add(new Cell(row, col));
                }
            }

    // add cell to solved list
    unsolved.Remove(aCell);
    solved.Add(aCell);
    changed.Push(changedCells);
}


private void UnselectCandidate(Cell aCell, int candidate)
{
    // 1) Remove selected candidate from grid
    m_grid[aCell.row, aCell.col] = 0;

    // 2) Add that candidate back to the cell constraint matrix.
    //    Since it wasn't selected, it can still be selected in the 
    //    future
    m_cellConstraintMatrix[aCell.row, aCell.col][candidate] = true;

    // 3) Remove the candidate from the row, col, and region constraint matrices
    m_rowConstraintMatrix[aCell.row][candidate] = false;
    m_colConstraintMatrix[aCell.col][candidate] = false;
    m_regionConstraintMatrix[aCell.row / 3, aCell.col / 3][candidate] = false;

    // 4) Add the candidate back to any cells that changed from
    //    its selection (propagation).
    foreach (Cell c in changed.Pop())
    {
        m_cellConstraintMatrix[c.row, c.col][candidate] = true;
    }

    // 5) Add the cell back to the list of unsolved
    solved.Remove(aCell);
    unsolved.Add(aCell);
}

Backtracking

And finally, we must be able to tell our solver to back up when it goes astray. This is implemented using recursion, but could also be implemented iteratively, with a bit more effort. This function gets the next cell and then tries to determine what candidate should go in that cell. Once it places a candidate, it goes to the next cell, and it keeps going until all the cells have been solved - we’re done - or we reach a dead end (we’ve used all the candidates). In the latter situation, it just continually calls the deselect method discussed previously until it reaches a cell with unselected candidates.

private bool SolveRecurse(Cell nextCell)
{
    // Our base case: No more unsolved cells to select, 
    // thus puzzle solved
    if (nextCell.row == -1)
        return true;

    // Loop through all candidates in the cell
    foreach (int candidate in m_cellConstraintMatrix[nextCell.row, nextCell.col])
    {
        writer.WriteLine("{4} -> ({0}, {1}) : {2} ({3})", nextCell.row, nextCell.col,
            m_cellConstraintMatrix[nextCell.row, nextCell.col], m_cellConstraintMatrix[nextCell.row, nextCell.col].Count, steps++);

        SelectCandidate(nextCell, candidate);

        // Move to the next cell.
        // if it returns false backtrack
        if (SolveRecurse(NextCell()) == false)
        {
            ++steps;
            writer.WriteLine("{0} -> BACK", steps);
            UnselectCandidate(nextCell, candidate);
            continue;
        }
        else // if we recieve true here this means the puzzle was solved earlier
            return true; 
    }

    // return false if path is unsolvable
    return false;

}

Conclusion

Hopefully by now you’re feeling confident enough to solve any Sudoku puzzle using the algorithm design techniques discussed here. I also hope that you might find additional application of this decision optimization algorithm as it can be widely used to solve a myriad of very difficult real-world problems. In addition to looking at the code, you might want to try executing it a few times to see how it performs, and to watch the selection and backtracking process. As I’ll be the first to admit, my solution is not the most efficient solution out there, and there are a few links to other implementations.[4] One of the fastest I’ve seen is the kudoku implementation in C.[5] It claims to solve a 1000 hard sudoku problems in less than 2 seconds. If you were ever stumped at Sudoku before, now you know you don’t have to be ever again.


  1. Dancing Links. Link

  2. Constraint Satisfaction. Link

  3. Enumerator Interface in C#. Link

  4. Another similar implementation in Python. Link.

  5. Sudoku C implementation. Link