Regex will suck.
.*? isn't enough. You have to restrict the characters allowed so that it won't leave the attribute values, such as with a [^"]+
And if the attributes are in a different order, like as with the second <a> and the one you're trying to match, then the regex gets exponentially more complex.
Say there's the title (A), class (B), href (C), and target (D). To match the full tag you have to do something like
Code: Select all
<a (A(B(CD|DC)|C(BD|DB)|D(BC|CB))|B(A(CD|DC)|C(AD|DA)|D(AC|CA))|C(A(BD|DB)|B(AD|DA)|D(AB|BA))|D(A(BC|CB)|B(AC|CA)|C(AB|BA)))>
(remember to expand A,B,C,D into the regex patterns to match each respective component) which doesn't even account for optional attributes.
Regardless, capturing is a pain because you have so many different places in the regex where the desired information can show up.
Unfortunately .NET doesn't seem to have a decent HTML parser. There is a compromise you can make:
Code: Select all
<a (title='[^']*'|title="[^"]*"|title=[^ \t]+|class=['"]?internallink['"]?|href=('[^']+')|href=("[^"]+")|href=([^ \t]+)|target=['"]?_blank['"]?)+>
which will match pretty much every A tag containing one or more of those attributes. You then go through each one and check that it matched one of those href capturing groups: if not then skip it, otherwise HTML-decode the string, use
System.Uri to parse the URL, and use
HttpUtility.ParseQueryString to get the individual query string values. Then filter out the ones that don't have the Page or File keys.