Regex

How to detect all urls in a string to match Twitter’s requirements (Windows 8(.1) and Windows Phone 8(.1))

url

As you might guess, I am still working on that app I mentioned in my last blog post. As I was diving deeper into the functions I want, I recognized that Twitter does a very well url handling server side.

Like the official documentation says, every url will be shortened with a link that has 22 characters (23 for https urls).

I was trying to write a RegEx expression to detect all the links that can be:

  • http
  • https
  • just plain domain names like “msicc.net”

This is not as easy as it sounds, and so I was a bit struggling. I then talked with @_MadMatt (follow him!) who has a lot of experience with twitter. My first attempt was a bit confusing as I did first only select http and https, then the plain domain names.

I found the names by their domain ending, but had some problems to get their length (which is essential). After the very helpful talk with Matthieu, I finally found a very good working RegEx expression here on GitHub.

I tested it with tons of links, and I got the desired results and it is now also very easy for me to get their length.

Recovering the amount of time I needed for this, I decided to share my solution with you. Here is the method I wrote:

        public int CalculateTweetCountWithLinks(int currentCount, string text)
        {
            int resultCount = 0;

            if (text != string.Empty)
            {
                //detailed explanation: https://gist.github.com/gruber/8891611
                string pattern = @"(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'.,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))";

                //generating a MatchCollection
                MatchCollection linksInText = Regex.Matches(text, pattern, RegexOptions.Multiline);

                //going forward only when links where found
                if (linksInText.Count != 0)
                {
                    //important to set them to 0 to get the correct count
                    int linkValueLength = 0;
                    int httpOrNoneCount = 0;
                    int httpsCount = 0;

                    foreach (Match m in linksInText)
                    {
                        //https urls need 23 characters, http and others 22
                        if (m.Value.Contains("https://"))
                        {
                            httpsCount = httpsCount + 1;
                        }
                        else
                        {
                            httpOrNoneCount = httpOrNoneCount + 1;
                        }

                        linkValueLength = linkValueLength + m.Value.Length;
                    }

                    //generating summaries of character counts
                    int httpOrNoneReplacedValueLength = httpOrNoneCount * 22;
                    int httpsReplacedValueLength = httpsCount * 23;

                    //calculating final count
                    resultCount = (currentCount - linkValueLength) + (httpOrNoneReplacedValueLength + httpsReplacedValueLength);                    
                }
                else
                {
                    resultCount = currentCount;
                }
            }
            return resultCount;
        }

First, we are detecting links in the string using the above mentioned RegEx expression and collect them in a MatchCollection.

As https urls have a 23 character length on t.co (Twitter’s url shortener), I am generating two new counts – one for https, one for all other urls.

The last step is to substract the the length of all Match values and add the newly calculated replaced link values lengths.

Add this little method to your TextChanged event, and you will be able to detect the character count on the fly.

As always, I hope this is helpful for some of you.

Happy coding, everyone!

 

Posted by msicc in Archive, 0 comments

Detect and remove Emojis from Text on Windows Phone

noemoticonskeyboard

In my recent project, I work with a lot of text that get’s its value from user input. To enable the best possible experience for my users, I choose the InputScope Text on all TextBoxes, because it provides the word suggestions while writing.

The Text will be submitted to a webserver via a REST API. And now the problem starts. The emojis that are part of the Windows Phone OS are not supported by the API and the webserver.

Of course, I was immediately looking for a way to get around this. I thought this might be helpful for some of you, so I am sharing two little helper methods for detecting and removing the emojis .

After publishing the first version of this post, I got some feedback that made me investigating a bit more on this topic.  The emojis are so called unicode symbols, and thanks to the Unicode behind it, they are compatible with all platforms that have the matching Unicode list implemented.

Windows Phone has  a subset of all available Unicode characters in the OS keyboard, coming from different ranges in the Unicode characters charts. Like we have to do sometimes, we have to maintain our own list to make sure that all emojis are covered by our app in this case and update our code if needed.

If you want to learn more about unicode charts, they are officially available here: http://www.unicode.org/charts/

Update 2: I am using this methods in a real world app. Although the underlying Unicode can be used, often normal text will be read as emoji. That’s why I reverted back to my initial version with the emojis in it. I never had any problems with that.

Now let’s have a look at the code I am using. First is my detecting method, that returns a bool after checking the text (input):

             public static bool HasUnsoppertedCharacter(string text)
             {
            string pattern = @"[{allemojisshere}]";

            Regex RegexEmojisKeyboard = new Regex(pattern);

            bool booleanreturnvalue = false;

            if (RegexEmojisKeyboard.IsMatch(text))
            {
                booleanreturnvalue = true;
            }
            else if (!RegexEmojisKeyboard.IsMatch(text))
            {
                booleanreturnvalue = false;
            }
            return booleanreturnvalue;
            }

 

As you can see, I declared a character range with all emojis. If one or more emojis is found, the bool will always return true. This can be used to display a MessageBox for example while the user is typing.

The second method removes the emojis from any text that is passed as input string.

             public static string RemovedUnSoppertedCharacterString(string text)
             {
            string result = string.Empty;
            string cleanedResult = string.Empty;

            string pattern = @"[{allemojishere}]";

            MatchCollection matches = Regex.Matches(text, pattern);

            foreach (Match match in matches)
            {
                result = Regex.Replace(text, pattern, string.Empty);
                cleanedResult = Regex.Replace(result, "  ", " ");
            }
            return cleanedResult;
             }

Also here I am using the character range with all emojis . The method writes all occurrences of emojis into a MatchCollection for Regex. I iterate trough this collection to remove all of them. The Method also checks the string for double spaces in the text and makes it a single space, as this happens while removing the emojis .

User Experience hint:

use this method with care, as it could be seen as a data loss from your users. I am using the first method to display a MessageBox to the user that emojis are not supported and that they will be removed, which I am doing with the second method. This way, my users are informed and they don’t need to do anything to correct that.

You might have noticed that there is a placeholder “{allemojishere}” in the code above. WordPress or the code plugin I use aren’t supporting the Emoticons in code, that’s why I attached my helper class.

As always, I hope this will be helpful for some of you.

Happy coding!

Posted by msicc in Archive, 1 comment