RegExp

216.73.216.xxx | ToothyWiki | RecentChanges | Login

Anyone happen to known of a reasonably friendly page to find out more about Regular Expressions on? - Kazuhiko

MoonShadow: we could make this it.

a matches a
b matches b
.
.

. matches any character except \n

[abcd] matches any of a,b,c,d
[a-z] matches anything between a and z inclusive
[a-zA-Z0-9] matches any unaccented letter and any of the ten digits
[[:alpha:]] matches any alphabetic character in the current locale (if it's supported at all)

a? matches zero or one a
a* matches zero or more as
a+ matches one or more as (if it's supported; sometimes you must use \+.)

(stuff) or sometimes $stuff$ groups together stuff

: (abcd)+ matches one or more lots of abcd

Uh - literal quoting the brackets etc. should cause them to be treated as literals by the regexp parser.

: Not necessarily. It depends on the regexp language you are using.

and so on.. is that the sort of thing you're after? What language are you using? - each implementation has its own quirks..

Ah, didn't know that, 'though I should have guessed. I had a good idea about the above but haven't seen any working examples so I would probably be lost on the details. I was trying to read the RegExp page you linked but got very lost when it went onto some of the examples of how and why things don't work. - Kazuhiko

I actually found man 7 regex to be pretty comprehensive; I do quite a lot of evil sed/awk/sh hackery, so you can always ask for advice :) - Emperor

BwaHaHaHa... Nope, I come from the Windows side of the fence :) .net now has ways of using regexp for validation (plus the fact I was kind of interested anyway). - Kazuhiko

Well, man pages are online, consider [This link] for example?

Kazuhiko is currently being most self-congratulatory for having finally written some regex replace statements that appear to work :) (and are, of course, extremely simple)

He is however, somewhat confused by some of the syntax in a match he found on the net (and is used in the dotnet environment to appropriate link http refs):

Pattern: (?<url>http://(?:[\w-]+\.)+[\w-]+(?:/[\w-./?%&~=]*[^.])?)

Replace: <a href="${url}">${url}</a>

PeterTaylor observes that the character class [\w-./?%&~=] should probably also include +, which is used by URL encoding for space.

: Ah, thanks. As noted below, I also need to add #, so the original writer obviously missed a few cases. --Kazuhiko

My first connection concerns the . in the last two sets of square brackets. Surely this needs to be escaped to prevent it from matching any character?

Inside square brackets, it has no special meaning - it just means a dot character.

: Thanks. Seems a bit of an odd exception, but I guess it works.

 sl236@debian:~$ perl -e '"abc.efg" =~ /([^.]+)/ and print "$1\n"'
 abc

The above expression actually absorbs one too many characters so the final set could well be at fault,

Example..?

"http://ftp.xyz/ n" produces "<a href="http://ftp.xyz/ ">http://ftp.xyz/ </a>n" and, as far as I can tell from random guesses, "http://ftp.xyz/string n" produces "<a href="http://ftp.xyz/string ">http://ftp.xyz/string </a>n" for any string matching [\w-./?%&~=]*. In particular, it dismally fails to chop off any trailing . which I presumed (perhaps incorrectly) was the point of the [^.] segment. "http://ftp.xyz/xyzr.asp. n" produces "<a href="http://ftp.xyz/xyzr.asp. ">http://ftp.xyz/xyzr.asp. </a>n"

: Kazuhiko notes with amusement that the Wiki has just proven its great superiority in matching addresses :)

[^.] matches anything except a dot - but you've already eaten all valid URL characters by the time you get there, so the character it eats must be from the subsequent text, the way I read it; that's what's sucking up your extra character. Try spitting it back out after you've eaten it - does this work?
(?<url>http://(?:[\w-]+\.)+[\w-]+(?:/[\w-./?%&~=]*)?))(?<nondot>[^.]?)

: Replace: <a href="${url}">${url}</a>${nondot}

but the [\w-./?%&~=] set does not appear to match a # on testing.

Correct; that says "match a 'word' character (a-z, A-Z, 0-9 or _), a -, a ., a /, ...)" - # isn't in the list.

: Yep, I was just contrasting this against my (mis)understanding of '.' in square brackets. Now noted that both # and + need to be added to the last match group.

My second question is about the ? at the start of a parenthesised group. ?<name> appears to define a name that can be used in the replacement, which seems useful, but why use ?: at the start of the other groupings?

(stuff) means "group stuff and assign it to a numbered variable". So /(.)(.)(.)/ matches three characters and assigns them to the variables $1, $2 and $3 which you can then use. (?:stuff) means "group stuff but don't assign it to anything". So /(.)(?:.)(.)/ gets the first matched character into $1, forgets the second, and puts the third into $2. MoonShadow has never seen the (?<text> ..) idiom before. Does any of that make any sense at all?

: Yes, thanks. Well explained. I don't know if ?:<name> is Microsoft specific, but I'm grateful for it. It seems a lot easier than counting matches to work out which number I want to reference, although I guess the same effect could be achieved by using '?:' on everything else. --Kazuhiko

Could someone verify my understanding of this before I tie my brain in knots?

. matches any single character other than the new line character (\n)
But in WindowsWorld?, an end-of-line actually consists of a carriage return (\r) followed by a new line (\n)

So, if I want a set of lines that start with a set character, I have to look for \r\n(character).*, but the . absorbs the \r so subsequent lines commence with just \n? --Kazuhiko

: Line start and end characters don't usually get matched at all like that. Try ^(character).*

: Sorry, badly described. I'm hunting down a match within a string that contains multiple lines (which may or may not start with (character). I don't care if the whole string matches (especially). --Kazuhiko

: Correct in principle, unless Perl is running in a Windows environment and/or is configured to silently change \r\n into \n on non-binary files. You don't have to know or care whether this is the case, or what line endings are being used, though - use the [m] flag - do something like /^(character)(.*)$/m - now $1 contains the character, $2 contains everything else on that line including the \r\n or whatever. So in a substitution you'd do s/^(character)(.*)$/$1whatever/mg or something. If your logic cares about what the line endings actually are, the s flag might be more useful. If you want to get rid of them in the match, do a few chomps on it; or globally, do something like s/\r\n$/\n/mg. It's generally quite rare in Perl to be dealing with things like that as a block rather than line-by-line, though - in for(<>) {print $_;}, for instance, $_ will never contain more than one newline. You could also use a split. What do you actually need to do?

Ah, sorry Senji... Hadn't realised 'til MoonShadow's post that ^ could match start of line rather than start of string. If I do use ^ and $ to indicate start and end of line, where does the carriage return/line feed actually fall? Inside or outside (or on) the ^ or $?

Just after the $. - MoonShadow

Cool. Thanks. I'll play with that --Kazuhiko

: One of the easiest things to do, of course, is to s/\r\n/\n/g first - then you know everything is single returned. --Vitenka

 $ perl -e '$test="a\nb\n"; $test=~s/^(.*)$/S$1E/mg;$test=~s/\n/n/g; print $test;'
 SaEnSbEn?

As to usage, I'm imitating the Wiki and trying to make bullet points behave properly by grouping them inside <ul> tags. --Kazuhiko

So presumably you have control of how the data is fed to you..? - MoonShadow

Umm... Don't think I entirely understand the question. I have the string, I do stuff to it, I display the string... --Kazuhiko

How do you get the string, was my question - because usually you have to go through hoops to get Perl to give you a string containing multiple lines. - MoonShadow

You have control of more stages of the process than just this substitution, so you know what form the string is in (or can force it to be in any form you choose) before you start acting on it here. --Vitenka

Yes, but the point I'm trying to make is if you're going through hoops getting everything into a large string just so you can go through hoops doing a too-complicated substitution on it, why not just do things more simply and deal with the input line-by-line? - MoonShadow

: 'Cause I'm not using perl :) I'm in .net - the string is the whole string, not fed line by line --Kazuhiko

Oh - when did :alpha: get added? It's a useful expansion. --Vitenka

[regexp syntax summary]

As part of the same task as the above, I am looking through the string to make pop-up 'glossary' links in the text. In it's simplest form this takes the pattern \[glossary:(?<word>.*?)\] and replaces it with <a href="javascript://" onclick="showGlossary('${word}')">${word}</a>.

This will work nicely unless the word contains < or > (not very likely) which would break the html or contains a ' (quite possible) which would break the javascript. Is there any simple way to replace these with the appropriate 'safe' versions at the same time as the replacement, or do I have to cycle through all the matches and handle the replacement of characters per match? --Kazuhiko

: How does the glossary lookup cope with these characters? There's no way I can think of of doing that in a single regexp unless you have something equivalent to the Perl [(?{ ...code... })] construct. If they are illegal characters in that context, it might be helpful to think of "removal/substitution of illegal characters" as a separate operation to "substitution of glossary links" - something like "while glossary links with illegal characters exist, find the first one and remove or substitute the first illegal character in it, and each time you search, start searching at the last glossary link where an illegal character was found" should do the trick, depending on how big your strings are and how many illegal characters they typically contain. I'd be curious to see what other people suggest though :) - MoonShadow

CategoryComputing | CategoryAbbreviation