help with regex

bordemstirs · Aug 16, 2011

I know this isn't a mafia specific issue, but this is the only board I'm subscribed to with fellow coders.

I'm pulling text from a website, and I'm basically trying to pull blocks of text whose individual lines do not start with a space, provided any of the lines contains a keyword.
The keyword is always followed by a less-than symbol.

The regex I'm using to match this right now is

Code:

 matcher m=create_matcher( "\\r?\\n([^ ](?:.|\\r?\\n(?!\\s))+?)" +i+ "<(.+?)\\r?\\n ", t );

where i is the keyword.

My -problem- however is that this generates a stack overflow error. Is there a better way to pull this without using the Or statement, which is where the stack overflow starts to occur.

xKiv · Aug 16, 2011

bordemstirs said:
Is there a better way to pull this without using the Or statement, which is where the stack overflow starts to occur.

Don't do it all with a regex.
idea:
Loop over all lines of the page, remember the last line number where you transited from "starts with space" to "doesn't start with space", process while lines don't start with space, aggregate all such lines (in a row) into an object; also keep a flag signalling whether you saw your keyword in any of the lines of the current line block.
When you hit a line starting with space or end of page, and the flag is true, declare the current block "found".

Suddenly you have an algorithm that's linear in page size, and non-recursive (as opposed to one that can be, if I am no mistaken, exponential in number of lines and recursive to a depth linear in number of lines).

bordemstirs · Aug 16, 2011

xKiv said:
Don't do it all with a regex.
idea:
Loop over all lines of the page, remember the last line number where you transited from "starts with space" to "doesn't start with space", process while lines don't start with space, aggregate all such lines (in a row) into an object; also keep a flag signalling whether you saw your keyword in any of the lines of the current line block.
When you hit a line starting with space or end of page, and the flag is true, declare the current block "found".

Suddenly you have an algorithm that's linear in page size, and non-recursive (as opposed to one that can be, if I am no mistaken, exponential in number of lines and recursive to a depth linear in number of lines).

I was hoping to be able to keep regex, since it's powerful, easy to change if something pages in the way the page is formatted, and takes up less space in the code (and as a byproduct easier to understand at a glance). I suppose I can just read line at a time, though I wouldn't need to keep track of all the stuff the way you described, since once I move through a block of text and determine it to not have my keyword, I can just disregard and move on.

Thanks though, I'll do this for now.

bordemstirs · Aug 16, 2011

visit_url() help

Does visit_url() return nothing when a page attempts to redirect? I have a url that when I view in my browser is basically a custom 404 page, but mafia claims length() to be 0.

Also, I was having some trouble with split_string() working using default delimiter. Does it only recognize one kind of line break? Had to resort to "\r?\n"

help with regex

bordemstirs

Member

xKiv

Active member

bordemstirs

Member

bordemstirs

Member