Getting the Text Between HTML Tags in PHP

Suppose you’re automatically parsing a webpage, and you come across the following kind of thing:

blah blah
some starting text
some useful content
some ending text
blah blah

We want to parse out the useful content from among the non-useful stuff, and we know there’s some starting text and some ending text that wraps the useful content.

A better example:

I like chicken
<div class="dog" style="border: 0px">
     I don't like to eat fish
</div>
<div class="fish">
I like to eat pork
</div>

How can we obtain just the text between the divs if we know the exact code that is immediately before and immediately after?

Here’s a PHP function you can use to accomplish this. It only uses the split() function:

<?php
function get_stuff_between($string, $start, $end){
	$content = split($start, $string);
	$content = split($end, $content[1]);
	return $content[0];
}
?>

If we have the example HTML code stored in the string $content, then we can obtain the text we want by calling the function like this:

get_stuff_between($content, '<div class="dog" style="border: 0px">', '</div>
<div class="fish">');

This returns the string

I don’t like to eat fish

Note that this requires the start and end strings be unique in the document, though it’s often pretty common for there to be such unique strings immediately before and after the content you want to parse.

The practical application for something like this is if you’re automatically parsing many HTML documents, and you’d like to extract only a portion of each document that’s neatly delimited by invariant tags.

There’s probably more flexible ways of doing this using preg_grep or preg_match_all, but I found this way to be easy and sufficient for what I needed to do today.