Home ยป Forum ยป Author Hangout

Forum: Author Hangout

How does SOL determine the word-count of a story?

Sarkasmus ๐Ÿšซ

So, I have a weird question.

When I open an author's page, every single story there also displays how many words that story contains. Now, for my own story, however, that number is curious.

According to SOL, my story has 284,250 words. But, according to the Word document I wrote the story in, it only has 279,407 words.

So... where are those extra 5k words coming from after the story is published?

awnlee jawking ๐Ÿšซ

@Sarkasmus

On the subject of word counts, is there anywhere within the Author Stats facilities where the word counts are shown?

AJ

Replies:   Sarkasmus
Sarkasmus ๐Ÿšซ

@awnlee jawking

Not that I could find, no.

Replies:   awnlee jawking
awnlee jawking ๐Ÿšซ

@Sarkasmus

Thanks.

To find SOL's word counts for my stories, I have to resort to 'Go To Homepage', except I can never remember where that is found, so I end up clicking on 'Author' and looking myself up.

Obviously I'm not the smartest frog in the blender :-(

AJ

Replies:   Switch Blayde
Switch Blayde ๐Ÿšซ

@awnlee jawking

I end up clicking on 'Author' and looking myself up.

That's what @Sarkasmus does. His question was how is the word count calculated? His stories' word counts listed differ from the word counts Word gives him.

Lazeez Jiddan (Webmaster)

@Sarkasmus

According to SOL, my story has 284,250 words. But, according to the Word document I wrote the story in, it only has 279,407 words.

The system uses the PHP built in function 'str_word_count()' to get the word count for the story after stripping all html tags from the text.

When I open the story in my text editor it reports 278,520 words.

I don't know how PHP counts words exactly, but I know that numbers or digits don't count as words and if a word has a digit inside like fri3nd, that counts as two words.

Dominions Son ๐Ÿšซ

@Lazeez Jiddan (Webmaster)

I googled information about the word count feature in Word. What found reads like it counts space delimited strings, so 123/456 would count as a word and fri3nd. would count as one word.

If he's using HTML or markup code for formatting, would the PHP word count function count that?

Sarkasmus ๐Ÿšซ
Updated:

@Lazeez Jiddan (Webmaster)

Thanks for the info!

I just made the mistake of trying to read up on that further, and found this:

Apple's Pages counts "2-7 mg/v" as four words while Microsoft Word counts the same string as two words.

...I don't even know why ANY software should count "2-7 mg/v" as words...

Replies:   Dominions Son
Dominions Son ๐Ÿšซ

@Sarkasmus

.I don't even know why ANY software should count "2-7 mg/v" as words...

Because Word defines words as space delimited sub-strings with no exclusions. If you took out the space in the middle it would only count as one word.

Replies:   Sarkasmus  Switch Blayde
Sarkasmus ๐Ÿšซ

@Dominions Son

That... actually makes sense.

I guess the only alternative would be to include an ever-expanding word list into the software, and then have it compare each substring with every entry in that list.
Counting words would then probably take longer than writing the story in the first place :D

Replies:   Dominions Son
Dominions Son ๐Ÿšซ
Updated:

@Sarkasmus

It could be worse. There are some older word counting algorithms that counted every x characters as a word without even excluding spaces. The calculations were based off of an average word length.

Switch Blayde ๐Ÿšซ

@Dominions Son

Because Word defines words as space delimited sub-strings with no exclusions.

An example:

stone cold dead (3 words)
stone-cold dead (2 words)

Both are correct. Both mean the same thing. But the 1st one is one more word.

Replies:   Dicrostonyx
Dicrostonyx ๐Ÿšซ

@Switch Blayde

Yes, but they only mean the same thing in this case. There are other adjective-noun constructions in which the hyphen can change meaning, which is why counting it as two words rather than three makes sense.

helmut_meukel ๐Ÿšซ

@Lazeez Jiddan (Webmaster)

The system uses the PHP built in function 'str_word_count()' to get the word count for the story after stripping all html tags from the text.

I don't know how PHP counts words exactly, but I know that numbers or digits don't count as words and if a word has a digit inside like fri3nd, that counts as two words.

Depending on how exactly it works, it may count words with HTML tags within the word as two words.
E.g. the author tries to explain an acronym by writing the full name with the character used in the acronym set bold.
(POTUS written long may then count as ten words instead of five!).
Same for a bold set first character in the starting word of the first paragraph.
Other constructs like " < span class="big_initial">A< /span>fter dinner " would stripped on SOL but treated similar by PHP.

HM.

Lazeez Jiddan (Webmaster)

@helmut_meukel

it may count words with HTML tags within the word as two words.

Somehow many of you missed this part of my answer:

after stripping all html tags from the text.

Replies:   Pixy
Pixy ๐Ÿšซ

@Lazeez Jiddan (Webmaster)

Somehow many of you missed this part of my answer:

We are good at that ... ๐Ÿ˜€

Michael Loucks ๐Ÿšซ
Updated:

@Lazeez Jiddan (Webmaster)

From the PHP manual:

For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters. Note that multibyte locales are not supported.

A locale (see locale(7)), is:

A locale is a set of language and cultural rules. These cover aspects such as language for messages, different character sets, lexicographic conventions, and so on. A program needs to be able to determine its locale and act accordingly to be portable to different cultures.

Gauthier ๐Ÿšซ

@Lazeez Jiddan (Webmaster)

The system uses the PHP built in function 'str_word_count()' to get the word count for the story after stripping all html tags from the text.

There is one major problem with that, str_word_count() only works with single byte code page. It doesn't handle utf8 correctly and also totally break with html entities (Not sure if you leave them in the text). Either you have to write your own count function or convert html entities (php html_entity_decode does an incomplete job by the way) and then transliterate to ascii with iconv before using str_word_count. that should give you a number closer to the Word count except for handling of non breakable space and apostrophe.

Lazeez Jiddan (Webmaster)

@Gauthier

Either you have to write your own count function

The difference is not big enough to make it necessary.

Soronel ๐Ÿšซ

@Sarkasmus

I could also believe that the two systems count abbreviations differently, i.e. "i.e.", wouldn't surprise me if one counted that as one word and another as two.

Replies:   Switch Blayde
Switch Blayde ๐Ÿšซ

@Soronel

"i.e."

Word counts it as one word.

Keet ๐Ÿšซ

@Sarkasmus

For 'normal' html the easiest and closest to accurate is counting the spaces in the body part of 'real' chapters (prologue, chapter, epilogue, etc).
Including a count of line breaks can add a little accuracy if the html is consistently formatted without empty lines.
Stripping the html tags first increases the accuracy but not as much as you would think since monolithic tags like p, i, b, etc. don't count as separate words because they don't add an extra space. It's mostly tags with class= and img tags that add to the count because of the spaces. (For an epub that could be a lot more because in an epub almost every tag has a class= part.)

It all depends on how accurate you want to be against how much processing time you want to spend. Using the PHP function for SOL is an easy and fast solution but obviously not the most accurate.

Replies:   Switch Blayde
Switch Blayde ๐Ÿšซ

@Keet

the easiest and closest to accurate is counting the spaces

In a previous post, I showed how Word counted two words hyphenated. It gave a count of 1.

When you use an emdash, some styles say to put a space before and after it. Some say not to use a space. Word, with or without the spaces, knows to count them as two words. For example, both of the following sentences are 3 words:

The girl โ€” wow.
The girlโ€”wow.

Replies:   Keet
Keet ๐Ÿšซ

@Switch Blayde

In a previous post, I showed how Word counted two words hyphenated. It gave a count of 1.

When you use an emdash, some styles say to put a space before and after it. Some say not to use a space. Word, with or without the spaces, knows to count them as two words. For example, both of the following sentences are 3 words:

The girl โ€” wow.
The girlโ€”wow.

It depends on the definition you use of what is a word. Following that definition you can create a set of rules. Counting the spaces is one of those rules. You could expand the rule-set with counting hyphens and emdashes as a space, and even expand those rules by counting consecutive spaces/hyphens/emdashes as a single space. I'm sure Word has a specific set of rules it uses for the word count and probably checks those rules while you type.
I'm pretty sure I could build a rule set for SOL html that is close to perfect but the processing time would be longer than simply using the PHP function str_word_count().

Back to Top

Close
 

WARNING! ADULT CONTENT...

Storiesonline is for adult entertainment only. By accessing this site you declare that you are of legal age and that you agree with our Terms of Service and Privacy Policy.