Skip to main content

Command Palette

Search for a command to run...

Moving from Fixed-Size Chunks to Semantic Integrity

Updated
2 min read
M

I am a mobile developer specialising in Android and Flutter. I maintain open-source Flutter packages on GitHub and pub.dev. This blog is where I share my experiences and insights from my development journey.

  • semantic is good at accuracy but need to balance the time (if have limited machine)

  • for general article, lead + body actually good

I was working on my multi language news bank with the pgvector + embeding model from zero knowledge. Here is my journey to shift from fixed size to lead body chuck (1st paragraph + others) to semantic chuck.

In the beginning, I followed other developers steps on using the fixed size method. It works fine in the beginning, size a 500 token of english word actually is not that bad. Until i start working with the Traditional Chinese and Japanese.

It was disaster.

The word is cut in the middle. 🫠 For example: くまもと is cut in くまも. One is for the place, one is the mascot.

Some of the keyword is cut in half and not able to be queried. We all know the embedded model transform that to a vector and then compare to the value to get the final result. its not possible to get a proper vector value.

So next i tried to cut the article in at least 2 parts. One is the 1st paragraph and the others depends on the size of the article. Since most people writing article will use the 1st article to state the core message of the article. That can be a good fit as a easier to narrow the vector search. And later one I append the whole article to the LLM agent context for later analysis task. This actually working quite well in general case. But the problem is when the article is long, that a 1st paragraph is not able to cover all, especially business related analysis article.

Therefore, I switch to the semantic chunk. That directly compare the sentences vector instead of wordings, which is more accurate. But the operational time to do the embedding is significantly slower. Since my case the volume of news article every hour is still available to take time, but if require to do the minute news. I may need to reconsider that.