This article will help you learn about Bureau Works' default segmentation rules and how to change them.
1. Default segmention rules
2. How to change your segmentation rules
2.1. Account level
2.2. Organizational Unit level
1. Default segmention rules
In Bureau Works, your content must be separated in segments for a several different reasons such as productivity, content analysis, TM matches and more.
When you upload a document to our platform, it undergoes an automatic segmentation process, which considers various parameters. In most cases, the segmentation will follow a natural structure that makes sense from the reader's perspective. However, depending on the file extension and other parameters, Bureau Works may apply different rules for segmentation.
Below are some of the rules we will use to segment the text:
- Paragraph separators, new line and tab characters
- Punctuation characters used in various languages, such as periods, question marks, exclamation points, etc
- Bureau Works will also consider the double-byte period "。"
Below, you can find some of the rules we'll use to avoid text segmentation:
- Commonly used abbreviations such as Mr., Mrs., Prof., e.g., Vol., etc
- Month abbreviations like Nov. 12, 2023 and Sept. 5, 2023
- Number with dots between them such as 2.55, 100.1 and 0.78
2. How to change your segmentation rules
As mentioned earlier, Bureau Works covers the most common scenarios for text segmentation. However, if your use case requires specific handling for text segmentation, you can provide instructions to the platform to tailor it to your needs. For instance, if you have a product name that ends with a dot or use an abbreviation not covered by default, you can add a new rule to our default configuration, and Bureau Works will adapt accordingly.
Our default rule file is not visible from the UI. Feel free to contact our support team for guidance and testing. They will find the best way to configure the segmentation file taking into consideration your needs.
2.1. Account level
In order to change the segmentation rules for your entire account, you must click on "Settings", "My Account", "Segmentation Settings" and then click on "Add new segmentation file":
Once you've chosen your .SRX file, you'll need to select the source language you want to apply it to, activate the rules, and then click on "Save".
2.2. Organizational Unit level
You can configure your segmentation rules at the Organizational Unit level as well. It's important to note that this configuration will override any settings made at the Account level.
In order to change the segmentation rules for your Organizational Unit, you must open the one you want to update, click on the "Segmentation Settings" tab and then click on "Add new segmentation file":
Once you've chosen your .SRX file, you'll need to select the source language you want to apply it to, activate the rules, and then click on "Save".
If any questions were not answered in this article, please don't hesitate to contact our support team at help@bureauworks.com.
Comments
0 comments
Please sign in to leave a comment.