Bulk Indexing Documents

Bulk index documents endpoint is used to index all the documents of a custom datasource using a series of /bulkindexdocuments requests with a common uploadId. Bulk indexing fully replaces the entire list of documents stored in Glean. After a successful bulk upload, all documents that were not a part of the most recent upload are deleted asynchronously.

There are similar bulk indexing endpoints for other objects like users, groups, employees, and teams as well.

Choosing `/indexdocuments` vs `/bulkindexdocuments`

When deciding between /indexdocuments and /bulkindexdocuments, it's important to understand their primary functions and use cases:

/bulkindexdocuments : This endpoint is designed for completely refreshing the datasource. It deletes all existing documents and replaces them with the new ones provided. Use this endpoint when you need to replace the existing corpus and upload all documents anew.
/indexdocuments : This endpoint is intended for incremental updates. It allows you to add a batch of new documents or update existing ones without affecting the other documents in the index. Choose this option when you want to keep the existing documents intact while adding or updating specific documents.

When to use each endpoint:

Use /bulkindexdocuments :
- When you need to perform a full refresh of the datasource.
- When all existing documents need to be replaced with a new set of documents.
Use /indexdocuments :
- When you need to add new documents to the existing index.
- When you need to update specific documents while keeping the rest of the index unchanged.

By selecting the appropriate endpoint based on your needs, you can efficiently manage your document indexing process.

Making your first successful request to `/bulkindexdocuments`

Here is a sample request to the /bulkindexdocuments endpoint.

cURLpython

Copy

Copied

 curl -X POST  https://customer-be.glean.com/api/index/v1/bulkindexdocuments \
  -H 'Authorization: Basic <Token>' \
  -d '
{ 
  "uploadId": "test-upload-id", 
  "isFirstPage": true, 
  "isLastPage": true, 
  "forceRestartUpload": true,
  "datasource": "gleantest",
  "documents": [
    {
      "datasource": "gleantest",
      "objectType": "EngineeringDoc",
      "id": "test-doc-1",
      "title": "How to bulk index documents",
      "body": {
        "mimeType": "text/plain",
        "textContent": "This doc will help you make your first successful bulk index document request"
      },
      "permissions": {
        "allowedUsers": [
          {
            "email": "myuser@bluesky.test",
            "datasourceUserId": "myuser-datasource-id",
            "name": "My User"
          }
        ],
        "allowAllDatasourceUsersAccess": true
      },
      "viewURL": "https://www.glean.engineering.co.in/test-doc-1",
      "customProperties": [
        {
          "name": "Org",
          "value": "Infrastructure"
        }
      ]
    }
  ] 
}'

Copy

Copied

import glean_indexing_api_client
from glean_indexing_api_client.api import documents_api
from glean_indexing_api_client.model.bulk_index_documents_request import BulkIndexDocumentsRequest
from pprint import pprint

# Please refer to the Getting Started page for more details on how to setup api_client
document_api = documents_api.DocumentsApi(api_client)

documents=[DocumentDefinition(
    datasource="gleantest",
    object_type="EngineeringDoc",
    title="How to bulk index documents",
    id="test-doc-1",
    view_url="https://www.glean.engineering.co.in/test-doc-1",
    body=ContentDefinition(mime_type="text/plain", text_content="This doc will help you make your first successful bulk index document request"),
    permissions=DocumentPermissionsDefinition(
      allow_anonymous_access=True
    ))]

bulk_index_documents_request = BulkIndexDocumentsRequest(
  upload_id="test-upload-id", datasource="gleantest", documents=documents, is_first_page=True, is_last_page=True, force_restart_upload=True)

# example passing only required values which don't have defaults set
try:
    document_api.bulkindexdocuments_post(bulk_index_documents_request)
except glean_indexing_api_client.ApiException as e:
    print("Exception when calling DocumentsApi->bulkindexdocuments_post: %s\n" % e)

Let's look at the different fields you need to successfully index documents to Glean. Note that this is just a sample request with minimal fields required to index content. For exhaustive list of fields, please refer here.

Bulk upload model

The bulk upload endpoints delete all entries that are not a part of the most recent upload. For example, /bulkindexdocuments endpoint would delete all the documents that are not present in the most recent upload.

Concurrent uploads are not allowed ie. you cannot start a new upload before the previous upload is finished.

There are some fields that are common across all bulk upload endpoints. We will be describing them here:

uploadId

This is the id which uniquely identifies an upload. You need to have a unique uploadId for all the paginated requests you send for an upload.

isFirstPage

This denotes whether the page being uploaded is the first page, and needs to be true for the first request and false for all subsequent requests for an upload. Your request would fail if you start a new page without finishing the previous upload.

isLastPage

This denotes whether the page being uploaded is the last page, and needs to be true for the last request and false for all other requests for an upload. You cannot start subsequent uploads before the last page ie. page with isLastPage = true is uploaded. If you want to start a new page ignoring the previous upload state, use the forceRestartUpload field.

forceRestartUpload

This is required if you want to start a new upload but the previous upload has not finished or has failed. Not specifying this bit in case of an unsuccessful previous upload will fail the request.

In addition to bulk upload fields, we also have:

disableStaleDocumentDeletionCheck

The /bulkindexdocuments asynchronously deletes all documents that weren’t a part of the most recent upload session. This can lead to accidental situations where too many documents get wiped in case of an erroneous bulk upload. To mitigate this, we have a deletion check in place which pauses the deletion of stale documents for 7 days if the percentage of docs being deleted exceeds 20%. In case you intentionally want to delete more than 20% of your previously uploaded documents, you can specify disableStaleDocumentDeletionCheck = true , which disables this check and allows the documents to be deleted. Note that documensts are delete asynchronously. If you wish for deletions to take effect immediately, use /processalldocuments endpoint.

Document model

The following is the basic document model used for indexing a new document to Glean. There are other fields too which you can use for advanced functionality. You can refer to them in the API reference docs here.

datasource

Represents the document datasource.

id

This is a unique identifier for the document. Each document should have a unique id.

body

This is used to specify the content which will be used to populate the document body. You also need to specify the mime type of the content.

allowedUsers

This represents a list of users who will be able to view this document. For representing a user, there are three fields: email, datasourceUserId, name. It is not required to populate the datasourceUserId field if you have specified isUserReferencedByEmail = true while adding the datasource config. In that case, an email is used to identify a user for permissions.
Please note that you must index users before uploading documents to ensure that the permissions are captured. Refer to the Permissions tutorial for more details!

viewURL

This represents the document view url. This is a required field and the request would fail if it is not specified for any document. This viewURL must also match the urlRegex specified while creating the datasource.

allowAnonymousAccess

This can be set to true if anyone who is signed into Glean can view search results for the document, even if they are not a user of this custom datasource.

allowAllDatasourceUsersAccess

This can be set to true if all users of the datasource (as uploaded using the Identity APIs) can view this document.

customProperties

This is a list of name - value pairs. These properties are used to populate additional facets which allow you to search using operations like "Org":"Infrastructure" on Glean. Note that the property names are predefined while creating a new datasource here .

Next steps

You can check the status of your document using our debugging/troubleshooting APIs. Please refer here for documentation on how to use these APIs.
For the indexed document to show up in Glean UI, the datasource must be enabled for search. For now, Glean will need to enable it internally, but in future this will be made available via Glean Admin Console. Once these steps are done, you should be able to search for the indexed document in Glean when logged in as the user having permissions to view the documents.
Note that it takes around 15-20 minutes for the documents to be indexed and appear on your Glean UI.

Bulk Indexing Documents

Choosing /indexdocuments vs /bulkindexdocuments

Making your first successful request to /bulkindexdocuments

Bulk upload model

Document model

Next steps

Was this helpful?

Choosing `/indexdocuments` vs `/bulkindexdocuments`

Making your first successful request to `/bulkindexdocuments`