KB "Website" datasource not-discovering non-home p...
# 🤝help
w
In a KB, I'm using a publically-viewable, but non-search-engine-indexed, Google Site as the main data source. This Google Site has a home page, with several other pages linked from the main nav. In the KB edit interface --> "Add A Website Source" -> content found on "A Website" -> enter URL -> "Discover pages", only the home page is 'discovered,' and the main nav is evidently not followed or discovered. When I test the discovery with a different (commercial) website, the 'discover' does 'see' pages linked to the source URL as expected, so I don't think the discovery function is broken in a general sense, however, I would like to understand what governs the discovery behavior so that I can get the Google Sites website data source to be properly discovered and ingested by the KB. All guidance is much appreciated! **side note: I believe the "discovery" behavior changed sometime in the last 3 or 4 weeks, as in my initial testing phases, it did seem to follow links on Google Sites, but I didn't capture evidence of that so I can't say for certain whether that is the case
f
Can you try adding specific pages instead? Go to the website source, read the text that pops up with it. There you will see that you can change it to specific pages instead of discover
w
@fresh-fireman-491 thanks for the reply! I did try adding as "Specific Web Pages" in the edit interface, but pages linked from the main nav were still not detected. In a real-world situation, one of these website sources may be any number of pages, so it would be preferable for the pages linked from nav to be automatically discovered when the site is added as a source, rather than needing to be added as individual pages. The confusing thing (for me) is that the page-discovery does work for some website sources, but the Google Sites nav structure seems pretty standard and I'm having a hard time figuring out why it's not being 'discovered' in the same way when added as a KB source
f
Sounds weird. I don't think I can do much more, but maybe @bumpy-butcher-41910 can help us here, hes a real expert!
b
it's likely because it's not being indexed by a search engine... do you have a similar site that is being indexed that you can reference against?
w
ah! that definitely applies here, the site in question is not indexed by search engines -- I'll put up another one that is, for comparison purposes -- @bumpy-butcher-41910 is that (i.e., whether the site is indexed by search engines) what governs the discovery behavior?
Interesting! I created another site with the same structure, did not select non-search-indexing, and when adding it to the KB, again only the home pages (and not any of the several linked subpages) were discovered
b
hm, in theory that's what should be governing the ability for a bot to find pages on a site
if the new site you're using to test is brand new it's possible it hasn't been crawled by any search engines yet
w
Makes sense -- mucho thanks to both of you fellas! Y'all are the best
b
\o/
keep me posted btw, would be happy to get to the bottom of this
w
based on the info already provided, I feel pretty confident that the search-indexing factor is the root of the issue -- I unpublished the other comparison site already, but will republish another one on my personal account and try it after the crawlers have had a few hours to index it, to confirm
@bumpy-butcher-41910 just to make sure I understand the behavior governance -- is it that the KB website discovery is respecting robots.txt / robots noindex ?