marit
Posts:8
Send A Message
 |
| 03/26/2008 10:46 AM |
|
| Hello
I've installed seamus with full text search. I see now, that it's indexing my files and folders. How can I get seamus to index all my tabs? it is importat, that it crawls/spiders through all tabs, even those, which have custom build moduls.
Is there something, that I have to set?
Thank you very much.
marit
Current settings in Seamus:
If No Records are Returned: Show "No Records Found"
Update Index Every: 6
Global Crawler: Yes
All other settings are empty/not set. |
|
|
|
|
John Henley
Posts:25
Send A Message
 |
| 03/26/2008 1:34 PM |
|
I too am very confused. I purchased this because your features said: "The Venexus Seamus module is an aggregation module that will crawl/spider your DotNetNuke portal (Tabs, Modules, Folders, and Files), external websites (including non-DotNetNuke websites), and any RSS feed." That lead me to believe that it would crawl/spider HTML content on non-DNN sites which don't have RSS feeds. Is that correct? If so, how do I set it up? |
|
|
|
|
John Henley
Posts:25
Send A Message
 |
| 03/26/2008 1:37 PM |
|
| For example, I see this in the errors:
ErrorID: 153
When: 3/26/2008 1:36:56 PM
URL: http://www.danalytics.com/guru/letter/
Message: 200 - text/html |
|
|
|
|
John Henley
Posts:25
Send A Message
 |
| 03/28/2008 3:07 PM |
|
Is this really an error? Should questions like this be posted in the issue tracker or asked here? |
|
|
|
|
John Henley
Posts:25
Send A Message
 |
| 03/28/2008 3:11 PM |
|
| Can you help me understand what needs to be done in order to crawl/spider HTML content on non-DNN sites which don't have RSS feeds? Do I just add the URL as a feed? |
|
|
|
|
tmunn78
Posts:1
Send A Message
 |
| 03/28/2008 3:39 PM |
|
| You can add items to be crawled through Seamus via the Add Feed link. It will give you an option for Feed or Website.
Also, look at this link for reference.
http://www.venexus.com/Support/ProductForums/tabid/132/forumid/2/postid/35/view/topic/Default.aspx
|
|
|
|
|
John Henley
Posts:25
Send A Message
 |
| 03/31/2008 11:32 AM |
|
It will give you an option for Feed or Website I don't have that option when using the Add Feed link...there is just a 'Feed URL'...is there a difference between an RSS URL and an HTML URL? |
|
|
|
|
Jeff Smith
Posts:131
Send A Message
 |
| 03/31/2008 11:51 AM |
|
| Make sure you enable "Users Can Add Feeds" in the module settings. When you do, you should see a link at the bottom of the main view on Seamus that says "Add Site". If clicked on you should see radio button for RSS Feed or Site URL and you can add a Title and URL |
|
|
|
|
marit
Posts:8
Send A Message
 |
| 04/07/2008 8:44 AM |
|
| When I add a URL, say "www.venexus.com", will seamus then just add the homepage or the entire site with all tabs?
Thanks very much! |
|
|
|
|
John Henley
Posts:25
Send A Message
 |
| 04/07/2008 10:05 AM |
|
| I'd like to understand this better as well. Does it work for non-DNN sites? |
|
|
|
|
Jeff Smith
Posts:131
Send A Message
 |
| 04/07/2008 11:55 AM |
|
| Currently, if you add a url like http://www.venexus.com, it will crawl that page and all other pages on the site it finds using traditional crawling method, if Global Crawler is enabled. As it crawls http://www.venexus.com, each time a URL is found on the page, it will add it to the aggregation queue. It will also continue indexing other sites that are linked from the Venexus website as well. For example, if we have a link to www.dotnetnuke.com, it will begin indexing that site as well. The Global Crawler option is only limited by the exclusion list (Edit Seamus > Manage Excludes) for sites it will crawl. That is why we provide the Moderation option and the VenexusSearchQueue to manage any pages/documents that are crawled. There currently is not a way to index only a single site (ignoring external urls from specified domain). However, in a future release we will be adding advanced domain management (placeholder in Edit Seamus > Manage Domains). From there you will be able to approve/ban domains to be crawled. By adding a domain in Manage Domains, it will automatically be added to the aggregation queue and only URLs with the approved domain name in it will be crawled. This is a fairly large peice of work with alot of heavy lifting of the aggregation queue and index related to the approving or banning of a domain. It is planned for the 2.6 release of Seamus. |
|
|
|
|
John Henley
Posts:25
Send A Message
 |
| 04/07/2008 4:43 PM |
|
Thank you for updating the Seamus documentation, that is very helpful. I do have a question, of course. Regarding the "limitations", "
Does this apply regardless of which of the four methods is used to crawl, or just to using the scheduled task?
"If you have set security on a page or file that is not public and requires authentication, Seamus will not be able to crawl and index that page or document." |
|
|
|
|
Jeff Smith
Posts:131
Send A Message
 |
| 04/08/2008 1:35 PM |
|
Currently, it is always true that it will not handle any authentication. However, this got pushed up for getting implemented by a client who has to have that functionality. I will be in the next release (2.5.5). |
|
|
|
|
marit
Posts:8
Send A Message
 |
| 04/09/2008 5:17 AM |
|
Limiting the global crawler to defined list of domains is an important feature for us. When do you version 2.6 expect to be released? |
|
|
|
|
Jeff Smith
Posts:131
Send A Message
 |
| 05/29/2008 10:04 AM |
|
| This has been released. |
|
|
|
|