Hey Linux community,

I’m struggling with a file management issue and hoping you can help. I have a large media collection spread across multiple external hard drives. Often, when I’m looking for a specific file, I can’t remember which drive it’s on.

I’m looking for a file indexing and search tool that meets the following requirements:

  • Ability to scan multiple locations
  • Option to exclude specific folders or subfolders from both scan and search
  • File indexing for quicker searches
  • Capability to search indexed files even when the original drive is disconnected
  • Real-time updates as files change

Any recommendations for tools that meet most or all of these criteria? It would be a huge help in organizing and finding my media files.

Thanks in advance for any suggestions!

  • boredsquirrel@slrpnk.net
    link
    fedilink
    arrow-up
    4
    ·
    6 months ago

    KDEs baloo might suit that.

    • runs in background
    • indexes
    • index probably kept?
    • can exclude directories

    I dont know what tool uses Baloo to search, kfind doesnt.

  • Euro@lemmy.ml
    link
    fedilink
    Norsk bokmål
    arrow-up
    4
    ·
    edit-2
    6 months ago

    Funnily enough I’ve been looking for a similar utility.

    I use jellyfin, and yacy for my local media/documents

    Jellyfin isn’t really a search engine, and it may or may not work if you disconnect the drives.

    From my experience with shows and movies it does great with metadata and displaying what i have in my collection. However it’s not as good for searching images/videos, as you have to search the exact image/video name (unless it has metadata)

    Yacy on the other hand, is much more like a traditional search engine, with an index and all. It’s great for documents (html, md, txt even docx), but doesn’t do well with media files, as it can’t pull metadata, so you have to search all media by title.

    I dont think yacy has real time updating, if it does, idk how to enable it.

    Both yacy and jellyfin have a way to blacklist things, but they’re just completely different

    yacy has a url based blacklist, while jellyfin only displays stuff from folders you tell it to (basically a whitelist)

    There was a program that I had stumbled across that was able to index a photos folder using image recognition to generate a description that you could search. I have since forgotten the name of the program but it does exist, and if I find it again I’ll update this comment.

    Personally I want something that works like yacy for traditional documents, and can use image recognition for images, but I have yet to find it.

    EDIT: I have found the program that does image recognition: sist2 I have tried it once before, from experience the sqlite search is a bit janky but works decently enough imo, i haven’t tried the other indexing method.

  • tla@lemmy.world
    link
    fedilink
    arrow-up
    3
    ·
    6 months ago

    To index file paths: GNU locate. It’s also quick to create the index with updatedb. To search: locate <part of path>. Ie: locate artist.flac

  • ssm@lemmy.sdf.org
    link
    fedilink
    arrow-up
    2
    arrow-down
    4
    ·
    edit-2
    6 months ago

    Ability to scan multiple locations

    find /path/one /path/two [expression]
    

    Option to exclude specific folders or subfolders from both scan and search

    find /some/path -type d ! \( -name  exclusion1 -o -name exclusion2 ... \) [expression]
    

    File indexing for quicker searches

    Not indexing, but you can make find faster through parallelization if you have the extension for xargs.

    # find -print0 is an extension which separates files found by '\0'
    # xargs -0 is an extension that separates by '\0' instead of spaces and newlines
    # xargs -P _x_ is an extension that invokes _utility_ on _x_ separate threads instead of in serial
    find /some/path [expression] -print0 | xargs -0P$(command_to_get_cpu_threads) _utility_ [args]
    

    Capability to search indexed files even when the original drive is disconnected

    I don’t know what the usecase for this is, but you can do something like create a script for cron that periodically dumps the names of files at a mount point to a path like ~/var/log/something, or use a domain-specific unmount script that dumps the paths before unmounting.

    Real-time updates as files change

    Would require a non-portable script that stores each file’s mtime in an array and compares the old mtime against the new mtime using stat, and then loop. Maybe implement as a daemon.

    • solrize@lemmy.world
      link
      fedilink
      arrow-up
      3
      ·
      edit-2
      6 months ago

      [search indexed files that are offline] One would hope this is not possible.

      I think the idea is store the search index in a separate place from the file. For indexing text though, I’ve found that the index is comparable in size to the file itself. It’s not entirely clear to me what OP wants to search. Something like email? Obviously if it’s just metadata for media files (kilobyte text description of a gigabyte video) then the search index can be tiny.

      Real-time updates as files change

      Would require non-portable script that stores each file’s mtime in an array and compares the old mtime against the new mtime using stat, and then loop. Maybe implement as a daemon.

      That is what inotify is for.

      I realize your overall answer was mostly snark, but the problems mentioned really do take some work to solve. For example, if you want to index email, you want the indexer to understand email headers so it can do the right things with the timestamps and other fields. You can’t just chuck everything into a big generic search engine and press “blend”.

      I will mention git-annex which is for sort of a different problem, but it can help you manually track where your offline files are, more or less.

      • ssm@lemmy.sdf.org
        link
        fedilink
        arrow-up
        1
        ·
        edit-2
        6 months ago

        Sorry I have .world blocked so I didn’t see your reply until now (wish I could block instances without blocking instance replies, but whatever)

        It’s not entirely clear to me what OP wants to search. Something like email? Obviously if it’s just metadata for media files (kilobyte text description of a gigabyte video) then the search index can be tiny.

        Yeah I amended my post earlier to recommend logging with a domain specific unmount script, but I don’t know why they want to do this.

        I realize your overall answer was mostly snark

        Apparently I’m so good at trolling I troll people even when I’m not trying to troll. :<

        This is what inotify is for

        If inotify works for you, that’s fine. I don’t have any experience with it, maybe I’ll look into it after this, if the usecase ever comes up.

        You can’t just chuck everything into a big generic search engine and press “blend”

        Eh, regex (EREs) is good enough for 99% of usecases honestly. For the 1%, consider using an easier to parse file format.

        • solrize@lemmy.world
          link
          fedilink
          arrow-up
          1
          ·
          edit-2
          6 months ago

          Yeah I amended my post earlier to recommend logging with a domain specific unmount script, but I don’t know why they want to do this.

          They have umpty jillion terabytes of video on a shelf full of external HDD’s and they want to know what files are on which drives. In the old days we had racks full of mag tapes and had the same issue. It’s not something new.

          For info about inotify, try web search.

          For text search, you start needing real indexing once you’re over maybe a GB of text. Before that, you can live with grep or SQL tables or whatever.