• Status: 2


    Scrollu için klavyeye kağıt sıkıştırmak: https://zippy.gfycat.com/FaithfulOrnateAltiplanochinchillamouse.webm

    Note to instagram.com
    I'm not using your API. I'm not republishing user content. Claims made against this post were only made because I upset a user here, see this comment for details.

    The Plan
    To archive instagram accounts with zero prejudice. I've made a start and have 3400* accounts 2236271 files @ 633GB downloaded already. In talks with other archivers what gets archived has been a concern of theirs, I've archived accounts containing animals, girls, cars, tattoos, etc, etc. While some of you will only want to archive certain pages this will only serve to hinder and slow the project over time.

    UPDATE: After running this project for awhile and coding ways around the initial username scraping issues we can now scrape 2 million usernames in 24 hours, so not half bad.... I've allocated 300TB on my local network for storage and post processing of the data. One box storing 5TB+ / 27642727 items

    If you're a serious archiver and have something to offer please read on.

    The Process
    I'm using RipMe 1.4.1 to get the accounts, this is a very reliable java tool newly maintained by /u/metaprime and originally wrote by /u/4_pr0n the problem with this tool while it reliably downloads the images and videos it doesn't collect the meta data of the posts (post time, location, caption, tags, comments). I'm open to new ideas if anyone has a better tool in mind that can get the post meta data as well.

    This is the way I'm currently getting the data....

    -- /IGArchiving ~ Parent directory

    ripme.jar ~ Obvious.

    rip-parallel.sh ~ Allows us to pass user lists to ripme using parallel to speed up the downloading process.

    --- /rips ~ Output directory

    tar.sh ~ tars the user directories, removes original files, adds current date to the filename.

    ========================

    UPDATE: We're now using a new user list format, our new tool (to be released) outputs the user list one user per line without prefixing https://instagram.com/ so now we use rip-parallel.sh and pass it the lists like so ./rip-parallel.sh 1st_followers_list_2mil.txt (example list)

    ========================

    To speedup your downloads take note of --jobs 3 in rip-parallel.sh this is how many instances of ripme.jar will be spawned to download the instagram accounts. I'm running 18 jobs on an 8 core mid range xeon with 32GB ram, ram overhead isn't bad using around 3GB and the cores sit at 50-70% load once it gets going. Using parallel my traffic looks like this, on this spec hardware I could push it to 24 jobs at once but this machine was running other tasks.

    This process could be streamlined but other than that it works.

    Storage
    Nobody wants to store 100's of terabytes themselves and with no foreseeable end or timescale on this project we will be pushing the tars (none searchable) up to archive.org in the hopes they don't mind :3 I'll be managing the items and keeping them between 5-800GB each, a generous user in our irc will be helping me in that effort.

    Thank you to those donating storage boxes for this project!

    Goals of the project or why?
    As mentioned above I want to grab any and all content, however something interesting to note is that generally Google isn't caching all ig accounts so often images of girls from ig are used to catfish people online, I'd like to build a database of the images we grab that is searchable by image/filename much like Google reverse image search and /u/4_pr0n i.rachives (code available here) in order to maintain another tool that works against the creepy catfishing folk.

    How you can help...
    UPDATE: Full steam ahead, nothing holding us back.

    A huge thank you to those donating storage, bandwidth and code to this project.

    The current hang up is finding a quick way to scrape user ids from instagram, I made the list above the very slow way by having an account and following 7500 (the limit) users and then scraping them from my own account, the process is slow because ig blocks you temporarily if you follow a large number of users in a short amount of time.

    I've searched and haven't found any free, reliable code/app to scrape users, however there is this tool ($49.99) that seems to be near perfect for this, it lets you find users with a certain post, followers and following count. But here's hoping one of you can find/build a free way to scrape users! :D

    When downloading using the above process you can't even saturate an a 100/100 line without paralleling the ripme.jar process for each account and I have't looked into doing this yet, any help here is appreciated as 1Gbit+ lines are a plenty among us DataHoarders.

    You can follow the above process and help in the archiving effort.


    Devamını oku


    Yorumlar




    Yorum yazabilmek icin en az 5 karmaya ihtiyaciniz var. Paylasim yaparak karmani artirabilirsin.

    Yorumlar
    Paylaş