DOE AGORA Qualquer valor

Instagram Data Scraping


Reference Worksheets


Digital Research Skills 

for App Studies



App Studies Initiative / Digital Methods Initiative / Public Data Lab


Reference Worksheet II

Instagram Data Scraping

http://bit.ly/asirw-2

Version: October 2020






















App Studies Initiative

http://www.appstudies.org/ 

Digital Methods Initiative

University of Amsterdam

https://www.digitalmethods.net/ 

Public Data Lab
http://publicdatalab.org/

Department of Digital Humanities
King’s College London

https://www.kcl.ac.uk/ddh

Created by Liliana Bounegru, Marloes Geboers, Jonathan Gray, Anne Helmond,
Stijn Peeters, Fernando van der Vlist (alphabetical).

Content is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license (CC BY-NC-ND 4.0)

Part of the Media Studies Reference Worksheets 

Contents



Contents 2

1. Introduction 3

1.1 Instagram Scraping 3

1.2 Tool: DMI Instagram Scraper 4

1.3 Installation (macOS) 4

1.4 Using the DMI Instagram Scraper 5

1.4.1. Tool explanation 5

1.4.2. Tool input and output 6

1.4.3. Reading result files and folders 6

2. Recipes 8

2.1. Hashtag analysis 8

2.1.1. Collecting and opening hashtag data from the DMI Instagram Scraper 8

2.2.2. Analysing top posts associated with a given hashtag 10

2.2.3. Analysing top accounts associated with a given hashtag 11

2.2.4. Creating image grids for top images associated with a given hashtag 12

2.2.5. Co-hashtag analysis with Gephi 13

2.2.6. Some other lines of inquiry and recipe ideas 16

2.2. Account Analysis 17

2.2.1. Recipe 17

3. Resources 18



1. Introduction



In this worksheet you learn to

  • Capture data from Instagram for given hashtags or account names.

  • Analyse top posts associated with a given hashtag

  • Analyse top accounts associated with a given hashtag

  • Create image grids for top images associated with a given hashtag

  • Perform co-hashtag analysis with Gephi

  • Explore (resonating) visual content


1.1 Instagram Scraping

This worksheet introduces you to capture data from Instagram for given hashtags or account names. With the DMI Instagram Scraper you can query Instagram using one or more hashtags or account names; the tool retrieves, among other things, time stamps, caption texts, comments, image URLs, account names, hashtags and number of likes and comments. What is all this (meta)data doing for us researchers? 


Studying Instagram content requires a methodological approach that is rather different from traditional content analyses. Image-based posts are multi-modal and networked by online publics, who constantly contextualize and (re)contextualize images. Researchers Leaver, Highfield and Abidin explain how images on social media are never isolated structures that can sensibly be studied without taking into account the visual text-based information that is surrounding them:  “What images we choose to share has meaning because of its form, its context, and our communicative intentions [...] This is aided by the surrounding contextual information, be it captions, hashtags, profile information, comments or other annotations” (2019, p. 44-45). 


Hashtags are extensively studied through their use, diffusion or repurposing. Mapping online protests and counter-protests and studying social visuality within diverging discursive spaces are among the most common uses (see among many others: Abidin, 2016; Freelon et al, 2016; Gallager et al, 2018; Meraz, 2018). To get a sense of the contextual space of Instagram content, hashtags are indispensable as they function as linguistic markers (Zappavigna, 2011), revealing discursive frameworks. It is for this reason that it is common to query social platforms based on hashtags, which, often further down the line, point to relevant actors (enter account analyses). Furthermore, like other social platforms, Instagram allows engagement with content through – most notably – commenting and liking posts. Such affective investments also generate useful metadata pointing to the amplification of particular visual content patterns (Niederer & Colombo, 2019; Geboers, 2019; Pearce et al, 2018). See also the work on resonance of climate change solutions on Instagram (DeGaetano, 2019) which also exemplifies different ways of visualizing research on Instagram. 


1.2 Tool: DMI Instagram Scraper

A number of scripts have been developed to capture Instagram data. This worksheet details the DMI Instagram Scraper, available from: https://github.com/digitalmethodsinitiative/dmi-instascraper/.


The DMI Instagram Scraper is built “on top of” instaloader by providing a simple user interface for it so that you don’t need to work with command-line interfaces or Python to use it. The scraper was developed by the Digital Methods Initiative and is distributed under the Mozilla Public License.


Download the DMI Instagram Scraper for macOS and Windows:


1.3 Installation (macOS)

  1. Download and double click dmi-instascraper-*-macOS.dmg to install.

  2. Drag “DMI Instagram Scraper” onto Applications.

  3. Go to Applications and open “DMI Instagram Scraper.app”

  4. You will receive the following notification: “DMI Instagram Scraper.app” cannot be opened because the developer cannot be verified.”

  5. Click Cancel.

  6. Go to System Preferences > Security & Privacy > General > (scroll down) > “DMI Instagram Scraper.app” was blocked from use because it is not from an identified developer > click Open Anyway.

  7. You will receive the following notification: “macOS cannot verify the developer of “DMI Instagram Scraper.app”. Are you sure you want to open it?” Click Open.


1.4 Using the DMI Instagram Scraper

1.4.1. Tool explanation

The DMI Instagram scraper is built on top of Instaloader, a ‘battle-tested, reliable MIT-licensed Python package’ (Bex T., 2020). Instaloader exposes its internally used methods and structures as a Python module, making it a powerful and intuitive Python API for Instagram, allowing to further customize obtaining media and metadata. This is what the DMI Instagram scraper is built on; if you are familiar with Python or command-line apps, you may find it interesting to take a look at Instaloader itself.


Instaloader scrapes data directly from the Instagram website. As such, anything you query is scraped in the order it appears on Instagram. The hashtag search, for example, returns the top results for a given hashtag query, as listed in Instagram’s top search results (e.g. https://www.instagram.com/explore/tags/qamom/ for the “qamon” query). For more on how instagram recommends and ranks content in the search and explore feature, see e.g. this article. Keep this in mind while collecting data, as it means results will be collected in an order determined by Instagram’s algorithm. It also means the tool is limited to the search features Instagram’s website provides; this means it is not possible to, for example, search within a given date range. In some cases, creative solutions may be available as a work-around for these limitations; for example, you could scrape enough items to reach the start of your desired date range, and then simply delete from the result any scraped posts from earlier than the start of your range.



1.4.2. Tool input and output

For general use, the tool requires the following input:

  1. In ‘Query’, specify the #hashtag, or @username you wish to study.

  2. In ‘Items per Query’ specify how many results you want. The more results you request, the longer it will take for the tool to run. If you are running a test, specify 5.

  3. In ‘Also scrape’ determine whether you also need: 

    1. ‘Comments’: all comments underneath a post 

    2. ‘Photo files’: the images of a post

    3. ‘Metadata files’: the metadata of your photo files (typically you won’t need this)

  4. In ‘File name’ you can give your result file a proper name, e.g. 2020-10-05_hashtag-feminism.csv

  5. In ‘Folder to scrape’ you determine the folder where your results are being saved.

  6. Hit ‘Start scraping’ and see the progress in the Status field.

  7. You can find the results in the folder you specified in step 5. Your results consist of two elements:

    1. Your result file that you named in step 4, e.g. 2020-10-05_hashtag-feminism.csv

    2. A folder with the same name as your result file.

  8. The .csv file is a comma separated file that contains your results and may be opened using your favorite spreadsheet software such as Excel or imported into Google Sheets. If you specified the option to scrape ‘photo files’, the folder contains your images as .jpg files and the metadata for your images as .json files (see the worksheet on Data Management for information on working with .csv files). Image file names connect to the IDs in the .csv file. It is important to retain this connection between images and metadata, as you will need this connection further down the pipeline. 


1.4.3. Reading result files and folders

Example of results in a #hashtag .csv file:



Your .csv file for a #hashtag contains a row for each result and has the following elements:


Table 1. Available data fields

Method

#hashtag

@username (#todo)

id: unique identifier of your Instagram image. When comments were scraped, these will have numerical IDs of their own


Thread_id: these align with the unique identifiers mentioned in id. When comments are scraped, this column works to keep comments (when scraped these create new rows with their own numerical ids) connected to the image ID of the post.


Parent_id: connects either to the identifier mentioned at id, or when a user comments in response to another user (@mention) in the comment thread, it will refer to the numerical id of the comment that it is responding to


body: the caption of an image


author: the username of the poster of the image or comment 


timestamp: the UNIX timestamp of when the image was posted


type: content type (picture, video, or comment)


url: link to the image


thumbnail_url: link to the thumbnail of the image


hashtags: the hashtags used with the image, separated by comma


usertags: other Instagram users who have been tagged in the image


mentioned: other Instagram users who have been mentioned in the image caption


num_likes: the number of likes the image has received


num_comments: the number of comments the image has received


subject: unknown field.


photo_file: the location on your harddrive where the photo file (.jpg) is located


metadata_file: the location on your harddrive where the metadata (.json) is located



2. Recipes



The recipes below are designed to help you get started with the DMI Instagram Scraper and familiarise yourself with some of its essential features to collect and analyse and Instagram data. 


2.1. Hashtag analysis

2.1.1. Collecting and opening hashtag data from the DMI Instagram Scraper

This first recipe will guide you through the process of collecting data with the DMI Instagram Scraper, opening it for further exploration and how to see the original posts about which data is gathered.


  1. Enter the hashtag that you’d like to query for

    1. In this case we can try “qamom” which has emerged in relation to conspiracy theory communities. You can read more about how the “Qamon” phenomenon emerged in this article

  2. Fill in the “items per query”

    1. Here you may wish to think about how many posts are appropriate for your research question. This may also depend on how popular your hashtag is. In order to inform decisions about this you may try a small sample before gathering a larger number of posts. In this case we’ll try 100 posts as a sample. Please note that instructing the tool to scrape large numbers of posts might slow it down significantly. 

  3. Decide on whether you’d also like to scrape comments, photo files and/or metadata files.

    1. As we’re just trying a smaller “test run” to start with, we’ll leave these empty for now. Once you have decided on the final number of posts you want you may want to obtain all of these elements to ensure you have copies of the photos, comments and metadata files corresponding with the dataset that is gathered. Copies of photos are always essential as the web is not static, but it is also good to know that Instagram image URLs expire after -- usually -- a couple of months. Please note that instructing the tool to scrape this additional data might slow it down significantly. 

  4. Give the file a name that means you’ll remember what it is :-)

    1. With digital methods research you’ll likely end up downloading lots of files and trying out lots of different things - and soon you’ll have a very cluttered desktop/downloads folder. To help you remember what the files are so you can refer back and find what you’re looking for you might want to include things like the date and the number of items in the name of the file. Here we’ll use “instagram-scrape-qamom-100posts-151020.csv” so it is very clear when we did it and how many posts we were looking for, as well as the reminder that this is a dataset from Instagram.

  5. Decide where on your machine you want to store your data

    1. You might want to store it straight into a dedicated project folder to keep things organised.

  6. Start scraping!

    1. Hit the “Start Scraping!” button and you should see feedback in the “Status” section below the button.

    2. If you run into any issues at this stage you may want to check things like:

      1. Are there posts associated with this hashtag on the public interface to Instagram? (Sometimes hashtags may be moderated so all posts associated with them are removed/no longer retrievable…)

      2. Have you confirmed that it works with a smaller sample? If you are trying to scrape a larger number of posts, it might be worth trying a much smaller number first to see that things work as intended first.

      3. Does it work with all of the scrape options deselected? Sometimes it can take a while to download photos, so if something appears not to be working you can first try to get only the CSV file, without the extras, and see if that works.

      4. If things still don’t work and you can’t figure out why you can consider filing a “bug report” on Github here. Where possible this should include screenshots, details of your operating system and steps you have taken so that others can try to reproduce the issue and figure out how to fix it. Some tips on filing a bug report can be found here and here.

  7. Open the CSV file.

    1. If all of the steps above work you will end up with a CSV file. CSV stands for comma separated value and is a simple format for tabular data (often used as a common way to exchange data between different tools and software packages).

    2. You can open the CSV file by opening or importing it in Excel or in Google Sheets.

    3. The list of hashtags per post can be found in the “hashtags” column.

    4. The “Id” column contains the identifier for each of the original posts that have been scraped. To see the post you can put the Id into a web browser, preceded by the following:  “https://www.instagram.com/p/”. So, for example, the Id “CGLFjgJgjRw” would give the URL:
      https://www.instagram.com/p/CGLFjgJgjRw 

Screenshot of CSV file opening in Google Sheets.


2.2.2. Analysing top posts associated with a given hashtag

Once you’ve downloaded the hashtag data what can you do with it? This recipe looks at how to obtain a list of the top posts.


  1. Open the CSV file in Google Sheets

    1. While you can use Excel or any other software you like, in this example we’re going to walk through the analysis process in Google Sheets, which is particularly convenient for group project work (and also for troubleshooting).

    2. You can upload the file to Google Drive and then click “open in” and select Google Sheets in order to import the CSV file.

  2. Turn on filtering by selecting the headers in the top row of your data and then clicking “Data” > “Turn on filter”

    1. Once this is done you can then sort and filter any of the columns by any of the headers. 

  3. Sort the posts by number of likes by clicking the green triangle next to the “num_likes” header and then clicking “Sort A -> Z”

    1. You can now see the posts in order of the number of likes they received.

    2. You can also use the same process to sort by number of comments. What might the differences be between posts with larger numbers of likes and larger numbers of comments?



2.2.3. Analysing top accounts associated with a given hashtag

What else can you do with the data from the Instagram Scraper? This recipe looks at how to obtain a list of the top accounts.


  1. Open the CSV file in Google Sheets

    1. As above, while you can use Excel or any other software you like, in this example we’re going to walk through the analysis process in Google Sheets, which is particularly convenient for group project work (and also for troubleshooting).

    2. You can upload the file to Google Drive and then click “open in” and select Google Sheets in order to import the CSV file.

  2. Create a new pivot table by clicking “Data” > “Pivot Table”

    1. Confirm that the correct range of data in the original sheet is selected, and then click “Ok”.

    2. Pivot tables are a way to summarise, analyse, sort and filter data in a given sheet.

  3. Create a summary of the top posters by navigating to the “Pivot table editor” on the right hand side and clicking “Add” next to “Rows” and selecting “author”.

  4. Then click “Add” next to “Values” and select “author” and “COUNTA” in the “Summarise by” menu in order to count the authors.

  5. Select “COUNTA of authors” in the “Sort by” menu under “Rows” and then select “Descending” under “Order” in order to obtain a list of the top posters with the most prolific posters at the top.



2.2.4. Creating image grids for top images associated with a given hashtag

As well as analysing top posts and top users  in your dataset, you may also look at the top images. This is based on this simple visual methods recipe for creating image grids (and you can refer there for more detailed steps and associated readings).


  1. Open your dataset in Google Sheets, turn on filtering and filter by top posts (as per the recipe above on analysing top posts).

  2. Create a new “Image” column by right clicking on the “url” column and selecting “Insert 1 right”

  3. Go to the top cell in the new “Image” column and click the “function” button () or click “Insert” > “Function” and then select “Google” > “IMAGE”.

    • If you’d prefer you can simply type into the cell: =IMAGE(URL)

  4. When the function is selected in the new cell, click on the image URL immediately to the left of the cell in order to provide the image URL as an input. You’ll then see a small preview of the image in the “Image” column.

  5. Click the small green square in the bottom right hand corner of the newly created image cell in order to “fill down” to the bottom of the sheet.

  6. Resize the width and height of the “Image” column so that you can see a bigger preview of the image.





2.2.5. Co-hashtag analysis with Gephi

As well as looking at top posts and top accounts in your dataset you can also see which other hashtags are used along with the one that you have chosen to investigate. Here we will do this by creating co-hashtag graphs with Gephi. For more about making network data you can also see this worksheet



  1. Go to Table2Net and open the original CSV file that you downloaded with your hashtag data (as per above): https://medialab.github.io/table2net/ 

    • First we need to convert your tabular data file (CSV) into a network data file so that it can be opened in Gephi. This involves selecting which elements of your CSV file are the “nodes” and which are the “edges”.



  • Once the file is loaded you should see a preview of your dataset and you can check it all looks correct.

  1. The next step is to select the type of network. As there will be only one type of node in your graph (hashtags) you can select “Normal (one type of node)”.

  2. Next we need to specify which column defines the nodes. This is the “hashtags” column so you should select this from the dropdown menu.

  3. As the dataset contains multiple hashtags per cell separated by commas, you should select “Comma separated” under the next dropdown menu.

    • Once you have selected this you should see a preview of your dataset so you can check that Table2Net is extracting the hashtags separately rather than treating them as one long hashtag!

  4. Scroll down to “Links” and select “Id”.

  5. Scroll all the way to the bottom and click “Build and download the network”.

  6. Now download and open the Gephi software.

  7. Click “File” > “Open” to open the network graph and select the file that you have downloaded from Table2Net.

  8. Spatialise the network by selecting the “Force Atlas 2” algorithm from the layout panel on the left.

    • This enables you to see which hashtags co-occur. The nodes represent the hashtags, and the edges represent the posts which they appear together in.

    • You can “tune” the network by using “Dissuade Hubs”, “LinLog mode” and adjusting the Scaling and Gravity. For more on this you can see the Gephi tutorial videos listed above.

  9. Run the modularity function on the “statistics” panel on the left by clicking “Run” next to “Modularity”.

    • This enables you to identify possible clusters or groups of hashtags and colour them accordingly (in just a moment).

  10. Resize the nodes by clicking on the little concentric circles under the “Appearance” menu on the left and then clicking “Ranking” and selecting “Degree” from the dropdown menu.

  11. Colour the nodes according to their modularity class by clicking the little colour palette from the “Appearance” menu and then selecting “Partition” and selecting “Modularity Class” from the dropdown menu.



  1. Once you have finished adjusting the layout, you can switch the the “Preview” mode via the menu at the top and exporting a PDF by clicking the check boxes next to “Show Labels”, “Proportional Size”, “Show Edges” and “Curved”, refreshing to check that everything looks how you want it to, and then clicking the “Export: SVG/PDF/PNG” button at the bottom of the page.

    • You can generate a PDF file which displays the network in a way which means you can continue to explore and zoom in in the document, and share with others you may be working with.



2.2.6. Some other lines of inquiry and recipe ideas

Now that you have explored the basics, if you’d like to have a go at some more advanced and experimental options you could try the following..


  • Can you count the number of different hashtags that appear in the hashtags column in your dataset? This would involve splitting them out and counting them to get overall summaries. You could try this using Google Sheets, Excel or OpenRefine.

  • Can you see which hashtags are associated with which accounts? You could use the co-hashtag network recipe above, but create a “bi-partite” network using Table2Net and then explore the associations between users and hashtags.

  • Which are the images associated with different hashtags in your dataset? You could use the visual methods recipes as inspiration to create image grids per hashtag, or image-hashtag networks to explore this.

  • Which images are posted by which accounts? You could use the filters in Google Sheets (as per above) to create separate lists of images, sorted by engagement, and then use those to create image grids per account.

  • How are the images associated with hashtags changing over time? You may create image grids using top liked ranking of images over time. 

  • How could you analyse the texts associated with different posts associated with a given hashtag? For example you could use the WORDij software to explore the texts of comments in your dataset. This creates a semantic network. 

  • How can you examine posts over time? You could convert the timestamps in the dataset and format the data for use with tools like RankFlow in order to see the rise and fall of different hashtags in your dataset over time.

  • What are the entities in the images associated with your hashtag? You could use this recipe in order to cluster the images according to entities which machine learning algorithms detect in them. 

  • While these recipes explore how to use datasets with one hashtag, what kinds of issues arise when it comes to analysing multiple hashtags with the Instagram Scraper tool?


2.2. Account Analysis

2.2.1. Recipe


3. Resources



Further reading



Comentários

Ebook

Postagens mais visitadas