User:Fæ/Flickr API detail
Description
[edit]Several of Fæ's batch upload projects have relied on custom use of the Flickr API (see https://www.flickr.com/services/api/). Flickr currently supplies this as a free service, though with reasonable throttle limits. Several browser tools exist for Flickr uploads, including the Commons upload wizard which can upload 50 images at a time from Flickr.
Using a custom script makes it possible to tailor the upload to help with more intelligent automatic mapping of Flickr tags/group names/album names to Commons category, or for searching Flickr titles and descriptions for clues to category allocation. The generic upload tools have no system for coping with weak file names, such as "IMG_9634" in the screenshot example. Using a custom script it is possible to either skip these incompatible names, or to suggest/automatically use alternatives generated from the Flickr description field or other data.
Flickr has its own set of licences (see https://www.flickr.com/services/api/flickr.photos.licenses.getInfo.html —at the time of writing the API documentation needs updating on Flickr), these were extended in 2015 to include CC0 and Public Domain Mark. Six out of the eleven current options are potentially compatible with Commons, using a custom script making it possible to set more appropriate templates and other events based on this Flickr file information than standard tools could allow for; for example using {{PD-1923}} for old photographs based on dates in the description, even if the Flickr image uses a default non-commercial restriction.
2021 Update
[edit]In 2021 the Pywikibot code was rewritten from scratch to be compliant with Python3 and to take advantage of the flickrapi Python module.
As part of being a good Flickr citizen, the script limits API accesses to below 3,300 per hour, in turn this gives an effective limit of 1,500 photograph uploads per hour. A number of past improvements can be factored back in based on the needs of future projects and the return on volunteer programme time. For example, the extent to which auto-cats based on flickr tags, flickr photosets, flickr galleries or flickr groups, or auto-name correction to compensate for the title blacklist filter might work.
Licenses
[edit]Image licensing for each Flickr photograph is checked automatically and mapped as below by default. If not in this list or an empty template string, the file is rejected for upload:
# Map of Flickr license to Wikimedia Commons template
licdic = {
"0":"",
"1":"",
"2":"",
"3":"",
"4":"Cc-by-2.0",
"5":"Cc-by-sa-2.0",
"6":"",
"7":"Flickr-no known copyright restrictions",
"8":"PD-USGov",
"9":"Cc-zero",
"10":"PDMark-owner"
}
The index numbers for licenses are available at https://www.flickr.com/services/api/explore/flickr.photos.licenses.getInfo
Source code
[edit]See #2021 Update.
Several times there have been requests for Fae's particular source code. There have been some versions released, for example this Flickr upload of 10,000 sanitation photographs. However, though this might be interesting to look at, running it may be a problem for a new user. Here are the issues:
- The code was originally written for 'compat', which is the previous version of Pywikibot, though from late 2016 the script is compatible with 'core'.
- Modifying and running the code will require programming experience in Python. It may also need installing and understanding custom modules which themselves exist in various versions, for example colorama (which could probably be safely stubbed out) or flickrapi (which is essential).
- The code may not be compatible with the latest versions of Python. The code has most recently been reliant on Python 2.7.10.
- Changes to either the Flickr API or the Commons API, especially relating to bot uploads or white-listing sites, may permanently break older versions.
In conclusion, if you are tempted to try this from scratch, don't. It's much easier to read up on the GLAMwiki toolset (off-line from 2021) Flickr2Commons and focus on the data you want to be represented on Commons, rather than spending your time debugging the use of odd Python modules that happen to be handy for Fæ for historical reasons, unless you naturally enjoy poking about in old clunky code.
Example batch uploads
[edit]Escuerda
[edit]These uploads are from the Escuerda.net Flickrstream, a project requested after the political communications group swapped their license to CC-BY-SA. The stream publishes photographs from the Portuguese 'Left Bloc' political events, but is run separately from any political party. The photographs go back more than a decade and cover canvassing, protests and meetings.
Internet Archive (large) images
[edit]Category:Files from Internet Archive Book Images Flickr stream
The source Flickrstream run by the Internet Archive is problematic for several reasons. A custom script was created that:
- Selects image over 2,500 pixels on the longest side (estimated at 2% to 4% of images). This is problematic as Flickr often returns a "Null" size on the standard query for original image size. Where this happens the script loads a temporary version of the original image into memory and analyses that.
- Relies on the Flickr API call people.getPublicPhotos the use of photos.search gave erratic results for this large Flickrstream.
- Extra error traps for calls with urllib2.urlopen(), as Flickr servers may time-out. Possibly related to passing so many calls to the API when testing for image sizes.
- Category checks for matches to Flickr tags "booksubject:<subject>".
If this code is to be used for more than a few thousand uploads, it may be worth having more sophisticated traps on the specialist 'tag' text including testing further for book title, author or artist information. Blank page checks might be possible using the Python Image Library, though this would be likely to be slow and hardware intensive. At the time of writing, Wikidata does not cater for identification of books.
See User:Fæ/Project_list/Internet_Archive for project reports and other technical details.
Sasha Kargaltsev's Flickrstream
[edit]Category:Photographs by Sasha Kargaltsev
For the Flickrstream https://www.flickr.com/photos/kargaltsev/, uploads are confirmed as CC-BY-2.0 before uploading to Commons by using an interactive commandline script. This works by:
- The script runs in Python using the standard Flickrapi package.
- Standard currently available Flickr licence types are loaded from flickr.photos.licenses.getInfo.
- Identify the next photograph in the Flickrstream and load photo information using flickr.photos.getInfo.
- Check that the license id matches the standard license type for 'Attribution License' which is defined as CC-BY-2.0. If this test fails, the upload is automatically skipped.
- Manually confirm the upload in the command window and prompt for any changes to title or description.
This level of semi-automation gives a higher level of confidence that the uploaded image was published on Flickr on a CC-BY-2.0 license than any pure human visual check could provide, and can be considered equivalent to an automated bot check. With Kargaltsev's stream a significant proportion of published photographs are marked as restricted, this is shown on Commons as an additional field on the Commons image page (by checking the API provided photo information). Standard upload tools may be unable to manage restricted images. --Fæ (talk) 22:55, 6 June 2014 (UTC)
Sasha Taylor's Flickrstream
[edit]Collection of West Midlands Police Museum: 458R
The process is identical to Kargaltsev's stream, with minor changes:
- An addition of the Flickr Attribution-ShareAlike License which is the same as CC-BY-SA-2.0; any other licenses are skipped.
- Uploads are filtered to the West Midlands Police Museum Flickr set rather than checking the whole stream in date order.
- The prompt for title or description changes are skipped with titles containing "DSC" auto-named and blank descriptions (which appears to be all of them) filled in with a default text.
- An institution template is used.
I was asked by Pigsonthewing to upload this set after an earlier edit-a-thon with the Museum, as attempts to use normal upload tools had repeatedly failed to transfer them systematically. It is unclear why, possibly the width and height are swapped in the metadata for some files. This conflict may be causing the FlickreviewR bot to fail with the error "size_not_found" with some files. --Fæ (talk) 23:32, 17 June 2014 (UTC)
Thinktank, Birmingham
[edit]- Uploads filtered by Flickr-title on "banner" and "sign" to reduce the number of potential copyright issues.
- No institution template used.
--Fæ (talk) 00:26, 22 January 2015 (UTC)
Flickr "The Commons"
[edit]I have generally extended the logic to testing for FlickrCommons licenses. These use {{Flickr-no known copyright restrictions}}. Where the date is deduced to be early enough, {{PD-1923}} is added. A key advantage of using a custom script is to add information about Flickr sets and tags which are skipped by current standard tools, plus using these as the basis of some intelligent category matching.
The first batch upload using this script was to populate Files from Miami University Libraries - Digital Collections: 6,829R. --Fæ (talk) 15:43, 10 March 2015 (UTC)