వికీపీడియా చర్చ:Chemical infobox/Archive 6

వికీపీడియా నుండి
Jump to navigation Jump to search
← Archive 5 Archive 6 Archive 7 →

Justification for Emolecules links?

All CAS numbers link to http://emolecules.com, which appears to be an ad-driven listing of chemical suppliers, completely unrelated to CAS. I don't think Wikipedia should drive traffic from all our chemistry articles to such a site without a very good justification. Was such a justification given when those links were first installed? AxelBoldt (talk) 19:32, 27 November 2007 (UTC)

Those links come from the {{Chembox CASNo}} template. DMacks (talk) 19:44, 27 November 2007 (UTC)
I have changed it to PubChem. Сасусlе 16:41, 12 January 2008 (UTC)
I am not sure if I support that change, PubChem is not fully complete, and does not supply for a type of linkfarm as eMolecules does. Moreover, PubChem is linked from their own field, 'pubchem'. AFAIK there is no other site like what eMolecules delivers. But I'd like to hear some more input on this. --Dirk Beetstra T C 17:18, 12 January 2008 (UTC)
I do not think that coverage is a big problem as almost all Wikipedia compounds are in PubChem. The problems I noticed were related to polymers and complex compounds, the absence of a documented CAS filter in PubChem, leading to false hits, and the absence of and database supplier crosslinks. PubChem is linked to from the emolecules results page, so I tend to change it back, even if I do not like the commercial nature of that site. Сасусlе 18:33, 12 January 2008 (UTC)
What I meant was that I do not like the nature and set up of PubChem, the links they provide are not complete, selected literature, but not into the documents themselves, etc. What eMolecules offers is more broad, it links to commercial suppliers (a good source of information about compounds), it links to pubchem and other literature databases. I think that for now eMolecules is a good one for CAS (a good linkfarm), even for the commercial part of eMolecules (as long as their information is unbiased). I do see the point of AxelBoldt that it drives traffic away to a commercial site, but since there is no good alternative (except for not providing the information at all, with the follow up of biased links or linkfarms in the external links sections, since there are many suppliers for most chemicals), this is fine for now. --Dirk Beetstra T C 19:01, 12 January 2008 (UTC)
I have changed it back to emolecules. Сасусlе 04:21, 13 January 2008 (UTC)
ChemSpider has been encouraged to provide further information about chemical vendors so we have started to enable this. One example of the efforts has been discussed here.--ChemSpiderMan (talk) 02:34, 24 July 2008 (UTC)

HMIS

Has any thoughts been given to including the HMIS Color Bar? I know in my lab, we pay more attention to the HMIS over the NFPA diamonds. —MJCdetroit (talk) 16:09, 5 December 2007 (UTC)

Units

I am in the process of systematically comparing the data in the infoboxes. Is there systematization of the units (e.g. g/cm^3) either in the actual units used or the syntax used to represent them? Petermr (talk) 23:32, 6 December 2007 (UTC)

Generally, g/mL for liquids and g/cm^3 for solids, though some liquids are quoted in g/cm^3 as well. There is some use of kg/m^3, but that is rare. --Rifleman 82 (talk) 12:18, 9 December 2007 (UTC)
There are the WP:MOSNUM and the Wikipedia:WikiProject Chemicals/Style guidelines, which clearly make you write 100&nbsp;g/cm<sup>3</sup> and &minus;10&nbsp;&deg;C (for 100 g/cm3 and −10 °C). Note that there are bots and editor tools which replace &deg; with the unicode ° symbol. Wim van Dorst (talk) 20:59, 31 December 2007 (UTC).

Semantic information

I am interested in extracting the data in the infoboxes in a semantic manner. Given that the data are added into to templates is there a way of accessing these templates for the data? Scraping the HTML leads to several problems. Petermr (talk) 23:32, 6 December 2007 (UTC)

Screen-scraping is strongly discouraged, but the database is freely downloadable. (See Wikipedia:Database download.) You can then run a SQL query to identify the records containing the infoboxes, and then use regular expressions to parse out the needed fields. --Arcadian (talk) 01:26, 7 December 2007 (UTC)
thanks very much. I would prefer the XML dump - do you know how large it is? I presumably have to filter out the molecule from that. Petermr (talk) 11:04, 9 December 2007 (UTC)


Request for new properties

Triple point

Could we add a field for triple point? I tried to but screwed it up. Furmanj (talk) 16:44, 19 December 2007 (UTC)

Coefficient of Thermal Expansion

Please add the property Coefficient of thermal expansion. Uiteoi (talk) 09:55, 22 April 2008 (UTC)

Specific Heat Capacity

Please add a property for Specific Heat Capacity or Specific Heat in J Kg-1 K-1 -- 10:36, 22 April 2008 (UTC)

There is already a Heat Capacity property in the Thermo-chemistry section but heat capacity relates to an object and not to a material and therefore which unit is J K-1 -- Uiteoi (talk) 10:36, 22 April 2008 (UTC)

The heat capacity property actually refers to "molar specific heat capacity" according to Template:Chembox HeatCapacity. Molar specific heat capacity is only one sub-unit among others (kg, g) used to represent the amount of material. -- Uiteoi (talk) 12:59, 22 April 2008 (UTC)

After looking at the pages referring to the heat capacity template (only 2 at the time), I found that contributors have always been referring to "specific heat capacity" and since no page was referring to "molar specific heat capacity", I decided to change the template to reflect this actual usage. Another option might have been to modify the pages but this would have required to calculate the molar heat capacity which contributors are not supposed to do. -- Uiteoi (talk) 12:59, 22 April 2008 (UTC).

You could have calculated if you had wished to, we regularly calculate our own molar masses etc., if only to check the results! Specific heat capacity and molar heat capacity are two separate concepts but, as you point out, the only uses of this template are for specific heat capacity and so I have left your changes. Physchim62 (talk) 19:02, 22 April 2008 (UTC)

WLN

Please add a field for WLN. 82.207.115.213 (talk) 15:12, 5 May 2008 (UTC)

Chembox Related: Isomers, by Oxidation, by Reduction, by Hydratation, by Dehydratation, OtherFunctionalGroup, etc

I think the Chembox Related could include other naturally ocurring related chemical components. Namely: Isomers (ex: isopropanol in the propanol chembox, ammonium thiocyanate in the thiourea box), by Oxidation (ex: iron(III) oxide in the iron(II) oxide box, propanol in the propane box), by Reduction, by Hydratation (or is it Hydration?) (ex: sulfuric acid in the sulfur trioxide box, ethanol in the ethylene box), by Dehydratation, OtherFunctionalGroup (ex: ethylamine in the ethanol box, sulfuryl fluoride in the sulfuric acid box). Albmont (talk) 17:02, 1 April 2009 (UTC) I didn't add links because I don't want to pollute the "afluent pages" with a discussion

Extracting semantic chemistry from the infobox

I am interested in extracting the information from infoboxes and putting it under RDF and CML. This is motivated in part by DBPedia's demo that WP can evolve as a machine-parsable knowledgebase as well as a human-readable encyclopedia. I appreciate that there have been more than one generation of infobox and have read {{chembox new}} and some of the subsequent links. I have posted an initial example of a WP molecule in RDF at [1]. The information was taken from the XML representation of the information. I have a number of questions - please forgive the rather long list and also my ignorance of current WP activities or practice:

  • are there other activities in creating semantic chemistry for or from WP that I need to know about?
  • is there a BNF or software for parsing the Chemistry infobox - ideally into RDF or XML? If not I'll probably end up writing one
  • is there any check on the syntax of the information submitted in a field? if so does the contributor get an error flag?
  • what is the policy on scientific units? To conform to a controlled vocabulary?
  • what is the range of allowed characters and character encoding. Even in the XML I have encountered many characters outside the ASCII (32-127) range and many outside ISOLatin-1. Unless the encoding is given these can cause problems.
  • what percentage of infoboxes are in the new format and is there a policy for converting old ones to new ones?
  • if data are available elsewhere is it possible to add them automatically to WP by creating infobox entries.

Petermr (talk) 16:25, 31 December 2007 (UTC)

Hi Peter, interesting lead. Would you please be so kind as to explain what RDF, CML, and BNF actually mean? Wim van Dorst (talk) 17:12, 31 December 2007 (UTC).

My mistake - sorry. They all have WP entries:

  • Resource_Description_Framework perhaps best seen here as the basis for the Semantic Web. By creating all the chemistry in RDF we get a completely semantic collection. This means that there are many pieces of software that can process it already.
  • Chemical Markup Language (CML). Chemistry represented in XML. This allows molecules, properties, reactions, spectra and crystallography to be formally encoded and therefore machine-parsable. At the risk of being immodest it would be a great thing if chemical data in WP were in CML.
  • Backus–Naur_form BNF. A formal description of syntax that allows the automatic generation of parsing software. I assume that the infobox syntax has a BNF somewhere. At present I am guessing what the syntax is and how to parse it.

More generally I think that the data in WP could be very important. I have already gone on public record as saying that I think WP chemistry will start to become the de facto standard for undergraduate chemical reference, at least for common compounds. Petermr (talk) 17:38, 31 December 2007 (UTC)

Hi Peter, this explains it well. For the machine-readable chemical wikipages, people are currently introducing InChI coding into the chembox. There's even a nifty thing in there to make it now show when you don't want it (they tend to be over-long). I presume that INCHI would be the thing you'd be looking for. And for promoting Chemistry on Wikipedia, you definitely need a word with dr Walker, who have given presentations in the real world promoting exactly that! Wim van Dorst (talk) 19:44, 31 December 2007 (UTC).

Thanks very much, Wim. It's actually EVERYTHING in the infobox that I want (not just the InChI). Appearance, physical properties, etc. So, for example RDF allows you ask questions like "give me all compounds that are yellow and melt over 200 Celsius". I mailed Martin recently but it's holiday time... The technical questions are probably general WP-technology ones and I don't know who the Template experts are in chemistry. In principle it should be possible to extract all the chemistry automatically. A bonus is that this will automatically detect errors in syntax (and possibly in values - e.g. a constraint that Mpt must be less than Bpt can be checked).Petermr (talk) 19:54, 31 December 2007 (UTC)

Along these lines, I've written a python program to extract wiki pages with chembox new on them, and another to extract the chemboxes from these pages so I can extract density information. FWIW, there were 3553 entries as of mid-March 2008, of which about 1600 have values entered for density. My density parser is not very clever, and I've edited a number of the wikipedia entries to fill in the gaps. It won't be enough for a complete RDF converter, but it is a start. [PAK] 69.140.172.225 (talk) 03:08, 29 March 2008 (UTC)

Thanks a lot! I'll email some interested parties. Walkerma (talk) 04:24, 29 March 2008 (UTC)

extractchem.py:

"""
Extract all pages from enwiki-latest-pages-articles.xml.bz2
containing "{{chembox new", printing them to stdout.
For the 3.6 Gb compressed wiki database of March 2008, this
operation takes 100 minuits on a 2GHz pentium, yielding 3663
pages, 1653 of which have density information.
""" 

import bz2

def next_page(file):
    # Skip header
    while True:
        line = file.readline()
        if line.startswith("  <page>"): break
        if line == "": return []

    # Read page
    page = line
    has_chembox = False
    while True:
        line = file.readline()
        if "{{chembox new" in line.lower():
            has_chembox = True
        page += line
        if line.startswith("  </page>"): break
    return page,has_chembox

def process_file(ifile, ofile):
    while True:
        page, has_chembox = next_page(file)
        if page == []: break
        if has_chembox:
            print >>ofile, "".join(page),

ifile = bz2.BZ2File('enwiki-latest-pages-articles.xml.bz2')
ofile = open('chempages.xml','w')
process_file(ifile,ofile)

processchem.py

# -*- coding: utf-8 -*-
# This program is public domain
"""
Process chemical pages from wikipedia.

Usage:
   wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
   python extractchem.py
   python processchem.py

The first command downloads the latest wikipedia dump.  At 3.5+ Gb, this takes
an hour or more.  The next command extracts chembox articles from the dump.
This takes 1.5+ hours.  The third command extracts the chembox entries and
looks at the density field.  This is quick, but you will need to change it
frequently to account for variation in human input.
""" 

import re
from xmlescape import xmlunescape


title_matcher = re.compile("<title>(.*)</title>")
def find_chembox(page,title):
    """
    Extract a chembox from the wiki page.
    """
    # Skip to start of <text> block
    text = page.find("<text")
    # Skip to start of chembox within <text>
    start = page.lower().find("{{chembox new",text)
    if start < 0: raise ValueError("Missing chembox in "+title)
    k = start+13

    # We are starting with a nesting level of 1.  Go until the end of the
    # page or until nesting level reaches zero.
    nesting = 1
    while True:
        if k >= len(page):
            # If we reach the end of the page, then we are missing }}
            raise ValueError("Mismatched {{Chembox new ... }} in "+title)
            return ""
        elif page[k:k+2] == "{{":
            # Increase nesting level on {{
            nesting += 1
            k += 2
        elif page[k:k+2] == "}}":
            # Decrease nesting level on }}
            # If nesting level reaches zero we are at the end of the box
            nesting -= 1
            k += 2
            if nesting == 0:
                chembook_end = k
                return page[start:k]
        elif page[k:].startswith("&lt;!--"):
            # Skip escaped XML comment
            k = page.find('--&gt;',k+4)
            if k < 0:
                raise ValueError("Mismatched <!--  ... --> in "+title)
        else:
            # Default is move to next character.
            k+=1

def next_page(file):
    """
    Get the next title/chembox
    """
    # Skip header
    while True:
        line = file.readline()
        if line.startswith("  <page>"): break
        if line == "": return None,""

    # Read page
    lines = [line]
    while True:
        line = file.readline()
        lines += [line]
        if line.startswith("  </page>"): break

    # Convert to a long string
    page = "".join(lines)
    match = title_matcher.search(page)
    title = match.group(1)
    chembox = find_chembox(page,title)
    chembox = xmlunescape(chembox.decode('UTF-8'))
    return title,chembox

density_matcher = re.compile(r"\|\s*[Dd]ensity\s*=\s*([^|]*)\s*\|")
def find_density(chembox,title):
    match = density_matcher.search(chembox)
    if not match: return None,""

    # Convert spaces
    density = match.group(1)
    for form in ["&nbsp;","&thinsp;"]:
        density = density.replace(form," ")

    # Regularize units
    for form,becomes in [("&middot;"," "),
                         (u"·"," "),
                         ("&minus;","-"),
                         (u"−","-"),
                         (u"³","<sup>3</sup>"),
                         (u"°",""),
                         ("&deg;"," "),
                         ]:
        density = density.replace(form,becomes)
    
    for form in [
                 "kg/dm<sup>3</sup>",
                 "kg dm<sup>-3</sup>",
                 "kg.dm<sup>-3</sup>",
                 "kg/dm^3","kg/dm3",
                 "kg/L", "kg/l",
                 "kg l<sup>-1</sup>",
                 ]:
        density = density.replace(form,"#mL")#"g/cm**3")

    for form in [
                 "mg/cm<sup>3</sup>",
                 "mg cm<sup>-3</sup>",
                 "mg.cm<sup>-3</sup>",
                 "g/L","g/l",
                 "g L<sup>-1</sup>",
                 "g.L<sup>-1</sup>",
                 "g/dm<sup>3</sup>",
                 "g dm<sup>-3</sup>",
                 "g.dm<sup>-3</sup>",
                 "kg/m3",
                 "kg/m<sup>3</sup>",
                 "kg m<sup>-3</sup>",
                 "kg m-3",
                 "kg.m<sup>-3</sup>",
                 ]:
        density = density.replace(form,"#L")#"g/L")

    for form in [
                 "g/cm<sup>3</sup>",
                 "g cm<sup>-3</sup>",
                 "g.cm<sup>-3</sup>",
                 "g/cm^3","g/cm3","g/cc",
                 "g/mL", "g/ml",
                 "g ml<sup>-1</sup>",
                 ]:
        density = density.replace(form,"#mL")#"g/cm**3")
    density = density.strip()

    # If empty return None
    if density == "": return None,""
    #print density,"===",title#,match.group(1).strip()

    # Split into density/caveat
    endvalue = density.find(' ')
    if endvalue>0:
        value = density[:endvalue].strip()
        caveat = density[endvalue+1:].strip()
    else:
        value = density
        caveat = ""

    # Missing density?
    if value in ["-","?"]:
        return None,caveat

    # Floating point density?
    try:
        return float(value),caveat
    except:
        pass

    # European decimal point ','?
    try:
        return float(value.replace(',','.')),caveat
    except:
        pass

    # Value range?
    try:
        lo,hi = value.split('-')
        return (float(lo)+float(hi))/2,density
    except:
        pass

    # Unknown
    print title,"unparsed density   -->   ",density
    return None,density
    

def process_file(file):
    while True:
        try:
            title,chembox = next_page(file)
        except ValueError,msg:
            print msg
        else:
            if title == None: break
            density, caveat = find_density(chembox,title)
            #if density != None: print title,density,'::',caveat

file = open('chempages.xml','rU')
process_file(file)

xmlescape.py:

# Author xmlescape: Gabriel Genellina
# Author xmlunescape: Leif K-Brooks, based on work by Aaron Swartz
# Source http://www.thescripts.com/forum/thread594350.html
# This version is modified from the original.
"""Escape special xml characters in a text string"""

from htmlentitydefs import codepoint2name,name2codepoint
import re

unichr2entity = dict((unichr(code),u'&%s;'%name) 
    for code,name in codepoint2name.iteritems() if code !=38)

def xmlescape(text,d=unichr2entity):
    """xmlstr = xmlescape(str)
    Convert text into a form suitable for inclusion in an XML file,
    with characters such as '&' replaced by &amp;
    """
    if u"&"in text:
        text = text.replace(u"&",u"&amp;")
    for key,value in d.iteritems():
        if key in text:
            text = text.replace(key,value)
    return text

def _replace_entity(m):
    """regular expression character replacement function for xmlunescape.
    """
    s = m.group(1)
    if s[0] == u'#':
        s = s[1:]
        try:
            if s[0] in u'xX':
                c = int(s[1:], 16)
            else:
                c = int(s)
            return unichr(c)
        except ValueError:
            return m.group(0)
    else:
        try:
            return unichr(name2codepoint[s])
        except (ValueError, KeyError):
            return m.group(0)

_entity_re = re.compile(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", re.UNICODE)
def xmlunescape(s):
    """str = xmlunescape(xmlstr)
    Replace XML entities with original ISO characters.
    """
    return _entity_re.sub(_replace_entity, s)


if __name__ == "__main__":
    s = "<>&;"
    xmls = xmlescape(s)
    uns = xmlunescape(xmls)
    print "%s => %s => %s"%(s,xmls,uns)

Capitalisation

The standard style on Wikipedia is to avoid unnecessary capitalisation: see Wikipedia:Manual of Style (capital letters). So "Material Safety Data Sheet" should become "Material safety data sheet". It's already capitalised this way in the main article (Material safety data sheet). The only reason I don't do this myself is lack of time to deal with the consequent broken links. Hairy Dude (talk) 15:53, 6 January 2008 (UTC)

Interesting…I've usually seen it written with Title Capitalization (and I've never seen it abbreviated as "msds"). Pinging Talk:Material safety data sheet. DMacks (talk) 16:12, 6 January 2008 (UTC)
I've generally seen it used in title capitalization, and that's how I've written it for some 30 years. But I am less concerned about how it is written out in full that I am as an initialization. As an initializaiton it is always written in caps, as MSDS. —Preceding unsigned comment added by Pzavon (talkcontribs) 01:43, 7 January 2008 (UTC)

Is the {{chembox new}} stable and complete enough to formally replace {{chembox}}?

Hi all, since 2005, we formally use the wikitable formatted {{chembox}}. Approximately a year ago, the {{chembox new}} was developped for transcluded instead of substituted tables. During 2007, several problems with the transcluded were solved, and overall it was well enhanced, up to the point that last November quietly the old template was adjusted to work for the new template (Thanks, Dirk, PC). I have not been including many chemboxes in new articles recently, but on the whole I have the impression that all practical problems with the {{chembox new}} have now been solved. That would now allow the discussion to formally move away from the old-style to the new-style, including updating the project page. Any comments? Wim van Dorst (talk) 17:01, 6 January 2008 (UTC).

I strongly support this. --Arcadian (talk) 01:55, 7 January 2008 (UTC)
Can we make a change to chembox to replace 'Molecular formula' with 'Chemical formula'? I have been looking at the lead(II) nitrate article, and every time I see Pb(NO3)2 next to 'molecular formula' I grumble - no molecules in lead(II) nitrate, after all. EdChem (talk) 14:50, 12 January 2008 (UTC)
Hi Wim, I guess you missed the major sweep performed by Rifleman 82 (well, his bot, chem-awb), which has replaced many, many {{chembox}}es with {{chembox new}}s. Formally chembox is now 'obsolete', though there seems to be a problem in removing it altogether, which is something PC and I have looked into. There are still some old chemboxes there, which will have to be done by hand (some contain formatting which could not be caught automatically, some have strange fields, and there are probably also some that we did not find).
Re EdChem, you are right, I guess that has to be changed, feel free to edit these templates, only the top templates are protected because they are quite sensitive, intricate and transcluded to many pages (see , it should be in there.
Hope this helps. --Dirk Beetstra T C 15:05, 12 January 2008 (UTC)
{{Chembox new}} has been stable for at least 18 months now, with one exception linked to a change in the documentation. It can be expanded beyond my original plans (and has been, thanks to all involved), but the basisc structure has been thoroughly tested. As Dirk mentions, many chemboxes have already been automagically converted: I'm waiting for info as to which need more delicate attention or greater community input. The box we have on xylene, for example, cannot be converted into the new format. Physchim62 (talk) 16:00, 12 January 2008 (UTC)
Please could somebody then update Wikipedia:Chemical_infobox. Сасусlе 16:24, 12 January 2008 (UTC)
I believe there are about <1000 entries left to be fixed, but AWB doesn't seem to be working for quite a while. Let me try again, to finish them all. --Rifleman 82 (talk) 01:53, 13 January 2008 (UTC)
I've done a final run to clear the 400 or so remaining. There are 89 left which require hand coding. See the remaining worklist: [2] Help would be appreciated! --Rifleman 82 (talk) 02:06, 14 January 2008 (UTC)
Is The Presidents (song) actually supposed to be on that list, or is it a mistake? ~XarBioGeek (talk) 02:20, 14 January 2008 (UTC)
There are a few strange articles such as Muhammed which transclude our header. I removed the few I saw, but I must have missed this one. You can ignore it. I'll remove it by hand now. --Rifleman 82 (talk) 02:29, 14 January 2008 (UTC)

Is there any consensus as to whether infoboxes like NatOrganic or LabChemical will be taken under the general Chemical infobox? I occasionally see these kinds of infoboxes, but don't know why they are differentiated from other chemicals. It also seems that a lot of inorganics don't have infoboxes, is that just my perception? Casforty (talk) 04:24, 10 September 2008 (UTC)

In my opinion, you may freely replace NatOrganic and LabChemical. --Arcadian (talk) 02:55, 16 September 2008 (UTC)

It appears that the template automatically creates a link for pKb and points to Acid_dissociation_constant.

I didn't find that link to be helpful.

I clicked the "Basicity" link in the table of properties here Sodium_hydroxide. It took me to Acid_dissociation_constant, where the word "basicity" appears only once -- and without definition.

I was familiar with pH, but not pKa or pKb.

My suspicion as to the meaning of "basicity" was encouraged here:

where the word appears only twice, but contains this:

pKa + pKb = 14

Which, when combined with this:

leads to the (perhaps erroneous) conclusion that, except for strong bases (which Sodium_hydroxide may or may not be):

pH + pKb ≈ 14

That was a long way around to come to an uneasy conclusion.

And pH is a quantity for an aqueous solution. Can a solid have a pKb value? Is the pKb value of -2.43 for Sodium_hydroxide at the point of maximum Solubility in water (111 g/100 ml at 20°C)? If so, it might helpful to note that the tabulated pKb value is given for the point of tabulated maximum solubility (which is temperature-dependent). -Ac44ck (talk) 18:29, 8 January 2008 (UTC)

Picture layout

I think it would be helpful if you included an explanation of how to manage the pictures (i.e. ImageFileL1 vs ImageFileR1), as I was unable to convert an older chembox template's layout to this one without an extensive search for an example. XarBiogeek (talk) 22:24, 13 January 2008 (UTC)

That information is in the documentation of {{chembox new}}, the older chembox should be replaced with that template, if possible. Hope this helps. --Dirk Beetstra T C 16:43, 16 January 2008 (UTC)